date:20161128

[jira] [Assigned] (SPARK-18108) Partition discovery fails with explicitly written long partitions

2016-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18108:


Assignee: (was: Apache Spark)

> Partition discovery fails with explicitly written long partitions
> -
>
> Key: SPARK-18108
> URL: https://issues.apache.org/jira/browse/SPARK-18108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Richard Moorhead
>Priority: Minor
> Attachments: stacktrace.out
>
>
> We have parquet data written from Spark1.6 that, when read from 2.0.1, 
> produces errors.
> {code}
> case class A(a: Long, b: Int)
> val as = Seq(A(1,2))
> //partition explicitly written
> spark.createDataFrame(as).write.parquet("/data/a=1/")
> spark.read.parquet("/data/").collect
> {code}
> The above code fails; stack trace attached. 
> If an integer used, explicit partition discovery succeeds.
> {code}
> case class A(a: Int, b: Int)
> val as = Seq(A(1,2))
> //partition explicitly written
> spark.createDataFrame(as).write.parquet("/data/a=1/")
> spark.read.parquet("/data/").collect
> {code}
> The action succeeds. Additionally, if 'partitionBy' is used instead of 
> explicit writes, partition discovery succeeds. 
> Question: Is the first example a reasonable use case? 
> [PartitioningUtils|https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L319]
>  seems to default to Integer types unless the partition value exceeds the 
> integer type's length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18108) Partition discovery fails with explicitly written long partitions

2016-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15701268#comment-15701268
 ] 

Apache Spark commented on SPARK-18108:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/16030

> Partition discovery fails with explicitly written long partitions
> -
>
> Key: SPARK-18108
> URL: https://issues.apache.org/jira/browse/SPARK-18108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Richard Moorhead
>Priority: Minor
> Attachments: stacktrace.out
>
>
> We have parquet data written from Spark1.6 that, when read from 2.0.1, 
> produces errors.
> {code}
> case class A(a: Long, b: Int)
> val as = Seq(A(1,2))
> //partition explicitly written
> spark.createDataFrame(as).write.parquet("/data/a=1/")
> spark.read.parquet("/data/").collect
> {code}
> The above code fails; stack trace attached. 
> If an integer used, explicit partition discovery succeeds.
> {code}
> case class A(a: Int, b: Int)
> val as = Seq(A(1,2))
> //partition explicitly written
> spark.createDataFrame(as).write.parquet("/data/a=1/")
> spark.read.parquet("/data/").collect
> {code}
> The action succeeds. Additionally, if 'partitionBy' is used instead of 
> explicit writes, partition discovery succeeds. 
> Question: Is the first example a reasonable use case? 
> [PartitioningUtils|https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L319]
>  seems to default to Integer types unless the partition value exceeds the 
> integer type's length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18108) Partition discovery fails with explicitly written long partitions

2016-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18108:


Assignee: Apache Spark

> Partition discovery fails with explicitly written long partitions
> -
>
> Key: SPARK-18108
> URL: https://issues.apache.org/jira/browse/SPARK-18108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Richard Moorhead
>Assignee: Apache Spark
>Priority: Minor
> Attachments: stacktrace.out
>
>
> We have parquet data written from Spark1.6 that, when read from 2.0.1, 
> produces errors.
> {code}
> case class A(a: Long, b: Int)
> val as = Seq(A(1,2))
> //partition explicitly written
> spark.createDataFrame(as).write.parquet("/data/a=1/")
> spark.read.parquet("/data/").collect
> {code}
> The above code fails; stack trace attached. 
> If an integer used, explicit partition discovery succeeds.
> {code}
> case class A(a: Int, b: Int)
> val as = Seq(A(1,2))
> //partition explicitly written
> spark.createDataFrame(as).write.parquet("/data/a=1/")
> spark.read.parquet("/data/").collect
> {code}
> The action succeeds. Additionally, if 'partitionBy' is used instead of 
> explicit writes, partition discovery succeeds. 
> Question: Is the first example a reasonable use case? 
> [PartitioningUtils|https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L319]
>  seems to default to Integer types unless the partition value exceeds the 
> integer type's length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18606) [HISTORYSERVER]It will check html elems while searching HistoryServer

2016-11-28 Thread Tao Wang (JIRA)

Tao Wang created SPARK-18606:


 Summary: [HISTORYSERVER]It will check html elems while searching 
HistoryServer
 Key: SPARK-18606
 URL: https://issues.apache.org/jira/browse/SPARK-18606
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Tao Wang
Priority: Minor


When we search applications in HistoryServer, it will include all contents 
between  tag, which including useless elemtns like "

[jira] [Created] (SPARK-18607) give a result on a percent of the tasks succeed

2016-11-28 Thread Ru Xiang (JIRA)

Ru Xiang created SPARK-18607:


 Summary: give a result on  a percent of the tasks succeed
 Key: SPARK-18607
 URL: https://issues.apache.org/jira/browse/SPARK-18607
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Ru Xiang


In this patch, we modify the codes corresponding to runApproximateJob so that 
we can get a result when the specified percent of tasks succeed.

In a production environment, 'long tail' is a common urgent problem. In 
practice, as long as we can get a specified percent of  tasks' results, we can 
guarantee the final results. And this is a common requirement in the practice 
of machine learning algorithms.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2016-11-28 Thread Nick Pentreath (JIRA)

Nick Pentreath created SPARK-18608:
--

 Summary: Spark ML algorithms that check RDD cache level for 
internal caching double-cache data
 Key: SPARK-18608
 URL: https://issues.apache.org/jira/browse/SPARK-18608
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Nick Pentreath


Some algorithms in Spark ML (e.g. {{LogisticRegression}}, {{LinearRegression}}, 
and I believe now {{KMeans}}) handle persistence internally. They check whether 
the input dataset is cached, and if not they cache it for performance.

However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. This 
will actually always be true, since even if the dataset itself is cached, the 
RDD returned by {{dataset.rdd}} will not be cached.

Hence if the input dataset is cached, the data will end up being cached twice, 
which is wasteful.

To see this:

{code}
scala> import org.apache.spark.storage.StorageLevel
import org.apache.spark.storage.StorageLevel

scala> val df = spark.range(10).toDF("num")
df: org.apache.spark.sql.DataFrame = [num: bigint]

scala> df.storageLevel == StorageLevel.NONE
res0: Boolean = true

scala> df.persist
res1: df.type = [num: bigint]

scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
res2: Boolean = true

scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
res3: Boolean = false

scala> df.rdd.getStorageLevel == StorageLevel.NONE
res4: Boolean = true
{code}

Before SPARK-16063, there was no way to check the storage level of the input 
{{DataSet}}, but now we can, so the checks should be migrated to use 
{{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18606) [HISTORYSERVER]It will check html elems while searching HistoryServer

2016-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18606:


Assignee: (was: Apache Spark)

> [HISTORYSERVER]It will check html elems while searching HistoryServer
> -
>
> Key: SPARK-18606
> URL: https://issues.apache.org/jira/browse/SPARK-18606
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Tao Wang
>Priority: Minor
>
> When we search applications in HistoryServer, it will include all contents 
> between  tag, which including useless elemtns like " href" and making results confused. 
> We should remove those to make it clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18606) [HISTORYSERVER]It will check html elems while searching HistoryServer

2016-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18606:


Assignee: Apache Spark

> [HISTORYSERVER]It will check html elems while searching HistoryServer
> -
>
> Key: SPARK-18606
> URL: https://issues.apache.org/jira/browse/SPARK-18606
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Tao Wang
>Assignee: Apache Spark
>Priority: Minor
>
> When we search applications in HistoryServer, it will include all contents 
> between  tag, which including useless elemtns like " href" and making results confused. 
> We should remove those to make it clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18606) [HISTORYSERVER]It will check html elems while searching HistoryServer

2016-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15701444#comment-15701444
 ] 

Apache Spark commented on SPARK-18606:
--

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/16031

> [HISTORYSERVER]It will check html elems while searching HistoryServer
> -
>
> Key: SPARK-18606
> URL: https://issues.apache.org/jira/browse/SPARK-18606
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Tao Wang
>Priority: Minor
>
> When we search applications in HistoryServer, it will include all contents 
> between  tag, which including useless elemtns like " href" and making results confused. 
> We should remove those to make it clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2016-11-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15701447#comment-15701447
 ] 

Sean Owen commented on SPARK-18608:
---

Agree, I had long since meant to note this. This would be great to fix.

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -
>
> Key: SPARK-18608
> URL: https://issues.apache.org/jira/browse/SPARK-18608
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Nick Pentreath
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18118) SpecificSafeProjection.apply of Java Object from Dataset to JavaRDD Grows Beyond 64 KB

2016-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15701463#comment-15701463
 ] 

Apache Spark commented on SPARK-18118:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/16032

> SpecificSafeProjection.apply of Java Object from Dataset to JavaRDD Grows 
> Beyond 64 KB
> --
>
> Key: SPARK-18118
> URL: https://issues.apache.org/jira/browse/SPARK-18118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>
> For sufficiently wide or nested Java Objects, when SpecificSafeProjection 
> attempts to recreate the object from an InternalRow, the generated 
> SpecificSafeProjection.apply method is larger than allowed: 
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection"
>  grows beyond 64 KB
> {code}
> Although related, this issue appears not to have been resolved by 
> SPARK-15285. Since there is only one top-level object when projecting, 
> splitExpressions finds no additional Expressions to split. The result is a 
> single large, nested Expression that forms the apply code.
> See the reproducer for an example [1].
> [1] - https://github.com/bdrillard/specific-safe-projection-error



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18118) SpecificSafeProjection.apply of Java Object from Dataset to JavaRDD Grows Beyond 64 KB

2016-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18118:


Assignee: Apache Spark

> SpecificSafeProjection.apply of Java Object from Dataset to JavaRDD Grows 
> Beyond 64 KB
> --
>
> Key: SPARK-18118
> URL: https://issues.apache.org/jira/browse/SPARK-18118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Apache Spark
>
> For sufficiently wide or nested Java Objects, when SpecificSafeProjection 
> attempts to recreate the object from an InternalRow, the generated 
> SpecificSafeProjection.apply method is larger than allowed: 
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection"
>  grows beyond 64 KB
> {code}
> Although related, this issue appears not to have been resolved by 
> SPARK-15285. Since there is only one top-level object when projecting, 
> splitExpressions finds no additional Expressions to split. The result is a 
> single large, nested Expression that forms the apply code.
> See the reproducer for an example [1].
> [1] - https://github.com/bdrillard/specific-safe-projection-error



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18118) SpecificSafeProjection.apply of Java Object from Dataset to JavaRDD Grows Beyond 64 KB

2016-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18118:


Assignee: (was: Apache Spark)

> SpecificSafeProjection.apply of Java Object from Dataset to JavaRDD Grows 
> Beyond 64 KB
> --
>
> Key: SPARK-18118
> URL: https://issues.apache.org/jira/browse/SPARK-18118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>
> For sufficiently wide or nested Java Objects, when SpecificSafeProjection 
> attempts to recreate the object from an InternalRow, the generated 
> SpecificSafeProjection.apply method is larger than allowed: 
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection"
>  grows beyond 64 KB
> {code}
> Although related, this issue appears not to have been resolved by 
> SPARK-15285. Since there is only one top-level object when projecting, 
> splitExpressions finds no additional Expressions to split. The result is a 
> single large, nested Expression that forms the apply code.
> See the reproducer for an example [1].
> [1] - https://github.com/bdrillard/specific-safe-projection-error



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18558) spark-csv: infer data type for mixed integer/null columns causes exception

2016-11-28 Thread Jayadevan M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15701508#comment-15701508
 ] 

Jayadevan M commented on SPARK-18558:
-

[~PeterRose]
I tried to replicate this issue, But did not get null pointer exception

spark.read.option("header", "true").option("inferSchema", 
"true").format("csv").load("example.csv").show(5);
+---+
|column1|
+---+
|  1|
|  2|
|   null|
+---+


> spark-csv: infer data type for mixed integer/null columns causes exception
> --
>
> Key: SPARK-18558
> URL: https://issues.apache.org/jira/browse/SPARK-18558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
>Reporter: Peter Rose
>
> Null pointer exception when using the following csv file:
> example.csv:
> column1
> "1"
> "2"
> ""
>  Dataset df = spark
>   .read()
>   .option("header", "true")
>   .option("inferSchema", "true")
>   .format("csv")
>   .load(example.csv);
>  df.printSchema();
> The type is correctly inferred:
> root
>  |-- col1: integer (nullable = true)
> df.show(5);
> The show method leads to this exception:
> java.lang.NumberFormatException: null
>   at java.lang.Integer.parseInt(Integer.java:542) ~[?:1.8.0_25]
>   at java.lang.Integer.parseInt(Integer.java:615) ~[?:1.8.0_25]
>   at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) 
> ~[scala-library-2.11.8.jar:?]
>   at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) 
> ~[scala-library-2.11.8.jar:?]
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:241)
>  ~[spark-sql_2.11-2.0.2.jar:2.0.2]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18560) Receiver data can not be dataSerialized properly.

2016-11-28 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-18560:
--
Component/s: (was: Structured Streaming)

> Receiver data can not be dataSerialized properly.
> -
>
> Key: SPARK-18560
> URL: https://issues.apache.org/jira/browse/SPARK-18560
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2
>Reporter: Genmao Yu
>Priority: Critical
>
> My spark streaming job can run correctly on Spark 1.6.1, but it can not run 
> properly on Spark 2.0.1, with following exception:
> {code}
> 16/11/22 19:20:15 ERROR executor.Executor: Exception in task 4.3 in stage 6.0 
> (TID 87)
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 13994
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:243)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1760)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1150)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1150)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1943)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1943)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Go deep into  relevant implementation, I find the type of data received by 
> {{Receiver}} is erased. And in Spark2.x, framework can choose a appropriate 
> {{Serializer}} from {{JavaSerializer}} and {{KryoSerializer}} base on the 
> type of data. 
> At the {{Receiver}} side, the type of data is erased to be {{Object}}, so 
> framework will choose {{JavaSerializer}}, with following code:
> {code}
> def canUseKryo(ct: ClassTag[_]): Boolean = {
> primitiveAndPrimitiveArrayClassTags.contains(ct) || ct == stringClassTag
>   }
>   def getSerializer(ct: ClassTag[_]): Serializer = {
> if (canUseKryo(ct)) {
>   kryoSerializer
> } else {
>   defaultSerializer
> }
>   }
> {code}
> At task side, we can get correct data type, and framework will choose 
> {{KryoSerializer}} if possible, with following supported type:
> {code}
> private[this] val stringClassTag: ClassTag[String] = 
> implicitly[ClassTag[String]]
> private[this] val primitiveAndPrimitiveArrayClassTags: Set[ClassTag[_]] = {
> val primitiveClassTags = Set[ClassTag[_]](
>   ClassTag.Boolean,
>   ClassTag.Byte,
>   ClassTag.Char,
>   ClassTag.Double,
>   ClassTag.Float,
>   ClassTag.Int,
>   ClassTag.Long,
>   ClassTag.Null,
>   ClassTag.Short
> )
> val arrayClassTags = primitiveClassTags.map(_.wrap)
> primitiveClassTags ++ arrayClassTags
>   }
> {code}
> In my case, the type of data is Byte Array.
> This problem stems from SPARK-13990, a patch to have Spark automatically pick 
> the "best" serializer when caching RDDs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-18474) Add StreamingQuery.status in python

2016-11-28 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das closed SPARK-18474.
-
Resolution: Duplicate

> Add StreamingQuery.status in python
> ---
>
> Key: SPARK-18474
> URL: https://issues.apache.org/jira/browse/SPARK-18474
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18407) Inferred partition columns cause assertion error

2016-11-28 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-18407.
---
   Resolution: Fixed
 Assignee: Burak Yavuz
Fix Version/s: 2.1.0

> Inferred partition columns cause assertion error
> 
>
> Key: SPARK-18407
> URL: https://issues.apache.org/jira/browse/SPARK-18407
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Michael Armbrust
>Assignee: Burak Yavuz
>Priority: Critical
> Fix For: 2.1.0
>
>
> [This 
> assertion|https://github.com/apache/spark/blob/16eaad9daed0b633e6a714b5704509aa7107d6e5/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L408]
>  fails when you run a stream against json data that is stored in partitioned 
> folders, if you manually specify the schema and that schema omits the 
> partitioned columns.
> My hunch is that we are inferring those columns even though the schema is 
> being passed in manually and adding them to the end.
> While we are fixing this bug, it would be nice to make the assertion better.  
> Truncating is not terribly useful as, at least in my case, it truncated the 
> most interesting part.  I changed it to this while debugging:
> {code}
>   s"""
>  |Batch does not have expected schema
>  |Expected: ${output.mkString(",")}
>  |Actual: ${newPlan.output.mkString(",")}
>  |
>  |== Original ==
>  |$logicalPlan
>  |
>  |== Batch ==
>  |$newPlan
>""".stripMargin
> {code}
> I also tried specifying the partition columns in the schema and now it 
> appears that they are filled with corrupted data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18577) Ambiguous reference with duplicate column names in aggregate

2016-11-28 Thread Yerui Sun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yerui Sun resolved SPARK-18577.
---
Resolution: Won't Fix

It's indeed not standard sql usage，only supported in Hive, but not MySQL, 
postgresql, presto, won't fix in Spark.

> Ambiguous reference with duplicate column names in aggregate
> 
>
> Key: SPARK-18577
> URL: https://issues.apache.org/jira/browse/SPARK-18577
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Yerui Sun
>
> Assuming we have a table 't' with 3 columns 'id', 'name' and 'rank', and 
> here's the sql to re-produce issue:
> {code}
> select id, count(*) from t t1 join t t2 on t1.name = t2.name group by t1.id
> {code}
> The error message is:
> {code}
> Reference 'id' is ambiguous, could be: id#3, id#9.; line 1 pos 7
> {code}
> The sql can be parsed in Hive, since the select 'id' reference can be 
> resolved to 't1.id', which presented in group expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-18577) Ambiguous reference with duplicate column names in aggregate

2016-11-28 Thread Yerui Sun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yerui Sun closed SPARK-18577.
-

> Ambiguous reference with duplicate column names in aggregate
> 
>
> Key: SPARK-18577
> URL: https://issues.apache.org/jira/browse/SPARK-18577
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Yerui Sun
>
> Assuming we have a table 't' with 3 columns 'id', 'name' and 'rank', and 
> here's the sql to re-produce issue:
> {code}
> select id, count(*) from t t1 join t t2 on t1.name = t2.name group by t1.id
> {code}
> The error message is:
> {code}
> Reference 'id' is ambiguous, could be: id#3, id#9.; line 1 pos 7
> {code}
> The sql can be parsed in Hive, since the select 'id' reference can be 
> resolved to 't1.id', which presented in group expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18568) vertex attributes in the edge triplet not getting updated in super steps for Pregel API

2016-11-28 Thread Rohit (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15701622#comment-15701622
 ] 

Rohit commented on SPARK-18568:
---

Found the exact issue. If the vertex attribute is a complex object with mutable 
objects the edge triplet does not update the new state once already the vertex 
attributes are shipped but if the vertex attributes are immutable objects then 
there is no issue. below is a code for the same. Just changing the mutable 
hashmap to immutable hashmap solves the issues. ( this is not a fix for the 
bug, either this limitation should be made aware to the users are the bug needs 
to be fixed for immutable objects.)
[~ankurd] Any suggestion on what should i look for more specifically to fix the 
above bug. Thanks

import org.apache.spark.graphx._
import com.alibaba.fastjson.JSONObject
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.log4j.Logger
import org.apache.log4j.Level
import scala.collection.mutable.HashMap


object PregelTest {
  val logger = Logger.getLogger(getClass().getName());
  def run(graph: Graph[HashMap[String, Int], HashMap[String, Int]]): 
Graph[HashMap[String, Int], HashMap[String, Int]] = {

def vProg(v: VertexId, attr: HashMap[String, Int], msg: Integer): 
HashMap[String, Int] = {
  var updatedAttr = attr
  
  if (msg < 0) {
// init message received 
if (v.equals(0.asInstanceOf[VertexId])) updatedAttr = attr.+=("LENGTH" 
-> 0)
else updatedAttr = attr.+=("LENGTH" -> Integer.MAX_VALUE)
  } else {
updatedAttr = attr.+=("LENGTH" -> (msg + 1))
  }
  updatedAttr
}

def sendMsg(triplet: EdgeTriplet[HashMap[String, Int], HashMap[String, 
Int]]): Iterator[(VertexId, Integer)] = {
  val len = triplet.srcAttr.get("LENGTH").get
  // send a msg if last hub is reachable 
  if (len < Integer.MAX_VALUE) Iterator((triplet.dstId, len))
  else Iterator.empty
}

def mergeMsg(msg1: Integer, msg2: Integer): Integer = {
  if (msg1 < msg2) msg1 else msg2
}

Pregel(graph, new Integer(-1), 3, EdgeDirection.Either)(vProg, sendMsg, 
mergeMsg)
  }

  def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
val conf = new SparkConf().setAppName("Pregel Test")
conf.set("spark.master", "local")
val sc = new SparkContext(conf)
val test = new HashMap[String, Int]

// create a simplest test graph with 3 nodes and 2 edges 
val vertexList = Array(
  (0.asInstanceOf[VertexId], new HashMap[String, Int]),
  (1.asInstanceOf[VertexId], new HashMap[String, Int]),
  (2.asInstanceOf[VertexId], new HashMap[String, Int]))
val edgeList = Array(
  Edge(0.asInstanceOf[VertexId], 1.asInstanceOf[VertexId], new 
HashMap[String, Int]),
  Edge(1.asInstanceOf[VertexId], 2.asInstanceOf[VertexId], new 
HashMap[String, Int]))

val vertexRdd = sc.parallelize(vertexList)
val edgeRdd = sc.parallelize(edgeList)
val g = Graph[HashMap[String, Int], HashMap[String, Int]](vertexRdd, 
edgeRdd)

// run test code 
val lpa = run(g)
lpa.vertices.collect().map(println)
  }
}

> vertex attributes in the edge triplet not getting updated in super steps for 
> Pregel API
> ---
>
> Key: SPARK-18568
> URL: https://issues.apache.org/jira/browse/SPARK-18568
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.2
>Reporter: Rohit
>
> When running the Pregel API with vertex attribute as complex objects. The 
> vertex attributes are not getting updated in the triplet view. For example if 
> the vertex attributes changes in first superstep for vertex"a" the triplet 
> src attributes in the send msg program for the first super step gets the 
> latest attributes of the vertex "a" but on 2nd super step if the vertex 
> attributes changes in the vprog the edge triplets are not updated with this 
> new state of the vertex for all the edge triplets having the vertex "a" as 
> src or destination. if I re-create the graph using g = Graph(g.vertices, 
> g.edges) in the while loop before the next super step then its getting 
> updated. But this fix is not good performance wise. A detailed description of 
> the bug along with the code to recreate it is in the attached URL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18604) Collapse Window optimizer rule changes column order

2016-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15701627#comment-15701627
 ] 

Apache Spark commented on SPARK-18604:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/16027

> Collapse Window optimizer rule changes column order
> ---
>
> Key: SPARK-18604
> URL: https://issues.apache.org/jira/browse/SPARK-18604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>
> The recently added CollapseWindow optimizer rule changes the column order of 
> attributes. This actually modifies the schema of the logical plan (which 
> optimization should not do), and breaks `collect()` in a subtle way (we bind 
> the row encoder to the output of the logical plan and not the optimized 
> plan). 
> For example the following code:
> {noformat}
> val customers = Seq(
>   ("Alice", "2016-05-01", 50.00),
>   ("Alice", "2016-05-03", 45.00),
>   ("Alice", "2016-05-04", 55.00),
>   ("Bob", "2016-05-01", 25.00),
>   ("Bob", "2016-05-04", 29.00),
>   ("Bob", "2016-05-06", 27.00)).
>   toDF("name", "date", "amountSpent")
>  
> // Import the window functions.
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions._
>  
> // Create a window spec.
> val wSpec1 = Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1)
> val df2 = customers
>   .withColumn("total", sum(customers("amountSpent")).over(wSpec1))
>   .withColumn("cnt", count(customers("amountSpent")).over(wSpec1))
> {noformat}
> ...yields the following weird result:
> {noformat}
> +-+--+---++---+
> | name|  date|amountSpent|   total|cnt|
> +-+--+---++---+
> |  Bob|2016-05-01|   25.0|1.0E-323|4632796641680687104|
> |  Bob|2016-05-04|   29.0|1.5E-323|4635400285215260672|
> |  Bob|2016-05-06|   27.0|1.0E-323|4633078116657397760|
> |Alice|2016-05-01|   50.0|1.0E-323|4636385447633747968|
> |Alice|2016-05-03|   45.0|1.5E-323|4639481672377565184|
> |Alice|2016-05-04|   55.0|1.0E-323|4636737291354636288|
> +-+--+---++---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18604) Collapse Window optimizer rule changes column order

2016-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18604:


Assignee: Apache Spark

> Collapse Window optimizer rule changes column order
> ---
>
> Key: SPARK-18604
> URL: https://issues.apache.org/jira/browse/SPARK-18604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>
> The recently added CollapseWindow optimizer rule changes the column order of 
> attributes. This actually modifies the schema of the logical plan (which 
> optimization should not do), and breaks `collect()` in a subtle way (we bind 
> the row encoder to the output of the logical plan and not the optimized 
> plan). 
> For example the following code:
> {noformat}
> val customers = Seq(
>   ("Alice", "2016-05-01", 50.00),
>   ("Alice", "2016-05-03", 45.00),
>   ("Alice", "2016-05-04", 55.00),
>   ("Bob", "2016-05-01", 25.00),
>   ("Bob", "2016-05-04", 29.00),
>   ("Bob", "2016-05-06", 27.00)).
>   toDF("name", "date", "amountSpent")
>  
> // Import the window functions.
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions._
>  
> // Create a window spec.
> val wSpec1 = Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1)
> val df2 = customers
>   .withColumn("total", sum(customers("amountSpent")).over(wSpec1))
>   .withColumn("cnt", count(customers("amountSpent")).over(wSpec1))
> {noformat}
> ...yields the following weird result:
> {noformat}
> +-+--+---++---+
> | name|  date|amountSpent|   total|cnt|
> +-+--+---++---+
> |  Bob|2016-05-01|   25.0|1.0E-323|4632796641680687104|
> |  Bob|2016-05-04|   29.0|1.5E-323|4635400285215260672|
> |  Bob|2016-05-06|   27.0|1.0E-323|4633078116657397760|
> |Alice|2016-05-01|   50.0|1.0E-323|4636385447633747968|
> |Alice|2016-05-03|   45.0|1.5E-323|4639481672377565184|
> |Alice|2016-05-04|   55.0|1.0E-323|4636737291354636288|
> +-+--+---++---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18604) Collapse Window optimizer rule changes column order

2016-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18604:


Assignee: (was: Apache Spark)

> Collapse Window optimizer rule changes column order
> ---
>
> Key: SPARK-18604
> URL: https://issues.apache.org/jira/browse/SPARK-18604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>
> The recently added CollapseWindow optimizer rule changes the column order of 
> attributes. This actually modifies the schema of the logical plan (which 
> optimization should not do), and breaks `collect()` in a subtle way (we bind 
> the row encoder to the output of the logical plan and not the optimized 
> plan). 
> For example the following code:
> {noformat}
> val customers = Seq(
>   ("Alice", "2016-05-01", 50.00),
>   ("Alice", "2016-05-03", 45.00),
>   ("Alice", "2016-05-04", 55.00),
>   ("Bob", "2016-05-01", 25.00),
>   ("Bob", "2016-05-04", 29.00),
>   ("Bob", "2016-05-06", 27.00)).
>   toDF("name", "date", "amountSpent")
>  
> // Import the window functions.
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions._
>  
> // Create a window spec.
> val wSpec1 = Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1)
> val df2 = customers
>   .withColumn("total", sum(customers("amountSpent")).over(wSpec1))
>   .withColumn("cnt", count(customers("amountSpent")).over(wSpec1))
> {noformat}
> ...yields the following weird result:
> {noformat}
> +-+--+---++---+
> | name|  date|amountSpent|   total|cnt|
> +-+--+---++---+
> |  Bob|2016-05-01|   25.0|1.0E-323|4632796641680687104|
> |  Bob|2016-05-04|   29.0|1.5E-323|4635400285215260672|
> |  Bob|2016-05-06|   27.0|1.0E-323|4633078116657397760|
> |Alice|2016-05-01|   50.0|1.0E-323|4636385447633747968|
> |Alice|2016-05-03|   45.0|1.5E-323|4639481672377565184|
> |Alice|2016-05-04|   55.0|1.0E-323|4636737291354636288|
> +-+--+---++---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18604) Collapse Window optimizer rule changes column order

2016-11-28 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18604.
---
   Resolution: Fixed
 Assignee: Herman van Hovell
Fix Version/s: 2.1.0

> Collapse Window optimizer rule changes column order
> ---
>
> Key: SPARK-18604
> URL: https://issues.apache.org/jira/browse/SPARK-18604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.1.0
>
>
> The recently added CollapseWindow optimizer rule changes the column order of 
> attributes. This actually modifies the schema of the logical plan (which 
> optimization should not do), and breaks `collect()` in a subtle way (we bind 
> the row encoder to the output of the logical plan and not the optimized 
> plan). 
> For example the following code:
> {noformat}
> val customers = Seq(
>   ("Alice", "2016-05-01", 50.00),
>   ("Alice", "2016-05-03", 45.00),
>   ("Alice", "2016-05-04", 55.00),
>   ("Bob", "2016-05-01", 25.00),
>   ("Bob", "2016-05-04", 29.00),
>   ("Bob", "2016-05-06", 27.00)).
>   toDF("name", "date", "amountSpent")
>  
> // Import the window functions.
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions._
>  
> // Create a window spec.
> val wSpec1 = Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1)
> val df2 = customers
>   .withColumn("total", sum(customers("amountSpent")).over(wSpec1))
>   .withColumn("cnt", count(customers("amountSpent")).over(wSpec1))
> {noformat}
> ...yields the following weird result:
> {noformat}
> +-+--+---++---+
> | name|  date|amountSpent|   total|cnt|
> +-+--+---++---+
> |  Bob|2016-05-01|   25.0|1.0E-323|4632796641680687104|
> |  Bob|2016-05-04|   29.0|1.5E-323|4635400285215260672|
> |  Bob|2016-05-06|   27.0|1.0E-323|4633078116657397760|
> |Alice|2016-05-01|   50.0|1.0E-323|4636385447633747968|
> |Alice|2016-05-03|   45.0|1.5E-323|4639481672377565184|
> |Alice|2016-05-04|   55.0|1.0E-323|4636737291354636288|
> +-+--+---++---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18527) UDAFPercentile (bigint, array) needs explicity cast to double

2016-11-28 Thread Fabian Boehnlein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15701661#comment-15701661
 ] 

Fabian Boehnlein commented on SPARK-18527:
--

Interesting [~hvanhovell], good to see that 
{code}percentile(a, array){code}
is aimed to be 
[covered|https://github.com/apache/spark/pull/14136/files#diff-a15a6f87f9676612c69435953a13ddd3R127]
 in the own implementation. Indeed that PR seems quite big for a soon release.

Maybe [~dongjoon] could give starting points for this one, related to the very 
close PR: https://github.com/apache/spark/pull/13930

Thanks!

> UDAFPercentile (bigint, array) needs explicity cast to double
> -
>
> Key: SPARK-18527
> URL: https://issues.apache.org/jira/browse/SPARK-18527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark-2.0.1-bin-hadoop2.7/bin/spark-shell
>Reporter: Fabian Boehnlein
>
> Same bug as SPARK-16228 but 
> {code}_FUNC_(bigint, array) {code}
> instead of 
> {code}_FUNC_(bigint, double){code}
> Fix of SPARK-16228 only fixes the non-array case that was hit.
> {code}
> sql("select percentile(value, array(0.5,0.99)) from values 1,2,3 T(value)")
> {code}
> fails in Spark 2 shell.
> Longer example
> {code}
> case class Record(key: Long, value: String)
> val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i.toLong, 
> s"val_$i")))
> recordsDF.createOrReplaceTempView("records")
> sql("SELECT percentile(key, Array(0.95, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 
> 0.2, 0.1)) AS test FROM records")
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': 
> org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
> for class org.apache.had
> oop.hive.ql.udf.UDAFPercentile with (bigint, array). Possible 
> choices: _FUNC_(bigint, array)  _FUNC_(bigint, double)  ; line 1 pos 7
>   at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getMethodInternal(FunctionRegistry.java:1164)
>   at 
> org.apache.hadoop.hive.ql.exec.DefaultUDAFEvaluatorResolver.getEvaluatorClass(DefaultUDAFEvaluatorResolver.java:83)
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge.getEvaluator(GenericUDAFBridge.java:56)
>   at 
> org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver.getEvaluator(AbstractGenericUDAFResolver.java:47){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18607) give a result on a percent of the tasks succeed

2016-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18607:


Assignee: Apache Spark

> give a result on  a percent of the tasks succeed
> 
>
> Key: SPARK-18607
> URL: https://issues.apache.org/jira/browse/SPARK-18607
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Ru Xiang
>Assignee: Apache Spark
>  Labels: patch
>
> In this patch, we modify the codes corresponding to runApproximateJob so that 
> we can get a result when the specified percent of tasks succeed.
> In a production environment, 'long tail' is a common urgent problem. In 
> practice, as long as we can get a specified percent of  tasks' results, we 
> can guarantee the final results. And this is a common requirement in the 
> practice of machine learning algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18607) give a result on a percent of the tasks succeed

2016-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15701683#comment-15701683
 ] 

Apache Spark commented on SPARK-18607:
--

User 'Ru-Xiang' has created a pull request for this issue:
https://github.com/apache/spark/pull/16033

> give a result on  a percent of the tasks succeed
> 
>
> Key: SPARK-18607
> URL: https://issues.apache.org/jira/browse/SPARK-18607
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Ru Xiang
>  Labels: patch
>
> In this patch, we modify the codes corresponding to runApproximateJob so that 
> we can get a result when the specified percent of tasks succeed.
> In a production environment, 'long tail' is a common urgent problem. In 
> practice, as long as we can get a specified percent of  tasks' results, we 
> can guarantee the final results. And this is a common requirement in the 
> practice of machine learning algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18607) give a result on a percent of the tasks succeed

2016-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18607:


Assignee: (was: Apache Spark)

> give a result on  a percent of the tasks succeed
> 
>
> Key: SPARK-18607
> URL: https://issues.apache.org/jira/browse/SPARK-18607
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Ru Xiang
>  Labels: patch
>
> In this patch, we modify the codes corresponding to runApproximateJob so that 
> we can get a result when the specified percent of tasks succeed.
> In a production environment, 'long tail' is a common urgent problem. In 
> practice, as long as we can get a specified percent of  tasks' results, we 
> can guarantee the final results. And this is a common requirement in the 
> practice of machine learning algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18609) [SQL] column mixup with CROSS JOIN

2016-11-28 Thread Furcy Pin (JIRA)

Furcy Pin created SPARK-18609:
-

 Summary: [SQL] column mixup with CROSS JOIN
 Key: SPARK-18609
 URL: https://issues.apache.org/jira/browse/SPARK-18609
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.2, 2.1.0
Reporter: Furcy Pin


Reproduced on spark-sql v2.0.2 and on branch master.

{code}
DROP TABLE IF EXISTS p1 ;
DROP TABLE IF EXISTS p2 ;

CREATE TABLE p1 (col TIMESTAMP) ;
CREATE TABLE p2 (col TIMESTAMP) ;

set spark.sql.crossJoin.enabled = true;

-- EXPLAIN
WITH CTE AS (
  SELECT
s2.col as col
  FROM p1
  CROSS JOIN (
SELECT
  e.col as col
FROM p2 E
  ) s2
)
SELECT
  T1.col as c1,
  T2.col as c2
FROM CTE T1
CROSS JOIN CTE T2
;
{code}

This returns the following stacktrace :
{code}
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: col#21
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
at 
org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:55)
at 
org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:54)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at 
org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:54)
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at 
org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218)
at 
org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218)
at 
org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:30)
at 
org.apache.spark.sql.

[jira] [Commented] (SPARK-18527) UDAFPercentile (bigint, array) needs explicity cast to double

2016-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15701806#comment-15701806
 ] 

Apache Spark commented on SPARK-18527:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/16034

> UDAFPercentile (bigint, array) needs explicity cast to double
> -
>
> Key: SPARK-18527
> URL: https://issues.apache.org/jira/browse/SPARK-18527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark-2.0.1-bin-hadoop2.7/bin/spark-shell
>Reporter: Fabian Boehnlein
>
> Same bug as SPARK-16228 but 
> {code}_FUNC_(bigint, array) {code}
> instead of 
> {code}_FUNC_(bigint, double){code}
> Fix of SPARK-16228 only fixes the non-array case that was hit.
> {code}
> sql("select percentile(value, array(0.5,0.99)) from values 1,2,3 T(value)")
> {code}
> fails in Spark 2 shell.
> Longer example
> {code}
> case class Record(key: Long, value: String)
> val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i.toLong, 
> s"val_$i")))
> recordsDF.createOrReplaceTempView("records")
> sql("SELECT percentile(key, Array(0.95, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 
> 0.2, 0.1)) AS test FROM records")
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': 
> org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
> for class org.apache.had
> oop.hive.ql.udf.UDAFPercentile with (bigint, array). Possible 
> choices: _FUNC_(bigint, array)  _FUNC_(bigint, double)  ; line 1 pos 7
>   at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getMethodInternal(FunctionRegistry.java:1164)
>   at 
> org.apache.hadoop.hive.ql.exec.DefaultUDAFEvaluatorResolver.getEvaluatorClass(DefaultUDAFEvaluatorResolver.java:83)
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge.getEvaluator(GenericUDAFBridge.java:56)
>   at 
> org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver.getEvaluator(AbstractGenericUDAFResolver.java:47){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18527) UDAFPercentile (bigint, array) needs explicity cast to double

2016-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18527:


Assignee: (was: Apache Spark)

> UDAFPercentile (bigint, array) needs explicity cast to double
> -
>
> Key: SPARK-18527
> URL: https://issues.apache.org/jira/browse/SPARK-18527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark-2.0.1-bin-hadoop2.7/bin/spark-shell
>Reporter: Fabian Boehnlein
>
> Same bug as SPARK-16228 but 
> {code}_FUNC_(bigint, array) {code}
> instead of 
> {code}_FUNC_(bigint, double){code}
> Fix of SPARK-16228 only fixes the non-array case that was hit.
> {code}
> sql("select percentile(value, array(0.5,0.99)) from values 1,2,3 T(value)")
> {code}
> fails in Spark 2 shell.
> Longer example
> {code}
> case class Record(key: Long, value: String)
> val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i.toLong, 
> s"val_$i")))
> recordsDF.createOrReplaceTempView("records")
> sql("SELECT percentile(key, Array(0.95, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 
> 0.2, 0.1)) AS test FROM records")
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': 
> org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
> for class org.apache.had
> oop.hive.ql.udf.UDAFPercentile with (bigint, array). Possible 
> choices: _FUNC_(bigint, array)  _FUNC_(bigint, double)  ; line 1 pos 7
>   at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getMethodInternal(FunctionRegistry.java:1164)
>   at 
> org.apache.hadoop.hive.ql.exec.DefaultUDAFEvaluatorResolver.getEvaluatorClass(DefaultUDAFEvaluatorResolver.java:83)
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge.getEvaluator(GenericUDAFBridge.java:56)
>   at 
> org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver.getEvaluator(AbstractGenericUDAFResolver.java:47){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18527) UDAFPercentile (bigint, array) needs explicity cast to double

2016-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18527:


Assignee: Apache Spark

> UDAFPercentile (bigint, array) needs explicity cast to double
> -
>
> Key: SPARK-18527
> URL: https://issues.apache.org/jira/browse/SPARK-18527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: spark-2.0.1-bin-hadoop2.7/bin/spark-shell
>Reporter: Fabian Boehnlein
>Assignee: Apache Spark
>
> Same bug as SPARK-16228 but 
> {code}_FUNC_(bigint, array) {code}
> instead of 
> {code}_FUNC_(bigint, double){code}
> Fix of SPARK-16228 only fixes the non-array case that was hit.
> {code}
> sql("select percentile(value, array(0.5,0.99)) from values 1,2,3 T(value)")
> {code}
> fails in Spark 2 shell.
> Longer example
> {code}
> case class Record(key: Long, value: String)
> val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i.toLong, 
> s"val_$i")))
> recordsDF.createOrReplaceTempView("records")
> sql("SELECT percentile(key, Array(0.95, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 
> 0.2, 0.1)) AS test FROM records")
> org.apache.spark.sql.AnalysisException: No handler for Hive UDF 
> 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': 
> org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
> for class org.apache.had
> oop.hive.ql.udf.UDAFPercentile with (bigint, array). Possible 
> choices: _FUNC_(bigint, array)  _FUNC_(bigint, double)  ; line 1 pos 7
>   at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getMethodInternal(FunctionRegistry.java:1164)
>   at 
> org.apache.hadoop.hive.ql.exec.DefaultUDAFEvaluatorResolver.getEvaluatorClass(DefaultUDAFEvaluatorResolver.java:83)
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge.getEvaluator(GenericUDAFBridge.java:56)
>   at 
> org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver.getEvaluator(AbstractGenericUDAFResolver.java:47){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18605) Spark Streaming ERROR TransportResponseHandler: Still have 1 requests outstanding when connection

2016-11-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18605:
--
Target Version/s:   (was: 1.6.2)
   Fix Version/s: (was: 1.6.2)

> Spark Streaming ERROR TransportResponseHandler: Still have 1 requests 
> outstanding when connection 
> --
>
> Key: SPARK-18605
> URL: https://issues.apache.org/jira/browse/SPARK-18605
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.2
> Environment: spark-submit \
> --driver-java-options "-XX:PermSize=1024M -XX:MaxPermSize=3072M" \
> --driver-memory 3G  \
> --class cn.com.jldata.ETLDiver \
> --master yarn \
> --deploy-mode cluster \
> --proxy-user hdfs \
> --executor-memory 5G \
> --executor-cores 3 \
> --num-executors 6 \
> --conf spark.dynamicAllocation.enabled=true \
> --conf spark.dynamicAllocation.initialExecutors=10 \
> --conf spark.dynamicAllocation.maxExecutors=20 \
> --conf spark.dynamicAllocation.minExecutors=6 \
> --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
> --conf spark.network.timeout=300 \
> --conf spark.yarn.executor.memoryOverhead=4096 \
> --conf spark.yarn.driver.memoryOverhead=2048 \
> --conf spark.driver.cores=3 \
> --conf spark.shuffle.memoryFraction=0.5 \
> --conf spark.storage.memoryFraction=0.3 \
> --conf spark.core.connection.ack.wait.timeout=300  \
> --conf spark.shuffle.service.enabled=true \
> --conf spark.shuffle.service.port=7337 \
> --queue spark \
>Reporter: jiafeng.zhang
>
> 16/11/26 11:01:02 WARN TransportChannelHandler: Exception in connection from 
> dpnode12/192.168.9.26:7337
> java.io.IOException: Connection timed out
>   at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>   at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
>   at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> 16/11/26 11:01:02 ERROR TransportResponseHandler: Still have 1 requests 
> outstanding when connection from dpnode12/192.168.9.26:7337 is closed
> 16/11/26 11:01:02 ERROR OneForOneBlockFetcher: Failed while starting block 
> fetches
> java.io.IOException: Connection timed out
>   at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>   at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
>   at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> 16/11/26 11:01:02 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 1 
> outstanding blocks after 5000 ms



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18118) SpecificSafeProjection.apply of Java Object from Dataset to JavaRDD Grows Beyond 64 KB

2016-11-28 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18118.
---
   Resolution: Fixed
 Assignee: Kazuaki Ishizaki
Fix Version/s: 2.1.0
   2.0.3

> SpecificSafeProjection.apply of Java Object from Dataset to JavaRDD Grows 
> Beyond 64 KB
> --
>
> Key: SPARK-18118
> URL: https://issues.apache.org/jira/browse/SPARK-18118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Kazuaki Ishizaki
> Fix For: 2.0.3, 2.1.0
>
>
> For sufficiently wide or nested Java Objects, when SpecificSafeProjection 
> attempts to recreate the object from an InternalRow, the generated 
> SpecificSafeProjection.apply method is larger than allowed: 
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection"
>  grows beyond 64 KB
> {code}
> Although related, this issue appears not to have been resolved by 
> SPARK-15285. Since there is only one top-level object when projecting, 
> splitExpressions finds no additional Expressions to split. The result is a 
> single large, nested Expression that forms the apply code.
> See the reproducer for an example [1].
> [1] - https://github.com/bdrillard/specific-safe-projection-error



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18609) [SQL] column mixup with CROSS JOIN

2016-11-28 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-18609:
--
Component/s: SQL

> [SQL] column mixup with CROSS JOIN
> --
>
> Key: SPARK-18609
> URL: https://issues.apache.org/jira/browse/SPARK-18609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Furcy Pin
>
> Reproduced on spark-sql v2.0.2 and on branch master.
> {code}
> DROP TABLE IF EXISTS p1 ;
> DROP TABLE IF EXISTS p2 ;
> CREATE TABLE p1 (col TIMESTAMP) ;
> CREATE TABLE p2 (col TIMESTAMP) ;
> set spark.sql.crossJoin.enabled = true;
> -- EXPLAIN
> WITH CTE AS (
>   SELECT
> s2.col as col
>   FROM p1
>   CROSS JOIN (
> SELECT
>   e.col as col
> FROM p2 E
>   ) s2
> )
> SELECT
>   T1.col as c1,
>   T2.col as c2
> FROM CTE T1
> CROSS JOIN CTE T2
> ;
> {code}
> This returns the following stacktrace :
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: col#21
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:55)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:54)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:54)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperatio

[jira] [Updated] (SPARK-18609) [SQL] column mixup with CROSS JOIN

2016-11-28 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-18609:
--
Labels:   (was: sql)

> [SQL] column mixup with CROSS JOIN
> --
>
> Key: SPARK-18609
> URL: https://issues.apache.org/jira/browse/SPARK-18609
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Furcy Pin
>
> Reproduced on spark-sql v2.0.2 and on branch master.
> {code}
> DROP TABLE IF EXISTS p1 ;
> DROP TABLE IF EXISTS p2 ;
> CREATE TABLE p1 (col TIMESTAMP) ;
> CREATE TABLE p2 (col TIMESTAMP) ;
> set spark.sql.crossJoin.enabled = true;
> -- EXPLAIN
> WITH CTE AS (
>   SELECT
> s2.col as col
>   FROM p1
>   CROSS JOIN (
> SELECT
>   e.col as col
> FROM p2 E
>   ) s2
> )
> SELECT
>   T1.col as c1,
>   T2.col as c2
> FROM CTE T1
> CROSS JOIN CTE T2
> ;
> {code}
> This returns the following stacktrace :
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: col#21
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:55)
>   at 
> org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:54)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:54)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
>   at 
> org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOper

[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2016-11-28 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15701839#comment-15701839
 ] 

Nick Pentreath commented on SPARK-18608:


I've also been meaning to log this for a little while now. 

It's actually not a simple fix - there is now some automated label casting to 
Double in {{predictor.fit}} that will throw away the dataset storage level 
info. We could perhaps centralize the handle persistence logic in {{fit}}.

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -
>
> Key: SPARK-18608
> URL: https://issues.apache.org/jira/browse/SPARK-18608
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Nick Pentreath
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-17732) ALTER TABLE DROP PARTITION should support comparators

2016-11-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-17732:
-

> ALTER TABLE DROP PARTITION should support comparators
> -
>
> Key: SPARK-17732
> URL: https://issues.apache.org/jira/browse/SPARK-17732
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>
> This issue aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in 
> Apache Spark 2.0 for backward compatibility.
> *Spark 1.6.2*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> {code}
> *Spark 2.0*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '<' expecting {')', ','}(line 1, pos 42)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17732) ALTER TABLE DROP PARTITION should support comparators

2016-11-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17732:

Target Version/s: 2.2.0

> ALTER TABLE DROP PARTITION should support comparators
> -
>
> Key: SPARK-17732
> URL: https://issues.apache.org/jira/browse/SPARK-17732
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>
> This issue aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in 
> Apache Spark 2.0 for backward compatibility.
> *Spark 1.6.2*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> {code}
> *Spark 2.0*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '<' expecting {')', ','}(line 1, pos 42)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17732) ALTER TABLE DROP PARTITION should support comparators

2016-11-28 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17732:

Fix Version/s: (was: 2.2.0)

> ALTER TABLE DROP PARTITION should support comparators
> -
>
> Key: SPARK-17732
> URL: https://issues.apache.org/jira/browse/SPARK-17732
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>
> This issue aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in 
> Apache Spark 2.0 for backward compatibility.
> *Spark 1.6.2*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> {code}
> *Spark 2.0*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '<' expecting {')', ','}(line 1, pos 42)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18610) greatest/leatest fails to run with string aginst date/timestamp

2016-11-28 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-18610:


 Summary: greatest/leatest fails to run with string aginst 
date/timestamp
 Key: SPARK-18610
 URL: https://issues.apache.org/jira/browse/SPARK-18610
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Hyukjin Kwon


It seems Spark SQL fails to implicitly cast (or detect widen type) from string 
with date/timestamp.

{code}
spark-sql> select greatest("2015-02-02", date("2015-01-01")) ;
Error in query: cannot resolve 'greatest('2015-02-02', CAST('2015-01-01' AS 
DATE))' due to data type mismatch: The expressions should all have the same 
type, got GREATEST(string, date).; line 1 pos 7
{code}

It seems, at least, other DBMS support this by implicit casting/widened types.

{code}
hive> select greatest("2015-02-021", date("2015-01-01"));
OK
2015-01-01
Time taken: 0.019 seconds, Fetched: 1 row(s)
hive> select greatest("-02-021", date("2015-01-01"));
OK
2015-01-01
Time taken: 0.02 seconds, Fetched: 1 row(s)
hive>
hive> CREATE TEMPORARY TABLE typeof as select greatest("2015-02-02", 
date("2015-01-01"));
OK
Time taken: 2.63 seconds
hive> DESCRIBE typeof;
OK
_c0 date
Time taken: 0.031 seconds, Fetched: 1 row(s)
{code}

{code}
mysql> select greatest("2015-02-02abc", date("2015-01-01"));
+---+
| greatest("2015-02-02abc", date("2015-01-01")) |
+---+
| 2015-02-02abc |
+---+
1 row in set, 1 warning (0.00 sec)

mysql> CREATE TEMPORARY TABLE typeof as select greatest("2015-02-02", 
date("2015-01-01"));
Query OK, 1 row affected (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> DESCRIBE typeof;
++-+--+-+-+---+
| Field  | Type| Null | Key | 
Default | Extra |
++-+--+-+-+---+
| greatest("2015-02-02", date("2015-01-01")) | varchar(10) | YES  | | NULL  
  |   |
++-+--+-+-+---+
1 row in set (0.00 sec)
{code}

{code}
postgres=# select greatest('2015-02-02abc', date('2015-01-01'));
ERROR:  invalid input syntax for type date: "2015-02-02abc"
LINE 1: select greatest('2015-02-02abc', date('2015-01-01'));

postgres=# CREATE TEMPORARY TABLE typeof as select greatest('2015-02-02', 
date('2015-01-01'));
SELECT 1

postgres=# \d+ typeof
  Table "pg_temp_3.typeof"
  Column  | Type | Modifiers | Storage | Stats target | Description
--+--+---+-+--+-
 greatest | date |   | plain   |  |
Has OIDs: no
{code}

I tracked down and it seems we want Hive's behaviour assuming from SPARK-12201.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17732) ALTER TABLE DROP PARTITION should support comparators

2016-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17732:


Assignee: Apache Spark  (was: Dongjoon Hyun)

> ALTER TABLE DROP PARTITION should support comparators
> -
>
> Key: SPARK-17732
> URL: https://issues.apache.org/jira/browse/SPARK-17732
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> This issue aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in 
> Apache Spark 2.0 for backward compatibility.
> *Spark 1.6.2*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> {code}
> *Spark 2.0*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '<' expecting {')', ','}(line 1, pos 42)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17732) ALTER TABLE DROP PARTITION should support comparators

2016-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17732:


Assignee: Dongjoon Hyun  (was: Apache Spark)

> ALTER TABLE DROP PARTITION should support comparators
> -
>
> Key: SPARK-17732
> URL: https://issues.apache.org/jira/browse/SPARK-17732
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>
> This issue aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in 
> Apache Spark 2.0 for backward compatibility.
> *Spark 1.6.2*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> {code}
> *Spark 2.0*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '<' expecting {')', ','}(line 1, pos 42)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17732) ALTER TABLE DROP PARTITION should support comparators

2016-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15701984#comment-15701984
 ] 

Apache Spark commented on SPARK-17732:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/16036

> ALTER TABLE DROP PARTITION should support comparators
> -
>
> Key: SPARK-17732
> URL: https://issues.apache.org/jira/browse/SPARK-17732
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>
> This issue aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in 
> Apache Spark 2.0 for backward compatibility.
> *Spark 1.6.2*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> {code}
> *Spark 2.0*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '<' expecting {')', ','}(line 1, pos 42)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18610) greatest/leatest fails to run with string aginst date/timestamp

2016-11-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702007#comment-15702007
 ] 

Sean Owen commented on SPARK-18610:
---

I know I should know this, and could test it, but maybe you know the answer 
directly: does Spark otherwise coerce a string to a date? it does seem a little 
odd to me.

> greatest/leatest fails to run with string aginst date/timestamp
> ---
>
> Key: SPARK-18610
> URL: https://issues.apache.org/jira/browse/SPARK-18610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> It seems Spark SQL fails to implicitly cast (or detect widen type) from 
> string with date/timestamp.
> {code}
> spark-sql> select greatest("2015-02-02", date("2015-01-01")) ;
> Error in query: cannot resolve 'greatest('2015-02-02', CAST('2015-01-01' AS 
> DATE))' due to data type mismatch: The expressions should all have the same 
> type, got GREATEST(string, date).; line 1 pos 7
> {code}
> It seems, at least, other DBMS support this by implicit casting/widened types.
> {code}
> hive> select greatest("2015-02-021", date("2015-01-01"));
> OK
> 2015-01-01
> Time taken: 0.019 seconds, Fetched: 1 row(s)
> hive> select greatest("-02-021", date("2015-01-01"));
> OK
> 2015-01-01
> Time taken: 0.02 seconds, Fetched: 1 row(s)
> hive>
> hive> CREATE TEMPORARY TABLE typeof as select greatest("2015-02-02", 
> date("2015-01-01"));
> OK
> Time taken: 2.63 seconds
> hive> DESCRIBE typeof;
> OK
> _c0   date
> Time taken: 0.031 seconds, Fetched: 1 row(s)
> {code}
> {code}
> mysql> select greatest("2015-02-02abc", date("2015-01-01"));
> +---+
> | greatest("2015-02-02abc", date("2015-01-01")) |
> +---+
> | 2015-02-02abc |
> +---+
> 1 row in set, 1 warning (0.00 sec)
> mysql> CREATE TEMPORARY TABLE typeof as select greatest("2015-02-02", 
> date("2015-01-01"));
> Query OK, 1 row affected (0.01 sec)
> Records: 1  Duplicates: 0  Warnings: 0
> mysql> DESCRIBE typeof;
> ++-+--+-+-+---+
> | Field  | Type| Null | Key | 
> Default | Extra |
> ++-+--+-+-+---+
> | greatest("2015-02-02", date("2015-01-01")) | varchar(10) | YES  | | 
> NULL|   |
> ++-+--+-+-+---+
> 1 row in set (0.00 sec)
> {code}
> {code}
> postgres=# select greatest('2015-02-02abc', date('2015-01-01'));
> ERROR:  invalid input syntax for type date: "2015-02-02abc"
> LINE 1: select greatest('2015-02-02abc', date('2015-01-01'));
> postgres=# CREATE TEMPORARY TABLE typeof as select greatest('2015-02-02', 
> date('2015-01-01'));
> SELECT 1
> postgres=# \d+ typeof
>   Table "pg_temp_3.typeof"
>   Column  | Type | Modifiers | Storage | Stats target | Description
> --+--+---+-+--+-
>  greatest | date |   | plain   |  |
> Has OIDs: no
> {code}
> I tracked down and it seems we want Hive's behaviour assuming from 
> SPARK-12201.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18220) ClassCastException occurs when using select query on ORC file

2016-11-28 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702009#comment-15702009
 ] 

Wenchen Fan commented on SPARK-18220:
-

> I created a separate table by querying an existing table in hive for testing 
> purposes.

can you provide the SQL statement that created this table? and it would be good 
if you can run `DESC TABLE` for the hive table and post the results. Thanks!

> ClassCastException occurs when using select query on ORC file
> -
>
> Key: SPARK-18220
> URL: https://issues.apache.org/jira/browse/SPARK-18220
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jerryjung
>  Labels: orcfile, sql
>
> Error message is below.
> {noformat}
> ==
> 16/11/02 16:38:09 INFO ReaderImpl: Reading ORC rows from 
> hdfs://xxx/part-00022 with {include: [true], offset: 0, length: 
> 9223372036854775807}
> 16/11/02 16:38:09 INFO Executor: Finished task 17.0 in stage 22.0 (TID 42). 
> 1220 bytes result sent to driver
> 16/11/02 16:38:09 INFO TaskSetManager: Finished task 17.0 in stage 22.0 (TID 
> 42) in 116 ms on localhost (executor driver) (19/20)
> 16/11/02 16:38:09 ERROR Executor: Exception in task 10.0 in stage 22.0 (TID 
> 35)
> java.lang.ClassCastException: 
> org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to 
> org.apache.hadoop.io.Text
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
>   at 
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:526)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> ORC dump info.
> ==
> File Version: 0.12 with HIVE_8732
> 16/11/02 16:39:21 INFO orc.ReaderImpl: Reading ORC rows from 
> hdfs://XXX/part-0 with {include: null, offset: 0, length: 
> 9223372036854775807}
> 16/11/02 16:39:21 INFO orc.RecordReaderFactory: Schema is not specified on 
> read. Using file schema.
> Rows: 7
> Compression: ZLIB
> Compression size: 262144
> Type: 
> struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18610) greatest/leatest fails to run with string aginst date/timestamp

2016-11-28 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702032#comment-15702032
 ] 

Hyukjin Kwon commented on SPARK-18610:
--

Ah, I believe it is a bit complicated more than I know but up to my knowledge, 
I believe it is yes.

It seems there are some cases using {{TypeCoercion.findWiderTypeForTwo}} which 
promotes the type to {{StringType}} when it fails to find the compatible type.

It seems only {{Least}}/{{Greatest}} have a kind of special rule, 
{{TypeCoercion.findWiderTypeWithoutStringPromotion}} which do not fail back to 
{{StringType}} when coercing.

BTW, for some expressions extending {{ImplicitTypeCasts}}, they implicitly cast 
the input types to the desired types specified in {{inputTypes}}.


> greatest/leatest fails to run with string aginst date/timestamp
> ---
>
> Key: SPARK-18610
> URL: https://issues.apache.org/jira/browse/SPARK-18610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> It seems Spark SQL fails to implicitly cast (or detect widen type) from 
> string with date/timestamp.
> {code}
> spark-sql> select greatest("2015-02-02", date("2015-01-01")) ;
> Error in query: cannot resolve 'greatest('2015-02-02', CAST('2015-01-01' AS 
> DATE))' due to data type mismatch: The expressions should all have the same 
> type, got GREATEST(string, date).; line 1 pos 7
> {code}
> It seems, at least, other DBMS support this by implicit casting/widened types.
> {code}
> hive> select greatest("2015-02-021", date("2015-01-01"));
> OK
> 2015-01-01
> Time taken: 0.019 seconds, Fetched: 1 row(s)
> hive> select greatest("-02-021", date("2015-01-01"));
> OK
> 2015-01-01
> Time taken: 0.02 seconds, Fetched: 1 row(s)
> hive>
> hive> CREATE TEMPORARY TABLE typeof as select greatest("2015-02-02", 
> date("2015-01-01"));
> OK
> Time taken: 2.63 seconds
> hive> DESCRIBE typeof;
> OK
> _c0   date
> Time taken: 0.031 seconds, Fetched: 1 row(s)
> {code}
> {code}
> mysql> select greatest("2015-02-02abc", date("2015-01-01"));
> +---+
> | greatest("2015-02-02abc", date("2015-01-01")) |
> +---+
> | 2015-02-02abc |
> +---+
> 1 row in set, 1 warning (0.00 sec)
> mysql> CREATE TEMPORARY TABLE typeof as select greatest("2015-02-02", 
> date("2015-01-01"));
> Query OK, 1 row affected (0.01 sec)
> Records: 1  Duplicates: 0  Warnings: 0
> mysql> DESCRIBE typeof;
> ++-+--+-+-+---+
> | Field  | Type| Null | Key | 
> Default | Extra |
> ++-+--+-+-+---+
> | greatest("2015-02-02", date("2015-01-01")) | varchar(10) | YES  | | 
> NULL|   |
> ++-+--+-+-+---+
> 1 row in set (0.00 sec)
> {code}
> {code}
> postgres=# select greatest('2015-02-02abc', date('2015-01-01'));
> ERROR:  invalid input syntax for type date: "2015-02-02abc"
> LINE 1: select greatest('2015-02-02abc', date('2015-01-01'));
> postgres=# CREATE TEMPORARY TABLE typeof as select greatest('2015-02-02', 
> date('2015-01-01'));
> SELECT 1
> postgres=# \d+ typeof
>   Table "pg_temp_3.typeof"
>   Column  | Type | Modifiers | Storage | Stats target | Description
> --+--+---+-+--+-
>  greatest | date |   | plain   |  |
> Has OIDs: no
> {code}
> I tracked down and it seems we want Hive's behaviour assuming from 
> SPARK-12201.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18610) greatest/leatest fails to run with string aginst date/timestamp

2016-11-28 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702047#comment-15702047
 ] 

Hyukjin Kwon commented on SPARK-18610:
--

So, I understood (by myself) that we might need to implicitly cast string to 
date like some expressions extending {{ImplicitTypeCasts}} but of course I 
should look into this deeper.

> greatest/leatest fails to run with string aginst date/timestamp
> ---
>
> Key: SPARK-18610
> URL: https://issues.apache.org/jira/browse/SPARK-18610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> It seems Spark SQL fails to implicitly cast (or detect widen type) from 
> string with date/timestamp.
> {code}
> spark-sql> select greatest("2015-02-02", date("2015-01-01")) ;
> Error in query: cannot resolve 'greatest('2015-02-02', CAST('2015-01-01' AS 
> DATE))' due to data type mismatch: The expressions should all have the same 
> type, got GREATEST(string, date).; line 1 pos 7
> {code}
> It seems, at least, other DBMS support this by implicit casting/widened types.
> {code}
> hive> select greatest("2015-02-021", date("2015-01-01"));
> OK
> 2015-01-01
> Time taken: 0.019 seconds, Fetched: 1 row(s)
> hive> select greatest("-02-021", date("2015-01-01"));
> OK
> 2015-01-01
> Time taken: 0.02 seconds, Fetched: 1 row(s)
> hive>
> hive> CREATE TEMPORARY TABLE typeof as select greatest("2015-02-02", 
> date("2015-01-01"));
> OK
> Time taken: 2.63 seconds
> hive> DESCRIBE typeof;
> OK
> _c0   date
> Time taken: 0.031 seconds, Fetched: 1 row(s)
> {code}
> {code}
> mysql> select greatest("2015-02-02abc", date("2015-01-01"));
> +---+
> | greatest("2015-02-02abc", date("2015-01-01")) |
> +---+
> | 2015-02-02abc |
> +---+
> 1 row in set, 1 warning (0.00 sec)
> mysql> CREATE TEMPORARY TABLE typeof as select greatest("2015-02-02", 
> date("2015-01-01"));
> Query OK, 1 row affected (0.01 sec)
> Records: 1  Duplicates: 0  Warnings: 0
> mysql> DESCRIBE typeof;
> ++-+--+-+-+---+
> | Field  | Type| Null | Key | 
> Default | Extra |
> ++-+--+-+-+---+
> | greatest("2015-02-02", date("2015-01-01")) | varchar(10) | YES  | | 
> NULL|   |
> ++-+--+-+-+---+
> 1 row in set (0.00 sec)
> {code}
> {code}
> postgres=# select greatest('2015-02-02abc', date('2015-01-01'));
> ERROR:  invalid input syntax for type date: "2015-02-02abc"
> LINE 1: select greatest('2015-02-02abc', date('2015-01-01'));
> postgres=# CREATE TEMPORARY TABLE typeof as select greatest('2015-02-02', 
> date('2015-01-01'));
> SELECT 1
> postgres=# \d+ typeof
>   Table "pg_temp_3.typeof"
>   Column  | Type | Modifiers | Storage | Stats target | Description
> --+--+---+-+--+-
>  greatest | date |   | plain   |  |
> Has OIDs: no
> {code}
> I tracked down and it seems we want Hive's behaviour assuming from 
> SPARK-12201.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18610) greatest/leatest fails to run with string aginst date/timestamp

2016-11-28 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702057#comment-15702057
 ] 

Hyukjin Kwon commented on SPARK-18610:
--

Actually, in case of MySQL, it seems converting the datatype to string whereas 
Hive and Postgres do not.

> greatest/leatest fails to run with string aginst date/timestamp
> ---
>
> Key: SPARK-18610
> URL: https://issues.apache.org/jira/browse/SPARK-18610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> It seems Spark SQL fails to implicitly cast (or detect widen type) from 
> string with date/timestamp.
> {code}
> spark-sql> select greatest("2015-02-02", date("2015-01-01")) ;
> Error in query: cannot resolve 'greatest('2015-02-02', CAST('2015-01-01' AS 
> DATE))' due to data type mismatch: The expressions should all have the same 
> type, got GREATEST(string, date).; line 1 pos 7
> {code}
> It seems, at least, other DBMS support this by implicit casting/widened types.
> {code}
> hive> select greatest("2015-02-021", date("2015-01-01"));
> OK
> 2015-01-01
> Time taken: 0.019 seconds, Fetched: 1 row(s)
> hive> select greatest("-02-021", date("2015-01-01"));
> OK
> 2015-01-01
> Time taken: 0.02 seconds, Fetched: 1 row(s)
> hive>
> hive> CREATE TEMPORARY TABLE typeof as select greatest("2015-02-02", 
> date("2015-01-01"));
> OK
> Time taken: 2.63 seconds
> hive> DESCRIBE typeof;
> OK
> _c0   date
> Time taken: 0.031 seconds, Fetched: 1 row(s)
> {code}
> {code}
> mysql> select greatest("2015-02-02abc", date("2015-01-01"));
> +---+
> | greatest("2015-02-02abc", date("2015-01-01")) |
> +---+
> | 2015-02-02abc |
> +---+
> 1 row in set, 1 warning (0.00 sec)
> mysql> CREATE TEMPORARY TABLE typeof as select greatest("2015-02-02", 
> date("2015-01-01"));
> Query OK, 1 row affected (0.01 sec)
> Records: 1  Duplicates: 0  Warnings: 0
> mysql> DESCRIBE typeof;
> ++-+--+-+-+---+
> | Field  | Type| Null | Key | 
> Default | Extra |
> ++-+--+-+-+---+
> | greatest("2015-02-02", date("2015-01-01")) | varchar(10) | YES  | | 
> NULL|   |
> ++-+--+-+-+---+
> 1 row in set (0.00 sec)
> {code}
> {code}
> postgres=# select greatest('2015-02-02abc', date('2015-01-01'));
> ERROR:  invalid input syntax for type date: "2015-02-02abc"
> LINE 1: select greatest('2015-02-02abc', date('2015-01-01'));
> postgres=# CREATE TEMPORARY TABLE typeof as select greatest('2015-02-02', 
> date('2015-01-01'));
> SELECT 1
> postgres=# \d+ typeof
>   Table "pg_temp_3.typeof"
>   Column  | Type | Modifiers | Storage | Stats target | Description
> --+--+---+-+--+-
>  greatest | date |   | plain   |  |
> Has OIDs: no
> {code}
> I tracked down and it seems we want Hive's behaviour assuming from 
> SPARK-12201.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18610) greatest/leatest fails to run with string aginst date/timestamp

2016-11-28 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702083#comment-15702083
 ] 

Hyukjin Kwon commented on SPARK-18610:
--

Hi [~hvanhovell] and [~cloud_fan], After rethinking, I think it might more 
reasonable to coerce it to string when they are string and date. Could I please 
ask what do you think if you don't mind?

> greatest/leatest fails to run with string aginst date/timestamp
> ---
>
> Key: SPARK-18610
> URL: https://issues.apache.org/jira/browse/SPARK-18610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> It seems Spark SQL fails to implicitly cast (or detect widen type) from 
> string with date/timestamp.
> {code}
> spark-sql> select greatest("2015-02-02", date("2015-01-01")) ;
> Error in query: cannot resolve 'greatest('2015-02-02', CAST('2015-01-01' AS 
> DATE))' due to data type mismatch: The expressions should all have the same 
> type, got GREATEST(string, date).; line 1 pos 7
> {code}
> It seems, at least, other DBMS support this by implicit casting/widened types.
> {code}
> hive> select greatest("2015-02-021", date("2015-01-01"));
> OK
> 2015-01-01
> Time taken: 0.019 seconds, Fetched: 1 row(s)
> hive> select greatest("-02-021", date("2015-01-01"));
> OK
> 2015-01-01
> Time taken: 0.02 seconds, Fetched: 1 row(s)
> hive>
> hive> CREATE TEMPORARY TABLE typeof as select greatest("2015-02-02", 
> date("2015-01-01"));
> OK
> Time taken: 2.63 seconds
> hive> DESCRIBE typeof;
> OK
> _c0   date
> Time taken: 0.031 seconds, Fetched: 1 row(s)
> {code}
> {code}
> mysql> select greatest("2015-02-02abc", date("2015-01-01"));
> +---+
> | greatest("2015-02-02abc", date("2015-01-01")) |
> +---+
> | 2015-02-02abc |
> +---+
> 1 row in set, 1 warning (0.00 sec)
> mysql> CREATE TEMPORARY TABLE typeof as select greatest("2015-02-02", 
> date("2015-01-01"));
> Query OK, 1 row affected (0.01 sec)
> Records: 1  Duplicates: 0  Warnings: 0
> mysql> DESCRIBE typeof;
> ++-+--+-+-+---+
> | Field  | Type| Null | Key | 
> Default | Extra |
> ++-+--+-+-+---+
> | greatest("2015-02-02", date("2015-01-01")) | varchar(10) | YES  | | 
> NULL|   |
> ++-+--+-+-+---+
> 1 row in set (0.00 sec)
> {code}
> {code}
> postgres=# select greatest('2015-02-02abc', date('2015-01-01'));
> ERROR:  invalid input syntax for type date: "2015-02-02abc"
> LINE 1: select greatest('2015-02-02abc', date('2015-01-01'));
> postgres=# CREATE TEMPORARY TABLE typeof as select greatest('2015-02-02', 
> date('2015-01-01'));
> SELECT 1
> postgres=# \d+ typeof
>   Table "pg_temp_3.typeof"
>   Column  | Type | Modifiers | Storage | Stats target | Description
> --+--+---+-+--+-
>  greatest | date |   | plain   |  |
> Has OIDs: no
> {code}
> I tracked down and it seems we want Hive's behaviour assuming from 
> SPARK-12201.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18610) greatest/leatest fails to run with string aginst date/timestamp

2016-11-28 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-18610:
-
Description: 
It seems Spark SQL fails to implicitly cast (or detect widen type) from string 
with date/timestamp.

{code}
spark-sql> select greatest("2015-02-02", date("2015-01-01")) ;
Error in query: cannot resolve 'greatest('2015-02-02', CAST('2015-01-01' AS 
DATE))' due to data type mismatch: The expressions should all have the same 
type, got GREATEST(string, date).; line 1 pos 7
{code}

It seems, at least, other DBMS support this by implicit casting/widened types.

{code}
hive> select greatest("2015-02-02", date("2015-01-01"));
OK
2015-02-02
hive> select greatest("2015-02-021", date("2015-01-01"));
OK
2015-01-01
Time taken: 0.019 seconds, Fetched: 1 row(s)
hive> select greatest("-02-021", date("2015-01-01"));
OK
2015-01-01
Time taken: 0.02 seconds, Fetched: 1 row(s)
hive>
hive> CREATE TEMPORARY TABLE typeof as select greatest("2015-02-02", 
date("2015-01-01"));
OK
Time taken: 2.63 seconds
hive> DESCRIBE typeof;
OK
_c0 date
Time taken: 0.031 seconds, Fetched: 1 row(s)
{code}

{code}
mysql> select greatest("2015-02-02abc", date("2015-01-01"));
+---+
| greatest("2015-02-02abc", date("2015-01-01")) |
+---+
| 2015-02-02abc |
+---+
1 row in set, 1 warning (0.00 sec)

mysql> CREATE TEMPORARY TABLE typeof as select greatest("2015-02-02", 
date("2015-01-01"));
Query OK, 1 row affected (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> DESCRIBE typeof;
++-+--+-+-+---+
| Field  | Type| Null | Key | 
Default | Extra |
++-+--+-+-+---+
| greatest("2015-02-02", date("2015-01-01")) | varchar(10) | YES  | | NULL  
  |   |
++-+--+-+-+---+
1 row in set (0.00 sec)
{code}

{code}
postgres=# select greatest('2015-02-02abc', date('2015-01-01'));
ERROR:  invalid input syntax for type date: "2015-02-02abc"
LINE 1: select greatest('2015-02-02abc', date('2015-01-01'));

postgres=# CREATE TEMPORARY TABLE typeof as select greatest('2015-02-02', 
date('2015-01-01'));
SELECT 1

postgres=# \d+ typeof
  Table "pg_temp_3.typeof"
  Column  | Type | Modifiers | Storage | Stats target | Description
--+--+---+-+--+-
 greatest | date |   | plain   |  |
Has OIDs: no
{code}

I tracked down and it seems we want Hive's behaviour assuming from SPARK-12201.

  was:
It seems Spark SQL fails to implicitly cast (or detect widen type) from string 
with date/timestamp.

{code}
spark-sql> select greatest("2015-02-02", date("2015-01-01")) ;
Error in query: cannot resolve 'greatest('2015-02-02', CAST('2015-01-01' AS 
DATE))' due to data type mismatch: The expressions should all have the same 
type, got GREATEST(string, date).; line 1 pos 7
{code}

It seems, at least, other DBMS support this by implicit casting/widened types.

{code}
hive> select greatest("2015-02-021", date("2015-01-01"));
OK
2015-01-01
Time taken: 0.019 seconds, Fetched: 1 row(s)
hive> select greatest("-02-021", date("2015-01-01"));
OK
2015-01-01
Time taken: 0.02 seconds, Fetched: 1 row(s)
hive>
hive> CREATE TEMPORARY TABLE typeof as select greatest("2015-02-02", 
date("2015-01-01"));
OK
Time taken: 2.63 seconds
hive> DESCRIBE typeof;
OK
_c0 date
Time taken: 0.031 seconds, Fetched: 1 row(s)
{code}

{code}
mysql> select greatest("2015-02-02abc", date("2015-01-01"));
+---+
| greatest("2015-02-02abc", date("2015-01-01")) |
+---+
| 2015-02-02abc |
+---+
1 row in set, 1 warning (0.00 sec)

mysql> CREATE TEMPORARY TABLE typeof as select greatest("2015-02-02", 
date("2015-01-01"));
Query OK, 1 row affected (0.01 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> DESCRIBE typeof;
++-+--+-+-+---+
| Field  | Type| Null | Key | 
Default | Extra |
++-+--+-+-+---+
| greatest("2015-02-02", date("2015-01-01")) | varchar(10) | YES  | | NULL  
  |   |
++-+--+-+-+---+
1 row in set (0.00 sec)
{code}

{code}
postgres=# select greatest('2015-02-02abc', date('2015-01-01'));
ERROR:  invalid inp

[jira] [Commented] (SPARK-18220) ClassCastException occurs when using select query on ORC file

2016-11-28 Thread Jerryjung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702104#comment-15702104
 ] 

Jerryjung commented on SPARK-18220:
---

CREATE EXTERNAL TABLE `d_c`.`dcoc_ircs_op_brch`(`ircs_op_brch_cd` string, 
`ircs_op_brch_nm` string, `cms_brch_cd` string, `cms_brch_nm` string, 
`etl_job_dtm` timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
STORED AS
  INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION 'hdfs://xxx:8020/xxx'
TBLPROPERTIES (
  'rawDataSize' = '2426',
  'numFiles' = '0',
  'transient_lastDdlTime' = '1480313167',
  'totalSize' = '0',
  'COLUMN_STATS_ACCURATE' = 'true',
  'numRows' = '6'
)



> ClassCastException occurs when using select query on ORC file
> -
>
> Key: SPARK-18220
> URL: https://issues.apache.org/jira/browse/SPARK-18220
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jerryjung
>  Labels: orcfile, sql
>
> Error message is below.
> {noformat}
> ==
> 16/11/02 16:38:09 INFO ReaderImpl: Reading ORC rows from 
> hdfs://xxx/part-00022 with {include: [true], offset: 0, length: 
> 9223372036854775807}
> 16/11/02 16:38:09 INFO Executor: Finished task 17.0 in stage 22.0 (TID 42). 
> 1220 bytes result sent to driver
> 16/11/02 16:38:09 INFO TaskSetManager: Finished task 17.0 in stage 22.0 (TID 
> 42) in 116 ms on localhost (executor driver) (19/20)
> 16/11/02 16:38:09 ERROR Executor: Exception in task 10.0 in stage 22.0 (TID 
> 35)
> java.lang.ClassCastException: 
> org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to 
> org.apache.hadoop.io.Text
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
>   at 
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:526)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> ORC dump info.
> ==
> File Version: 0.12 with HIVE_8732
> 16/11/02 16:39:21 INFO orc.ReaderImpl: Reading ORC rows from 
> hdfs://XXX/part-0 with {include: null, offset: 0, length: 
> 9223372036854775807}
> 16/11/02 16:39:21 INFO orc.RecordReaderFactory: Schema is not specified on 
> read. Using file schema.
> Rows: 7
> Compression: ZLIB
> Compression size: 262144
> Type: 
> struct
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18020) Kinesis receiver does not snapshot when shard completes

2016-11-28 Thread Basile Deustua (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702167#comment-15702167
 ] 

Basile Deustua commented on SPARK-18020:


Same here

> Kinesis receiver does not snapshot when shard completes
> ---
>
> Key: SPARK-18020
> URL: https://issues.apache.org/jira/browse/SPARK-18020
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Yonathan Randolph
>Priority: Minor
>  Labels: kinesis
>
> When a kinesis shard is split or combined and the old shard ends, the Amazon 
> Kinesis Client library [calls 
> IRecordProcessor.shutdown|https://github.com/awslabs/amazon-kinesis-client/blob/v1.7.0/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/ShutdownTask.java#L100]
>  and expects that {{IRecordProcessor.shutdown}} must checkpoint the sequence 
> number {{ExtendedSequenceNumber.SHARD_END}} before returning. Unfortunately, 
> spark’s 
> [KinesisRecordProcessor|https://github.com/apache/spark/blob/v2.0.1/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala]
>  sometimes does not checkpoint SHARD_END. This results in an error message, 
> and spark is then blocked indefinitely from processing any items from the 
> child shards.
> This issue has also been raised on StackOverflow: [resharding while spark 
> running on kinesis 
> stream|http://stackoverflow.com/questions/38898691/resharding-while-spark-running-on-kinesis-stream]
> Exception that is logged:
> {code}
> 16/10/19 19:37:49 ERROR worker.ShutdownTask: Application exception. 
> java.lang.IllegalArgumentException: Application didn't checkpoint at end of 
> shard shardId-0030
> at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:106)
> at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
> at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Command used to split shard:
> {code}
> aws kinesis --region us-west-1 split-shard --stream-name my-stream 
> --shard-to-split shardId-0030 --new-starting-hash-key 
> 5316911983139663491615228241121378303
> {code}
> After the spark-streaming job has hung, examining the DynamoDB table 
> indicates that the parent shard processor has not reached 
> {{ExtendedSequenceNumber.SHARD_END}} and the child shards are still at 
> {{ExtendedSequenceNumber.TRIM_HORIZON}} waiting for the parent to finish:
> {code}
> aws kinesis --region us-west-1 describe-stream --stream-name my-stream
> {
> "StreamDescription": {
> "RetentionPeriodHours": 24, 
> "StreamName": "my-stream", 
> "Shards": [
> {
> "ShardId": "shardId-0030", 
> "HashKeyRange": {
> "EndingHashKey": 
> "10633823966279326983230456482242756606", 
> "StartingHashKey": "0"
> },
> ...
> }, 
> {
> "ShardId": "shardId-0062", 
> "HashKeyRange": {
> "EndingHashKey": "5316911983139663491615228241121378302", 
> "StartingHashKey": "0"
> }, 
> "ParentShardId": "shardId-0030", 
> "SequenceNumberRange": {
> "StartingSequenceNumber": 
> "49566806087883755242230188435465744452396445937434624994"
> }
> }, 
> {
> "ShardId": "shardId-0063", 
> "HashKeyRange": {
> "EndingHashKey": 
> "10633823966279326983230456482242756606", 
> "StartingHashKey": "5316911983139663491615228241121378303"
> }, 
> "ParentShardId": "shardId-0030", 
> "SequenceNumberRange": {
> "StartingSequenceNumber": 
> "49566806087906055987428719058607280170669094298940605426"
> }
> },
> ...
> ],
> "StreamStatus": "ACTIVE"
> }
> }
> aws dynamodb --region us-west-1 scan --table-name my-processor
> {
> "Items": [
> {
> "leaseOwner": {
> "S": "localhost:fd385c95-5d19-467

[jira] [Issue Comment Deleted] (SPARK-18020) Kinesis receiver does not snapshot when shard completes

2016-11-28 Thread Basile Deustua (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Basile Deustua updated SPARK-18020:
---
Comment: was deleted

(was: Same here)

> Kinesis receiver does not snapshot when shard completes
> ---
>
> Key: SPARK-18020
> URL: https://issues.apache.org/jira/browse/SPARK-18020
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Yonathan Randolph
>Priority: Minor
>  Labels: kinesis
>
> When a kinesis shard is split or combined and the old shard ends, the Amazon 
> Kinesis Client library [calls 
> IRecordProcessor.shutdown|https://github.com/awslabs/amazon-kinesis-client/blob/v1.7.0/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/ShutdownTask.java#L100]
>  and expects that {{IRecordProcessor.shutdown}} must checkpoint the sequence 
> number {{ExtendedSequenceNumber.SHARD_END}} before returning. Unfortunately, 
> spark’s 
> [KinesisRecordProcessor|https://github.com/apache/spark/blob/v2.0.1/external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisRecordProcessor.scala]
>  sometimes does not checkpoint SHARD_END. This results in an error message, 
> and spark is then blocked indefinitely from processing any items from the 
> child shards.
> This issue has also been raised on StackOverflow: [resharding while spark 
> running on kinesis 
> stream|http://stackoverflow.com/questions/38898691/resharding-while-spark-running-on-kinesis-stream]
> Exception that is logged:
> {code}
> 16/10/19 19:37:49 ERROR worker.ShutdownTask: Application exception. 
> java.lang.IllegalArgumentException: Application didn't checkpoint at end of 
> shard shardId-0030
> at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:106)
> at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
> at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Command used to split shard:
> {code}
> aws kinesis --region us-west-1 split-shard --stream-name my-stream 
> --shard-to-split shardId-0030 --new-starting-hash-key 
> 5316911983139663491615228241121378303
> {code}
> After the spark-streaming job has hung, examining the DynamoDB table 
> indicates that the parent shard processor has not reached 
> {{ExtendedSequenceNumber.SHARD_END}} and the child shards are still at 
> {{ExtendedSequenceNumber.TRIM_HORIZON}} waiting for the parent to finish:
> {code}
> aws kinesis --region us-west-1 describe-stream --stream-name my-stream
> {
> "StreamDescription": {
> "RetentionPeriodHours": 24, 
> "StreamName": "my-stream", 
> "Shards": [
> {
> "ShardId": "shardId-0030", 
> "HashKeyRange": {
> "EndingHashKey": 
> "10633823966279326983230456482242756606", 
> "StartingHashKey": "0"
> },
> ...
> }, 
> {
> "ShardId": "shardId-0062", 
> "HashKeyRange": {
> "EndingHashKey": "5316911983139663491615228241121378302", 
> "StartingHashKey": "0"
> }, 
> "ParentShardId": "shardId-0030", 
> "SequenceNumberRange": {
> "StartingSequenceNumber": 
> "49566806087883755242230188435465744452396445937434624994"
> }
> }, 
> {
> "ShardId": "shardId-0063", 
> "HashKeyRange": {
> "EndingHashKey": 
> "10633823966279326983230456482242756606", 
> "StartingHashKey": "5316911983139663491615228241121378303"
> }, 
> "ParentShardId": "shardId-0030", 
> "SequenceNumberRange": {
> "StartingSequenceNumber": 
> "49566806087906055987428719058607280170669094298940605426"
> }
> },
> ...
> ],
> "StreamStatus": "ACTIVE"
> }
> }
> aws dynamodb --region us-west-1 scan --table-name my-processor
> {
> "Items": [
> {
> "leaseOwner": {
> "S": "localhost:fd385c95-5d19-4678-926f-b6d5f5503cbe"
>

[jira] [Resolved] (SPARK-17783) Hide Credentials in CREATE and DESC FORMATTED/EXTENDED a PERSISTENT/TEMP Table for JDBC

2016-11-28 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17783.
---
   Resolution: Fixed
 Assignee: Xiao Li
Fix Version/s: 2.1.0

> Hide Credentials in CREATE and DESC FORMATTED/EXTENDED a PERSISTENT/TEMP 
> Table for JDBC
> ---
>
> Key: SPARK-17783
> URL: https://issues.apache.org/jira/browse/SPARK-17783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.1.0
>
>
> We should never expose the Credentials in the EXPLAIN and DESC 
> FORMATTED/EXTENDED command. However, below commands exposed the credentials. 
> {noformat}
> CREATE TABLE tab1 USING org.apache.spark.sql.jdbc
> {noformat}
> {noformat}
> == Physical Plan ==
> ExecutedCommand
>+- CreateDataSourceTableCommand CatalogTable(
>   Table: `tab1`
>   Created: Tue Oct 04 21:39:44 PDT 2016
>   Last Access: Wed Dec 31 15:59:59 PST 1969
>   Type: MANAGED
>   Provider: org.apache.spark.sql.jdbc
>   Storage(Properties: 
> [url=jdbc:h2:mem:testdb0;user=testUser;password=testPass, 
> dbtable=TEST.PEOPLE, user=testUser, password=testPass])), false
> {noformat}
> {noformat}
> DESC FORMATTED tab1
> {noformat}
> {noformat}
> ...
> |# Storage Information   |
>   |   |
> |Compressed: |No  
>   |   |
> |Storage Desc Parameters:|
>   |   |
> |  path  
> |file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/tab1|   |
> |  url   
> |jdbc:h2:mem:testdb0;user=testUser;password=testPass   |   |
> |  dbtable   |TEST.PEOPLE 
>   |   |
> |  user  |testUser
>   |   |
> |  password  |testPass
>   |   |
> ++--+---+
> {noformat}
> {noformat}
> DESC EXTENDED tab1
> {noformat}
> {noformat}
> ...
>   Storage(Properties: 
> [path=file:/Users/xiaoli/IdeaProjects/sparkDelivery/spark-warehouse/tab1, 
> url=jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable=TEST.PEOPLE, 
> user=testUser, password=testPass]))|   |
> {noformat}
> {noformat}
> CREATE TEMP VIEW tab1 USING org.apache.spark.sql.jdbc
> {noformat}
> {noformat}
> == Physical Plan ==
> ExecutedCommand
>+- CreateTempViewUsing `tab1`, false, org.apache.spark.sql.jdbc, Map(url 
> -> jdbc:h2:mem:testdb0;user=testUser;password=testPass, dbtable -> 
> TEST.PEOPLE, user -> testUser, password -> testPass)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18611) mesos shuffle service on v2 isn't compatible with spark v1

2016-11-28 Thread Adrian Bridgett (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Bridgett updated SPARK-18611:

Affects Version/s: 1.6.3

> mesos shuffle service on v2 isn't compatible with spark v1
> --
>
> Key: SPARK-18611
> URL: https://issues.apache.org/jira/browse/SPARK-18611
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.3
>Reporter: Adrian Bridgett
>Priority: Minor
>
> In SPARK-12583 heartbeating was added to the shuffle service.
> This unfortunately changes registerDriver with a different signature so we 
> can't use sparkv1 clients with sparkv2 mesos shuffle service. 
> http://apache-spark-developers-list.1001551.n3.nabble.com/YARN-Shuffle-service-and-its-compatibility-td17222.html
>   is a slightly different issue, but it does suggest that it's agreed that 
> compatibility is a good idea in these cases.
> https://github.com/apache/spark/pull/13279 was created a while ago to 
> backport this to 1.6.x but was rejected (too invasive).  However I don't 
> believe this issue (v1/v2 compatibility) was recognised back then so perhaps 
> it's time to reconsider?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18597) Do not push down filters for LEFT ANTI JOIN

2016-11-28 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18597.
---
   Resolution: Fixed
 Assignee: Herman van Hovell
Fix Version/s: 2.1.0

> Do not push down filters for LEFT ANTI JOIN
> ---
>
> Key: SPARK-18597
> URL: https://issues.apache.org/jira/browse/SPARK-18597
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>  Labels: correctness
> Fix For: 2.1.0
>
>
> The optimizer pushes down filters for left anti joins. This unfortunately has 
> the opposite effect. For example:
> {noformat}
> sql("create or replace temporary view tbl_a as values (1, 5), (2, 1), (3, 6) 
> as t(c1, c2)")
> sql("create or replace temporary view tbl_b as values 1 as t(c1)")
> sql("""
> select *
> from   tbl_a
>left anti join tbl_b on ((tbl_a.c1 = tbl_a.c2) is null or tbl_a.c1 = 
> tbl_a.c2)
> """)
> {noformat}
> Should return rows [2, 1] & [3, 6], but returns no rows.
> The upside is that this will only happen when you use a really weird 
> anti-join (only referencing the table on the left hand side).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18611) mesos shuffle service on v2 isn't compatible with spark v1

2016-11-28 Thread Adrian Bridgett (JIRA)

Adrian Bridgett created SPARK-18611:
---

 Summary: mesos shuffle service on v2 isn't compatible with spark v1
 Key: SPARK-18611
 URL: https://issues.apache.org/jira/browse/SPARK-18611
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Reporter: Adrian Bridgett
Priority: Minor


In SPARK-12583 heartbeating was added to the shuffle service.

This unfortunately changes registerDriver with a different signature so we 
can't use sparkv1 clients with sparkv2 mesos shuffle service. 

http://apache-spark-developers-list.1001551.n3.nabble.com/YARN-Shuffle-service-and-its-compatibility-td17222.html
  is a slightly different issue, but it does suggest that it's agreed that 
compatibility is a good idea in these cases.

https://github.com/apache/spark/pull/13279 was created a while ago to backport 
this to 1.6.x but was rejected (too invasive).  However I don't believe this 
issue (v1/v2 compatibility) was recognised back then so perhaps it's time to 
reconsider?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18611) mesos shuffle service on v2 isn't compatible with spark v1

2016-11-28 Thread Adrian Bridgett (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702213#comment-15702213
 ] 

Adrian Bridgett commented on SPARK-18611:
-

dragos spotted this problem in his comment on 
https://github.com/apache/spark/pull/11272/files FYI.

> mesos shuffle service on v2 isn't compatible with spark v1
> --
>
> Key: SPARK-18611
> URL: https://issues.apache.org/jira/browse/SPARK-18611
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.3
>Reporter: Adrian Bridgett
>Priority: Minor
>
> In SPARK-12583 heartbeating was added to the shuffle service.
> This unfortunately changes registerDriver with a different signature so we 
> can't use sparkv1 clients with sparkv2 mesos shuffle service. 
> http://apache-spark-developers-list.1001551.n3.nabble.com/YARN-Shuffle-service-and-its-compatibility-td17222.html
>   is a slightly different issue, but it does suggest that it's agreed that 
> compatibility is a good idea in these cases.
> https://github.com/apache/spark/pull/13279 was created a while ago to 
> backport this to 1.6.x but was rejected (too invasive).  However I don't 
> believe this issue (v1/v2 compatibility) was recognised back then so perhaps 
> it's time to reconsider?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18549) Failed to Uncache a View that References a Dropped Table.

2016-11-28 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702258#comment-15702258
 ] 

Jiang Xingbo commented on SPARK-18549:
--

I'd suggest we store the view dependency in the view's catalogTable, note that 
we only do this for permenant views, temp/global temp views don't participant 
in the dependency management issue.

> Failed to Uncache a View that References a Dropped Table.
> -
>
> Key: SPARK-18549
> URL: https://issues.apache.org/jira/browse/SPARK-18549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Xiao Li
>Priority: Critical
>
> {code}
>   spark.range(1, 10).toDF("id1").write.format("json").saveAsTable("jt1")
>   spark.range(1, 10).toDF("id2").write.format("json").saveAsTable("jt2")
>   sql("CREATE VIEW testView AS SELECT * FROM jt1 JOIN jt2 ON id1 == id2")
>   // Cache is empty at the beginning
>   assert(spark.sharedState.cacheManager.isEmpty)
>   sql("CACHE TABLE testView")
>   assert(spark.catalog.isCached("testView"))
>   // Cache is not empty
>   assert(!spark.sharedState.cacheManager.isEmpty)
> {code}
> {code}
>   // drop a table referenced by a cached view
>   sql("DROP TABLE jt1")
> -- So far everything is fine
>   // Failed to unache the view
>   val e = intercept[AnalysisException] {
> sql("UNCACHE TABLE testView")
>   }.getMessage
>   assert(e.contains("Table or view not found: `default`.`jt1`"))
>   // We are unable to drop it from the cache
>   assert(!spark.sharedState.cacheManager.isEmpty)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18471) In treeAggregate, generate (big) zeros instead of sending them.

2016-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702262#comment-15702262
 ] 

Apache Spark commented on SPARK-18471:
--

User 'AnthonyTruchet' has created a pull request for this issue:
https://github.com/apache/spark/pull/16037

> In treeAggregate, generate (big) zeros instead of sending them.
> ---
>
> Key: SPARK-18471
> URL: https://issues.apache.org/jira/browse/SPARK-18471
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core
>Reporter: Anthony Truchet
>Priority: Minor
>
> When using optimization routine like LBFGS, treeAggregate curently sends the 
> zero vector as part of the closure. This zero can be huge (e.g. ML vectors 
> with millions of zeros) but can be easily generated.
> Several option are possible (upcoming patches to come soon for some of them).
> On is to provide a treeAggregateWithZeroGenerator method (either in core on 
> in MLlib) which wrap treeAggregate in an option and generate the zero if None.
> Another one is to rewrite treeAggregate to wrap an underlying implementation 
> which use a zero generator directly.
> There might be other better alternative we have not spotted...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18471) In treeAggregate, generate (big) zeros instead of sending them.

2016-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702296#comment-15702296
 ] 

Apache Spark commented on SPARK-18471:
--

User 'AnthonyTruchet' has created a pull request for this issue:
https://github.com/apache/spark/pull/16038

> In treeAggregate, generate (big) zeros instead of sending them.
> ---
>
> Key: SPARK-18471
> URL: https://issues.apache.org/jira/browse/SPARK-18471
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core
>Reporter: Anthony Truchet
>Priority: Minor
>
> When using optimization routine like LBFGS, treeAggregate curently sends the 
> zero vector as part of the closure. This zero can be huge (e.g. ML vectors 
> with millions of zeros) but can be easily generated.
> Several option are possible (upcoming patches to come soon for some of them).
> On is to provide a treeAggregateWithZeroGenerator method (either in core on 
> in MLlib) which wrap treeAggregate in an option and generate the zero if None.
> Another one is to rewrite treeAggregate to wrap an underlying implementation 
> which use a zero generator directly.
> There might be other better alternative we have not spotted...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18597) Do not push down filters for LEFT ANTI JOIN

2016-11-28 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702328#comment-15702328
 ] 

Herman van Hovell commented on SPARK-18597:
---

[~nsyca] LEFT SEMI and LEFT ANTI are both SEMI joins (half joins). A semi join 
only returns rows from the first table when it matches one or more rows from 
the second table (I got this from 
http://www.slideshare.net/alokeparnachoudhury/semi-joins). The anti join is the 
opposite, and only returns a row form the first table when it does not match 
any row in the second table.

Hive also supports LEFT SEMI JOIN: 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
Impala supports both LEFT SEMI and LEFT ANTI JOIN: 
https://www.cloudera.com/documentation/enterprise/5-7-x/topics/impala_joins.html

> Do not push down filters for LEFT ANTI JOIN
> ---
>
> Key: SPARK-18597
> URL: https://issues.apache.org/jira/browse/SPARK-18597
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>  Labels: correctness
> Fix For: 2.1.0
>
>
> The optimizer pushes down filters for left anti joins. This unfortunately has 
> the opposite effect. For example:
> {noformat}
> sql("create or replace temporary view tbl_a as values (1, 5), (2, 1), (3, 6) 
> as t(c1, c2)")
> sql("create or replace temporary view tbl_b as values 1 as t(c1)")
> sql("""
> select *
> from   tbl_a
>left anti join tbl_b on ((tbl_a.c1 = tbl_a.c2) is null or tbl_a.c1 = 
> tbl_a.c2)
> """)
> {noformat}
> Should return rows [2, 1] & [3, 6], but returns no rows.
> The upside is that this will only happen when you use a really weird 
> anti-join (only referencing the table on the left hand side).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18262) JSON.org license is now CatX

2016-11-28 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702340#comment-15702340
 ] 

Steve Loughran commented on SPARK-18262:


~tdunning has done a mostly-compatible org.json replacement JAR, see 
HADOOP-13794. If org.json can't be pulled, and there's a lib using a non-shaded 
dependency, this should be able to act as a replacement

> JSON.org license is now CatX
> 
>
> Key: SPARK-18262
> URL: https://issues.apache.org/jira/browse/SPARK-18262
> Project: Spark
>  Issue Type: Bug
>Reporter: Sean Busbey
>Assignee: Sean Owen
>Priority: Blocker
> Fix For: 2.1.0
>
>
> per [update resolved legal|http://www.apache.org/legal/resolved.html#json]:
> {quote}
> CAN APACHE PRODUCTS INCLUDE WORKS LICENSED UNDER THE JSON LICENSE?
> No. As of 2016-11-03 this has been moved to the 'Category X' license list. 
> Prior to this, use of the JSON Java library was allowed. See Debian's page 
> for a list of alternatives.
> {quote}
> I'm not actually clear if Spark is using one of the JSON.org licensed 
> libraries. As of current master (dc4c6009) the java library gets called out 
> in the [NOTICE file for our source 
> repo|https://github.com/apache/spark/blob/dc4c60098641cf64007e2f0e36378f000ad5f6b1/NOTICE#L424]
>  but:
> 1) It doesn't say where in the source
> 2) the given url is 404 (http://www.json.org/java/index.html)
> 3) It doesn't actually say in the NOTICE what license the inclusion is under
> 4) the JSON.org license for the java {{org.json:json}} artifact (what the 
> blurb in #2 is usually referring to) doesn't show up in our LICENSE file, nor 
> in the {{licenses/}} directory
> 5) I don't see a direct reference to the {{org.json:json}} artifact in our 
> poms.
> So maybe it's just coming in transitively and we can exclude it / ping 
> whoever is bringing it in?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18597) Do not push down filters for LEFT ANTI JOIN

2016-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702336#comment-15702336
 ] 

Apache Spark commented on SPARK-18597:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/16039

> Do not push down filters for LEFT ANTI JOIN
> ---
>
> Key: SPARK-18597
> URL: https://issues.apache.org/jira/browse/SPARK-18597
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>  Labels: correctness
> Fix For: 2.1.0
>
>
> The optimizer pushes down filters for left anti joins. This unfortunately has 
> the opposite effect. For example:
> {noformat}
> sql("create or replace temporary view tbl_a as values (1, 5), (2, 1), (3, 6) 
> as t(c1, c2)")
> sql("create or replace temporary view tbl_b as values 1 as t(c1)")
> sql("""
> select *
> from   tbl_a
>left anti join tbl_b on ((tbl_a.c1 = tbl_a.c2) is null or tbl_a.c1 = 
> tbl_a.c2)
> """)
> {noformat}
> Should return rows [2, 1] & [3, 6], but returns no rows.
> The upside is that this will only happen when you use a really weird 
> anti-join (only referencing the table on the left hand side).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18512) FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and S3A

2016-11-28 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702368#comment-15702368
 ] 

Steve Loughran commented on SPARK-18512:


This looks like a consistency problem; s3 listing always lags the 
creation/deletion/update of contents.

the committer has listed paths to merge in, then gone through each one to see 
their type: if not a file, lists the subdirectory, and, interestingly gets an 
exception at this point. Maybe the first listing found an object which is no 
longer there by the time the second listing went through, that is, the 
exception isn't a delay-on-create, its a delay-on-delete. 

create listing delays could be handled in the committer by having a retry on an 
FNFE; it'd slighly increase the time before a failure, but as that's a failure 
path, not too serious; delete delays could be addressed the opposite: ignore 
the problem, on the basis that if the listing failed, there's no file to 
rename. That's more worrying as it's a sign of a problem which could have 
implications further up the commit process: things are changing in the listing 
of files being renamed.

HADOOP-13345 is going to address list inconsistency; I'm doing a committer 
there which I could also try to make more robust even when not using a 
dynamo-DB backed bucket. Question is: what is the good retry policy here, 
especially given once an inconsistency has surfaced, a large amount of the 
merge may already have taken place. Backing up and retrying may be differently 
dangerous.

One thing I would recommend trying is: commit to HDFS, then copy. Do that and 
you can turn speculation on in your executors, get the local Virtual HDD perf 
and networking, as well as a consistent view. copy to s3a after all that you 
want done is complete.





> FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and 
> S3A
> 
>
> Key: SPARK-18512
> URL: https://issues.apache.org/jira/browse/SPARK-18512
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
> Environment: AWS EMR 5.0.1
> Spark 2.0.1
> S3 EU-West-1 (S3A with read-after-write consistency)
>Reporter: Giuseppe Bonaccorso
>
> After a few hours of streaming processing and data saving in Parquet format, 
> I got always this exception:
> {code:java}
> java.io.FileNotFoundException: No such file or directory: 
> s3a://xxx/_temporary/0/task_
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1004)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:745)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:426)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:362)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334)
>   at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan

[jira] [Commented] (SPARK-18512) FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and S3A

2016-11-28 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702373#comment-15702373
 ] 

Steve Loughran commented on SPARK-18512:


one question: what's the size of data being committed here?

> FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and 
> S3A
> 
>
> Key: SPARK-18512
> URL: https://issues.apache.org/jira/browse/SPARK-18512
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
> Environment: AWS EMR 5.0.1
> Spark 2.0.1
> S3 EU-West-1 (S3A with read-after-write consistency)
>Reporter: Giuseppe Bonaccorso
>
> After a few hours of streaming processing and data saving in Parquet format, 
> I got always this exception:
> {code:java}
> java.io.FileNotFoundException: No such file or directory: 
> s3a://xxx/_temporary/0/task_
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1004)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:745)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:426)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:362)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334)
>   at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:510)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:488)
> {code}
> I've tried also s3:// and s3n:// but it always happens after a 3-5 hours. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18612) Leaked broadcasted variable Mllib

2016-11-28 Thread Anthony Truchet (JIRA)

Anthony Truchet created SPARK-18612:
---

 Summary: Leaked broadcasted variable Mllib
 Key: SPARK-18612
 URL: https://issues.apache.org/jira/browse/SPARK-18612
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.0.2, 1.6.3
Reporter: Anthony Truchet


Fix broadcasted variable leaks in MLlib.

For example, `bcW` in the L-BFGSS CostFun.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18512) FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and S3A

2016-11-28 Thread Giuseppe Bonaccorso (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702394#comment-15702394
 ] 

Giuseppe Bonaccorso commented on SPARK-18512:
-

It happens after about 1.000.000 writings of (100 - 1000 KB).
I've tried your solution writing to HDFS and it works. Unfortunately I cannot 
use distcp on the same cluster, because we're working with Spark Streaming. Is 
it possible doing it on the same cluster without distcp? Thanks

> FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and 
> S3A
> 
>
> Key: SPARK-18512
> URL: https://issues.apache.org/jira/browse/SPARK-18512
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
> Environment: AWS EMR 5.0.1
> Spark 2.0.1
> S3 EU-West-1 (S3A with read-after-write consistency)
>Reporter: Giuseppe Bonaccorso
>
> After a few hours of streaming processing and data saving in Parquet format, 
> I got always this exception:
> {code:java}
> java.io.FileNotFoundException: No such file or directory: 
> s3a://xxx/_temporary/0/task_
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1004)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:745)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:426)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:362)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334)
>   at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:510)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:488)
> {code}
> I've tried also s3:// and s3n:// but it always happens after a 3-5 hours. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18612) Leaked broadcasted variable Mllib

2016-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18612:


Assignee: Apache Spark

> Leaked broadcasted variable Mllib
> -
>
> Key: SPARK-18612
> URL: https://issues.apache.org/jira/browse/SPARK-18612
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Anthony Truchet
>Assignee: Apache Spark
>
> Fix broadcasted variable leaks in MLlib.
> For example, `bcW` in the L-BFGSS CostFun.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18612) Leaked broadcasted variable Mllib

2016-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702392#comment-15702392
 ] 

Apache Spark commented on SPARK-18612:
--

User 'AnthonyTruchet' has created a pull request for this issue:
https://github.com/apache/spark/pull/16040

> Leaked broadcasted variable Mllib
> -
>
> Key: SPARK-18612
> URL: https://issues.apache.org/jira/browse/SPARK-18612
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Anthony Truchet
>
> Fix broadcasted variable leaks in MLlib.
> For example, `bcW` in the L-BFGSS CostFun.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18612) Leaked broadcasted variable Mllib

2016-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18612:


Assignee: (was: Apache Spark)

> Leaked broadcasted variable Mllib
> -
>
> Key: SPARK-18612
> URL: https://issues.apache.org/jira/browse/SPARK-18612
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Anthony Truchet
>
> Fix broadcasted variable leaks in MLlib.
> For example, `bcW` in the L-BFGSS CostFun.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18612) Leaked broadcasted variable Mllib

2016-11-28 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18612:
--
  Priority: Trivial  (was: Major)
Issue Type: Improvement  (was: Bug)

> Leaked broadcasted variable Mllib
> -
>
> Key: SPARK-18612
> URL: https://issues.apache.org/jira/browse/SPARK-18612
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Anthony Truchet
>Priority: Trivial
>
> Fix broadcasted variable leaks in MLlib.
> For example, `bcW` in the L-BFGSS CostFun.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18058) AnalysisException may be thrown when union two DFs whose struct fields have different nullability

2016-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702431#comment-15702431
 ] 

Apache Spark commented on SPARK-18058:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/16041

> AnalysisException may be thrown when union two DFs whose struct fields have 
> different nullability
> -
>
> Key: SPARK-18058
> URL: https://issues.apache.org/jira/browse/SPARK-18058
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Cheng Lian
>Assignee: Nan Zhu
> Fix For: 2.0.2, 2.1.0
>
>
> The following Spark shell snippet reproduces this issue:
> {code}
> spark.range(10).createOrReplaceTempView("t1")
> spark.range(10).map(i => i: 
> java.lang.Long).toDF("id").createOrReplaceTempView("t2")
> sql("SELECT struct(id) FROM t1 UNION ALL SELECT struct(id) FROM t2")
> {code}
> {noformat}
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. StructType(StructField(id,LongType,true)) 
> <> StructType(StructField(id,LongType,false)) at the first column of the 
> second table;
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11$$anonfun$apply$12.apply(CheckAnalysis.scala:291)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11$$anonfun$apply$12.apply(CheckAnalysis.scala:289)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11.apply(CheckAnalysis.scala:289)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$11.apply(CheckAnalysis.scala:278)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:278)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:132)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:573)
>   ... 50 elided
> {noformat}
> The reason is that we treat two {{StructType}} incompatible even if their 
> only differ from each other in field nullability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18481) ML 2.1 QA: Remove deprecated methods for ML

2016-11-28 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18481:
--
Description: 
Remove deprecated methods for ML.

This task removed the following (deprecated) public APIs in org.apache.spark.ml:
* classification.RandomForestClassificationModel.numTrees  (This now refers to 
the Param called "numTrees")
* feature.ChiSqSelectorModel.setLabelCol
* regression.LinearRegressionSummary.model
* regression.RandomForestRegressionModel.numTrees  (This now refers to the 
Param called "numTrees")
* PipelineStage.validateParams

This task made the following changes to match existing patterns for Params:
* These methods were made final:
** classification.RandomForestClassificationModel.getNumTrees
** regression.RandomForestRegressionModel.getNumTrees
* These methods return the concrete class type, rather than an arbitrary trait. 
 This only affected Java compatibility, not Scala.
** classification.RandomForestClassificationModel.setFeatureSubsetStrategy
** regression.RandomForestRegressionModel.setFeatureSubsetStrategy


  was:
Remove deprecated methods for ML.

We removed the following public APIs in this JIRA:
org.apache.spark.ml.classification.RandomForestClassificationModel.numTrees
org.apache.spark.ml.feature.ChiSqSelectorModel.setLabelCol
org.apache.spark.ml.regression.LinearRegressionSummary.model
org.apache.spark.ml.regression.RandomForestRegressionModel.numTrees



> ML 2.1 QA: Remove deprecated methods for ML 
> 
>
> Key: SPARK-18481
> URL: https://issues.apache.org/jira/browse/SPARK-18481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Remove deprecated methods for ML.
> This task removed the following (deprecated) public APIs in 
> org.apache.spark.ml:
> * classification.RandomForestClassificationModel.numTrees  (This now refers 
> to the Param called "numTrees")
> * feature.ChiSqSelectorModel.setLabelCol
> * regression.LinearRegressionSummary.model
> * regression.RandomForestRegressionModel.numTrees  (This now refers to the 
> Param called "numTrees")
> * PipelineStage.validateParams
> This task made the following changes to match existing patterns for Params:
> * These methods were made final:
> ** classification.RandomForestClassificationModel.getNumTrees
> ** regression.RandomForestRegressionModel.getNumTrees
> * These methods return the concrete class type, rather than an arbitrary 
> trait.  This only affected Java compatibility, not Scala.
> ** classification.RandomForestClassificationModel.setFeatureSubsetStrategy
> ** regression.RandomForestRegressionModel.setFeatureSubsetStrategy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18612) Leaked broadcasted variable Mllib

2016-11-28 Thread Anthony Truchet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702446#comment-15702446
 ] 

Anthony Truchet commented on SPARK-18612:
-

See related:
* https://issues.apache.org/jira/browse/SPARK-16440
* https://issues.apache.org/jira/browse/SPARK-16696


> Leaked broadcasted variable Mllib
> -
>
> Key: SPARK-18612
> URL: https://issues.apache.org/jira/browse/SPARK-18612
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Anthony Truchet
>Priority: Trivial
>
> Fix broadcasted variable leaks in MLlib.
> For example, `bcW` in the L-BFGSS CostFun.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18592) Move DT/RF/GBT Param setter methods to subclasses

2016-11-28 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18592:
--
Target Version/s: 2.1.0, 2.2.0

> Move DT/RF/GBT Param setter methods to subclasses
> -
>
> Key: SPARK-18592
> URL: https://issues.apache.org/jira/browse/SPARK-18592
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Move DT/RF/GBT Param setter methods to subclasses and deprecate these methods 
> in the Model classes to make them more Java-friendly.
> See discussion at 
> https://github.com/apache/spark/pull/15913#discussion_r89662469 .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18324) ML, Graph 2.1 QA: Programming guide update and migration guide

2016-11-28 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702451#comment-15702451
 ] 

Joseph K. Bradley commented on SPARK-18324:
---

Note API changes from [SPARK-18481] and deprecations in [SPARK-18592]

> ML, Graph 2.1 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-18324
> URL: https://issues.apache.org/jira/browse/SPARK-18324
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Critical
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18597) Do not push down filters for LEFT ANTI JOIN

2016-11-28 Thread Nattavut Sutyanyong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702454#comment-15702454
 ] 

Nattavut Sutyanyong commented on SPARK-18597:
-

[~hvanhovell], @dongjoon, [~smilegator] FYI:

I edited the description to reflect the correct behaviour of the {{LeftAnti}} 
in the example. The correct answer is all the rows from tbl_a as there is no 
row in tbl_a that satisfies the predicate tbl_a.c1 = tbl_a.c2 in the ON clause. 
By the LeftAnti semantics of tbl_a except rows that satisfy the predicate in 
the ON clause, all rows from tbl_a are returned. 

> Do not push down filters for LEFT ANTI JOIN
> ---
>
> Key: SPARK-18597
> URL: https://issues.apache.org/jira/browse/SPARK-18597
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>  Labels: correctness
> Fix For: 2.1.0
>
>
> The optimizer pushes down filters for left anti joins. This unfortunately has 
> the opposite effect. For example:
> {noformat}
> sql("create or replace temporary view tbl_a as values (1, 5), (2, 1), (3, 6) 
> as t(c1, c2)")
> sql("create or replace temporary view tbl_b as values 1 as t(c1)")
> sql("""
> select *
> from   tbl_a
>left anti join tbl_b on ((tbl_a.c1 = tbl_a.c2) is null or tbl_a.c1 = 
> tbl_a.c2)
> """)
> {noformat}
> Should return rows [2, 1] & [3, 6], but returns no rows.
> The upside is that this will only happen when you use a really weird 
> anti-join (only referencing the table on the left hand side).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18597) Do not push down filters for LEFT ANTI JOIN

2016-11-28 Thread Nattavut Sutyanyong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nattavut Sutyanyong updated SPARK-18597:

Description: 
The optimizer pushes down filters for left anti joins. This unfortunately has 
the opposite effect. For example:
{noformat}
sql("create or replace temporary view tbl_a as values (1, 5), (2, 1), (3, 6) as 
t(c1, c2)")
sql("create or replace temporary view tbl_b as values 1 as t(c1)")
sql("""
select *
from   tbl_a
   left anti join tbl_b on ((tbl_a.c1 = tbl_a.c2) is null or tbl_a.c1 = 
tbl_a.c2)
""")
{noformat}

Should return rows [1, 5], [2, 1] & [3, 6], but returns no rows.

The upside is that this will only happen when you use a really weird anti-join 
(only referencing the table on the left hand side).

  was:
The optimizer pushes down filters for left anti joins. This unfortunately has 
the opposite effect. For example:
{noformat}
sql("create or replace temporary view tbl_a as values (1, 5), (2, 1), (3, 6) as 
t(c1, c2)")
sql("create or replace temporary view tbl_b as values 1 as t(c1)")
sql("""
select *
from   tbl_a
   left anti join tbl_b on ((tbl_a.c1 = tbl_a.c2) is null or tbl_a.c1 = 
tbl_a.c2)
""")
{noformat}

Should return rows [2, 1] & [3, 6], but returns no rows.

The upside is that this will only happen when you use a really weird anti-join 
(only referencing the table on the left hand side).


> Do not push down filters for LEFT ANTI JOIN
> ---
>
> Key: SPARK-18597
> URL: https://issues.apache.org/jira/browse/SPARK-18597
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>  Labels: correctness
> Fix For: 2.1.0
>
>
> The optimizer pushes down filters for left anti joins. This unfortunately has 
> the opposite effect. For example:
> {noformat}
> sql("create or replace temporary view tbl_a as values (1, 5), (2, 1), (3, 6) 
> as t(c1, c2)")
> sql("create or replace temporary view tbl_b as values 1 as t(c1)")
> sql("""
> select *
> from   tbl_a
>left anti join tbl_b on ((tbl_a.c1 = tbl_a.c2) is null or tbl_a.c1 = 
> tbl_a.c2)
> """)
> {noformat}
> Should return rows [1, 5], [2, 1] & [3, 6], but returns no rows.
> The upside is that this will only happen when you use a really weird 
> anti-join (only referencing the table on the left hand side).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18597) Do not push down filters for LEFT ANTI JOIN

2016-11-28 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702454#comment-15702454
 ] 

Herman van Hovell edited comment on SPARK-18597 at 11/28/16 4:53 PM:
-

[~hvanhovell], [~dongjoon]], [~smilegator] FYI:

I edited the description to reflect the correct behaviour of the {{LeftAnti}} 
in the example. The correct answer is all the rows from tbl_a as there is no 
row in tbl_a that satisfies the predicate tbl_a.c1 = tbl_a.c2 in the ON clause. 
By the LeftAnti semantics of tbl_a except rows that satisfy the predicate in 
the ON clause, all rows from tbl_a are returned. 


was (Author: nsyca):
[~hvanhovell], @dongjoon, [~smilegator] FYI:

I edited the description to reflect the correct behaviour of the {{LeftAnti}} 
in the example. The correct answer is all the rows from tbl_a as there is no 
row in tbl_a that satisfies the predicate tbl_a.c1 = tbl_a.c2 in the ON clause. 
By the LeftAnti semantics of tbl_a except rows that satisfy the predicate in 
the ON clause, all rows from tbl_a are returned. 

> Do not push down filters for LEFT ANTI JOIN
> ---
>
> Key: SPARK-18597
> URL: https://issues.apache.org/jira/browse/SPARK-18597
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>  Labels: correctness
> Fix For: 2.1.0
>
>
> The optimizer pushes down filters for left anti joins. This unfortunately has 
> the opposite effect. For example:
> {noformat}
> sql("create or replace temporary view tbl_a as values (1, 5), (2, 1), (3, 6) 
> as t(c1, c2)")
> sql("create or replace temporary view tbl_b as values 1 as t(c1)")
> sql("""
> select *
> from   tbl_a
>left anti join tbl_b on ((tbl_a.c1 = tbl_a.c2) is null or tbl_a.c1 = 
> tbl_a.c2)
> """)
> {noformat}
> Should return rows [1, 5], [2, 1] & [3, 6], but returns no rows.
> The upside is that this will only happen when you use a really weird 
> anti-join (only referencing the table on the left hand side).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18597) Do not push down filters for LEFT ANTI JOIN

2016-11-28 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702474#comment-15702474
 ] 

Herman van Hovell commented on SPARK-18597:
---

Thanks!

> Do not push down filters for LEFT ANTI JOIN
> ---
>
> Key: SPARK-18597
> URL: https://issues.apache.org/jira/browse/SPARK-18597
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>  Labels: correctness
> Fix For: 2.1.0
>
>
> The optimizer pushes down filters for left anti joins. This unfortunately has 
> the opposite effect. For example:
> {noformat}
> sql("create or replace temporary view tbl_a as values (1, 5), (2, 1), (3, 6) 
> as t(c1, c2)")
> sql("create or replace temporary view tbl_b as values 1 as t(c1)")
> sql("""
> select *
> from   tbl_a
>left anti join tbl_b on ((tbl_a.c1 = tbl_a.c2) is null or tbl_a.c1 = 
> tbl_a.c2)
> """)
> {noformat}
> Should return rows [1, 5], [2, 1] & [3, 6], but returns no rows.
> The upside is that this will only happen when you use a really weird 
> anti-join (only referencing the table on the left hand side).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18597) Do not push down filters for LEFT ANTI JOIN

2016-11-28 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702454#comment-15702454
 ] 

Herman van Hovell edited comment on SPARK-18597 at 11/28/16 4:53 PM:
-

[~hvanhovell], [~dongjoon], [~smilegator] FYI:

I edited the description to reflect the correct behaviour of the {{LeftAnti}} 
in the example. The correct answer is all the rows from tbl_a as there is no 
row in tbl_a that satisfies the predicate tbl_a.c1 = tbl_a.c2 in the ON clause. 
By the LeftAnti semantics of tbl_a except rows that satisfy the predicate in 
the ON clause, all rows from tbl_a are returned. 


was (Author: nsyca):
[~hvanhovell], [~dongjoon]], [~smilegator] FYI:

I edited the description to reflect the correct behaviour of the {{LeftAnti}} 
in the example. The correct answer is all the rows from tbl_a as there is no 
row in tbl_a that satisfies the predicate tbl_a.c1 = tbl_a.c2 in the ON clause. 
By the LeftAnti semantics of tbl_a except rows that satisfy the predicate in 
the ON clause, all rows from tbl_a are returned. 

> Do not push down filters for LEFT ANTI JOIN
> ---
>
> Key: SPARK-18597
> URL: https://issues.apache.org/jira/browse/SPARK-18597
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>  Labels: correctness
> Fix For: 2.1.0
>
>
> The optimizer pushes down filters for left anti joins. This unfortunately has 
> the opposite effect. For example:
> {noformat}
> sql("create or replace temporary view tbl_a as values (1, 5), (2, 1), (3, 6) 
> as t(c1, c2)")
> sql("create or replace temporary view tbl_b as values 1 as t(c1)")
> sql("""
> select *
> from   tbl_a
>left anti join tbl_b on ((tbl_a.c1 = tbl_a.c2) is null or tbl_a.c1 = 
> tbl_a.c2)
> """)
> {noformat}
> Should return rows [1, 5], [2, 1] & [3, 6], but returns no rows.
> The upside is that this will only happen when you use a really weird 
> anti-join (only referencing the table on the left hand side).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18535) Redact sensitive information from Spark logs and UI

2016-11-28 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-18535.

   Resolution: Fixed
 Assignee: Mark Grover
Fix Version/s: 2.2.0

> Redact sensitive information from Spark logs and UI
> ---
>
> Key: SPARK-18535
> URL: https://issues.apache.org/jira/browse/SPARK-18535
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI, YARN
>Affects Versions: 2.1.0
>Reporter: Mark Grover
>Assignee: Mark Grover
> Fix For: 2.2.0
>
> Attachments: redacted.png
>
>
> A Spark user may have to provide a sensitive information for a Spark 
> configuration property, or a source out an environment variable in the 
> executor or driver environment that contains sensitive information. A good 
> example of this would be when reading/writing data from/to S3 using Spark. 
> The S3 secret and S3 access key can be placed in a [hadoop credential 
> provider|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html].
>  However, one still needs to provide the password for the credential provider 
> to Spark, which is typically supplied as an environment variable to the 
> driver and executor environments. This environment variable shows up in logs, 
> and may also show up in the UI.
> 1. For logs, it shows up in a few places:
>   1A. Event logs under {{SparkListenerEnvironmentUpdate}} event.
>   1B. YARN logs, when printing the executor launch context.
> 2. For UI, it would show up in the _Environment_ tab, but it is redacted if 
> it contains the words "password" or "secret" in it. And, these magic words 
> are 
> [hardcoded|https://github.com/apache/spark/blob/a2d464770cd183daa7d727bf377bde9c21e29e6a/core/src/main/scala/org/apache/spark/ui/env/EnvironmentPage.scala#L30]
>  and hence not customizable.
> This JIRA is to track the work to make sure sensitive information is redacted 
> from all logs and UIs in Spark, while still being passed on to all relevant 
> places it needs to get passed on to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18597) Do not push down filters for LEFT ANTI JOIN

2016-11-28 Thread Nattavut Sutyanyong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702505#comment-15702505
 ] 

Nattavut Sutyanyong commented on SPARK-18597:
-

I also left another comment in the PR that {{ExistenceJoin}} should be treated 
the same as {{LeftOuter}} and {{LeftAnti}}, not {{InnerLike}} and {{LeftSemi}}. 
This is not currently exposed because the rewrite of {{\[NOT\] EXISTS OR ... }} 
to {{ExistenceJoin}} happens in rule {{RewritePredicateSubquery}}, which is in 
a separate rule set and placed after the rule {{PushPredicateThroughJoin}}. 
During the transformation in the rule {{PushPredicateThroughJoin}}, an 
ExistenceJoin never exists.

The semantics of {{ExistenceJoin}} says we need to preserve all the rows from 
the left table through the join operation as if it is a regular {{LeftOuter}} 
join. The {{ExistenceJoin}} augments the {{LeftOuter}} operation with a new 
column called {{exists}}, set to true when the join condition in the ON clause 
is true and false otherwise. The filter of any rows will happen in the 
{{Filter}} operation above the {ExistenceJoin}}.

Example:

A(c1, c2): { (1, 1), (1, 2) }
B(c1): { (NULL) }   // can be any value as it is irrelevant in this example

{code:SQL}
select A.*
from   A
where  exists (select 1 from B where A.c1 = A.c2)
   or A.c2=2
{code}

In this example, the correct result is all the rows from A. If the pattern 
{{ExistenceJoin}} at line 935 in {{Optimizer.scala}} added by the PR of this 
JIRA is indeed active, the code will push down the predicate A.c1 = A.c2 to be 
a {{Filter}} on relation A, which will filter the row (1,2) from A.

If you agree with my analysis above, I can open a JIRA/a PR to remove this 
piece of code.


> Do not push down filters for LEFT ANTI JOIN
> ---
>
> Key: SPARK-18597
> URL: https://issues.apache.org/jira/browse/SPARK-18597
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>  Labels: correctness
> Fix For: 2.1.0
>
>
> The optimizer pushes down filters for left anti joins. This unfortunately has 
> the opposite effect. For example:
> {noformat}
> sql("create or replace temporary view tbl_a as values (1, 5), (2, 1), (3, 6) 
> as t(c1, c2)")
> sql("create or replace temporary view tbl_b as values 1 as t(c1)")
> sql("""
> select *
> from   tbl_a
>left anti join tbl_b on ((tbl_a.c1 = tbl_a.c2) is null or tbl_a.c1 = 
> tbl_a.c2)
> """)
> {noformat}
> Should return rows [1, 5], [2, 1] & [3, 6], but returns no rows.
> The upside is that this will only happen when you use a really weird 
> anti-join (only referencing the table on the left hand side).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18597) Do not push down filters for LEFT ANTI JOIN

2016-11-28 Thread Nattavut Sutyanyong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702505#comment-15702505
 ] 

Nattavut Sutyanyong edited comment on SPARK-18597 at 11/28/16 5:09 PM:
---

I also left another comment in the PR that {{ExistenceJoin}} should be treated 
the same as {{LeftOuter}} and {{LeftAnti}}, not {{InnerLike}} and {{LeftSemi}}. 
This is not currently exposed because the rewrite of {{\[NOT\] EXISTS OR ...}} 
to {{ExistenceJoin}} happens in rule {{RewritePredicateSubquery}}, which is in 
a separate rule set and placed after the rule {{PushPredicateThroughJoin}}. 
During the transformation in the rule {{PushPredicateThroughJoin}}, an 
ExistenceJoin never exists.

The semantics of {{ExistenceJoin}} says we need to preserve all the rows from 
the left table through the join operation as if it is a regular {{LeftOuter}} 
join. The {{ExistenceJoin}} augments the {{LeftOuter}} operation with a new 
column called {{exists}}, set to true when the join condition in the ON clause 
is true and false otherwise. The filter of any rows will happen in the 
{{Filter}} operation above the {{ExistenceJoin}}.

Example:

A(c1, c2): { (1, 1), (1, 2) }
B(c1): { (NULL) }   // can be any value as it is irrelevant in this example

{code:SQL}
select A.*
from   A
where  exists (select 1 from B where A.c1 = A.c2)
   or A.c2=2
{code}

In this example, the correct result is all the rows from A. If the pattern 
{{ExistenceJoin}} at line 935 in {{Optimizer.scala}} added by the PR of this 
JIRA is indeed active, the code will push down the predicate A.c1 = A.c2 to be 
a {{Filter}} on relation A, which will filter the row (1,2) from A.

If you agree with my analysis above, I can open a JIRA/a PR to remove this 
piece of code.



was (Author: nsyca):
I also left another comment in the PR that {{ExistenceJoin}} should be treated 
the same as {{LeftOuter}} and {{LeftAnti}}, not {{InnerLike}} and {{LeftSemi}}. 
This is not currently exposed because the rewrite of {{\[NOT\] EXISTS OR ... }} 
to {{ExistenceJoin}} happens in rule {{RewritePredicateSubquery}}, which is in 
a separate rule set and placed after the rule {{PushPredicateThroughJoin}}. 
During the transformation in the rule {{PushPredicateThroughJoin}}, an 
ExistenceJoin never exists.

The semantics of {{ExistenceJoin}} says we need to preserve all the rows from 
the left table through the join operation as if it is a regular {{LeftOuter}} 
join. The {{ExistenceJoin}} augments the {{LeftOuter}} operation with a new 
column called {{exists}}, set to true when the join condition in the ON clause 
is true and false otherwise. The filter of any rows will happen in the 
{{Filter}} operation above the {{ExistenceJoin}}.

Example:

A(c1, c2): { (1, 1), (1, 2) }
B(c1): { (NULL) }   // can be any value as it is irrelevant in this example

{code:SQL}
select A.*
from   A
where  exists (select 1 from B where A.c1 = A.c2)
   or A.c2=2
{code}

In this example, the correct result is all the rows from A. If the pattern 
{{ExistenceJoin}} at line 935 in {{Optimizer.scala}} added by the PR of this 
JIRA is indeed active, the code will push down the predicate A.c1 = A.c2 to be 
a {{Filter}} on relation A, which will filter the row (1,2) from A.

If you agree with my analysis above, I can open a JIRA/a PR to remove this 
piece of code.


> Do not push down filters for LEFT ANTI JOIN
> ---
>
> Key: SPARK-18597
> URL: https://issues.apache.org/jira/browse/SPARK-18597
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>  Labels: correctness
> Fix For: 2.1.0
>
>
> The optimizer pushes down filters for left anti joins. This unfortunately has 
> the opposite effect. For example:
> {noformat}
> sql("create or replace temporary view tbl_a as values (1, 5), (2, 1), (3, 6) 
> as t(c1, c2)")
> sql("create or replace temporary view tbl_b as values 1 as t(c1)")
> sql("""
> select *
> from   tbl_a
>left anti join tbl_b on ((tbl_a.c1 = tbl_a.c2) is null or tbl_a.c1 = 
> tbl_a.c2)
> """)
> {noformat}
> Should return rows [1, 5], [2, 1] & [3, 6], but returns no rows.
> The upside is that this will only happen when you use a really weird 
> anti-join (only referencing the table on the left hand side).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18597) Do not push down filters for LEFT ANTI JOIN

2016-11-28 Thread Nattavut Sutyanyong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702505#comment-15702505
 ] 

Nattavut Sutyanyong edited comment on SPARK-18597 at 11/28/16 5:09 PM:
---

I also left another comment in the PR that {{ExistenceJoin}} should be treated 
the same as {{LeftOuter}} and {{LeftAnti}}, not {{InnerLike}} and {{LeftSemi}}. 
This is not currently exposed because the rewrite of {{\[NOT\] EXISTS OR ... }} 
to {{ExistenceJoin}} happens in rule {{RewritePredicateSubquery}}, which is in 
a separate rule set and placed after the rule {{PushPredicateThroughJoin}}. 
During the transformation in the rule {{PushPredicateThroughJoin}}, an 
ExistenceJoin never exists.

The semantics of {{ExistenceJoin}} says we need to preserve all the rows from 
the left table through the join operation as if it is a regular {{LeftOuter}} 
join. The {{ExistenceJoin}} augments the {{LeftOuter}} operation with a new 
column called {{exists}}, set to true when the join condition in the ON clause 
is true and false otherwise. The filter of any rows will happen in the 
{{Filter}} operation above the {{ExistenceJoin}}.

Example:

A(c1, c2): { (1, 1), (1, 2) }
B(c1): { (NULL) }   // can be any value as it is irrelevant in this example

{code:SQL}
select A.*
from   A
where  exists (select 1 from B where A.c1 = A.c2)
   or A.c2=2
{code}

In this example, the correct result is all the rows from A. If the pattern 
{{ExistenceJoin}} at line 935 in {{Optimizer.scala}} added by the PR of this 
JIRA is indeed active, the code will push down the predicate A.c1 = A.c2 to be 
a {{Filter}} on relation A, which will filter the row (1,2) from A.

If you agree with my analysis above, I can open a JIRA/a PR to remove this 
piece of code.



was (Author: nsyca):
I also left another comment in the PR that {{ExistenceJoin}} should be treated 
the same as {{LeftOuter}} and {{LeftAnti}}, not {{InnerLike}} and {{LeftSemi}}. 
This is not currently exposed because the rewrite of {{\[NOT\] EXISTS OR ... }} 
to {{ExistenceJoin}} happens in rule {{RewritePredicateSubquery}}, which is in 
a separate rule set and placed after the rule {{PushPredicateThroughJoin}}. 
During the transformation in the rule {{PushPredicateThroughJoin}}, an 
ExistenceJoin never exists.

The semantics of {{ExistenceJoin}} says we need to preserve all the rows from 
the left table through the join operation as if it is a regular {{LeftOuter}} 
join. The {{ExistenceJoin}} augments the {{LeftOuter}} operation with a new 
column called {{exists}}, set to true when the join condition in the ON clause 
is true and false otherwise. The filter of any rows will happen in the 
{{Filter}} operation above the {ExistenceJoin}}.

Example:

A(c1, c2): { (1, 1), (1, 2) }
B(c1): { (NULL) }   // can be any value as it is irrelevant in this example

{code:SQL}
select A.*
from   A
where  exists (select 1 from B where A.c1 = A.c2)
   or A.c2=2
{code}

In this example, the correct result is all the rows from A. If the pattern 
{{ExistenceJoin}} at line 935 in {{Optimizer.scala}} added by the PR of this 
JIRA is indeed active, the code will push down the predicate A.c1 = A.c2 to be 
a {{Filter}} on relation A, which will filter the row (1,2) from A.

If you agree with my analysis above, I can open a JIRA/a PR to remove this 
piece of code.


> Do not push down filters for LEFT ANTI JOIN
> ---
>
> Key: SPARK-18597
> URL: https://issues.apache.org/jira/browse/SPARK-18597
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>  Labels: correctness
> Fix For: 2.1.0
>
>
> The optimizer pushes down filters for left anti joins. This unfortunately has 
> the opposite effect. For example:
> {noformat}
> sql("create or replace temporary view tbl_a as values (1, 5), (2, 1), (3, 6) 
> as t(c1, c2)")
> sql("create or replace temporary view tbl_b as values 1 as t(c1)")
> sql("""
> select *
> from   tbl_a
>left anti join tbl_b on ((tbl_a.c1 = tbl_a.c2) is null or tbl_a.c1 = 
> tbl_a.c2)
> """)
> {noformat}
> Should return rows [1, 5], [2, 1] & [3, 6], but returns no rows.
> The upside is that this will only happen when you use a really weird 
> anti-join (only referencing the table on the left hand side).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18597) Do not push down filters for LEFT ANTI JOIN

2016-11-28 Thread Nattavut Sutyanyong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702505#comment-15702505
 ] 

Nattavut Sutyanyong edited comment on SPARK-18597 at 11/28/16 5:11 PM:
---

I also left another comment in the PR that {{ExistenceJoin}} should be treated 
the same as {{LeftOuter}} and {{LeftAnti}}, not {{InnerLike}} and {{LeftSemi}}. 
This is not currently exposed because the rewrite of {{\[NOT\] EXISTS OR ...}} 
to {{ExistenceJoin}} happens in rule {{RewritePredicateSubquery}}, which is in 
a separate rule set and placed after the rule {{PushPredicateThroughJoin}}. 
During the transformation in the rule {{PushPredicateThroughJoin}}, an 
ExistenceJoin never exists.

The semantics of {{ExistenceJoin}} says we need to preserve all the rows from 
the left table through the join operation as if it is a regular {{LeftOuter}} 
join. The {{ExistenceJoin}} augments the {{LeftOuter}} operation with a new 
column called {{exists}}, set to true when the join condition in the ON clause 
is true and false otherwise. The filter of any rows will happen in the 
{{Filter}} operation above the {{ExistenceJoin}}.

Example:

A(c1, c2): { (1, 1), (1, 2) }
B(c1): { (NULL) }   // can be any value as it is irrelevant in this example

{code:SQL}
select A.*
from   A
where  exists (select 1 from B where A.c1 = A.c2)
   or A.c2=2
{code}

In this example, the correct result is all the rows from A. If the pattern 
{{ExistenceJoin}} at line 935 in {{Optimizer.scala}} added by the PR of this 
JIRA is indeed active, the code will push down the predicate A.c1 = A.c2 to be 
a {{Filter}} on relation A, which will filter the row (1,2) from A.

If you agree with my analysis above, I can open a JIRA/a PR to move this piece 
of code from {{InnerLike}} to {{LeftOuter}}.



was (Author: nsyca):
I also left another comment in the PR that {{ExistenceJoin}} should be treated 
the same as {{LeftOuter}} and {{LeftAnti}}, not {{InnerLike}} and {{LeftSemi}}. 
This is not currently exposed because the rewrite of {{\[NOT\] EXISTS OR ...}} 
to {{ExistenceJoin}} happens in rule {{RewritePredicateSubquery}}, which is in 
a separate rule set and placed after the rule {{PushPredicateThroughJoin}}. 
During the transformation in the rule {{PushPredicateThroughJoin}}, an 
ExistenceJoin never exists.

The semantics of {{ExistenceJoin}} says we need to preserve all the rows from 
the left table through the join operation as if it is a regular {{LeftOuter}} 
join. The {{ExistenceJoin}} augments the {{LeftOuter}} operation with a new 
column called {{exists}}, set to true when the join condition in the ON clause 
is true and false otherwise. The filter of any rows will happen in the 
{{Filter}} operation above the {{ExistenceJoin}}.

Example:

A(c1, c2): { (1, 1), (1, 2) }
B(c1): { (NULL) }   // can be any value as it is irrelevant in this example

{code:SQL}
select A.*
from   A
where  exists (select 1 from B where A.c1 = A.c2)
   or A.c2=2
{code}

In this example, the correct result is all the rows from A. If the pattern 
{{ExistenceJoin}} at line 935 in {{Optimizer.scala}} added by the PR of this 
JIRA is indeed active, the code will push down the predicate A.c1 = A.c2 to be 
a {{Filter}} on relation A, which will filter the row (1,2) from A.

If you agree with my analysis above, I can open a JIRA/a PR to remove this 
piece of code.


> Do not push down filters for LEFT ANTI JOIN
> ---
>
> Key: SPARK-18597
> URL: https://issues.apache.org/jira/browse/SPARK-18597
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>  Labels: correctness
> Fix For: 2.1.0
>
>
> The optimizer pushes down filters for left anti joins. This unfortunately has 
> the opposite effect. For example:
> {noformat}
> sql("create or replace temporary view tbl_a as values (1, 5), (2, 1), (3, 6) 
> as t(c1, c2)")
> sql("create or replace temporary view tbl_b as values 1 as t(c1)")
> sql("""
> select *
> from   tbl_a
>left anti join tbl_b on ((tbl_a.c1 = tbl_a.c2) is null or tbl_a.c1 = 
> tbl_a.c2)
> """)
> {noformat}
> Should return rows [1, 5], [2, 1] & [3, 6], but returns no rows.
> The upside is that this will only happen when you use a really weird 
> anti-join (only referencing the table on the left hand side).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18512) FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and S3A

2016-11-28 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702373#comment-15702373
 ] 

Steve Loughran edited comment on SPARK-18512 at 11/28/16 5:54 PM:
--

one question: what's the size of data being committed here?

And a suggestion: use algorithm 2 for committing files if you aren't already; I 
think from the stack trace you may be using the v1 algorithm, which has more 
renames than it needs. 

Here are the options I have for reading ORC and Parquet, and for output work
{code}

  private val ORC_OPTIONS = Map(
"spark.hadoop.orc.splits.include.file.footer" -> "true",
"spark.hadoop.orc.cache.stripe.details.size" -> "1000",
"spark.hadoop.orc.filterPushdown" -> "true")

  private val PARQUET_OPTIONS = Map(
"spark.sql.parquet.mergeSchema" -> "false",
"spark.sql.parquet.filterPushdown" -> "true")

  private val MAPREDUCE_OPTIONS = Map(
"spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version" -> "2",
"spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored" -> 
"true" )
{code}


was (Author: ste...@apache.org):
one question: what's the size of data being committed here?

> FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and 
> S3A
> 
>
> Key: SPARK-18512
> URL: https://issues.apache.org/jira/browse/SPARK-18512
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
> Environment: AWS EMR 5.0.1
> Spark 2.0.1
> S3 EU-West-1 (S3A with read-after-write consistency)
>Reporter: Giuseppe Bonaccorso
>
> After a few hours of streaming processing and data saving in Parquet format, 
> I got always this exception:
> {code:java}
> java.io.FileNotFoundException: No such file or directory: 
> s3a://xxx/_temporary/0/task_
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1004)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:745)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:426)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:362)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334)
>   at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:510)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:488)
> {code}
> I've tried also s3://

[jira] [Commented] (SPARK-18512) FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and S3A

2016-11-28 Thread Giuseppe Bonaccorso (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702656#comment-15702656
 ] 

Giuseppe Bonaccorso commented on SPARK-18512:
-

Thanks! I've already set mergeSchema=false, but I gonna add filterPushdown and 
mapreduce options.

> FileNotFoundException on _temporary directory with Spark Streaming 2.0.1 and 
> S3A
> 
>
> Key: SPARK-18512
> URL: https://issues.apache.org/jira/browse/SPARK-18512
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.1
> Environment: AWS EMR 5.0.1
> Spark 2.0.1
> S3 EU-West-1 (S3A with read-after-write consistency)
>Reporter: Giuseppe Bonaccorso
>
> After a few hours of streaming processing and data saving in Parquet format, 
> I got always this exception:
> {code:java}
> java.io.FileNotFoundException: No such file or directory: 
> s3a://xxx/_temporary/0/task_
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1004)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:745)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:426)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:362)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334)
>   at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:510)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:488)
> {code}
> I've tried also s3:// and s3n:// but it always happens after a 3-5 hours. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18602) Dependency list still shows that the version of org.codehaus.janino:commons-compiler is 2.7.6

2016-11-28 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-18602:


Assignee: Yin Huai

> Dependency list still shows that the version of 
> org.codehaus.janino:commons-compiler is 2.7.6
> -
>
> Key: SPARK-18602
> URL: https://issues.apache.org/jira/browse/SPARK-18602
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.1.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.1.0
>
>
> org.codehaus.janino:janino:3.0.0 depends on 
> org.codehaus.janino:commons-compiler:3.0.0.
> However, 
> https://github.com/apache/spark/blob/branch-2.1/dev/deps/spark-deps-hadoop-2.7
>  still shows that commons-compiler from janino is 2.7.6. This is probably 
> because hive module depends on calcite-core, which depends on 
> commons-compiler 2.7.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18602) Dependency list still shows that the version of org.codehaus.janino:commons-compiler is 2.7.6

2016-11-28 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-18602.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 16025
[https://github.com/apache/spark/pull/16025]

> Dependency list still shows that the version of 
> org.codehaus.janino:commons-compiler is 2.7.6
> -
>
> Key: SPARK-18602
> URL: https://issues.apache.org/jira/browse/SPARK-18602
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 2.1.0
>Reporter: Yin Huai
> Fix For: 2.1.0
>
>
> org.codehaus.janino:janino:3.0.0 depends on 
> org.codehaus.janino:commons-compiler:3.0.0.
> However, 
> https://github.com/apache/spark/blob/branch-2.1/dev/deps/spark-deps-hadoop-2.7
>  still shows that commons-compiler from janino is 2.7.6. This is probably 
> because hive module depends on calcite-core, which depends on 
> commons-compiler 2.7.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18582) Whitelist LogicalPlan operators allowed in correlated subqueries

2016-11-28 Thread Nattavut Sutyanyong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702684#comment-15702684
 ] 

Nattavut Sutyanyong commented on SPARK-18582:
-

We shall classify the operators into 4 categories:

# Operators that are allowed anywhere in a correlated subquery
   and, by definition of the operators, they cannot host outer references.
# Operators that are allowed anywhere in a correlated subquery
   so long as they do not host outer references.
# Operators that need special treatment. These operators are
   Project, Filter, Join, Aggregate, and, Generate.

Any operators that are not in the above list are allowed in a correlated 
subquery only if they are not on a correlation path. In other word, these 
operators are allowed only under a correlation point.

Note that a correlation path is defined as the sub-tree of all the operators 
that are on the path from the operator hosting the correlated expressions up to 
the operator producing the correlated values.


> Whitelist LogicalPlan operators allowed in correlated subqueries
> 
>
> Key: SPARK-18582
> URL: https://issues.apache.org/jira/browse/SPARK-18582
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>
> We want to tighten the code that handles correlated subquery to whitelist 
> operators that are allowed in it.
> The current code in {{def pullOutCorrelatedPredicates}} looks like
> {code}
>   // Simplify the predicates before pulling them out.
>   val transformed = BooleanSimplification(sub) transformUp {
> case f @ Filter(cond, child) => ...
> case p @ Project(expressions, child) => ...
> case a @ Aggregate(grouping, expressions, child) => ...
> case w : Window => ...
> case j @ Join(left, _, RightOuter, _) => ...
> case j @ Join(left, right, FullOuter, _) => ...
> case j @ Join(_, right, jt, _) if !jt.isInstanceOf[InnerLike] => ...
> case u: Union => ...
> case s: SetOperation => ...
> case e: Expand => ...
> case l : LocalLimit => ...
> case g : GlobalLimit => ...
> case s : Sample => ...
> case p =>
>   failOnOuterReference(p)
>   ...
>   }
> {code}
> The code disallows operators in a sub plan of an operator hosting correlation 
> on a case by case basis. As it is today, it only blocks {{Union}}, 
> {{Intersect}}, {{Except}}, {{Expand}} {{LocalLimit}} {{GlobalLimit}} 
> {{Sample}} {{FullOuter}} and right table of {{LeftOuter}} (and left table of 
> {{RightOuter}}). That means any {{LogicalPlan}} operators that are not in the 
> list above are permitted to be under a correlation point. Is this risky? 
> There are many (30+ at least from browsing the {{LogicalPlan}} type 
> hierarchy) operators derived from {{LogicalPlan}} class.
> For the case of {{ScalarSubquery}}, it explicitly checks that only 
> {{SubqueryAlias}} {{Project}} {{Filter}} {{Aggregate}} are allowed 
> ({{CheckAnalysis.scala}} around line 126-165 in and after {{def 
> cleanQuery}}). We should whitelist which operators are allowed in correlated 
> subqueries. At my first glance, we should allow, in addition to the ones 
> allowed in {{ScalarSubquery}}: {{Join}}, {{Distinct}}, {{Sort}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18582) Whitelist LogicalPlan operators allowed in correlated subqueries

2016-11-28 Thread Nattavut Sutyanyong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702690#comment-15702690
 ] 

Nattavut Sutyanyong commented on SPARK-18582:
-

For {{Window}} operator, I propose to place it in the last category noted in my 
previous comment. Here is an example showing it may return incorrect results.

Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1")
Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2")

sql("select * from t1 where c1 in (select sum(c1) over () from t2 where 
t1.c2=t2.c2)")

When pulling up the correlated predicate through the {{Window}} operator, it 
does not add the column T2.C2 into the PARTITION clause of the {{Window}} 
operator and return no row in the above example. The correct result is the row 
of (1, 1).

> Whitelist LogicalPlan operators allowed in correlated subqueries
> 
>
> Key: SPARK-18582
> URL: https://issues.apache.org/jira/browse/SPARK-18582
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>
> We want to tighten the code that handles correlated subquery to whitelist 
> operators that are allowed in it.
> The current code in {{def pullOutCorrelatedPredicates}} looks like
> {code}
>   // Simplify the predicates before pulling them out.
>   val transformed = BooleanSimplification(sub) transformUp {
> case f @ Filter(cond, child) => ...
> case p @ Project(expressions, child) => ...
> case a @ Aggregate(grouping, expressions, child) => ...
> case w : Window => ...
> case j @ Join(left, _, RightOuter, _) => ...
> case j @ Join(left, right, FullOuter, _) => ...
> case j @ Join(_, right, jt, _) if !jt.isInstanceOf[InnerLike] => ...
> case u: Union => ...
> case s: SetOperation => ...
> case e: Expand => ...
> case l : LocalLimit => ...
> case g : GlobalLimit => ...
> case s : Sample => ...
> case p =>
>   failOnOuterReference(p)
>   ...
>   }
> {code}
> The code disallows operators in a sub plan of an operator hosting correlation 
> on a case by case basis. As it is today, it only blocks {{Union}}, 
> {{Intersect}}, {{Except}}, {{Expand}} {{LocalLimit}} {{GlobalLimit}} 
> {{Sample}} {{FullOuter}} and right table of {{LeftOuter}} (and left table of 
> {{RightOuter}}). That means any {{LogicalPlan}} operators that are not in the 
> list above are permitted to be under a correlation point. Is this risky? 
> There are many (30+ at least from browsing the {{LogicalPlan}} type 
> hierarchy) operators derived from {{LogicalPlan}} class.
> For the case of {{ScalarSubquery}}, it explicitly checks that only 
> {{SubqueryAlias}} {{Project}} {{Filter}} {{Aggregate}} are allowed 
> ({{CheckAnalysis.scala}} around line 126-165 in and after {{def 
> cleanQuery}}). We should whitelist which operators are allowed in correlated 
> subqueries. At my first glance, we should allow, in addition to the ones 
> allowed in {{ScalarSubquery}}: {{Join}}, {{Distinct}}, {{Sort}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2016-11-28 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702760#comment-15702760
 ] 

Kazuaki Ishizaki commented on SPARK-18492:
--


I realized that the following code can reproduce the message "... grows beyond 
64 KB", then can be succeeded with the warning.
Anyway, I am working for avoiding these messages for nested udfs.

{code}
09:43:17.332 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
"processNext()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB
...
org.codehaus.janino.JaninoRuntimeException: Code of method "processNext()V" of 
class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB
at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:949)
...
09:43:17.496 WARN org.apache.spark.sql.execution.WholeStageCodegenExec: 
Whole-stage codegen disabled for this plan:
...
{code}

Source program
{code:java}
val ua = udf((i: Int) => Array[Int](i))
val u = udf((i: Int) => i)

val df = spark.sparkContext.parallelize(
  (1 to 1).map(i => i)).toDF("i")
df.select($"*",
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)),
  ua(u('i)), ua(u('i)), ua(u('i)), ua(u('i)))
  .showString(1)
{code}

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.data

[jira] [Resolved] (SPARK-16282) Implement percentile SQL function

2016-11-28 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-16282.
---
   Resolution: Fixed
 Assignee: Jiang Xingbo
Fix Version/s: 2.1.0

> Implement percentile SQL function
> -
>
> Key: SPARK-16282
> URL: https://issues.apache.org/jira/browse/SPARK-16282
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Jiang Xingbo
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17680) Unicode Character Support for Column Names and Comments

2016-11-28 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-17680.
---
  Resolution: Fixed
   Fix Version/s: 2.1.0
Target Version/s: 2.1.0

> Unicode Character Support for Column Names and Comments
> ---
>
> Key: SPARK-17680
> URL: https://issues.apache.org/jira/browse/SPARK-17680
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
> Fix For: 2.1.0
>
>
> Spark SQL supports Unicode characters for column names when specified within 
> backticks(`). When the Hive support is enabled, the version of the Hive 
> metastore must be higher than 0.12, See the JIRA: 
> https://issues.apache.org/jira/browse/HIVE-6013 Hive metastore supports 
> Unicode characters for column names since 0.13.
> In Spark SQL, table comments, and view comments always allow Unicode 
> characters without backticks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17680) Unicode Character Support for Column Names and Comments

2016-11-28 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-17680:
--
Assignee: Kazuaki Ishizaki

> Unicode Character Support for Column Names and Comments
> ---
>
> Key: SPARK-17680
> URL: https://issues.apache.org/jira/browse/SPARK-17680
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
> Fix For: 2.1.0
>
>
> Spark SQL supports Unicode characters for column names when specified within 
> backticks(`). When the Hive support is enabled, the version of the Hive 
> metastore must be higher than 0.12, See the JIRA: 
> https://issues.apache.org/jira/browse/HIVE-6013 Hive metastore supports 
> Unicode characters for column names since 0.13.
> In Spark SQL, table comments, and view comments always allow Unicode 
> characters without backticks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-17680) Unicode Character Support for Column Names and Comments

2016-11-28 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reopened SPARK-17680:
---

> Unicode Character Support for Column Names and Comments
> ---
>
> Key: SPARK-17680
> URL: https://issues.apache.org/jira/browse/SPARK-17680
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
> Fix For: 2.1.0
>
>
> Spark SQL supports Unicode characters for column names when specified within 
> backticks(`). When the Hive support is enabled, the version of the Hive 
> metastore must be higher than 0.12, See the JIRA: 
> https://issues.apache.org/jira/browse/HIVE-6013 Hive metastore supports 
> Unicode characters for column names since 0.13.
> In Spark SQL, table comments, and view comments always allow Unicode 
> characters without backticks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17680) Unicode Character Support for Column Names and Comments

2016-11-28 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17680:


Assignee: Kazuaki Ishizaki  (was: Apache Spark)

> Unicode Character Support for Column Names and Comments
> ---
>
> Key: SPARK-17680
> URL: https://issues.apache.org/jira/browse/SPARK-17680
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
> Fix For: 2.1.0
>
>
> Spark SQL supports Unicode characters for column names when specified within 
> backticks(`). When the Hive support is enabled, the version of the Hive 
> metastore must be higher than 0.12, See the JIRA: 
> https://issues.apache.org/jira/browse/HIVE-6013 Hive metastore supports 
> Unicode characters for column names since 0.13.
> In Spark SQL, table comments, and view comments always allow Unicode 
> characters without backticks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 223 matches

Mail list logo