[jira] [Commented] (SPARK-26741) Analyzer incorrectly resolves aggregate function outside of Aggregate operators

2019-01-26 Thread Kris Mok (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753310#comment-16753310
 ] 

Kris Mok commented on SPARK-26741:
--

Note that there's a similar issue with non-aggregate functions. Here's an 
example:
{code:none}
spark.sql("create table foo (id int, blob binary)")
val df = spark.sql("select length(blob) from foo where id = 1 order by 
length(blob) limit 10")
df.explain(true)
{code}
{code:none}
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Sort ['length('blob) ASC NULLS FIRST], true
  +- 'Project [unresolvedalias('length('blob), None)]
 +- 'Filter ('id = 1)
+- 'UnresolvedRelation `foo`

== Analyzed Logical Plan ==
length(blob): int
GlobalLimit 10
+- LocalLimit 10
   +- Project [length(blob)#25]
  +- Sort [length(blob#24) ASC NULLS FIRST], true
 +- Project [length(blob#24) AS length(blob)#25, blob#24]
+- Filter (id#23 = 1)
   +- SubqueryAlias `default`.`foo`
  +- HiveTableRelation `default`.`foo`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#23, blob#24]

== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Project [length(blob)#25]
  +- Sort [length(blob#24) ASC NULLS FIRST], true
 +- Project [length(blob#24) AS length(blob)#25, blob#24]
+- Filter (isnotnull(id#23) && (id#23 = 1))
   +- HiveTableRelation `default`.`foo`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#23, blob#24]

== Physical Plan ==
TakeOrderedAndProject(limit=10, orderBy=[length(blob#24) ASC NULLS FIRST], 
output=[length(blob)#25])
+- *(1) Project [length(blob#24) AS length(blob)#25, blob#24]
   +- *(1) Filter (isnotnull(id#23) && (id#23 = 1))
  +- Scan hive default.foo [blob#24, id#23], HiveTableRelation 
`default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#23, 
blob#24]
{code}

Note how the {{Sort}} operator performs the {{length()}} again, despite there's 
one in the projection right below it. The root cause of this problem in the 
Analyzer is the same as the main example in this ticket, although this example 
is not as harmful (at least it still runs...)

> Analyzer incorrectly resolves aggregate function outside of Aggregate 
> operators
> ---
>
> Key: SPARK-26741
> URL: https://issues.apache.org/jira/browse/SPARK-26741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kris Mok
>Priority: Major
>
> The analyzer can sometimes hit issues with resolving functions. e.g.
> {code:sql}
> select max(id)
> from range(10)
> group by id
> having count(1) >= 1
> order by max(id)
> {code}
> The analyzed plan of this query is:
> {code:none}
> == Analyzed Logical Plan ==
> max(id): bigint
> Project [max(id)#91L]
> +- Sort [max(id#88L) ASC NULLS FIRST], true
>+- Project [max(id)#91L, id#88L]
>   +- Filter (count(1)#93L >= cast(1 as bigint))
>  +- Aggregate [id#88L], [max(id#88L) AS max(id)#91L, count(1) AS 
> count(1)#93L, id#88L]
> +- Range (0, 10, step=1, splits=None)
> {code}
> Note how an aggregate function is outside of {{Aggregate}} operators in the 
> fully analyzed plan:
> {{Sort [max(id#88L) ASC NULLS FIRST], true}}, which makes the plan invalid.
> Trying to run this query will lead to weird issues in codegen, but the root 
> cause is in the analyzer:
> {code:none}
> java.lang.UnsupportedOperationException: Cannot generate code for expression: 
> max(input[1, bigint, false])
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.$anonfun$createOrderKeys$1(GenerateOrdering.scala:82)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:237)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> 

[jira] [Resolved] (SPARK-26735) Verify plan integrity for special expressions

2019-01-26 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-26735.
-
   Resolution: Fixed
 Assignee: Kris Mok
Fix Version/s: 3.0.0

> Verify plan integrity for special expressions
> -
>
> Key: SPARK-26735
> URL: https://issues.apache.org/jira/browse/SPARK-26735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
> Fix For: 3.0.0
>
>
> Add verification of plan integrity with regards to special expressions being 
> hosted only in supported operators. Specifically:
> * AggregateExpression: should only be hosted in Aggregate, or indirectly in 
> Window
> * WindowExpression: should only be hosted in Window
> * Generator: should only be hosted in Generate



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26744) Create API supportDataType in File Source V2 framework

2019-01-26 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-26744:
--

 Summary: Create API supportDataType in File Source V2 framework
 Key: SPARK-26744
 URL: https://issues.apache.org/jira/browse/SPARK-26744
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26743) Add a test to check the actual resource limit set via 'spark.executor.pyspark.memory'

2019-01-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26743:


Assignee: (was: Apache Spark)

> Add a test to check the actual resource limit set via 
> 'spark.executor.pyspark.memory'
> -
>
> Key: SPARK-26743
> URL: https://issues.apache.org/jira/browse/SPARK-26743
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Looks the test that checks the actual resource limit set (by 
> 'spark.executor.pyspark.memory') is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26743) Add a test to check the actual resource limit set via 'spark.executor.pyspark.memory'

2019-01-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26743:


Assignee: Apache Spark

> Add a test to check the actual resource limit set via 
> 'spark.executor.pyspark.memory'
> -
>
> Key: SPARK-26743
> URL: https://issues.apache.org/jira/browse/SPARK-26743
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Looks the test that checks the actual resource limit set (by 
> 'spark.executor.pyspark.memory') is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26743) Add a test to check the actual resource limit set via 'spark.executor.pyspark.memory'

2019-01-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26743:
-
Description: Looks the test that checks the actual resource limit set (by 
'spark.executor.pyspark.memory') is missing.  (was: Looks the test that checks 
the actual resource limit set (by 'spark.executor.pyspark.memory'0 is missing.)

> Add a test to check the actual resource limit set via 
> 'spark.executor.pyspark.memory'
> -
>
> Key: SPARK-26743
> URL: https://issues.apache.org/jira/browse/SPARK-26743
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Looks the test that checks the actual resource limit set (by 
> 'spark.executor.pyspark.memory') is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26080) Unable to run worker.py on Windows

2019-01-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26080:
-
Labels: release-notes  (was: )

> Unable to run worker.py on Windows
> --
>
> Key: SPARK-26080
> URL: https://issues.apache.org/jira/browse/SPARK-26080
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Windows 10 Education 64 bit
>Reporter: Hayden Jeune
>Assignee: Hyukjin Kwon
>Priority: Blocker
>  Labels: release-notes
> Fix For: 2.4.1, 3.0.0
>
>
> Use of the resource module in python means worker.py cannot run on a windows 
> system. This package is only available in unix based environments.
> [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25]
> {code:python}
> textFile = sc.textFile("README.md")
> textFile.first()
> {code}
> When the above commands are run I receive the error 'worker failed to connect 
> back', and I can see an exception in the console coming from worker.py saying 
> 'ModuleNotFoundError: No module named resource'
> I do not really know enough about what I'm doing to fix this myself. 
> Apologies if there's something simple I'm missing here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26743) Add a test to check the actual resource limit set via 'spark.executor.pyspark.memory'

2019-01-26 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-26743:


 Summary: Add a test to check the actual resource limit set via 
'spark.executor.pyspark.memory'
 Key: SPARK-26743
 URL: https://issues.apache.org/jira/browse/SPARK-26743
 Project: Spark
  Issue Type: Test
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


Looks the test that checks the actual resource limit set (by 
'spark.executor.pyspark.memory'0 is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25981) Arrow optimization for conversion from R DataFrame to Spark DataFrame

2019-01-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25981.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22954
[https://github.com/apache/spark/pull/22954]

> Arrow optimization for conversion from R DataFrame to Spark DataFrame
> -
>
> Key: SPARK-25981
> URL: https://issues.apache.org/jira/browse/SPARK-25981
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> PySpark introduced an optimization for toPandas and createDataFrame with 
> Pandas DataFrame.
> This was leveraged by PyArrow API.
> R Arrow API is under developement 
> (https://github.com/apache/arrow/tree/master/r) and about to be released via 
> CRAN (https://issues.apache.org/jira/browse/ARROW-3204).
> Once it's released, we can reuse PySpark's Arrow optimization code path and 
> leverage it with minimised codes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26630) Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce

2019-01-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753248#comment-16753248
 ] 

Dongjoon Hyun commented on SPARK-26630:
---

Thank you, [~Deegue] and [~smilegator].

> Support reading Hive-serde tables whose INPUTFORMAT is 
> org.apache.hadoop.mapreduce
> --
>
> Key: SPARK-26630
> URL: https://issues.apache.org/jira/browse/SPARK-26630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Deegue
>Assignee: Deegue
>Priority: Major
> Fix For: 3.0.0
>
>
> This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR 
> #23506).
> It will throw ClassCastException when we use new input format (eg. 
> `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to 
> use NewHadoopRDD to deal with this input format in TableReader.scala.
> Exception :
> {noformat}
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to 
> org.apache.hadoop.mapred.InputFormat
>   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:96)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 87 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25981) Arrow optimization for conversion from R DataFrame to Spark DataFrame

2019-01-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25981:


Assignee: Hyukjin Kwon

> Arrow optimization for conversion from R DataFrame to Spark DataFrame
> -
>
> Key: SPARK-25981
> URL: https://issues.apache.org/jira/browse/SPARK-25981
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> PySpark introduced an optimization for toPandas and createDataFrame with 
> Pandas DataFrame.
> This was leveraged by PyArrow API.
> R Arrow API is under developement 
> (https://github.com/apache/arrow/tree/master/r) and about to be released via 
> CRAN (https://issues.apache.org/jira/browse/ARROW-3204).
> Once it's released, we can reuse PySpark's Arrow optimization code path and 
> leverage it with minimised codes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26663) Cannot query a Hive table with subdirectories

2019-01-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753218#comment-16753218
 ] 

Dongjoon Hyun commented on SPARK-26663:
---

I'm closing this issue as `Cannot Reproduce`. It may depend on your environment.

> Cannot query a Hive table with subdirectories
> -
>
> Key: SPARK-26663
> URL: https://issues.apache.org/jira/browse/SPARK-26663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Aäron
>Priority: Major
>
> Hello,
>  
> I want to report the following issue (my first one :) )
> When I create a table in Hive based on a union all then Spark 2.4 is unable 
> to query this table.
> To reproduce:
> *Hive 1.2.1*
> {code:java}
> hive> creat table a(id int);
> insert into a values(1);
> hive> creat table b(id int);
> insert into b values(2);
> hive> create table c(id int) as select id from a union all select id from b;
> {code}
>  
> *Spark 2.3.1*
>  
> {code:java}
> scala> spark.table("c").show
> +---+
> | id|
> +---+
> | 1|
> | 2|
> +---+
> scala> spark.table("c").count
> res5: Long = 2
>  {code}
>  
> *Spark 2.4.0*
> {code:java}
> scala> spark.table("c").show
> 19/01/18 17:00:49 WARN HiveMetastoreCatalog: Unable to infer schema for table 
> perftest_be.c from file format ORC (inference mode: INFER_AND_SAVE). Using 
> metastore schema.
> +---+
> | id|
> +---+
> +---+
> scala> spark.table("c").count
> res3: Long = 0
> {code}
> I did not find an existing issue for this.  Might be important to investigate.
>  
> +Extra info:+ Spark 2.3.1 and 2.4.0 use the same spark-defaults.conf.
>  
> Kind regards.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26663) Cannot query a Hive table with subdirectories

2019-01-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26663.
---
Resolution: Cannot Reproduce

> Cannot query a Hive table with subdirectories
> -
>
> Key: SPARK-26663
> URL: https://issues.apache.org/jira/browse/SPARK-26663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Aäron
>Priority: Major
>
> Hello,
>  
> I want to report the following issue (my first one :) )
> When I create a table in Hive based on a union all then Spark 2.4 is unable 
> to query this table.
> To reproduce:
> *Hive 1.2.1*
> {code:java}
> hive> creat table a(id int);
> insert into a values(1);
> hive> creat table b(id int);
> insert into b values(2);
> hive> create table c(id int) as select id from a union all select id from b;
> {code}
>  
> *Spark 2.3.1*
>  
> {code:java}
> scala> spark.table("c").show
> +---+
> | id|
> +---+
> | 1|
> | 2|
> +---+
> scala> spark.table("c").count
> res5: Long = 2
>  {code}
>  
> *Spark 2.4.0*
> {code:java}
> scala> spark.table("c").show
> 19/01/18 17:00:49 WARN HiveMetastoreCatalog: Unable to infer schema for table 
> perftest_be.c from file format ORC (inference mode: INFER_AND_SAVE). Using 
> metastore schema.
> +---+
> | id|
> +---+
> +---+
> scala> spark.table("c").count
> res3: Long = 0
> {code}
> I did not find an existing issue for this.  Might be important to investigate.
>  
> +Extra info:+ Spark 2.3.1 and 2.4.0 use the same spark-defaults.conf.
>  
> Kind regards.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26663) Cannot query a Hive table with subdirectories

2019-01-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753216#comment-16753216
 ] 

Dongjoon Hyun commented on SPARK-26663:
---

Hi, [~pomptuintje]. Thank you for reporting. Actually, the given example has 
incorrect syntax like `creat table`. It would be better if you reports with the 
script what you used. I do the following but I cannot reproduce the issue.
{code:java}
Logging initialized using configuration in 
jar:file:/Users/dongjoon/APACHE/hive-release/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties
hive> create table a(id int);
OK
Time taken: 1.299 seconds
hive> create table b(id int);
OK
Time taken: 0.046 seconds
hive> insert into a values(1);
Query ID = dongjoon_20190126145804_2c0252e1-d07c-4213-a387-90efe26d450b
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-01-26 14:58:06,272 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local2005651311_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: 
file:/user/hive/warehouse/a/.hive-staging_hive_2019-01-26_14-58-04_030_4426868381325183205-1/-ext-1
Loading data to table default.a
Table default.a stats: [numFiles=1, numRows=1, totalSize=2, rawDataSize=1]
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 2.436 seconds
hive> insert into b values(1);
Query ID = dongjoon_20190126145810_034d9c36-0f23-42a6-ac0a-681839335bd6
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-01-26 14:58:11,941 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local966105199_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: 
file:/user/hive/warehouse/b/.hive-staging_hive_2019-01-26_14-58-10_554_693159949912597124-1/-ext-1
Loading data to table default.b
Table default.b stats: [numFiles=1, numRows=1, totalSize=2, rawDataSize=1]
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 1.551 seconds
hive> create table c as select id from a union all select id from b;
Query ID = dongjoon_20190126145831_c2b31651-c88b-47ab-9081-2375cf064b15
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-01-26 14:58:33,130 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local1725928125_0003
Stage-4 is filtered out by condition resolver.
Stage-3 is selected by condition resolver.
Stage-5 is filtered out by condition resolver.
Launching Job 3 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2019-01-26 14:58:34,449 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_local1246940820_0004
Moving data to: file:/user/hive/warehouse/c
Table default.c stats: [numFiles=1, numRows=2, totalSize=4, rawDataSize=2]
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Stage-Stage-3:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 2.806 seconds
{code}
{code:java}
19/01/26 14:58:42 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context available as 'sc' (master = local[*], app id = 
local-1548543527800).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
  /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sql("select * from c").show
19/01/26 14:58:57 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
+---+
| id|
+---+
|  1|
|  1|
+---+
{code}

> Cannot query a Hive table with subdirectories
> -
>
> Key: SPARK-26663
> URL: https://issues.apache.org/jira/browse/SPARK-26663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Aäron
>Priority: Major
>
> Hello,
>  
> I want to report the following issue (my first one :) )
> When I create a table in Hive based on a union all then Spark 2.4 is unable 
> to query this table.
> To reproduce:

[jira] [Commented] (SPARK-26675) Error happened during creating avro files

2019-01-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753213#comment-16753213
 ] 

Dongjoon Hyun commented on SPARK-26675:
---

[~tony0918]. Do you have a problem in Spark Scala Shell environment, too?

> Error happened during creating avro files
> -
>
> Key: SPARK-26675
> URL: https://issues.apache.org/jira/browse/SPARK-26675
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Tony Mao
>Priority: Major
>
> Run cmd
> {code:java}
> spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 
> /nke/reformat.py
> {code}
> code in reformat.py
> {code:java}
> df = spark.read.option("multiline", "true").json("file:///nke/example1.json")
> df.createOrReplaceTempView("traffic")
> a = spark.sql("""SELECT store.*, store.name as store_name, 
> store.dataSupplierId as store_dataSupplierId, trafficSensor.*,
> trafficSensor.name as trafficSensor_name, trafficSensor.dataSupplierId as 
> trafficSensor_dataSupplierId, readings.*
> FROM (SELECT explode(stores) as store, explode(store.trafficSensors) as 
> trafficSensor,
> explode(trafficSensor.trafficSensorReadings) as readings FROM traffic)""")
> b = a.drop("trafficSensors", "trafficSensorReadings", "name", 
> "dataSupplierId")
> b.write.format("avro").save("file:///nke/curated/namesAndFavColors.avro")
> {code}
> Error message:
> {code:java}
> Traceback (most recent call last):
> File "/nke/reformat.py", line 18, in 
> b.select("store_name", 
> "store_dataSupplierId").write.format("avro").save("file:///nke/curated/namesAndFavColors.avro")
> File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/readwriter.py", 
> line 736, in save
> File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
> line 1257, in __call__
> File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, 
> in deco
> File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o45.save.
> : java.lang.NoSuchMethodError: 
> org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema;
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185)
> at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:176)
> at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:174)
> at scala.collection.Iterator$class.foreach(Iterator.scala:891)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174)
> at 
> org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118)
> at 
> org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:118)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
> at 
> 

[jira] [Commented] (SPARK-26675) Error happened during creating avro files

2019-01-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753211#comment-16753211
 ] 

Dongjoon Hyun commented on SPARK-26675:
---

cc [~Gengliang.Wang]

> Error happened during creating avro files
> -
>
> Key: SPARK-26675
> URL: https://issues.apache.org/jira/browse/SPARK-26675
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Tony Mao
>Priority: Major
>
> Run cmd
> {code:java}
> spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 
> /nke/reformat.py
> {code}
> code in reformat.py
> {code:java}
> df = spark.read.option("multiline", "true").json("file:///nke/example1.json")
> df.createOrReplaceTempView("traffic")
> a = spark.sql("""SELECT store.*, store.name as store_name, 
> store.dataSupplierId as store_dataSupplierId, trafficSensor.*,
> trafficSensor.name as trafficSensor_name, trafficSensor.dataSupplierId as 
> trafficSensor_dataSupplierId, readings.*
> FROM (SELECT explode(stores) as store, explode(store.trafficSensors) as 
> trafficSensor,
> explode(trafficSensor.trafficSensorReadings) as readings FROM traffic)""")
> b = a.drop("trafficSensors", "trafficSensorReadings", "name", 
> "dataSupplierId")
> b.write.format("avro").save("file:///nke/curated/namesAndFavColors.avro")
> {code}
> Error message:
> {code:java}
> Traceback (most recent call last):
> File "/nke/reformat.py", line 18, in 
> b.select("store_name", 
> "store_dataSupplierId").write.format("avro").save("file:///nke/curated/namesAndFavColors.avro")
> File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/readwriter.py", 
> line 736, in save
> File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
> line 1257, in __call__
> File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, 
> in deco
> File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o45.save.
> : java.lang.NoSuchMethodError: 
> org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema;
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185)
> at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:176)
> at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:174)
> at scala.collection.Iterator$class.foreach(Iterator.scala:891)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174)
> at 
> org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118)
> at 
> org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:118)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
> at 
> 

[jira] [Commented] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1

2019-01-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753207#comment-16753207
 ] 

Dongjoon Hyun commented on SPARK-26742:
---

BTW, we are already using 4.1.0.
{code}
4.1.0
{code}

> Bump Kubernetes Client Version to 4.1.1
> ---
>
> Key: SPARK-26742
> URL: https://issues.apache.org/jira/browse/SPARK-26742
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Steve Davids
>Priority: Major
>  Labels: easyfix
>
> Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master 
> branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest 
> Kubernetes compatibility support for newer clusters: 
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1

2019-01-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753207#comment-16753207
 ] 

Dongjoon Hyun edited comment on SPARK-26742 at 1/26/19 10:38 PM:
-

BTW, we are already using 4.1.0 for Spark 3.0.0.
{code}
4.1.0
{code}


was (Author: dongjoon):
BTW, we are already using 4.1.0.
{code}
4.1.0
{code}

> Bump Kubernetes Client Version to 4.1.1
> ---
>
> Key: SPARK-26742
> URL: https://issues.apache.org/jira/browse/SPARK-26742
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Steve Davids
>Priority: Major
>  Labels: easyfix
>
> Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master 
> branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest 
> Kubernetes compatibility support for newer clusters: 
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1

2019-01-26 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753206#comment-16753206
 ] 

Dongjoon Hyun commented on SPARK-26742:
---

SPARK-26603 is working on the testing environment and trying to embrace new K8S 
versions.

> Bump Kubernetes Client Version to 4.1.1
> ---
>
> Key: SPARK-26742
> URL: https://issues.apache.org/jira/browse/SPARK-26742
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Steve Davids
>Priority: Major
>  Labels: easyfix
>
> Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master 
> branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest 
> Kubernetes compatibility support for newer clusters: 
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1

2019-01-26 Thread Steve Davids (JIRA)
Steve Davids created SPARK-26742:


 Summary: Bump Kubernetes Client Version to 4.1.1
 Key: SPARK-26742
 URL: https://issues.apache.org/jira/browse/SPARK-26742
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Kubernetes
Affects Versions: 2.4.0, 3.0.0
Reporter: Steve Davids


Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master branch 
has 4.0, the client should be upgraded to 4.1.1 to have the broadest Kubernetes 
compatibility support for newer clusters: 
https://github.com/fabric8io/kubernetes-client#compatibility-matrix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26741) Analyzer incorrectly resolves aggregate function outside of Aggregate operators

2019-01-26 Thread Kris Mok (JIRA)
Kris Mok created SPARK-26741:


 Summary: Analyzer incorrectly resolves aggregate function outside 
of Aggregate operators
 Key: SPARK-26741
 URL: https://issues.apache.org/jira/browse/SPARK-26741
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kris Mok


The analyzer can sometimes hit issues with resolving functions. e.g.
{code:sql}
select max(id)
from range(10)
group by id
having count(1) >= 1
order by max(id)
{code}
The analyzed plan of this query is:
{code:none}
== Analyzed Logical Plan ==
max(id): bigint
Project [max(id)#91L]
+- Sort [max(id#88L) ASC NULLS FIRST], true
   +- Project [max(id)#91L, id#88L]
  +- Filter (count(1)#93L >= cast(1 as bigint))
 +- Aggregate [id#88L], [max(id#88L) AS max(id)#91L, count(1) AS 
count(1)#93L, id#88L]
+- Range (0, 10, step=1, splits=None)
{code}
Note how an aggregate function is outside of {{Aggregate}} operators in the 
fully analyzed plan:
{{Sort [max(id#88L) ASC NULLS FIRST], true}}, which makes the plan invalid.

Trying to run this query will lead to weird issues in codegen, but the root 
cause is in the analyzer:
{code:none}
java.lang.UnsupportedOperationException: Cannot generate code for expression: 
max(input[1, bigint, false])
  at 
org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291)
  at 
org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290)
  at 
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
  at scala.Option.getOrElse(Option.scala:138)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.$anonfun$createOrderKeys$1(GenerateOrdering.scala:82)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at scala.collection.TraversableLike.map(TraversableLike.scala:237)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
  at scala.collection.AbstractTraversable.map(Traversable.scala:108)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.createOrderKeys(GenerateOrdering.scala:82)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.genComparisons(GenerateOrdering.scala:91)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:152)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:44)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1194)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:195)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:192)
  at 
org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:153)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3302)
  at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2470)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3291)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3287)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2470)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2684)
  at org.apache.spark.sql.Dataset.getRows(Dataset.scala:262)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:299)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:753)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:712)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:721)
{code}

The test case {{SPARK-23957 Remove redundant sort from subquery plan(scalar 
subquery)}} in {{SubquerySuite}} has been disabled because of hitting this 
issue, caught by SPARK-26735. We should re-enable that test once this bug is 
fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26740) Statistics for date and timestamp columns depend on system time zone

2019-01-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26740:


Assignee: Apache Spark

> Statistics for date and timestamp columns depend on system time zone
> 
>
> Key: SPARK-26740
> URL: https://issues.apache.org/jira/browse/SPARK-26740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> While saving statistics for timestamp/date columns, default time zone is used 
> in conversion of internal type (microseconds or days since epoch) to textual 
> representation. The textual representation doesn't contain time zone. So, 
> when it is converted back to internal types (Long for TimestampType or 
> DateType), the Timestamp.valueOf and Date.valueOf are used in conversions. 
> The methods use current system time zone.
> If system time zone is different while saving and retrieving statistics for 
> timestamp/date columns, restored microseconds/days since epoch will be 
> different.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26740) Statistics for date and timestamp columns depend on system time zone

2019-01-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26740:


Assignee: (was: Apache Spark)

> Statistics for date and timestamp columns depend on system time zone
> 
>
> Key: SPARK-26740
> URL: https://issues.apache.org/jira/browse/SPARK-26740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> While saving statistics for timestamp/date columns, default time zone is used 
> in conversion of internal type (microseconds or days since epoch) to textual 
> representation. The textual representation doesn't contain time zone. So, 
> when it is converted back to internal types (Long for TimestampType or 
> DateType), the Timestamp.valueOf and Date.valueOf are used in conversions. 
> The methods use current system time zone.
> If system time zone is different while saving and retrieving statistics for 
> timestamp/date columns, restored microseconds/days since epoch will be 
> different.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26740) Statistics for date and timestamp columns depend on system time zone

2019-01-26 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26740:
--

 Summary: Statistics for date and timestamp columns depend on 
system time zone
 Key: SPARK-26740
 URL: https://issues.apache.org/jira/browse/SPARK-26740
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


While saving statistics for timestamp/date columns, default time zone is used 
in conversion of internal type (microseconds or days since epoch) to textual 
representation. The textual representation doesn't contain time zone. So, when 
it is converted back to internal types (Long for TimestampType or DateType), 
the Timestamp.valueOf and Date.valueOf are used in conversions. The methods use 
current system time zone.
If system time zone is different while saving and retrieving statistics for 
timestamp/date columns, restored microseconds/days since epoch will be 
different.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26630) Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce

2019-01-26 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753146#comment-16753146
 ] 

Xiao Li commented on SPARK-26630:
-

This is an improvement. Thanks!


> Support reading Hive-serde tables whose INPUTFORMAT is 
> org.apache.hadoop.mapreduce
> --
>
> Key: SPARK-26630
> URL: https://issues.apache.org/jira/browse/SPARK-26630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Deegue
>Assignee: Deegue
>Priority: Major
> Fix For: 3.0.0
>
>
> This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR 
> #23506).
> It will throw ClassCastException when we use new input format (eg. 
> `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to 
> use NewHadoopRDD to deal with this input format in TableReader.scala.
> Exception :
> {noformat}
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to 
> org.apache.hadoop.mapred.InputFormat
>   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:96)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 87 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26630) Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce

2019-01-26 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-26630.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Support reading Hive-serde tables whose INPUTFORMAT is 
> org.apache.hadoop.mapreduce
> --
>
> Key: SPARK-26630
> URL: https://issues.apache.org/jira/browse/SPARK-26630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Deegue
>Assignee: Deegue
>Priority: Major
> Fix For: 3.0.0
>
>
> This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR 
> #23506).
> It will throw ClassCastException when we use new input format (eg. 
> `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to 
> use NewHadoopRDD to deal with this input format in TableReader.scala.
> Exception :
> {noformat}
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to 
> org.apache.hadoop.mapred.InputFormat
>   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:96)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 87 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26630) Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce

2019-01-26 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-26630:
---

Assignee: Deegue

> Support reading Hive-serde tables whose INPUTFORMAT is 
> org.apache.hadoop.mapreduce
> --
>
> Key: SPARK-26630
> URL: https://issues.apache.org/jira/browse/SPARK-26630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Deegue
>Assignee: Deegue
>Priority: Major
>
> This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR 
> #23506).
> It will throw ClassCastException when we use new input format (eg. 
> `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to 
> use NewHadoopRDD to deal with this input format in TableReader.scala.
> Exception :
> {noformat}
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to 
> org.apache.hadoop.mapred.InputFormat
>   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:96)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 87 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26630) Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce

2019-01-26 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26630:

Summary: Support reading Hive-serde tables whose INPUTFORMAT is 
org.apache.hadoop.mapreduce  (was: ClassCastException in TableReader while 
creating HadoopRDD)

> Support reading Hive-serde tables whose INPUTFORMAT is 
> org.apache.hadoop.mapreduce
> --
>
> Key: SPARK-26630
> URL: https://issues.apache.org/jira/browse/SPARK-26630
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Deegue
>Priority: Major
>
> This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR 
> #23506).
> It will throw ClassCastException when we use new input format (eg. 
> `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to 
> use NewHadoopRDD to deal with this input format in TableReader.scala.
> Exception :
> {noformat}
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to 
> org.apache.hadoop.mapred.InputFormat
>   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:96)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 87 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26630) Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce

2019-01-26 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-26630:

Issue Type: Improvement  (was: Bug)

> Support reading Hive-serde tables whose INPUTFORMAT is 
> org.apache.hadoop.mapreduce
> --
>
> Key: SPARK-26630
> URL: https://issues.apache.org/jira/browse/SPARK-26630
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Deegue
>Priority: Major
>
> This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR 
> #23506).
> It will throw ClassCastException when we use new input format (eg. 
> `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to 
> use NewHadoopRDD to deal with this input format in TableReader.scala.
> Exception :
> {noformat}
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to 
> org.apache.hadoop.mapred.InputFormat
>   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254)
>   at scala.Option.getOrElse(Option.scala:138)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:96)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
>   ... 87 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26739) Standardized Join Types for DataFrames

2019-01-26 Thread Skyler Lehan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Skyler Lehan updated SPARK-26739:
-
Description: 
h3. *Q1.* What are you trying to do? Articulate your objectives using 
absolutely no jargon.

Currently, in the on join functions 
[DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset],
 the join types are defined via a string parameter called joinType. In order 
for a developer to know which joins are possible, they must look up the API 
call for join. While this works fine, it can cause the developer to make a typo 
resulting in improper joins and/or unexpected errors. The objective of this 
improvement would be to allow developers to use a common definition for join 
types (by enum or constants) called JoinTypes. This would contain the possible 
joins and remove the possibility of a typo. It would also allow Spark to alter 
the names of the joins in the future without impacting end-users.
h3. *Q2.* What problem is this proposal NOT designed to solve?

The problem this solves is extremely narrow, it would not solve anything other 
than providing a common definition for join types.
h3. *Q3.* How is it done today, and what are the limits of current practice?

Currently, developers must join two DataFrames like so:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer")
{code}
Where they manually type the join type. As stated above, this:
 * Requires developers to manually type in the join
 * Can cause possibility of typos
 * Restricts renaming of join types as its a literal string
 * Does not restrict and/or compile check the join type being used, leading to 
runtime errors

h3. *Q4.* What is new in your approach and why do you think it will be 
successful?

The new approach would use constants, something like this:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
JoinType.LEFT_OUTER)
{code}
This would provide:
 * In code reference/definitions of the possible join types
 ** This subsequently allows the addition of scaladoc of what each join type 
does and how it operates
 * Removes possibilities of a typo on the join type
 * Provides compile time checking of the join type (only if an enum is used)

To clarify, if JoinType is a constant, it would just fill in the joinType 
string parameter for users. If an enum is used, it would restrict the domain of 
possible join types to whatever is defined in the future JoinType enum. The 
enum is preferred, however it would take longer to implement.
h3. *Q5.* Who cares? If you are successful, what difference will it make?

Developers using Apache Spark will care. This will make the join function 
easier to wield and lead to less runtime errors. It will save time by bringing 
join type validation at compile time. It will also provide in code reference to 
the join types, which saves the developer time of having to look up and 
navigate the multiple join functions to find the possible join types. In 
addition to that, the resulting constants/enum would have documentation on how 
that join type works.
h3. *Q6.* What are the risks?

Users of Apache Spark who currently use strings to define their join types 
could be impacted if an enum is chosen as the common definition. This risk can 
be mitigated by using string constants. The string constants would be the exact 
same string as the string literals used today. For example:
{code:java}
JoinType.INNER = "inner"
{code}
If an enum is still the preferred way of defining the join types, new join 
functions could be added that take in these enums and the join calls that 
contain string parameters for joinType could be deprecated. This would give 
developers a chance to change over to the new join types.
h3. *Q7.* How long will it take?

A few days for a seasoned Spark developer.
h3. *Q8.* What are the mid-term and final "exams" to check for success?

Mid-term exam would be the addition of a common definition of the join types 
and additional join functions that take in the join type enum/constant. The 
final exam would be working tests written to check the functionality of these 
new join functions and the join functions that take a string for joinType would 
be deprecated.
h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, 
if any. Backward and forward compatibility must be taken into account.

*If enums are used:*

The following join function signatures would be added to the Dataset API:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame
{code}
The following functions would be deprecated:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): 
DataFrame
{code}
A new enum would be created called JoinType. Developers would be 

[jira] [Updated] (SPARK-26739) Standardized Join Types for DataFrames

2019-01-26 Thread Skyler Lehan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Skyler Lehan updated SPARK-26739:
-
Description: 
h3. *Q1.* What are you trying to do? Articulate your objectives using 
absolutely no jargon.

Currently, in the join functions on 
[DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset],
 the join types are defined via a string parameter called joinType. In order 
for a developer to know which joins are possible, they must look up the API 
call for join. While this works fine, it can cause the developer to make a typo 
resulting in improper joins and/or unexpected errors. The objective of this 
improvement would be to allow developers to use a common definition for join 
types (by enum or constants) called JoinTypes. This would contain the possible 
joins and remove the possibility of a typo. It would also allow Spark to alter 
the names of the joins in the future without impacting end-users.
h3. *Q2.* What problem is this proposal NOT designed to solve?

The problem this solves is extremely narrow, it would not solve anything other 
than providing a common definition for join types.
h3. *Q3.* How is it done today, and what are the limits of current practice?

Currently, developers must join two DataFrames like so:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer")
{code}
Where they manually type the join type. As stated above, this:
 * Requires developers to manually type in the join
 * Can cause possibility of typos
 * Restricts renaming of join types as its a literal string
 * Does not restrict and/or compile check the join type being used, leading to 
runtime errors

h3. *Q4.* What is new in your approach and why do you think it will be 
successful?

The new approach would use constants, something like this:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
JoinType.LEFT_OUTER)
{code}
This would provide:
 * In code reference/definitions of the possible join types
 ** This subsequently allows the addition of scaladoc of what each join type 
does and how it operates
 * Removes possibilities of a typo on the join type
 * Provides compile time checking of the join type (only if an enum is used)

To clarify, if JoinType is a constant, it would just fill in the joinType 
string parameter for users. If an enum is used, it would restrict the domain of 
possible join types to whatever is defined in the future JoinType enum. The 
enum is preferred, however it would take longer to implement.
h3. *Q5.* Who cares? If you are successful, what difference will it make?

Developers using Apache Spark will care. This will make the join function 
easier to wield and lead to less runtime errors. It will save time by bringing 
join type validation at compile time. It will also provide in code reference to 
the join types, which saves the developer time of having to look up and 
navigate the multiple join functions to find the possible join types. In 
addition to that, the resulting constants/enum would have documentation on how 
that join type works.
h3. *Q6.* What are the risks?

Users of Apache Spark who currently use strings to define their join types 
could be impacted if an enum is chosen as the common definition. This risk can 
be mitigated by using string constants. The string constants would be the exact 
same string as the string literals used today. For example:
{code:java}
JoinType.INNER = "inner"
{code}
If an enum is still the preferred way of defining the join types, new join 
functions could be added that take in these enums and the join calls that 
contain string parameters for joinType could be deprecated. This would give 
developers a chance to change over to the new join types.
h3. *Q7.* How long will it take?

A few days for a seasoned Spark developer.
h3. *Q8.* What are the mid-term and final "exams" to check for success?

Mid-term exam would be the addition of a common definition of the join types 
and additional join functions that take in the join type enum/constant. The 
final exam would be working tests written to check the functionality of these 
new join functions and the join functions that take a string for joinType would 
be deprecated.
h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, 
if any. Backward and forward compatibility must be taken into account.

*If enums are used:*

The following join function signatures would be added to the Dataset API:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame
{code}
The following functions would be deprecated:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): 
DataFrame
{code}
A new enum would be created called JoinType. Developers would be 

[jira] [Updated] (SPARK-26739) Standardized Join Types for DataFrames

2019-01-26 Thread Skyler Lehan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Skyler Lehan updated SPARK-26739:
-
Description: 
h3. *Q1.* What are you trying to do? Articulate your objectives using 
absolutely no jargon.

Currently, in the [join 
functions|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@join(right:org.apache.spark.sql.Dataset[_],joinExprs:org.apache.spark.sql.Column,joinType:String):org.apache.spark.sql.DataFrame]
 on DataFrames, the join types are defined via a string parameter called 
joinType. In order for a developer to know which joins are possible, they must 
look up the API call for join. While this works fine, it can cause the 
developer to make a typo resulting in improper joins and/or unexpected errors. 
The objective of this improvement would be to allow developers to use a common 
definition for join types (by enum or constants) called JoinTypes. This would 
contain the possible joins and remove the possibility of a typo. It would also 
allow Spark to alter the names of the joins in the future without impacting 
end-users.
h3. *Q2.* What problem is this proposal NOT designed to solve?

The problem this solves is extremely narrow, it would not solve anything other 
than providing a common definition for join types.
h3. *Q3.* How is it done today, and what are the limits of current practice?

Currently, developers must join two DataFrames like so:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer")
{code}
Where they manually type the join type. As stated above, this:
 * Requires developers to manually type in the join
 * Can cause possibility of typos
 * Restricts renaming of join types as its a literal string
 * Does not restrict and/or compile check the join type being used, leading to 
runtime errors

h3. *Q4.* What is new in your approach and why do you think it will be 
successful?

The new approach would use constants, something like this:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
JoinType.LEFT_OUTER)
{code}
This would provide:
 * In code reference/definitions of the possible join types
 ** This subsequently allows the addition of scaladoc of what each join type 
does and how it operates
 * Removes possibilities of a typo on the join type
 * Provides compile time checking of the join type (only if an enum is used)

To clarify, if JoinType is a constant, it would just fill in the joinType 
string parameter for users. If an enum is used, it would restrict the domain of 
possible join types to whatever is defined in the future JoinType enum. The 
enum is preferred, however it would take longer to implement.
h3. *Q5.* Who cares? If you are successful, what difference will it make?

Developers using Apache Spark will care. This will make the join function 
easier to wield and lead to less runtime errors. It will save time by bringing 
join type validation at compile time. It will also provide in code reference to 
the join types, which saves the developer time of having to look up and 
navigate the multiple join functions to find the possible join types. In 
addition to that, the resulting constants/enum would have documentation on how 
that join type works.
h3. *Q6.* What are the risks?

Users of Apache Spark who currently use strings to define their join types 
could be impacted if an enum is chosen as the common definition. This risk can 
be mitigated by using string constants. The string constants would be the exact 
same string as the string literals used today. For example:
{code:java}
JoinType.INNER = "inner"
{code}
If an enum is still the preferred way of defining the join types, new join 
functions could be added that take in these enums and the join calls that 
contain string parameters for joinType could be deprecated. This would give 
developers a chance to change over to the new join types.
h3. *Q7.* How long will it take?

A few days for a seasoned Spark developer.
h3. *Q8.* What are the mid-term and final "exams" to check for success?

Mid-term exam would be the addition of a common definition of the join types 
and additional join functions that take in the join type enum/constant. The 
final exam would be working tests written to check the functionality of these 
new join functions and the join functions that take a string for joinType would 
be deprecated.
h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, 
if any. Backward and forward compatibility must be taken into account.

*If enums are used:*

The following join function signatures would be added to the Dataset API:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame
{code}
The following functions would be deprecated:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
def join(right: Dataset[_], 

[jira] [Updated] (SPARK-26739) Standardized Join Types for DataFrames

2019-01-26 Thread Skyler Lehan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Skyler Lehan updated SPARK-26739:
-
Description: 
h3. *Q1.* What are you trying to do? Articulate your objectives using 
absolutely no jargon.

Currently, in the [join 
functions|"http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@join(right:org.apache.spark.sql.Dataset[_],joinExprs:org.apache.spark.sql.Column,joinType:String):org.apache.spark.sql.DataFrame"]
 on DataFrames, the join types are defined via a string parameter called 
joinType. In order for a developer to know which joins are possible, they must 
look up the API call for join. While this works fine, it can cause the 
developer to make a typo resulting in improper joins and/or unexpected errors. 
The objective of this improvement would be to allow developers to use a common 
definition for join types (by enum or constants) called JoinTypes. This would 
contain the possible joins and remove the possibility of a typo. It would also 
allow Spark to alter the names of the joins in the future without impacting 
end-users.
h3. *Q2.* What problem is this proposal NOT designed to solve?

The problem this solves is extremely narrow, it would not solve anything other 
than providing a common definition for join types.
h3. *Q3.* How is it done today, and what are the limits of current practice?

Currently, developers must join two DataFrames like so:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer")
{code}
Where they manually type the join type. As stated above, this:
 * Requires developers to manually type in the join
 * Can cause possibility of typos
 * Restricts renaming of join types as its a literal string
 * Does not restrict and/or compile check the join type being used, leading to 
runtime errors

h3. *Q4.* What is new in your approach and why do you think it will be 
successful?

The new approach would use constants, something like this:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
JoinType.LEFT_OUTER)
{code}
This would provide:
 * In code reference/definitions of the possible join types
 ** This subsequently allows the addition of scaladoc of what each join type 
does and how it operates
 * Removes possibilities of a typo on the join type
 * Provides compile time checking of the join type (only if an enum is used)

To clarify, if JoinType is a constant, it would just fill in the joinType 
string parameter for users. If an enum is used, it would restrict the domain of 
possible join types to whatever is defined in the future JoinType enum. The 
enum is preferred, however it would take longer to implement.
h3. *Q5.* Who cares? If you are successful, what difference will it make?

Developers using Apache Spark will care. This will make the join function 
easier to wield and lead to less runtime errors. It will save time by bringing 
join type validation at compile time. It will also provide in code reference to 
the join types, which saves the developer time of having to look up and 
navigate the multiple join functions to find the possible join types. In 
addition to that, the resulting constants/enum would have documentation on how 
that join type works.
h3. *Q6.* What are the risks?

Users of Apache Spark who currently use strings to define their join types 
could be impacted if an enum is chosen as the common definition. This risk can 
be mitigated by using string constants. The string constants would be the exact 
same string as the string literals used today. For example:
{code:java}
JoinType.INNER = "inner"
{code}
If an enum is still the preferred way of defining the join types, new join 
functions could be added that take in these enums and the join calls that 
contain string parameters for joinType could be deprecated. This would give 
developers a chance to change over to the new join types.
h3. *Q7.* How long will it take?

A few days for a seasoned Spark developer.
h3. *Q8.* What are the mid-term and final "exams" to check for success?

Mid-term exam would be the addition of a common definition of the join types 
and additional join functions that take in the join type enum/constant. The 
final exam would be working tests written to check the functionality of these 
new join functions and the join functions that take a string for joinType would 
be deprecated.
h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, 
if any. Backward and forward compatibility must be taken into account.

*If enums are used:*

The following join function signatures would be added to the Dataset API:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame
{code}
The following functions would be deprecated:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
def join(right: Dataset[_], 

[jira] [Created] (SPARK-26739) Standardized Join Types for DataFrames

2019-01-26 Thread Skyler Lehan (JIRA)
Skyler Lehan created SPARK-26739:


 Summary: Standardized Join Types for DataFrames
 Key: SPARK-26739
 URL: https://issues.apache.org/jira/browse/SPARK-26739
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Skyler Lehan
 Fix For: 2.4.1


h3. *Q1.* What are you trying to do? Articulate your objectives using 
absolutely no jargon.

Currently, in the [join 
functions|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@join(right:org.apache.spark.sql.Dataset[_],joinExprs:org.apache.spark.sql.Column,joinType:String):org.apache.spark.sql.DataFrame]
 on DataFrames, the join types are defined via a string parameter called 
joinType. In order for a developer to know which joins are possible, they must 
look up the API call for join. While this works fine, it can cause the 
developer to make a typo resulting in improper joins and/or unexpected errors. 
The objective of this improvement would be to allow developers to use a common 
definition for join types (by enum or constants) called JoinTypes. This would 
contain the possible joins and remove the possibility of a typo. It would also 
allow Spark to alter the names of the joins in the future without impacting 
end-users.
h3. *Q2.* What problem is this proposal NOT designed to solve?

The problem this solves is extremely narrow, it would not solve anything other 
than providing a common definition for join types.
h3. *Q3.* How is it done today, and what are the limits of current practice?

Currently, developers must join two DataFrames like so: 
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer")
{code}
Where they manually type the join type. As stated above, this:
 * Requires developers to manually type in the join
 * Can cause possibility of typos
 * Restricts renaming of join types as its a literal string
 * Does not restrict and/or compile check the join type being used, leading to 
runtime errors

h3. *Q4.* What is new in your approach and why do you think it will be 
successful?

The new approach would use constants, something like this:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
JoinType.LEFT_OUTER)
{code}
This would provide:
 * In code reference/definitions of the possible join types
 ** This subsequently allows the addition of scaladoc of what each join type 
does and how it operates
 * Removes possibilities of a typo on the join type
 * Provides compile time checking of the join type (only if an enum is used)

To clarify, if JoinType is a constant, it would just fill in the joinType 
string parameter for users. If an enum is used, it would restrict the domain of 
possible join types to whatever is defined in the future JoinType enum. The 
enum is preferred, however it would take longer to implement.
h3. *Q5.* Who cares? If you are successful, what difference will it make?

Developers using Apache Spark will care. This will make the join function 
easier to wield and lead to less runtime errors. It will save time by bringing 
join type validation at compile time. It will also provide in code reference to 
the join types, which saves the developer time of having to look up and 
navigate the multiple join functions to find the possible join types. In 
addition to that, the resulting constants/enum would have documentation on how 
that join type works.
h3. *Q6.* What are the risks?

Users of Apache Spark who currently use strings to define their join types 
could be impacted if an enum is chosen as the common definition. This risk can 
be mitigated by using string constants. The string constants would be the exact 
same string as the string literals used today. For example:
{code:java}
JoinType.INNER = "inner"
{code}
If an enum is still the preferred way of defining the join types, new join 
functions could be added that take in these enums and the join calls that 
contain string parameters for joinType could be deprecated. This would give 
developers a chance to change over to the new join types.
h3. *Q7.* How long will it take?

A few days for a seasoned Spark developer.
h3. *Q8.* What are the mid-term and final "exams" to check for success?

Mid-term exam would be the addition of a common definition of the join types 
and additional join functions that take in the join type enum/constant. The 
final exam would be working tests written to check the functionality of these 
new join functions and the join functions that take a string for joinType would 
be deprecated.
h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, 
if any. Backward and forward compatibility must be taken into account.

*If enums are used:*

The following join function signatures would be added to the Dataset API:
{code:java}
def join(right: Dataset[_], joinExprs: Column, 

[jira] [Assigned] (SPARK-26656) Benchmark for date/time functions and expressions

2019-01-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26656:


Assignee: Apache Spark

> Benchmark for date/time functions and expressions
> -
>
> Key: SPARK-26656
> URL: https://issues.apache.org/jira/browse/SPARK-26656
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Write benchmarks for datetimeExressions



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26656) Benchmark for date/time functions and expressions

2019-01-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26656:


Assignee: (was: Apache Spark)

> Benchmark for date/time functions and expressions
> -
>
> Key: SPARK-26656
> URL: https://issues.apache.org/jira/browse/SPARK-26656
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Write benchmarks for datetimeExressions



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes

2019-01-26 Thread Sergey (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753005#comment-16753005
 ] 

Sergey commented on SPARK-26688:


Hi Imran,

thanks for you reply.

"Meanwhile devs start to apply this willy-nilly, as these configs tend to just 
keep getting built up over time "

SLA could mitigate this problem. Every blacklisted node for a specific job 
slows it down in a long run. Anyway, dev would have to report / communicate 
with ops to resolve node issue. 

 

"Ideally, blacklisting and speculation should be able to prevent that problem"

We are going to try out speculation but we are not there yet. 

> Provide configuration of initially blacklisted YARN nodes
> -
>
> Key: SPARK-26688
> URL: https://issues.apache.org/jira/browse/SPARK-26688
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Introducing new config for initially blacklisted YARN nodes.
> This came up in the apache spark user mailing list: 
> [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26738) Pyspark random forest classifier feature importance with column names

2019-01-26 Thread Praveen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen updated SPARK-26738:

Description: 
I am trying to plot the feature importances of random forest classifier with 
with column names. I am using Spark 2.3.2 and Pyspark.

The input X is sentences and i am using tfidf (HashingTF + IDF) + StringIndexer 
to generate the feature vectors.

I have included all the stages in a Pipeline

 
{code:java}
regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, outputCol= 
"words", pattern="[a-zA-Z_]+", toLowercase=True, 
minTokenLength=minimum_token_size)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", 
numFeatures=number_of_feature)
idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col)
indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name)
converter = IndexToString(inputCol='prediction', outputCol="original_label", 
labels=indexer.fit(df).labels)
feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer])
estimator = RandomForestClassifier(labelCol=label_col, 
featuresCol=features_col, numTrees=100)
pipeline = Pipeline(stages=[feature_pipeline, estimator, converter])
model = pipeline.fit(df)
{code}
Generating the feature importances as
{code:java}
rdc = model.stages[-2]
print (rdc.featureImportances)
{code}
So far so good, but when i try to map the feature importances to the feature 
columns as below
{code:java}
attrs = sorted((attr["idx"], attr["name"]) for attr in 
(chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values(

[(name, rdc.featureImportances[idx])
   for idx, name in attrs
   if dtModel_1.featureImportances[idx]]{code}
 

I get the key error on ml_attr
{code:java}
KeyError: 'ml_attr'{code}
The printed the dictionary,
{code:java}
print (df_pred.schema["featurescol"].metadata){code}
and it's empty {}

Any thoughts on what I am doing wrong ? How can I getting feature importances 
to the columns names.

Thanks

  was:
I am trying to plot the feature importances of random forest classifier with 
with column names. I am using Spark 2.3.2 and Pyspark.

The input X is sentences and i am using tfidf (HashingTF + IDF) + StringIndexer 
to generate the feature vectors.

I have included all the stages in a Pipeline

 

 

{{regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, 
outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, 
minTokenLength=minimum_token_size) 

hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", 
numFeatures=number_of_feature) 

idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col) 

indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name) 

converter = IndexToString(inputCol='prediction', outputCol="original_label", 
labels=indexer.fit(df).labels) 

feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer]) 

estimator = RandomForestClassifier(labelCol=label_col, 
featuresCol=features_col, numTrees=100) 

pipeline = Pipeline(stages=[feature_pipeline, estimator, converter])

model = pipeline.fit(df)}}{{}}

 

 

Generating the feature importances as

 
{code:java}
rdc = model.stages[-2]
print (rdc.featureImportances)
{code}
So far so good, but when i try to map the feature importances to the feature 
columns as below
{code:java}
attrs = sorted((attr["idx"], attr["name"]) for attr in 
(chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values(

[(name, rdc.featureImportances[idx])
   for idx, name in attrs
   if dtModel_1.featureImportances[idx]]{code}
 

I get the key error on ml_attr
{code:java}
KeyError: 'ml_attr'{code}
The printed the dictionary,
{code:java}
print (df_pred.schema["featurescol"].metadata){code}
and it's empty {}

Any thoughts on what I am doing wrong ? How can I getting feature importances 
to the columns names.

Thanks


> Pyspark random forest classifier feature importance with column names
> -
>
> Key: SPARK-26738
> URL: https://issues.apache.org/jira/browse/SPARK-26738
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.3.2
> Environment: {code:java}
>  {code}
>Reporter: Praveen
>Priority: Major
>  Labels: RandomForest, pyspark
>
> I am trying to plot the feature importances of random forest classifier with 
> with column names. I am using Spark 2.3.2 and Pyspark.
> The input X is sentences and i am using tfidf (HashingTF + IDF) + 
> StringIndexer to generate the feature vectors.
> I have included all the stages in a Pipeline
>  
> {code:java}
> regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, 
> outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, 
> minTokenLength=minimum_token_size)
> hashingTF = HashingTF(inputCol="words", 

[jira] [Updated] (SPARK-26731) remove EOLed spark jobs from jenkins

2019-01-26 Thread Chris Bogan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bogan updated SPARK-26731:

Attachment: LICENSE

> remove EOLed spark jobs from jenkins
> 
>
> Key: SPARK-26731
> URL: https://issues.apache.org/jira/browse/SPARK-26731
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 1.6.3, 2.0.2, 2.1.3
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: LICENSE, activemq-cli-tools-4a984ec.tar.gz
>
>
> i will disable, but not remove (yet), the branch-specific builds for 1.6, 2.0 
> and 2.1 on jenkins.
> these include all test builds, as well as docs, lint, compile, and packaging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26738) Pyspark random forest classifier feature importance with column names

2019-01-26 Thread Praveen (JIRA)
Praveen created SPARK-26738:
---

 Summary: Pyspark random forest classifier feature importance with 
column names
 Key: SPARK-26738
 URL: https://issues.apache.org/jira/browse/SPARK-26738
 Project: Spark
  Issue Type: Question
  Components: ML
Affects Versions: 2.3.2
 Environment: {code:java}
 {code}
Reporter: Praveen


I am trying to plot the feature importances of random forest classifier with 
with column names. I am using Spark 2.3.2 and Pyspark.

The input X is sentences and i am using tfidf (HashingTF + IDF) + StringIndexer 
to generate the feature vectors.

I have included all the stages in a Pipeline

 

 

{{regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, 
outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, 
minTokenLength=minimum_token_size) 

hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", 
numFeatures=number_of_feature) 

idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col) 

indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name) 

converter = IndexToString(inputCol='prediction', outputCol="original_label", 
labels=indexer.fit(df).labels) 

feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer]) 

estimator = RandomForestClassifier(labelCol=label_col, 
featuresCol=features_col, numTrees=100) 

pipeline = Pipeline(stages=[feature_pipeline, estimator, converter])

model = pipeline.fit(df)}}{{}}

 

 

Generating the feature importances as

 
{code:java}
rdc = model.stages[-2]
print (rdc.featureImportances)
{code}
So far so good, but when i try to map the feature importances to the feature 
columns as below
{code:java}
attrs = sorted((attr["idx"], attr["name"]) for attr in 
(chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values(

[(name, rdc.featureImportances[idx])
   for idx, name in attrs
   if dtModel_1.featureImportances[idx]]{code}
 

I get the key error on ml_attr
{code:java}
KeyError: 'ml_attr'{code}
The printed the dictionary,
{code:java}
print (df_pred.schema["featurescol"].metadata){code}
and it's empty {}

Any thoughts on what I am doing wrong ? How can I getting feature importances 
to the columns names.

Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26722) add SPARK_TEST_KEY=1 to pull request builder and spark-master-test-sbt-hadoop-2.7

2019-01-26 Thread Chris Bogan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bogan updated SPARK-26722:

Attachment: SPARK-26731.doc

> add SPARK_TEST_KEY=1 to pull request builder and 
> spark-master-test-sbt-hadoop-2.7
> -
>
> Key: SPARK-26722
> URL: https://issues.apache.org/jira/browse/SPARK-26722
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: SPARK-26731.doc
>
>
> from https://github.com/apache/spark/pull/23117:
> we need to add the {{SPARK_TEST_KEY=1}} env var to both the GHPRB and 
> {{spark-master-test-sbt-hadoop-2.7}} builds.
> this is done for the PRB, and was manually added to the 
> {{spark-master-test-sbt-hadoop-2.7}} build.
> i will leave this open until i finish porting the JJB configs in to the main 
> spark repo (for the {{spark-master-test-sbt-hadoop-2.7}} build).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-26731) remove EOLed spark jobs from jenkins

2019-01-26 Thread Chris Bogan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bogan updated SPARK-26731:

Comment: was deleted

(was: {code:java}
// code placeholder
{code}
$ cd infrastructure/content/dev/ $ $EDITOR infra-site.mdtext $ svn commit ... 
wait a short while for the page to be rebuilt ... ... **ONLY IF** you are an 
ASF Member, then publish: ... $ curl -sL http://s.apache.org/cms-cli | perl})

> remove EOLed spark jobs from jenkins
> 
>
> Key: SPARK-26731
> URL: https://issues.apache.org/jira/browse/SPARK-26731
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 1.6.3, 2.0.2, 2.1.3
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: LICENSE, activemq-cli-tools-4a984ec.tar.gz
>
>
> i will disable, but not remove (yet), the branch-specific builds for 1.6, 2.0 
> and 2.1 on jenkins.
> these include all test builds, as well as docs, lint, compile, and packaging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26731) remove EOLed spark jobs from jenkins

2019-01-26 Thread Chris Bogan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16752988#comment-16752988
 ] 

Chris Bogan commented on SPARK-26731:
-

{code:java}
// code placeholder
{code}
$ cd infrastructure/content/dev/ $ $EDITOR infra-site.mdtext $ svn commit ... 
wait a short while for the page to be rebuilt ... ... **ONLY IF** you are an 
ASF Member, then publish: ... $ curl -sL http://s.apache.org/cms-cli | perl}

> remove EOLed spark jobs from jenkins
> 
>
> Key: SPARK-26731
> URL: https://issues.apache.org/jira/browse/SPARK-26731
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 1.6.3, 2.0.2, 2.1.3
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: LICENSE, activemq-cli-tools-4a984ec.tar.gz
>
>
> i will disable, but not remove (yet), the branch-specific builds for 1.6, 2.0 
> and 2.1 on jenkins.
> these include all test builds, as well as docs, lint, compile, and packaging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26327) Metrics in FileSourceScanExec not update correctly while relation.partitionSchema is set

2019-01-26 Thread Chris Bogan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bogan updated SPARK-26327:

Attachment: request_handler_interface.json

> Metrics in FileSourceScanExec not update correctly while 
> relation.partitionSchema is set
> 
>
> Key: SPARK-26327
> URL: https://issues.apache.org/jira/browse/SPARK-26327
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>
> Attachments: Homepage - Material Design, 
> apache-opennlp-1.9.1-bin.tar.gz, request_handler_interface.json
>
>
> As currently approach in `FileSourceScanExec`, the metrics of "numFiles" and 
> "metadataTime"(fileListingTime) were updated while lazy val 
> `selectedPartitions` initialized in the scenario of relation.partitionSchema 
> is set. But `selectedPartitions` will be initialized by `metadata` at first, 
> which is called by `queryExecution.toString` in 
> `SQLExecution.withNewExecutionId`. So while the 
> `SQLMetrics.postDriverMetricUpdates` called, there's no corresponding 
> liveExecutions in SQLAppStatusListener, the metrics update is not work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26731) remove EOLed spark jobs from jenkins

2019-01-26 Thread Chris Bogan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bogan updated SPARK-26731:

Attachment: activemq-cli-tools-4a984ec.tar.gz

> remove EOLed spark jobs from jenkins
> 
>
> Key: SPARK-26731
> URL: https://issues.apache.org/jira/browse/SPARK-26731
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 1.6.3, 2.0.2, 2.1.3
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: LICENSE, activemq-cli-tools-4a984ec.tar.gz
>
>
> i will disable, but not remove (yet), the branch-specific builds for 1.6, 2.0 
> and 2.1 on jenkins.
> these include all test builds, as well as docs, lint, compile, and packaging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26327) Metrics in FileSourceScanExec not update correctly while relation.partitionSchema is set

2019-01-26 Thread Chris Bogan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bogan updated SPARK-26327:

Attachment: apache-opennlp-1.9.1-bin.tar.gz

> Metrics in FileSourceScanExec not update correctly while 
> relation.partitionSchema is set
> 
>
> Key: SPARK-26327
> URL: https://issues.apache.org/jira/browse/SPARK-26327
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>
> Attachments: Homepage - Material Design, 
> apache-opennlp-1.9.1-bin.tar.gz
>
>
> As currently approach in `FileSourceScanExec`, the metrics of "numFiles" and 
> "metadataTime"(fileListingTime) were updated while lazy val 
> `selectedPartitions` initialized in the scenario of relation.partitionSchema 
> is set. But `selectedPartitions` will be initialized by `metadata` at first, 
> which is called by `queryExecution.toString` in 
> `SQLExecution.withNewExecutionId`. So while the 
> `SQLMetrics.postDriverMetricUpdates` called, there's no corresponding 
> liveExecutions in SQLAppStatusListener, the metrics update is not work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26327) Metrics in FileSourceScanExec not update correctly while relation.partitionSchema is set

2019-01-26 Thread Chris Bogan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bogan updated SPARK-26327:

Attachment: Homepage - Material Design

> Metrics in FileSourceScanExec not update correctly while 
> relation.partitionSchema is set
> 
>
> Key: SPARK-26327
> URL: https://issues.apache.org/jira/browse/SPARK-26327
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>
> Attachments: Homepage - Material Design
>
>
> As currently approach in `FileSourceScanExec`, the metrics of "numFiles" and 
> "metadataTime"(fileListingTime) were updated while lazy val 
> `selectedPartitions` initialized in the scenario of relation.partitionSchema 
> is set. But `selectedPartitions` will be initialized by `metadata` at first, 
> which is called by `queryExecution.toString` in 
> `SQLExecution.withNewExecutionId`. So while the 
> `SQLMetrics.postDriverMetricUpdates` called, there's no corresponding 
> liveExecutions in SQLAppStatusListener, the metrics update is not work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26689) Disk broken causing broadcast failure

2019-01-26 Thread liupengcheng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16752961#comment-16752961
 ] 

liupengcheng commented on SPARK-26689:
--

[~tgraves] In production environment, yarn.nodemanager.local-dirs is always 
configured as multiple directories which are mounted on different disks. so I 
think since we use this parameter in spark, we should also make use of this 
feature, and should not expect job failure when encountering only a single disk 
error.

This PR I put up can also reduce the FetchFailure and even Job failure caused 
by FetchFailed if blacklist not enabled or node not blacklisted(task may be 
repeated scheduled to the unhealthy node)

> Disk broken causing broadcast failure
> -
>
> Key: SPARK-26689
> URL: https://issues.apache.org/jira/browse/SPARK-26689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.4.0
> Environment: Spark on Yarn
> Mutliple Disk
>Reporter: liupengcheng
>Priority: Major
>
> We encoutered an application failure in our production cluster which caused 
> by the bad disk problems. It will incur application failure.
> {code:java}
> Job aborted due to stage failure: Task serialization failed: 
> java.io.IOException: Failed to create local dir in 
> /home/work/hdd5/yarn/c3prc-hadoop/nodemanager/usercache/h_user_profile/appcache/application_1463372393999_144979/blockmgr-1f96b724-3e16-4c09-8601-1a2e3b758185/3b.
> org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:73)
> org.apache.spark.storage.DiskStore.contains(DiskStore.scala:173)
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$getCurrentBlockStatus(BlockManager.scala:391)
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801)
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:629)
> org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:987)
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
> org.apache.spark.SparkContext.broadcast(SparkContext.scala:1332)
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:863)
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1090)
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086)
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086)
> scala.Option.foreach(Option.scala:236)
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1086)
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1085)
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1085)
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1528)
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1493)
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1482)
> org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> We have multiple disk on our cluster nodes, however, it still fails. I think 
> it's because spark does not handle bad disk in `DiskBlockManager` currently. 
> Actually, we can handle bad disk in multiple disk environment to avoid 
> application failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org