date:20180320

[jira] [Created] (SPARK-23761) Dataframe filter(udf) followed by groupby in pyspark throws a casting error

2018-03-20 Thread Dhaniram Kshirsagar (JIRA)

Dhaniram Kshirsagar created SPARK-23761:
---

 Summary: Dataframe filter(udf) followed by groupby in pyspark 
throws a casting error
 Key: SPARK-23761
 URL: https://issues.apache.org/jira/browse/SPARK-23761
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.6.0
 Environment: pyspark 1.6.0

Python 2.6.6 (r266:84292, Aug 18 2016, 15:13:37) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-17)] on linux2

CentOS 6.7
Reporter: Dhaniram Kshirsagar


On pyspark with dataframe, we are getting following exception when 'filter(with 
UDF) is followed by groupby' :-

# Snippet of error observed in pyspark
{code:java}
py4j.protocol.Py4JJavaError: An error occurred while calling o56.filter.
: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to 
org.apache.spark.sql.catalyst.plans.logical.Aggregate{code}
This one looks like https://issues.apache.org/jira/browse/SPARK-12981 however 
not sure if this one is same.

 

Here is gist with pyspark steps to reproduce this issue:

[https://gist.github.com/dhaniram-kshirsagar/d72545620b6a05d145a1a6bece797b6d] 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23666) Undeterministic column name with UDFs

2018-03-20 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23666.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.4.0

> Undeterministic column name with UDFs
> -
>
> Key: SPARK-23666
> URL: https://issues.apache.org/jira/browse/SPARK-23666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Daniel Darabos
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.4.0
>
>
> When you access structure fields in Spark SQL, the auto-generated result 
> column name includes an internal ID.
> {code:java}
> scala> import spark.implicits._
> scala> Seq(((1, 2), 3)).toDF("a", "b").createOrReplaceTempView("x")
> scala> spark.udf.register("f", (a: Int) => a)
> scala> spark.sql("select f(a._1) from x").show
> +-+
> |UDF:f(a._1 AS _1#148)|
> +-+
> |1|
> +-+
> {code}
> This ID ({{#148}}) is only included for UDFs.
> {code:java}
> scala> spark.sql("select factorial(a._1) from x").show
> +---+
> |factorial(a._1 AS `_1`)|
> +---+
> |  1|
> +---+
> {code}
> The internal ID is different on every invocation. The problem this causes for 
> us is that the schema of the SQL output is never the same:
> {code:java}
> scala> spark.sql("select f(a._1) from x").schema ==
>spark.sql("select f(a._1) from x").schema
> Boolean = false
> {code}
> We rely on similar schema checks when reloading persisted data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23234) ML python test failure due to default outputCol

2018-03-20 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23234.
--
Resolution: Duplicate

> ML python test failure due to default outputCol
> ---
>
> Key: SPARK-23234
> URL: https://issues.apache.org/jira/browse/SPARK-23234
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Major
>
> SPARK-22799 and SPARK-22797 are causing valid Python test failures. The 
> reason is that Python is setting the default params with set. So they are not 
> considered as defaults, but as params passed by the user.
> This means that an outputCol is set not as a default but as a real value.
> Anyway, this is a misbehavior of the python API which can cause serious 
> problems and I'd suggest to rethink the way this is done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23244) Incorrect handling of default values when deserializing python wrappers of scala transformers

2018-03-20 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407510#comment-16407510
 ] 

Bryan Cutler commented on SPARK-23244:
--

Just to clarify, the PySpark save/load is just a wrapper making the same calls 
in Java, so that will fix the root cause of the issue

> Incorrect handling of default values when deserializing python wrappers of 
> scala transformers
> -
>
> Key: SPARK-23244
> URL: https://issues.apache.org/jira/browse/SPARK-23244
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Tomas Nykodym
>Priority: Minor
>
> Default values are not handled properly when serializing/deserializing python 
> trasnformers which are wrappers around scala objects. It looks like that 
> after deserialization the default values which were based on uid do not get 
> properly restored and values which were not set are set to their (original) 
> default values.
> Here's a simple code example using Bucketizer:
> {code:python}
> >>> from pyspark.ml.feature import Bucketizer
> >>> a = Bucketizer() 
> >>> a.save("bucketizer0")
> >>> b = load("bucketizer0") 
> >>> a._defaultParamMap[a.outputCol]
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b._defaultParamMap[b.outputCol]
> u'Bucketizer_41cf9afbc559ca2bfc9a__output'
> >>> a.isSet(a.outputCol)
> False 
> >>> b.isSet(b.outputCol)
> True
> >>> a.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23244) Incorrect handling of default values when deserializing python wrappers of scala transformers

2018-03-20 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23244.
--
Resolution: Duplicate

> Incorrect handling of default values when deserializing python wrappers of 
> scala transformers
> -
>
> Key: SPARK-23244
> URL: https://issues.apache.org/jira/browse/SPARK-23244
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Tomas Nykodym
>Priority: Minor
>
> Default values are not handled properly when serializing/deserializing python 
> trasnformers which are wrappers around scala objects. It looks like that 
> after deserialization the default values which were based on uid do not get 
> properly restored and values which were not set are set to their (original) 
> default values.
> Here's a simple code example using Bucketizer:
> {code:python}
> >>> from pyspark.ml.feature import Bucketizer
> >>> a = Bucketizer() 
> >>> a.save("bucketizer0")
> >>> b = load("bucketizer0") 
> >>> a._defaultParamMap[a.outputCol]
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b._defaultParamMap[b.outputCol]
> u'Bucketizer_41cf9afbc559ca2bfc9a__output'
> >>> a.isSet(a.outputCol)
> False 
> >>> b.isSet(b.outputCol)
> True
> >>> a.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23244) Incorrect handling of default values when deserializing python wrappers of scala transformers

2018-03-20 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407507#comment-16407507
 ] 

Bryan Cutler commented on SPARK-23244:
--

I looked into this and it is a little bit different because with save/load, 
params are only transferred from Java to Python.  So the actual problem is in 
Scala:
{code:java}
scala> import org.apache.spark.ml.feature.Bucketizer
import org.apache.spark.ml.feature.Bucketizer

scala> val a = new Bucketizer()
a: org.apache.spark.ml.feature.Bucketizer = bucketizer_30c66d09db18

scala> a.isSet(a.outputCol)
res2: Boolean = false

scala> a.save("bucketizer0")

scala> val b = Bucketizer.load("bucketizer0")
b: org.apache.spark.ml.feature.Bucketizer = bucketizer_30c66d09db18

scala> b.isSet(b.outputCol)
res4: Boolean = true{code}
It seems this is being worked on in SPARK-23455, so I'll still close this as a 
duplicate

> Incorrect handling of default values when deserializing python wrappers of 
> scala transformers
> -
>
> Key: SPARK-23244
> URL: https://issues.apache.org/jira/browse/SPARK-23244
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Tomas Nykodym
>Priority: Minor
>
> Default values are not handled properly when serializing/deserializing python 
> trasnformers which are wrappers around scala objects. It looks like that 
> after deserialization the default values which were based on uid do not get 
> properly restored and values which were not set are set to their (original) 
> default values.
> Here's a simple code example using Bucketizer:
> {code:python}
> >>> from pyspark.ml.feature import Bucketizer
> >>> a = Bucketizer() 
> >>> a.save("bucketizer0")
> >>> b = load("bucketizer0") 
> >>> a._defaultParamMap[a.outputCol]
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b._defaultParamMap[b.outputCol]
> u'Bucketizer_41cf9afbc559ca2bfc9a__output'
> >>> a.isSet(a.outputCol)
> False 
> >>> b.isSet(b.outputCol)
> True
> >>> a.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23760) CodegenContext.withSubExprEliminationExprs should save/restore CSE state correctly

2018-03-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407466#comment-16407466
 ] 

Apache Spark commented on SPARK-23760:
--

User 'rednaxelafx' has created a pull request for this issue:
https://github.com/apache/spark/pull/20870

> CodegenContext.withSubExprEliminationExprs should save/restore CSE state 
> correctly
> --
>
> Key: SPARK-23760
> URL: https://issues.apache.org/jira/browse/SPARK-23760
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Kris Mok
>Priority: Major
>
> There's a bug in {{CodegenContext.withSubExprEliminationExprs()}} that makes 
> it effectively always clear the subexpression elimination state after it's 
> called.
> The original intent of this function was that it should save the old state, 
> set the given new state as current and perform codegen (invoke 
> {{Expression.genCode()}}), and at the end restore the subexpression 
> elimination state back to the old state. This ticket tracks a fix to actually 
> implement the original intent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23760) CodegenContext.withSubExprEliminationExprs should save/restore CSE state correctly

2018-03-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23760:


Assignee: (was: Apache Spark)

> CodegenContext.withSubExprEliminationExprs should save/restore CSE state 
> correctly
> --
>
> Key: SPARK-23760
> URL: https://issues.apache.org/jira/browse/SPARK-23760
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Kris Mok
>Priority: Major
>
> There's a bug in {{CodegenContext.withSubExprEliminationExprs()}} that makes 
> it effectively always clear the subexpression elimination state after it's 
> called.
> The original intent of this function was that it should save the old state, 
> set the given new state as current and perform codegen (invoke 
> {{Expression.genCode()}}), and at the end restore the subexpression 
> elimination state back to the old state. This ticket tracks a fix to actually 
> implement the original intent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23760) CodegenContext.withSubExprEliminationExprs should save/restore CSE state correctly

2018-03-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23760:


Assignee: Apache Spark

> CodegenContext.withSubExprEliminationExprs should save/restore CSE state 
> correctly
> --
>
> Key: SPARK-23760
> URL: https://issues.apache.org/jira/browse/SPARK-23760
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Kris Mok
>Assignee: Apache Spark
>Priority: Major
>
> There's a bug in {{CodegenContext.withSubExprEliminationExprs()}} that makes 
> it effectively always clear the subexpression elimination state after it's 
> called.
> The original intent of this function was that it should save the old state, 
> set the given new state as current and perform codegen (invoke 
> {{Expression.genCode()}}), and at the end restore the subexpression 
> elimination state back to the old state. This ticket tracks a fix to actually 
> implement the original intent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23760) CodegenContext.withSubExprEliminationExprs should save/restore CSE state correctly

2018-03-20 Thread Kris Mok (JIRA)

Kris Mok created SPARK-23760:


 Summary: CodegenContext.withSubExprEliminationExprs should 
save/restore CSE state correctly
 Key: SPARK-23760
 URL: https://issues.apache.org/jira/browse/SPARK-23760
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.2.1, 2.2.0
Reporter: Kris Mok


There's a bug in {{CodegenContext.withSubExprEliminationExprs()}} that makes it 
effectively always clear the subexpression elimination state after it's called.

The original intent of this function was that it should save the old state, set 
the given new state as current and perform codegen (invoke 
{{Expression.genCode()}}), and at the end restore the subexpression elimination 
state back to the old state. This ticket tracks a fix to actually implement the 
original intent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2018-03-20 Thread Teng Peng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407451#comment-16407451
 ] 

Teng Peng edited comment on SPARK-19208 at 3/21/18 4:44 AM:


[~timhunter] Has the Jira ticket been opened? I believe the new API for 
statistical info would be a great improvement.


was (Author: teng peng):
[~timhunter] Has the Jira ticket been opened? I believe this would be a great 
improvement.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Major
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2018-03-20 Thread Teng Peng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407451#comment-16407451
 ] 

Teng Peng commented on SPARK-19208:
---

[~timhunter] Has the Jira ticket been opened? I believe this would be a great 
improvement.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Major
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2018-03-20 Thread Franck Tago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16389150#comment-16389150
 ] 

Franck Tago edited comment on SPARK-23519 at 3/21/18 3:29 AM:
--

Any updates on this ?

Could someone assist with this?


was (Author: tafra...@gmail.com):
Any updates on this ?

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Critical
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2018-03-20 Thread Franck Tago (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-23519:

Description: 
1- create and populate a hive table  . I did this in a hive cli session .[ not 
that this matters ]

create table  atable (col1 int) ;

insert  into atable values (10 ) , (100)  ;

2. create a view from the table.  

[These actions were performed from a spark shell ]

spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
from atable ")
 java.lang.AssertionError: assertion failed: The view output (col1,col1) 
contains duplicate column name.
 at scala.Predef$.assert(Predef.scala:170)
 at 
org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
 at 
org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
 at 
org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
 at org.apache.spark.sql.Dataset.(Dataset.scala:183)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)

  was:
1- create and populate a hive table  . I did this in a hive cli session .[ not 
that this matters ]

create table  atable (col1 int) ;

insert  into atable values (10 ) , (100)  ;

2. create a view form the table.   [ I did this from a spark shell ]

spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
from atable ")
java.lang.AssertionError: assertion failed: The view output (col1,col1) 
contains duplicate column name.
 at scala.Predef$.assert(Predef.scala:170)
 at 
org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
 at 
org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
 at 
org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
 at org.apache.spark.sql.Dataset.(Dataset.scala:183)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)


> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Critical
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23513) java.io.IOException: Expected 12 fields, but got 5 for row :Spark submit error

2018-03-20 Thread abel-sun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407355#comment-16407355
 ] 

abel-sun commented on SPARK-23513:
--

Can you provide some more error message![~Fray]

> java.io.IOException: Expected 12 fields, but got 5 for row :Spark submit 
> error 
> ---
>
> Key: SPARK-23513
> URL: https://issues.apache.org/jira/browse/SPARK-23513
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Examples, Input/Output, Java API
>Affects Versions: 1.4.0, 2.2.0
>Reporter: Rawia 
>Priority: Blocker
>
> Hello
> I'm trying to run a spark application (distributedWekaSpark) but  when I'm 
> using the spark-submit command I get this error
> {quote}{quote}ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) 
> java.io.IOException: Expected 12 fields, but got 5 for row: 
> outlook,temperature,humidity,windy,play
> {quote}{quote}
> I tried with other datasets but always the same error appeared, (always 12 
> fields expected)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20709) spark-shell use proxy-user failed

2018-03-20 Thread KaiXinXIaoLei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407334#comment-16407334
 ] 

KaiXinXIaoLei commented on SPARK-20709:
---

[~ffbin] [~srowen] i also meet this problem. Can u tell me how to solve this

> spark-shell use proxy-user failed
> -
>
> Key: SPARK-20709
> URL: https://issues.apache.org/jira/browse/SPARK-20709
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.1.0
>Reporter: fangfengbin
>Priority: Major
>
> cmd is : spark-shell --master yarn-client --proxy-user leoB
> Throw Exception: failedto find any Kerberos tgt
> Log is:
> 17/05/11 15:56:21 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, 
> sampleName=Ops, always=false, type=DEFAULT, valueName=Time, value=[Rate of 
> successful kerberos logins and latency (milliseconds)])
> 17/05/11 15:56:21 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, 
> sampleName=Ops, always=false, type=DEFAULT, valueName=Time, value=[Rate of 
> failed kerberos logins and latency (milliseconds)])
> 17/05/11 15:56:21 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, 
> sampleName=Ops, always=false, type=DEFAULT, valueName=Time, value=[GetGroups])
> 17/05/11 15:56:21 DEBUG MetricsSystemImpl: UgiMetrics, User and group related 
> metrics
> 17/05/11 15:56:22 DEBUG Shell: setsid exited with exit code 0
> 17/05/11 15:56:22 DEBUG Groups:  Creating new Groups object
> 17/05/11 15:56:22 DEBUG NativeCodeLoader: Trying to load the custom-built 
> native-hadoop library...
> 17/05/11 15:56:22 DEBUG NativeCodeLoader: Loaded the native-hadoop library
> 17/05/11 15:56:22 DEBUG JniBasedUnixGroupsMapping: Using 
> JniBasedUnixGroupsMapping for Group resolution
> 17/05/11 15:56:22 DEBUG JniBasedUnixGroupsMappingWithFallback: Group mapping 
> impl=org.apache.hadoop.security.JniBasedUnixGroupsMapping
> 17/05/11 15:56:22 DEBUG Groups: Group mapping 
> impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback; 
> cacheTimeout=30; warningDeltaMs=5000
> 17/05/11 15:56:22 DEBUG UserGroupInformation: hadoop login
> 17/05/11 15:56:22 DEBUG UserGroupInformation: hadoop login commit
> 17/05/11 15:56:22 DEBUG UserGroupInformation: using kerberos 
> user:sp...@hadoop.com
> 17/05/11 15:56:22 DEBUG UserGroupInformation: Using user: "sp...@hadoop.com" 
> with name sp...@hadoop.com
> 17/05/11 15:56:22 DEBUG UserGroupInformation: User entry: "sp...@hadoop.com"
> 17/05/11 15:56:22 DEBUG UserGroupInformation: Assuming keytab is managed 
> externally since logged in from subject.
> 17/05/11 15:56:22 DEBUG UserGroupInformation: UGI loginUser:sp...@hadoop.com 
> (auth:KERBEROS)
> 17/05/11 15:56:22 DEBUG UserGroupInformation: Current time is 1494489382449
> 17/05/11 15:56:22 DEBUG UserGroupInformation: Next refresh is 1494541210600
> 17/05/11 15:56:22 DEBUG UserGroupInformation: PrivilegedAction as:leoB 
> (auth:PROXY) via sp...@hadoop.com (auth:KERBEROS) 
> from:org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> 17/05/11 15:56:29 WARN SparkConf: In Spark 1.0 and later spark.local.dir will 
> be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS 
> in mesos/standalone and LOCAL_DIRS in YARN).
> 17/05/11 15:56:56 WARN SessionState: load mapred-default.xml, HIVE_CONF_DIR 
> env not found!
> 17/05/11 15:56:56

[jira] [Commented] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml

2018-03-20 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407302#comment-16407302
 ] 

Weichen Xu commented on SPARK-23751:


I will work on this. :)

> Kolmogorov-Smirnoff test Python API in pyspark.ml
> -
>
> Key: SPARK-23751
> URL: https://issues.apache.org/jira/browse/SPARK-23751
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Python wrapper for new DataFrame-based API for KS test



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23455) Default Params in ML should be saved separately

2018-03-20 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23455:
--
Shepherd: Joseph K. Bradley

> Default Params in ML should be saved separately
> ---
>
> Key: SPARK-23455
> URL: https://issues.apache.org/jira/browse/SPARK-23455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We save ML's user-supplied params and default params as one entity in JSON. 
> During loading the saved models, we set all the loaded params into created ML 
> model instances as user-supplied params.
> It causes some problems, e.g., if we strictly disallow some params to be set 
> at the same time, a default param can fail the param check because it is 
> treated as user-supplied param after loading.
> The loaded default params should not be set as user-supplied params. We 
> should save ML default params separately in JSON.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23759) Unable to bind Spark2 history server to specific host name / IP

2018-03-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407288#comment-16407288
 ] 

Apache Spark commented on SPARK-23759:
--

User 'felixalbani' has created a pull request for this issue:
https://github.com/apache/spark/pull/20867

> Unable to bind Spark2 history server to specific host name / IP
> ---
>
> Key: SPARK-23759
> URL: https://issues.apache.org/jira/browse/SPARK-23759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.2.0
>Reporter: Felix
>Priority: Major
>
> Ideally, exporting SPARK_LOCAL_IP= in spark2 
> environment should allow Spark2 History server to bind to private interface 
> however this is not working in spark 2.2.0
> Spark2 history server still listens on 0.0.0.0
> {code:java}
> [root@sparknode1 ~]# netstat -tulapn|grep 18081
> tcp0  0 0.0.0.0:18081   0.0.0.0:*   
> LISTEN  21313/java
> tcp0  0 172.26.104.151:39126172.26.104.151:18081
> TIME_WAIT   -
> {code}
> On earlier versions this change was working fine:
> {code:java}
> [root@dwphive1 ~]# netstat -tulapn|grep 18081
> tcp0  0 172.26.113.55:18081 0.0.0.0:*   
> LISTEN  2565/java
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23759) Unable to bind Spark2 history server to specific host name / IP

2018-03-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23759:


Assignee: Apache Spark

> Unable to bind Spark2 history server to specific host name / IP
> ---
>
> Key: SPARK-23759
> URL: https://issues.apache.org/jira/browse/SPARK-23759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.2.0
>Reporter: Felix
>Assignee: Apache Spark
>Priority: Major
>
> Ideally, exporting SPARK_LOCAL_IP= in spark2 
> environment should allow Spark2 History server to bind to private interface 
> however this is not working in spark 2.2.0
> Spark2 history server still listens on 0.0.0.0
> {code:java}
> [root@sparknode1 ~]# netstat -tulapn|grep 18081
> tcp0  0 0.0.0.0:18081   0.0.0.0:*   
> LISTEN  21313/java
> tcp0  0 172.26.104.151:39126172.26.104.151:18081
> TIME_WAIT   -
> {code}
> On earlier versions this change was working fine:
> {code:java}
> [root@dwphive1 ~]# netstat -tulapn|grep 18081
> tcp0  0 172.26.113.55:18081 0.0.0.0:*   
> LISTEN  2565/java
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23759) Unable to bind Spark2 history server to specific host name / IP

2018-03-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23759:


Assignee: (was: Apache Spark)

> Unable to bind Spark2 history server to specific host name / IP
> ---
>
> Key: SPARK-23759
> URL: https://issues.apache.org/jira/browse/SPARK-23759
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.2.0
>Reporter: Felix
>Priority: Major
>
> Ideally, exporting SPARK_LOCAL_IP= in spark2 
> environment should allow Spark2 History server to bind to private interface 
> however this is not working in spark 2.2.0
> Spark2 history server still listens on 0.0.0.0
> {code:java}
> [root@sparknode1 ~]# netstat -tulapn|grep 18081
> tcp0  0 0.0.0.0:18081   0.0.0.0:*   
> LISTEN  21313/java
> tcp0  0 172.26.104.151:39126172.26.104.151:18081
> TIME_WAIT   -
> {code}
> On earlier versions this change was working fine:
> {code:java}
> [root@dwphive1 ~]# netstat -tulapn|grep 18081
> tcp0  0 172.26.113.55:18081 0.0.0.0:*   
> LISTEN  2565/java
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23749) Avoid Hive.get() to compatible with different Hive metastore

2018-03-20 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-23749:

Description: 
{noformat}
18/03/15 22:34:46 WARN Hive: Failed to register all functions.
org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.thrift.TApplicationException: Invalid method name: 
'get_all_functions'
at 
org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3897)
at 
org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:248)
at 
org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:231)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:388)
at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:332)
at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:312)
at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:288)
at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:273)
at 
org.apache.spark.deploy.security.HiveDelegationTokenProvider$$anonfun$obtainDelegationTokens$1.apply$mcV$sp(HiveDelegationTokenProvider.scala:95)
at 
org.apache.spark.deploy.security.HiveDelegationTokenProvider$$anonfun$obtainDelegationTokens$1.apply(HiveDelegationTokenProvider.scala:94)
at 
org.apache.spark.deploy.security.HiveDelegationTokenProvider$$anonfun$obtainDelegationTokens$1.apply(HiveDelegationTokenProvider.scala:94)
at 
org.apache.spark.deploy.security.HiveDelegationTokenProvider$$anon$1.run(HiveDelegationTokenProvider.scala:131)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at 
org.apache.spark.deploy.security.HiveDelegationTokenProvider.doAsRealUser(HiveDelegationTokenProvider.scala:130)
at 
org.apache.spark.deploy.security.HiveDelegationTokenProvider.obtainDelegationTokens(HiveDelegationTokenProvider.scala:94)
at 
org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$obtainDelegationTokens$2.apply(HadoopDelegationTokenManager.scala:132)
at 
org.apache.spark.deploy.security.HadoopDelegationTokenManager$$anonfun$obtainDelegationTokens$2.apply(HadoopDelegationTokenManager.scala:130)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at 
org.apache.spark.deploy.security.HadoopDelegationTokenManager.obtainDelegationTokens(HadoopDelegationTokenManager.scala:130)
at 
org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager.obtainDelegationTokens(YARNHadoopDelegationTokenManager.scala:56)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:388)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:869)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:169)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
at org.apache.spark.SparkContext.(SparkContext.scala:501)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2489)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:48)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:304)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:157)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.Java

[jira] [Commented] (SPARK-23750) [Performance] Inner Join Elimination based on Informational RI constraints

2018-03-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407248#comment-16407248
 ] 

Apache Spark commented on SPARK-23750:
--

User 'ioana-delaney' has created a pull request for this issue:
https://github.com/apache/spark/pull/20868

> [Performance] Inner Join Elimination based on Informational RI constraints
> --
>
> Key: SPARK-23750
> URL: https://issues.apache.org/jira/browse/SPARK-23750
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ioana Delaney
>Priority: Major
>
> +*Inner Join Elimination based on Informational RI constraints*+
> This transformation detects RI joins and eliminates the parent/PK table if 
> none of its columns, other than the PK columns, are referenced in the query.
> Typical examples that benefit from this rewrite are queries over complex 
> views.
> *View using TPC-DS schema:*
> {code}
> create view customer_purchases_2002 (id, last, first, product, store_id, 
> month, quantity) as
> select c_customer_id, c_last_name, c_first_name, i_product_name, s_store_id, 
> d_moy, ss_quantity 
> from store_sales, date_dim, customer, item, store
> where d_date_sk = ss_sold_date_sk and
>   c_customer_sk = ss_customer_sk and 
>   i_item_sk = ss_item_sk and
>   s_store_sk = ss_store_sk and
>   d_year = 2002
> {code}
> The view returns customer purchases made in year 2002. It is a join between 
> fact table _store_sales_ and dimensions _customer_, _item,_ _store_, and 
> _date_. The tables are joined using RI predicates.
> If we write a query that only selects a subset of columns from the view, for 
> example, we are only interested in the items bought and not the stores, 
> internally, the Optimizer, will first merge the view into the query, and 
> then, based on the _primary key – foreign key_ join predicate analysis, it 
> will decide that the join with the _store_ table is not needed, and therefore 
> the _store_ table is removed.
> *Query:*
> {code}
> select id, first, last, product, quantity 
> from customer_purchases_2002
> where product like ‘bicycle%’ and
>   month between 1 and 2
> {code}
> *Internal query after view expansion:*
> {code}
> select c_customer_id as id, c_first_name as first, c_last_name as last,
>i_product_name as product,ss_quantity as quantity 
> from store_sales, date_dim, customer, item, store
> where d_date_sk = ss_sold_date_sk and
>   c_customer_sk = ss_customer_sk and 
>   i_item_sk = ss_item_sk and
>   s_store_sk = ss_store_sk and
>   d_year = 2002 and
>   month between 1 and 2 and
>   product like ‘bicycle%’
> {code}
> *Internal optimized query after join elimination:*
> {code:java}
> select c_customer_id as id, c_first_name as first, c_last_name as last,
>i_product_name as product,ss_quantity as quantity 
> from store_sales, date_dim, customer, item
> where d_date_sk = ss_sold_date_sk and
>   c_customer_sk = ss_customer_sk and 
>   i_item_sk = ss_item_sk and
>   d_year = 2002 and
>   month between 1 and 2 and
>   product like ‘bicycle%’
> {code}
> The join with _store_ table can be removed since no columns are retrieved 
> from the table, and every row from the _store_sales_ fact table will find a 
> match in _store_ based on the RI relationship.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23750) [Performance] Inner Join Elimination based on Informational RI constraints

2018-03-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23750:


Assignee: Apache Spark

> [Performance] Inner Join Elimination based on Informational RI constraints
> --
>
> Key: SPARK-23750
> URL: https://issues.apache.org/jira/browse/SPARK-23750
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ioana Delaney
>Assignee: Apache Spark
>Priority: Major
>
> +*Inner Join Elimination based on Informational RI constraints*+
> This transformation detects RI joins and eliminates the parent/PK table if 
> none of its columns, other than the PK columns, are referenced in the query.
> Typical examples that benefit from this rewrite are queries over complex 
> views.
> *View using TPC-DS schema:*
> {code}
> create view customer_purchases_2002 (id, last, first, product, store_id, 
> month, quantity) as
> select c_customer_id, c_last_name, c_first_name, i_product_name, s_store_id, 
> d_moy, ss_quantity 
> from store_sales, date_dim, customer, item, store
> where d_date_sk = ss_sold_date_sk and
>   c_customer_sk = ss_customer_sk and 
>   i_item_sk = ss_item_sk and
>   s_store_sk = ss_store_sk and
>   d_year = 2002
> {code}
> The view returns customer purchases made in year 2002. It is a join between 
> fact table _store_sales_ and dimensions _customer_, _item,_ _store_, and 
> _date_. The tables are joined using RI predicates.
> If we write a query that only selects a subset of columns from the view, for 
> example, we are only interested in the items bought and not the stores, 
> internally, the Optimizer, will first merge the view into the query, and 
> then, based on the _primary key – foreign key_ join predicate analysis, it 
> will decide that the join with the _store_ table is not needed, and therefore 
> the _store_ table is removed.
> *Query:*
> {code}
> select id, first, last, product, quantity 
> from customer_purchases_2002
> where product like ‘bicycle%’ and
>   month between 1 and 2
> {code}
> *Internal query after view expansion:*
> {code}
> select c_customer_id as id, c_first_name as first, c_last_name as last,
>i_product_name as product,ss_quantity as quantity 
> from store_sales, date_dim, customer, item, store
> where d_date_sk = ss_sold_date_sk and
>   c_customer_sk = ss_customer_sk and 
>   i_item_sk = ss_item_sk and
>   s_store_sk = ss_store_sk and
>   d_year = 2002 and
>   month between 1 and 2 and
>   product like ‘bicycle%’
> {code}
> *Internal optimized query after join elimination:*
> {code:java}
> select c_customer_id as id, c_first_name as first, c_last_name as last,
>i_product_name as product,ss_quantity as quantity 
> from store_sales, date_dim, customer, item
> where d_date_sk = ss_sold_date_sk and
>   c_customer_sk = ss_customer_sk and 
>   i_item_sk = ss_item_sk and
>   d_year = 2002 and
>   month between 1 and 2 and
>   product like ‘bicycle%’
> {code}
> The join with _store_ table can be removed since no columns are retrieved 
> from the table, and every row from the _store_sales_ fact table will find a 
> match in _store_ based on the RI relationship.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23750) [Performance] Inner Join Elimination based on Informational RI constraints

2018-03-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23750:


Assignee: (was: Apache Spark)

> [Performance] Inner Join Elimination based on Informational RI constraints
> --
>
> Key: SPARK-23750
> URL: https://issues.apache.org/jira/browse/SPARK-23750
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ioana Delaney
>Priority: Major
>
> +*Inner Join Elimination based on Informational RI constraints*+
> This transformation detects RI joins and eliminates the parent/PK table if 
> none of its columns, other than the PK columns, are referenced in the query.
> Typical examples that benefit from this rewrite are queries over complex 
> views.
> *View using TPC-DS schema:*
> {code}
> create view customer_purchases_2002 (id, last, first, product, store_id, 
> month, quantity) as
> select c_customer_id, c_last_name, c_first_name, i_product_name, s_store_id, 
> d_moy, ss_quantity 
> from store_sales, date_dim, customer, item, store
> where d_date_sk = ss_sold_date_sk and
>   c_customer_sk = ss_customer_sk and 
>   i_item_sk = ss_item_sk and
>   s_store_sk = ss_store_sk and
>   d_year = 2002
> {code}
> The view returns customer purchases made in year 2002. It is a join between 
> fact table _store_sales_ and dimensions _customer_, _item,_ _store_, and 
> _date_. The tables are joined using RI predicates.
> If we write a query that only selects a subset of columns from the view, for 
> example, we are only interested in the items bought and not the stores, 
> internally, the Optimizer, will first merge the view into the query, and 
> then, based on the _primary key – foreign key_ join predicate analysis, it 
> will decide that the join with the _store_ table is not needed, and therefore 
> the _store_ table is removed.
> *Query:*
> {code}
> select id, first, last, product, quantity 
> from customer_purchases_2002
> where product like ‘bicycle%’ and
>   month between 1 and 2
> {code}
> *Internal query after view expansion:*
> {code}
> select c_customer_id as id, c_first_name as first, c_last_name as last,
>i_product_name as product,ss_quantity as quantity 
> from store_sales, date_dim, customer, item, store
> where d_date_sk = ss_sold_date_sk and
>   c_customer_sk = ss_customer_sk and 
>   i_item_sk = ss_item_sk and
>   s_store_sk = ss_store_sk and
>   d_year = 2002 and
>   month between 1 and 2 and
>   product like ‘bicycle%’
> {code}
> *Internal optimized query after join elimination:*
> {code:java}
> select c_customer_id as id, c_first_name as first, c_last_name as last,
>i_product_name as product,ss_quantity as quantity 
> from store_sales, date_dim, customer, item
> where d_date_sk = ss_sold_date_sk and
>   c_customer_sk = ss_customer_sk and 
>   i_item_sk = ss_item_sk and
>   d_year = 2002 and
>   month between 1 and 2 and
>   product like ‘bicycle%’
> {code}
> The join with _store_ table can be removed since no columns are retrieved 
> from the table, and every row from the _store_sales_ fact table will find a 
> match in _store_ based on the RI relationship.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-03-20 Thread Darek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407247#comment-16407247
 ] 

Darek commented on SPARK-23534:
---

https://github.com/Azure/azure-storage-java 7.0 will only work with 
org.apache.hadoop/hadoop-azure/3.0.0.

I am afraid of using of older version of azure-storage because of all the 
security issues that have been found and fixed in the newer version, not to 
mention all the new features that Azure has added in the last 2 years. Using 
old software and public cloud = bad idea.



> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20697) MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.

2018-03-20 Thread Abhishek Madav (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Madav updated SPARK-20697:
---
Priority: Critical  (was: Major)

> MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.
> --
>
> Key: SPARK-20697
> URL: https://issues.apache.org/jira/browse/SPARK-20697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.2.1, 2.3.0
>Reporter: Abhishek Madav
>Priority: Critical
>
> MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table 
> does not restore the bucketing information to the storage descriptor in the 
> metastore. 
> Steps to reproduce:
> 1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) 
> PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED 
> FIELDS TERMINATED BY ',';
> 2) In Hive-CLI issue a desc formatted for the table.
> # col_namedata_type   comment 
>
> a int 
>
> # Partition Information
> # col_namedata_type   comment 
>
> b int 
>
> # Detailed Table Information   
> Database: sparkhivebucket  
> Owner:devbld   
> CreateTime:   Wed May 10 10:31:07 PDT 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://localhost:8020/user/hive/warehouse/partbucket 
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   transient_lastDdlTime   1494437467  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  10   
> Bucket Columns:   [a]  
> Sort Columns: []   
> Storage Desc Params:   
>   field.delim ,   
>   serialization.format, 
> 3) In spark-shell, 
> scala> spark.sql("MSCK REPAIR TABLE partbucket")
> 4) Back to Hive-CLI 
> desc formatted partbucket;
> # col_namedata_type   comment 
>
> a int 
>
> # Partition Information
> # col_namedata_type   comment 
>
> b int 
>
> # Detailed Table Information   
> Database: sparkhivebucket  
> Owner:devbld   
> CreateTime:   Wed May 10 10:31:07 PDT 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: 
> hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket 
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   spark.sql.partitionProvider catalog 
>   transient_lastDdlTime   1494437647  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   field.delim ,   
>   serialization.format, 
> Further inserts to this table cannot be made in bucketed fashion through 
> Hive. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20697) MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.

2018-03-20 Thread Abhishek Madav (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Madav updated SPARK-20697:
---
Affects Version/s: 2.2.0
   2.2.1
   2.3.0

> MSCK REPAIR TABLE resets the Storage Information for bucketed hive tables.
> --
>
> Key: SPARK-20697
> URL: https://issues.apache.org/jira/browse/SPARK-20697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.2.1, 2.3.0
>Reporter: Abhishek Madav
>Priority: Major
>
> MSCK REPAIR TABLE used to recover partitions for a partitioned+bucketed table 
> does not restore the bucketing information to the storage descriptor in the 
> metastore. 
> Steps to reproduce:
> 1) Create a paritioned+bucketed table in hive: CREATE TABLE partbucket(a int) 
> PARTITIONED BY (b int) CLUSTERED BY (a) INTO 10 BUCKETS ROW FORMAT DELIMITED 
> FIELDS TERMINATED BY ',';
> 2) In Hive-CLI issue a desc formatted for the table.
> # col_namedata_type   comment 
>
> a int 
>
> # Partition Information
> # col_namedata_type   comment 
>
> b int 
>
> # Detailed Table Information   
> Database: sparkhivebucket  
> Owner:devbld   
> CreateTime:   Wed May 10 10:31:07 PDT 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://localhost:8020/user/hive/warehouse/partbucket 
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   transient_lastDdlTime   1494437467  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  10   
> Bucket Columns:   [a]  
> Sort Columns: []   
> Storage Desc Params:   
>   field.delim ,   
>   serialization.format, 
> 3) In spark-shell, 
> scala> spark.sql("MSCK REPAIR TABLE partbucket")
> 4) Back to Hive-CLI 
> desc formatted partbucket;
> # col_namedata_type   comment 
>
> a int 
>
> # Partition Information
> # col_namedata_type   comment 
>
> b int 
>
> # Detailed Table Information   
> Database: sparkhivebucket  
> Owner:devbld   
> CreateTime:   Wed May 10 10:31:07 PDT 2017 
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: 
> hdfs://localhost:8020/user/hive/warehouse/sparkhivebucket.db/partbucket 
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   spark.sql.partitionProvider catalog 
>   transient_lastDdlTime   1494437647  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
>  
> InputFormat:  org.apache.hadoop.mapred.TextInputFormat 
> OutputFormat: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   field.delim ,   
>   serialization.format, 
> Further inserts to this table cannot be made in bucketed fashion through 
> Hive. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation

2018-03-20 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407205#comment-16407205
 ] 

Joseph K. Bradley commented on SPARK-23686:
---

This will be useful!  Synced offline: we'll split this up into subtasks.

> Make better usage of org.apache.spark.ml.util.Instrumentation
> -
>
> Key: SPARK-23686
> URL: https://issues.apache.org/jira/browse/SPARK-23686
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> This Jira is a bit high level and might require subtasks or other jiras for 
> more specific tasks.
> I've noticed that we don't make the best usage of the instrumentation class. 
> Specifically sometimes we bypass the instrumentation class and use the 
> debugger instead. For example, 
> [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143]
> Also there are some things that might be useful to log in the instrumentation 
> class that we currently don't. For example:
> number of training examples
> mean/var of label (regression)
> I know computing these things can be expensive in some cases, but especially 
> when this data is already available we can log it for free. For example, 
> Logistic Regression Summarizer computes some useful data including numRows 
> that we don't log.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10884) Support prediction on single instance for regression and classification related models

2018-03-20 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407163#comment-16407163
 ] 

Joseph K. Bradley commented on SPARK-10884:
---

I know a lot of people are watching this, so I'm just pinging to say I'm about 
to merge the PR.  Please check out the changes in case you have comments & have 
not seen the updates.  Thanks!

> Support prediction on single instance for regression and classification 
> related models
> --
>
> Key: SPARK-10884
> URL: https://issues.apache.org/jira/browse/SPARK-10884
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Weichen Xu
>Priority: Major
>  Labels: 2.2.0
>
> Support prediction on single instance for regression and classification 
> related models (i.e., PredictionModel, ClassificationModel and their sub 
> classes). 
> Add corresponding test cases.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23759) Unable to bind Spark2 history server to specific host name / IP

2018-03-20 Thread Felix (JIRA)

Felix created SPARK-23759:
-

 Summary: Unable to bind Spark2 history server to specific host 
name / IP
 Key: SPARK-23759
 URL: https://issues.apache.org/jira/browse/SPARK-23759
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 2.2.0
Reporter: Felix


Ideally, exporting SPARK_LOCAL_IP= in spark2 
environment should allow Spark2 History server to bind to private interface 
however this is not working in spark 2.2.0

Spark2 history server still listens on 0.0.0.0
{code:java}
[root@sparknode1 ~]# netstat -tulapn|grep 18081
tcp0  0 0.0.0.0:18081   0.0.0.0:*   
LISTEN  21313/java
tcp0  0 172.26.104.151:39126172.26.104.151:18081
TIME_WAIT   -
{code}
On earlier versions this change was working fine:
{code:java}
[root@dwphive1 ~]# netstat -tulapn|grep 18081
tcp0  0 172.26.113.55:18081 0.0.0.0:*   
LISTEN  2565/java
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23739) Spark structured streaming long running problem

2018-03-20 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407112#comment-16407112
 ] 

Marco Gaido commented on SPARK-23739:
-

Can you provide some more info about how you are getting this error and how to 
reproduce? Which command are you using to submit your application? May you also 
provide a sample to reproduce this? Thanks.

> Spark structured streaming long running problem
> ---
>
> Key: SPARK-23739
> URL: https://issues.apache.org/jira/browse/SPARK-23739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Florencio
>Priority: Critical
>  Labels: spark, streaming, structured
>
> I had a problem with long running spark structured streaming in spark 2.1. 
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.kafka.common.requests.LeaveGroupResponse.
> The detailed error is the following:
> 18/03/16 16:10:57 INFO StreamExecution: Committed offsets for batch 2110. 
> Metadata OffsetSeqMetadata(0,1521216656590)
> 18/03/16 16:10:57 INFO KafkaSource: GetBatch called with start = 
> Some(\{"TopicName":{"2":5520197,"1":5521045,"3":5522054,"0":5527915}}), end = 
> \{"TopicName":{"2":5522730,"1":5523577,"3":5524586,"0":5530441}}
> 18/03/16 16:10:57 INFO KafkaSource: Partitions added: Map()
> 18/03/16 16:10:57 ERROR StreamExecution: Query [id = 
> a233b9ff-cc39-44d3-b953-a255986c04bf, runId = 
> 8520e3c0-2455-4ac1-9021-8518fb58b3f8] terminated with error
> java.util.zip.ZipException: invalid code lengths set
>  at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>  at java.io.FilterInputStream.read(FilterInputStream.java:133)
>  at java.io.FilterInputStream.read(FilterInputStream.java:107)
>  at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354)
>  at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>  at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>  at org.apache.spark.util.Utils$.copyStream(Utils.scala:362)
>  at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:45)
>  at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:83)
>  at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:173)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>  at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
>  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
>  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>  at org.apache.spark.rdd.RDD.map(RDD.scala:369)
>  at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:287)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:503)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:499)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>  at 
> org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
>  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> 18/03/16 16:10:57 ERROR ClientUtils: Failed to close coordinator
> java.lang.NoClassDefFoundError: 
> org/apache/kafka/common/requests/LeaveGroupResponse
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendLeaveGroupRequest(AbstractCoordinator.java:575)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.maybeLeaveGroup(AbstractCoordinator.java:566)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.close(AbstractCoordinator.java:555)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.close(ConsumerCoordinator.java:377)
>  at org.apache.kafka.clients.ClientUtils.closeQuietly(ClientUtils.java:66)
>  at 
> org.apache.kafka.clients.consumer.KafkaConsumer.close(Kafk

[jira] [Commented] (SPARK-18813) MLlib 2.2 Roadmap

2018-03-20 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407108#comment-16407108
 ] 

Joseph K. Bradley commented on SPARK-18813:
---

I just linked the roadmap for 2.4 (since we did not have one for 2.3): 
[SPARK-23758]

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
> Fix For: 2.2.0
>
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> * Vote for & watch issues which are important to you.
> ** MLlib, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC]
> ** SparkR, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC]
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These 
> meanings have been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | [1 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0]
>  | next release | Blocker | *must* | *must* | *must* |
> | [2 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0]
>  | next release | Critical | *must* | yes, unless small | *best effort* |
> | [3 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0]
>  | next release | Major | *must* | optional | *best effort* |
> | [4 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0]
>  | next release | Minor | optional | no | maybe |
> | [5 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0]
>  | next release | Trivial | optional | no | maybe |
> | [6 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2

[jira] [Updated] (SPARK-23758) MLlib 2.4 Roadmap

2018-03-20 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23758:
--
Description: 
h1. Roadmap process

This roadmap is a master list for MLlib improvements we are working on during 
this release.  This includes ML-related changes in PySpark and SparkR.

*What is planned for the next release?*
* This roadmap lists issues which at least one Committer has prioritized.  See 
details below in "Instructions for committers."
* This roadmap only lists larger or more critical issues.

*How can contributors influence this roadmap?*
* If you believe an issue should be in this roadmap, please discuss the issue 
on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
least one must agree to shepherd the issue.
* For general discussions, use this JIRA or the dev mailing list.  For specific 
issues, please comment on those issues or the mailing list.
* Vote for & watch issues which are important to you.
** MLlib, sorted by: [Votes | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC]
 or [Watchers | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC]
** SparkR, sorted by: [Votes | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC]
 or [Watchers | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC]

h2. Target Version and Priority

This section describes the meaning of Target Version and Priority.

|| Category | Target Version | Priority | Shepherd | Put on roadmap? | In next 
release? ||
| [1 | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
 | next release | Blocker | *must* | *must* | *must* |
| [2 | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
 | next release | Critical | *must* | yes, unless small | *best effort* |
| [3 | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
 | next release | Major | *must* | optional | *best effort* |
| [4 | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
 | next release | Minor | optional | no | maybe |
| [5 | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
 | next release | Trivial | optional | no | maybe |
| [6 | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20"Target%20Version%2Fs"%20in%20(EMPTY)%20AND%20Shepherd%20not%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC]
 | (empty) | (any) | yes | no | maybe |
| [7 | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(EMPTY)%20AND%20Shepherd%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC]
 | (empty) | (any) | no | no | maybe |

The *Category* in the table above has the following meaning:

1. A committer has promised to see this issue to completion for the next 
release.  Contributions *will* receive attention.
2-3. A committer has promised to see this issue to completion for the next 
release.  Contributions *will* receive attention.  The issue ma

[jira] [Created] (SPARK-23758) MLlib 2.4 Roadmap

2018-03-20 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-23758:
-

 Summary: MLlib 2.4 Roadmap
 Key: SPARK-23758
 URL: https://issues.apache.org/jira/browse/SPARK-23758
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.4.0
Reporter: Joseph K. Bradley


h1. Roadmap process

This roadmap is a master list for MLlib improvements we are working on during 
this release.  This includes ML-related changes in PySpark and SparkR.

*What is planned for the next release?*
* This roadmap lists issues which at least one Committer has prioritized.  See 
details below in "Instructions for committers."
* This roadmap only lists larger or more critical issues.

*How can contributors influence this roadmap?*
* If you believe an issue should be in this roadmap, please discuss the issue 
on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
least one must agree to shepherd the issue.
* For general discussions, use this JIRA or the dev mailing list.  For specific 
issues, please comment on those issues or the mailing list.
* Vote for & watch issues which are important to you.
** MLlib, sorted by: [Votes | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC]
 or [Watchers | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC]
** SparkR, sorted by: [Votes | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC]
 or [Watchers | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC]

h2. Target Version and Priority

This section describes the meaning of Target Version and Priority.

|| Category | Target Version | Priority | Shepherd | Put on roadmap? | In next 
release? ||
| [1 | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
 | next release | Blocker | *must* | *must* | *must* |
| [2 | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
 | next release | Critical | *must* | yes, unless small | *best effort* |
| [3 | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
 | next release | Major | *must* | optional | *best effort* |
| [4 | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
 | next release | Minor | optional | no | maybe |
| [5 | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
 | next release | Trivial | optional | no | maybe |
| [6 | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20"Target%20Version%2Fs"%20in%20(EMPTY)%20AND%20Shepherd%20not%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC]
 | (empty) | (any) | yes | no | maybe |
| [7 | 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(EMPTY)%20AND%20Shepherd%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC]
 | (empty) | (any) | no | no | maybe |

The *Category* in the table above has the following meaning:

1. A committer has promised to see this issue to completion for the next 
release.  Contributions *will

[jira] [Updated] (SPARK-23690) VectorAssembler should have handleInvalid to handle columns with null values

2018-03-20 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23690:
--
Shepherd: Joseph K. Bradley

> VectorAssembler should have handleInvalid to handle columns with null values
> 
>
> Key: SPARK-23690
> URL: https://issues.apache.org/jira/browse/SPARK-23690
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Priority: Major
>
> VectorAssembler only takes in numeric (and vectors (of numeric?)) columns as 
> an input and returns the assembled vector. It currently throws an error if it 
> sees a null value in any column. This behavior also affects `RFormula` that 
> uses VectorAssembler to assemble numeric columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23500) Filters on named_structs could be pushed into scans

2018-03-20 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23500.
-
   Resolution: Fixed
 Assignee: Henry Robinson
Fix Version/s: 2.4.0

> Filters on named_structs could be pushed into scans
> ---
>
> Key: SPARK-23500
> URL: https://issues.apache.org/jira/browse/SPARK-23500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Henry Robinson
>Assignee: Henry Robinson
>Priority: Major
> Fix For: 2.4.0
>
>
> Simple filters on dataframes joined with {{joinWith()}} are missing an 
> opportunity to get pushed into the scan because they're written in terms of 
> {{named_struct}} that could be removed by the optimizer.
> Given the following simple query over two dataframes:
> {code:java}
> scala> val df = spark.read.parquet("one_million")
> df: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df2 = spark.read.parquet("one_million")
> df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> df.joinWith(df2, df2.col("id") === df.col("id2")).filter("_2.id > 
> 30").explain
> == Physical Plan ==
> *(2) BroadcastHashJoin [_1#94.id2], [_2#95.id], Inner, BuildRight
> :- *(2) Project [named_struct(id, id#0L, id2, id2#1L) AS _1#94]
> :  +- *(2) FileScan parquet [id#0L,id2#1L] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> struct, false].id))
>+- *(1) Project [named_struct(id, id#90L, id2, id2#91L) AS _2#95]
>   +- *(1) Filter (named_struct(id, id#90L, id2, id2#91L).id > 30)
>  +- *(1) FileScan parquet [id#90L,id2#91L] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct
> {code}
> Using {{joinWith}} means that the filter is placed on a {{named_struct}}, and 
> is then pushed down. When the filter is just above the scan, the 
> wrapping-and-projection of {{named_struct(id...).id}} is a no-op and could be 
> removed. Then the filter can be pushed down to Parquet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23757) [Performance] Star schema detection improvements

2018-03-20 Thread Ioana Delaney (JIRA)

Ioana Delaney created SPARK-23757:
-

 Summary: [Performance] Star schema detection improvements
 Key: SPARK-23757
 URL: https://issues.apache.org/jira/browse/SPARK-23757
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ioana Delaney


Star schema consists of one or more fact tables referencing a number of 
dimension tables. Queries against star schema are expected to run fast because 
of the established RI constraints among the tables. In general, star schema 
joins are detected using the following conditions:

1. RI constraints (reliable detection)
* Dimension contains a primary key that is being joined to the fact table.
* Fact table contains foreign keys referencing multiple dimension tables.

2. Cardinality based heuristics
* Usually, the table with the highest cardinality is the fact table.


Existing SPARK-17791 uses a combination of the above two conditions to detect 
and optimize star joins. With support for informational RI constraints, the 
algorithm in SPARK-17791 can be improved with reliable RI detection.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

2018-03-20 Thread Ioana Delaney (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406909#comment-16406909
 ] 

Ioana Delaney commented on SPARK-19842:
---

I opened several performance JIRAs to show the benefits of the informational RI 
constraints.

> Informational Referential Integrity Constraints Support in Spark
> 
>
> Key: SPARK-19842
> URL: https://issues.apache.org/jira/browse/SPARK-19842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
>Priority: Major
> Attachments: InformationalRIConstraints.doc
>
>
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key 
> (referential integrity) constraints_ in Spark. The main purpose is to open up 
> an area of query optimization techniques that rely on referential integrity 
> constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a 
> _unique_, _primary key_, _foreign key_, or _check constraint_, that can be 
> used by Spark to improve query performance. Informational constraints are not 
> enforced by the Spark SQL engine; rather, they are used by Catalyst to 
> optimize the query processing. They provide semantics information that allows 
> Catalyst to rewrite queries to eliminate joins, push down aggregates, remove 
> unnecessary Distinct operations, and perform a number of other optimizations. 
> Informational constraints are primarily targeted to applications that load 
> and analyze data that originated from a data warehouse. For such 
> applications, the conditions for a given constraint are known to be true, so 
> the constraint does not need to be enforced during data load operations. 
> The attached document covers constraint definition, metastore storage, 
> constraint validation, and maintenance. The document shows many examples of 
> query performance improvements that utilize referential integrity constraints 
> and can be implemented in Spark.
> Link to the google doc: 
> [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23756) [Performance] Redundant join elimination

2018-03-20 Thread Ioana Delaney (JIRA)

Ioana Delaney created SPARK-23756:
-

 Summary: [Performance] Redundant join elimination
 Key: SPARK-23756
 URL: https://issues.apache.org/jira/browse/SPARK-23756
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ioana Delaney


This rewrite eliminates self-joins on unique keys. Self-joins may be introduced 
after view expansion. 

*User view:*
{code}
create view manager(mgrno, income) as 
select e.empno, e.salary + e.bonus
from employee e, department d
where e.empno = d.mgrno;
{code}

*User query:*
{code}
select e.empname, e.empno
from employee e, manager m
where e.empno = m.mgrno and m.income > 100K
{code}

*Internal query after view expansion:*

{code}
select e.lastname, e.empno
from employee e, employee m, department d
where e.empno = m.empno /* PK = PK */ and e.empno = d.mgrno and 
m.salary + m.bonus > 100K
{code}

*Internal query after join elimination:*

{code}
select e.lastname, e.empno
from employee e, department d
where e.empno = d.mgrno and 
e.salary + e.bonus > 100K
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23755) [Performance] Distinct elimination

2018-03-20 Thread Ioana Delaney (JIRA)

Ioana Delaney created SPARK-23755:
-

 Summary: [Performance] Distinct elimination
 Key: SPARK-23755
 URL: https://issues.apache.org/jira/browse/SPARK-23755
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ioana Delaney


The Distinct requirement can be removed if it is proved that the operation 
produces unique output.
 
{code}
select distinct d.deptno /* PK */, e.empname
from employee e, department d
where e.empno = d.mgrno /*PK = FK*/
{code}

*Internal query after rewrite:*

{code}
select d.deptno, e.empname
from employee e, department d
where e.empno = d.mgrno
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23754) StopIterator exception in Python UDF results in partial result

2018-03-20 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-23754:
---
Description: 
Reproduce:
{code:java}
df = spark.range(0, 1000)
from pyspark.sql.functions import udf

def foo(x):
raise StopIteration()

df.withColumn('v', udf(foo)).show()

# Results
# +---+---+
# | id|  v|
# +---+---+
# +---+---+{code}

I think the task should fail in this case

  was:
{code:java}
df = spark.range(0, 1000)
from pyspark.sql.functions import udf

def foo(x):
raise StopIteration()

df.withColumn('v', udf(foo)).show()

# Results
# +---+---+
# | id|  v|
# +---+---+
# +---+---+

{code}


> StopIterator exception in Python UDF results in partial result
> --
>
> Key: SPARK-23754
> URL: https://issues.apache.org/jira/browse/SPARK-23754
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
>
> Reproduce:
> {code:java}
> df = spark.range(0, 1000)
> from pyspark.sql.functions import udf
> def foo(x):
> raise StopIteration()
> df.withColumn('v', udf(foo)).show()
> # Results
> # +---+---+
> # | id|  v|
> # +---+---+
> # +---+---+{code}
> I think the task should fail in this case



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23754) StopIterator exception in Python UDF results in partial result

2018-03-20 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-23754:
---
Description: 
{code:java}
df = spark.range(0, 1000)
from pyspark.sql.functions import udf

def foo(x):
raise StopIteration()

df.withColumn('v', udf(foo)).show()

# Results
# +---+---+
# | id|  v|
# +---+---+
# +---+---+

{code}

> StopIterator exception in Python UDF results in partial result
> --
>
> Key: SPARK-23754
> URL: https://issues.apache.org/jira/browse/SPARK-23754
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
>
> {code:java}
> df = spark.range(0, 1000)
> from pyspark.sql.functions import udf
> def foo(x):
> raise StopIteration()
> df.withColumn('v', udf(foo)).show()
> # Results
> # +---+---+
> # | id|  v|
> # +---+---+
> # +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23754) StopIterator exception in Python UDF results in partial result

2018-03-20 Thread Li Jin (JIRA)

Li Jin created SPARK-23754:
--

 Summary: StopIterator exception in Python UDF results in partial 
result
 Key: SPARK-23754
 URL: https://issues.apache.org/jira/browse/SPARK-23754
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Li Jin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6190) create LargeByteBuffer abstraction for eliminating 2GB limit on blocks

2018-03-20 Thread Matthew Porter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406875#comment-16406875
 ] 

Matthew Porter commented on SPARK-6190:
---

Experiencing similar frustrations to Brian, we have well partitioned datasets 
that are just massive in size. Every once in a while Spark fails and we spend 
much longer than I would like doing nothing but tweaking partition values and 
crossing our fingers that the next run succeeds. Again, very hard to explain to 
higher management that despite using having hundreds of GBs of of RAM at our 
disposal, we are limited to 2 GBs during data shuffles. Are there any plans or 
intentions on resolving this bug? This and 
https://issues.apache.org/jira/browse/SPARK-5928 have been "In Progress" for 
more than 3 years now with no visible progress.

> create LargeByteBuffer abstraction for eliminating 2GB limit on blocks
> --
>
> Key: SPARK-6190
> URL: https://issues.apache.org/jira/browse/SPARK-6190
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Attachments: LargeByteBuffer_v3.pdf
>
>
> A key component in eliminating the 2GB limit on blocks is creating a proper 
> abstraction for storing more than 2GB.  Currently spark is limited by a 
> reliance on nio ByteBuffer and netty ByteBuf, both of which are limited at 
> 2GB.  This task will introduce the new abstraction and the relevant 
> implementation and utilities, without effecting the existing implementation 
> at all.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23737) Scala API documentation leads to nonexistent pages for sources

2018-03-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23737.
--
Resolution: Duplicate

> Scala API documentation leads to nonexistent pages for sources
> --
>
> Key: SPARK-23737
> URL: https://issues.apache.org/jira/browse/SPARK-23737
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Alexander Bessonov
>Priority: Minor
>
> h3. Steps to reproduce:
>  # Go to [Scala API 
> homepage|[http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package]].
>  # Click "Source: package.scala"
> h3. Result:
> The link leads to nonexistent page: 
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/package.scala]
> h3. Expected result:
> The link leads to proper page:
> [https://github.com/apache/spark/tree/v2.3.0/core/src/main/scala/org/apache/spark/package.scala]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23574) SinglePartition in data source V2 scan

2018-03-20 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23574.
-
   Resolution: Fixed
 Assignee: Jose Torres
Fix Version/s: 2.4.0

> SinglePartition in data source V2 scan
> --
>
> Key: SPARK-23574
> URL: https://issues.apache.org/jira/browse/SPARK-23574
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 2.4.0
>
>
> DataSourceV2ScanExec currently reports UnknownPartitioning whenever the 
> reader doesn't mix in SupportsReportPartitioning. It can also report 
> SinglePartition in the case where there's a single reader factory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23753) [Performance] Group By Push Down through Join

2018-03-20 Thread Ioana Delaney (JIRA)

Ioana Delaney created SPARK-23753:
-

 Summary: [Performance] Group By Push Down through Join
 Key: SPARK-23753
 URL: https://issues.apache.org/jira/browse/SPARK-23753
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ioana Delaney


*Group By push down through Join*

Another transformation that benefits from RI constraints is Group By push down 
through joins. The transformation interchanges the order of the group-by and 
join operations. The benefit of pushing down a group-by is that it may reduce 
the number of input rows to the join. On the other hand, if the join is very 
selective, it might make sense to execute the group by after the join. That is 
why this transformation is in general applied based on cost or selectivity 
estimates. 

However, if the join is an RI join, under certain conditions, it is safe to 
push down group by operation below the join. An example is shown below.

{code}
select c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name, 
   sum(ss.ss_quantity) as store_sales_quantity
from store_sales ss, date_dim, customer, store
where d_date_sk  = ss_sold_date_sk and
  c_customer_sk  = ss_customer_sk and
  s_store_sk  = ss_store_sk and
  d_year between 2000 and 2002
group by c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name
{code}

The query computes the quantities sold grouped by _customer_ and _store_ 
tables. The tables are in a _star schema_ join. The grouping columns are a 
super set of the join keys. The aggregate columns come from the fact table 
_store_sales_. The group by operation can be pushed down to the fact table 
_store_sales_ through the join with the _customer_ and _store_ tables. The join 
will not affect the partitions nor the aggregates computed by the pushed down 
group-by since every tuple in _store_sales_ will join with a tuple in 
_customer_ and _store_ tables.

{code}
select c_customer_sk, c_first_name, c_last_name, s_store_sk, s_store_name,
   v1.store_sales_quantity
from customer, store, (select ss_customer_sk, ss_store_sk, sum(ss_quantity) as 
store_sales_quantity
   from store_sales, date_dim
   where d_date_sk = ss_sold_date_sk and
 d_year between 2000 and 2002
   group by ss_customer_sk, ss_store_sk ) v1
where c_customer_sk = v1.ss_customer_sk and
  s_store_sk = v1.ss_store_sk
{code}

\\

When the query is run using a 1TB TPC-DS setup, the group by reduces the number 
of rows from 1.5 billion to 100 million rows and the query execution drops from 
about 70 secs to 30 secs, a 2x improvement.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-03-20 Thread Bruce Robbins (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-23715:
--
Description: 
This produces the expected answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 07:18:23|
+---+
{noformat}
However, the equivalent UTC input (but with an explicit timezone) produces a 
wrong answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
Additionally, the equivalent Unix time (1520921903, which is also 
"2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
{noformat}
df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
These issues stem from the fact that the FromUTCTimestamp expression, despite 
its name, expects the input to be in the user's local timezone. There is some 
magic under the covers to make things work (mostly) as the user expects.

As an example, let's say a user in Los Angeles issues the following:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
).as("dt")).show
{noformat}
FromUTCTimestamp gets as input a Timestamp (long) value representing
{noformat}
2018-03-13T06:18:23-07:00 (long value 152094710300)
{noformat}
What FromUTCTimestamp needs instead is
{noformat}
2018-03-13T06:18:23+00:00 (long value 152092190300)
{noformat}
So, it applies the local timezone's offset to the input timestamp to get the 
correct value (152094710300 minus 7 hours is 152092190300). Then it can 
process the value and produce the expected output.

When the user explicitly specifies a time zone, FromUTCTimestamp's assumptions 
break down. The input is no longer in the local time zone. Because of the way 
input data is implicitly casted, FromUTCTimestamp never knows whether the input 
data had an explicit timezone.

Here are some gory details:

There is sometimes a mismatch in expectations between the (string => timestamp) 
cast and FromUTCTimestamp. Also, since the FromUTCTimestamp expression never 
sees the actual input string (the cast "intercepts" the input and converts it 
to a long timestamp before FromUTCTimestamp uses the value), FromUTCTimestamp 
cannot reject any input value that would exercise this mismatch in expectations.

There is a similar mismatch in expectations in the (integer => timestamp) cast 
and FromUTCTimestamp. As a result, Unix time input almost always produces 
incorrect output.
h3. When things work as expected for String input:

When from_utc_timestamp is passed a string time value with no time zone, 
DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the 
datetime string as though it's in the user's local time zone. Because 
DateTimeUtils.stringToTimestamp is a general function, this is reasonable.

As a result, FromUTCTimestamp's input is a timestamp shifted by the local time 
zone's offset. FromUTCTimestamp assumes this (or more accurately, a utility 
function called by FromUTCTimestamp assumes this), so the first thing it does 
is reverse-shift to get it back the correct value. Now that the long value has 
been shifted back to the correct timestamp value, it can now process it (by 
shifting it again based on the specified time zone).
h3. When things go wrong with String input:

When from_utc_timestamp is passed a string datetime value with an explicit time 
zone, stringToTimestamp honors that timezone and ignores the local time zone. 
stringToTimestamp does not shift the timestamp by the local timezone's offset, 
but by the timezone specified on the datetime string.

Unfortunately, FromUTCTimestamp, which has no insight into the actual input or 
the conversion, still assumes the timestamp is shifted by the local time zone. 
So it reverse-shifts the long value by the local time zone's offset, which 
produces a incorrect timestamp (except in the case where the input datetime 
string just happened to have an explicit timezone that matches the local 
timezone). FromUTCTimestamp then uses this incorrect value for further 
processing.
h3. When things go wrong for Unix time input:

The cast in this case simply multiplies the integer by 100. The cast does 
not shift the resulting timestamp by the local time zone's offset.

Again, because FromUTCTimestamp's evaluation assumes a shifted timestamp, the 
result is wrong.

  was:
This produces the expected answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018

[jira] [Commented] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-03-20 Thread Bruce Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406836#comment-16406836
 ] 

Bruce Robbins commented on SPARK-23715:
---

A fix to this requires some ugly hacking of the implicit casts rule and the 
Cast class. It doesn't seem worth the mess.

New Timestamp types (with timezone awareness) might help with this issue.

> from_utc_timestamp returns incorrect results for some UTC date/time values
> --
>
> Key: SPARK-23715
> URL: https://issues.apache.org/jira/browse/SPARK-23715
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Major
>
> This produces the expected answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 07:18:23|
> +---+
> {noformat}
> However, the equivalent UTC input (but with an explicit timezone) produces a 
> wrong answer:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> Additionally, the equivalent Unix time (1520921903, which is also 
> "2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
> {noformat}
> df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
> ).as("dt")).show
> +---+
> | dt|
> +---+
> |2018-03-13 00:18:23|
> +---+
> {noformat}
> These issues stem from the fact the FromUTCTimestamp expression, despite its 
> name, expects the input to be in the user's local timezone. There is some 
> magic under the covers to make things work (mostly) as the user expects.
> As an example, let's say a user in Los Angeles issues the following:
> {noformat}
> df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
> ).as("dt")).show
> {noformat}
> FromUTCTimestamp gets as input a Timestamp (long) value representing
> {noformat}
> 2018-03-13T06:18:23-07:00 (long value 152094710300)
> {noformat}
> What FromUTCTimestamp needs instead is
> {noformat}
> 2018-03-13T06:18:23+00:00 (long value 152092190300)
> {noformat}
> So, it applies the local timezone's offset to the input timestamp to get the 
> correct value (152094710300 minus 7 hours is 152092190300). Then it 
> can process the value and produce the expected output.
> When the user explicitly specifies a time zone, FromUTCTimestamp's 
> assumptions break down. The input is no longer in the local time zone. 
> Because of the way input data is implicitly casted, FromUTCTimestamp never 
> knows whether the input data had an explicit timezone.
> Here are some gory details:
> There is sometimes a mismatch in expectations between the (string => 
> timestamp) cast and FromUTCTimestamp. Also, since the FromUTCTimestamp 
> expression never sees the actual input string (the cast "intercepts" the 
> input and converts it to a long timestamp before FromUTCTimestamp uses the 
> value), FromUTCTimestamp cannot reject any input value that would exercise 
> this mismatch in expectations.
> There is a similar mismatch in expectations in the (integer => timestamp) 
> cast and FromUTCTimestamp. As a result, Unix time input almost always 
> produces incorrect output.
> h3. When things work as expected for String input:
> When from_utc_timestamp is passed a string time value with no time zone, 
> DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the 
> datetime string as though it's in the user's local time zone. Because 
> DateTimeUtils.stringToTimestamp is a general function, this is reasonable.
> As a result, FromUTCTimestamp's input is a timestamp shifted by the local 
> time zone's offset. FromUTCTimestamp assumes this (or more accurately, a 
> utility function called by FromUTCTimestamp assumes this), so the first thing 
> it does is reverse-shift to get it back the correct value. Now that the long 
> value has been shifted back to the correct timestamp value, it can now 
> process it (by shifting it again based on the specified time zone).
> h3. When things go wrong with String input:
> When from_utc_timestamp is passed a string datetime value with an explicit 
> time zone, stringToTimestamp honors that timezone and ignores the local time 
> zone. stringToTimestamp does not shift the timestamp by the local timezone's 
> offset, but by the timezone specified on the datetime string.
> Unfortunately, FromUTCTimestamp, which has no insight into the actual input 
> or the conversion, still assumes the timestamp is shifted

[jira] [Created] (SPARK-23752) [Performance] Existential Subquery to Inner Join

2018-03-20 Thread Ioana Delaney (JIRA)

Ioana Delaney created SPARK-23752:
-

 Summary: [Performance] Existential Subquery to Inner Join
 Key: SPARK-23752
 URL: https://issues.apache.org/jira/browse/SPARK-23752
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ioana Delaney


*+Existential Subquery to Inner Join+*

Another enhancement that uses Informational Constraints is existential subquery 
to inner join. This rewrite converts an existential subquery to an inner join, 
and thus provides alternative join choices for the Optimizer based on the 
selectivity of the tables. 

An example using TPC-DS schema is shown below.

{code}
select c_first_name, c_last_name, c_email_address
from customer c
where EXISTS (select * 
  from store_sales, date_dim
  where c.c_customer_sk = ss_customer_sk and
ss_sold_date_sk = d_date_sk and
d_year = 2002 and
d_moy between 4 and 4+3)
{code}

Spark uses left semi-join to evaluated existential subqueries. A left semi-join 
will return a row from the outer table if there is at least one match in the 
inner. Semi-join is a general used technique to rewrite existential subqueries, 
but it has some limitations as it imposes a certain order of the joined table. 
In this case the large fact table _store_sales_ has to be on the inner of the 
join. A more efficient execution can be obtained if the subquery is converted 
to a regular Inner join. This will allow the Optimizer to choose better join 
orders.

Converting a subquery to inner join is possible if either the subquery produces 
at most one row or, by introducing a _Distinct_ on the outer table’s row key in 
order to remove the duplicate rows that will result after the inner join and 
thus to enforce the semantics of the subquery. As a key for the outer, we can 
use the primary key of the _customer_ table.

*Internal query after rewrite:*

{code}
select distinct c_customer_sk /*PK */, c_first_name, c_last_name, 
c_email_address
from customer c, store_sales, date_dim
where c.c_customer_sk = ss_customer_sk and
   ss_sold_date_sk = d_date_sk and
   d_year = 2002 and
   d_moy between 4 and 4+3
{code}

\\

*Example performance results using 1TB TPC-DS benchmark:*

\\

||TPC-DS Query||spark-2.2||spark-2.2 w/ sub2join||Query speedup||
||||(secs)||(secs)||
||
|Q10|355|190|2x|
|Q16|1394|706|2x|
|Q35|462|285|1.5x|
|Q69|327|173|1.5x|
|Q94|603|307|2x|








--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-03-20 Thread Bruce Robbins (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-23715:
--
Description: 
This produces the expected answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 07:18:23|
+---+
{noformat}
However, the equivalent UTC input (but with an explicit timezone) produces a 
wrong answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
Additionally, the equivalent Unix time (1520921903, which is also 
"2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
{noformat}
df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
These issues stem from the fact the FromUTCTimestamp expression, despite its 
name, expects the input to be in the user's local timezone. There is some magic 
under the covers to make things work (mostly) as the user expects.

As an example, let's say a user in Los Angeles issues the following:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
).as("dt")).show
{noformat}
FromUTCTimestamp gets as input a Timestamp (long) value representing
{noformat}
2018-03-13T06:18:23-07:00 (long value 152094710300)
{noformat}
What FromUTCTimestamp needs instead is
{noformat}
2018-03-13T06:18:23+00:00 (long value 152092190300)
{noformat}
So, it applies the local timezone's offset to the input timestamp to get the 
correct value (152094710300 minus 7 hours is 152092190300). Then it can 
process the value and produce the expected output.

When the user explicitly specifies a time zone, FromUTCTimestamp's assumptions 
break down. The input is no longer in the local time zone. Because of the way 
input data is implicitly casted, FromUTCTimestamp never knows whether the input 
data had an explicit timezone.

Here are some gory details:

There is sometimes a mismatch in expectations between the (string => timestamp) 
cast and FromUTCTimestamp. Also, since the FromUTCTimestamp expression never 
sees the actual input string (the cast "intercepts" the input and converts it 
to a long timestamp before FromUTCTimestamp uses the value), FromUTCTimestamp 
cannot reject any input value that would exercise this mismatch in expectations.

There is a similar mismatch in expectations in the (integer => timestamp) cast 
and FromUTCTimestamp. As a result, Unix time input almost always produces 
incorrect output.
h3. When things work as expected for String input:

When from_utc_timestamp is passed a string time value with no time zone, 
DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the 
datetime string as though it's in the user's local time zone. Because 
DateTimeUtils.stringToTimestamp is a general function, this is reasonable.

As a result, FromUTCTimestamp's input is a timestamp shifted by the local time 
zone's offset. FromUTCTimestamp assumes this (or more accurately, a utility 
function called by FromUTCTimestamp assumes this), so the first thing it does 
is reverse-shift to get it back the correct value. Now that the long value has 
been shifted back to the correct timestamp value, it can now process it (by 
shifting it again based on the specified time zone).
h3. When things go wrong with String input:

When from_utc_timestamp is passed a string datetime value with an explicit time 
zone, stringToTimestamp honors that timezone and ignores the local time zone. 
stringToTimestamp does not shift the timestamp by the local timezone's offset, 
but by the timezone specified on the datetime string.

Unfortunately, FromUTCTimestamp, which has no insight into the actual input or 
the conversion, still assumes the timestamp is shifted by the local time zone. 
So it reverse-shifts the long value by the local time zone's offset, which 
produces a incorrect timestamp (except in the case where the input datetime 
string just happened to have an explicit timezone that matches the local 
timezone). FromUTCTimestamp then uses this incorrect value for further 
processing.
h3. When things go wrong for Unix time input:

The cast in this case simply multiplies the integer by 100. The cast does 
not shift the resulting timestamp by the local time zone's offset.

Again, because FromUTCTimestamp's evaluation assumes a shifted timestamp, the 
result is wrong.

  was:
This produces the expected answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-1

[jira] [Updated] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2018-03-20 Thread Franck Tago (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago updated SPARK-23519:

Component/s: SQL

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Critical
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view form the table.   [ I did this from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
> java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-03-20 Thread Bruce Robbins (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-23715:
--
Description: 
This produces the expected answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 07:18:23|
+---+
{noformat}
However, the equivalent UTC input (but with an explicit timezone) produces a 
wrong answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
Additionally, the equivalent Unix time (1520921903, which is also 
"2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
{noformat}
df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
These issues stem from the fact the FromUTCTimestamp expression, despite its 
name, expects the input to be in the user's local timezone. As an example, 
let's say a user in Los Angeles issues the following:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
).as("dt")).show
{noformat}
FromUTCTimestamp gets as input a Timestamp (long) value representing
{noformat}
2018-03-13T06:18:23-07:00 (long value 152094710300)
{noformat}
What FromUTCTimestamp needs instead is
{noformat}
2018-03-13T06:18:23+00:00 (long value 152092190300)
{noformat}
So, it applies the local timezone's offset to the input timestamp to get the 
correct value (152094710300 minus 7 hours is 152092190300). Then it can 
process the value and produce the expected output.

When the user explicitly specifies a time zone, FromUTCTimestamp's assumptions 
break down. The input is no longer in the local time zone. Because of the way 
input data is implicitly casted, FromUTCTimestamp never knows whether the input 
data had an explicit timezone.

Here are some gory details:

There is sometimes a mismatch in expectations between the (string => timestamp) 
cast and FromUTCTimestamp. Also, since the FromUTCTimestamp expression never 
sees the actual input string (the cast "intercepts" the input and converts it 
to a long timestamp before FromUTCTimestamp uses the value), FromUTCTimestamp 
cannot reject any input value that would exercise this mismatch in expectations.

There is a similar mismatch in expectations in the (integer => timestamp) cast 
and FromUTCTimestamp. As a result, Unix time input almost always produces 
incorrect output.
h3. When things work as expected for String input:

When from_utc_timestamp is passed a string time value with no time zone, 
DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the 
datetime string as though it's in the user's local time zone. Because 
DateTimeUtils.stringToTimestamp is a general function, this is reasonable.

As a result, FromUTCTimestamp's input is a timestamp shifted by the local time 
zone's offset. FromUTCTimestamp assumes this (or more accurately, a utility 
function called by FromUTCTimestamp assumes this), so the first thing it does 
is reverse-shift to get it back the correct value. Now that the long value has 
been shifted back to the correct timestamp value, it can now process it (by 
shifting it again based on the specified time zone).
h3. When things go wrong with String input:

When from_utc_timestamp is passed a string datetime value with an explicit time 
zone, stringToTimestamp honors that timezone and ignores the local time zone. 
stringToTimestamp does not shift the timestamp by the local timezone's offset, 
but by the timezone specified on the datetime string.

Unfortunately, FromUTCTimestamp, which has no insight into the actual input or 
the conversion, still assumes the timestamp is shifted by the local time zone. 
So it reverse-shifts the long value by the local time zone's offset, which 
produces a incorrect timestamp (except in the case where the input datetime 
string just happened to have an explicit timezone that matches the local 
timezone). FromUTCTimestamp then uses this incorrect value for further 
processing.
h3. When things go wrong for Unix time input:

The cast in this case simply multiplies the integer by 100. The cast does 
not shift the resulting timestamp by the local time zone's offset.

Again, because FromUTCTimestamp's evaluation assumes a shifted timestamp, the 
result is wrong.

  was:
This produces the expected answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 07:18:23|
+---+
{noformat}
However, the equivalent UTC input (but with

[jira] [Updated] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-03-20 Thread Bruce Robbins (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-23715:
--
Description: 
This produces the expected answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 07:18:23|
+---+
{noformat}
However, the equivalent UTC input (but with an explicit timezone) produces a 
wrong answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
Additionally, the equivalent Unix time (1520921903, which is also 
"2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
{noformat}
df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
These issues stem from the fact the FromUTCTimestamp, despite its name, expects 
the input to be in the user's local timezone. As an example, let's say a user 
in Los Angeles issues the following:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
).as("dt")).show
{noformat}
FromUTCTimestamp gets as input a Timestamp (long) value representing
{noformat}
2018-03-13T06:18:23-07:00 (long value 152094710300)
{noformat}
What FromUTCTimestamp needs instead is
{noformat}
2018-03-13T06:18:23+00:00 (long value 152092190300)
{noformat}
So, it applies the local timezone's offset to the input timestamp to get the 
correct value (152094710300 minus 7 hours is 152092190300). Then it can 
process the value and produce the expected output.

When the user explicitly specifies a time zone, FromUTCTimestamp's assumptions 
break down. The input is no longer in the local time zone. Because of the way 
input data is implicitly casted, FromUTCTimestamp never knows whether the input 
data had an explicit timezone.

Here are some gory details:

There is sometimes a mismatch in expectations between the (string => timestamp) 
cast and FromUTCTimestamp. Also, since the FromUTCTimestamp expression never 
sees the actual input string (the cast "intercepts" the input and converts it 
to a long timestamp before FromUTCTimestamp uses the value), FromUTCTimestamp 
cannot reject any input value that would exercise this mismatch in expectations.

There is a similar mismatch in expectations in the (integer => timestamp) cast 
and FromUTCTimestamp. As a result, Unix time input almost always produces 
incorrect output.
h3. When things work as expected for String input:

When from_utc_timestamp is passed a string time value with no time zone, 
DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the 
datetime string as though it's in the user's local time zone. Because 
DateTimeUtils.stringToTimestamp is a general function, this is reasonable.

As a result, FromUTCTimestamp's input is a timestamp shifted by the local time 
zone's offset. FromUTCTimestamp assumes this (or more accurately, a utility 
function called by FromUTCTimestamp assumes this), so the first thing it does 
is reverse-shift to get it back the correct value. Now that the long value has 
been shifted back to the correct timestamp value, it can now process it (by 
shifting it again based on the specified time zone).
h3. When things go wrong with String input:

When from_utc_timestamp is passed a string datetime value with an explicit time 
zone, stringToTimestamp honors that timezone and ignores the local time zone. 
stringToTimestamp does not shift the timestamp by the local timezone's offset, 
but by the timezone specified on the datetime string.

Unfortunately, FromUTCTimestamp, which has no insight into the actual input or 
the conversion, still assumes the timestamp is shifted by the local time zone. 
So it reverse-shifts the long value by the local time zone's offset, which 
produces a incorrect timestamp (except in the case where the input datetime 
string just happened to have an explicit timezone that matches the local 
timezone). FromUTCTimestamp then uses this incorrect value for further 
processing.
h3. When things go wrong for Unix time input:

The cast in this case simply multiplies the integer by 100. The cast does 
not shift the resulting timestamp by the local time zone's offset.

Again, because FromUTCTimestamp's evaluation assumes a shifted timestamp, the 
result is wrong.

  was:
This produces the expected answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 07:18:23|
+---+
{noformat}
However, the equivalent UTC input (but with an explici

[jira] [Created] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml

2018-03-20 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-23751:
-

 Summary: Kolmogorov-Smirnoff test Python API in pyspark.ml
 Key: SPARK-23751
 URL: https://issues.apache.org/jira/browse/SPARK-23751
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 2.4.0
Reporter: Joseph K. Bradley


Python wrapper for new DataFrame-based API for KS test



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21898) Feature parity for KolmogorovSmirnovTest in MLlib

2018-03-20 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21898.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 19108
[https://github.com/apache/spark/pull/19108]

> Feature parity for KolmogorovSmirnovTest in MLlib
> -
>
> Key: SPARK-21898
> URL: https://issues.apache.org/jira/browse/SPARK-21898
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
> Fix For: 2.4.0
>
>
> Feature parity for KolmogorovSmirnovTest in MLlib.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21898) Feature parity for KolmogorovSmirnovTest in MLlib

2018-03-20 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-21898:
-

Assignee: Weichen Xu

> Feature parity for KolmogorovSmirnovTest in MLlib
> -
>
> Key: SPARK-21898
> URL: https://issues.apache.org/jira/browse/SPARK-21898
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
> Fix For: 2.4.0
>
>
> Feature parity for KolmogorovSmirnovTest in MLlib.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23499) Mesos Cluster Dispatcher should support priority queues to submit drivers

2018-03-20 Thread Pascal GILLET (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406796#comment-16406796
 ] 

Pascal GILLET commented on SPARK-23499:
---

[~susanxhuynh] Certainly, none of the proposed solutions above can prevent a 
"misuse" of the dispatcher, like keeping adding drivers in the URGENT queue.

To understand you well, are you in favor of implementing one of the solutions 
above?
 * The solution not related to the Mesos weights?
 * The solution that maps the Mesos weights to priorities?
 * A solution that must allow the preemption of jobs at the dispatcher level?

> Mesos Cluster Dispatcher should support priority queues to submit drivers
> -
>
> Key: SPARK-23499
> URL: https://issues.apache.org/jira/browse/SPARK-23499
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.1
>Reporter: Pascal GILLET
>Priority: Major
> Attachments: Screenshot from 2018-02-28 17-22-47.png
>
>
> As for Yarn, Mesos users should be able to specify priority queues to define 
> a workload management policy for queued drivers in the Mesos Cluster 
> Dispatcher.
> Submitted drivers are *currently* kept in order of their submission: the 
> first driver added to the queue will be the first one to be executed (FIFO).
> Each driver could have a "priority" associated with it. A driver with high 
> priority is served (Mesos resources) before a driver with low priority. If 
> two drivers have the same priority, they are served according to their submit 
> date in the queue.
> To set up such priority queues, the following changes are proposed:
>  * The Mesos Cluster Dispatcher can optionally be configured with the 
> _spark.mesos.dispatcher.queue.[QueueName]_ property. This property takes a 
> float as value. This adds a new queue named _QueueName_ for submitted drivers 
> with the specified priority.
>  Higher numbers indicate higher priority.
>  The user can then specify multiple queues.
>  * A driver can be submitted to a specific queue with 
> _spark.mesos.dispatcher.queue_. This property takes the name of a queue 
> previously declared in the dispatcher as value.
> By default, the dispatcher has a single "default" queue with 0.0 priority 
> (cannot be overridden). If none of the properties above are specified, the 
> behavior is the same as the current one (i.e. simple FIFO).
> Additionaly, it is possible to implement a consistent and overall workload 
> management policy throughout the lifecycle of drivers by mapping these 
> priority queues to weighted Mesos roles if any (i.e. from the QUEUED state in 
> the dispatcher to the final states in the Mesos cluster), and by specifying a 
> _spark.mesos.role_ along with a _spark.mesos.dispatcher.queue_ when 
> submitting an application.
> For example, with the URGENT Mesos role:
> {code:java}
> # Conf on the dispatcher side
> spark.mesos.dispatcher.queue.URGENT=1.0
> # Conf on the driver side
> spark.mesos.dispatcher.queue=URGENT
> spark.mesos.role=URGENT
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23750) [Performance] Inner Join Elimination based on Informational RI constraints

2018-03-20 Thread Ioana Delaney (JIRA)

Ioana Delaney created SPARK-23750:
-

 Summary: [Performance] Inner Join Elimination based on 
Informational RI constraints
 Key: SPARK-23750
 URL: https://issues.apache.org/jira/browse/SPARK-23750
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ioana Delaney


+*Inner Join Elimination based on Informational RI constraints*+

This transformation detects RI joins and eliminates the parent/PK table if none 
of its columns, other than the PK columns, are referenced in the query.

Typical examples that benefit from this rewrite are queries over complex views.

*View using TPC-DS schema:*

{code}
create view customer_purchases_2002 (id, last, first, product, store_id, month, 
quantity) as
select c_customer_id, c_last_name, c_first_name, i_product_name, s_store_id, 
d_moy, ss_quantity 
from store_sales, date_dim, customer, item, store
where d_date_sk = ss_sold_date_sk and
  c_customer_sk = ss_customer_sk and 
  i_item_sk = ss_item_sk and
  s_store_sk = ss_store_sk and
  d_year = 2002

{code}

The view returns customer purchases made in year 2002. It is a join between 
fact table _store_sales_ and dimensions _customer_, _item,_ _store_, and 
_date_. The tables are joined using RI predicates.

If we write a query that only selects a subset of columns from the view, for 
example, we are only interested in the items bought and not the stores, 
internally, the Optimizer, will first merge the view into the query, and then, 
based on the _primary key – foreign key_ join predicate analysis, it will 
decide that the join with the _store_ table is not needed, and therefore the 
_store_ table is removed.

*Query:*
{code}
select id, first, last, product, quantity 
from customer_purchases_2002
where product like ‘bicycle%’ and
  month between 1 and 2
{code}

*Internal query after view expansion:*

{code}
select c_customer_id as id, c_first_name as first, c_last_name as last,
   i_product_name as product,ss_quantity as quantity 
from store_sales, date_dim, customer, item, store
where d_date_sk = ss_sold_date_sk and
  c_customer_sk = ss_customer_sk and 
  i_item_sk = ss_item_sk and
  s_store_sk = ss_store_sk and
  d_year = 2002 and
  month between 1 and 2 and
  product like ‘bicycle%’
{code}
*Internal optimized query after join elimination:*
{code:java}
select c_customer_id as id, c_first_name as first, c_last_name as last,
   i_product_name as product,ss_quantity as quantity 
from store_sales, date_dim, customer, item
where d_date_sk = ss_sold_date_sk and
  c_customer_sk = ss_customer_sk and 
  i_item_sk = ss_item_sk and
  d_year = 2002 and
  month between 1 and 2 and
  product like ‘bicycle%’
{code}
The join with _store_ table can be removed since no columns are retrieved from 
the table, and every row from the _store_sales_ fact table will find a match in 
_store_ based on the RI relationship.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23749) Avoid Hive.get() to compatible with different Hive metastore

2018-03-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23749:


Assignee: (was: Apache Spark)

> Avoid Hive.get() to compatible with different Hive metastore
> 
>
> Key: SPARK-23749
> URL: https://issues.apache.org/jira/browse/SPARK-23749
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23749) Avoid Hive.get() to compatible with different Hive metastore

2018-03-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406724#comment-16406724
 ] 

Apache Spark commented on SPARK-23749:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/20866

> Avoid Hive.get() to compatible with different Hive metastore
> 
>
> Key: SPARK-23749
> URL: https://issues.apache.org/jira/browse/SPARK-23749
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23749) Avoid Hive.get() to compatible with different Hive metastore

2018-03-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23749:


Assignee: Apache Spark

> Avoid Hive.get() to compatible with different Hive metastore
> 
>
> Key: SPARK-23749
> URL: https://issues.apache.org/jira/browse/SPARK-23749
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23749) Avoid Hive.get() to compatible with different Hive metastore

2018-03-20 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-23749:
---

 Summary: Avoid Hive.get() to compatible with different Hive 
metastore
 Key: SPARK-23749
 URL: https://issues.apache.org/jira/browse/SPARK-23749
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23748) Support select from temp tables

2018-03-20 Thread Jose Torres (JIRA)

Jose Torres created SPARK-23748:
---

 Summary: Support select from temp tables
 Key: SPARK-23748
 URL: https://issues.apache.org/jira/browse/SPARK-23748
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres


As reported in the dev list, the following currently fails:

 

val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", 
"localhost:9092").option("subscribe", "join_test").option("startingOffsets", 
"earliest").load();



jdf.createOrReplaceTempView("table")


 

val resultdf = spark.sql("select * from table")



resultdf.writeStream.outputMode("append").format("console").option("truncate", 
false).trigger(Trigger.Continuous("1 second")).start()



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23747) Add EpochCoordinator unit tests

2018-03-20 Thread Jose Torres (JIRA)

Jose Torres created SPARK-23747:
---

 Summary: Add EpochCoordinator unit tests
 Key: SPARK-23747
 URL: https://issues.apache.org/jira/browse/SPARK-23747
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23737) Scala API documentation leads to nonexistent pages for sources

2018-03-20 Thread Alexander Bessonov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406591#comment-16406591
 ] 

Alexander Bessonov commented on SPARK-23737:


Oh, thanks. Linked them.

> Scala API documentation leads to nonexistent pages for sources
> --
>
> Key: SPARK-23737
> URL: https://issues.apache.org/jira/browse/SPARK-23737
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: Alexander Bessonov
>Priority: Minor
>
> h3. Steps to reproduce:
>  # Go to [Scala API 
> homepage|[http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package]].
>  # Click "Source: package.scala"
> h3. Result:
> The link leads to nonexistent page: 
> [https://github.com/apache/spark/tree/v2.3.0/Users/sameera/dev/spark/core/src/main/scala/org/apache/spark/package.scala]
> h3. Expected result:
> The link leads to proper page:
> [https://github.com/apache/spark/tree/v2.3.0/core/src/main/scala/org/apache/spark/package.scala]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23746) HashMap UserDefinedType giving cast exception in Spark 1.6.2 while implementing UDAF

2018-03-20 Thread Izhar Ahmed (JIRA)

Izhar Ahmed created SPARK-23746:
---

 Summary: HashMap UserDefinedType giving cast exception in Spark 
1.6.2 while implementing UDAF
 Key: SPARK-23746
 URL: https://issues.apache.org/jira/browse/SPARK-23746
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.6.2
Reporter: Izhar Ahmed


I am trying to use a custom HashMap implementation as UserDefinedType instead 
of MapType in spark. The code is *working fine in spark 1.5.2* but giving 
{{java.lang.ClassCastException: scala.collection.immutable.HashMap$HashMap1 
cannot be cast to org.apache.spark.sql.catalyst.util.MapData}} *exception in 
spark 1.6.2*

The code:- 
{code:java}
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
UserDefinedAggregateFunction}
import org.apache.spark.sql.types._

import scala.collection.immutable.HashMap

class Test extends UserDefinedAggregateFunction {

  def inputSchema: StructType =
StructType(Array(StructField("input", StringType)))

  def bufferSchema = StructType(Array(StructField("top_n", CustomHashMapType)))

  def dataType: DataType = CustomHashMapType

  def deterministic = true

  def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = HashMap.empty[String, Long]
  }

  def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val buff0 = buffer.getAs[HashMap[String, Long]](0)
buffer(0) = buff0.updated("test", buff0.getOrElse("test", 0L) + 1L)
  }

  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {

buffer1(0) = buffer1.
  getAs[HashMap[String, Long]](0)
  .merged(buffer2.getAs[HashMap[String, Long]](0))({ case ((k, v1), (_, 
v2)) => (k, v1 + v2) })
  }

  def evaluate(buffer: Row): Any = {
buffer(0)
  }
}

private case object CustomHashMapType extends UserDefinedType[HashMap[String, 
Long]] {

  override def sqlType: DataType = MapType(StringType, LongType)

  override def serialize(obj: Any): Map[String, Long] =
obj.asInstanceOf[Map[String, Long]]

  override def deserialize(datum: Any): HashMap[String, Long] = {
datum.asInstanceOf[Map[String, Long]] ++: HashMap.empty[String, Long]
  }

  override def userClass: Class[HashMap[String, Long]] = 
classOf[HashMap[String, Long]]

}
{code}
The wrapper Class to run the UDAF:-
{code:scala}
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

object TestJob {

  def main(args: Array[String]): Unit = {
val conf = new 
SparkConf().setMaster("local[4]").setAppName("DataStatsExecution")
val sc = new SparkContext(conf)

val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(Seq(1,2,3,4)).toDF("col")
val udaf = new Test()
val outdf = df.agg(udaf(df("col")))
outdf.show
  }
}
{code}

Stacktrace:-
{code:java}
Caused by: java.lang.ClassCastException: 
scala.collection.immutable.HashMap$HashMap1 cannot be cast to 
org.apache.spark.sql.catalyst.util.MapData
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getMap(rows.scala:50)
at 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getMap(rows.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getMap(JoinedRow.scala:115)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$31.apply(AggregationIterator.scala:345)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$31.apply(AggregationIterator.scala:344)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:154)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issue

[jira] [Assigned] (SPARK-23542) The exists action shoule be further optimized in logical plan

2018-03-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23542:


Assignee: Apache Spark

> The exists action shoule be further optimized in logical plan
> -
>
> Key: SPARK-23542
> URL: https://issues.apache.org/jira/browse/SPARK-23542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: KaiXinXIaoLei
>Assignee: Apache Spark
>Priority: Major
>
> The optimized logical plan of query '*select * from tt1 where exists (select 
> *  from tt2  where tt1.i = tt2.i)*' is :
> {code:java}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#14 = i#16)
> :- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
> +- Project [i#16]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
>  
> The `exists` action will be rewritten as semi jion. But i the query of 
> `*select * from tt1 left semi join tt2 on tt2.i = tt1.i*`, the optimized 
> logical plan is :
> {noformat}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#22 = i#20)
> :- Filter isnotnull(i#20)
> : +- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
> +- Project [i#22]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat}
>  
>  So i think the  optimized logical plan of '*select * from tt1 where exists 
> (select *  from tt2  where tt1.i = tt2.i)*;` should be further optimization.
> {code:java}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#14 = i#16)
> :- Filter isnotnull(i#14)
>   : +- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
> +- Project [i#16]
>  :- Filter isnotnull(i#16)
>   +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23542) The exists action shoule be further optimized in logical plan

2018-03-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406172#comment-16406172
 ] 

Apache Spark commented on SPARK-23542:
--

User 'KaiXinXiaoLei' has created a pull request for this issue:
https://github.com/apache/spark/pull/20865

> The exists action shoule be further optimized in logical plan
> -
>
> Key: SPARK-23542
> URL: https://issues.apache.org/jira/browse/SPARK-23542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: KaiXinXIaoLei
>Priority: Major
>
> The optimized logical plan of query '*select * from tt1 where exists (select 
> *  from tt2  where tt1.i = tt2.i)*' is :
> {code:java}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#14 = i#16)
> :- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
> +- Project [i#16]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
>  
> The `exists` action will be rewritten as semi jion. But i the query of 
> `*select * from tt1 left semi join tt2 on tt2.i = tt1.i*`, the optimized 
> logical plan is :
> {noformat}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#22 = i#20)
> :- Filter isnotnull(i#20)
> : +- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
> +- Project [i#22]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat}
>  
>  So i think the  optimized logical plan of '*select * from tt1 where exists 
> (select *  from tt2  where tt1.i = tt2.i)*;` should be further optimization.
> {code:java}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#14 = i#16)
> :- Filter isnotnull(i#14)
>   : +- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
> +- Project [i#16]
>  :- Filter isnotnull(i#16)
>   +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23542) The exists action shoule be further optimized in logical plan

2018-03-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23542:


Assignee: (was: Apache Spark)

> The exists action shoule be further optimized in logical plan
> -
>
> Key: SPARK-23542
> URL: https://issues.apache.org/jira/browse/SPARK-23542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: KaiXinXIaoLei
>Priority: Major
>
> The optimized logical plan of query '*select * from tt1 where exists (select 
> *  from tt2  where tt1.i = tt2.i)*' is :
> {code:java}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#14 = i#16)
> :- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
> +- Project [i#16]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
>  
> The `exists` action will be rewritten as semi jion. But i the query of 
> `*select * from tt1 left semi join tt2 on tt2.i = tt1.i*`, the optimized 
> logical plan is :
> {noformat}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#22 = i#20)
> :- Filter isnotnull(i#20)
> : +- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
> +- Project [i#22]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat}
>  
>  So i think the  optimized logical plan of '*select * from tt1 where exists 
> (select *  from tt2  where tt1.i = tt2.i)*;` should be further optimization.
> {code:java}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#14 = i#16)
> :- Filter isnotnull(i#14)
>   : +- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
> +- Project [i#16]
>  :- Filter isnotnull(i#16)
>   +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23542) The exists action shoule be further optimized in logical plan

2018-03-20 Thread KaiXinXIaoLei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-23542:
--
Description: 
The optimized logical plan of query '*select * from tt1 where exists (select *  
from tt2  where tt1.i = tt2.i)*' is :
{code:java}
== Optimized Logical Plan ==
Join LeftSemi, (i#14 = i#16)
:- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
+- Project [i#16]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
 

The `exists` action will be rewritten as semi jion. But i the query of `*select 
* from tt1 left semi join tt2 on tt2.i = tt1.i*`, the optimized logical plan is 
:
{noformat}
== Optimized Logical Plan ==
Join LeftSemi, (i#22 = i#20)
:- Filter isnotnull(i#20)
: +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
+- Project [i#22]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat}
 

 So i think the  optimized logical plan of '*select * from tt1 where exists 
(select *  from tt2  where tt1.i = tt2.i)*;` should be further optimization.
{code:java}
== Optimized Logical Plan ==
Join LeftSemi, (i#14 = i#16)
:- Filter isnotnull(i#14)
  : +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
+- Project [i#16]
 :- Filter isnotnull(i#16)
  +- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
 

  was:
The optimized logical plan of query '*select * from tt1 where exists (select *  
from tt2  where tt1.i = tt2.i)*' is :
{code:java}
== Optimized Logical Plan ==
Join LeftSemi, (i#14 = i#16)
:- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
+- Project [i#16]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
 

The `exists` action will be rewritten as semi jion. But i the query of `*select 
* from tt1 left semi join tt2 on tt2.i = tt1.i*`, the optimized logical plan is 
:
{noformat}
== Optimized Logical Plan ==
Join LeftSemi, (i#22 = i#20)
:- Filter isnotnull(i#20)
: +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
+- Project [i#22]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat}
 

 So i think the  optimized logical plan of '*select * from tt1 where exists 
(select *  from tt2  where tt1.i = tt2.i)*;` should be further optimization.
{code:java}
== Optimized Logical Plan ==
Join LeftSemi, (i#14 = i#16)
:- Filter isnotnull(i#20)
: +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
+- Project [i#16]
+- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
 


> The exists action shoule be further optimized in logical plan
> -
>
> Key: SPARK-23542
> URL: https://issues.apache.org/jira/browse/SPARK-23542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: KaiXinXIaoLei
>Priority: Major
>
> The optimized logical plan of query '*select * from tt1 where exists (select 
> *  from tt2  where tt1.i = tt2.i)*' is :
> {code:java}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#14 = i#16)
> :- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
> +- Project [i#16]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code}
>  
> The `exists` action will be rewritten as semi jion. But i the query of 
> `*select * from tt1 left semi join tt2 on tt2.i = tt1.i*`, the optimized 
> logical plan is :
> {noformat}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#22 = i#20)
> :- Filter isnotnull(i#20)
> : +- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
> +- Project [i#22]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat}
>  
>  So i think the  optimized logical plan of '*select * from tt1 where exists 
> (select *  from tt2  where tt1.i = tt2.i)*;` should be further optimization.
> {code:java}
> == Optimized Logical Plan ==
> Join LeftSemi, (i#14 = i#16)
> :- Filter isnotnull(i#14)
>   : +- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
> +- Project [i#16]
>  :- Filter isnotnull(i#16)
>   +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16

[jira] [Commented] (SPARK-23513) java.io.IOException: Expected 12 fields, but got 5 for row :Spark submit error

2018-03-20 Thread Narsireddy AVula (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406164#comment-16406164
 ] 

Narsireddy AVula commented on SPARK-23513:
--

Seems provided information is not sufficient  to proceed to validate

> java.io.IOException: Expected 12 fields, but got 5 for row :Spark submit 
> error 
> ---
>
> Key: SPARK-23513
> URL: https://issues.apache.org/jira/browse/SPARK-23513
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Examples, Input/Output, Java API
>Affects Versions: 1.4.0, 2.2.0
>Reporter: Rawia 
>Priority: Blocker
>
> Hello
> I'm trying to run a spark application (distributedWekaSpark) but  when I'm 
> using the spark-submit command I get this error
> {quote}{quote}ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) 
> java.io.IOException: Expected 12 fields, but got 5 for row: 
> outlook,temperature,humidity,windy,play
> {quote}{quote}
> I tried with other datasets but always the same error appeared, (always 12 
> fields expected)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16745) Spark job completed however have to wait for 13 mins (data size is small)

2018-03-20 Thread Sujit Kumar Mahapatra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406153#comment-16406153
 ] 

Sujit Kumar Mahapatra commented on SPARK-16745:
---

+1. Getting similar issue with standalone spark on Mac for spark 2.2.1

> Spark job completed however have to wait for 13 mins (data size is small)
> -
>
> Key: SPARK-16745
> URL: https://issues.apache.org/jira/browse/SPARK-16745
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.1
> Environment: Max OS X Yosemite, Terminal, MacBook Air Late 2014
>Reporter: Joe Chong
>Priority: Minor
>
> I submitted a job in scala spark shell to show a DataFrame. The data size is 
> about 43K. The job was successful in the end, but took more than 13 minutes 
> to resolve. Upon checking the log, there's multiple exception raised on 
> "Failed to check existence of class" with a java.net.connectionexpcetion 
> message indicating timeout trying to connect to the port 52067, the repl port 
> that Spark setup. Please assist to troubleshoot. Thanks. 
> Started Spark in standalone mode
> $ spark-shell --driver-memory 5g --master local[*]
> 16/07/26 21:05:29 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 16/07/26 21:05:30 INFO spark.SecurityManager: Changing view acls to: joechong
> 16/07/26 21:05:30 INFO spark.SecurityManager: Changing modify acls to: 
> joechong
> 16/07/26 21:05:30 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(joechong); users 
> with modify permissions: Set(joechong)
> 16/07/26 21:05:30 INFO spark.HttpServer: Starting HTTP Server
> 16/07/26 21:05:30 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 16/07/26 21:05:30 INFO server.AbstractConnector: Started 
> SocketConnector@0.0.0.0:52067
> 16/07/26 21:05:30 INFO util.Utils: Successfully started service 'HTTP class 
> server' on port 52067.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.1
>   /_/
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
> Type in expressions to have them evaluated.
> Type :help for more information.
> 16/07/26 21:05:34 INFO spark.SparkContext: Running Spark version 1.6.1
> 16/07/26 21:05:34 INFO spark.SecurityManager: Changing view acls to: joechong
> 16/07/26 21:05:34 INFO spark.SecurityManager: Changing modify acls to: 
> joechong
> 16/07/26 21:05:34 INFO spark.SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(joechong); users 
> with modify permissions: Set(joechong)
> 16/07/26 21:05:35 INFO util.Utils: Successfully started service 'sparkDriver' 
> on port 52072.
> 16/07/26 21:05:35 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 16/07/26 21:05:35 INFO Remoting: Starting remoting
> 16/07/26 21:05:35 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriverActorSystem@10.199.29.218:52074]
> 16/07/26 21:05:35 INFO util.Utils: Successfully started service 
> 'sparkDriverActorSystem' on port 52074.
> 16/07/26 21:05:35 INFO spark.SparkEnv: Registering MapOutputTracker
> 16/07/26 21:05:35 INFO spark.SparkEnv: Registering BlockManagerMaster
> 16/07/26 21:05:35 INFO storage.DiskBlockManager: Created local directory at 
> /private/var/folders/r7/bs2f87nj6lnd5vm51lvxcw68gn/T/blockmgr-cd542a27-6ff1-4f51-a72b-78654142fdb6
> 16/07/26 21:05:35 INFO storage.MemoryStore: MemoryStore started with capacity 
> 3.4 GB
> 16/07/26 21:05:35 INFO spark.SparkEnv: Registering OutputCommitCoordinator
> 16/07/26 21:05:36 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 16/07/26 21:05:36 INFO server.AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0:4040
> 16/07/26 21:05:36 INFO util.Utils: Successfully started service 'SparkUI' on 
> port 4040.
> 16/07/26 21:05:36 INFO ui.SparkUI: Started SparkUI at 
> http://10.199.29.218:4040
> 16/07/26 21:05:36 INFO executor.Executor: Starting executor ID driver on host 
> localhost
> 16/07/26 21:05:36 INFO executor.Executor: Using REPL class URI: 
> http://10.199.29.218:52067
> 16/07/26 21:05:36 INFO util.Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52075.
> 16/07/26 21:05:36 INFO netty.NettyBlockTransferService: Server created on 
> 52075
> 16/07/26 21:05:36 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 16/07/26 21:05:36 INFO storage.BlockManagerMasterEndpoint: Registering block 
> manager localhost:52075 with 3.4 GB RAM, BlockManagerId(driver, localhost, 
> 52075)
> 16/07/26 21:05:36 INF

[jira] [Commented] (SPARK-16872) Include Gaussian Naive Bayes Classifier

2018-03-20 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406128#comment-16406128
 ] 

zhengruifeng commented on SPARK-16872:
--

I think both 1) a new GNB estimator and 2) current NB includes Gaussian   are 
OK.

[~mlnick] [~josephkb] [~yanboliang]  What are your thoughts?

It has been a long time since my first PR, and I really hope to finish it in 
following months. Could you help shepherding this ?

> Include Gaussian Naive Bayes Classifier
> ---
>
> Key: SPARK-16872
> URL: https://issues.apache.org/jira/browse/SPARK-16872
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> I implemented Gaussian NB according to scikit-learn's {{GaussianNB}}.
> In GaussianNB model, the {{theta}} matrix is used to store means and there is 
> a extra {{sigma}} matrix storing the variance of each feature.
> GaussianNB in spark
> {code}
> scala> import org.apache.spark.ml.classification.GaussianNaiveBayes
> import org.apache.spark.ml.classification.GaussianNaiveBayes
> scala> val path = 
> "/Users/zrf/.dev/spark-2.1.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt"
> path: String = 
> /Users/zrf/.dev/spark-2.1.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt
> scala> val data = spark.read.format("libsvm").load(path).persist()
> data: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: 
> double, features: vector]
> scala> val gnb = new GaussianNaiveBayes()
> gnb: org.apache.spark.ml.classification.GaussianNaiveBayes = gnb_54c50467306c
> scala> val model = gnb.fit(data)
> 17/01/03 14:25:48 INFO Instrumentation: 
> GaussianNaiveBayes-gnb_54c50467306c-720112035-1: training: numPartitions=1 
> storageLevel=StorageLevel(1 replicas)
> 17/01/03 14:25:48 INFO Instrumentation: 
> GaussianNaiveBayes-gnb_54c50467306c-720112035-1: {}
> 17/01/03 14:25:49 INFO Instrumentation: 
> GaussianNaiveBayes-gnb_54c50467306c-720112035-1: {"numFeatures":4}
> 17/01/03 14:25:49 INFO Instrumentation: 
> GaussianNaiveBayes-gnb_54c50467306c-720112035-1: {"numClasses":3}
> 17/01/03 14:25:49 INFO Instrumentation: 
> GaussianNaiveBayes-gnb_54c50467306c-720112035-1: training finished
> model: org.apache.spark.ml.classification.GaussianNaiveBayesModel = 
> GaussianNaiveBayesModel (uid=gnb_54c50467306c) with 3 classes
> scala> model.pi
> res0: org.apache.spark.ml.linalg.Vector = 
> [-1.0986122886681098,-1.0986122886681098,-1.0986122886681098]
> scala> model.pi.toArray.map(math.exp)
> res1: Array[Double] = Array(0., 0., 
> 0.)
> scala> model.theta
> res2: org.apache.spark.ml.linalg.Matrix =
> 0.270067018001   -0.188540006  0.543050720001   0.60546
> -0.60779998  0.18172   -0.842711740006  
> -0.88139998
> -0.091425964 -0.35858001   0.105084738  
> 0.021666701507102017
> scala> model.sigma
> res3: org.apache.spark.ml.linalg.Matrix =
> 0.1223012510889361   0.07078051983960698  0.0343595243976   
> 0.051336071297393815
> 0.03758145300924998  0.09880280046403413  0.003390296940069426  
> 0.007822241779598893
> 0.08058763609659315  0.06701386661293329  0.024866409227781675  
> 0.02661391644759426
> scala> model.transform(data).select("probability").take(10)
> [rdd_68_0]
> res4: Array[org.apache.spark.sql.Row] = 
> Array([[1.0627410543476422E-21,0.9938,6.2765233965353945E-15]], 
> [[7.254521422345374E-26,1.0,1.3849442153180895E-18]], 
> [[1.9629244119173135E-24,0.9998,1.9424765181237926E-16]], 
> [[6.061218297948492E-22,0.9902,9.853216073401884E-15]], 
> [[0.9972225671942837,8.844241161578932E-165,0.002777432805716399]], 
> [[5.361683970373604E-26,1.0,2.3004604508982183E-18]], 
> [[0.01062850630038623,3.3102617689978775E-100,0.9893714936996136]], 
> [[1.9297314618271785E-4,2.124922209137708E-71,0.9998070268538172]], 
> [[3.118816393732361E-27,1.0,6.5310299615983584E-21]], 
> [[0.926009854522,8.734773657627494E-206,7.399014547943611E-6]])
> scala> model.transform(data).select("prediction").take(10)
> [rdd_68_0]
> res5: Array[org.apache.spark.sql.Row] = Array([1.0], [1.0], [1.0], [1.0], 
> [0.0], [1.0], [2.0], [2.0], [1.0], [0.0])
> {code}
> GaussianNB in scikit-learn
> {code}
> import numpy as np
> from sklearn.naive_bayes import GaussianNB
> from sklearn.datasets import load_svmlight_file
> path = 
> '/Users/zrf/.dev/spark-2.1.0-bin-hadoop2.7/data/mllib/sample_multiclass_classification_data.txt'
> X, y = load_svmlight_file(path)
> X = X.toarray()
> clf = GaussianNB()
> clf.fit(X, y)
> >>> clf.class_prior_
> array([ 0.,  0.,  0.])
> >>> clf.theta_
> array([[ 0.2701, -0.1885,  0.54305072,

[jira] [Assigned] (SPARK-23745) Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped

2018-03-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23745:


Assignee: Apache Spark

> Remove the directories of the “hive.downloaded.resources.dir” when 
> HiveThriftServer2 stopped
> 
>
> Key: SPARK-23745
> URL: https://issues.apache.org/jira/browse/SPARK-23745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: zuotingbing
>Assignee: Apache Spark
>Priority: Major
> Attachments: 2018-03-20_164832.png
>
>
> !2018-03-20_164832.png!  
> when start the HiveThriftServer2, we create some directories for 
> hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not 
> remove these directories. The directories could accumulate a lot.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23745) Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped

2018-03-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23745:


Assignee: (was: Apache Spark)

> Remove the directories of the “hive.downloaded.resources.dir” when 
> HiveThriftServer2 stopped
> 
>
> Key: SPARK-23745
> URL: https://issues.apache.org/jira/browse/SPARK-23745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-03-20_164832.png
>
>
> !2018-03-20_164832.png!  
> when start the HiveThriftServer2, we create some directories for 
> hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not 
> remove these directories. The directories could accumulate a lot.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23745) Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped

2018-03-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16406050#comment-16406050
 ] 

Apache Spark commented on SPARK-23745:
--

User 'zuotingbing' has created a pull request for this issue:
https://github.com/apache/spark/pull/20864

> Remove the directories of the “hive.downloaded.resources.dir” when 
> HiveThriftServer2 stopped
> 
>
> Key: SPARK-23745
> URL: https://issues.apache.org/jira/browse/SPARK-23745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-03-20_164832.png
>
>
> !2018-03-20_164832.png!  
> when start the HiveThriftServer2, we create some directories for 
> hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not 
> remove these directories. The directories could accumulate a lot.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23745) Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped

2018-03-20 Thread zuotingbing (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-23745:

Description: 
!2018-03-20_164832.png!  

when start the HiveThriftServer2, we create some directories for 
hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not 
remove these directories. The directories could accumulate a lot.

  was:
!2018-03-20_164832.png!  

when start the HiveThriftServer2, we create some  directories for 
hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not 
remove these directories.The directories could accumulate a lot.


> Remove the directories of the “hive.downloaded.resources.dir” when 
> HiveThriftServer2 stopped
> 
>
> Key: SPARK-23745
> URL: https://issues.apache.org/jira/browse/SPARK-23745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-03-20_164832.png
>
>
> !2018-03-20_164832.png!  
> when start the HiveThriftServer2, we create some directories for 
> hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not 
> remove these directories. The directories could accumulate a lot.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23745) Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped

2018-03-20 Thread zuotingbing (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-23745:

Description: 
!2018-03-20_164832.png!  

when start the HiveThriftServer2, we create some  directories for 
hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not 
remove these directories.The directories could accumulate a lot.

  was:
 

when start the HiveThriftServer2, we create some  directories for 
hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not 
remove these directories.The directories could accumulate a lot.


> Remove the directories of the “hive.downloaded.resources.dir” when 
> HiveThriftServer2 stopped
> 
>
> Key: SPARK-23745
> URL: https://issues.apache.org/jira/browse/SPARK-23745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-03-20_164832.png
>
>
> !2018-03-20_164832.png!  
> when start the HiveThriftServer2, we create some  directories for 
> hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not 
> remove these directories.The directories could accumulate a lot.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23745) Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped

2018-03-20 Thread zuotingbing (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-23745:

Description: 
 

when start the HiveThriftServer2, we create some  directories for 
hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not 
remove these directories.The directories could accumulate a lot.

  was:
!image-2018-03-20-16-49-00-175.png!

when start the HiveThriftServer2, we create some  directories for 
hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not 
remove these directories.The directories could accumulate a lot.


> Remove the directories of the “hive.downloaded.resources.dir” when 
> HiveThriftServer2 stopped
> 
>
> Key: SPARK-23745
> URL: https://issues.apache.org/jira/browse/SPARK-23745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-03-20_164832.png
>
>
>  
> when start the HiveThriftServer2, we create some  directories for 
> hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not 
> remove these directories.The directories could accumulate a lot.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23745) Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped

2018-03-20 Thread zuotingbing (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-23745:

Attachment: 2018-03-20_164832.png

> Remove the directories of the “hive.downloaded.resources.dir” when 
> HiveThriftServer2 stopped
> 
>
> Key: SPARK-23745
> URL: https://issues.apache.org/jira/browse/SPARK-23745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: linux
>Reporter: zuotingbing
>Priority: Major
> Attachments: 2018-03-20_164832.png
>
>
> !image-2018-03-20-16-49-00-175.png!
> when start the HiveThriftServer2, we create some  directories for 
> hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not 
> remove these directories.The directories could accumulate a lot.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23745) Remove the directories of the “hive.downloaded.resources.dir” when HiveThriftServer2 stopped

2018-03-20 Thread zuotingbing (JIRA)

zuotingbing created SPARK-23745:
---

 Summary: Remove the directories of the 
“hive.downloaded.resources.dir” when HiveThriftServer2 stopped
 Key: SPARK-23745
 URL: https://issues.apache.org/jira/browse/SPARK-23745
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
 Environment: linux
Reporter: zuotingbing


!image-2018-03-20-16-49-00-175.png!

when start the HiveThriftServer2, we create some  directories for 
hive.downloaded.resources.dir, but when stop the HiveThriftServer2 we do not 
remove these directories.The directories could accumulate a lot.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23691) Use sql_conf util in PySpark tests where possible

2018-03-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23691:
-
Fix Version/s: 2.3.1

> Use sql_conf util in PySpark tests where possible
> -
>
> Key: SPARK-23691
> URL: https://issues.apache.org/jira/browse/SPARK-23691
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.1, 2.4.0
>
>
> https://github.com/apache/spark/commit/d6632d185e147fcbe6724545488ad80dce20277e
>  added an useful util
> {code}
> @contextmanager
> def sql_conf(self, pairs):
> ...
> {code}
> to allow configuration set/unset within a block:
> {code}
> with self.sql_conf({"spark.blah.blah.blah", "blah"})
> # test codes
> {code}
> It would be nicer if we use it. 
> Note that there look already few places affecting tests without restoring the 
> original value back in unittest classes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20964) Make some keywords reserved along with the ANSI/SQL standard

2018-03-20 Thread Alex Ott (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405962#comment-16405962
 ] 

Alex Ott edited comment on SPARK-20964 at 3/20/18 8:36 AM:
---

Just want to add another example of query that is rejected by sqlite, but works 
fine with Spark SQL:

{{SELECT state, count( * ) FROM user_addresses *where* group by state;}}

In this case the *where* keyword is treated as table alias - although it 
matches to SQL specification, it really just hides the error that I did in this 
query by forgetting to add condition expression


was (Author: alexott):
Just want to add another example of query that is rejected by sqlite, but works 
fine with Spark SQL:

{{SELECT state, count(*) FROM user_addresses *where* group by state;}}

In this case the *where* keyword is treated as table alias - although it 
matches to SQL specification, it really just hides the error that I did in this 
query by forgetting to add condition expression

> Make some keywords reserved along with the ANSI/SQL standard
> 
>
> Key: SPARK-20964
> URL: https://issues.apache.org/jira/browse/SPARK-20964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The current Spark has many non-reserved words that are essentially reserved 
> in the ANSI/SQL standard 
> (http://developer.mimer.se/validator/sql-reserved-words.tml). 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L709
> This is because there are many datasources (for instance twitter4j) that 
> unfortunately use reserved keywords for column names (See [~hvanhovell]'s 
> comments: https://github.com/apache/spark/pull/18079#discussion_r118842186). 
> We might fix this issue in future major releases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20964) Make some keywords reserved along with the ANSI/SQL standard

2018-03-20 Thread Alex Ott (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405962#comment-16405962
 ] 

Alex Ott edited comment on SPARK-20964 at 3/20/18 8:35 AM:
---

Just want to add another example of query that is rejected by sqlite, but works 
fine with Spark SQL:

{{SELECT state, count(*) FROM user_addresses *where* group by state;}}

In this case the *where* keyword is treated as table alias - although it 
matches to SQL specification, it really just hides the error that I did in this 
query by forgetting to add condition expression


was (Author: alexott):
Just want to add another example of query that is rejected by sqlite, but works 
fine with Spark SQL:

    SELECT state, count(*) FROM user_addresses *where* group by state;

In this case the *where* keyword is treated as table alias - although it 
matches to SQL specification, it really just hides the error that I did in this 
query by forgetting to add condition expression

> Make some keywords reserved along with the ANSI/SQL standard
> 
>
> Key: SPARK-20964
> URL: https://issues.apache.org/jira/browse/SPARK-20964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The current Spark has many non-reserved words that are essentially reserved 
> in the ANSI/SQL standard 
> (http://developer.mimer.se/validator/sql-reserved-words.tml). 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L709
> This is because there are many datasources (for instance twitter4j) that 
> unfortunately use reserved keywords for column names (See [~hvanhovell]'s 
> comments: https://github.com/apache/spark/pull/18079#discussion_r118842186). 
> We might fix this issue in future major releases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20964) Make some keywords reserved along with the ANSI/SQL standard

2018-03-20 Thread Alex Ott (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405962#comment-16405962
 ] 

Alex Ott commented on SPARK-20964:
--

Just want to add another example of query that is rejected by sqlite, but works 
fine with Spark SQL:

    SELECT state, count(*) FROM user_addresses *where* group by state;

In this case the *where* keyword is treated as table alias - although it 
matches to SQL specification, it really just hides the error that I did in this 
query by forgetting to add condition expression

> Make some keywords reserved along with the ANSI/SQL standard
> 
>
> Key: SPARK-20964
> URL: https://issues.apache.org/jira/browse/SPARK-20964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The current Spark has many non-reserved words that are essentially reserved 
> in the ANSI/SQL standard 
> (http://developer.mimer.se/validator/sql-reserved-words.tml). 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L709
> This is because there are many datasources (for instance twitter4j) that 
> unfortunately use reserved keywords for column names (See [~hvanhovell]'s 
> comments: https://github.com/apache/spark/pull/18079#discussion_r118842186). 
> We might fix this issue in future major releases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

87 matches

Mail list logo