[jira] [Created] (SPARK-29326) ANSI store assignment policy: throw exception on insertion failure

2019-10-02 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-29326:
--

 Summary: ANSI store assignment policy: throw exception on 
insertion failure
 Key: SPARK-29326
 URL: https://issues.apache.org/jira/browse/SPARK-29326
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


As per ANSI SQL standard,  ANSI store assignment policy should throw an 
exception on insertion failure, such as inserting out-of-range value to a 
numeric field.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29327) Support specifying features via multiple columns in Predictor and PredictionModel

2019-10-02 Thread Liangcai Li (Jira)
Liangcai Li created SPARK-29327:
---

 Summary: Support specifying features via multiple columns in 
Predictor and PredictionModel
 Key: SPARK-29327
 URL: https://issues.apache.org/jira/browse/SPARK-29327
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 3.0.0
Reporter: Liangcai Li


There are always more features than one in a classification/regression task, 
however the current API to specify features columns in Predictor of Spark MLLib 
only supports one single column, which requires users to assemble the multiple 
features columns into a "org.apache.spark.ml.linalg.Vector" before fitting to 
Spark ML pipeline. 

This improvement is going to let users specify the features columns directly 
without vectorization. To support this, we can introduce two new APIs in both 
"Predictor" and "PredictionModel", and a new parameter named "featuresCols" 
storing the features columns names as an Array. ( PR is ready here 
[https://github.com/apache/spark/pull/25983])
*APIs:*
{{def setFeaturesCol(value: Array[String]): M = ...}}
{{protected def isSupportMultiColumnsForFeatures: Boolean = false}}
*Parameter:*
{{final val featuresCols: StringArrayParam = new StringArrayParam(this, 
"featuresCols",   ...)}}

Then ML implementations can get and use the features columns names from this 
new parameter "featuresCols", along with the raw data of features in separate 
columns directly in dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29328) Incorrect calculation mean seconds per month

2019-10-02 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29328:
--

 Summary: Incorrect calculation mean seconds per month
 Key: SPARK-29328
 URL: https://issues.apache.org/jira/browse/SPARK-29328
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
Reporter: Maxim Gekk


Existing implementation assumes 31 days per month or 372 days per year which is 
far away from the correct number. Spark uses the proleptic Gregorian calendar 
by default SPARK-26651 in which the average year is 365.2425 days long: 
https://en.wikipedia.org/wiki/Gregorian_calendar . Need to fix calculation in 3 
places at least:
- GroupStateImpl.scala:167:val millisPerMonth = 
TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
- EventTimeWatermark.scala:32:val millisPerMonth = 
TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
- DateTimeUtils.scala:610:val secondsInMonth = DAYS.toSeconds(31)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29329) maven incremental builds not working

2019-10-02 Thread Thomas Graves (Jira)
Thomas Graves created SPARK-29329:
-

 Summary: maven incremental builds not working
 Key: SPARK-29329
 URL: https://issues.apache.org/jira/browse/SPARK-29329
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.0.0
Reporter: Thomas Graves


It looks like since we Upgraded scala-maven-plugin to 4.2.0 
https://issues.apache.org/jira/browse/SPARK-28759 spark incremental builds stop 
working.  Everytime you build its building all files, which takes forever.

It would be nice to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28759) Upgrade scala-maven-plugin to 4.2.0

2019-10-02 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942787#comment-16942787
 ] 

Thomas Graves commented on SPARK-28759:
---

I rolled back this commit and the incremental compile now works.  Without 
incremental compiles the build takes forever so I'm against disabling it.  I 
filed https://issues.apache.org/jira/browse/SPARK-29329 for us to look at.

> Upgrade scala-maven-plugin to 4.2.0
> ---
>
> Key: SPARK-28759
> URL: https://issues.apache.org/jira/browse/SPARK-28759
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29329) maven incremental builds not working

2019-10-02 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-29329:
--
Description: 
It looks like since we Upgraded scala-maven-plugin to 4.2.0 
https://issues.apache.org/jira/browse/SPARK-28759 spark incremental builds stop 
working.  Everytime you build its building all files, which takes forever.

It would be nice to fix this.

 

To reproduce, just build spark once ( I happened to be using the command below):

build/mvn -Phadoop-3.2 -Phive-thriftserver -Phive -Pyarn -Pkinesis-asl 
-Pkubernetes -Pmesos -Phadoop-cloud -Pspark-ganglia-lgpl package -DskipTests

Then build it again and you will see that it compiles all the files and takes 
15-30 minutes. With incremental it skips all unnecessary files and takes closer 
to 5 minutes.

  was:
It looks like since we Upgraded scala-maven-plugin to 4.2.0 
https://issues.apache.org/jira/browse/SPARK-28759 spark incremental builds stop 
working.  Everytime you build its building all files, which takes forever.

It would be nice to fix this.


> maven incremental builds not working
> 
>
> Key: SPARK-29329
> URL: https://issues.apache.org/jira/browse/SPARK-29329
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> It looks like since we Upgraded scala-maven-plugin to 4.2.0 
> https://issues.apache.org/jira/browse/SPARK-28759 spark incremental builds 
> stop working.  Everytime you build its building all files, which takes 
> forever.
> It would be nice to fix this.
>  
> To reproduce, just build spark once ( I happened to be using the command 
> below):
> build/mvn -Phadoop-3.2 -Phive-thriftserver -Phive -Pyarn -Pkinesis-asl 
> -Pkubernetes -Pmesos -Phadoop-cloud -Pspark-ganglia-lgpl package -DskipTests
> Then build it again and you will see that it compiles all the files and takes 
> 15-30 minutes. With incremental it skips all unnecessary files and takes 
> closer to 5 minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29329) maven incremental builds not working

2019-10-02 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942790#comment-16942790
 ] 

Thomas Graves commented on SPARK-29329:
---

there are few comments on SPARK-28759 in regards to this, see:

 

https://issues.apache.org/jira/browse/SPARK-28759?focusedCommentId=16942407&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16942407

> maven incremental builds not working
> 
>
> Key: SPARK-29329
> URL: https://issues.apache.org/jira/browse/SPARK-29329
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> It looks like since we Upgraded scala-maven-plugin to 4.2.0 
> https://issues.apache.org/jira/browse/SPARK-28759 spark incremental builds 
> stop working.  Everytime you build its building all files, which takes 
> forever.
> It would be nice to fix this.
>  
> To reproduce, just build spark once ( I happened to be using the command 
> below):
> build/mvn -Phadoop-3.2 -Phive-thriftserver -Phive -Pyarn -Pkinesis-asl 
> -Pkubernetes -Pmesos -Phadoop-cloud -Pspark-ganglia-lgpl package -DskipTests
> Then build it again and you will see that it compiles all the files and takes 
> 15-30 minutes. With incremental it skips all unnecessary files and takes 
> closer to 5 minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend

2019-10-02 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942809#comment-16942809
 ] 

Maciej Szymkiewicz commented on SPARK-29212:


[~podongfeng] It sounds about right. I will also argue that, conditioning on 
1., we should remove Java specific mixins, if they don't serve any practical 
value (provide no implementation whatsoever, like {{JavaPredictorParams}}, or 
have no JVM wrapper specific implementation, like {{JavaPredictor}}).


As of the second point there is additional consideration here - some {{Java*}} 
classes are considered part of the  public API, and this should stay as is 
(these provide crucial information to the end user). However deeper we go, the 
less useful they are (once again conditioning on 1.).

On a side note current approach to ML API  requires a lot of boilerplate code. 
Lately I've been playing with [some 
ideas|https://gist.github.com/zero323/ee36bce57ddeac82322e3ab4ef547611], that 
wouldn't require code generation - they have some caveats, but maybe there is 
something there. 

> Add common classes without using JVM backend
> 
>
> Key: SPARK-29212
> URL: https://issues.apache.org/jira/browse/SPARK-29212
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> copyed from [https://github.com/apache/spark/pull/25776.]
>  
> Maciej's *Concern*:
> *Use cases for public ML type hierarchy*
>  * Add Python-only Transformer implementations:
>  ** I am Python user and want to implement pure Python ML classifier without 
> providing JVM backend.
>  ** I want this classifier to be meaningfully positioned in the existing type 
> hierarchy.
>  ** However I have access only to high level classes ({{Estimator}}, 
> {{Model}}, {{MLReader}} / {{MLReadable}}).
>  * Run time parameter validation for both user defined (see above) and 
> existing class hierarchy,
>  ** I am a library developer who provides functions that are meaningful only 
> for specific categories of {{Estimators}} - here classifiers.
>  ** I want to validate that user passed argument is indeed a classifier:
>  *** For built-in objects using "private" type hierarchy is not really 
> satisfying (actually, what is the rationale behind making it "private"? If 
> the goal is Scala API parity, and Scala counterparts are public, shouldn't 
> these be too?).
>  ** For user defined objects I can:
>  *** Use duck typing (on {{setRawPredictionCol}} for classifier, on 
> {{numClasses}} for classification model) but it hardly satisfying.
>  *** Provide parallel non-abstract type hierarchy ({{Classifier}} or 
> {{PythonClassifier}} and so on) and require users to implement such 
> interfaces. That however would require separate logic for checking for 
> built-in and and user-provided classes.
>  *** Provide parallel abstract type hierarchy, register all existing built-in 
> classes and require users to do the same.
> Clearly these are not satisfying solutions as they require either defensive 
> programming or reinventing the same functionality for different 3rd party 
> APIs.
>  * Static type checking
>  ** I am either end user or library developer and want to use PEP-484 
> annotations to indicate components that require classifier or classification 
> model.
>  ** Currently I can provide only imprecise annotations, [such 
> as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241]
> def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...
> or try to narrow things down using structural subtyping:
> class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, 
> value: str) -> Classifier: ... class Classifier(Protocol, Model): def 
> setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> 
> int: ...
>  
> Maciej's *Proposal*:
> {code:java}
> Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e.
> class ClassifierParams: ...
> class Predictor(Estimator,PredictorParams):
> def setLabelCol(self, value): ...
> def setFeaturesCol(self, value): ...
> def setPredictionCol(self, value): ...
> class Classifier(Predictor, ClassifierParams):
> def setRawPredictionCol(self, value): ...
> class PredictionModel(Model,PredictorParams):
> def setFeaturesCol(self, value): ...
> def setPredictionCol(self, value): ...
> def numFeatures(self): ...
> def predict(self, value): ...
> and JVM interop should extend from this hierarchy, i.e.
> class JavaPredictionModel(PredictionModel): ...
> In other words it should be consistent with existing approach, where we have 
> ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and 
> Java* va

[jira] [Comment Edited] (SPARK-29212) Add common classes without using JVM backend

2019-10-02 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942809#comment-16942809
 ] 

Maciej Szymkiewicz edited comment on SPARK-29212 at 10/2/19 1:41 PM:
-

[~podongfeng] It sounds about right. I will also argue that, conditioning on 
1., we should remove Java specific mixins, if they don't serve any practical 
value (provide no implementation whatsoever or don't extend other {{Java*}} 
mixins, like {{JavaPredictorParams}}, or have no JVM wrapper specific 
implementation, like {{JavaPredictor}}).


As of the second point there is additional consideration here - some {{Java*}} 
classes are considered part of the  public API, and this should stay as is 
(these provide crucial information to the end user). However deeper we go, the 
less useful they are (once again conditioning on 1.).

On a side note current approach to ML API  requires a lot of boilerplate code. 
Lately I've been playing with [some 
ideas|https://gist.github.com/zero323/ee36bce57ddeac82322e3ab4ef547611], that 
wouldn't require code generation - they have some caveats, but maybe there is 
something there. 


was (Author: zero323):
[~podongfeng] It sounds about right. I will also argue that, conditioning on 
1., we should remove Java specific mixins, if they don't serve any practical 
value (provide no implementation whatsoever, like {{JavaPredictorParams}}, or 
have no JVM wrapper specific implementation, like {{JavaPredictor}}).


As of the second point there is additional consideration here - some {{Java*}} 
classes are considered part of the  public API, and this should stay as is 
(these provide crucial information to the end user). However deeper we go, the 
less useful they are (once again conditioning on 1.).

On a side note current approach to ML API  requires a lot of boilerplate code. 
Lately I've been playing with [some 
ideas|https://gist.github.com/zero323/ee36bce57ddeac82322e3ab4ef547611], that 
wouldn't require code generation - they have some caveats, but maybe there is 
something there. 

> Add common classes without using JVM backend
> 
>
> Key: SPARK-29212
> URL: https://issues.apache.org/jira/browse/SPARK-29212
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> copyed from [https://github.com/apache/spark/pull/25776.]
>  
> Maciej's *Concern*:
> *Use cases for public ML type hierarchy*
>  * Add Python-only Transformer implementations:
>  ** I am Python user and want to implement pure Python ML classifier without 
> providing JVM backend.
>  ** I want this classifier to be meaningfully positioned in the existing type 
> hierarchy.
>  ** However I have access only to high level classes ({{Estimator}}, 
> {{Model}}, {{MLReader}} / {{MLReadable}}).
>  * Run time parameter validation for both user defined (see above) and 
> existing class hierarchy,
>  ** I am a library developer who provides functions that are meaningful only 
> for specific categories of {{Estimators}} - here classifiers.
>  ** I want to validate that user passed argument is indeed a classifier:
>  *** For built-in objects using "private" type hierarchy is not really 
> satisfying (actually, what is the rationale behind making it "private"? If 
> the goal is Scala API parity, and Scala counterparts are public, shouldn't 
> these be too?).
>  ** For user defined objects I can:
>  *** Use duck typing (on {{setRawPredictionCol}} for classifier, on 
> {{numClasses}} for classification model) but it hardly satisfying.
>  *** Provide parallel non-abstract type hierarchy ({{Classifier}} or 
> {{PythonClassifier}} and so on) and require users to implement such 
> interfaces. That however would require separate logic for checking for 
> built-in and and user-provided classes.
>  *** Provide parallel abstract type hierarchy, register all existing built-in 
> classes and require users to do the same.
> Clearly these are not satisfying solutions as they require either defensive 
> programming or reinventing the same functionality for different 3rd party 
> APIs.
>  * Static type checking
>  ** I am either end user or library developer and want to use PEP-484 
> annotations to indicate components that require classifier or classification 
> model.
>  ** Currently I can provide only imprecise annotations, [such 
> as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241]
> def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...
> or try to narrow things down using structural subtyping:
> class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, 
> value: str) -> Classifier: ... class Classifier(Prot

[jira] [Assigned] (SPARK-28970) implement USE CATALOG/NAMESPACE for Data Source V2

2019-10-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28970:
---

Assignee: Terry Kim

> implement USE CATALOG/NAMESPACE for Data Source V2
> --
>
> Key: SPARK-28970
> URL: https://issues.apache.org/jira/browse/SPARK-28970
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Terry Kim
>Priority: Major
>
> Currently Spark has a `USE abc` command to switch the current database.
> We should have something similar for Data Source V2, to switch the current 
> catalog and/or current namespace.
> We can introduce 2 new command: `USE CATALOG abc` and `USE NAMESPACE abc`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28970) implement USE CATALOG/NAMESPACE for Data Source V2

2019-10-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28970.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25771
[https://github.com/apache/spark/pull/25771]

> implement USE CATALOG/NAMESPACE for Data Source V2
> --
>
> Key: SPARK-28970
> URL: https://issues.apache.org/jira/browse/SPARK-28970
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently Spark has a `USE abc` command to switch the current database.
> We should have something similar for Data Source V2, to switch the current 
> catalog and/or current namespace.
> We can introduce 2 new command: `USE CATALOG abc` and `USE NAMESPACE abc`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29330) Allow users to chose the name of Spark Shuffle service

2019-10-02 Thread Alexander Bessonov (Jira)
Alexander Bessonov created SPARK-29330:
--

 Summary: Allow users to chose the name of Spark Shuffle service
 Key: SPARK-29330
 URL: https://issues.apache.org/jira/browse/SPARK-29330
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 2.4.4
Reporter: Alexander Bessonov


As of now, Spark uses hardcoded value {{spark_shuffle}} as the name of the 
Shuffle Service.

HDP distribution of Spark, on the other hand, uses [{{spark2_shuffle}}|#L117]]. 
This is done to be able to run both Spark 1.6 and Spark 2.x on the same Hadoop 
cluster.

Running vanilla Spark on HDP cluster with only Spark 2.x shuffle service (HDP 
favor) running becomes impossible due to the shuffle service name mismatch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29330) Allow users to chose the name of Spark Shuffle service

2019-10-02 Thread Alexander Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Bessonov updated SPARK-29330:
---
Description: 
As of now, Spark uses hardcoded value {{spark_shuffle}} as the name of the 
Shuffle Service.

HDP distribution of Spark, on the other hand, uses 
[{{spark2_shuffle}}|https://github.com/hortonworks/spark2-release/blob/HDP-3.1.0.0-78-tag/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala#L117].
 This is done to be able to run both Spark 1.6 and Spark 2.x on the same Hadoop 
cluster.

Running vanilla Spark on HDP cluster with only Spark 2.x shuffle service (HDP 
favor) running becomes impossible due to the shuffle service name mismatch.

  was:
As of now, Spark uses hardcoded value {{spark_shuffle}} as the name of the 
Shuffle Service.

HDP distribution of Spark, on the other hand, uses [{{spark2_shuffle}}|#L117]]. 
This is done to be able to run both Spark 1.6 and Spark 2.x on the same Hadoop 
cluster.

Running vanilla Spark on HDP cluster with only Spark 2.x shuffle service (HDP 
favor) running becomes impossible due to the shuffle service name mismatch.


> Allow users to chose the name of Spark Shuffle service
> --
>
> Key: SPARK-29330
> URL: https://issues.apache.org/jira/browse/SPARK-29330
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.4.4
>Reporter: Alexander Bessonov
>Priority: Minor
>
> As of now, Spark uses hardcoded value {{spark_shuffle}} as the name of the 
> Shuffle Service.
> HDP distribution of Spark, on the other hand, uses 
> [{{spark2_shuffle}}|https://github.com/hortonworks/spark2-release/blob/HDP-3.1.0.0-78-tag/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala#L117].
>  This is done to be able to run both Spark 1.6 and Spark 2.x on the same 
> Hadoop cluster.
> Running vanilla Spark on HDP cluster with only Spark 2.x shuffle service (HDP 
> favor) running becomes impossible due to the shuffle service name mismatch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29331) create DS v2 Write at physical plan

2019-10-02 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-29331:
---

 Summary: create DS v2 Write at physical plan
 Key: SPARK-29331
 URL: https://issues.apache.org/jira/browse/SPARK-29331
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

2019-10-02 Thread Simeon H.K. Fitch (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942906#comment-16942906
 ] 

Simeon H.K. Fitch commented on SPARK-13802:
---

Is there a workaround to this problem? Ordering is important when encoders are 
used to reify structs into Scala types, and not being able to specify the order 
(without a lot of boilerplate schema work) results in Exceptions.

> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -
>
> Key: SPARK-13802
> URL: https://issues.apache.org/jira/browse/SPARK-13802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Szymon Matejczyk
>Priority: Major
>
> When using Row constructor from kwargs, fields in the tuple underneath are 
> sorted by name. When Schema is reading the row, it is not using the fields in 
> this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
> StructField("id", StringType()),
> StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +--+--+
> |id|first_name|
> +--+--+
> |Szymon|39|
> +--+--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method

2019-10-02 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942912#comment-16942912
 ] 

Maciej Szymkiewicz commented on SPARK-13802:


[~metasim] namedtuples are the simplest and the most efficient replacement 
(https://stackoverflow.com/a/49949762).

> Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
> -
>
> Key: SPARK-13802
> URL: https://issues.apache.org/jira/browse/SPARK-13802
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Szymon Matejczyk
>Priority: Major
>
> When using Row constructor from kwargs, fields in the tuple underneath are 
> sorted by name. When Schema is reading the row, it is not using the fields in 
> this order.
> {code}
> from pyspark.sql import Row
> from pyspark.sql.types import *
> schema = StructType([
> StructField("id", StringType()),
> StructField("first_name", StringType())])
> row = Row(id="39", first_name="Szymon")
> schema.toInternal(row)
> Out[5]: ('Szymon', '39')
> {code}
> {code}
> df = sqlContext.createDataFrame([row], schema)
> df.show(1)
> +--+--+
> |id|first_name|
> +--+--+
> |Szymon|39|
> +--+--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29273) Spark peakExecutionMemory metrics is zero

2019-10-02 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-29273.

Fix Version/s: 3.0.0
 Assignee: huangweiyi
   Resolution: Fixed

> Spark peakExecutionMemory metrics is zero
> -
>
> Key: SPARK-29273
> URL: https://issues.apache.org/jira/browse/SPARK-29273
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Assignee: huangweiyi
>Priority: Major
> Fix For: 3.0.0
>
>
> with spark 2.4.3 in our production environment, i want to get the 
> peakExecutionMemory which is exposed by the TaskMetrics, but alway get the 
> zero value



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29332) Upgrade zstd-jni library to 1.4.3

2019-10-02 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-29332:
-

 Summary: Upgrade zstd-jni library to 1.4.3
 Key: SPARK-29332
 URL: https://issues.apache.org/jira/browse/SPARK-29332
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29333) Sample weight in RandomForestRegressor

2019-10-02 Thread Jiaqi Guo (Jira)
Jiaqi Guo created SPARK-29333:
-

 Summary: Sample weight in RandomForestRegressor
 Key: SPARK-29333
 URL: https://issues.apache.org/jira/browse/SPARK-29333
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.4.4
Reporter: Jiaqi Guo


I think there have been some tickets that are related to this feature request. 
Even though the tickets earlier have been designated with resolved status, it 
still seems impossible to add sample weight to random forest 
classifier/regressor.

The possibility of having sample weight is definitely useful for many use 
cases, for example class imbalance and weighted bias correction for the 
samples. I think the sample weight should be considered in the splitting 
criterion. 

Please correct me if I am missing the new feature. Otherwise, it would be great 
to have an update on whether we have a path forward supporting this in the near 
future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29334) Supported vector operators in scala should have parity with pySpark

2019-10-02 Thread Patrick Pisciuneri (Jira)
Patrick Pisciuneri created SPARK-29334:
--

 Summary: Supported vector operators in scala should have parity 
with pySpark 
 Key: SPARK-29334
 URL: https://issues.apache.org/jira/browse/SPARK-29334
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.3.5, 2.4.5, 3.0.0
Reporter: Patrick Pisciuneri


pySpark supports various overloaded operators for the DenseVector type that the 
scala class does not support. 

# ML

https://github.com/apache/spark/blob/master/python/pyspark/ml/linalg/__init__.py#L441-L462

# MLLIB

https://github.com/apache/spark/blob/master/python/pyspark/mllib/linalg/__init__.py#L485-L506

We should be able to leverage the BLAS wrappers to implement these methods on 
the scala side.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29334) Supported vector operators in scala should have parity with pySpark

2019-10-02 Thread Patrick Pisciuneri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Pisciuneri updated SPARK-29334:
---
Description: 
pySpark supports various overloaded operators for the DenseVector type that the 
scala class does not support. 

- ML: 
https://github.com/apache/spark/blob/master/python/pyspark/ml/linalg/__init__.py#L441-L462
- MLLIB: 
https://github.com/apache/spark/blob/master/python/pyspark/mllib/linalg/__init__.py#L485-L506

We should be able to leverage the BLAS wrappers to implement these methods on 
the scala side.


  was:
pySpark supports various overloaded operators for the DenseVector type that the 
scala class does not support. 

# ML

https://github.com/apache/spark/blob/master/python/pyspark/ml/linalg/__init__.py#L441-L462

# MLLIB

https://github.com/apache/spark/blob/master/python/pyspark/mllib/linalg/__init__.py#L485-L506

We should be able to leverage the BLAS wrappers to implement these methods on 
the scala side.



> Supported vector operators in scala should have parity with pySpark 
> 
>
> Key: SPARK-29334
> URL: https://issues.apache.org/jira/browse/SPARK-29334
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.5, 2.4.5, 3.0.0
>Reporter: Patrick Pisciuneri
>Priority: Minor
>
> pySpark supports various overloaded operators for the DenseVector type that 
> the scala class does not support. 
> - ML: 
> https://github.com/apache/spark/blob/master/python/pyspark/ml/linalg/__init__.py#L441-L462
> - MLLIB: 
> https://github.com/apache/spark/blob/master/python/pyspark/mllib/linalg/__init__.py#L485-L506
> We should be able to leverage the BLAS wrappers to implement these methods on 
> the scala side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28917) Jobs can hang because of race of RDD.dependencies

2019-10-02 Thread Imran Rashid (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-28917:
-
Description: 
{{RDD.dependencies}} stores the precomputed cache value, but it is not 
thread-safe.  This can lead to a race where the value gets overwritten, but the 
DAGScheduler gets stuck in an inconsistent state.  In particular, this can 
happen when there is a race between the DAGScheduler event loop, and another 
thread (eg. a user thread, if there is multi-threaded job submission).


First, a job is submitted by the user, which then computes the result Stage and 
its parents:

https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983

Which eventually makes a call to {{rdd.dependencies}}:

https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L519

At the same time, the user could also touch {{rdd.dependencies}} in another 
thread, which could overwrite the stored value because of the race.

Then the DAGScheduler checks the dependencies *again* later on in the job 
submission, via {{getMissingParentStages}}

https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1025

Because it will find new dependencies, it will create entirely different 
stages.  Now the job has some orphaned stages which will never run.

One symptoms of this are seeing disjoint sets of stages in the "Parents of 
final stage" and the "Missing parents" messages on job submission (however this 
is not required).

(*EDIT*: Seeing repeated msgs "Registering RDD X" actually is just fine, it is 
not a symptom of a problem at all.  It just means the RDD is the *input* to 
multiple shuffles.)

{noformat}
[INFO] 2019-08-15 23:22:31,570 org.apache.spark.SparkContext logInfo - Starting 
job: count at XXX.scala:462
...
[INFO] 2019-08-15 23:22:31,573 org.apache.spark.scheduler.DAGScheduler logInfo 
- Registering RDD 14 (repartition at XXX.scala:421)
...
...
[INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler logInfo 
- Got job 1 (count at XXX.scala:462) with 40 output partitions
[INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler logInfo 
- Final stage: ResultStage 5 (count at XXX.scala:462)
[INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler logInfo 
- Parents of final stage: List(ShuffleMapStage 4)
[INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler logInfo 
- Registering RDD 14 (repartition at XXX.scala:421)
[INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler logInfo 
- Missing parents: List(ShuffleMapStage 6)
{noformat}

Another symptom is only visible with DEBUG logs turned on for DAGScheduler -- 
you will calls to {{submitStage(Stage X)}} multiple times, followed by a 
different set of missing stages.  eg. here, we see stage 1 first is missing 
stage 0 as a dependency, and then later on its missing stage 23:

{noformat}
19/09/19 22:28:15 DEBUG scheduler.DAGScheduler: submitStage(ShuffleMapStage 1)
19/09/19 22:28:15 DEBUG scheduler.DAGScheduler: missing: List(ShuffleMapStage 0)
...
19/09/19 22:32:01 DEBUG scheduler.DAGScheduler: submitStage(ShuffleMapStage 1)
19/09/19 22:32:01 DEBUG scheduler.DAGScheduler: missing: List(ShuffleMapStage 
23)
{noformat}

Note that there is a similar issue w/ {{rdd.partitions}}.  In particular for 
some RDDs, {{partitions}} references {{dependencies}} (eg. {{CoGroupedRDD}}).  

There is also an issue that {{rdd.storageLevel}} is read and cached in the 
scheduler, but it could be modified simultaneously by the user in another 
thread.   But, I can't see a way it could effect the scheduler.

*WORKAROUND*:
(a) call {{rdd.dependencies}} while you know that RDD is only getting touched 
by one thread (eg. in the thread that created it, or before you submit multiple 
jobs touching that RDD from other threads). Then that value will get cached.
(b) don't submit jobs from multiple threads.

  was:
{{RDD.dependencies}} stores the precomputed cache value, but it is not 
thread-safe.  This can lead to a race where the value gets overwritten, but the 
DAGScheduler gets stuck in an inconsistent state.  In particular, this can 
happen when there is a race between the DAGScheduler event loop, and another 
thread (eg. a user thread, if there is multi-threaded job submission).


First, a job is submitted by the user, which then computes the result Stage and 
its parents:

https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983

Which eventually makes a call to {{rdd.dependencies}}:

https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scal

[jira] [Assigned] (SPARK-29332) Upgrade zstd-jni library to 1.4.3

2019-10-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29332:
-

Assignee: Dongjoon Hyun

> Upgrade zstd-jni library to 1.4.3
> -
>
> Key: SPARK-29332
> URL: https://issues.apache.org/jira/browse/SPARK-29332
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29332) Upgrade zstd-jni library to 1.4.3

2019-10-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29332.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26002
[https://github.com/apache/spark/pull/26002]

> Upgrade zstd-jni library to 1.4.3
> -
>
> Key: SPARK-29332
> URL: https://issues.apache.org/jira/browse/SPARK-29332
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29078) Spark shell fails if read permission is not granted to hive warehouse directory

2019-10-02 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943095#comment-16943095
 ] 

Peter Toth commented on SPARK-29078:


[~misutoth], if we look closer at the stacktrace ({{at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)}})
 it shows that the AccessControlException issue is with the default database 
existence check (on master branch this corresponds to 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L139:

{noformat}
// Create default database if it doesn't exist
if (!externalCatalog.databaseExists(SessionCatalog.DEFAULT_DATABASE)) {
  // There may be another Spark application creating default database at 
the same time, here we
{noformat}

This exception is expected if the user doesn't have access to the directory of 
the default database. In that case the user can't use Spark SQL.

I would suggest closing this ticket.

> Spark shell fails if read permission is not granted to hive warehouse 
> directory
> ---
>
> Key: SPARK-29078
> URL: https://issues.apache.org/jira/browse/SPARK-29078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mihaly Toth
>Priority: Major
>
> Similarly to SPARK-20256, in {{SharedSessionState}} when 
> {{GlobalTempViewManager}} is created, it is checked that there is no database 
> exists that has the same name as of the global temp database (name is 
> configurable with {{spark.sql.globalTempDatabase}}) , because that is a 
> special database, which should not exist in the metastore. For this, a read 
> permission is required on the warehouse directory at the moment, which on the 
> other hand would allow listing all the databases of all users.
> When such a read access is not granted for security reasons, an access 
> violation exception should be ignored upon such initial validation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27297) Add higher order functions to Scala API

2019-10-02 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-27297:
-

Assignee: Nikolas Vanderhoof

> Add higher order functions to Scala API
> ---
>
> Key: SPARK-27297
> URL: https://issues.apache.org/jira/browse/SPARK-27297
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Nikolas Vanderhoof
>Assignee: Nikolas Vanderhoof
>Priority: Major
>
> There is currently no existing Scala API equivalent for the higher order 
> functions introduced in Spark 2.4.0.
>  * transform
>  * aggregate
>  * filter
>  * exists
>  * zip_with
>  * map_zip_with
>  * map_filter
>  * transform_values
>  * transform_keys
> Equivalent column based functions should be added to the Scala API for 
> org.apache.spark.sql.functions with the following signatures:
>  
> {code:scala}
> def transform(column: Column, f: Column => Column): Column = ???
> def transform(column: Column, f: (Column, Column) => Column): Column = ???
> def exists(column: Column, f: Column => Column): Column = ???
> def filter(column: Column, f: Column => Column): Column = ???
> def aggregate(
> expr: Column,
> zero: Column,
> merge: (Column, Column) => Column,
> finish: Column => Column): Column = ???
> def aggregate(
> expr: Column,
> zero: Column,
> merge: (Column, Column) => Column): Column = ???
> def zip_with(
> left: Column,
> right: Column,
> f: (Column, Column) => Column): Column = ???
> def transform_keys(expr: Column, f: (Column, Column) => Column): Column = ???
> def transform_values(expr: Column, f: (Column, Column) => Column): Column = 
> ???
> def map_filter(expr: Column, f: (Column, Column) => Column): Column = ???
> def map_zip_with(left: Column, right: Column, f: (Column, Column, Column) => 
> Column): Column = ???
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27297) Add higher order functions to Scala API

2019-10-02 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-27297.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 24232
https://github.com/apache/spark/pull/24232

> Add higher order functions to Scala API
> ---
>
> Key: SPARK-27297
> URL: https://issues.apache.org/jira/browse/SPARK-27297
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Nikolas Vanderhoof
>Assignee: Nikolas Vanderhoof
>Priority: Major
> Fix For: 3.0.0
>
>
> There is currently no existing Scala API equivalent for the higher order 
> functions introduced in Spark 2.4.0.
>  * transform
>  * aggregate
>  * filter
>  * exists
>  * zip_with
>  * map_zip_with
>  * map_filter
>  * transform_values
>  * transform_keys
> Equivalent column based functions should be added to the Scala API for 
> org.apache.spark.sql.functions with the following signatures:
>  
> {code:scala}
> def transform(column: Column, f: Column => Column): Column = ???
> def transform(column: Column, f: (Column, Column) => Column): Column = ???
> def exists(column: Column, f: Column => Column): Column = ???
> def filter(column: Column, f: Column => Column): Column = ???
> def aggregate(
> expr: Column,
> zero: Column,
> merge: (Column, Column) => Column,
> finish: Column => Column): Column = ???
> def aggregate(
> expr: Column,
> zero: Column,
> merge: (Column, Column) => Column): Column = ???
> def zip_with(
> left: Column,
> right: Column,
> f: (Column, Column) => Column): Column = ???
> def transform_keys(expr: Column, f: (Column, Column) => Column): Column = ???
> def transform_values(expr: Column, f: (Column, Column) => Column): Column = 
> ???
> def map_filter(expr: Column, f: (Column, Column) => Column): Column = ???
> def map_zip_with(left: Column, right: Column, f: (Column, Column, Column) => 
> Column): Column = ???
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29078) Spark shell fails if read permission is not granted to hive warehouse directory

2019-10-02 Thread Mihaly Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943118#comment-16943118
 ] 

Mihaly Toth commented on SPARK-29078:
-

But if the user has access to that directory (which is the hive warehouse 
directory), it can see what databases are there regardless of having access to 
those databases or not. This is not the worst security gap, so if we believe 
this is acceptable I dont mind closing this jira.

> Spark shell fails if read permission is not granted to hive warehouse 
> directory
> ---
>
> Key: SPARK-29078
> URL: https://issues.apache.org/jira/browse/SPARK-29078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mihaly Toth
>Priority: Major
>
> Similarly to SPARK-20256, in {{SharedSessionState}} when 
> {{GlobalTempViewManager}} is created, it is checked that there is no database 
> exists that has the same name as of the global temp database (name is 
> configurable with {{spark.sql.globalTempDatabase}}) , because that is a 
> special database, which should not exist in the metastore. For this, a read 
> permission is required on the warehouse directory at the moment, which on the 
> other hand would allow listing all the databases of all users.
> When such a read access is not granted for security reasons, an access 
> violation exception should be ignored upon such initial validation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28962) High-order function: filter(array, function) → array

2019-10-02 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-28962.
---
Fix Version/s: 3.0.0
 Assignee: Henry Davidge
   Resolution: Fixed

Issue resolved by pull request 25666
https://github.com/apache/spark/pull/25666

> High-order function: filter(array, function) → array
> ---
>
> Key: SPARK-28962
> URL: https://issues.apache.org/jira/browse/SPARK-28962
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Henry Davidge
>Assignee: Henry Davidge
>Priority: Major
> Fix For: 3.0.0
>
>
> It's helpful to have access to the index when using the {{filter}} function. 
> For instance, we're using SparkSQL to manipulate genomic data. We store some 
> fields in a long array that has the same length for every row in the 
> DataFrame. We compute aggregates that are per array position (so we look at 
> the kth element for each row's array) and then want to filter each row's 
> array by looking values in the aggregate array.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29078) Spark shell fails if read permission is not granted to hive warehouse directory

2019-10-02 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943127#comment-16943127
 ] 

Peter Toth commented on SPARK-29078:


I don't think there should be other databases under {{/apps/hive/warehouse}} 
directory if the {{default}} database points to {{/apps/hive/warehouse}}.

> Spark shell fails if read permission is not granted to hive warehouse 
> directory
> ---
>
> Key: SPARK-29078
> URL: https://issues.apache.org/jira/browse/SPARK-29078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mihaly Toth
>Priority: Major
>
> Similarly to SPARK-20256, in {{SharedSessionState}} when 
> {{GlobalTempViewManager}} is created, it is checked that there is no database 
> exists that has the same name as of the global temp database (name is 
> configurable with {{spark.sql.globalTempDatabase}}) , because that is a 
> special database, which should not exist in the metastore. For this, a read 
> permission is required on the warehouse directory at the moment, which on the 
> other hand would allow listing all the databases of all users.
> When such a read access is not granted for security reasons, an access 
> violation exception should be ignored upon such initial validation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29078) Spark shell fails if read permission is not granted to hive warehouse directory

2019-10-02 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943127#comment-16943127
 ] 

Peter Toth edited comment on SPARK-29078 at 10/2/19 8:15 PM:
-

I don't think there should be other databases under {{/apps/hive/warehouse}} 
directory if the {{default}} database points to {{/apps/hive/warehouse}}. I 
mean that way we could avoid this issue.


was (Author: petertoth):
I don't think there should be other databases under {{/apps/hive/warehouse}} 
directory if the {{default}} database points to {{/apps/hive/warehouse}}.

> Spark shell fails if read permission is not granted to hive warehouse 
> directory
> ---
>
> Key: SPARK-29078
> URL: https://issues.apache.org/jira/browse/SPARK-29078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mihaly Toth
>Priority: Major
>
> Similarly to SPARK-20256, in {{SharedSessionState}} when 
> {{GlobalTempViewManager}} is created, it is checked that there is no database 
> exists that has the same name as of the global temp database (name is 
> configurable with {{spark.sql.globalTempDatabase}}) , because that is a 
> special database, which should not exist in the metastore. For this, a read 
> permission is required on the warehouse directory at the moment, which on the 
> other hand would allow listing all the databases of all users.
> When such a read access is not granted for security reasons, an access 
> violation exception should be ignored upon such initial validation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql

2019-10-02 Thread Srini E (Jira)
Srini E created SPARK-29335:
---

 Summary: Cost Based Optimizer stats are not used while evaluating 
query plans in Spark Sql
 Key: SPARK-29335
 URL: https://issues.apache.org/jira/browse/SPARK-29335
 Project: Spark
  Issue Type: Question
  Components: Optimizer
Affects Versions: 2.3.0
 Environment: We tried to execute the same using Spark-sql and Thrify 
server using SQLWorkbench but we are not able to use the stats.
Reporter: Srini E


We are trying to leverage CBO for getting better plan results for few critical 
queries run thru spark-sql or thru thrift server using jdbc driver. 

Following settings added to spark-defaults.conf

*spark.sql.cbo.enabled true* 

*spark.experimental.extrastrategies intervaljoin* 

*spark.sql.cbo.joinreorder.enabled true* 

 

The tables that we are using are not partitioned.

Spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ;

analyze table arrow.t_fperiods_sundar compute statistics for columns eid, year, 
ptype, absref, fpid , pid ;

analyze table arrow.t_fdata_sundar compute statistics ;

analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, 
absref;

Analyze completed success fully.

Describe extended , does not show column level stats data and queries are not 
leveraging table or column level stats .

we are using Oracle as our Hive Catalog store and not Glue .

 

+*When we are using spark sql and running queries we are not able to see the 
stats in use in the explain plan and we are not sure if cbo is put to use.*+ 

 

*A quick response would be helpful.*

 

*Explain Plan:*

Following Explain command does not reference to any Statistics usage.
 
spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref 
from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = 
a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 
and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;*
 
19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = 
2017),(ptype#4546 = A),(eid#4542 = 29940),isnull(PID#4527),isnotnull(fpid#4523)
19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct
19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID)
19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
isnotnull(absref#4569),(absref#4569 = 
Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940)
19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct
19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940)
== Parsed Logical Plan ==
'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref]
+- 'Filter 'a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && 
(('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && 
('a12.eid = 29940)) && isnull('a12.PID)))
 +- 'Join Inner
 :- 'SubqueryAlias a12
 : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar`
 +- 'SubqueryAlias a13
 +- 'UnresolvedRelation `arrow`.`t_fdata_sundar`
 
== Analyzed Logical Plan ==
imnem: string, fvalue: string, ptype: string, absref: string
Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
+- Filter eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = 
cast(fpid#4567 as decimal(38,0 && ((absref#4569 = Y2017) && (year#4545 = 
2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = 
cast(cast(29940 as decimal(5,0)) as decimal(10,0 && isnull(PID#4527)))
 +- Join Inner
 :- SubqueryAlias a12
 : +- SubqueryAlias t_fperiods_sundar
 : +- 
Relation[FPID#4523,QTR#4524,ABSREF#4525,DSET#4526,PID#4527,NID#4528,CCD#4529,LOCKDATE#4530,LOCKUSER#4531,UPDUSER#4532,UPDDATE#4533,RESTATED#4534,DATADATE#4535,DATASTATE#4536,ACCSTD#4537,INDFMT#4538,DATAFMT#4539,CONSOL#4540,FYR#4541,EID#4542,FPSTATE#4543,BATCH_ID#4544L,YEAR#4545,PTYPE#4546]
 parquet
 +- SubqueryAlias a13
 +- SubqueryAlias t_fdata_sundar
 +- 
Relation[FDID#4547,IMNEM#4548,DSET#4549,DSRS#4550,PID#4551,FVALUE#4552,FCOMP#4553,CONF#4554,UPDDATE#4555,UPDUSER#4556,SEQ#4557,CCD#4558,NID#4559,CONVTYPE#4560,DATADATE#4561,DCOMMENT#4562,UDOMAIN#4563,FPDID#4564L,VLP_CALC#4565,EID#4566,FPID#4567,BATCH_ID#4568L,ABSREF#4569]
 parquet
 
== Optimized Logical Plan ==
Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
+- Join Inner, ((eid#4542 = eid#4566) && (fpid#4523 = cast(fpid#4567 as 
decimal(38,0
 :- Project [FPID#4523, EID#4542, PTYPE#4546]
 : +- Filter (((isnotnull(ptype#4546) && isnotnull(year#4545)) && 
isnotnull(

[jira] [Created] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee the relativeError will be respected

2019-10-02 Thread Guilherme Souza (Jira)
Guilherme Souza created SPARK-29336:
---

 Summary: The implementation of QuantileSummaries.merge  does not 
guarantee the relativeError will be respected 
 Key: SPARK-29336
 URL: https://issues.apache.org/jira/browse/SPARK-29336
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: Guilherme Souza






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee the relativeError will be respected

2019-10-02 Thread Guilherme Souza (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guilherme Souza updated SPARK-29336:

   Shepherd:   (was: Sean Zhong)
Description: (sorry for the early submission, I'm still writing the 
description...)

> The implementation of QuantileSummaries.merge  does not guarantee the 
> relativeError will be respected 
> --
>
> Key: SPARK-29336
> URL: https://issues.apache.org/jira/browse/SPARK-29336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Guilherme Souza
>Priority: Minor
>
> (sorry for the early submission, I'm still writing the description...)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql

2019-10-02 Thread Srini E (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srini E updated SPARK-29335:

Attachment: explain_plan_cbo_spark.txt

> Cost Based Optimizer stats are not used while evaluating query plans in Spark 
> Sql
> -
>
> Key: SPARK-29335
> URL: https://issues.apache.org/jira/browse/SPARK-29335
> Project: Spark
>  Issue Type: Question
>  Components: Optimizer
>Affects Versions: 2.3.0
> Environment: We tried to execute the same using Spark-sql and Thrify 
> server using SQLWorkbench but we are not able to use the stats.
>Reporter: Srini E
>Priority: Major
> Attachments: explain_plan_cbo_spark.txt
>
>
> We are trying to leverage CBO for getting better plan results for few 
> critical queries run thru spark-sql or thru thrift server using jdbc driver. 
> Following settings added to spark-defaults.conf
> *spark.sql.cbo.enabled true* 
> *spark.experimental.extrastrategies intervaljoin* 
> *spark.sql.cbo.joinreorder.enabled true* 
>  
> The tables that we are using are not partitioned.
> Spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ;
> analyze table arrow.t_fperiods_sundar compute statistics for columns eid, 
> year, ptype, absref, fpid , pid ;
> analyze table arrow.t_fdata_sundar compute statistics ;
> analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, 
> absref;
> Analyze completed success fully.
> Describe extended , does not show column level stats data and queries are not 
> leveraging table or column level stats .
> we are using Oracle as our Hive Catalog store and not Glue .
>  
> +*When we are using spark sql and running queries we are not able to see the 
> stats in use in the explain plan and we are not sure if cbo is put to use.*+ 
>  
> *A quick response would be helpful.*
>  
> *Explain Plan:*
> Following Explain command does not reference to any Statistics usage.
>  
> spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref 
> from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = 
> a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 
> and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;*
>  
> 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
> 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
> isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = 
> 2017),(ptype#4546 = A),(eid#4542 = 
> 29940),isnull(PID#4527),isnotnull(fpid#4523)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct decimal(38,0), PID: string, EID: decimal(10,0), YEAR: int, PTYPE: string ... 
> 3 more fields>
> 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
> IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
> 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
> isnotnull(absref#4569),(absref#4569 = 
> Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct string, FVALUE: string, EID: decimal(10,0), FPID: decimal(10,0), ABSREF: 
> string ... 3 more fields>
> 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
> IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940)
> == Parsed Logical Plan ==
> 'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref]
> +- 'Filter 'a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && 
> (('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && 
> ('a12.eid = 29940)) && isnull('a12.PID)))
>  +- 'Join Inner
>  :- 'SubqueryAlias a12
>  : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar`
>  +- 'SubqueryAlias a13
>  +- 'UnresolvedRelation `arrow`.`t_fdata_sundar`
>  
> == Analyzed Logical Plan ==
> imnem: string, fvalue: string, ptype: string, absref: string
> Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
> +- Filter eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = 
> cast(fpid#4567 as decimal(38,0 && ((absref#4569 = Y2017) && (year#4545 = 
> 2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = 
> cast(cast(29940 as decimal(5,0)) as decimal(10,0 && isnull(PID#4527)))
>  +- Join Inner
>  :- SubqueryAlias a12
>  : +- SubqueryAlias t_fperiods_sundar
>  : +- 
> Relation[FPID#4523,QTR#4524,ABSREF#4525,DSET#4526,PID#4527,NID#4528,CCD#4529,LOCKDATE#4530,LOCKUSER#4531,UPDUSER#4532,UPDDATE#4533,RESTATED#4534,DATADATE#4535,DATASTATE#4536,ACCSTD#4537,INDFMT#4538,DATAFMT#4539,CONSOL#4540,FYR#4541,EID#4542,FPSTATE#4543,BATCH_ID#4544L,YEAR#4545,PTYPE#4546]
>  parquet

[jira] [Updated] (SPARK-29337) How to Cache Table and Pin it in Memory and should not Spill to Disk on Thrift Server

2019-10-02 Thread Srini E (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srini E updated SPARK-29337:

Attachment: Cache+Image.png

> How to Cache Table and Pin it in Memory and should not Spill to Disk on 
> Thrift Server 
> --
>
> Key: SPARK-29337
> URL: https://issues.apache.org/jira/browse/SPARK-29337
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Srini E
>Priority: Major
> Attachments: Cache+Image.png
>
>
> Hi Team,
> How to pin the table in cache so it would not swap out of memory?
> Situation: We are using Microstrategy BI reporting. Semantic layer is built. 
> We wanted to Cache highly used tables into CACHE using Spark SQL CACHE Table 
> ; we did cache for SPARK context( Thrift server). Please see 
> below snapshot of Cache table, went to disk over time. Initially it was all 
> in cache , now some in cache and some in disk. That disk may be local disk 
> relatively more expensive reading than from s3. Queries may take longer and 
> inconsistent times from user experience perspective. If More queries running 
> using Cache tables, copies of the cache table images are copied and copies 
> are not staying in memory causing reports to run longer. so how to pin the 
> table so would not swap to disk. Spark memory management is dynamic 
> allocation, and how to use those few tables to Pin in memory .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29337) How to Cache Table and Pin it in Memory and should not Spill to Disk on Thrift Server

2019-10-02 Thread Srini E (Jira)
Srini E created SPARK-29337:
---

 Summary: How to Cache Table and Pin it in Memory and should not 
Spill to Disk on Thrift Server 
 Key: SPARK-29337
 URL: https://issues.apache.org/jira/browse/SPARK-29337
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 2.3.0
Reporter: Srini E
 Attachments: Cache+Image.png

Hi Team,

How to pin the table in cache so it would not swap out of memory?

Situation: We are using Microstrategy BI reporting. Semantic layer is built. We 
wanted to Cache highly used tables into CACHE using Spark SQL CACHE Table 
; we did cache for SPARK context( Thrift server). Please see below 
snapshot of Cache table, went to disk over time. Initially it was all in cache 
, now some in cache and some in disk. That disk may be local disk relatively 
more expensive reading than from s3. Queries may take longer and inconsistent 
times from user experience perspective. If More queries running using Cache 
tables, copies of the cache table images are copied and copies are not staying 
in memory causing reports to run longer. so how to pin the table so would not 
swap to disk. Spark memory management is dynamic allocation, and how to use 
those few tables to Pin in memory .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee that the relativeError will be respected

2019-10-02 Thread Guilherme Souza (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guilherme Souza updated SPARK-29336:

Summary: The implementation of QuantileSummaries.merge  does not guarantee 
that the relativeError will be respected   (was: The implementation of 
QuantileSummaries.merge  does not guarantee the relativeError will be respected 
)

> The implementation of QuantileSummaries.merge  does not guarantee that the 
> relativeError will be respected 
> ---
>
> Key: SPARK-29336
> URL: https://issues.apache.org/jira/browse/SPARK-29336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Guilherme Souza
>Priority: Minor
>
> Hello Spark maintainers,
> I was experimenting with my own implementation of the [space-efficient 
> quantile 
> algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf]
>  in another language and I was using the Spark's one as a reference.
> In my analysis, I believe to have found an issue with the {{merge()}} logic. 
> Here is some simple Scala code that reproduces the issue I've found:
>  
> {code:java}
> var values = (1 to 100).toArray
> val all_quantiles = values.indices.map(i => (i+1).toDouble / 
> values.length).toArray
> for (n <- 0 until 5) {
>   var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
>   val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1)
>   val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
>   val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) 
> => Math.abs(expected - answer) }).toArray
>   val max_error = error.max
>   print(max_error + "\n")
> }
> {code}
> I query for all possible quantiles in a 100-element array with a desired 10% 
> max error. In this scenario, one would expect to observe a maximum error of 
> 10 ranks or less (10% of 100). However, the output I observe is:
>  
> {noformat}
> 16
> 12
> 10
> 11
> 17{noformat}
> The variance is probably due to non-deterministic operations behind the 
> scenes, but irrelevant to the core cause.
> Interestingly enough, if I change from five to one partition the code works 
> as expected and gives 10 every time. This seems to point to some problem at 
> the [merge 
> logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L153-L171]
> The original authors ([~clockfly] and [~cloud_fan] for what I could dig from 
> the history) suggest the published paper is not clear on how that should be 
> done and, honestly, I was not confident in the current approach either.
> I've found SPARK-21184 that reports the same problem, but it was 
> unfortunately closed with no fix applied.
> In my external implementation I believe to have found a sound way to 
> implement the merge method. [Here is my take in Rust, if relevant
> |https://github.com/sitegui/space-efficient-quantile/blob/188c74638c9840e5f47d6c6326b2886d47b149bc/src/modified_gk/summary.rs#L162-L218]I'd
>  be really glad to add unit tests and contribute my implementation adapted to 
> Scala.
> I'd love to hear your opinion on the matter.
> Best regards
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee the relativeError will be respected

2019-10-02 Thread Guilherme Souza (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guilherme Souza updated SPARK-29336:

Description: 
Hello Spark maintainers,

I was experimenting with my own implementation of the [space-efficient quantile 
algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf]
 in another language and I was using the Spark's one as a reference.

In my analysis, I believe to have found an issue with the {{merge()}} logic. 
Here is some simple Scala code that reproduces the issue I've found:

 
{code:java}
var values = (1 to 100).toArray
val all_quantiles = values.indices.map(i => (i+1).toDouble / 
values.length).toArray
for (n <- 0 until 5) {
  var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
  val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1)
  val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
  val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => 
Math.abs(expected - answer) }).toArray
  val max_error = error.max
  print(max_error + "\n")
}
{code}
I query for all possible quantiles in a 100-element array with a desired 10% 
max error. In this scenario, one would expect to observe a maximum error of 10 
ranks or less (10% of 100). However, the output I observe is:

 
{noformat}
16
12
10
11
17{noformat}
The variance is probably due to non-deterministic operations behind the scenes, 
but irrelevant to the core cause.

Interestingly enough, if I change from five to one partition the code works as 
expected and gives 10 every time. This seems to point to some problem at the 
[merge 
logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L153-L171]

The original authors ([~clockfly] and [~cloud_fan] for what I could dig from 
the history) suggest the published paper is not clear on how that should be 
done and, honestly, I was not confident in the current approach either.

I've found SPARK-21184 that reports the same problem, but it was unfortunately 
closed with no fix applied.

In my external implementation I believe to have found a sound way to implement 
the merge method. [Here is my take in Rust, if relevant
|https://github.com/sitegui/space-efficient-quantile/blob/188c74638c9840e5f47d6c6326b2886d47b149bc/src/modified_gk/summary.rs#L162-L218]I'd
 be really glad to add unit tests and contribute my implementation adapted to 
Scala.
I'd love to hear your opinion on the matter.

Best regards

 

 

  was:(sorry for the early submission, I'm still writing the description...)


> The implementation of QuantileSummaries.merge  does not guarantee the 
> relativeError will be respected 
> --
>
> Key: SPARK-29336
> URL: https://issues.apache.org/jira/browse/SPARK-29336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Guilherme Souza
>Priority: Minor
>
> Hello Spark maintainers,
> I was experimenting with my own implementation of the [space-efficient 
> quantile 
> algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf]
>  in another language and I was using the Spark's one as a reference.
> In my analysis, I believe to have found an issue with the {{merge()}} logic. 
> Here is some simple Scala code that reproduces the issue I've found:
>  
> {code:java}
> var values = (1 to 100).toArray
> val all_quantiles = values.indices.map(i => (i+1).toDouble / 
> values.length).toArray
> for (n <- 0 until 5) {
>   var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
>   val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1)
>   val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
>   val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) 
> => Math.abs(expected - answer) }).toArray
>   val max_error = error.max
>   print(max_error + "\n")
> }
> {code}
> I query for all possible quantiles in a 100-element array with a desired 10% 
> max error. In this scenario, one would expect to observe a maximum error of 
> 10 ranks or less (10% of 100). However, the output I observe is:
>  
> {noformat}
> 16
> 12
> 10
> 11
> 17{noformat}
> The variance is probably due to non-deterministic operations behind the 
> scenes, but irrelevant to the core cause.
> Interestingly enough, if I change from five to one partition the code works 
> as expected and gives 10 every time. This seems to point to some problem at 
> the [merge 
> logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/Quanti

[jira] [Updated] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee that the relativeError will be respected

2019-10-02 Thread Guilherme Souza (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guilherme Souza updated SPARK-29336:

Description: 
Hello Spark maintainers,

I was experimenting with my own implementation of the [space-efficient quantile 
algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf]
 in another language and I was using the Spark's one as a reference.

In my analysis, I believe to have found an issue with the {{merge()}} logic. 
Here is some simple Scala code that reproduces the issue I've found:

 
{code:java}
var values = (1 to 100).toArray
val all_quantiles = values.indices.map(i => (i+1).toDouble / 
values.length).toArray
for (n <- 0 until 5) {
  var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
  val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1)
  val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
  val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => 
Math.abs(expected - answer) }).toArray
  val max_error = error.max
  print(max_error + "\n")
}
{code}
I query for all possible quantiles in a 100-element array with a desired 10% 
max error. In this scenario, one would expect to observe a maximum error of 10 
ranks or less (10% of 100). However, the output I observe is:

 
{noformat}
16
12
10
11
17{noformat}
The variance is probably due to non-deterministic operations behind the scenes, 
but irrelevant to the core cause. (and sorry for my Scala, I'm not used to it)

Interestingly enough, if I change from five to one partition the code works as 
expected and gives 10 every time. This seems to point to some problem at the 
[merge 
logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L153-L171]

The original authors ([~clockfly] and [~cloud_fan] for what I could dig from 
the history) suggest the published paper is not clear on how that should be 
done and, honestly, I was not confident in the current approach either.

I've found SPARK-21184 that reports the same problem, but it was unfortunately 
closed with no fix applied.

In my external implementation I believe to have found a sound way to implement 
the merge method. [Here is my take in Rust, if 
relevant|https://github.com/sitegui/space-efficient-quantile/blob/188c74638c9840e5f47d6c6326b2886d47b149bc/src/modified_gk/summary.rs#L162-L218]
I'd be really glad to add unit tests and contribute my implementation adapted 
to Scala.
 I'd love to hear your opinion on the matter.
Best regards

 

 

  was:
Hello Spark maintainers,

I was experimenting with my own implementation of the [space-efficient quantile 
algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf]
 in another language and I was using the Spark's one as a reference.

In my analysis, I believe to have found an issue with the {{merge()}} logic. 
Here is some simple Scala code that reproduces the issue I've found:

 
{code:java}
var values = (1 to 100).toArray
val all_quantiles = values.indices.map(i => (i+1).toDouble / 
values.length).toArray
for (n <- 0 until 5) {
  var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
  val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1)
  val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
  val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => 
Math.abs(expected - answer) }).toArray
  val max_error = error.max
  print(max_error + "\n")
}
{code}
I query for all possible quantiles in a 100-element array with a desired 10% 
max error. In this scenario, one would expect to observe a maximum error of 10 
ranks or less (10% of 100). However, the output I observe is:

 
{noformat}
16
12
10
11
17{noformat}
The variance is probably due to non-deterministic operations behind the scenes, 
but irrelevant to the core cause. (and sorry for my Scala, I'm not used to it)

Interestingly enough, if I change from five to one partition the code works as 
expected and gives 10 every time. This seems to point to some problem at the 
[merge 
logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L153-L171]

The original authors ([~clockfly] and [~cloud_fan] for what I could dig from 
the history) suggest the published paper is not clear on how that should be 
done and, honestly, I was not confident in the current approach either.

I've found SPARK-21184 that reports the same problem, but it was unfortunately 
closed with no fix applied.

In my external implementation I believe to have found a sound way to implement 
the merge method. [Here is my take in Rust, if relevant
|https://github.com/sitegui/space-efficient-quantile/blo

[jira] [Updated] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee that the relativeError will be respected

2019-10-02 Thread Guilherme Souza (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guilherme Souza updated SPARK-29336:

Description: 
Hello Spark maintainers,

I was experimenting with my own implementation of the [space-efficient quantile 
algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf]
 in another language and I was using the Spark's one as a reference.

In my analysis, I believe to have found an issue with the {{merge()}} logic. 
Here is some simple Scala code that reproduces the issue I've found:

 
{code:java}
var values = (1 to 100).toArray
val all_quantiles = values.indices.map(i => (i+1).toDouble / 
values.length).toArray
for (n <- 0 until 5) {
  var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
  val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1)
  val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
  val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => 
Math.abs(expected - answer) }).toArray
  val max_error = error.max
  print(max_error + "\n")
}
{code}
I query for all possible quantiles in a 100-element array with a desired 10% 
max error. In this scenario, one would expect to observe a maximum error of 10 
ranks or less (10% of 100). However, the output I observe is:

 
{noformat}
16
12
10
11
17{noformat}
The variance is probably due to non-deterministic operations behind the scenes, 
but irrelevant to the core cause. (and sorry for my Scala, I'm not used to it)

Interestingly enough, if I change from five to one partition the code works as 
expected and gives 10 every time. This seems to point to some problem at the 
[merge 
logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L153-L171]

The original authors ([~clockfly] and [~cloud_fan] for what I could dig from 
the history) suggest the published paper is not clear on how that should be 
done and, honestly, I was not confident in the current approach either.

I've found SPARK-21184 that reports the same problem, but it was unfortunately 
closed with no fix applied.

In my external implementation I believe to have found a sound way to implement 
the merge method. [Here is my take in Rust, if relevant
|https://github.com/sitegui/space-efficient-quantile/blob/188c74638c9840e5f47d6c6326b2886d47b149bc/src/modified_gk/summary.rs#L162-L218]I'd
 be really glad to add unit tests and contribute my implementation adapted to 
Scala.
 I'd love to hear your opinion on the matter.|

Best regards

 

 

  was:
Hello Spark maintainers,

I was experimenting with my own implementation of the [space-efficient quantile 
algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf]
 in another language and I was using the Spark's one as a reference.

In my analysis, I believe to have found an issue with the {{merge()}} logic. 
Here is some simple Scala code that reproduces the issue I've found:

 
{code:java}
var values = (1 to 100).toArray
val all_quantiles = values.indices.map(i => (i+1).toDouble / 
values.length).toArray
for (n <- 0 until 5) {
  var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
  val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1)
  val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
  val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => 
Math.abs(expected - answer) }).toArray
  val max_error = error.max
  print(max_error + "\n")
}
{code}
I query for all possible quantiles in a 100-element array with a desired 10% 
max error. In this scenario, one would expect to observe a maximum error of 10 
ranks or less (10% of 100). However, the output I observe is:

 
{noformat}
16
12
10
11
17{noformat}
The variance is probably due to non-deterministic operations behind the scenes, 
but irrelevant to the core cause.

Interestingly enough, if I change from five to one partition the code works as 
expected and gives 10 every time. This seems to point to some problem at the 
[merge 
logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L153-L171]

The original authors ([~clockfly] and [~cloud_fan] for what I could dig from 
the history) suggest the published paper is not clear on how that should be 
done and, honestly, I was not confident in the current approach either.

I've found SPARK-21184 that reports the same problem, but it was unfortunately 
closed with no fix applied.

In my external implementation I believe to have found a sound way to implement 
the merge method. [Here is my take in Rust, if relevant
|https://github.com/sitegui/space-efficient-quantile/blob/188c74638c9840e5f47d6c6326b2886d47b149bc/

[jira] [Commented] (SPARK-18748) UDF multiple evaluations causes very poor performance

2019-10-02 Thread Anton Baranau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943176#comment-16943176
 ] 

Anton Baranau commented on SPARK-18748:
---

I got the same problem having the code below with 2.4.4
{code:python}
df.withColumn("scores", 
sf.explode(expensive_spacy_nlp_udf("texts"))).selectExpr('scores.score1', 
'scores.score2')
{code}
In my case data isn't huge so I can afford to cahce it like below
{code:python}
df.withColumn("scores", 
sf.explode(expensive_spacy_nlp_udf("texts"))).cache().selectExpr('scores.score1',
 'scores.score2')
{code}

> UDF multiple evaluations causes very poor performance
> -
>
> Key: SPARK-18748
> URL: https://issues.apache.org/jira/browse/SPARK-18748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Ohad Raviv
>Priority: Major
>
> We have a use case where we have a relatively expensive UDF that needs to be 
> calculated. The problem is that instead of being calculated once, it gets 
> calculated over and over again.
> for example:
> {quote}
> def veryExpensiveCalc(str:String) = \{println("blahblah1"); "nothing"\}
> hiveContext.udf.register("veryExpensiveCalc", veryExpensiveCalc _)
> hiveContext.sql("select * from (select veryExpensiveCalc('a') c)z where c is 
> not null and c<>''").show
> {quote}
> with the output:
> {quote}
> blahblah1
> blahblah1
> blahblah1
> +---+
> |  c|
> +---+
> |nothing|
> +---+
> {quote}
> You can see that for each reference of column "c" you will get the println.
> that causes very poor performance for our real use case.
> This also came out on StackOverflow:
> http://stackoverflow.com/questions/40320563/spark-udf-called-more-than-once-per-record-when-df-has-too-many-columns
> http://stackoverflow.com/questions/34587596/trying-to-turn-a-blob-into-multiple-columns-in-spark/
> with two problematic work-arounds:
> 1. cache() after the first time. e.g.
> {quote}
> hiveContext.sql("select veryExpensiveCalc('a') as c").cache().where("c is not 
> null and c<>''").show
> {quote}
> while it works, in our case we can't do that because the table is too big to 
> cache.
> 2. move back and forth to rdd:
> {quote}
> val df = hiveContext.sql("select veryExpensiveCalc('a') as c")
> hiveContext.createDataFrame(df.rdd, df.schema).where("c is not null and 
> c<>''").show
> {quote}
> which works but then we loose some of the optimizations like push down 
> predicate features, etc. and its very ugly.
> Any ideas on how we can make the UDF get calculated just once in a reasonable 
> way?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18748) UDF multiple evaluations causes very poor performance

2019-10-02 Thread Anton Baranau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943176#comment-16943176
 ] 

Anton Baranau edited comment on SPARK-18748 at 10/2/19 9:32 PM:


I got the same problem having the code below with 2.4.4
{code:python}
df.withColumn("scores", 
sf.explode(expensive_spacy_nlp_udf("texts"))).selectExpr('scores.score1', 
'scores.score2')
{code}
In my case data isn't huge so I can afford to cache it like below
{code:python}
df.withColumn("scores", 
sf.explode(expensive_spacy_nlp_udf("texts"))).cache().selectExpr('scores.score1',
 'scores.score2')
{code}


was (Author: ahtokca):
I got the same problem having the code below with 2.4.4
{code:python}
df.withColumn("scores", 
sf.explode(expensive_spacy_nlp_udf("texts"))).selectExpr('scores.score1', 
'scores.score2')
{code}
In my case data isn't huge so I can afford to cahce it like below
{code:python}
df.withColumn("scores", 
sf.explode(expensive_spacy_nlp_udf("texts"))).cache().selectExpr('scores.score1',
 'scores.score2')
{code}

> UDF multiple evaluations causes very poor performance
> -
>
> Key: SPARK-18748
> URL: https://issues.apache.org/jira/browse/SPARK-18748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Ohad Raviv
>Priority: Major
>
> We have a use case where we have a relatively expensive UDF that needs to be 
> calculated. The problem is that instead of being calculated once, it gets 
> calculated over and over again.
> for example:
> {quote}
> def veryExpensiveCalc(str:String) = \{println("blahblah1"); "nothing"\}
> hiveContext.udf.register("veryExpensiveCalc", veryExpensiveCalc _)
> hiveContext.sql("select * from (select veryExpensiveCalc('a') c)z where c is 
> not null and c<>''").show
> {quote}
> with the output:
> {quote}
> blahblah1
> blahblah1
> blahblah1
> +---+
> |  c|
> +---+
> |nothing|
> +---+
> {quote}
> You can see that for each reference of column "c" you will get the println.
> that causes very poor performance for our real use case.
> This also came out on StackOverflow:
> http://stackoverflow.com/questions/40320563/spark-udf-called-more-than-once-per-record-when-df-has-too-many-columns
> http://stackoverflow.com/questions/34587596/trying-to-turn-a-blob-into-multiple-columns-in-spark/
> with two problematic work-arounds:
> 1. cache() after the first time. e.g.
> {quote}
> hiveContext.sql("select veryExpensiveCalc('a') as c").cache().where("c is not 
> null and c<>''").show
> {quote}
> while it works, in our case we can't do that because the table is too big to 
> cache.
> 2. move back and forth to rdd:
> {quote}
> val df = hiveContext.sql("select veryExpensiveCalc('a') as c")
> hiveContext.createDataFrame(df.rdd, df.schema).where("c is not null and 
> c<>''").show
> {quote}
> which works but then we loose some of the optimizations like push down 
> predicate features, etc. and its very ugly.
> Any ideas on how we can make the UDF get calculated just once in a reasonable 
> way?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28725) Spark ML not able to de-serialize Logistic Regression model saved in previous version of Spark

2019-10-02 Thread Sharad Varshney (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943192#comment-16943192
 ] 

Sharad Varshney commented on SPARK-28725:
-

Even the same version of Spark 2.4.3 shows huge difference of probabilities.

> Spark ML not able to de-serialize Logistic Regression model saved in previous 
> version of Spark
> --
>
> Key: SPARK-28725
> URL: https://issues.apache.org/jira/browse/SPARK-28725
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.3
> Environment: PROD
>Reporter: Sharad Varshney
>Priority: Major
>
> Logistic Regression model saved using Spark version 2.3.0 in HDI is not 
> correctly de-serialized  with Spark 2.4.3 version. It loads into the memory 
> but probabilities it emits on inference is like 1.45 e-44(to 44th decimal 
> place approx equal to 0)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29338) Add overload for filter with index to Scala/Java API

2019-10-02 Thread Nikolas Vanderhoof (Jira)
Nikolas Vanderhoof created SPARK-29338:
--

 Summary: Add overload for filter with index to Scala/Java API
 Key: SPARK-29338
 URL: https://issues.apache.org/jira/browse/SPARK-29338
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Nikolas Vanderhoof


Add an overload for the higher order function `filter` that takes array index 
as its second argument to `org.apache.spark.sql.functions`.

See: SPARK-28962 and SPARK-27297



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29338) Add overload for filter with index to Scala/Java API

2019-10-02 Thread Nikolas Vanderhoof (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolas Vanderhoof resolved SPARK-29338.

Resolution: Duplicate

> Add overload for filter with index to Scala/Java API
> 
>
> Key: SPARK-29338
> URL: https://issues.apache.org/jira/browse/SPARK-29338
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Nikolas Vanderhoof
>Priority: Major
>
> Add an overload for the higher order function `filter` that takes array index 
> as its second argument to `org.apache.spark.sql.functions`.
> See: SPARK-28962 and SPARK-27297



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29339) Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build)

2019-10-02 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-29339:


 Summary: Support Arrow 0.14 in vectoried dapply and gapply (test 
it in AppVeyor build)
 Key: SPARK-29339
 URL: https://issues.apache.org/jira/browse/SPARK-29339
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


dapply and gapply with Arrow optimization and Arrow 0.14 seems failing:

{code}
> collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, 
> structType("gear double")))
Error in readBin(con, raw(), as.integer(dataLen), endian = "big") :
  invalid 'n' argument
{code}

We should fix it and also test it in AppVeyor



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29322) History server is stuck reading incomplete event log file compressed with zstd

2019-10-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29322:
-

Assignee: Jungtaek Lim

> History server is stuck reading incomplete event log file compressed with zstd
> --
>
> Key: SPARK-29322
> URL: https://issues.apache.org/jira/browse/SPARK-29322
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Attachments: history-server-1.jstack, history-server-2.jstack, 
> history-server-3.jstack, history-server-4.jstack
>
>
> While working on SPARK-28869, I've discovered the issue that reading 
> inprogress event log file on zstd compression could lead the thread being 
> stuck. I just experimented with Spark History Server and observed same issue. 
> I'll attach the jstack files.
> This is very easy to reproduce: setting configuration as below
> - spark.eventLog.enabled=true
> - spark.eventLog.compress=true
> - spark.eventLog.compression.codec=zstd
> and start Spark application. While the application is running, load the 
> application in SHS webpage. It may succeed to replay the event log, but high 
> likely it will be stuck and loading page will be also stuck.
> Only listing the thread stack trace being stuck across jstack files:
> {code}
> 2019-10-02 11:32:36
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.191-b12 mixed mode):
> ...
> "qtp2072313080-30" #30 daemon prio=5 os_prio=31 tid=0x7ff5b90e7800 
> nid=0x9703 runnable [0x7f22]
>java.lang.Thread.State: RUNNABLE
>   at java.io.FileInputStream.readBytes(Native Method)
>   at java.io.FileInputStream.read(FileInputStream.java:255)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:156)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   - locked <0x0007b5f97c60> (a 
> org.apache.hadoop.fs.BufferedFSInputStream)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at 
> org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:436)
>   at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:257)
>   at 
> org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:276)
>   at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:228)
>   at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196)
>   - locked <0x0007b5f97b58> (a 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   - locked <0x0007b5f97af8> (a java.io.BufferedInputStream)
>   at 
> com.github.luben.zstd.ZstdInputStream.readInternal(ZstdInputStream.java:129)
>   at com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:107)
>   - locked <0x0007b5f97ac0> (a com.github.luben.zstd.ZstdInputStream)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   - locked <0x0007b5cd3bd0> (a java.io.BufferedInputStream)
>   at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
>   at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
>   at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>   - locked <0x0007b5f94a00> (a java.io.InputStreamReader)
>   at java.io.InputStreamReader.read(InputStreamReader.java:184)
>   at java.io.BufferedReader.fill(BufferedReader.java:161)
>   at java.io.BufferedReader.readLine(BufferedReader.java:324)
>   - locked <0x0007b5f94a00> (a java.io.InputStreamReader)
>   at java.io.BufferedReader.readLine(BufferedReader.java:389)
>   at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:74)
>   at scala.collection.Iterator$$anon$20.hasNext(Iterator.scala:884)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511)
>   at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:80)
>   at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$rebuildAppStore$5(FsHistoryProvider.scala:976)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.$a

[jira] [Resolved] (SPARK-29322) History server is stuck reading incomplete event log file compressed with zstd

2019-10-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29322.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25996
[https://github.com/apache/spark/pull/25996]

> History server is stuck reading incomplete event log file compressed with zstd
> --
>
> Key: SPARK-29322
> URL: https://issues.apache.org/jira/browse/SPARK-29322
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: history-server-1.jstack, history-server-2.jstack, 
> history-server-3.jstack, history-server-4.jstack
>
>
> While working on SPARK-28869, I've discovered the issue that reading 
> inprogress event log file on zstd compression could lead the thread being 
> stuck. I just experimented with Spark History Server and observed same issue. 
> I'll attach the jstack files.
> This is very easy to reproduce: setting configuration as below
> - spark.eventLog.enabled=true
> - spark.eventLog.compress=true
> - spark.eventLog.compression.codec=zstd
> and start Spark application. While the application is running, load the 
> application in SHS webpage. It may succeed to replay the event log, but high 
> likely it will be stuck and loading page will be also stuck.
> Only listing the thread stack trace being stuck across jstack files:
> {code}
> 2019-10-02 11:32:36
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.191-b12 mixed mode):
> ...
> "qtp2072313080-30" #30 daemon prio=5 os_prio=31 tid=0x7ff5b90e7800 
> nid=0x9703 runnable [0x7f22]
>java.lang.Thread.State: RUNNABLE
>   at java.io.FileInputStream.readBytes(Native Method)
>   at java.io.FileInputStream.read(FileInputStream.java:255)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:156)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   - locked <0x0007b5f97c60> (a 
> org.apache.hadoop.fs.BufferedFSInputStream)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at 
> org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:436)
>   at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:257)
>   at 
> org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:276)
>   at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:228)
>   at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196)
>   - locked <0x0007b5f97b58> (a 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   - locked <0x0007b5f97af8> (a java.io.BufferedInputStream)
>   at 
> com.github.luben.zstd.ZstdInputStream.readInternal(ZstdInputStream.java:129)
>   at com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:107)
>   - locked <0x0007b5f97ac0> (a com.github.luben.zstd.ZstdInputStream)
>   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>   at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>   - locked <0x0007b5cd3bd0> (a java.io.BufferedInputStream)
>   at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
>   at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
>   at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>   - locked <0x0007b5f94a00> (a java.io.InputStreamReader)
>   at java.io.InputStreamReader.read(InputStreamReader.java:184)
>   at java.io.BufferedReader.fill(BufferedReader.java:161)
>   at java.io.BufferedReader.readLine(BufferedReader.java:324)
>   - locked <0x0007b5f94a00> (a java.io.InputStreamReader)
>   at java.io.BufferedReader.readLine(BufferedReader.java:389)
>   at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:74)
>   at scala.collection.Iterator$$anon$20.hasNext(Iterator.scala:884)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511)
>   at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:80)
>   at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
>   at 
> org.apache.spark.deploy.history.FsHistor

[jira] [Updated] (SPARK-29328) Incorrect calculation mean seconds per month

2019-10-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29328:
--
Affects Version/s: 2.3.4

> Incorrect calculation mean seconds per month
> 
>
> Key: SPARK-29328
> URL: https://issues.apache.org/jira/browse/SPARK-29328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Maxim Gekk
>Priority: Minor
>
> Existing implementation assumes 31 days per month or 372 days per year which 
> is far away from the correct number. Spark uses the proleptic Gregorian 
> calendar by default SPARK-26651 in which the average year is 365.2425 days 
> long: https://en.wikipedia.org/wiki/Gregorian_calendar . Need to fix 
> calculation in 3 places at least:
> - GroupStateImpl.scala:167:val millisPerMonth = 
> TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
> - EventTimeWatermark.scala:32:val millisPerMonth = 
> TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
> - DateTimeUtils.scala:610:val secondsInMonth = DAYS.toSeconds(31)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29328) Incorrect calculation mean seconds per month

2019-10-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29328:
--
Affects Version/s: 2.2.3

> Incorrect calculation mean seconds per month
> 
>
> Key: SPARK-29328
> URL: https://issues.apache.org/jira/browse/SPARK-29328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.4
>Reporter: Maxim Gekk
>Priority: Minor
>
> Existing implementation assumes 31 days per month or 372 days per year which 
> is far away from the correct number. Spark uses the proleptic Gregorian 
> calendar by default SPARK-26651 in which the average year is 365.2425 days 
> long: https://en.wikipedia.org/wiki/Gregorian_calendar . Need to fix 
> calculation in 3 places at least:
> - GroupStateImpl.scala:167:val millisPerMonth = 
> TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
> - EventTimeWatermark.scala:32:val millisPerMonth = 
> TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
> - DateTimeUtils.scala:610:val secondsInMonth = DAYS.toSeconds(31)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29328) Incorrect calculation mean seconds per month

2019-10-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29328:
--
Description: 
Existing implementation assumes 31 days per month or 372 days per year which is 
far away from the correct number. Spark uses the proleptic Gregorian calendar 
by default SPARK-26651 in which the average year is 365.2425 days long: 
https://en.wikipedia.org/wiki/Gregorian_calendar . Need to fix calculation in 3 
places at least:
- GroupStateImpl.scala:167:val millisPerMonth = 
TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
- EventTimeWatermark.scala:32:val millisPerMonth = 
TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
- DateTimeUtils.scala:610:val secondsInMonth = DAYS.toSeconds(31)

*BEFORE*
{code}
spark-sql> select months_between('2019-09-15', '1970-01-01');
596.4516129
{code}
*AFTER*
{code}
spark-sql> select months_between('2019-09-15', '1970-01-01');
596.45996838
{code}

  was:
Existing implementation assumes 31 days per month or 372 days per year which is 
far away from the correct number. Spark uses the proleptic Gregorian calendar 
by default SPARK-26651 in which the average year is 365.2425 days long: 
https://en.wikipedia.org/wiki/Gregorian_calendar . Need to fix calculation in 3 
places at least:
- GroupStateImpl.scala:167:val millisPerMonth = 
TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
- EventTimeWatermark.scala:32:val millisPerMonth = 
TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
- DateTimeUtils.scala:610:val secondsInMonth = DAYS.toSeconds(31)


> Incorrect calculation mean seconds per month
> 
>
> Key: SPARK-29328
> URL: https://issues.apache.org/jira/browse/SPARK-29328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.4
>Reporter: Maxim Gekk
>Priority: Minor
>
> Existing implementation assumes 31 days per month or 372 days per year which 
> is far away from the correct number. Spark uses the proleptic Gregorian 
> calendar by default SPARK-26651 in which the average year is 365.2425 days 
> long: https://en.wikipedia.org/wiki/Gregorian_calendar . Need to fix 
> calculation in 3 places at least:
> - GroupStateImpl.scala:167:val millisPerMonth = 
> TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
> - EventTimeWatermark.scala:32:val millisPerMonth = 
> TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
> - DateTimeUtils.scala:610:val secondsInMonth = DAYS.toSeconds(31)
> *BEFORE*
> {code}
> spark-sql> select months_between('2019-09-15', '1970-01-01');
> 596.4516129
> {code}
> *AFTER*
> {code}
> spark-sql> select months_between('2019-09-15', '1970-01-01');
> 596.45996838
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29328) Incorrect calculation mean seconds per month

2019-10-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29328:
--
Affects Version/s: 2.1.3

> Incorrect calculation mean seconds per month
> 
>
> Key: SPARK-29328
> URL: https://issues.apache.org/jira/browse/SPARK-29328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.4
>Reporter: Maxim Gekk
>Priority: Minor
>
> Existing implementation assumes 31 days per month or 372 days per year which 
> is far away from the correct number. Spark uses the proleptic Gregorian 
> calendar by default SPARK-26651 in which the average year is 365.2425 days 
> long: https://en.wikipedia.org/wiki/Gregorian_calendar . Need to fix 
> calculation in 3 places at least:
> - GroupStateImpl.scala:167:val millisPerMonth = 
> TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
> - EventTimeWatermark.scala:32:val millisPerMonth = 
> TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
> - DateTimeUtils.scala:610:val secondsInMonth = DAYS.toSeconds(31)
> *BEFORE*
> {code}
> spark-sql> select months_between('2019-09-15', '1970-01-01');
> 596.4516129
> {code}
> *AFTER*
> {code}
> spark-sql> select months_between('2019-09-15', '1970-01-01');
> 596.45996838
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29305) Update LICENSE and NOTICE for hadoop 3.2

2019-10-02 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29305:


Assignee: angerszhu

> Update LICENSE and NOTICE for hadoop 3.2
> 
>
> Key: SPARK-29305
> URL: https://issues.apache.org/jira/browse/SPARK-29305
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> {code}
> com.fasterxml.jackson.jaxrs:jackson-jaxrs-base:2.9.5  
> com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider:2.9.5 
> com.fasterxml.woodstox:woodstox-core:5.0.3
> com.github.stephenc.jcip:jcip-annotations:1.0-1   
> com.google.re2j:re2j:1.1  
> com.microsoft.sqlserver:mssql-jdbc:6.2.1.jre7 
> com.nimbusds:nimbus-jose-jwt:4.41.1   
> dnsjava:dnsjava:2.1.7 
> net.minidev:accessors-smart:1.2   
> net.minidev:json-smart:2.3
> org.apache.commons:commons-configuration2:2.1.1   
> org.apache.geronimo.specs:geronimo-jcache_1.0_spec:1.0-alpha-1
> org.apache.hadoop:hadoop-hdfs-client:3.2.0 
> org.apache.kerby:kerb-admin:1.0.1 
> org.apache.kerby:kerb-client:1.0.1
> org.apache.kerby:kerb-common:1.0.1
> org.apache.kerby:kerb-core:1.0.1  
> org.apache.kerby:kerb-crypto:1.0.1
> org.apache.kerby:kerb-identity:1.0.1  
> org.apache.kerby:kerb-server:1.0.1
> org.apache.kerby:kerb-simplekdc:1.0.1 
> org.apache.kerby:kerb-util:1.0.1  
> org.apache.kerby:kerby-asn1:1.0.1 
> org.apache.kerby:kerby-config:1.0.1   
> org.apache.kerby:kerby-pkix:1.0.1 
> org.apache.kerby:kerby-util:1.0.1 
> org.apache.kerby:kerby-xdr:1.0.1  
> org.apache.kerby:token-provider:1.0.1 
> org.codehaus.woodstox:stax2-api:3.1.4 
> org.ehcache:ehcache:3.3.1 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29305) Update LICENSE and NOTICE for hadoop 3.2

2019-10-02 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29305.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25978
[https://github.com/apache/spark/pull/25978]

> Update LICENSE and NOTICE for hadoop 3.2
> 
>
> Key: SPARK-29305
> URL: https://issues.apache.org/jira/browse/SPARK-29305
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.0.0
>
>
> {code}
> com.fasterxml.jackson.jaxrs:jackson-jaxrs-base:2.9.5  
> com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider:2.9.5 
> com.fasterxml.woodstox:woodstox-core:5.0.3
> com.github.stephenc.jcip:jcip-annotations:1.0-1   
> com.google.re2j:re2j:1.1  
> com.microsoft.sqlserver:mssql-jdbc:6.2.1.jre7 
> com.nimbusds:nimbus-jose-jwt:4.41.1   
> dnsjava:dnsjava:2.1.7 
> net.minidev:accessors-smart:1.2   
> net.minidev:json-smart:2.3
> org.apache.commons:commons-configuration2:2.1.1   
> org.apache.geronimo.specs:geronimo-jcache_1.0_spec:1.0-alpha-1
> org.apache.hadoop:hadoop-hdfs-client:3.2.0 
> org.apache.kerby:kerb-admin:1.0.1 
> org.apache.kerby:kerb-client:1.0.1
> org.apache.kerby:kerb-common:1.0.1
> org.apache.kerby:kerb-core:1.0.1  
> org.apache.kerby:kerb-crypto:1.0.1
> org.apache.kerby:kerb-identity:1.0.1  
> org.apache.kerby:kerb-server:1.0.1
> org.apache.kerby:kerb-simplekdc:1.0.1 
> org.apache.kerby:kerb-util:1.0.1  
> org.apache.kerby:kerby-asn1:1.0.1 
> org.apache.kerby:kerby-config:1.0.1   
> org.apache.kerby:kerby-pkix:1.0.1 
> org.apache.kerby:kerby-util:1.0.1 
> org.apache.kerby:kerby-xdr:1.0.1  
> org.apache.kerby:token-provider:1.0.1 
> org.codehaus.woodstox:stax2-api:3.1.4 
> org.ehcache:ehcache:3.3.1 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29305) Update LICENSE and NOTICE for hadoop 3.2

2019-10-02 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-29305:
-
Priority: Minor  (was: Major)

> Update LICENSE and NOTICE for hadoop 3.2
> 
>
> Key: SPARK-29305
> URL: https://issues.apache.org/jira/browse/SPARK-29305
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
>
> {code}
> com.fasterxml.jackson.jaxrs:jackson-jaxrs-base:2.9.5  
> com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider:2.9.5 
> com.fasterxml.woodstox:woodstox-core:5.0.3
> com.github.stephenc.jcip:jcip-annotations:1.0-1   
> com.google.re2j:re2j:1.1  
> com.microsoft.sqlserver:mssql-jdbc:6.2.1.jre7 
> com.nimbusds:nimbus-jose-jwt:4.41.1   
> dnsjava:dnsjava:2.1.7 
> net.minidev:accessors-smart:1.2   
> net.minidev:json-smart:2.3
> org.apache.commons:commons-configuration2:2.1.1   
> org.apache.geronimo.specs:geronimo-jcache_1.0_spec:1.0-alpha-1
> org.apache.hadoop:hadoop-hdfs-client:3.2.0 
> org.apache.kerby:kerb-admin:1.0.1 
> org.apache.kerby:kerb-client:1.0.1
> org.apache.kerby:kerb-common:1.0.1
> org.apache.kerby:kerb-core:1.0.1  
> org.apache.kerby:kerb-crypto:1.0.1
> org.apache.kerby:kerb-identity:1.0.1  
> org.apache.kerby:kerb-server:1.0.1
> org.apache.kerby:kerb-simplekdc:1.0.1 
> org.apache.kerby:kerb-util:1.0.1  
> org.apache.kerby:kerby-asn1:1.0.1 
> org.apache.kerby:kerby-config:1.0.1   
> org.apache.kerby:kerby-pkix:1.0.1 
> org.apache.kerby:kerby-util:1.0.1 
> org.apache.kerby:kerby-xdr:1.0.1  
> org.apache.kerby:token-provider:1.0.1 
> org.codehaus.woodstox:stax2-api:3.1.4 
> org.ehcache:ehcache:3.3.1 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29328) Incorrect calculation mean seconds per month

2019-10-02 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-29328:
---
Labels: correctness  (was: )

> Incorrect calculation mean seconds per month
> 
>
> Key: SPARK-29328
> URL: https://issues.apache.org/jira/browse/SPARK-29328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.4
>Reporter: Maxim Gekk
>Priority: Minor
>  Labels: correctness
>
> Existing implementation assumes 31 days per month or 372 days per year which 
> is far away from the correct number. Spark uses the proleptic Gregorian 
> calendar by default SPARK-26651 in which the average year is 365.2425 days 
> long: https://en.wikipedia.org/wiki/Gregorian_calendar . Need to fix 
> calculation in 3 places at least:
> - GroupStateImpl.scala:167:val millisPerMonth = 
> TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
> - EventTimeWatermark.scala:32:val millisPerMonth = 
> TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31
> - DateTimeUtils.scala:610:val secondsInMonth = DAYS.toSeconds(31)
> *BEFORE*
> {code}
> spark-sql> select months_between('2019-09-15', '1970-01-01');
> 596.4516129
> {code}
> *AFTER*
> {code}
> spark-sql> select months_between('2019-09-15', '1970-01-01');
> 596.45996838
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org