[jira] [Created] (SPARK-29326) ANSI store assignment policy: throw exception on insertion failure
Gengliang Wang created SPARK-29326: -- Summary: ANSI store assignment policy: throw exception on insertion failure Key: SPARK-29326 URL: https://issues.apache.org/jira/browse/SPARK-29326 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang As per ANSI SQL standard, ANSI store assignment policy should throw an exception on insertion failure, such as inserting out-of-range value to a numeric field. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29327) Support specifying features via multiple columns in Predictor and PredictionModel
Liangcai Li created SPARK-29327: --- Summary: Support specifying features via multiple columns in Predictor and PredictionModel Key: SPARK-29327 URL: https://issues.apache.org/jira/browse/SPARK-29327 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 3.0.0 Reporter: Liangcai Li There are always more features than one in a classification/regression task, however the current API to specify features columns in Predictor of Spark MLLib only supports one single column, which requires users to assemble the multiple features columns into a "org.apache.spark.ml.linalg.Vector" before fitting to Spark ML pipeline. This improvement is going to let users specify the features columns directly without vectorization. To support this, we can introduce two new APIs in both "Predictor" and "PredictionModel", and a new parameter named "featuresCols" storing the features columns names as an Array. ( PR is ready here [https://github.com/apache/spark/pull/25983]) *APIs:* {{def setFeaturesCol(value: Array[String]): M = ...}} {{protected def isSupportMultiColumnsForFeatures: Boolean = false}} *Parameter:* {{final val featuresCols: StringArrayParam = new StringArrayParam(this, "featuresCols", ...)}} Then ML implementations can get and use the features columns names from this new parameter "featuresCols", along with the raw data of features in separate columns directly in dataset. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29328) Incorrect calculation mean seconds per month
Maxim Gekk created SPARK-29328: -- Summary: Incorrect calculation mean seconds per month Key: SPARK-29328 URL: https://issues.apache.org/jira/browse/SPARK-29328 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4 Reporter: Maxim Gekk Existing implementation assumes 31 days per month or 372 days per year which is far away from the correct number. Spark uses the proleptic Gregorian calendar by default SPARK-26651 in which the average year is 365.2425 days long: https://en.wikipedia.org/wiki/Gregorian_calendar . Need to fix calculation in 3 places at least: - GroupStateImpl.scala:167:val millisPerMonth = TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 - EventTimeWatermark.scala:32:val millisPerMonth = TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 - DateTimeUtils.scala:610:val secondsInMonth = DAYS.toSeconds(31) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29329) maven incremental builds not working
Thomas Graves created SPARK-29329: - Summary: maven incremental builds not working Key: SPARK-29329 URL: https://issues.apache.org/jira/browse/SPARK-29329 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.0.0 Reporter: Thomas Graves It looks like since we Upgraded scala-maven-plugin to 4.2.0 https://issues.apache.org/jira/browse/SPARK-28759 spark incremental builds stop working. Everytime you build its building all files, which takes forever. It would be nice to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28759) Upgrade scala-maven-plugin to 4.2.0
[ https://issues.apache.org/jira/browse/SPARK-28759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942787#comment-16942787 ] Thomas Graves commented on SPARK-28759: --- I rolled back this commit and the incremental compile now works. Without incremental compiles the build takes forever so I'm against disabling it. I filed https://issues.apache.org/jira/browse/SPARK-29329 for us to look at. > Upgrade scala-maven-plugin to 4.2.0 > --- > > Key: SPARK-28759 > URL: https://issues.apache.org/jira/browse/SPARK-28759 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29329) maven incremental builds not working
[ https://issues.apache.org/jira/browse/SPARK-29329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-29329: -- Description: It looks like since we Upgraded scala-maven-plugin to 4.2.0 https://issues.apache.org/jira/browse/SPARK-28759 spark incremental builds stop working. Everytime you build its building all files, which takes forever. It would be nice to fix this. To reproduce, just build spark once ( I happened to be using the command below): build/mvn -Phadoop-3.2 -Phive-thriftserver -Phive -Pyarn -Pkinesis-asl -Pkubernetes -Pmesos -Phadoop-cloud -Pspark-ganglia-lgpl package -DskipTests Then build it again and you will see that it compiles all the files and takes 15-30 minutes. With incremental it skips all unnecessary files and takes closer to 5 minutes. was: It looks like since we Upgraded scala-maven-plugin to 4.2.0 https://issues.apache.org/jira/browse/SPARK-28759 spark incremental builds stop working. Everytime you build its building all files, which takes forever. It would be nice to fix this. > maven incremental builds not working > > > Key: SPARK-29329 > URL: https://issues.apache.org/jira/browse/SPARK-29329 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > It looks like since we Upgraded scala-maven-plugin to 4.2.0 > https://issues.apache.org/jira/browse/SPARK-28759 spark incremental builds > stop working. Everytime you build its building all files, which takes > forever. > It would be nice to fix this. > > To reproduce, just build spark once ( I happened to be using the command > below): > build/mvn -Phadoop-3.2 -Phive-thriftserver -Phive -Pyarn -Pkinesis-asl > -Pkubernetes -Pmesos -Phadoop-cloud -Pspark-ganglia-lgpl package -DskipTests > Then build it again and you will see that it compiles all the files and takes > 15-30 minutes. With incremental it skips all unnecessary files and takes > closer to 5 minutes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29329) maven incremental builds not working
[ https://issues.apache.org/jira/browse/SPARK-29329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942790#comment-16942790 ] Thomas Graves commented on SPARK-29329: --- there are few comments on SPARK-28759 in regards to this, see: https://issues.apache.org/jira/browse/SPARK-28759?focusedCommentId=16942407&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16942407 > maven incremental builds not working > > > Key: SPARK-29329 > URL: https://issues.apache.org/jira/browse/SPARK-29329 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > It looks like since we Upgraded scala-maven-plugin to 4.2.0 > https://issues.apache.org/jira/browse/SPARK-28759 spark incremental builds > stop working. Everytime you build its building all files, which takes > forever. > It would be nice to fix this. > > To reproduce, just build spark once ( I happened to be using the command > below): > build/mvn -Phadoop-3.2 -Phive-thriftserver -Phive -Pyarn -Pkinesis-asl > -Pkubernetes -Pmesos -Phadoop-cloud -Pspark-ganglia-lgpl package -DskipTests > Then build it again and you will see that it compiles all the files and takes > 15-30 minutes. With incremental it skips all unnecessary files and takes > closer to 5 minutes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend
[ https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942809#comment-16942809 ] Maciej Szymkiewicz commented on SPARK-29212: [~podongfeng] It sounds about right. I will also argue that, conditioning on 1., we should remove Java specific mixins, if they don't serve any practical value (provide no implementation whatsoever, like {{JavaPredictorParams}}, or have no JVM wrapper specific implementation, like {{JavaPredictor}}). As of the second point there is additional consideration here - some {{Java*}} classes are considered part of the public API, and this should stay as is (these provide crucial information to the end user). However deeper we go, the less useful they are (once again conditioning on 1.). On a side note current approach to ML API requires a lot of boilerplate code. Lately I've been playing with [some ideas|https://gist.github.com/zero323/ee36bce57ddeac82322e3ab4ef547611], that wouldn't require code generation - they have some caveats, but maybe there is something there. > Add common classes without using JVM backend > > > Key: SPARK-29212 > URL: https://issues.apache.org/jira/browse/SPARK-29212 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > copyed from [https://github.com/apache/spark/pull/25776.] > > Maciej's *Concern*: > *Use cases for public ML type hierarchy* > * Add Python-only Transformer implementations: > ** I am Python user and want to implement pure Python ML classifier without > providing JVM backend. > ** I want this classifier to be meaningfully positioned in the existing type > hierarchy. > ** However I have access only to high level classes ({{Estimator}}, > {{Model}}, {{MLReader}} / {{MLReadable}}). > * Run time parameter validation for both user defined (see above) and > existing class hierarchy, > ** I am a library developer who provides functions that are meaningful only > for specific categories of {{Estimators}} - here classifiers. > ** I want to validate that user passed argument is indeed a classifier: > *** For built-in objects using "private" type hierarchy is not really > satisfying (actually, what is the rationale behind making it "private"? If > the goal is Scala API parity, and Scala counterparts are public, shouldn't > these be too?). > ** For user defined objects I can: > *** Use duck typing (on {{setRawPredictionCol}} for classifier, on > {{numClasses}} for classification model) but it hardly satisfying. > *** Provide parallel non-abstract type hierarchy ({{Classifier}} or > {{PythonClassifier}} and so on) and require users to implement such > interfaces. That however would require separate logic for checking for > built-in and and user-provided classes. > *** Provide parallel abstract type hierarchy, register all existing built-in > classes and require users to do the same. > Clearly these are not satisfying solutions as they require either defensive > programming or reinventing the same functionality for different 3rd party > APIs. > * Static type checking > ** I am either end user or library developer and want to use PEP-484 > annotations to indicate components that require classifier or classification > model. > ** Currently I can provide only imprecise annotations, [such > as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241] > def setClassifier(self, value: Estimator[M]) -> OneVsRest: ... > or try to narrow things down using structural subtyping: > class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, > value: str) -> Classifier: ... class Classifier(Protocol, Model): def > setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> > int: ... > > Maciej's *Proposal*: > {code:java} > Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e. > class ClassifierParams: ... > class Predictor(Estimator,PredictorParams): > def setLabelCol(self, value): ... > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > class Classifier(Predictor, ClassifierParams): > def setRawPredictionCol(self, value): ... > class PredictionModel(Model,PredictorParams): > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > def numFeatures(self): ... > def predict(self, value): ... > and JVM interop should extend from this hierarchy, i.e. > class JavaPredictionModel(PredictionModel): ... > In other words it should be consistent with existing approach, where we have > ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and > Java* va
[jira] [Comment Edited] (SPARK-29212) Add common classes without using JVM backend
[ https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942809#comment-16942809 ] Maciej Szymkiewicz edited comment on SPARK-29212 at 10/2/19 1:41 PM: - [~podongfeng] It sounds about right. I will also argue that, conditioning on 1., we should remove Java specific mixins, if they don't serve any practical value (provide no implementation whatsoever or don't extend other {{Java*}} mixins, like {{JavaPredictorParams}}, or have no JVM wrapper specific implementation, like {{JavaPredictor}}). As of the second point there is additional consideration here - some {{Java*}} classes are considered part of the public API, and this should stay as is (these provide crucial information to the end user). However deeper we go, the less useful they are (once again conditioning on 1.). On a side note current approach to ML API requires a lot of boilerplate code. Lately I've been playing with [some ideas|https://gist.github.com/zero323/ee36bce57ddeac82322e3ab4ef547611], that wouldn't require code generation - they have some caveats, but maybe there is something there. was (Author: zero323): [~podongfeng] It sounds about right. I will also argue that, conditioning on 1., we should remove Java specific mixins, if they don't serve any practical value (provide no implementation whatsoever, like {{JavaPredictorParams}}, or have no JVM wrapper specific implementation, like {{JavaPredictor}}). As of the second point there is additional consideration here - some {{Java*}} classes are considered part of the public API, and this should stay as is (these provide crucial information to the end user). However deeper we go, the less useful they are (once again conditioning on 1.). On a side note current approach to ML API requires a lot of boilerplate code. Lately I've been playing with [some ideas|https://gist.github.com/zero323/ee36bce57ddeac82322e3ab4ef547611], that wouldn't require code generation - they have some caveats, but maybe there is something there. > Add common classes without using JVM backend > > > Key: SPARK-29212 > URL: https://issues.apache.org/jira/browse/SPARK-29212 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > copyed from [https://github.com/apache/spark/pull/25776.] > > Maciej's *Concern*: > *Use cases for public ML type hierarchy* > * Add Python-only Transformer implementations: > ** I am Python user and want to implement pure Python ML classifier without > providing JVM backend. > ** I want this classifier to be meaningfully positioned in the existing type > hierarchy. > ** However I have access only to high level classes ({{Estimator}}, > {{Model}}, {{MLReader}} / {{MLReadable}}). > * Run time parameter validation for both user defined (see above) and > existing class hierarchy, > ** I am a library developer who provides functions that are meaningful only > for specific categories of {{Estimators}} - here classifiers. > ** I want to validate that user passed argument is indeed a classifier: > *** For built-in objects using "private" type hierarchy is not really > satisfying (actually, what is the rationale behind making it "private"? If > the goal is Scala API parity, and Scala counterparts are public, shouldn't > these be too?). > ** For user defined objects I can: > *** Use duck typing (on {{setRawPredictionCol}} for classifier, on > {{numClasses}} for classification model) but it hardly satisfying. > *** Provide parallel non-abstract type hierarchy ({{Classifier}} or > {{PythonClassifier}} and so on) and require users to implement such > interfaces. That however would require separate logic for checking for > built-in and and user-provided classes. > *** Provide parallel abstract type hierarchy, register all existing built-in > classes and require users to do the same. > Clearly these are not satisfying solutions as they require either defensive > programming or reinventing the same functionality for different 3rd party > APIs. > * Static type checking > ** I am either end user or library developer and want to use PEP-484 > annotations to indicate components that require classifier or classification > model. > ** Currently I can provide only imprecise annotations, [such > as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241] > def setClassifier(self, value: Estimator[M]) -> OneVsRest: ... > or try to narrow things down using structural subtyping: > class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, > value: str) -> Classifier: ... class Classifier(Prot
[jira] [Assigned] (SPARK-28970) implement USE CATALOG/NAMESPACE for Data Source V2
[ https://issues.apache.org/jira/browse/SPARK-28970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-28970: --- Assignee: Terry Kim > implement USE CATALOG/NAMESPACE for Data Source V2 > -- > > Key: SPARK-28970 > URL: https://issues.apache.org/jira/browse/SPARK-28970 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Terry Kim >Priority: Major > > Currently Spark has a `USE abc` command to switch the current database. > We should have something similar for Data Source V2, to switch the current > catalog and/or current namespace. > We can introduce 2 new command: `USE CATALOG abc` and `USE NAMESPACE abc` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28970) implement USE CATALOG/NAMESPACE for Data Source V2
[ https://issues.apache.org/jira/browse/SPARK-28970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-28970. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25771 [https://github.com/apache/spark/pull/25771] > implement USE CATALOG/NAMESPACE for Data Source V2 > -- > > Key: SPARK-28970 > URL: https://issues.apache.org/jira/browse/SPARK-28970 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > Currently Spark has a `USE abc` command to switch the current database. > We should have something similar for Data Source V2, to switch the current > catalog and/or current namespace. > We can introduce 2 new command: `USE CATALOG abc` and `USE NAMESPACE abc` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29330) Allow users to chose the name of Spark Shuffle service
Alexander Bessonov created SPARK-29330: -- Summary: Allow users to chose the name of Spark Shuffle service Key: SPARK-29330 URL: https://issues.apache.org/jira/browse/SPARK-29330 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 2.4.4 Reporter: Alexander Bessonov As of now, Spark uses hardcoded value {{spark_shuffle}} as the name of the Shuffle Service. HDP distribution of Spark, on the other hand, uses [{{spark2_shuffle}}|#L117]]. This is done to be able to run both Spark 1.6 and Spark 2.x on the same Hadoop cluster. Running vanilla Spark on HDP cluster with only Spark 2.x shuffle service (HDP favor) running becomes impossible due to the shuffle service name mismatch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29330) Allow users to chose the name of Spark Shuffle service
[ https://issues.apache.org/jira/browse/SPARK-29330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Bessonov updated SPARK-29330: --- Description: As of now, Spark uses hardcoded value {{spark_shuffle}} as the name of the Shuffle Service. HDP distribution of Spark, on the other hand, uses [{{spark2_shuffle}}|https://github.com/hortonworks/spark2-release/blob/HDP-3.1.0.0-78-tag/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala#L117]. This is done to be able to run both Spark 1.6 and Spark 2.x on the same Hadoop cluster. Running vanilla Spark on HDP cluster with only Spark 2.x shuffle service (HDP favor) running becomes impossible due to the shuffle service name mismatch. was: As of now, Spark uses hardcoded value {{spark_shuffle}} as the name of the Shuffle Service. HDP distribution of Spark, on the other hand, uses [{{spark2_shuffle}}|#L117]]. This is done to be able to run both Spark 1.6 and Spark 2.x on the same Hadoop cluster. Running vanilla Spark on HDP cluster with only Spark 2.x shuffle service (HDP favor) running becomes impossible due to the shuffle service name mismatch. > Allow users to chose the name of Spark Shuffle service > -- > > Key: SPARK-29330 > URL: https://issues.apache.org/jira/browse/SPARK-29330 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 2.4.4 >Reporter: Alexander Bessonov >Priority: Minor > > As of now, Spark uses hardcoded value {{spark_shuffle}} as the name of the > Shuffle Service. > HDP distribution of Spark, on the other hand, uses > [{{spark2_shuffle}}|https://github.com/hortonworks/spark2-release/blob/HDP-3.1.0.0-78-tag/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala#L117]. > This is done to be able to run both Spark 1.6 and Spark 2.x on the same > Hadoop cluster. > Running vanilla Spark on HDP cluster with only Spark 2.x shuffle service (HDP > favor) running becomes impossible due to the shuffle service name mismatch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29331) create DS v2 Write at physical plan
Wenchen Fan created SPARK-29331: --- Summary: create DS v2 Write at physical plan Key: SPARK-29331 URL: https://issues.apache.org/jira/browse/SPARK-29331 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
[ https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942906#comment-16942906 ] Simeon H.K. Fitch commented on SPARK-13802: --- Is there a workaround to this problem? Ordering is important when encoders are used to reify structs into Scala types, and not being able to specify the order (without a lot of boilerplate schema work) results in Exceptions. > Fields order in Row(**kwargs) is not consistent with Schema.toInternal method > - > > Key: SPARK-13802 > URL: https://issues.apache.org/jira/browse/SPARK-13802 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Szymon Matejczyk >Priority: Major > > When using Row constructor from kwargs, fields in the tuple underneath are > sorted by name. When Schema is reading the row, it is not using the fields in > this order. > {code} > from pyspark.sql import Row > from pyspark.sql.types import * > schema = StructType([ > StructField("id", StringType()), > StructField("first_name", StringType())]) > row = Row(id="39", first_name="Szymon") > schema.toInternal(row) > Out[5]: ('Szymon', '39') > {code} > {code} > df = sqlContext.createDataFrame([row], schema) > df.show(1) > +--+--+ > |id|first_name| > +--+--+ > |Szymon|39| > +--+--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13802) Fields order in Row(**kwargs) is not consistent with Schema.toInternal method
[ https://issues.apache.org/jira/browse/SPARK-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942912#comment-16942912 ] Maciej Szymkiewicz commented on SPARK-13802: [~metasim] namedtuples are the simplest and the most efficient replacement (https://stackoverflow.com/a/49949762). > Fields order in Row(**kwargs) is not consistent with Schema.toInternal method > - > > Key: SPARK-13802 > URL: https://issues.apache.org/jira/browse/SPARK-13802 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.0 >Reporter: Szymon Matejczyk >Priority: Major > > When using Row constructor from kwargs, fields in the tuple underneath are > sorted by name. When Schema is reading the row, it is not using the fields in > this order. > {code} > from pyspark.sql import Row > from pyspark.sql.types import * > schema = StructType([ > StructField("id", StringType()), > StructField("first_name", StringType())]) > row = Row(id="39", first_name="Szymon") > schema.toInternal(row) > Out[5]: ('Szymon', '39') > {code} > {code} > df = sqlContext.createDataFrame([row], schema) > df.show(1) > +--+--+ > |id|first_name| > +--+--+ > |Szymon|39| > +--+--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29273) Spark peakExecutionMemory metrics is zero
[ https://issues.apache.org/jira/browse/SPARK-29273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-29273. Fix Version/s: 3.0.0 Assignee: huangweiyi Resolution: Fixed > Spark peakExecutionMemory metrics is zero > - > > Key: SPARK-29273 > URL: https://issues.apache.org/jira/browse/SPARK-29273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Assignee: huangweiyi >Priority: Major > Fix For: 3.0.0 > > > with spark 2.4.3 in our production environment, i want to get the > peakExecutionMemory which is exposed by the TaskMetrics, but alway get the > zero value -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29332) Upgrade zstd-jni library to 1.4.3
Dongjoon Hyun created SPARK-29332: - Summary: Upgrade zstd-jni library to 1.4.3 Key: SPARK-29332 URL: https://issues.apache.org/jira/browse/SPARK-29332 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29333) Sample weight in RandomForestRegressor
Jiaqi Guo created SPARK-29333: - Summary: Sample weight in RandomForestRegressor Key: SPARK-29333 URL: https://issues.apache.org/jira/browse/SPARK-29333 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.4.4 Reporter: Jiaqi Guo I think there have been some tickets that are related to this feature request. Even though the tickets earlier have been designated with resolved status, it still seems impossible to add sample weight to random forest classifier/regressor. The possibility of having sample weight is definitely useful for many use cases, for example class imbalance and weighted bias correction for the samples. I think the sample weight should be considered in the splitting criterion. Please correct me if I am missing the new feature. Otherwise, it would be great to have an update on whether we have a path forward supporting this in the near future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29334) Supported vector operators in scala should have parity with pySpark
Patrick Pisciuneri created SPARK-29334: -- Summary: Supported vector operators in scala should have parity with pySpark Key: SPARK-29334 URL: https://issues.apache.org/jira/browse/SPARK-29334 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.3.5, 2.4.5, 3.0.0 Reporter: Patrick Pisciuneri pySpark supports various overloaded operators for the DenseVector type that the scala class does not support. # ML https://github.com/apache/spark/blob/master/python/pyspark/ml/linalg/__init__.py#L441-L462 # MLLIB https://github.com/apache/spark/blob/master/python/pyspark/mllib/linalg/__init__.py#L485-L506 We should be able to leverage the BLAS wrappers to implement these methods on the scala side. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29334) Supported vector operators in scala should have parity with pySpark
[ https://issues.apache.org/jira/browse/SPARK-29334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Pisciuneri updated SPARK-29334: --- Description: pySpark supports various overloaded operators for the DenseVector type that the scala class does not support. - ML: https://github.com/apache/spark/blob/master/python/pyspark/ml/linalg/__init__.py#L441-L462 - MLLIB: https://github.com/apache/spark/blob/master/python/pyspark/mllib/linalg/__init__.py#L485-L506 We should be able to leverage the BLAS wrappers to implement these methods on the scala side. was: pySpark supports various overloaded operators for the DenseVector type that the scala class does not support. # ML https://github.com/apache/spark/blob/master/python/pyspark/ml/linalg/__init__.py#L441-L462 # MLLIB https://github.com/apache/spark/blob/master/python/pyspark/mllib/linalg/__init__.py#L485-L506 We should be able to leverage the BLAS wrappers to implement these methods on the scala side. > Supported vector operators in scala should have parity with pySpark > > > Key: SPARK-29334 > URL: https://issues.apache.org/jira/browse/SPARK-29334 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.3.5, 2.4.5, 3.0.0 >Reporter: Patrick Pisciuneri >Priority: Minor > > pySpark supports various overloaded operators for the DenseVector type that > the scala class does not support. > - ML: > https://github.com/apache/spark/blob/master/python/pyspark/ml/linalg/__init__.py#L441-L462 > - MLLIB: > https://github.com/apache/spark/blob/master/python/pyspark/mllib/linalg/__init__.py#L485-L506 > We should be able to leverage the BLAS wrappers to implement these methods on > the scala side. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28917) Jobs can hang because of race of RDD.dependencies
[ https://issues.apache.org/jira/browse/SPARK-28917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-28917: - Description: {{RDD.dependencies}} stores the precomputed cache value, but it is not thread-safe. This can lead to a race where the value gets overwritten, but the DAGScheduler gets stuck in an inconsistent state. In particular, this can happen when there is a race between the DAGScheduler event loop, and another thread (eg. a user thread, if there is multi-threaded job submission). First, a job is submitted by the user, which then computes the result Stage and its parents: https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983 Which eventually makes a call to {{rdd.dependencies}}: https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L519 At the same time, the user could also touch {{rdd.dependencies}} in another thread, which could overwrite the stored value because of the race. Then the DAGScheduler checks the dependencies *again* later on in the job submission, via {{getMissingParentStages}} https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1025 Because it will find new dependencies, it will create entirely different stages. Now the job has some orphaned stages which will never run. One symptoms of this are seeing disjoint sets of stages in the "Parents of final stage" and the "Missing parents" messages on job submission (however this is not required). (*EDIT*: Seeing repeated msgs "Registering RDD X" actually is just fine, it is not a symptom of a problem at all. It just means the RDD is the *input* to multiple shuffles.) {noformat} [INFO] 2019-08-15 23:22:31,570 org.apache.spark.SparkContext logInfo - Starting job: count at XXX.scala:462 ... [INFO] 2019-08-15 23:22:31,573 org.apache.spark.scheduler.DAGScheduler logInfo - Registering RDD 14 (repartition at XXX.scala:421) ... ... [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler logInfo - Got job 1 (count at XXX.scala:462) with 40 output partitions [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler logInfo - Final stage: ResultStage 5 (count at XXX.scala:462) [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler logInfo - Parents of final stage: List(ShuffleMapStage 4) [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler logInfo - Registering RDD 14 (repartition at XXX.scala:421) [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler logInfo - Missing parents: List(ShuffleMapStage 6) {noformat} Another symptom is only visible with DEBUG logs turned on for DAGScheduler -- you will calls to {{submitStage(Stage X)}} multiple times, followed by a different set of missing stages. eg. here, we see stage 1 first is missing stage 0 as a dependency, and then later on its missing stage 23: {noformat} 19/09/19 22:28:15 DEBUG scheduler.DAGScheduler: submitStage(ShuffleMapStage 1) 19/09/19 22:28:15 DEBUG scheduler.DAGScheduler: missing: List(ShuffleMapStage 0) ... 19/09/19 22:32:01 DEBUG scheduler.DAGScheduler: submitStage(ShuffleMapStage 1) 19/09/19 22:32:01 DEBUG scheduler.DAGScheduler: missing: List(ShuffleMapStage 23) {noformat} Note that there is a similar issue w/ {{rdd.partitions}}. In particular for some RDDs, {{partitions}} references {{dependencies}} (eg. {{CoGroupedRDD}}). There is also an issue that {{rdd.storageLevel}} is read and cached in the scheduler, but it could be modified simultaneously by the user in another thread. But, I can't see a way it could effect the scheduler. *WORKAROUND*: (a) call {{rdd.dependencies}} while you know that RDD is only getting touched by one thread (eg. in the thread that created it, or before you submit multiple jobs touching that RDD from other threads). Then that value will get cached. (b) don't submit jobs from multiple threads. was: {{RDD.dependencies}} stores the precomputed cache value, but it is not thread-safe. This can lead to a race where the value gets overwritten, but the DAGScheduler gets stuck in an inconsistent state. In particular, this can happen when there is a race between the DAGScheduler event loop, and another thread (eg. a user thread, if there is multi-threaded job submission). First, a job is submitted by the user, which then computes the result Stage and its parents: https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983 Which eventually makes a call to {{rdd.dependencies}}: https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scal
[jira] [Assigned] (SPARK-29332) Upgrade zstd-jni library to 1.4.3
[ https://issues.apache.org/jira/browse/SPARK-29332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29332: - Assignee: Dongjoon Hyun > Upgrade zstd-jni library to 1.4.3 > - > > Key: SPARK-29332 > URL: https://issues.apache.org/jira/browse/SPARK-29332 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29332) Upgrade zstd-jni library to 1.4.3
[ https://issues.apache.org/jira/browse/SPARK-29332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29332. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26002 [https://github.com/apache/spark/pull/26002] > Upgrade zstd-jni library to 1.4.3 > - > > Key: SPARK-29332 > URL: https://issues.apache.org/jira/browse/SPARK-29332 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29078) Spark shell fails if read permission is not granted to hive warehouse directory
[ https://issues.apache.org/jira/browse/SPARK-29078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943095#comment-16943095 ] Peter Toth commented on SPARK-29078: [~misutoth], if we look closer at the stacktrace ({{at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)}}) it shows that the AccessControlException issue is with the default database existence check (on master branch this corresponds to https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L139: {noformat} // Create default database if it doesn't exist if (!externalCatalog.databaseExists(SessionCatalog.DEFAULT_DATABASE)) { // There may be another Spark application creating default database at the same time, here we {noformat} This exception is expected if the user doesn't have access to the directory of the default database. In that case the user can't use Spark SQL. I would suggest closing this ticket. > Spark shell fails if read permission is not granted to hive warehouse > directory > --- > > Key: SPARK-29078 > URL: https://issues.apache.org/jira/browse/SPARK-29078 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mihaly Toth >Priority: Major > > Similarly to SPARK-20256, in {{SharedSessionState}} when > {{GlobalTempViewManager}} is created, it is checked that there is no database > exists that has the same name as of the global temp database (name is > configurable with {{spark.sql.globalTempDatabase}}) , because that is a > special database, which should not exist in the metastore. For this, a read > permission is required on the warehouse directory at the moment, which on the > other hand would allow listing all the databases of all users. > When such a read access is not granted for security reasons, an access > violation exception should be ignored upon such initial validation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27297) Add higher order functions to Scala API
[ https://issues.apache.org/jira/browse/SPARK-27297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reassigned SPARK-27297: - Assignee: Nikolas Vanderhoof > Add higher order functions to Scala API > --- > > Key: SPARK-27297 > URL: https://issues.apache.org/jira/browse/SPARK-27297 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Nikolas Vanderhoof >Assignee: Nikolas Vanderhoof >Priority: Major > > There is currently no existing Scala API equivalent for the higher order > functions introduced in Spark 2.4.0. > * transform > * aggregate > * filter > * exists > * zip_with > * map_zip_with > * map_filter > * transform_values > * transform_keys > Equivalent column based functions should be added to the Scala API for > org.apache.spark.sql.functions with the following signatures: > > {code:scala} > def transform(column: Column, f: Column => Column): Column = ??? > def transform(column: Column, f: (Column, Column) => Column): Column = ??? > def exists(column: Column, f: Column => Column): Column = ??? > def filter(column: Column, f: Column => Column): Column = ??? > def aggregate( > expr: Column, > zero: Column, > merge: (Column, Column) => Column, > finish: Column => Column): Column = ??? > def aggregate( > expr: Column, > zero: Column, > merge: (Column, Column) => Column): Column = ??? > def zip_with( > left: Column, > right: Column, > f: (Column, Column) => Column): Column = ??? > def transform_keys(expr: Column, f: (Column, Column) => Column): Column = ??? > def transform_values(expr: Column, f: (Column, Column) => Column): Column = > ??? > def map_filter(expr: Column, f: (Column, Column) => Column): Column = ??? > def map_zip_with(left: Column, right: Column, f: (Column, Column, Column) => > Column): Column = ??? > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27297) Add higher order functions to Scala API
[ https://issues.apache.org/jira/browse/SPARK-27297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-27297. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 24232 https://github.com/apache/spark/pull/24232 > Add higher order functions to Scala API > --- > > Key: SPARK-27297 > URL: https://issues.apache.org/jira/browse/SPARK-27297 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Nikolas Vanderhoof >Assignee: Nikolas Vanderhoof >Priority: Major > Fix For: 3.0.0 > > > There is currently no existing Scala API equivalent for the higher order > functions introduced in Spark 2.4.0. > * transform > * aggregate > * filter > * exists > * zip_with > * map_zip_with > * map_filter > * transform_values > * transform_keys > Equivalent column based functions should be added to the Scala API for > org.apache.spark.sql.functions with the following signatures: > > {code:scala} > def transform(column: Column, f: Column => Column): Column = ??? > def transform(column: Column, f: (Column, Column) => Column): Column = ??? > def exists(column: Column, f: Column => Column): Column = ??? > def filter(column: Column, f: Column => Column): Column = ??? > def aggregate( > expr: Column, > zero: Column, > merge: (Column, Column) => Column, > finish: Column => Column): Column = ??? > def aggregate( > expr: Column, > zero: Column, > merge: (Column, Column) => Column): Column = ??? > def zip_with( > left: Column, > right: Column, > f: (Column, Column) => Column): Column = ??? > def transform_keys(expr: Column, f: (Column, Column) => Column): Column = ??? > def transform_values(expr: Column, f: (Column, Column) => Column): Column = > ??? > def map_filter(expr: Column, f: (Column, Column) => Column): Column = ??? > def map_zip_with(left: Column, right: Column, f: (Column, Column, Column) => > Column): Column = ??? > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29078) Spark shell fails if read permission is not granted to hive warehouse directory
[ https://issues.apache.org/jira/browse/SPARK-29078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943118#comment-16943118 ] Mihaly Toth commented on SPARK-29078: - But if the user has access to that directory (which is the hive warehouse directory), it can see what databases are there regardless of having access to those databases or not. This is not the worst security gap, so if we believe this is acceptable I dont mind closing this jira. > Spark shell fails if read permission is not granted to hive warehouse > directory > --- > > Key: SPARK-29078 > URL: https://issues.apache.org/jira/browse/SPARK-29078 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mihaly Toth >Priority: Major > > Similarly to SPARK-20256, in {{SharedSessionState}} when > {{GlobalTempViewManager}} is created, it is checked that there is no database > exists that has the same name as of the global temp database (name is > configurable with {{spark.sql.globalTempDatabase}}) , because that is a > special database, which should not exist in the metastore. For this, a read > permission is required on the warehouse directory at the moment, which on the > other hand would allow listing all the databases of all users. > When such a read access is not granted for security reasons, an access > violation exception should be ignored upon such initial validation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28962) High-order function: filter(array, function) → array
[ https://issues.apache.org/jira/browse/SPARK-28962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-28962. --- Fix Version/s: 3.0.0 Assignee: Henry Davidge Resolution: Fixed Issue resolved by pull request 25666 https://github.com/apache/spark/pull/25666 > High-order function: filter(array, function) → array > --- > > Key: SPARK-28962 > URL: https://issues.apache.org/jira/browse/SPARK-28962 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Henry Davidge >Assignee: Henry Davidge >Priority: Major > Fix For: 3.0.0 > > > It's helpful to have access to the index when using the {{filter}} function. > For instance, we're using SparkSQL to manipulate genomic data. We store some > fields in a long array that has the same length for every row in the > DataFrame. We compute aggregates that are per array position (so we look at > the kth element for each row's array) and then want to filter each row's > array by looking values in the aggregate array. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29078) Spark shell fails if read permission is not granted to hive warehouse directory
[ https://issues.apache.org/jira/browse/SPARK-29078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943127#comment-16943127 ] Peter Toth commented on SPARK-29078: I don't think there should be other databases under {{/apps/hive/warehouse}} directory if the {{default}} database points to {{/apps/hive/warehouse}}. > Spark shell fails if read permission is not granted to hive warehouse > directory > --- > > Key: SPARK-29078 > URL: https://issues.apache.org/jira/browse/SPARK-29078 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mihaly Toth >Priority: Major > > Similarly to SPARK-20256, in {{SharedSessionState}} when > {{GlobalTempViewManager}} is created, it is checked that there is no database > exists that has the same name as of the global temp database (name is > configurable with {{spark.sql.globalTempDatabase}}) , because that is a > special database, which should not exist in the metastore. For this, a read > permission is required on the warehouse directory at the moment, which on the > other hand would allow listing all the databases of all users. > When such a read access is not granted for security reasons, an access > violation exception should be ignored upon such initial validation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29078) Spark shell fails if read permission is not granted to hive warehouse directory
[ https://issues.apache.org/jira/browse/SPARK-29078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943127#comment-16943127 ] Peter Toth edited comment on SPARK-29078 at 10/2/19 8:15 PM: - I don't think there should be other databases under {{/apps/hive/warehouse}} directory if the {{default}} database points to {{/apps/hive/warehouse}}. I mean that way we could avoid this issue. was (Author: petertoth): I don't think there should be other databases under {{/apps/hive/warehouse}} directory if the {{default}} database points to {{/apps/hive/warehouse}}. > Spark shell fails if read permission is not granted to hive warehouse > directory > --- > > Key: SPARK-29078 > URL: https://issues.apache.org/jira/browse/SPARK-29078 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mihaly Toth >Priority: Major > > Similarly to SPARK-20256, in {{SharedSessionState}} when > {{GlobalTempViewManager}} is created, it is checked that there is no database > exists that has the same name as of the global temp database (name is > configurable with {{spark.sql.globalTempDatabase}}) , because that is a > special database, which should not exist in the metastore. For this, a read > permission is required on the warehouse directory at the moment, which on the > other hand would allow listing all the databases of all users. > When such a read access is not granted for security reasons, an access > violation exception should be ignored upon such initial validation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql
Srini E created SPARK-29335: --- Summary: Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql Key: SPARK-29335 URL: https://issues.apache.org/jira/browse/SPARK-29335 Project: Spark Issue Type: Question Components: Optimizer Affects Versions: 2.3.0 Environment: We tried to execute the same using Spark-sql and Thrify server using SQLWorkbench but we are not able to use the stats. Reporter: Srini E We are trying to leverage CBO for getting better plan results for few critical queries run thru spark-sql or thru thrift server using jdbc driver. Following settings added to spark-defaults.conf *spark.sql.cbo.enabled true* *spark.experimental.extrastrategies intervaljoin* *spark.sql.cbo.joinreorder.enabled true* The tables that we are using are not partitioned. Spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ; analyze table arrow.t_fperiods_sundar compute statistics for columns eid, year, ptype, absref, fpid , pid ; analyze table arrow.t_fdata_sundar compute statistics ; analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, absref; Analyze completed success fully. Describe extended , does not show column level stats data and queries are not leveraging table or column level stats . we are using Oracle as our Hive Catalog store and not Glue . +*When we are using spark sql and running queries we are not able to see the stats in use in the explain plan and we are not sure if cbo is put to use.*+ *A quick response would be helpful.* *Explain Plan:* Following Explain command does not reference to any Statistics usage. spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;* 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with: 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = 2017),(ptype#4546 = A),(eid#4542 = 29940),isnull(PID#4527),isnotnull(fpid#4523) 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID) 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with: 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: isnotnull(absref#4569),(absref#4569 = Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940) 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940) == Parsed Logical Plan == 'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref] +- 'Filter 'a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && (('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && ('a12.eid = 29940)) && isnull('a12.PID))) +- 'Join Inner :- 'SubqueryAlias a12 : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar` +- 'SubqueryAlias a13 +- 'UnresolvedRelation `arrow`.`t_fdata_sundar` == Analyzed Logical Plan == imnem: string, fvalue: string, ptype: string, absref: string Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569] +- Filter eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = cast(fpid#4567 as decimal(38,0 && ((absref#4569 = Y2017) && (year#4545 = 2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = cast(cast(29940 as decimal(5,0)) as decimal(10,0 && isnull(PID#4527))) +- Join Inner :- SubqueryAlias a12 : +- SubqueryAlias t_fperiods_sundar : +- Relation[FPID#4523,QTR#4524,ABSREF#4525,DSET#4526,PID#4527,NID#4528,CCD#4529,LOCKDATE#4530,LOCKUSER#4531,UPDUSER#4532,UPDDATE#4533,RESTATED#4534,DATADATE#4535,DATASTATE#4536,ACCSTD#4537,INDFMT#4538,DATAFMT#4539,CONSOL#4540,FYR#4541,EID#4542,FPSTATE#4543,BATCH_ID#4544L,YEAR#4545,PTYPE#4546] parquet +- SubqueryAlias a13 +- SubqueryAlias t_fdata_sundar +- Relation[FDID#4547,IMNEM#4548,DSET#4549,DSRS#4550,PID#4551,FVALUE#4552,FCOMP#4553,CONF#4554,UPDDATE#4555,UPDUSER#4556,SEQ#4557,CCD#4558,NID#4559,CONVTYPE#4560,DATADATE#4561,DCOMMENT#4562,UDOMAIN#4563,FPDID#4564L,VLP_CALC#4565,EID#4566,FPID#4567,BATCH_ID#4568L,ABSREF#4569] parquet == Optimized Logical Plan == Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569] +- Join Inner, ((eid#4542 = eid#4566) && (fpid#4523 = cast(fpid#4567 as decimal(38,0 :- Project [FPID#4523, EID#4542, PTYPE#4546] : +- Filter (((isnotnull(ptype#4546) && isnotnull(year#4545)) && isnotnull(
[jira] [Created] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee the relativeError will be respected
Guilherme Souza created SPARK-29336: --- Summary: The implementation of QuantileSummaries.merge does not guarantee the relativeError will be respected Key: SPARK-29336 URL: https://issues.apache.org/jira/browse/SPARK-29336 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.3 Reporter: Guilherme Souza -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee the relativeError will be respected
[ https://issues.apache.org/jira/browse/SPARK-29336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guilherme Souza updated SPARK-29336: Shepherd: (was: Sean Zhong) Description: (sorry for the early submission, I'm still writing the description...) > The implementation of QuantileSummaries.merge does not guarantee the > relativeError will be respected > -- > > Key: SPARK-29336 > URL: https://issues.apache.org/jira/browse/SPARK-29336 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Guilherme Souza >Priority: Minor > > (sorry for the early submission, I'm still writing the description...) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql
[ https://issues.apache.org/jira/browse/SPARK-29335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srini E updated SPARK-29335: Attachment: explain_plan_cbo_spark.txt > Cost Based Optimizer stats are not used while evaluating query plans in Spark > Sql > - > > Key: SPARK-29335 > URL: https://issues.apache.org/jira/browse/SPARK-29335 > Project: Spark > Issue Type: Question > Components: Optimizer >Affects Versions: 2.3.0 > Environment: We tried to execute the same using Spark-sql and Thrify > server using SQLWorkbench but we are not able to use the stats. >Reporter: Srini E >Priority: Major > Attachments: explain_plan_cbo_spark.txt > > > We are trying to leverage CBO for getting better plan results for few > critical queries run thru spark-sql or thru thrift server using jdbc driver. > Following settings added to spark-defaults.conf > *spark.sql.cbo.enabled true* > *spark.experimental.extrastrategies intervaljoin* > *spark.sql.cbo.joinreorder.enabled true* > > The tables that we are using are not partitioned. > Spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ; > analyze table arrow.t_fperiods_sundar compute statistics for columns eid, > year, ptype, absref, fpid , pid ; > analyze table arrow.t_fdata_sundar compute statistics ; > analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, > absref; > Analyze completed success fully. > Describe extended , does not show column level stats data and queries are not > leveraging table or column level stats . > we are using Oracle as our Hive Catalog store and not Glue . > > +*When we are using spark sql and running queries we are not able to see the > stats in use in the explain plan and we are not sure if cbo is put to use.*+ > > *A quick response would be helpful.* > > *Explain Plan:* > Following Explain command does not reference to any Statistics usage. > > spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref > from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = > a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 > and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;* > > 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with: > 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: > isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = > 2017),(ptype#4546 = A),(eid#4542 = > 29940),isnull(PID#4527),isnotnull(fpid#4523) > 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct decimal(38,0), PID: string, EID: decimal(10,0), YEAR: int, PTYPE: string ... > 3 more fields> > 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: > IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID) > 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with: > 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: > isnotnull(absref#4569),(absref#4569 = > Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940) > 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct string, FVALUE: string, EID: decimal(10,0), FPID: decimal(10,0), ABSREF: > string ... 3 more fields> > 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: > IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940) > == Parsed Logical Plan == > 'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref] > +- 'Filter 'a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && > (('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && > ('a12.eid = 29940)) && isnull('a12.PID))) > +- 'Join Inner > :- 'SubqueryAlias a12 > : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar` > +- 'SubqueryAlias a13 > +- 'UnresolvedRelation `arrow`.`t_fdata_sundar` > > == Analyzed Logical Plan == > imnem: string, fvalue: string, ptype: string, absref: string > Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569] > +- Filter eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = > cast(fpid#4567 as decimal(38,0 && ((absref#4569 = Y2017) && (year#4545 = > 2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = > cast(cast(29940 as decimal(5,0)) as decimal(10,0 && isnull(PID#4527))) > +- Join Inner > :- SubqueryAlias a12 > : +- SubqueryAlias t_fperiods_sundar > : +- > Relation[FPID#4523,QTR#4524,ABSREF#4525,DSET#4526,PID#4527,NID#4528,CCD#4529,LOCKDATE#4530,LOCKUSER#4531,UPDUSER#4532,UPDDATE#4533,RESTATED#4534,DATADATE#4535,DATASTATE#4536,ACCSTD#4537,INDFMT#4538,DATAFMT#4539,CONSOL#4540,FYR#4541,EID#4542,FPSTATE#4543,BATCH_ID#4544L,YEAR#4545,PTYPE#4546] > parquet
[jira] [Updated] (SPARK-29337) How to Cache Table and Pin it in Memory and should not Spill to Disk on Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-29337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srini E updated SPARK-29337: Attachment: Cache+Image.png > How to Cache Table and Pin it in Memory and should not Spill to Disk on > Thrift Server > -- > > Key: SPARK-29337 > URL: https://issues.apache.org/jira/browse/SPARK-29337 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.3.0 >Reporter: Srini E >Priority: Major > Attachments: Cache+Image.png > > > Hi Team, > How to pin the table in cache so it would not swap out of memory? > Situation: We are using Microstrategy BI reporting. Semantic layer is built. > We wanted to Cache highly used tables into CACHE using Spark SQL CACHE Table > ; we did cache for SPARK context( Thrift server). Please see > below snapshot of Cache table, went to disk over time. Initially it was all > in cache , now some in cache and some in disk. That disk may be local disk > relatively more expensive reading than from s3. Queries may take longer and > inconsistent times from user experience perspective. If More queries running > using Cache tables, copies of the cache table images are copied and copies > are not staying in memory causing reports to run longer. so how to pin the > table so would not swap to disk. Spark memory management is dynamic > allocation, and how to use those few tables to Pin in memory . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29337) How to Cache Table and Pin it in Memory and should not Spill to Disk on Thrift Server
Srini E created SPARK-29337: --- Summary: How to Cache Table and Pin it in Memory and should not Spill to Disk on Thrift Server Key: SPARK-29337 URL: https://issues.apache.org/jira/browse/SPARK-29337 Project: Spark Issue Type: Question Components: SQL Affects Versions: 2.3.0 Reporter: Srini E Attachments: Cache+Image.png Hi Team, How to pin the table in cache so it would not swap out of memory? Situation: We are using Microstrategy BI reporting. Semantic layer is built. We wanted to Cache highly used tables into CACHE using Spark SQL CACHE Table ; we did cache for SPARK context( Thrift server). Please see below snapshot of Cache table, went to disk over time. Initially it was all in cache , now some in cache and some in disk. That disk may be local disk relatively more expensive reading than from s3. Queries may take longer and inconsistent times from user experience perspective. If More queries running using Cache tables, copies of the cache table images are copied and copies are not staying in memory causing reports to run longer. so how to pin the table so would not swap to disk. Spark memory management is dynamic allocation, and how to use those few tables to Pin in memory . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee that the relativeError will be respected
[ https://issues.apache.org/jira/browse/SPARK-29336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guilherme Souza updated SPARK-29336: Summary: The implementation of QuantileSummaries.merge does not guarantee that the relativeError will be respected (was: The implementation of QuantileSummaries.merge does not guarantee the relativeError will be respected ) > The implementation of QuantileSummaries.merge does not guarantee that the > relativeError will be respected > --- > > Key: SPARK-29336 > URL: https://issues.apache.org/jira/browse/SPARK-29336 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Guilherme Souza >Priority: Minor > > Hello Spark maintainers, > I was experimenting with my own implementation of the [space-efficient > quantile > algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf] > in another language and I was using the Spark's one as a reference. > In my analysis, I believe to have found an issue with the {{merge()}} logic. > Here is some simple Scala code that reproduces the issue I've found: > > {code:java} > var values = (1 to 100).toArray > val all_quantiles = values.indices.map(i => (i+1).toDouble / > values.length).toArray > for (n <- 0 until 5) { > var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5) > val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1) > val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray > val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) > => Math.abs(expected - answer) }).toArray > val max_error = error.max > print(max_error + "\n") > } > {code} > I query for all possible quantiles in a 100-element array with a desired 10% > max error. In this scenario, one would expect to observe a maximum error of > 10 ranks or less (10% of 100). However, the output I observe is: > > {noformat} > 16 > 12 > 10 > 11 > 17{noformat} > The variance is probably due to non-deterministic operations behind the > scenes, but irrelevant to the core cause. > Interestingly enough, if I change from five to one partition the code works > as expected and gives 10 every time. This seems to point to some problem at > the [merge > logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L153-L171] > The original authors ([~clockfly] and [~cloud_fan] for what I could dig from > the history) suggest the published paper is not clear on how that should be > done and, honestly, I was not confident in the current approach either. > I've found SPARK-21184 that reports the same problem, but it was > unfortunately closed with no fix applied. > In my external implementation I believe to have found a sound way to > implement the merge method. [Here is my take in Rust, if relevant > |https://github.com/sitegui/space-efficient-quantile/blob/188c74638c9840e5f47d6c6326b2886d47b149bc/src/modified_gk/summary.rs#L162-L218]I'd > be really glad to add unit tests and contribute my implementation adapted to > Scala. > I'd love to hear your opinion on the matter. > Best regards > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee the relativeError will be respected
[ https://issues.apache.org/jira/browse/SPARK-29336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guilherme Souza updated SPARK-29336: Description: Hello Spark maintainers, I was experimenting with my own implementation of the [space-efficient quantile algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf] in another language and I was using the Spark's one as a reference. In my analysis, I believe to have found an issue with the {{merge()}} logic. Here is some simple Scala code that reproduces the issue I've found: {code:java} var values = (1 to 100).toArray val all_quantiles = values.indices.map(i => (i+1).toDouble / values.length).toArray for (n <- 0 until 5) { var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5) val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1) val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => Math.abs(expected - answer) }).toArray val max_error = error.max print(max_error + "\n") } {code} I query for all possible quantiles in a 100-element array with a desired 10% max error. In this scenario, one would expect to observe a maximum error of 10 ranks or less (10% of 100). However, the output I observe is: {noformat} 16 12 10 11 17{noformat} The variance is probably due to non-deterministic operations behind the scenes, but irrelevant to the core cause. Interestingly enough, if I change from five to one partition the code works as expected and gives 10 every time. This seems to point to some problem at the [merge logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L153-L171] The original authors ([~clockfly] and [~cloud_fan] for what I could dig from the history) suggest the published paper is not clear on how that should be done and, honestly, I was not confident in the current approach either. I've found SPARK-21184 that reports the same problem, but it was unfortunately closed with no fix applied. In my external implementation I believe to have found a sound way to implement the merge method. [Here is my take in Rust, if relevant |https://github.com/sitegui/space-efficient-quantile/blob/188c74638c9840e5f47d6c6326b2886d47b149bc/src/modified_gk/summary.rs#L162-L218]I'd be really glad to add unit tests and contribute my implementation adapted to Scala. I'd love to hear your opinion on the matter. Best regards was:(sorry for the early submission, I'm still writing the description...) > The implementation of QuantileSummaries.merge does not guarantee the > relativeError will be respected > -- > > Key: SPARK-29336 > URL: https://issues.apache.org/jira/browse/SPARK-29336 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Guilherme Souza >Priority: Minor > > Hello Spark maintainers, > I was experimenting with my own implementation of the [space-efficient > quantile > algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf] > in another language and I was using the Spark's one as a reference. > In my analysis, I believe to have found an issue with the {{merge()}} logic. > Here is some simple Scala code that reproduces the issue I've found: > > {code:java} > var values = (1 to 100).toArray > val all_quantiles = values.indices.map(i => (i+1).toDouble / > values.length).toArray > for (n <- 0 until 5) { > var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5) > val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1) > val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray > val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) > => Math.abs(expected - answer) }).toArray > val max_error = error.max > print(max_error + "\n") > } > {code} > I query for all possible quantiles in a 100-element array with a desired 10% > max error. In this scenario, one would expect to observe a maximum error of > 10 ranks or less (10% of 100). However, the output I observe is: > > {noformat} > 16 > 12 > 10 > 11 > 17{noformat} > The variance is probably due to non-deterministic operations behind the > scenes, but irrelevant to the core cause. > Interestingly enough, if I change from five to one partition the code works > as expected and gives 10 every time. This seems to point to some problem at > the [merge > logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/Quanti
[jira] [Updated] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee that the relativeError will be respected
[ https://issues.apache.org/jira/browse/SPARK-29336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guilherme Souza updated SPARK-29336: Description: Hello Spark maintainers, I was experimenting with my own implementation of the [space-efficient quantile algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf] in another language and I was using the Spark's one as a reference. In my analysis, I believe to have found an issue with the {{merge()}} logic. Here is some simple Scala code that reproduces the issue I've found: {code:java} var values = (1 to 100).toArray val all_quantiles = values.indices.map(i => (i+1).toDouble / values.length).toArray for (n <- 0 until 5) { var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5) val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1) val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => Math.abs(expected - answer) }).toArray val max_error = error.max print(max_error + "\n") } {code} I query for all possible quantiles in a 100-element array with a desired 10% max error. In this scenario, one would expect to observe a maximum error of 10 ranks or less (10% of 100). However, the output I observe is: {noformat} 16 12 10 11 17{noformat} The variance is probably due to non-deterministic operations behind the scenes, but irrelevant to the core cause. (and sorry for my Scala, I'm not used to it) Interestingly enough, if I change from five to one partition the code works as expected and gives 10 every time. This seems to point to some problem at the [merge logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L153-L171] The original authors ([~clockfly] and [~cloud_fan] for what I could dig from the history) suggest the published paper is not clear on how that should be done and, honestly, I was not confident in the current approach either. I've found SPARK-21184 that reports the same problem, but it was unfortunately closed with no fix applied. In my external implementation I believe to have found a sound way to implement the merge method. [Here is my take in Rust, if relevant|https://github.com/sitegui/space-efficient-quantile/blob/188c74638c9840e5f47d6c6326b2886d47b149bc/src/modified_gk/summary.rs#L162-L218] I'd be really glad to add unit tests and contribute my implementation adapted to Scala. I'd love to hear your opinion on the matter. Best regards was: Hello Spark maintainers, I was experimenting with my own implementation of the [space-efficient quantile algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf] in another language and I was using the Spark's one as a reference. In my analysis, I believe to have found an issue with the {{merge()}} logic. Here is some simple Scala code that reproduces the issue I've found: {code:java} var values = (1 to 100).toArray val all_quantiles = values.indices.map(i => (i+1).toDouble / values.length).toArray for (n <- 0 until 5) { var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5) val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1) val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => Math.abs(expected - answer) }).toArray val max_error = error.max print(max_error + "\n") } {code} I query for all possible quantiles in a 100-element array with a desired 10% max error. In this scenario, one would expect to observe a maximum error of 10 ranks or less (10% of 100). However, the output I observe is: {noformat} 16 12 10 11 17{noformat} The variance is probably due to non-deterministic operations behind the scenes, but irrelevant to the core cause. (and sorry for my Scala, I'm not used to it) Interestingly enough, if I change from five to one partition the code works as expected and gives 10 every time. This seems to point to some problem at the [merge logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L153-L171] The original authors ([~clockfly] and [~cloud_fan] for what I could dig from the history) suggest the published paper is not clear on how that should be done and, honestly, I was not confident in the current approach either. I've found SPARK-21184 that reports the same problem, but it was unfortunately closed with no fix applied. In my external implementation I believe to have found a sound way to implement the merge method. [Here is my take in Rust, if relevant |https://github.com/sitegui/space-efficient-quantile/blo
[jira] [Updated] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee that the relativeError will be respected
[ https://issues.apache.org/jira/browse/SPARK-29336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guilherme Souza updated SPARK-29336: Description: Hello Spark maintainers, I was experimenting with my own implementation of the [space-efficient quantile algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf] in another language and I was using the Spark's one as a reference. In my analysis, I believe to have found an issue with the {{merge()}} logic. Here is some simple Scala code that reproduces the issue I've found: {code:java} var values = (1 to 100).toArray val all_quantiles = values.indices.map(i => (i+1).toDouble / values.length).toArray for (n <- 0 until 5) { var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5) val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1) val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => Math.abs(expected - answer) }).toArray val max_error = error.max print(max_error + "\n") } {code} I query for all possible quantiles in a 100-element array with a desired 10% max error. In this scenario, one would expect to observe a maximum error of 10 ranks or less (10% of 100). However, the output I observe is: {noformat} 16 12 10 11 17{noformat} The variance is probably due to non-deterministic operations behind the scenes, but irrelevant to the core cause. (and sorry for my Scala, I'm not used to it) Interestingly enough, if I change from five to one partition the code works as expected and gives 10 every time. This seems to point to some problem at the [merge logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L153-L171] The original authors ([~clockfly] and [~cloud_fan] for what I could dig from the history) suggest the published paper is not clear on how that should be done and, honestly, I was not confident in the current approach either. I've found SPARK-21184 that reports the same problem, but it was unfortunately closed with no fix applied. In my external implementation I believe to have found a sound way to implement the merge method. [Here is my take in Rust, if relevant |https://github.com/sitegui/space-efficient-quantile/blob/188c74638c9840e5f47d6c6326b2886d47b149bc/src/modified_gk/summary.rs#L162-L218]I'd be really glad to add unit tests and contribute my implementation adapted to Scala. I'd love to hear your opinion on the matter.| Best regards was: Hello Spark maintainers, I was experimenting with my own implementation of the [space-efficient quantile algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf] in another language and I was using the Spark's one as a reference. In my analysis, I believe to have found an issue with the {{merge()}} logic. Here is some simple Scala code that reproduces the issue I've found: {code:java} var values = (1 to 100).toArray val all_quantiles = values.indices.map(i => (i+1).toDouble / values.length).toArray for (n <- 0 until 5) { var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5) val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1) val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => Math.abs(expected - answer) }).toArray val max_error = error.max print(max_error + "\n") } {code} I query for all possible quantiles in a 100-element array with a desired 10% max error. In this scenario, one would expect to observe a maximum error of 10 ranks or less (10% of 100). However, the output I observe is: {noformat} 16 12 10 11 17{noformat} The variance is probably due to non-deterministic operations behind the scenes, but irrelevant to the core cause. Interestingly enough, if I change from five to one partition the code works as expected and gives 10 every time. This seems to point to some problem at the [merge logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L153-L171] The original authors ([~clockfly] and [~cloud_fan] for what I could dig from the history) suggest the published paper is not clear on how that should be done and, honestly, I was not confident in the current approach either. I've found SPARK-21184 that reports the same problem, but it was unfortunately closed with no fix applied. In my external implementation I believe to have found a sound way to implement the merge method. [Here is my take in Rust, if relevant |https://github.com/sitegui/space-efficient-quantile/blob/188c74638c9840e5f47d6c6326b2886d47b149bc/
[jira] [Commented] (SPARK-18748) UDF multiple evaluations causes very poor performance
[ https://issues.apache.org/jira/browse/SPARK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943176#comment-16943176 ] Anton Baranau commented on SPARK-18748: --- I got the same problem having the code below with 2.4.4 {code:python} df.withColumn("scores", sf.explode(expensive_spacy_nlp_udf("texts"))).selectExpr('scores.score1', 'scores.score2') {code} In my case data isn't huge so I can afford to cahce it like below {code:python} df.withColumn("scores", sf.explode(expensive_spacy_nlp_udf("texts"))).cache().selectExpr('scores.score1', 'scores.score2') {code} > UDF multiple evaluations causes very poor performance > - > > Key: SPARK-18748 > URL: https://issues.apache.org/jira/browse/SPARK-18748 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: Ohad Raviv >Priority: Major > > We have a use case where we have a relatively expensive UDF that needs to be > calculated. The problem is that instead of being calculated once, it gets > calculated over and over again. > for example: > {quote} > def veryExpensiveCalc(str:String) = \{println("blahblah1"); "nothing"\} > hiveContext.udf.register("veryExpensiveCalc", veryExpensiveCalc _) > hiveContext.sql("select * from (select veryExpensiveCalc('a') c)z where c is > not null and c<>''").show > {quote} > with the output: > {quote} > blahblah1 > blahblah1 > blahblah1 > +---+ > | c| > +---+ > |nothing| > +---+ > {quote} > You can see that for each reference of column "c" you will get the println. > that causes very poor performance for our real use case. > This also came out on StackOverflow: > http://stackoverflow.com/questions/40320563/spark-udf-called-more-than-once-per-record-when-df-has-too-many-columns > http://stackoverflow.com/questions/34587596/trying-to-turn-a-blob-into-multiple-columns-in-spark/ > with two problematic work-arounds: > 1. cache() after the first time. e.g. > {quote} > hiveContext.sql("select veryExpensiveCalc('a') as c").cache().where("c is not > null and c<>''").show > {quote} > while it works, in our case we can't do that because the table is too big to > cache. > 2. move back and forth to rdd: > {quote} > val df = hiveContext.sql("select veryExpensiveCalc('a') as c") > hiveContext.createDataFrame(df.rdd, df.schema).where("c is not null and > c<>''").show > {quote} > which works but then we loose some of the optimizations like push down > predicate features, etc. and its very ugly. > Any ideas on how we can make the UDF get calculated just once in a reasonable > way? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18748) UDF multiple evaluations causes very poor performance
[ https://issues.apache.org/jira/browse/SPARK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943176#comment-16943176 ] Anton Baranau edited comment on SPARK-18748 at 10/2/19 9:32 PM: I got the same problem having the code below with 2.4.4 {code:python} df.withColumn("scores", sf.explode(expensive_spacy_nlp_udf("texts"))).selectExpr('scores.score1', 'scores.score2') {code} In my case data isn't huge so I can afford to cache it like below {code:python} df.withColumn("scores", sf.explode(expensive_spacy_nlp_udf("texts"))).cache().selectExpr('scores.score1', 'scores.score2') {code} was (Author: ahtokca): I got the same problem having the code below with 2.4.4 {code:python} df.withColumn("scores", sf.explode(expensive_spacy_nlp_udf("texts"))).selectExpr('scores.score1', 'scores.score2') {code} In my case data isn't huge so I can afford to cahce it like below {code:python} df.withColumn("scores", sf.explode(expensive_spacy_nlp_udf("texts"))).cache().selectExpr('scores.score1', 'scores.score2') {code} > UDF multiple evaluations causes very poor performance > - > > Key: SPARK-18748 > URL: https://issues.apache.org/jira/browse/SPARK-18748 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: Ohad Raviv >Priority: Major > > We have a use case where we have a relatively expensive UDF that needs to be > calculated. The problem is that instead of being calculated once, it gets > calculated over and over again. > for example: > {quote} > def veryExpensiveCalc(str:String) = \{println("blahblah1"); "nothing"\} > hiveContext.udf.register("veryExpensiveCalc", veryExpensiveCalc _) > hiveContext.sql("select * from (select veryExpensiveCalc('a') c)z where c is > not null and c<>''").show > {quote} > with the output: > {quote} > blahblah1 > blahblah1 > blahblah1 > +---+ > | c| > +---+ > |nothing| > +---+ > {quote} > You can see that for each reference of column "c" you will get the println. > that causes very poor performance for our real use case. > This also came out on StackOverflow: > http://stackoverflow.com/questions/40320563/spark-udf-called-more-than-once-per-record-when-df-has-too-many-columns > http://stackoverflow.com/questions/34587596/trying-to-turn-a-blob-into-multiple-columns-in-spark/ > with two problematic work-arounds: > 1. cache() after the first time. e.g. > {quote} > hiveContext.sql("select veryExpensiveCalc('a') as c").cache().where("c is not > null and c<>''").show > {quote} > while it works, in our case we can't do that because the table is too big to > cache. > 2. move back and forth to rdd: > {quote} > val df = hiveContext.sql("select veryExpensiveCalc('a') as c") > hiveContext.createDataFrame(df.rdd, df.schema).where("c is not null and > c<>''").show > {quote} > which works but then we loose some of the optimizations like push down > predicate features, etc. and its very ugly. > Any ideas on how we can make the UDF get calculated just once in a reasonable > way? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28725) Spark ML not able to de-serialize Logistic Regression model saved in previous version of Spark
[ https://issues.apache.org/jira/browse/SPARK-28725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943192#comment-16943192 ] Sharad Varshney commented on SPARK-28725: - Even the same version of Spark 2.4.3 shows huge difference of probabilities. > Spark ML not able to de-serialize Logistic Regression model saved in previous > version of Spark > -- > > Key: SPARK-28725 > URL: https://issues.apache.org/jira/browse/SPARK-28725 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.3 > Environment: PROD >Reporter: Sharad Varshney >Priority: Major > > Logistic Regression model saved using Spark version 2.3.0 in HDI is not > correctly de-serialized with Spark 2.4.3 version. It loads into the memory > but probabilities it emits on inference is like 1.45 e-44(to 44th decimal > place approx equal to 0) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29338) Add overload for filter with index to Scala/Java API
Nikolas Vanderhoof created SPARK-29338: -- Summary: Add overload for filter with index to Scala/Java API Key: SPARK-29338 URL: https://issues.apache.org/jira/browse/SPARK-29338 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Nikolas Vanderhoof Add an overload for the higher order function `filter` that takes array index as its second argument to `org.apache.spark.sql.functions`. See: SPARK-28962 and SPARK-27297 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29338) Add overload for filter with index to Scala/Java API
[ https://issues.apache.org/jira/browse/SPARK-29338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikolas Vanderhoof resolved SPARK-29338. Resolution: Duplicate > Add overload for filter with index to Scala/Java API > > > Key: SPARK-29338 > URL: https://issues.apache.org/jira/browse/SPARK-29338 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Nikolas Vanderhoof >Priority: Major > > Add an overload for the higher order function `filter` that takes array index > as its second argument to `org.apache.spark.sql.functions`. > See: SPARK-28962 and SPARK-27297 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29339) Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build)
Hyukjin Kwon created SPARK-29339: Summary: Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build) Key: SPARK-29339 URL: https://issues.apache.org/jira/browse/SPARK-29339 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 3.0.0 Reporter: Hyukjin Kwon dapply and gapply with Arrow optimization and Arrow 0.14 seems failing: {code} > collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, > structType("gear double"))) Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' argument {code} We should fix it and also test it in AppVeyor -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29322) History server is stuck reading incomplete event log file compressed with zstd
[ https://issues.apache.org/jira/browse/SPARK-29322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-29322: - Assignee: Jungtaek Lim > History server is stuck reading incomplete event log file compressed with zstd > -- > > Key: SPARK-29322 > URL: https://issues.apache.org/jira/browse/SPARK-29322 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Attachments: history-server-1.jstack, history-server-2.jstack, > history-server-3.jstack, history-server-4.jstack > > > While working on SPARK-28869, I've discovered the issue that reading > inprogress event log file on zstd compression could lead the thread being > stuck. I just experimented with Spark History Server and observed same issue. > I'll attach the jstack files. > This is very easy to reproduce: setting configuration as below > - spark.eventLog.enabled=true > - spark.eventLog.compress=true > - spark.eventLog.compression.codec=zstd > and start Spark application. While the application is running, load the > application in SHS webpage. It may succeed to replay the event log, but high > likely it will be stuck and loading page will be also stuck. > Only listing the thread stack trace being stuck across jstack files: > {code} > 2019-10-02 11:32:36 > Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.191-b12 mixed mode): > ... > "qtp2072313080-30" #30 daemon prio=5 os_prio=31 tid=0x7ff5b90e7800 > nid=0x9703 runnable [0x7f22] >java.lang.Thread.State: RUNNABLE > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:255) > at > org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:156) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > - locked <0x0007b5f97c60> (a > org.apache.hadoop.fs.BufferedFSInputStream) > at java.io.DataInputStream.read(DataInputStream.java:149) > at > org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:436) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:257) > at > org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:276) > at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:228) > at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196) > - locked <0x0007b5f97b58> (a > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker) > at java.io.DataInputStream.read(DataInputStream.java:149) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > - locked <0x0007b5f97af8> (a java.io.BufferedInputStream) > at > com.github.luben.zstd.ZstdInputStream.readInternal(ZstdInputStream.java:129) > at com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:107) > - locked <0x0007b5f97ac0> (a com.github.luben.zstd.ZstdInputStream) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > - locked <0x0007b5cd3bd0> (a java.io.BufferedInputStream) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > - locked <0x0007b5f94a00> (a java.io.InputStreamReader) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.readLine(BufferedReader.java:324) > - locked <0x0007b5f94a00> (a java.io.InputStreamReader) > at java.io.BufferedReader.readLine(BufferedReader.java:389) > at > scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:74) > at scala.collection.Iterator$$anon$20.hasNext(Iterator.scala:884) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:80) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$rebuildAppStore$5(FsHistoryProvider.scala:976) > at > org.apache.spark.deploy.history.FsHistoryProvider.$a
[jira] [Resolved] (SPARK-29322) History server is stuck reading incomplete event log file compressed with zstd
[ https://issues.apache.org/jira/browse/SPARK-29322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29322. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25996 [https://github.com/apache/spark/pull/25996] > History server is stuck reading incomplete event log file compressed with zstd > -- > > Key: SPARK-29322 > URL: https://issues.apache.org/jira/browse/SPARK-29322 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > Attachments: history-server-1.jstack, history-server-2.jstack, > history-server-3.jstack, history-server-4.jstack > > > While working on SPARK-28869, I've discovered the issue that reading > inprogress event log file on zstd compression could lead the thread being > stuck. I just experimented with Spark History Server and observed same issue. > I'll attach the jstack files. > This is very easy to reproduce: setting configuration as below > - spark.eventLog.enabled=true > - spark.eventLog.compress=true > - spark.eventLog.compression.codec=zstd > and start Spark application. While the application is running, load the > application in SHS webpage. It may succeed to replay the event log, but high > likely it will be stuck and loading page will be also stuck. > Only listing the thread stack trace being stuck across jstack files: > {code} > 2019-10-02 11:32:36 > Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.191-b12 mixed mode): > ... > "qtp2072313080-30" #30 daemon prio=5 os_prio=31 tid=0x7ff5b90e7800 > nid=0x9703 runnable [0x7f22] >java.lang.Thread.State: RUNNABLE > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:255) > at > org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:156) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > - locked <0x0007b5f97c60> (a > org.apache.hadoop.fs.BufferedFSInputStream) > at java.io.DataInputStream.read(DataInputStream.java:149) > at > org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:436) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:257) > at > org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:276) > at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:228) > at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196) > - locked <0x0007b5f97b58> (a > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker) > at java.io.DataInputStream.read(DataInputStream.java:149) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > - locked <0x0007b5f97af8> (a java.io.BufferedInputStream) > at > com.github.luben.zstd.ZstdInputStream.readInternal(ZstdInputStream.java:129) > at com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:107) > - locked <0x0007b5f97ac0> (a com.github.luben.zstd.ZstdInputStream) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > - locked <0x0007b5cd3bd0> (a java.io.BufferedInputStream) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > - locked <0x0007b5f94a00> (a java.io.InputStreamReader) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.readLine(BufferedReader.java:324) > - locked <0x0007b5f94a00> (a java.io.InputStreamReader) > at java.io.BufferedReader.readLine(BufferedReader.java:389) > at > scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:74) > at scala.collection.Iterator$$anon$20.hasNext(Iterator.scala:884) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:511) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:80) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) > at > org.apache.spark.deploy.history.FsHistor
[jira] [Updated] (SPARK-29328) Incorrect calculation mean seconds per month
[ https://issues.apache.org/jira/browse/SPARK-29328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29328: -- Affects Version/s: 2.3.4 > Incorrect calculation mean seconds per month > > > Key: SPARK-29328 > URL: https://issues.apache.org/jira/browse/SPARK-29328 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4 >Reporter: Maxim Gekk >Priority: Minor > > Existing implementation assumes 31 days per month or 372 days per year which > is far away from the correct number. Spark uses the proleptic Gregorian > calendar by default SPARK-26651 in which the average year is 365.2425 days > long: https://en.wikipedia.org/wiki/Gregorian_calendar . Need to fix > calculation in 3 places at least: > - GroupStateImpl.scala:167:val millisPerMonth = > TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 > - EventTimeWatermark.scala:32:val millisPerMonth = > TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 > - DateTimeUtils.scala:610:val secondsInMonth = DAYS.toSeconds(31) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29328) Incorrect calculation mean seconds per month
[ https://issues.apache.org/jira/browse/SPARK-29328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29328: -- Affects Version/s: 2.2.3 > Incorrect calculation mean seconds per month > > > Key: SPARK-29328 > URL: https://issues.apache.org/jira/browse/SPARK-29328 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.4 >Reporter: Maxim Gekk >Priority: Minor > > Existing implementation assumes 31 days per month or 372 days per year which > is far away from the correct number. Spark uses the proleptic Gregorian > calendar by default SPARK-26651 in which the average year is 365.2425 days > long: https://en.wikipedia.org/wiki/Gregorian_calendar . Need to fix > calculation in 3 places at least: > - GroupStateImpl.scala:167:val millisPerMonth = > TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 > - EventTimeWatermark.scala:32:val millisPerMonth = > TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 > - DateTimeUtils.scala:610:val secondsInMonth = DAYS.toSeconds(31) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29328) Incorrect calculation mean seconds per month
[ https://issues.apache.org/jira/browse/SPARK-29328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29328: -- Description: Existing implementation assumes 31 days per month or 372 days per year which is far away from the correct number. Spark uses the proleptic Gregorian calendar by default SPARK-26651 in which the average year is 365.2425 days long: https://en.wikipedia.org/wiki/Gregorian_calendar . Need to fix calculation in 3 places at least: - GroupStateImpl.scala:167:val millisPerMonth = TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 - EventTimeWatermark.scala:32:val millisPerMonth = TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 - DateTimeUtils.scala:610:val secondsInMonth = DAYS.toSeconds(31) *BEFORE* {code} spark-sql> select months_between('2019-09-15', '1970-01-01'); 596.4516129 {code} *AFTER* {code} spark-sql> select months_between('2019-09-15', '1970-01-01'); 596.45996838 {code} was: Existing implementation assumes 31 days per month or 372 days per year which is far away from the correct number. Spark uses the proleptic Gregorian calendar by default SPARK-26651 in which the average year is 365.2425 days long: https://en.wikipedia.org/wiki/Gregorian_calendar . Need to fix calculation in 3 places at least: - GroupStateImpl.scala:167:val millisPerMonth = TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 - EventTimeWatermark.scala:32:val millisPerMonth = TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 - DateTimeUtils.scala:610:val secondsInMonth = DAYS.toSeconds(31) > Incorrect calculation mean seconds per month > > > Key: SPARK-29328 > URL: https://issues.apache.org/jira/browse/SPARK-29328 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.4 >Reporter: Maxim Gekk >Priority: Minor > > Existing implementation assumes 31 days per month or 372 days per year which > is far away from the correct number. Spark uses the proleptic Gregorian > calendar by default SPARK-26651 in which the average year is 365.2425 days > long: https://en.wikipedia.org/wiki/Gregorian_calendar . Need to fix > calculation in 3 places at least: > - GroupStateImpl.scala:167:val millisPerMonth = > TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 > - EventTimeWatermark.scala:32:val millisPerMonth = > TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 > - DateTimeUtils.scala:610:val secondsInMonth = DAYS.toSeconds(31) > *BEFORE* > {code} > spark-sql> select months_between('2019-09-15', '1970-01-01'); > 596.4516129 > {code} > *AFTER* > {code} > spark-sql> select months_between('2019-09-15', '1970-01-01'); > 596.45996838 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29328) Incorrect calculation mean seconds per month
[ https://issues.apache.org/jira/browse/SPARK-29328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29328: -- Affects Version/s: 2.1.3 > Incorrect calculation mean seconds per month > > > Key: SPARK-29328 > URL: https://issues.apache.org/jira/browse/SPARK-29328 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.4 >Reporter: Maxim Gekk >Priority: Minor > > Existing implementation assumes 31 days per month or 372 days per year which > is far away from the correct number. Spark uses the proleptic Gregorian > calendar by default SPARK-26651 in which the average year is 365.2425 days > long: https://en.wikipedia.org/wiki/Gregorian_calendar . Need to fix > calculation in 3 places at least: > - GroupStateImpl.scala:167:val millisPerMonth = > TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 > - EventTimeWatermark.scala:32:val millisPerMonth = > TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 > - DateTimeUtils.scala:610:val secondsInMonth = DAYS.toSeconds(31) > *BEFORE* > {code} > spark-sql> select months_between('2019-09-15', '1970-01-01'); > 596.4516129 > {code} > *AFTER* > {code} > spark-sql> select months_between('2019-09-15', '1970-01-01'); > 596.45996838 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29305) Update LICENSE and NOTICE for hadoop 3.2
[ https://issues.apache.org/jira/browse/SPARK-29305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-29305: Assignee: angerszhu > Update LICENSE and NOTICE for hadoop 3.2 > > > Key: SPARK-29305 > URL: https://issues.apache.org/jira/browse/SPARK-29305 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > > {code} > com.fasterxml.jackson.jaxrs:jackson-jaxrs-base:2.9.5 > com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider:2.9.5 > com.fasterxml.woodstox:woodstox-core:5.0.3 > com.github.stephenc.jcip:jcip-annotations:1.0-1 > com.google.re2j:re2j:1.1 > com.microsoft.sqlserver:mssql-jdbc:6.2.1.jre7 > com.nimbusds:nimbus-jose-jwt:4.41.1 > dnsjava:dnsjava:2.1.7 > net.minidev:accessors-smart:1.2 > net.minidev:json-smart:2.3 > org.apache.commons:commons-configuration2:2.1.1 > org.apache.geronimo.specs:geronimo-jcache_1.0_spec:1.0-alpha-1 > org.apache.hadoop:hadoop-hdfs-client:3.2.0 > org.apache.kerby:kerb-admin:1.0.1 > org.apache.kerby:kerb-client:1.0.1 > org.apache.kerby:kerb-common:1.0.1 > org.apache.kerby:kerb-core:1.0.1 > org.apache.kerby:kerb-crypto:1.0.1 > org.apache.kerby:kerb-identity:1.0.1 > org.apache.kerby:kerb-server:1.0.1 > org.apache.kerby:kerb-simplekdc:1.0.1 > org.apache.kerby:kerb-util:1.0.1 > org.apache.kerby:kerby-asn1:1.0.1 > org.apache.kerby:kerby-config:1.0.1 > org.apache.kerby:kerby-pkix:1.0.1 > org.apache.kerby:kerby-util:1.0.1 > org.apache.kerby:kerby-xdr:1.0.1 > org.apache.kerby:token-provider:1.0.1 > org.codehaus.woodstox:stax2-api:3.1.4 > org.ehcache:ehcache:3.3.1 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29305) Update LICENSE and NOTICE for hadoop 3.2
[ https://issues.apache.org/jira/browse/SPARK-29305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29305. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25978 [https://github.com/apache/spark/pull/25978] > Update LICENSE and NOTICE for hadoop 3.2 > > > Key: SPARK-29305 > URL: https://issues.apache.org/jira/browse/SPARK-29305 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > Fix For: 3.0.0 > > > {code} > com.fasterxml.jackson.jaxrs:jackson-jaxrs-base:2.9.5 > com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider:2.9.5 > com.fasterxml.woodstox:woodstox-core:5.0.3 > com.github.stephenc.jcip:jcip-annotations:1.0-1 > com.google.re2j:re2j:1.1 > com.microsoft.sqlserver:mssql-jdbc:6.2.1.jre7 > com.nimbusds:nimbus-jose-jwt:4.41.1 > dnsjava:dnsjava:2.1.7 > net.minidev:accessors-smart:1.2 > net.minidev:json-smart:2.3 > org.apache.commons:commons-configuration2:2.1.1 > org.apache.geronimo.specs:geronimo-jcache_1.0_spec:1.0-alpha-1 > org.apache.hadoop:hadoop-hdfs-client:3.2.0 > org.apache.kerby:kerb-admin:1.0.1 > org.apache.kerby:kerb-client:1.0.1 > org.apache.kerby:kerb-common:1.0.1 > org.apache.kerby:kerb-core:1.0.1 > org.apache.kerby:kerb-crypto:1.0.1 > org.apache.kerby:kerb-identity:1.0.1 > org.apache.kerby:kerb-server:1.0.1 > org.apache.kerby:kerb-simplekdc:1.0.1 > org.apache.kerby:kerb-util:1.0.1 > org.apache.kerby:kerby-asn1:1.0.1 > org.apache.kerby:kerby-config:1.0.1 > org.apache.kerby:kerby-pkix:1.0.1 > org.apache.kerby:kerby-util:1.0.1 > org.apache.kerby:kerby-xdr:1.0.1 > org.apache.kerby:token-provider:1.0.1 > org.codehaus.woodstox:stax2-api:3.1.4 > org.ehcache:ehcache:3.3.1 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29305) Update LICENSE and NOTICE for hadoop 3.2
[ https://issues.apache.org/jira/browse/SPARK-29305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-29305: - Priority: Minor (was: Major) > Update LICENSE and NOTICE for hadoop 3.2 > > > Key: SPARK-29305 > URL: https://issues.apache.org/jira/browse/SPARK-29305 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > > {code} > com.fasterxml.jackson.jaxrs:jackson-jaxrs-base:2.9.5 > com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider:2.9.5 > com.fasterxml.woodstox:woodstox-core:5.0.3 > com.github.stephenc.jcip:jcip-annotations:1.0-1 > com.google.re2j:re2j:1.1 > com.microsoft.sqlserver:mssql-jdbc:6.2.1.jre7 > com.nimbusds:nimbus-jose-jwt:4.41.1 > dnsjava:dnsjava:2.1.7 > net.minidev:accessors-smart:1.2 > net.minidev:json-smart:2.3 > org.apache.commons:commons-configuration2:2.1.1 > org.apache.geronimo.specs:geronimo-jcache_1.0_spec:1.0-alpha-1 > org.apache.hadoop:hadoop-hdfs-client:3.2.0 > org.apache.kerby:kerb-admin:1.0.1 > org.apache.kerby:kerb-client:1.0.1 > org.apache.kerby:kerb-common:1.0.1 > org.apache.kerby:kerb-core:1.0.1 > org.apache.kerby:kerb-crypto:1.0.1 > org.apache.kerby:kerb-identity:1.0.1 > org.apache.kerby:kerb-server:1.0.1 > org.apache.kerby:kerb-simplekdc:1.0.1 > org.apache.kerby:kerb-util:1.0.1 > org.apache.kerby:kerby-asn1:1.0.1 > org.apache.kerby:kerby-config:1.0.1 > org.apache.kerby:kerby-pkix:1.0.1 > org.apache.kerby:kerby-util:1.0.1 > org.apache.kerby:kerby-xdr:1.0.1 > org.apache.kerby:token-provider:1.0.1 > org.codehaus.woodstox:stax2-api:3.1.4 > org.ehcache:ehcache:3.3.1 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29328) Incorrect calculation mean seconds per month
[ https://issues.apache.org/jira/browse/SPARK-29328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-29328: --- Labels: correctness (was: ) > Incorrect calculation mean seconds per month > > > Key: SPARK-29328 > URL: https://issues.apache.org/jira/browse/SPARK-29328 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.4 >Reporter: Maxim Gekk >Priority: Minor > Labels: correctness > > Existing implementation assumes 31 days per month or 372 days per year which > is far away from the correct number. Spark uses the proleptic Gregorian > calendar by default SPARK-26651 in which the average year is 365.2425 days > long: https://en.wikipedia.org/wiki/Gregorian_calendar . Need to fix > calculation in 3 places at least: > - GroupStateImpl.scala:167:val millisPerMonth = > TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 > - EventTimeWatermark.scala:32:val millisPerMonth = > TimeUnit.MICROSECONDS.toMillis(CalendarInterval.MICROS_PER_DAY) * 31 > - DateTimeUtils.scala:610:val secondsInMonth = DAYS.toSeconds(31) > *BEFORE* > {code} > spark-sql> select months_between('2019-09-15', '1970-01-01'); > 596.4516129 > {code} > *AFTER* > {code} > spark-sql> select months_between('2019-09-15', '1970-01-01'); > 596.45996838 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org