[jira] [Resolved] (SPARK-25390) data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25390. - Fix Version/s: 3.0.0 Assignee: Wenchen Fan Resolution: Fixed > data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > Currently it's not very clear how we should abstract data source v2 API. The > abstraction should be unified between batch and streaming, or similar but > have a well-defined difference between batch and streaming. And the > abstraction should also include catalog/table. > An example of the abstraction: > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} > We should refactor the data source v2 API according to the abstraction -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29373) DataSourceV2: Commands should not submit a spark job.
[ https://issues.apache.org/jira/browse/SPARK-29373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29373: --- Assignee: Terry Kim > DataSourceV2: Commands should not submit a spark job. > - > > Key: SPARK-29373 > URL: https://issues.apache.org/jira/browse/SPARK-29373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Blocker > > DataSourceV2 Exec classes (ShowTablesExec, ShowNamespacesExec, etc.) all > extend LeafExecNode. This results in running a job when executeCollect() is > called. This breaks the previous behavior > [SPARK-19650|https://issues.apache.org/jira/browse/SPARK-19650]. > A new command physical operator will be introduced form which all V2 Exec > classes derive to avoid running a job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29373) DataSourceV2: Commands should not submit a spark job.
[ https://issues.apache.org/jira/browse/SPARK-29373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29373. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26048 [https://github.com/apache/spark/pull/26048] > DataSourceV2: Commands should not submit a spark job. > - > > Key: SPARK-29373 > URL: https://issues.apache.org/jira/browse/SPARK-29373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Blocker > Fix For: 3.0.0 > > > DataSourceV2 Exec classes (ShowTablesExec, ShowNamespacesExec, etc.) all > extend LeafExecNode. This results in running a job when executeCollect() is > called. This breaks the previous behavior > [SPARK-19650|https://issues.apache.org/jira/browse/SPARK-19650]. > A new command physical operator will be introduced form which all V2 Exec > classes derive to avoid running a job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29401) Replace ambiguous varargs call parallelize(Array) with parallelize(Seq)
[ https://issues.apache.org/jira/browse/SPARK-29401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29401. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26062 [https://github.com/apache/spark/pull/26062] > Replace ambiguous varargs call parallelize(Array) with parallelize(Seq) > --- > > Key: SPARK-29401 > URL: https://issues.apache.org/jira/browse/SPARK-29401 > Project: Spark > Issue Type: Sub-task > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > Fix For: 3.0.0 > > > Another general class of Scala 2.13 compile errors: > {code} > [ERROR] [Error] > /Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/ShuffleSuite.scala:47: > overloaded method value apply with alternatives: > (x: Unit,xs: Unit*)Array[Unit] > (x: Double,xs: Double*)Array[Double] > (x: Float,xs: Float*)Array[Float] > (x: Long,xs: Long*)Array[Long] > (x: Int,xs: Int*)Array[Int] > (x: Char,xs: Char*)Array[Char] > (x: Short,xs: Short*)Array[Short] > (x: Byte,xs: Byte*)Array[Byte] > (x: Boolean,xs: Boolean*)Array[Boolean] > cannot be applied to ((Int, Int), (Int, Int), (Int, Int), (Int, Int)) > {code} > All of these (almost all?) are from calls to {{SparkContext.parallelize}} > with an Array of tuples. We can replace them with Seqs, which seems to work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29392) Remove use of deprecated symbol literal " 'name " syntax in favor Symbol("name")
[ https://issues.apache.org/jira/browse/SPARK-29392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29392. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26061 [https://github.com/apache/spark/pull/26061] > Remove use of deprecated symbol literal " 'name " syntax in favor > Symbol("name") > > > Key: SPARK-29392 > URL: https://issues.apache.org/jira/browse/SPARK-29392 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > Fix For: 3.0.0 > > > Example: > {code} > [WARNING] [Warn] > /Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/memory/UnifiedMemoryManagerSuite.scala:308: > symbol literal is deprecated; use Symbol("assertInvariants") instead > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29403) Uses Arrow R 0.14.1 in AppVeyor for now
Hyukjin Kwon created SPARK-29403: Summary: Uses Arrow R 0.14.1 in AppVeyor for now Key: SPARK-29403 URL: https://issues.apache.org/jira/browse/SPARK-29403 Project: Spark Issue Type: Test Components: Project Infra, SparkR Affects Versions: 3.0.0 Reporter: Hyukjin Kwon Assignee: Hyukjin Kwon Due to SPARK-29378, the test is currently being failed -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28797) Document DROP FUNCTION statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-28797. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25553 [https://github.com/apache/spark/pull/25553] > Document DROP FUNCTION statement in SQL Reference. > -- > > Key: SPARK-28797 > URL: https://issues.apache.org/jira/browse/SPARK-28797 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: Dilip Biswal >Assignee: Sandeep Katta >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28797) Document DROP FUNCTION statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-28797: - Priority: Minor (was: Major) > Document DROP FUNCTION statement in SQL Reference. > -- > > Key: SPARK-28797 > URL: https://issues.apache.org/jira/browse/SPARK-28797 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: Dilip Biswal >Assignee: Sandeep Katta >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28797) Document DROP FUNCTION statement in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-28797: Assignee: Sandeep Katta > Document DROP FUNCTION statement in SQL Reference. > -- > > Key: SPARK-28797 > URL: https://issues.apache.org/jira/browse/SPARK-28797 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: Dilip Biswal >Assignee: Sandeep Katta >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28502) Error with struct conversion while using pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-28502. -- Fix Version/s: 3.0.0 Resolution: Fixed This was fixed once support for StructType was added in pandas_udf because the window range sent to the udf is a struct column > Error with struct conversion while using pandas_udf > --- > > Key: SPARK-28502 > URL: https://issues.apache.org/jira/browse/SPARK-28502 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 > Environment: OS: Ubuntu > Python: 3.6 >Reporter: Nasir Ali >Priority: Minor > Fix For: 3.0.0 > > > What I am trying to do: Group data based on time intervals (e.g., 15 days > window) and perform some operations on dataframe using (pandas) UDFs. I don't > know if there is a better/cleaner way to do it. > Below is the sample code that I tried and error message I am getting. > > {code:java} > df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"), > (13.00, "2018-03-11T12:27:18+00:00"), > (25.00, "2018-03-12T11:27:18+00:00"), > (20.00, "2018-03-13T15:27:18+00:00"), > (17.00, "2018-03-14T12:27:18+00:00"), > (99.00, "2018-03-15T11:27:18+00:00"), > (156.00, "2018-03-22T11:27:18+00:00"), > (17.00, "2018-03-31T11:27:18+00:00"), > (25.00, "2018-03-15T11:27:18+00:00"), > (25.00, "2018-03-16T11:27:18+00:00") > ], >["id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > schema = StructType([ > StructField("id", IntegerType()), > StructField("ts", TimestampType()) > ]) > @pandas_udf(schema, PandasUDFType.GROUPED_MAP) > def some_udf(df): > # some computation > return df > df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show() > {code} > This throws following exception: > {code:java} > TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]> > {code} > > However, if I use builtin agg method then it works all fine. For example, > {code:java} > df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False) > {code} > Output > {code:java} > +-+--+---+ > |id |window|avg(id)| > +-+--+---+ > |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0 | > |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0 | > |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0 | > |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0 | > |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0 | > |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0 | > |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0 | > +-+--+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28502) Error with struct conversion while using pandas_udf
[ https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947272#comment-16947272 ] Bryan Cutler commented on SPARK-28502: -- I'm closing this since it is working in master and will mark fixed version as 3.0.0. I also created SPARK-29402 to add proper testing for this use case. > Error with struct conversion while using pandas_udf > --- > > Key: SPARK-28502 > URL: https://issues.apache.org/jira/browse/SPARK-28502 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.3 > Environment: OS: Ubuntu > Python: 3.6 >Reporter: Nasir Ali >Priority: Minor > > What I am trying to do: Group data based on time intervals (e.g., 15 days > window) and perform some operations on dataframe using (pandas) UDFs. I don't > know if there is a better/cleaner way to do it. > Below is the sample code that I tried and error message I am getting. > > {code:java} > df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"), > (13.00, "2018-03-11T12:27:18+00:00"), > (25.00, "2018-03-12T11:27:18+00:00"), > (20.00, "2018-03-13T15:27:18+00:00"), > (17.00, "2018-03-14T12:27:18+00:00"), > (99.00, "2018-03-15T11:27:18+00:00"), > (156.00, "2018-03-22T11:27:18+00:00"), > (17.00, "2018-03-31T11:27:18+00:00"), > (25.00, "2018-03-15T11:27:18+00:00"), > (25.00, "2018-03-16T11:27:18+00:00") > ], >["id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > schema = StructType([ > StructField("id", IntegerType()), > StructField("ts", TimestampType()) > ]) > @pandas_udf(schema, PandasUDFType.GROUPED_MAP) > def some_udf(df): > # some computation > return df > df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show() > {code} > This throws following exception: > {code:java} > TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]> > {code} > > However, if I use builtin agg method then it works all fine. For example, > {code:java} > df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False) > {code} > Output > {code:java} > +-+--+---+ > |id |window|avg(id)| > +-+--+---+ > |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0 | > |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0 | > |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0 | > |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0 | > |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0 | > |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0 | > |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0 | > +-+--+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29402) Add tests for grouped map pandas_udf using window
[ https://issues.apache.org/jira/browse/SPARK-29402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947266#comment-16947266 ] Bryan Cutler commented on SPARK-29402: -- This is related to SPARK-28502 that using grouped map pandas_udf with windows failed due to the key having a struct column for window range > Add tests for grouped map pandas_udf using window > - > > Key: SPARK-29402 > URL: https://issues.apache.org/jira/browse/SPARK-29402 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL, Tests >Affects Versions: 2.4.4 >Reporter: Bryan Cutler >Priority: Major > > Grouped map pandas_udf should have tests that use a window, due to the key > containing the grouping key and window range. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29402) Add tests for grouped map pandas_udf using window
Bryan Cutler created SPARK-29402: Summary: Add tests for grouped map pandas_udf using window Key: SPARK-29402 URL: https://issues.apache.org/jira/browse/SPARK-29402 Project: Spark Issue Type: Improvement Components: PySpark, SQL, Tests Affects Versions: 2.4.4 Reporter: Bryan Cutler Grouped map pandas_udf should have tests that use a window, due to the key containing the grouping key and window range. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29368) Port interval.sql
[ https://issues.apache.org/jira/browse/SPARK-29368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29368: -- Component/s: Tests > Port interval.sql > - > > Key: SPARK-29368 > URL: https://issues.apache.org/jira/browse/SPARK-29368 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Here is interval.sql: > [https://raw.githubusercontent.com/postgres/postgres/REL_12_STABLE/src/test/regress/sql/interval.sql] > Results: > https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/interval.out -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29382) Support the `INTERVAL` type by Parquet datasource
[ https://issues.apache.org/jira/browse/SPARK-29382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947234#comment-16947234 ] Dongjoon Hyun commented on SPARK-29382: --- Since this is not a Parquet specific format. Could you update the issue title? > Support the `INTERVAL` type by Parquet datasource > - > > Key: SPARK-29382 > URL: https://issues.apache.org/jira/browse/SPARK-29382 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Spark cannot create a table using parquet if a column has the `INTERVAL` type: > {code} > spark-sql> CREATE TABLE INTERVAL_TBL (f1 interval) USING PARQUET; > Error in query: Parquet data source does not support interval data type.; > {code} > This is needed for SPARK-29368 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29401) Replace ambiguous varargs call parallelize(Array) with parallelize(Seq)
Sean R. Owen created SPARK-29401: Summary: Replace ambiguous varargs call parallelize(Array) with parallelize(Seq) Key: SPARK-29401 URL: https://issues.apache.org/jira/browse/SPARK-29401 Project: Spark Issue Type: Sub-task Components: ML, Spark Core Affects Versions: 3.0.0 Reporter: Sean R. Owen Assignee: Sean R. Owen Another general class of Scala 2.13 compile errors: {code} [ERROR] [Error] /Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/ShuffleSuite.scala:47: overloaded method value apply with alternatives: (x: Unit,xs: Unit*)Array[Unit] (x: Double,xs: Double*)Array[Double] (x: Float,xs: Float*)Array[Float] (x: Long,xs: Long*)Array[Long] (x: Int,xs: Int*)Array[Int] (x: Char,xs: Char*)Array[Char] (x: Short,xs: Short*)Array[Short] (x: Byte,xs: Byte*)Array[Byte] (x: Boolean,xs: Boolean*)Array[Boolean] cannot be applied to ((Int, Int), (Int, Int), (Int, Int), (Int, Int)) {code} All of these (almost all?) are from calls to {{SparkContext.parallelize}} with an Array of tuples. We can replace them with Seqs, which seems to work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29392) Remove use of deprecated symbol literal " 'name " syntax in favor Symbol("name")
[ https://issues.apache.org/jira/browse/SPARK-29392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-29392: Assignee: Sean R. Owen > Remove use of deprecated symbol literal " 'name " syntax in favor > Symbol("name") > > > Key: SPARK-29392 > URL: https://issues.apache.org/jira/browse/SPARK-29392 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Minor > > Example: > {code} > [WARNING] [Warn] > /Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/memory/UnifiedMemoryManagerSuite.scala:308: > symbol literal is deprecated; use Symbol("assertInvariants") instead > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29400) Improve PrometheusResource to use labels
[ https://issues.apache.org/jira/browse/SPARK-29400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29400: -- Priority: Minor (was: Major) > Improve PrometheusResource to use labels > > > Key: SPARK-29400 > URL: https://issues.apache.org/jira/browse/SPARK-29400 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > SPARK-29064 introduced `PrometheusResource` for native support. This issue > aims to improve it to use **labels**. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29400) Improve PrometheusResource to use labels
Dongjoon Hyun created SPARK-29400: - Summary: Improve PrometheusResource to use labels Key: SPARK-29400 URL: https://issues.apache.org/jira/browse/SPARK-29400 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Dongjoon Hyun SPARK-29064 introduced `PrometheusResource` for native support. This issue aims to improve it to use **labels**. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22256) Introduce spark.mesos.driver.memoryOverhead
[ https://issues.apache.org/jira/browse/SPARK-22256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947177#comment-16947177 ] David McWhorter commented on SPARK-22256: - I created this pull request https://github.com/pmackles/spark/pull/1 to update the branch for this pull request with changes on branch https://github.com/dmcwhorter/spark/tree/dmcwhorter-SPARK-22256. The updates do the following: 1. Rebase pmackles/paul-SPARK-22256 onto the latest apache/spark master branch 2. Add a test case for when the default value of spark.mesos.driver.memoryOverhead is 10% of driver memory as requested in apache#21006 My hope is this update will make this pull request ready to merge and include in the next spark release. > Introduce spark.mesos.driver.memoryOverhead > > > Key: SPARK-22256 > URL: https://issues.apache.org/jira/browse/SPARK-22256 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.2.0, 2.3.0 >Reporter: Cosmin Lehene >Priority: Minor > Labels: docker, memory, mesos > Original Estimate: 24h > Remaining Estimate: 24h > > When running spark driver in a container such as when using the Mesos > dispatcher service, we need to apply the same rules as for executors in order > to avoid the JVM going over the allotted limit and then killed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29381) Add 'private' _XXXParams classes for classification & regression
[ https://issues.apache.org/jira/browse/SPARK-29381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947171#comment-16947171 ] Huaxin Gao commented on SPARK-29381: Yes. I am happy to work on this. [~podongfeng] > Add 'private' _XXXParams classes for classification & regression > > > Key: SPARK-29381 > URL: https://issues.apache.org/jira/browse/SPARK-29381 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > ping [~huaxingao] would you like to work on this? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29399) Remove old ExecutorPlugin API (or wrap it using new API)
Marcelo Masiero Vanzin created SPARK-29399: -- Summary: Remove old ExecutorPlugin API (or wrap it using new API) Key: SPARK-29399 URL: https://issues.apache.org/jira/browse/SPARK-29399 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Marcelo Masiero Vanzin The parent bug is proposing a new plugin API for Spark, so we could remove the old one since it's a developer API. If we can get the new API into Spark 3.0, then removal might be a better idea than deprecation. That would be my preference since then we can remove the new elements added as part of SPARK-28091 without having to deprecate them first. If it doesn't make it into 3.0, then we should deprecate the APIs from 3.0 and manage old plugins through the new code being added. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29398) Allow RPC endpoints to use dedicated thread pools
Marcelo Masiero Vanzin created SPARK-29398: -- Summary: Allow RPC endpoints to use dedicated thread pools Key: SPARK-29398 URL: https://issues.apache.org/jira/browse/SPARK-29398 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Marcelo Masiero Vanzin This is a new feature of the RPC framework so that we can isolate RPC message delivery for plugins from normal Spark RPC needs. This minimizes the impact that plugins can have on normal RPC communication - they'll still fight for CPU, but they wouldn't block the dispatcher threads used by existing Spark RPC endpoints. See parent bug for further details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29397) Create new plugin interface for driver and executor plugins
Marcelo Masiero Vanzin created SPARK-29397: -- Summary: Create new plugin interface for driver and executor plugins Key: SPARK-29397 URL: https://issues.apache.org/jira/browse/SPARK-29397 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Marcelo Masiero Vanzin This task covers the work of adding a new interface for Spark plugins, covering both driver and executor side component. See parent bug for details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29396) Extend Spark plugin interface to driver
Marcelo Masiero Vanzin created SPARK-29396: -- Summary: Extend Spark plugin interface to driver Key: SPARK-29396 URL: https://issues.apache.org/jira/browse/SPARK-29396 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.0.0 Reporter: Marcelo Masiero Vanzin Spark provides an extension API for people to implement executor plugins, added in SPARK-24918 and later extended in SPARK-28091. That API does not offer any functionality for doing similar things on the driver side, though. As a consequence of that, there is not a good way for the executor plugins to get information or communicate in any way with the Spark driver. I've been playing with such an improved API for developing some new functionality. I'll file a few child bugs for the work to get the changes in. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29337) How to Cache Table and Pin it in Memory and should not Spill to Disk on Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-29337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srini E updated SPARK-29337: Labels: Question stack-overflow (was: ) > How to Cache Table and Pin it in Memory and should not Spill to Disk on > Thrift Server > -- > > Key: SPARK-29337 > URL: https://issues.apache.org/jira/browse/SPARK-29337 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.3.0 >Reporter: Srini E >Priority: Major > Labels: Question, stack-overflow > Attachments: Cache+Image.png > > > Hi Team, > How to pin the table in cache so it would not swap out of memory? > Situation: We are using Microstrategy BI reporting. Semantic layer is built. > We wanted to Cache highly used tables into CACHE using Spark SQL CACHE Table > ; we did cache for SPARK context( Thrift server). Please see > below snapshot of Cache table, went to disk over time. Initially it was all > in cache , now some in cache and some in disk. That disk may be local disk > relatively more expensive reading than from s3. Queries may take longer and > inconsistent times from user experience perspective. If More queries running > using Cache tables, copies of the cache table images are copied and copies > are not staying in memory causing reports to run longer. so how to pin the > table so would not swap to disk. Spark memory management is dynamic > allocation, and how to use those few tables to Pin in memory . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-29337) How to Cache Table and Pin it in Memory and should not Spill to Disk on Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-29337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srini E reopened SPARK-29337: - Openned and labelled it to Stackoverflow and Question > How to Cache Table and Pin it in Memory and should not Spill to Disk on > Thrift Server > -- > > Key: SPARK-29337 > URL: https://issues.apache.org/jira/browse/SPARK-29337 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.3.0 >Reporter: Srini E >Priority: Major > Labels: Question, stack-overflow > Attachments: Cache+Image.png > > > Hi Team, > How to pin the table in cache so it would not swap out of memory? > Situation: We are using Microstrategy BI reporting. Semantic layer is built. > We wanted to Cache highly used tables into CACHE using Spark SQL CACHE Table > ; we did cache for SPARK context( Thrift server). Please see > below snapshot of Cache table, went to disk over time. Initially it was all > in cache , now some in cache and some in disk. That disk may be local disk > relatively more expensive reading than from s3. Queries may take longer and > inconsistent times from user experience perspective. If More queries running > using Cache tables, copies of the cache table images are copied and copies > are not staying in memory causing reports to run longer. so how to pin the > table so would not swap to disk. Spark memory management is dynamic > allocation, and how to use those few tables to Pin in memory . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql
[ https://issues.apache.org/jira/browse/SPARK-29335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srini E reopened SPARK-29335: - Updated the Label to Question and Stack ovefflow > Cost Based Optimizer stats are not used while evaluating query plans in Spark > Sql > - > > Key: SPARK-29335 > URL: https://issues.apache.org/jira/browse/SPARK-29335 > Project: Spark > Issue Type: Question > Components: Optimizer >Affects Versions: 2.3.0 > Environment: We tried to execute the same using Spark-sql and Thrify > server using SQLWorkbench but we are not able to use the stats. >Reporter: Srini E >Priority: Major > Labels: Question, stack-overflow > Attachments: explain_plan_cbo_spark.txt > > > We are trying to leverage CBO for getting better plan results for few > critical queries run thru spark-sql or thru thrift server using jdbc driver. > Following settings added to spark-defaults.conf > {code} > spark.sql.cbo.enabled true > spark.experimental.extrastrategies intervaljoin > spark.sql.cbo.joinreorder.enabled true > {code} > > The tables that we are using are not partitioned. > {code} > spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ; > analyze table arrow.t_fperiods_sundar compute statistics for columns eid, > year, ptype, absref, fpid , pid ; > analyze table arrow.t_fdata_sundar compute statistics ; > analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, > absref; > {code} > Analyze completed success fully. > Describe extended , does not show column level stats data and queries are not > leveraging table or column level stats . > we are using Oracle as our Hive Catalog store and not Glue . > *When we are using spark sql and running queries we are not able to see the > stats in use in the explain plan and we are not sure if cbo is put to use.* > *A quick response would be helpful.* > *Explain Plan:* > Following Explain command does not reference to any Statistics usage. > > {code} > spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref > from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = > a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 > and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;* > > 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with: > 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: > isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = > 2017),(ptype#4546 = A),(eid#4542 = > 29940),isnull(PID#4527),isnotnull(fpid#4523) > 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct decimal(38,0), PID: string, EID: decimal(10,0), YEAR: int, PTYPE: string ... > 3 more fields> > 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: > IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID) > 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with: > 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: > isnotnull(absref#4569),(absref#4569 = > Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940) > 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct string, FVALUE: string, EID: decimal(10,0), FPID: decimal(10,0), ABSREF: > string ... 3 more fields> > 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: > IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940) > == Parsed Logical Plan == > 'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref] > +- 'Filter 'a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && > (('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && > ('a12.eid = 29940)) && isnull('a12.PID))) > +- 'Join Inner > :- 'SubqueryAlias a12 > : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar` > +- 'SubqueryAlias a13 > +- 'UnresolvedRelation `arrow`.`t_fdata_sundar` > > == Analyzed Logical Plan == > imnem: string, fvalue: string, ptype: string, absref: string > Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569] > +- Filter eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = > cast(fpid#4567 as decimal(38,0 && ((absref#4569 = Y2017) && (year#4545 = > 2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = > cast(cast(29940 as decimal(5,0)) as decimal(10,0 && isnull(PID#4527))) > +- Join Inner > :- SubqueryAlias a12 > : +- SubqueryAlias t_fperiods_sundar > : +- >
[jira] [Updated] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql
[ https://issues.apache.org/jira/browse/SPARK-29335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srini E updated SPARK-29335: Labels: Question stack-overflow (was: ) > Cost Based Optimizer stats are not used while evaluating query plans in Spark > Sql > - > > Key: SPARK-29335 > URL: https://issues.apache.org/jira/browse/SPARK-29335 > Project: Spark > Issue Type: Question > Components: Optimizer >Affects Versions: 2.3.0 > Environment: We tried to execute the same using Spark-sql and Thrify > server using SQLWorkbench but we are not able to use the stats. >Reporter: Srini E >Priority: Major > Labels: Question, stack-overflow > Attachments: explain_plan_cbo_spark.txt > > > We are trying to leverage CBO for getting better plan results for few > critical queries run thru spark-sql or thru thrift server using jdbc driver. > Following settings added to spark-defaults.conf > {code} > spark.sql.cbo.enabled true > spark.experimental.extrastrategies intervaljoin > spark.sql.cbo.joinreorder.enabled true > {code} > > The tables that we are using are not partitioned. > {code} > spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ; > analyze table arrow.t_fperiods_sundar compute statistics for columns eid, > year, ptype, absref, fpid , pid ; > analyze table arrow.t_fdata_sundar compute statistics ; > analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, > absref; > {code} > Analyze completed success fully. > Describe extended , does not show column level stats data and queries are not > leveraging table or column level stats . > we are using Oracle as our Hive Catalog store and not Glue . > *When we are using spark sql and running queries we are not able to see the > stats in use in the explain plan and we are not sure if cbo is put to use.* > *A quick response would be helpful.* > *Explain Plan:* > Following Explain command does not reference to any Statistics usage. > > {code} > spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref > from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = > a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 > and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;* > > 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with: > 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: > isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = > 2017),(ptype#4546 = A),(eid#4542 = > 29940),isnull(PID#4527),isnotnull(fpid#4523) > 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct decimal(38,0), PID: string, EID: decimal(10,0), YEAR: int, PTYPE: string ... > 3 more fields> > 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: > IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID) > 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with: > 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: > isnotnull(absref#4569),(absref#4569 = > Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940) > 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct string, FVALUE: string, EID: decimal(10,0), FPID: decimal(10,0), ABSREF: > string ... 3 more fields> > 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: > IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940) > == Parsed Logical Plan == > 'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref] > +- 'Filter 'a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && > (('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && > ('a12.eid = 29940)) && isnull('a12.PID))) > +- 'Join Inner > :- 'SubqueryAlias a12 > : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar` > +- 'SubqueryAlias a13 > +- 'UnresolvedRelation `arrow`.`t_fdata_sundar` > > == Analyzed Logical Plan == > imnem: string, fvalue: string, ptype: string, absref: string > Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569] > +- Filter eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = > cast(fpid#4567 as decimal(38,0 && ((absref#4569 = Y2017) && (year#4545 = > 2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = > cast(cast(29940 as decimal(5,0)) as decimal(10,0 && isnull(PID#4527))) > +- Join Inner > :- SubqueryAlias a12 > : +- SubqueryAlias t_fperiods_sundar > : +- >
[jira] [Assigned] (SPARK-28917) Jobs can hang because of race of RDD.dependencies
[ https://issues.apache.org/jira/browse/SPARK-28917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin reassigned SPARK-28917: -- Assignee: Imran Rashid > Jobs can hang because of race of RDD.dependencies > - > > Key: SPARK-28917 > URL: https://issues.apache.org/jira/browse/SPARK-28917 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.3, 2.4.3 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > > {{RDD.dependencies}} stores the precomputed cache value, but it is not > thread-safe. This can lead to a race where the value gets overwritten, but > the DAGScheduler gets stuck in an inconsistent state. In particular, this > can happen when there is a race between the DAGScheduler event loop, and > another thread (eg. a user thread, if there is multi-threaded job submission). > First, a job is submitted by the user, which then computes the result Stage > and its parents: > https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983 > Which eventually makes a call to {{rdd.dependencies}}: > https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L519 > At the same time, the user could also touch {{rdd.dependencies}} in another > thread, which could overwrite the stored value because of the race. > Then the DAGScheduler checks the dependencies *again* later on in the job > submission, via {{getMissingParentStages}} > https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1025 > Because it will find new dependencies, it will create entirely different > stages. Now the job has some orphaned stages which will never run. > One symptoms of this are seeing disjoint sets of stages in the "Parents of > final stage" and the "Missing parents" messages on job submission (however > this is not required). > (*EDIT*: Seeing repeated msgs "Registering RDD X" actually is just fine, it > is not a symptom of a problem at all. It just means the RDD is the *input* > to multiple shuffles.) > {noformat} > [INFO] 2019-08-15 23:22:31,570 org.apache.spark.SparkContext logInfo - > Starting job: count at XXX.scala:462 > ... > [INFO] 2019-08-15 23:22:31,573 org.apache.spark.scheduler.DAGScheduler > logInfo - Registering RDD 14 (repartition at XXX.scala:421) > ... > ... > [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler > logInfo - Got job 1 (count at XXX.scala:462) with 40 output partitions > [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler > logInfo - Final stage: ResultStage 5 (count at XXX.scala:462) > [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler > logInfo - Parents of final stage: List(ShuffleMapStage 4) > [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler > logInfo - Registering RDD 14 (repartition at XXX.scala:421) > [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler > logInfo - Missing parents: List(ShuffleMapStage 6) > {noformat} > Another symptom is only visible with DEBUG logs turned on for DAGScheduler -- > you will calls to {{submitStage(Stage X)}} multiple times, followed by a > different set of missing stages. eg. here, we see stage 1 first is missing > stage 0 as a dependency, and then later on its missing stage 23: > {noformat} > 19/09/19 22:28:15 DEBUG scheduler.DAGScheduler: submitStage(ShuffleMapStage 1) > 19/09/19 22:28:15 DEBUG scheduler.DAGScheduler: missing: List(ShuffleMapStage > 0) > ... > 19/09/19 22:32:01 DEBUG scheduler.DAGScheduler: submitStage(ShuffleMapStage 1) > 19/09/19 22:32:01 DEBUG scheduler.DAGScheduler: missing: List(ShuffleMapStage > 23) > {noformat} > Note that there is a similar issue w/ {{rdd.partitions}}. In particular for > some RDDs, {{partitions}} references {{dependencies}} (eg. {{CoGroupedRDD}}). > > There is also an issue that {{rdd.storageLevel}} is read and cached in the > scheduler, but it could be modified simultaneously by the user in another > thread. But, I can't see a way it could effect the scheduler. > *WORKAROUND*: > (a) call {{rdd.dependencies}} while you know that RDD is only getting touched > by one thread (eg. in the thread that created it, or before you submit > multiple jobs touching that RDD from other threads). Then that value will get > cached. > (b) don't submit jobs from multiple threads. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Resolved] (SPARK-28917) Jobs can hang because of race of RDD.dependencies
[ https://issues.apache.org/jira/browse/SPARK-28917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-28917. Fix Version/s: 3.0.0 2.4.5 Resolution: Fixed Issue resolved by pull request 25951 [https://github.com/apache/spark/pull/25951] > Jobs can hang because of race of RDD.dependencies > - > > Key: SPARK-28917 > URL: https://issues.apache.org/jira/browse/SPARK-28917 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.3, 2.4.3 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > {{RDD.dependencies}} stores the precomputed cache value, but it is not > thread-safe. This can lead to a race where the value gets overwritten, but > the DAGScheduler gets stuck in an inconsistent state. In particular, this > can happen when there is a race between the DAGScheduler event loop, and > another thread (eg. a user thread, if there is multi-threaded job submission). > First, a job is submitted by the user, which then computes the result Stage > and its parents: > https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983 > Which eventually makes a call to {{rdd.dependencies}}: > https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L519 > At the same time, the user could also touch {{rdd.dependencies}} in another > thread, which could overwrite the stored value because of the race. > Then the DAGScheduler checks the dependencies *again* later on in the job > submission, via {{getMissingParentStages}} > https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1025 > Because it will find new dependencies, it will create entirely different > stages. Now the job has some orphaned stages which will never run. > One symptoms of this are seeing disjoint sets of stages in the "Parents of > final stage" and the "Missing parents" messages on job submission (however > this is not required). > (*EDIT*: Seeing repeated msgs "Registering RDD X" actually is just fine, it > is not a symptom of a problem at all. It just means the RDD is the *input* > to multiple shuffles.) > {noformat} > [INFO] 2019-08-15 23:22:31,570 org.apache.spark.SparkContext logInfo - > Starting job: count at XXX.scala:462 > ... > [INFO] 2019-08-15 23:22:31,573 org.apache.spark.scheduler.DAGScheduler > logInfo - Registering RDD 14 (repartition at XXX.scala:421) > ... > ... > [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler > logInfo - Got job 1 (count at XXX.scala:462) with 40 output partitions > [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler > logInfo - Final stage: ResultStage 5 (count at XXX.scala:462) > [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler > logInfo - Parents of final stage: List(ShuffleMapStage 4) > [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler > logInfo - Registering RDD 14 (repartition at XXX.scala:421) > [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler > logInfo - Missing parents: List(ShuffleMapStage 6) > {noformat} > Another symptom is only visible with DEBUG logs turned on for DAGScheduler -- > you will calls to {{submitStage(Stage X)}} multiple times, followed by a > different set of missing stages. eg. here, we see stage 1 first is missing > stage 0 as a dependency, and then later on its missing stage 23: > {noformat} > 19/09/19 22:28:15 DEBUG scheduler.DAGScheduler: submitStage(ShuffleMapStage 1) > 19/09/19 22:28:15 DEBUG scheduler.DAGScheduler: missing: List(ShuffleMapStage > 0) > ... > 19/09/19 22:32:01 DEBUG scheduler.DAGScheduler: submitStage(ShuffleMapStage 1) > 19/09/19 22:32:01 DEBUG scheduler.DAGScheduler: missing: List(ShuffleMapStage > 23) > {noformat} > Note that there is a similar issue w/ {{rdd.partitions}}. In particular for > some RDDs, {{partitions}} references {{dependencies}} (eg. {{CoGroupedRDD}}). > > There is also an issue that {{rdd.storageLevel}} is read and cached in the > scheduler, but it could be modified simultaneously by the user in another > thread. But, I can't see a way it could effect the scheduler. > *WORKAROUND*: > (a) call {{rdd.dependencies}} while you know that RDD is only getting touched > by one thread (eg. in the thread that created it, or before you submit > multiple jobs touching that RDD from other threads). Then that value will get > cached. > (b) don't submit jobs from multiple threads. -- This message
[jira] [Updated] (SPARK-29370) Interval strings without explicit unit markings
[ https://issues.apache.org/jira/browse/SPARK-29370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-29370: --- Description: In PostgreSQL, Quantities of days, hours, minutes, and seconds can be specified without explicit unit markings. For example, '1 12:59:10' is read the same as '1 day 12 hours 59 min 10 sec'. For example: {code:java} maxim=# select interval '1 12:59:10'; interval 1 day 12:59:10 (1 row) {code} It should allow to specify the sign: {code} maxim=# SELECT interval '1 +2:03:04' minute to second; interval 1 day 02:03:04 maxim=# SELECT interval '1 -2:03:04' minute to second; interval - 1 day -02:03:04 {code} was: In PostgreSQL, Quantities of days, hours, minutes, and seconds can be specified without explicit unit markings. For example, '1 12:59:10' is read the same as '1 day 12 hours 59 min 10 sec'. For example: {code} maxim=# select interval '1 12:59:10'; interval 1 day 12:59:10 (1 row) {code} > Interval strings without explicit unit markings > --- > > Key: SPARK-29370 > URL: https://issues.apache.org/jira/browse/SPARK-29370 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > In PostgreSQL, Quantities of days, hours, minutes, and seconds can be > specified without explicit unit markings. For example, '1 12:59:10' is read > the same as '1 day 12 hours 59 min 10 sec'. For example: > {code:java} > maxim=# select interval '1 12:59:10'; > interval > > 1 day 12:59:10 > (1 row) > {code} > It should allow to specify the sign: > {code} > maxim=# SELECT interval '1 +2:03:04' minute to second; > interval > > 1 day 02:03:04 > maxim=# SELECT interval '1 -2:03:04' minute to second; > interval > - > 1 day -02:03:04 > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29395) Precision of the interval type
Maxim Gekk created SPARK-29395: -- Summary: Precision of the interval type Key: SPARK-29395 URL: https://issues.apache.org/jira/browse/SPARK-29395 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk PostgreSQL allows to specify interval precision, see [https://www.postgresql.org/docs/12/datatype-datetime.html] |{{interval [ _{{fields}}_ ] [ (_{{p}}_) ]}}|16 bytes|time interval|-17800 years|17800 years|1 microsecond| For example: {code} maxim=# SELECT interval '1 2:03.4567' day to second(2); interval --- 1 day 00:02:03.46 (1 row) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29394) Support ISO 8601 format for intervals
Maxim Gekk created SPARK-29394: -- Summary: Support ISO 8601 format for intervals Key: SPARK-29394 URL: https://issues.apache.org/jira/browse/SPARK-29394 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Interval values can also be written as ISO 8601 time intervals, using either the “format with designators” of the standard's section 4.4.3.2 or the “alternative format” of section 4.4.3.3. For example: |P1Y2M3DT4H5M6S|ISO 8601 “format with designators”| |P0001-02-03T04:05:06|ISO 8601 “alternative format”: same meaning as above| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29393) Add the make_interval() function
Maxim Gekk created SPARK-29393: -- Summary: Add the make_interval() function Key: SPARK-29393 URL: https://issues.apache.org/jira/browse/SPARK-29393 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk PostgreSQL allows to make an interval by using the make_interval() function: |{{make_interval(_{{years}}_ }}{{int}}{{ DEFAULT 0, _{{months}}_ }}{{int}}{{ DEFAULT 0, _{{weeks}}_ }}{{int}}{{ DEFAULT 0, _{{days}}_ }}{{int}}{{ DEFAULT 0, _{{hours}}_ }}{{int}}{{ DEFAULT 0, _{{mins}}_ }}{{int}}{{ DEFAULT 0, _{{secs}}_ }}{{double precision}}{{ DEFAULT 0.0)}}|{{interval}}|Create interval from years, months, weeks, days, hours, minutes and seconds fields|{{make_interval(days => 10)}}|{{10 days}}| See https://www.postgresql.org/docs/12/functions-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend
[ https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947042#comment-16947042 ] Huaxin Gao commented on SPARK-29212: +1 for Sean's comment. > Add common classes without using JVM backend > > > Key: SPARK-29212 > URL: https://issues.apache.org/jira/browse/SPARK-29212 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > Copied from [https://github.com/apache/spark/pull/25776]. > > Maciej's *Concern*: > *Use cases for public ML type hierarchy* > * Add Python-only Transformer implementations: > * > ** I am Python user and want to implement pure Python ML classifier without > providing JVM backend. > ** I want this classifier to be meaningfully positioned in the existing type > hierarchy. > ** However I have access only to high level classes ({{Estimator}}, > {{Model}}, {{MLReader}} / {{MLReadable}}). > * Run time parameter validation for both user defined (see above) and > existing class hierarchy, > * > ** I am a library developer who provides functions that are meaningful only > for specific categories of {{Estimators}} - here classifiers. > ** I want to validate that user passed argument is indeed a classifier: > *** For built-in objects using "private" type hierarchy is not really > satisfying (actually, what is the rationale behind making it "private"? If > the goal is Scala API parity, and Scala counterparts are public, shouldn't > these be too?). > ** For user defined objects I can: > *** Use duck typing (on {{setRawPredictionCol}} for classifier, on > {{numClasses}} for classification model) but it hardly satisfying. > *** Provide parallel non-abstract type hierarchy ({{Classifier}} or > {{PythonClassifier}} and so on) and require users to implement such > interfaces. That however would require separate logic for checking for > built-in and and user-provided classes. > *** Provide parallel abstract type hierarchy, register all existing built-in > classes and require users to do the same. > Clearly these are not satisfying solutions as they require either defensive > programming or reinventing the same functionality for different 3rd party > APIs. > * Static type checking > * > ** I am either end user or library developer and want to use PEP-484 > annotations to indicate components that require classifier or classification > model. > * > ** Currently I can provide only imprecise annotations, [such > as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241] > def setClassifier(self, value: Estimator[M]) -> OneVsRest: ... > or try to narrow things down using structural subtyping: > class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, > value: str) -> Classifier: ... class Classifier(Protocol, Model): def > setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> > int: ... > (...) > * First of all nothing in the original API indicated this. On the contrary, > the original API clearly suggests that non-Java path is supported, by > providing base classes (Params, Transformer, Estimator, Model, ML > \{Reader,Writer}, ML\{Readable,Writable}) as well as Java specific > implementations (JavaParams, JavaTransformer, JavaEstimator, JavaModel, > JavaML\{Reader,Writer}, JavaML > {Readable,Writable} > ). > * Furthermore authoritative (IMHO) and open Python ML extensions exist > (spark-sklearn is one of these, but if I recall correctly spark-deep-learning > provides so pure-Python utilities). Personally I've seen quite a lot of > private implementations, but that's just anecdotal evidence. > Let us assume for the sake of argument that above observations are > irrelevant. I will argue that having complete, public type hierarchy is still > desired: > * Two out three use cases I described, can narrowed down to Java > implementation only, though there are less compelling if we do that. > * More importantly, public type hierarchy with Java specific extensions, is > pyspark.ml standard. There is no reason to treat this specific case as an > exception, especially when the implementations, is far from utilitarian (for > example implementation-free JavaClassifierParams and > JavaProbabilisticClassifierParams save, as for now, no practical purpose > whatsoever). > > Maciej's *Proposal*: > {code:python} > Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e. > class ClassifierParams: ... > class Predictor(Estimator,PredictorParams): > def setLabelCol(self, value): ... > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > class
[jira] [Created] (SPARK-29392) Remove use of deprecated symbol literal " 'name " syntax in favor Symbol("name")
Sean R. Owen created SPARK-29392: Summary: Remove use of deprecated symbol literal " 'name " syntax in favor Symbol("name") Key: SPARK-29392 URL: https://issues.apache.org/jira/browse/SPARK-29392 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL, Tests Affects Versions: 3.0.0 Reporter: Sean R. Owen Example: {code} [WARNING] [Warn] /Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/memory/UnifiedMemoryManagerSuite.scala:308: symbol literal is deprecated; use Symbol("assertInvariants") instead {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19609) Broadcast joins should pushdown join constraints as Filter to the larger relation
[ https://issues.apache.org/jira/browse/SPARK-19609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947019#comment-16947019 ] Hyukjin Kwon commented on SPARK-19609: -- It was resolved as incomplete as it indicates EOL release and has been inactive more than a year, as discussed in dev mailing list. Please reopen if you can verify it is still and issue in Spark 2.4+ > Broadcast joins should pushdown join constraints as Filter to the larger > relation > - > > Key: SPARK-19609 > URL: https://issues.apache.org/jira/browse/SPARK-19609 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nick Dimiduk >Priority: Major > Labels: bulk-closed > > For broadcast inner-joins, where the smaller relation is known to be small > enough to materialize on a worker, the set of values for all join columns is > known and fits in memory. Spark should translate these values into a > {{Filter}} pushed down to the datasource. The common join condition of > equality, i.e. {{lhs.a == rhs.a}}, can be written as an {{a in ...}} clause. > An example of pushing such filters is already present in the form of > {{IsNotNull}} filters via [~sameerag]'s work on SPARK-12957 subtasks. > This optimization could even work when the smaller relation does not fit > entirely in memory. This could be done by partitioning the smaller relation > into N pieces, applying this predicate pushdown for each piece, and unioning > the results. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19609) Broadcast joins should pushdown join constraints as Filter to the larger relation
[ https://issues.apache.org/jira/browse/SPARK-19609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947001#comment-16947001 ] Nick Dimiduk commented on SPARK-19609: -- Hi [~hyukjin.kwon], mind adding a comment as to why this issue was closed? Has the functionality been implemented elsewhere? How about a link off to the relevant JIRA so I know what fix version to look for? Thanks! > Broadcast joins should pushdown join constraints as Filter to the larger > relation > - > > Key: SPARK-19609 > URL: https://issues.apache.org/jira/browse/SPARK-19609 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nick Dimiduk >Priority: Major > Labels: bulk-closed > > For broadcast inner-joins, where the smaller relation is known to be small > enough to materialize on a worker, the set of values for all join columns is > known and fits in memory. Spark should translate these values into a > {{Filter}} pushed down to the datasource. The common join condition of > equality, i.e. {{lhs.a == rhs.a}}, can be written as an {{a in ...}} clause. > An example of pushing such filters is already present in the form of > {{IsNotNull}} filters via [~sameerag]'s work on SPARK-12957 subtasks. > This optimization could even work when the smaller relation does not fit > entirely in memory. This could be done by partitioning the smaller relation > into N pieces, applying this predicate pushdown for each piece, and unioning > the results. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee that the relativeError will be respected
[ https://issues.apache.org/jira/browse/SPARK-29336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-29336: Assignee: Guilherme Souza > The implementation of QuantileSummaries.merge does not guarantee that the > relativeError will be respected > --- > > Key: SPARK-29336 > URL: https://issues.apache.org/jira/browse/SPARK-29336 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Guilherme Souza >Assignee: Guilherme Souza >Priority: Minor > > Hello Spark maintainers, > I was experimenting with my own implementation of the [space-efficient > quantile > algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf] > in another language and I was using the Spark's one as a reference. > In my analysis, I believe to have found an issue with the {{merge()}} logic. > Here is some simple Scala code that reproduces the issue I've found: > > {code:java} > var values = (1 to 100).toArray > val all_quantiles = values.indices.map(i => (i+1).toDouble / > values.length).toArray > for (n <- 0 until 5) { > var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5) > val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1) > val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray > val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) > => Math.abs(expected - answer) }).toArray > val max_error = error.max > print(max_error + "\n") > } > {code} > I query for all possible quantiles in a 100-element array with a desired 10% > max error. In this scenario, one would expect to observe a maximum error of > 10 ranks or less (10% of 100). However, the output I observe is: > > {noformat} > 16 > 12 > 10 > 11 > 17{noformat} > The variance is probably due to non-deterministic operations behind the > scenes, but irrelevant to the core cause. (and sorry for my Scala, I'm not > used to it) > Interestingly enough, if I change from five to one partition the code works > as expected and gives 10 every time. This seems to point to some problem at > the [merge > logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L153-L171] > The original authors ([~clockfly] and [~cloud_fan] for what I could dig from > the history) suggest the published paper is not clear on how that should be > done and, honestly, I was not confident in the current approach either. > I've found SPARK-21184 that reports the same problem, but it was > unfortunately closed with no fix applied. > In my external implementation I believe to have found a sound way to > implement the merge method. [Here is my take in Rust, if > relevant|https://github.com/sitegui/space-efficient-quantile/blob/188c74638c9840e5f47d6c6326b2886d47b149bc/src/modified_gk/summary.rs#L162-L218] > I'd be really glad to add unit tests and contribute my implementation adapted > to Scala. > I'd love to hear your opinion on the matter. > Best regards > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee that the relativeError will be respected
[ https://issues.apache.org/jira/browse/SPARK-29336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-29336. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26029 [https://github.com/apache/spark/pull/26029] > The implementation of QuantileSummaries.merge does not guarantee that the > relativeError will be respected > --- > > Key: SPARK-29336 > URL: https://issues.apache.org/jira/browse/SPARK-29336 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: Guilherme Souza >Assignee: Guilherme Souza >Priority: Minor > Fix For: 3.0.0 > > > Hello Spark maintainers, > I was experimenting with my own implementation of the [space-efficient > quantile > algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf] > in another language and I was using the Spark's one as a reference. > In my analysis, I believe to have found an issue with the {{merge()}} logic. > Here is some simple Scala code that reproduces the issue I've found: > > {code:java} > var values = (1 to 100).toArray > val all_quantiles = values.indices.map(i => (i+1).toDouble / > values.length).toArray > for (n <- 0 until 5) { > var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5) > val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1) > val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray > val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) > => Math.abs(expected - answer) }).toArray > val max_error = error.max > print(max_error + "\n") > } > {code} > I query for all possible quantiles in a 100-element array with a desired 10% > max error. In this scenario, one would expect to observe a maximum error of > 10 ranks or less (10% of 100). However, the output I observe is: > > {noformat} > 16 > 12 > 10 > 11 > 17{noformat} > The variance is probably due to non-deterministic operations behind the > scenes, but irrelevant to the core cause. (and sorry for my Scala, I'm not > used to it) > Interestingly enough, if I change from five to one partition the code works > as expected and gives 10 every time. This seems to point to some problem at > the [merge > logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L153-L171] > The original authors ([~clockfly] and [~cloud_fan] for what I could dig from > the history) suggest the published paper is not clear on how that should be > done and, honestly, I was not confident in the current approach either. > I've found SPARK-21184 that reports the same problem, but it was > unfortunately closed with no fix applied. > In my external implementation I believe to have found a sound way to > implement the merge method. [Here is my take in Rust, if > relevant|https://github.com/sitegui/space-efficient-quantile/blob/188c74638c9840e5f47d6c6326b2886d47b149bc/src/modified_gk/summary.rs#L162-L218] > I'd be really glad to add unit tests and contribute my implementation adapted > to Scala. > I'd love to hear your opinion on the matter. > Best regards > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql
[ https://issues.apache.org/jira/browse/SPARK-29335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29335: - Description: We are trying to leverage CBO for getting better plan results for few critical queries run thru spark-sql or thru thrift server using jdbc driver. Following settings added to spark-defaults.conf {code} spark.sql.cbo.enabled true spark.experimental.extrastrategies intervaljoin spark.sql.cbo.joinreorder.enabled true {code} The tables that we are using are not partitioned. {code} spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ; analyze table arrow.t_fperiods_sundar compute statistics for columns eid, year, ptype, absref, fpid , pid ; analyze table arrow.t_fdata_sundar compute statistics ; analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, absref; {code} Analyze completed success fully. Describe extended , does not show column level stats data and queries are not leveraging table or column level stats . we are using Oracle as our Hive Catalog store and not Glue . *When we are using spark sql and running queries we are not able to see the stats in use in the explain plan and we are not sure if cbo is put to use.* *A quick response would be helpful.* *Explain Plan:* Following Explain command does not reference to any Statistics usage. {code} spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;* 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with: 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = 2017),(ptype#4546 = A),(eid#4542 = 29940),isnull(PID#4527),isnotnull(fpid#4523) 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID) 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with: 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: isnotnull(absref#4569),(absref#4569 = Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940) 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940) == Parsed Logical Plan == 'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref] +- 'Filter 'a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && (('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && ('a12.eid = 29940)) && isnull('a12.PID))) +- 'Join Inner :- 'SubqueryAlias a12 : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar` +- 'SubqueryAlias a13 +- 'UnresolvedRelation `arrow`.`t_fdata_sundar` == Analyzed Logical Plan == imnem: string, fvalue: string, ptype: string, absref: string Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569] +- Filter eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = cast(fpid#4567 as decimal(38,0 && ((absref#4569 = Y2017) && (year#4545 = 2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = cast(cast(29940 as decimal(5,0)) as decimal(10,0 && isnull(PID#4527))) +- Join Inner :- SubqueryAlias a12 : +- SubqueryAlias t_fperiods_sundar : +- Relation[FPID#4523,QTR#4524,ABSREF#4525,DSET#4526,PID#4527,NID#4528,CCD#4529,LOCKDATE#4530,LOCKUSER#4531,UPDUSER#4532,UPDDATE#4533,RESTATED#4534,DATADATE#4535,DATASTATE#4536,ACCSTD#4537,INDFMT#4538,DATAFMT#4539,CONSOL#4540,FYR#4541,EID#4542,FPSTATE#4543,BATCH_ID#4544L,YEAR#4545,PTYPE#4546] parquet +- SubqueryAlias a13 +- SubqueryAlias t_fdata_sundar +- Relation[FDID#4547,IMNEM#4548,DSET#4549,DSRS#4550,PID#4551,FVALUE#4552,FCOMP#4553,CONF#4554,UPDDATE#4555,UPDUSER#4556,SEQ#4557,CCD#4558,NID#4559,CONVTYPE#4560,DATADATE#4561,DCOMMENT#4562,UDOMAIN#4563,FPDID#4564L,VLP_CALC#4565,EID#4566,FPID#4567,BATCH_ID#4568L,ABSREF#4569] parquet == Optimized Logical Plan == Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569] +- Join Inner, ((eid#4542 = eid#4566) && (fpid#4523 = cast(fpid#4567 as decimal(38,0 :- Project [FPID#4523, EID#4542, PTYPE#4546] : +- Filter (((isnotnull(ptype#4546) && isnotnull(year#4545)) && isnotnull(eid#4542)) && (year#4545 = 2017)) && (ptype#4546 = A)) && (eid#4542 = 29940)) && isnull(PID#4527)) && isnotnull(fpid#4523)) : +-
[jira] [Resolved] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql
[ https://issues.apache.org/jira/browse/SPARK-29335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29335. -- Resolution: Invalid Questions should go to mailing list or stackoverflow. You could have a better answer than this > Cost Based Optimizer stats are not used while evaluating query plans in Spark > Sql > - > > Key: SPARK-29335 > URL: https://issues.apache.org/jira/browse/SPARK-29335 > Project: Spark > Issue Type: Question > Components: Optimizer >Affects Versions: 2.3.0 > Environment: We tried to execute the same using Spark-sql and Thrify > server using SQLWorkbench but we are not able to use the stats. >Reporter: Srini E >Priority: Major > Attachments: explain_plan_cbo_spark.txt > > > We are trying to leverage CBO for getting better plan results for few > critical queries run thru spark-sql or thru thrift server using jdbc driver. > Following settings added to spark-defaults.conf > {code} > spark.sql.cbo.enabled true > spark.experimental.extrastrategies intervaljoin > spark.sql.cbo.joinreorder.enabled true > {code} > > The tables that we are using are not partitioned. > {code} > spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ; > analyze table arrow.t_fperiods_sundar compute statistics for columns eid, > year, ptype, absref, fpid , pid ; > analyze table arrow.t_fdata_sundar compute statistics ; > analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, > absref; > {code} > Analyze completed success fully. > Describe extended , does not show column level stats data and queries are not > leveraging table or column level stats . > we are using Oracle as our Hive Catalog store and not Glue . > *When we are using spark sql and running queries we are not able to see the > stats in use in the explain plan and we are not sure if cbo is put to use.* > *A quick response would be helpful.* > *Explain Plan:* > Following Explain command does not reference to any Statistics usage. > > {code} > spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref > from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = > a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 > and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;* > > 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with: > 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: > isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = > 2017),(ptype#4546 = A),(eid#4542 = > 29940),isnull(PID#4527),isnotnull(fpid#4523) > 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct decimal(38,0), PID: string, EID: decimal(10,0), YEAR: int, PTYPE: string ... > 3 more fields> > 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: > IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID) > 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with: > 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: > isnotnull(absref#4569),(absref#4569 = > Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940) > 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct string, FVALUE: string, EID: decimal(10,0), FPID: decimal(10,0), ABSREF: > string ... 3 more fields> > 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: > IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940) > == Parsed Logical Plan == > 'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref] > +- 'Filter 'a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && > (('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && > ('a12.eid = 29940)) && isnull('a12.PID))) > +- 'Join Inner > :- 'SubqueryAlias a12 > : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar` > +- 'SubqueryAlias a13 > +- 'UnresolvedRelation `arrow`.`t_fdata_sundar` > > == Analyzed Logical Plan == > imnem: string, fvalue: string, ptype: string, absref: string > Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569] > +- Filter eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = > cast(fpid#4567 as decimal(38,0 && ((absref#4569 = Y2017) && (year#4545 = > 2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = > cast(cast(29940 as decimal(5,0)) as decimal(10,0 && isnull(PID#4527))) > +- Join Inner > :- SubqueryAlias a12 > : +- SubqueryAlias t_fperiods_sundar > : +- >
[jira] [Commented] (SPARK-29356) Stopping Spark doesn't shut down all network connections
[ https://issues.apache.org/jira/browse/SPARK-29356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946810#comment-16946810 ] Hyukjin Kwon commented on SPARK-29356: -- Can you post a reproducer or at least some warn/error messages to track? > Stopping Spark doesn't shut down all network connections > > > Key: SPARK-29356 > URL: https://issues.apache.org/jira/browse/SPARK-29356 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Malthe Borch >Priority: Minor > > The Spark session's gateway client still has an open network connection after > a call to `spark.stop()`. This is unexpected and for example in a test suite, > this triggers a resource warning when tearing down the test case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29358. -- Resolution: Won't Fix > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946809#comment-16946809 ] Hyukjin Kwon commented on SPARK-29358: -- Given that workaround is pretty easy, I wouldn't need the new API. Seems not a strong reason to me (e.g. there's no way to work around or considerable codes are required). > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29372) Codegen grows beyond 64 KB for more columns in case of SupportsScanColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-29372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29372: - Priority: Major (was: Critical) > Codegen grows beyond 64 KB for more columns in case of > SupportsScanColumnarBatch > > > Key: SPARK-29372 > URL: https://issues.apache.org/jira/browse/SPARK-29372 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.2 >Reporter: Shubham Chaurasia >Priority: Major > > In case of vectorized DSv2 readers i.e. if it implements > {{SupportsScanColumnarBatch}} and number of columns is around(or greater > than) 1000 then it throws > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Code of method > "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:990) > at org.codehaus.janino.CodeContext.write(CodeContext.java:899) > at org.codehaus.janino.CodeContext.writeBranch(CodeContext.java:1016) > at org.codehaus.janino.UnitCompiler.writeBranch(UnitCompiler.java:11911) > at > org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:3675) > at org.codehaus.janino.UnitCompiler.access$5500(UnitCompiler.java:212) > {code} > I can see from logs that it tries to disable Whole-stage codegen but it's > failing even after that on each retry. > {code} > 19/10/07 20:49:35 WARN WholeStageCodegenExec: Whole-stage codegen disabled > for plan (id=0): > *(0) DataSourceV2Scan [column_0#3558, column_1#3559, column_2#3560, > column_3#3561, column_4#3562, column_5#3563, column_6#3564, column_7#3565, > column_8#3566, column_9#3567, column_10#3568, column_11#3569, column_12#3570, > column_13#3571, column_14#3572, column_15#3573, column_16#3574, > column_17#3575, column_18#3576, column_19#3577, column_20#3578, > column_21#3579, column_22#3580, column_23#3581, ... 976 more fields], > com.shubham.reader.MyDataSourceReader@5c7673b8 > {code} > Repro code for a simple reader can be: > {code:java} > public class MyDataSourceReader implements DataSourceReader, > SupportsScanColumnarBatch { > private StructType schema; > private int numCols = 10; > private int numRows = 10; > private int numReaders = 1; > public MyDataSourceReader(Map options) { > initOptions(options); > System.out.println("MyDataSourceReader.MyDataSourceReader: > Instantiated" + this); > } > private void initOptions(Map options) { > String numColumns = options.get("num_columns"); > if (numColumns != null) { > numCols = Integer.parseInt(numColumns); > } > String numRowsOption = options.get("num_rows_per_reader"); > if (numRowsOption != null) { > numRows = Integer.parseInt(numRowsOption); > } > String readersOption = options.get("num_readers"); > if (readersOption != null) { > numReaders = Integer.parseInt(readersOption); > } > } > @Override public StructType readSchema() { > final String colPrefix = "column_"; > StructField[] fields = new StructField[numCols]; > for (int i = 0; i < numCols; i++) { > fields[i] = new StructField(colPrefix + i, DataTypes.IntegerType, true, > Metadata.empty()); > } > schema = new StructType(fields); > return schema; > } > @Override public List> > createBatchDataReaderFactories() { > System.out.println("MyDataSourceReader.createDataReaderFactories: " + > numReaders); > return new ArrayList<>(); > } > } > {code} > If I pass {{num_columns}} 1000 or greater, the issue appears. > {code:java} > spark.read.format("com.shubham.MyDataSource").option("num_columns", > "1000").option("num_rows_per_reader", 2).option("num_readers", 1).load.show > {code} > Any fixes/workarounds for this? > SPARK-16845 and SPARK-17092 are resolved but looks like they don't deal with > the vectorized part. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend
[ https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946784#comment-16946784 ] Sean R. Owen commented on SPARK-29212: -- I don't have a strong opinion on it. I'd make it all consistent first, before changing to another consistent state. Removing truly no-op classes is OK, unless we can foresee a use for them later. Pyspark does mean to wrap the JVM implemenatation, so "Java" related classes aren't inherently wrong even as part of a developer API. Refactoring is good, but weigh it against breaking existing extensions of these classes, even in Spark 3. > Add common classes without using JVM backend > > > Key: SPARK-29212 > URL: https://issues.apache.org/jira/browse/SPARK-29212 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > Copied from [https://github.com/apache/spark/pull/25776]. > > Maciej's *Concern*: > *Use cases for public ML type hierarchy* > * Add Python-only Transformer implementations: > * > ** I am Python user and want to implement pure Python ML classifier without > providing JVM backend. > ** I want this classifier to be meaningfully positioned in the existing type > hierarchy. > ** However I have access only to high level classes ({{Estimator}}, > {{Model}}, {{MLReader}} / {{MLReadable}}). > * Run time parameter validation for both user defined (see above) and > existing class hierarchy, > * > ** I am a library developer who provides functions that are meaningful only > for specific categories of {{Estimators}} - here classifiers. > ** I want to validate that user passed argument is indeed a classifier: > *** For built-in objects using "private" type hierarchy is not really > satisfying (actually, what is the rationale behind making it "private"? If > the goal is Scala API parity, and Scala counterparts are public, shouldn't > these be too?). > ** For user defined objects I can: > *** Use duck typing (on {{setRawPredictionCol}} for classifier, on > {{numClasses}} for classification model) but it hardly satisfying. > *** Provide parallel non-abstract type hierarchy ({{Classifier}} or > {{PythonClassifier}} and so on) and require users to implement such > interfaces. That however would require separate logic for checking for > built-in and and user-provided classes. > *** Provide parallel abstract type hierarchy, register all existing built-in > classes and require users to do the same. > Clearly these are not satisfying solutions as they require either defensive > programming or reinventing the same functionality for different 3rd party > APIs. > * Static type checking > * > ** I am either end user or library developer and want to use PEP-484 > annotations to indicate components that require classifier or classification > model. > * > ** Currently I can provide only imprecise annotations, [such > as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241] > def setClassifier(self, value: Estimator[M]) -> OneVsRest: ... > or try to narrow things down using structural subtyping: > class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, > value: str) -> Classifier: ... class Classifier(Protocol, Model): def > setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> > int: ... > (...) > * First of all nothing in the original API indicated this. On the contrary, > the original API clearly suggests that non-Java path is supported, by > providing base classes (Params, Transformer, Estimator, Model, ML > \{Reader,Writer}, ML\{Readable,Writable}) as well as Java specific > implementations (JavaParams, JavaTransformer, JavaEstimator, JavaModel, > JavaML\{Reader,Writer}, JavaML > {Readable,Writable} > ). > * Furthermore authoritative (IMHO) and open Python ML extensions exist > (spark-sklearn is one of these, but if I recall correctly spark-deep-learning > provides so pure-Python utilities). Personally I've seen quite a lot of > private implementations, but that's just anecdotal evidence. > Let us assume for the sake of argument that above observations are > irrelevant. I will argue that having complete, public type hierarchy is still > desired: > * Two out three use cases I described, can narrowed down to Java > implementation only, though there are less compelling if we do that. > * More importantly, public type hierarchy with Java specific extensions, is > pyspark.ml standard. There is no reason to treat this specific case as an > exception, especially when the implementations, is far from utilitarian (for > example implementation-free JavaClassifierParams and >
[jira] [Assigned] (SPARK-24640) size(null) returns null
[ https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-24640: Assignee: Maxim Gekk > size(null) returns null > > > Key: SPARK-24640 > URL: https://issues.apache.org/jira/browse/SPARK-24640 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Assignee: Maxim Gekk >Priority: Major > Labels: api, bulk-closed > > Size(null) should return null instead of -1 in 3.0 release. This is a > behavior change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24640) size(null) returns null
[ https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24640. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26051 [https://github.com/apache/spark/pull/26051] > size(null) returns null > > > Key: SPARK-24640 > URL: https://issues.apache.org/jira/browse/SPARK-24640 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Assignee: Maxim Gekk >Priority: Major > Labels: api, bulk-closed > Fix For: 3.0.0 > > > Size(null) should return null instead of -1 in 3.0 release. This is a > behavior change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29391) Default year-month units
Maxim Gekk created SPARK-29391: -- Summary: Default year-month units Key: SPARK-29391 URL: https://issues.apache.org/jira/browse/SPARK-29391 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk PostgreSQL can assume default year-month units by defaults: {code} maxim=# SELECT interval '1-2'; interval --- 1 year 2 mons {code} but the same produces NULL in Spark: -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29390) Add the justify_days(), justify_hours() and justify_interval() functions
Maxim Gekk created SPARK-29390: -- Summary: Add the justify_days(), justify_hours() and justify_interval() functions Key: SPARK-29390 URL: https://issues.apache.org/jira/browse/SPARK-29390 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk See *Table 9.31. Date/Time Functions* ([https://www.postgresql.org/docs/12/functions-datetime.html)] |{{justify_days(}}{{interval}}{{)}}|{{interval}}|Adjust interval so 30-day time periods are represented as months|{{justify_days(interval '35 days')}}|{{1 mon 5 days}}| | {{justify_hours(}}{{interval}}{{)}}|{{interval}}|Adjust interval so 24-hour time periods are represented as days|{{justify_hours(interval '27 hours')}}|{{1 day 03:00:00}}| | {{justify_interval(}}{{interval}}{{)}}|{{interval}}|Adjust interval using {{justify_days}} and {{justify_hours}}, with additional sign adjustments|{{justify_interval(interval '1 mon -1 hour')}}|{{29 days 23:00:00}}| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29389) Short synonyms of interval units
Maxim Gekk created SPARK-29389: -- Summary: Short synonyms of interval units Key: SPARK-29389 URL: https://issues.apache.org/jira/browse/SPARK-29389 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Should be supported the following synonyms: {code} ["MILLENNIUM", ("MILLENNIA", "MIL", "MILS"), "CENTURY", ("CENTURIES", "C", "CENT"), "DECADE", ("DECADES", "DEC", "DECS"), "YEAR", ("Y", "YEARS", "YR", "YRS"), "QUARTER", ("QTR"), "MONTH", ("MON", "MONS", "MONTHS"), "DAY", ("D", "DAYS"), "HOUR", ("H", "HOURS", "HR", "HRS"), "MINUTE", ("M", "MIN", "MINS", "MINUTES"), "SECOND", ("S", "SEC", "SECONDS", "SECS"), "MILLISECONDS", ("MSEC", "MSECS", "MILLISECON", "MSECONDS", "MS"), "MICROSECONDS", ("USEC", "USECS", "USECONDS", "MICROSECON", "US"), "EPOCH"] {code} For example: {code} maxim=# select '1y 10mon -10d -10h -10min -10.01s ago'::interval; interval -1 years -10 mons +10 days 10:10:10.01 (1 row) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29389) Support synonyms for interval units
[ https://issues.apache.org/jira/browse/SPARK-29389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-29389: --- Summary: Support synonyms for interval units (was: Short synonyms of interval units) > Support synonyms for interval units > --- > > Key: SPARK-29389 > URL: https://issues.apache.org/jira/browse/SPARK-29389 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Should be supported the following synonyms: > {code} > ["MILLENNIUM", ("MILLENNIA", "MIL", "MILS"), >"CENTURY", ("CENTURIES", "C", "CENT"), >"DECADE", ("DECADES", "DEC", "DECS"), >"YEAR", ("Y", "YEARS", "YR", "YRS"), >"QUARTER", ("QTR"), >"MONTH", ("MON", "MONS", "MONTHS"), >"DAY", ("D", "DAYS"), >"HOUR", ("H", "HOURS", "HR", "HRS"), >"MINUTE", ("M", "MIN", "MINS", "MINUTES"), >"SECOND", ("S", "SEC", "SECONDS", "SECS"), >"MILLISECONDS", ("MSEC", "MSECS", "MILLISECON", > "MSECONDS", "MS"), >"MICROSECONDS", ("USEC", "USECS", "USECONDS", > "MICROSECON", "US"), >"EPOCH"] > {code} > For example: > {code} > maxim=# select '1y 10mon -10d -10h -10min -10.01s > ago'::interval; > interval > > -1 years -10 mons +10 days 10:10:10.01 > (1 row) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29388) Construct intervals from the `millenniums`, `centuries` or `decades` units
Maxim Gekk created SPARK-29388: -- Summary: Construct intervals from the `millenniums`, `centuries` or `decades` units Key: SPARK-29388 URL: https://issues.apache.org/jira/browse/SPARK-29388 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk PostgreSQL supports `millenniums`, `centuries` or `decades` interval units. See {code} maxim=# select '4 millenniums 5 centuries 4 decades 1 year 4 months 4 days 17 minutes 31 seconds'::interval; interval --- 4541 years 4 mons 4 days 00:17:31 (1 row) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29387) Support `*` and `\` operators for intervals
Maxim Gekk created SPARK-29387: -- Summary: Support `*` and `\` operators for intervals Key: SPARK-29387 URL: https://issues.apache.org/jira/browse/SPARK-29387 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Support `*` by numeric, `/` by numeric. See [https://www.postgresql.org/docs/12/functions-datetime.html] ||Operator||Example||Result|| |*|900 * interval '1 second'|interval '00:15:00'| |*|21 * interval '1 day'|interval '21 days'| |/|interval '1 hour' / double precision '1.5'|interval '00:40:00'| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29386) Copy data between a file and a table
Maxim Gekk created SPARK-29386: -- Summary: Copy data between a file and a table Key: SPARK-29386 URL: https://issues.apache.org/jira/browse/SPARK-29386 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk https://www.postgresql.org/docs/12/sql-copy.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29385) Make `INTERVAL` values comparable
Maxim Gekk created SPARK-29385: -- Summary: Make `INTERVAL` values comparable Key: SPARK-29385 URL: https://issues.apache.org/jira/browse/SPARK-29385 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk PostgreSQL allows to compare interval by `=`, `<>`, `<`, `<=`, `>`, `>=`. For example: {code} maxim=# select interval '1 month' > interval '29 days'; ?column? -- t {code} but the same fails in Spark: {code} spark-sql> select interval 1 month > interval 29 days; Error in query: cannot resolve '(interval 1 months > interval 4 weeks 1 days)' due to data type mismatch: GreaterThan does not support ordering on type interval; line 1 pos 7; 'Project [unresolvedalias((interval 1 months > interval 4 weeks 1 days), None)] +- OneRowRelation {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29384) Support `ago` in interval strings
Maxim Gekk created SPARK-29384: -- Summary: Support `ago` in interval strings Key: SPARK-29384 URL: https://issues.apache.org/jira/browse/SPARK-29384 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk PostgreSQL allow to specify direction in interval string by the `ago` word: {code} maxim=# select interval '@ 1 year 2 months 3 days 14 seconds ago'; interval -1 years -2 mons -3 days -00:00:14 {code} See https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-INTERVAL-INPUT -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29383) Support the optional prefix `@` in interval strings
Maxim Gekk created SPARK-29383: -- Summary: Support the optional prefix `@` in interval strings Key: SPARK-29383 URL: https://issues.apache.org/jira/browse/SPARK-29383 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk PostgreSQL allows `@` at the beginning and `ago` at the end of interval strings: {code} maxim=# select interval '@ 14 seconds'; interval -- 00:00:14 {code} See https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-INTERVAL-INPUT -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29382) Support the `INTERVAL` type by Parquet datasource
Maxim Gekk created SPARK-29382: -- Summary: Support the `INTERVAL` type by Parquet datasource Key: SPARK-29382 URL: https://issues.apache.org/jira/browse/SPARK-29382 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Spark cannot create a table using parquet if a column has the `INTERVAL` type: {code} spark-sql> CREATE TABLE INTERVAL_TBL (f1 interval) USING PARQUET; Error in query: Parquet data source does not support interval data type.; {code} This is needed for SPARK-29368 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-9636) Treat $SPARK_HOME as write-only
[ https://issues.apache.org/jira/browse/SPARK-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philipp Angerer reopened SPARK-9636: This is not fixed and there was no reason given why it was closed, so I’ll reopen it. > Treat $SPARK_HOME as write-only > --- > > Key: SPARK-9636 > URL: https://issues.apache.org/jira/browse/SPARK-9636 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 1.4.1 > Environment: Linux >Reporter: Philipp Angerer >Priority: Minor > Labels: bulk-closed > > when starting spark scripts as user and it is installed in a directory the > user has no write permissions on, many things work fine, except for the logs > (e.g. for {{start-master.sh}}) > logs are per default written to {{$SPARK_LOG_DIR}} or (if unset) to > {{$SPARK_HOME/logs}}. > if installed in this way, it should, instead of throwing an error, write logs > to {{/var/log/spark/}}. that’s easy to fix by simply testing a few log dirs > in sequence for writability before trying to use one. i suggest using > {{$SPARK_LOG_DIR}} (if set) → {{/var/log/spark/}} → {{~/.cache/spark-logs/}} > → {{$SPARK_HOME/logs/}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-29379) SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'
[ https://issues.apache.org/jira/browse/SPARK-29379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-29379: -- Comment: was deleted (was: Don't need to add new expression class. If we just add code in ShowFunctionsCommand, we should change a lot UT about functions: {code:java} case class ShowFunctionsCommand( db: Option[String], pattern: Option[String], showUserFunctions: Boolean, showSystemFunctions: Boolean) extends RunnableCommand { override val output: Seq[Attribute] = { val schema = StructType(StructField("function", StringType, nullable = false) :: Nil) schema.toAttributes } override def run(sparkSession: SparkSession): Seq[Row] = { val dbName = db.getOrElse(sparkSession.sessionState.catalog.getCurrentDatabase) // If pattern is not specified, we use '*', which is used to // match any sequence of characters (including no characters). val functionNames = sparkSession.sessionState.catalog .listFunctions(dbName, pattern.getOrElse("*")) .collect { case (f, "USER") if showUserFunctions => f.unquotedString case (f, "SYSTEM") if showSystemFunctions => f.unquotedString } (functionNames ++ Seq("!=", "<>", "between", "case")).sorted.map(Row(_)) } } {code} ) > SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case' > > > Key: SPARK-29379 > URL: https://issues.apache.org/jira/browse/SPARK-29379 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29222) Flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence
[ https://issues.apache.org/jira/browse/SPARK-29222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946609#comment-16946609 ] huangtianhua commented on SPARK-29222: -- The tests specified in -SPARK-29205- failed every time when testing in arm instance, and after increasing the timeout and batch time they success, but we didn't test 100 times, just several times. I have no idea about the batchDuration of StreamingContext setting, is there a principle? > Flaky test: > pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence > --- > > Key: SPARK-29222 > URL: https://issues.apache.org/jira/browse/SPARK-29222 > Project: Spark > Issue Type: Test > Components: MLlib, Tests >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Minor > > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111237/testReport/] > {code:java} > Error Message > 7 != 10 > StacktraceTraceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 429, in test_parameter_convergence > self._eventually(condition, catch_assertions=True) > File > "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 74, in _eventually > raise lastValue > File > "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 65, in _eventually > lastValue = condition() > File > "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 425, in condition > self.assertEqual(len(model_weights), len(batches)) > AssertionError: 7 != 10 >{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29381) Add 'private' _XXXParams classes for classification & regression
zhengruifeng created SPARK-29381: Summary: Add 'private' _XXXParams classes for classification & regression Key: SPARK-29381 URL: https://issues.apache.org/jira/browse/SPARK-29381 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 3.0.0 Reporter: zhengruifeng ping [~huaxingao] would you like to work on this? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29380) RFormula avoid repeated 'first' jobs to get vector size
zhengruifeng created SPARK-29380: Summary: RFormula avoid repeated 'first' jobs to get vector size Key: SPARK-29380 URL: https://issues.apache.org/jira/browse/SPARK-29380 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng In current impl, {{RFormula}} will trigger one {{first}} job to get the vector size, if the size can not be obtained from {{AttributeGroup.}} {{This can be optimized by get the first row lazily, and reuse it for each vector column.}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-24640) size(null) returns null
[ https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk reopened SPARK-24640: > size(null) returns null > > > Key: SPARK-24640 > URL: https://issues.apache.org/jira/browse/SPARK-24640 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > Labels: api, bulk-closed > > Size(null) should return null instead of -1 in 3.0 release. This is a > behavior change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend
[ https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946583#comment-16946583 ] zhengruifeng commented on SPARK-29212: -- [~zero323] ??we should remove Java specific mixins, if they don't serve any practical value (provide no implementation whatsoever or don't extend other {{Java*}} mixins, like {{JavaPredictorParams}}, or have no JVM wrapper specific implementation, like {{JavaPredictor}}).?? I am neutral on it, what's is your thoughts? [~huaxingao] [~srowen] ??As of the second point there is additional consideration here - some {{Java*}} classes are considered part of the public API, and this should stay as is (these provide crucial information to the end user). ?? I guess we have reached an agreement in related tickets (like _XXXParams in featuers/clustering). ??On a side note current approach to ML API requires a lot of boilerplate code. Lately I've been playing with [some ideas|https://gist.github.com/zero323/ee36bce57ddeac82322e3ab4ef547611], that wouldn't require code generation - they have some caveats, but maybe there is something there. ?? It looks succinct, I think we may take it into account in the future. > Add common classes without using JVM backend > > > Key: SPARK-29212 > URL: https://issues.apache.org/jira/browse/SPARK-29212 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > Copied from [https://github.com/apache/spark/pull/25776]. > > Maciej's *Concern*: > *Use cases for public ML type hierarchy* > * Add Python-only Transformer implementations: > * > ** I am Python user and want to implement pure Python ML classifier without > providing JVM backend. > ** I want this classifier to be meaningfully positioned in the existing type > hierarchy. > ** However I have access only to high level classes ({{Estimator}}, > {{Model}}, {{MLReader}} / {{MLReadable}}). > * Run time parameter validation for both user defined (see above) and > existing class hierarchy, > * > ** I am a library developer who provides functions that are meaningful only > for specific categories of {{Estimators}} - here classifiers. > ** I want to validate that user passed argument is indeed a classifier: > *** For built-in objects using "private" type hierarchy is not really > satisfying (actually, what is the rationale behind making it "private"? If > the goal is Scala API parity, and Scala counterparts are public, shouldn't > these be too?). > ** For user defined objects I can: > *** Use duck typing (on {{setRawPredictionCol}} for classifier, on > {{numClasses}} for classification model) but it hardly satisfying. > *** Provide parallel non-abstract type hierarchy ({{Classifier}} or > {{PythonClassifier}} and so on) and require users to implement such > interfaces. That however would require separate logic for checking for > built-in and and user-provided classes. > *** Provide parallel abstract type hierarchy, register all existing built-in > classes and require users to do the same. > Clearly these are not satisfying solutions as they require either defensive > programming or reinventing the same functionality for different 3rd party > APIs. > * Static type checking > * > ** I am either end user or library developer and want to use PEP-484 > annotations to indicate components that require classifier or classification > model. > * > ** Currently I can provide only imprecise annotations, [such > as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241] > def setClassifier(self, value: Estimator[M]) -> OneVsRest: ... > or try to narrow things down using structural subtyping: > class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, > value: str) -> Classifier: ... class Classifier(Protocol, Model): def > setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> > int: ... > (...) > * First of all nothing in the original API indicated this. On the contrary, > the original API clearly suggests that non-Java path is supported, by > providing base classes (Params, Transformer, Estimator, Model, ML > \{Reader,Writer}, ML\{Readable,Writable}) as well as Java specific > implementations (JavaParams, JavaTransformer, JavaEstimator, JavaModel, > JavaML\{Reader,Writer}, JavaML > {Readable,Writable} > ). > * Furthermore authoritative (IMHO) and open Python ML extensions exist > (spark-sklearn is one of these, but if I recall correctly spark-deep-learning > provides so pure-Python utilities). Personally I've seen quite a lot of > private implementations, but that's just anecdotal evidence. > Let us assume
[jira] [Assigned] (SPARK-29269) Pyspark ALSModel support getters/setters
[ https://issues.apache.org/jira/browse/SPARK-29269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-29269: Assignee: Huaxin Gao > Pyspark ALSModel support getters/setters > > > Key: SPARK-29269 > URL: https://issues.apache.org/jira/browse/SPARK-29269 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > ping [~huaxingao] , would you like to work on this? This is similar to your > previous works. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-29269) Pyspark ALSModel support getters/setters
[ https://issues.apache.org/jira/browse/SPARK-29269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-29269: - Comment: was deleted (was: It seems that I do not have the permission to assign a tickect: ``` JIRAError: JiraError HTTP 403 url: https://issues.apache.org/jira/rest/api/latest/issue/SPARK-29269/assignee text: You do not have permission to assign issues. ``` [~dongjoon] Could you please help assign this ticket to Huaxin? Thanks!) > Pyspark ALSModel support getters/setters > > > Key: SPARK-29269 > URL: https://issues.apache.org/jira/browse/SPARK-29269 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > ping [~huaxingao] , would you like to work on this? This is similar to your > previous works. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29366) Subqueries created for DPP are not printed in EXPLAIN FORMATTED
[ https://issues.apache.org/jira/browse/SPARK-29366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-29366. - Fix Version/s: 3.0.0 Assignee: Dilip Biswal Resolution: Fixed > Subqueries created for DPP are not printed in EXPLAIN FORMATTED > --- > > Key: SPARK-29366 > URL: https://issues.apache.org/jira/browse/SPARK-29366 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Major > Fix For: 3.0.0 > > > The subquery expressions introduced by DPP are not printed in the newer > explain. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29379) SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'
[ https://issues.apache.org/jira/browse/SPARK-29379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946526#comment-16946526 ] angerszhu commented on SPARK-29379: --- Don't need to add new expression class. If we just add code in ShowFunctionsCommand, we should change a lot UT about functions: {code:java} case class ShowFunctionsCommand( db: Option[String], pattern: Option[String], showUserFunctions: Boolean, showSystemFunctions: Boolean) extends RunnableCommand { override val output: Seq[Attribute] = { val schema = StructType(StructField("function", StringType, nullable = false) :: Nil) schema.toAttributes } override def run(sparkSession: SparkSession): Seq[Row] = { val dbName = db.getOrElse(sparkSession.sessionState.catalog.getCurrentDatabase) // If pattern is not specified, we use '*', which is used to // match any sequence of characters (including no characters). val functionNames = sparkSession.sessionState.catalog .listFunctions(dbName, pattern.getOrElse("*")) .collect { case (f, "USER") if showUserFunctions => f.unquotedString case (f, "SYSTEM") if showSystemFunctions => f.unquotedString } (functionNames ++ Seq("!=", "<>", "between", "case")).sorted.map(Row(_)) } } {code} > SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case' > > > Key: SPARK-29379 > URL: https://issues.apache.org/jira/browse/SPARK-29379 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Major > > SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29269) Pyspark ALSModel support getters/setters
[ https://issues.apache.org/jira/browse/SPARK-29269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946524#comment-16946524 ] zhengruifeng commented on SPARK-29269: -- It seems that I do not have the permission to assign a tickect: ``` JIRAError: JiraError HTTP 403 url: https://issues.apache.org/jira/rest/api/latest/issue/SPARK-29269/assignee text: You do not have permission to assign issues. ``` [~dongjoon] Could you please help assign this ticket to Huaxin? Thanks! > Pyspark ALSModel support getters/setters > > > Key: SPARK-29269 > URL: https://issues.apache.org/jira/browse/SPARK-29269 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > ping [~huaxingao] , would you like to work on this? This is similar to your > previous works. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24640) size(null) returns null
[ https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946503#comment-16946503 ] Maxim Gekk edited comment on SPARK-24640 at 10/8/19 6:11 AM: - As far as I remember we planed to remove spark.sql.legacy.sizeOfNull in 3.0. [~hyukjin.kwon] [~smilegator] This ticket is a remainder of this. See https://github.com/apache/spark/pull/21598#issuecomment-399695523 was (Author: maxgekk): As far as I remember we planed to remove spark.sql.legacy.sizeOfNull in 3.0. [~hyukjin.kwon] [~smilegator] This ticket is a remainder of this. > size(null) returns null > > > Key: SPARK-24640 > URL: https://issues.apache.org/jira/browse/SPARK-24640 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > Labels: api, bulk-closed > > Size(null) should return null instead of -1 in 3.0 release. This is a > behavior change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29269) Pyspark ALSModel support getters/setters
[ https://issues.apache.org/jira/browse/SPARK-29269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-29269. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25947 [https://github.com/apache/spark/pull/25947] > Pyspark ALSModel support getters/setters > > > Key: SPARK-29269 > URL: https://issues.apache.org/jira/browse/SPARK-29269 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > ping [~huaxingao] , would you like to work on this? This is similar to your > previous works. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org