[jira] [Resolved] (SPARK-25390) data source V2 API refactoring

2019-10-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25390.
-
Fix Version/s: 3.0.0
 Assignee: Wenchen Fan
   Resolution: Fixed

> data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently it's not very clear how we should abstract data source v2 API. The 
> abstraction should be unified between batch and streaming, or similar but 
> have a well-defined difference between batch and streaming. And the 
> abstraction should also include catalog/table.
> An example of the abstraction:
> {code}
> batch: catalog -> table -> scan
> streaming: catalog -> table -> stream -> scan
> {code}
> We should refactor the data source v2 API according to the abstraction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29373) DataSourceV2: Commands should not submit a spark job.

2019-10-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29373:
---

Assignee: Terry Kim

> DataSourceV2: Commands should not submit a spark job.
> -
>
> Key: SPARK-29373
> URL: https://issues.apache.org/jira/browse/SPARK-29373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Blocker
>
> DataSourceV2 Exec classes (ShowTablesExec, ShowNamespacesExec, etc.) all 
> extend LeafExecNode. This results in running a job when executeCollect() is 
> called. This breaks the previous behavior 
> [SPARK-19650|https://issues.apache.org/jira/browse/SPARK-19650].
> A new command physical operator will be introduced form which all V2 Exec 
> classes derive to avoid running a job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29373) DataSourceV2: Commands should not submit a spark job.

2019-10-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29373.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26048
[https://github.com/apache/spark/pull/26048]

> DataSourceV2: Commands should not submit a spark job.
> -
>
> Key: SPARK-29373
> URL: https://issues.apache.org/jira/browse/SPARK-29373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Blocker
> Fix For: 3.0.0
>
>
> DataSourceV2 Exec classes (ShowTablesExec, ShowNamespacesExec, etc.) all 
> extend LeafExecNode. This results in running a job when executeCollect() is 
> called. This breaks the previous behavior 
> [SPARK-19650|https://issues.apache.org/jira/browse/SPARK-19650].
> A new command physical operator will be introduced form which all V2 Exec 
> classes derive to avoid running a job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29401) Replace ambiguous varargs call parallelize(Array) with parallelize(Seq)

2019-10-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29401.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26062
[https://github.com/apache/spark/pull/26062]

> Replace ambiguous varargs call parallelize(Array) with parallelize(Seq)
> ---
>
> Key: SPARK-29401
> URL: https://issues.apache.org/jira/browse/SPARK-29401
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> Another general class of Scala 2.13 compile errors:
> {code}
> [ERROR] [Error] 
> /Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/ShuffleSuite.scala:47:
>  overloaded method value apply with alternatives:
>   (x: Unit,xs: Unit*)Array[Unit] 
>   (x: Double,xs: Double*)Array[Double] 
>   (x: Float,xs: Float*)Array[Float] 
>   (x: Long,xs: Long*)Array[Long] 
>   (x: Int,xs: Int*)Array[Int] 
>   (x: Char,xs: Char*)Array[Char] 
>   (x: Short,xs: Short*)Array[Short] 
>   (x: Byte,xs: Byte*)Array[Byte] 
>   (x: Boolean,xs: Boolean*)Array[Boolean]
>  cannot be applied to ((Int, Int), (Int, Int), (Int, Int), (Int, Int))
> {code}
> All of these (almost all?) are from calls to {{SparkContext.parallelize}} 
> with an Array of tuples. We can replace them with Seqs, which seems to work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29392) Remove use of deprecated symbol literal " 'name " syntax in favor Symbol("name")

2019-10-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29392.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26061
[https://github.com/apache/spark/pull/26061]

> Remove use of deprecated symbol literal " 'name " syntax in favor 
> Symbol("name")
> 
>
> Key: SPARK-29392
> URL: https://issues.apache.org/jira/browse/SPARK-29392
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> Example:
> {code}
> [WARNING] [Warn] 
> /Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/memory/UnifiedMemoryManagerSuite.scala:308:
>  symbol literal is deprecated; use Symbol("assertInvariants") instead
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29403) Uses Arrow R 0.14.1 in AppVeyor for now

2019-10-08 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-29403:


 Summary: Uses Arrow R 0.14.1 in AppVeyor for now
 Key: SPARK-29403
 URL: https://issues.apache.org/jira/browse/SPARK-29403
 Project: Spark
  Issue Type: Test
  Components: Project Infra, SparkR
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon
Assignee: Hyukjin Kwon


Due to SPARK-29378, the test is currently being failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28797) Document DROP FUNCTION statement in SQL Reference.

2019-10-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-28797.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25553
[https://github.com/apache/spark/pull/25553]

> Document DROP FUNCTION statement in SQL Reference.
> --
>
> Key: SPARK-28797
> URL: https://issues.apache.org/jira/browse/SPARK-28797
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: Sandeep Katta
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28797) Document DROP FUNCTION statement in SQL Reference.

2019-10-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-28797:
-
Priority: Minor  (was: Major)

> Document DROP FUNCTION statement in SQL Reference.
> --
>
> Key: SPARK-28797
> URL: https://issues.apache.org/jira/browse/SPARK-28797
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: Sandeep Katta
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28797) Document DROP FUNCTION statement in SQL Reference.

2019-10-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-28797:


Assignee: Sandeep Katta

> Document DROP FUNCTION statement in SQL Reference.
> --
>
> Key: SPARK-28797
> URL: https://issues.apache.org/jira/browse/SPARK-28797
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: Sandeep Katta
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28502) Error with struct conversion while using pandas_udf

2019-10-08 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-28502.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

This was fixed once support for StructType was added in pandas_udf because the 
window range sent to the udf is a struct column

> Error with struct conversion while using pandas_udf
> ---
>
> Key: SPARK-28502
> URL: https://issues.apache.org/jira/browse/SPARK-28502
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: OS: Ubuntu
> Python: 3.6
>Reporter: Nasir Ali
>Priority: Minor
> Fix For: 3.0.0
>
>
> What I am trying to do: Group data based on time intervals (e.g., 15 days 
> window) and perform some operations on dataframe using (pandas) UDFs. I don't 
> know if there is a better/cleaner way to do it.
> Below is the sample code that I tried and error message I am getting.
>  
> {code:java}
> df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"),
> (13.00, "2018-03-11T12:27:18+00:00"),
> (25.00, "2018-03-12T11:27:18+00:00"),
> (20.00, "2018-03-13T15:27:18+00:00"),
> (17.00, "2018-03-14T12:27:18+00:00"),
> (99.00, "2018-03-15T11:27:18+00:00"),
> (156.00, "2018-03-22T11:27:18+00:00"),
> (17.00, "2018-03-31T11:27:18+00:00"),
> (25.00, "2018-03-15T11:27:18+00:00"),
> (25.00, "2018-03-16T11:27:18+00:00")
> ],
>["id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> schema = StructType([
> StructField("id", IntegerType()),
> StructField("ts", TimestampType())
> ])
> @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
> def some_udf(df):
> # some computation
> return df
> df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show()
> {code}
> This throws following exception:
> {code:java}
> TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]>
> {code}
>  
> However, if I use builtin agg method then it works all fine. For example,
> {code:java}
> df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False)
> {code}
> Output
> {code:java}
> +-+--+---+
> |id   |window|avg(id)|
> +-+--+---+
> |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0   |
> |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0   |
> |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0  |
> |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0   |
> |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0   |
> |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0   |
> |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0   |
> +-+--+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28502) Error with struct conversion while using pandas_udf

2019-10-08 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947272#comment-16947272
 ] 

Bryan Cutler commented on SPARK-28502:
--

I'm closing this since it is working in master and will mark fixed version as 
3.0.0. I also created SPARK-29402 to add proper testing for this use case.

> Error with struct conversion while using pandas_udf
> ---
>
> Key: SPARK-28502
> URL: https://issues.apache.org/jira/browse/SPARK-28502
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: OS: Ubuntu
> Python: 3.6
>Reporter: Nasir Ali
>Priority: Minor
>
> What I am trying to do: Group data based on time intervals (e.g., 15 days 
> window) and perform some operations on dataframe using (pandas) UDFs. I don't 
> know if there is a better/cleaner way to do it.
> Below is the sample code that I tried and error message I am getting.
>  
> {code:java}
> df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"),
> (13.00, "2018-03-11T12:27:18+00:00"),
> (25.00, "2018-03-12T11:27:18+00:00"),
> (20.00, "2018-03-13T15:27:18+00:00"),
> (17.00, "2018-03-14T12:27:18+00:00"),
> (99.00, "2018-03-15T11:27:18+00:00"),
> (156.00, "2018-03-22T11:27:18+00:00"),
> (17.00, "2018-03-31T11:27:18+00:00"),
> (25.00, "2018-03-15T11:27:18+00:00"),
> (25.00, "2018-03-16T11:27:18+00:00")
> ],
>["id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> schema = StructType([
> StructField("id", IntegerType()),
> StructField("ts", TimestampType())
> ])
> @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
> def some_udf(df):
> # some computation
> return df
> df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show()
> {code}
> This throws following exception:
> {code:java}
> TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]>
> {code}
>  
> However, if I use builtin agg method then it works all fine. For example,
> {code:java}
> df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False)
> {code}
> Output
> {code:java}
> +-+--+---+
> |id   |window|avg(id)|
> +-+--+---+
> |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0   |
> |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0   |
> |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0  |
> |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0   |
> |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0   |
> |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0   |
> |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0   |
> +-+--+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29402) Add tests for grouped map pandas_udf using window

2019-10-08 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947266#comment-16947266
 ] 

Bryan Cutler commented on SPARK-29402:
--

This is related to SPARK-28502 that using grouped map pandas_udf with windows 
failed due to the key having a struct column for window range

> Add tests for grouped map pandas_udf using window
> -
>
> Key: SPARK-29402
> URL: https://issues.apache.org/jira/browse/SPARK-29402
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL, Tests
>Affects Versions: 2.4.4
>Reporter: Bryan Cutler
>Priority: Major
>
> Grouped map pandas_udf should have tests that use a window, due to the key 
> containing the grouping key and window range.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29402) Add tests for grouped map pandas_udf using window

2019-10-08 Thread Bryan Cutler (Jira)
Bryan Cutler created SPARK-29402:


 Summary: Add tests for grouped map pandas_udf using window
 Key: SPARK-29402
 URL: https://issues.apache.org/jira/browse/SPARK-29402
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL, Tests
Affects Versions: 2.4.4
Reporter: Bryan Cutler


Grouped map pandas_udf should have tests that use a window, due to the key 
containing the grouping key and window range.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29368) Port interval.sql

2019-10-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29368:
--
Component/s: Tests

> Port interval.sql
> -
>
> Key: SPARK-29368
> URL: https://issues.apache.org/jira/browse/SPARK-29368
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Here is interval.sql: 
> [https://raw.githubusercontent.com/postgres/postgres/REL_12_STABLE/src/test/regress/sql/interval.sql]
> Results: 
> https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/interval.out



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29382) Support the `INTERVAL` type by Parquet datasource

2019-10-08 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947234#comment-16947234
 ] 

Dongjoon Hyun commented on SPARK-29382:
---

Since this is not a Parquet specific format. Could you update the issue title?

> Support the `INTERVAL` type by Parquet datasource
> -
>
> Key: SPARK-29382
> URL: https://issues.apache.org/jira/browse/SPARK-29382
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Spark cannot create a table using parquet if a column has the `INTERVAL` type:
> {code}
> spark-sql> CREATE TABLE INTERVAL_TBL (f1 interval) USING PARQUET;
> Error in query: Parquet data source does not support interval data type.;
> {code}
> This is needed for SPARK-29368



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29401) Replace ambiguous varargs call parallelize(Array) with parallelize(Seq)

2019-10-08 Thread Sean R. Owen (Jira)
Sean R. Owen created SPARK-29401:


 Summary: Replace ambiguous varargs call parallelize(Array) with 
parallelize(Seq)
 Key: SPARK-29401
 URL: https://issues.apache.org/jira/browse/SPARK-29401
 Project: Spark
  Issue Type: Sub-task
  Components: ML, Spark Core
Affects Versions: 3.0.0
Reporter: Sean R. Owen
Assignee: Sean R. Owen


Another general class of Scala 2.13 compile errors:

{code}
[ERROR] [Error] 
/Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/ShuffleSuite.scala:47:
 overloaded method value apply with alternatives:
  (x: Unit,xs: Unit*)Array[Unit] 
  (x: Double,xs: Double*)Array[Double] 
  (x: Float,xs: Float*)Array[Float] 
  (x: Long,xs: Long*)Array[Long] 
  (x: Int,xs: Int*)Array[Int] 
  (x: Char,xs: Char*)Array[Char] 
  (x: Short,xs: Short*)Array[Short] 
  (x: Byte,xs: Byte*)Array[Byte] 
  (x: Boolean,xs: Boolean*)Array[Boolean]
 cannot be applied to ((Int, Int), (Int, Int), (Int, Int), (Int, Int))
{code}

All of these (almost all?) are from calls to {{SparkContext.parallelize}} with 
an Array of tuples. We can replace them with Seqs, which seems to work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29392) Remove use of deprecated symbol literal " 'name " syntax in favor Symbol("name")

2019-10-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29392:


Assignee: Sean R. Owen

> Remove use of deprecated symbol literal " 'name " syntax in favor 
> Symbol("name")
> 
>
> Key: SPARK-29392
> URL: https://issues.apache.org/jira/browse/SPARK-29392
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> Example:
> {code}
> [WARNING] [Warn] 
> /Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/memory/UnifiedMemoryManagerSuite.scala:308:
>  symbol literal is deprecated; use Symbol("assertInvariants") instead
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29400) Improve PrometheusResource to use labels

2019-10-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29400:
--
Priority: Minor  (was: Major)

> Improve PrometheusResource to use labels
> 
>
> Key: SPARK-29400
> URL: https://issues.apache.org/jira/browse/SPARK-29400
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> SPARK-29064 introduced `PrometheusResource` for native support. This issue 
> aims to improve it to use **labels**.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29400) Improve PrometheusResource to use labels

2019-10-08 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-29400:
-

 Summary: Improve PrometheusResource to use labels
 Key: SPARK-29400
 URL: https://issues.apache.org/jira/browse/SPARK-29400
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


SPARK-29064 introduced `PrometheusResource` for native support. This issue aims 
to improve it to use **labels**.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22256) Introduce spark.mesos.driver.memoryOverhead

2019-10-08 Thread David McWhorter (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947177#comment-16947177
 ] 

David McWhorter commented on SPARK-22256:
-

I created this pull request https://github.com/pmackles/spark/pull/1 to update 
the branch for this pull request with changes on branch 
https://github.com/dmcwhorter/spark/tree/dmcwhorter-SPARK-22256.

The updates do the following:
1. Rebase pmackles/paul-SPARK-22256 onto the latest apache/spark master branch
2. Add a test case for when the default value of 
spark.mesos.driver.memoryOverhead is 10% of driver memory as requested in 
apache#21006

My hope is this update will make this pull request ready to merge and include 
in the next spark release.

> Introduce spark.mesos.driver.memoryOverhead 
> 
>
> Key: SPARK-22256
> URL: https://issues.apache.org/jira/browse/SPARK-22256
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Cosmin Lehene
>Priority: Minor
>  Labels: docker, memory, mesos
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When running spark driver in a container such as when using the Mesos 
> dispatcher service, we need to apply the same rules as for executors in order 
> to avoid the JVM going over the allotted limit and then killed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29381) Add 'private' _XXXParams classes for classification & regression

2019-10-08 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947171#comment-16947171
 ] 

Huaxin Gao commented on SPARK-29381:


Yes. I am happy to work on this. [~podongfeng]

> Add 'private' _XXXParams classes for classification & regression
> 
>
> Key: SPARK-29381
> URL: https://issues.apache.org/jira/browse/SPARK-29381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> ping [~huaxingao]  would you like to work on this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29399) Remove old ExecutorPlugin API (or wrap it using new API)

2019-10-08 Thread Marcelo Masiero Vanzin (Jira)
Marcelo Masiero Vanzin created SPARK-29399:
--

 Summary: Remove old ExecutorPlugin API (or wrap it using new API)
 Key: SPARK-29399
 URL: https://issues.apache.org/jira/browse/SPARK-29399
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Marcelo Masiero Vanzin


The parent bug is proposing a new plugin API for Spark, so we could remove the 
old one since it's a developer API.

If we can get the new API into Spark 3.0, then removal might be a better idea 
than deprecation. That would be my preference since then we can remove the new 
elements added as part of SPARK-28091 without having to deprecate them first.

If it doesn't make it into 3.0, then we should deprecate the APIs from 3.0 and 
manage old plugins through the new code being added.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29398) Allow RPC endpoints to use dedicated thread pools

2019-10-08 Thread Marcelo Masiero Vanzin (Jira)
Marcelo Masiero Vanzin created SPARK-29398:
--

 Summary: Allow RPC endpoints to use dedicated thread pools
 Key: SPARK-29398
 URL: https://issues.apache.org/jira/browse/SPARK-29398
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Marcelo Masiero Vanzin


This is a new feature of the RPC framework so that we can isolate RPC message 
delivery for plugins from normal Spark RPC needs. This minimizes the impact 
that plugins can have on normal RPC communication - they'll still fight for 
CPU, but they wouldn't block the dispatcher threads used by existing Spark RPC 
endpoints.

See parent bug for further details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29397) Create new plugin interface for driver and executor plugins

2019-10-08 Thread Marcelo Masiero Vanzin (Jira)
Marcelo Masiero Vanzin created SPARK-29397:
--

 Summary: Create new plugin interface for driver and executor 
plugins
 Key: SPARK-29397
 URL: https://issues.apache.org/jira/browse/SPARK-29397
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Marcelo Masiero Vanzin


This task covers the work of adding a new interface for Spark plugins, covering 
both driver and executor side component.

See parent bug for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29396) Extend Spark plugin interface to driver

2019-10-08 Thread Marcelo Masiero Vanzin (Jira)
Marcelo Masiero Vanzin created SPARK-29396:
--

 Summary: Extend Spark plugin interface to driver
 Key: SPARK-29396
 URL: https://issues.apache.org/jira/browse/SPARK-29396
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Marcelo Masiero Vanzin


Spark provides an extension API for people to implement executor plugins, added 
in SPARK-24918 and later extended in SPARK-28091.

That API does not offer any functionality for doing similar things on the 
driver side, though. As a consequence of that, there is not a good way for the 
executor plugins to get information or communicate in any way with the Spark 
driver.

I've been playing with such an improved API for developing some new 
functionality. I'll file a few child bugs for the work to get the changes in.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29337) How to Cache Table and Pin it in Memory and should not Spill to Disk on Thrift Server

2019-10-08 Thread Srini E (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srini E updated SPARK-29337:

Labels: Question stack-overflow  (was: )

> How to Cache Table and Pin it in Memory and should not Spill to Disk on 
> Thrift Server 
> --
>
> Key: SPARK-29337
> URL: https://issues.apache.org/jira/browse/SPARK-29337
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Srini E
>Priority: Major
>  Labels: Question, stack-overflow
> Attachments: Cache+Image.png
>
>
> Hi Team,
> How to pin the table in cache so it would not swap out of memory?
> Situation: We are using Microstrategy BI reporting. Semantic layer is built. 
> We wanted to Cache highly used tables into CACHE using Spark SQL CACHE Table 
> ; we did cache for SPARK context( Thrift server). Please see 
> below snapshot of Cache table, went to disk over time. Initially it was all 
> in cache , now some in cache and some in disk. That disk may be local disk 
> relatively more expensive reading than from s3. Queries may take longer and 
> inconsistent times from user experience perspective. If More queries running 
> using Cache tables, copies of the cache table images are copied and copies 
> are not staying in memory causing reports to run longer. so how to pin the 
> table so would not swap to disk. Spark memory management is dynamic 
> allocation, and how to use those few tables to Pin in memory .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-29337) How to Cache Table and Pin it in Memory and should not Spill to Disk on Thrift Server

2019-10-08 Thread Srini E (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srini E reopened SPARK-29337:
-

Openned and labelled it to Stackoverflow and Question

> How to Cache Table and Pin it in Memory and should not Spill to Disk on 
> Thrift Server 
> --
>
> Key: SPARK-29337
> URL: https://issues.apache.org/jira/browse/SPARK-29337
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Srini E
>Priority: Major
>  Labels: Question, stack-overflow
> Attachments: Cache+Image.png
>
>
> Hi Team,
> How to pin the table in cache so it would not swap out of memory?
> Situation: We are using Microstrategy BI reporting. Semantic layer is built. 
> We wanted to Cache highly used tables into CACHE using Spark SQL CACHE Table 
> ; we did cache for SPARK context( Thrift server). Please see 
> below snapshot of Cache table, went to disk over time. Initially it was all 
> in cache , now some in cache and some in disk. That disk may be local disk 
> relatively more expensive reading than from s3. Queries may take longer and 
> inconsistent times from user experience perspective. If More queries running 
> using Cache tables, copies of the cache table images are copied and copies 
> are not staying in memory causing reports to run longer. so how to pin the 
> table so would not swap to disk. Spark memory management is dynamic 
> allocation, and how to use those few tables to Pin in memory .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql

2019-10-08 Thread Srini E (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srini E reopened SPARK-29335:
-

Updated the Label to Question and Stack ovefflow

> Cost Based Optimizer stats are not used while evaluating query plans in Spark 
> Sql
> -
>
> Key: SPARK-29335
> URL: https://issues.apache.org/jira/browse/SPARK-29335
> Project: Spark
>  Issue Type: Question
>  Components: Optimizer
>Affects Versions: 2.3.0
> Environment: We tried to execute the same using Spark-sql and Thrify 
> server using SQLWorkbench but we are not able to use the stats.
>Reporter: Srini E
>Priority: Major
>  Labels: Question, stack-overflow
> Attachments: explain_plan_cbo_spark.txt
>
>
> We are trying to leverage CBO for getting better plan results for few 
> critical queries run thru spark-sql or thru thrift server using jdbc driver. 
> Following settings added to spark-defaults.conf
> {code}
> spark.sql.cbo.enabled true
> spark.experimental.extrastrategies intervaljoin
> spark.sql.cbo.joinreorder.enabled true
> {code}
>  
> The tables that we are using are not partitioned.
> {code}
> spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ;
> analyze table arrow.t_fperiods_sundar compute statistics for columns eid, 
> year, ptype, absref, fpid , pid ;
> analyze table arrow.t_fdata_sundar compute statistics ;
> analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, 
> absref;
> {code}
> Analyze completed success fully.
> Describe extended , does not show column level stats data and queries are not 
> leveraging table or column level stats .
> we are using Oracle as our Hive Catalog store and not Glue .
> *When we are using spark sql and running queries we are not able to see the 
> stats in use in the explain plan and we are not sure if cbo is put to use.*
> *A quick response would be helpful.*
> *Explain Plan:*
> Following Explain command does not reference to any Statistics usage.
>  
> {code}
> spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref 
> from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = 
> a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 
> and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;*
>  
> 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
> 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
> isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = 
> 2017),(ptype#4546 = A),(eid#4542 = 
> 29940),isnull(PID#4527),isnotnull(fpid#4523)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct decimal(38,0), PID: string, EID: decimal(10,0), YEAR: int, PTYPE: string ... 
> 3 more fields>
> 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
> IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
> 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
> isnotnull(absref#4569),(absref#4569 = 
> Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct string, FVALUE: string, EID: decimal(10,0), FPID: decimal(10,0), ABSREF: 
> string ... 3 more fields>
> 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
> IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940)
> == Parsed Logical Plan ==
> 'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref]
> +- 'Filter 'a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && 
> (('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && 
> ('a12.eid = 29940)) && isnull('a12.PID)))
>  +- 'Join Inner
>  :- 'SubqueryAlias a12
>  : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar`
>  +- 'SubqueryAlias a13
>  +- 'UnresolvedRelation `arrow`.`t_fdata_sundar`
>  
> == Analyzed Logical Plan ==
> imnem: string, fvalue: string, ptype: string, absref: string
> Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
> +- Filter eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = 
> cast(fpid#4567 as decimal(38,0 && ((absref#4569 = Y2017) && (year#4545 = 
> 2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = 
> cast(cast(29940 as decimal(5,0)) as decimal(10,0 && isnull(PID#4527)))
>  +- Join Inner
>  :- SubqueryAlias a12
>  : +- SubqueryAlias t_fperiods_sundar
>  : +- 
> 

[jira] [Updated] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql

2019-10-08 Thread Srini E (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srini E updated SPARK-29335:

Labels: Question stack-overflow  (was: )

> Cost Based Optimizer stats are not used while evaluating query plans in Spark 
> Sql
> -
>
> Key: SPARK-29335
> URL: https://issues.apache.org/jira/browse/SPARK-29335
> Project: Spark
>  Issue Type: Question
>  Components: Optimizer
>Affects Versions: 2.3.0
> Environment: We tried to execute the same using Spark-sql and Thrify 
> server using SQLWorkbench but we are not able to use the stats.
>Reporter: Srini E
>Priority: Major
>  Labels: Question, stack-overflow
> Attachments: explain_plan_cbo_spark.txt
>
>
> We are trying to leverage CBO for getting better plan results for few 
> critical queries run thru spark-sql or thru thrift server using jdbc driver. 
> Following settings added to spark-defaults.conf
> {code}
> spark.sql.cbo.enabled true
> spark.experimental.extrastrategies intervaljoin
> spark.sql.cbo.joinreorder.enabled true
> {code}
>  
> The tables that we are using are not partitioned.
> {code}
> spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ;
> analyze table arrow.t_fperiods_sundar compute statistics for columns eid, 
> year, ptype, absref, fpid , pid ;
> analyze table arrow.t_fdata_sundar compute statistics ;
> analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, 
> absref;
> {code}
> Analyze completed success fully.
> Describe extended , does not show column level stats data and queries are not 
> leveraging table or column level stats .
> we are using Oracle as our Hive Catalog store and not Glue .
> *When we are using spark sql and running queries we are not able to see the 
> stats in use in the explain plan and we are not sure if cbo is put to use.*
> *A quick response would be helpful.*
> *Explain Plan:*
> Following Explain command does not reference to any Statistics usage.
>  
> {code}
> spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref 
> from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = 
> a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 
> and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;*
>  
> 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
> 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
> isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = 
> 2017),(ptype#4546 = A),(eid#4542 = 
> 29940),isnull(PID#4527),isnotnull(fpid#4523)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct decimal(38,0), PID: string, EID: decimal(10,0), YEAR: int, PTYPE: string ... 
> 3 more fields>
> 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
> IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
> 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
> isnotnull(absref#4569),(absref#4569 = 
> Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct string, FVALUE: string, EID: decimal(10,0), FPID: decimal(10,0), ABSREF: 
> string ... 3 more fields>
> 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
> IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940)
> == Parsed Logical Plan ==
> 'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref]
> +- 'Filter 'a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && 
> (('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && 
> ('a12.eid = 29940)) && isnull('a12.PID)))
>  +- 'Join Inner
>  :- 'SubqueryAlias a12
>  : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar`
>  +- 'SubqueryAlias a13
>  +- 'UnresolvedRelation `arrow`.`t_fdata_sundar`
>  
> == Analyzed Logical Plan ==
> imnem: string, fvalue: string, ptype: string, absref: string
> Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
> +- Filter eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = 
> cast(fpid#4567 as decimal(38,0 && ((absref#4569 = Y2017) && (year#4545 = 
> 2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = 
> cast(cast(29940 as decimal(5,0)) as decimal(10,0 && isnull(PID#4527)))
>  +- Join Inner
>  :- SubqueryAlias a12
>  : +- SubqueryAlias t_fperiods_sundar
>  : +- 
> 

[jira] [Assigned] (SPARK-28917) Jobs can hang because of race of RDD.dependencies

2019-10-08 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-28917:
--

Assignee: Imran Rashid

> Jobs can hang because of race of RDD.dependencies
> -
>
> Key: SPARK-28917
> URL: https://issues.apache.org/jira/browse/SPARK-28917
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
>
> {{RDD.dependencies}} stores the precomputed cache value, but it is not 
> thread-safe.  This can lead to a race where the value gets overwritten, but 
> the DAGScheduler gets stuck in an inconsistent state.  In particular, this 
> can happen when there is a race between the DAGScheduler event loop, and 
> another thread (eg. a user thread, if there is multi-threaded job submission).
> First, a job is submitted by the user, which then computes the result Stage 
> and its parents:
> https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983
> Which eventually makes a call to {{rdd.dependencies}}:
> https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L519
> At the same time, the user could also touch {{rdd.dependencies}} in another 
> thread, which could overwrite the stored value because of the race.
> Then the DAGScheduler checks the dependencies *again* later on in the job 
> submission, via {{getMissingParentStages}}
> https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1025
> Because it will find new dependencies, it will create entirely different 
> stages.  Now the job has some orphaned stages which will never run.
> One symptoms of this are seeing disjoint sets of stages in the "Parents of 
> final stage" and the "Missing parents" messages on job submission (however 
> this is not required).
> (*EDIT*: Seeing repeated msgs "Registering RDD X" actually is just fine, it 
> is not a symptom of a problem at all.  It just means the RDD is the *input* 
> to multiple shuffles.)
> {noformat}
> [INFO] 2019-08-15 23:22:31,570 org.apache.spark.SparkContext logInfo - 
> Starting job: count at XXX.scala:462
> ...
> [INFO] 2019-08-15 23:22:31,573 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Registering RDD 14 (repartition at XXX.scala:421)
> ...
> ...
> [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Got job 1 (count at XXX.scala:462) with 40 output partitions
> [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Final stage: ResultStage 5 (count at XXX.scala:462)
> [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Parents of final stage: List(ShuffleMapStage 4)
> [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Registering RDD 14 (repartition at XXX.scala:421)
> [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Missing parents: List(ShuffleMapStage 6)
> {noformat}
> Another symptom is only visible with DEBUG logs turned on for DAGScheduler -- 
> you will calls to {{submitStage(Stage X)}} multiple times, followed by a 
> different set of missing stages.  eg. here, we see stage 1 first is missing 
> stage 0 as a dependency, and then later on its missing stage 23:
> {noformat}
> 19/09/19 22:28:15 DEBUG scheduler.DAGScheduler: submitStage(ShuffleMapStage 1)
> 19/09/19 22:28:15 DEBUG scheduler.DAGScheduler: missing: List(ShuffleMapStage 
> 0)
> ...
> 19/09/19 22:32:01 DEBUG scheduler.DAGScheduler: submitStage(ShuffleMapStage 1)
> 19/09/19 22:32:01 DEBUG scheduler.DAGScheduler: missing: List(ShuffleMapStage 
> 23)
> {noformat}
> Note that there is a similar issue w/ {{rdd.partitions}}.  In particular for 
> some RDDs, {{partitions}} references {{dependencies}} (eg. {{CoGroupedRDD}}). 
>  
> There is also an issue that {{rdd.storageLevel}} is read and cached in the 
> scheduler, but it could be modified simultaneously by the user in another 
> thread.   But, I can't see a way it could effect the scheduler.
> *WORKAROUND*:
> (a) call {{rdd.dependencies}} while you know that RDD is only getting touched 
> by one thread (eg. in the thread that created it, or before you submit 
> multiple jobs touching that RDD from other threads). Then that value will get 
> cached.
> (b) don't submit jobs from multiple threads.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Resolved] (SPARK-28917) Jobs can hang because of race of RDD.dependencies

2019-10-08 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-28917.

Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 25951
[https://github.com/apache/spark/pull/25951]

> Jobs can hang because of race of RDD.dependencies
> -
>
> Key: SPARK-28917
> URL: https://issues.apache.org/jira/browse/SPARK-28917
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> {{RDD.dependencies}} stores the precomputed cache value, but it is not 
> thread-safe.  This can lead to a race where the value gets overwritten, but 
> the DAGScheduler gets stuck in an inconsistent state.  In particular, this 
> can happen when there is a race between the DAGScheduler event loop, and 
> another thread (eg. a user thread, if there is multi-threaded job submission).
> First, a job is submitted by the user, which then computes the result Stage 
> and its parents:
> https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L983
> Which eventually makes a call to {{rdd.dependencies}}:
> https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L519
> At the same time, the user could also touch {{rdd.dependencies}} in another 
> thread, which could overwrite the stored value because of the race.
> Then the DAGScheduler checks the dependencies *again* later on in the job 
> submission, via {{getMissingParentStages}}
> https://github.com/apache/spark/blob/24655583f1cb5dae2e80bb572604fb4a9761ec07/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1025
> Because it will find new dependencies, it will create entirely different 
> stages.  Now the job has some orphaned stages which will never run.
> One symptoms of this are seeing disjoint sets of stages in the "Parents of 
> final stage" and the "Missing parents" messages on job submission (however 
> this is not required).
> (*EDIT*: Seeing repeated msgs "Registering RDD X" actually is just fine, it 
> is not a symptom of a problem at all.  It just means the RDD is the *input* 
> to multiple shuffles.)
> {noformat}
> [INFO] 2019-08-15 23:22:31,570 org.apache.spark.SparkContext logInfo - 
> Starting job: count at XXX.scala:462
> ...
> [INFO] 2019-08-15 23:22:31,573 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Registering RDD 14 (repartition at XXX.scala:421)
> ...
> ...
> [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Got job 1 (count at XXX.scala:462) with 40 output partitions
> [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Final stage: ResultStage 5 (count at XXX.scala:462)
> [INFO] 2019-08-15 23:22:31,582 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Parents of final stage: List(ShuffleMapStage 4)
> [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Registering RDD 14 (repartition at XXX.scala:421)
> [INFO] 2019-08-15 23:22:31,599 org.apache.spark.scheduler.DAGScheduler 
> logInfo - Missing parents: List(ShuffleMapStage 6)
> {noformat}
> Another symptom is only visible with DEBUG logs turned on for DAGScheduler -- 
> you will calls to {{submitStage(Stage X)}} multiple times, followed by a 
> different set of missing stages.  eg. here, we see stage 1 first is missing 
> stage 0 as a dependency, and then later on its missing stage 23:
> {noformat}
> 19/09/19 22:28:15 DEBUG scheduler.DAGScheduler: submitStage(ShuffleMapStage 1)
> 19/09/19 22:28:15 DEBUG scheduler.DAGScheduler: missing: List(ShuffleMapStage 
> 0)
> ...
> 19/09/19 22:32:01 DEBUG scheduler.DAGScheduler: submitStage(ShuffleMapStage 1)
> 19/09/19 22:32:01 DEBUG scheduler.DAGScheduler: missing: List(ShuffleMapStage 
> 23)
> {noformat}
> Note that there is a similar issue w/ {{rdd.partitions}}.  In particular for 
> some RDDs, {{partitions}} references {{dependencies}} (eg. {{CoGroupedRDD}}). 
>  
> There is also an issue that {{rdd.storageLevel}} is read and cached in the 
> scheduler, but it could be modified simultaneously by the user in another 
> thread.   But, I can't see a way it could effect the scheduler.
> *WORKAROUND*:
> (a) call {{rdd.dependencies}} while you know that RDD is only getting touched 
> by one thread (eg. in the thread that created it, or before you submit 
> multiple jobs touching that RDD from other threads). Then that value will get 
> cached.
> (b) don't submit jobs from multiple threads.



--
This message 

[jira] [Updated] (SPARK-29370) Interval strings without explicit unit markings

2019-10-08 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-29370:
---
Description: 
In PostgreSQL, Quantities of days, hours, minutes, and seconds can be specified 
without explicit unit markings. For example, '1 12:59:10' is read the same as 
'1 day 12 hours 59 min 10 sec'. For example:
{code:java}
maxim=# select interval '1 12:59:10';
interval

 1 day 12:59:10
(1 row)
{code}
It should allow to specify the sign:
{code}
maxim=# SELECT interval '1 +2:03:04' minute to second;
interval

 1 day 02:03:04
maxim=# SELECT interval '1 -2:03:04' minute to second;
interval 
-
 1 day -02:03:04
{code}
 

  was:
In PostgreSQL, Quantities of days, hours, minutes, and seconds can be specified 
without explicit unit markings. For example, '1 12:59:10' is read the same as 
'1 day 12 hours 59 min 10 sec'. For example:
{code}
maxim=# select interval '1 12:59:10';
interval

 1 day 12:59:10
(1 row)
{code}


> Interval strings without explicit unit markings
> ---
>
> Key: SPARK-29370
> URL: https://issues.apache.org/jira/browse/SPARK-29370
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> In PostgreSQL, Quantities of days, hours, minutes, and seconds can be 
> specified without explicit unit markings. For example, '1 12:59:10' is read 
> the same as '1 day 12 hours 59 min 10 sec'. For example:
> {code:java}
> maxim=# select interval '1 12:59:10';
> interval
> 
>  1 day 12:59:10
> (1 row)
> {code}
> It should allow to specify the sign:
> {code}
> maxim=# SELECT interval '1 +2:03:04' minute to second;
> interval
> 
>  1 day 02:03:04
> maxim=# SELECT interval '1 -2:03:04' minute to second;
> interval 
> -
>  1 day -02:03:04
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29395) Precision of the interval type

2019-10-08 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29395:
--

 Summary: Precision of the interval type
 Key: SPARK-29395
 URL: https://issues.apache.org/jira/browse/SPARK-29395
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


PostgreSQL allows to specify interval precision, see 
[https://www.postgresql.org/docs/12/datatype-datetime.html]
|{{interval [ _{{fields}}_ ] [ (_{{p}}_) ]}}|16 bytes|time interval|-17800 
years|17800 years|1 microsecond|

For example:
{code}
maxim=# SELECT interval '1 2:03.4567' day to second(2);
 interval  
---
 1 day 00:02:03.46
(1 row)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29394) Support ISO 8601 format for intervals

2019-10-08 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29394:
--

 Summary: Support ISO 8601 format for intervals
 Key: SPARK-29394
 URL: https://issues.apache.org/jira/browse/SPARK-29394
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Interval values can also be written as ISO 8601 time intervals, using either 
the “format with designators” of the standard's section 4.4.3.2 or the 
“alternative format” of section 4.4.3.3. 
 For example:
|P1Y2M3DT4H5M6S|ISO 8601 “format with designators”|
|P0001-02-03T04:05:06|ISO 8601 “alternative format”: same meaning as above|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29393) Add the make_interval() function

2019-10-08 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29393:
--

 Summary: Add the make_interval() function
 Key: SPARK-29393
 URL: https://issues.apache.org/jira/browse/SPARK-29393
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


PostgreSQL allows to make an interval by using the make_interval() function:
|{{make_interval(_{{years}}_ }}{{int}}{{ DEFAULT 0, _{{months}}_ }}{{int}}{{ 
DEFAULT 0, _{{weeks}}_ }}{{int}}{{ DEFAULT 0, _{{days}}_ }}{{int}}{{ DEFAULT 0, 
_{{hours}}_ }}{{int}}{{ DEFAULT 0, _{{mins}}_ }}{{int}}{{ DEFAULT 0, _{{secs}}_ 
}}{{double precision}}{{ DEFAULT 0.0)}}|{{interval}}|Create interval from 
years, months, weeks, days, hours, minutes and seconds 
fields|{{make_interval(days => 10)}}|{{10 days}}|
See https://www.postgresql.org/docs/12/functions-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend

2019-10-08 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947042#comment-16947042
 ] 

Huaxin Gao commented on SPARK-29212:


+1 for Sean's comment. 

> Add common classes without using JVM backend
> 
>
> Key: SPARK-29212
> URL: https://issues.apache.org/jira/browse/SPARK-29212
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> Copied from [https://github.com/apache/spark/pull/25776].
>  
>  Maciej's *Concern*:
> *Use cases for public ML type hierarchy*
>  * Add Python-only Transformer implementations:
>  * 
>  ** I am Python user and want to implement pure Python ML classifier without 
> providing JVM backend.
>  ** I want this classifier to be meaningfully positioned in the existing type 
> hierarchy.
>  ** However I have access only to high level classes ({{Estimator}}, 
> {{Model}}, {{MLReader}} / {{MLReadable}}).
>  * Run time parameter validation for both user defined (see above) and 
> existing class hierarchy,
>  * 
>  ** I am a library developer who provides functions that are meaningful only 
> for specific categories of {{Estimators}} - here classifiers.
>  ** I want to validate that user passed argument is indeed a classifier:
>  *** For built-in objects using "private" type hierarchy is not really 
> satisfying (actually, what is the rationale behind making it "private"? If 
> the goal is Scala API parity, and Scala counterparts are public, shouldn't 
> these be too?).
>  ** For user defined objects I can:
>  *** Use duck typing (on {{setRawPredictionCol}} for classifier, on 
> {{numClasses}} for classification model) but it hardly satisfying.
>  *** Provide parallel non-abstract type hierarchy ({{Classifier}} or 
> {{PythonClassifier}} and so on) and require users to implement such 
> interfaces. That however would require separate logic for checking for 
> built-in and and user-provided classes.
>  *** Provide parallel abstract type hierarchy, register all existing built-in 
> classes and require users to do the same.
>  Clearly these are not satisfying solutions as they require either defensive 
> programming or reinventing the same functionality for different 3rd party 
> APIs.
>  * Static type checking
>  * 
>  ** I am either end user or library developer and want to use PEP-484 
> annotations to indicate components that require classifier or classification 
> model.
>  * 
>  ** Currently I can provide only imprecise annotations, [such 
> as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241]
>  def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...
>  or try to narrow things down using structural subtyping:
>  class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, 
> value: str) -> Classifier: ... class Classifier(Protocol, Model): def 
> setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> 
> int: ...
> (...)
>  * First of all nothing in the original API indicated this. On the contrary, 
> the original API clearly suggests that non-Java path is supported, by 
> providing base classes (Params, Transformer, Estimator, Model, ML 
> \{Reader,Writer}, ML\{Readable,Writable}) as well as Java specific 
> implementations (JavaParams, JavaTransformer, JavaEstimator, JavaModel, 
> JavaML\{Reader,Writer}, JavaML
> {Readable,Writable}
> ).
>  * Furthermore authoritative (IMHO) and open Python ML extensions exist 
> (spark-sklearn is one of these, but if I recall correctly spark-deep-learning 
> provides so pure-Python utilities). Personally I've seen quite a lot of 
> private implementations, but that's just anecdotal evidence.
> Let us assume for the sake of argument that above observations are 
> irrelevant. I will argue that having complete, public type hierarchy is still 
> desired:
>  * Two out three use cases I described, can narrowed down to Java 
> implementation only, though there are less compelling if we do that.
>  * More importantly, public type hierarchy with Java specific extensions, is 
> pyspark.ml standard. There is no reason to treat this specific case as an 
> exception, especially when the implementations, is far from utilitarian (for 
> example implementation-free JavaClassifierParams and 
> JavaProbabilisticClassifierParams save, as for now, no practical purpose 
> whatsoever).
>   
> Maciej's *Proposal*:
> {code:python}
> Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e.
> class ClassifierParams: ...
> class Predictor(Estimator,PredictorParams):
> def setLabelCol(self, value): ...
> def setFeaturesCol(self, value): ...
> def setPredictionCol(self, value): ...
> class 

[jira] [Created] (SPARK-29392) Remove use of deprecated symbol literal " 'name " syntax in favor Symbol("name")

2019-10-08 Thread Sean R. Owen (Jira)
Sean R. Owen created SPARK-29392:


 Summary: Remove use of deprecated symbol literal " 'name " syntax 
in favor Symbol("name")
 Key: SPARK-29392
 URL: https://issues.apache.org/jira/browse/SPARK-29392
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL, Tests
Affects Versions: 3.0.0
Reporter: Sean R. Owen


Example:

{code}
[WARNING] [Warn] 
/Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/memory/UnifiedMemoryManagerSuite.scala:308:
 symbol literal is deprecated; use Symbol("assertInvariants") instead
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19609) Broadcast joins should pushdown join constraints as Filter to the larger relation

2019-10-08 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947019#comment-16947019
 ] 

Hyukjin Kwon commented on SPARK-19609:
--

It was resolved as incomplete as it indicates EOL release and has been inactive 
more than a year, as discussed in dev mailing list. Please reopen if you can 
verify it is still and issue in Spark 2.4+

> Broadcast joins should pushdown join constraints as Filter to the larger 
> relation
> -
>
> Key: SPARK-19609
> URL: https://issues.apache.org/jira/browse/SPARK-19609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>Priority: Major
>  Labels: bulk-closed
>
> For broadcast inner-joins, where the smaller relation is known to be small 
> enough to materialize on a worker, the set of values for all join columns is 
> known and fits in memory. Spark should translate these values into a 
> {{Filter}} pushed down to the datasource. The common join condition of 
> equality, i.e. {{lhs.a == rhs.a}}, can be written as an {{a in ...}} clause. 
> An example of pushing such filters is already present in the form of 
> {{IsNotNull}} filters via [~sameerag]'s work on SPARK-12957 subtasks.
> This optimization could even work when the smaller relation does not fit 
> entirely in memory. This could be done by partitioning the smaller relation 
> into N pieces, applying this predicate pushdown for each piece, and unioning 
> the results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19609) Broadcast joins should pushdown join constraints as Filter to the larger relation

2019-10-08 Thread Nick Dimiduk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947001#comment-16947001
 ] 

Nick Dimiduk commented on SPARK-19609:
--

Hi [~hyukjin.kwon], mind adding a comment as to why this issue was closed? Has 
the functionality been implemented elsewhere? How about a link off to the 
relevant JIRA so I know what fix version to look for? Thanks!

> Broadcast joins should pushdown join constraints as Filter to the larger 
> relation
> -
>
> Key: SPARK-19609
> URL: https://issues.apache.org/jira/browse/SPARK-19609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>Priority: Major
>  Labels: bulk-closed
>
> For broadcast inner-joins, where the smaller relation is known to be small 
> enough to materialize on a worker, the set of values for all join columns is 
> known and fits in memory. Spark should translate these values into a 
> {{Filter}} pushed down to the datasource. The common join condition of 
> equality, i.e. {{lhs.a == rhs.a}}, can be written as an {{a in ...}} clause. 
> An example of pushing such filters is already present in the form of 
> {{IsNotNull}} filters via [~sameerag]'s work on SPARK-12957 subtasks.
> This optimization could even work when the smaller relation does not fit 
> entirely in memory. This could be done by partitioning the smaller relation 
> into N pieces, applying this predicate pushdown for each piece, and unioning 
> the results.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee that the relativeError will be respected

2019-10-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29336:


Assignee: Guilherme Souza

> The implementation of QuantileSummaries.merge  does not guarantee that the 
> relativeError will be respected 
> ---
>
> Key: SPARK-29336
> URL: https://issues.apache.org/jira/browse/SPARK-29336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Guilherme Souza
>Assignee: Guilherme Souza
>Priority: Minor
>
> Hello Spark maintainers,
> I was experimenting with my own implementation of the [space-efficient 
> quantile 
> algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf]
>  in another language and I was using the Spark's one as a reference.
> In my analysis, I believe to have found an issue with the {{merge()}} logic. 
> Here is some simple Scala code that reproduces the issue I've found:
>  
> {code:java}
> var values = (1 to 100).toArray
> val all_quantiles = values.indices.map(i => (i+1).toDouble / 
> values.length).toArray
> for (n <- 0 until 5) {
>   var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
>   val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1)
>   val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
>   val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) 
> => Math.abs(expected - answer) }).toArray
>   val max_error = error.max
>   print(max_error + "\n")
> }
> {code}
> I query for all possible quantiles in a 100-element array with a desired 10% 
> max error. In this scenario, one would expect to observe a maximum error of 
> 10 ranks or less (10% of 100). However, the output I observe is:
>  
> {noformat}
> 16
> 12
> 10
> 11
> 17{noformat}
> The variance is probably due to non-deterministic operations behind the 
> scenes, but irrelevant to the core cause. (and sorry for my Scala, I'm not 
> used to it)
> Interestingly enough, if I change from five to one partition the code works 
> as expected and gives 10 every time. This seems to point to some problem at 
> the [merge 
> logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L153-L171]
> The original authors ([~clockfly] and [~cloud_fan] for what I could dig from 
> the history) suggest the published paper is not clear on how that should be 
> done and, honestly, I was not confident in the current approach either.
> I've found SPARK-21184 that reports the same problem, but it was 
> unfortunately closed with no fix applied.
> In my external implementation I believe to have found a sound way to 
> implement the merge method. [Here is my take in Rust, if 
> relevant|https://github.com/sitegui/space-efficient-quantile/blob/188c74638c9840e5f47d6c6326b2886d47b149bc/src/modified_gk/summary.rs#L162-L218]
> I'd be really glad to add unit tests and contribute my implementation adapted 
> to Scala.
>  I'd love to hear your opinion on the matter.
> Best regards
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29336) The implementation of QuantileSummaries.merge does not guarantee that the relativeError will be respected

2019-10-08 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29336.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26029
[https://github.com/apache/spark/pull/26029]

> The implementation of QuantileSummaries.merge  does not guarantee that the 
> relativeError will be respected 
> ---
>
> Key: SPARK-29336
> URL: https://issues.apache.org/jira/browse/SPARK-29336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Guilherme Souza
>Assignee: Guilherme Souza
>Priority: Minor
> Fix For: 3.0.0
>
>
> Hello Spark maintainers,
> I was experimenting with my own implementation of the [space-efficient 
> quantile 
> algorithm|http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf]
>  in another language and I was using the Spark's one as a reference.
> In my analysis, I believe to have found an issue with the {{merge()}} logic. 
> Here is some simple Scala code that reproduces the issue I've found:
>  
> {code:java}
> var values = (1 to 100).toArray
> val all_quantiles = values.indices.map(i => (i+1).toDouble / 
> values.length).toArray
> for (n <- 0 until 5) {
>   var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
>   val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1)
>   val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
>   val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) 
> => Math.abs(expected - answer) }).toArray
>   val max_error = error.max
>   print(max_error + "\n")
> }
> {code}
> I query for all possible quantiles in a 100-element array with a desired 10% 
> max error. In this scenario, one would expect to observe a maximum error of 
> 10 ranks or less (10% of 100). However, the output I observe is:
>  
> {noformat}
> 16
> 12
> 10
> 11
> 17{noformat}
> The variance is probably due to non-deterministic operations behind the 
> scenes, but irrelevant to the core cause. (and sorry for my Scala, I'm not 
> used to it)
> Interestingly enough, if I change from five to one partition the code works 
> as expected and gives 10 every time. This seems to point to some problem at 
> the [merge 
> logic|https://github.com/apache/spark/blob/51d6ba7490eaac32fc33b8996fdf06b747884a54/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L153-L171]
> The original authors ([~clockfly] and [~cloud_fan] for what I could dig from 
> the history) suggest the published paper is not clear on how that should be 
> done and, honestly, I was not confident in the current approach either.
> I've found SPARK-21184 that reports the same problem, but it was 
> unfortunately closed with no fix applied.
> In my external implementation I believe to have found a sound way to 
> implement the merge method. [Here is my take in Rust, if 
> relevant|https://github.com/sitegui/space-efficient-quantile/blob/188c74638c9840e5f47d6c6326b2886d47b149bc/src/modified_gk/summary.rs#L162-L218]
> I'd be really glad to add unit tests and contribute my implementation adapted 
> to Scala.
>  I'd love to hear your opinion on the matter.
> Best regards
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql

2019-10-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29335:
-
Description: 
We are trying to leverage CBO for getting better plan results for few critical 
queries run thru spark-sql or thru thrift server using jdbc driver. 

Following settings added to spark-defaults.conf

{code}
spark.sql.cbo.enabled true
spark.experimental.extrastrategies intervaljoin
spark.sql.cbo.joinreorder.enabled true
{code}

 

The tables that we are using are not partitioned.

{code}
spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ;
analyze table arrow.t_fperiods_sundar compute statistics for columns eid, year, 
ptype, absref, fpid , pid ;
analyze table arrow.t_fdata_sundar compute statistics ;
analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, 
absref;
{code}

Analyze completed success fully.

Describe extended , does not show column level stats data and queries are not 
leveraging table or column level stats .

we are using Oracle as our Hive Catalog store and not Glue .

*When we are using spark sql and running queries we are not able to see the 
stats in use in the explain plan and we are not sure if cbo is put to use.*

*A quick response would be helpful.*

*Explain Plan:*

Following Explain command does not reference to any Statistics usage.
 
{code}
spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref 
from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = 
a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 
and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;*
 
19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = 
2017),(ptype#4546 = A),(eid#4542 = 29940),isnull(PID#4527),isnotnull(fpid#4523)
19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct
19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID)
19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
isnotnull(absref#4569),(absref#4569 = 
Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940)
19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct
19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940)
== Parsed Logical Plan ==
'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref]
+- 'Filter 'a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && 
(('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && 
('a12.eid = 29940)) && isnull('a12.PID)))
 +- 'Join Inner
 :- 'SubqueryAlias a12
 : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar`
 +- 'SubqueryAlias a13
 +- 'UnresolvedRelation `arrow`.`t_fdata_sundar`
 
== Analyzed Logical Plan ==
imnem: string, fvalue: string, ptype: string, absref: string
Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
+- Filter eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = 
cast(fpid#4567 as decimal(38,0 && ((absref#4569 = Y2017) && (year#4545 = 
2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = 
cast(cast(29940 as decimal(5,0)) as decimal(10,0 && isnull(PID#4527)))
 +- Join Inner
 :- SubqueryAlias a12
 : +- SubqueryAlias t_fperiods_sundar
 : +- 
Relation[FPID#4523,QTR#4524,ABSREF#4525,DSET#4526,PID#4527,NID#4528,CCD#4529,LOCKDATE#4530,LOCKUSER#4531,UPDUSER#4532,UPDDATE#4533,RESTATED#4534,DATADATE#4535,DATASTATE#4536,ACCSTD#4537,INDFMT#4538,DATAFMT#4539,CONSOL#4540,FYR#4541,EID#4542,FPSTATE#4543,BATCH_ID#4544L,YEAR#4545,PTYPE#4546]
 parquet
 +- SubqueryAlias a13
 +- SubqueryAlias t_fdata_sundar
 +- 
Relation[FDID#4547,IMNEM#4548,DSET#4549,DSRS#4550,PID#4551,FVALUE#4552,FCOMP#4553,CONF#4554,UPDDATE#4555,UPDUSER#4556,SEQ#4557,CCD#4558,NID#4559,CONVTYPE#4560,DATADATE#4561,DCOMMENT#4562,UDOMAIN#4563,FPDID#4564L,VLP_CALC#4565,EID#4566,FPID#4567,BATCH_ID#4568L,ABSREF#4569]
 parquet
 
== Optimized Logical Plan ==
Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
+- Join Inner, ((eid#4542 = eid#4566) && (fpid#4523 = cast(fpid#4567 as 
decimal(38,0
 :- Project [FPID#4523, EID#4542, PTYPE#4546]
 : +- Filter (((isnotnull(ptype#4546) && isnotnull(year#4545)) && 
isnotnull(eid#4542)) && (year#4545 = 2017)) && (ptype#4546 = A)) && (eid#4542 = 
29940)) && isnull(PID#4527)) && isnotnull(fpid#4523))
 : +- 

[jira] [Resolved] (SPARK-29335) Cost Based Optimizer stats are not used while evaluating query plans in Spark Sql

2019-10-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29335.
--
Resolution: Invalid

Questions should go to mailing list or stackoverflow. You could have a better 
answer than this

> Cost Based Optimizer stats are not used while evaluating query plans in Spark 
> Sql
> -
>
> Key: SPARK-29335
> URL: https://issues.apache.org/jira/browse/SPARK-29335
> Project: Spark
>  Issue Type: Question
>  Components: Optimizer
>Affects Versions: 2.3.0
> Environment: We tried to execute the same using Spark-sql and Thrify 
> server using SQLWorkbench but we are not able to use the stats.
>Reporter: Srini E
>Priority: Major
> Attachments: explain_plan_cbo_spark.txt
>
>
> We are trying to leverage CBO for getting better plan results for few 
> critical queries run thru spark-sql or thru thrift server using jdbc driver. 
> Following settings added to spark-defaults.conf
> {code}
> spark.sql.cbo.enabled true
> spark.experimental.extrastrategies intervaljoin
> spark.sql.cbo.joinreorder.enabled true
> {code}
>  
> The tables that we are using are not partitioned.
> {code}
> spark-sql> analyze table arrow.t_fperiods_sundar compute statistics ;
> analyze table arrow.t_fperiods_sundar compute statistics for columns eid, 
> year, ptype, absref, fpid , pid ;
> analyze table arrow.t_fdata_sundar compute statistics ;
> analyze table arrow.t_fdata_sundar compute statistics for columns eid, fpid, 
> absref;
> {code}
> Analyze completed success fully.
> Describe extended , does not show column level stats data and queries are not 
> leveraging table or column level stats .
> we are using Oracle as our Hive Catalog store and not Glue .
> *When we are using spark sql and running queries we are not able to see the 
> stats in use in the explain plan and we are not sure if cbo is put to use.*
> *A quick response would be helpful.*
> *Explain Plan:*
> Following Explain command does not reference to any Statistics usage.
>  
> {code}
> spark-sql> *explain extended select a13.imnem,a13.fvalue,a12.ptype,a13.absref 
> from arrow.t_fperiods_sundar a12, arrow.t_fdata_sundar a13 where a12.eid = 
> a13.eid and a12.fpid = a13.fpid and a13.absref = 'Y2017' and a12.year = 2017 
> and a12.ptype = 'A' and a12.eid = 29940 and a12.PID is NULL ;*
>  
> 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
> 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
> isnotnull(ptype#4546),isnotnull(year#4545),isnotnull(eid#4542),(year#4545 = 
> 2017),(ptype#4546 = A),(eid#4542 = 
> 29940),isnull(PID#4527),isnotnull(fpid#4523)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct decimal(38,0), PID: string, EID: decimal(10,0), YEAR: int, PTYPE: string ... 
> 3 more fields>
> 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
> IsNotNull(PTYPE),IsNotNull(YEAR),IsNotNull(EID),EqualTo(YEAR,2017),EqualTo(PTYPE,A),EqualTo(EID,29940),IsNull(PID),IsNotNull(FPID)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Pruning directories with:
> 19/09/05 14:15:15 INFO FileSourceStrategy: Post-Scan Filters: 
> isnotnull(absref#4569),(absref#4569 = 
> Y2017),isnotnull(fpid#4567),isnotnull(eid#4566),(eid#4566 = 29940)
> 19/09/05 14:15:15 INFO FileSourceStrategy: Output Data Schema: struct string, FVALUE: string, EID: decimal(10,0), FPID: decimal(10,0), ABSREF: 
> string ... 3 more fields>
> 19/09/05 14:15:15 INFO FileSourceScanExec: Pushed Filters: 
> IsNotNull(ABSREF),EqualTo(ABSREF,Y2017),IsNotNull(FPID),IsNotNull(EID),EqualTo(EID,29940)
> == Parsed Logical Plan ==
> 'Project ['a13.imnem, 'a13.fvalue, 'a12.ptype, 'a13.absref]
> +- 'Filter 'a12.eid = 'a13.eid) && ('a12.fpid = 'a13.fpid)) && 
> (('a13.absref = Y2017) && ('a12.year = 2017))) && ((('a12.ptype = A) && 
> ('a12.eid = 29940)) && isnull('a12.PID)))
>  +- 'Join Inner
>  :- 'SubqueryAlias a12
>  : +- 'UnresolvedRelation `arrow`.`t_fperiods_sundar`
>  +- 'SubqueryAlias a13
>  +- 'UnresolvedRelation `arrow`.`t_fdata_sundar`
>  
> == Analyzed Logical Plan ==
> imnem: string, fvalue: string, ptype: string, absref: string
> Project [imnem#4548, fvalue#4552, ptype#4546, absref#4569]
> +- Filter eid#4542 = eid#4566) && (cast(fpid#4523 as decimal(38,0)) = 
> cast(fpid#4567 as decimal(38,0 && ((absref#4569 = Y2017) && (year#4545 = 
> 2017))) && (((ptype#4546 = A) && (cast(eid#4542 as decimal(10,0)) = 
> cast(cast(29940 as decimal(5,0)) as decimal(10,0 && isnull(PID#4527)))
>  +- Join Inner
>  :- SubqueryAlias a12
>  : +- SubqueryAlias t_fperiods_sundar
>  : +- 
> 

[jira] [Commented] (SPARK-29356) Stopping Spark doesn't shut down all network connections

2019-10-08 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946810#comment-16946810
 ] 

Hyukjin Kwon commented on SPARK-29356:
--

Can you post a reproducer or at least some warn/error messages to track?

> Stopping Spark doesn't shut down all network connections
> 
>
> Key: SPARK-29356
> URL: https://issues.apache.org/jira/browse/SPARK-29356
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Malthe Borch
>Priority: Minor
>
> The Spark session's gateway client still has an open network connection after 
> a call to `spark.stop()`. This is unexpected and for example in a test suite, 
> this triggers a resource warning when tearing down the test case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2019-10-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29358.
--
Resolution: Won't Fix

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2019-10-08 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946809#comment-16946809
 ] 

Hyukjin Kwon commented on SPARK-29358:
--

Given that workaround is pretty easy, I wouldn't need the new API. Seems not a 
strong reason to me (e.g. there's no way to work around or considerable codes 
are required).

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29372) Codegen grows beyond 64 KB for more columns in case of SupportsScanColumnarBatch

2019-10-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29372:
-
Priority: Major  (was: Critical)

> Codegen grows beyond 64 KB for more columns in case of 
> SupportsScanColumnarBatch
> 
>
> Key: SPARK-29372
> URL: https://issues.apache.org/jira/browse/SPARK-29372
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.2
>Reporter: Shubham Chaurasia
>Priority: Major
>
> In case of vectorized DSv2 readers i.e. if it implements 
> {{SupportsScanColumnarBatch}} and number of columns is around(or greater 
> than) 1000 then it throws
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Code of method 
> "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:990)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:899)
>   at org.codehaus.janino.CodeContext.writeBranch(CodeContext.java:1016)
>   at org.codehaus.janino.UnitCompiler.writeBranch(UnitCompiler.java:11911)
>   at 
> org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:3675)
>   at org.codehaus.janino.UnitCompiler.access$5500(UnitCompiler.java:212)
> {code}
> I can see from logs that it tries to disable Whole-stage codegen but it's 
> failing even after that on each retry.
> {code}
> 19/10/07 20:49:35 WARN WholeStageCodegenExec: Whole-stage codegen disabled 
> for plan (id=0):
>  *(0) DataSourceV2Scan [column_0#3558, column_1#3559, column_2#3560, 
> column_3#3561, column_4#3562, column_5#3563, column_6#3564, column_7#3565, 
> column_8#3566, column_9#3567, column_10#3568, column_11#3569, column_12#3570, 
> column_13#3571, column_14#3572, column_15#3573, column_16#3574, 
> column_17#3575, column_18#3576, column_19#3577, column_20#3578, 
> column_21#3579, column_22#3580, column_23#3581, ... 976 more fields], 
> com.shubham.reader.MyDataSourceReader@5c7673b8
> {code}
> Repro code for a simple reader can be: 
> {code:java}
> public class MyDataSourceReader implements DataSourceReader, 
> SupportsScanColumnarBatch {
>   private StructType schema;
>   private int numCols = 10;
>   private int numRows = 10;
>   private int numReaders = 1;
>   public MyDataSourceReader(Map options) {
> initOptions(options);
> System.out.println("MyDataSourceReader.MyDataSourceReader: 
> Instantiated" + this);
>   }
>   private void initOptions(Map options) {
> String numColumns = options.get("num_columns");
> if (numColumns != null) {
>   numCols = Integer.parseInt(numColumns);
> }
> String numRowsOption = options.get("num_rows_per_reader");
> if (numRowsOption != null) {
>   numRows = Integer.parseInt(numRowsOption);
> }
> String readersOption = options.get("num_readers");
> if (readersOption != null) {
>   numReaders = Integer.parseInt(readersOption);
> }
>   }
>   @Override public StructType readSchema() {
> final String colPrefix = "column_";
> StructField[] fields = new StructField[numCols];
> for (int i = 0; i < numCols; i++) {
>   fields[i] = new StructField(colPrefix + i, DataTypes.IntegerType, true, 
> Metadata.empty());
> }
> schema = new StructType(fields);
> return schema;
>   }
>   @Override public List> 
> createBatchDataReaderFactories() {
> System.out.println("MyDataSourceReader.createDataReaderFactories: " + 
> numReaders);
> return new ArrayList<>();
>   }
> }
> {code}
> If I pass {{num_columns}} 1000 or greater, the issue appears.
> {code:java}
> spark.read.format("com.shubham.MyDataSource").option("num_columns", 
> "1000").option("num_rows_per_reader", 2).option("num_readers", 1).load.show
> {code}
> Any fixes/workarounds for this? 
> SPARK-16845 and SPARK-17092 are resolved but looks like they don't deal with 
> the vectorized part.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend

2019-10-08 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946784#comment-16946784
 ] 

Sean R. Owen commented on SPARK-29212:
--

I don't have a strong opinion on it. I'd make it all consistent first, before 
changing to another consistent state. Removing truly no-op classes is OK, 
unless we can foresee a use for them later. Pyspark does mean to wrap the JVM 
implemenatation, so "Java" related classes aren't inherently wrong even as part 
of a developer API. Refactoring is good, but weigh it against breaking existing 
extensions of these classes, even in Spark 3.

> Add common classes without using JVM backend
> 
>
> Key: SPARK-29212
> URL: https://issues.apache.org/jira/browse/SPARK-29212
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> Copied from [https://github.com/apache/spark/pull/25776].
>  
>  Maciej's *Concern*:
> *Use cases for public ML type hierarchy*
>  * Add Python-only Transformer implementations:
>  * 
>  ** I am Python user and want to implement pure Python ML classifier without 
> providing JVM backend.
>  ** I want this classifier to be meaningfully positioned in the existing type 
> hierarchy.
>  ** However I have access only to high level classes ({{Estimator}}, 
> {{Model}}, {{MLReader}} / {{MLReadable}}).
>  * Run time parameter validation for both user defined (see above) and 
> existing class hierarchy,
>  * 
>  ** I am a library developer who provides functions that are meaningful only 
> for specific categories of {{Estimators}} - here classifiers.
>  ** I want to validate that user passed argument is indeed a classifier:
>  *** For built-in objects using "private" type hierarchy is not really 
> satisfying (actually, what is the rationale behind making it "private"? If 
> the goal is Scala API parity, and Scala counterparts are public, shouldn't 
> these be too?).
>  ** For user defined objects I can:
>  *** Use duck typing (on {{setRawPredictionCol}} for classifier, on 
> {{numClasses}} for classification model) but it hardly satisfying.
>  *** Provide parallel non-abstract type hierarchy ({{Classifier}} or 
> {{PythonClassifier}} and so on) and require users to implement such 
> interfaces. That however would require separate logic for checking for 
> built-in and and user-provided classes.
>  *** Provide parallel abstract type hierarchy, register all existing built-in 
> classes and require users to do the same.
>  Clearly these are not satisfying solutions as they require either defensive 
> programming or reinventing the same functionality for different 3rd party 
> APIs.
>  * Static type checking
>  * 
>  ** I am either end user or library developer and want to use PEP-484 
> annotations to indicate components that require classifier or classification 
> model.
>  * 
>  ** Currently I can provide only imprecise annotations, [such 
> as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241]
>  def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...
>  or try to narrow things down using structural subtyping:
>  class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, 
> value: str) -> Classifier: ... class Classifier(Protocol, Model): def 
> setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> 
> int: ...
> (...)
>  * First of all nothing in the original API indicated this. On the contrary, 
> the original API clearly suggests that non-Java path is supported, by 
> providing base classes (Params, Transformer, Estimator, Model, ML 
> \{Reader,Writer}, ML\{Readable,Writable}) as well as Java specific 
> implementations (JavaParams, JavaTransformer, JavaEstimator, JavaModel, 
> JavaML\{Reader,Writer}, JavaML
> {Readable,Writable}
> ).
>  * Furthermore authoritative (IMHO) and open Python ML extensions exist 
> (spark-sklearn is one of these, but if I recall correctly spark-deep-learning 
> provides so pure-Python utilities). Personally I've seen quite a lot of 
> private implementations, but that's just anecdotal evidence.
> Let us assume for the sake of argument that above observations are 
> irrelevant. I will argue that having complete, public type hierarchy is still 
> desired:
>  * Two out three use cases I described, can narrowed down to Java 
> implementation only, though there are less compelling if we do that.
>  * More importantly, public type hierarchy with Java specific extensions, is 
> pyspark.ml standard. There is no reason to treat this specific case as an 
> exception, especially when the implementations, is far from utilitarian (for 
> example implementation-free JavaClassifierParams and 
> 

[jira] [Assigned] (SPARK-24640) size(null) returns null

2019-10-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24640:


Assignee: Maxim Gekk

> size(null) returns null 
> 
>
> Key: SPARK-24640
> URL: https://issues.apache.org/jira/browse/SPARK-24640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: api, bulk-closed
>
> Size(null) should return null instead of -1 in 3.0 release. This is a 
> behavior change. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24640) size(null) returns null

2019-10-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24640.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26051
[https://github.com/apache/spark/pull/26051]

> size(null) returns null 
> 
>
> Key: SPARK-24640
> URL: https://issues.apache.org/jira/browse/SPARK-24640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: api, bulk-closed
> Fix For: 3.0.0
>
>
> Size(null) should return null instead of -1 in 3.0 release. This is a 
> behavior change. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29391) Default year-month units

2019-10-08 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29391:
--

 Summary: Default year-month units
 Key: SPARK-29391
 URL: https://issues.apache.org/jira/browse/SPARK-29391
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


PostgreSQL can assume default year-month units by defaults:
{code}
maxim=# SELECT interval '1-2'; 
   interval
---
 1 year 2 mons
{code}
but the same produces NULL in Spark:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29390) Add the justify_days(), justify_hours() and justify_interval() functions

2019-10-08 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29390:
--

 Summary: Add  the justify_days(),  justify_hours() and  
justify_interval() functions
 Key: SPARK-29390
 URL: https://issues.apache.org/jira/browse/SPARK-29390
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


See *Table 9.31. Date/Time Functions* 
([https://www.postgresql.org/docs/12/functions-datetime.html)]
|{{justify_days(}}{{interval}}{{)}}|{{interval}}|Adjust interval so 30-day time 
periods are represented as months|{{justify_days(interval '35 days')}}|{{1 mon 
5 days}}|
| {{justify_hours(}}{{interval}}{{)}}|{{interval}}|Adjust interval so 24-hour 
time periods are represented as days|{{justify_hours(interval '27 hours')}}|{{1 
day 03:00:00}}|
| {{justify_interval(}}{{interval}}{{)}}|{{interval}}|Adjust interval using 
{{justify_days}} and {{justify_hours}}, with additional sign 
adjustments|{{justify_interval(interval '1 mon -1 hour')}}|{{29 days 23:00:00}}|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29389) Short synonyms of interval units

2019-10-08 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29389:
--

 Summary: Short synonyms of interval units
 Key: SPARK-29389
 URL: https://issues.apache.org/jira/browse/SPARK-29389
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Should be supported the following synonyms:
{code}
 ["MILLENNIUM", ("MILLENNIA", "MIL", "MILS"),
   "CENTURY", ("CENTURIES", "C", "CENT"),
   "DECADE", ("DECADES", "DEC", "DECS"),
   "YEAR", ("Y", "YEARS", "YR", "YRS"),
   "QUARTER", ("QTR"),
   "MONTH", ("MON", "MONS", "MONTHS"),
   "DAY", ("D", "DAYS"),
   "HOUR", ("H", "HOURS", "HR", "HRS"),
   "MINUTE", ("M", "MIN", "MINS", "MINUTES"),
   "SECOND", ("S", "SEC", "SECONDS", "SECS"),
   "MILLISECONDS", ("MSEC", "MSECS", "MILLISECON", "MSECONDS", 
"MS"),
   "MICROSECONDS", ("USEC", "USECS", "USECONDS", "MICROSECON", 
"US"),
   "EPOCH"]
{code}

For example:
{code}
maxim=# select '1y 10mon -10d -10h -10min -10.01s 
ago'::interval;
interval

 -1 years -10 mons +10 days 10:10:10.01
(1 row)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29389) Support synonyms for interval units

2019-10-08 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-29389:
---
Summary: Support synonyms for interval units  (was: Short synonyms of 
interval units)

> Support synonyms for interval units
> ---
>
> Key: SPARK-29389
> URL: https://issues.apache.org/jira/browse/SPARK-29389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Should be supported the following synonyms:
> {code}
>  ["MILLENNIUM", ("MILLENNIA", "MIL", "MILS"),
>"CENTURY", ("CENTURIES", "C", "CENT"),
>"DECADE", ("DECADES", "DEC", "DECS"),
>"YEAR", ("Y", "YEARS", "YR", "YRS"),
>"QUARTER", ("QTR"),
>"MONTH", ("MON", "MONS", "MONTHS"),
>"DAY", ("D", "DAYS"),
>"HOUR", ("H", "HOURS", "HR", "HRS"),
>"MINUTE", ("M", "MIN", "MINS", "MINUTES"),
>"SECOND", ("S", "SEC", "SECONDS", "SECS"),
>"MILLISECONDS", ("MSEC", "MSECS", "MILLISECON", 
> "MSECONDS", "MS"),
>"MICROSECONDS", ("USEC", "USECS", "USECONDS", 
> "MICROSECON", "US"),
>"EPOCH"]
> {code}
> For example:
> {code}
> maxim=# select '1y 10mon -10d -10h -10min -10.01s 
> ago'::interval;
> interval
> 
>  -1 years -10 mons +10 days 10:10:10.01
> (1 row)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29388) Construct intervals from the `millenniums`, `centuries` or `decades` units

2019-10-08 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29388:
--

 Summary: Construct intervals from the `millenniums`, `centuries` 
or `decades` units
 Key: SPARK-29388
 URL: https://issues.apache.org/jira/browse/SPARK-29388
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


PostgreSQL supports `millenniums`, `centuries` or `decades` interval units. See
{code}
maxim=# select '4 millenniums 5 centuries 4 decades 1 year 4 months 4 days 17 
minutes 31 seconds'::interval;
 interval  
---
 4541 years 4 mons 4 days 00:17:31
(1 row)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29387) Support `*` and `\` operators for intervals

2019-10-08 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29387:
--

 Summary: Support `*` and `\` operators for intervals
 Key: SPARK-29387
 URL: https://issues.apache.org/jira/browse/SPARK-29387
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Support `*` by numeric, `/` by numeric. See 
[https://www.postgresql.org/docs/12/functions-datetime.html]
||Operator||Example||Result||
|*|900 * interval '1 second'|interval '00:15:00'|
|*|21 * interval '1 day'|interval '21 days'|
|/|interval '1 hour' / double precision '1.5'|interval '00:40:00'|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29386) Copy data between a file and a table

2019-10-08 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29386:
--

 Summary: Copy data between a file and a table 
 Key: SPARK-29386
 URL: https://issues.apache.org/jira/browse/SPARK-29386
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


https://www.postgresql.org/docs/12/sql-copy.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29385) Make `INTERVAL` values comparable

2019-10-08 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29385:
--

 Summary: Make `INTERVAL` values comparable
 Key: SPARK-29385
 URL: https://issues.apache.org/jira/browse/SPARK-29385
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


PostgreSQL allows to compare interval by `=`, `<>`, `<`, `<=`, `>`, `>=`. For 
example:
{code}
maxim=# select interval '1 month' > interval '29 days';
 ?column? 
--
 t
{code}
but the same fails in Spark:
{code}
spark-sql> select interval 1 month > interval 29 days;
Error in query: cannot resolve '(interval 1 months > interval 4 weeks 1 days)' 
due to data type mismatch: GreaterThan does not support ordering on type 
interval; line 1 pos 7;
'Project [unresolvedalias((interval 1 months > interval 4 weeks 1 days), None)]
+- OneRowRelation
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29384) Support `ago` in interval strings

2019-10-08 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29384:
--

 Summary: Support `ago` in interval strings
 Key: SPARK-29384
 URL: https://issues.apache.org/jira/browse/SPARK-29384
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


PostgreSQL allow to specify direction in interval string by the `ago` word:
{code}
maxim=# select interval '@ 1 year 2 months 3 days 14 seconds ago';
  interval  

 -1 years -2 mons -3 days -00:00:14
{code}
 See 
https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-INTERVAL-INPUT



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29383) Support the optional prefix `@` in interval strings

2019-10-08 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29383:
--

 Summary: Support the optional prefix `@` in interval strings
 Key: SPARK-29383
 URL: https://issues.apache.org/jira/browse/SPARK-29383
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


PostgreSQL allows `@` at the beginning and `ago` at the end of interval strings:
{code}
maxim=# select interval '@ 14 seconds';
 interval 
--
 00:00:14
{code}
See 
https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-INTERVAL-INPUT



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29382) Support the `INTERVAL` type by Parquet datasource

2019-10-08 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29382:
--

 Summary: Support the `INTERVAL` type by Parquet datasource
 Key: SPARK-29382
 URL: https://issues.apache.org/jira/browse/SPARK-29382
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Spark cannot create a table using parquet if a column has the `INTERVAL` type:
{code}
spark-sql> CREATE TABLE INTERVAL_TBL (f1 interval) USING PARQUET;
Error in query: Parquet data source does not support interval data type.;
{code}
This is needed for SPARK-29368



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-9636) Treat $SPARK_HOME as write-only

2019-10-08 Thread Philipp Angerer (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-9636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philipp Angerer reopened SPARK-9636:


This is not fixed and there was no reason given why it was closed, so I’ll 
reopen it.

> Treat $SPARK_HOME as write-only
> ---
>
> Key: SPARK-9636
> URL: https://issues.apache.org/jira/browse/SPARK-9636
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.4.1
> Environment: Linux
>Reporter: Philipp Angerer
>Priority: Minor
>  Labels: bulk-closed
>
> when starting spark scripts as user and it is installed in a directory the 
> user has no write permissions on, many things work fine, except for the logs 
> (e.g. for {{start-master.sh}})
> logs are per default written to {{$SPARK_LOG_DIR}} or (if unset) to 
> {{$SPARK_HOME/logs}}.
> if installed in this way, it should, instead of throwing an error, write logs 
> to {{/var/log/spark/}}. that’s easy to fix by simply testing a few log dirs 
> in sequence for writability before trying to use one. i suggest using 
> {{$SPARK_LOG_DIR}} (if set) → {{/var/log/spark/}} → {{~/.cache/spark-logs/}} 
> → {{$SPARK_HOME/logs/}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-29379) SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'

2019-10-08 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29379:
--
Comment: was deleted

(was: Don't need to add new expression class. 

If we just add code in ShowFunctionsCommand, we should change a lot UT about 
functions:

{code:java}
case class ShowFunctionsCommand(
db: Option[String],
pattern: Option[String],
showUserFunctions: Boolean,
showSystemFunctions: Boolean) extends RunnableCommand {

  override val output: Seq[Attribute] = {
val schema = StructType(StructField("function", StringType, nullable = 
false) :: Nil)
schema.toAttributes
  }

  override def run(sparkSession: SparkSession): Seq[Row] = {
val dbName = 
db.getOrElse(sparkSession.sessionState.catalog.getCurrentDatabase)
// If pattern is not specified, we use '*', which is used to
// match any sequence of characters (including no characters).
val functionNames =
  sparkSession.sessionState.catalog
.listFunctions(dbName, pattern.getOrElse("*"))
.collect {
  case (f, "USER") if showUserFunctions => f.unquotedString
  case (f, "SYSTEM") if showSystemFunctions => f.unquotedString
}
(functionNames ++ Seq("!=", "<>", "between", "case")).sorted.map(Row(_))
  }
}
{code}
)

> SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'
> 
>
> Key: SPARK-29379
> URL: https://issues.apache.org/jira/browse/SPARK-29379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29222) Flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence

2019-10-08 Thread huangtianhua (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946609#comment-16946609
 ] 

huangtianhua commented on SPARK-29222:
--

The tests specified in -SPARK-29205- failed every time when testing in arm 
instance, and after increasing the timeout and batch time they success, but we 
didn't test 100 times, just several times.  I have no idea about the 
batchDuration of StreamingContext setting, is there a principle? 

> Flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence
> ---
>
> Key: SPARK-29222
> URL: https://issues.apache.org/jira/browse/SPARK-29222
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, Tests
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111237/testReport/]
> {code:java}
> Error Message
> 7 != 10
> StacktraceTraceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 429, in test_parameter_convergence
> self._eventually(condition, catch_assertions=True)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 74, in _eventually
> raise lastValue
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 65, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 425, in condition
> self.assertEqual(len(model_weights), len(batches))
> AssertionError: 7 != 10
>{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29381) Add 'private' _XXXParams classes for classification & regression

2019-10-08 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29381:


 Summary: Add 'private' _XXXParams classes for classification & 
regression
 Key: SPARK-29381
 URL: https://issues.apache.org/jira/browse/SPARK-29381
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


ping [~huaxingao]  would you like to work on this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29380) RFormula avoid repeated 'first' jobs to get vector size

2019-10-08 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-29380:


 Summary: RFormula avoid repeated 'first' jobs to get vector size
 Key: SPARK-29380
 URL: https://issues.apache.org/jira/browse/SPARK-29380
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


In current impl, {{RFormula}} will trigger one {{first}} job to get the vector 
size, if the size can not be obtained from {{AttributeGroup.}}

{{This can be optimized by get the first row lazily, and reuse it for each 
vector column.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-24640) size(null) returns null

2019-10-08 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk reopened SPARK-24640:


> size(null) returns null 
> 
>
> Key: SPARK-24640
> URL: https://issues.apache.org/jira/browse/SPARK-24640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>  Labels: api, bulk-closed
>
> Size(null) should return null instead of -1 in 3.0 release. This is a 
> behavior change. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend

2019-10-08 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946583#comment-16946583
 ] 

zhengruifeng commented on SPARK-29212:
--

[~zero323]  ??we should remove Java specific mixins, if they don't serve any 
practical value (provide no implementation whatsoever or don't extend other 
{{Java*}} mixins, like {{JavaPredictorParams}}, or have no JVM wrapper specific 
implementation, like {{JavaPredictor}}).??

I am neutral on it, what's is your thoughts? [~huaxingao] [~srowen]

 

??As of the second point there is additional consideration here - some 
{{Java*}} classes are considered part of the public API, and this should stay 
as is (these provide crucial information to the end user). ??

I guess we have reached an agreement in related tickets (like _XXXParams in 
featuers/clustering).

 

??On a side note current approach to ML API requires a lot of boilerplate code. 
Lately I've been playing with [some 
ideas|https://gist.github.com/zero323/ee36bce57ddeac82322e3ab4ef547611], that 
wouldn't require code generation - they have some caveats, but maybe there is 
something there. ??

It looks succinct, I think we may take it into account in the future.

 

> Add common classes without using JVM backend
> 
>
> Key: SPARK-29212
> URL: https://issues.apache.org/jira/browse/SPARK-29212
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> Copied from [https://github.com/apache/spark/pull/25776].
>  
>  Maciej's *Concern*:
> *Use cases for public ML type hierarchy*
>  * Add Python-only Transformer implementations:
>  * 
>  ** I am Python user and want to implement pure Python ML classifier without 
> providing JVM backend.
>  ** I want this classifier to be meaningfully positioned in the existing type 
> hierarchy.
>  ** However I have access only to high level classes ({{Estimator}}, 
> {{Model}}, {{MLReader}} / {{MLReadable}}).
>  * Run time parameter validation for both user defined (see above) and 
> existing class hierarchy,
>  * 
>  ** I am a library developer who provides functions that are meaningful only 
> for specific categories of {{Estimators}} - here classifiers.
>  ** I want to validate that user passed argument is indeed a classifier:
>  *** For built-in objects using "private" type hierarchy is not really 
> satisfying (actually, what is the rationale behind making it "private"? If 
> the goal is Scala API parity, and Scala counterparts are public, shouldn't 
> these be too?).
>  ** For user defined objects I can:
>  *** Use duck typing (on {{setRawPredictionCol}} for classifier, on 
> {{numClasses}} for classification model) but it hardly satisfying.
>  *** Provide parallel non-abstract type hierarchy ({{Classifier}} or 
> {{PythonClassifier}} and so on) and require users to implement such 
> interfaces. That however would require separate logic for checking for 
> built-in and and user-provided classes.
>  *** Provide parallel abstract type hierarchy, register all existing built-in 
> classes and require users to do the same.
>  Clearly these are not satisfying solutions as they require either defensive 
> programming or reinventing the same functionality for different 3rd party 
> APIs.
>  * Static type checking
>  * 
>  ** I am either end user or library developer and want to use PEP-484 
> annotations to indicate components that require classifier or classification 
> model.
>  * 
>  ** Currently I can provide only imprecise annotations, [such 
> as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241]
>  def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...
>  or try to narrow things down using structural subtyping:
>  class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, 
> value: str) -> Classifier: ... class Classifier(Protocol, Model): def 
> setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> 
> int: ...
> (...)
>  * First of all nothing in the original API indicated this. On the contrary, 
> the original API clearly suggests that non-Java path is supported, by 
> providing base classes (Params, Transformer, Estimator, Model, ML 
> \{Reader,Writer}, ML\{Readable,Writable}) as well as Java specific 
> implementations (JavaParams, JavaTransformer, JavaEstimator, JavaModel, 
> JavaML\{Reader,Writer}, JavaML
> {Readable,Writable}
> ).
>  * Furthermore authoritative (IMHO) and open Python ML extensions exist 
> (spark-sklearn is one of these, but if I recall correctly spark-deep-learning 
> provides so pure-Python utilities). Personally I've seen quite a lot of 
> private implementations, but that's just anecdotal evidence.
> Let us assume 

[jira] [Assigned] (SPARK-29269) Pyspark ALSModel support getters/setters

2019-10-08 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-29269:


Assignee: Huaxin Gao

> Pyspark ALSModel support getters/setters
> 
>
> Key: SPARK-29269
> URL: https://issues.apache.org/jira/browse/SPARK-29269
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> ping [~huaxingao] , would you like to work on this? This is similar to your 
> previous works. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-29269) Pyspark ALSModel support getters/setters

2019-10-08 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-29269:
-
Comment: was deleted

(was: It seems that I do not have the permission to assign a tickect:
```
JIRAError: JiraError HTTP 403 url: 
https://issues.apache.org/jira/rest/api/latest/issue/SPARK-29269/assignee
 text: You do not have permission to assign issues.

```

[~dongjoon] Could you please help assign this ticket to Huaxin? Thanks!)

> Pyspark ALSModel support getters/setters
> 
>
> Key: SPARK-29269
> URL: https://issues.apache.org/jira/browse/SPARK-29269
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> ping [~huaxingao] , would you like to work on this? This is similar to your 
> previous works. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29366) Subqueries created for DPP are not printed in EXPLAIN FORMATTED

2019-10-08 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-29366.
-
Fix Version/s: 3.0.0
 Assignee: Dilip Biswal
   Resolution: Fixed

> Subqueries created for DPP are not printed in EXPLAIN FORMATTED
> ---
>
> Key: SPARK-29366
> URL: https://issues.apache.org/jira/browse/SPARK-29366
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.0
>
>
> The subquery expressions introduced by DPP are not printed in the newer 
> explain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29379) SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'

2019-10-08 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946526#comment-16946526
 ] 

angerszhu commented on SPARK-29379:
---

Don't need to add new expression class. 

If we just add code in ShowFunctionsCommand, we should change a lot UT about 
functions:

{code:java}
case class ShowFunctionsCommand(
db: Option[String],
pattern: Option[String],
showUserFunctions: Boolean,
showSystemFunctions: Boolean) extends RunnableCommand {

  override val output: Seq[Attribute] = {
val schema = StructType(StructField("function", StringType, nullable = 
false) :: Nil)
schema.toAttributes
  }

  override def run(sparkSession: SparkSession): Seq[Row] = {
val dbName = 
db.getOrElse(sparkSession.sessionState.catalog.getCurrentDatabase)
// If pattern is not specified, we use '*', which is used to
// match any sequence of characters (including no characters).
val functionNames =
  sparkSession.sessionState.catalog
.listFunctions(dbName, pattern.getOrElse("*"))
.collect {
  case (f, "USER") if showUserFunctions => f.unquotedString
  case (f, "SYSTEM") if showSystemFunctions => f.unquotedString
}
(functionNames ++ Seq("!=", "<>", "between", "case")).sorted.map(Row(_))
  }
}
{code}


> SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'
> 
>
> Key: SPARK-29379
> URL: https://issues.apache.org/jira/browse/SPARK-29379
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> SHOW FUNCTIONS don't show '!=', '<>' , 'between', 'case'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29269) Pyspark ALSModel support getters/setters

2019-10-08 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946524#comment-16946524
 ] 

zhengruifeng commented on SPARK-29269:
--

It seems that I do not have the permission to assign a tickect:
```
JIRAError: JiraError HTTP 403 url: 
https://issues.apache.org/jira/rest/api/latest/issue/SPARK-29269/assignee
 text: You do not have permission to assign issues.

```

[~dongjoon] Could you please help assign this ticket to Huaxin? Thanks!

> Pyspark ALSModel support getters/setters
> 
>
> Key: SPARK-29269
> URL: https://issues.apache.org/jira/browse/SPARK-29269
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> ping [~huaxingao] , would you like to work on this? This is similar to your 
> previous works. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24640) size(null) returns null

2019-10-08 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946503#comment-16946503
 ] 

Maxim Gekk edited comment on SPARK-24640 at 10/8/19 6:11 AM:
-

As far as I remember we planed to remove spark.sql.legacy.sizeOfNull in 3.0. 
[~hyukjin.kwon] [~smilegator] This ticket is a remainder of this. See 
https://github.com/apache/spark/pull/21598#issuecomment-399695523


was (Author: maxgekk):
As far as I remember we planed to remove spark.sql.legacy.sizeOfNull in 3.0. 
[~hyukjin.kwon] [~smilegator] This ticket is a remainder of this.

> size(null) returns null 
> 
>
> Key: SPARK-24640
> URL: https://issues.apache.org/jira/browse/SPARK-24640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>  Labels: api, bulk-closed
>
> Size(null) should return null instead of -1 in 3.0 release. This is a 
> behavior change. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29269) Pyspark ALSModel support getters/setters

2019-10-08 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-29269.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25947
[https://github.com/apache/spark/pull/25947]

> Pyspark ALSModel support getters/setters
> 
>
> Key: SPARK-29269
> URL: https://issues.apache.org/jira/browse/SPARK-29269
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> ping [~huaxingao] , would you like to work on this? This is similar to your 
> previous works. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org