[jira] [Commented] (SPARK-39457) Support IPv6-only environment

2022-06-15 Thread DB Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17554830#comment-17554830
 ] 

DB Tsai commented on SPARK-39457:
-

If there is any IPv6 issue in Hadoop client side, we might hit it once we get 
Spark fully working on pure IPv6 env. We will test it once we get there.

> Support IPv6-only environment
> -
>
> Key: SPARK-39457
> URL: https://issues.apache.org/jira/browse/SPARK-39457
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: DB Tsai
>Priority: Major
>  Labels: releasenotes
>
> Spark doesn't fully work in pure IPV6 environment that doesn't have IPV4 at 
> all. This is an umbrella jira tracking the support of pure IPV6 deployment. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39459) LocalSchedulerBackend doesn't support IPV6

2022-06-13 Thread DB Tsai (Jira)
DB Tsai created SPARK-39459:
---

 Summary: LocalSchedulerBackend doesn't support IPV6
 Key: SPARK-39459
 URL: https://issues.apache.org/jira/browse/SPARK-39459
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.2.1
Reporter: DB Tsai



{code:java}
➜  ./bin/spark-shell
22/06/09 14:52:35 WARN Utils: Your hostname, DBs-Mac-mini-2.local resolves to a 
loopback address: 127.0.0.1; using 2600:1700:1151:11ef:0:0:0:2000 instead (on 
interface en1)
22/06/09 14:52:35 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
22/06/09 14:52:43 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
22/06/09 14:52:44 ERROR SparkContext: Error initializing SparkContext.
java.lang.AssertionError: assertion failed: Expected hostname or IPv6 IP 
enclosed in [] but got 2600:1700:1151:11ef:0:0:0:2000
at scala.Predef$.assert(Predef.scala:223) ~[scala-library-2.12.15.jar:?]
at org.apache.spark.util.Utils$.checkHost(Utils.scala:1110) 
~[spark-core_2.12-3.2.0.jar:3.2.0.37]
at org.apache.spark.executor.Executor.(Executor.scala:89) 
~[spark-core_2.12-3.2.0.jar:3.2.0.37]
at 
org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:64)
 ~[spark-core_2.12-3.2.0.jar:3.2.0]
at 
org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:132)
 ~[spark-core_2.12-3.2.0.jar:3.2.0]
{code}




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39457) Support pure IPV6 environment without IPV4

2022-06-13 Thread DB Tsai (Jira)
DB Tsai created SPARK-39457:
---

 Summary: Support pure IPV6 environment without IPV4
 Key: SPARK-39457
 URL: https://issues.apache.org/jira/browse/SPARK-39457
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.2.1
Reporter: DB Tsai


Spark doesn't fully work in pure IPV6 environment that doesn't have IPV4 at 
all. This is an umbrella jira tracking the support of pure IPV6 deployment. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36895) Add Create Index syntax support

2021-10-26 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-36895.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34148
[https://github.com/apache/spark/pull/34148]

> Add Create Index syntax support
> ---
>
> Key: SPARK-36895
> URL: https://issues.apache.org/jira/browse/SPARK-36895
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36895) Add Create Index syntax support

2021-10-26 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-36895:
---

Assignee: Huaxin Gao

> Add Create Index syntax support
> ---
>
> Key: SPARK-36895
> URL: https://issues.apache.org/jira/browse/SPARK-36895
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37113) Upgrade Parquet to 1.12.2

2021-10-25 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-37113.
-
Fix Version/s: 3.2.1
   3.3.0
   Resolution: Fixed

Issue resolved by pull request 34391
[https://github.com/apache/spark/pull/34391]

> Upgrade Parquet to 1.12.2
> -
>
> Key: SPARK-37113
> URL: https://issues.apache.org/jira/browse/SPARK-37113
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.3.0, 3.2.1
>
>
> Upgrade Parquet version to 1.12.2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37113) Upgrade Parquet to 1.12.2

2021-10-25 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-37113:
---

Assignee: Chao Sun

> Upgrade Parquet to 1.12.2
> -
>
> Key: SPARK-37113
> URL: https://issues.apache.org/jira/browse/SPARK-37113
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> Upgrade Parquet version to 1.12.2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36821) Create a test to extend ColumnarBatch

2021-09-27 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-36821:
---

Assignee: Yufei Gu

> Create a test to extend ColumnarBatch
> -
>
> Key: SPARK-36821
> URL: https://issues.apache.org/jira/browse/SPARK-36821
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yufei Gu
>Assignee: Yufei Gu
>Priority: Major
>
> As a followup of Spark-36814, to create a test to extend ColumnarBatch to 
> prevent future changes to break it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36821) Create a test to extend ColumnarBatch

2021-09-27 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-36821.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34087
[https://github.com/apache/spark/pull/34087]

> Create a test to extend ColumnarBatch
> -
>
> Key: SPARK-36821
> URL: https://issues.apache.org/jira/browse/SPARK-36821
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yufei Gu
>Assignee: Yufei Gu
>Priority: Major
> Fix For: 3.3.0
>
>
> As a followup of Spark-36814, to create a test to extend ColumnarBatch to 
> prevent future changes to break it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36814) Make class ColumnarBatch extendable

2021-09-21 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-36814:
---

Assignee: Yufei Gu

> Make class ColumnarBatch extendable
> ---
>
> Key: SPARK-36814
> URL: https://issues.apache.org/jira/browse/SPARK-36814
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yufei Gu
>Assignee: Yufei Gu
>Priority: Major
> Fix For: 3.3.0
>
>
> To support better vectorized reading in multiple data source, ColumnarBatch 
> need to be extendable. For example, To support row-level delete(  
> [https://github.com/apache/iceberg/issues/3141]) in Iceberg's vectorized 
> read, we need to filter out deleted rows in a batch, which requires 
> ColumnarBatch to be extendable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36814) Make class ColumnarBatch extendable

2021-09-21 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-36814.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34054
[https://github.com/apache/spark/pull/34054]

> Make class ColumnarBatch extendable
> ---
>
> Key: SPARK-36814
> URL: https://issues.apache.org/jira/browse/SPARK-36814
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yufei Gu
>Priority: Major
> Fix For: 3.3.0
>
>
> To support better vectorized reading in multiple data source, ColumnarBatch 
> need to be extendable. For example, To support row-level delete(  
> [https://github.com/apache/iceberg/issues/3141]) in Iceberg's vectorized 
> read, we need to filter out deleted rows in a batch, which requires 
> ColumnarBatch to be extendable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36696) spark.read.parquet loads empty dataset

2021-09-15 Thread DB Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415703#comment-17415703
 ] 

DB Tsai edited comment on SPARK-36696 at 9/15/21, 7:21 PM:
---

This issue is addressed by SPARK-34542 Can you verify and close this?


was (Author: dbtsai):
This issue is addressed by SPARK-34542 can we close this JIRA?

> spark.read.parquet loads empty dataset
> --
>
> Key: SPARK-36696
> URL: https://issues.apache.org/jira/browse/SPARK-36696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Blocker
> Attachments: example.parquet
>
>
> Here's a parquet file Spark 3.2/master can't read properly.
> The file was stored by pandas and must contain 3650 rows, but Spark 
> 3.2/master returns an empty dataset.
> {code:python}
> >>> import pandas as pd
> >>> len(pd.read_parquet('/path/to/example.parquet'))
> 3650
> >>> spark.read.parquet('/path/to/example.parquet').count()
> 0
> {code}
> I guess it's caused by the parquet 1.12.0.
> When I reverted two commits related to the parquet 1.12.0 from branch-3.2:
>  - 
> [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa]
>  - 
> [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da]
> it reads the data successfully.
> We need to add some workaround, or revert the commits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36696) spark.read.parquet loads empty dataset

2021-09-15 Thread DB Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415703#comment-17415703
 ] 

DB Tsai edited comment on SPARK-36696 at 9/15/21, 7:20 PM:
---

This issue is addressed by SPARK-34542 can we close this JIRA?


was (Author: dbtsai):
This issue is addressed by https://issues.apache.org/jira/browse/SPARK-34542 
can we close this JIRA?

> spark.read.parquet loads empty dataset
> --
>
> Key: SPARK-36696
> URL: https://issues.apache.org/jira/browse/SPARK-36696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Blocker
> Attachments: example.parquet
>
>
> Here's a parquet file Spark 3.2/master can't read properly.
> The file was stored by pandas and must contain 3650 rows, but Spark 
> 3.2/master returns an empty dataset.
> {code:python}
> >>> import pandas as pd
> >>> len(pd.read_parquet('/path/to/example.parquet'))
> 3650
> >>> spark.read.parquet('/path/to/example.parquet').count()
> 0
> {code}
> I guess it's caused by the parquet 1.12.0.
> When I reverted two commits related to the parquet 1.12.0 from branch-3.2:
>  - 
> [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa]
>  - 
> [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da]
> it reads the data successfully.
> We need to add some workaround, or revert the commits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset

2021-09-15 Thread DB Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415703#comment-17415703
 ] 

DB Tsai commented on SPARK-36696:
-

This issue is addressed by https://issues.apache.org/jira/browse/SPARK-34542 
can we close this JIRA?

> spark.read.parquet loads empty dataset
> --
>
> Key: SPARK-36696
> URL: https://issues.apache.org/jira/browse/SPARK-36696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Blocker
> Attachments: example.parquet
>
>
> Here's a parquet file Spark 3.2/master can't read properly.
> The file was stored by pandas and must contain 3650 rows, but Spark 
> 3.2/master returns an empty dataset.
> {code:python}
> >>> import pandas as pd
> >>> len(pd.read_parquet('/path/to/example.parquet'))
> 3650
> >>> spark.read.parquet('/path/to/example.parquet').count()
> 0
> {code}
> I guess it's caused by the parquet 1.12.0.
> When I reverted two commits related to the parquet 1.12.0 from branch-3.2:
>  - 
> [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa]
>  - 
> [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da]
> it reads the data successfully.
> We need to add some workaround, or revert the commits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36726) Upgrade Parquet to 1.12.1

2021-09-15 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-36726.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33969
[https://github.com/apache/spark/pull/33969]

> Upgrade Parquet to 1.12.1
> -
>
> Key: SPARK-36726
> URL: https://issues.apache.org/jira/browse/SPARK-36726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Blocker
> Fix For: 3.2.0
>
>
> Upgrade Apache Parquet to 1.12.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36726) Upgrade Parquet to 1.12.1

2021-09-15 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-36726:
---

Assignee: Chao Sun

> Upgrade Parquet to 1.12.1
> -
>
> Key: SPARK-36726
> URL: https://issues.apache.org/jira/browse/SPARK-36726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Blocker
>
> Upgrade Apache Parquet to 1.12.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36644) Push down boolean column filter

2021-09-03 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-36644.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33898
[https://github.com/apache/spark/pull/33898]

> Push down boolean column filter
> ---
>
> Key: SPARK-36644
> URL: https://issues.apache.org/jira/browse/SPARK-36644
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
>
> The following query does not push down the filter 
> ```
> SELECT * FROM t WHERE boolean_field
> ```
> although the following query pushes down the filter as expected.
> ```
> SELECT * FROM t WHERE boolean_field = true
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36644) Push down boolean column filter

2021-09-03 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-36644:
---

Assignee: Kazuyuki Tanimura

> Push down boolean column filter
> ---
>
> Key: SPARK-36644
> URL: https://issues.apache.org/jira/browse/SPARK-36644
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
>
> The following query does not push down the filter 
> ```
> SELECT * FROM t WHERE boolean_field
> ```
> although the following query pushes down the filter as expected.
> ```
> SELECT * FROM t WHERE boolean_field = true
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36481) Expose LogisticRegression.setInitialModel

2021-08-11 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-36481.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33710
[https://github.com/apache/spark/pull/33710]

> Expose LogisticRegression.setInitialModel
> -
>
> Key: SPARK-36481
> URL: https://issues.apache.org/jira/browse/SPARK-36481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.3.0
>
>
> Several Spark ML components already allow setting of an initial model, 
> including KMeans, LogisticRegression, and GaussianMixture. This is useful to 
> begin training from a known reasonably good model.
> However, the method in LogisticRegression is private to Spark. I don't see a 
> good reason why it should be as the others in KMeans et al are not.
> None of these are exposed in Pyspark, which I don't necessarily want to 
> question or deal with now; there are other places one could arguably set an 
> initial model too, but, here just interested in exposing the existing, tested 
> functionality to callers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35921) ${spark.yarn.isHadoopProvided} in config.properties is not edited if build with SBT

2021-06-29 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-35921.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33121
[https://github.com/apache/spark/pull/33121]

> ${spark.yarn.isHadoopProvided} in config.properties is not edited if build 
> with SBT
> ---
>
> Key: SPARK-35921
> URL: https://issues.apache.org/jira/browse/SPARK-35921
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> yarn sub-module contains config.properties.
> {code}
> spark.yarn.isHadoopProvided = ${spark.yarn.isHadoopProvided}
> {code}
> The ${spark.yarn.isHadoopProvided} part is replaced with true or false in 
> build depending on whether Hadoop is provided or not (specified by 
> -Phadoop-provided).
> The edited config.properties will be loaded at runtime to control how to 
> populate Hadoop-related classpath.
> If we build with Maven, these process works but doesn't with SBT.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35782) leveldbjni doesn't work in Apple Silicon on macOS

2021-06-17 Thread DB Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365014#comment-17365014
 ] 

DB Tsai commented on SPARK-35782:
-

[~yikunkero] It will only work for Apple Silicon on Linux but not macOS. For 
macOS, we need to recompile for this specific OS.

> leveldbjni doesn't work in Apple Silicon on macOS
> -
>
> Key: SPARK-35782
> URL: https://issues.apache.org/jira/browse/SPARK-35782
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: DB Tsai
>Priority: Major
>
> leveldbjni doesn't contain the native library for Apple Silicon on macOS. We 
> will need to build native library for Apple Silicon on macOS, and cut a new 
> release so Spark can use it.
> However, it is not maintained for a long time, and the last release was in 
> 2016. Per 
> [discussion|http://apache-spark-developers-list.1001551.n3.nabble.com/leveldbjni-dependency-td30146.html]
>  in spark dev mailing list, other platform also runs into the same support 
> issue. Perhaps, we should we consider racksdb as replacement.
> Note, here is the rocksdb task to support Apple Silicon, 
> https://github.com/facebook/rocksdb/issues/7720



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35782) leveldbjni doesn't work in Apple Silicon on macOS

2021-06-15 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-35782:

Description: 
leveldbjni doesn't contain the native library for Apple Silicon on macOS. We 
will need to build native library for Apple Silicon on macOS, and cut a new 
release so Spark can use it.

However, it is not maintained for a long time, and the last release was in 
2016. Per 
[discussion|http://apache-spark-developers-list.1001551.n3.nabble.com/leveldbjni-dependency-td30146.html]
 in spark dev mailing list, other platform also runs into the same support 
issue. Perhaps, we should we consider racksdb as replacement.

Note, here is the rocksdb task to support Apple Silicon, 
https://github.com/facebook/rocksdb/issues/7720

  was:
leveldbjni doesn't contain the native library for Apple Silicon on macOS. We 
will need to build native library for Apple Silicon on macOS, and cut a new 
release so Spark can use it.

However, it is not maintained for a long time, and the last release was in 
2016. Per 
[discussion|http://apache-spark-developers-list.1001551.n3.nabble.com/leveldbjni-dependency-td30146.html]
 in spark dev mailing list, other platform also runs into the same support 
issue. Perhaps, we should we consider racksdb as replacement.


> leveldbjni doesn't work in Apple Silicon on macOS
> -
>
> Key: SPARK-35782
> URL: https://issues.apache.org/jira/browse/SPARK-35782
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: DB Tsai
>Priority: Major
>
> leveldbjni doesn't contain the native library for Apple Silicon on macOS. We 
> will need to build native library for Apple Silicon on macOS, and cut a new 
> release so Spark can use it.
> However, it is not maintained for a long time, and the last release was in 
> 2016. Per 
> [discussion|http://apache-spark-developers-list.1001551.n3.nabble.com/leveldbjni-dependency-td30146.html]
>  in spark dev mailing list, other platform also runs into the same support 
> issue. Perhaps, we should we consider racksdb as replacement.
> Note, here is the rocksdb task to support Apple Silicon, 
> https://github.com/facebook/rocksdb/issues/7720



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35782) leveldbjni doesn't work in Apple Silicon on macOS

2021-06-15 Thread DB Tsai (Jira)
DB Tsai created SPARK-35782:
---

 Summary: leveldbjni doesn't work in Apple Silicon on macOS
 Key: SPARK-35782
 URL: https://issues.apache.org/jira/browse/SPARK-35782
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: DB Tsai


leveldbjni doesn't contain the native library for Apple Silicon on macOS. We 
will need to build native library for Apple Silicon on macOS, and cut a new 
release so Spark can use it.

However, it is not maintained for a long time, and the last release was in 
2016. Per 
[discussion|http://apache-spark-developers-list.1001551.n3.nabble.com/leveldbjni-dependency-td30146.html]
 in spark dev mailing list, other platform also runs into the same support 
issue. Perhaps, we should we consider racksdb as replacement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35781) Support Spark on Apple Silicon on macOS natively

2021-06-15 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-35781:

Description: This is an umbrella JIRA tracking the progress of supporting 
Apple Silicon on macOS natively.

> Support Spark on Apple Silicon on macOS natively
> 
>
> Key: SPARK-35781
> URL: https://issues.apache.org/jira/browse/SPARK-35781
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.1.2
>Reporter: DB Tsai
>Priority: Major
>
> This is an umbrella JIRA tracking the progress of supporting Apple Silicon on 
> macOS natively.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35781) Support Spark on Apple Silicon on macOS natively

2021-06-15 Thread DB Tsai (Jira)
DB Tsai created SPARK-35781:
---

 Summary: Support Spark on Apple Silicon on macOS natively
 Key: SPARK-35781
 URL: https://issues.apache.org/jira/browse/SPARK-35781
 Project: Spark
  Issue Type: New Feature
  Components: Build
Affects Versions: 3.1.2
Reporter: DB Tsai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35640) Refactor Parquet vectorized reader to remove duplicated code paths

2021-06-10 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-35640.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32777
[https://github.com/apache/spark/pull/32777]

> Refactor Parquet vectorized reader to remove duplicated code paths
> --
>
> Key: SPARK-35640
> URL: https://issues.apache.org/jira/browse/SPARK-35640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently in Parquet vectorized code path, there are many code duplications 
> such as the following:
> {code:java}
>   public void readIntegers(
>   int total,
>   WritableColumnVector c,
>   int rowId,
>   int level,
>   VectorizedValuesReader data) throws IOException {
> int left = total;
> while (left > 0) {
>   if (this.currentCount == 0) this.readNextGroup();
>   int n = Math.min(left, this.currentCount);
>   switch (mode) {
> case RLE:
>   if (currentValue == level) {
> data.readIntegers(n, c, rowId);
>   } else {
> c.putNulls(rowId, n);
>   }
>   break;
> case PACKED:
>   for (int i = 0; i < n; ++i) {
> if (currentBuffer[currentBufferIdx++] == level) {
>   c.putInt(rowId + i, data.readInteger());
> } else {
>   c.putNull(rowId + i);
> }
>   }
>   break;
>   }
>   rowId += n;
>   left -= n;
>   currentCount -= n;
> }
>   }
> {code}
> This makes it hard to maintain as any change on this will need to be 
> replicated in 20+ places. The issue becomes more serious when we are going to 
> implement column index and complex type support for the vectorized path.
> The original intention is for performance. However now days JIT compilers 
> tend to be smart on this and will inline virtual calls as much as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33966) Two-tier encryption key management

2021-01-05 Thread DB Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259454#comment-17259454
 ] 

DB Tsai commented on SPARK-33966:
-

cc [~dongjoon] [~chaosun] and [~viirya]

> Two-tier encryption key management
> --
>
> Key: SPARK-33966
> URL: https://issues.apache.org/jira/browse/SPARK-33966
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gidon Gershinsky
>Priority: Major
>
> Columnar data formats (Parquet and ORC) have recently added a column 
> encryption capability. The data protection follows the practice of envelope 
> encryption, where the Data Encryption Key (DEK) is freshly generated for each 
> file/column, and is encrypted with a master key (or an intermediate key, that 
> is in turn encrypted with a master key). The master keys are kept in a 
> centralized Key Management Service (KMS) - meaning that each Spark worker 
> needs to interact with a (typically slow) KMS server. 
> This Jira (and its sub-tasks) introduce an alternative approach, that on one 
> hand preserves the best practice of generating fresh encryption keys for each 
> data file/column, and on the other hand allows Spark clusters to have a 
> scalable interaction with a KMS server, by delegating it to the application 
> driver. This is done via two-tier management of the keys, where a random Key 
> Encryption Key (KEK) is generated by the driver, encrypted by the master key 
> in the KMS, and distributed by the driver to the workers, so they can use it 
> to encrypt the DEKs, generated there by Parquet or ORC libraries. In the 
> workers, the KEKs are distributed to the executors/threads in the write path. 
> In the read path, the encrypted KEKs are fetched by workers from file 
> metadata, decrypted via interaction with the driver, and shared among the 
> executors/threads.
> The KEK layer further improves scalability of the key management, because 
> neither driver or workers need to interact with the KMS for each file/column.
> Stand-alone Parquet/ORC libraries (without Spark) and/or other frameworks 
> (e.g., Presto, pandas) must be able to read/decrypt the files, 
> written/encrypted by this Spark-driven key management mechanism - and 
> vice-versa. [of course, only if both sides have proper authorisation for 
> using the master keys in the KMS]
> A link to a discussion/design doc is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33212) Move to shaded clients for Hadoop 3.x profile

2020-10-21 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-33212.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29843
[https://github.com/apache/spark/pull/29843]

> Move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.1.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33212) Move to shaded clients for Hadoop 3.x profile

2020-10-21 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-33212:
---

Assignee: Chao Sun

> Move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32721) Simplify if clauses with null and boolean

2020-08-31 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-32721.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29567
[https://github.com/apache/spark/pull/29567]

> Simplify if clauses with null and boolean
> -
>
> Key: SPARK-32721
> URL: https://issues.apache.org/jira/browse/SPARK-32721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.1.0
>
>
> The following if clause:
> {code:sql}
> if(p, null, false) 
> {code}
> can be simplified to:
> {code:sql}
> and(p, null)
> {code}
> And similarly, the following clause:
> {code:sql}
> if(p, null, true)
> {code}
> can be simplified to:
> {code:sql}
> or(not(p), null)
> {code}
> iff predicate {{p}} is deterministic, i.e., can be evaluated to either true 
> or false, but not null.
> {{and}} and {{or}} clauses are more optimization friendly. For instance, by 
> converting {{if(col > 42, null, false)}} to {{and(col > 42, null)}}, we can 
> potentially push the filter {{col > 42}} down to data sources to avoid 
> unnecessary IO.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32721) Simplify if clauses with null and boolean

2020-08-31 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-32721:
---

Assignee: Chao Sun

> Simplify if clauses with null and boolean
> -
>
> Key: SPARK-32721
> URL: https://issues.apache.org/jira/browse/SPARK-32721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> The following if clause:
> {code:sql}
> if(p, null, false) 
> {code}
> can be simplified to:
> {code:sql}
> and(p, null)
> {code}
> And similarly, the following clause:
> {code:sql}
> if(p, null, true)
> {code}
> can be simplified to:
> {code:sql}
> or(not(p), null)
> {code}
> iff predicate {{p}} is deterministic, i.e., can be evaluated to either true 
> or false, but not null.
> {{and}} and {{or}} clauses are more optimization friendly. For instance, by 
> converting {{if(col > 42, null, false)}} to {{and(col > 42, null)}}, we can 
> potentially push the filter {{col > 42}} down to data sources to avoid 
> unnecessary IO.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22231) Support of map, filter, withField, dropFields in nested list of structures

2020-08-02 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-22231:

Summary: Support of map, filter, withField, dropFields in nested list of 
structures  (was: Support of map, filter, withColumn, dropColumn in nested list 
of structures)

> Support of map, filter, withField, dropFields in nested list of structures
> --
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: struct (containsNull = true)
> // |||-- a: integer (nullable = true)
> // |||-- b: double (nullable = true)
> result.show(false)
> // +---++--+
> // |foo|bar |items |
> // +---++--+
> // |10 

[jira] [Resolved] (SPARK-32397) Snapshot artifacts can have differing timestamps, making it hard to consume

2020-07-29 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-32397.
-
Fix Version/s: 3.1.0
   2.4.7
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29274
[https://github.com/apache/spark/pull/29274]

> Snapshot artifacts can have differing timestamps, making it hard to consume
> ---
>
> Key: SPARK-32397
> URL: https://issues.apache.org/jira/browse/SPARK-32397
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
> Fix For: 3.0.1, 2.4.7, 3.1.0
>
>
> Since we use multiple sub components in building Spark we can get into a 
> situation where the timestamps for these components is different. This can 
> make it difficult to consume Spark snapshots in an environment where someone 
> is running a nightly build for other folks to develop on top of.
> I believe I have a small fix for this already, but just waiting to verify and 
> then I'll open a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32385) Publish a "bill of materials" (BOM) descriptor for Spark with correct versions of various dependencies

2020-07-29 Thread DB Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17167504#comment-17167504
 ] 

DB Tsai edited comment on SPARK-32385 at 7/29/20, 9:04 PM:
---

+1 This will be very useful for users to include Spark as deps.

[~hyukjin.kwon] from  [https://www.baeldung.com/spring-maven-bom]

Following is an example of how to write a BOM file: 
{code:java}

4.0.0
baeldung
Baeldung-BOM
0.0.1-SNAPSHOT
pom
BaelDung-BOM
parent pom



test
a
1.2


test
b
1.0
compile


test
c
1.0
compile




{code}

As we can see, the BOM is a normal POM file with a dependencyManagement section 
where we can include all an artifact's information and versions.




was (Author: dbtsai):
+1 This will be very useful for users to include Spark as deps.

 

[~hyukjin.kwon] from  [https://www.baeldung.com/spring-maven-bom]

Following is an example of how to write a BOM file: 
{code:java}

4.0.0
baeldung
Baeldung-BOM
0.0.1-SNAPSHOT
pom
BaelDung-BOM
parent pom



test
a
1.2


test
b
1.0
compile


test
c
1.0
compile




{code}

As we can see, the BOM is a normal POM file with a dependencyManagement section 
where we can include all an artifact's information and versions.



> Publish a "bill of materials" (BOM) descriptor for Spark with correct 
> versions of various dependencies
> --
>
> Key: SPARK-32385
> URL: https://issues.apache.org/jira/browse/SPARK-32385
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Vladimir Matveev
>Priority: Major
>
> Spark has a lot of dependencies, many of them very common (e.g. Guava, 
> Jackson). Also, versions of these dependencies are not updated as frequently 
> as they are released upstream, which is totally understandable and natural, 
> but which also means that often Spark has a dependency on a lower version of 
> a library, which is incompatible with a higher, more recent version of the 
> same library. This incompatibility can manifest in different ways, e.g as 
> classpath errors or runtime check errors (like with Jackson), in certain 
> cases.
>  
> Spark does attempt to "fix" versions of its dependencies by declaring them 
> explicitly in its {{pom.xml}} file. However, this approach, being somewhat 
> workable if the Spark-using project itself uses Maven, breaks down if another 
> build system is used, like Gradle. The reason is that Maven uses an 
> unconventional "nearest first" version conflict resolution strategy, while 
> many other tools like Gradle use the "highest first" strategy which resolves 
> the highest possible version number inside the entire graph of dependencies. 
> This means that other dependencies of the project can pull a higher version 
> of some dependency, which is incompatible with Spark.
>  
> One example would be an explicit or a transitive dependency on a higher 
> version of Jackson in the project. Spark itself depends on several modules of 
> Jackson; if only one of them gets a higher version, and others remain on the 
> lower version, this will result in runtime exceptions due to an internal 
> version check in Jackson.
>  
> A widely used solution for this kind of version issues is publishing of a 
> "bill of materials" descriptor (see here: 
> [https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html]
>  and here: 
> [https://docs.gradle.org/current/userguide/platforms.html#sub:bom_import]). 
> This descriptor would contain all versions of all dependencies of Spark; then 
> downstream projects will be able to use their build system's support for BOMs 
> to enforce version constraints required for Spark to function correctly.
>  
> One example of successful implementation of the BOM-based approach is Spring: 
> [https://www.baeldung.com/spring-maven-bom#spring-bom]. For different Spring 
> projects, e.g. Spring Boot, there are BOM descriptors published which can be 
> used in downstream projects to fix the versions of Spring components and 
> their dependencies, significantly reducing confusion around proper version 
> numbers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SPARK-32385) Publish a "bill of materials" (BOM) descriptor for Spark with correct versions of various dependencies

2020-07-29 Thread DB Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17167504#comment-17167504
 ] 

DB Tsai commented on SPARK-32385:
-

+1 This will be very useful for users to include Spark as deps.

 

[~hyukjin.kwon] from  [https://www.baeldung.com/spring-maven-bom]

Following is an example of how to write a BOM file: 
{code:java}

4.0.0
baeldung
Baeldung-BOM
0.0.1-SNAPSHOT
pom
BaelDung-BOM
parent pom



test
a
1.2


test
b
1.0
compile


test
c
1.0
compile




{code}

As we can see, the BOM is a normal POM file with a dependencyManagement section 
where we can include all an artifact's information and versions.



> Publish a "bill of materials" (BOM) descriptor for Spark with correct 
> versions of various dependencies
> --
>
> Key: SPARK-32385
> URL: https://issues.apache.org/jira/browse/SPARK-32385
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Vladimir Matveev
>Priority: Major
>
> Spark has a lot of dependencies, many of them very common (e.g. Guava, 
> Jackson). Also, versions of these dependencies are not updated as frequently 
> as they are released upstream, which is totally understandable and natural, 
> but which also means that often Spark has a dependency on a lower version of 
> a library, which is incompatible with a higher, more recent version of the 
> same library. This incompatibility can manifest in different ways, e.g as 
> classpath errors or runtime check errors (like with Jackson), in certain 
> cases.
>  
> Spark does attempt to "fix" versions of its dependencies by declaring them 
> explicitly in its {{pom.xml}} file. However, this approach, being somewhat 
> workable if the Spark-using project itself uses Maven, breaks down if another 
> build system is used, like Gradle. The reason is that Maven uses an 
> unconventional "nearest first" version conflict resolution strategy, while 
> many other tools like Gradle use the "highest first" strategy which resolves 
> the highest possible version number inside the entire graph of dependencies. 
> This means that other dependencies of the project can pull a higher version 
> of some dependency, which is incompatible with Spark.
>  
> One example would be an explicit or a transitive dependency on a higher 
> version of Jackson in the project. Spark itself depends on several modules of 
> Jackson; if only one of them gets a higher version, and others remain on the 
> lower version, this will result in runtime exceptions due to an internal 
> version check in Jackson.
>  
> A widely used solution for this kind of version issues is publishing of a 
> "bill of materials" descriptor (see here: 
> [https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html]
>  and here: 
> [https://docs.gradle.org/current/userguide/platforms.html#sub:bom_import]). 
> This descriptor would contain all versions of all dependencies of Spark; then 
> downstream projects will be able to use their build system's support for BOMs 
> to enforce version constraints required for Spark to function correctly.
>  
> One example of successful implementation of the BOM-based approach is Spring: 
> [https://www.baeldung.com/spring-maven-bom#spring-bom]. For different Spring 
> projects, e.g. Spring Boot, there are BOM descriptors published which can be 
> used in downstream projects to fix the versions of Spring components and 
> their dependencies, significantly reducing confusion around proper version 
> numbers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31960) Only populate Hadoop classpath for no-hadoop build

2020-06-18 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-31960.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28788
[https://github.com/apache/spark/pull/28788]

> Only populate Hadoop classpath for no-hadoop build
> --
>
> Key: SPARK-31960
> URL: https://issues.apache.org/jira/browse/SPARK-31960
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31960) Only populate Hadoop classpath for no-hadoop build

2020-06-11 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-31960:

Parent: SPARK-31582
Issue Type: Sub-task  (was: Bug)

> Only populate Hadoop classpath for no-hadoop build
> --
>
> Key: SPARK-31960
> URL: https://issues.apache.org/jira/browse/SPARK-31960
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31960) Only populate Hadoop classpath for no-hadoop build

2020-06-10 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-31960:

Summary: Only populate Hadoop classpath for no-hadoop build  (was: Only 
populate Hadoop classpath when hadoop is provided)

> Only populate Hadoop classpath for no-hadoop build
> --
>
> Key: SPARK-31960
> URL: https://issues.apache.org/jira/browse/SPARK-31960
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31960) Only populate Hadoop classpath when hadoop is provided

2020-06-10 Thread DB Tsai (Jira)
DB Tsai created SPARK-31960:
---

 Summary: Only populate Hadoop classpath when hadoop is provided
 Key: SPARK-31960
 URL: https://issues.apache.org/jira/browse/SPARK-31960
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 3.0.0
Reporter: DB Tsai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31876) Upgrade to Zstd 1.4.5

2020-06-02 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-31876:
---

Assignee: William Hyun

> Upgrade to Zstd 1.4.5
> -
>
> Key: SPARK-31876
> URL: https://issues.apache.org/jira/browse/SPARK-31876
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
>
> This issue aims to upgrade to Zstd 1.4.5
> Zstd 1.4.5 improves performance.
> [https://github.com/facebook/zstd/releases/tag/v1.4.5]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31876) Upgrade to Zstd 1.4.5

2020-06-02 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-31876.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28682
[https://github.com/apache/spark/pull/28682]

> Upgrade to Zstd 1.4.5
> -
>
> Key: SPARK-31876
> URL: https://issues.apache.org/jira/browse/SPARK-31876
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> This issue aims to upgrade to Zstd 1.4.5
> Zstd 1.4.5 improves performance.
> [https://github.com/facebook/zstd/releases/tag/v1.4.5]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31399) Closure cleaner broken in Scala 2.12

2020-05-19 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-31399:

Fix Version/s: 2.4.6

> Closure cleaner broken in Scala 2.12
> 
>
> Key: SPARK-31399
> URL: https://issues.apache.org/jira/browse/SPARK-31399
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Kris Mok
>Priority: Blocker
> Fix For: 2.4.6, 3.0.0
>
>
> The `ClosureCleaner` only support Scala functions and it uses the following 
> check to catch closures
> {code}
>   // Check whether a class represents a Scala closure
>   private def isClosure(cls: Class[_]): Boolean = {
> cls.getName.contains("$anonfun$")
>   }
> {code}
> This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
> functions become Java lambdas.
> As an example, the following code works well in Spark 2.4 Spark Shell:
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql.functions.lit
> defined class Foo
> col: org.apache.spark.sql.Column = 123
> df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
> {code}
> But fails in 3.0
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
>   at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:421)
>   ... 39 elided
> Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
> Serialization stack:
>   - object not serializable (class: org.apache.spark.sql.Column, value: 
> 123)
>   - field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
>   - object (class $iw, $iw@2d87ac2b)
>   - element of array (index: 0)
>   - array (class [Ljava.lang.Object;, size 1)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class $iw, 
> functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
> instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393)
>   ... 47 more
> {code}
> **Apache Spark 2.4.5 with Scala 2.12**
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.5
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>   at 

[jira] [Comment Edited] (SPARK-25557) ORC predicate pushdown for nested fields

2020-05-12 Thread DB Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105680#comment-17105680
 ] 

DB Tsai edited comment on SPARK-25557 at 5/12/20, 6:57 PM:
---

[~omalley] No missing part on ORC side. We already have the foundation to 
support nested predicate pushdown in Spark in the parent JIRA, and all we need 
is hook it up with ORC like 
[https://github.com/apache/spark/pull/27728/files#diff-67a76299606811fd795f69f8d53b6f2bR56]
 for Parquet.


was (Author: dbtsai):
[~omalley] No missing part on ORC side. We already have the foundation to 
support nested predicate pushdown in Spark in the parent JIRA, and all we need 
is hook it up with ORC like 
[https://github.com/apache/spark/pull/27728/files#diff-67a76299606811fd795f69f8d53b6f2bR56]
 

> ORC predicate pushdown for nested fields
> 
>
> Key: SPARK-25557
> URL: https://issues.apache.org/jira/browse/SPARK-25557
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25557) ORC predicate pushdown for nested fields

2020-05-12 Thread DB Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17105680#comment-17105680
 ] 

DB Tsai commented on SPARK-25557:
-

[~omalley] No missing part on ORC side. We already have the foundation to 
support nested predicate pushdown in Spark in the parent JIRA, and all we need 
is hook it up with ORC like 
[https://github.com/apache/spark/pull/27728/files#diff-67a76299606811fd795f69f8d53b6f2bR56]
 

> ORC predicate pushdown for nested fields
> 
>
> Key: SPARK-25557
> URL: https://issues.apache.org/jira/browse/SPARK-25557
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31582) Be able to not populate Hadoop classpath

2020-04-29 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-31582:

Fix Version/s: 3.0.0

> Be able to not populate Hadoop classpath
> 
>
> Key: SPARK-31582
> URL: https://issues.apache.org/jira/browse/SPARK-31582
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>
> Spark Yarn client will populate hadoop classpath from 
> `yarn.application.classpath` and ``mapreduce.application.classpath`. However, 
> for Spark with embedded hadoop build, it will result jar conflicts because 
> spark distribution can contain different version of hadoop jars.
> We are adding a new Yarn configuration to not populate hadoop classpath from  
> `yarn.application.classpath` and ``mapreduce.application.classpath`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31582) Being able to not populate Hadoop classpath

2020-04-29 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-31582:

Summary: Being able to not populate Hadoop classpath  (was: Be able to not 
populate Hadoop classpath)

> Being able to not populate Hadoop classpath
> ---
>
> Key: SPARK-31582
> URL: https://issues.apache.org/jira/browse/SPARK-31582
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>
> Spark Yarn client will populate hadoop classpath from 
> `yarn.application.classpath` and ``mapreduce.application.classpath`. However, 
> for Spark with embedded hadoop build, it will result jar conflicts because 
> spark distribution can contain different version of hadoop jars.
> We are adding a new Yarn configuration to not populate hadoop classpath from  
> `yarn.application.classpath` and ``mapreduce.application.classpath`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31582) Be able to not populate Hadoop classpath

2020-04-29 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-31582.
-
Fix Version/s: 2.4.6
   Resolution: Fixed

Issue resolved by pull request 28376
[https://github.com/apache/spark/pull/28376]

> Be able to not populate Hadoop classpath
> 
>
> Key: SPARK-31582
> URL: https://issues.apache.org/jira/browse/SPARK-31582
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.6
>
>
> Spark Yarn client will populate hadoop classpath from 
> `yarn.application.classpath` and ``mapreduce.application.classpath`. However, 
> for Spark with embedded hadoop build, it will result jar conflicts because 
> spark distribution can contain different version of hadoop jars.
> We are adding a new Yarn configuration to not populate hadoop classpath from  
> `yarn.application.classpath` and ``mapreduce.application.classpath`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31582) Be able to not populate Hadoop classpath

2020-04-29 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-31582:
---

Assignee: DB Tsai

> Be able to not populate Hadoop classpath
> 
>
> Key: SPARK-31582
> URL: https://issues.apache.org/jira/browse/SPARK-31582
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> Spark Yarn client will populate hadoop classpath from 
> `yarn.application.classpath` and ``mapreduce.application.classpath`. However, 
> for Spark with embedded hadoop build, it will result jar conflicts because 
> spark distribution can contain different version of hadoop jars.
> We are adding a new Yarn configuration to not populate hadoop classpath from  
> `yarn.application.classpath` and ``mapreduce.application.classpath`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31582) Be able to not populate Hadoop classpath

2020-04-27 Thread DB Tsai (Jira)
DB Tsai created SPARK-31582:
---

 Summary: Be able to not populate Hadoop classpath
 Key: SPARK-31582
 URL: https://issues.apache.org/jira/browse/SPARK-31582
 Project: Spark
  Issue Type: New Feature
  Components: YARN
Affects Versions: 2.4.5
Reporter: DB Tsai


Spark Yarn client will populate hadoop classpath from 
`yarn.application.classpath` and ``mapreduce.application.classpath`. However, 
for Spark with embedded hadoop build, it will result jar conflicts because 
spark distribution can contain different version of hadoop jars.

We are adding a new Yarn configuration to not populate hadoop classpath from  
`yarn.application.classpath` and ``mapreduce.application.classpath`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31364) Benchmark Nested Parquet Predicate Pushdown

2020-04-24 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-31364.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28319
[https://github.com/apache/spark/pull/28319]

> Benchmark Nested Parquet Predicate Pushdown
> ---
>
> Key: SPARK-31364
> URL: https://issues.apache.org/jira/browse/SPARK-31364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> We would like to benchmark best and worst scenarios such as no record matches 
> the predicate, and how much extra overhead is added.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31364) Benchmark Nested Parquet Predicate Pushdown

2020-04-24 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-31364:

Summary: Benchmark Nested Parquet Predicate Pushdown  (was: Benchmark 
Parquet Predicate Pushdown)

> Benchmark Nested Parquet Predicate Pushdown
> ---
>
> Key: SPARK-31364
> URL: https://issues.apache.org/jira/browse/SPARK-31364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
>
> We would like to benchmark best and worst scenarios such as no record matches 
> the predicate, and how much extra overhead is added.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31365) Enable nested predicate pushdown per data sources

2020-04-06 Thread DB Tsai (Jira)
DB Tsai created SPARK-31365:
---

 Summary: Enable nested predicate pushdown per data sources
 Key: SPARK-31365
 URL: https://issues.apache.org/jira/browse/SPARK-31365
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: DB Tsai


Currently, nested predicate pushdown is on or off for all data sources. We 
should create configuration for each supported data source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25558) Pushdown predicates for nested fields in DataSource Strategy

2020-04-06 Thread DB Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076731#comment-17076731
 ] 

DB Tsai edited comment on SPARK-25558 at 4/6/20, 11:05 PM:
---

Issue resolved by [https://github.com/apache/spark/pull/27728]


was (Author: dbtsai):
Issued resolved by [https://github.com/apache/spark/pull/27728]

> Pushdown predicates for nested fields in DataSource Strategy 
> -
>
> Key: SPARK-25558
> URL: https://issues.apache.org/jira/browse/SPARK-25558
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25558) Pushdown predicates for nested fields in DataSource Strategy

2020-04-06 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-25558.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Pushdown predicates for nested fields in DataSource Strategy 
> -
>
> Key: SPARK-25558
> URL: https://issues.apache.org/jira/browse/SPARK-25558
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25558) Pushdown predicates for nested fields in DataSource Strategy

2020-04-06 Thread DB Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076731#comment-17076731
 ] 

DB Tsai commented on SPARK-25558:
-

Issued resolved by [https://github.com/apache/spark/pull/27728]

> Pushdown predicates for nested fields in DataSource Strategy 
> -
>
> Key: SPARK-25558
> URL: https://issues.apache.org/jira/browse/SPARK-25558
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31364) Benchmark Parquet Predicate Pushdown

2020-04-06 Thread DB Tsai (Jira)
DB Tsai created SPARK-31364:
---

 Summary: Benchmark Parquet Predicate Pushdown
 Key: SPARK-31364
 URL: https://issues.apache.org/jira/browse/SPARK-31364
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: DB Tsai


We would like to benchmark best and worst scenarios such as no record matches 
the predicate, and how much extra overhead is added.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27644) Enable spark.sql.optimizer.nestedSchemaPruning.enabled by default

2020-04-06 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-27644.
-
Resolution: Duplicate

> Enable spark.sql.optimizer.nestedSchemaPruning.enabled by default
> -
>
> Key: SPARK-27644
> URL: https://issues.apache.org/jira/browse/SPARK-27644
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> We can enable this after resolving all on-going issues and finishing more 
> verifications.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29805) Enable nested schema pruning and pruning on expressions by default

2020-04-06 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-29805:

Parent: SPARK-25603
Issue Type: Sub-task  (was: New Feature)

> Enable nested schema pruning and pruning on expressions by default
> --
>
> Key: SPARK-29805
> URL: https://issues.apache.org/jira/browse/SPARK-29805
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27644) Enable spark.sql.optimizer.nestedSchemaPruning.enabled by default

2020-04-06 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-27644:

Parent: (was: SPARK-25556)
Issue Type: New Feature  (was: Sub-task)

> Enable spark.sql.optimizer.nestedSchemaPruning.enabled by default
> -
>
> Key: SPARK-27644
> URL: https://issues.apache.org/jira/browse/SPARK-27644
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> We can enable this after resolving all on-going issues and finishing more 
> verifications.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27644) Enable spark.sql.optimizer.nestedSchemaPruning.enabled by default

2020-04-06 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-27644:

Parent: SPARK-25603
Issue Type: Sub-task  (was: New Feature)

> Enable spark.sql.optimizer.nestedSchemaPruning.enabled by default
> -
>
> Key: SPARK-27644
> URL: https://issues.apache.org/jira/browse/SPARK-27644
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> We can enable this after resolving all on-going issues and finishing more 
> verifications.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31313) Add `m01` node name to support Minikube 1.8.x

2020-03-31 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-31313.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28080
[https://github.com/apache/spark/pull/28080]

> Add `m01` node name to support Minikube 1.8.x
> -
>
> Key: SPARK-31313
> URL: https://issues.apache.org/jira/browse/SPARK-31313
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31313) Add `m01` node name to support Minikube 1.8.x

2020-03-31 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-31313:
---

Assignee: Dongjoon Hyun

> Add `m01` node name to support Minikube 1.8.x
> -
>
> Key: SPARK-31313
> URL: https://issues.apache.org/jira/browse/SPARK-31313
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17636) Parquet predicate pushdown for nested fields

2020-03-31 Thread DB Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072142#comment-17072142
 ] 

DB Tsai commented on SPARK-17636:
-

[~MasterDDT] we are not able to merge a new feature to an old release. Being 
said that, we internally at Apple use this for two years in prod without 
issues. You can backport it if you want to use it in your env.

> Parquet predicate pushdown for nested fields
> 
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 1.6.3, 2.0.2
>Reporter: Mitesh
>Assignee: DB Tsai
>Priority: Minor
> Fix For: 3.0.0
>
>
> There's a *PushedFilters* for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {noformat}
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-03-31 Thread DB Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071969#comment-17071969
 ] 

DB Tsai commented on SPARK-22231:
-

[~fqaiser94] Thanks for continuing this work. We implemented this feature while 
I was at Netflix, and it's ready useful for end users to manipulate nested 
dataframe. Currently, we try to not assign the ticket to prevent someone is 
being assigned but not works on it. Therefore, I unassigned this JIRA.

In your PR, you implement the first part, and I create a sub-task for it. 
https://issues.apache.org/jira/browse/SPARK-31317 Can you change the Jira 
number to  SPARK-31317 to have it properly linked? 

 

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = false)
> // 

[jira] [Created] (SPARK-31317) Add withField method to Column class

2020-03-31 Thread DB Tsai (Jira)
DB Tsai created SPARK-31317:
---

 Summary: Add withField method to Column class
 Key: SPARK-31317
 URL: https://issues.apache.org/jira/browse/SPARK-31317
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3
Reporter: DB Tsai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-03-31 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-22231:
---

Assignee: (was: Jeremy Smith)

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: struct (containsNull = true)
> // |||-- a: integer (nullable = true)
> // |||-- b: double (nullable = true)
> result.show(false)
> // +---++--+
> // |foo|bar |items |
> // +---++--+
> // |10 |10.0|[[10,11.0], [11,12.0]]|
> // |20 |20.0|[[20,21.0], [21,22.0]]|
> // +---++--+
> {code}
> and the second one 

[jira] [Assigned] (SPARK-31064) New Parquet Predicate Filter APIs with multi-part Identifier Support

2020-03-06 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-31064:
---

Assignee: DB Tsai

> New Parquet Predicate Filter APIs with multi-part Identifier Support
> 
>
> Key: SPARK-31064
> URL: https://issues.apache.org/jira/browse/SPARK-31064
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> Parquet's *org.apache.parquet.filter2.predicate.FilterApi* uses *dots* as 
> separators to split the column name into multi-parts of nested fields. The 
> drawback is this causes issues when the field name contains *dot*.
> The new APIs that will be added will take array of string directly for 
> multi-parts of nested fields, so no confusion as using *dot* as a separator.
> It's intended to move this code back to parquet community. See [PARQUET-1809]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31064) New Parquet Predicate Filter APIs with multi-part Identifier Support

2020-03-06 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-31064.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27824
[https://github.com/apache/spark/pull/27824]

> New Parquet Predicate Filter APIs with multi-part Identifier Support
> 
>
> Key: SPARK-31064
> URL: https://issues.apache.org/jira/browse/SPARK-31064
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> Parquet's *org.apache.parquet.filter2.predicate.FilterApi* uses *dots* as 
> separators to split the column name into multi-parts of nested fields. The 
> drawback is this causes issues when the field name contains *dot*.
> The new APIs that will be added will take array of string directly for 
> multi-parts of nested fields, so no confusion as using *dot* as a separator.
> It's intended to move this code back to parquet community. See [PARQUET-1809]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31064) New Parquet Predicate Filter APIs with multi-part Identifier Support

2020-03-05 Thread DB Tsai (Jira)
DB Tsai created SPARK-31064:
---

 Summary: New Parquet Predicate Filter APIs with multi-part 
Identifier Support
 Key: SPARK-31064
 URL: https://issues.apache.org/jira/browse/SPARK-31064
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.5
Reporter: DB Tsai


Parquet's *org.apache.parquet.filter2.predicate.FilterApi* uses *dots* as 
separators to split the column name into multi-parts of nested fields. The 
drawback is this causes issues when the field name contains *dot*.

The new APIs that will be added will take array of string directly for 
multi-parts of nested fields, so no confusion as using *dot* as a separator.

It's intended to move this code back to parquet community. See [PARQUET-1809]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31058) Consolidate the implementation of quoteIfNeeded

2020-03-05 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-31058.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27814
[https://github.com/apache/spark/pull/27814]

> Consolidate the implementation of quoteIfNeeded
> ---
>
> Key: SPARK-31058
> URL: https://issues.apache.org/jira/browse/SPARK-31058
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> There are two implementation of quoteIfNeeded, and one is in 
> *org.apache.spark.sql.connector.catalog.CatalogV2Implicits.quote* and the 
> other is in *OrcFiltersBase.quoteAttributeNameIfNeeded* This PR will 
> consolidate them into one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31058) Consolidate the implementation of quoteIfNeeded

2020-03-05 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-31058:
---

Assignee: DB Tsai

> Consolidate the implementation of quoteIfNeeded
> ---
>
> Key: SPARK-31058
> URL: https://issues.apache.org/jira/browse/SPARK-31058
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> There are two implementation of quoteIfNeeded, and one is in 
> *org.apache.spark.sql.connector.catalog.CatalogV2Implicits.quote* and the 
> other is in *OrcFiltersBase.quoteAttributeNameIfNeeded* This PR will 
> consolidate them into one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31060) Handle column names containing `dots` in data source `Filter`

2020-03-05 Thread DB Tsai (Jira)
DB Tsai created SPARK-31060:
---

 Summary: Handle column names containing `dots` in data source 
`Filter`
 Key: SPARK-31060
 URL: https://issues.apache.org/jira/browse/SPARK-31060
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.5
Reporter: DB Tsai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31058) Consolidate the implementation of quoteIfNeeded

2020-03-05 Thread DB Tsai (Jira)
DB Tsai created SPARK-31058:
---

 Summary: Consolidate the implementation of quoteIfNeeded
 Key: SPARK-31058
 URL: https://issues.apache.org/jira/browse/SPARK-31058
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.5
Reporter: DB Tsai


There are two implementation of quoteIfNeeded, and one is in 
*org.apache.spark.sql.connector.catalog.CatalogV2Implicits.quote* and the other 
is in *OrcFiltersBase.quoteAttributeNameIfNeeded* This PR will consolidate them 
into one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31027) Refactor `DataSourceStrategy.scala` to minimize the changes to support nested predicate pushdown

2020-03-04 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-31027:

Fix Version/s: (was: 3.1.0)
   3.0.0

> Refactor `DataSourceStrategy.scala` to minimize the changes to support nested 
> predicate pushdown
> 
>
> Key: SPARK-31027
> URL: https://issues.apache.org/jira/browse/SPARK-31027
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31026) Parquet predicate pushdown on columns with dots

2020-03-03 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-31026:

Summary: Parquet predicate pushdown on columns with dots  (was: Enable 
Parquet predicate pushdown on columns with dots)

> Parquet predicate pushdown on columns with dots
> ---
>
> Key: SPARK-31026
> URL: https://issues.apache.org/jira/browse/SPARK-31026
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
>
> Parquet predicate pushdown on columns with dots is disabled in -SPARK-20364- 
> due to Parquet's APIs don't support it. A new set of APIs is purposed in 
> PARQUET-1809 to generalize the support of nested cols which can address this 
> issue. This implementation will be merged into Spark repo first until we get 
> a new release from Parquet community.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31027) Refactor `DataSourceStrategy.scala` to minimize the changes to support nested predicate pushdown

2020-03-03 Thread DB Tsai (Jira)
DB Tsai created SPARK-31027:
---

 Summary: Refactor `DataSourceStrategy.scala` to minimize the 
changes to support nested predicate pushdown
 Key: SPARK-31027
 URL: https://issues.apache.org/jira/browse/SPARK-31027
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.5
Reporter: DB Tsai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31026) Enable Parquet predicate pushdown on columns with dots

2020-03-03 Thread DB Tsai (Jira)
DB Tsai created SPARK-31026:
---

 Summary: Enable Parquet predicate pushdown on columns with dots
 Key: SPARK-31026
 URL: https://issues.apache.org/jira/browse/SPARK-31026
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: DB Tsai


Parquet predicate pushdown on columns with dots is disabled in -SPARK-20364- 
due to Parquet's APIs don't support it. A new set of APIs is purposed in 
PARQUET-1809 to generalize the support of nested cols which can address this 
issue. This implementation will be merged into Spark repo first until we get a 
new release from Parquet community.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30920) DSv2 Predicate Framework

2020-02-21 Thread DB Tsai (Jira)
DB Tsai created SPARK-30920:
---

 Summary: DSv2 Predicate Framework
 Key: SPARK-30920
 URL: https://issues.apache.org/jira/browse/SPARK-30920
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: DB Tsai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30289) Partitioned by Nested Column for `InMemoryTable`

2020-02-14 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-30289.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26929
[https://github.com/apache/spark/pull/26929]

> Partitioned by Nested Column for `InMemoryTable`
> 
>
> Key: SPARK-30289
> URL: https://issues.apache.org/jira/browse/SPARK-30289
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30289) Partitioned by Nested Column for `InMemoryTable`

2020-02-07 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-30289:

Summary: Partitioned by Nested Column for `InMemoryTable`  (was: DSv2 
partitioning should not accept nested columns)

> Partitioned by Nested Column for `InMemoryTable`
> 
>
> Key: SPARK-30289
> URL: https://issues.apache.org/jira/browse/SPARK-30289
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30289) DSv2 partitioning should not accept nested columns

2019-12-17 Thread DB Tsai (Jira)
DB Tsai created SPARK-30289:
---

 Summary: DSv2 partitioning should not accept nested columns
 Key: SPARK-30289
 URL: https://issues.apache.org/jira/browse/SPARK-30289
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: DB Tsai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue

2019-11-17 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-25694.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26530
[https://github.com/apache/spark/pull/26530]

> URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue
> ---
>
> Key: SPARK-25694
> URL: https://issues.apache.org/jira/browse/SPARK-25694
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.4, 3.0.0
>Reporter: Bo Yang
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>
> URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() 
> returns FsUrlConnection object, which is not compatible with 
> HttpURLConnection. This will cause exception when using some third party http 
> library (e.g. scalaj.http).
> The following code in Spark 2.3.0 introduced the issue: 
> sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala:
> {code}
> object SharedState extends Logging  {   ...   
>   URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory())   ...
> }
> {code}
> Here is the example exception when using scalaj.http in Spark:
> {code}
>  StackTrace: scala.MatchError: 
> org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/]
>  (of class org.apache.hadoop.fs.FsUrlConnection)
>  at 
> scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343)
>  at scalaj.http.HttpRequest.exec(Http.scala:335)
>  at scalaj.http.HttpRequest.asString(Http.scala:455)
> {code}
>   
> One option to fix the issue is to return null in 
> URLStreamHandlerFactory.createURLStreamHandler when the protocol is 
> http/https, so it will use the default behavior and be compatible with 
> scalaj.http. Following is the code example:
> {code}
> class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with 
> Logging {
>   private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory()
>   override def createURLStreamHandler(protocol: String): URLStreamHandler = {
> val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol)
> if (handler == null) {
>   return null
> }
> if (protocol != null &&
>   (protocol.equalsIgnoreCase("http")
>   || protocol.equalsIgnoreCase("https"))) {
>   // return null to use system default URLStreamHandler
>   null
> } else {
>   handler
> }
>   }
> }
> {code}
> I would like to get some discussion here before submitting a pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue

2019-11-17 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-25694:
---

Assignee: Dongjoon Hyun

> URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue
> ---
>
> Key: SPARK-25694
> URL: https://issues.apache.org/jira/browse/SPARK-25694
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.4, 3.0.0
>Reporter: Bo Yang
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() 
> returns FsUrlConnection object, which is not compatible with 
> HttpURLConnection. This will cause exception when using some third party http 
> library (e.g. scalaj.http).
> The following code in Spark 2.3.0 introduced the issue: 
> sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala:
> {code}
> object SharedState extends Logging  {   ...   
>   URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory())   ...
> }
> {code}
> Here is the example exception when using scalaj.http in Spark:
> {code}
>  StackTrace: scala.MatchError: 
> org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/]
>  (of class org.apache.hadoop.fs.FsUrlConnection)
>  at 
> scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343)
>  at scalaj.http.HttpRequest.exec(Http.scala:335)
>  at scalaj.http.HttpRequest.asString(Http.scala:455)
> {code}
>   
> One option to fix the issue is to return null in 
> URLStreamHandlerFactory.createURLStreamHandler when the protocol is 
> http/https, so it will use the default behavior and be compatible with 
> scalaj.http. Following is the code example:
> {code}
> class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with 
> Logging {
>   private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory()
>   override def createURLStreamHandler(protocol: String): URLStreamHandler = {
> val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol)
> if (handler == null) {
>   return null
> }
> if (protocol != null &&
>   (protocol.equalsIgnoreCase("http")
>   || protocol.equalsIgnoreCase("https"))) {
>   // return null to use system default URLStreamHandler
>   null
> } else {
>   handler
> }
>   }
> }
> {code}
> I would like to get some discussion here before submitting a pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24203) Make executor's bindAddress configurable

2019-11-13 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-24203.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26331
[https://github.com/apache/spark/pull/26331]

> Make executor's bindAddress configurable
> 
>
> Key: SPARK-24203
> URL: https://issues.apache.org/jira/browse/SPARK-24203
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lukas Majercak
>Assignee: Nishchal Venkataramana
>Priority: Major
>  Labels: bulk-closed
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29805) Enable nested schema pruning and pruning on expressions by default

2019-11-11 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-29805.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26443
[https://github.com/apache/spark/pull/26443]

> Enable nested schema pruning and pruning on expressions by default
> --
>
> Key: SPARK-29805
> URL: https://issues.apache.org/jira/browse/SPARK-29805
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29805) Enable nested schema pruning and pruning on expressions by default

2019-11-08 Thread DB Tsai (Jira)
DB Tsai created SPARK-29805:
---

 Summary: Enable nested schema pruning and pruning on expressions 
by default
 Key: SPARK-29805
 URL: https://issues.apache.org/jira/browse/SPARK-29805
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: DB Tsai
Assignee: DB Tsai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24203) Make executor's bindAddress configurable

2019-10-31 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-24203:
---

Assignee: Nishchal Venkataramana

> Make executor's bindAddress configurable
> 
>
> Key: SPARK-24203
> URL: https://issues.apache.org/jira/browse/SPARK-24203
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lukas Majercak
>Assignee: Nishchal Venkataramana
>Priority: Major
>  Labels: bulk-closed
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-24203) Make executor's bindAddress configurable

2019-10-31 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reopened SPARK-24203:
-

> Make executor's bindAddress configurable
> 
>
> Key: SPARK-24203
> URL: https://issues.apache.org/jira/browse/SPARK-24203
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lukas Majercak
>Priority: Major
>  Labels: bulk-closed
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29670) Make executor's bindAddress configurable

2019-10-31 Thread DB Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964599#comment-16964599
 ] 

DB Tsai commented on SPARK-29670:
-

This is a duplication of SPARK-29670

> Make executor's bindAddress configurable
> 
>
> Key: SPARK-29670
> URL: https://issues.apache.org/jira/browse/SPARK-29670
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4
>Reporter: Nishchal Venkataramana
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29670) Make executor's bindAddress configurable

2019-10-31 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-29670.
-
Resolution: Duplicate

> Make executor's bindAddress configurable
> 
>
> Key: SPARK-29670
> URL: https://issues.apache.org/jira/browse/SPARK-29670
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4
>Reporter: Nishchal Venkataramana
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29576) Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus

2019-10-23 Thread DB Tsai (Jira)
DB Tsai created SPARK-29576:
---

 Summary: Use Spark's CompressionCodec for Ser/Deser of 
MapOutputStatus
 Key: SPARK-29576
 URL: https://issues.apache.org/jira/browse/SPARK-29576
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: DB Tsai
 Fix For: 3.0.0


Instead of using ZStd codec directly, we use Spark's CompressionCodec which 
wraps ZStd codec in buffered stream to avoid overhead excessive of JNI call 
while trying to compress small amount of data.

Also, by using Spark's CompressionCodec, we can easily to make it configurable 
in the future if needed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29515) MapStatuses SerDeser Benchmark

2019-10-18 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-29515:
---

Assignee: DB Tsai

> MapStatuses SerDeser Benchmark
> --
>
> Key: SPARK-29515
> URL: https://issues.apache.org/jira/browse/SPARK-29515
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29515) MapStatuses SerDeser Benchmark

2019-10-18 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-29515.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26169
[https://github.com/apache/spark/pull/26169]

> MapStatuses SerDeser Benchmark
> --
>
> Key: SPARK-29515
> URL: https://issues.apache.org/jira/browse/SPARK-29515
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29515) MapStatuses SerDeser Benchmark

2019-10-18 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-29515:

Affects Version/s: (was: 3.0.0)
   2.4.4

> MapStatuses SerDeser Benchmark
> --
>
> Key: SPARK-29515
> URL: https://issues.apache.org/jira/browse/SPARK-29515
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29515) MapStatuses SerDeser Benchmark

2019-10-18 Thread DB Tsai (Jira)
DB Tsai created SPARK-29515:
---

 Summary: MapStatuses SerDeser Benchmark
 Key: SPARK-29515
 URL: https://issues.apache.org/jira/browse/SPARK-29515
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: DB Tsai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29434) Improve the MapStatuses serialization performance

2019-10-10 Thread DB Tsai (Jira)
DB Tsai created SPARK-29434:
---

 Summary: Improve the MapStatuses serialization performance
 Key: SPARK-29434
 URL: https://issues.apache.org/jira/browse/SPARK-29434
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: DB Tsai
 Fix For: 3.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29351) Avoid full synchronization in ShuffleMapStage

2019-10-03 Thread DB Tsai (Jira)
DB Tsai created SPARK-29351:
---

 Summary: Avoid full synchronization in ShuffleMapStage
 Key: SPARK-29351
 URL: https://issues.apache.org/jira/browse/SPARK-29351
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.4
 Environment: # 
Reporter: DB Tsai
 Fix For: 3.0.0


In one of our production streaming jobs that has more than 1k executors, and 
each has 20 cores, Spark spends significant portion of time (30s) in sending 
out the `ShuffeStatus`. We find there are two issues.

# In driver's message loop, it's calling `serializedMapStatus` which is in sync 
block. When the job scales really big, it can cause the contention.
# When the job is big, the `MapStatus` is huge as well, the serialization time 
and compression time is slow.

This work aims to address the first problem.
 
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29079) Enable GitHub Action on PR

2019-09-13 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-29079.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25786
[https://github.com/apache/spark/pull/25786]

> Enable GitHub Action on PR
> --
>
> Key: SPARK-29079
> URL: https://issues.apache.org/jira/browse/SPARK-29079
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>
> So far, we detect JDK11 compilation error after merging.
> This will prevent JDK11 compilation error at PR stage.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29032) Simplify Prometheus support by adding PrometheusServlet

2019-09-13 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-29032.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25769
[https://github.com/apache/spark/pull/25769]

> Simplify Prometheus support by adding PrometheusServlet
> ---
>
> Key: SPARK-29032
> URL: https://issues.apache.org/jira/browse/SPARK-29032
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue aims to simplify `Prometheus` support in Spark standalone 
> environment or K8s environment.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29064) Add PrometheusResource to export Executor metrics

2019-09-13 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-29064.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25770
[https://github.com/apache/spark/pull/25770]

> Add PrometheusResource to export Executor metrics
> -
>
> Key: SPARK-29064
> URL: https://issues.apache.org/jira/browse/SPARK-29064
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   >