[jira] [Created] (SPARK-49310) Upgrade Apache Parquet to 1.14.2

2024-08-19 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-49310:


 Summary: Upgrade Apache Parquet to 1.14.2
 Key: SPARK-49310
 URL: https://issues.apache.org/jira/browse/SPARK-49310
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Fokko Driesprong






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49014) Bump Apache Avro to 1.12.0

2024-07-26 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-49014:


 Summary: Bump Apache Avro to 1.12.0
 Key: SPARK-49014
 URL: https://issues.apache.org/jira/browse/SPARK-49014
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 3.4.3
Reporter: Fokko Driesprong






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48177) Upgrade `Parquet` to 1.14.1

2024-06-17 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-48177:
-
Summary: Upgrade `Parquet` to 1.14.1  (was: Upgrade `Parquet` to 1.14.0)

> Upgrade `Parquet` to 1.14.1
> ---
>
> Key: SPARK-48177
> URL: https://issues.apache.org/jira/browse/SPARK-48177
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48177) Bump Parquet to 1.14.0

2024-05-07 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-48177:


 Summary: Bump Parquet to 1.14.0
 Key: SPARK-48177
 URL: https://issues.apache.org/jira/browse/SPARK-48177
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.2
Reporter: Fokko Driesprong
 Fix For: 4.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43594) Add LocalDateTime to anyToMicros

2023-05-19 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-43594:


 Summary: Add LocalDateTime to anyToMicros
 Key: SPARK-43594
 URL: https://issues.apache.org/jira/browse/SPARK-43594
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Fokko Driesprong






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43425) Add TimestampNTZType to ColumnarBatchRow

2023-05-09 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-43425:
-
Issue Type: Bug  (was: Improvement)

> Add TimestampNTZType to ColumnarBatchRow
> 
>
> Key: SPARK-43425
> URL: https://issues.apache.org/jira/browse/SPARK-43425
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43425) Add TimestampNTZType to ColumnarBatchRow

2023-05-09 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-43425:


 Summary: Add TimestampNTZType to ColumnarBatchRow
 Key: SPARK-43425
 URL: https://issues.apache.org/jira/browse/SPARK-43425
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Fokko Driesprong
 Fix For: 3.4.1, 3.5.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32797) Install mypy on the Jenkins CI workers

2020-10-18 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-32797:
-
Affects Version/s: (was: 3.0.0)
   3.1.0

> Install mypy on the Jenkins CI workers
> --
>
> Key: SPARK-32797
> URL: https://issues.apache.org/jira/browse/SPARK-32797
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, PySpark
>Affects Versions: 3.1.0
>Reporter: Fokko Driesprong
>Priority: Major
>
> We want to check the types of the PySpark code. This requires mypy to be 
> installed on the CI. Can you do this [~shaneknapp]? 
> Related PR: [https://github.com/apache/spark/pull/29180]
> You can install this using pip: [https://pypi.org/project/mypy/] Should be 
> similar to flake8 and sphinx. The latest version is ok! Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32797) Install mypy on the Jenkins CI workers

2020-09-04 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-32797:
-
Description: 
We want to check the types of the PySpark code. This requires mypy to be 
installed on the CI. Can you do this [~shaneknapp]? 

Related PR: [https://github.com/apache/spark/pull/29180]

You can install this using pip: [https://pypi.org/project/mypy/] Should be 
similar to flake8 and sphinx. The latest version is ok! Thanks!

> Install mypy on the Jenkins CI workers
> --
>
> Key: SPARK-32797
> URL: https://issues.apache.org/jira/browse/SPARK-32797
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>
> We want to check the types of the PySpark code. This requires mypy to be 
> installed on the CI. Can you do this [~shaneknapp]? 
> Related PR: [https://github.com/apache/spark/pull/29180]
> You can install this using pip: [https://pypi.org/project/mypy/] Should be 
> similar to flake8 and sphinx. The latest version is ok! Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32797) Install mypy on the Jenkins CI workers

2020-09-04 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-32797:


 Summary: Install mypy on the Jenkins CI workers
 Key: SPARK-32797
 URL: https://issues.apache.org/jira/browse/SPARK-32797
 Project: Spark
  Issue Type: Improvement
  Components: jenkins, PySpark
Affects Versions: 3.0.0
Reporter: Fokko Driesprong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32770) Add missing imports

2020-09-02 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong resolved SPARK-32770.
--
Resolution: Won't Fix

> Add missing imports
> ---
>
> Key: SPARK-32770
> URL: https://issues.apache.org/jira/browse/SPARK-32770
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32770) Add missing imports

2020-09-01 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-32770:


 Summary: Add missing imports
 Key: SPARK-32770
 URL: https://issues.apache.org/jira/browse/SPARK-32770
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0, 2.4.6
Reporter: Fokko Driesprong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32719) Add Flake8 check for missing imports

2020-08-27 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-32719:


 Summary: Add Flake8 check for missing imports
 Key: SPARK-32719
 URL: https://issues.apache.org/jira/browse/SPARK-32719
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Fokko Driesprong


Add Flake8 check to detect missing imports. While working on SPARK-17333 I've 
noticed that we're missing some imports. This PR will enable a check using 
Flake8. One of the side effects is that we can't use wildcard imports, since 
Flake8 is unable to figure them out. However, having wildcard imports isn't the 
best practice since it can be unclear from which wildcard import a specific 
class is coming from.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10520) Dates cannot be summarised

2020-08-22 Thread Fokko Driesprong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182266#comment-17182266
 ] 

Fokko Driesprong commented on SPARK-10520:
--

Can this issue be assigned to my name? Normally this happens automatically when 
opening a PR.

> Dates cannot be summarised
> --
>
> Key: SPARK-10520
> URL: https://issues.apache.org/jira/browse/SPARK-10520
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SparkR, SQL
>Affects Versions: 1.5.0
>Reporter: Vincent Warmerdam
>Priority: Major
>  Labels: bulk-closed
>
> I create a simple dataframe in R and call the summary function on it 
> (standard R, not SparkR). 
> {code}
> > library(magrittr)
> > df <- data.frame(
>   date = as.Date("2015-01-01") + 0:99, 
>   r = runif(100)
> )
> > df %>% summary
>   date  r  
>  Min.   :2015-01-01   Min.   :0.01221  
>  1st Qu.:2015-01-25   1st Qu.:0.30003  
>  Median :2015-02-19   Median :0.46416  
>  Mean   :2015-02-19   Mean   :0.50350  
>  3rd Qu.:2015-03-16   3rd Qu.:0.73361  
>  Max.   :2015-04-10   Max.   :0.99618  
> {code}
> Notice that the date can be summarised here. In SparkR; this will give an 
> error.
> {code}
> > ddf <- createDataFrame(sqlContext, df) 
> > ddf %>% summary
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
>   org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to 
> data type mismatch: function average requires numeric types, not DateType;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at org.apache.spark.sql.
> {code}
> This is a rather annoying bug since the SparkR documentation currently 
> suggests that dates are now supported in SparkR. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32572) Run all the tests at once, instead of having separate entrypoints.

2020-08-08 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-32572:
-
Description: 
Started with this comment thread: 
https://github.com/apache/spark/pull/29121/files#r456683561

Each file is invoked separately and has a separate entry point: 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/tests/test_wrapper.py#L120]

We would replace 
[https://github.com/apache/spark/blob/master/dev/run-tests.py#L470] this 
function call to the subprocess with something that would invoke the python 
tests.

> Run all the tests at once, instead of having separate entrypoints.
> --
>
> Key: SPARK-32572
> URL: https://issues.apache.org/jira/browse/SPARK-32572
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>
> Started with this comment thread: 
> https://github.com/apache/spark/pull/29121/files#r456683561
> Each file is invoked separately and has a separate entry point: 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/tests/test_wrapper.py#L120]
> We would replace 
> [https://github.com/apache/spark/blob/master/dev/run-tests.py#L470] this 
> function call to the subprocess with something that would invoke the python 
> tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32572) Run all the tests at once, instead of having separate entrypoints.

2020-08-08 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-32572:


 Summary: Run all the tests at once, instead of having separate 
entrypoints.
 Key: SPARK-32572
 URL: https://issues.apache.org/jira/browse/SPARK-32572
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Fokko Driesprong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32319) Disallow the use of unused imports

2020-08-03 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-32319:
-
Summary: Disallow the use of unused imports  (was: Remove unused imports)

> Disallow the use of unused imports
> --
>
> Key: SPARK-32319
> URL: https://issues.apache.org/jira/browse/SPARK-32319
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>
> We don't want to import stuff that we're not going to use, to reduce the 
> memory pressure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32313) Remove Python 2 artifacts

2020-07-22 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong resolved SPARK-32313.
--
Resolution: Won't Fix

> Remove Python 2 artifacts
> -
>
> Key: SPARK-32313
> URL: https://issues.apache.org/jira/browse/SPARK-32313
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32320) Remove mutable default arguments

2020-07-15 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-32320:


 Summary: Remove mutable default arguments
 Key: SPARK-32320
 URL: https://issues.apache.org/jira/browse/SPARK-32320
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Fokko Driesprong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32319) Remove unused imports

2020-07-15 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-32319:


 Summary: Remove unused imports
 Key: SPARK-32319
 URL: https://issues.apache.org/jira/browse/SPARK-32319
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Fokko Driesprong


We don't want to import stuff that we're not going to use, to reduce the memory 
pressure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-27757) Bump Jackson to 2.9.9

2020-07-14 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong closed SPARK-27757.


Master is on Jackson 2.10 now :)

> Bump Jackson to 2.9.9
> -
>
> Key: SPARK-27757
> URL: https://issues.apache.org/jira/browse/SPARK-27757
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Minor
> Fix For: 3.0.0
>
>
> This fixes CVE-2019-12086 on Databind: 
> https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.9.9



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17333) Make pyspark interface friendly with mypy static analysis

2020-07-14 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-17333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-17333:
-
Summary: Make pyspark interface friendly with mypy static analysis  (was: 
Make pyspark interface friendly with static analysis)

> Make pyspark interface friendly with mypy static analysis
> -
>
> Key: SPARK-17333
> URL: https://issues.apache.org/jira/browse/SPARK-17333
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Assaf Mendelson
>Priority: Trivial
>
> Static analysis tools such as those common to IDE for auto completion and 
> error marking, tend to have poor results with pyspark.
> This is cause by two separate issues:
> The first is that many elements are created programmatically such as the max 
> function in pyspark.sql.functions.
> The second is that we tend to use pyspark in a functional manner, meaning 
> that we chain many actions (e.g. df.filter().groupby().agg()) and since 
> python has no type information this can become difficult to understand.
> I would suggest changing the interface to improve it. 
> The way I see it we can either change the interface or provide interface 
> enhancements.
> Changing the interface means defining (when possible) all functions directly, 
> i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py 
> and then generating the functions programmatically by using _create_function, 
> create the function directly. 
> def max(col):
>"""
>docstring
>"""
>_create_function(max,"docstring")
> Second we can add type indications to all functions as defined in pep 484 or 
> pycharm's legacy type hinting 
> (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy).
> So for example max might look like this:
> def max(col):
>"""
>does  a max.
>   :type col: Column
>   :rtype Column
>"""
> This would provide a wide range of support as these types of hints, while old 
> are pretty common.
> A second option is to use PEP 3107 to define interfaces (pyi files)
> in this case we might have a functions.pyi file which would contain something 
> like:
> def max(col: Column) -> Column:
> """
> Aggregate function: returns the maximum value of the expression in a 
> group.
> """
> ...
> This has the advantage of easier to understand types and not touching the 
> code (only supported code) but has the disadvantage of being separately 
> managed (i.e. greater chance of doing a mistake) and the fact that some 
> configuration would be needed in the IDE/static analysis tool instead of 
> working out of the box.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32313) Remove Python 2 artifacts

2020-07-14 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-32313:


 Summary: Remove Python 2 artifacts
 Key: SPARK-32313
 URL: https://issues.apache.org/jira/browse/SPARK-32313
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Fokko Driesprong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32311) Remove duplicate import

2020-07-14 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-32311:


 Summary: Remove duplicate import
 Key: SPARK-32311
 URL: https://issues.apache.org/jira/browse/SPARK-32311
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Fokko Driesprong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32309) Fix missing import

2020-07-14 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-32309:


 Summary: Fix missing import
 Key: SPARK-32309
 URL: https://issues.apache.org/jira/browse/SPARK-32309
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Fokko Driesprong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31981) Keep TimestampType when taking an average of a Timestamp

2020-06-13 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-31981:


 Summary: Keep TimestampType when taking an average of a Timestamp
 Key: SPARK-31981
 URL: https://issues.apache.org/jira/browse/SPARK-31981
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Fokko Driesprong
 Fix For: 3.1.0


Currently, when you take an average of a Timestamp, you'll end up with a 
Double, representing the seconds since epoch. This is because of old Hive 
behavior. I strongly believe that it is better to return a Timestamp.

root@8c4241b617ec:/# psql postgres postgres
psql (12.3 (Debian 12.3-1.pgdg100+1))
Type "help" for help.

postgres=# CREATE TABLE timestamp_demo (ts TIMESTAMP);
CREATE TABLE
postgres=# INSERT INTO timestamp_demo VALUES('2019-01-01 18:22:11');
INSERT 0 1
postgres=# INSERT INTO timestamp_demo VALUES('2018-01-01 18:22:11');
INSERT 0 1
postgres=# INSERT INTO timestamp_demo VALUES('2017-01-01 18:22:11');
INSERT 0 1
postgres=# SELECT AVG(ts) FROM timestamp_demo;
ERROR: function avg(timestamp without time zone) does not exist
LINE 1: SELECT AVG(ts) FROM timestamp_demo;

 


root@bab43a5731e8:/# mysql
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 9
Server version: 8.0.20 MySQL Community Server - GPL

Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> CREATE TABLE timestamp_demo (ts TIMESTAMP);
Query OK, 0 rows affected (0.05 sec)

mysql> INSERT INTO timestamp_demo VALUES('2019-01-01 18:22:11');
Query OK, 1 row affected (0.01 sec)

mysql> INSERT INTO timestamp_demo VALUES('2018-01-01 18:22:11');
Query OK, 1 row affected (0.01 sec)

mysql> INSERT INTO timestamp_demo VALUES('2017-01-01 18:22:11');
Query OK, 1 row affected (0.01 sec)

mysql> SELECT AVG(ts) FROM timestamp_demo;
+-+
| AVG(ts) |
+-+
| 20180101182211. |
+-+
1 row in set (0.00 sec)

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-10520) Dates cannot be summarised

2020-06-07 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong reopened SPARK-10520:
--

This is the underlying issue of [https://github.com/apache/spark/pull/28554]

Let me check if I can come up with a fix.

> Dates cannot be summarised
> --
>
> Key: SPARK-10520
> URL: https://issues.apache.org/jira/browse/SPARK-10520
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SparkR, SQL
>Affects Versions: 1.5.0
>Reporter: Vincent Warmerdam
>Priority: Major
>  Labels: bulk-closed
>
> I create a simple dataframe in R and call the summary function on it 
> (standard R, not SparkR). 
> {code}
> > library(magrittr)
> > df <- data.frame(
>   date = as.Date("2015-01-01") + 0:99, 
>   r = runif(100)
> )
> > df %>% summary
>   date  r  
>  Min.   :2015-01-01   Min.   :0.01221  
>  1st Qu.:2015-01-25   1st Qu.:0.30003  
>  Median :2015-02-19   Median :0.46416  
>  Mean   :2015-02-19   Mean   :0.50350  
>  3rd Qu.:2015-03-16   3rd Qu.:0.73361  
>  Max.   :2015-04-10   Max.   :0.99618  
> {code}
> Notice that the date can be summarised here. In SparkR; this will give an 
> error.
> {code}
> > ddf <- createDataFrame(sqlContext, df) 
> > ddf %>% summary
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
>   org.apache.spark.sql.AnalysisException: cannot resolve 'avg(date)' due to 
> data type mismatch: function average requires numeric types, not DateType;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:61)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
>   at org.apache.spark.sql.
> {code}
> This is a rather annoying bug since the SparkR documentation currently 
> suggests that dates are now supported in SparkR. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31735) Include all columns in the summary report

2020-05-16 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-31735:


 Summary: Include all columns in the summary report
 Key: SPARK-31735
 URL: https://issues.apache.org/jira/browse/SPARK-31735
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.4.5
Reporter: Fokko Driesprong


Dates and other columns are excluded:

 

{{from datetime import datetime, timedelta, timezone}}
{{from pyspark.sql import types as T}}
{{from pyspark.sql import Row}}
{{from pyspark.sql import functions as F}}{{START = datetime(2014, 1, 1, 
tzinfo=timezone.utc)}}{{n_days = 22}}{{date_range = [Row(date=(START + 
timedelta(days=n))) for n in range(0, n_days)]}}{{schema = 
T.StructType([T.StructField(name="date", dataType=T.DateType(), 
nullable=False)])}}
{{rdd = spark.sparkContext.parallelize(date_range)}}{{df = 
spark.createDataFrame(data=rdd, schema=schema)}}
{{df.agg(F.max("date")).show()}}{{df.summary().show()}}
{{+---+}}
{{|summary|}}
{{+---+}}
{{| count |}}
{{| mean  |}}
{{| stddev|}}
{{| min   |}}
{{| 25%   |}}
{{| 50%   |}}
{{| 75%   |}}
{{| max   |}}
{{+---+}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29245) CCE during creating HiveMetaStoreClient

2020-04-10 Thread Fokko Driesprong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081174#comment-17081174
 ] 

Fokko Driesprong commented on SPARK-29245:
--

Thanks, on Iceberg we have a similar issue: 
[https://github.com/apache/incubator-iceberg/pull/577]

For reference.

> CCE during creating HiveMetaStoreClient 
> 
>
> Key: SPARK-29245
> URL: https://issues.apache.org/jira/browse/SPARK-29245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> From `master` branch build, when I try to connect to an external HMS, I hit 
> the following.
> {code}
> 19/09/25 10:58:46 ERROR hive.log: Got exception: java.lang.ClassCastException 
> class [Ljava.lang.Object; cannot be cast to class [Ljava.net.URI; 
> ([Ljava.lang.Object; and [Ljava.net.URI; are in module java.base of loader 
> 'bootstrap')
> java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to 
> class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module 
> java.base of loader 'bootstrap')
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:200)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)
> {code}
> With HIVE-21508, I can get the following.
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.4)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("show databases").show
> ++
> |databaseName|
> ++
> |  .  |
> ...
> {code}
> With 2.3.7-SNAPSHOT, the following basic tests are tested.
> - SHOW DATABASES / TABLES
> - DESC DATABASE / TABLE
> - CREATE / DROP / USE DATABASE
> - CREATE / DROP / INSERT / LOAD / SELECT TABLE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30103) Remove duplicate schema merge logic

2020-03-17 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong resolved SPARK-30103.
--
Resolution: Won't Fix

> Remove duplicate schema merge logic
> ---
>
> Key: SPARK-30103
> URL: https://issues.apache.org/jira/browse/SPARK-30103
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Fokko Driesprong
>Priority: Major
>
> There is duplicate logic of merging two schema's. First one in 
> StructType.merge() and secondly in catalyst using 
> TypeCoercion.findTightestCommonType(). My suggestion is to remove the first 
> one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27506) Function `from_avro` doesn't allow deserialization of data using other compatible schemas

2019-12-11 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-27506:
-
Fix Version/s: 3.0.0

> Function `from_avro` doesn't allow deserialization of data using other 
> compatible schemas
> -
>
> Key: SPARK-27506
> URL: https://issues.apache.org/jira/browse/SPARK-27506
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gianluca Amori
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 3.0.0
>
>
>  SPARK-24768 and subtasks introduced support to read and write Avro data by 
> parsing a binary column of Avro format and converting it into its 
> corresponding catalyst value (and viceversa).
>  
> The current implementation has the limitation of requiring deserialization of 
> an event with the exact same schema with which it was serialized. This breaks 
> one of the most important features of Avro, schema evolution 
> [https://docs.confluent.io/current/schema-registry/avro.html] - most 
> importantly, the ability to read old data with a newer (compatible) schema 
> without breaking the consumer.
>  
> The GenericDatumReader in the Avro library already supports passing an 
> optional *writer's schema* (the schema with which the record was serialized) 
> alongside a mandatory *reader's schema* (the schema with which the record is 
> going to be deserialized). The proposed change is to do the same in the 
> from_avro function, allowing the possibility to pass an optional writer's 
> schema to be used in the deserialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30103) Remove duplicate schema merge logic

2019-12-02 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-30103:


 Summary: Remove duplicate schema merge logic
 Key: SPARK-30103
 URL: https://issues.apache.org/jira/browse/SPARK-30103
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: Fokko Driesprong
 Fix For: 3.0.0


There is duplicate logic of merging two schema's. First one in 
StructType.merge() and secondly in catalyst using 
TypeCoercion.findTightestCommonType(). My suggestion is to remove the first one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30004) Allow UserDefinedType to be merged into a standard DateType

2019-11-23 Thread Fokko Driesprong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-30004:
-
Component/s: (was: Spark Core)
 SQL

> Allow UserDefinedType to be merged into a standard DateType
> ---
>
> Key: SPARK-30004
> URL: https://issues.apache.org/jira/browse/SPARK-30004
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Fokko Driesprong
>Priority: Major
> Fix For: 2.4.5
>
>
> I've registered a custom type, namely XMLGregorianCalendar which is being 
> used by Scalaxb. A XML databinding tool, for generating case classes based on 
> a XSD. I want to convert the XMLGregorianCalendar to a regular TimestampType. 
> This works, but when I update the table (using Delta), I get an error:
> Failed to merge fields 'START_DATE_MAINTENANCE_FLPL' and 
> 'START_DATE_MAINTENANCE_FLPL'.
> Failed to merge incompatible data types TimestampType and 
> org.apache.spark.sql.types.CustomXMLGregorianCalendarType@5ff12345;;
> There are two ways of fixing this:
> * Adding a rule which compares the sqlType. 
> * Change the compare function, so it will check the rhs to the lhs, so I can 
> override the equals function on the UserDefinedType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30004) Allow UserDefinedType to be merged into a standard DateType

2019-11-23 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-30004:


 Summary: Allow UserDefinedType to be merged into a standard 
DateType
 Key: SPARK-30004
 URL: https://issues.apache.org/jira/browse/SPARK-30004
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Fokko Driesprong
 Fix For: 2.4.5


I've registered a custom type, namely XMLGregorianCalendar which is being used 
by Scalaxb. A XML databinding tool, for generating case classes based on a XSD. 
I want to convert the XMLGregorianCalendar to a regular TimestampType. This 
works, but when I update the table (using Delta), I get an error:

Failed to merge fields 'START_DATE_MAINTENANCE_FLPL' and 
'START_DATE_MAINTENANCE_FLPL'.
Failed to merge incompatible data types TimestampType and 
org.apache.spark.sql.types.CustomXMLGregorianCalendarType@5ff12345;;

There are two ways of fixing this:

* Adding a rule which compares the sqlType. 
* Change the compare function, so it will check the rhs to the lhs, so I can 
override the equals function on the UserDefinedType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29483) Bump Jackson to 2.10.0

2019-10-15 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-29483:


 Summary: Bump Jackson to 2.10.0
 Key: SPARK-29483
 URL: https://issues.apache.org/jira/browse/SPARK-29483
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Fokko Driesprong
 Fix For: 3.0.0


Fixes the following CVE's:
https://www.cvedetails.com/cve/CVE-2019-16942/
https://www.cvedetails.com/cve/CVE-2019-16943/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29445) Bump netty-all from 4.1.39.Final to 4.1.42.Final

2019-10-11 Thread Fokko Driesprong (Jira)
Fokko Driesprong created SPARK-29445:


 Summary: Bump netty-all from 4.1.39.Final to 4.1.42.Final
 Key: SPARK-29445
 URL: https://issues.apache.org/jira/browse/SPARK-29445
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Fokko Driesprong


https://www.cvedetails.com/cve/CVE-2019-16869/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)

2019-09-09 Thread Fokko Driesprong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926122#comment-16926122
 ] 

Fokko Driesprong commented on SPARK-28921:
--

I can confirm that we're running into the same issue with an on-premise k8s 
cluster with RBAC enabled. After updating the kubernetes client to 4.4.2 
everything works fine again.

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 
> 1.12.10, 1.11.10)
> ---
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: Paul Schweigert
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by fixes for a recent CVE : 
> CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809]
> Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669]
>  
> Looks like upgrading kubernetes-client to 4.4.2 would solve this issue.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.9.x

2019-09-02 Thread Fokko Driesprong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920685#comment-16920685
 ] 

Fokko Driesprong commented on SPARK-27733:
--

The regression issue has been resolved with the freshly released Avro 1.9.1. 
I'll look into the issues with the Hive dependency.

> Upgrade to Avro 1.9.x
> -
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.0 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranmer, no shaded guava, security 
> updates, so probably a worth upgrade.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28728) Bump Jackson Databind to 2.9.9.3

2019-08-14 Thread Fokko Driesprong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-28728:
-
Description: (was: Due to CVE's: 
https://www.cvedetails.com/vulnerability-list/vendor_id-15866/product_id-42991/version_id-238179/opec-1/Fasterxml-Jackson-databind-2.9.0.html)

> Bump Jackson Databind to 2.9.9.3
> 
>
> Key: SPARK-28728
> URL: https://issues.apache.org/jira/browse/SPARK-28728
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Fokko Driesprong
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28728) Bump Jackson Databind to 2.9.9.3

2019-08-14 Thread Fokko Driesprong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-28728:
-
Description: Needs to be upgraded due to issues.

> Bump Jackson Databind to 2.9.9.3
> 
>
> Key: SPARK-28728
> URL: https://issues.apache.org/jira/browse/SPARK-28728
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Fokko Driesprong
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> Needs to be upgraded due to issues.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28728) Bump Jackson Databind to 2.9.9.3

2019-08-14 Thread Fokko Driesprong (JIRA)
Fokko Driesprong created SPARK-28728:


 Summary: Bump Jackson Databind to 2.9.9.3
 Key: SPARK-28728
 URL: https://issues.apache.org/jira/browse/SPARK-28728
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Fokko Driesprong
 Fix For: 2.4.4, 3.0.0


Due to CVE's: 
https://www.cvedetails.com/vulnerability-list/vendor_id-15866/product_id-42991/version_id-238179/opec-1/Fasterxml-Jackson-databind-2.9.0.html



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28713) Bump checkstyle from 8.14 to 8.18

2019-08-13 Thread Fokko Driesprong (JIRA)
Fokko Driesprong created SPARK-28713:


 Summary: Bump checkstyle from 8.14 to 8.18
 Key: SPARK-28713
 URL: https://issues.apache.org/jira/browse/SPARK-28713
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Fokko Driesprong


>From the GitHub Security Advisory Database:

Moderate severity vulnerability that affects com.puppycrawl.tools:checkstyle
Checkstyle prior to 8.18 loads external DTDs by default, which can potentially 
lead to denial of service attacks or the leaking of confidential information.

Affected versions: < 8.18



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27757) Bump Jackson to 2.9.9

2019-05-17 Thread Fokko Driesprong (JIRA)
Fokko Driesprong created SPARK-27757:


 Summary: Bump Jackson to 2.9.9
 Key: SPARK-27757
 URL: https://issues.apache.org/jira/browse/SPARK-27757
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Fokko Driesprong


This fixes CVE-2019-12086 on Databind: 
https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.9.9



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25408) Move to idiomatic Java 8

2018-09-11 Thread Fokko Driesprong (JIRA)
Fokko Driesprong created SPARK-25408:


 Summary: Move to idiomatic Java 8
 Key: SPARK-25408
 URL: https://issues.apache.org/jira/browse/SPARK-25408
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Fokko Driesprong


Java8 has some nice functions such as try-with-resource and the Collections 
library, which isn't used a lot in the Spark codebase. We might consider to 
using this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25033) Bump Apache commons.{httpclient, httpcore}

2018-08-06 Thread Fokko Driesprong (JIRA)
Fokko Driesprong created SPARK-25033:


 Summary: Bump Apache commons.{httpclient, httpcore}
 Key: SPARK-25033
 URL: https://issues.apache.org/jira/browse/SPARK-25033
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Fokko Driesprong


I would like to bump the versions to make it up to date with my other 
dependencies, in my case Stocator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24603) Typo in comments

2018-06-20 Thread Fokko Driesprong (JIRA)
Fokko Driesprong created SPARK-24603:


 Summary: Typo in comments
 Key: SPARK-24603
 URL: https://issues.apache.org/jira/browse/SPARK-24603
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Fokko Driesprong


The findTightestCommonTypeOfTwo has been renamed to findTightestCommonType



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24601) Bump Jackson version to 2.9.6

2018-06-20 Thread Fokko Driesprong (JIRA)
Fokko Driesprong created SPARK-24601:


 Summary: Bump Jackson version to 2.9.6
 Key: SPARK-24601
 URL: https://issues.apache.org/jira/browse/SPARK-24601
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Fokko Driesprong


The Jackson version is lacking behind, and therefore I have to add a lot of 
exclusions to the SBT files: 

```
Caused by: com.fasterxml.jackson.databind.JsonMappingException: Incompatible 
Jackson version: 2.9.5
at 
com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64)
at 
com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19)
at 
com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:751)
at 
org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:82)
at 
org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala)
```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24520) Double braces in link

2018-06-11 Thread Fokko Driesprong (JIRA)
Fokko Driesprong created SPARK-24520:


 Summary: Double braces in link
 Key: SPARK-24520
 URL: https://issues.apache.org/jira/browse/SPARK-24520
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.3.0
Reporter: Fokko Driesprong


Double braces in the markdown, which break the link



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23854) Update Guava to 16.0.1

2018-04-03 Thread Fokko Driesprong (JIRA)
Fokko Driesprong created SPARK-23854:


 Summary: Update Guava to 16.0.1
 Key: SPARK-23854
 URL: https://issues.apache.org/jira/browse/SPARK-23854
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Fokko Driesprong


Currently Spark is still on Guava 14.0.1, and therefore I would like to bump 
the version to 16.0.1.

Babysteps are important here, because we don't want to become incompatible with 
other technology stacks, but 14.0.1 is getting old.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22919) Bump Apache httpclient versions

2017-12-28 Thread Fokko Driesprong (JIRA)
Fokko Driesprong created SPARK-22919:


 Summary: Bump Apache httpclient versions
 Key: SPARK-22919
 URL: https://issues.apache.org/jira/browse/SPARK-22919
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Fokko Driesprong


I would like to bump the PATCH versions of both the Apache httpclient Apache 
httpcore. I use the SparkTC Stocator library for connecting to an object store, 
and I would align the versions to reduce java version mismatches. Furthermore 
it is good to bump these versions since they fix stability and performance 
issues:
https://archive.apache.org/dist/httpcomponents/httpclient/RELEASE_NOTES-4.5.x.txt
https://www.apache.org/dist/httpcomponents/httpcore/RELEASE_NOTES-4.4.x.txt



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16481) Spark does not update statistics when making use of Hive partitions

2016-07-11 Thread Fokko Driesprong (JIRA)
Fokko Driesprong created SPARK-16481:


 Summary: Spark does not update statistics when making use of Hive 
partitions
 Key: SPARK-16481
 URL: https://issues.apache.org/jira/browse/SPARK-16481
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
Reporter: Fokko Driesprong


Hi all,

I've had some strange behaviour using Hive partitions. Turned out, when using 
Hive partitions, the statistics of the Parquet get not updated properly when 
inserting new data. I've isolated the issue in the following case:
https://github.com/Fokko/spark-strange-refresh-behaviour

The fix right now is to refresh the data by hand, which is quite error prone as 
it can be easily forgotten.

Cheers, Fokko Driesprong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12869) Optimize conversion from BlockMatrix to IndexedRowMatrix

2016-02-14 Thread Fokko Driesprong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-12869:
-
Flags: Patch
Affects Version/s: 1.6.0
 Target Version/s: 1.6.1
Fix Version/s: 1.6.1

> Optimize conversion from BlockMatrix to IndexedRowMatrix
> 
>
> Key: SPARK-12869
> URL: https://issues.apache.org/jira/browse/SPARK-12869
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Fokko Driesprong
> Fix For: 1.6.1
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> In the current implementation of the BlockMatrix, the conversion to the 
> IndexedRowMatrix is done by converting it to a CoordinateMatrix first. This 
> is somewhat ok when the matrix is very sparse, but for dense matrices this is 
> very inefficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12869) Optimize conversion from BlockMatrix to IndexedRowMatrix

2016-01-19 Thread Fokko Driesprong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107533#comment-15107533
 ] 

Fokko Driesprong commented on SPARK-12869:
--

Hi guys,

I've implemented an improved version of the toIndexedRowMatrix function on the 
BlockMatrix. I needed this for a project, but would like to share it with the 
rest of the community. In the case of dense matrices, it can increase 
performance up to 19 times:
https://github.com/Fokko/BlockMatrixToIndexedRowMatrix

The pull-request on Github:
https://github.com/apache/spark/pull/10839

> Optimize conversion from BlockMatrix to IndexedRowMatrix
> 
>
> Key: SPARK-12869
> URL: https://issues.apache.org/jira/browse/SPARK-12869
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Fokko Driesprong
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> In the current implementation of the BlockMatrix, the conversion to the 
> IndexedRowMatrix is done by converting it to a CoordinateMatrix first. This 
> is somewhat ok when the matrix is very sparse, but for dense matrices this is 
> very inefficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12869) Optimize conversion from BlockMatrix to IndexedRowMatrix

2016-01-17 Thread Fokko Driesprong (JIRA)
Fokko Driesprong created SPARK-12869:


 Summary: Optimize conversion from BlockMatrix to IndexedRowMatrix
 Key: SPARK-12869
 URL: https://issues.apache.org/jira/browse/SPARK-12869
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Fokko Driesprong


In the current implementation of the BlockMatrix, the conversion to the 
IndexedRowMatrix is done by converting it to a CoordinateMatrix first. This is 
somewhat ok when the matrix is very sparse, but for dense matrices this is very 
inefficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org