Re: [VOTE] Release Spark 3.2.1 (RC2)

Sean Owen Fri, 21 Jan 2022 15:47:27 -0800

Continue on the ticket - I am not sure this is established. We would block
a release for critical problems that are not regressions. This is not a
data loss / 'deleting data' issue even if valid.
You're welcome to provide feedback but votes are for the PMC.


On Fri, Jan 21, 2022 at 5:24 PM Bjørn Jørgensen <[email protected]>
wrote:

> Ok, but deleting users' data without them knowing it is never a good idea.
> That's why I give this RC -1.
>
> lør. 22. jan. 2022 kl. 00:16 skrev Sean Owen <[email protected]>:
>
>> (Bjorn - unless this is a regression, it would not block a release, even
>> if it's a bug)
>>
>> On Fri, Jan 21, 2022 at 5:09 PM Bjørn Jørgensen <[email protected]>
>> wrote:
>>
>>> [x] -1 Do not release this package because, deletes all my columns with
>>> only Null in it.
>>>
>>> I have opened https://issues.apache.org/jira/browse/SPARK-37981 for
>>> this bug.
>>>
>>>
>>>
>>>
>>> fre. 21. jan. 2022 kl. 21:45 skrev Sean Owen <[email protected]>:
>>>
>>>> (Are you suggesting this is a regression, or is it a general question?
>>>> here we're trying to figure out whether there are critical bugs introduced
>>>> in 3.2.1 vs 3.2.0)
>>>>
>>>> On Fri, Jan 21, 2022 at 1:58 PM Bjørn Jørgensen <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi, I am wondering if it's a bug or not.
>>>>>
>>>>> I do have a lot of json files, where they have some columns that are
>>>>> all "null" on.
>>>>>
>>>>> I start spark with
>>>>>
>>>>> from pyspark import pandas as ps
>>>>> import re
>>>>> import numpy as np
>>>>> import os
>>>>> import pandas as pd
>>>>>
>>>>> from pyspark import SparkContext, SparkConf
>>>>> from pyspark.sql import SparkSession
>>>>> from pyspark.sql.functions import concat, concat_ws, lit, col, trim,
>>>>> expr
>>>>> from pyspark.sql.types import StructType, StructField,
>>>>> StringType,IntegerType
>>>>>
>>>>> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
>>>>>
>>>>> def get_spark_session(app_name: str, conf: SparkConf):
>>>>>     conf.setMaster('local[*]')
>>>>>     conf \
>>>>>       .set('spark.driver.memory', '64g')\
>>>>>       .set("fs.s3a.access.key", "minio") \
>>>>>       .set("fs.s3a.secret.key", "") \
>>>>>       .set("fs.s3a.endpoint", "http://192.168.1.127:9000";) \
>>>>>       .set("spark.hadoop.fs.s3a.impl",
>>>>> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>>>>>       .set("spark.hadoop.fs.s3a.path.style.access", "true") \
>>>>>       .set("spark.sql.repl.eagerEval.enabled", "True") \
>>>>>       .set("spark.sql.adaptive.enabled", "True") \
>>>>>       .set("spark.serializer",
>>>>> "org.apache.spark.serializer.KryoSerializer") \
>>>>>       .set("spark.sql.repl.eagerEval.maxNumRows", "10000") \
>>>>>       .set("sc.setLogLevel", "error")
>>>>>
>>>>>     return
>>>>> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
>>>>>
>>>>> spark = get_spark_session("Falk", SparkConf())
>>>>>
>>>>> d3 =
>>>>> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
>>>>>
>>>>> import pyspark
>>>>> def sparkShape(dataFrame):
>>>>>     return (dataFrame.count(), len(dataFrame.columns))
>>>>> pyspark.sql.dataframe.DataFrame.shape = sparkShape
>>>>> print(d3.shape())
>>>>>
>>>>>
>>>>> (653610, 267)
>>>>>
>>>>>
>>>>> d3.write.json("d3.json")
>>>>>
>>>>>
>>>>> d3 = spark.read.json("d3.json/*.json")
>>>>>
>>>>> import pyspark
>>>>> def sparkShape(dataFrame):
>>>>>     return (dataFrame.count(), len(dataFrame.columns))
>>>>> pyspark.sql.dataframe.DataFrame.shape = sparkShape
>>>>> print(d3.shape())
>>>>>
>>>>> (653610, 186)
>>>>>
>>>>>
>>>>> So spark is deleting 81 columns. I think that all of these 81 deleted
>>>>> columns have only Null in them.
>>>>>
>>>>> Is this a bug or has this been made on purpose?
>>>>>
>>>>>
>>>>> fre. 21. jan. 2022 kl. 04:59 skrev huaxin gao <[email protected]
>>>>> >:
>>>>>
>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>> version 3.2.1. The vote is open until 8:00pm Pacific time January 25 and
>>>>>> passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 
>>>>>> votes. [
>>>>>> ] +1 Release this package as Apache Spark 3.2.1[ ] -1 Do not release
>>>>>> this package because ... To learn more about Apache Spark, please see
>>>>>> http://spark.apache.org/ The tag to be voted on is v3.2.1-rc2
>>>>>> (commit 4f25b3f71238a00508a356591553f2dfa89f8290):
>>>>>> https://github.com/apache/spark/tree/v3.2.1-rc2
>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>> at:https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-bin/
>>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS The staging
>>>>>> repository for this release can be found at:
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1398/
>>>>>>
>>>>>> The documentation corresponding to this release can be found at:
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.2.1-rc2-docs/_site/
>>>>>> The list of bug fixes going into 3.2.1 can be found at the following
>>>>>> URL:https://s.apache.org/yu0cy
>>>>>>
>>>>>> This release is using the release script of the tag v3.2.1-rc2. FAQ
>>>>>> ========================= How can I help test this release?
>>>>>> ========================= If you are a Spark user, you can help us test
>>>>>> this release by taking an existing Spark workload and running on this
>>>>>> release candidate, then reporting any regressions. If you're working in
>>>>>> PySpark you can set up a virtual env and install the current RC and see 
>>>>>> if
>>>>>> anything important breaks, in the Java/Scala you can add the staging
>>>>>> repository to your projects resolvers and test with the RC (make sure to
>>>>>> clean up the artifact cache before/after so you don't end up building 
>>>>>> with
>>>>>> a out of date RC going forward).
>>>>>> =========================================== What should happen to JIRA
>>>>>> tickets still targeting 3.2.1? 
>>>>>> ===========================================
>>>>>> The current list of open tickets targeted at 3.2.1 can be found at:
>>>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>>>> Version/s" = 3.2.1 Committers should look at those and triage. Extremely
>>>>>> important bug fixes, documentation, and API tweaks that impact
>>>>>> compatibility should be worked on immediately. Everything else please
>>>>>> retarget to an appropriate release. ================== But my bug isn't
>>>>>> fixed? ================== In order to make timely releases, we will
>>>>>> typically not hold the release unless the bug in question is a regression
>>>>>> from the previous release. That being said, if there is something which 
>>>>>> is
>>>>>> a regression that has not been correctly targeted please ping me or a
>>>>>> committer to help target the issue.
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Bjørn Jørgensen
>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>> Norge
>>>>>
>>>>> +47 480 94 297
>>>>>
>>>>
>>>
>>> --
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297
>>>
>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: [VOTE] Release Spark 3.2.1 (RC2)

Reply via email to