[jira] [Commented] (SPARK-39743) Unable to set zstd compression level while writing parquet files

2022-07-27 Thread zhiming she (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572169#comment-17572169
 ] 

zhiming she commented on SPARK-39743:
-

[~hyukjin.kwon]

 

can you mark this issue as `Resolved`.

> Unable to set zstd compression level while writing parquet files
> 
>
> Key: SPARK-39743
> URL: https://issues.apache.org/jira/browse/SPARK-39743
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yeachan Park
>Priority: Minor
>
> While writing zstd compressed parquet files, the following setting 
> `spark.io.compression.zstd.level` does not have any affect with regards to 
> the compression level of zstd.
> All files seem to be written with the default zstd compression level, and the 
> config option seems to be ignored.
> Using the zstd cli tool, we confirmed that setting a higher compression level 
> for the same file tested in spark resulted in a smaller file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39743) Unable to set zstd compression level while writing parquet files

2022-07-28 Thread zhiming she (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572546#comment-17572546
 ] 

zhiming she commented on SPARK-39743:
-

[~yeachan153] 

Only some more important parameters are supported in spark, such as 
`avro.mapred.deflate.level`. Other less important parameters are usually not 
supported in spark, such as `avro.mapred.zstd.level` and 
`parquet.compression.codec.zstd.level`. Unless there is a special reason.

In addition, the problem you mentioned does exist, and I will modify the 
corresponding parameter description in the document later.

> Unable to set zstd compression level while writing parquet files
> 
>
> Key: SPARK-39743
> URL: https://issues.apache.org/jira/browse/SPARK-39743
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yeachan Park
>Priority: Minor
>
> While writing zstd compressed parquet files, the following setting 
> `spark.io.compression.zstd.level` does not have any affect with regards to 
> the compression level of zstd.
> All files seem to be written with the default zstd compression level, and the 
> config option seems to be ignored.
> Using the zstd cli tool, we confirmed that setting a higher compression level 
> for the same file tested in spark resulted in a smaller file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39743) Unable to set zstd compression level while writing parquet files

2022-07-28 Thread zhiming she (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572551#comment-17572551
 ] 

zhiming she commented on SPARK-39743:
-

[~hyukjin.kwon] 

As [~yeachan153]  said, is it possible to add some clarification in spark docs 
or post external docs about some parameters that are not controlled by spark  
(like parquet zstd level). Not sure if this is possible as I haven't found a 
link to other libraries documentation in the spark doc.

> Unable to set zstd compression level while writing parquet files
> 
>
> Key: SPARK-39743
> URL: https://issues.apache.org/jira/browse/SPARK-39743
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Yeachan Park
>Priority: Minor
>
> While writing zstd compressed parquet files, the following setting 
> `spark.io.compression.zstd.level` does not have any affect with regards to 
> the compression level of zstd.
> All files seem to be written with the default zstd compression level, and the 
> config option seems to be ignored.
> Using the zstd cli tool, we confirmed that setting a higher compression level 
> for the same file tested in spark resulted in a smaller file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39976) NULL check in ArrayIntersect adds extraneous null from first param

2022-08-04 Thread zhiming she (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575177#comment-17575177
 ] 

zhiming she commented on SPARK-39976:
-

I can reproduce this case , i will try to fix it.

> NULL check in ArrayIntersect adds extraneous null from first param
> --
>
> Key: SPARK-39976
> URL: https://issues.apache.org/jira/browse/SPARK-39976
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Navin Kumar
>Priority: Major
>
> This is very likely a regression from SPARK-36829.
> When using {{array_intersect(a, b)}}, if the first parameter contains a 
> {{NULL}} value and the second one does not, an extraneous {{NULL}} is present 
> in the output. This also leads to {{array_intersect(a, b) != 
> array_intersect(b, a)}} which is incorrect as set intersection should be 
> commutative.
> Example using PySpark:
> {code:python}
> >>> a = [1, 2, 3]
> >>> b = [3, None, 5]
> >>> df = spark.sparkContext.parallelize(data).toDF(["a","b"])
> >>> df.show()
> +-++
> |a|   b|
> +-++
> |[1, 2, 3]|[3, null, 5]|
> +-++
> >>> df.selectExpr("array_intersect(a,b)").show()
> +-+
> |array_intersect(a, b)|
> +-+
> |  [3]|
> +-+
> >>> df.selectExpr("array_intersect(b,a)").show()
> +-+
> |array_intersect(b, a)|
> +-+
> |[3, null]|
> +-+
> {code}
> Note that in the first case, {{a}} does not contain a {{NULL}}, and the final 
> output is correct: {{[3]}}. In the second case, since {{b}} does contain 
> {{NULL}} and is now the first parameter.
> The same behavior occurs in Scala when writing to Parquet:
> {code:scala}
> scala> val a = Array[java.lang.Integer](1, 2, null, 4)
> a: Array[Integer] = Array(1, 2, null, 4)
> scala> val b = Array[java.lang.Integer](4, 5, 6, 7)
> b: Array[Integer] = Array(4, 5, 6, 7)
> scala> val df = Seq((a, b)).toDF("a","b")
> df: org.apache.spark.sql.DataFrame = [a: array, b: array]
> scala> df.write.parquet("/tmp/simple.parquet")
> scala> val df = spark.read.parquet("/tmp/simple.parquet")
> df: org.apache.spark.sql.DataFrame = [a: array, b: array]
> scala> df.show()
> +---++
> |  a|   b|
> +---++
> |[1, 2, null, 4]|[4, 5, 6, 7]|
> +---++
> scala> df.selectExpr("array_intersect(a,b)").show()
> +-+
> |array_intersect(a, b)|
> +-+
> |[null, 4]|
> +-+
> scala> df.selectExpr("array_intersect(b,a)").show()
> +-+
> |array_intersect(b, a)|
> +-+
> |  [4]|
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org