[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader

2023-10-19 Thread Zamil Majdy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zamil Majdy updated SPARK-45604:

Description: 
Repro:

```

val path = "/tmp/sample_parquet_file"

spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS 
field").write.parquet(path)
spark.read.schema("field ARRAY").parquet(path).collect()

```

Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT 
on OffHeap mode.

  was:
Repro:

val path = "/tmp/sample_parquet_file"

spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS 
field").write.parquet(path)
spark.read.schema("field ARRAY").parquet(path).collect()

Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT 
on OffHeap mode.


> Converting timestamp_ntz to array can cause NPE or SEGFAULT on 
> parquet vectorized reader
> ---
>
> Key: SPARK-45604
> URL: https://issues.apache.org/jira/browse/SPARK-45604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Zamil Majdy
>Priority: Major
>  Labels: pull-request-available
>
> Repro:
> ```
> val path = "/tmp/sample_parquet_file"
> spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS 
> field").write.parquet(path)
> spark.read.schema("field ARRAY").parquet(path).collect()
> ```
> Depending on the memory mode, it will throw an NPE on OnHeap mode and 
> SEGFAULT on OffHeap mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader

2023-10-19 Thread Zamil Majdy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zamil Majdy updated SPARK-45604:

Description: 
Repro:

val path = "/tmp/sample_parquet_file"

spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS 
field").write.parquet(path)
spark.read.schema("field ARRAY").parquet(path).collect()

Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT 
on OffHeap mode.

  was:
Repro:


{{{}{}}}```
val path = "/tmp/zamil/timestamp"

spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS 
field").write.parquet(path)
spark.read.schema("field ARRAY").parquet(path).collect()
```

Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT 
on OffHeap mode.


> Converting timestamp_ntz to array can cause NPE or SEGFAULT on 
> parquet vectorized reader
> ---
>
> Key: SPARK-45604
> URL: https://issues.apache.org/jira/browse/SPARK-45604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Zamil Majdy
>Priority: Major
>  Labels: pull-request-available
>
> Repro:
> val path = "/tmp/sample_parquet_file"
> spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS 
> field").write.parquet(path)
> spark.read.schema("field ARRAY").parquet(path).collect()
> Depending on the memory mode, it will throw an NPE on OnHeap mode and 
> SEGFAULT on OffHeap mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader

2023-10-19 Thread Zamil Majdy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zamil Majdy updated SPARK-45604:

Description: 
Repro:


{{{}{}}}```
val path = "/tmp/zamil/timestamp"

spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS 
field").write.parquet(path)
spark.read.schema("field ARRAY").parquet(path).collect()
```

Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT 
on OffHeap mode.

  was:
Repro:

{{val path = "/tmp/someparquetfile"}}
{{spark.sql("SELECT CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ) AS 
field").write.mode("overwrite").parquet(path)}}
{{spark.read.schema("field array").parquet(path).collect()}}

Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT 
on OffHeap mode.


> Converting timestamp_ntz to array can cause NPE or SEGFAULT on 
> parquet vectorized reader
> ---
>
> Key: SPARK-45604
> URL: https://issues.apache.org/jira/browse/SPARK-45604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Zamil Majdy
>Priority: Major
>  Labels: pull-request-available
>
> Repro:
> {{{}{}}}```
> val path = "/tmp/zamil/timestamp"
> spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS 
> field").write.parquet(path)
> spark.read.schema("field ARRAY").parquet(path).collect()
> ```
> Depending on the memory mode, it will throw an NPE on OnHeap mode and 
> SEGFAULT on OffHeap mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader

2023-10-19 Thread Zamil Majdy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zamil Majdy updated SPARK-45604:

Description: 
Repro:

{{val path = "/tmp/someparquetfile"}}
{{spark.sql("SELECT CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ) AS 
field").write.mode("overwrite").parquet(path)}}
{{spark.read.schema("field array").parquet(path).collect()}}

Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT 
on OffHeap mode.

  was:
Repro:

{{val path = "/tmp/someparquetfile"
spark.sql("SELECT CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ) AS 
field").write.mode("overwrite").parquet(path)
spark.read.schema("field array").parquet(path).collect()}}

Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT 
on OffHeap mode.


> Converting timestamp_ntz to array can cause NPE or SEGFAULT on 
> parquet vectorized reader
> ---
>
> Key: SPARK-45604
> URL: https://issues.apache.org/jira/browse/SPARK-45604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Zamil Majdy
>Priority: Major
>  Labels: pull-request-available
>
> Repro:
> {{val path = "/tmp/someparquetfile"}}
> {{spark.sql("SELECT CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ) AS 
> field").write.mode("overwrite").parquet(path)}}
> {{spark.read.schema("field array").parquet(path).collect()}}
> Depending on the memory mode, it will throw an NPE on OnHeap mode and 
> SEGFAULT on OffHeap mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader

2023-10-19 Thread Zamil Majdy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zamil Majdy updated SPARK-45604:

Description: 
Repro:

{{val path = "/tmp/someparquetfile"
spark.sql("SELECT CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ) AS 
field").write.mode("overwrite").parquet(path)
spark.read.schema("field array").parquet(path).collect()}}

Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT 
on OffHeap mode.

  was:
Repro:

```
val path = "/tmp/someparquetfile"
val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) 
AS field")

df.write.mode("overwrite").parquet(path)
spark.read.schema("field map>").parquet(path).collect()
```

Depending on the memory mode is used, it will produced NPE on on-heap mode, and 
segfault on off-heap


> Converting timestamp_ntz to array can cause NPE or SEGFAULT on 
> parquet vectorized reader
> ---
>
> Key: SPARK-45604
> URL: https://issues.apache.org/jira/browse/SPARK-45604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Zamil Majdy
>Priority: Major
>  Labels: pull-request-available
>
> Repro:
> {{val path = "/tmp/someparquetfile"
> spark.sql("SELECT CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ) AS 
> field").write.mode("overwrite").parquet(path)
> spark.read.schema("field array").parquet(path).collect()}}
> Depending on the memory mode, it will throw an NPE on OnHeap mode and 
> SEGFAULT on OffHeap mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45608) Migrate SchemaColumnConvertNotSupportedException onto DATATYPE_MISMATCH error classes

2023-10-19 Thread Zamil Majdy (Jira)
Zamil Majdy created SPARK-45608:
---

 Summary: Migrate SchemaColumnConvertNotSupportedException onto 
DATATYPE_MISMATCH error classes
 Key: SPARK-45608
 URL: https://issues.apache.org/jira/browse/SPARK-45608
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Zamil Majdy


SchemaColumnConvertNotSupportedException is not currently part of 
SparkThrowable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader

2023-10-19 Thread Zamil Majdy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zamil Majdy updated SPARK-45604:

Description: 
Repro:

```
val path = "/tmp/someparquetfile"
val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) 
AS field")

df.write.mode("overwrite").parquet(path)
spark.read.schema("field map>").parquet(path).collect()
```

Depending on the memory mode is used, it will produced NPE on on-heap mode, and 
segfault on off-heap

  was:
Repro:

```
val path = "/tmp/zamil/timestamp"
val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) 
AS field")

df.write.mode("overwrite").parquet(path)
spark.read.schema("field map>").parquet(path).collect()
```

Depending on the memory mode is used, it will produced NPE on on-heap mode, and 
segfault on off-heap


> Converting timestamp_ntz to array can cause NPE or SEGFAULT on 
> parquet vectorized reader
> ---
>
> Key: SPARK-45604
> URL: https://issues.apache.org/jira/browse/SPARK-45604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Zamil Majdy
>Priority: Major
>
> Repro:
> ```
> val path = "/tmp/someparquetfile"
> val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) 
> AS field")
> df.write.mode("overwrite").parquet(path)
> spark.read.schema("field map array>").parquet(path).collect()
> ```
> Depending on the memory mode is used, it will produced NPE on on-heap mode, 
> and segfault on off-heap



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader

2023-10-19 Thread Zamil Majdy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zamil Majdy updated SPARK-45604:

Description: 
Repro:

```
spark.conf.set("spark.databricks.photon.enabled", "false")

val path = "/tmp/zamil/timestamp"
val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) 
AS field")

df.write.mode("overwrite").parquet(path)
spark.read.schema("field map>").parquet(path).collect()
```

Depending on the memory mode is used, it will produced NPE on on-heap mode, and 
segfault on off-heap

  was:
Repro:

 

```
spark.conf.set("spark.databricks.photon.enabled", "false")

val path = "/tmp/zamil/timestamp"
val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) 
AS field")

df.write.mode("overwrite").parquet(path)
spark.read.schema("field map>").parquet(path).collect()
```
Depending on the memory mode is used, it will produced NPE on on-heap mode, and 
segfault on off-heap


> Converting timestamp_ntz to array can cause NPE or SEGFAULT on 
> parquet vectorized reader
> ---
>
> Key: SPARK-45604
> URL: https://issues.apache.org/jira/browse/SPARK-45604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Zamil Majdy
>Priority: Major
>
> Repro:
> ```
> spark.conf.set("spark.databricks.photon.enabled", "false")
> val path = "/tmp/zamil/timestamp"
> val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) 
> AS field")
> df.write.mode("overwrite").parquet(path)
> spark.read.schema("field map array>").parquet(path).collect()
> ```
> Depending on the memory mode is used, it will produced NPE on on-heap mode, 
> and segfault on off-heap



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader

2023-10-19 Thread Zamil Majdy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zamil Majdy updated SPARK-45604:

Description: 
Repro:

```
val path = "/tmp/zamil/timestamp"
val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) 
AS field")

df.write.mode("overwrite").parquet(path)
spark.read.schema("field map>").parquet(path).collect()
```

Depending on the memory mode is used, it will produced NPE on on-heap mode, and 
segfault on off-heap

  was:
Repro:

```
spark.conf.set("spark.databricks.photon.enabled", "false")

val path = "/tmp/zamil/timestamp"
val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) 
AS field")

df.write.mode("overwrite").parquet(path)
spark.read.schema("field map>").parquet(path).collect()
```

Depending on the memory mode is used, it will produced NPE on on-heap mode, and 
segfault on off-heap


> Converting timestamp_ntz to array can cause NPE or SEGFAULT on 
> parquet vectorized reader
> ---
>
> Key: SPARK-45604
> URL: https://issues.apache.org/jira/browse/SPARK-45604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Zamil Majdy
>Priority: Major
>
> Repro:
> ```
> val path = "/tmp/zamil/timestamp"
> val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) 
> AS field")
> df.write.mode("overwrite").parquet(path)
> spark.read.schema("field map array>").parquet(path).collect()
> ```
> Depending on the memory mode is used, it will produced NPE on on-heap mode, 
> and segfault on off-heap



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader

2023-10-19 Thread Zamil Majdy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zamil Majdy updated SPARK-45604:

Description: 
Repro:

 

```
spark.conf.set("spark.databricks.photon.enabled", "false")

val path = "/tmp/zamil/timestamp"
val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) 
AS field")

df.write.mode("overwrite").parquet(path)
spark.read.schema("field map>").parquet(path).collect()
```
Depending on the memory mode is used, it will produced NPE on on-heap mode, and 
segfault on off-heap

  was:
Repro:

 

{{```}}
spark.conf.set("spark.databricks.photon.enabled", "false")

val path = "/tmp/zamil/timestamp"
val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) 
AS field")

df.write.mode("overwrite").parquet(path)
spark.read.schema("field map>").parquet(path).collect()
{{{}{}}}```

Depending on the memory mode is used, it will produced NPE on on-heap mode, and 
segfault on off-heap


> Converting timestamp_ntz to array can cause NPE or SEGFAULT on 
> parquet vectorized reader
> ---
>
> Key: SPARK-45604
> URL: https://issues.apache.org/jira/browse/SPARK-45604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Zamil Majdy
>Priority: Major
>
> Repro:
>  
> ```
> spark.conf.set("spark.databricks.photon.enabled", "false")
> val path = "/tmp/zamil/timestamp"
> val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) 
> AS field")
> df.write.mode("overwrite").parquet(path)
> spark.read.schema("field map array>").parquet(path).collect()
> ```
> Depending on the memory mode is used, it will produced NPE on on-heap mode, 
> and segfault on off-heap



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader

2023-10-19 Thread Zamil Majdy (Jira)
Zamil Majdy created SPARK-45604:
---

 Summary: Converting timestamp_ntz to array can 
cause NPE or SEGFAULT on parquet vectorized reader
 Key: SPARK-45604
 URL: https://issues.apache.org/jira/browse/SPARK-45604
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Zamil Majdy


Repro:

 

{{{}```{}}}{{{}{}}}
spark.conf.set("spark.databricks.photon.enabled", "false")

{{}}
val path = "/tmp/somepath"
val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) 
AS field")

{{}}
df.write.mode("overwrite").parquet(path)
spark.read.schema("field map>").parquet(path).collect()
{{{}{}}}{{{}```{}}}

Depending on the memory mode is used, it will produced NPE on on-heap mode, and 
segfault on off-heap



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader

2023-10-19 Thread Zamil Majdy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zamil Majdy updated SPARK-45604:

Description: 
Repro:

 

{{```}}
spark.conf.set("spark.databricks.photon.enabled", "false")

val path = "/tmp/zamil/timestamp"
val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) 
AS field")

df.write.mode("overwrite").parquet(path)
spark.read.schema("field map>").parquet(path).collect()
{{{}{}}}```

Depending on the memory mode is used, it will produced NPE on on-heap mode, and 
segfault on off-heap

  was:
Repro:

 

{{{}```{}}}{{{}{}}}
spark.conf.set("spark.databricks.photon.enabled", "false")

{{}}
val path = "/tmp/somepath"
val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) 
AS field")

{{}}
df.write.mode("overwrite").parquet(path)
spark.read.schema("field map>").parquet(path).collect()
{{{}{}}}{{{}```{}}}

Depending on the memory mode is used, it will produced NPE on on-heap mode, and 
segfault on off-heap


> Converting timestamp_ntz to array can cause NPE or SEGFAULT on 
> parquet vectorized reader
> ---
>
> Key: SPARK-45604
> URL: https://issues.apache.org/jira/browse/SPARK-45604
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Zamil Majdy
>Priority: Major
>
> Repro:
>  
> {{```}}
> spark.conf.set("spark.databricks.photon.enabled", "false")
> val path = "/tmp/zamil/timestamp"
> val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) 
> AS field")
> df.write.mode("overwrite").parquet(path)
> spark.read.schema("field map array>").parquet(path).collect()
> {{{}{}}}```
> Depending on the memory mode is used, it will produced NPE on on-heap mode, 
> and segfault on off-heap



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44718) High On-heap memory usage is detected while doing parquet-file reading with Off-Heap memory mode enabled on spark

2023-08-08 Thread Zamil Majdy (Jira)
Zamil Majdy created SPARK-44718:
---

 Summary: High On-heap memory usage is detected while doing 
parquet-file reading with Off-Heap memory mode enabled on spark
 Key: SPARK-44718
 URL: https://issues.apache.org/jira/browse/SPARK-44718
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.4.1
Reporter: Zamil Majdy


I see the high use of on-heap memory usage while doing the parquet file reading 
when the off-heap memory mode is enabled. This is caused by the memory-mode for 
the column vector for the vectorized reader is configured by different flag, 
and the default value is always set to On-Heap.

Conf to reproduce the issue:

{{spark.memory.offHeap.size 100}}
{{spark.memory.offHeap.enabled true}}

Enabling these configurations only will not change the memory mode used for 
parquet-reading by the vectorized reader to Off-Heap.

 

Proposed PR: https://github.com/apache/spark/pull/42394



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43264) Avoid allocation of unwritten ColumnVector in VectorizedReader

2023-04-24 Thread Zamil Majdy (Jira)
Zamil Majdy created SPARK-43264:
---

 Summary: Avoid allocation of unwritten ColumnVector in 
VectorizedReader
 Key: SPARK-43264
 URL: https://issues.apache.org/jira/browse/SPARK-43264
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.4.1, 3.5.0
Reporter: Zamil Majdy


Spark Vectorized Reader allocates the array for every fields for each value 
count even the array is ended up empty. This causes a high memory consumption 
when reading a table with large struct+array or many columns with sparse value. 
One way to fix this is by lazily allocating the column vector and only 
allocates the array only when it is needed (array is written).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org