[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: ``` val path = "/tmp/sample_parquet_file" spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS field").write.parquet(path) spark.read.schema("field ARRAY").parquet(path).collect() ``` Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT on OffHeap mode. was: Repro: val path = "/tmp/sample_parquet_file" spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS field").write.parquet(path) spark.read.schema("field ARRAY").parquet(path).collect() Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT on OffHeap mode. > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > Labels: pull-request-available > > Repro: > ``` > val path = "/tmp/sample_parquet_file" > spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS > field").write.parquet(path) > spark.read.schema("field ARRAY").parquet(path).collect() > ``` > Depending on the memory mode, it will throw an NPE on OnHeap mode and > SEGFAULT on OffHeap mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: val path = "/tmp/sample_parquet_file" spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS field").write.parquet(path) spark.read.schema("field ARRAY").parquet(path).collect() Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT on OffHeap mode. was: Repro: {{{}{}}}``` val path = "/tmp/zamil/timestamp" spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS field").write.parquet(path) spark.read.schema("field ARRAY").parquet(path).collect() ``` Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT on OffHeap mode. > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > Labels: pull-request-available > > Repro: > val path = "/tmp/sample_parquet_file" > spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS > field").write.parquet(path) > spark.read.schema("field ARRAY").parquet(path).collect() > Depending on the memory mode, it will throw an NPE on OnHeap mode and > SEGFAULT on OffHeap mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: {{{}{}}}``` val path = "/tmp/zamil/timestamp" spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS field").write.parquet(path) spark.read.schema("field ARRAY").parquet(path).collect() ``` Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT on OffHeap mode. was: Repro: {{val path = "/tmp/someparquetfile"}} {{spark.sql("SELECT CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ) AS field").write.mode("overwrite").parquet(path)}} {{spark.read.schema("field array").parquet(path).collect()}} Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT on OffHeap mode. > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > Labels: pull-request-available > > Repro: > {{{}{}}}``` > val path = "/tmp/zamil/timestamp" > spark.sql("SELECT CAST('2019-01-01' AS TIMESTAMP_NTZ) AS > field").write.parquet(path) > spark.read.schema("field ARRAY").parquet(path).collect() > ``` > Depending on the memory mode, it will throw an NPE on OnHeap mode and > SEGFAULT on OffHeap mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: {{val path = "/tmp/someparquetfile"}} {{spark.sql("SELECT CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ) AS field").write.mode("overwrite").parquet(path)}} {{spark.read.schema("field array").parquet(path).collect()}} Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT on OffHeap mode. was: Repro: {{val path = "/tmp/someparquetfile" spark.sql("SELECT CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ) AS field").write.mode("overwrite").parquet(path) spark.read.schema("field array").parquet(path).collect()}} Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT on OffHeap mode. > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > Labels: pull-request-available > > Repro: > {{val path = "/tmp/someparquetfile"}} > {{spark.sql("SELECT CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ) AS > field").write.mode("overwrite").parquet(path)}} > {{spark.read.schema("field array").parquet(path).collect()}} > Depending on the memory mode, it will throw an NPE on OnHeap mode and > SEGFAULT on OffHeap mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: {{val path = "/tmp/someparquetfile" spark.sql("SELECT CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ) AS field").write.mode("overwrite").parquet(path) spark.read.schema("field array").parquet(path).collect()}} Depending on the memory mode, it will throw an NPE on OnHeap mode and SEGFAULT on OffHeap mode. was: Repro: ``` val path = "/tmp/someparquetfile" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() ``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > Labels: pull-request-available > > Repro: > {{val path = "/tmp/someparquetfile" > spark.sql("SELECT CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ) AS > field").write.mode("overwrite").parquet(path) > spark.read.schema("field array").parquet(path).collect()}} > Depending on the memory mode, it will throw an NPE on OnHeap mode and > SEGFAULT on OffHeap mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45608) Migrate SchemaColumnConvertNotSupportedException onto DATATYPE_MISMATCH error classes
Zamil Majdy created SPARK-45608: --- Summary: Migrate SchemaColumnConvertNotSupportedException onto DATATYPE_MISMATCH error classes Key: SPARK-45608 URL: https://issues.apache.org/jira/browse/SPARK-45608 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.5.0 Reporter: Zamil Majdy SchemaColumnConvertNotSupportedException is not currently part of SparkThrowable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: ``` val path = "/tmp/someparquetfile" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() ``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap was: Repro: ``` val path = "/tmp/zamil/timestamp" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() ``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > > Repro: > ``` > val path = "/tmp/someparquetfile" > val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) > AS field") > df.write.mode("overwrite").parquet(path) > spark.read.schema("field map array>").parquet(path).collect() > ``` > Depending on the memory mode is used, it will produced NPE on on-heap mode, > and segfault on off-heap -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: ``` spark.conf.set("spark.databricks.photon.enabled", "false") val path = "/tmp/zamil/timestamp" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() ``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap was: Repro: ``` spark.conf.set("spark.databricks.photon.enabled", "false") val path = "/tmp/zamil/timestamp" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() ``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > > Repro: > ``` > spark.conf.set("spark.databricks.photon.enabled", "false") > val path = "/tmp/zamil/timestamp" > val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) > AS field") > df.write.mode("overwrite").parquet(path) > spark.read.schema("field map array>").parquet(path).collect() > ``` > Depending on the memory mode is used, it will produced NPE on on-heap mode, > and segfault on off-heap -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: ``` val path = "/tmp/zamil/timestamp" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() ``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap was: Repro: ``` spark.conf.set("spark.databricks.photon.enabled", "false") val path = "/tmp/zamil/timestamp" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() ``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > > Repro: > ``` > val path = "/tmp/zamil/timestamp" > val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) > AS field") > df.write.mode("overwrite").parquet(path) > spark.read.schema("field map array>").parquet(path).collect() > ``` > Depending on the memory mode is used, it will produced NPE on on-heap mode, > and segfault on off-heap -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: ``` spark.conf.set("spark.databricks.photon.enabled", "false") val path = "/tmp/zamil/timestamp" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() ``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap was: Repro: {{```}} spark.conf.set("spark.databricks.photon.enabled", "false") val path = "/tmp/zamil/timestamp" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() {{{}{}}}``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > > Repro: > > ``` > spark.conf.set("spark.databricks.photon.enabled", "false") > val path = "/tmp/zamil/timestamp" > val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) > AS field") > df.write.mode("overwrite").parquet(path) > spark.read.schema("field map array>").parquet(path).collect() > ``` > Depending on the memory mode is used, it will produced NPE on on-heap mode, > and segfault on off-heap -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
Zamil Majdy created SPARK-45604: --- Summary: Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader Key: SPARK-45604 URL: https://issues.apache.org/jira/browse/SPARK-45604 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.0 Reporter: Zamil Majdy Repro: {{{}```{}}}{{{}{}}} spark.conf.set("spark.databricks.photon.enabled", "false") {{}} val path = "/tmp/somepath" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") {{}} df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() {{{}{}}}{{{}```{}}} Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45604) Converting timestamp_ntz to array can cause NPE or SEGFAULT on parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-45604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zamil Majdy updated SPARK-45604: Description: Repro: {{```}} spark.conf.set("spark.databricks.photon.enabled", "false") val path = "/tmp/zamil/timestamp" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() {{{}{}}}``` Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap was: Repro: {{{}```{}}}{{{}{}}} spark.conf.set("spark.databricks.photon.enabled", "false") {{}} val path = "/tmp/somepath" val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) AS field") {{}} df.write.mode("overwrite").parquet(path) spark.read.schema("field map>").parquet(path).collect() {{{}{}}}{{{}```{}}} Depending on the memory mode is used, it will produced NPE on on-heap mode, and segfault on off-heap > Converting timestamp_ntz to array can cause NPE or SEGFAULT on > parquet vectorized reader > --- > > Key: SPARK-45604 > URL: https://issues.apache.org/jira/browse/SPARK-45604 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Zamil Majdy >Priority: Major > > Repro: > > {{```}} > spark.conf.set("spark.databricks.photon.enabled", "false") > val path = "/tmp/zamil/timestamp" > val df = sql("SELECT MAP('key', CAST('2019-01-01 00:00:00' AS TIMESTAMP_NTZ)) > AS field") > df.write.mode("overwrite").parquet(path) > spark.read.schema("field map array>").parquet(path).collect() > {{{}{}}}``` > Depending on the memory mode is used, it will produced NPE on on-heap mode, > and segfault on off-heap -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44718) High On-heap memory usage is detected while doing parquet-file reading with Off-Heap memory mode enabled on spark
Zamil Majdy created SPARK-44718: --- Summary: High On-heap memory usage is detected while doing parquet-file reading with Off-Heap memory mode enabled on spark Key: SPARK-44718 URL: https://issues.apache.org/jira/browse/SPARK-44718 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 3.4.1 Reporter: Zamil Majdy I see the high use of on-heap memory usage while doing the parquet file reading when the off-heap memory mode is enabled. This is caused by the memory-mode for the column vector for the vectorized reader is configured by different flag, and the default value is always set to On-Heap. Conf to reproduce the issue: {{spark.memory.offHeap.size 100}} {{spark.memory.offHeap.enabled true}} Enabling these configurations only will not change the memory mode used for parquet-reading by the vectorized reader to Off-Heap. Proposed PR: https://github.com/apache/spark/pull/42394 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43264) Avoid allocation of unwritten ColumnVector in VectorizedReader
Zamil Majdy created SPARK-43264: --- Summary: Avoid allocation of unwritten ColumnVector in VectorizedReader Key: SPARK-43264 URL: https://issues.apache.org/jira/browse/SPARK-43264 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 3.4.1, 3.5.0 Reporter: Zamil Majdy Spark Vectorized Reader allocates the array for every fields for each value count even the array is ended up empty. This causes a high memory consumption when reading a table with large struct+array or many columns with sparse value. One way to fix this is by lazily allocating the column vector and only allocates the array only when it is needed (array is written). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org