[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2023-02-06 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4261:
--
Fix Version/s: 0.12.2
   (was: 0.13.0)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.12.2
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2023-02-06 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4261:
--
Component/s: writer-core

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: writer-core
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-12-20 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4261:
-
Sprint: 2022/08/22, 2022/09/05, 2022/10/18, 2022/12/12, 2023-01-09  (was: 
2022/08/22, 2022/09/05, 2022/10/18, 2022/12/12, 0.13.0 Final Sprint)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-12-19 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4261:
--
Sprint: 2022/08/22, 2022/09/05, 2022/10/18, 2022/12/12, 0.13.0 Final Sprint 
 (was: 2022/08/22, 2022/09/05, 2022/10/18, 2022/12/12)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-12-07 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4261:
--
Priority: Critical  (was: Blocker)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-12-07 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4261:
--
Fix Version/s: (was: 0.13.0)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Critical
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-12-02 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4261:
-
Sprint: 2022/08/22, 2022/09/05, 2022/10/18, 2022/12/06  (was: 2022/08/22, 
2022/09/05, 2022/10/18)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-12-02 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4261:
-
Sprint: 2022/08/22, 2022/09/05, 2022/10/18  (was: 2022/08/22, 2022/09/05, 
2022/10/18, 2022/11/29)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-11-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4261:
--
Sprint: 2022/08/22, 2022/09/05, 2022/10/18, 2022/11/29  (was: 2022/08/22, 
2022/09/05, 2022/10/18)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-11-17 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4261:
--
Sprint: 2022/08/22, 2022/09/05, 2022/10/18  (was: 2022/08/22, 2022/09/05, 
2022/10/18, 2022/11/15)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-11-01 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4261:
-
Sprint: 2022/08/22, 2022/09/05, 2022/10/18, 2022/11/15  (was: 2022/08/22, 
2022/09/05, 2022/10/18, 2022/11/01)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-11-01 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4261:
-
Sprint: 2022/08/22, 2022/09/05, 2022/10/18, 2022/11/01  (was: 2022/08/22, 
2022/09/05, 2022/10/18)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-10-18 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4261:
--
Sprint: 2022/08/22, 2022/09/05, 2022/10/18  (was: 2022/08/22, 2022/09/05)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-09-22 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4261:
--
Sprint: 2022/08/22, 2022/09/05  (was: 2022/08/22, 2022/09/05, 2022/09/19)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-09-19 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4261:
-
Sprint: 2022/08/22, 2022/09/05, 2022/09/19  (was: 2022/08/22, 2022/09/05)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-09-07 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4261:
-
Sprint: 2022/08/22, 2022/09/05  (was: 2022/08/22)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-08-22 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4261:
-
Sprint: 2022/08/22

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-08-22 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4261:
--
Summary: OOM in bulk-insert when using "NONE" sort-mode for table w/ large 
# of partitions  (was: OOM in bulk-insert when using "NONE" sort-mode for table 
w/ large # of partitino)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.12.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4261) OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of partitions

2022-08-22 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4261:
--
Fix Version/s: 0.13.0
   (was: 0.12.0)

> OOM in bulk-insert when using "NONE" sort-mode for table w/ large # of 
> partitions
> -
>
> Key: HUDI-4261
> URL: https://issues.apache.org/jira/browse/HUDI-4261
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.13.0
>
> Attachments: Screen Shot 2022-06-15 at 6.06.06 PM.png
>
>
> While experimenting w/ bulk-inserting i've stumbled upon an OOM failure when 
> you do bulk-insert w/ sort-mode "NONE" for the table w/ large number of 
> partitions (> 1000).
>  
> This happens for the same reasons as HUDI-3883: every logical partition 
> (let's say we have N of these, equal to shuffling-parallelism in Hudi) 
> handled by Spark, (since no re-partitioning is done to align with the actual 
> partition-column) will likely have a record from every physical partition on 
> disk (let's say we have M of these). B/c of that every logical partition will 
> be writing into every physical one.
> This will eventually produce 
>  # M * N files in the table
>  # For every file in the table while writing Hudi will keep a "handle" in 
> memory which in turn will hold full buffer worth of Parquet data (until 
> flushed).
> This ultimately leads to an OOM.
>  
> !Screen Shot 2022-06-15 at 6.06.06 PM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)