[GitHub] [hudi] vinothchandar commented on a change in pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

GitBox Wed, 01 Sep 2021 15:49:12 -0700


vinothchandar commented on a change in pull request #3549:
URL: https://github.com/apache/hudi/pull/3549#discussion_r700624615




##########
File path: website/blog/2021-08-27-bulk-insert-sort-modes.md
##########
@@ -0,0 +1,88 @@
+---
+title: "Bulk Insert Sort Modes with Apache Hudi"
+excerpt: "Different sort modes available with BulkInsert"
+author: shivnarayan
+category: blog
+---
+
+Apache Hudi supports a `bulk_insert` operation in addition to "insert" and 
"upsert" to ingest data into a hudi table. 
+There are different sort modes that one could employ while using bulk_insert. 
This blog will talk about 
+different sort modes available out of the box, and how each compares with 
others. 
+<!--truncate-->
+
+Apache Hudi supports “bulk_insert” to assist in initial loading to data to a 
hudi table. This is expected
+to be faster when compared to using “insert” or “upsert” operations. Bulk 
insert differs from insert in two
+aspects. Existing records are never looked up with bulk_insert, and some 
writer side optimizations like 
+small files are not managed with bulk_insert. 
+
+Bulk insert offers 3 different sort modes to cater to different needs of 
users, based on the following principles.
+
+- Sorting will give us good compression and upsert performance, if data is 
laid out well. Especially if your record keys 
+  have some sort of ordering (timestamp, etc) characteristics, sorting will 
assist in trimming down a lot of files 
+  during upsert. If data is sorted by frequently queried columns, queries will 
leverage parquet predicate pushdown 
+  to trim down the data to ensure lower latency as well.
+  
+- Additionally, parquet writing is quite a memory intensive operation. When 
writing large volumes of data into a table 
+  that is also partitioned into 1000s of partitions, without sorting of any 
kind, the writer may have to keep 1000s of 
+  parquet writers open simultaneously incurring unsustainable memory pressure 
and eventually leading to crashes.
+  
+- It's also desirable to start with the smallest amount of files possible when 
bulk importing data, as to avoid 
+  metadata overhead later on for writers and queries. 
+
+3 Sort modes supported out of the box are: `PARTITION_SORT`,`GLOBAL_SORT` and 
`NONE`.
+
+## Configurations 
+One can set the config 
[“hoodie.bulkinsert.sort.mode”](https://hudi.apache.org/docs/configurations.html#withBulkInsertSortMode)
 to either 
+of the three values, namely NONE, GLOBAL_SORT and PARTITION_SORT. Default sort 
mode is `GLOBAL_SORT`.
+
+## Different Sort Modes
+
+### Global Sort
+
+As the name suggests, Hudi sorts the records globally across the input 
partitions, which maximizes the number of files 
+pruned using key ranges, during index lookups for subsequent upserts. This is 
because each file has non-overlapping 
+min, max values for keys, which really helps, when the key has some ordering 
characteristics such as a time based prefix.
+Given we are writing to a single parquet file on a single output partition 
path on storage at any given time, this mode
+greatly helps control memory pressure during large partitioned writes. Also 
due to global sorting, each small table 
+partition path will be written from atmost two spark partition and thus 
contain just 2 files. 
+This is the default sort mode with bulk_insert operation in Hudi. 
+
+### Partition sort
+In this sort mode, records within a given spark partition will be sorted. But 
there are chances that a given spark partition 
+can contain records from different table partitions. And so, even though we 
sort within each spark partitions, this sort
+mode could result in large number of files at the end of bulk_insert, since 
records for a given table partition could 
+be spread across many spark partitions. During actual write by the writers, we 
may not have much open files 
+simultaneously, since we close out the file before moving to next file (as 
records are sorted within a spark partition) 
+and hence may not have much memory pressure. 
+
+### None
+
+In this mode, no transformation such as sorting is done to the user records 
and delegated to the writers as is. So, 
+when writing large volumes of data into a table partitioned into 1000s of 
partitions, the writer may have to keep 1000s of
+parquet writers open simultaneously incurring unsustainable memory pressure 
and eventually leading to crashes. Also, 
+min max ranges for a given file could be very wide (unsorted records) and 
hence subsequent upserts may read 
+bloom filters from lot of files during index lookup. Since records are not 
sorted, and each writer could get records 
+across N number of table partitions, this sort mode could result in a huge 
number of files at the end of bulk import. 
+This could also impact your upsert or query performance due to large number of 
small files. 
+
+## User defined partitioner
+
+If none of the above built-in sort modes suffice, users can also choose to 
implement their own 
+[partitioner](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java)
+and plug it in with bulk insert as needed.
+
+## Bulk insert with different sort modes
+Here is a microbenchmark to show the performance difference between different 
sort modes.
+
+![Figure showing different sort modes in 
bulk_insert](/assets/images/blog/bulkinsert-sort-modes/sort-modes.png) <br/>

Review comment:
       its a bit misleading how sorting adds very little overhead for 
bulk_insert. thoughts?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a change in pull request #3549: [HUDI-2369] Blog on bulk_insert sort modes

Reply via email to