[GitHub] [hudi] asharma4-lucid commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.

2020-12-01 Thread GitBox


asharma4-lucid commented on issue #2269:
URL: https://github.com/apache/hudi/issues/2269#issuecomment-736998571


   Thanks @bvaradar. Would you know when 0.7.0 is slated for release as the S3 
listing time will continue to grow for us as we add more partitions even with 
cleaning turned off? Also, since we are using a COW table and mostly inserts, 
would new versions of files still be created and hence would old versions be 
required to be cleaned up?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] asharma4-lucid commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.

2020-11-26 Thread GitBox


asharma4-lucid commented on issue #2269:
URL: https://github.com/apache/hudi/issues/2269#issuecomment-734453893


   Is there a downside to keeping hoodie.clean.automatic=false?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] asharma4-lucid commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.

2020-11-26 Thread GitBox


asharma4-lucid commented on issue #2269:
URL: https://github.com/apache/hudi/issues/2269#issuecomment-734450744


   Thanks @bvaradar . Setting the value hoodie.clean.automatic=false has helped 
in reducing the processing time significantly. Now the 5 records got inserted 
in less than a minute.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] asharma4-lucid commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.

2020-11-24 Thread GitBox


asharma4-lucid commented on issue #2269:
URL: https://github.com/apache/hudi/issues/2269#issuecomment-733323629


   Yes this is a COW table.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] asharma4-lucid commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.

2020-11-24 Thread GitBox


asharma4-lucid commented on issue #2269:
URL: https://github.com/apache/hudi/issues/2269#issuecomment-733174238


   Thanks @bvaradar. I tried to insert just 5 records to the existing table 
with ~300K partitions and it took close to ~5 hrs. If I insert ~5 records in a 
new table it takes less than 2 mins. Is this extra time of ~5 hrs all because 
of cleaner and compaction processes? For our use case, we mostly get inserts. 
With that in mind, would it be beneficial for us if we switch to MOR from COW 
and do async compaction (I am most likely making an incorrect assumption that 
this huge extra processing time is only because of compaction) ? And also, 
since our data does not have frequent record level updates, would switching to 
MOR make any difference?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] asharma4-lucid commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.

2020-11-22 Thread GitBox


asharma4-lucid commented on issue #2269:
URL: https://github.com/apache/hudi/issues/2269#issuecomment-731908622


   Thanks @bvaradar for your response. I have a few more questions:
   
   1) The reason we have kept the partition key that we are using is, because 
we wanted to gain O(1) read performance for the same. It is my understanding 
that this many number of partitions puts memory pressure on the executors as 
each executor creates as many writers as the partitions. (I am assuming in HDFS 
namenode would also be impacted, but since we are using S3, I am discounting 
that, but do let me know if I am mistaken). It is here that I wanted to confirm 
my understanding. Every day our process will update around ~12K partitions + 
insert ~33 K new partitions. So, my question is will the executors doing the 
hudi table write create ~44K writers contributing to the memory pressure. Or 
will the already existing partitions, i.e. ~300K also be touched in some way by 
the hudi table write executors leading to performance degradation as we 
continue to add more table to the hudi table?
   
   2) Just to confirm my understanding, when you mentioned s3 listing as the 
bottleneck, you meant that the s3 listing of all the partitions and files for 
the hudi table and not just the partitions updated and/or inserted for that 
specific process. So, in my case, that would imply that the Hudi table write 
process is doing an s3 listing of already existing ~300 K partitions and 
associated files and not just the ~44K partitions for the specific execution. 
And this is probably in line with what we have observed as well. Because for 
the intial 15 day processes, each hudi table write completed in around 4 hrs 
and then from the 16th day onwards, it gradually started increasing from 4 to 5 
to 6 and now to almost 9 hrs per day as we move ahead. Can you please confirm?
   
   3) If s3 listing requirement is made optional in hudi 0.7.0, then can we 
continue to use the partition key that we are using assuming that every day our 
process will add/update ~44K partitions in the hudi table? I understand that is 
not the best partition key as it has very high cardinality, but our read 
requirement is what is driving us towards this. I guess this might be related 
to question 1 above, but my question is, is there any other downside as well 
that you could glean from our use of this partition key apart from the s3 
listing dependency?
   
   4) We are trying to see if spark bucketing on the key would be a good middle 
ground between partition on the key and not using partitioning. Does hudi table 
write support bucketed writes and consequently, are the hudi table reads able 
to use the buckets for optimal read performance? Something like, O(1) hash + 
O(log m) binary search where m is the number of records in each bucketed file.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org