[GitHub] [hudi] kirkuz commented on issue #2323: [SUPPORT] GLOBAL_BLOOM index significantly slowing down processing time

2021-01-29 Thread GitBox


kirkuz commented on issue #2323:
URL: https://github.com/apache/hudi/issues/2323#issuecomment-769682139


   Yes, feel free to close it. I'll check that in the new release.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] kirkuz commented on issue #2323: [SUPPORT] GLOBAL_BLOOM index significantly slowing down processing time

2021-01-27 Thread GitBox


kirkuz commented on issue #2323:
URL: https://github.com/apache/hudi/issues/2323#issuecomment-768163638


   I've found it very time and resource consuming. This is why I've decided to 
change my requirements to partition data by a column that should never change 
for a particular row (and I changed my index to SIMPLE only), therefore there 
shouldn't be a case that records moves from part_1 to part_2. In such approach 
I will pay more for AWS Athena queries (cause I will query more data due to 
bigger partitions), but less for AWS EMR to process such kind of data.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] kirkuz commented on issue #2323: [SUPPORT] GLOBAL_BLOOM index significantly slowing down processing time

2021-01-25 Thread GitBox


kirkuz commented on issue #2323:
URL: https://github.com/apache/hudi/issues/2323#issuecomment-766649165


   Hi @nsivabalan,
   
   I think we can close this issue for now. I've changed from GLOBAL_BLOOM to 
SIMPLE index with static partition keys, cause GLOBAL_BLOOM was too slow in my 
use case. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] kirkuz commented on issue #2323: [SUPPORT] GLOBAL_BLOOM index significantly slowing down processing time

2021-01-25 Thread GitBox


kirkuz commented on issue #2323:
URL: https://github.com/apache/hudi/issues/2323#issuecomment-766649165


   Hi @nsivabalan,
   
   I think we can close this issue for now. I've changed from GLOBAL_BLOOM to 
SIMPLE index with static partition keys, cause GLOBAL_BLOOM was too slow in my 
use case. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] kirkuz commented on issue #2323: [SUPPORT] GLOBAL_BLOOM index significantly slowing down processing time

2020-12-15 Thread GitBox


kirkuz commented on issue #2323:
URL: https://github.com/apache/hudi/issues/2323#issuecomment-745166974


   @n3nash 
   
   1. I can't really understand what is the difference between GLOBAL_BLOOM and 
GLOBAL_SIMPLE. Will the latter solve the problem with updating the partition 
for me (I mean removing the record from previous partition and adding it to the 
new one)?
   Where should I use GLOBAL_SIMPLE, in which use-cases?
   2. Do you have any recommendation about performance tunning like number of 
instances, cores, memory etc.?
   3. Do you use GLOBAL_BLOOM in your use-cases in Uber? I've learnt on slack 
channel that you use HBASE index. Does it mean that HBASE index is doing the 
same as GLOBAL_BLOOM. What I'm wondering is that if my use case is so rare (to 
delete from old partition and insert into new partition) that nobody has raised 
that problem so far?
   4. Do you think that switching to Kafka and DeltaStreamer (with continuous 
integration) will solve my issue that I will have less rows to upsert each 
time? Or it will mean that each upsert with DeltaStreamer it will again have to 
list all partitions?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] kirkuz commented on issue #2323: [SUPPORT] GLOBAL_BLOOM index significantly slowing down processing time

2020-12-14 Thread GitBox


kirkuz commented on issue #2323:
URL: https://github.com/apache/hudi/issues/2323#issuecomment-744511122


   Hi guys, do you have any update here?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] kirkuz commented on issue #2323: [SUPPORT] GLOBAL_BLOOM index significantly slowing down processing time

2020-12-11 Thread GitBox


kirkuz commented on issue #2323:
URL: https://github.com/apache/hudi/issues/2323#issuecomment-743110390


   @n3nash I believe that those are defaults for hardware configuration taken 
from spark.dynamicAllocation.enabled = true. I was running that on a cluster of 
1 master and 6 cores AWS EMR r5d.4xlarge with 16vCores and 128GB ram per each. 
This is why:
   
   spark.executor.instances is 6
   spark.executor.cores is 16
   
   Can you advice me how should I try to parametrize it?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] kirkuz commented on issue #2323: [SUPPORT] GLOBAL_BLOOM index significantly slowing down processing time

2020-12-11 Thread GitBox


kirkuz commented on issue #2323:
URL: https://github.com/apache/hudi/issues/2323#issuecomment-743105824


   @n3nash, sure
   
   
![image](https://user-images.githubusercontent.com/22114492/101891064-d8eb3a00-3ba1-11eb-9d2b-ff374507879c.png)
   
   
![image](https://user-images.githubusercontent.com/22114492/101891153-f0c2be00-3ba1-11eb-853c-f1544da10fcf.png)
   
   
![image](https://user-images.githubusercontent.com/22114492/101891205-020bca80-3ba2-11eb-83d2-dc22a97e38aa.png)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org