Hi all, I have been discussing with Priyank sir on the following style of workload and whether we can improve HBase's performance in this area. The usecase is as follows:
1) Bulk load data. 2) Query the data multiple times(read access mostly, and no real time writes). This is a common workload, and I am pretty interested in benchmarking HBase's performance in this area, as well as improve this further. Please advice me on how I can proceed in benchmarking. Specifically, how will I need to set up a HBase cluster, will there be any specific requirements of the cluster for this type of testing? I worked on a patch to improve performance for a similar usecase in PostgreSQL. The case is pretty similar, bulk load of data, large number of mostly read only queries, and then deletion of the data. The optimization I targeted was the cost of writes to disk. Specifically, setting of flags(hint bits) for tracking the commt status of inserting/deleting transaction was causing a write overhead. I tried to mitigate this by making a cache which holds the transaction id in case of the above mentioned workload, hence mitigating the cost of writes. I will start benchmarking once I have the system set up and then start thinking of tests. Once I have an outline in my mind, I shall post it on the list. i will require the community's guidance in this a lot. Thoughts/Comments/Advice please? Regards, Atri -- Regards, Atri l'apprenant