Also, I would be interested to know if phoenix gathers any stats now?
Thanks
Abhishek
On 01/31/2014 10:46 PM, abhishek wrote:
Hi James
I am building a schema recommendation system based on cost modeling.
Currently, I gather all necessary features off line using phoenix
queries or hbase api. However, as you mentioned in your mail, these
stats can be gathered efficiently during major compaction or by using
other background processes.
I agree with you that cardinalities will contribute to cost. Thank you
for pointing that out. Is there other stats that could have immense
impact?
Thanks for replying and showing interest in this project.
Abhishek
On 01/31/2014 10:16 PM, James Taylor wrote:
Hi,
This is a good start. How do you get the number of rows per table?
I think the biggest missing piece is histogram information so you can
approximate cardinalities. We thought to track this through a stats
collection process done during major compaction. And, of course, the
query engine needs to combine the cardinalities based on the ands/ors
used in the query.
Thanks,
James
On Fri, Jan 31, 2014 at 6:15 PM, abhishek <[email protected]
<mailto:[email protected]>> wrote:
Hi Taylor
I am currently working on cost modeling for join and scan queries.
Currently, my feature set includes follows things:
1) number of region server
2) number of thread per region server
3) number of client side threads
4) number of rows per table
5) record size
6) rowkey length
7) Hdfs block size
8) hdfs replication factor
9) hbase cache size
and few more
Would you be able to point out more features that could effect
scan and join query performance?
Thanks
Abhishek