Also, I would be interested to know if phoenix gathers any stats now?

Thanks
Abhishek
On 01/31/2014 10:46 PM, abhishek wrote:
Hi James

I am building a schema recommendation system based on cost modeling. Currently, I gather all necessary features off line using phoenix queries or hbase api. However, as you mentioned in your mail, these stats can be gathered efficiently during major compaction or by using other background processes.

I agree with you that cardinalities will contribute to cost. Thank you for pointing that out. Is there other stats that could have immense impact?

Thanks for replying and showing interest in this project.

Abhishek

On 01/31/2014 10:16 PM, James Taylor wrote:
Hi,
This is a good start. How do you get the number of rows per table?

I think the biggest missing piece is histogram information so you can approximate cardinalities. We thought to track this through a stats collection process done during major compaction. And, of course, the query engine needs to combine the cardinalities based on the ands/ors used in the query.

Thanks,
James


On Fri, Jan 31, 2014 at 6:15 PM, abhishek <[email protected] <mailto:[email protected]>> wrote:

    Hi Taylor

    I am currently working on cost modeling for join and scan queries.

    Currently, my feature set includes follows things:
    1) number of region server
    2) number of thread per region server
    3) number of client side threads
    4) number of rows per table
    5) record size
    6) rowkey length
    7) Hdfs block size
    8) hdfs replication factor
    9) hbase cache size
    and few more

    Would you be able to point out more features that could effect
    scan and join query performance?

    Thanks
    Abhishek




Reply via email to