Hi, I wanted to extract the following in a separate thread:
I was going to ask about partitioning as a way to handle (querying against) > large volumes of data. This is related to my Q above about date-based > partitioning. But I'm wondering if one can go further. Partitioning by > date, partitioning by tenant, but then also partitioning by some other > columns, which would be different for each type of data being inserted. > e.g. for sales data maybe the partitions would be date, tenantID, but then > also customerCountry, customerGender, etc. For performance metrics data > maybe it would be date, tenantID, but then also environment (prod vs. dev), > or applicationType (e.g. my HBase cluster performance metrics vs. my Tomcat > performance metrics), and so on. > > Essentially, a secondary index is declaring a partitioning. The indexed columns make up the row > key which in HBase determines the partitioning. Aha! Hmmm. But, as far as I know, how one constructs the key is.... the key. That is, doesn't one typically construct the key based on access patterns? How would that work in the the scenario I described in my other email - unknown number of columns and ad-hoc SQL queries? How do you handle the above without having to create all possible combinations of columns (to anticipate any sort of query) and having to insert N rows in the index table for each 1 row in the primary table? Don't you have to do that in order to handle any ad-hoc query one may choose to run? Thanks, Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/
