Hi Sergei:

You mention a very important algorithm task, that is the "multi-column 
cardinality estimation". This is a challenge that all what-if analysis 
databases need to address.

In the current VIDEX open-source version, we released a simple solution that 
assumes independence between columns, so Card(AB) = Card(A)/total_row * 
Card(B)/total_row. This performs well on benchmarks like TPCH but tends to 
under-estimate in more complex scenarios.

In ByteDance's production environment, when sampling is permitted, we 
pre-collect up to 100k rows covering all columns that appear in query 
conditions, and estimate joint cardinality based on this sample. Without 
sampling, we use a pre-trained language model: faster but coarser. We are 
currently preparing a paper on this work, and will release it in the future.

Nevertheless, existing methods still don't perfectly solve this problem. That's 
why we've opened the algorithm interfaces and welcome research contributions.

Best,
Rong
_______________________________________________
discuss mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to