Hi Adam,

Thank you for your thoughtful questions. I'm pleased to share our real-world 
experience at ByteDance, where VIDEX currently powers our internal index 
recommendation service, processing thousands of optimization tasks daily. We're 
also planning to launch it on our public cloud (https://www.volcengine.com/) 
within the next 1-2 quarters.

Regarding your specific inquiries:

1. **AI models for cardinality and NDV estimation in production:**

   NDV estimation approaches fall into two categories: sampling-based and 
dataless. When partial data access is available, many classical NDV algorithms 
exist, though they typically excel only with specific distributions [1]. VIDEX 
employs AdaNDV (our work accepted by VLDB'25 [1]), an adaptive approach that 
combines multiple NDV algorithms for optimal results.

   For users with strict privacy requirements or those needing rapid 
recommendations (<5s), we deploy PLM4NDV, our dataless solution accepted by 
SIGMOD'25 [2], which ranks among the leading approaches in this domain.

   Cardinality estimation methods are generally classified as query-driven or 
data-driven. Data-driven methods (such as Naru and DeepDB) provide superior 
single-table cardinality accuracy but require greater preprocessing resources. 
For privacy-conscious cloud users or environments where full-scans are 
restricted, query-driven methods like MSCN [4] are more appropriate. Our 
query-driven approach GRASP [3]has achieved state-of-the-art results and has 
been accepted at VLDB'25.

   For index recommendations in practice, we provide further details in point 
#2 below.

2. **Limitations and production generalization:**

   The primary challenge for VIDEX in production environments is accurately 
modeling multi-column join distributions with limited data access. Our approach 
varies according to customer requirements:

   - With sampling permission, we gather data via PK-based sampling and utilize 
AdaNDV for NDV estimation. We construct histograms for single-column 
cardinality estimations and employ correlation coefficients for multi-column 
cardinality.

   - In zero-sampling scenarios, we rely on our pre-trained models (PLM4NDV and 
a dataless CardEst method).

   Our testing across 5,000+ index recommendation tasks demonstrates that these 
approaches consistently outperform traditional sampling-based recommendations.

3. **Natural language models with VIDEX:**

   Regarding NDV, PLM4NDV (our SIGMOD 2025 paper) leverages pre-trained 
language models to extract semantic schema information without accessing actual 
data. This approach is particularly valuable in cloud environments where data 
access is restricted. Our models are pre-trained on thousands of public schema 
datasets, making them immediately applicable to new business scenarios without 
additional training.

   In terms of cardinality, we've achieved promising results using language 
models for entirely dataless cardinality estimation.

Thank you again for your interest. I welcome any additional questions regarding 
our research technology or business implementations.

- [1] AdaNDV (Our NDV work, VLDB 2025): Xu, X., Zhang, T., He, X., Li, H., 
Kang, R., Wang, S., ... & Chen, J. (2025). AdaNDV: Adaptive Number of Distinct 
Value Estimation via Learning to Select and Fuse Estimators.
- [2] PLM4NDV (Our language-model-based NDV work, SIGMOD 2025): Xu, X., He, X., 
Zhang, T., Zhang, L., Shi, R., & Chen, J. PLM4NDV: Minimizing Data Access for 
Number of Distinct Values Estimation with Pre-trained Language Models
- [3] GRASP (Our query-driven cardinality work, VLDB 2025): Peizhi Wu, Rong 
Kang, Tieying Zhang*, Jianjun Chen, Ryan Marcus, Zachary G. Ives. Data-Agnostic 
Cardinality Learning from Imperfect Workloads.
- [4] MSCN: A. Kipf, T. Kipf, B. Radke, V. Leis, P. Boncz, and A. Kemper, 
“Learned Cardinalities: Estimating Correlated Joins with Deep Learning,” Dec. 
18, 2018, arXiv: arXiv:1809.00677. doi: 10.48550/arXiv.1809.00677.

Best regards,
Rong
ByteBrain Team, ByteDance
_______________________________________________
discuss mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to