Hi everyone,

I would like to propose introducing support for the Vortex columnar format in 
GraphAr to enhance storage efficiency and query performance, particularly in 
analytical and AI scenarios.
I have initiated a discussion regarding this proposal at 
https://github.com/apache/incubator-graphar/discussions/887.


Background
Currently, several emerging columnar file formats—such as 
[Vortex](https://github.com/vortex-data/vortex), 
[Lance](https://github.com/lance-format/lance), 
[F3](https://github.com/future-file-format/F3), BtrBlocks, Nimble, and Parquet 
variants—demonstrate strong performance advantages in specific scenarios.


I wonder whether supporting these formats in GraphAr could significantly reduce 
storage overhead and improve query performance at scale.


Benefits


1. Introducing the Vortex columnar format can improve storage efficiency and 
query performance through better compression and vectorized execution.
2. It enables more flexible column-level encoding strategies, which can better 
align with analytical graph workloads.
3. Vortex is designed to be GPU-friendly, particularly in AI and analytics 
scenarios.




Effects of Modifications
1. Storage layer implementation and format adapters
2. All binding languages require adoption.


```shell
enum class FileType : int32_t { CSV = 0, PARQUET = 1, ORC = 2, JSON = 3 };
```
Evidence from DuckDB
Vortex has already been integrated into DuckDB, where it demonstrates 
substantial performance improvements on analytical workloads such as TPC-H. 
Reported results show significant gains in scan efficiency and query execution 
time compared to traditional columnar formats. Details are available in this 
[blog](https://duckdb.org/2026/01/23/duckdb-vortex-extension).


What do others think about this idea?
I’m happy to hear suggestions or alternative approaches.


Thanks, yao jun

Reply via email to