kosiew commented on issue #15819:
URL: https://github.com/apache/datafusion/issues/15819#issuecomment-2826690238
Thanks for raising this! This is a very relevant use case—particularly in
genomics, where the desired sort order of chromosome labels (e.g., `chr1`,
`chr2`, `chrX`, `chrY`, `chrM`) doesn’t follow lexicographic rules.
### ✅ Proposed Solution: Use a Sort Key via `MemTable`
> add a chrom_num column which explicitly numbers the chromosomes in the
desired order, and then sort on that, then drop the column. It's not very
elegant.
While DataFusion doesn’t yet support custom comparators or sorting
dictionary arrays by keys, a clean and efficient workaround is to introduce a
numeric **sort key** column using a `MemTable` (or a CSV/Parquet file) as a
lookup.
#### 🧬 Example
Create a `chrom_order` mapping table:
```rust
let chrom_map = vec![
("chr1", 1),
("chr2", 2),
("chrX", 23),
("chrY", 24),
("chrM", 25),
("GL000220.1", 1001),
];
let schema = Arc::new(Schema::new(vec![
Field::new("chrom", DataType::Utf8, false),
Field::new("sort_key", DataType::Int32, false),
]));
let chrom_array = StringArray::from_iter_values(chrom_map.iter().map(|(c,
_)| *c));
let key_array = Int32Array::from_iter_values(chrom_map.iter().map(|(_, k)|
*k));
let batch = RecordBatch::try_new(
schema.clone(),
vec![Arc::new(chrom_array), Arc::new(key_array)],
)?;
let mem_table = MemTable::try_new(schema, vec![vec![batch]])?;
ctx.register_table("chrom_order", Arc::new(mem_table))?;
```
```sql
SELECT v.*
FROM variants v
JOIN chrom_order c ON v.chrom = c.chrom
ORDER BY c.sort_key, v.pos;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]