Re: quering H5 "flatten" data in apache drill in releases after 1.19.0

Charles Givre Thu, 05 Mar 2026 18:47:36 -0800

Hi Tore, 
Thanks for your interest and use of Drill.  Could you try this:

1.  In the configuration for your dfs plugin, make sure that the config for the 
hdf5 format is as shown below:


"hdf5": {
  "type": "hdf5",
  "extensions": [
    "h5"
  ],
  "showPreview": true
}
2.   Run a SELECT *  query on your HDF5 file and report back what the results 
look like. 

A word about the HDF5 plugin.  The preview you are looking for is really just 
meant to give a sample of the data.  If your data set is really large, it will 
get truncated in that view.   Also, if I remember correctly, the name 
“int_data” is the actual name of that column from the dataset. 

Really the better way to query your data is to use the defaultPath option.  
This allows you to query tables within HDF5 files.  

"SELECT int_col_0, int_col_1 
FROM table(dfs.`hdf5/scalar.h5` (type => 'hdf5', defaultPath => '/nd/3D'))"
Best,
— C


> On Mar 5, 2026, at 15:46, Tore Van Grembergen via user 
> <[email protected]> wrote:
> 
> Hi Team,
> 
> I am looking into using the apache drill capabilities for querying H5 data.
> The documentation on this as provided on the site 
> https://drill.apache.org/docs/hdf5-format-plugin/ works for version 1.19.0, 
> however not as of 1.20.0.
> The column where the actual data is mapped into seems to be no longer 
> available.
> 
> e.g. the column int_data as per below example is no longer there .
> 
> apache drill> select * from dfs.test.`dset.h5`;
> |-------|-----------|-----------|-----------|---------------|--------------|------------------|-------------------|------------|--------------------------------------------------------------------------|
> | path  | data_type | file_name | data_size | element_count | is_timestamp | 
> is_time_duration | dataset_data_type | dimensions | int_data                  
>                                                |
> |-------|-----------|-----------|-----------|---------------|--------------|------------------|-------------------|------------|--------------------------------------------------------------------------|
> | /dset | DATASET   | dset.h5   | 96        | 24            | false        | 
> false            | INTEGER           | [4, 6]     | 
> [[1,2,3,4,5,6],[7,8,9,10,11,12],[13,14,15,16,17,18],[19,20,21,22,23,24]] |
> |-------|-----------|-----------|-----------|---------------|--------------|------------------|-------------------|------------|--------------------------------------------------------------------------|
> 
> 
> I have read somewhere that a parameter in the workspace definition 
> "showPreview" : true should restore the original way of working, however when 
> trying to save this parameter, it is automagically removed.
> (remark : the environment is running the apache/drill image in a docker 
> container, the config is stored on a mounted drive)
> 
> The reason for needing this int_data, double_data column is that there are a 
> lot of times too many values in and it is not known upfront  how many values 
> will be in the field.
> Hence the "column" approach in the select * from table(xyz) is not workable.
> It is necessary to be able to do  e.g. select flatten(int_data) as int_data 
> from dfs.test.dset.h5;
> 
> Is there a way to get this (re)-activated in apache dril 1.22 and successors ?
> 
> All help is much appreciated.
> 
> Kind regards
> 
> Tore
> 
>

signature.asc
Description: Message signed with OpenPGP

Re: quering H5 "flatten" data in apache drill in releases after 1.19.0

Reply via email to