Hi,

I have a sample data set (a few million records) that is saved to parquet
in 2 ways. A simple file structure with primary types to store dimensions
and metrics (String, Double) and a using nested maps (String,String and
String,Double) respectively.

Querying the data set with the simple types only:

select roundTimeStamp(s.occurred_at,'PT1H') as `at`, sum(metrics_price) as
price, sum(metrics_kwh) as kwh from
dfs.asa.`/processed/etactica-dev-p1/entitysamples/metrics/D2017*` as s
group by roundTimeStamp(s.occurred_at,'PT1H')


takes: *28.442 *sec. (dev. laptop x 1)


Same query against the nested structure:

select roundTimeStamp(s.occurred_at,'PT1H') as `at`, sum(s.metrics.price)
as price, sum(s.metricss.kwh) as kwh from
dfs.asa.`/processed/etactica-dev-p1/entitysamples/metrics/D2017*` as s
group by roundTimeStamp(s.occurred_at,'PT1H')

takes: *719.810* sec.

Event counting the number of records takes very, very long if there is a
nested structure involved. (select count(*) from)
It does not behave like this on our production servers (1.8) put I have not
run this particular test on them (their performance has never been an
issue)
I have these sample files available if anyone wishes to reproduces this
consistently.
Regards,
 -Stefán

Reply via email to