Reading dataset with stats making lots of network traffic..

Gautam Thu, 18 Apr 2019 11:53:40 -0700

Hello folks,

I have been testing Iceberg reading with and without stats built into
Iceberg dataset manifest and found that there's a huge jump in network
traffic with the latter..



In my test I am comparing two Iceberg datasets, both written in Iceberg
format. One with and the other without stats collected in Iceberg
manifests. In particular the difference between the writers used for the
two datasets is this PR:
https://github.com/apache/incubator-iceberg/pull/63/files which uses
Iceberg's writers for writing Parquet data. I captured tcpdump from query
scans run on these two datasets.  The partition being scanned contains 1
manifest, 1 parquet data file and ~3700 rows in both datasets. There's a
30x jump in network traffic to the remote filesystem (ADLS) when i switch
to stats based Iceberg dataset. Both queries used the same Iceberg reader
code to access both datasets.

```
root@d69e104e7d40:/usr/local/spark#  tcpdump -r
iceberg_geo1_metrixx_qc_postvalues_batch_query.pcap | grep
perfanalysis.adlus15.projectcabostore.net | grep ">" | wc -l
reading from file iceberg_geo1_metrixx_qc_postvalues_batch_query.pcap,
link-type EN10MB (Ethernet)

*8844*


root@d69e104e7d40:/usr/local/spark# tcpdump -r
iceberg_scratch_pad_demo_11_batch_query.pcap | grep
perfanalysis.adlus15.projectcabostore.net | grep ">" | wc -l
reading from file iceberg_scratch_pad_demo_11_batch_query.pcap, link-type
EN10MB (Ethernet)

*269708*

```

As a consequence of this the query response times get affected drastically
(illustrated below). I must confess that I am on a slow internet connection
via VPN connecting to the remote FS. But the dataset without stats took
just 1m 49s while the dataset with stats took 26m 48s to read the same
sized data. Most of that time in the latter dataset was spent split
planning in Manifest reading and stats evaluation.

```
all=> select count(*)  from iceberg_geo1_metrixx_qc_postvalues where
batchId = '4a6f95abac924159bb3d7075373395c9';
 count(1)
----------
     3627
(1 row)
*Time: 109673.202 ms (01:49.673)*

all=>  select count(*) from iceberg_scratch_pad_demo_11  where
_ACP_YEAR=2018 and _ACP_MONTH=01 and _ACP_DAY=01 and batchId =
'6d50eeb3e7d74b4f99eea91a27fc8f15';
 count(1)
----------
     3808
(1 row)
*Time: 1608058.616 ms (26:48.059)*

```

Has anyone faced this? I'm wondering if there's some caching or parallelism
option here that can be leveraged.  Would appreciate some guidance. If
there isn't a straightforward fix and others feel this is an issue I can
raise an issue and look into it further.


Cheers,
-Gautam.

Reading dataset with stats making lots of network traffic..

Reply via email to