No, we haven’t experienced it yet. The manifest size is huge in your case. To me, Ryan is correct: it might be either big lower/upper bounds (then truncation will help) or a big number columns (then collecting lower/upper bounds only for specific columns will help). I think both optimizations are needed and will reduce the manifest size.
Since you mentioned you have a lot of columns and we collect bounds for nested struct fields, I am wondering if you could revert [1] locally and compare the manifest size. [1] - https://github.com/apache/incubator-iceberg/commit/c383dd87a89e35d622e9c458fd711931cbc5e96f <https://github.com/apache/incubator-iceberg/commit/c383dd87a89e35d622e9c458fd711931cbc5e96f> > On 19 Apr 2019, at 15:42, Gautam <gautamkows...@gmail.com> wrote: > > Thanks for responding Anton! Do we think the delay is mainly due to > lower/upper bound filtering? have you faced this? I haven't exactly found > where the slowness is yet. It's generally due to the stats filtering but what > part of it is causing this much network traffic. There's CloseableIteratable > that takes a ton of time on the next() and hasNext() calls. My guess is the > expression evaluation on each manifest entry is what's doing it. > > On Fri, Apr 19, 2019 at 1:41 PM Anton Okolnychyi <aokolnyc...@apple.com > <mailto:aokolnyc...@apple.com>> wrote: > I think we need to have a list of columns for which we want to collect stats > and that should be configurable by the user. Maybe, this config should be > applicable only to lower/upper bounds. As we now collect stats even for > nested struct fields, this might generate a lot of data. In most cases, users > cluster/sort their data by a subset of data columns to have fast queries with > predicates on those columns. So, being able to configure columns for which to > collect lower/upper bounds seems reasonable. > >> On 19 Apr 2019, at 08:03, Gautam <gautamkows...@gmail.com >> <mailto:gautamkows...@gmail.com>> wrote: >> >> > The length in bytes of the schema is 109M as compared to 687K of the >> > non-stats dataset. >> >> Typo, length in bytes of *manifest*. schema is the same. >> >> On Fri, Apr 19, 2019 at 12:16 PM Gautam <gautamkows...@gmail.com >> <mailto:gautamkows...@gmail.com>> wrote: >> Correction, partition count = 4308. >> >> > Re: Changing the way we keep stats. Avro is a block splittable format and >> > is friendly with parallel compute frameworks like Spark. >> >> Here I am trying to say that we don't need to change the format to columnar >> right? The current format is already friendly for parallelization. >> >> thanks. >> >> >> >> >> >> On Fri, Apr 19, 2019 at 12:12 PM Gautam <gautamkows...@gmail.com >> <mailto:gautamkows...@gmail.com>> wrote: >> Ah, my bad. I missed adding in the schema details .. Here are some details >> on the dataset with stats : >> >> Iceberg Schema Columns : 20 >> Spark Schema fields : 20 >> Snapshot Summary :{added-data-files=4308, added-records=11494037, >> changed-partition-count=4308, total-records=11494037, total-data-files=4308} >> Manifest files :1 >> Manifest details: >> => manifest file path: >> adl://[dataset_base_path]/metadata/4bcda033-9df5-4c84-8eef-9d6ef93e4347-m0.avro >> <> >> => manifest file length: 109,028,885 >> => existing files count: 0 >> => added files count: 4308 >> => deleted files count: 0 >> => partitions count: 4 >> => partition fields count: 4 >> >> Re: Num data files. It has a single manifest keep track of 4308 files. Total >> record count is 11.4 Million. >> >> Re: Columns. You are right that this table has many columns.. although it >> has only 20 top-level columns, num leaf columns are in order of thousands. >> This Schema is heavy on structs (in the thousands) and has deep levels of >> nesting. I know Iceberg keeps column_sizes, value_counts, >> null_value_counts for all leaf fields and additionally lower-bounds, >> upper-bounds for native, struct types (not yet for map KVs and arrays). The >> length in bytes of the schema is 109M as compared to 687K of the non-stats >> dataset. >> >> Re: Turning off stats. I am looking to leverage stats coz for our datasets >> with much larger number of data files we want to leverage iceberg's ability >> to skip entire files based on these stats. This is one of the big incentives >> for us to use Iceberg. >> >> Re: Changing the way we keep stats. Avro is a block splittable format and is >> friendly with parallel compute frameworks like Spark. So would it make sense >> for instance to have add an option to have Spark job / Futures handle split >> planning? In a larger context, 109M is not that much metadata given that >> Iceberg is meant for datasets where the metadata itself is Bigdata scale. >> I'm curious on how folks with larger sized metadata (in GB) are optimizing >> this today. >> >> >> Cheers, >> -Gautam. >> >> >> >> >> On Fri, Apr 19, 2019 at 12:40 AM Ryan Blue <rb...@netflix.com.invalid >> <mailto:rb...@netflix.com.invalid>> wrote: >> Thanks for bringing this up! My initial theory is that this table has a ton >> of stats data that you have to read. That could happen in a couple of cases. >> >> First, you might have large values in some columns. Parquet will suppress >> its stats if values are larger than 4k and those are what Iceberg uses. But >> that could still cause you to store two 1k+ objects for each large column >> (lower and upper bounds). With a lot of data files, that could add up >> quickly. The solution here is to implement #113 >> <https://github.com/apache/incubator-iceberg/issues/113> so that we don't >> store the actual min and max for string or binary columns, but instead a >> truncated value that is just above or just below. >> >> The second case is when you have a lot of columns. Each column stores both a >> lower and upper bound, so 1,000 columns could easily take 8k per file. If >> this is the problem, then maybe we want to have a way to turn off column >> stats. We could also think of ways to change the way stats are stored in the >> manifest files, but that only helps if we move to a columnar format to store >> manifests, so this is probably not a short-term fix. >> >> If you can share a bit more information about this table, we can probably >> tell which one is the problem. I'm guessing it is the large values problem. >> >> On Thu, Apr 18, 2019 at 11:52 AM Gautam <gautamkows...@gmail.com >> <mailto:gautamkows...@gmail.com>> wrote: >> Hello folks, >> >> I have been testing Iceberg reading with and without stats built into >> Iceberg dataset manifest and found that there's a huge jump in network >> traffic with the latter.. >> >> >> In my test I am comparing two Iceberg datasets, both written in Iceberg >> format. One with and the other without stats collected in Iceberg manifests. >> In particular the difference between the writers used for the two datasets >> is this PR: https://github.com/apache/incubator-iceberg/pull/63/files >> <https://github.com/apache/incubator-iceberg/pull/63/files> which uses >> Iceberg's writers for writing Parquet data. I captured tcpdump from query >> scans run on these two datasets. The partition being scanned contains 1 >> manifest, 1 parquet data file and ~3700 rows in both datasets. There's a 30x >> jump in network traffic to the remote filesystem (ADLS) when i switch to >> stats based Iceberg dataset. Both queries used the same Iceberg reader code >> to access both datasets. >> >> ``` >> root@d69e104e7d40:/usr/local/spark# tcpdump -r >> iceberg_geo1_metrixx_qc_postvalues_batch_query.pcap | grep >> perfanalysis.adlus15.projectcabostore.net >> <http://perfanalysis.adlus15.projectcabostore.net/> | grep ">" | wc -l >> reading from file iceberg_geo1_metrixx_qc_postvalues_batch_query.pcap, >> link-type EN10MB (Ethernet) >> >> 8844 >> >> >> root@d69e104e7d40:/usr/local/spark# tcpdump -r >> iceberg_scratch_pad_demo_11_batch_query.pcap | grep >> perfanalysis.adlus15.projectcabostore.net >> <http://perfanalysis.adlus15.projectcabostore.net/> | grep ">" | wc -l >> reading from file iceberg_scratch_pad_demo_11_batch_query.pcap, link-type >> EN10MB (Ethernet) >> >> 269708 >> >> ``` >> >> As a consequence of this the query response times get affected drastically >> (illustrated below). I must confess that I am on a slow internet connection >> via VPN connecting to the remote FS. But the dataset without stats took just >> 1m 49s while the dataset with stats took 26m 48s to read the same sized >> data. Most of that time in the latter dataset was spent split planning in >> Manifest reading and stats evaluation. >> >> ``` >> all=> select count(*) from iceberg_geo1_metrixx_qc_postvalues where batchId >> = '4a6f95abac924159bb3d7075373395c9'; >> count(1) >> ---------- >> 3627 >> (1 row) >> Time: 109673.202 ms (01:49.673) >> >> all=> select count(*) from iceberg_scratch_pad_demo_11 where >> _ACP_YEAR=2018 and _ACP_MONTH=01 and _ACP_DAY=01 and batchId = >> '6d50eeb3e7d74b4f99eea91a27fc8f15'; >> count(1) >> ---------- >> 3808 >> (1 row) >> Time: 1608058.616 ms (26:48.059) >> >> ``` >> >> Has anyone faced this? I'm wondering if there's some caching or parallelism >> option here that can be leveraged. Would appreciate some guidance. If there >> isn't a straightforward fix and others feel this is an issue I can raise an >> issue and look into it further. >> >> >> Cheers, >> -Gautam. >> >> >> >> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >