Re: Reading dataset with stats making lots of network traffic..

Gautam Thu, 02 May 2019 15:18:17 -0700

Ah looks like MergingSnapshotUpdate.mergeGroup() has the relevant logic. So
it preserves the natrual order of manifests so i guess it groups based on
when manifests were created so the answer is whatever order the commits
were done. If batches within multiple days were committed out of order then
a manifest could end up with multiple days.



On Thu, May 2, 2019 at 2:23 PM Gautam <gautamkows...@gmail.com> wrote:

> Ok, thanks for the tip on not having to by tied to a hierarchical
> partition spec.
>
> Although this still doesn't explain why all the manifests are scanned,
> coz it should be pruning the list of manifests and it's not. Is my
> understanding correct that the manifest grouping might be re-shuffling up
> the days so query on 1 day might map to all manifests even? Does manifest
> merging optimize for partition boundaries or is it based on manifest's
> natural order?
>
> On Thu, May 2, 2019 at 2:06 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> You also don't need to use year, month, and day. You can just use day.
>>
>> The time-based partition functions all produce ordinals, not local
>> values: month(Jan 1970) = 0 and month(Jan 1972) = 24. Same thing with day
>> and hour. In fact, I should open a PR to throw an exception when there are
>> duplicate partition functions...
>>
>> On Thu, May 2, 2019 at 1:52 PM Gautam <gautamkows...@gmail.com> wrote:
>>
>>> FYI .. The test Partition Spec is  :
>>> [
>>>   YEAR: identity(21)
>>>   MONTH: identity(22)
>>>   DAY: identity(23)
>>>   batchId: identity(24)
>>> ]
>>>
>>>
>>>
>>> On Thu, May 2, 2019 at 1:46 PM Gautam <gautamkows...@gmail.com> wrote:
>>>
>>>> > Using those, you should be able to control parallelism. If you want
>>>> to test with 4,000, then you can set the min count to 5,000 so Iceberg
>>>> won’t compact manifests.
>>>>
>>>> This is helpful. Thanks for the pointer on increasing parallelism. Will
>>>> try this out. So I understand the behaviour, if a different dataset has
>>>> >=5000  batches then the resultant # manifests would be (total_num_batches
>>>> % 5000 ) ?
>>>>
>>>> > What surprises me is that you’re not getting much benefit from
>>>> filtering out manifests that aren’t helpful. We see a lot of benefit from
>>>> it.
>>>>
>>>> Pardon the verbose example but i think it'l help explain what i'm
>>>> seeing ..
>>>>
>>>> Regarding manifest filtering:  I tested if partition filters in sql
>>>> query actually reduce manifests being inspected. In my example, i have 16
>>>> manifests that point to 4000 batch partitions ( each file is restricted to
>>>> one partition as we'r using physical partitioning in the table ).  So when
>>>> querying for .. WHERE  batchId = 'xyz'  .. at most 1 manifest should be
>>>> read coz 1 batch == 1 file which should be tracked by 1 manifest (among the
>>>> 16) , right? But i see that all 16 are being inspected in
>>>> BaseTableScan.planFiles().  Correct me if i'm wrong, it's this call [1]
>>>> that should be giving me the manifests that match a partition. When I
>>>> inspect this  it says `matchingManifests = 16` ,  which is all the
>>>> manifests in the table.  This *could* be due to the fact that our
>>>> batch ids are random UUIDs so lower/upper bounds may not help coz there's
>>>> no inherent ordering amongst batches.
>>>> But then  i tried year = 2019 and month = 01 and day = 01 which also
>>>> scanned all manifests. Could this be due to the way Iceberg manifests are
>>>> re-grouped and merged? If so, shouldn't re-grouping honour partition
>>>> boundaries and optimize for it?
>>>>
>>>>
>>>> Cheers,
>>>> -Gautam.
>>>>
>>>> [1] -
>>>> https://github.com/apache/incubator-iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseTableScan.java#L173
>>>>
>>>>
>>>> On Thu, May 2, 2019 at 12:27 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>>> Good questions. Grouping manifests is configurable at the table level.
>>>>> There are 2 settings:
>>>>>
>>>>>    - commit.manifest.target-size-bytes defaults to 8MB, this is the
>>>>>    target size that Iceberg will compact to
>>>>>    - commit.manifest.min-count-to-merge defaults to 100, this is the
>>>>>    minimum number of files before a compaction is triggered
>>>>>
>>>>> Using those, you should be able to control parallelism. If you want to
>>>>> test with 4,000, then you can set the min count to 5,000 so Iceberg won’t
>>>>> compact manifests.
>>>>>
>>>>> What surprises me is that you’re not getting much benefit from
>>>>> filtering out manifests that aren’t helpful. We see a lot of benefit from
>>>>> it. You might try sorting the data files by partition before adding them 
>>>>> to
>>>>> the table. That will cluster data files in the same partition so you can
>>>>> read fewer manifests.
>>>>>
>>>>> On Thu, May 2, 2019 at 12:09 PM Gautam <gautamkows...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hey Anton,
>>>>>>             Sorry bout the delay on this. Been caught up with some
>>>>>> other things. Thanks for raising issue#173 .
>>>>>>
>>>>>> So the root cause is indeed the density and size of the schema. While
>>>>>> I agree the option to configure stats for columns is good (although i'm 
>>>>>> not
>>>>>> fully convinced that this is purely due to lower/upper bounds). For
>>>>>> instance, maybe it's just taking a while to iterate over manifest rows 
>>>>>> and
>>>>>> deserialize the DataFile stats in each read?  The solution i'm using 
>>>>>> right
>>>>>> now is to parallelize the manifest reading in split planning. We
>>>>>> regenerated the Iceberg table with more manifests. Now the code enables 
>>>>>> the
>>>>>> ParallelIterator which uses a worker pool of threads (1 thread per cpu by
>>>>>> default, configurable using 'iceberg.worker.num-threads' ) to read
>>>>>> manifests.
>>>>>>
>>>>>> On that note, the ability to parallelize is limited to how many
>>>>>> manifests are in the table. So as a test, for a table with 4000 files we
>>>>>> created one manifest per file (think of one file as a single batch commit
>>>>>> in this case). So I was hoping to get a parallelism factor of 4000. But
>>>>>> Iceberg summarizes manifests into fewer manifests with each commit so we
>>>>>> instead ended up with 16 manifests. So now split planning is limited to
>>>>>> reading at most 16 units of parallelism. Is this grouping of manifests 
>>>>>> into
>>>>>> fewer configurable? if not should we allow making this configurable?
>>>>>>
>>>>>> Sorry if this is forking a different conversation. If so, I can start
>>>>>> a separate conversation thread on this.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, May 1, 2019 at 9:42 PM Anton Okolnychyi <
>>>>>> aokolnyc...@apple.com> wrote:
>>>>>>
>>>>>>> Hey Gautam,
>>>>>>>
>>>>>>> Out of my curiosity, did you manage to confirm the root cause of the
>>>>>>> issue?
>>>>>>>
>>>>>>> P.S. I created [1] so that we can make collection of lower/upper
>>>>>>> bounds configurable.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Anton
>>>>>>>
>>>>>>> [1] - https://github.com/apache/incubator-iceberg/issues/173
>>>>>>>
>>>>>>> On 22 Apr 2019, at 09:15, Gautam <gautamkows...@gmail.com> wrote:
>>>>>>>
>>>>>>> Thanks guys for the insights ..
>>>>>>>
>>>>>>> > I like Anton's idea to have an optional list of columns for which
>>>>>>> we keep stats. That would allow us to avoid storing stats for thousands 
>>>>>>> of
>>>>>>> columns that won't ever be used. Another option here is to add a flag to
>>>>>>> keep stats only for top-level columns. That's much less configuration 
>>>>>>> for
>>>>>>> users and probably does the right thing in many cases. Simpler to use 
>>>>>>> but
>>>>>>> not as fast in all cases is sometimes a good compromise.
>>>>>>>
>>>>>>> This makes sense to me. It adds a variable that data pipelines can
>>>>>>> tweak on to improve performance. I will add an issue on Github to add a
>>>>>>> stats config/flag. Although, having said that, I would try to optimize
>>>>>>> around this coz read patterns are hardly ever known a priori and adding 
>>>>>>> a
>>>>>>> column to this list means having to re-write the entire data again. So 
>>>>>>> i'l
>>>>>>> try the other suggestion which is parallelizing on multiple manifests.
>>>>>>>
>>>>>>> >  To clarify my comment on changing the storage: the idea is to use
>>>>>>> separate columns instead of a map and then use a columnar storage 
>>>>>>> format so
>>>>>>> we can project those columns independently. Avro can't project columns
>>>>>>> independently. This wouldn't help on the write side and may just cause a
>>>>>>> lot of seeking on the read side that diminishes the benefits.
>>>>>>>
>>>>>>> Gotcha.
>>>>>>>
>>>>>>> > Also, now that we have more details, I think there is a second
>>>>>>> problem. Because we expect several manifests in a table, we parallelize
>>>>>>> split planning on manifests instead of splits of manifest files. This
>>>>>>> planning operation is happening in a single thread instead of in 
>>>>>>> parallel.
>>>>>>> I think if you split the write across several manifests, you'd improve 
>>>>>>> wall
>>>>>>> time.
>>>>>>>
>>>>>>> This might actually be the issue here, this was a test bench dataset
>>>>>>> so the writer job created a single manifest for all the data in the 
>>>>>>> dataset
>>>>>>> which isn't really how we will do things in prod. I'l try and create the
>>>>>>> metadata based on productions expected commit pattern.
>>>>>>>
>>>>>>>
>>>>>>> Regarding Iceberg not truncating large bounded column values
>>>>>>> https://github.com/apache/incubator-iceberg/issues/113 .. I didn't
>>>>>>> consider this with our dataset. The current evidence is leading towards 
>>>>>>> the
>>>>>>> number of columns and the sheer number of files that the manifest is
>>>>>>> maintaining but this is a good thing to look into.
>>>>>>>
>>>>>>> Thanks again guys.
>>>>>>>
>>>>>>> -Gautam.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Apr 19, 2019 at 9:05 AM Ryan Blue <rb...@netflix.com> wrote:
>>>>>>>
>>>>>>>> I like Anton's idea to have an optional list of columns for which
>>>>>>>> we keep stats. That would allow us to avoid storing stats for 
>>>>>>>> thousands of
>>>>>>>> columns that won't ever be used. Another option here is to add a flag 
>>>>>>>> to
>>>>>>>> keep stats only for top-level columns. That's much less configuration 
>>>>>>>> for
>>>>>>>> users and probably does the right thing in many cases. Simpler to use 
>>>>>>>> but
>>>>>>>> not as fast in all cases is sometimes a good compromise.
>>>>>>>>
>>>>>>>> To clarify my comment on changing the storage: the idea is to use
>>>>>>>> separate columns instead of a map and then use a columnar storage 
>>>>>>>> format so
>>>>>>>> we can project those columns independently. Avro can't project columns
>>>>>>>> independently. This wouldn't help on the write side and may just cause 
>>>>>>>> a
>>>>>>>> lot of seeking on the read side that diminishes the benefits.
>>>>>>>>
>>>>>>>> Also, now that we have more details, I think there is a second
>>>>>>>> problem. Because we expect several manifests in a table, we parallelize
>>>>>>>> split planning on manifests instead of splits of manifest files. This
>>>>>>>> planning operation is happening in a single thread instead of in 
>>>>>>>> parallel.
>>>>>>>> I think if you split the write across several manifests, you'd improve 
>>>>>>>> wall
>>>>>>>> time.
>>>>>>>>
>>>>>>>> On Fri, Apr 19, 2019 at 8:15 AM Anton Okolnychyi <
>>>>>>>> aokolnyc...@apple.com> wrote:
>>>>>>>>
>>>>>>>>> No, we haven’t experienced it yet. The manifest size is huge in
>>>>>>>>> your case. To me, Ryan is correct: it might be either big lower/upper
>>>>>>>>> bounds (then truncation will help) or a big number columns (then 
>>>>>>>>> collecting
>>>>>>>>> lower/upper bounds only for specific columns will help). I think both
>>>>>>>>> optimizations are needed and will reduce the manifest size.
>>>>>>>>>
>>>>>>>>> Since you mentioned you have a lot of columns and we collect
>>>>>>>>> bounds for nested struct fields, I am wondering if you could revert 
>>>>>>>>> [1]
>>>>>>>>> locally and compare the manifest size.
>>>>>>>>>
>>>>>>>>> [1] -
>>>>>>>>> https://github.com/apache/incubator-iceberg/commit/c383dd87a89e35d622e9c458fd711931cbc5e96f
>>>>>>>>>
>>>>>>>>> On 19 Apr 2019, at 15:42, Gautam <gautamkows...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Thanks for responding Anton! Do we think the delay is mainly due
>>>>>>>>> to lower/upper bound filtering? have you faced this? I haven't exactly
>>>>>>>>> found where the slowness is yet. It's generally due to the stats 
>>>>>>>>> filtering
>>>>>>>>> but what part of it is causing this much network traffic. There's
>>>>>>>>> CloseableIteratable  that takes a ton of time on the next() and 
>>>>>>>>> hasNext()
>>>>>>>>> calls. My guess is the expression evaluation on each manifest entry is
>>>>>>>>> what's doing it.
>>>>>>>>>
>>>>>>>>> On Fri, Apr 19, 2019 at 1:41 PM Anton Okolnychyi <
>>>>>>>>> aokolnyc...@apple.com> wrote:
>>>>>>>>>
>>>>>>>>>> I think we need to have a list of columns for which we want to
>>>>>>>>>> collect stats and that should be configurable by the user. Maybe, 
>>>>>>>>>> this
>>>>>>>>>> config should be applicable only to lower/upper bounds. As we now 
>>>>>>>>>> collect
>>>>>>>>>> stats even for nested struct fields, this might generate a lot of 
>>>>>>>>>> data. In
>>>>>>>>>> most cases, users cluster/sort their data by a subset of data 
>>>>>>>>>> columns to
>>>>>>>>>> have fast queries with predicates on those columns. So, being able to
>>>>>>>>>> configure columns for which to collect lower/upper bounds seems 
>>>>>>>>>> reasonable.
>>>>>>>>>>
>>>>>>>>>> On 19 Apr 2019, at 08:03, Gautam <gautamkows...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> >  The length in bytes of the schema is 109M as compared to 687K
>>>>>>>>>> of the non-stats dataset.
>>>>>>>>>>
>>>>>>>>>> Typo, length in bytes of *manifest*. schema is the same.
>>>>>>>>>>
>>>>>>>>>> On Fri, Apr 19, 2019 at 12:16 PM Gautam <gautamkows...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Correction, partition count = 4308.
>>>>>>>>>>>
>>>>>>>>>>> > Re: Changing the way we keep stats. Avro is a block splittable
>>>>>>>>>>> format and is friendly with parallel compute frameworks like Spark.
>>>>>>>>>>>
>>>>>>>>>>> Here I am trying to say that we don't need to change the format
>>>>>>>>>>> to columnar right? The current format is already friendly for
>>>>>>>>>>> parallelization.
>>>>>>>>>>>
>>>>>>>>>>> thanks.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 19, 2019 at 12:12 PM Gautam <gautamkows...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Ah, my bad. I missed adding in the schema details .. Here are
>>>>>>>>>>>> some details on the dataset with stats :
>>>>>>>>>>>>
>>>>>>>>>>>>  Iceberg Schema Columns : 20
>>>>>>>>>>>>  Spark Schema fields : 20
>>>>>>>>>>>>  Snapshot Summary :{added-data-files=4308,
>>>>>>>>>>>> added-records=11494037, changed-partition-count=4308,
>>>>>>>>>>>> total-records=11494037, total-data-files=4308}
>>>>>>>>>>>>  Manifest files :1
>>>>>>>>>>>>  Manifest details:
>>>>>>>>>>>>      => manifest file path:
>>>>>>>>>>>> adl://[dataset_base_path]/metadata/4bcda033-9df5-4c84-8eef-9d6ef93e4347-m0.avro
>>>>>>>>>>>>      => manifest file length: 109,028,885
>>>>>>>>>>>>      => existing files count: 0
>>>>>>>>>>>>      => added files count: 4308
>>>>>>>>>>>>      => deleted files count: 0
>>>>>>>>>>>>      => partitions count: 4
>>>>>>>>>>>>      => partition fields count: 4
>>>>>>>>>>>>
>>>>>>>>>>>> Re: Num data files. It has a single manifest keep track of 4308
>>>>>>>>>>>> files. Total record count is 11.4 Million.
>>>>>>>>>>>>
>>>>>>>>>>>> Re: Columns. You are right that this table has many columns..
>>>>>>>>>>>> although it has only 20 top-level columns,  num leaf columns are 
>>>>>>>>>>>> in order
>>>>>>>>>>>> of thousands. This Schema is heavy on structs (in the thousands) 
>>>>>>>>>>>> and has
>>>>>>>>>>>> deep levels of nesting.  I know Iceberg keeps
>>>>>>>>>>>> *column_sizes, value_counts, null_value_counts* for all leaf
>>>>>>>>>>>> fields and additionally *lower-bounds, upper-bounds* for
>>>>>>>>>>>> native, struct types (not yet for map KVs and arrays).  The length 
>>>>>>>>>>>> in bytes
>>>>>>>>>>>> of the schema is 109M as compared to 687K of the non-stats dataset.
>>>>>>>>>>>>
>>>>>>>>>>>> Re: Turning off stats. I am looking to leverage stats coz for
>>>>>>>>>>>> our datasets with much larger number of data files we want to 
>>>>>>>>>>>> leverage
>>>>>>>>>>>> iceberg's ability to skip entire files based on these stats. This 
>>>>>>>>>>>> is one of
>>>>>>>>>>>> the big incentives for us to use Iceberg.
>>>>>>>>>>>>
>>>>>>>>>>>> Re: Changing the way we keep stats. Avro is a block splittable
>>>>>>>>>>>> format and is friendly with parallel compute frameworks like 
>>>>>>>>>>>> Spark. So
>>>>>>>>>>>> would it make sense for instance to have add an option to have 
>>>>>>>>>>>> Spark job /
>>>>>>>>>>>> Futures  handle split planning?   In a larger context, 109M is not 
>>>>>>>>>>>> that
>>>>>>>>>>>> much metadata given that Iceberg is meant for datasets where the 
>>>>>>>>>>>> metadata
>>>>>>>>>>>> itself is Bigdata scale.  I'm curious on how folks with larger 
>>>>>>>>>>>> sized
>>>>>>>>>>>> metadata (in GB) are optimizing this today.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> -Gautam.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Apr 19, 2019 at 12:40 AM Ryan Blue <
>>>>>>>>>>>> rb...@netflix.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for bringing this up! My initial theory is that this
>>>>>>>>>>>>> table has a ton of stats data that you have to read. That could 
>>>>>>>>>>>>> happen in a
>>>>>>>>>>>>> couple of cases.
>>>>>>>>>>>>>
>>>>>>>>>>>>> First, you might have large values in some columns. Parquet
>>>>>>>>>>>>> will suppress its stats if values are larger than 4k and those 
>>>>>>>>>>>>> are what
>>>>>>>>>>>>> Iceberg uses. But that could still cause you to store two 1k+ 
>>>>>>>>>>>>> objects for
>>>>>>>>>>>>> each large column (lower and upper bounds). With a lot of data 
>>>>>>>>>>>>> files, that
>>>>>>>>>>>>> could add up quickly. The solution here is to implement #113
>>>>>>>>>>>>> <https://github.com/apache/incubator-iceberg/issues/113> so
>>>>>>>>>>>>> that we don't store the actual min and max for string or binary 
>>>>>>>>>>>>> columns,
>>>>>>>>>>>>> but instead a truncated value that is just above or just below.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The second case is when you have a lot of columns. Each column
>>>>>>>>>>>>> stores both a lower and upper bound, so 1,000 columns could 
>>>>>>>>>>>>> easily take 8k
>>>>>>>>>>>>> per file. If this is the problem, then maybe we want to have a 
>>>>>>>>>>>>> way to turn
>>>>>>>>>>>>> off column stats. We could also think of ways to change the way 
>>>>>>>>>>>>> stats are
>>>>>>>>>>>>> stored in the manifest files, but that only helps if we move to a 
>>>>>>>>>>>>> columnar
>>>>>>>>>>>>> format to store manifests, so this is probably not a short-term 
>>>>>>>>>>>>> fix.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you can share a bit more information about this table, we
>>>>>>>>>>>>> can probably tell which one is the problem. I'm guessing it is 
>>>>>>>>>>>>> the large
>>>>>>>>>>>>> values problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Apr 18, 2019 at 11:52 AM Gautam <
>>>>>>>>>>>>> gautamkows...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello folks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have been testing Iceberg reading with and without stats
>>>>>>>>>>>>>> built into Iceberg dataset manifest and found that there's a 
>>>>>>>>>>>>>> huge jump in
>>>>>>>>>>>>>> network traffic with the latter..
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In my test I am comparing two Iceberg datasets, both written
>>>>>>>>>>>>>> in Iceberg format. One with and the other without stats 
>>>>>>>>>>>>>> collected in
>>>>>>>>>>>>>> Iceberg manifests. In particular the difference between the 
>>>>>>>>>>>>>> writers used
>>>>>>>>>>>>>> for the two datasets is this PR:
>>>>>>>>>>>>>> https://github.com/apache/incubator-iceberg/pull/63/files which
>>>>>>>>>>>>>> uses Iceberg's writers for writing Parquet data. I captured 
>>>>>>>>>>>>>> tcpdump from
>>>>>>>>>>>>>> query scans run on these two datasets.  The partition being 
>>>>>>>>>>>>>> scanned
>>>>>>>>>>>>>> contains 1 manifest, 1 parquet data file and ~3700 rows in both 
>>>>>>>>>>>>>> datasets.
>>>>>>>>>>>>>> There's a 30x jump in network traffic to the remote filesystem 
>>>>>>>>>>>>>> (ADLS) when
>>>>>>>>>>>>>> i switch to stats based Iceberg dataset. Both queries used the 
>>>>>>>>>>>>>> same Iceberg
>>>>>>>>>>>>>> reader code to access both datasets.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>> root@d69e104e7d40:/usr/local/spark#  tcpdump -r
>>>>>>>>>>>>>> iceberg_geo1_metrixx_qc_postvalues_batch_query.pcap | grep
>>>>>>>>>>>>>> perfanalysis.adlus15.projectcabostore.net | grep ">" | wc -l
>>>>>>>>>>>>>> reading from file
>>>>>>>>>>>>>> iceberg_geo1_metrixx_qc_postvalues_batch_query.pcap, link-type 
>>>>>>>>>>>>>> EN10MB
>>>>>>>>>>>>>> (Ethernet)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *8844*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> root@d69e104e7d40:/usr/local/spark# tcpdump -r
>>>>>>>>>>>>>> iceberg_scratch_pad_demo_11_batch_query.pcap | grep
>>>>>>>>>>>>>> perfanalysis.adlus15.projectcabostore.net | grep ">" | wc -l
>>>>>>>>>>>>>> reading from file
>>>>>>>>>>>>>> iceberg_scratch_pad_demo_11_batch_query.pcap, link-type EN10MB 
>>>>>>>>>>>>>> (Ethernet)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *269708*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As a consequence of this the query response times get
>>>>>>>>>>>>>> affected drastically (illustrated below). I must confess that I 
>>>>>>>>>>>>>> am on a
>>>>>>>>>>>>>> slow internet connection via VPN connecting to the remote FS. 
>>>>>>>>>>>>>> But the
>>>>>>>>>>>>>> dataset without stats took just 1m 49s while the dataset with 
>>>>>>>>>>>>>> stats took
>>>>>>>>>>>>>> 26m 48s to read the same sized data. Most of that time in the 
>>>>>>>>>>>>>> latter
>>>>>>>>>>>>>> dataset was spent split planning in Manifest reading and stats 
>>>>>>>>>>>>>> evaluation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>> all=> select count(*)  from
>>>>>>>>>>>>>> iceberg_geo1_metrixx_qc_postvalues where batchId =
>>>>>>>>>>>>>> '4a6f95abac924159bb3d7075373395c9';
>>>>>>>>>>>>>>  count(1)
>>>>>>>>>>>>>> ----------
>>>>>>>>>>>>>>      3627
>>>>>>>>>>>>>> (1 row)
>>>>>>>>>>>>>> *Time: 109673.202 ms (01:49.673)*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> all=>  select count(*) from iceberg_scratch_pad_demo_11
>>>>>>>>>>>>>> where _ACP_YEAR=2018 and _ACP_MONTH=01 and _ACP_DAY=01 and 
>>>>>>>>>>>>>> batchId =
>>>>>>>>>>>>>> '6d50eeb3e7d74b4f99eea91a27fc8f15';
>>>>>>>>>>>>>>  count(1)
>>>>>>>>>>>>>> ----------
>>>>>>>>>>>>>>      3808
>>>>>>>>>>>>>> (1 row)
>>>>>>>>>>>>>> *Time: 1608058.616 ms (26:48.059)*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Has anyone faced this? I'm wondering if there's some caching
>>>>>>>>>>>>>> or parallelism option here that can be leveraged.  Would 
>>>>>>>>>>>>>> appreciate some
>>>>>>>>>>>>>> guidance. If there isn't a straightforward fix and others feel 
>>>>>>>>>>>>>> this is an
>>>>>>>>>>>>>> issue I can raise an issue and look into it further.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> -Gautam.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: Reading dataset with stats making lots of network traffic..

Reply via email to