spark partition discovery vs iceberg partition discovery implementation

2019-04-18 Thread suds
I am working on spark project and came across interesting ( was known in hive) convention spark use. https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#partition-discovery in spark if I partition dataset. partition columns does not exists in parquet schema and hence in final data file.

Re: spark partition discovery vs iceberg partition discovery implementation

2019-04-18 Thread Ryan Blue
Iceberg stores all table columns in the underlying data files. It does not store derived partition values in the data files. If you're partitioning by date(ts), it won't store that date ordinal. If you're partitioning by identity(date_col), it will store date_col. When reading data, values from th

Reading dataset with stats making lots of network traffic..

2019-04-18 Thread Gautam
Hello folks, I have been testing Iceberg reading with and without stats built into Iceberg dataset manifest and found that there's a huge jump in network traffic with the latter.. In my test I am comparing two Iceberg datasets, both written in Iceberg format. One with and the other without stats

Re: Reading dataset with stats making lots of network traffic..

2019-04-18 Thread Ryan Blue
Thanks for bringing this up! My initial theory is that this table has a ton of stats data that you have to read. That could happen in a couple of cases. First, you might have large values in some columns. Parquet will suppress its stats if values are larger than 4k and those are what Iceberg uses.

Re: spark partition discovery vs iceberg partition discovery implementation

2019-04-18 Thread suds
Thank you for reply and great explanation! On Thu, Apr 18, 2019 at 8:54 AM Ryan Blue wrote: > Iceberg stores all table columns in the underlying data files. It does not > store derived partition values in the data files. If you're partitioning by > date(ts), it won't store that date ordinal. If

Re: Reading dataset with stats making lots of network traffic..

2019-04-18 Thread Gautam
Correction, partition count = 4308. > Re: Changing the way we keep stats. Avro is a block splittable format and is friendly with parallel compute frameworks like Spark. Here I am trying to say that we don't need to change the format to columnar right? The current format is already friendly for pa

Re: Reading dataset with stats making lots of network traffic..

2019-04-18 Thread Gautam
Ah, my bad. I missed adding in the schema details .. Here are some details on the dataset with stats : Iceberg Schema Columns : 20 Spark Schema fields : 20 Snapshot Summary :{added-data-files=4308, added-records=11494037, changed-partition-count=4308, total-records=11494037, total-data-files=43