Re: [DISCUSS] Write-audit-publish support

2019-07-19 Thread RD
I think this could be useful. When we ingest data from Kafka, we do a predefined set of checks on the data. We can potentially utilize something like this to check for sanity before publishing. How is the auditing process suppose to find the new snapshot , since it is not accessible from the table

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-19 Thread Gautam
Hey Guys, Sorry bout the delay on this. Just got back on getting a basic working implementation in Iceberg for Vectorization on primitive types. *Here's what I have so far : * I have added `ParquetValueReader` implementations for some basic primitive types that build the respective Ar

Re: lower_bounds and floating point numbers: NaNs and negative zero

2019-07-19 Thread Jim Apple
Putting aside for a moment the question of hashing -0 and +0, I wonder if this could be addressed by ordering floating point numbers using the totalOrder predicate, but when there is a NaN in a file, omit the field it is in from manifest_entry.data_file.{sort_columns, lower_bounds, upper_bounds}

[DISCUSS] Write-audit-publish support

2019-07-19 Thread Ryan Blue
Hi everyone, At Netflix, we have a pattern for building ETL jobs where we write data, then audit the result before publishing the data that was written to a final table. We call this WAP for write, audit, publish. We’ve added support in our Iceberg branch. A WAP write creates a new table snapshot

Re: Metadata compression

2019-07-19 Thread Arina Yelchiyeva
Agree, that this is metadata only for Iceberg and should not be read by other systems, it was just an example. Main point is that having gz in the middle is confusing. I guess expectation is that if file ends with json suffix, it is a json. Maybe another option is to remove “.json" from metadata

Re: Metadata compression

2019-07-19 Thread Ryan Blue
The intent here was to make it easier to identify the format of a file, but if this makes the files incompatible with other systems maybe we should change it back. I think the argument against changing it back is that I wouldn't expect people to read these files with systems like Drill. Instead, w

Metadata compression

2019-07-19 Thread Arina Yelchiyeva
Hi all, Recent changes in metadata compression started adding “.gz” after metadata file name, not in the end as before. Before: v1.metadata.json.gz Now: v1.gz.metadata.json Looks like this was done intentionally but for me it looks rather confusing. Since gz is indication of compressed file and

Re: Sort Spec

2019-07-19 Thread Anton Okolnychyi
I think we are all on the same page. By that statement, I meant that we should not assume the current sort order is always applied to all files in the table, as that would require rewriting data immediately when we change the sort order. Also, different parts of the table can be ordered differen