I think this could be useful. When we ingest data from Kafka, we do a
predefined set of checks on the data. We can potentially utilize something
like this to check for sanity before publishing.
How is the auditing process suppose to find the new snapshot , since it is
not accessible from the table
Hey Guys,
Sorry bout the delay on this. Just got back on getting a basic
working implementation in Iceberg for Vectorization on primitive types.
*Here's what I have so far : *
I have added `ParquetValueReader` implementations for some basic primitive
types that build the respective Ar
Putting aside for a moment the question of hashing -0 and +0, I wonder if this
could be addressed by ordering floating point numbers using the totalOrder
predicate, but when there is a NaN in a file, omit the field it is in from
manifest_entry.data_file.{sort_columns, lower_bounds, upper_bounds}
Hi everyone,
At Netflix, we have a pattern for building ETL jobs where we write data,
then audit the result before publishing the data that was written to a
final table. We call this WAP for write, audit, publish.
We’ve added support in our Iceberg branch. A WAP write creates a new table
snapshot
Agree, that this is metadata only for Iceberg and should not be read by other
systems, it was just an example.
Main point is that having gz in the middle is confusing. I guess expectation is
that if file ends with json suffix, it is a json.
Maybe another option is to remove “.json" from metadata
The intent here was to make it easier to identify the format of a file, but
if this makes the files incompatible with other systems maybe we should
change it back.
I think the argument against changing it back is that I wouldn't expect
people to read these files with systems like Drill. Instead, w
Hi all,
Recent changes in metadata compression started adding “.gz” after metadata file
name, not in the end as before.
Before: v1.metadata.json.gz
Now: v1.gz.metadata.json
Looks like this was done intentionally but for me it looks rather confusing.
Since gz is indication of compressed file and
I think we are all on the same page. By that statement, I meant that we should
not assume the current sort order is always applied to all files in the table,
as that would require rewriting data immediately when we change the sort order.
Also, different parts of the table can be ordered differen