It works for me too. .. Owen
> On Jul 3, 2019, at 11:27, Anton Okolnychyi <aokolnyc...@apple.com.invalid> > wrote: > > Works for me too. > >> On 3 Jul 2019, at 19:09, Erik Wright <erik.wri...@shopify.com.INVALID> wrote: >> >> That works for me. >> >> On Wed, Jul 3, 2019 at 2:01 PM Ryan Blue <rb...@netflix.com.invalid> wrote: >>> How about 9AM PDT on Friday, 5 July then? >>> >>>> On Wed, Jul 3, 2019 at 10:55 AM Owen O'Malley <owen.omal...@gmail.com> >>>> wrote: >>>> I'd like to call in, but I'm out Thursday. Friday would work except 11am >>>> to 1pm pdt. >>>> >>>> .. Owen >>>> >>>>> On Wed, Jul 3, 2019 at 10:42 AM Ryan Blue <rb...@netflix.com.invalid> >>>>> wrote: >>>>> I'm available Thursday and Friday this week as well, but it's a holiday >>>>> in the US so some people may be out. If there are no objections from >>>>> anyone that would like to attend, then I'm up for that. >>>>> >>>>>> On Wed, Jul 3, 2019 at 10:40 AM Anton Okolnychyi <aokolnyc...@apple.com> >>>>>> wrote: >>>>>> I apologize for the delay on my side. I’ll still have to go through the >>>>>> last emails. I am available on Thursday/Friday this week and would be >>>>>> great to sync. >>>>>> >>>>>> Thanks, >>>>>> Anton >>>>>> >>>>>>> On 3 Jul 2019, at 01:29, Ryan Blue <rb...@netflix.com.INVALID> wrote: >>>>>>> >>>>>>> Sorry I didn't get back to this thread last week. Let's try to have a >>>>>>> video call to sync up on this next week. What days would work for >>>>>>> everyone? >>>>>>> >>>>>>> rb >>>>>>> >>>>>>>> On Fri, Jun 21, 2019 at 9:06 AM Erik Wright <erik.wri...@shopify.com> >>>>>>>> wrote: >>>>>>>> With regards to operation values. Currently they are: >>>>>>>> append: data files were added and no files were removed. >>>>>>>> replace: data files were rewritten with the same data; i.e., >>>>>>>> compaction, changing the data file format, or relocating data files. >>>>>>>> overwrite: data files were deleted and added in a logical overwrite >>>>>>>> operation. >>>>>>>> delete: data files were removed and their contents logically deleted. >>>>>>>> If deletion files (with or without data files) are appended to the >>>>>>>> dataset, will we consider that an `append` operation? If so, if >>>>>>>> deletion and/or data files are appended, and whole files are also >>>>>>>> deleted, will we consider that an `overwrite`? >>>>>>>> >>>>>>>> Given that the only apparent purpose of the operation field is to >>>>>>>> optimize snapshot expiration the above seems to meet its needs. An >>>>>>>> incremental reader can also skip `replace` snapshots but no others. >>>>>>>> Once it decides to read a snapshot I don't think there's any >>>>>>>> difference in how it processes the data for append/overwrite/delete >>>>>>>> cases. >>>>>>>> >>>>>>>>> On Thu, Jun 20, 2019 at 8:55 PM Ryan Blue <rb...@netflix.com> wrote: >>>>>>>>> I don’t see that we need [sequence numbers] for file/offset-deletes, >>>>>>>>> since they apply to a specific file. They’re not harmful, but the >>>>>>>>> don’t seem relevant. >>>>>>>>> >>>>>>>>> These delete files will probably contain a path and an offset and >>>>>>>>> could contain deletes for multiple files. In that case, the sequence >>>>>>>>> number can be used to eliminate delete files that don’t need to be >>>>>>>>> applied to a particular data file, just like the column equality >>>>>>>>> deletes. Likewise, it can be used to drop the delete files when there >>>>>>>>> are no data files with an older sequence number. >>>>>>>>> >>>>>>>>> I don’t understand the purpose of the min sequence number, nor what >>>>>>>>> the “min data seq” is. >>>>>>>>> >>>>>>>>> Min sequence number would be used for pruning delete files without >>>>>>>>> reading all the manifests to find out if there are old data files. If >>>>>>>>> no manifest with data for a partition contains a file older than some >>>>>>>>> sequence number N, then any delete file with a sequence number < N >>>>>>>>> can be removed. >>>>>>>>> >>>>>>>> OK, so the minimum sequence number is an attribute of manifest files. >>>>>>>> Sounds good. It can likely permit us to optimize compaction operations >>>>>>>> as well (i.e., you can easily limit the operation to a subset of >>>>>>>> manifest files as long as they are the oldest ones). >>>>>>>> >>>>>>>>> The “min data seq” is the minimum sequence number of a data file. >>>>>>>>> That seems like what we actually want for the pruning I described >>>>>>>>> above. >>>>>>>>> >>>>>>>> I would expect a data file (appended rows or deletions by column >>>>>>>> value) to have a single sequence number that applies to the whole >>>>>>>> file. Even a delete-by-file-and-offset file can do with only a single >>>>>>>> sequence number (which must be larger than the sequence numbers of all >>>>>>>> deleted files). Why do we need a "minimum" data sequence per file? >>>>>>>>> Off the top of my head [supporting non-key delete] requires adding >>>>>>>>> additional information to the manifest file, indicating the columns >>>>>>>>> that are used for the deletion. Only equality would be supported; if >>>>>>>>> multiple columns were used, they would be combined with boolean-and. >>>>>>>>> I don’t see anything too tricky about it. >>>>>>>>> >>>>>>>>> Yes, exactly. I actually phrased it wrong initially. I think it would >>>>>>>>> be simple to extend the equality deletes to do this. We just need a >>>>>>>>> way to have global scope, not just partition scope. >>>>>>>>> >>>>>>>> I don't think anything special needs to be done with regards to >>>>>>>> scoping/partitioning of delete files. When scanning one or more data >>>>>>>> files, one must also consider any and all deletion files that could >>>>>>>> apply to them. The only way to prune deletion files from consideration >>>>>>>> is: >>>>>>>> All of your data files have at least one partition column in common. >>>>>>>> The deletion file is also partitioned on that column (at least). >>>>>>>> The value sets of the data files do not overlap the value sets of the >>>>>>>> deletion files in that column. >>>>>>>> So given a dataset of sessions that is partitioned by device form >>>>>>>> factor and date, for example, you could have a delete (user_id=9876) >>>>>>>> in a deletion file that is not partitioned. And it would be "in scope" >>>>>>>> for all of those data files. >>>>>>>> >>>>>>>> If you had the same dataset partitioned by hash(user_id) and your >>>>>>>> deletes were _also_ partitioned by hash(user_id) you would be able to >>>>>>>> prune those deletes while scanning the sessions. >>>>>>>>> If we add this on a per-deletion file basis it is not clear if there >>>>>>>>> is any relevance in preserving the concept of a unique row ID. >>>>>>>>> >>>>>>>>> Agreed. That’s why I’ve been steering us away from the debate about >>>>>>>>> whether keys are unique or not. Either way, a natural key delete must >>>>>>>>> delete all of the records it matches. >>>>>>>>> >>>>>>>>> I would assume that the maximum sequence number should appear in the >>>>>>>>> table metadata >>>>>>>>> >>>>>>>>> Agreed. >>>>>>>>> >>>>>>>>> [W]ould you make it optional to assign a sequence number to a >>>>>>>>> snapshot? “Replace” snapshots would not need one. >>>>>>>>> >>>>>>>>> The only requirement is that it is monotonically increasing. If one >>>>>>>>> isn’t used, we don’t have to increment. I’d say it is up to the >>>>>>>>> implementation to decide. I would probably increment it every time to >>>>>>>>> avoid errors. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Software Engineer >>>>>>>>> Netflix >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Software Engineer >>>>>>> Netflix >>>>>> >>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >