One thing to account for is the row count stats spread all over the various 
level of stats. If a record is logical deleted, then rowcount = rowcount -1.
So when using any level of stats to compute row count, how do we account for 
logical deletes?
Eric

-----Original Message-----
From: Atri Sharma [mailto:atri.j...@gmail.com] 
Sent: Tuesday, December 5, 2017 10:48 AM
To: dev@parquet.apache.org
Subject: Re: Regarding PARQUET-1155

Thanks.

I think a configurable purger which can replace pages (like HBase Compaction, 
as mentioned above) should suffice and the frequency of compaction can be 
defined.

Do we do the full page replacement technique for replacing records today in any 
scenario?

Regards,

Atri

On Tue, Dec 5, 2017 at 9:17 PM, lukas nalezenec <lu...@apache.org> wrote:
> Hi,
> I think that delete marker is good idea.
> I was in basic GDPR training and i think that it meets EU law 
> requirements
>
> Lukas
>
> 2017-12-05 11:37 GMT+01:00 Atri Sharma <atri.j...@gmail.com>:
>
>> Agreed.
>>
>> I have come up with a patch to add metadata to the page header 
>> marking the tuples deleted. The visibility checks will need to 
>> consult page header before returning the read results back.
>>
>> The pruning still needs to be implemented.
>>
>> On Tue, Dec 5, 2017 at 3:20 AM, Eric Owhadi <eric.owh...@esgyn.com> wrote:
>> > May be the EU requirement provide a deadline for the delete. So one 
>> > can
>> imagine to implement a "logical delete", and on a monthly basis 
>> (assuming that is the EU deadline to be compliant), perform a 
>> physical delete by reloading the data without the logical deletes? It 
>> is like HBase major compaction concept?
>> > Eric
>> >
>> >
>> > -----Original Message-----
>> > From: Wes McKinney [mailto:wesmck...@gmail.com]
>> > Sent: Monday, December 4, 2017 3:38 PM
>> > To: dev@parquet.apache.org
>> > Subject: Re: Regarding PARQUET-1155
>> >
>> > hi Atri -- even if we could, I am not sure this would meet the
>> requirements of the EU law, since the "deleted" data could still be 
>> read by an adversary even if a Parquet implementation like parquet-mr 
>> did not permit it
>> >
>> > - Wes
>> >
>> > On Mon, Dec 4, 2017 at 11:55 AM, Atri Sharma <atri.j...@gmail.com>
>> wrote:
>> >> I see, thanks.
>> >>
>> >> Could we not introduce the concept of a delete marker, where we 
>> >> mark the deleted records in the page header?
>> >>
>> >> On Mon, Dec 4, 2017 at 10:23 PM, Wes McKinney 
>> >> <wesmck...@gmail.com>
>> wrote:
>> >>> I don't think this is possible due to the encoding and 
>> >>> compression
>> schemes.
>> >>>
>> >>> For example, suppose that you had the following data
>> >>>
>> >>> 1
>> >>> 1
>> >>> 1
>> >>> 4
>> >>> 4
>> >>> 4
>> >>> 4
>> >>>
>> >>> This would be dictionary-encoded and compressed to semantically 
>> >>> look like
>> >>>
>> >>> dictionary: 1, 4
>> >>> data page: (3, 0) (4, 1)
>> >>>
>> >>> The encoded data page (using the hybrid bit-packing / RLE 
>> >>> encoding
>> >>> scheme) would furthermore be compressed. Editing records in 
>> >>> general would change the size of the compressed and encoded data 
>> >>> stream, so you could not edit the page without rewriting the file.
>> >>>
>> >>> - Wes
>> >>>
>> >>> On Mon, Dec 4, 2017 at 11:46 AM, Atri Sharma 
>> >>> <atri.j...@gmail.com>
>> wrote:
>> >>>> Hi Wes,
>> >>>>
>> >>>> Thanks for your response.
>> >>>>
>> >>>> My main use case is that I want to introduce updatability to 
>> >>>> Parquet records without going the route of replacing the entire page.
>> >>>>
>> >>>> Is that something that has already been discussed please?
>> >>>>
>> >>>> Regards,
>> >>>>
>> >>>> Atri
>> >>>>
>> >>>> On Mon, Dec 4, 2017 at 10:10 PM, Wes McKinney 
>> >>>> <wesmck...@gmail.com>
>> wrote:
>> >>>>> hi Atri,
>> >>>>>
>> >>>>> From a prior discussion on the mailing list, it is not clear 
>> >>>>> that this is a problem that concerns either Parquet format or 
>> >>>>> the implementations in the Apache Parquet project. If data must 
>> >>>>> be edited or deleted, then the point-of-truth Parquet files 
>> >>>>> must be scanned and overwritten with the offending records deleted.
>> >>>>> Modifying files in place is not feasible due to the compression 
>> >>>>> and encoding schemes (dictionary, run-length encoding) used in 
>> >>>>> the Parquet format. Let me know if I am misunderstanding the use case.
>> >>>>>
>> >>>>> Thanks
>> >>>>> Wes
>> >>>>>
>> >>>>> On Mon, Dec 4, 2017 at 11:30 AM, Atri Sharma 
>> >>>>> <atri.j...@gmail.com>
>> wrote:
>> >>>>>> Hi Folks,
>> >>>>>>
>> >>>>>> Any update?
>> >>>>>>
>> >>>>>> On Fri, Dec 1, 2017 at 9:23 AM, Atri Sharma 
>> >>>>>> <atri.j...@gmail.com>
>> wrote:
>> >>>>>>> https://issues.apache.org/jira/browse/PARQUET-1155
>> >>>>>>>
>> >>>>>>> Anybody working on it? Can I take it up?
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Regards,
>> >>>>>>
>> >>>>>> Atri
>> >>>>>> l'apprenant
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Regards,
>> >>>>
>> >>>> Atri
>> >>>> l'apprenant
>> >>
>> >>
>> >>
>> >> --
>> >> Regards,
>> >>
>> >> Atri
>> >> l'apprenant
>>
>>
>>
>> --
>> Regards,
>>
>> Atri
>> l'apprenant
>>



--
Regards,

Atri
l'apprenant

Reply via email to