Re: Update / Delete records in Parquet

2019-05-03 Thread Chetan Khatri
Agreed with delta.io, I am exploring both options

On Wed, May 1, 2019 at 2:50 PM Vitaliy Pisarev 
wrote:

> Ankit, you should take a look at delta.io that was recently open sourced
> by databricks.
>
> Full DML support is on the way.
>
>
>
> *From: *"Khare, Ankit" 
> *Date: *Tuesday, 23 April 2019 at 11:35
> *To: *Chetan Khatri , Jason Nerothin <
> jasonnerot...@gmail.com>
> *Cc: *user 
> *Subject: *Re: Update / Delete records in Parquet
>
>
>
> Hi Chetan,
>
>
>
> I also agree that for this usecase parquet would not be the best option .
> I had similar usecase ,
>
>
>
> 50 different tables to be download from MSSQL .
>
>
>
> Source : MSSQL
>
> Destination. : Apache KUDU (Since it supports very well change data
> capture use cases)
>
>
>
> We used Streamset CDC module to connect to MSSQL and then get CDC data to
> Apache KUDU
>
>
>
> Total records. : 3 B
>
>
>
> Thanks
>
> Ankit
>
>
>
>
>
> *From: *Chetan Khatri 
> *Date: *Tuesday, 23. April 2019 at 05:58
> *To: *Jason Nerothin 
> *Cc: *user 
> *Subject: *Re: Update / Delete records in Parquet
>
>
>
> Hello Jason, Thank you for reply. My use case is that, first time I do
> full load and transformation/aggregation/joins and write to parquet (as
> staging) but next time onwards my source is MSSQL Server, I want to pull
> only those records got changed / updated and would like to update at
> parquet also if possible without side effects.
>
>
> https://docs.microsoft.com/en-us/sql/relational-databases/track-changes/work-with-change-tracking-sql-server?view=sql-server-2017
>
>
>
> On Tue, Apr 23, 2019 at 3:02 AM Jason Nerothin 
> wrote:
>
> Hi Chetan,
>
>
>
> Do you have to use Parquet?
>
>
>
> It just feels like it might be the wrong sink for a high-frequency change
> scenario.
>
>
>
> What are you trying to accomplish?
>
>
>
> Thanks,
> Jason
>
>
>
> On Mon, Apr 22, 2019 at 2:09 PM Chetan Khatri 
> wrote:
>
> Hello All,
>
>
>
> If I am doing incremental load / delta and would like to update / delete
> the records in parquet, I understands that parquet is immutable and can't
> be deleted / updated theoretically only append / overwrite can be done. But
> I can see utility tools which claims to add value for that.
>
>
>
> https://github.com/Factual/parquet-rewriter
>
>
>
> Please throw a light.
>
>
>
> Thanks
>
>
>
>
> --
>
> Thanks,
>
> Jason
>
>


Re: Update / Delete records in Parquet

2019-05-01 Thread Vitaliy Pisarev
Ankit, you should take a look at delta.io<https://delta.io/> that was recently 
open sourced by databricks.
Full DML support is on the way.

From: "Khare, Ankit" 
Date: Tuesday, 23 April 2019 at 11:35
To: Chetan Khatri , Jason Nerothin 

Cc: user 
Subject: Re: Update / Delete records in Parquet

Hi Chetan,

I also agree that for this usecase parquet would not be the best option . I had 
similar usecase ,

50 different tables to be download from MSSQL .

Source : MSSQL
Destination. : Apache KUDU (Since it supports very well change data capture use 
cases)

We used Streamset CDC module to connect to MSSQL and then get CDC data to 
Apache KUDU

Total records. : 3 B

Thanks
Ankit


From: Chetan Khatri 
Date: Tuesday, 23. April 2019 at 05:58
To: Jason Nerothin 
Cc: user 
Subject: Re: Update / Delete records in Parquet

Hello Jason, Thank you for reply. My use case is that, first time I do full 
load and transformation/aggregation/joins and write to parquet (as staging) but 
next time onwards my source is MSSQL Server, I want to pull only those records 
got changed / updated and would like to update at parquet also if possible 
without side effects.
https://docs.microsoft.com/en-us/sql/relational-databases/track-changes/work-with-change-tracking-sql-server?view=sql-server-2017

On Tue, Apr 23, 2019 at 3:02 AM Jason Nerothin 
mailto:jasonnerot...@gmail.com>> wrote:
Hi Chetan,

Do you have to use Parquet?

It just feels like it might be the wrong sink for a high-frequency change 
scenario.

What are you trying to accomplish?

Thanks,
Jason

On Mon, Apr 22, 2019 at 2:09 PM Chetan Khatri 
mailto:chetan.opensou...@gmail.com>> wrote:
Hello All,

If I am doing incremental load / delta and would like to update / delete the 
records in parquet, I understands that parquet is immutable and can't be 
deleted / updated theoretically only append / overwrite can be done. But I can 
see utility tools which claims to add value for that.

https://github.com/Factual/parquet-rewriter

Please throw a light.

Thanks


--
Thanks,
Jason


Re: Update / Delete records in Parquet

2019-04-23 Thread Khare, Ankit
Hi Chetan,

I also agree that for this usecase parquet would not be the best option . I had 
similar usecase ,

50 different tables to be download from MSSQL .

Source : MSSQL
Destination. : Apache KUDU (Since it supports very well change data capture use 
cases)

We used Streamset CDC module to connect to MSSQL and then get CDC data to 
Apache KUDU

Total records. : 3 B

Thanks
Ankit


From: Chetan Khatri 
Date: Tuesday, 23. April 2019 at 05:58
To: Jason Nerothin 
Cc: user 
Subject: Re: Update / Delete records in Parquet

Hello Jason, Thank you for reply. My use case is that, first time I do full 
load and transformation/aggregation/joins and write to parquet (as staging) but 
next time onwards my source is MSSQL Server, I want to pull only those records 
got changed / updated and would like to update at parquet also if possible 
without side effects.
https://docs.microsoft.com/en-us/sql/relational-databases/track-changes/work-with-change-tracking-sql-server?view=sql-server-2017

On Tue, Apr 23, 2019 at 3:02 AM Jason Nerothin 
mailto:jasonnerot...@gmail.com>> wrote:
Hi Chetan,

Do you have to use Parquet?

It just feels like it might be the wrong sink for a high-frequency change 
scenario.

What are you trying to accomplish?

Thanks,
Jason

On Mon, Apr 22, 2019 at 2:09 PM Chetan Khatri 
mailto:chetan.opensou...@gmail.com>> wrote:
Hello All,

If I am doing incremental load / delta and would like to update / delete the 
records in parquet, I understands that parquet is immutable and can't be 
deleted / updated theoretically only append / overwrite can be done. But I can 
see utility tools which claims to add value for that.

https://github.com/Factual/parquet-rewriter

Please throw a light.

Thanks


--
Thanks,
Jason


Re: Update / Delete records in Parquet

2019-04-22 Thread Chetan Khatri
Hello Jason, Thank you for reply. My use case is that, first time I do full
load and transformation/aggregation/joins and write to parquet (as staging)
but next time onwards my source is MSSQL Server, I want to pull only those
records got changed / updated and would like to update at parquet also if
possible without side effects.
https://docs.microsoft.com/en-us/sql/relational-databases/track-changes/work-with-change-tracking-sql-server?view=sql-server-2017

On Tue, Apr 23, 2019 at 3:02 AM Jason Nerothin 
wrote:

> Hi Chetan,
>
> Do you have to use Parquet?
>
> It just feels like it might be the wrong sink for a high-frequency change
> scenario.
>
> What are you trying to accomplish?
>
> Thanks,
> Jason
>
> On Mon, Apr 22, 2019 at 2:09 PM Chetan Khatri 
> wrote:
>
>> Hello All,
>>
>> If I am doing incremental load / delta and would like to update / delete
>> the records in parquet, I understands that parquet is immutable and can't
>> be deleted / updated theoretically only append / overwrite can be done. But
>> I can see utility tools which claims to add value for that.
>>
>> https://github.com/Factual/parquet-rewriter
>>
>> Please throw a light.
>>
>> Thanks
>>
>
>
> --
> Thanks,
> Jason
>


Re: Update / Delete records in Parquet

2019-04-22 Thread Jason Nerothin
Hi Chetan,

Do you have to use Parquet?

It just feels like it might be the wrong sink for a high-frequency change
scenario.

What are you trying to accomplish?

Thanks,
Jason

On Mon, Apr 22, 2019 at 2:09 PM Chetan Khatri 
wrote:

> Hello All,
>
> If I am doing incremental load / delta and would like to update / delete
> the records in parquet, I understands that parquet is immutable and can't
> be deleted / updated theoretically only append / overwrite can be done. But
> I can see utility tools which claims to add value for that.
>
> https://github.com/Factual/parquet-rewriter
>
> Please throw a light.
>
> Thanks
>


-- 
Thanks,
Jason


Update / Delete records in Parquet

2019-04-22 Thread Chetan Khatri
Hello All,

If I am doing incremental load / delta and would like to update / delete
the records in parquet, I understands that parquet is immutable and can't
be deleted / updated theoretically only append / overwrite can be done. But
I can see utility tools which claims to add value for that.

https://github.com/Factual/parquet-rewriter

Please throw a light.

Thanks