Inc.
Direct: 770.859.1255
www.revenueanalytics.com
-Original Message-
From: Divya Gehlot [mailto:divya.htco...@gmail.com]
Sent: Thursday, July 27, 2017 1:56 AM
To: user@drill.apache.org
Subject: Re: append data to already existing table saved in parquet format
Hi Paul,
Let my try your app
Hi Paul,
Let my try your approach of CTAS and save to partition directory structure
.
Thanks for the suggestion.
Thanks,
Divya
On 27 July 2017 at 11:57, Paul Rogers wrote:
> Hi All,
>
> Saurabh, you are right. But, since Parquet does not allow appending to
> existing files, we have to do the
Hi All,
Saurabh, you are right. But, since Parquet does not allow appending to existing
files, we have to do the logical equivalent which is to create a new Parquet
file. For it to be part of the same “table” it must be part of an existing
partition structure as Divya described.
The trick here
But append only means you are adding event record to a table(forget the layout
for a while). That means you have to write to the end of a table. If the writes
are too many, you have to batch them and then convert them into a column
format.
This to me sounds like a Kafka workflow where you keep
Yes Paul I am looking for the insert into partition feature .
In this way we just have to create the file for that particular partition
when new data comes in or any updation if its required .
Else every time when data comes in have run the view and recreate the
parquet files for whole data set whi
Hi Divya,
Seems that you are asking for an “INSERT INTO” feature (DRILL-3534). The idea
would be to create new Parquet files into an existing partition structure. That
feature has not yet been started. So, the workarounds provided might help you
for now.
- Paul
> On Jul 26, 2017, at 8:46 AM,
Does Drill provide that kind of functionality? Theoretically yes. CTAS
should work. But your cluster has to be sized. But I would never put
something in such a pipeline without adequate testing. And I would always
consider a lambda architecture to ensure that if this path were to fail
(with Drill o
The data size is not big for every hour but data size will grow with the
time say if I have data for 2 years and data is coming on hourly basis and
everytime creating the paruqet table is not the feasible solution .
Likewise for hive create the partition and insert the data into partition
accordin
I always recommend against using CTAS as a shortcut for a ETL type large
workload. You will need to size your Drill cluster accordingly. Consider
using Hive or Spark instead.
What are the source file formats? For every hour, what is the size and the
number of rows for that data? Are you doing any
I am not aware of any clean way to do this. However if your data is
partitioned based on directories, then you can use the below hack which
leverages temporary tables [1]. Essentially, you backup your partition to a
temp table, then override it by taking the union of new partition data and
existing
Drill doesn't have support for an insert into command. You could try using
the CTAS command to write to a specific partition directory, may be? Also
look at CTAS auto partitioning [1]
[1] https://drill.apache.org/docs/partition-by-clause/
On Tue, Jul 25, 2017 at 10:52 PM, Divya Gehlot
wrote:
>
Hi,
I am naive to Apache drill.
As I have data coming in every hour , when I searched I couldnt find the
insert into partition command in Apache drill.
How can we insert data to particular partition without rewriting the whole
data set ?
Appreciate the help.
Thanks,
Divya
12 matches
Mail list logo