Re: append data to already existing table saved in parquet format

Paul Rogers Wed, 26 Jul 2017 20:58:55 -0700

Hi All,

Saurabh, you are right. But, since Parquet does not allow appending to existing 
files, we have to do the logical equivalent which is to create a new Parquet 
file. For it to be part of the same “table” it must be part of an existing 
partition structure as Divya described.


The trick here is to choose a proper time grain. Too small and you end up with 
a very large number of files, and performance will suffer. (Once a second, for 
example, is too frequent.) Too slow and people don’t get near-real-time 
results. But, with the hour grain, Divya is using, then the number of files 
will not be too large per directory, and each file might be of a reasonable 
size.

Using Kafka to batch data would be a fine idea.

Of course, this is still not as good as Saurabh's former project, Druid, which 
builds aggregated cubes on the fly and has a lambda architecture to allow 
querying both immediate and historical data. Still, Divya’s design can work 
fine for some use cases when latency is not an issue and data volume is 
reasonable.

It would help if Drill had INSERT INTO support. But, I wonder, can it be made 
to work with Drill today? Perhaps the query can simply include the proper 
target directory in the CTAS statement. That is, data for 2017-07-25 02:00 
would go into “2017/07/26/2000.parquet”, say. That is, do-it-yourself 
partitioning. I hope Drill won’t care how the Parquet files got into the 
directories, only that the directories have the expected structure. (Is this 
accurate? Haven’t tried it myself…)

With single-threaded, hourly updates, there is no worry about the name 
collisions and other tricky issues that INSERT INTO will have to solve.

Divya, have you tried this solution?

Thanks,

- Paul

> On Jul 26, 2017, at 7:32 PM, Saurabh Mahapatra <saurabhmahapatr...@gmail.com> 
> wrote:
> 
> But append only means you are adding event record to a table(forget the 
> layout for a while). That means you have to write to the end of a table. If 
> the writes are too many, you have to batch them and then convert them into a 
> column format. 
> 
> This to me sounds like a Kafka workflow where you keeping ingesting event 
> data, then batch process it ( or stream process it). Writing or appending to 
> a columnar store when you data is in a row like format does not sound 
> efficient at all. I have not seen such a design in systems that actually 
> work. I know there are query engines that try to do that but the use is 
> limited. You cannot scale. 
> 
> I always think of Parquet or a columnar data store as the repository of 
> historical data that came from the OLTP world. You do not want to touch it 
> once you created it. You want to have a strategy where you batch the recent 
> data, create the historical data and move on. 
> 
> My 2 cents.
> 
> Saurabh
> 
> On Jul 26, 2017, at 6:58 PM, Divya Gehlot <divya.htco...@gmail.com> wrote:
> 
> Yes Paul I am looking for the insert into partition feature .
> In this way we just have to create the file for that particular partition
> when new data comes in or any updation if its required .
> Else every time when data comes in have run the view and recreate the
> parquet files for whole data set which is very time consuming specially
> when your data is being visualized in some real time dashboard .
> 
> Thanks,
> Divya
> 
>> On 27 July 2017 at 08:40, Paul Rogers <prog...@mapr.com> wrote:
>> 
>> Hi Divya,
>> 
>> Seems that you are asking for an “INSERT INTO” feature (DRILL-3534). The
>> idea would be to create new Parquet files into an existing partition
>> structure. That feature has not yet been started. So, the workarounds
>> provided might help you for now.
>> 
>> - Paul
>> 
>>> On Jul 26, 2017, at 8:46 AM, Saurabh Mahapatra <
>> saurabhmahapatr...@gmail.com> wrote:
>>> 
>>> Does Drill provide that kind of functionality? Theoretically yes. CTAS
>>> should work. But your cluster has to be sized. But I would never put
>>> something in such a pipeline without adequate testing. And I would always
>>> consider a lambda architecture to ensure that if this path were to fail
>>> (with Drill or any other combination of tools), you can recover from the
>>> failure. Each failure that you have puts you behind. If you have several
>>> failures, you will be backlogged and need a mechanism to catch up.
>>> 
>>> For data growth, you would need to go back to the source of the data and
>>> estimate the row cardinality. If this is coming from a OLTP system, then
>> it
>>> is related to volume of transactions in the business process. If you do
>> not
>>> understand that load, your system will eventually start failing in the
>>> future with Drill or otherwise.
>>> 
>>> Sizing and testing. Just do it.
>>> 
>>> Thanks,
>>> Saurabh
>>> 
>>> 
>>> 
>>> On Wed, Jul 26, 2017 at 2:52 AM, Divya Gehlot <divya.htco...@gmail.com>
>>> wrote:
>>> 
>>>> The data size is not big for every hour but  data size will grow with
>> the
>>>> time say if I have data for 2 years and data is coming on hourly basis
>> and
>>>> everytime creating the paruqet table is not the feasible solution .
>>>> Likewise for hive create the partition and insert the data into
>> partition
>>>> accordingly .
>>>> Was lookiing for that kind of solution.
>>>> Does Drill provides that kind of functionalty ?
>>>> 
>>>> Thanks,
>>>> Divya
>>>> 
>>>> 
>>>> On 26 July 2017 at 15:04, Saurabh Mahapatra <
>> saurabhmahapatr...@gmail.com>
>>>> wrote:
>>>> 
>>>>> I always recommend against using CTAS as a shortcut for a ETL type
>> large
>>>>> workload. You will need to size your Drill cluster accordingly.
>> Consider
>>>>> using Hive or Spark instead.
>>>>> 
>>>>> What are the source file formats? For every hour, what is the size and
>> the
>>>>> number of rows for that data? Are you doing any aggregations? And what
>> is
>>>>> the lag between the streaming data and data available for analytics
>> that
>>>>> you are willing to tolerate?
>>>>> 
>>>>> On Tue, Jul 25, 2017 at 11:27 PM, rahul challapalli <
>>>>> challapallira...@gmail.com> wrote:
>>>>> 
>>>>>> I am not aware of any clean way to do this. However if your data is
>>>>>> partitioned based on directories, then you can use the below hack
>> which
>>>>>> leverages temporary tables [1]. Essentially, you backup your partition
>>>>> to a
>>>>>> temp table, then override it by taking the union of new partition data
>>>>> and
>>>>>> existing partition data. This way we are not over-writing the entire
>>>>> table.
>>>>>> 
>>>>>> create temporary table mytable_2017 (col1, col2....)  as select col1,
>>>>> col2,
>>>>>> ......from mytable where dir0 = "2017";
>>>>>> drop table `mytable/2017`;
>>>>>> create table `mytable/2017` as
>>>>>>  select col1, col2 .........from new_partition_data
>>>>>>  union
>>>>>>  select col1, col2 ......... from mytable_2017;
>>>>>> drop table mytable_2017;
>>>>>> 
>>>>>> Caveat : Temporary tables get dropped automatically if the session
>> ends
>>>>> or
>>>>>> the drillbit crashes. In the above sequence, if the connection gets
>>>>> dropped
>>>>>> (there are known issues causing this) between the client and drillbit
>>>>> after
>>>>>> executing the "DROP" statement, then your partition data is lost
>>>>> forever.
>>>>>> And since drill doesn't support transactions, the mentioned approach
>> is
>>>>>> dangerous.
>>>>>> 
>>>>>> [1] https://drill.apache.org/docs/create-temporary-table-as-cttas/
>>>>>> 
>>>>>> 
>>>>>> On Tue, Jul 25, 2017 at 10:52 PM, Divya Gehlot <
>> divya.htco...@gmail.com
>>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> I am naive to Apache drill.
>>>>>>> As I have data coming in every hour , when I searched I couldnt find
>>>>> the
>>>>>>> insert into partition command in Apache drill.
>>>>>>> How can we insert data to particular partition without rewriting the
>>>>>> whole
>>>>>>> data set ?
>>>>>>> 
>>>>>>> 
>>>>>>> Appreciate the help.
>>>>>>> Thanks,
>>>>>>> Divya
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>>

Re: append data to already existing table saved in parquet format

Reply via email to