Rahul

Ctas plus some file moves are what you need. Do a query against the new
file to force the meta data cache to be updated.

Also, consider not building the weekly files. You might measure their
impact but I would expect no gain and possibly some loss of performance due
to less parallelism. In fact, you should test whether the monthly files
help you. It might be better to delay creating them for 30-90 days to make
queries faster by having more files. The cutie of parallelism has a limit
so you should test some scenarios.

For doing updates on multiple files atomically, you may like the 'cp -rl'
command. It makes a copy of a directory using hard links. You can modify
this new directory without touching the original. You can replace the
original directory using an atomic rename.


On Oct 31, 2017 07:08, "Rahul Raj" <rahul....@option3consulting.com> wrote:

> Hi,
>
> I have few questions on modeling a time series use case with parquet and
> drill. I have seen the topic discussed at
> https://issues.apache.org/jira/browse/DRILL-3534.
>
> My requirements are:
>
> * Keep the parquet files partitioned by year and month
> * For the current month, the data needs to be further partitioned by Week
> and Day
> * End of the running week, 7 daily parquets will be merged to a single
> weekly file
> * Similarly, weekly files will to be merged to form a monthly file during
> month end
>
> I will have a web application to generate the daily data and to ensure the
> batch runs/ atomic writes/locking etc.
>
> What are the possible ways to merge parquet files? Another CTAS?
>
> Is it possible to use parquet-tools(part of Parquet-MR) to merge multiple
> parquets(java jar ./parquet-tools-<VERSION>.jar <command> <input-directory>
> <output-file>) and then let drill query the results?. Will it impact the
> drill meta data caching mechanism?
>
> Regards,
> Rahul
>
> --
> **** This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom it is
> addressed. If you are not the named addressee then you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately and delete this e-mail from your system.****
>

Reply via email to