Re: Feature/Question

Ted Dunning Thu, 27 May 2021 09:57:10 -0700

Akshay,

I don't understand why you can't use Drill to create the parquet files.
Can you say more?


Is there a language constraint? A process constraint?

As I hear it, you are asking "I don't want to use Drill to create parquet,
I want to use something else". The problem is that there are tons of other
ways. I start with not understanding your needs (coz I think Drill is the
easiest way for me to create parquet files) and then have no idea which
direction you are headed.

Just a little more definition could help me (and others) help you.

On Thu, May 27, 2021 at 8:18 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) <
abhasi...@bloomberg.net> wrote:

> Hi Drill Team,
>
> I've another ques - is there a python parquet module you provide/support
> which I can leverage to create .parquet & .parquet.crc files which drill
> creates.
>
> I currently have a drill cluster & I want to use it for reading the data
> but not creating the parquet files.
>
> I'm aware of other modules, but I want to preserve the speed &
> optimization of drill - so particularly looking at the module which drill
> uses to convert files to parquet & parquet.crc.
>
> My end goal here is to have a drill cluster reading data from s3 & a
> separate process to convert data to parquet & parquet.crc files & upload it
> to s3.
>
> Best,
> Akshay
>
> From: ted.dunn...@gmail.com At: 04/27/21 17:37:43 UTC-4:00
> To: Akshay Bhasin (BLOOMBERG/ 731 LEX ) <abhasi...@bloomberg.net>
> Cc: dev@drill.apache.org
> Subject: Re: Feature/Question
>
>
> Akshay,
>
> That's great news!
>
> On Tue, Apr 27, 2021 at 1:10 PM Akshay Bhasin (BLOOMBERG/ 731 LEX) <
> abhasi...@bloomberg.net> wrote:
>
>> Hi Ted,
>>
>> Thanks for reaching out. Yes - the below worked successfully.
>>
>> I was able to create different objects in s3 like 'XXX/YYY/filename',
>> 'XXX/ZZZ/filename' and able to query like
>> SELECT * FROM XXX.
>>
>> Thanks !
>>
>> Best,
>> Akshay
>>
>> From: ted.dunn...@gmail.com At: 04/21/21 17:21:42
>> To: Akshay Bhasin (BLOOMBERG/ 731 LEX ) <abhasi...@bloomberg.net>,
>> dev@drill.apache.org
>> Subject: Re: Feature/Question
>>
>>
>> Akshay,
>>
>> Yes. It is possible to do what you want from a few different angles.
>>
>> As you have noted, S3 doesn't have directories. Not really. On the other
>> hand, people simulate this using naming schemes and S3 has some support for
>> this.
>>
>> One of the simplest ways to deal with this is to create a view that
>> explicitly mentions every S3 object that you have in your table. The
>> contents of this view can get a bit cumbersome, but that shouldn't be a
>> problem since users never need to know. You will need to set up a scheduled
>> action to update this view occasionally, but that is pretty simple.
>>
>> The other way is to use a naming scheme with a delimiter such as /. This
>> is described at
>> https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html
>> If you do that and have files named (for instance) foo/a.json,
>> foo/b.json, foo/c.json and you query
>>
>>     select * from s3.`foo`
>>
>> you should see the contents of a.json, b.json and c.json. See here for
>> commentary
>> <https://stackoverflow.com/questions/44785065/apache-drill-how-to-query-all-files-in-an-s3-bucket>
>>
>> I haven't tried this, however, so I am simply going on the reports of
>> others. If this works for you, please report your success back here.
>>
>>
>>
>>
>>
>> On Wed, Apr 21, 2021 at 11:34 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) <
>> abhasi...@bloomberg.net> wrote:
>>
>>> Hi Drill Community,
>>>
>>> I'm Akshay and I'm using Drill for a project I'm working on.
>>>
>>> There is this particular use case I want to implement - I want to know
>>> if its possible.
>>>
>>> 1) Currently, we have a partition of file system and we create a view on
>>> top of it. For example, we have below directory structure -
>>>
>>> /home/product/product_name/year/month/day/*parquet
>>> /home/product/product_name_2/year/month/day/*parquet
>>> /home/product/product_name_3/year/month/day/*parquetdev
>>>
>>> Now, we create a view over it -
>>> Create view temp AS SELECT `dir0` AS prod, `dir1` as year, `dir2` as
>>> month, `dir3` as day, * from dfs.`/home/product`;
>>>
>>> Then, we can query all the data dynamically -
>>> SELECT * from temp LIMIT 5;
>>>
>>> 2) Now I want to replicate this behavior via s3. I want to ask if its
>>> possible - I was able to create a logical directory. But s3 inherently does
>>> not support directories only objects.
>>>
>>> Therefore, I was curious to know if it is supported/way to do this. I
>>> was unable to find any documentation on your website related to
>>> partitioning data on s3.
>>>
>>> Thanks for your help.
>>> Best,
>>> Akshay
>>
>>
>>
>

Re: Feature/Question

Reply via email to