Re: Batch load of unstructured data in Drill

2016-12-08 Thread Alexander Reshetov
Hi Stefán, Yes, I'm considering this option now (while there is no other better options). Faced some limitation though. You can not query on directory when schema between files different. Error: UNSUPPORTED_OPERATION ERROR: Hash aggregate does not support schema changes On Fri, Dec 9, 2016 at

Re: Batch load of unstructured data in Drill

2016-12-08 Thread Stefán Baxter
Hi, Have you considered batching them up into a nicely defined directory structure and use directory pruning as part of your queries? I ask because our batch processes does that. Data is arranged into Hour, Day, Month, Quarter, Years structures (which we then roll-up in different ways, based on v

Re: Batch load of unstructured data in Drill

2016-12-08 Thread John Omernik
Sure... I believe you could CTAS from your json directory into a tmp parquet directory and then move the resultant files into the final parquet directory i.e. Drill Query: Create table `.mytempparq` as select * from `.mytempjson` Filesystem command: mv ./mytempparq/* ./myfinalparq It would be gr

Re: Batch load of unstructured data in Drill

2016-12-08 Thread Alexander Reshetov
By the way, is it possible to append data to parquet data source? I'm looking for possibility to update (append to) existing data new rows so every query execution will have new data rows. Surely it's possible with plain JSON, but I want more efficient binary format which will give quicker reads (

Re: Batch load of unstructured data in Drill

2016-12-08 Thread Alexander Reshetov
Hi John, Thanks, I tried with directory containing several parquet sub-directories. It works and looks in Drill like one parquet data source. Not exactly what I want, but it's good workaround. Thanks again. On Wed, Dec 7, 2016 at 4:39 PM, John Omernik wrote: > Alexander - > > When I have someth

Re: Batch load of unstructured data in Drill

2016-12-07 Thread Alexander Reshetov
Hi Stefán, Yes, thanks, I know about CTAS possibility and it works fine. And much faster then direct JSON read. I'm looking for possibility to load batch data from other sources. For example from Kafka Connect Sink module. On Wed, Dec 7, 2016 at 4:33 PM, Stefán Baxter wrote: > Hi Alexander, > >

Re: Batch load of unstructured data in Drill

2016-12-07 Thread John Omernik
Alexander - When I have something like this, especially when the output will be extremely large, I use CTAS into Parquet files. That said, I think you are more looking at the ETL process for JSON. So, ignoring the CTAS to Parquet for now, if you have a bunch of JSON files that will be loaded incr

Re: Batch load of unstructured data in Drill

2016-12-07 Thread Stefán Baxter
Hi Alexander, Drill allows you to both a) query the data directly in json format and b) convert it to Parqet (have a look at the CTAS function) Hope that helps, -Stefán On Wed, Dec 7, 2016 at 1:08 PM, Alexander Reshetov < alexander.v.reshe...@gmail.com> wrote: > Hello, > > I want to load batch

Batch load of unstructured data in Drill

2016-12-07 Thread Alexander Reshetov
Hello, I want to load batches of unstructured data in Drill. Mostly JSON data. Is there any batch API or other options to do so? Thanks.