Re: Drill Questions

2018-02-07 Thread Joel Pfaff
Hello, I think there is a confusion in the name "repartitioning" since it can be understood in two different ways: * changing the number of partitions independently from the content * regrouping data with the same value in a given column in the same partitions (potentially changing the number of

Drill Questions

2018-02-06 Thread Saurabh Mahapatra
Posting this for John Humphreys who posted this in the MapR community but I think this may benefit all users: https://community.mapr.com/thread/22719-re-how-can-i-partition-data-in-drill 1. If I had Spark re-partition a data frame based on a column, and then saved the data frame to parquet

Re: [Drill-Questions] Speed difference between GZ and BZ2

2016-08-05 Thread Shankar Mane
Yes, i went through the benchmarks and started testing this one. I have tested this one using Hadoop Map-Reduce. And it seems BZ worked faster than GZ. As i know GZ is non-splittable and BZ is splittable. Hadoop MR takes the advantage of this splittable property and launched multiple mappers and

Re: [Drill-Questions] Speed difference between GZ and BZ2

2016-08-05 Thread Khurram Faraaz
Shankar, This is expected behavior, bzip2 decompression is four to twelve times slower than decompressing gzip compressed files. You can look at the comparison benchmark here for numbers - http://tukaani.org/lzma/benchmarks.html On Thu, Aug 4, 2016 at 5:13 PM, Shankar Mane wrote: > Please find

Re: [Drill-Questions] Speed difference between GZ and BZ2

2016-08-04 Thread Khurram Faraaz
Ok so query planning took less than one second in both the aggregate queries. Looks like most of the time is getting spent in query execution. On Thu, Aug 4, 2016 at 5:13 PM, Shankar Mane wrote: > Please find the query plan for both queries. FYI: I am not seeing > any planning difference between

Re: [Drill-Questions] Speed difference between GZ and BZ2

2016-08-04 Thread Shankar Mane
Please find the query plan for both queries. FYI: I am not seeing any planning difference between these 2 queries except Cost. / Query on GZ / 0: jdbc:drill:> explain plan for select channelid, count(serverTime) from dfs.`/t

Re: [Drill-Questions] Speed difference between GZ and BZ2

2016-08-04 Thread Khurram Faraaz
Can you please do an explain plan over the two aggregate queries. That way we can know where most of the time is being spent, is it in the query planning phase or is it query execution that is taking longer. Please share the query plans and the time taken for those explain plan statements. On Mon,

Re: [Drill-Questions] Speed difference between GZ and BZ2

2016-08-01 Thread Shankar Mane
It is plain json (1 json per line). Each json message size = ~4kb no. of json messages = ~5 Millions. store.parquet.compression = snappy ( i don't think, this parameter get used. As I am querying select only.) On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz wrote: > What is the data format with

Re: [Drill-Questions] Speed difference between GZ and BZ2

2016-08-01 Thread Khurram Faraaz
What is the data format within those .gz and .bz2 files ? It is parquet or JSON or plain text (CSV) ? Also, what was this config parameter `store.parquet.compression` set to, when ypu ran your test ? - Khurram On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane wrote: > Awaiting for response.. > > O

Re: [Drill-Questions] Speed difference between GZ and BZ2

2016-07-31 Thread Shankar Mane
Awaiting for response.. On 30-Jul-2016 3:20 PM, "Shankar Mane" wrote: > > I am Comparing Querying speed between GZ and BZ2. > > Below are the 2 files and their sizes (This 2 files have same data): > kafka_3_25-Jul-2016-12a.json.gz = 1.8G > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G > > > > Results:

[Drill-Questions] Speed difference between GZ and BZ2

2016-07-30 Thread Shankar Mane
I am Comparing Querying speed between GZ and BZ2. Below are the 2 files and their sizes (This 2 files have same data): kafka_3_25-Jul-2016-12a.json.gz = 1.8G kafka_3_25-Jul-2016-12a.json.bz2= 1.1G Results: 0: jdbc:drill:> select channelid, count(serverTime) from dfs.`/tmp/stest-gz/kafka_3_25-J