Hello,
I think there is a confusion in the name "repartitioning" since it can be
understood in two different ways:
* changing the number of partitions independently from the content
* regrouping data with the same value in a given column in the same
partitions (potentially changing the number of
Posting this for John Humphreys who posted this in the MapR community but I
think this may benefit all users:
https://community.mapr.com/thread/22719-re-how-can-i-partition-data-in-drill
1. If I had Spark re-partition a data frame based on a column, and then
saved the data frame to parquet
Yes, i went through the benchmarks and started testing this one.
I have tested this one using Hadoop Map-Reduce. And it seems BZ worked
faster than GZ. As i know GZ is non-splittable and BZ is splittable.
Hadoop MR takes the advantage of this splittable property and launched
multiple mappers and
Shankar,
This is expected behavior, bzip2 decompression is four to twelve times
slower than decompressing gzip compressed files.
You can look at the comparison benchmark here for numbers -
http://tukaani.org/lzma/benchmarks.html
On Thu, Aug 4, 2016 at 5:13 PM, Shankar Mane
wrote:
> Please find
Ok so query planning took less than one second in both the aggregate
queries.
Looks like most of the time is getting spent in query execution.
On Thu, Aug 4, 2016 at 5:13 PM, Shankar Mane
wrote:
> Please find the query plan for both queries. FYI: I am not seeing
> any planning difference between
Please find the query plan for both queries. FYI: I am not seeing
any planning difference between these 2 queries except Cost.
/ Query on GZ
/
0: jdbc:drill:> explain plan for select channelid, count(serverTime) from
dfs.`/t
Can you please do an explain plan over the two aggregate queries. That way
we can know where most of the time is being spent, is it in the query
planning phase or is it query execution that is taking longer. Please share
the query plans and the time taken for those explain plan statements.
On Mon,
It is plain json (1 json per line).
Each json message size = ~4kb
no. of json messages = ~5 Millions.
store.parquet.compression = snappy ( i don't think, this parameter get
used. As I am querying select only.)
On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz wrote:
> What is the data format with
What is the data format within those .gz and .bz2 files ? It is parquet or
JSON or plain text (CSV) ?
Also, what was this config parameter `store.parquet.compression` set to,
when ypu ran your test ?
- Khurram
On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane
wrote:
> Awaiting for response..
>
> O
Awaiting for response..
On 30-Jul-2016 3:20 PM, "Shankar Mane" wrote:
>
> I am Comparing Querying speed between GZ and BZ2.
>
> Below are the 2 files and their sizes (This 2 files have same data):
> kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
>
>
>
> Results:
I am Comparing Querying speed between GZ and BZ2.
Below are the 2 files and their sizes (This 2 files have same data):
kafka_3_25-Jul-2016-12a.json.gz = 1.8G
kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
Results:
0: jdbc:drill:> select channelid, count(serverTime) from
dfs.`/tmp/stest-gz/kafka_3_25-J
11 matches
Mail list logo