Hi Pedro
These are much more accurate performance numbers
In total I have 5,671,287 rows. Each row was stored in JSON. The JSON is
very complicated and can be upto 4k per row
I randomly picked 30 partitions.
My ³big² files are at most 64M
execution timesrccoalesce(num)file sizenum files
34min
Hi Pedro
I did some experiments. I using one of our relatively small data set. The
data set is loaded into 3 or 4 data frames. I then call count()
Looks like using bigger files and reading from HDFS is a good solution for
reading data. I guess I¹ll need to do something similar to this to deal