subject:"use big files and read from HDFS was\: performance problem when reading lots of small files created by spark streaming."

Re: use big files and read from HDFS was: performance problem when reading lots of small files created by spark streaming.

2016-07-30 Thread Andy Davidson

Hi Pedro These are much more accurate performance numbers In total I have 5,671,287 rows. Each row was stored in JSON. The JSON is very complicated and can be upto 4k per row I randomly picked 30 partitions. My ³big² files are at most 64M execution timesrccoalesce(num)file sizenum files 34min

use big files and read from HDFS was: performance problem when reading lots of small files created by spark streaming.

2016-07-29 Thread Andy Davidson

Hi Pedro I did some experiments. I using one of our relatively small data set. The data set is loaded into 3 or 4 data frames. I then call count() Looks like using bigger files and reading from HDFS is a good solution for reading data. I guess I¹ll need to do something similar to this to deal