How to read large size files from a directory ?

2017-05-09 Thread ashwini anand
just 50 lines of each file. Please find the code at below link https://gist.github.com/ashwini-anand/0e468da9b4ab7863dff14833d34de79e The size of each file of the directory can be very large in my case and because of this reason use of wholeTextFiles api will be inefficient in this case. Right now

How to read large size files from a directory ?

2017-05-09 Thread ashwini anand
I am reading each file of a directory using wholeTextFiles. After that I am calling a function on each element of the rdd using map . The whole program uses just 50 lines of each file. The code is as below:def processFiles(fileNameContentsPair): fileName= fileNameContentsPair[0] result =

How does partitioning happen for binary files in spark ?

2017-04-06 Thread ashwini anand
By looking into the source code, I found that for textFile(), the partitioning is computed by the computeSplitSize() function in FileInputFormat class. This function takes into consideration the minPartitions value passed by user. As per my understanding , the same thing for binaryFiles() is