please create a github repo and upload the code there...
Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman <https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links> 2017-05-09 8:47 GMT+02:00 ashwini anand <aayan...@gmail.com>: > I am reading each file of a directory using wholeTextFiles. After that I > am calling a function on each element of the rdd using map . The whole > program uses just 50 lines of each file. The code is as below: def > processFiles(fileNameContentsPair): fileName= fileNameContentsPair[0] > result = "\n\n"+fileName resultEr = "\n\n"+fileName input = > StringIO.StringIO(fileNameContentsPair[1]) reader = > csv.reader(input,strict=True) try: i=0 for row in reader: if i==50: break > // do some processing and get result string i=i+1 except csv.Error as e: > resultEr = resultEr +"error occured\n\n" return resultEr return result if > __name__ == "__main__": inputFile = sys.argv[1] outputFile = sys.argv[2] sc > = SparkContext(appName = "SomeApp") resultRDD = > sc.wholeTextFiles(inputFile).map(processFiles) > resultRDD.saveAsTextFile(outputFile) > The size of each file of the directory can be very large in my case and > because of this reason use of wholeTextFiles api will be inefficient in > this case. Right now wholeTextFiles loads full file content into the > memory. can we make wholeTextFiles to load only first 50 lines of each file > ? Apart from using wholeTextFiles, other solution I can think of is > iterating over each file of the directory one by one but that also seems to > be inefficient. I am new to spark. Please let me know if there is any > efficient way to do this. > ------------------------------ > View this message in context: How to read large size files from a > directory ? > <http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-large-size-files-from-a-directory-tp28669.html> > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >