please create a github repo and upload the code there...

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
<https://about.me/alonso.isidoro.roman?promo=email_sig&utm_source=email_sig&utm_medium=email_sig&utm_campaign=external_links>

2017-05-09 8:47 GMT+02:00 ashwini anand <aayan...@gmail.com>:

> I am reading each file of a directory using wholeTextFiles. After that I
> am calling a function on each element of the rdd using map . The whole
> program uses just 50 lines of each file. The code is as below: def
> processFiles(fileNameContentsPair): fileName= fileNameContentsPair[0]
> result = "\n\n"+fileName resultEr = "\n\n"+fileName input =
> StringIO.StringIO(fileNameContentsPair[1]) reader =
> csv.reader(input,strict=True) try: i=0 for row in reader: if i==50: break
> // do some processing and get result string i=i+1 except csv.Error as e:
> resultEr = resultEr +"error occured\n\n" return resultEr return result if
> __name__ == "__main__": inputFile = sys.argv[1] outputFile = sys.argv[2] sc
> = SparkContext(appName = "SomeApp") resultRDD =
> sc.wholeTextFiles(inputFile).map(processFiles) 
> resultRDD.saveAsTextFile(outputFile)
> The size of each file of the directory can be very large in my case and
> because of this reason use of wholeTextFiles api will be inefficient in
> this case. Right now wholeTextFiles loads full file content into the
> memory. can we make wholeTextFiles to load only first 50 lines of each file
> ? Apart from using wholeTextFiles, other solution I can think of is
> iterating over each file of the directory one by one but that also seems to
> be inefficient. I am new to spark. Please let me know if there is any
> efficient way to do this.
> ------------------------------
> View this message in context: How to read large size files from a
> directory ?
> <http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-large-size-files-from-a-directory-tp28669.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>

Reply via email to