Hi We are using pyspark 1.3 and input is text files located on hdfs. file structure <day1> file1.txt file2.txt <day2> file1.txt file2.txt ...
Question: 1) What is the way to provide as an input for PySpark job multiple files which located in Multiple folders (on hdfs). Using textFile method works fine for single file or folder , but how can I do it using multiple folders? Is there a way to pass array , list of files? 2) What is the meaning of partition parameter in textFile method? sc = SparkContext(appName="TAD") lines = sc.textFile(<my input>, 1) Thanks Oleg.