Hi
   We are using pyspark 1.3 and input is text files located on hdfs.

file structure
    <day1>
                file1.txt
                file2.txt
    <day2>
                file1.txt
                file2.txt
     ...

Question:

   1) What is the way to provide as an input for PySpark job  multiple
files which located in Multiple folders (on hdfs).
Using textFile method works fine for single file or folder , but how can I
do it using multiple folders?
Is there a way to pass array , list of files?

   2) What is the meaning of partition parameter in textFile method?

  sc = SparkContext(appName="TAD")
  lines = sc.textFile(<my input>, 1)

Thanks
Oleg.

Reply via email to