Hello ,
I used approach that you've suggested :
lines = sc.textFile("/input/lprs/2015_05_15/file4.csv,
/input/lprs/2015_05_14/file3.csv, /input/lprs/2015_05_13/file2.csv,
/input/lprs/2015_05_12/file1.csv")
but It doesn't work for me:
py4j.protocol.Py4JJavaError: An error occurred
Hi Oleg,
For 1, RDD#union will help. You can iterate over folders and union the obtained
RDD along.
For 2, seems like it won’t work in a deterministic way according to this
discussion(http://stackoverflow.com/questions/24871044/in-spark-what-does-the-parameter-minpartitions-works-in-sparkcontex
Hi
We are using pyspark 1.3 and input is text files located on hdfs.
file structure
file1.txt
file2.txt
file1.txt
file2.txt
...
Question:
1) What is the way to provide as an input for PySpark job multiple
files