Spark has sc.wholeTextFiles() which returns RDD of tuple. First element of
tuple if the file name and second element is the file content.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe
> So my question is supposing all files are in a directory and I read then
using sc.textFile("path/*"), how can I understand each data is for which
file?
Maybe the input_file_name() function help you:
You can create your own data source exactly doing this.
Why is the file name important if the file content is the same?
> On 24. Sep 2018, at 13:53, Soheil Pourbafrani wrote:
>
> Hi, My text data are in the form of text file. In the processing logic, I
> need to know each word is from which
Hi, My text data are in the form of text file. In the processing logic, I
need to know each word is from which file. Actually, I need to tokenize the
words and create the pair of . The naive solution is to
call sc.textFile for each file and having the fileName in a variable,
create the pairs, but