Hi,
I am sorry for the beginners question but...
I have a spark java code which reads a file (c:\my-input.csv) process it
and writes an output file (my-output.csv)
Now I want to run it on Hadoop in a distributed environment
1) My inlut file should be one big file or separate smaller files?
2) if
Hi, i am a beginner too, but as i have learned, hadoop works better with
big files, at least with 64MB, 128MB or even more. I think you need to
aggregate all the files into a new big one. Then you must copy to HDFS
using this command:
hadoop fs -put MYFILE /YOUR_ROUTE_ON_HDFS/MYFILE
hadoop just