Hi, i am a beginner too, but as i have learned, hadoop works better with big files, at least with 64MB, 128MB or even more. I think you need to aggregate all the files into a new big one. Then you must copy to HDFS using this command:
hadoop fs -put MYFILE /YOUR_ROUTE_ON_HDFS/MYFILE hadoop just copy MYFILE into hadoop distributed file system. Can i recommend you what i have done? go to BigDataUniversity.com and take the Hadoop Fundamentals I course. It is free and very well documented. Regards Alonso Isidoro Roman. Mis citas preferidas (de hoy) : "Si depurar es el proceso de quitar los errores de software, entonces programar debe ser el proceso de introducirlos..." - Edsger Dijkstra My favorite quotes (today): "If debugging is the process of removing software bugs, then programming must be the process of putting ..." - Edsger Dijkstra "If you pay peanuts you get monkeys" 2014-03-03 12:10 GMT+01:00 goi cto <goi....@gmail.com>: > Hi, > > I am sorry for the beginners question but... > I have a spark java code which reads a file (c:\my-input.csv) process it > and writes an output file (my-output.csv) > Now I want to run it on Hadoop in a distributed environment > 1) My inlut file should be one big file or separate smaller files? > 2) if we are using smaller files, how does my code needs to change to > process all of the input files? > > Will Hadoop just copy the files to different servers or will it also split > their content among servers? > > Any example will be great! > -- > Eran | CTO >