Hi, I am new to spark. I have began to read to understand sparks RDD files as well as SparkSQL. My question is more on how to build out the RDD files and best practices. I have data that is broken down by hour into files on HDFS in avro format. Do I need to create a separate RDD for each file? or using SparkSQL a separate SchemaRDD?
I want to be able to pull lets say an entire day of data into spark and run some analytics on it. Then possibly a week, a month, etc. If there is documentation on this procedure or best practives for building RDD's please point me at them. Thanks for your time, Sam