If it's for auditing, if recommend pushing the files out somewhere reasonably external, Amazon S3 works well for this type of thing, and you don't have to worry too much about backups and the like.
______________________________ Sent from iPhone > On 3 Jan 2015, at 5:07 pm, Srinivasa T N <seen...@gmail.com> wrote: > > Hi Wilm, > The reason is that for some auditing purpose, I want to store the original > files also. > > Regards, > Seenu. > >> On Fri, Jan 2, 2015 at 11:09 PM, Wilm Schumacher <wilm.schumac...@gmail.com> >> wrote: >> Hi, >> >> perhaps I totally misunderstood your problem, but why "bother" with >> cassandra for storing in the first place? >> >> If your MR for hadoop is only run once for each file (as you wrote >> above), why not copy the data directly to hdfs, run your MR job and use >> cassandra as sink? >> >> As hdfs and yarn are more or less completely independent you could >> perhaps use the "master" as ResourceManager (yarn) AND NameNode and >> DataNode (hdfs) and launch your MR job directly and as mentioned use >> Cassandra as sink for the reduced data. By this you won't need dedicated >> hardware, as you only need the hdfs once, process and delete the files >> afterwards. >> >> Best wishes, >> >> Wilm >