Re: Storing large files for later processing through hadoop
Am 03.01.2015 um 07:07 schrieb Srinivasa T N: > Hi Wilm, >The reason is that for some auditing purpose, I want to store the > original files also. well, then I would use a hdfs cluster for storing, as it seems to be exactly what you need. If you collocate hdfs DataNodes and yarns ResourceManager, you also could spare a lot of hardware or costs for external services. It is not recommended to do that, but in your special case this should work. This seems applicable as you only use the hdfs for storing the xml exactly for that purpose. But I'm more familiar with hadoop, hdfs and hbase than with Cassandra. So perhaps I'm biased. And what Jacob proposed could be a solution, too. Spares a lot of nerves ;). Best wishes, Wilm
Re: Storing large files for later processing through hadoop
If it's for auditing, if recommend pushing the files out somewhere reasonably external, Amazon S3 works well for this type of thing, and you don't have to worry too much about backups and the like. __ Sent from iPhone > On 3 Jan 2015, at 5:07 pm, Srinivasa T N wrote: > > Hi Wilm, >The reason is that for some auditing purpose, I want to store the original > files also. > > Regards, > Seenu. > >> On Fri, Jan 2, 2015 at 11:09 PM, Wilm Schumacher >> wrote: >> Hi, >> >> perhaps I totally misunderstood your problem, but why "bother" with >> cassandra for storing in the first place? >> >> If your MR for hadoop is only run once for each file (as you wrote >> above), why not copy the data directly to hdfs, run your MR job and use >> cassandra as sink? >> >> As hdfs and yarn are more or less completely independent you could >> perhaps use the "master" as ResourceManager (yarn) AND NameNode and >> DataNode (hdfs) and launch your MR job directly and as mentioned use >> Cassandra as sink for the reduced data. By this you won't need dedicated >> hardware, as you only need the hdfs once, process and delete the files >> afterwards. >> >> Best wishes, >> >> Wilm >
Re: Storing large files for later processing through hadoop
Hi Wilm, The reason is that for some auditing purpose, I want to store the original files also. Regards, Seenu. On Fri, Jan 2, 2015 at 11:09 PM, Wilm Schumacher wrote: > Hi, > > perhaps I totally misunderstood your problem, but why "bother" with > cassandra for storing in the first place? > > If your MR for hadoop is only run once for each file (as you wrote > above), why not copy the data directly to hdfs, run your MR job and use > cassandra as sink? > > As hdfs and yarn are more or less completely independent you could > perhaps use the "master" as ResourceManager (yarn) AND NameNode and > DataNode (hdfs) and launch your MR job directly and as mentioned use > Cassandra as sink for the reduced data. By this you won't need dedicated > hardware, as you only need the hdfs once, process and delete the files > afterwards. > > Best wishes, > > Wilm >
Re: Storing large files for later processing through hadoop
Hi, perhaps I totally misunderstood your problem, but why "bother" with cassandra for storing in the first place? If your MR for hadoop is only run once for each file (as you wrote above), why not copy the data directly to hdfs, run your MR job and use cassandra as sink? As hdfs and yarn are more or less completely independent you could perhaps use the "master" as ResourceManager (yarn) AND NameNode and DataNode (hdfs) and launch your MR job directly and as mentioned use Cassandra as sink for the reduced data. By this you won't need dedicated hardware, as you only need the hdfs once, process and delete the files afterwards. Best wishes, Wilm
Re: Storing large files for later processing through hadoop
> Since the hadoop MR streaming job requires the file to be processed to be > present in HDFS, > I was thinking whether can it get directly from mongodb instead of me > manually fetching it > and placing it in a directory before submitting the hadoop job? Hadoop M/R can get data directly from Cassandra. See CqlInputFormat. ~mck
Re: Storing large files for later processing through hadoop
I agree that cassandra is a columnar store. The storing of the raw xml file, parsing the file using hadoop and then storing the extracted value is only once. The extracted data on which further operations will be done suits well with the timeseries storage of the data provided by cassandra and that is the reason I am trying to get the things done for which it is not designed. Regards, Seenu. On Fri, Jan 2, 2015 at 10:42 PM, Eric Stevens wrote: > > Can this split and combine be done automatically by cassandra when > inserting/fetching the file without application being bothered about it? > > There are client libraries which offer recipes for this, but in general, > no. > > You're trying to do something with Cassandra that it's not designed to > do. You can get there from here, but you're not going to have a good > time. If you need a document store, you should use a NoSQL solution > designed with that in mind (Cassandra is a columnar store). If you need a > distributed filesystem, you should use one of those. > > If you do want to continue forward and do this with Cassandra, then you > should definitely not do this on the same cluster as handles normal clients > as the kind of workload you'd be subjecting this cluster to is going to > cause all sorts of troubles for normal clients, particularly with respect > to GC pressure, compaction and streaming problems, and many other > consequences of vastly exceeding recommended limits. > > On Fri, Jan 2, 2015 at 9:53 AM, Srinivasa T N wrote: > >> >> >> On Fri, Jan 2, 2015 at 5:54 PM, mck wrote: >> >>> >>> You could manually chunk them down to 64Mb pieces. >>> >>> Can this split and combine be done automatically by cassandra when >> inserting/fetching the file without application being bothered about it? >> >> >>> >>> > 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch >>> > the file from cassandra to HDFS when I want to process it in hadoop >>> cluster? >>> >>> >>> We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No >>> need for backups of it, no need to upgrade data, and we're free to wipe >>> it whenever hadoop has been stopped. >>> ~mck >>> >> >> Since the hadoop MR streaming job requires the file to be processed to be >> present in HDFS, I was thinking whether can it get directly from mongodb >> instead of me manually fetching it and placing it in a directory before >> submitting the hadoop job? >> >> >> >> There was a datastax project before in being able to replace HDFS with >> >> Cassandra, but i don't think it's alive anymore. >> >> I think you are referring to Brisk project ( >> http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/) >> but I don't know its current status. >> >> Can I use http://gerrymcnicol.azurewebsites.net/ for my task in hand? >> >> Regards, >> Seenu. >> > >
Re: Storing large files for later processing through hadoop
> Can this split and combine be done automatically by cassandra when inserting/fetching the file without application being bothered about it? There are client libraries which offer recipes for this, but in general, no. You're trying to do something with Cassandra that it's not designed to do. You can get there from here, but you're not going to have a good time. If you need a document store, you should use a NoSQL solution designed with that in mind (Cassandra is a columnar store). If you need a distributed filesystem, you should use one of those. If you do want to continue forward and do this with Cassandra, then you should definitely not do this on the same cluster as handles normal clients as the kind of workload you'd be subjecting this cluster to is going to cause all sorts of troubles for normal clients, particularly with respect to GC pressure, compaction and streaming problems, and many other consequences of vastly exceeding recommended limits. On Fri, Jan 2, 2015 at 9:53 AM, Srinivasa T N wrote: > > > On Fri, Jan 2, 2015 at 5:54 PM, mck wrote: > >> >> You could manually chunk them down to 64Mb pieces. >> >> Can this split and combine be done automatically by cassandra when > inserting/fetching the file without application being bothered about it? > > >> >> > 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch >> > the file from cassandra to HDFS when I want to process it in hadoop >> cluster? >> >> >> We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No >> need for backups of it, no need to upgrade data, and we're free to wipe >> it whenever hadoop has been stopped. >> ~mck >> > > Since the hadoop MR streaming job requires the file to be processed to be > present in HDFS, I was thinking whether can it get directly from mongodb > instead of me manually fetching it and placing it in a directory before > submitting the hadoop job? > > > >> There was a datastax project before in being able to replace HDFS with > >> Cassandra, but i don't think it's alive anymore. > > I think you are referring to Brisk project ( > http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/) > but I don't know its current status. > > Can I use http://gerrymcnicol.azurewebsites.net/ for my task in hand? > > Regards, > Seenu. >
Re: Storing large files for later processing through hadoop
On Fri, Jan 2, 2015 at 5:54 PM, mck wrote: > > You could manually chunk them down to 64Mb pieces. > > Can this split and combine be done automatically by cassandra when inserting/fetching the file without application being bothered about it? > > > 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch > > the file from cassandra to HDFS when I want to process it in hadoop > cluster? > > > We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No > need for backups of it, no need to upgrade data, and we're free to wipe > it whenever hadoop has been stopped. > ~mck > Since the hadoop MR streaming job requires the file to be processed to be present in HDFS, I was thinking whether can it get directly from mongodb instead of me manually fetching it and placing it in a directory before submitting the hadoop job? >> There was a datastax project before in being able to replace HDFS with >> Cassandra, but i don't think it's alive anymore. I think you are referring to Brisk project ( http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/) but I don't know its current status. Can I use http://gerrymcnicol.azurewebsites.net/ for my task in hand? Regards, Seenu.
Re: Storing large files for later processing through hadoop
> 1) The FAQ … informs that I can have only files of around 64 MB … See http://wiki.apache.org/cassandra/CassandraLimitations "A single column value may not be larger than 2GB; in practice, "single digits of MB" is a more reasonable limit, since there is no streaming or random access of blob values." CASSANDRA-16 only covers pushing those objects through compaction. Getting the objects in and out of the heap during normal requests is still a problem. You could manually chunk them down to 64Mb pieces. > 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch > the file from cassandra to HDFS when I want to process it in hadoop cluster? We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No need for backups of it, no need to upgrade data, and we're free to wipe it whenever hadoop has been stopped. Otherwise all our hadoop jobs still read from and write to Cassandra. Cassandra is our "big data" platform, with hadoop/spark just providing additional aggregation abilities. I think this is the effective way, rather than trying to completely gut out HDFS. There was a datastax project before in being able to replace HDFS with Cassandra, but i don't think it's alive anymore. ~mck
Storing large files for later processing through hadoop
Hi All, The problem I am trying to address is: Store the raw files (files are in xml format and of the size arnd 700MB) in cassandra, later fetch it and process it in hadoop cluster and populate back the processed data in cassandra. Regarding this, I wanted few clarifications: 1) The FAQ ( https://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage) informs that I can have only files of around 64 MB but at the same time talks about about an jira issue https://issues.apache.org/jira/browse/CASSANDRA-16 which is solved in 0.6 version itself. So, in the present version of cassandra (2.0.11), is there any limit on the size of the file in a column and if so, what is it? 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch the file from cassandra to HDFS when I want to process it in hadoop cluster? Regards, Seenu.