Re: RF=1 w/ hadoop jobs
On Mon, 2011-09-05 at 21:52 +0200, Patrik Modesto wrote: > I'm not sure about 0.8.x and 0.7.9 (to be released today with your > patch) but 0.7.8 will fail even with RF>1 when there is Hadoop > TaskTracer without local Cassandra. So increasing RF is not a > solution. This isn't true (or not the intention). If you increase RF then yes the task will fail but it will get re-run on the next replica. So the job takes longer but should still work. ~mck -- "This is my simple religion. There is no need for temples; no need for complicated philosophy. Our own brain, our own heart is our temple; the philosophy is kindness." The Dalai Lama | http://semb.wever.org | http://sesat.no | | http://tech.finn.no | Java XSS Filter | signature.asc Description: This is a digitally signed message part
Re: RF=1 w/ hadoop jobs
On Mon, Sep 5, 2011 at 09:39, Mick Semb Wever wrote: > I've entered a jira issue covering this request. > https://issues.apache.org/jira/browse/CASSANDRA-3136 > > Would you mind attaching your patch to the issue. > (No review of it will happen anywhere else.) I see Jonathan didn't change his mind, as the ticket was resolved "won't fix". So that's it. I'm not going to attach the patch until another core Cassandra developer is interested in the use-cases mentioned there. I'm not sure about 0.8.x and 0.7.9 (to be released today with your patch) but 0.7.8 will fail even with RF>1 when there is Hadoop TaskTracer without local Cassandra. So increasing RF is not a solution. Regards, Patrik
Re: RF=1 w/ hadoop jobs
On Fri, 2011-09-02 at 09:28 +0200, Patrik Modesto wrote: > We use Cassandra as a storage for web-pages, we store the HTML, all > URLs that has the same HTML data and some computed data. We run Hadoop > MR jobs to compute lexical and thematical data for each page and for > exporting the data to a binary files for later use. URL gets to a > Cassandra on user request (a pageview) so if we delete an URL, it gets > back quickly if the page is active. Because of that and because there > is lots of data, we have the keyspace set to RF=1. We can drop the > whole keyspace and it will regenerate quickly and would contain only > fresh data, so we don't care about lossing a node. I've entered a jira issue covering this request. https://issues.apache.org/jira/browse/CASSANDRA-3136 Would you mind attaching your patch to the issue. (No review of it will happen anywhere else.) ~mck -- “Innovators and creative geniuses cannot be reared in schools. They are precisely the men who defy what the school has taught them.” - Ludwig von Mises | http://semb.wever.org | http://sesat.no | | http://tech.finn.no | Java XSS Filter | signature.asc Description: This is a digitally signed message part
Re: RF=1 w/ hadoop jobs
On Fri, Sep 2, 2011 at 08:54, Mick Semb Wever wrote: > Patrik: is it possible to describe the use-case you have here? Sure. We use Cassandra as a storage for web-pages, we store the HTML, all URLs that has the same HTML data and some computed data. We run Hadoop MR jobs to compute lexical and thematical data for each page and for exporting the data to a binary files for later use. URL gets to a Cassandra on user request (a pageview) so if we delete an URL, it gets back quickly if the page is active. Because of that and because there is lots of data, we have the keyspace set to RF=1. We can drop the whole keyspace and it will regenerate quickly and would contain only fresh data, so we don't care about lossing a node. But Hadoop does care, well to be specific the Cassnadra ColumnInputFormat and ColumnRecortReader are the problem parts. If I stop one Cassandra node all MR jobs that read/write Cassandra fail. In our case, it doesn't matter, we can skip the range of URLs. The MR jobs run in a tight loop, so when the node is back with it's data, we use them. It's not only about some HW crash but it makes maintenance quite difficult. To stop a Cassandra node, you have to stop tasktracker there too which is unfortunate as there are another MR jobs that don't need Cassandra and can happily run. Regards, P.
Re: RF=1 w/ hadoop jobs
On Fri, 2011-09-02 at 08:20 +0200, Patrik Modesto wrote: > As Jonathan > already explained himself: "ignoring unavailable ranges is a > misfeature, imo" Generally it's not what one would want i think. But I can see the case when data is to be treated volatile and ignoring unavailable ranges may be acceptable. For example if you searching for something or some-pattern and one hit is enough. If you get the hit it's a positive result regardless if ranges were ignored, if you don't and you *know* there was a range ignored along the way you can re-run the job later. The worse case scenario here is no worse than the job always failing on you. Although some indication of ranges ignored is required. Another example is when your just trying to extract a small random sample (like a pig SAMPLE) of data out of cassandra. Patrik: is it possible to describe the use-case you have here? ~mck -- “The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man.” - George Bernard Shaw | http://semb.wever.org | http://sesat.no | | http://tech.finn.no | Java XSS Filter | signature.asc Description: This is a digitally signed message part
Re: RF=1 w/ hadoop jobs
Hi, On Thu, Sep 1, 2011 at 12:36, Mck wrote: >> It's available here: http://pastebin.com/hhrr8m9P (for version 0.7.8) > > I'm interested in this patch and see it's usefulness but no one will act > until you attach it to an issue. (I think a new issue is appropriate > here). I'm glad someone is interestet in my patch usefull. As Jonathan already explained himself: "ignoring unavailable ranges is a misfeature, imo" I'm thinking opening a new ticket without support from more users is useless ATM. Please test the patch and if you like it, than there is time for ticket. Regards, P.
Re: RF=1 w/ hadoop jobs
On Thu, 2011-08-18 at 08:54 +0200, Patrik Modesto wrote: > But there is the another problem with Hadoop-Cassandra, if there is no > node available for a range of keys, it fails on RuntimeError. For > example having a keyspace with RF=1 and a node is down all MapReduce > tasks fail. CASSANDRA-2388 is related but not the same. Before 0.8.4 the behaviour was if the local cassandra node didn't have the split's data the tasktracker would connect to another cassandra node where the split's data could be found. So even <0.8.4 with RF=1 you would have your hadoop job fail. Although I've reopened CASSANDRA-2388 (and reverted the code locally) because the new behaviour in 0.8.4 leads to abysmal tasktracker throughput (for me task allocation doesn't seem to honour data-locality according to split.getLocations()). > I've reworked my previous patch, that was addressing this > issue and now there are ConfigHelper methods for enable/disable > ignoring unavailable ranges. > It's available here: http://pastebin.com/hhrr8m9P (for version 0.7.8) I'm interested in this patch and see it's usefulness but no one will act until you attach it to an issue. (I think a new issue is appropriate here). ~mck