Dean, We moved away from Hadoop and M/R, and instead we are using Storm as our compute grid. We queue keys in Kafka, then Storm distributes the work to the grid. Its working well so far, but we haven't taken it to prod yet. Data is read from Cassandra using a Cassandra-bolt.
If you end up using Storm, let me know. We have an unreleased version of the bolt that you probably want to use. (we're waiting on Nathan/Storm to fix some classpath loading issues) RE: a customer virtual keyspace Partitioner, point well taken -brian --- Brian O'Neill Lead Architect, Software Development Health Market Science The Science of Better Results 2700 Horizon Drive ? King of Prussia, PA ? 19406 M: 215.588.6024 ? @boneill42 <http://www.twitter.com/boneill42> ? healthmarketscience.com This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited. On 10/2/12 9:33 AM, "Hiller, Dean" <dean.hil...@nrel.gov> wrote: >Well, I think I know the direction we may follow so we can >1. Have Virtual CF's >2. Be able to map/reduce ONE Virtual CF > >Well, not map/reduce exactly but really really close. We use PlayOrm with >it's partitioning so I am now thinking what we will do is have a compute >grid where we can have each node doing a findAll query into the >partitions it is responsible for. In this way, I think we can 1000's of >virtual CF's inside ONE CF and then PlayOrm does it's query and retrieves >the rows for that partition of one virtual CF. > >Anyone know of a computer grid we can dish out work to? That would be my >only missing piece (well, that and the PlayOrm virtual CF feature but I >can add that within a week probably though I am on vacation this Thursday >to monday). > >Later, >Dean > > >On 10/2/12 6:35 AM, "Hiller, Dean" <dean.hil...@nrel.gov> wrote: > >>So basically, with moving towards the 1000's of CF all being put in one >>CF, our performance is going to tank on map/reduce, correct? I mean, >>from >>what I remember we could do map/reduce on a single CF, but by stuffing >>1000's of virtual Cf's into one CF, our map/reduce will have to read in >>all 999 virtual CF's rows that we don't want just to map/reduce the ONE >>CF. >> >>Map/reduce VERY VERY SLOW when reading in 1000 times more rows :( :(. >> >>Is this correct? This really sounds like highly undesirable behavior. >>There needs to be a way for people with 1000's of CF's to also run >>map/reduce on any one CF. Doing Map/reduce on 1000 times the number of >>rows will be 1000 times slowerÅ .and of course, we will most likely get up >>to 20,000 tables from my most recent projectionsÅ .our last test load, we >>ended up with 8k+ CF's. Since I kept two other keyspaces, cassandra >>started getting really REALLY slow when we got up to 15k+ CF's in the >>system though I didn't look into why. >> >>I don't mind having 1000's of virtual CF's in ONE CF, BUT I need to >>map/reduce "just" the virtual CF!!!!! Ugh. >> >>Thanks, >>Dean >> >>On 10/1/12 3:38 PM, "Ben Hood" <0x6e6...@gmail.com> wrote: >> >>>On Mon, Oct 1, 2012 at 9:38 PM, Brian O'Neill <b...@alumni.brown.edu> >>>wrote: >>>> Its just a convenient way of prefixing: >>>> >>>>http://hector-client.github.com/hector/build/html/content/virtual_keysp >>>>a >>>>c >>>>es.html >>> >>>So given that it is possible to use a CF per tenant, should we assume >>>that there at sufficient scale that there is less overhead to prefix >>>keys than there is to manage multiple CFs? >>> >>>Ben >> >