Re: Cassandra & Spark
Spark can read hdfs directly so locality is important but Spark can't read Cassandra data directly it can only connect by api. So I think you don't need to install them a same node 2018년 8월 25일 (토) 오후 3:16, Affan Syed 님이 작성: > Tobias, > > This is very interesting. Can I inquire a bit more on why you have both C* > and Kudu in the system? > > Wouldnt keeping just Kudu work (that was its initial purpose?). Is there > something to do with its production readiness? I ask as we have a similar > concern as well. > > Finally, how are your dashboard apps talking to Kudu? Is there a backend > that talks via impala, or do you have some calls to bash level scripts > communicating over some file system? > > > > - Affan > > > On Thu, Jun 8, 2017 at 7:01 PM Tobias Eriksson < > tobias.eriks...@qvantel.com> wrote: > >> Hi >> >> What I wanted was a dashboard with graphs/diagrams and it should not take >> minutes for the page to load >> >> Thus, it was a problem to have Spark with Cassandra, and not solving the >> parallelization to such an extent that I could have the diagrams rendered >> in seconds. >> >> Now with Kudu we get some decent results rendering the diagrams/graphs >> >> >> >> The way we transfer data from Cassandra which is the Production system >> storage to Kudu, is through an Apache Kafka topic (or many topics actually) >> and then we have an application which ingests the data into Kudu >> >> >> >> >> >> Other Systems -- > Domain Storage App(s) -- > Cassandra -- > KAFKA -- > >> KuduIngestion App -- > Kudu < -- Dashboard App(s) >> >> >> >> >> >> If you want to play with really fast analytics then perhaps consider >> looking at Apache Ignite >> >> https://ignite.apache.org >> >> Which then act as a layer between Cassandra and your applications storing >> into Cassandra (memory datagrid I think it is called) >> >> Basically, think of it as a big cache >> >> It is an in-memory thingi ☺ >> >> And then you can run some super fast queries >> >> >> >> -Tobias >> >> >> >> *From: *DuyHai Doan >> *Date: *Thursday, 8 June 2017 at 15:42 >> *To: *Tobias Eriksson >> *Cc: *한 승호 , "user@cassandra.apache.org" < >> user@cassandra.apache.org> >> *Subject: *Re: Cassandra & Spark >> >> >> >> Interesting >> >> >> >> Tobias, when you said "Instead we transferred the data to Apache Kudu", >> did you transfer all Cassandra data into Kudu from with a single migration >> and then tap into Kudo for aggregation or did you run data import every >> day/week/month from Cassandra into Kudu ? >> >> >> >> From my point of view, the difficulty is not to have a static set of data >> and run aggregation on it, there are a lot of alternatives out there. The >> difficulty is to be able to run analytics on a live/production/changing >> dataset with all the data movement & update that it implies. >> >> >> >> Regards >> >> >> >> On Thu, Jun 8, 2017 at 3:37 PM, Tobias Eriksson < >> tobias.eriks...@qvantel.com> wrote: >> >> Hi >> >> Something to consider before moving to Apache Spark and Cassandra >> >> I have a background where we have tons of data in Cassandra, and we >> wanted to use Apache Spark to run various jobs >> >> We loved what we could do with Spark, BUT…. >> >> We realized soon that we wanted to run multiple jobs in parallel >> >> Some jobs would take 30 minutes and some 45 seconds >> >> Spark is by default arranged so that it will take up all the resources >> there is, this can be tweaked by using Mesos or Yarn >> >> But even with Mesos and Yarn we found it complicated to run multiple jobs >> in parallel. >> >> So eventually we ended up throwing out Spark, >> >> Instead we transferred the data to Apache Kudu, and then we ran our >> analysis on Kudu, and what a difference ! >> >> “my two cents!” >> >> -Tobias >> >> >> >> >> >> >> >> *From: *한 승호 >> *Date: *Thursday, 8 June 2017 at 10:25 >> *To: *"user@cassandra.apache.org" >> *Subject: *Cassandra & Spark >> >> >> >> Hello, >> >> >> >> I am Seung-ho and I work as a Data Engineer in Korea. I need some advice. >> >> >> >> My company recently consider replacing RDMBS-based system with Cassandra >> and Hadoop. >> >> The purpose of this system is to analyze Cadssandra and HDFS data with >> Spark. >> >> >> >> It seems many user cases put emphasis on data locality, for instance, >> both Cassandra and Spark executor should be on the same node. >> >> >> >> The thing is, my company's data analyst team wants to analyze >> heterogeneous data source, Cassandra and HDFS, using Spark. >> >> So, I wonder what would be the best practices of using Cassandra and >> Hadoop in such case. >> >> >> >> Plan A: Both HDFS and Cassandra with NodeManager(Spark Executor) on the >> same node >> >> >> >> Plan B: Cassandra + Node Manager / HDFS + NodeManager in each node >> separately but the same cluster >> >> >> >> >> >> Which would be better or correct, or would be a better way? >> >> >> >> I appreciate your advice in advance :) >> >> >> >> Best Regards, >> >> Seung-Ho Han >> >> >> >> >> >> Windows 10용 메일
Re: Kill queries
to kill running query, there is no way. turn off all nodes and turn on. cassandra doesnt support kill query feature 2017년 1월 23일 월요일, Cogumelos Maravilha님이 작성한 메시지: > Hi, > > I'm using cqlsh --request-timeout=1 but because I've more than > 600.000.000 rows some times I get blocked and I kill the cqlsh. But what > about the running query in Cassandra? How can I check that? > > Thanks in advance. > >
Re: MailBox Impl
How about this: http://www.elasticinbox.com/ 2013년 7월 19일 금요일에 Vegard Berget님이 작성: > Hi, > 1) Counters will probably work for this. Our experience with counters is that it is very accurate. But read up on how repair/inconsistencies etc is handled. > 2) You can not, as far as i know at least, have ttl on part of a counter. What you can do, depending on how accurate it needs to be, is to have counters per hour (or something like that) and add them together for the last 10 days (which of course can be done async and stored): > MailboxId as rowkey, Year,Month,Day,Hour as columnkey. I don't know if this solves anything for you, but maybe you can use part of that idea for something? > .vegard, > > - Original Message - > From: > user@cassandra.apache.org > To: > "user@cassandra.apache.org" > Cc: > Sent: > Thu, 18 Jul 2013 21:30:08 + > Subject: > MailBox Impl > > > Hi - We are planning on using Cassandra for an IMAP based implementation. There are some questions that we are stuck with - > > > > 1) Each user will have a pre-defined mailbox size (say 10 MB). We need to maintain a field to check if the mail-box size exceeds the predefined size. Will using the counter family be appropriate ? > > 2) Also, we need to have retention for only 10 days. After 10 days, the previous days data will be removed. We plan to have TTL defined per message. But if we do that, how does the counter in question 1 get updated with the space cleaned due to deletion ? > > 3) Do we NOT have TTL and manage the deletions within the application itself ? > > > > Thanks, > > Kanwar > >
Re: how to stop hinted handoff
if you don't use Write Level ANY, It will be automactically turned off. Just use Write Level One. 2012/10/9 Manu Zhang > Hi, all > > I tried out the client_only example as another local node 127.0.0.2 and > then it went down. Now the node (127.0.0.1) started hinted handoff to > iteself. How to stop that? > > Thanks! >
Re: improving cassandra-vs-mongodb-vs-couchdb-vs-redis
Don't trust NoSQL Benchmark. It's not a lie. but. NoSQL has different performance in many different environment. Do Benchmark with your real environment. and choose it. Thank you. 2011/12/28 Igor Lino > You are totally right. I'm far from being an expert on the subject, but > the comparison felt inconsistent and incomplete. (I could not express that > in my 1st email, not to bias the opinion) > > Do you know of any similar comparison, which is not biased towards some > particular technology or solution? (so not coming from > http://cassandra.apache.org/) > I want to understand how superior is Cassandra in its latest release > against closer competitors, ideally with the opinion from expert guys. > > > On Wed, Dec 28, 2011 at 12:14 AM, Edward Capriolo > wrote: > >This is not really a comparison of anything because each NoSQL has its > own bullet points like: >Boats > great for traveling on water >Cars > great for traveling on land >So the conclusion I should gather is? >Also as for the Cassandra bullet points, they are really thin (and > wrong). Such as: >Cassandra: >Best used: When you write more than you read (logging). If every > component of the system must be in Java. ("No one gets fired for choosing > Apache's stuff.") >I view that as: >Nonsensical, inaccurate, and anecdotal. >Also I notice on the other side (and not trying to pick on hbase, but) >hbase: >No single point of failure >Random access performance is like MySQL >Hbase has several SPOF's, its random access performance is definitely > NOT 'like mysql', >Cassandra ACTUALLY has no SPOF but as they author mentions, he/she does > not like Cassandra so that fact was left out. >From what I can see of the writeup, it is obviously inaccurate in > numerous places (without even reading the entire thing). > Also when comparing these technologies very subtle differences in > design have profound in effects in operation and performance. Thus someone > trying to paper over 6 technologies and compare them with a few bullet > points is really doing the world an injustice. > On Tue, Dec 27, 2011 at 5:01 PM, Igor Lino wrote: > >Hi! > >I was trying to get an understanding of the real strengths of > Cassandra against other competitors. Its actually not that simple and > depends a lot on details on the actual requirements. > >Reading the following comparison: >http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis > >It felt like the description of Cassandra painted a limiting > picture of its capabilities. Is there any Cassandra expert that could > improve that summary? is there any important thing missing? or is there a > more fitting common use case for Cassandra than what Mr. Kovacs has given? >(I believe/think that a Cassandra expert can improve that generic > description) > >Thanks, >Igor > > > >