Re: Query on Data Modelling of a specific usecase
Hi Jon, Thanks for your guidance. In above mentioned table i can have different scale depending on Report. One report may have 1 rows. Second report may have half million rows. Third report may have 1 million rows. Fourth report may have 10 million rows. As this is timeseries data that was main reason of modelling in cassandra. We preferred separate table for each report as there is no usecase of quering across reports and also Light reports will work faster. I can plan to reduce no of tables drastically by combining lighter reports in one table at application level. If you can suggest optimal table design keeping one table in mind with 10 millions to 1 billion rows scale for the mentioned queries. Thanks, Naresh Yadav On Wed, Apr 19, 2017 at 9:26 PM, Jon Haddad <jonathan.had...@gmail.com> wrote: > How much data do you plan to store in each table? > > I’ll be honest, this doesn’t sound like a Cassandra use case at first > glance. 1 table per report x 1000 is going to be a bad time. Odds are > with different queries, you’ll need multiple views, so lets call that a > handful of tables per report. Sounds to me like you need CSV (for small > reports) or Parquet + a file system (for large ones). > > Jon > > > On Apr 18, 2017, at 11:34 PM, Naresh Yadav <nyadav@gmail.com> wrote: > > Looking for cassandra expert's recommendation on above usecase, please > reply. > > On Mon, Apr 17, 2017 at 7:37 PM, Naresh Yadav <nyadav@gmail.com> > wrote: > >> Hi all, >> >> This is my existing table configured on apache-cassandra-3.0.9: >> >> CREATE TABLE report_id1 ( >>mc_id text, >>tag_id text, >>e_date timestamp. >>value text >>PRIMARY KEY ((mc_id, tag_id), e_date) >> } >> >> I create table dynamically for each report from application. Need to >> support upto 1000 reports means 1000 such tables. >> unique mc_id will be in range of 5 to 100 in a report. >> For a mc_id there will be unique tag_id in range of 100 to 1 million in a >> report. >> For a mc_id, tag_id there will be unique e_date values in range of 10 to >> 5000. >> >> Current queries to answer : >> 1)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND >> e_date='16Apr2017 23:59:59'; >> 2)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND >> e_date >='01Apr2017 00:00:00' AND e_date <='16Apr2017 23:59:59; >> >> 3)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date='16Apr2017 >> 23:59:59'; >>Current design this works with ALLOW FILTERING ONLY >> 4)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date >='01Apr2017 >> 00:00:00' AND e_date <='16Apr2017 23:59:59'; >>Current design this works with ALLOW FILTERING ONLY >> >> Looking for better design for this case, keeping in mind dynamic tables >> usecase and queries listed. >> >> Thanks in advance, >> Naresh >> >> > >
Re: Query on Data Modelling of a specific usecase
Looking for cassandra expert's recommendation on above usecase, please reply. On Mon, Apr 17, 2017 at 7:37 PM, Naresh Yadav <nyadav@gmail.com> wrote: > Hi all, > > This is my existing table configured on apache-cassandra-3.0.9: > > CREATE TABLE report_id1 ( >mc_id text, >tag_id text, >e_date timestamp. >value text >PRIMARY KEY ((mc_id, tag_id), e_date) > } > > I create table dynamically for each report from application. Need to > support upto 1000 reports means 1000 such tables. > unique mc_id will be in range of 5 to 100 in a report. > For a mc_id there will be unique tag_id in range of 100 to 1 million in a > report. > For a mc_id, tag_id there will be unique e_date values in range of 10 to > 5000. > > Current queries to answer : > 1)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND > e_date='16Apr2017 23:59:59'; > 2)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND > e_date >='01Apr2017 00:00:00' AND e_date <='16Apr2017 23:59:59; > > 3)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date='16Apr2017 23:59:59'; >Current design this works with ALLOW FILTERING ONLY > 4)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date >='01Apr2017 > 00:00:00' AND e_date <='16Apr2017 23:59:59'; >Current design this works with ALLOW FILTERING ONLY > > Looking for better design for this case, keeping in mind dynamic tables > usecase and queries listed. > > Thanks in advance, > Naresh > >
Query on Data Modelling of a specific usecase
Hi all, This is my existing table configured on apache-cassandra-3.0.9: CREATE TABLE report_id1 ( mc_id text, tag_id text, e_date timestamp. value text PRIMARY KEY ((mc_id, tag_id), e_date) } I create table dynamically for each report from application. Need to support upto 1000 reports means 1000 such tables. unique mc_id will be in range of 5 to 100 in a report. For a mc_id there will be unique tag_id in range of 100 to 1 million in a report. For a mc_id, tag_id there will be unique e_date values in range of 10 to 5000. Current queries to answer : 1)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND e_date='16Apr2017 23:59:59'; 2)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND e_date >='01Apr2017 00:00:00' AND e_date <='16Apr2017 23:59:59; 3)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date='16Apr2017 23:59:59'; Current design this works with ALLOW FILTERING ONLY 4)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date >='01Apr2017 00:00:00' AND e_date <='16Apr2017 23:59:59'; Current design this works with ALLOW FILTERING ONLY Looking for better design for this case, keeping in mind dynamic tables usecase and queries listed. Thanks in advance, Naresh
Re: Tag filtering data model
We also had similar usecase, after lot of trials with cassandra, we finally created solr schema doc_id(unique key), tags(indexed) in apache solr for answering search query "Get me matching docs by any given no of tags" and that solved our usecase. We had usecase of millions of docs and in tags we can have 100's of tags on a doc. Please share your final conclusion if you crack this problem within cassandra only, would be interested to know your solution. On Fri, Sep 11, 2015 at 1:23 PM, Artur Siekielskiwrote: > I store documents submitted by users, with optional tags (lists of > strings): > > CREATE TABLE doc ( > user_id uuid, > date text, // part of partition key, to distribute data better > doc_id uuid, > tags list, > contents text, > PRIMARY KEY((user_id, date), doc_id) > ); > > What is the best way to implement tag filtering? A user can select a list > of tags and get documents with the tags. I thought about: > > 1) Full denormalization - include tags in the primary key and insert a doc > for each subset of specified tags. This will however lead to large disk > space usage, because there are 2**n subsets (for 10 tags and a 1MB doc > 1000MB would be written). > > 2) Secondary index on 'tags' collection, and using queries like: > SELECT * FROM doc WHERE user_id=? AND date=? AND tags CONTAINS=? AND tags > CONTAINS=? ... > > Since I will supply partition key value, I assume there will be no > problems with contacting multiple nodes. But how well will it work for > hundreds of thousands of results? I think intersection of tag matches needs > to be performed in memory so it will not scale well. > > 3) Partial denormalization - do inserts for each single tag and then > manually compute intersection. However in the worst case it can lead to > scanning almost the whole table. > > 4) Full denormalization but without contents. I would get correct doc_ids > fast, then I would need to use '... WHERE doc_id IN ?' with potentially a > very large list of doc_ids. > > > What's Cassandra's way to implement this? >
Re: Help me on Cassandra Data Modelling
please inputs on last email if any.. On Tue, Jan 28, 2014 at 7:18 AM, Naresh Yadav nyadav@gmail.com wrote: yes thunder you are right, i had simplified that by moving *tags *search(partial/exact) in separate column family tagcombination which will act as index for all search based on tags and in my my original metricresult table will store tagcombinationid and time in columns otherwise it was getting complicated was not getting good results. Yes i agree with you on duplicating the storage with tagcombination columnfamily...if i have billion of real tagcombinations with 8 tags in each then i am duplicating 2^8 combinations for each one to support partial match for that tagcombination which will make this very heavy table...with individual keys i will not able to support search with set of tags ..please suggest alternative solution.. Also one of my colleague suggested a total different approach to it but i am not able to map that on cassandra. Acc to him we store all possible tags in columns and for each combination we just mark 0s, 1s whichever tags appear in that combination...So data(TC1 as India, Pencil AND TC2 as India, Pen) will be like : IndiaPencil Pen TC1 1 1 0 TC2 1 0 1 I am not able to design optimal column family for this in cassandra..if i design as is then for search of India, Pen then i will select India, Pen columns but that will touch each and every row because i am not able to apply criteria of matching 1s only...i believe there can be better design of this to make use of this good thought. Please help me on this.. Thanks Naresh On Mon, Jan 27, 2014 at 11:30 PM, Thunder Stumpges thunder.stump...@gmail.com wrote: Hey Naresh, You asked a similar question a week or two ago. It looks like you have simplified your needs quite a bit. Were you able to adjust your requirements or separate the issue? You had a complicated time dimension before, as well as a single query for multiple AND cases on tags. c)Give data for Metric=Sales AND Tag=U.S.A O/P : 5 rows d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen O/P :1 row I agree with Jonathan on the model for this simplified use case. However looking at how you are storing each partial tag combination as well as individual tags in the partitioning key, you will be severely duplicating your storage. You might want to just store individual keys in the partitioning key. Good luck, Thunder On Mon, Jan 27, 2014 at 8:48 AM, Naresh Yadav nyadav@gmail.comwrote: Thanks Jonathan for guiding me..i just want to confirm my understanding : create columnfamily tagcombinations { partialtags text, tagcombinationid text, tagcombinationtags settags Primary Key((partialtags), tagcombinationid) } IF i need to store TWO tagcombination TC1 as India, Pencil AND TC2 as India, Pen then data will stored as : TC1 TC2 India India,Pencil India,pen TC1 Pencil India,Pencil TC2 Pen India,Pen TC1 India,PencilIndia,Pencil TC2 India,PenIndia, Pen I hope i had understood the thought properly please confirm on design. Thanks Naresh On Mon, Jan 27, 2014 at 7:05 PM, Jonathan Lacefield jlacefi...@datastax.com wrote: Hello, The trick with this data model is to get to partition based, and/or cluster based access pattern so C* returns results quickly. In C* you want to model your tables based on your query access patterns and remember that writes are cheap and fast in C*. So, try something like the following: 1 Table with a Partition Key = Tag String Tag String = Tag or set of Tags Cluster based on tag combination (probably desc order) This will allow you to select any combination that includes Tag or set of Tags This will duplicate data as you will store 1 tag combination in every Tag partition, i.e. if a tag combination has 2 parts, then you will have 2 rows Hope this helps. Jonathan Lacefield Solutions Architect, DataStax (404) 822 3487 http://www.linkedin.com/in/jlacefield http://www.datastax.com/what-we-offer/products-services/training/virtual-training On Mon, Jan 27, 2014 at 7:24 AM, Naresh Yadav nyadav@gmail.comwrote: Hi all, Urgently need help on modelling this usecase on Cassandra. I have concept of tags and tagcombinations. For example U.S.A and Pen are two tags AND if they come together in some definition then register a tagcombination(U.S.A-Pen) for that.. *tags *(U.S.A, Pen, Pencil, India, Shampoo) *tagcombinations*(U.S.A-Pen, India-pencil, U.S.A-Pencil, India-Pen, India-Pen-Shampoo) - millions of tags
Re: Help me on Cassandra Data Modelling
Thanks Jonathan for guiding me..i just want to confirm my understanding : create columnfamily tagcombinations { partialtags text, tagcombinationid text, tagcombinationtags settags Primary Key((partialtags), tagcombinationid) } IF i need to store TWO tagcombination TC1 as India, Pencil AND TC2 as India, Pen then data will stored as : TC1 TC2 India India,Pencil India,pen TC1 Pencil India,Pencil TC2 Pen India,Pen TC1 India,PencilIndia,Pencil TC2 India,PenIndia, Pen I hope i had understood the thought properly please confirm on design. Thanks Naresh On Mon, Jan 27, 2014 at 7:05 PM, Jonathan Lacefield jlacefi...@datastax.com wrote: Hello, The trick with this data model is to get to partition based, and/or cluster based access pattern so C* returns results quickly. In C* you want to model your tables based on your query access patterns and remember that writes are cheap and fast in C*. So, try something like the following: 1 Table with a Partition Key = Tag String Tag String = Tag or set of Tags Cluster based on tag combination (probably desc order) This will allow you to select any combination that includes Tag or set of Tags This will duplicate data as you will store 1 tag combination in every Tag partition, i.e. if a tag combination has 2 parts, then you will have 2 rows Hope this helps. Jonathan Lacefield Solutions Architect, DataStax (404) 822 3487 http://www.linkedin.com/in/jlacefield http://www.datastax.com/what-we-offer/products-services/training/virtual-training On Mon, Jan 27, 2014 at 7:24 AM, Naresh Yadav nyadav@gmail.comwrote: Hi all, Urgently need help on modelling this usecase on Cassandra. I have concept of tags and tagcombinations. For example U.S.A and Pen are two tags AND if they come together in some definition then register a tagcombination(U.S.A-Pen) for that.. *tags *(U.S.A, Pen, Pencil, India, Shampoo) *tagcombinations*(U.S.A-Pen, India-pencil, U.S.A-Pencil, India-Pen, India-Pen-Shampoo) - millions of tags - billions of tagcombinations - one tagcombination generally have 2-8 tags - Every day we get lakhs of new tagcombinations to write Query need to support : one tag or set of tags appears in how many tagcombinationids If i query for Pen,India then it should return two tagcombinaions (India-Pen, India-Pen-Shampoo))..Query will be fired by application in realtime. I am new to cassandra and need to deliver fast so please give your inputs. Thanks Naresh
Re: Help me on Cassandra Data Modelling
yes thunder you are right, i had simplified that by moving *tags *search(partial/exact) in separate column family tagcombination which will act as index for all search based on tags and in my my original metricresult table will store tagcombinationid and time in columns otherwise it was getting complicated was not getting good results. Yes i agree with you on duplicating the storage with tagcombination columnfamily...if i have billion of real tagcombinations with 8 tags in each then i am duplicating 2^8 combinations for each one to support partial match for that tagcombination which will make this very heavy table...with individual keys i will not able to support search with set of tags ..please suggest alternative solution.. Also one of my colleague suggested a total different approach to it but i am not able to map that on cassandra. Acc to him we store all possible tags in columns and for each combination we just mark 0s, 1s whichever tags appear in that combination...So data(TC1 as India, Pencil AND TC2 as India, Pen) will be like : IndiaPencil Pen TC1 1 1 0 TC2 1 0 1 I am not able to design optimal column family for this in cassandra..if i design as is then for search of India, Pen then i will select India, Pen columns but that will touch each and every row because i am not able to apply criteria of matching 1s only...i believe there can be better design of this to make use of this good thought. Please help me on this.. Thanks Naresh On Mon, Jan 27, 2014 at 11:30 PM, Thunder Stumpges thunder.stump...@gmail.com wrote: Hey Naresh, You asked a similar question a week or two ago. It looks like you have simplified your needs quite a bit. Were you able to adjust your requirements or separate the issue? You had a complicated time dimension before, as well as a single query for multiple AND cases on tags. c)Give data for Metric=Sales AND Tag=U.S.A O/P : 5 rows d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen O/P :1 row I agree with Jonathan on the model for this simplified use case. However looking at how you are storing each partial tag combination as well as individual tags in the partitioning key, you will be severely duplicating your storage. You might want to just store individual keys in the partitioning key. Good luck, Thunder On Mon, Jan 27, 2014 at 8:48 AM, Naresh Yadav nyadav@gmail.comwrote: Thanks Jonathan for guiding me..i just want to confirm my understanding : create columnfamily tagcombinations { partialtags text, tagcombinationid text, tagcombinationtags settags Primary Key((partialtags), tagcombinationid) } IF i need to store TWO tagcombination TC1 as India, Pencil AND TC2 as India, Pen then data will stored as : TC1 TC2 India India,Pencil India,pen TC1 Pencil India,Pencil TC2 Pen India,Pen TC1 India,PencilIndia,Pencil TC2 India,PenIndia, Pen I hope i had understood the thought properly please confirm on design. Thanks Naresh On Mon, Jan 27, 2014 at 7:05 PM, Jonathan Lacefield jlacefi...@datastax.com wrote: Hello, The trick with this data model is to get to partition based, and/or cluster based access pattern so C* returns results quickly. In C* you want to model your tables based on your query access patterns and remember that writes are cheap and fast in C*. So, try something like the following: 1 Table with a Partition Key = Tag String Tag String = Tag or set of Tags Cluster based on tag combination (probably desc order) This will allow you to select any combination that includes Tag or set of Tags This will duplicate data as you will store 1 tag combination in every Tag partition, i.e. if a tag combination has 2 parts, then you will have 2 rows Hope this helps. Jonathan Lacefield Solutions Architect, DataStax (404) 822 3487 http://www.linkedin.com/in/jlacefield http://www.datastax.com/what-we-offer/products-services/training/virtual-training On Mon, Jan 27, 2014 at 7:24 AM, Naresh Yadav nyadav@gmail.comwrote: Hi all, Urgently need help on modelling this usecase on Cassandra. I have concept of tags and tagcombinations. For example U.S.A and Pen are two tags AND if they come together in some definition then register a tagcombination(U.S.A-Pen) for that.. *tags *(U.S.A, Pen, Pencil, India, Shampoo) *tagcombinations*(U.S.A-Pen, India-pencil, U.S.A-Pencil, India-Pen, India-Pen-Shampoo) - millions of tags - billions of tagcombinations - one tagcombination generally have 2-8 tags - Every day we get lakhs of new tagcombinations to write Query
Best design for a usecase ??
Hi, I need to design a table which will give a UUID to set of tags. Each tag itself has unique UUID *TagCombination* table TC1 - India, Pen TC2 - Shampoo, U.K TC3 - Team1, Product1, Location1 TC4 - Office1, India, Pen I can have *billion *of such unique combinations and there can be *million *of unique tags but each combination will have 2 to 10 tags max. As data comes daily there would be new combination registered if not exists. *Query on this table :* 1. Give me list of tags for Tagcombination Id=TC1 2. A set of tags comes in which Tagcombination Ids If i say India,Pen comes then it comes in TC1, TC4 There can be exact match or partial match on tags to get TCids Please suggest design for this so that this table can handle bigdata. Thanks Naresh
Re: Best design for a usecase ??
just to add : on this table there will be lakhs of select queries to get tagcombinationid fro a partial set of tags... On Tue, Jan 21, 2014 at 2:33 PM, Naresh Yadav nyadav@gmail.com wrote: Hi, I need to design a table which will give a UUID to set of tags. Each tag itself has unique UUID *TagCombination* table TC1 - India, Pen TC2 - Shampoo, U.K TC3 - Team1, Product1, Location1 TC4 - Office1, India, Pen I can have *billion *of such unique combinations and there can be *million *of unique tags but each combination will have 2 to 10 tags max. As data comes daily there would be new combination registered if not exists. *Query on this table :* 1. Give me list of tags for Tagcombination Id=TC1 2. A set of tags comes in which Tagcombination Ids If i say India,Pen comes then it comes in TC1, TC4 There can be exact match or partial match on tags to get TCids Please suggest design for this so that this table can handle bigdata. Thanks Naresh
Getting indexoutbound exception for a specific query on cassandra trunk
I had taken latest source code of cassandra trunk to evaluate performance of indexing on collections new feature( https://issues.apache.org/jira/browse/CASSANDRA-4511) for my usecase.. IF you configure table like this with commands in given order : CREATE TABLE testcollectionindex(userid text, timeunitid text, periodid text, periodlabel text, periodtags text, unit text, datatags text, datatagsset settext,value double,PRIMARY KEY((unit,periodid), datatags)); INSERT INTO testcollectionindex(periodlabel, datatags, datatagsset, value,timeunitid,periodid, unit) VALUES('Feb-2010', 'India|Pen|Store1', {'India', 'Pen', 'Store1'}, 10,'Month','Period2','Number'); CREATE INDEX testcollectionindexdatatagsset ON testcollectionindex (datatagsset); SELECT * FROM testcollectionindex WHERE datatagsset CONTAINS 'Store1'; Output ( works perfectly ): unit | periodid | datatags | datatagsset | periodlabel +--+--++- Number | Period2 | India|Pen|Store1 | {'India', 'Pen', 'Store1'} | Feb-2010 (1 rows) *SELECT * FROM testcollectionindex WHERE periodid='Period2' AND unit='Number' AND datatagsset CONTAINS 'Store1';* THIS QUERY DO NOT WORK..I get RPC timeout error and server logs showing indexoutofbound exception (http://pastebin.com/f7qmRc0R) Deugging code for this query I get SliceQueryFilter [reversed=false, slices=[[, ]], count=2147483647, toGroup = 1] because of that it throws java.lang.ArrayIndexOutOfBoundsException: 0in CompositesIndexOnCollectionKey.java method makeIndexColumnNameBuilder() Note : I also tested this query on 03-Dec-2013 source code snapshot of cassandra getting same exception there also please someone help me on this so that i can proceed on this and conclude on this new supported feature of cassandra. Thanks Naresh
Probable release date for cassandra 2.1 ??
Hi, I am looking feature(CASSANDRA-4511https://issues.apache.org/jira/browse/CASSANDRA-4511) which allows Index on Collections. Any idea about release date of Cassandra 2.1 ?? Till this releases, i am thinking to take source code of 2.1 and build it on my machine to test the required feature. Please suggest instructions for that. Thanks Naresh
Re: Help on Designing Cassandra table for my usecase
@Thunder I just came to know about (CASSANDRA-4511https://issues.apache.org/jira/browse/CASSANDRA-4511) which allows Index on Collections and that will be part of release 2.1. I hope in that case my problem will be solved by changing your designed table with tag column as settext and defining secondary index on it. Is there any risk of performance problem of this design keeping in mind huge data ??? Naresh On Fri, Jan 10, 2014 at 10:26 AM, Naresh Yadav nyadav@gmail.com wrote: @Thunder thanks for suggesting design but my main problem is indexing/quering dynamic Tag on each row that is main context of each row and most of queries will include that.. As an alternative to cassandra, i tried Apache Blur, in blur table i am able to store exact same data and all queries also worked..so blur allows dynamic indexing of tag column BUT moving away from cassandra, i am loosing its strength because of that i am not confident on this decision as data will be huge in my case. Please guide me on this with better suggestions. Thanks Naresh On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges thunder.stump...@gmail.com wrote: Well I think you have essentially time-series data, which C* should handle well, however I think your Tag column is going to cause troubles. C* does have collection columns, but they are not indexable nor usable in WHERE clause. Your example has both the uniqueness of the data (primary key) and query filtering on potentially multiple Tag columns. That is not supported in C* AFAIK.If it were a single Tag, that could be a column that is Indexed possibly. Ignoring that issue with the many different Tags, You could model the table as: CREATE TABLE metric_data ( metric text, time text, period text, tag text, value int, PRIMARY KEY( (metric,time), period, tag) ) That would make a composite partitioning key on metric and time meaning you'd always have to pass those (or else randomly page via TOKEN through all rows). After specifying metric and time, you could optionally also specify period and/or tag, and results would be ordered (clustered) by period. This would satisfy your queries a,b, and d but not c (as you did not specify time). If Time was a granularity column, does it even make sense to return records across differing time values? What does it mean to return the 4 month rows and 1 year row in your example? Could you issue N queries in this case (where N is a small number of each of your time granularities) ? I'm not sure how close that gets you, or if you can re-work your concept of Tag at all. Good luck. Thunder On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger hkro...@gmail.com wrote: To my eye that looks something what the traditional analytics systems do. You can check out e.g. Acunu Analytics which uses Cassandra as a backend. Cheers, Hannu 2014/1/9 Naresh Yadav nyadav@gmail.com Hi all, I have a use case with huge data which i am not able to design in cassandra. Table name : MetricResult Sample Data : Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pen, Value=10 Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil, Value=20 Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen, Value=30 Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil, Value=10 Metric=Sales, Time=Month, Period=Feb-10, Tag=India, Value=90 Metric=Sales, Time=Year, Period=2010, Tag=U.S.A, Value=70 Metric=Cost, Time=Year, Period=2010,Tag=CPU, Value=8000 Metric=Cost, Time=Year, Period=2010,Tag=RAM, Value=4000 Metric=Cost, Time=Year Period=2011, Tag=CPU, Value=9000 Metric=Resource, Time=Week Period=Week1-2013, Value=100 So in above case i have case of TimeSeries data i.e Time,Period column Dynamic columns i.e Tag column Indexing on dynamic columns i.e Tag column Aggregations SUM, AVERAGE Same value comes again for a Metric, Time, Period, Tag then overwrite it Queries i need to support : -- a)Give data for Metric=Sales AND Time=Month O/P : 5 rows b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10 O/P : 2 rows c)Give data for Metric=Sales AND Tag=U.S.A O/P : 5 rows d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen O/P :1 row This table can have TB's of data and for a Metric,Period can have millions of rows. Please give suggestion to design/model this table in Cassandra. If some limitation in Cassandra then suggest best technology to handle this. Thanks Naresh
Re: Help on Designing Cassandra table for my usecase
@vivek thanks for pointing that out..Other than primary key defining only one secondary index tags and in my case same tags will be repeating itself across period for sure for a metric=Sales AND also across metric Sales, Cost also can be same set of tags to some extent not always.. Thanks Naresh On Fri, Jan 10, 2014 at 6:05 PM, Vivek Mishra mishra.v...@gmail.com wrote: @Naresh Too many indices or indices with high cardinality should be discouraged and are always performance issues. A set will not contain duplicate values. -Vivek On Fri, Jan 10, 2014 at 5:48 PM, Naresh Yadav nyadav@gmail.comwrote: @Thunder I just came to know about (CASSANDRA-4511https://issues.apache.org/jira/browse/CASSANDRA-4511) which allows Index on Collections and that will be part of release 2.1. I hope in that case my problem will be solved by changing your designed table with tag column as settext and defining secondary index on it. Is there any risk of performance problem of this design keeping in mind huge data ??? Naresh On Fri, Jan 10, 2014 at 10:26 AM, Naresh Yadav nyadav@gmail.comwrote: @Thunder thanks for suggesting design but my main problem is indexing/quering dynamic Tag on each row that is main context of each row and most of queries will include that.. As an alternative to cassandra, i tried Apache Blur, in blur table i am able to store exact same data and all queries also worked..so blur allows dynamic indexing of tag column BUT moving away from cassandra, i am loosing its strength because of that i am not confident on this decision as data will be huge in my case. Please guide me on this with better suggestions. Thanks Naresh On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges thunder.stump...@gmail.com wrote: Well I think you have essentially time-series data, which C* should handle well, however I think your Tag column is going to cause troubles. C* does have collection columns, but they are not indexable nor usable in WHERE clause. Your example has both the uniqueness of the data (primary key) and query filtering on potentially multiple Tag columns. That is not supported in C* AFAIK.If it were a single Tag, that could be a column that is Indexed possibly. Ignoring that issue with the many different Tags, You could model the table as: CREATE TABLE metric_data ( metric text, time text, period text, tag text, value int, PRIMARY KEY( (metric,time), period, tag) ) That would make a composite partitioning key on metric and time meaning you'd always have to pass those (or else randomly page via TOKEN through all rows). After specifying metric and time, you could optionally also specify period and/or tag, and results would be ordered (clustered) by period. This would satisfy your queries a,b, and d but not c (as you did not specify time). If Time was a granularity column, does it even make sense to return records across differing time values? What does it mean to return the 4 month rows and 1 year row in your example? Could you issue N queries in this case (where N is a small number of each of your time granularities) ? I'm not sure how close that gets you, or if you can re-work your concept of Tag at all. Good luck. Thunder On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger hkro...@gmail.comwrote: To my eye that looks something what the traditional analytics systems do. You can check out e.g. Acunu Analytics which uses Cassandra as a backend. Cheers, Hannu 2014/1/9 Naresh Yadav nyadav@gmail.com Hi all, I have a use case with huge data which i am not able to design in cassandra. Table name : MetricResult Sample Data : Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pen, Value=10 Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil, Value=20 Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen, Value=30 Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil, Value=10 Metric=Sales, Time=Month, Period=Feb-10, Tag=India, Value=90 Metric=Sales, Time=Year, Period=2010, Tag=U.S.A, Value=70 Metric=Cost, Time=Year, Period=2010,Tag=CPU, Value=8000 Metric=Cost, Time=Year, Period=2010,Tag=RAM, Value=4000 Metric=Cost, Time=Year Period=2011, Tag=CPU, Value=9000 Metric=Resource, Time=Week Period=Week1-2013, Value=100 So in above case i have case of TimeSeries data i.e Time,Period column Dynamic columns i.e Tag column Indexing on dynamic columns i.e Tag column Aggregations SUM, AVERAGE Same value comes again for a Metric, Time, Period, Tag then overwrite it Queries i need to support : -- a)Give data for Metric=Sales AND Time=Month O/P : 5 rows b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10 O/P : 2 rows c)Give data for Metric=Sales AND Tag=U.S.A O/P : 5 rows d)Give data for Metric=Sales
Help on Designing Cassandra table for my usecase
Hi all, I have a use case with huge data which i am not able to design in cassandra. Table name : MetricResult Sample Data : Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pen, Value=10 Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil, Value=20 Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen, Value=30 Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil, Value=10 Metric=Sales, Time=Month, Period=Feb-10, Tag=India, Value=90 Metric=Sales, Time=Year, Period=2010, Tag=U.S.A, Value=70 Metric=Cost, Time=Year, Period=2010,Tag=CPU, Value=8000 Metric=Cost, Time=Year, Period=2010,Tag=RAM, Value=4000 Metric=Cost, Time=Year Period=2011, Tag=CPU, Value=9000 Metric=Resource, Time=Week Period=Week1-2013, Value=100 So in above case i have case of TimeSeries data i.e Time,Period column Dynamic columns i.e Tag column Indexing on dynamic columns i.e Tag column Aggregations SUM, AVERAGE Same value comes again for a Metric, Time, Period, Tag then overwrite it Queries i need to support : -- a)Give data for Metric=Sales AND Time=Month O/P : 5 rows b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10 O/P : 2 rows c)Give data for Metric=Sales AND Tag=U.S.A O/P : 5 rows d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen O/P :1 row This table can have TB's of data and for a Metric,Period can have millions of rows. Please give suggestion to design/model this table in Cassandra. If some limitation in Cassandra then suggest best technology to handle this. Thanks Naresh
Re: Help on Designing Cassandra table for my usecase
@thunder thanks for guidance queries will be fired by application on this table when users login and browse the application and also through mobile apps through webservice. Response needs to be quick as user will be doing analysis over this data on the fly. Writes also needs to be fast as there is time limit we need to show this data to user everyday. Aggregation we can build in application outside cassandra. But we are not clear what table we should design in cassandra for the queries we need..Please give guidance on the possible design to handle dynamic tags indexing for queries.. Thanks Naresh On Thu, Jan 9, 2014 at 9:41 PM, Thunder Stumpges thunder.stump...@gmail.com wrote: This sort of work sounds much more like a Hadoop/Hive/Pig type of analysis. What are your latency requirements on queries? Are they ad-hoc or part of an application? What is the case where you would need to change an existing value? If it is write once, then Hadoop/Hive is great, if it changes randomly, then not so much. Cassandra has limitations that it does not support aggregation, that must be done by a client. In my experience it is really suited to quickly write lots of data and look it back up in a random io type manner if you already know the key you are looking for. If you have the high speed write and rewrite needs, but also the full data analytical requirements, there are plugins for using C* as a backing store for Pig/Hive. It is a little finicky to get working depending on all your versions but does work fairly well in my limited experience. Perhaps with a little better understanding of your workload needs others can chime in too. Good luck. -Thunder On Jan 9, 2014, at 5:15 AM, Naresh Yadav nyadav@gmail.com wrote: Hi all, I have a use case with huge data which i am not able to design in cassandra. Table name : MetricResult Sample Data : Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pen, Value=10 Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil, Value=20 Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen, Value=30 Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil, Value=10 Metric=Sales, Time=Month, Period=Feb-10, Tag=India, Value=90 Metric=Sales, Time=Year, Period=2010, Tag=U.S.A, Value=70 Metric=Cost, Time=Year, Period=2010,Tag=CPU, Value=8000 Metric=Cost, Time=Year, Period=2010,Tag=RAM, Value=4000 Metric=Cost, Time=Year Period=2011, Tag=CPU, Value=9000 Metric=Resource, Time=Week Period=Week1-2013, Value=100 So in above case i have case of TimeSeries data i.e Time,Period column Dynamic columns i.e Tag column Indexing on dynamic columns i.e Tag column Aggregations SUM, AVERAGE Same value comes again for a Metric, Time, Period, Tag then overwrite it Queries i need to support : -- a)Give data for Metric=Sales AND Time=Month O/P : 5 rows b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10 O/P : 2 rows c)Give data for Metric=Sales AND Tag=U.S.A O/P : 5 rows d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen O/P :1 row This table can have TB's of data and for a Metric,Period can have millions of rows. Please give suggestion to design/model this table in Cassandra. If some limitation in Cassandra then suggest best technology to handle this. Thanks Naresh
Re: Help on Designing Cassandra table for my usecase
@thunder It will be write once 80% of time but there can be cases client makes correction in data and then we need to overwrite that.. Thanks Naresh On Thu, Jan 9, 2014 at 11:49 PM, Naresh Yadav nyadav@gmail.com wrote: @thunder thanks for guidance queries will be fired by application on this table when users login and browse the application and also through mobile apps through webservice. Response needs to be quick as user will be doing analysis over this data on the fly. Writes also needs to be fast as there is time limit we need to show this data to user everyday. Aggregation we can build in application outside cassandra. But we are not clear what table we should design in cassandra for the queries we need..Please give guidance on the possible design to handle dynamic tags indexing for queries.. Thanks Naresh On Thu, Jan 9, 2014 at 9:41 PM, Thunder Stumpges thunder.stump...@gmail.com wrote: This sort of work sounds much more like a Hadoop/Hive/Pig type of analysis. What are your latency requirements on queries? Are they ad-hoc or part of an application? What is the case where you would need to change an existing value? If it is write once, then Hadoop/Hive is great, if it changes randomly, then not so much. Cassandra has limitations that it does not support aggregation, that must be done by a client. In my experience it is really suited to quickly write lots of data and look it back up in a random io type manner if you already know the key you are looking for. If you have the high speed write and rewrite needs, but also the full data analytical requirements, there are plugins for using C* as a backing store for Pig/Hive. It is a little finicky to get working depending on all your versions but does work fairly well in my limited experience. Perhaps with a little better understanding of your workload needs others can chime in too. Good luck. -Thunder On Jan 9, 2014, at 5:15 AM, Naresh Yadav nyadav@gmail.com wrote: Hi all, I have a use case with huge data which i am not able to design in cassandra. Table name : MetricResult Sample Data : Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pen, Value=10 Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil, Value=20 Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen, Value=30 Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil, Value=10 Metric=Sales, Time=Month, Period=Feb-10, Tag=India, Value=90 Metric=Sales, Time=Year, Period=2010, Tag=U.S.A, Value=70 Metric=Cost, Time=Year, Period=2010,Tag=CPU, Value=8000 Metric=Cost, Time=Year, Period=2010,Tag=RAM, Value=4000 Metric=Cost, Time=Year Period=2011, Tag=CPU, Value=9000 Metric=Resource, Time=Week Period=Week1-2013, Value=100 So in above case i have case of TimeSeries data i.e Time,Period column Dynamic columns i.e Tag column Indexing on dynamic columns i.e Tag column Aggregations SUM, AVERAGE Same value comes again for a Metric, Time, Period, Tag then overwrite it Queries i need to support : -- a)Give data for Metric=Sales AND Time=Month O/P : 5 rows b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10 O/P : 2 rows c)Give data for Metric=Sales AND Tag=U.S.A O/P : 5 rows d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen O/P :1 row This table can have TB's of data and for a Metric,Period can have millions of rows. Please give suggestion to design/model this table in Cassandra. If some limitation in Cassandra then suggest best technology to handle this. Thanks Naresh
Re: Help on Designing Cassandra table for my usecase
@Thunder thanks for suggesting design but my main problem is indexing/quering dynamic Tag on each row that is main context of each row and most of queries will include that.. As an alternative to cassandra, i tried Apache Blur, in blur table i am able to store exact same data and all queries also worked..so blur allows dynamic indexing of tag column BUT moving away from cassandra, i am loosing its strength because of that i am not confident on this decision as data will be huge in my case. Please guide me on this with better suggestions. Thanks Naresh On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges thunder.stump...@gmail.com wrote: Well I think you have essentially time-series data, which C* should handle well, however I think your Tag column is going to cause troubles. C* does have collection columns, but they are not indexable nor usable in WHERE clause. Your example has both the uniqueness of the data (primary key) and query filtering on potentially multiple Tag columns. That is not supported in C* AFAIK.If it were a single Tag, that could be a column that is Indexed possibly. Ignoring that issue with the many different Tags, You could model the table as: CREATE TABLE metric_data ( metric text, time text, period text, tag text, value int, PRIMARY KEY( (metric,time), period, tag) ) That would make a composite partitioning key on metric and time meaning you'd always have to pass those (or else randomly page via TOKEN through all rows). After specifying metric and time, you could optionally also specify period and/or tag, and results would be ordered (clustered) by period. This would satisfy your queries a,b, and d but not c (as you did not specify time). If Time was a granularity column, does it even make sense to return records across differing time values? What does it mean to return the 4 month rows and 1 year row in your example? Could you issue N queries in this case (where N is a small number of each of your time granularities) ? I'm not sure how close that gets you, or if you can re-work your concept of Tag at all. Good luck. Thunder On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger hkro...@gmail.com wrote: To my eye that looks something what the traditional analytics systems do. You can check out e.g. Acunu Analytics which uses Cassandra as a backend. Cheers, Hannu 2014/1/9 Naresh Yadav nyadav@gmail.com Hi all, I have a use case with huge data which i am not able to design in cassandra. Table name : MetricResult Sample Data : Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pen, Value=10 Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil, Value=20 Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen, Value=30 Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil, Value=10 Metric=Sales, Time=Month, Period=Feb-10, Tag=India, Value=90 Metric=Sales, Time=Year, Period=2010, Tag=U.S.A, Value=70 Metric=Cost, Time=Year, Period=2010,Tag=CPU, Value=8000 Metric=Cost, Time=Year, Period=2010,Tag=RAM, Value=4000 Metric=Cost, Time=Year Period=2011, Tag=CPU, Value=9000 Metric=Resource, Time=Week Period=Week1-2013, Value=100 So in above case i have case of TimeSeries data i.e Time,Period column Dynamic columns i.e Tag column Indexing on dynamic columns i.e Tag column Aggregations SUM, AVERAGE Same value comes again for a Metric, Time, Period, Tag then overwrite it Queries i need to support : -- a)Give data for Metric=Sales AND Time=Month O/P : 5 rows b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10 O/P : 2 rows c)Give data for Metric=Sales AND Tag=U.S.A O/P : 5 rows d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen O/P :1 row This table can have TB's of data and for a Metric,Period can have millions of rows. Please give suggestion to design/model this table in Cassandra. If some limitation in Cassandra then suggest best technology to handle this. Thanks Naresh
Re: Setting up a multi-node cluster
You would need to configure rpc_address also with hostname/ips on both the nodes. Naresh On Wed, Aug 28, 2013 at 10:15 AM, Dinesh dinesh.gad...@gmail.com wrote: Hi, I am trying to setup a two node Cassandra cluster Able to start the first node, but not seeing the following exception while starting the second node ERROR 17:31:34,315 Exception encountered during startup java.lang.IllegalStateException: Unable to contact any seeds! at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:947) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:716) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:554) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:451) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490) java.lang.IllegalStateException: Unable to contact any seeds! at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:947) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:716) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:554) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:451) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490) Exception encountered during startup: Unable to contact any seeds! ERROR 17:31:34,322 Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException at org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321) at org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:370) at org.apache.cassandra.service.StorageService.access$000(StorageService.java:88) = My yaml configuration files have these modified first node yaml --- initial_token: -9223372036854775808 # generated this using tokengen tool seeds: 10.96.19.207 # which is the IP of first node listen_address: 10.96.19.207 # which is the IP of first node itself rpc_address: 0.0.0.0 second node yaml initial_token: 0 seeds: 10.96.19.207 # which is the IP of first node listen_address: 10.96.10.223 # which is the IP of second node rpc_address: 0.0.0.0 == Can anyone please help me what went wrong with my configuration? Regards Dinesh
Re: Cassandra HANGS after some writes
just to update everyone, as per expert advices i tried running this on Linux machine but i still have exact same problem on linux also...even no difference in performance...I tried with default yaml and heap size of 8GB..Now advise me on cassandra linux optmizations, will try those... Naresh On Wed, Aug 14, 2013 at 10:43 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Aug 13, 2013 at 10:39 PM, Naresh Yadav nyadav@gmail.comwrote: I made one single change in default cassandra.yaml, just to experiment. native_transport_min_threads: *1* native_transport_max_threads: *1* with max one single thread for native protocol requests i noticed some improvement, earlier with default yaml most of time it was failing after *10K* combinations BUT with this it worked storing *30K* combinations out of 1lakh.. Most of the time, when your node fails, it is because Java Garbage Collection is failing. This is usually because you are writing faster than you can flush to disk and also collect the garbage. Usually people inspect the GC logs and/or use something like jconsole to inspect the JMX interface of the JVM and the Cassandra metrics exposed there. =Rob
Cassandra HANGS after some writes
Hi All, I have single node cassandra using CQL using datastax java driver 1.0.1 and cassandra verison 1.2.6. *Infrastructure :* 16GB machine with 8GB heap given to cassandra, i7 processor.. DEFAULT cassandra.yaml no change done by me. -Xms1G^ -Xmx12G^ no other change in cassandra.bat *Problem : _ *cassandra Freezes after some writes and i see no action on cassandra console for an hour...all Native_Transport threads are also killedmy program keeps running NO ERROR comes...when i connect with cql that works In start it creates 16 NativeTransport threads and after 10-15 minutes Total threads goes to 128...Just before it hangs, With JCONSOLE when i see Native_Transport threads then i find most of them in state as : http://pastebin.com/DeShpHtP *Load on cassandra : * ___ i am running a usecase which stores Combinations(my project terminology) in cassandraCurrently testing storing 2.5 lakh combinations with 100 parallel threads..each thread storing one combination...real case i need to support of many CRORES but that would need different hardware and multi node cluster... In Storing ONE combination takes around 2sec and involves : 527 INSERT INTO queries 506 UPDATE queries 954 SELECT queries 100 parallel threads parallely storing 100 combinations *MY CASSANDRA LOGS :* http://pastebin.com/CnNvA9x3 Please look last 100-200 lines of log because that is time it freezed PLEASE HELP ME OUT, I AM NOT ABLE TO PROCEED FROM 1 week...
Re: Cassandra HANGS after some writes
Thanks Alain, will avoid capsi am newbie to cassandra, just started using 2 weeks back.. Here are JConsole screenshots just 5mins after cassandra freezed : http://i.imgur.com/3oUBjKU.png http://i.imgur.com/2O4PrKb.png http://i.imgur.com/zxhFzr1.png 4:05 is time cassandra freezed thats why decline in no of threads http://i.imgur.com/ScgAciv.png Uploaded complete system.log of cassandra till freeze : http://www.scribd.com/doc/159949231/Cassandrasystem-log Observation : As in my usecase i am storing 1lakh combinations(527insert,506update,954select) each parallel by 100 threads in batch of 1000... Sometimes it works till 1000 batch then hangs but sometimes it completes 1 then hangs and once even worked for more than lakh Same hardware Same settings of cassandra i see random behaviour of performance.. Thanks Naresh On Tue, Aug 13, 2013 at 3:48 PM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi Naresh. First thing, there is no need of caps in here. People reading this ML is here to help when they have time and skills enough to do so. So please, chill out and do not use caps to show how much desperate you are. Concerning your problem, the only abnormal thing I was able to find in your logs is 1. ERROR [NonPeriodicTasks:1] 2013-08-13 01:52:42,106 SSTableDeletingTask.java (line 72) Unable to delete \var\lib\cassandra\data\system\schema_columnfamilies\system-schema_columnfamilies-ic-241-Data.db (it will be removed on server restart; we'll also retry after GC) I don't think this should keep C* hanging. Do you have something on kernel logs ? Do you have monitor any metrics like disk throughput / heap used / cpu load / iowait which are known as being bottlenecks / pertinent metrics ? Alain 2013/8/13 Naresh Yadav nyadav@gmail.com Hi All, I have single node cassandra using CQL using datastax java driver 1.0.1 and cassandra verison 1.2.6. *Infrastructure :* 16GB machine with 8GB heap given to cassandra, i7 processor.. DEFAULT cassandra.yaml no change done by me. -Xms1G^ -Xmx12G^ no other change in cassandra.bat *Problem : _ *cassandra Freezes after some writes and i see no action on cassandra console for an hour...all Native_Transport threads are also killedmy program keeps running NO ERROR comes...when i connect with cql that works In start it creates 16 NativeTransport threads and after 10-15 minutes Total threads goes to 128...Just before it hangs, With JCONSOLE when i see Native_Transport threads then i find most of them in state as : http://pastebin.com/DeShpHtP *Load on cassandra : * ___ i am running a usecase which stores Combinations(my project terminology) in cassandraCurrently testing storing 2.5 lakh combinations with 100 parallel threads..each thread storing one combination...real case i need to support of many CRORES but that would need different hardware and multi node cluster... In Storing ONE combination takes around 2sec and involves : 527 INSERT INTO queries 506 UPDATE queries 954 SELECT queries 100 parallel threads parallely storing 100 combinations *MY CASSANDRA LOGS :* http://pastebin.com/CnNvA9x3 Please look last 100-200 lines of log because that is time it freezed PLEASE HELP ME OUT, I AM NOT ABLE TO PROCEED FROM 1 week...
Re: Cassandra HANGS after some writes
Hi Alex, Yes i am testing in development environment of Windows 7 64bit. I left default yaml then cassandra created var folder and created data, log, cache folders in it.I tried commit log on different harddisk but this problem not solved with thatI guess this problem is somewhat related to deadlock in Native Transport threads...thats why cassandra is hanging indefinitly.. Naresh On Tue, Aug 13, 2013 at 7:21 PM, Alexis Rodríguez arodrig...@inconcertcc.com wrote: Naresh, are you deploying cassandra in windows? If that is the case you may need to change the data and commitlog directories in cassandra.yaml. Also you should check the log directories. See the section 2.1 http://wiki.apache.org/cassandra/GettingStarted On Tue, Aug 13, 2013 at 8:28 AM, Naresh Yadav nyadav@gmail.comwrote: Thanks Alain, will avoid capsi am newbie to cassandra, just started using 2 weeks back.. Here are JConsole screenshots just 5mins after cassandra freezed : http://i.imgur.com/3oUBjKU.png http://i.imgur.com/2O4PrKb.png http://i.imgur.com/zxhFzr1.png 4:05 is time cassandra freezed thats why decline in no of threads http://i.imgur.com/ScgAciv.png Uploaded complete system.log of cassandra till freeze : http://www.scribd.com/doc/159949231/Cassandrasystem-log Observation : As in my usecase i am storing 1lakh combinations(527insert,506update,954select) each parallel by 100 threads in batch of 1000... Sometimes it works till 1000 batch then hangs but sometimes it completes 1 then hangs and once even worked for more than lakh Same hardware Same settings of cassandra i see random behaviour of performance.. Thanks Naresh On Tue, Aug 13, 2013 at 3:48 PM, Alain RODRIGUEZ arodr...@gmail.comwrote: Hi Naresh. First thing, there is no need of caps in here. People reading this ML is here to help when they have time and skills enough to do so. So please, chill out and do not use caps to show how much desperate you are. Concerning your problem, the only abnormal thing I was able to find in your logs is 1. ERROR [NonPeriodicTasks:1] 2013-08-13 01:52:42,106 SSTableDeletingTask.java (line 72) Unable to delete \var\lib\cassandra\data\system\schema_columnfamilies\system-schema_columnfamilies-ic-241-Data.db (it will be removed on server restart; we'll also retry after GC) I don't think this should keep C* hanging. Do you have something on kernel logs ? Do you have monitor any metrics like disk throughput / heap used / cpu load / iowait which are known as being bottlenecks / pertinent metrics ? Alain 2013/8/13 Naresh Yadav nyadav@gmail.com Hi All, I have single node cassandra using CQL using datastax java driver 1.0.1 and cassandra verison 1.2.6. *Infrastructure :* 16GB machine with 8GB heap given to cassandra, i7 processor.. DEFAULT cassandra.yaml no change done by me. -Xms1G^ -Xmx12G^ no other change in cassandra.bat *Problem : _ *cassandra Freezes after some writes and i see no action on cassandra console for an hour...all Native_Transport threads are also killedmy program keeps running NO ERROR comes...when i connect with cql that works In start it creates 16 NativeTransport threads and after 10-15 minutes Total threads goes to 128...Just before it hangs, With JCONSOLE when i see Native_Transport threads then i find most of them in state as : http://pastebin.com/DeShpHtP *Load on cassandra : * ___ i am running a usecase which stores Combinations(my project terminology) in cassandraCurrently testing storing 2.5 lakh combinations with 100 parallel threads..each thread storing one combination...real case i need to support of many CRORES but that would need different hardware and multi node cluster... In Storing ONE combination takes around 2sec and involves : 527 INSERT INTO queries 506 UPDATE queries 954 SELECT queries 100 parallel threads parallely storing 100 combinations *MY CASSANDRA LOGS :* http://pastebin.com/CnNvA9x3 Please look last 100-200 lines of log because that is time it freezed PLEASE HELP ME OUT, I AM NOT ABLE TO PROCEED FROM 1 week...
Re: Cassandra HANGS after some writes
Hi all, I started cassandra few weeks back and i am on development enviornment, it will take months for production as everything in development.But i will spend time and setup one machine with UBuntu and will check if similar problem comes or not...Also i had started hands on Hadoop then linux would be must for me on production.. Till then if anybody can give me some pointers to try on windows parallely as my most of team do not familiar with linux enviornment thats why started on Windows. Thanks Naresh On Tue, Aug 13, 2013 at 9:37 PM, Alain RODRIGUEZ arodr...@gmail.com wrote: @Kanwar Sangha Cassandra on windows ? Please install Linux ! Useful comment, please spare your time and stop that troll. He surely have his reason to use windows (I suppose it is a dev constraint or choice). Anyway, C* is available in windows so it should work. Comments like windows sucks, go linux or macOS, are not going to solve his issue. If Cassandra can't be run on windows, just don't package Cassandra for windows. We just can recommend Naresh *not* to use Windows as the OS for your production nodes. Alain 2013/8/13 Kanwar Sangha kan...@mavenir.com Cassandra on windows ? Please install Linux ! ** ** ** ** *From:* Romain HARDOUIN [mailto:romain.hardo...@urssaf.fr] *Sent:* 13 August 2013 10:17 *To:* user@cassandra.apache.org *Subject:* Re: Cassandra HANGS after some writes ** ** Naresh, My two cents is that you should run Cassandra on a Linux VM. Issues are more easy to diagnose/pinpoint. Windows is a bit obscure to many people here. Cheers Alexis Rodríguez arodrig...@inconcertcc.com a écrit sur 13/08/2013 16:50:42 : De : Alexis Rodríguez arodrig...@inconcertcc.com A : user@cassandra.apache.org, Date : 13/08/2013 16:51 Objet : Re: Cassandra HANGS after some writes Naresh, Windows is not my cup of tea. May be someone else has more experience using the Redmond's prodigy child. cheers, and good luck
Re: Cassandra HANGS after some writes
I made one single change in default cassandra.yaml, just to experiment. native_transport_min_threads: *1* native_transport_max_threads: *1* with max one single thread for native protocol requests i noticed some improvement, earlier with default yaml most of time it was failing after * 10K* combinations BUT with this it worked storing *30K* combinations out of 1lakh.. Please guide me further on this hint... Naresh On Tue, Aug 13, 2013 at 11:06 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Aug 13, 2013 at 10:34 AM, Andrew Cobley a.e.cob...@dundee.ac.ukwrote: Has anyone ever done any performance comparisons of linux vs a headless windows server ? No, but given the number of linux specific optimizations in Cassandra, I would expect this to be no contest. =Rob