Re: Query on Data Modelling of a specific usecase

2017-04-20 Thread Naresh Yadav
Hi Jon,

Thanks for your guidance.

In above mentioned table i can have different scale depending on Report.

One report may have 1 rows.
Second report may have half million rows.
Third report may have 1 million rows.
Fourth report may have 10 million rows.

As this is timeseries data that was main reason of modelling in cassandra.
We preferred separate table for each report as there is no usecase of
quering across reports and also Light reports will work faster.
I can plan to reduce no of tables drastically by combining lighter reports
in one table at application level.

If you can suggest optimal table design keeping one table in mind with 10
millions to 1 billion rows scale for the mentioned queries.

Thanks,
Naresh Yadav

On Wed, Apr 19, 2017 at 9:26 PM, Jon Haddad <jonathan.had...@gmail.com>
wrote:

> How much data do you plan to store in each table?
>
> I’ll be honest, this doesn’t sound like a Cassandra use case at first
> glance.  1 table per report x 1000 is going to be a bad time.  Odds are
> with different queries, you’ll need multiple views, so lets call that a
> handful of tables per report.  Sounds to me like you need CSV (for small
> reports) or Parquet + a file system (for large ones).
>
> Jon
>
>
> On Apr 18, 2017, at 11:34 PM, Naresh Yadav <nyadav@gmail.com> wrote:
>
> Looking for cassandra expert's recommendation on above usecase, please
> reply.
>
> On Mon, Apr 17, 2017 at 7:37 PM, Naresh Yadav <nyadav@gmail.com>
> wrote:
>
>> Hi all,
>>
>> This is my existing table configured on apache-cassandra-3.0.9:
>>
>> CREATE TABLE report_id1 (
>>mc_id text,
>>tag_id text,
>>e_date timestamp.
>>value text
>>PRIMARY KEY ((mc_id, tag_id), e_date)
>> }
>>
>> I create table dynamically for each report from application. Need to
>> support upto 1000 reports means 1000 such tables.
>> unique mc_id will be in range of 5 to 100 in a report.
>> For a mc_id there will be unique tag_id in range of 100 to 1 million in a
>> report.
>> For a mc_id, tag_id there will be unique e_date values in range of 10 to
>> 5000.
>>
>> Current queries to answer :
>> 1)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND
>> e_date='16Apr2017 23:59:59';
>> 2)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND
>> e_date >='01Apr2017 00:00:00' AND e_date <='16Apr2017 23:59:59;
>>
>> 3)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date='16Apr2017
>> 23:59:59';
>>Current design this works with ALLOW FILTERING ONLY
>> 4)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date >='01Apr2017
>> 00:00:00' AND e_date <='16Apr2017 23:59:59';
>>Current design this works with ALLOW FILTERING ONLY
>>
>> Looking for better design for this case, keeping in mind dynamic tables
>> usecase and queries listed.
>>
>> Thanks in advance,
>> Naresh
>>
>>
>
>


Re: Query on Data Modelling of a specific usecase

2017-04-19 Thread Naresh Yadav
Looking for cassandra expert's recommendation on above usecase, please
reply.

On Mon, Apr 17, 2017 at 7:37 PM, Naresh Yadav <nyadav@gmail.com> wrote:

> Hi all,
>
> This is my existing table configured on apache-cassandra-3.0.9:
>
> CREATE TABLE report_id1 (
>mc_id text,
>tag_id text,
>e_date timestamp.
>value text
>PRIMARY KEY ((mc_id, tag_id), e_date)
> }
>
> I create table dynamically for each report from application. Need to
> support upto 1000 reports means 1000 such tables.
> unique mc_id will be in range of 5 to 100 in a report.
> For a mc_id there will be unique tag_id in range of 100 to 1 million in a
> report.
> For a mc_id, tag_id there will be unique e_date values in range of 10 to
> 5000.
>
> Current queries to answer :
> 1)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND
> e_date='16Apr2017 23:59:59';
> 2)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND
> e_date >='01Apr2017 00:00:00' AND e_date <='16Apr2017 23:59:59;
>
> 3)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date='16Apr2017 23:59:59';
>Current design this works with ALLOW FILTERING ONLY
> 4)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date >='01Apr2017
> 00:00:00' AND e_date <='16Apr2017 23:59:59';
>Current design this works with ALLOW FILTERING ONLY
>
> Looking for better design for this case, keeping in mind dynamic tables
> usecase and queries listed.
>
> Thanks in advance,
> Naresh
>
>


Query on Data Modelling of a specific usecase

2017-04-17 Thread Naresh Yadav
Hi all,

This is my existing table configured on apache-cassandra-3.0.9:

CREATE TABLE report_id1 (
   mc_id text,
   tag_id text,
   e_date timestamp.
   value text
   PRIMARY KEY ((mc_id, tag_id), e_date)
}

I create table dynamically for each report from application. Need to
support upto 1000 reports means 1000 such tables.
unique mc_id will be in range of 5 to 100 in a report.
For a mc_id there will be unique tag_id in range of 100 to 1 million in a
report.
For a mc_id, tag_id there will be unique e_date values in range of 10 to
5000.

Current queries to answer :
1)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND
e_date='16Apr2017 23:59:59';
2)SELECT * FROM report_id1 WHERE mc_id='x' AND tag_id IN('a','b','c') AND
e_date >='01Apr2017 00:00:00' AND e_date <='16Apr2017 23:59:59;

3)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date='16Apr2017 23:59:59';
   Current design this works with ALLOW FILTERING ONLY
4)SELECT * FROM report_id1 WHERE mc_id='x' AND e_date >='01Apr2017
00:00:00' AND e_date <='16Apr2017 23:59:59';
   Current design this works with ALLOW FILTERING ONLY

Looking for better design for this case, keeping in mind dynamic tables
usecase and queries listed.

Thanks in advance,
Naresh


Re: Tag filtering data model

2015-09-16 Thread Naresh Yadav
We also had similar usecase, after lot of trials with cassandra, we finally
created solr schema doc_id(unique key), tags(indexed)
in apache solr for answering search query "Get me matching docs by any
given no of tags" and that solved our usecase. We had usecase of millions
of docs and in tags we can have 100's of tags on a doc.

Please share your final conclusion if you crack this problem within
cassandra only, would be interested to know your solution.

On Fri, Sep 11, 2015 at 1:23 PM, Artur Siekielski  wrote:

> I store documents submitted by users, with optional tags (lists of
> strings):
>
> CREATE TABLE doc (
>   user_id uuid,
>   date text, // part of partition key, to distribute data better
>   doc_id uuid,
>   tags list,
>   contents text,
>   PRIMARY KEY((user_id, date), doc_id)
> );
>
> What is the best way to implement tag filtering? A user can select a list
> of tags and get documents with the tags. I thought about:
>
> 1) Full denormalization - include tags in the primary key and insert a doc
> for each subset of specified tags. This will however lead to large disk
> space usage, because there are 2**n subsets (for 10 tags and a 1MB doc
> 1000MB would be written).
>
> 2) Secondary index on 'tags' collection, and using queries like:
> SELECT * FROM doc WHERE user_id=? AND date=? AND tags CONTAINS=? AND tags
> CONTAINS=? ...
>
> Since I will supply partition key value, I assume there will be no
> problems with contacting multiple nodes. But how well will it work for
> hundreds of thousands of results? I think intersection of tag matches needs
> to be performed in memory so it will not scale well.
>
> 3) Partial denormalization - do inserts for each single tag and then
> manually compute intersection. However in the worst case it can lead to
> scanning almost the whole table.
>
> 4) Full denormalization but without contents. I would get correct doc_ids
> fast, then I would need to use '... WHERE doc_id IN ?' with potentially a
> very large list of doc_ids.
>
>
> What's Cassandra's way to implement this?
>


Re: Help me on Cassandra Data Modelling

2014-01-28 Thread Naresh Yadav
please inputs on last email if any..


On Tue, Jan 28, 2014 at 7:18 AM, Naresh Yadav nyadav@gmail.com wrote:

 yes thunder you are right, i had simplified that by moving *tags 
 *search(partial/exact)
 in separate column family tagcombination which will act as index for all
 search based on tags and in my my original metricresult table will store
 tagcombinationid and time in columns otherwise it was getting complicated 
 was not getting good results.

 Yes i agree with you on duplicating the storage with tagcombination
 columnfamily...if i have billion of real tagcombinations with 8 tags in
 each then i am duplicating 2^8 combinations for each one to support partial
 match for that tagcombination which will make this very heavy table...with
 individual keys i will not able to support search with set of tags
 ..please suggest alternative solution..

 Also one of my colleague suggested a total different approach to it but i
 am  not able to map that on cassandra.
 Acc to him we store all possible tags in columns and for each combination
 we just mark 0s, 1s whichever tags
 appear in that combination...So data(TC1 as India, Pencil AND TC2 as
 India, Pen) will be like :

   IndiaPencil   Pen
 TC1  1 1  0
 TC2  1  0  1

 I am not able to design optimal column family for this in cassandra..if i
 design as is then for search of India, Pen then i will select India, Pen
 columns but that will touch each and every row because i am not able to
 apply criteria of matching 1s only...i believe there can be better design
 of this to make use of this good thought.

 Please help me on this..

 Thanks
 Naresh



 On Mon, Jan 27, 2014 at 11:30 PM, Thunder Stumpges 
 thunder.stump...@gmail.com wrote:

 Hey Naresh,

 You asked a similar question a week or two ago. It looks like you have
 simplified your needs quite a bit. Were you able to adjust your
 requirements or separate the issue? You had a complicated time dimension
 before, as well as a single query for multiple AND cases on tags.

 
 c)Give data for Metric=Sales AND Tag=U.S.A
O/P : 5 rows
 d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen
O/P :1 row



 I agree with Jonathan on the model for this simplified use case. However
 looking at how you are storing each partial tag combination as well as
 individual tags in the partitioning key, you will be severely duplicating
 your storage. You might want to just store individual keys in the
 partitioning key.

 Good luck,
 Thunder




 On Mon, Jan 27, 2014 at 8:48 AM, Naresh Yadav nyadav@gmail.comwrote:

 Thanks Jonathan for guiding me..i just want to confirm my understanding :

 create columnfamily tagcombinations {
  partialtags text,
  tagcombinationid text,
  tagcombinationtags settags
 Primary Key((partialtags), tagcombinationid)
 }
 IF i need to store TWO tagcombination TC1 as India, Pencil AND TC2 as
 India, Pen then data will stored as :

TC1  TC2
 India  India,Pencil   India,pen

TC1
 Pencil  India,Pencil

TC2
 Pen   India,Pen

 TC1
 India,PencilIndia,Pencil

   TC2
 India,PenIndia, Pen


 I hope i had understood the thought properly please confirm on design.

 Thanks
 Naresh


 On Mon, Jan 27, 2014 at 7:05 PM, Jonathan Lacefield 
 jlacefi...@datastax.com wrote:

 Hello,

   The trick with this data model is to get to partition based, and/or
 cluster based access pattern so C* returns results quickly.  In C* you want
 to model your tables based on your query access patterns and remember that
 writes are cheap and fast in C*.

   So, try something like the following:

   1 Table with a Partition Key = Tag String
  Tag String = Tag or set of Tags
  Cluster based on tag combination (probably desc order)
  This will allow you to select any combination that includes
 Tag or set of Tags
  This will duplicate data as you will store 1 tag combination
 in every Tag partition, i.e. if a tag combination has 2 parts, then you
 will have 2 rows

   Hope this helps.

 Jonathan Lacefield
 Solutions Architect, DataStax
 (404) 822 3487
  http://www.linkedin.com/in/jlacefield



 http://www.datastax.com/what-we-offer/products-services/training/virtual-training


 On Mon, Jan 27, 2014 at 7:24 AM, Naresh Yadav nyadav@gmail.comwrote:

 Hi all,

 Urgently need help on modelling this usecase on Cassandra.

 I have concept of tags and tagcombinations.
 For example U.S.A and Pen are two tags AND if they come together in
 some definition then register a tagcombination(U.S.A-Pen) for that..

 *tags *(U.S.A, Pen, Pencil, India, Shampoo)
 *tagcombinations*(U.S.A-Pen, India-pencil, U.S.A-Pencil, India-Pen,
 India-Pen-Shampoo)

 - millions of tags

Re: Help me on Cassandra Data Modelling

2014-01-27 Thread Naresh Yadav
Thanks Jonathan for guiding me..i just want to confirm my understanding :

create columnfamily tagcombinations {
 partialtags text,
 tagcombinationid text,
 tagcombinationtags settags
Primary Key((partialtags), tagcombinationid)
}
IF i need to store TWO tagcombination TC1 as India, Pencil AND TC2 as
India, Pen then data will stored as :

   TC1  TC2
India  India,Pencil   India,pen

   TC1
Pencil  India,Pencil

   TC2
Pen   India,Pen

TC1
India,PencilIndia,Pencil

  TC2
India,PenIndia, Pen


I hope i had understood the thought properly please confirm on design.

Thanks
Naresh


On Mon, Jan 27, 2014 at 7:05 PM, Jonathan Lacefield jlacefi...@datastax.com
 wrote:

 Hello,

   The trick with this data model is to get to partition based, and/or
 cluster based access pattern so C* returns results quickly.  In C* you want
 to model your tables based on your query access patterns and remember that
 writes are cheap and fast in C*.

   So, try something like the following:

   1 Table with a Partition Key = Tag String
  Tag String = Tag or set of Tags
  Cluster based on tag combination (probably desc order)
  This will allow you to select any combination that includes Tag
 or set of Tags
  This will duplicate data as you will store 1 tag combination in
 every Tag partition, i.e. if a tag combination has 2 parts, then you will
 have 2 rows

   Hope this helps.

 Jonathan Lacefield
 Solutions Architect, DataStax
 (404) 822 3487
 http://www.linkedin.com/in/jlacefield



 http://www.datastax.com/what-we-offer/products-services/training/virtual-training


 On Mon, Jan 27, 2014 at 7:24 AM, Naresh Yadav nyadav@gmail.comwrote:

 Hi all,

 Urgently need help on modelling this usecase on Cassandra.

 I have concept of tags and tagcombinations.
 For example U.S.A and Pen are two tags AND if they come together in some
 definition then register a tagcombination(U.S.A-Pen) for that..

 *tags *(U.S.A, Pen, Pencil, India, Shampoo)
 *tagcombinations*(U.S.A-Pen, India-pencil, U.S.A-Pencil, India-Pen,
 India-Pen-Shampoo)

 - millions of tags
 - billions of tagcombinations
 - one tagcombination generally have 2-8 tags
 - Every day we get lakhs of new tagcombinations to write

 Query need to support :
 one tag or set of tags appears in how many tagcombinationids 
 If i query for Pen,India then it should return two tagcombinaions
 (India-Pen, India-Pen-Shampoo))..Query will be fired by application in
 realtime.

 I am new to cassandra and need to deliver fast so please give your inputs.

 Thanks
 Naresh





Re: Help me on Cassandra Data Modelling

2014-01-27 Thread Naresh Yadav
yes thunder you are right, i had simplified that by moving *tags
*search(partial/exact)
in separate column family tagcombination which will act as index for all
search based on tags and in my my original metricresult table will store
tagcombinationid and time in columns otherwise it was getting complicated 
was not getting good results.

Yes i agree with you on duplicating the storage with tagcombination
columnfamily...if i have billion of real tagcombinations with 8 tags in
each then i am duplicating 2^8 combinations for each one to support partial
match for that tagcombination which will make this very heavy table...with
individual keys i will not able to support search with set of tags
..please suggest alternative solution..

Also one of my colleague suggested a total different approach to it but i
am  not able to map that on cassandra.
Acc to him we store all possible tags in columns and for each combination
we just mark 0s, 1s whichever tags
appear in that combination...So data(TC1 as India, Pencil AND TC2 as India,
Pen) will be like :

  IndiaPencil   Pen
TC1  1 1  0
TC2  1  0  1

I am not able to design optimal column family for this in cassandra..if i
design as is then for search of India, Pen then i will select India, Pen
columns but that will touch each and every row because i am not able to
apply criteria of matching 1s only...i believe there can be better design
of this to make use of this good thought.

Please help me on this..

Thanks
Naresh



On Mon, Jan 27, 2014 at 11:30 PM, Thunder Stumpges 
thunder.stump...@gmail.com wrote:

 Hey Naresh,

 You asked a similar question a week or two ago. It looks like you have
 simplified your needs quite a bit. Were you able to adjust your
 requirements or separate the issue? You had a complicated time dimension
 before, as well as a single query for multiple AND cases on tags.

 
 c)Give data for Metric=Sales AND Tag=U.S.A
O/P : 5 rows
 d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen
O/P :1 row



 I agree with Jonathan on the model for this simplified use case. However
 looking at how you are storing each partial tag combination as well as
 individual tags in the partitioning key, you will be severely duplicating
 your storage. You might want to just store individual keys in the
 partitioning key.

 Good luck,
 Thunder




 On Mon, Jan 27, 2014 at 8:48 AM, Naresh Yadav nyadav@gmail.comwrote:

 Thanks Jonathan for guiding me..i just want to confirm my understanding :

 create columnfamily tagcombinations {
  partialtags text,
  tagcombinationid text,
  tagcombinationtags settags
 Primary Key((partialtags), tagcombinationid)
 }
 IF i need to store TWO tagcombination TC1 as India, Pencil AND TC2 as
 India, Pen then data will stored as :

TC1  TC2
 India  India,Pencil   India,pen

TC1
 Pencil  India,Pencil

TC2
 Pen   India,Pen

 TC1
 India,PencilIndia,Pencil

   TC2
 India,PenIndia, Pen


 I hope i had understood the thought properly please confirm on design.

 Thanks
 Naresh


 On Mon, Jan 27, 2014 at 7:05 PM, Jonathan Lacefield 
 jlacefi...@datastax.com wrote:

 Hello,

   The trick with this data model is to get to partition based, and/or
 cluster based access pattern so C* returns results quickly.  In C* you want
 to model your tables based on your query access patterns and remember that
 writes are cheap and fast in C*.

   So, try something like the following:

   1 Table with a Partition Key = Tag String
  Tag String = Tag or set of Tags
  Cluster based on tag combination (probably desc order)
  This will allow you to select any combination that includes Tag
 or set of Tags
  This will duplicate data as you will store 1 tag combination in
 every Tag partition, i.e. if a tag combination has 2 parts, then you will
 have 2 rows

   Hope this helps.

 Jonathan Lacefield
 Solutions Architect, DataStax
 (404) 822 3487
  http://www.linkedin.com/in/jlacefield



 http://www.datastax.com/what-we-offer/products-services/training/virtual-training


 On Mon, Jan 27, 2014 at 7:24 AM, Naresh Yadav nyadav@gmail.comwrote:

 Hi all,

 Urgently need help on modelling this usecase on Cassandra.

 I have concept of tags and tagcombinations.
 For example U.S.A and Pen are two tags AND if they come together in
 some definition then register a tagcombination(U.S.A-Pen) for that..

 *tags *(U.S.A, Pen, Pencil, India, Shampoo)
 *tagcombinations*(U.S.A-Pen, India-pencil, U.S.A-Pencil, India-Pen,
 India-Pen-Shampoo)

 - millions of tags
 - billions of tagcombinations
 - one tagcombination generally have 2-8 tags
 - Every day we get lakhs of new tagcombinations to write

 Query

Best design for a usecase ??

2014-01-21 Thread Naresh Yadav
Hi,

I need to design a table which will give a UUID to set of tags.
Each tag itself has unique UUID

*TagCombination* table
TC1  -  India, Pen
TC2  -  Shampoo, U.K
TC3  -  Team1, Product1, Location1
TC4  -  Office1, India, Pen

I can have *billion *of such unique combinations and there can be *million *of
unique tags but each combination will have 2 to 10 tags max.

As data comes daily there would be new combination registered if not exists.

*Query on this table :*
1. Give me list of tags for Tagcombination Id=TC1
2. A set of tags comes in which Tagcombination Ids
If i say India,Pen comes then it comes in TC1, TC4
There can be exact match or partial match on tags to get TCids

Please suggest design for this so that this table can handle bigdata.

Thanks
Naresh


Re: Best design for a usecase ??

2014-01-21 Thread Naresh Yadav
just to add : on this table there will be lakhs of select queries to get
tagcombinationid fro a partial set of tags...

On Tue, Jan 21, 2014 at 2:33 PM, Naresh Yadav nyadav@gmail.com wrote:

 Hi,

 I need to design a table which will give a UUID to set of tags.
 Each tag itself has unique UUID

 *TagCombination* table
 TC1  -  India, Pen
 TC2  -  Shampoo, U.K
 TC3  -  Team1, Product1, Location1
 TC4  -  Office1, India, Pen

 I can have *billion *of such unique combinations and there can be *million
 *of unique tags but each combination will have 2 to 10 tags max.

 As data comes daily there would be new combination registered if not
 exists.

 *Query on this table :*
 1. Give me list of tags for Tagcombination Id=TC1
 2. A set of tags comes in which Tagcombination Ids
 If i say India,Pen comes then it comes in TC1, TC4
 There can be exact match or partial match on tags to get TCids

 Please suggest design for this so that this table can handle bigdata.

 Thanks
 Naresh




Getting indexoutbound exception for a specific query on cassandra trunk

2014-01-16 Thread Naresh Yadav
I had taken latest source code of cassandra trunk to evaluate performance
of indexing on collections new feature(
https://issues.apache.org/jira/browse/CASSANDRA-4511) for my usecase..

IF you configure table like this with commands in given order :

CREATE TABLE testcollectionindex(userid text, timeunitid text, periodid
text, periodlabel text, periodtags text, unit text, datatags text,
datatagsset settext,value double,PRIMARY KEY((unit,periodid), datatags));

INSERT INTO testcollectionindex(periodlabel, datatags, datatagsset,
value,timeunitid,periodid, unit) VALUES('Feb-2010', 'India|Pen|Store1',
{'India', 'Pen', 'Store1'}, 10,'Month','Period2','Number');

CREATE INDEX testcollectionindexdatatagsset ON testcollectionindex
(datatagsset);

SELECT * FROM testcollectionindex WHERE datatagsset CONTAINS 'Store1';
Output ( works perfectly ):
 unit   | periodid | datatags | datatagsset
| periodlabel
+--+--++-
 Number |  Period2 | India|Pen|Store1 | {'India', 'Pen', 'Store1'} |
Feb-2010

(1 rows)

*SELECT * FROM testcollectionindex WHERE periodid='Period2' AND
unit='Number' AND datatagsset CONTAINS 'Store1';*

THIS QUERY DO NOT WORK..I get RPC timeout error and server logs showing
indexoutofbound exception (http://pastebin.com/f7qmRc0R)

Deugging code for this query I get SliceQueryFilter [reversed=false,
slices=[[, ]], count=2147483647, toGroup = 1] because of that it throws
java.lang.ArrayIndexOutOfBoundsException: 0in
CompositesIndexOnCollectionKey.java method makeIndexColumnNameBuilder()

Note : I also tested this query on 03-Dec-2013 source code snapshot of
cassandra getting same exception there also

please someone help me on this so that i can proceed on this and conclude
on this new supported feature of cassandra.

Thanks
Naresh


Probable release date for cassandra 2.1 ??

2014-01-10 Thread Naresh Yadav
Hi,

I am looking 
feature(CASSANDRA-4511https://issues.apache.org/jira/browse/CASSANDRA-4511)
which allows Index on Collections.
Any idea about release date of Cassandra 2.1 ??

Till this releases, i am thinking to take source code of 2.1 and build it
on my machine to test the required feature. Please suggest instructions for
that.


Thanks
Naresh


Re: Help on Designing Cassandra table for my usecase

2014-01-10 Thread Naresh Yadav
@Thunder
I just came to know about
(CASSANDRA-4511https://issues.apache.org/jira/browse/CASSANDRA-4511)
which allows Index on Collections and that will be part of release 2.1.
I hope in that case my problem will be solved by changing your designed
table with tag column as settext and defining secondary index on it. Is
there any risk of performance problem of this design keeping in mind huge
data ???


Naresh

On Fri, Jan 10, 2014 at 10:26 AM, Naresh Yadav nyadav@gmail.com wrote:

 @Thunder thanks for suggesting design but my main problem is
 indexing/quering dynamic Tag on each row that is main context of each row
 and most of queries will include that..

 As an alternative to cassandra, i tried Apache Blur, in blur table i am
 able to store exact same data and all queries also worked..so blur  allows
 dynamic indexing  of tag column BUT moving away from cassandra, i am
 loosing its strength because of that i am not confident on this decision as
 data will be huge in my case.

 Please guide me on this with better suggestions.

 Thanks
 Naresh

 On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges 
 thunder.stump...@gmail.com wrote:

 Well I think you have essentially time-series data, which C* should
 handle well, however I think your Tag column is going to cause troubles.
 C* does have collection columns, but they are not indexable nor usable in
 WHERE clause. Your example has both the uniqueness of the data (primary
 key) and query filtering on potentially multiple Tag columns. That is not
 supported in C* AFAIK.If it were a single Tag, that could be a column that
 is Indexed possibly.

 Ignoring that issue with the many different Tags, You could model the
 table as:

 CREATE TABLE metric_data (
   metric text,
   time text,
   period text,
   tag text,
   value int,
   PRIMARY KEY( (metric,time), period, tag)
 )

 That would make a composite partitioning key on metric and time meaning
 you'd always have to pass those (or else randomly page via TOKEN through
 all rows). After specifying metric and time, you could optionally also
 specify period and/or tag, and results would be ordered (clustered) by
 period. This would satisfy your queries a,b, and d but not c (as you did
 not specify time). If Time was a granularity column, does it even make
 sense to return records across differing time values? What does it mean to
 return the 4 month rows and 1 year row in your example? Could you issue N
 queries in this case (where N is a small number of each of your time
 granularities) ?

 I'm not sure how close that gets you, or if you can re-work your concept
 of Tag at all.
 Good luck.
 Thunder



 On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger hkro...@gmail.com wrote:

 To my eye that looks something what the traditional analytics systems
 do. You can check out e.g. Acunu Analytics which uses Cassandra as a
 backend.

 Cheers,
 Hannu


 2014/1/9 Naresh Yadav nyadav@gmail.com

 Hi all,

 I have a use case with huge data which i am not able to design in
 cassandra.

 Table name : MetricResult

 Sample Data :

 Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,
 Value=10
 Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,
 Value=20
 Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,
 Value=30
 Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,
 Value=10
 Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
Value=90
 Metric=Sales, Time=Year, Period=2010,   Tag=U.S.A,
Value=70
 Metric=Cost,  Time=Year, Period=2010,Tag=CPU,
 Value=8000
 Metric=Cost,  Time=Year,  Period=2010,Tag=RAM,
 Value=4000
 Metric=Cost,  Time=Year  Period=2011, Tag=CPU,
 Value=9000
 Metric=Resource, Time=Week Period=Week1-2013,
 Value=100

 So in above case i have case of
  TimeSeries data  i.e Time,Period column
  Dynamic columns i.e Tag column
  Indexing on dynamic columns i.e Tag column
  Aggregations SUM, AVERAGE
  Same value comes again for a Metric, Time, Period, Tag then
 overwrite it

 Queries i need to support :
 --
 a)Give data for Metric=Sales AND Time=Month
O/P : 5 rows
 b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
O/P : 2 rows
 c)Give data for Metric=Sales AND Tag=U.S.A
O/P : 5 rows
 d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen
O/P :1 row


 This table can have TB's of data and for a Metric,Period can have
 millions of rows.

 Please give suggestion to design/model this table in Cassandra. If some
 limitation in Cassandra then suggest best technology to handle this.


 Thanks
 Naresh









Re: Help on Designing Cassandra table for my usecase

2014-01-10 Thread Naresh Yadav
@vivek thanks for pointing that out..Other than primary key defining only
one secondary index tags and in my case same tags will be repeating itself
across period for sure for a metric=Sales AND also across metric Sales,
Cost also can be same set of tags to some extent not always..


Thanks
Naresh


On Fri, Jan 10, 2014 at 6:05 PM, Vivek Mishra mishra.v...@gmail.com wrote:

 @Naresh
 Too many indices or indices with high cardinality should be discouraged
 and are always performance issues. A set will not contain duplicate values.

 -Vivek


 On Fri, Jan 10, 2014 at 5:48 PM, Naresh Yadav nyadav@gmail.comwrote:

 @Thunder
 I just came to know about 
 (CASSANDRA-4511https://issues.apache.org/jira/browse/CASSANDRA-4511)
 which allows Index on Collections and that will be part of release 2.1.
 I hope in that case my problem will be solved by changing your designed
 table with tag column as settext and defining secondary index on it. Is
 there any risk of performance problem of this design keeping in mind huge
 data ???


 Naresh

 On Fri, Jan 10, 2014 at 10:26 AM, Naresh Yadav nyadav@gmail.comwrote:

 @Thunder thanks for suggesting design but my main problem is
 indexing/quering dynamic Tag on each row that is main context of each row
 and most of queries will include that..

 As an alternative to cassandra, i tried Apache Blur, in blur table i am
 able to store exact same data and all queries also worked..so blur  allows
 dynamic indexing  of tag column BUT moving away from cassandra, i am
 loosing its strength because of that i am not confident on this decision as
 data will be huge in my case.

 Please guide me on this with better suggestions.

 Thanks
 Naresh

 On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges 
 thunder.stump...@gmail.com wrote:

 Well I think you have essentially time-series data, which C* should
 handle well, however I think your Tag column is going to cause troubles.
 C* does have collection columns, but they are not indexable nor usable in
 WHERE clause. Your example has both the uniqueness of the data (primary
 key) and query filtering on potentially multiple Tag columns. That is not
 supported in C* AFAIK.If it were a single Tag, that could be a column that
 is Indexed possibly.

 Ignoring that issue with the many different Tags, You could model the
 table as:

 CREATE TABLE metric_data (
   metric text,
   time text,
   period text,
   tag text,
   value int,
   PRIMARY KEY( (metric,time), period, tag)
 )

 That would make a composite partitioning key on metric and time meaning
 you'd always have to pass those (or else randomly page via TOKEN through
 all rows). After specifying metric and time, you could optionally also
 specify period and/or tag, and results would be ordered (clustered) by
 period. This would satisfy your queries a,b, and d but not c (as you did
 not specify time). If Time was a granularity column, does it even make
 sense to return records across differing time values? What does it mean to
 return the 4 month rows and 1 year row in your example? Could you issue N
 queries in this case (where N is a small number of each of your time
 granularities) ?

 I'm not sure how close that gets you, or if you can re-work your
 concept of Tag at all.
 Good luck.
 Thunder



 On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger hkro...@gmail.comwrote:

 To my eye that looks something what the traditional analytics systems
 do. You can check out e.g. Acunu Analytics which uses Cassandra as a
 backend.

 Cheers,
 Hannu


 2014/1/9 Naresh Yadav nyadav@gmail.com

 Hi all,

 I have a use case with huge data which i am not able to design in
 cassandra.

 Table name : MetricResult

 Sample Data :

 Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,
 Value=10
 Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,
 Value=20
 Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,
 Value=30
 Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,
 Value=10
 Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
Value=90
 Metric=Sales, Time=Year, Period=2010,   Tag=U.S.A,
  Value=70
 Metric=Cost,  Time=Year, Period=2010,Tag=CPU,
 Value=8000
 Metric=Cost,  Time=Year,  Period=2010,Tag=RAM,
 Value=4000
 Metric=Cost,  Time=Year  Period=2011, Tag=CPU,
Value=9000
 Metric=Resource, Time=Week Period=Week1-2013,
 Value=100

 So in above case i have case of
  TimeSeries data  i.e Time,Period column
  Dynamic columns i.e Tag column
  Indexing on dynamic columns i.e Tag column
  Aggregations SUM, AVERAGE
  Same value comes again for a Metric, Time, Period, Tag then
 overwrite it

 Queries i need to support :
 --
 a)Give data for Metric=Sales AND Time=Month
O/P : 5 rows
 b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
O/P : 2 rows
 c)Give data for Metric=Sales AND Tag=U.S.A
O/P : 5 rows
 d)Give data for Metric=Sales

Help on Designing Cassandra table for my usecase

2014-01-09 Thread Naresh Yadav
Hi all,

I have a use case with huge data which i am not able to design in cassandra.

Table name : MetricResult

Sample Data :

Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen, Value=10
Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,  Value=20
Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen, Value=30
Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,  Value=10
Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
Value=90
Metric=Sales, Time=Year, Period=2010,   Tag=U.S.A,
Value=70
Metric=Cost,  Time=Year, Period=2010,Tag=CPU,
Value=8000
Metric=Cost,  Time=Year,  Period=2010,Tag=RAM,
Value=4000
Metric=Cost,  Time=Year  Period=2011, Tag=CPU,
Value=9000
Metric=Resource, Time=Week Period=Week1-2013,  Value=100

So in above case i have case of
 TimeSeries data  i.e Time,Period column
 Dynamic columns i.e Tag column
 Indexing on dynamic columns i.e Tag column
 Aggregations SUM, AVERAGE
 Same value comes again for a Metric, Time, Period, Tag then
overwrite it

Queries i need to support :
--
a)Give data for Metric=Sales AND Time=Month
   O/P : 5 rows
b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
   O/P : 2 rows
c)Give data for Metric=Sales AND Tag=U.S.A
   O/P : 5 rows
d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen
   O/P :1 row


This table can have TB's of data and for a Metric,Period can have millions
of rows.

Please give suggestion to design/model this table in Cassandra. If some
limitation in Cassandra then suggest best technology to handle this.


Thanks
Naresh


Re: Help on Designing Cassandra table for my usecase

2014-01-09 Thread Naresh Yadav
@thunder thanks for guidance queries will be fired by application on this
table when users login and browse the application and also through mobile
apps through webservice. Response needs to be quick as user will be doing
analysis over this data on the fly. Writes also needs to be fast as there
is time limit we need to show this data to user everyday.

Aggregation we can build in application outside cassandra. But we are not
clear what table we should design in cassandra for the queries we
need..Please give guidance on the possible design to handle dynamic tags
indexing for queries..

Thanks
Naresh


On Thu, Jan 9, 2014 at 9:41 PM, Thunder Stumpges thunder.stump...@gmail.com
 wrote:

 This sort of work sounds much more like a Hadoop/Hive/Pig type of analysis.

 What are your latency requirements on queries? Are they ad-hoc or part of
 an application? What is the case where you would need to change an existing
 value? If it is write once, then Hadoop/Hive is great, if it changes
 randomly, then not so much.

 Cassandra has limitations that it does not support aggregation, that must
 be done by a client. In my experience it is really suited to quickly write
 lots of data and look it back up in a random io type manner if you
 already know the key you are looking for.

 If you have the high speed write and rewrite needs, but also the full
 data analytical requirements, there are plugins for using C* as a backing
 store for Pig/Hive. It is a little finicky to get working depending on all
 your versions but does work fairly well in my limited experience.

 Perhaps with a little better understanding of your workload needs others
 can chime in too. Good luck.

 -Thunder


  On Jan 9, 2014, at 5:15 AM, Naresh Yadav nyadav@gmail.com wrote:
 
  Hi all,
 
  I have a use case with huge data which i am not able to design in
 cassandra.
 
  Table name : MetricResult
 
  Sample Data :
 
  Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,
 Value=10
  Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,  Value=20
  Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen, Value=30
  Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,  Value=10
  Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
  Value=90
  Metric=Sales, Time=Year, Period=2010,   Tag=U.S.A,
  Value=70
  Metric=Cost,  Time=Year, Period=2010,Tag=CPU,
 Value=8000
  Metric=Cost,  Time=Year,  Period=2010,Tag=RAM,
  Value=4000
  Metric=Cost,  Time=Year  Period=2011, Tag=CPU,
 Value=9000
  Metric=Resource, Time=Week Period=Week1-2013,
  Value=100
 
  So in above case i have case of
   TimeSeries data  i.e Time,Period column
   Dynamic columns i.e Tag column
   Indexing on dynamic columns i.e Tag column
   Aggregations SUM, AVERAGE
   Same value comes again for a Metric, Time, Period, Tag then
 overwrite it
 
  Queries i need to support :
  --
  a)Give data for Metric=Sales AND Time=Month
 O/P : 5 rows
  b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
 O/P : 2 rows
  c)Give data for Metric=Sales AND Tag=U.S.A
 O/P : 5 rows
  d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen
 O/P :1 row
 
 
  This table can have TB's of data and for a Metric,Period can have
 millions of rows.
 
  Please give suggestion to design/model this table in Cassandra. If some
 limitation in Cassandra then suggest best technology to handle this.
 
 
  Thanks
  Naresh



Re: Help on Designing Cassandra table for my usecase

2014-01-09 Thread Naresh Yadav
@thunder It will be write once 80% of time but there can be cases client
makes correction in data and then we need to overwrite that..

Thanks
Naresh


On Thu, Jan 9, 2014 at 11:49 PM, Naresh Yadav nyadav@gmail.com wrote:

 @thunder thanks for guidance queries will be fired by application on this
 table when users login and browse the application and also through mobile
 apps through webservice. Response needs to be quick as user will be doing
 analysis over this data on the fly. Writes also needs to be fast as there
 is time limit we need to show this data to user everyday.

 Aggregation we can build in application outside cassandra. But we are not
 clear what table we should design in cassandra for the queries we
 need..Please give guidance on the possible design to handle dynamic tags
 indexing for queries..

 Thanks
 Naresh



 On Thu, Jan 9, 2014 at 9:41 PM, Thunder Stumpges 
 thunder.stump...@gmail.com wrote:

 This sort of work sounds much more like a Hadoop/Hive/Pig type of
 analysis.

 What are your latency requirements on queries? Are they ad-hoc or part of
 an application? What is the case where you would need to change an existing
 value? If it is write once, then Hadoop/Hive is great, if it changes
 randomly, then not so much.

 Cassandra has limitations that it does not support aggregation, that must
 be done by a client. In my experience it is really suited to quickly write
 lots of data and look it back up in a random io type manner if you
 already know the key you are looking for.

 If you have the high speed write and rewrite needs, but also the full
 data analytical requirements, there are plugins for using C* as a backing
 store for Pig/Hive. It is a little finicky to get working depending on all
 your versions but does work fairly well in my limited experience.

 Perhaps with a little better understanding of your workload needs others
 can chime in too. Good luck.

 -Thunder


  On Jan 9, 2014, at 5:15 AM, Naresh Yadav nyadav@gmail.com wrote:
 
  Hi all,
 
  I have a use case with huge data which i am not able to design in
 cassandra.
 
  Table name : MetricResult
 
  Sample Data :
 
  Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,
 Value=10
  Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,
  Value=20
  Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen,
 Value=30
  Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,
  Value=10
  Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
  Value=90
  Metric=Sales, Time=Year, Period=2010,   Tag=U.S.A,
Value=70
  Metric=Cost,  Time=Year, Period=2010,Tag=CPU,
 Value=8000
  Metric=Cost,  Time=Year,  Period=2010,Tag=RAM,
  Value=4000
  Metric=Cost,  Time=Year  Period=2011, Tag=CPU,
 Value=9000
  Metric=Resource, Time=Week Period=Week1-2013,
  Value=100
 
  So in above case i have case of
   TimeSeries data  i.e Time,Period column
   Dynamic columns i.e Tag column
   Indexing on dynamic columns i.e Tag column
   Aggregations SUM, AVERAGE
   Same value comes again for a Metric, Time, Period, Tag then
 overwrite it
 
  Queries i need to support :
  --
  a)Give data for Metric=Sales AND Time=Month
 O/P : 5 rows
  b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
 O/P : 2 rows
  c)Give data for Metric=Sales AND Tag=U.S.A
 O/P : 5 rows
  d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen
 O/P :1 row
 
 
  This table can have TB's of data and for a Metric,Period can have
 millions of rows.
 
  Please give suggestion to design/model this table in Cassandra. If some
 limitation in Cassandra then suggest best technology to handle this.
 
 
  Thanks
  Naresh







Re: Help on Designing Cassandra table for my usecase

2014-01-09 Thread Naresh Yadav
@Thunder thanks for suggesting design but my main problem is
indexing/quering dynamic Tag on each row that is main context of each row
and most of queries will include that..

As an alternative to cassandra, i tried Apache Blur, in blur table i am
able to store exact same data and all queries also worked..so blur  allows
dynamic indexing  of tag column BUT moving away from cassandra, i am
loosing its strength because of that i am not confident on this decision as
data will be huge in my case.

Please guide me on this with better suggestions.

Thanks
Naresh

On Fri, Jan 10, 2014 at 2:33 AM, Thunder Stumpges 
thunder.stump...@gmail.com wrote:

 Well I think you have essentially time-series data, which C* should handle
 well, however I think your Tag column is going to cause troubles. C* does
 have collection columns, but they are not indexable nor usable in WHERE
 clause. Your example has both the uniqueness of the data (primary key) and
 query filtering on potentially multiple Tag columns. That is not
 supported in C* AFAIK.If it were a single Tag, that could be a column that
 is Indexed possibly.

 Ignoring that issue with the many different Tags, You could model the
 table as:

 CREATE TABLE metric_data (
   metric text,
   time text,
   period text,
   tag text,
   value int,
   PRIMARY KEY( (metric,time), period, tag)
 )

 That would make a composite partitioning key on metric and time meaning
 you'd always have to pass those (or else randomly page via TOKEN through
 all rows). After specifying metric and time, you could optionally also
 specify period and/or tag, and results would be ordered (clustered) by
 period. This would satisfy your queries a,b, and d but not c (as you did
 not specify time). If Time was a granularity column, does it even make
 sense to return records across differing time values? What does it mean to
 return the 4 month rows and 1 year row in your example? Could you issue N
 queries in this case (where N is a small number of each of your time
 granularities) ?

 I'm not sure how close that gets you, or if you can re-work your concept
 of Tag at all.
 Good luck.
 Thunder



 On Thu, Jan 9, 2014 at 10:45 AM, Hannu Kröger hkro...@gmail.com wrote:

 To my eye that looks something what the traditional analytics systems do.
 You can check out e.g. Acunu Analytics which uses Cassandra as a backend.

 Cheers,
 Hannu


 2014/1/9 Naresh Yadav nyadav@gmail.com

 Hi all,

 I have a use case with huge data which i am not able to design in
 cassandra.

 Table name : MetricResult

 Sample Data :

 Metric=Sales, Time=Month,  Period=Jan-10, Tag=U.S.A, Tag=Pen,
 Value=10
 Metric=Sales, Time=Month, Period=Jan-10, Tag=U.S.A, Tag=Pencil,  Value=20
 Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pen, Value=30
 Metric=Sales, Time=Month, Period=Feb-10, Tag=U.S.A, Tag=Pencil,  Value=10
 Metric=Sales, Time=Month, Period=Feb-10, Tag=India,
 Value=90
 Metric=Sales, Time=Year, Period=2010,   Tag=U.S.A,
Value=70
 Metric=Cost,  Time=Year, Period=2010,Tag=CPU,
 Value=8000
 Metric=Cost,  Time=Year,  Period=2010,Tag=RAM,
 Value=4000
 Metric=Cost,  Time=Year  Period=2011, Tag=CPU,
 Value=9000
 Metric=Resource, Time=Week Period=Week1-2013,
 Value=100

 So in above case i have case of
  TimeSeries data  i.e Time,Period column
  Dynamic columns i.e Tag column
  Indexing on dynamic columns i.e Tag column
  Aggregations SUM, AVERAGE
  Same value comes again for a Metric, Time, Period, Tag then
 overwrite it

 Queries i need to support :
 --
 a)Give data for Metric=Sales AND Time=Month
O/P : 5 rows
 b)Give data for Metric=Sales AND Time=Month AND Period=Jan-10
O/P : 2 rows
 c)Give data for Metric=Sales AND Tag=U.S.A
O/P : 5 rows
 d)Give data for Metric=Sales AND Period=Jan-10 AND Tag=U.S.A AND Tag=Pen
O/P :1 row


 This table can have TB's of data and for a Metric,Period can have
 millions of rows.

 Please give suggestion to design/model this table in Cassandra. If some
 limitation in Cassandra then suggest best technology to handle this.


 Thanks
 Naresh






Re: Setting up a multi-node cluster

2013-08-27 Thread Naresh Yadav
You would need to configure rpc_address also with hostname/ips on both the
nodes.

Naresh

On Wed, Aug 28, 2013 at 10:15 AM, Dinesh dinesh.gad...@gmail.com wrote:

 Hi,

 I am trying to setup a two node Cassandra cluster

 Able to start the first node, but not seeing the following exception while
 starting the second node

 ERROR 17:31:34,315 Exception encountered during startup
 java.lang.IllegalStateException: Unable to contact any seeds!
 at
 org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:947)
 at
 org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:716)
 at
 org.apache.cassandra.service.StorageService.initServer(StorageService.java:554)
 at
 org.apache.cassandra.service.StorageService.initServer(StorageService.java:451)
 at
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348)
 at
 org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447)
 at
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490)
 java.lang.IllegalStateException: Unable to contact any seeds!
 at
 org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:947)
 at
 org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:716)
 at
 org.apache.cassandra.service.StorageService.initServer(StorageService.java:554)
 at
 org.apache.cassandra.service.StorageService.initServer(StorageService.java:451)
 at
 org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348)
 at
 org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:447)
 at
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:490)
 Exception encountered during startup: Unable to contact any seeds!
 ERROR 17:31:34,322 Exception in thread
 Thread[StorageServiceShutdownHook,5,main]
 java.lang.NullPointerException
 at
 org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321)
 at
 org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:370)
 at
 org.apache.cassandra.service.StorageService.access$000(StorageService.java:88)




 =
 My yaml configuration files have these modified


 first node yaml
 ---
 initial_token: -9223372036854775808 # generated this using tokengen tool
 seeds: 10.96.19.207 # which is the IP of first node
 listen_address: 10.96.19.207 # which is the IP of first node itself
 rpc_address: 0.0.0.0

 second node yaml
 
 initial_token: 0
 seeds: 10.96.19.207 # which is the IP of first node
 listen_address: 10.96.10.223 # which is the IP of second node
 rpc_address: 0.0.0.0


 ==

 Can anyone please help me what went wrong with my configuration?

 Regards
 Dinesh






Re: Cassandra HANGS after some writes

2013-08-16 Thread Naresh Yadav
just to update everyone, as per expert advices i tried running this on
Linux machine but i still have exact same problem on linux also...even no
difference in performance...I tried with default yaml and heap size of
8GB..Now advise me on cassandra linux optmizations, will try those...

Naresh

On Wed, Aug 14, 2013 at 10:43 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Aug 13, 2013 at 10:39 PM, Naresh Yadav nyadav@gmail.comwrote:

 I made one single change in default cassandra.yaml, just to experiment.

 native_transport_min_threads: *1*
 native_transport_max_threads: *1*

 with max one single thread for native protocol requests i noticed some
 improvement, earlier with default yaml most of time it was failing after
 *10K* combinations BUT with this it worked storing *30K* combinations
 out of 1lakh..


 Most of the time, when your node fails, it is because Java Garbage
 Collection is failing. This is usually because you are writing faster than
 you can flush to disk and also collect the garbage. Usually people inspect
 the GC logs and/or use something like jconsole to inspect the JMX interface
 of the JVM and the Cassandra metrics exposed there.

 =Rob




Cassandra HANGS after some writes

2013-08-13 Thread Naresh Yadav
Hi All,

I have single node cassandra using  CQL using datastax java driver 1.0.1
and cassandra verison 1.2.6.

*Infrastructure :* 16GB machine with 8GB heap given to cassandra, i7
processor.. DEFAULT cassandra.yaml no change done by me.
-Xms1G^
 -Xmx12G^ no other change in cassandra.bat

*Problem :
_
 *cassandra Freezes after some writes and i see no action on cassandra
console for an hour...all Native_Transport threads are also killedmy
program keeps running NO ERROR comes...when i connect with cql that
works In start it creates 16 NativeTransport threads and after
10-15 minutes Total threads goes to 128...Just before it hangs, With
JCONSOLE when i see Native_Transport threads then i find most of them in
state as :

http://pastebin.com/DeShpHtP

*Load on cassandra : *
___
i am running a usecase which stores Combinations(my project terminology) in
cassandraCurrently testing storing 2.5 lakh combinations with 100
parallel threads..each thread storing one combination...real case i need to
support of many CRORES but that would need different hardware and multi
node cluster...

In Storing ONE combination takes around 2sec and involves :

527 INSERT INTO queries 

506 UPDATE queries 

954 SELECT queries 

100 parallel threads parallely storing 100 combinations
*MY CASSANDRA LOGS :*

http://pastebin.com/CnNvA9x3

Please look last 100-200 lines of log because that is time it freezed


PLEASE HELP ME OUT, I AM NOT ABLE TO PROCEED FROM 1 week...


Re: Cassandra HANGS after some writes

2013-08-13 Thread Naresh Yadav
Thanks Alain, will avoid capsi am newbie to cassandra, just started
using 2 weeks back..

Here are JConsole screenshots just 5mins after cassandra freezed :

http://i.imgur.com/3oUBjKU.png
http://i.imgur.com/2O4PrKb.png
http://i.imgur.com/zxhFzr1.png   4:05 is time cassandra freezed
thats why decline in no of threads
http://i.imgur.com/ScgAciv.png
Uploaded complete system.log of cassandra till freeze :
http://www.scribd.com/doc/159949231/Cassandrasystem-log

Observation : As in my usecase i am storing 1lakh
combinations(527insert,506update,954select) each parallel by 100 threads in
batch of 1000...
Sometimes it works till 1000 batch then hangs but sometimes it completes
1 then hangs and once even worked for more than lakh
Same hardware Same settings of cassandra i see random behaviour of
performance..

Thanks
Naresh

On Tue, Aug 13, 2013 at 3:48 PM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 Hi Naresh.

 First thing, there is no need of caps in here. People reading this ML is
 here to help when they have time and skills enough to do so. So please,
 chill out and do not use caps to show how much desperate you are.

 Concerning your problem, the only abnormal thing I was able to find in
 your logs is


1. ERROR [NonPeriodicTasks:1] 2013-08-13 01:52:42,106
SSTableDeletingTask.java (line 72) Unable to delete

 \var\lib\cassandra\data\system\schema_columnfamilies\system-schema_columnfamilies-ic-241-Data.db
(it will be removed on server restart; we'll also retry after GC)


 I don't think this should keep C* hanging.

 Do you have something on kernel logs ?

 Do you have monitor any metrics like disk throughput / heap used / cpu
 load / iowait which are known as being bottlenecks / pertinent metrics ?

 Alain


 2013/8/13 Naresh Yadav nyadav@gmail.com


 Hi All,

 I have single node cassandra using  CQL using datastax java driver 1.0.1
 and cassandra verison 1.2.6.

 *Infrastructure :* 16GB machine with 8GB heap given to cassandra, i7
 processor.. DEFAULT cassandra.yaml no change done by me.
 -Xms1G^
  -Xmx12G^ no other change in cassandra.bat

 *Problem :
 _
  *cassandra Freezes after some writes and i see no action on cassandra
 console for an hour...all Native_Transport threads are also killedmy
 program keeps running NO ERROR comes...when i connect with cql that
 works In start it creates 16 NativeTransport threads and after
 10-15 minutes Total threads goes to 128...Just before it hangs, With
 JCONSOLE when i see Native_Transport threads then i find most of them in
 state as :

 http://pastebin.com/DeShpHtP

 *Load on cassandra : *
 ___
 i am running a usecase which stores Combinations(my project terminology)
 in cassandraCurrently testing storing 2.5 lakh combinations with 100
 parallel threads..each thread storing one combination...real case i need to
 support of many CRORES but that would need different hardware and multi
 node cluster...

 In Storing ONE combination takes around 2sec and involves :

 527 INSERT INTO queries 

 506 UPDATE queries 

 954 SELECT queries 

 100 parallel threads parallely storing 100 combinations
 *MY CASSANDRA LOGS :*

 http://pastebin.com/CnNvA9x3

 Please look last 100-200 lines of log because that is time it freezed


 PLEASE HELP ME OUT, I AM NOT ABLE TO PROCEED FROM 1 week...





Re: Cassandra HANGS after some writes

2013-08-13 Thread Naresh Yadav
Hi Alex,

Yes i am testing in development environment of Windows 7 64bit.
I left default yaml then cassandra created var folder and created data,
log, cache folders in it.I tried commit log on different harddisk but
this problem not solved with thatI guess this problem is somewhat
related to deadlock in Native Transport threads...thats why cassandra is
hanging indefinitly..

Naresh

On Tue, Aug 13, 2013 at 7:21 PM, Alexis Rodríguez 
arodrig...@inconcertcc.com wrote:

 Naresh, are you deploying cassandra in windows?

 If that is the case you may need to change the data and commitlog
 directories in cassandra.yaml. Also you should check the log directories.

 See the section 2.1  http://wiki.apache.org/cassandra/GettingStarted


 On Tue, Aug 13, 2013 at 8:28 AM, Naresh Yadav nyadav@gmail.comwrote:


 Thanks Alain, will avoid capsi am newbie to cassandra, just started
 using 2 weeks back..

 Here are JConsole screenshots just 5mins after cassandra freezed :

 http://i.imgur.com/3oUBjKU.png
 http://i.imgur.com/2O4PrKb.png
 http://i.imgur.com/zxhFzr1.png   4:05 is time cassandra
 freezed thats why decline in no of threads
 http://i.imgur.com/ScgAciv.png
 Uploaded complete system.log of cassandra till freeze :
 http://www.scribd.com/doc/159949231/Cassandrasystem-log

 Observation : As in my usecase i am storing 1lakh
 combinations(527insert,506update,954select) each parallel by 100 threads in
 batch of 1000...
 Sometimes it works till 1000 batch then hangs but sometimes it completes
 1 then hangs and once even worked for more than lakh
 Same hardware Same settings of cassandra i see random behaviour of
 performance..

 Thanks
 Naresh


 On Tue, Aug 13, 2013 at 3:48 PM, Alain RODRIGUEZ arodr...@gmail.comwrote:

 Hi Naresh.

 First thing, there is no need of caps in here. People reading this ML is
 here to help when they have time and skills enough to do so. So please,
 chill out and do not use caps to show how much desperate you are.

 Concerning your problem, the only abnormal thing I was able to find in
 your logs is


1. ERROR [NonPeriodicTasks:1] 2013-08-13 01:52:42,106
SSTableDeletingTask.java (line 72) Unable to delete

 \var\lib\cassandra\data\system\schema_columnfamilies\system-schema_columnfamilies-ic-241-Data.db
(it will be removed on server restart; we'll also retry after GC)


 I don't think this should keep C* hanging.

 Do you have something on kernel logs ?

 Do you have monitor any metrics like disk throughput / heap used / cpu
 load / iowait which are known as being bottlenecks / pertinent metrics ?

 Alain


 2013/8/13 Naresh Yadav nyadav@gmail.com


 Hi All,

 I have single node cassandra using  CQL using datastax java driver
 1.0.1 and cassandra verison 1.2.6.

 *Infrastructure :* 16GB machine with 8GB heap given to cassandra, i7
 processor.. DEFAULT cassandra.yaml no change done by me.
 -Xms1G^
  -Xmx12G^ no other change in cassandra.bat

 *Problem :
 _
  *cassandra Freezes after some writes and i see no action on cassandra
 console for an hour...all Native_Transport threads are also killedmy
 program keeps running NO ERROR comes...when i connect with cql that
 works In start it creates 16 NativeTransport threads and after
 10-15 minutes Total threads goes to 128...Just before it hangs, With
 JCONSOLE when i see Native_Transport threads then i find most of them in
 state as :

 http://pastebin.com/DeShpHtP

 *Load on cassandra : *
 ___
 i am running a usecase which stores Combinations(my project
 terminology) in cassandraCurrently testing storing 2.5 lakh
 combinations with 100 parallel threads..each thread storing one
 combination...real case i need to support of many CRORES but that would
 need different hardware and multi node cluster...

 In Storing ONE combination takes around 2sec and involves :

 527 INSERT INTO queries 

 506 UPDATE queries 

 954 SELECT queries 

 100 parallel threads parallely storing 100 combinations
 *MY CASSANDRA LOGS :*

 http://pastebin.com/CnNvA9x3

 Please look last 100-200 lines of log because that is time it freezed


 PLEASE HELP ME OUT, I AM NOT ABLE TO PROCEED FROM 1 week...









Re: Cassandra HANGS after some writes

2013-08-13 Thread Naresh Yadav
Hi all,

I started cassandra few weeks back and i am on development enviornment, it
will take months for production as everything in development.But i will
spend time and setup one machine with UBuntu and will check if similar
problem comes or not...Also i had started hands on Hadoop then linux would
be must for me on production..

Till then if anybody can give me some pointers to try on windows parallely
as my most of team do not familiar
with linux enviornment thats why started on Windows.

Thanks
Naresh

On Tue, Aug 13, 2013 at 9:37 PM, Alain RODRIGUEZ arodr...@gmail.com wrote:

 @Kanwar Sangha

 Cassandra on windows ? Please install Linux !

 Useful comment, please spare your time and stop that troll.

 He surely have his reason to use windows (I suppose it is a dev constraint
 or choice). Anyway, C* is available in windows so it should work. Comments
 like windows sucks, go linux or macOS, are not going to solve his issue.
 If Cassandra can't be run on windows, just don't package Cassandra for
 windows.

 We just can recommend Naresh *not* to use Windows as the OS for your
 production nodes.

 Alain


 2013/8/13 Kanwar Sangha kan...@mavenir.com

  Cassandra on windows ? Please install Linux ! 

 ** **

 ** **

 *From:* Romain HARDOUIN [mailto:romain.hardo...@urssaf.fr]
 *Sent:* 13 August 2013 10:17
 *To:* user@cassandra.apache.org
 *Subject:* Re: Cassandra HANGS after some writes

 ** **

 Naresh,

 My two cents is that you should run Cassandra on a Linux VM.
 Issues are more easy to diagnose/pinpoint. Windows is a bit obscure to
 many people here.

 Cheers

 Alexis Rodríguez arodrig...@inconcertcc.com a écrit sur 13/08/2013
 16:50:42 :

  De : Alexis Rodríguez arodrig...@inconcertcc.com
  A : user@cassandra.apache.org,
  Date : 13/08/2013 16:51
  Objet : Re: Cassandra HANGS after some writes
 
  Naresh,
 
  Windows is not my cup of tea. May be someone else has more
  experience using the Redmond's prodigy child.
 
  cheers, and good luck 





Re: Cassandra HANGS after some writes

2013-08-13 Thread Naresh Yadav
I made one single change in default cassandra.yaml, just to experiment.

native_transport_min_threads: *1*
native_transport_max_threads: *1*

with max one single thread for native protocol requests i noticed some
improvement, earlier with default yaml most of time it was failing after *
10K* combinations BUT with this it worked storing *30K* combinations out of
1lakh..

Please guide me further on this hint...

Naresh

On Tue, Aug 13, 2013 at 11:06 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Aug 13, 2013 at 10:34 AM, Andrew Cobley 
 a.e.cob...@dundee.ac.ukwrote:

 Has anyone ever done any performance comparisons of linux vs a headless
 windows server ?


 No, but given the number of linux specific optimizations in Cassandra, I
 would expect this to be no contest.

 =Rob