when a node is dead in Cassandra cluster

2015-09-21 Thread Shenghua(Daniel) Wan
Hi,
When a node is dead, is it supposed to exist in the ring? When I found a
node is lost, and I check with nodetool and ops center, I still see the
lost node in the token ring. When I describe_ring, the lost node is also
returned. Is this what it is supposed to be? Why did not C* server hide the
lost nodes from the clients?

Thanks a lot!

-- 

Regards,
Shenghua (Daniel) Wan


Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-28 Thread Shenghua(Daniel) Wan
That's c* default setting. My version is 2.0.11. Check your Cassandra.yaml.
On Jan 28, 2015 4:53 PM, "Huiliang Zhang"  wrote:

> If you are using replication factor 1 and 3 cassandra nodes, 256 virtual
> nodes should be evenly distributed on 3 nodes. So there are totally 256
> virtual nodes. But in your experiment, you saw 3*257 mapper. Is that
> because of the setting cassandra.input.split.size=3? It is nothing with
> node number=3. Otherwise, I am confused why there are 256 virtual nodes on
> every cassandra node.
>
> On Wed, Jan 28, 2015 at 12:29 AM, Shenghua(Daniel) Wan <
> wansheng...@gmail.com> wrote:
>
>> I did another experiment to verify indeed 3*257 (1 of 257 ranges is null
>> effectively) mappers were created.
>>
>> Thanks mcm for the information !
>>
>> On Wed, Jan 28, 2015 at 12:17 AM, mck  wrote:
>>
>>> Shenghua,
>>>
>>> > The problem is the user might only want all the data via a "select *"
>>> > like statement. It seems that 257 connections to query the rows are
>>> necessary.
>>> > However, is there any way to prohibit 257 concurrent connections?
>>>
>>>
>>> Your reasoning is correct.
>>> The number of connections should be tunable via the
>>> "cassandra.input.split.size" property. See
>>> ConfigHelper.setInputSplitSize(..)
>>>
>>> The problem is that vnodes completely trashes this, since splits
>>> returned don't span across vnodes.
>>> There's an issue out for this –
>>> https://issues.apache.org/jira/browse/CASSANDRA-6091
>>>  but part of the problem is that the thrift stuff involved here is
>>>  getting rewritten¹ to be pure cql.
>>>
>>> In the meantime you override the CqlInputFormat and manually re-merge
>>> splits together, where location sets match, so to better honour
>>> inputSplitSize and to return to a more reasonable number of connections.
>>> We do this, using code similar to this patch
>>> https://github.com/michaelsembwever/cassandra/pull/2/files
>>>
>>> ~mck
>>>
>>> ¹ https://issues.apache.org/jira/browse/CASSANDRA-8358
>>>
>>
>>
>>
>> --
>>
>> Regards,
>> Shenghua (Daniel) Wan
>>
>
>


Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-28 Thread Shenghua(Daniel) Wan
I did another experiment to verify indeed 3*257 (1 of 257 ranges is null
effectively) mappers were created.

Thanks mcm for the information !

On Wed, Jan 28, 2015 at 12:17 AM, mck  wrote:

> Shenghua,
>
> > The problem is the user might only want all the data via a "select *"
> > like statement. It seems that 257 connections to query the rows are
> necessary.
> > However, is there any way to prohibit 257 concurrent connections?
>
>
> Your reasoning is correct.
> The number of connections should be tunable via the
> "cassandra.input.split.size" property. See
> ConfigHelper.setInputSplitSize(..)
>
> The problem is that vnodes completely trashes this, since splits
> returned don't span across vnodes.
> There's an issue out for this –
> https://issues.apache.org/jira/browse/CASSANDRA-6091
>  but part of the problem is that the thrift stuff involved here is
>  getting rewritten¹ to be pure cql.
>
> In the meantime you override the CqlInputFormat and manually re-merge
> splits together, where location sets match, so to better honour
> inputSplitSize and to return to a more reasonable number of connections.
> We do this, using code similar to this patch
> https://github.com/michaelsembwever/cassandra/pull/2/files
>
> ~mck
>
> ¹ https://issues.apache.org/jira/browse/CASSANDRA-8358
>



-- 

Regards,
Shenghua (Daniel) Wan


Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-27 Thread Shenghua(Daniel) Wan
For clarification, please checkout the source code I got from C* v2.0.11

in AbstractColumnFamilyInputFormat  getSplits(JobContext context)
line 125 and 168

// cannonical ranges and nodes holding replicas
List masterRangeNodes = getRangeMap(conf);

 for (TokenRange range : masterRangeNodes)
{
if (jobRange == null)
{
// for each range, pick a live owner and ask it to
compute bite-sized splits
splitfutures.add(executor.submit(new
SplitCallable(range, conf)));
}

My understanding for this part of source code is for each token range, it
will create a connection to the server.


On Tue, Jan 27, 2015 at 11:21 PM, Huiliang Zhang  wrote:

> In that case, each node will have 256/3 connections at most. Still 256
> mappers. Someone please correct me if I am wrong.
>
> On Tue, Jan 27, 2015 at 11:04 PM, Shenghua(Daniel) Wan <
> wansheng...@gmail.com> wrote:
>
>> Hi, Huiliang,
>> Great to hear from you, again!
>> Image you have 3 nodes, replication factor=1, and using default number of
>> tokens. You will have 3*256 mappers... In that case, you will be soon out
>> of mappers or reach the limit.
>>
>>
>> On Tue, Jan 27, 2015 at 10:59 PM, Huiliang Zhang 
>> wrote:
>>
>>> Hi Shenghua, as I understand, each range is assigned to a mapper. Mapper
>>> will not share connections. So, it needs at least 256 connections to read
>>> all. But all 256 connections should not be set up at the same time unless
>>> you have 256 mappers running at the same time.
>>>
>>> On Tue, Jan 27, 2015 at 9:34 PM, Shenghua(Daniel) Wan <
>>> wansheng...@gmail.com> wrote:
>>>
>>>> By default, each C* node is set with 256 tokens. On a local 1-node C*
>>>> server, my hadoop drop creates 256 connections to the server. Is there any
>>>> way to control this behavior? e.g. reduce the number of connections to a
>>>> pre-configured gap.
>>>>
>>>> I debugged C* source code and found the client asks for partition
>>>> ranges, or virtual nodes. Then the client was told by server there were 257
>>>> ranges, corresponding to 257 column family splits.
>>>>
>>>> Here is a snapshot of my logs
>>>>
>>>> 15/01/27 18:02:20 DEBUG hadoop.AbstractColumnFamilyInputFormat: adding
>>>> ColumnFamilySplit((9121856086738887846, '-9223372036854775808] 
>>>> @[localhost])
>>>> ...
>>>> totally 257 splits.
>>>>
>>>> The problem is the user might only want all the data via a "select *"
>>>> like statement. It seems that 257 connections to query the rows are
>>>> necessary. However, is there any way to prohibit 257 concurrent
>>>> connections?
>>>>
>>>> My C* version is 2.0.11 and I also tried CqlPagingInputFormat, which
>>>> has same behavior.
>>>>
>>>> Thank you.
>>>>
>>>> --
>>>>
>>>> Regards,
>>>> Shenghua (Daniel) Wan
>>>>
>>>
>>>
>>
>>
>> --
>>
>> Regards,
>> Shenghua (Daniel) Wan
>>
>
>


-- 

Regards,
Shenghua (Daniel) Wan


Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-27 Thread Shenghua(Daniel) Wan
I mean when the number of nodes grow, there are more virtual nodes in
total. For each vnode (or a partition range), a connection will be created.
For 3 node, 256 tokens each, replication factor=1 for simplicity, there
will be 3*256 virtual nodes, and therefore that many connections. Let me
know if there is any incorrect reasoning here. Thanks.

On Tue, Jan 27, 2015 at 11:21 PM, Huiliang Zhang  wrote:

> In that case, each node will have 256/3 connections at most. Still 256
> mappers. Someone please correct me if I am wrong.
>
> On Tue, Jan 27, 2015 at 11:04 PM, Shenghua(Daniel) Wan <
> wansheng...@gmail.com> wrote:
>
>> Hi, Huiliang,
>> Great to hear from you, again!
>> Image you have 3 nodes, replication factor=1, and using default number of
>> tokens. You will have 3*256 mappers... In that case, you will be soon out
>> of mappers or reach the limit.
>>
>>
>> On Tue, Jan 27, 2015 at 10:59 PM, Huiliang Zhang 
>> wrote:
>>
>>> Hi Shenghua, as I understand, each range is assigned to a mapper. Mapper
>>> will not share connections. So, it needs at least 256 connections to read
>>> all. But all 256 connections should not be set up at the same time unless
>>> you have 256 mappers running at the same time.
>>>
>>> On Tue, Jan 27, 2015 at 9:34 PM, Shenghua(Daniel) Wan <
>>> wansheng...@gmail.com> wrote:
>>>
>>>> By default, each C* node is set with 256 tokens. On a local 1-node C*
>>>> server, my hadoop drop creates 256 connections to the server. Is there any
>>>> way to control this behavior? e.g. reduce the number of connections to a
>>>> pre-configured gap.
>>>>
>>>> I debugged C* source code and found the client asks for partition
>>>> ranges, or virtual nodes. Then the client was told by server there were 257
>>>> ranges, corresponding to 257 column family splits.
>>>>
>>>> Here is a snapshot of my logs
>>>>
>>>> 15/01/27 18:02:20 DEBUG hadoop.AbstractColumnFamilyInputFormat: adding
>>>> ColumnFamilySplit((9121856086738887846, '-9223372036854775808] 
>>>> @[localhost])
>>>> ...
>>>> totally 257 splits.
>>>>
>>>> The problem is the user might only want all the data via a "select *"
>>>> like statement. It seems that 257 connections to query the rows are
>>>> necessary. However, is there any way to prohibit 257 concurrent
>>>> connections?
>>>>
>>>> My C* version is 2.0.11 and I also tried CqlPagingInputFormat, which
>>>> has same behavior.
>>>>
>>>> Thank you.
>>>>
>>>> --
>>>>
>>>> Regards,
>>>> Shenghua (Daniel) Wan
>>>>
>>>
>>>
>>
>>
>> --
>>
>> Regards,
>> Shenghua (Daniel) Wan
>>
>
>


-- 

Regards,
Shenghua (Daniel) Wan


Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-27 Thread Shenghua(Daniel) Wan
Hi, Huiliang,
Great to hear from you, again!
Image you have 3 nodes, replication factor=1, and using default number of
tokens. You will have 3*256 mappers... In that case, you will be soon out
of mappers or reach the limit.


On Tue, Jan 27, 2015 at 10:59 PM, Huiliang Zhang  wrote:

> Hi Shenghua, as I understand, each range is assigned to a mapper. Mapper
> will not share connections. So, it needs at least 256 connections to read
> all. But all 256 connections should not be set up at the same time unless
> you have 256 mappers running at the same time.
>
> On Tue, Jan 27, 2015 at 9:34 PM, Shenghua(Daniel) Wan <
> wansheng...@gmail.com> wrote:
>
>> By default, each C* node is set with 256 tokens. On a local 1-node C*
>> server, my hadoop drop creates 256 connections to the server. Is there any
>> way to control this behavior? e.g. reduce the number of connections to a
>> pre-configured gap.
>>
>> I debugged C* source code and found the client asks for partition ranges,
>> or virtual nodes. Then the client was told by server there were 257 ranges,
>> corresponding to 257 column family splits.
>>
>> Here is a snapshot of my logs
>>
>> 15/01/27 18:02:20 DEBUG hadoop.AbstractColumnFamilyInputFormat: adding
>> ColumnFamilySplit((9121856086738887846, '-9223372036854775808] @[localhost])
>> ...
>> totally 257 splits.
>>
>> The problem is the user might only want all the data via a "select *"
>> like statement. It seems that 257 connections to query the rows are
>> necessary. However, is there any way to prohibit 257 concurrent
>> connections?
>>
>> My C* version is 2.0.11 and I also tried CqlPagingInputFormat, which has
>> same behavior.
>>
>> Thank you.
>>
>> --
>>
>> Regards,
>> Shenghua (Daniel) Wan
>>
>
>


-- 

Regards,
Shenghua (Daniel) Wan


Re: Re: full-tabe scan - extracting all data from C*

2015-01-27 Thread Shenghua(Daniel) Wan
Cool. What about performance? e.g. how many record for how long?

On Tue, Jan 27, 2015 at 10:16 PM, Xu Zhongxing 
wrote:

> For Java driver, there is no special API actually, just
>
> ResultSet rs = session.execute("select * from ...");
> for (Row r : rs) {
>...
> }
>
> For Spark, the code skeleton is:
>
> val rdd = sc.cassandraTable("ks", "table")
>
> then call various standard Spark API to process the table parallelly.
>
> I have not used CqlInputFormat.
>
> At 2015-01-28 13:38:20, "Shenghua(Daniel) Wan" 
> wrote:
>
> Hi, Zhongxing,
> I am also interested in your table size. I am trying to dump 10s Million
> record data from C* using map-reduce related API like CqlInputFormat.
> You mentioned about Java driver. Could you suggest any API you used?
> Thanks.
>
> On Tue, Jan 27, 2015 at 5:33 PM, Xu Zhongxing 
> wrote:
>
>> Both Java driver "select * from table" and Spark sc.cassandraTable() work
>> well.
>> I use both of them frequently.
>>
>> At 2015-01-28 04:06:20, "Mohammed Guller"  wrote:
>>
>>  Hi –
>>
>>
>>
>> Over the last few weeks, I have seen several emails on this mailing list
>> from people trying to extract all data from C*, so that they can import
>> that data into other analytical tools that provide much richer analytics
>> functionality than C*. Extracting all data from C* is a full-table scan,
>> which is not the ideal use case for C*. However, people don’t have much
>> choice if they want to do ad-hoc analytics on the data in C*.
>> Unfortunately, I don’t think C* comes with any built-in tools that make
>> this task easy for a large dataset. Please correct me if I am wrong. Cqlsh
>> has a COPY TO command, but it doesn’t really work if you have a large
>> amount of data in C*.
>>
>>
>>
>> I am aware of couple of approaches for extracting all data from a table
>> in C*:
>>
>> 1)  Iterate through all the C* partitions (physical rows) using the
>> Java Driver and CQL.
>>
>> 2)  Extract the data directly from SSTables files.
>>
>>
>>
>> Either approach can be used with Hadoop or Spark to speed up the
>> extraction process.
>>
>>
>>
>> I wanted to do a quick survey and find out how many people on this
>> mailing list have successfully used approach #1 or #2 for extracting large
>> datasets (terabytes) from C*. Also, if you have used some other techniques,
>> it would be great if you could share your approach with the group.
>>
>>
>>
>> Mohammed
>>
>>
>>
>>
>
>
> --
>
> Regards,
> Shenghua (Daniel) Wan
>
>


-- 

Regards,
Shenghua (Daniel) Wan


Re: full-tabe scan - extracting all data from C*

2015-01-27 Thread Shenghua(Daniel) Wan
Recently I surveyed this topic and you may want to take a look at
https://github.com/fullcontact/hadoop-sstable
and
https://github.com/Netflix/aegisthus


On Tue, Jan 27, 2015 at 5:33 PM, Xu Zhongxing  wrote:

> Both Java driver "select * from table" and Spark sc.cassandraTable() work
> well.
> I use both of them frequently.
>
> At 2015-01-28 04:06:20, "Mohammed Guller"  wrote:
>
>  Hi –
>
>
>
> Over the last few weeks, I have seen several emails on this mailing list
> from people trying to extract all data from C*, so that they can import
> that data into other analytical tools that provide much richer analytics
> functionality than C*. Extracting all data from C* is a full-table scan,
> which is not the ideal use case for C*. However, people don’t have much
> choice if they want to do ad-hoc analytics on the data in C*.
> Unfortunately, I don’t think C* comes with any built-in tools that make
> this task easy for a large dataset. Please correct me if I am wrong. Cqlsh
> has a COPY TO command, but it doesn’t really work if you have a large
> amount of data in C*.
>
>
>
> I am aware of couple of approaches for extracting all data from a table in
> C*:
>
> 1)  Iterate through all the C* partitions (physical rows) using the
> Java Driver and CQL.
>
> 2)  Extract the data directly from SSTables files.
>
>
>
> Either approach can be used with Hadoop or Spark to speed up the
> extraction process.
>
>
>
> I wanted to do a quick survey and find out how many people on this mailing
> list have successfully used approach #1 or #2 for extracting large datasets
> (terabytes) from C*. Also, if you have used some other techniques, it would
> be great if you could share your approach with the group.
>
>
>
> Mohammed
>
>
>
>


-- 

Regards,
Shenghua (Daniel) Wan


Re: full-tabe scan - extracting all data from C*

2015-01-27 Thread Shenghua(Daniel) Wan
Hi, Zhongxing,
I am also interested in your table size. I am trying to dump 10s Million
record data from C* using map-reduce related API like CqlInputFormat.
You mentioned about Java driver. Could you suggest any API you used? Thanks.

On Tue, Jan 27, 2015 at 5:33 PM, Xu Zhongxing  wrote:

> Both Java driver "select * from table" and Spark sc.cassandraTable() work
> well.
> I use both of them frequently.
>
> At 2015-01-28 04:06:20, "Mohammed Guller"  wrote:
>
>  Hi –
>
>
>
> Over the last few weeks, I have seen several emails on this mailing list
> from people trying to extract all data from C*, so that they can import
> that data into other analytical tools that provide much richer analytics
> functionality than C*. Extracting all data from C* is a full-table scan,
> which is not the ideal use case for C*. However, people don’t have much
> choice if they want to do ad-hoc analytics on the data in C*.
> Unfortunately, I don’t think C* comes with any built-in tools that make
> this task easy for a large dataset. Please correct me if I am wrong. Cqlsh
> has a COPY TO command, but it doesn’t really work if you have a large
> amount of data in C*.
>
>
>
> I am aware of couple of approaches for extracting all data from a table in
> C*:
>
> 1)  Iterate through all the C* partitions (physical rows) using the
> Java Driver and CQL.
>
> 2)  Extract the data directly from SSTables files.
>
>
>
> Either approach can be used with Hadoop or Spark to speed up the
> extraction process.
>
>
>
> I wanted to do a quick survey and find out how many people on this mailing
> list have successfully used approach #1 or #2 for extracting large datasets
> (terabytes) from C*. Also, if you have used some other techniques, it would
> be great if you could share your approach with the group.
>
>
>
> Mohammed
>
>
>
>


-- 

Regards,
Shenghua (Daniel) Wan


cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-27 Thread Shenghua(Daniel) Wan
By default, each C* node is set with 256 tokens. On a local 1-node C*
server, my hadoop drop creates 256 connections to the server. Is there any
way to control this behavior? e.g. reduce the number of connections to a
pre-configured gap.

I debugged C* source code and found the client asks for partition ranges,
or virtual nodes. Then the client was told by server there were 257 ranges,
corresponding to 257 column family splits.

Here is a snapshot of my logs

15/01/27 18:02:20 DEBUG hadoop.AbstractColumnFamilyInputFormat: adding
ColumnFamilySplit((9121856086738887846, '-9223372036854775808] @[localhost])
...
totally 257 splits.

The problem is the user might only want all the data via a "select *" like
statement. It seems that 257 connections to query the rows are necessary.
However, is there any way to prohibit 257 concurrent connections?

My C* version is 2.0.11 and I also tried CqlPagingInputFormat, which has
same behavior.

Thank you.

-- 

Regards,
Shenghua (Daniel) Wan