Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-28 Thread mck
Shenghua,
 
 The problem is the user might only want all the data via a select *
 like statement. It seems that 257 connections to query the rows are necessary.
 However, is there any way to prohibit 257 concurrent connections?


Your reasoning is correct.
The number of connections should be tunable via the
cassandra.input.split.size property. See
ConfigHelper.setInputSplitSize(..)

The problem is that vnodes completely trashes this, since splits
returned don't span across vnodes.
There's an issue out for this –
https://issues.apache.org/jira/browse/CASSANDRA-6091
 but part of the problem is that the thrift stuff involved here is
 getting rewritten¹ to be pure cql.

In the meantime you override the CqlInputFormat and manually re-merge
splits together, where location sets match, so to better honour
inputSplitSize and to return to a more reasonable number of connections.
We do this, using code similar to this patch
https://github.com/michaelsembwever/cassandra/pull/2/files

~mck

¹ https://issues.apache.org/jira/browse/CASSANDRA-8358


Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-28 Thread Shenghua(Daniel) Wan
I did another experiment to verify indeed 3*257 (1 of 257 ranges is null
effectively) mappers were created.

Thanks mcm for the information !

On Wed, Jan 28, 2015 at 12:17 AM, mck m...@apache.org wrote:

 Shenghua,

  The problem is the user might only want all the data via a select *
  like statement. It seems that 257 connections to query the rows are
 necessary.
  However, is there any way to prohibit 257 concurrent connections?


 Your reasoning is correct.
 The number of connections should be tunable via the
 cassandra.input.split.size property. See
 ConfigHelper.setInputSplitSize(..)

 The problem is that vnodes completely trashes this, since splits
 returned don't span across vnodes.
 There's an issue out for this –
 https://issues.apache.org/jira/browse/CASSANDRA-6091
  but part of the problem is that the thrift stuff involved here is
  getting rewritten¹ to be pure cql.

 In the meantime you override the CqlInputFormat and manually re-merge
 splits together, where location sets match, so to better honour
 inputSplitSize and to return to a more reasonable number of connections.
 We do this, using code similar to this patch
 https://github.com/michaelsembwever/cassandra/pull/2/files

 ~mck

 ¹ https://issues.apache.org/jira/browse/CASSANDRA-8358




-- 

Regards,
Shenghua (Daniel) Wan


Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-28 Thread Huiliang Zhang
If you are using replication factor 1 and 3 cassandra nodes, 256 virtual
nodes should be evenly distributed on 3 nodes. So there are totally 256
virtual nodes. But in your experiment, you saw 3*257 mapper. Is that
because of the setting cassandra.input.split.size=3? It is nothing with
node number=3. Otherwise, I am confused why there are 256 virtual nodes on
every cassandra node.

On Wed, Jan 28, 2015 at 12:29 AM, Shenghua(Daniel) Wan 
wansheng...@gmail.com wrote:

 I did another experiment to verify indeed 3*257 (1 of 257 ranges is null
 effectively) mappers were created.

 Thanks mcm for the information !

 On Wed, Jan 28, 2015 at 12:17 AM, mck m...@apache.org wrote:

 Shenghua,

  The problem is the user might only want all the data via a select *
  like statement. It seems that 257 connections to query the rows are
 necessary.
  However, is there any way to prohibit 257 concurrent connections?


 Your reasoning is correct.
 The number of connections should be tunable via the
 cassandra.input.split.size property. See
 ConfigHelper.setInputSplitSize(..)

 The problem is that vnodes completely trashes this, since splits
 returned don't span across vnodes.
 There's an issue out for this –
 https://issues.apache.org/jira/browse/CASSANDRA-6091
  but part of the problem is that the thrift stuff involved here is
  getting rewritten¹ to be pure cql.

 In the meantime you override the CqlInputFormat and manually re-merge
 splits together, where location sets match, so to better honour
 inputSplitSize and to return to a more reasonable number of connections.
 We do this, using code similar to this patch
 https://github.com/michaelsembwever/cassandra/pull/2/files

 ~mck

 ¹ https://issues.apache.org/jira/browse/CASSANDRA-8358




 --

 Regards,
 Shenghua (Daniel) Wan



Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-28 Thread Shenghua(Daniel) Wan
That's c* default setting. My version is 2.0.11. Check your Cassandra.yaml.
On Jan 28, 2015 4:53 PM, Huiliang Zhang zhl...@gmail.com wrote:

 If you are using replication factor 1 and 3 cassandra nodes, 256 virtual
 nodes should be evenly distributed on 3 nodes. So there are totally 256
 virtual nodes. But in your experiment, you saw 3*257 mapper. Is that
 because of the setting cassandra.input.split.size=3? It is nothing with
 node number=3. Otherwise, I am confused why there are 256 virtual nodes on
 every cassandra node.

 On Wed, Jan 28, 2015 at 12:29 AM, Shenghua(Daniel) Wan 
 wansheng...@gmail.com wrote:

 I did another experiment to verify indeed 3*257 (1 of 257 ranges is null
 effectively) mappers were created.

 Thanks mcm for the information !

 On Wed, Jan 28, 2015 at 12:17 AM, mck m...@apache.org wrote:

 Shenghua,

  The problem is the user might only want all the data via a select *
  like statement. It seems that 257 connections to query the rows are
 necessary.
  However, is there any way to prohibit 257 concurrent connections?


 Your reasoning is correct.
 The number of connections should be tunable via the
 cassandra.input.split.size property. See
 ConfigHelper.setInputSplitSize(..)

 The problem is that vnodes completely trashes this, since splits
 returned don't span across vnodes.
 There's an issue out for this –
 https://issues.apache.org/jira/browse/CASSANDRA-6091
  but part of the problem is that the thrift stuff involved here is
  getting rewritten¹ to be pure cql.

 In the meantime you override the CqlInputFormat and manually re-merge
 splits together, where location sets match, so to better honour
 inputSplitSize and to return to a more reasonable number of connections.
 We do this, using code similar to this patch
 https://github.com/michaelsembwever/cassandra/pull/2/files

 ~mck

 ¹ https://issues.apache.org/jira/browse/CASSANDRA-8358




 --

 Regards,
 Shenghua (Daniel) Wan





Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-27 Thread Shenghua(Daniel) Wan
I mean when the number of nodes grow, there are more virtual nodes in
total. For each vnode (or a partition range), a connection will be created.
For 3 node, 256 tokens each, replication factor=1 for simplicity, there
will be 3*256 virtual nodes, and therefore that many connections. Let me
know if there is any incorrect reasoning here. Thanks.

On Tue, Jan 27, 2015 at 11:21 PM, Huiliang Zhang zhl...@gmail.com wrote:

 In that case, each node will have 256/3 connections at most. Still 256
 mappers. Someone please correct me if I am wrong.

 On Tue, Jan 27, 2015 at 11:04 PM, Shenghua(Daniel) Wan 
 wansheng...@gmail.com wrote:

 Hi, Huiliang,
 Great to hear from you, again!
 Image you have 3 nodes, replication factor=1, and using default number of
 tokens. You will have 3*256 mappers... In that case, you will be soon out
 of mappers or reach the limit.


 On Tue, Jan 27, 2015 at 10:59 PM, Huiliang Zhang zhl...@gmail.com
 wrote:

 Hi Shenghua, as I understand, each range is assigned to a mapper. Mapper
 will not share connections. So, it needs at least 256 connections to read
 all. But all 256 connections should not be set up at the same time unless
 you have 256 mappers running at the same time.

 On Tue, Jan 27, 2015 at 9:34 PM, Shenghua(Daniel) Wan 
 wansheng...@gmail.com wrote:

 By default, each C* node is set with 256 tokens. On a local 1-node C*
 server, my hadoop drop creates 256 connections to the server. Is there any
 way to control this behavior? e.g. reduce the number of connections to a
 pre-configured gap.

 I debugged C* source code and found the client asks for partition
 ranges, or virtual nodes. Then the client was told by server there were 257
 ranges, corresponding to 257 column family splits.

 Here is a snapshot of my logs

 15/01/27 18:02:20 DEBUG hadoop.AbstractColumnFamilyInputFormat: adding
 ColumnFamilySplit((9121856086738887846, '-9223372036854775808] 
 @[localhost])
 ...
 totally 257 splits.

 The problem is the user might only want all the data via a select *
 like statement. It seems that 257 connections to query the rows are
 necessary. However, is there any way to prohibit 257 concurrent
 connections?

 My C* version is 2.0.11 and I also tried CqlPagingInputFormat, which
 has same behavior.

 Thank you.

 --

 Regards,
 Shenghua (Daniel) Wan





 --

 Regards,
 Shenghua (Daniel) Wan





-- 

Regards,
Shenghua (Daniel) Wan


Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-27 Thread Shenghua(Daniel) Wan
Hi, Huiliang,
Great to hear from you, again!
Image you have 3 nodes, replication factor=1, and using default number of
tokens. You will have 3*256 mappers... In that case, you will be soon out
of mappers or reach the limit.


On Tue, Jan 27, 2015 at 10:59 PM, Huiliang Zhang zhl...@gmail.com wrote:

 Hi Shenghua, as I understand, each range is assigned to a mapper. Mapper
 will not share connections. So, it needs at least 256 connections to read
 all. But all 256 connections should not be set up at the same time unless
 you have 256 mappers running at the same time.

 On Tue, Jan 27, 2015 at 9:34 PM, Shenghua(Daniel) Wan 
 wansheng...@gmail.com wrote:

 By default, each C* node is set with 256 tokens. On a local 1-node C*
 server, my hadoop drop creates 256 connections to the server. Is there any
 way to control this behavior? e.g. reduce the number of connections to a
 pre-configured gap.

 I debugged C* source code and found the client asks for partition ranges,
 or virtual nodes. Then the client was told by server there were 257 ranges,
 corresponding to 257 column family splits.

 Here is a snapshot of my logs

 15/01/27 18:02:20 DEBUG hadoop.AbstractColumnFamilyInputFormat: adding
 ColumnFamilySplit((9121856086738887846, '-9223372036854775808] @[localhost])
 ...
 totally 257 splits.

 The problem is the user might only want all the data via a select *
 like statement. It seems that 257 connections to query the rows are
 necessary. However, is there any way to prohibit 257 concurrent
 connections?

 My C* version is 2.0.11 and I also tried CqlPagingInputFormat, which has
 same behavior.

 Thank you.

 --

 Regards,
 Shenghua (Daniel) Wan





-- 

Regards,
Shenghua (Daniel) Wan


Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-27 Thread Huiliang Zhang
In that case, each node will have 256/3 connections at most. Still 256
mappers. Someone please correct me if I am wrong.

On Tue, Jan 27, 2015 at 11:04 PM, Shenghua(Daniel) Wan 
wansheng...@gmail.com wrote:

 Hi, Huiliang,
 Great to hear from you, again!
 Image you have 3 nodes, replication factor=1, and using default number of
 tokens. You will have 3*256 mappers... In that case, you will be soon out
 of mappers or reach the limit.


 On Tue, Jan 27, 2015 at 10:59 PM, Huiliang Zhang zhl...@gmail.com wrote:

 Hi Shenghua, as I understand, each range is assigned to a mapper. Mapper
 will not share connections. So, it needs at least 256 connections to read
 all. But all 256 connections should not be set up at the same time unless
 you have 256 mappers running at the same time.

 On Tue, Jan 27, 2015 at 9:34 PM, Shenghua(Daniel) Wan 
 wansheng...@gmail.com wrote:

 By default, each C* node is set with 256 tokens. On a local 1-node C*
 server, my hadoop drop creates 256 connections to the server. Is there any
 way to control this behavior? e.g. reduce the number of connections to a
 pre-configured gap.

 I debugged C* source code and found the client asks for partition
 ranges, or virtual nodes. Then the client was told by server there were 257
 ranges, corresponding to 257 column family splits.

 Here is a snapshot of my logs

 15/01/27 18:02:20 DEBUG hadoop.AbstractColumnFamilyInputFormat: adding
 ColumnFamilySplit((9121856086738887846, '-9223372036854775808] @[localhost])
 ...
 totally 257 splits.

 The problem is the user might only want all the data via a select *
 like statement. It seems that 257 connections to query the rows are
 necessary. However, is there any way to prohibit 257 concurrent
 connections?

 My C* version is 2.0.11 and I also tried CqlPagingInputFormat, which has
 same behavior.

 Thank you.

 --

 Regards,
 Shenghua (Daniel) Wan





 --

 Regards,
 Shenghua (Daniel) Wan



Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-27 Thread Shenghua(Daniel) Wan
For clarification, please checkout the source code I got from C* v2.0.11

in AbstractColumnFamilyInputFormat  getSplits(JobContext context)
line 125 and 168

// cannonical ranges and nodes holding replicas
ListTokenRange masterRangeNodes = getRangeMap(conf);

 for (TokenRange range : masterRangeNodes)
{
if (jobRange == null)
{
// for each range, pick a live owner and ask it to
compute bite-sized splits
splitfutures.add(executor.submit(new
SplitCallable(range, conf)));
}

My understanding for this part of source code is for each token range, it
will create a connection to the server.


On Tue, Jan 27, 2015 at 11:21 PM, Huiliang Zhang zhl...@gmail.com wrote:

 In that case, each node will have 256/3 connections at most. Still 256
 mappers. Someone please correct me if I am wrong.

 On Tue, Jan 27, 2015 at 11:04 PM, Shenghua(Daniel) Wan 
 wansheng...@gmail.com wrote:

 Hi, Huiliang,
 Great to hear from you, again!
 Image you have 3 nodes, replication factor=1, and using default number of
 tokens. You will have 3*256 mappers... In that case, you will be soon out
 of mappers or reach the limit.


 On Tue, Jan 27, 2015 at 10:59 PM, Huiliang Zhang zhl...@gmail.com
 wrote:

 Hi Shenghua, as I understand, each range is assigned to a mapper. Mapper
 will not share connections. So, it needs at least 256 connections to read
 all. But all 256 connections should not be set up at the same time unless
 you have 256 mappers running at the same time.

 On Tue, Jan 27, 2015 at 9:34 PM, Shenghua(Daniel) Wan 
 wansheng...@gmail.com wrote:

 By default, each C* node is set with 256 tokens. On a local 1-node C*
 server, my hadoop drop creates 256 connections to the server. Is there any
 way to control this behavior? e.g. reduce the number of connections to a
 pre-configured gap.

 I debugged C* source code and found the client asks for partition
 ranges, or virtual nodes. Then the client was told by server there were 257
 ranges, corresponding to 257 column family splits.

 Here is a snapshot of my logs

 15/01/27 18:02:20 DEBUG hadoop.AbstractColumnFamilyInputFormat: adding
 ColumnFamilySplit((9121856086738887846, '-9223372036854775808] 
 @[localhost])
 ...
 totally 257 splits.

 The problem is the user might only want all the data via a select *
 like statement. It seems that 257 connections to query the rows are
 necessary. However, is there any way to prohibit 257 concurrent
 connections?

 My C* version is 2.0.11 and I also tried CqlPagingInputFormat, which
 has same behavior.

 Thank you.

 --

 Regards,
 Shenghua (Daniel) Wan





 --

 Regards,
 Shenghua (Daniel) Wan





-- 

Regards,
Shenghua (Daniel) Wan


cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-27 Thread Shenghua(Daniel) Wan
By default, each C* node is set with 256 tokens. On a local 1-node C*
server, my hadoop drop creates 256 connections to the server. Is there any
way to control this behavior? e.g. reduce the number of connections to a
pre-configured gap.

I debugged C* source code and found the client asks for partition ranges,
or virtual nodes. Then the client was told by server there were 257 ranges,
corresponding to 257 column family splits.

Here is a snapshot of my logs

15/01/27 18:02:20 DEBUG hadoop.AbstractColumnFamilyInputFormat: adding
ColumnFamilySplit((9121856086738887846, '-9223372036854775808] @[localhost])
...
totally 257 splits.

The problem is the user might only want all the data via a select * like
statement. It seems that 257 connections to query the rows are necessary.
However, is there any way to prohibit 257 concurrent connections?

My C* version is 2.0.11 and I also tried CqlPagingInputFormat, which has
same behavior.

Thank you.

-- 

Regards,
Shenghua (Daniel) Wan