Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-27 Thread Shenghua(Daniel) Wan
For clarification, please checkout the source code I got from C* v2.0.11

in AbstractColumnFamilyInputFormat  getSplits(JobContext context)
line 125 and 168

// cannonical ranges and nodes holding replicas
List masterRangeNodes = getRangeMap(conf);

 for (TokenRange range : masterRangeNodes)
{
if (jobRange == null)
{
// for each range, pick a live owner and ask it to
compute bite-sized splits
splitfutures.add(executor.submit(new
SplitCallable(range, conf)));
}

My understanding for this part of source code is for each token range, it
will create a connection to the server.


On Tue, Jan 27, 2015 at 11:21 PM, Huiliang Zhang  wrote:

> In that case, each node will have 256/3 connections at most. Still 256
> mappers. Someone please correct me if I am wrong.
>
> On Tue, Jan 27, 2015 at 11:04 PM, Shenghua(Daniel) Wan <
> wansheng...@gmail.com> wrote:
>
>> Hi, Huiliang,
>> Great to hear from you, again!
>> Image you have 3 nodes, replication factor=1, and using default number of
>> tokens. You will have 3*256 mappers... In that case, you will be soon out
>> of mappers or reach the limit.
>>
>>
>> On Tue, Jan 27, 2015 at 10:59 PM, Huiliang Zhang 
>> wrote:
>>
>>> Hi Shenghua, as I understand, each range is assigned to a mapper. Mapper
>>> will not share connections. So, it needs at least 256 connections to read
>>> all. But all 256 connections should not be set up at the same time unless
>>> you have 256 mappers running at the same time.
>>>
>>> On Tue, Jan 27, 2015 at 9:34 PM, Shenghua(Daniel) Wan <
>>> wansheng...@gmail.com> wrote:
>>>
 By default, each C* node is set with 256 tokens. On a local 1-node C*
 server, my hadoop drop creates 256 connections to the server. Is there any
 way to control this behavior? e.g. reduce the number of connections to a
 pre-configured gap.

 I debugged C* source code and found the client asks for partition
 ranges, or virtual nodes. Then the client was told by server there were 257
 ranges, corresponding to 257 column family splits.

 Here is a snapshot of my logs

 15/01/27 18:02:20 DEBUG hadoop.AbstractColumnFamilyInputFormat: adding
 ColumnFamilySplit((9121856086738887846, '-9223372036854775808] 
 @[localhost])
 ...
 totally 257 splits.

 The problem is the user might only want all the data via a "select *"
 like statement. It seems that 257 connections to query the rows are
 necessary. However, is there any way to prohibit 257 concurrent
 connections?

 My C* version is 2.0.11 and I also tried CqlPagingInputFormat, which
 has same behavior.

 Thank you.

 --

 Regards,
 Shenghua (Daniel) Wan

>>>
>>>
>>
>>
>> --
>>
>> Regards,
>> Shenghua (Daniel) Wan
>>
>
>


-- 

Regards,
Shenghua (Daniel) Wan


Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-27 Thread Shenghua(Daniel) Wan
I mean when the number of nodes grow, there are more virtual nodes in
total. For each vnode (or a partition range), a connection will be created.
For 3 node, 256 tokens each, replication factor=1 for simplicity, there
will be 3*256 virtual nodes, and therefore that many connections. Let me
know if there is any incorrect reasoning here. Thanks.

On Tue, Jan 27, 2015 at 11:21 PM, Huiliang Zhang  wrote:

> In that case, each node will have 256/3 connections at most. Still 256
> mappers. Someone please correct me if I am wrong.
>
> On Tue, Jan 27, 2015 at 11:04 PM, Shenghua(Daniel) Wan <
> wansheng...@gmail.com> wrote:
>
>> Hi, Huiliang,
>> Great to hear from you, again!
>> Image you have 3 nodes, replication factor=1, and using default number of
>> tokens. You will have 3*256 mappers... In that case, you will be soon out
>> of mappers or reach the limit.
>>
>>
>> On Tue, Jan 27, 2015 at 10:59 PM, Huiliang Zhang 
>> wrote:
>>
>>> Hi Shenghua, as I understand, each range is assigned to a mapper. Mapper
>>> will not share connections. So, it needs at least 256 connections to read
>>> all. But all 256 connections should not be set up at the same time unless
>>> you have 256 mappers running at the same time.
>>>
>>> On Tue, Jan 27, 2015 at 9:34 PM, Shenghua(Daniel) Wan <
>>> wansheng...@gmail.com> wrote:
>>>
 By default, each C* node is set with 256 tokens. On a local 1-node C*
 server, my hadoop drop creates 256 connections to the server. Is there any
 way to control this behavior? e.g. reduce the number of connections to a
 pre-configured gap.

 I debugged C* source code and found the client asks for partition
 ranges, or virtual nodes. Then the client was told by server there were 257
 ranges, corresponding to 257 column family splits.

 Here is a snapshot of my logs

 15/01/27 18:02:20 DEBUG hadoop.AbstractColumnFamilyInputFormat: adding
 ColumnFamilySplit((9121856086738887846, '-9223372036854775808] 
 @[localhost])
 ...
 totally 257 splits.

 The problem is the user might only want all the data via a "select *"
 like statement. It seems that 257 connections to query the rows are
 necessary. However, is there any way to prohibit 257 concurrent
 connections?

 My C* version is 2.0.11 and I also tried CqlPagingInputFormat, which
 has same behavior.

 Thank you.

 --

 Regards,
 Shenghua (Daniel) Wan

>>>
>>>
>>
>>
>> --
>>
>> Regards,
>> Shenghua (Daniel) Wan
>>
>
>


-- 

Regards,
Shenghua (Daniel) Wan


Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-27 Thread Huiliang Zhang
In that case, each node will have 256/3 connections at most. Still 256
mappers. Someone please correct me if I am wrong.

On Tue, Jan 27, 2015 at 11:04 PM, Shenghua(Daniel) Wan <
wansheng...@gmail.com> wrote:

> Hi, Huiliang,
> Great to hear from you, again!
> Image you have 3 nodes, replication factor=1, and using default number of
> tokens. You will have 3*256 mappers... In that case, you will be soon out
> of mappers or reach the limit.
>
>
> On Tue, Jan 27, 2015 at 10:59 PM, Huiliang Zhang  wrote:
>
>> Hi Shenghua, as I understand, each range is assigned to a mapper. Mapper
>> will not share connections. So, it needs at least 256 connections to read
>> all. But all 256 connections should not be set up at the same time unless
>> you have 256 mappers running at the same time.
>>
>> On Tue, Jan 27, 2015 at 9:34 PM, Shenghua(Daniel) Wan <
>> wansheng...@gmail.com> wrote:
>>
>>> By default, each C* node is set with 256 tokens. On a local 1-node C*
>>> server, my hadoop drop creates 256 connections to the server. Is there any
>>> way to control this behavior? e.g. reduce the number of connections to a
>>> pre-configured gap.
>>>
>>> I debugged C* source code and found the client asks for partition
>>> ranges, or virtual nodes. Then the client was told by server there were 257
>>> ranges, corresponding to 257 column family splits.
>>>
>>> Here is a snapshot of my logs
>>>
>>> 15/01/27 18:02:20 DEBUG hadoop.AbstractColumnFamilyInputFormat: adding
>>> ColumnFamilySplit((9121856086738887846, '-9223372036854775808] @[localhost])
>>> ...
>>> totally 257 splits.
>>>
>>> The problem is the user might only want all the data via a "select *"
>>> like statement. It seems that 257 connections to query the rows are
>>> necessary. However, is there any way to prohibit 257 concurrent
>>> connections?
>>>
>>> My C* version is 2.0.11 and I also tried CqlPagingInputFormat, which has
>>> same behavior.
>>>
>>> Thank you.
>>>
>>> --
>>>
>>> Regards,
>>> Shenghua (Daniel) Wan
>>>
>>
>>
>
>
> --
>
> Regards,
> Shenghua (Daniel) Wan
>


Re: full-tabe scan - extracting all data from C*

2015-01-27 Thread Xu Zhongxing
This is hard to answer. The performance is a thing depending on context. 
You could tune various parameters.

At 2015-01-28 14:43:38, "Shenghua(Daniel) Wan"  wrote:

Cool. What about performance? e.g. how many record for how long?


On Tue, Jan 27, 2015 at 10:16 PM, Xu Zhongxing  wrote:

For Java driver, there is no special API actually, just


ResultSet rs = session.execute("select * from ...");
for (Row r : rs) {
   ...
}


For Spark, the code skeleton is:


val rdd = sc.cassandraTable("ks", "table")


then call various standard Spark API to process the table parallelly.


I have not used CqlInputFormat.


At 2015-01-28 13:38:20, "Shenghua(Daniel) Wan"  wrote:
Hi, Zhongxing,
I am also interested in your table size. I am trying to dump 10s Million record 
data from C* using map-reduce related API like CqlInputFormat.
You mentioned about Java driver. Could you suggest any API you used? Thanks.


On Tue, Jan 27, 2015 at 5:33 PM, Xu Zhongxing  wrote:

Both Java driver "select * from table" and Spark sc.cassandraTable() work well. 
I use both of them frequently.

At 2015-01-28 04:06:20, "Mohammed Guller"  wrote:


Hi –

 

Over the last few weeks, I have seen several emails on this mailing list from 
people trying to extract all data from C*, so that they can import that data 
into other analytical tools that provide much richer analytics functionality 
than C*. Extracting all data from C* is a full-table scan, which is not the 
ideal use case for C*. However, people don’t have much choice if they want to 
do ad-hoc analytics on the data in C*. Unfortunately, I don’t think C* comes 
with any built-in tools that make this task easy for a large dataset. Please 
correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn’t really 
work if you have a large amount of data in C*.

 

I am aware of couple of approaches for extracting all data from a table in C*:

1)  Iterate through all the C* partitions (physical rows) using the Java 
Driver and CQL.

2)  Extract the data directly from SSTables files.

 

Either approach can be used with Hadoop or Spark to speed up the extraction 
process.

 

I wanted to do a quick survey and find out how many people on this mailing list 
have successfully used approach #1 or #2 for extracting large datasets 
(terabytes) from C*. Also, if you have used some other techniques, it would be 
great if you could share your approach with the group.

 

Mohammed

 






--



Regards,
Shenghua (Daniel) Wan





--



Regards,
Shenghua (Daniel) Wan

FW: How to use cqlsh to access Cassandra DB if the client_encryption_options is enabled

2015-01-27 Thread Lu, Boying
Hi, All,

Does anyone know the answer?

Thanks a lot

Boying


From: Lu, Boying
Sent: 2015年1月6日 11:21
To: user@cassandra.apache.org
Subject: How to use cqlsh to access Cassandra DB if the 
client_encryption_options is enabled

Hi, All,

I turned on the dbclient_encryption_options like this:
client_encryption_options:
enabled: true
keystore:  path-to-my-keystore-file
keystore_password:  my-keystore-password
truststore: path-to-my-truststore-file
truststore_password:  my-truststore-password
…

I can use following cassandra-cli command to access DB:
cassandra-cli  -ts path-to-my-truststore-file �Ctspw my-truststore-password 
�Ctf org.apache.cassandra.thrift.SSLTransportFactory

But when I tried to access DB by cqlsh like this:
SSL_CERTFILE=path-to-my-truststore cqlsh �Ct cqlishlib.ssl.ssl_transport_factory

I got following error:
Connection error: Could not connect to localhost:9160: [Errno 0] _ssl.c:332: 
error::lib(0):func(0):reason(0)

I guess the reason maybe is that I didn’t provide the trustore password.   But 
cqlsh doesn’t provide such option.

Does anyone know how to resolve this issue?

Thanks

Boying



Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-27 Thread Shenghua(Daniel) Wan
Hi, Huiliang,
Great to hear from you, again!
Image you have 3 nodes, replication factor=1, and using default number of
tokens. You will have 3*256 mappers... In that case, you will be soon out
of mappers or reach the limit.


On Tue, Jan 27, 2015 at 10:59 PM, Huiliang Zhang  wrote:

> Hi Shenghua, as I understand, each range is assigned to a mapper. Mapper
> will not share connections. So, it needs at least 256 connections to read
> all. But all 256 connections should not be set up at the same time unless
> you have 256 mappers running at the same time.
>
> On Tue, Jan 27, 2015 at 9:34 PM, Shenghua(Daniel) Wan <
> wansheng...@gmail.com> wrote:
>
>> By default, each C* node is set with 256 tokens. On a local 1-node C*
>> server, my hadoop drop creates 256 connections to the server. Is there any
>> way to control this behavior? e.g. reduce the number of connections to a
>> pre-configured gap.
>>
>> I debugged C* source code and found the client asks for partition ranges,
>> or virtual nodes. Then the client was told by server there were 257 ranges,
>> corresponding to 257 column family splits.
>>
>> Here is a snapshot of my logs
>>
>> 15/01/27 18:02:20 DEBUG hadoop.AbstractColumnFamilyInputFormat: adding
>> ColumnFamilySplit((9121856086738887846, '-9223372036854775808] @[localhost])
>> ...
>> totally 257 splits.
>>
>> The problem is the user might only want all the data via a "select *"
>> like statement. It seems that 257 connections to query the rows are
>> necessary. However, is there any way to prohibit 257 concurrent
>> connections?
>>
>> My C* version is 2.0.11 and I also tried CqlPagingInputFormat, which has
>> same behavior.
>>
>> Thank you.
>>
>> --
>>
>> Regards,
>> Shenghua (Daniel) Wan
>>
>
>


-- 

Regards,
Shenghua (Daniel) Wan


Re: cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-27 Thread Huiliang Zhang
Hi Shenghua, as I understand, each range is assigned to a mapper. Mapper
will not share connections. So, it needs at least 256 connections to read
all. But all 256 connections should not be set up at the same time unless
you have 256 mappers running at the same time.

On Tue, Jan 27, 2015 at 9:34 PM, Shenghua(Daniel) Wan  wrote:

> By default, each C* node is set with 256 tokens. On a local 1-node C*
> server, my hadoop drop creates 256 connections to the server. Is there any
> way to control this behavior? e.g. reduce the number of connections to a
> pre-configured gap.
>
> I debugged C* source code and found the client asks for partition ranges,
> or virtual nodes. Then the client was told by server there were 257 ranges,
> corresponding to 257 column family splits.
>
> Here is a snapshot of my logs
>
> 15/01/27 18:02:20 DEBUG hadoop.AbstractColumnFamilyInputFormat: adding
> ColumnFamilySplit((9121856086738887846, '-9223372036854775808] @[localhost])
> ...
> totally 257 splits.
>
> The problem is the user might only want all the data via a "select *" like
> statement. It seems that 257 connections to query the rows are necessary.
> However, is there any way to prohibit 257 concurrent connections?
>
> My C* version is 2.0.11 and I also tried CqlPagingInputFormat, which has
> same behavior.
>
> Thank you.
>
> --
>
> Regards,
> Shenghua (Daniel) Wan
>


Re: Re: full-tabe scan - extracting all data from C*

2015-01-27 Thread Shenghua(Daniel) Wan
Cool. What about performance? e.g. how many record for how long?

On Tue, Jan 27, 2015 at 10:16 PM, Xu Zhongxing 
wrote:

> For Java driver, there is no special API actually, just
>
> ResultSet rs = session.execute("select * from ...");
> for (Row r : rs) {
>...
> }
>
> For Spark, the code skeleton is:
>
> val rdd = sc.cassandraTable("ks", "table")
>
> then call various standard Spark API to process the table parallelly.
>
> I have not used CqlInputFormat.
>
> At 2015-01-28 13:38:20, "Shenghua(Daniel) Wan" 
> wrote:
>
> Hi, Zhongxing,
> I am also interested in your table size. I am trying to dump 10s Million
> record data from C* using map-reduce related API like CqlInputFormat.
> You mentioned about Java driver. Could you suggest any API you used?
> Thanks.
>
> On Tue, Jan 27, 2015 at 5:33 PM, Xu Zhongxing 
> wrote:
>
>> Both Java driver "select * from table" and Spark sc.cassandraTable() work
>> well.
>> I use both of them frequently.
>>
>> At 2015-01-28 04:06:20, "Mohammed Guller"  wrote:
>>
>>  Hi –
>>
>>
>>
>> Over the last few weeks, I have seen several emails on this mailing list
>> from people trying to extract all data from C*, so that they can import
>> that data into other analytical tools that provide much richer analytics
>> functionality than C*. Extracting all data from C* is a full-table scan,
>> which is not the ideal use case for C*. However, people don’t have much
>> choice if they want to do ad-hoc analytics on the data in C*.
>> Unfortunately, I don’t think C* comes with any built-in tools that make
>> this task easy for a large dataset. Please correct me if I am wrong. Cqlsh
>> has a COPY TO command, but it doesn’t really work if you have a large
>> amount of data in C*.
>>
>>
>>
>> I am aware of couple of approaches for extracting all data from a table
>> in C*:
>>
>> 1)  Iterate through all the C* partitions (physical rows) using the
>> Java Driver and CQL.
>>
>> 2)  Extract the data directly from SSTables files.
>>
>>
>>
>> Either approach can be used with Hadoop or Spark to speed up the
>> extraction process.
>>
>>
>>
>> I wanted to do a quick survey and find out how many people on this
>> mailing list have successfully used approach #1 or #2 for extracting large
>> datasets (terabytes) from C*. Also, if you have used some other techniques,
>> it would be great if you could share your approach with the group.
>>
>>
>>
>> Mohammed
>>
>>
>>
>>
>
>
> --
>
> Regards,
> Shenghua (Daniel) Wan
>
>


-- 

Regards,
Shenghua (Daniel) Wan


Re:Re: full-tabe scan - extracting all data from C*

2015-01-27 Thread Xu Zhongxing
For Java driver, there is no special API actually, just


ResultSet rs = session.execute("select * from ...");
for (Row r : rs) {
   ...
}


For Spark, the code skeleton is:


val rdd = sc.cassandraTable("ks", "table")


then call various standard Spark API to process the table parallelly.


I have not used CqlInputFormat.


At 2015-01-28 13:38:20, "Shenghua(Daniel) Wan"  wrote:
Hi, Zhongxing,
I am also interested in your table size. I am trying to dump 10s Million record 
data from C* using map-reduce related API like CqlInputFormat.
You mentioned about Java driver. Could you suggest any API you used? Thanks.


On Tue, Jan 27, 2015 at 5:33 PM, Xu Zhongxing  wrote:

Both Java driver "select * from table" and Spark sc.cassandraTable() work well. 
I use both of them frequently.

At 2015-01-28 04:06:20, "Mohammed Guller"  wrote:


Hi –

 

Over the last few weeks, I have seen several emails on this mailing list from 
people trying to extract all data from C*, so that they can import that data 
into other analytical tools that provide much richer analytics functionality 
than C*. Extracting all data from C* is a full-table scan, which is not the 
ideal use case for C*. However, people don’t have much choice if they want to 
do ad-hoc analytics on the data in C*. Unfortunately, I don’t think C* comes 
with any built-in tools that make this task easy for a large dataset. Please 
correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn’t really 
work if you have a large amount of data in C*.

 

I am aware of couple of approaches for extracting all data from a table in C*:

1)  Iterate through all the C* partitions (physical rows) using the Java 
Driver and CQL.

2)  Extract the data directly from SSTables files.

 

Either approach can be used with Hadoop or Spark to speed up the extraction 
process.

 

I wanted to do a quick survey and find out how many people on this mailing list 
have successfully used approach #1 or #2 for extracting large datasets 
(terabytes) from C*. Also, if you have used some other techniques, it would be 
great if you could share your approach with the group.

 

Mohammed

 






--



Regards,
Shenghua (Daniel) Wan

Re:full-tabe scan - extracting all data from C*

2015-01-27 Thread Xu Zhongxing
The table has several billion rows.
I think the table size is irrelevant here. Cassandra driver can do paging well. 
Spark handles data partition well, too.


At 2015-01-28 10:45:17, "Mohammed Guller"  wrote:


How big is your table? How much data does it have?

 

Mohammed

 

From: Xu Zhongxing [mailto:xu_zhong_x...@163.com]
Sent: Tuesday, January 27, 2015 5:34 PM
To:user@cassandra.apache.org
Subject: Re:full-tabe scan - extracting all data from C*

 

Both Java driver "select * from table" and Spark sc.cassandraTable() work well. 

I use both of them frequently.


At 2015-01-28 04:06:20, "Mohammed Guller"  wrote:



Hi –

 

Over the last few weeks, I have seen several emails on this mailing list from 
people trying to extract all data from C*, so that they can import that data 
into other analytical tools that provide much richer analytics functionality 
than C*. Extracting all data from C* is a full-table scan, which is not the 
ideal use case for C*. However, people don’t have much choice if they want to 
do ad-hoc analytics on the data in C*. Unfortunately, I don’t think C* comes 
with any built-in tools that make this task easy for a large dataset. Please 
correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn’t really 
work if you have a large amount of data in C*.

 

I am aware of couple of approaches for extracting all data from a table in C*:

1)  Iterate through all the C* partitions (physical rows) using the Java 
Driver and CQL.

2)  Extract the data directly from SSTables files.

 

Either approach can be used with Hadoop or Spark to speed up the extraction 
process.

 

I wanted to do a quick survey and find out how many people on this mailing list 
have successfully used approach #1 or #2 for extracting large datasets 
(terabytes) from C*. Also, if you have used some other techniques, it would be 
great if you could share your approach with the group.

 

Mohammed

 

Re: full-tabe scan - extracting all data from C*

2015-01-27 Thread Shenghua(Daniel) Wan
Recently I surveyed this topic and you may want to take a look at
https://github.com/fullcontact/hadoop-sstable
and
https://github.com/Netflix/aegisthus


On Tue, Jan 27, 2015 at 5:33 PM, Xu Zhongxing  wrote:

> Both Java driver "select * from table" and Spark sc.cassandraTable() work
> well.
> I use both of them frequently.
>
> At 2015-01-28 04:06:20, "Mohammed Guller"  wrote:
>
>  Hi –
>
>
>
> Over the last few weeks, I have seen several emails on this mailing list
> from people trying to extract all data from C*, so that they can import
> that data into other analytical tools that provide much richer analytics
> functionality than C*. Extracting all data from C* is a full-table scan,
> which is not the ideal use case for C*. However, people don’t have much
> choice if they want to do ad-hoc analytics on the data in C*.
> Unfortunately, I don’t think C* comes with any built-in tools that make
> this task easy for a large dataset. Please correct me if I am wrong. Cqlsh
> has a COPY TO command, but it doesn’t really work if you have a large
> amount of data in C*.
>
>
>
> I am aware of couple of approaches for extracting all data from a table in
> C*:
>
> 1)  Iterate through all the C* partitions (physical rows) using the
> Java Driver and CQL.
>
> 2)  Extract the data directly from SSTables files.
>
>
>
> Either approach can be used with Hadoop or Spark to speed up the
> extraction process.
>
>
>
> I wanted to do a quick survey and find out how many people on this mailing
> list have successfully used approach #1 or #2 for extracting large datasets
> (terabytes) from C*. Also, if you have used some other techniques, it would
> be great if you could share your approach with the group.
>
>
>
> Mohammed
>
>
>
>


-- 

Regards,
Shenghua (Daniel) Wan


Re: full-tabe scan - extracting all data from C*

2015-01-27 Thread Shenghua(Daniel) Wan
Hi, Zhongxing,
I am also interested in your table size. I am trying to dump 10s Million
record data from C* using map-reduce related API like CqlInputFormat.
You mentioned about Java driver. Could you suggest any API you used? Thanks.

On Tue, Jan 27, 2015 at 5:33 PM, Xu Zhongxing  wrote:

> Both Java driver "select * from table" and Spark sc.cassandraTable() work
> well.
> I use both of them frequently.
>
> At 2015-01-28 04:06:20, "Mohammed Guller"  wrote:
>
>  Hi –
>
>
>
> Over the last few weeks, I have seen several emails on this mailing list
> from people trying to extract all data from C*, so that they can import
> that data into other analytical tools that provide much richer analytics
> functionality than C*. Extracting all data from C* is a full-table scan,
> which is not the ideal use case for C*. However, people don’t have much
> choice if they want to do ad-hoc analytics on the data in C*.
> Unfortunately, I don’t think C* comes with any built-in tools that make
> this task easy for a large dataset. Please correct me if I am wrong. Cqlsh
> has a COPY TO command, but it doesn’t really work if you have a large
> amount of data in C*.
>
>
>
> I am aware of couple of approaches for extracting all data from a table in
> C*:
>
> 1)  Iterate through all the C* partitions (physical rows) using the
> Java Driver and CQL.
>
> 2)  Extract the data directly from SSTables files.
>
>
>
> Either approach can be used with Hadoop or Spark to speed up the
> extraction process.
>
>
>
> I wanted to do a quick survey and find out how many people on this mailing
> list have successfully used approach #1 or #2 for extracting large datasets
> (terabytes) from C*. Also, if you have used some other techniques, it would
> be great if you could share your approach with the group.
>
>
>
> Mohammed
>
>
>
>


-- 

Regards,
Shenghua (Daniel) Wan


cqlinputformat and retired cqlpagingingputformat creates lots of connections to query the server

2015-01-27 Thread Shenghua(Daniel) Wan
By default, each C* node is set with 256 tokens. On a local 1-node C*
server, my hadoop drop creates 256 connections to the server. Is there any
way to control this behavior? e.g. reduce the number of connections to a
pre-configured gap.

I debugged C* source code and found the client asks for partition ranges,
or virtual nodes. Then the client was told by server there were 257 ranges,
corresponding to 257 column family splits.

Here is a snapshot of my logs

15/01/27 18:02:20 DEBUG hadoop.AbstractColumnFamilyInputFormat: adding
ColumnFamilySplit((9121856086738887846, '-9223372036854775808] @[localhost])
...
totally 257 splits.

The problem is the user might only want all the data via a "select *" like
statement. It seems that 257 connections to query the rows are necessary.
However, is there any way to prohibit 257 concurrent connections?

My C* version is 2.0.11 and I also tried CqlPagingInputFormat, which has
same behavior.

Thank you.

-- 

Regards,
Shenghua (Daniel) Wan


RE: Re:full-tabe scan - extracting all data from C*

2015-01-27 Thread Mohammed Guller
How big is your table? How much data does it have?

Mohammed

From: Xu Zhongxing [mailto:xu_zhong_x...@163.com]
Sent: Tuesday, January 27, 2015 5:34 PM
To: user@cassandra.apache.org
Subject: Re:full-tabe scan - extracting all data from C*

Both Java driver "select * from table" and Spark sc.cassandraTable() work well.
I use both of them frequently.

At 2015-01-28 04:06:20, "Mohammed Guller" 
mailto:moham...@glassbeam.com>> wrote:

Hi -

Over the last few weeks, I have seen several emails on this mailing list from 
people trying to extract all data from C*, so that they can import that data 
into other analytical tools that provide much richer analytics functionality 
than C*. Extracting all data from C* is a full-table scan, which is not the 
ideal use case for C*. However, people don't have much choice if they want to 
do ad-hoc analytics on the data in C*. Unfortunately, I don't think C* comes 
with any built-in tools that make this task easy for a large dataset. Please 
correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn't really 
work if you have a large amount of data in C*.

I am aware of couple of approaches for extracting all data from a table in C*:

1)  Iterate through all the C* partitions (physical rows) using the Java 
Driver and CQL.

2)  Extract the data directly from SSTables files.

Either approach can be used with Hadoop or Spark to speed up the extraction 
process.

I wanted to do a quick survey and find out how many people on this mailing list 
have successfully used approach #1 or #2 for extracting large datasets 
(terabytes) from C*. Also, if you have used some other techniques, it would be 
great if you could share your approach with the group.

Mohammed



Re:full-tabe scan - extracting all data from C*

2015-01-27 Thread Xu Zhongxing
Both Java driver "select * from table" and Spark sc.cassandraTable() work well. 
I use both of them frequently.

At 2015-01-28 04:06:20, "Mohammed Guller"  wrote:


Hi –

 

Over the last few weeks, I have seen several emails on this mailing list from 
people trying to extract all data from C*, so that they can import that data 
into other analytical tools that provide much richer analytics functionality 
than C*. Extracting all data from C* is a full-table scan, which is not the 
ideal use case for C*. However, people don’t have much choice if they want to 
do ad-hoc analytics on the data in C*. Unfortunately, I don’t think C* comes 
with any built-in tools that make this task easy for a large dataset. Please 
correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn’t really 
work if you have a large amount of data in C*.

 

I am aware of couple of approaches for extracting all data from a table in C*:

1)  Iterate through all the C* partitions (physical rows) using the Java 
Driver and CQL.

2)  Extract the data directly from SSTables files.

 

Either approach can be used with Hadoop or Spark to speed up the extraction 
process.

 

I wanted to do a quick survey and find out how many people on this mailing list 
have successfully used approach #1 or #2 for extracting large datasets 
(terabytes) from C*. Also, if you have used some other techniques, it would be 
great if you could share your approach with the group.

 

Mohammed

 

full-tabe scan - extracting all data from C*

2015-01-27 Thread Mohammed Guller
Hi -

Over the last few weeks, I have seen several emails on this mailing list from 
people trying to extract all data from C*, so that they can import that data 
into other analytical tools that provide much richer analytics functionality 
than C*. Extracting all data from C* is a full-table scan, which is not the 
ideal use case for C*. However, people don't have much choice if they want to 
do ad-hoc analytics on the data in C*. Unfortunately, I don't think C* comes 
with any built-in tools that make this task easy for a large dataset. Please 
correct me if I am wrong. Cqlsh has a COPY TO command, but it doesn't really 
work if you have a large amount of data in C*.

I am aware of couple of approaches for extracting all data from a table in C*:

1)  Iterate through all the C* partitions (physical rows) using the Java 
Driver and CQL.

2)  Extract the data directly from SSTables files.

Either approach can be used with Hadoop or Spark to speed up the extraction 
process.

I wanted to do a quick survey and find out how many people on this mailing list 
have successfully used approach #1 or #2 for extracting large datasets 
(terabytes) from C*. Also, if you have used some other techniques, it would be 
great if you could share your approach with the group.

Mohammed



Re: Controlling the MAX SIZE of sstables after compaction

2015-01-27 Thread Mikhail Strebkov
It is open sourced but works only with C* 1.x as far as I know.

Mikhail

On Tuesday, January 27, 2015, Mohammed Guller 
wrote:

>  I believe Aegisthus is open sourced.
>
>
>
> Mohammed
>
>
>
> *From:* Jan [mailto:cne...@yahoo.com
> ]
> *Sent:* Monday, January 26, 2015 11:20 AM
> *To:* user@cassandra.apache.org
> 
> *Subject:* Re: Controlling the MAX SIZE of sstables after compaction
>
>
>
> Parth  et al;
>
>
>
> the folks at Netflix seem to have built a solution for your problem.
>
> The Netflix Tech Blog: Aegisthus - A Bulk Data Pipeline out of Cassandra
> 
>
>
>
>
>
> [image: image]
> 
>
>
>
>
>
>
>
>
>
>
>
> The Netflix Tech Blog: Aegisthus - A Bulk Data Pipeline ...
> 
>
> By Charles Smith and Jeff Magnusson
>
> View on *techblog.netflix.com*
> 
>
> Preview by Yahoo
>
>
>
>
>
> May want to chase Jeff Magnuson & check if the solution is open sourced.
>
>
> Pl.   report back to this forum if you get an answer to the problem.
>
>
>
> hope this helps.
>
> Jan
>
>
>
> C* Architect
>
>
>
> On Monday, January 26, 2015 11:25 AM, Robert Coli  > wrote:
>
>
>
> On Sun, Jan 25, 2015 at 10:40 PM, Parth Setya  > wrote:
>
> 1. Is there a way to configure the size of sstables created after
> compaction?
>
>
>
> No, won'tfix : https://issues.apache.org/*jira*/browse/*CASSANDRA*-4897.
>
>
>
> You could use the "sstablesplit" utility on your One Big SSTable to split
> it into files of your preferred size.
>
>
>
>  2. Is there a better approach to generate the report?
>
>
>
> The major compaction isn't too bad, but something that understands
> SSTables as an input format would be preferable to sstable2json.
>
>
>
>  3. What are the flaws with this approach?
>
>
>
> sstable2json is slow and transforms your data to JSON.
>
>
>
> =Rob
>
>
>


RE: Controlling the MAX SIZE of sstables after compaction

2015-01-27 Thread Mohammed Guller
I believe Aegisthus is open sourced.

Mohammed

From: Jan [mailto:cne...@yahoo.com]
Sent: Monday, January 26, 2015 11:20 AM
To: user@cassandra.apache.org
Subject: Re: Controlling the MAX SIZE of sstables after compaction

Parth  et al;

the folks at Netflix seem to have built a solution for your problem.
The Netflix Tech Blog: Aegisthus - A Bulk Data Pipeline out of 
Cassandra





[image]











The Netflix Tech Blog: Aegisthus - A Bulk Data Pipeline 
...
By Charles Smith and Jeff Magnusson


View on 
techblog.netflix.com

Preview by Yahoo






May want to chase Jeff Magnuson & check if the solution is open sourced.
Pl.   report back to this forum if you get an answer to the problem.

hope this helps.
Jan

C* Architect

On Monday, January 26, 2015 11:25 AM, Robert Coli 
mailto:rc...@eventbrite.com>> wrote:

On Sun, Jan 25, 2015 at 10:40 PM, Parth Setya 
mailto:setya.pa...@gmail.com>> wrote:
1. Is there a way to configure the size of sstables created after compaction?

No, won'tfix : 
https://issues.apache.org/jira/browse/CASSANDRA-4897.

You could use the "sstablesplit" utility on your One Big SSTable to split it 
into files of your preferred size.

2. Is there a better approach to generate the report?

The major compaction isn't too bad, but something that understands SSTables as 
an input format would be preferable to sstable2json.

3. What are the flaws with this approach?

sstable2json is slow and transforms your data to JSON.

=Rob



Re: Using Cassandra for geospacial search

2015-01-27 Thread Alexandre Dutra
Hello,

The following session, recorded during the Cassandra Europe Summit 2014,
might also be of interest for you:

http://youtu.be/RQnw-tfVXb4

--
Alexandre Dutra

On Mon, Jan 26, 2015 at 11:07 PM, Jabbar Azam  wrote:

> There is also a YouTube video http://youtu.be/rqEylNsw2Ns explaining the
> implementation of geohashes in Cassandra.
>
> On Mon, 26 Jan 2015 21:34 DuyHai Doan  wrote:
>
>> Nice slides, the key idea is the
>> http://en.wikipedia.org/wiki/Z-order_curve
>>
>> On Mon, Jan 26, 2015 at 9:28 PM, Jabbar Azam  wrote:
>>
>>> Hello,
>>>
>>> You'll find this useful
>>> http://www.slideshare.net/mobile/mmalone/working-with-dimensional-data-in-distributed-hash-tables
>>>
>>> Its how simplegeo used geohashing and Cassandra for geolocation.
>>>
>>> On Mon, 26 Jan 2015 15:48 SEGALIS Morgan  wrote:
>>>
 Hi everyone,

 I wanted to know if someone has a feedback using geoHash algorithme
 with cassandra ?

 I will have to create a "nearby" functionnality soon, and I really
 would like to do it with cassandra for it's scalability, otherwise the
 smart choice would be MongoDB apparently.

 Is Cassandra can be used to do geospacial search (with some kind of
 radius) while being fast and scalable ?

 Thanks.

 --
 Morgan SEGALIS

>>>
>>


Re: Upgrade 2.0.11 --> Long "INDEX LOAD TIME"

2015-01-27 Thread Alain RODRIGUEZ
FTR, it looks like a "nodetool upgradesstables -a" address this issue.

It is still good to know that before running this command, any restart will
hang for a long time.

Hopping this will help someone, someday :).

C*heers !

2015-01-26 11:29 GMT+01:00 Alain RODRIGUEZ :

> Hi guys,
>
> We migrate a cluster to 2.0.11 (From 1.2.18), in a rolling upgrade way.
>
> Now anytime I restart a node, it needs about 30' to start (350 GB
> average). I used the debug level and see that we have a lot of "INDEX LOAD
> TIME" lasting more than 200+ secs. It remembers me when I switched
> index_interval to a new value and Cassandra had to rebuild indexes.
>
> Yet I did not change anything, and from what I read, index_interval per
> table (2.0) is using the index_interval defined in cassandra.yaml (1.2).
> Plus, it could have rebuild indexes the first time I restarted only, but I
> experience this anytime I restart a node (any node, after multiple restart).
>
> Did any of you face this issue ? If not, any clue on what might be
> happening ?
>
> I am trying to upgrade SSTable to see if it helps somehow.
>
> C*heers,
>
> Alain
>