from:"Utku Can Topçu"

Distinct Counter Proposal for Cassandra

2012-06-13 Thread Utku Can Topçu

Hi All,

Let's assume we have a use case where we need to count the number of
columns for a given key. Let's say the key is the URL and the column-name
is the IP address or any cardinality identifier.

The straight forward implementation seems to be simple, just inserting the
IP Adresses as columns under the key defined by the URL and using get_count
to count them back. However the problem here is in case of large rows
(where too many IP addresses are in); the get_count method has to
de-serialize the whole row and calculate the count. As also defined in the
user guides, it's not an O(1) operation and it's quite costly.

However, this problem seems to have better solutions if you don't have a
strict requirement for the count to be exact. There are streaming
algorithms that will provide good cardinality estimations within a
predefined failure rate, I think the most popular one seems to be the
(Hyper)LogLog algorithm, also there's an optimal one developed recently,
please check http://dl.acm.org/citation.cfm?doid=1807085.1807094

If you want to take a look at the Java implementation for LogLog,
Clearspring has both LogLog and space optimized HyperLogLog available at
https://github.com/clearspring/stream-lib

I don't see a reason why this can't be implemented in Cassandra. The
distributed nature of all these algorithms can easily be adapted to
Cassandra's model. I think most of us would love to see come cardinality
estimating columns in Cassandra.

Regards,
Utku

Re: Distinct Counter Proposal for Cassandra

2012-06-13 Thread Utku Can Topçu

Hi Yuki,

I think I should have used the word discussion instead of proposal for the
mailing subject. I have quite some of a design in my mind but I think it's
not yet ripe enough to formalize. I'll try to simplify it and open a Jira
ticket.
But first I'm wondering if there would be any excitement in the community
for such a feature.

Regards,
Utku

On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita mor.y...@gmail.com wrote:

 You can open JIRA ticket at
 https://issues.apache.org/jira/browse/CASSANDRA with your proposal.

 Just for the input:

 I had once implemented HyperLogLog counter to use internally in Cassandra,
 but it turned out I didn't need it so I just put it to gist. You can find
 it here: https://gist.github.com/2597943

 The above implementation and most of the other ones (including stream-lib)
 implement the optimized version of the algorithm which counts up to 10^9,
 so may need some work.

 Other alternative is self-learning bitmap (
 http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my
 understanding, is more memory efficient when counting small values.

 Yuki

 On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Topçu wrote:

 Hi All,

 Let's assume we have a use case where we need to count the number of
 columns for a given key. Let's say the key is the URL and the column-name
 is the IP address or any cardinality identifier.

 The straight forward implementation seems to be simple, just inserting the
 IP Adresses as columns under the key defined by the URL and using get_count
 to count them back. However the problem here is in case of large rows
 (where too many IP addresses are in); the get_count method has to
 de-serialize the whole row and calculate the count. As also defined in the
 user guides, it's not an O(1) operation and it's quite costly.

 However, this problem seems to have better solutions if you don't have a
 strict requirement for the count to be exact. There are streaming
 algorithms that will provide good cardinality estimations within a
 predefined failure rate, I think the most popular one seems to be the
 (Hyper)LogLog algorithm, also there's an optimal one developed recently,
 please check http://dl.acm.org/citation.cfm?doid=1807085.1807094

 If you want to take a look at the Java implementation for LogLog,
 Clearspring has both LogLog and space optimized HyperLogLog available at
 https://github.com/clearspring/stream-lib

 I don't see a reason why this can't be implemented in Cassandra. The
 distributed nature of all these algorithms can easily be adapted to
 Cassandra's model. I think most of us would love to see come cardinality
 estimating columns in Cassandra.

 Regards,
 Utku

Re: last record rowId

2011-06-15 Thread Utku Can Topçu

As far as I can tell, this functionality doesn't exist.

However you can use such a method to insert the rowId into another column
within a seperate row, and request the latest column.
I think this would work for you. However every insert would need a get
request, which I think would be performance issue somehow.

Regards,
Utku

On Wed, Jun 15, 2011 at 11:14 AM, karim abbouh karim_...@yahoo.fr wrote:

 in my java application,when we try to insert we should all the time know
 the last rowId
 in order the insert the new record in rowId+1,so for that we should save
 this rowId in a file
 is there other way to know the last record rowId?
 thanks
 B.R

Re: Corrupted Counter Columns

2011-05-28 Thread Utku Can Topçu

Hello,

Actually I did not have the patience to discover more on what's going on. I
had to drop the CF and start from scratch.

Even though there were no writes to those particular columns, while reading
at CL.ONE
there was a 50% chance that
- The query returned the correct value (51664)
- The query returned a non-sense value (18651001) (I say this is non-sense
because the there were not more than 52K increment requests and all
increments are actually +1 increments)

Aftet starting from scratch; I'm writing with CL.ONE and reading with
CL.QUORUM. Things seems to work fine.


On Fri, May 27, 2011 at 1:59 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 On Thu, May 26, 2011 at 2:21 PM, Utku Can Topçu u...@topcu.gen.tr wrote:
  Hello,
 
  I'm using the the 0.8.0-rc1, with RF=2 and 4 nodes.
 
  Strangely counters are corrupted. Say, the actual value should be : 51664
  and the value that cassandra sometimes outputs is: either 51664 or
 18651001.

 What does sometimes means in that context ? Is it like some query
 returns the former and some other the latter ? Does it alternate in
 the value returned despite no write coming in or does this at least
 stabilize to one of those value. Could you give more details on how
 this manifests itself. Does it depends on which node you connect to
 for the request for instance, does querying at QUORUM solves it ?

 
  And I have no idea on how to diagnose the problem or reproduce it.
 
  Can you help me in fixing this issue?
 
  Regards,
  Utku

Re: expiring + counter column?

2011-05-28 Thread Utku Can Topçu

How about implementing a freezing mechanism on counter columns.

If there are no more increments within freeze seconds after the last
increments (it would be orders or day or so); the column would lock itself
on increments and won't accept increment.

And after this freeze perioid, the ttl should work fine. The column will be
gone forever after freeze + ttl seconds.

On Sat, May 28, 2011 at 2:57 AM, Jonathan Ellis jbel...@gmail.com wrote:

 No. See comments to https://issues.apache.org/jira/browse/CASSANDRA-2103

 On Fri, May 27, 2011 at 7:29 PM, Yang tedd...@gmail.com wrote:
  is this combination feature available , or on track ?
 
  thanks
  Yang
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com

Corrupted Counter Columns

2011-05-26 Thread Utku Can Topçu

Hello,

I'm using the the 0.8.0-rc1, with RF=2 and 4 nodes.

Strangely counters are corrupted. Say, the actual value should be : 51664
and the value that cassandra sometimes outputs is: either 51664 or 18651001.

And I have no idea on how to diagnose the problem or reproduce it.

Can you help me in fixing this issue?

Regards,
Utku

Re: Corrupted Counter Columns

2011-05-26 Thread Utku Can Topçu

Some additional information on the settings:

I'm using CL.ONE for both reading and writing; and replicate_on_write is
true on the Counters CF.

I think the problem occurs after a restart when the commitlogs are read.

On Thu, May 26, 2011 at 2:21 PM, Utku Can Topçu u...@topcu.gen.tr wrote:

 Hello,

 I'm using the the 0.8.0-rc1, with RF=2 and 4 nodes.

 Strangely counters are corrupted. Say, the actual value should be : 51664
 and the value that cassandra sometimes outputs is: either 51664 or 18651001.

 And I have no idea on how to diagnose the problem or reproduce it.

 Can you help me in fixing this issue?

 Regards,
 Utku

CounterColumn increments gone after restart

2011-05-12 Thread Utku Can Topçu

Hi guys,

I have strange problem with 0.8.0-rc1. I'm not quite sure if this is the way
it should be but:
- I create a ColumnFamily named Counters
- do a few increments on a column.
- kill cassandra
- start cassandra

When I look at the counter column, the value is 1.

See the following pastebin please: http://pastebin.com/9jYdDiRY

Re: CounterColumn increments gone after restart

2011-05-12 Thread Utku Can Topçu

see the ticket https://issues.apache.org/jira/browse/CASSANDRA-2642 please

On Thu, May 12, 2011 at 3:28 PM, Utku Can Topçu u...@topcu.gen.tr wrote:

 Hi guys,

 I have strange problem with 0.8.0-rc1. I'm not quite sure if this is the
 way it should be but:
 - I create a ColumnFamily named Counters
 - do a few increments on a column.
 - kill cassandra
 - start cassandra

 When I look at the counter column, the value is 1.

 See the following pastebin please: http://pastebin.com/9jYdDiRY

Does counter columns support TTL

2011-02-17 Thread Utku Can Topçu

Hi All,

I'm experimenting and developing using counters. However, I've come to a
usecase where I need counters to expire and get deleted after a certain time
of inactivity (i.e. have countercolumn deleted one hour after the last
increment).

As far as I can tell counter columns don't have TTL in the thrift interface,
is it because of a limitation of the counter implementation?

Regards,
Utku

Re: Commercial support for cassandra

2011-02-17 Thread Utku Can Topçu

http://wiki.apache.org/cassandra/ThirdPartySupport

On Thu, Feb 17, 2011 at 12:20 AM, Sal Fuentes fuente...@gmail.com wrote:

 They also offer great training sessions. Have a look at their site for more
 information: http://www.datastax.com/about-us


 On Wed, Feb 16, 2011 at 3:13 PM, Michael Widmann 
 michael.widm...@gmail.com wrote:

 riptano - contact matt pfeil

 mike

 2011/2/17 A J s5a...@gmail.com

 By any chance are there companies that provide support for Cassandra ?
 Consult on setup and configuration and annual support packages ?




 --
 bayoda.com - Professional Online Backup Solutions for Small and Medium
 Sized Companies




 --
 Salvador Fuentes Jr.

Re: Does counter columns support TTL

2011-02-17 Thread Utku Can Topçu

Can anyone confirm that this patch works with the current trunk?

On Thu, Feb 17, 2011 at 4:16 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 https://issues.apache.org/jira/browse/CASSANDRA-2103


 On Thu, Feb 17, 2011 at 4:05 PM, Utku Can Topçu u...@topcu.gen.tr wrote:

 Hi All,

 I'm experimenting and developing using counters. However, I've come to a
 usecase where I need counters to expire and get deleted after a certain time
 of inactivity (i.e. have countercolumn deleted one hour after the last
 increment).

 As far as I can tell counter columns don't have TTL in the thrift
 interface, is it because of a limitation of the counter implementation?

 Regards,
 Utku

Re: Does counter columns support TTL

2011-02-17 Thread Utku Can Topçu

And I think this patch would still be useful and legitimate if the TTL of
the initial increment is taken into account.


On Thu, Feb 17, 2011 at 6:11 PM, Utku Can Topçu u...@topcu.gen.tr wrote:

 Yes, I've read the discussion. My use-case is similar to the use-case of
 the contributor.

 So that's the reason why I've asked if it works or not. (with the flaw of
 course).




 On Thu, Feb 17, 2011 at 5:41 PM, Jonathan Ellis jbel...@gmail.com wrote:

 If you read the discussion on that ticket, the point is that the
 approach is fundamentally flawed.

 On Thu, Feb 17, 2011 at 10:16 AM, Utku Can Topçu u...@topcu.gen.tr
 wrote:
  Can anyone confirm that this patch works with the current trunk?
 
  On Thu, Feb 17, 2011 at 4:16 PM, Sylvain Lebresne sylv...@datastax.com
 
  wrote:
 
  https://issues.apache.org/jira/browse/CASSANDRA-2103
 
  On Thu, Feb 17, 2011 at 4:05 PM, Utku Can Topçu u...@topcu.gen.tr
 wrote:
 
  Hi All,
 
  I'm experimenting and developing using counters. However, I've come to
 a
  usecase where I need counters to expire and get deleted after a
 certain time
  of inactivity (i.e. have countercolumn deleted one hour after the last
  increment).
 
  As far as I can tell counter columns don't have TTL in the thrift
  interface, is it because of a limitation of the counter
 implementation?
 
  Regards,
  Utku
 
 
 
 
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com

Re: Implemeting a LRU in Cassandra

2011-02-10 Thread Utku Can Topçu

Dear Aaron,

Thank you for your suggestion. I'll be evaluating it.

Since all my other use cases are implemented in Cassandra, now I had the
question in my mind, if it was possible to implement the sorted set in
Cassandra :)

The problem here is, in a few hours I might be resolving more than 2M pages.
Using redis would also cause a problem on deletion it seems so. However in
cassandra I might be trusting the expritation of the columns.

It looks like the sorted set won't be able to support partitioning, thus
won't be scalable at the end of the day.

Regards,
Utku

On Thu, Feb 10, 2011 at 9:54 AM, aaron morton aa...@thelastpickle.comwrote:

 FWIW and depending on the size of data, I would use consider using sorted
 sets in redis http://redis.io/commands#sorted_set Where the member is the
 page url and the weight is time stamp, use ZRANGE to get back the top 1,000
 entries in the set.

 Would that work for you?

 Aaron

 On 9 Feb 2011, at 23:58, Utku Can Topçu wrote:

  Hi All,
 
  I'm sure people here have tried to solve similar questions.
  Say I'm tracking pages, I want to access the least recently used 1000
 unique pages (i.e. columnnames). How can I achieve this?
 
  Using a row with say, ttl=60 seconds would solve the problem of accessing
 the least recently used unique pages in the last minute.
 
  Thanks for any comments and helps.
 
  Regards,
  Utku

Re: Super Slow Multi-gets

2011-02-10 Thread Utku Can Topçu

Dear Bill,

How about the size of the row in the Messages CF. Is it too big? Might you
be having an overhead of the bandwidth?

Regards,
Utku

On Thu, Feb 10, 2011 at 5:00 PM, Bill Speirs bill.spe...@gmail.com wrote:

 I have a 7 node setup with a replication factor of 1 and a read
 consistency of 1. I have two column families: Messages which stores
 millions of rows with a UUID for the row key, DateIndex which stores
 thousands of rows with a String as the row key. I perform 2 look-ups
 for my queries:

 1) Fetch the row from DateIndex that includes the date I'm looking
 for. This returns 1,000 columns where the column names are the UUID of
 the messages
 2) Do a multi-get (Hector client) using those 1,000 row keys I got
 from the first query.

 Query 1 is taking ~300ms to fetch 1,000 columns from a single row...
 respectable. However, query 2 is taking over 50s to perform 1,000 row
 look-ups! Also, when I scale down to 100 row look-ups for query 2, the
 time scales in a similar fashion, down to 5s.

 Am I doing something wrong here? It seems like taking 5s to look-up
 100 rows in a distributed hash table is way too slow.

 Thoughts?

 Bill-

Re: Super Slow Multi-gets

2011-02-10 Thread Utku Can Topçu

Bill,
It still sounds really strange.

Can you reproduce it? And note down the steps; I'm sure people here would be
pleased to repeat it.

Regards,
Utku

On Fri, Feb 11, 2011 at 5:34 AM, Mark Guzman segfa...@hasno.info wrote:

 I assume this should be set on all of the servers? Is there anything in
 particular one would look for in the log results?

 On Feb 10, 2011, at 4:37 PM, Aaron Morton wrote:

 Assuming cassandra 0.7 in log4j-server.properties make it look like this...

 log4j.rootLogger=DEBUG,stdout,R


 A
 On 11 Feb, 2011,at 10:30 AM, Bill Speirs bill.spe...@gmail.com wrote:

 I switched my implementation to use a thread pool of 10 threads each
 multi-getting 10 keys/rows. This reduces my time from 50s to 5s for
 fetching all 1,000 messages.

 I started looking through the Cassandra source to find where the
 parallel requests are actually made, and I believe it's in
 org.apache.cassandra.service.StorageProxy.java fetchRows, is this
 correct? I noticed a number of logger.debug calls, what do I need to
 set in my log4j.properties file to see these messages as they would
 probably help me determine what is taking so long. Currently my
 log4j.properties file looks like this and I'm not seeing these
 messages:

 log4j.appender.stdout=org.apache.log4j.ConsoleAppender
 log4j.appender.stdout.layout=org.apache.log4j.SimpleLayout
 log4j.category.org.apache=DEBUG, stdout
 log4j.category.me.prettyprint=DEBUG, stdout

 Thanks...

 Bill-


 On Thu, Feb 10, 2011 at 12:53 PM, Bill Speirs bill.spe...@gmail.com
 wrote:
  Each message row is well under 1K. So I don't think it is network... plus
  all boxes are on a fast LAN.
 
  Bill-
 
  On Feb 10, 2011 11:59 AM, Utku Can Topçu u...@topcu.gen.tr wrote:
  Dear Bill,
 
  How about the size of the row in the Messages CF. Is it too big? Might
 you
  be having an overhead of the bandwidth?
 
  Regards,
  Utku
 
  On Thu, Feb 10, 2011 at 5:00 PM, Bill Speirs bill.spe...@gmail.com
  wrote:
 
  I have a 7 node setup with a replication factor of 1 and a read
  consistency of 1 I have two column families: Messages which stores
  millions of rows with a UUID for the row key, DateIndex which stores
  thousands of rows with a String as the row key. I perform 2 look-ups
  for my queries:
 
  1) Fetch the row from DateIndex that includes the date I'm looking
  for. This returns 1,000 columns where the column names are the UUID of
  the messages
  2) Do a multi-get (Hector client) using those 1,000 row keys I got
  from the first query.
 
  Query 1 is taking ~300ms to fetch 1,000 columns from a single row...
  respectable. However, query 2 is taking over 50s to perform 1,000 row
  look-ups! Also, when I scale down to 100 row look-ups for query 2, the
  time scales in a similar fashion, down to 5s.
 
  Am I doing something wrong here? It seems like taking 5s to look-up
  100 rows in a distributed hash table is way too slow.
 
  Thoughts?
 
  Bill-

Implemeting a LRU in Cassandra

2011-02-09 Thread Utku Can Topçu

Hi All,

I'm sure people here have tried to solve similar questions.
Say I'm tracking pages, I want to access the least recently used 1000 unique
pages (i.e. columnnames). How can I achieve this?

Using a row with say, ttl=60 seconds would solve the problem of accessing
the least recently used unique pages in the last minute.

Thanks for any comments and helps.

Regards,
Utku

Re: Hadoop Integration doesn't work when one node is down

2011-01-02 Thread Utku Can Topçu

I've created an issue, was this what you were asking Jonathan?

https://issues.apache.org/jira/browse/CASSANDRA-1927


On Mon, Jan 3, 2011 at 12:24 AM, Jonathan Ellis jbel...@gmail.com wrote:

 Can you create one?

 On Sun, Jan 2, 2011 at 4:39 PM, mck m...@apache.org wrote:
 
  Is this a bug or feature or a misuse?
 
  i can confirm this bug.
  on a 3 node cluster testing environment with RF 3.
  (and no issue exists for it AFAIK).
 
  ~mck
 
 
  --
  Simplicity is the ultimate sophistication Leonardo Da Vinci's (William
  of Ockham)
  | www.semb.wever.org | www.sesat.no
  | www.finn.no| http://xss-http-filter.sf.net
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com

Re: Replacing nodes of the cluster in 0.7.0-RC1

2010-12-05 Thread Utku Can Topçu

Since no reply came in afew days, I tried my proposed steps and it all
worked fine.

Just to let you know.

On Sat, Dec 4, 2010 at 10:31 PM, Utku Can Topçu u...@topcu.gen.tr wrote:

  Hi All,

 I'm currently not happy with the hardware and the operating system of our
 4-node cassandra cluster. I'm planning to move the cluster to a different
 hardware/OS architecture.

 For this purpose I'm planning to bring up 4 new nodes, so that each node
 will be a replacement of another node in the current cluster. I would also
 like to note that the IP adresses will be also changing.

 As far as I remember, cassandra had been causing problems when there was an
 IP change back in version 0.6?

 So what steps should I take to achieve this?

 Will s straight forward approach like this work?,
 * drain all nodes
 * copy the data files to new hosts
 * change configration, seeds, datadir, tokens etc...
 * bring up the cluster

 Regards,
 Utku

Replacing nodes of the cluster in 0.7.0-RC1

2010-12-04 Thread Utku Can Topçu

 Hi All,

I'm currently not happy with the hardware and the operating system of our
4-node cassandra cluster. I'm planning to move the cluster to a different
hardware/OS architecture.

For this purpose I'm planning to bring up 4 new nodes, so that each node
will be a replacement of another node in the current cluster. I would also
like to note that the IP adresses will be also changing.

As far as I remember, cassandra had been causing problems when there was an
IP change back in version 0.6?

So what steps should I take to achieve this?

Will s straight forward approach like this work?,
* drain all nodes
* copy the data files to new hosts
* change configration, seeds, datadir, tokens etc...
* bring up the cluster

Regards,
Utku

Detecting failed nodes and restarting

2010-12-02 Thread Utku Can Topçu

Hi All,

The question is really simple. Is there anyone out there using a set of
scripts in production that detects failures of cassandra processes and
restarts them or takes required actions.

If so, how can we implement a generic solution for this problem?

Regards,

Utku

Deleting the datadir for system keyspace in 0.7

2010-11-15 Thread Utku Can Topçu

Hello All,

I'm wondering before restarting the a node in a cluster. If I delete the
system keyspace, what data would I be losing, would I be losing anything?

Regards,
Utku

Re: Deleting the datadir for system keyspace in 0.7

2010-11-15 Thread Utku Can Topçu

So,

The practice of deleting the system datadir and setting the token in the
configuration (so that we're not losing it) can be treated as a safe(!)
operation if we're OK to lose the hints? Or are there other things to be
aware of?

Regards,
Utku


On Mon, Nov 15, 2010 at 3:25 PM, Jonathan Ellis jbel...@gmail.com wrote:

 ... but blowing away your saved token is a great way to lose data if
 you don't know what you're doing.

 On Mon, Nov 15, 2010 at 8:17 AM, Gary Dusbabek gdusba...@gmail.com
 wrote:
  Mostly these things: stored schema information, cached cluster info,
  the token, hints.  Everything but the hints can be replaced.
 
  Gary.
 
  On Mon, Nov 15, 2010 at 06:29, Utku Can Topçu u...@topcu.gen.tr wrote:
  Hello All,
 
  I'm wondering before restarting the a node in a cluster. If I delete the
  system keyspace, what data would I be losing, would I be losing
 anything?
 
  Regards,
  Utku
 
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com

Cassandra Hadoop Integration not compatible with Hadoop 0.21.0

2010-11-05 Thread Utku Can Topçu

When I try to read a CF from Hadoop, just after issuing the run I get this
error:

Exception in thread main java.lang.IncompatibleClassChangeError: Found
interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at
org.apache.cassandra.hadoop.ColumnFamilyInputFormat.getSplits(ColumnFamilyInputFormat.java:88)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:401)
at
org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:418)
at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:338)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:960)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:976)

However the same code works fine for hadoop 0.20.2? Is there a prospective
patch for this issue?

Regards,
Utku

Re: Time to wait for CF to be consistent after stopping writes.

2010-10-28 Thread Utku Can Topçu

Gary, Thank you for your comments.

I also have another question in mind:
- If in all nodes nodetool cfstats shows that the memtable size is 0. Then
can I be sure that it's safe to assume that all values are consistent?

Regards,
Utku

On Wed, Oct 27, 2010 at 3:24 PM, Gary Dusbabek gdusba...@gmail.com wrote:

 On Wed, Oct 27, 2010 at 05:08, Utku Can Topçu u...@topcu.gen.tr wrote:
  Hi,
 
  For a columnfamily in a keyspace which has RF=3, I'm issuing writes with
  ConsistencyLevel.ONE.
 
  in the configuration I have:
  - memtable_flush_after_mins : 30
  - memtable_throughput_in_mb : 32
 
  I'm writing to this columnfamily continuously for about 1 hour then stop
  writing.
 
  So the question is:
 
  How long should I wait after stopping writes to that particular CF so
 that
  all writes take place and data contained in the CF will be consistent.

 There is no way to determine this precisely.  Depending on your nodes
 and network it could be as short as a few milliseconds or much longer.

  Which metrics should I be checking to ensure that the CF is now
 consistent?

 Execute a read using ConsistencyLevel.ALL.  If the value is not yet
 consistent, read repair will ensure that it soon will be.  Another
 approach is to write using ConsistencyLevel.ALL, although that would
 decrease your write throughput.

 
  And additionally if I was using ConsistencyLevel.QUORUM or
  ConsistencyLevel.ALL would it make a difference?

 Precisely.

  Would reducing the RF=3 to RF=1 would it make my life on this decision
  easier?

 It would make determining consistency better, but RF=1 isn't going to
 be very fault tolerant.

 Gary.

Time to wait for CF to be consistent after stopping writes.

2010-10-27 Thread Utku Can Topçu

Hi,

For a columnfamily in a keyspace which has RF=3, I'm issuing writes with
ConsistencyLevel.ONE.

in the configuration I have:
- memtable_flush_after_mins : 30
- memtable_throughput_in_mb : 32

I'm writing to this columnfamily continuously for about 1 hour then stop
writing.

So the question is:

How long should I wait after stopping writes to that particular CF so that
all writes take place and data contained in the CF will be consistent.
Which metrics should I be checking to ensure that the CF is now consistent?

And additionally if I was using ConsistencyLevel.QUORUM or
ConsistencyLevel.ALL would it make a difference?
Would reducing the RF=3 to RF=1 would it make my life on this decision
easier?

Regards,

Utku

Reading a keyrange when using RP

2010-10-21 Thread Utku Can Topçu

If I'm not mistaken cassandra has been providing support for keyrange
queries also on RP.

However when I try to define a keyrange such as, start: (key100, end:
key200) I get an error like:
InvalidRequestException(why:start key's md5 sorts after end key's md5.  this
is not allowed; you probably should not specify end key at all, under
RandomPartitioner)

How can I utilize cassandra to get a keyrange in RP?

Best Regards,
Utku

creating and dropping columnfamilies as a usecase

2010-10-21 Thread Utku Can Topçu

Hi All,

In the current project I'm working on. I have use case for hourly analyzing
the rows.

Since the 0.7x branch supports creating and dropping columnfamilies on the
fly;
My use case proposal will be:

* Create a CF at the very beginning of every hour
* At the end of the 1-hour period, analyze the data stored in the CF with
Hadoop
* Drop the CF afterwards.

Can you foresee any problems in continiously creating and dropping
columnfamilies?

Regards,
Utku

using jna.jar Unknown mlockall error 0

2010-10-08 Thread Utku Can Topçu

Hi,

In order to continue on memory optimizations, I've been trying to use the
JNA. However, when I copy the jna.jar to the lib directory? I get the
warning. I'm currently running the 0.6.5 version of cassandra.

WARN [main] 2010-10-08 09:16:18,924 FBUtilities.java (line 595) Unknown
mlockall error 0

Should I be worried about this warning, in case JNA might not be working?

Regards,
Utku

Re: using jna.jar Unknown mlockall error 0

2010-10-08 Thread Utku Can Topçu

I'm running an Ubuntu 9.10 linux box.

On Fri, Oct 8, 2010 at 11:33 AM, Roger Schildmeijer
schildmei...@gmail.comwrote:



 On Fri, Oct 8, 2010 at 11:27 AM, Utku Can Topçu u...@topcu.gen.tr wrote:

 Hi,

 In order to continue on memory optimizations, I've been trying to use the
 JNA. However, when I copy the jna.jar to the lib directory? I get the
 warning. I'm currently running the 0.6.5 version of cassandra.

 WARN [main] 2010-10-08 09:16:18,924 FBUtilities.java (line 595) Unknown
 mlockall error 0


 Return value == 0 usually indicates that the operation returned
 successfully (atleast on most moden POSIX systems). What OS are you using?


 Should I be worried about this warning, in case JNA might not be working?

 Regards,
 Utku



 WBR
 Roger Schildmeijer

Re: using jna.jar Unknown mlockall error 0

2010-10-08 Thread Utku Can Topçu

Thanks Nicolas,

I've just tried it as running root and the warning did not show up.

Do we need to run cassandra as root in order to use JNA?

Regards,
Utku

On Fri, Oct 8, 2010 at 11:45 AM, Nicolas Mathieu nico...@gmail.com wrote:

  If I'm not wrong, when I run cassandra as root I don't get that mlockall
 error 0.
 Maybe there is another solution anyway.

 nico008



 On 08/10/2010 11:33, Roger Schildmeijer wrote:



 On Fri, Oct 8, 2010 at 11:27 AM, Utku Can Topçu u...@topcu.gen.tr wrote:

 Hi,

 In order to continue on memory optimizations, I've been trying to use the
 JNA. However, when I copy the jna.jar to the lib directory? I get the
 warning. I'm currently running the 0.6.5 version of cassandra.

 WARN [main] 2010-10-08 09:16:18,924 FBUtilities.java (line 595) Unknown
 mlockall error 0


 Return value == 0 usually indicates that the operation returned
 successfully (atleast on most moden POSIX systems). What OS are you using?


 Should I be worried about this warning, in case JNA might not be working?

 Regards,
 Utku



 WBR
 Roger Schildmeijer

Re: Tuning cassandra to use less memory

2010-10-06 Thread Utku Can Topçu

Hi Oleg,

I've been also looking into these after some research.

I've been tacking with:
1. Setting the default max and min heap from 1G to 1500M.
2. I'm not using row caches, and the key caches are set to 1000, before they
were 200K as default
3. I've lowered the memtable throughput to 32MB
4. We're using a 32-bit JVM

- And additionally I've changed the disk access mode to mmap_index_only
(this was the only suggested ).
- I've also stopped using OPP and switched to RP

In our system there are currently 4 nodes, and there's one active keysapce
containing 12 active standart columnfamilies.

The nodes are still swapping, even though the swappiness is set to zero
right now. After swapping comes the OOM.

I'm not so sure about what to do but the 1.7 G does not seem to fit our
needs?
Do you think so?

Regards,
Utku

On Wed, Oct 6, 2010 at 12:47 PM, Oleg Anastasyev olega...@gmail.com wrote:


 
  Hi All,We're currently starting to get OOM exceptions in our cluster. I'm
 trying to push the limiations of our machines. Currently we have 1.7 G
 memory
 (ec2-medium)I'm wondering if by tweaking some of cassandra's configuration
 settings, is it possible to make it live in peace and less memory.
 1. What is current java heap size on your nodes ? Is it default 1Gb ? Try
 to
 configure more.
 2. Do you use row or key caches ? Try to lower their sizes in
 configuration.
 3. What are memtable throughput mb threshold ? You can try to lower it.
 4. Do you use 32-bit or 64-bit VM ? For 1.7Gb RAM size 32-bit VM is enough
 and
 it uses less RAM. so you can give more to Java heap.

Re: A proposed use case, any comments and experience is appreciated

2010-10-04 Thread Utku Can Topçu

What I can understand from behaving like a deleted column is
- They'll be there for at most GCGraceSeconds?

On Mon, Oct 4, 2010 at 3:51 PM, Jonathan Ellis jbel...@gmail.com wrote:

 Expiring columns are 0.7 only.

 An expired column behaves like a deleted column until it is compacted away.

 On Mon, Oct 4, 2010 at 8:48 AM, Utku Can Topçu u...@topcu.gen.tr wrote:
  Hi Jonathan,
 
  Thank you for mentioning about the expiring columns issue. I didn't know
  that it had existed.
  That's really great news.
  First of all, does the current 0.6 branch support it? If not so, is the
  patch available for the 0.6.5 somehow?
  And about the deletion issue, if all the columns in a row expire? When
 the
  row will be deleted, will I be seeing the row in my map inputs somehow,
 and
  for how long?
 
  Regards,
  Utku
 
  On Mon, Oct 4, 2010 at 3:30 PM, Jonathan Ellis jbel...@gmail.com
 wrote:
 
  A simpler approach might be to insert expiring columns into a 2nd CF
  with a TTL of one hour.
 
  On Mon, Oct 4, 2010 at 5:12 AM, Utku Can Topçu u...@topcu.gen.tr
 wrote:
   Hey All,
  
   I'm planning to run Map/Reduce on one of the ColumnFamilies. The keys
   are
   formed in such a fashion that, they are indexed in descending order by
   time.
   So I'll be analyzing the data for every hour iteratively.
  
   Since the current Hadoop integration does not support partial
   columnfamily
   analysis. I feel that, I'll need to dump the data of the last hour and
   put
   it to the hadoop cluster and do my analysis on the flat text file.
   Do you think of any other better way of getting the data of a
 keyrange
   into a hadoop cluster for analysis?
  
   Regards,
  
   Utku
  
  
  
 
 
 
  --
  Jonathan Ellis
  Project Chair, Apache Cassandra
  co-founder of Riptano, the source for professional Cassandra support
  http://riptano.com
 
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com

Re: A proposed use case, any comments and experience is appreciated

2010-10-04 Thread Utku Can Topçu

Hi Jonathan,

Thank you for mentioning about the expiring columns issue. I didn't know
that it had existed.
That's really great news.
First of all, does the current 0.6 branch support it? If not so, is the
patch available for the 0.6.5 somehow?
And about the deletion issue, if all the columns in a row expire? When the
row will be deleted, will I be seeing the row in my map inputs somehow, and
for how long?

Regards,
Utku

On Mon, Oct 4, 2010 at 3:30 PM, Jonathan Ellis jbel...@gmail.com wrote:

 A simpler approach might be to insert expiring columns into a 2nd CF
 with a TTL of one hour.

 On Mon, Oct 4, 2010 at 5:12 AM, Utku Can Topçu u...@topcu.gen.tr wrote:
  Hey All,
 
  I'm planning to run Map/Reduce on one of the ColumnFamilies. The keys are
  formed in such a fashion that, they are indexed in descending order by
 time.
  So I'll be analyzing the data for every hour iteratively.
 
  Since the current Hadoop integration does not support partial
 columnfamily
  analysis. I feel that, I'll need to dump the data of the last hour and
 put
  it to the hadoop cluster and do my analysis on the flat text file.
  Do you think of any other better way of getting the data of a keyrange
  into a hadoop cluster for analysis?
 
  Regards,
 
  Utku
 
 
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com

Hardware change of a node in the cluster

2010-10-04 Thread Utku Can Topçu

Hey All,

Recently I've tried to upgrade (hw upgrade) one of the nodes in my cassandra
cluster from ec2-small to ec2-large.

However, there were problems and since the IP of the new instance was
different from the previous instance. The other nodes didnot recognize it in
the ring.

So what should be the best practice for a complete hardware change of one
node in the cluster while keeping the data that it has.

Regards,

Utku

Best strategy for adding new nodes to the cluster

2010-09-27 Thread Utku Can Topçu

Hi All,

We're currently running a cassandra cluster with Replication Factor 3,
consisting of 4 nodes.

The current situation is:

- The nodes are all identical (AWS small instances)
- Data directory is in the partition (/mnt) which has 150G capacity and each
node has around 90 GB load, so 60 G free space per node is left.

So adding a new node to the cluster will seem to cause problems for us. I
think the node which will stream the data to the new bootstrapping node,
will not have enough disk space for anticompacting its data.

What should be the best practice for such scenarios?

Regards,

Utku

Having different 0.6.x instances in one Cassandra cluster

2010-08-05 Thread Utku Can Topçu

Hi All,

I'm planning to use the current 0.6.4 stable for creating an image that
would be the base for nodes in our Cassandra cluster.

However, the 0.6.5 release is on the way. When the 0.6.5 has been released.
Is it possible to have some of the nodes stay in 0.6.4 and having new nodes
in 0.6.5?

Best Regards,

Utku

Lucene CassandraDirectory Implementation

2010-07-22 Thread Utku Can Topçu

Hi All,

I was browsing through the Lucene JIRA and came across the issue named A
Column-Oriented Cassandra-Based Lucene Directory at
https://issues.apache.org/jira/browse/LUCENE-2456

Has anyone had a chance to test it? If so, do you think it's an efficient
implementation as a replacement for the FSDirectory?

Best Regards,

Utku

Cassandra Data Model Design Visualization

2010-06-29 Thread Utku Can Topçu

Hey Guys,

I've been into designing an application which consists of more than 20
ColumnFamily's.

Each ColumnFamily has some columns referencing to keys in other
ColumnFamily's,
some keys in ColumnFamily are combination of keys/columns in other
ColumnFamily's.

I guess most of the people are using these kind of approaches for building a
design for an application.

Are there any decent visualization schemas for designing Cassandra
ColumnFamily's?

Best Regards,

Utku

Implementing Counter on Cassandra

2010-06-29 Thread Utku Can Topçu

Hey Guys,

Currently in a project I'm involved in, I need to have some columns holding
incremented data.
The easy approach for implementing a counter with increments is right now as
I figured out is read - increment - insert however this approach is not
an atomic operation and can easily be corrupted in time.

Do you have any best practices in implementing an atomic counter on
Cassandra?

Best Regards,
Utku

Getting keys in a range sorted with respect to last access time

2010-06-07 Thread Utku Can Topçu

Hey All,

First of all I'll start with some questions on the default behavior of
get_range_slices method defined in the thrift API.

Given a keyrange with start-key kstart and end-key kend, assuming
kstartkend;
* Is it true that I'll get the range [kstart,kend) (kstart inclusive, kend
exclusive)?
* What's the default order of the rows in the result list? (assuming I am
using an OPP)
* (How) can we reverse the sorting order?
* What would be the behavior in the case kstartkend? Will I get an empty
result list?

Secondly, I have use case where I need to access the latest updated rows?
How can this be possible? Writing a new partitioner?

Best Regards,
Utku

Re: Anyone using hadoop/MapReduce integration currently?

2010-05-25 Thread Utku Can Topçu

Hi Jeremy,

 Why are you using Cassandra versus using data stored in HDFS or HBase?
- I'm thinking of using it for realtime streaming of user data. While
streaming the requests, I'm also using Lucandra for indexing the data in
realtime. It's a better option when you compare it with HBase or the native
HDFS flat files, because of low latency in writes.

 Is there anything holding you back from using it (if you would like to use
it but currently cannot)?

My answer to this would be:
- The current integration only supports the whole range of the CF to be
input for the map phase, it would be way much better if the InputFormat had
means of support for a KeyRange.

Best Regards,
Utku

On Tue, May 25, 2010 at 6:35 PM, Jeremy Hanna jeremy.hanna1...@gmail.comwrote:

 I'll be doing a presentation on Cassandra's (0.6+) hadoop integration next
 week. Is anyone currently using MapReduce or the initial Pig integration?

 (If you're unaware of such integration, see
 http://wiki.apache.org/cassandra/HadoopSupport)

 If so, could you post to this thread on how you're using it or planning on
 using it (if not covered by the shroud of secrecy)?

 e.g.
 What is the use case?

 Why are you using Cassandra versus using data stored in HDFS or HBase?

 Are you using a separate Hadoop cluster to run the MR jobs on, or perhaps
 are you running the Job Tracker and Task Trackers on Cassandra nodes?

 Is there anything holding you back from using it (if you would like to use
 it but currently cannot)?

 Thanks!

Re: Real-time Web Analysis tool using Cassandra. Doubts...

2010-05-12 Thread Utku Can Topçu

What makes cassandra a poor choice is the fact that, you can't use a
keyrange as input for the map phase for Hadoop.


On Wed, May 12, 2010 at 4:37 PM, Jonathan Ellis jbel...@gmail.com wrote:

 On Tue, May 11, 2010 at 1:52 PM, Paulo Gabriel Poiati
 paulogpoi...@gmail.com wrote:
  - First of all, my first thoughts is to have two CF one for raw client
  request (~10 millions++ per day) and other for aggregated metrics in some
  defined inteval time like 1min, 5min, 15min... Is this a good approach ?

 Sure.

  - It is a good idea to use a OrderPreservingPartitioner ? To maintain the
  order of my requests in the raw data CF ? Or the overhead is too big.

 The problem with OPP isn't overhead (it is lower-overhead than RP) but
 the tendency to have hotspots in sequentially-written data.

  - Initially the cluster will contain only three nodes, is it a problem
 (to
  few maybe) ?

 You'll have to do some load testing to see.

  - I think the best way to do the aggregation job is through a hadoop
  MapReduce job. Right ? Is there any other way to consider ?

 Map/Reduce is usually better than rolling your own because it
 parallelizes for you.

  - Is really Cassandra suitable for it ? Maybe HBase is better in this
 case?

 Nothing here makes me think Cassandra is a poor choice.

 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com

Distributed export and import into cassandra

2010-05-03 Thread Utku Can Topçu

Hey All,

I have a simple sample use case,
The aim is to export the columns in a column family into flat files with the
keys in range from k1 to k2.
Since all the nodes in the cluster is supposed to contain some of the
distribution of data, is it possible to make each node dump its own local
data volume to a flat file?

Best Regards,
Utku

ColumnFamilyOutputFormat?

2010-04-30 Thread Utku Can Topçu

Hey All,

I've been looking at the documentation and related articles about Cassandra
and Hadoop integration, I'm only seeing ColumnFamilyInputFormat for now.
What if I want to write directly to cassandra after a reduce?

What comes to my mind is, in the Reducer's setup I'd initialize a Cassandra
client so that rather than emitting the results to the MR framework, it
would be possible to output them to the Cassandra in a simple way.

Can you think of any other high level solutions like an OutputFormat or so?

Best Regards,
Utku

Re: ColumnFamilyInputFormat KeyRange scans on a CF

2010-04-30 Thread Utku Can Topçu

Do you mean, running the get_range_slices from a single? Yes, it would be
reasonable for a relatively small key range, when it comes to analyze a
really big range in really big data collection (i.e. like the one we
currently populate) having a way for distributing the reads among the
cluster seems the only reasonable solution.

In this current situation, the best option might be distributing the range
among ColumnFamilies (say, 1 CF for each day) and emptying the CF in order
to reuse for another day range after analyzing the data.

Can you suggest a workaround for this?

On Fri, Apr 30, 2010 at 3:22 PM, Jonathan Ellis jbel...@gmail.com wrote:

 Sounds like doing this w/o m/r with get_range_slices is a reasonable way to
 go.

 On Thu, Apr 29, 2010 at 6:04 PM, Utku Can Topçu u...@topcu.gen.tr wrote:
  I'm currently writing collected data continuously to Cassandra, having
 keys
  starting with a timestamp and a unique identifier (like
  2009.01.01.00.00.00.RANDOM) for being able to query in time ranges.
 
  I'm thinking of running periodical mapreduce jobs which will go through a
  designated time period. I might want to analyze the data only between
  2009.01 and 2009.02.
  I had done this previously with HBase however I thought cassandra would
 be a
  better choice for continuously storing data in a safe manner.
 
  I guess this briefly explains my designated use case.
 
  Best Regards,
  Utku
 
  On Thu, Apr 29, 2010 at 11:32 PM, Jonathan Ellis jbel...@gmail.com
 wrote:
 
  It's technically possible but 0.6 does not support this, no.
 
  What is the use case you are thinking of?
 
  On Thu, Apr 29, 2010 at 11:14 AM, Utku Can Topçu u...@topcu.gen.tr
  wrote:
   Hi,
  
   I've been trying to use Cassandra for some kind of a supplementary
 input
   source for Hadoop MapReduce jobs.
  
   The default usage of the ColumnFamilyInputFormat does a full
   columnfamily
   scan for using within the MapReduce framework as map input.
  
   However I believe that, it should be possible to give a keyrange to
 scan
   the
   specified range.
  
   Is it anymeans possible?
  
   Best Regards,
  
   Utku
 
  --
  Jonathan Ellis
  Project Chair, Apache Cassandra
  co-founder of Riptano, the source for professional Cassandra support
  http://riptano.com
 
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com

Re: ColumnFamilyInputFormat KeyRange scans on a CF

2010-04-30 Thread Utku Can Topçu

I meant in the first sentence running the get_range_slices from a single
point

On Fri, Apr 30, 2010 at 4:08 PM, Utku Can Topçu u...@topcu.gen.tr wrote:

 Do you mean, running the get_range_slices from a single? Yes, it would be
 reasonable for a relatively small key range, when it comes to analyze a
 really big range in really big data collection (i.e. like the one we
 currently populate) having a way for distributing the reads among the
 cluster seems the only reasonable solution.

 In this current situation, the best option might be distributing the range
 among ColumnFamilies (say, 1 CF for each day) and emptying the CF in order
 to reuse for another day range after analyzing the data.

 Can you suggest a workaround for this?

TimedOutException when using the ColumnFamilyInputFormat

2010-04-29 Thread Utku Can Topçu

Hey All,

I'm trying to run some tests on cassandra an Hadoop integration. I'm
basically following the word count example at
https://svn.apache.org/repos/asf/cassandra/trunk/contrib/word_count/src/WordCount.javausing
the ColumnFamilyInputFormat.

Currently I have one-node cassandra and hadoop setup on the same machine.

I'm having problems if there are more than one map tasks running on the same
node, please find the copy of the error message below.

If I limit the map tasks per tasktracker to 1, the MapReduce works fine
without anyproblems at all.

Do you thinki it's a know issue or am I doing something wrong in
implementation.

---error
10/04/29 13:47:37 INFO mapred.JobClient: Task Id :
attempt_201004291109_0024_m_00_1, Status : FAILED
java.lang.RuntimeException: TimedOutException()
at
org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:165)
at
org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:215)
at
org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:97)
at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
at
org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:91)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: TimedOutException()
at
org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:11015)
at
org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:623)
at
org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:597)
at
org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:142)
... 11 more
---


Best Regards,
Utku

Re: ColumnFamilyInputFormat KeyRange scans on a CF

2010-04-29 Thread Utku Can Topçu

I'm currently writing collected data continuously to Cassandra, having keys
starting with a timestamp and a unique identifier (like
2009.01.01.00.00.00.RANDOM) for being able to query in time ranges.

I'm thinking of running periodical mapreduce jobs which will go through a
designated time period. I might want to analyze the data only between
2009.01 and 2009.02.
I had done this previously with HBase however I thought cassandra would be a
better choice for continuously storing data in a safe manner.

I guess this briefly explains my designated use case.

Best Regards,
Utku

On Thu, Apr 29, 2010 at 11:32 PM, Jonathan Ellis jbel...@gmail.com wrote:

 It's technically possible but 0.6 does not support this, no.

 What is the use case you are thinking of?

 On Thu, Apr 29, 2010 at 11:14 AM, Utku Can Topçu u...@topcu.gen.tr
 wrote:
  Hi,
 
  I've been trying to use Cassandra for some kind of a supplementary input
  source for Hadoop MapReduce jobs.
 
  The default usage of the ColumnFamilyInputFormat does a full columnfamily
  scan for using within the MapReduce framework as map input.
 
  However I believe that, it should be possible to give a keyrange to scan
 the
  specified range.
 
  Is it anymeans possible?
 
  Best Regards,
 
  Utku

 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com

49 matches

Mail list logo