date:20110204

Is there any way to sort the columns named as integers in the descending order ?


Regards
-Aditya

Moving data

2011-02-04 Thread Morey, Gary

I have several large SQL Server 2005 tables.  I need to load the data in
these tables into Cassandra.  FYI, the Cassandra installation is on a
linux server running CentOS.

 

Can anyone suggest the best way to accomplish this?  I am a newbie to
Cassandra, so any advice would be greatly appreciated.

 

Best,

 

Gary

Re: Column Sorting of integer names

create a ReversedIntegerType.

On Fri, Feb 4, 2011 at 5:15 AM, Aditya Narayan ady...@gmail.com wrote:
 Is there any way to sort the columns named as integers in the descending 
 order ?


 Regards
 -Aditya




-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Unavalible Exception

ruslan usifov ruslan.usifov at gmail.com writes:

 
 HelloWhy i can get Unavalible Exception on live cluster (all nodes is up and
never shutdown)PS: v 0.7.0


Can the nodes see each other? Check Cassandra logs for messages regarding other
nodes.

Oleg

Re: Unavalible Exception

2011-02-04 Thread ruslan usifov

2011/2/4 Oleg Proudnikov ol...@cloudorange.com

 ruslan usifov ruslan.usifov at gmail.com writes:

 
  HelloWhy i can get Unavalible Exception on live cluster (all nodes is up
 and
 never shutdown)PS: v 0.7.0


 Can the nodes see each other? Check Cassandra logs for messages regarding
 other
 nodes.


Yes they can, nodetool ring show well configured ring, and ther is nothing
in logs (no WARN or ERROR)

Re: Unavalible Exception

ruslan usifov ruslan.usifov at gmail.com writes:

 
 
 2011/2/4 Oleg Proudnikov olegp at cloudorange.com
 ruslan usifov ruslan.usifov at gmail.com writes:
 
  HelloWhy i can get Unavalible Exception on live cluster (all nodes is up
andnever shutdown)PS: v 0.7.0
 Can the nodes see each other? Check Cassandra logs for messages regarding 
 other
 nodes.
 
 
 Yes they can, nodetool ring show well configured ring, and ther is nothing in
logs (no WARN or ERROR) 
 
 
 
 

Try searching for InetAddress as INFO

Re: Using a synchronized counter that keeps track of no of users on the application using it to allot UserIds/ keys to the new users after sign up

2011-02-04 Thread Ryan King

On Thu, Feb 3, 2011 at 9:12 PM, Aklin_81 asdk...@gmail.com wrote:
 Thanks Matthew  Ryan,

 The main inspiration behind me trying to generate Ids in sequential
 manner is to reduce the size of the userId, since I am using it for
 heavy denormalization. UUIDs are 16 bytes long, but I can also have a
 unique Id in just 4 bytes, and since this is just a one time process
 when the user signs-up, it makes sense to try cutting down the space
 requirements, if it is feasible without any downsides(!?).

 I am also using userIds to attach to Id of the other data of the user
 on my application. If I could reduce the userId size that I can also
 reduce the size of other Ids, I could drastically cut down the space
 requirements.


 [Sorry for this question is not directly related to cassandra but I
 think Cassandra factors here because of its  tuneable consistency]

Don't generate these ids in cassandra. Use something like snowflake,
flickr's ticket servers [2] or zookeeper sequential nodes.

-ryan


1. http://github.com/twitter/snowflake
2. 
http://code.flickr.com/blog/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/

Re: Move

Looks like https://issues.apache.org/jira/browse/CASSANDRA-1992, fixed
for 0.7.1.

On Fri, Feb 4, 2011 at 12:18 AM, Stu King s...@stuartrexking.com wrote:
 I am running a move on one node in a 5 node cluster. There are no writes to
 the cluster during the move.
 I am seeing an exception on one of the nodes (not the node which I am doing
 the move on).
 The exception stack is
 ERROR [CompactionExecutor:1] 2011-02-04 08:10:46,855 PrecompactedRow.java
 (line 82) Skipping row DecoratedKey(656517988577125179070965247963445,
 555345524e414d452e6a6f746173696c766573747265) in
 /var/lib/cassandra/data/Wenzani/UUID_UUID_SUPER-e-408-Data.db
 java.io.EOFException
 at java.io.RandomAccessFile.readFully(RandomAccessFile.java:416)
 at
 org.apache.cassandra.utils.FBUtilities.readByteArray(FBUtilities.java:280)
 at
 org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:94)
 at
 org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:364)
 at
 org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:313)
 at
 org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129)
 at
 org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:137)
 at org.apache.cassandra.io.PrecompactedRow.init(PrecompactedRow.java:78)
 at
 org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:138)
 at
 org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:107)
 at
 org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:42)
 at
 org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
 at
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
 at
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
 at
 org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
 at
 org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
 at
 org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:323)
 at
 org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:122)
 at
 org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:92)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:636)
 Output from nodetool ring.
 Address         Status State   Load            Owns    Token


 105916716988735575505223832861775432335
 1.1.1.2   Up     Normal  34.29 GB        45.36%
  12956529933298582072612274413196299151
 1.1.1.3    Up     Normal  34.46 GB        11.41%
  32366675628954067180152712803029297247
 1.1.1.4   Up     Normal  48.96 GB        11.40%
  51756081624280481651195537730585467204
 1.1.1.5  Up     Normal  22 GB           22.78%
  90515859237527157456212262236145255573
 1.1.1.6   Up     Leaving 13.34 GB        9.05%
 105916716988735575505223832861775432335
 1.1.1.6 is the node which I executed the move on. It seems to be locked in
 the Leaving state. Is this normal until the move completes?
 There is almost no activity in the logs and very little cpu usage across the
 cluster.
 Is this expected for a move?
 Cheers
 Stu



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: How to delete bulk data from cassandra 0.6.3

You should use truncate instead. (Then remove the snapshot truncate creates.)

On Fri, Feb 4, 2011 at 2:05 AM, Ali Ahsan ali.ah...@panasiangroup.com wrote:
 Hi All

 Is there any way i can delete column families data (not removing column
 families ) from Cassandra without effecting ring integrity.What if  i delete
 some column families data in linux with rm command  ?

 --
 S.Ali Ahsan

 Senior System Engineer

 e-Business (Pvt) Ltd

 49-C Jail Road, Lahore, P.O. Box 676
 Lahore 54000, Pakistan

 Tel: +92 (0)42 3758 7140 Ext. 128

 Mobile: +92 (0)345 831 8769

 Fax: +92 (0)42 3758 0027

 Email: ali.ah...@panasiangroup.com



 www.ebusiness-pg.com

 www.panasiangroup.com

 Confidentiality: This e-mail and any attachments may be confidential
 and/or privileged. If you are not a named recipient, please notify the
 sender immediately and do not disclose the contents to another person
 use it for any purpose or store or copy the information in any medium.
 Internet communications cannot be guaranteed to be timely, secure, error
 or virus-free. We do not accept liability for any errors or omissions.





-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: get_range_slices and tombstones

You can't create a row with no columns without tombstones being
involved somehow. :)

There's no distinction between a row with no columns because the
individual columns were removed, and a row with no columns because
the row was removed.  the latter is just a more efficient expression
of the former.

On Fri, Feb 4, 2011 at 2:26 AM, Patrik Modesto patrik.mode...@gmail.com wrote:
 Hi!

 I'm getting tombstones from get_range_slices(). I know that's normal.
 But is there a way to know that a key is tombstone? I know tombstone
 has no columns but I can create a row without any columns that would
 look like a tombstone in get_range_slices().

 Regards,
 Patrik




-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: How to delete bulk data from cassandra 0.6.3

2011-02-04 Thread roshandawrani

I thought truncate() was not available before 0.7 (in 0.6.3)was it?

---
Sent from BlackBerry

-Original Message-
From: Jonathan Ellis jbel...@gmail.com
Date: Fri, 4 Feb 2011 08:58:35 
To: useruser@cassandra.apache.org
Reply-To: user@cassandra.apache.org
Subject: Re: How to delete bulk data from cassandra 0.6.3

You should use truncate instead. (Then remove the snapshot truncate creates.)

On Fri, Feb 4, 2011 at 2:05 AM, Ali Ahsan ali.ah...@panasiangroup.com wrote:
 Hi All

 Is there any way i can delete column families data (not removing column
 families ) from Cassandra without effecting ring integrity.What if  i delete
 some column families data in linux with rm command  ?

 --
 S.Ali Ahsan

 Senior System Engineer

 e-Business (Pvt) Ltd

 49-C Jail Road, Lahore, P.O. Box 676
 Lahore 54000, Pakistan

 Tel: +92 (0)42 3758 7140 Ext. 128

 Mobile: +92 (0)345 831 8769

 Fax: +92 (0)42 3758 0027

 Email: ali.ah...@panasiangroup.com

 www.ebusiness-pg.com

 www.panasiangroup.com

 Confidentiality: This e-mail and any attachments may be confidential
 and/or privileged. If you are not a named recipient, please notify the
 sender immediately and do not disclose the contents to another person
 use it for any purpose or store or copy the information in any medium.
 Internet communications cannot be guaranteed to be timely, secure, error
 or virus-free. We do not accept liability for any errors or omissions.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: How to delete bulk data from cassandra 0.6.3

In that case, you should shut down the server before removing data files.

On Fri, Feb 4, 2011 at 9:01 AM,  roshandawr...@gmail.com wrote:
 I thought truncate() was not available before 0.7 (in 0.6.3)was it?

 ---
 Sent from BlackBerry

 -Original Message-
 From: Jonathan Ellis jbel...@gmail.com
 Date: Fri, 4 Feb 2011 08:58:35
 To: useruser@cassandra.apache.org
 Reply-To: user@cassandra.apache.org
 Subject: Re: How to delete bulk data from cassandra 0.6.3

 You should use truncate instead. (Then remove the snapshot truncate creates.)

 On Fri, Feb 4, 2011 at 2:05 AM, Ali Ahsan ali.ah...@panasiangroup.com wrote:
 Hi All

 Is there any way i can delete column families data (not removing column
 families ) from Cassandra without effecting ring integrity.What if  i delete
 some column families data in linux with rm command  ?

 --
 S.Ali Ahsan

 Senior System Engineer

 e-Business (Pvt) Ltd

 49-C Jail Road, Lahore, P.O. Box 676
 Lahore 54000, Pakistan

 Tel: +92 (0)42 3758 7140 Ext. 128

 Mobile: +92 (0)345 831 8769

 Fax: +92 (0)42 3758 0027

 Email: ali.ah...@panasiangroup.com



 www.ebusiness-pg.com

 www.panasiangroup.com

 Confidentiality: This e-mail and any attachments may be confidential
 and/or privileged. If you are not a named recipient, please notify the
 sender immediately and do not disclose the contents to another person
 use it for any purpose or store or copy the information in any medium.
 Internet communications cannot be guaranteed to be timely, secure, error
 or virus-free. We do not accept liability for any errors or omissions.





 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com




-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

CF Read and Write Latency Histograms

Hi All,

I suspect that Write and Read Latency column headers need to be swapped. I am
running a bulk load with no reads on this CF but I see Read column with values
while the Write column has zeros only. The MBean shows the values correctly.

Thank you,
Oleg

Re: Using Cassandra to store files

I am also looking to possible solutions to store pdfs  word documents.

But why wont you store in them in the filesystem instead of a database
unless your files are too small in which case it would be recommended
to use a database.

-Aditya


On Fri, Feb 4, 2011 at 5:30 PM, Daniel Doubleday
daniel.double...@gmx.net wrote:
 We are doing this with cassandra.
 But we cache a lot. We get around 20 writes/s and 1k reads/s (~ 100Mbit/s)
 for that particular CF but only 1% of them hit our cassandra cluster (5
 nodes, rf=3).

 /Daniel
 On Feb 4, 2011, at 9:37 AM, Brendan Poole wrote:

 Hi Daniel

 When you say We are doing this do you mean via NFS or Cassandra.

 Thanks

 Brendan





 Signature.jpg Brendan Poole
  Systems Developer
   NewLaw Solicitors
  Helmont House
  Churchill Way
  Cardiff
  brendan.po...@new-law.co.uk
  029 2078 4283
  www.new-law.co.uk

 


 From: Daniel Doubleday [mailto:daniel.double...@gmx.net]
 Sent: 03 February 2011 17:21
 To: user@cassandra.apache.org
 Subject: Re: Using Cassandra to store files

 Hundreds of thousands doesn't sound too bad. Good old NFS would do with an
 ok directory structure.
 We are doing this. Our documents are pretty small though (a few kb). We have
 around 40M right now with around 300GB total.
 Generally the problem is that much data usually means that cassandra becomes
 io bound during repairs and compactions even if your hot dataset would fit
 in the page cache. There are efforts to overcome this and 0.7 will help with
 repair problems but for the time being you have to have quite some headroom
 in terms of io performance to handle these situations.
 Here is a related post:
 http://comments.gmane.org/gmane.comp.db.cassandra.user/11190

 On Feb 3, 2011, at 1:33 PM, Brendan Poole wrote:

 Hi

 Would anyone recommend using Cassandra for storing hundreds of thousands of
 documents in Word/PDF format? The manual says it can store documents under
 64MB with no issue but was wondering if anyone is using it for this specific
 perpose.  Would it be efficient/reliable and is there anything I need to
 bear in mind?

 Thanks in advance

 Signature.jpg Brendan Poole
  Systems Developer
   NewLaw Solicitors
  Helmont House
  Churchill Way
  Cardiff
  brendan.po...@new-law.co.uk
  029 2078 4283
  www.new-law.co.uk



 P Please consider the environment before printing this e-mail
 Important - The information contained in this email (and any attached files)
 is confidential and may be legally privileged and protected by law.

 The intended recipient is authorised to access it. If you are not the
 intended recipient, please notify the sender immediately and delete or
 destroy all copies. You must not disclose the contents of this email to
 anyone. Unauthorised use, dissemination, distribution, publication or
 copying of this communication is prohibited.

 NewLaw Solicitors does not accept any liability for any inaccuracies or
 omissions in the contents of this email that may have arisen as a result of
 transmission. This message and any attachments are believed to be free of
 any virus or defect that might affect any computer system into which it is
 received and opened. However, it is the responsibility of the recipient to
 ensure that it is virus free; therefore, no responsibility is accepted for
 any loss or damage in any way arising from its use.

 NewLaw Solicitors is the trading name of NewLaw Legal Ltd, a limited company
 registered in England and Wales with registered number 07200038.
 NewLaw Legal Ltd is regulated by the Solicitors Regulation Authority whose
 website is http://www.sra.org.uk

 The registered office of NewLaw Legal Ltd is at Helmont House, Churchill
 Way, Cardiff, CF10 2HE. Tel: 0845 756 6870, Fax: 0845 756 6871, Email:
 i...@new-law.co.uk. www.new-law.co.uk.

 We use the word ‘partner’ to refer to a shareowner or director of the
 company, or an employee or consultant of the company who is a lawyer with
 equivalent standing and qualifications. A list of the directors is displayed
 at the above address, together with a list of those persons who are
 designated as partners.



 P Please consider the environment before printing this e-mail
 Important - The information contained in this email (and any attached files)
 is confidential and may be legally privileged and protected by law.

 The intended recipient is authorised to access it. If you are not the
 intended recipient, please notify the sender immediately and delete or
 destroy all copies. You must not disclose the contents of this email to
 anyone. Unauthorised use, dissemination, distribution, publication or
 copying of this communication is prohibited.

 NewLaw Solicitors does not accept any liability for any inaccuracies or
 omissions in the contents of this email that may have arisen as a result of
 transmission. This message and any attachments are believed to be free of
 any virus or

Re: CF Read and Write Latency Histograms

Can you create a ticket?

On Fri, Feb 4, 2011 at 9:41 AM, Oleg Proudnikov ol...@cloudorange.com wrote:
 Hi All,

 I suspect that Write and Read Latency column headers need to be swapped. I am
 running a bulk load with no reads on this CF but I see Read column with values
 while the Write column has zeros only. The MBean shows the values correctly.

 Thank you,
 Oleg








-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Using Cassandra to store files

2011-02-04 Thread buddhasystem


Even when storage is in NFS, Cassandra can still be quite useful as a file
catalog. Your physical storage can change, move etc. Therefore, it's a good
idea to provide mapping of logical names to physical store points (which in
fact can be many). This is a standard technique used in mass storage.

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Using-Cassandra-to-store-files-tp5988698p5993357.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.

Re: Using Cassandra to store files

yes, definitely a database for mapping ofcourse!

On Fri, Feb 4, 2011 at 11:17 PM, buddhasystem potek...@bnl.gov wrote:

 Even when storage is in NFS, Cassandra can still be quite useful as a file
 catalog. Your physical storage can change, move etc. Therefore, it's a good
 idea to provide mapping of logical names to physical store points (which in
 fact can be many). This is a standard technique used in mass storage.

 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Using-Cassandra-to-store-files-tp5988698p5993357.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.

Re: Moving data

I'm afraid there is no short answer.

The long answer is,

1) Read about Cassandra data modeling at
http://wiki.apache.org/cassandra/ArticlesAndPresentations.  It is not
as simple as one table equals one columnfamily.
2) Write a program to read your data out of SQL Server and write it
into Cassandra, preferably with multiple threads

On Fri, Feb 4, 2011 at 6:00 AM, Morey, Gary gary.mo...@xerox.com wrote:
 I have several large SQL Server 2005 tables.  I need to load the data in
 these tables into Cassandra.  FYI, the Cassandra installation is on a linux
 server running CentOS.



 Can anyone suggest the best way to accomplish this?  I am a newbie to
 Cassandra, so any advice would be greatly appreciated.



 Best,



 Gary



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Using Cassandra to store files

2011-02-04 Thread sridhar basam

For the  number of file the OP has why not just use a traditional filesystem
and solr to index the pdf data. You get to search inside of the files for
relevant information?

 Sri

On Fri, Feb 4, 2011 at 12:47 PM, buddhasystem potek...@bnl.gov wrote:


 Even when storage is in NFS, Cassandra can still be quite useful as a file
 catalog. Your physical storage can change, move etc. Therefore, it's a good
 idea to provide mapping of logical names to physical store points (which in
 fact can be many). This is a standard technique used in mass storage.

 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Using-Cassandra-to-store-files-tp5988698p5993357.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.

Re: Moving data

2011-02-04 Thread buddhasystem


FWIW, I'm working on migrating a large amount of data out of Oracle into my
test cluster. The data has been warehoused as CSV files on Amazon S3. Having
that in place allows me to not put extra load on the production service when
doing many repeated tests. I then parse the data using CSV Python module
and, as Jonathan says, use threads to batch upload data into Cassandra.
Notable points: since the data is relatively sparse (i.e. many zeros for
integers and empty strings for strings etc), I establish a default value
dictionary, and don't write these to Cassandra at all -- they can be
reconstructed as needed when reading back.

Also, make sure you wrap Cassandra writes etc into exceptions. When load is
high, you might get timeouts at TSocket level etc.

-- 
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Moving-data-tp5992669p5993443.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.

RE: CF Read and Write Latency Histograms

2011-02-04 Thread David Dabbs

Is this 0.7?

-Original Message-
From: Oleg Proudnikov [mailto:ol...@cloudorange.com] 
Sent: Friday, February 04, 2011 11:42 AM
To: user@cassandra.apache.org
Subject: CF Read and Write Latency Histograms

Hi All,

I suspect that Write and Read Latency column headers need to be swapped. I
am
running a bulk load with no reads on this CF but I see Read column with
values
while the Write column has zeros only. The MBean shows the values correctly.

Thank you,
Oleg

Re: CF Read and Write Latency Histograms

David Dabbs dmdabbs at gmail.com writes:

 
 Is this 0.7?
 

Yes

Re: Using a synchronized counter that keeps track of no of users on the application using it to allot UserIds/ keys to the new users after sign up

2011-02-04 Thread Aklin_81

Thanks so much Ryan for the links; I'll definitely take them into
consideration.

Just another thought which came to my mind:-
perhaps it may be beneficial to store(or duplicate) some of the data
like the Login credentials particularly userId to User's Name
mapping, etc (which is very heavily read), in a fast MyISAM table.
This could solve the problem of keys though auto-generated unique
sequential primary keys. I could use the same keys for Cassandra rows
for that user. And also since Cassandra reads are relatively slow, it
makes sense to store data like userId to Name mapping in MyISAM as
this data would be required after almost all queries to the database.

Regards
-Asil

On Fri, Feb 4, 2011 at 10:14 PM, Ryan King r...@twitter.com wrote:
On Thu, Feb 3, 2011 at 9:12 PM, Aklin_81 asdk...@gmail.com wrote:
Thanks Matthew Ryan,

The main inspiration behind me trying to generate Ids in sequential
manner is to reduce the size of the userId, since I am using it for
heavy denormalization. UUIDs are 16 bytes long, but I can also have a
unique Id in just 4 bytes, and since this is just a one time process
when the user signs-up, it makes sense to try cutting down the space
requirements, if it is feasible without any downsides(!?).

I am also using userIds to attach to Id of the other data of the user
on my application. If I could reduce the userId size that I can also
reduce the size of other Ids, I could drastically cut down the space
requirements.

[Sorry for this question is not directly related to cassandra but I
think Cassandra factors here because of its tuneable consistency]

Don't generate these ids in cassandra. Use something like snowflake,
flickr's ticket servers [2] or zookeeper sequential nodes.

-ryan

1. http://github.com/twitter/snowflake
2.
http://code.flickr.com/blog/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/

read latency in cassandra

2011-02-04 Thread Dan Kuebrich

Hi all,

It often takes more than two seconds to load:

- one row of ~450 events comprising ~600k
- cluster size of 1
- client is pycassa 1.04
- timeout on recv
- cold read (I believe)
- load generally  0.5 on a 4-core machine, 2 EC2 instance store drives for
cassandra
- cpu wait generally  1%

Often the following sequence occurs:

1. First attempt times out after 2 sec
2. Second attempt loads fine on immediate retry

So, I assume it's an issue about cache miss and going to disk.  Is 2 seconds
the normal I went to disk latency for cassandra?  What should we look to
tune, if anything? I don't think keeping everything in-memory is an option
for us given dataset size and access pattern (hot set is stuff being
currently written, stuff being accessed is likely to be older).

I didn't notice this problem with cassandra 0.6.8 and pycassa 0.3.

Thanks,
dan

Re: How to monitor Cassandra's throughput?

The issue has been resolved, the fix is on Hector's GitHub.


Oleg Proudnikov olegp at cloudorange.com writes:

 
 I have posted on Hector ML:
 
 http://thread.gmane.org/gmane.comp.db.hector.user/1690
 
 Oleg

RE: Tracking down read latency

2011-02-04 Thread David Dabbs


Thank you both for your advice. See my updated iostats below.


From: sridhar.ba...@gmail.com [mailto:sridhar.ba...@gmail.com] On Behalf Of
sridhar basam
Sent: Thursday, February 03, 2011 10:58 AM
To: user@cassandra.apache.org
Subject: Re: Tracking down read latency

The data provided is also a average value since boot time. Run the -x as
suggested below but run it via a interval of around 5 seconds. You very well
could be having i/o issue, it is hard to tell from the overall average
value you provided. Collect iostat -x 5 during the times when you see slow
reads and see how busy the disks are.

Sridhar

On Thu, Feb 3, 2011 at 3:21 AM, Peter Schuller wrote:
 $ iostat

As rcoli already mentioned you don't seen to have an I/O problem, but
as a point of general recommendation: When determining whether you are
blocking on disk I/O, pretty much *always* use iostat -x rather than
the much less useful default mode of iostat. The %util and queue
wait/average time columns are massively useful/important; without them
one is much more blind as to whether or not storage devices are
actually saturated.

 Peter Schuller


Our data is on sdb, commit logs on sdc.
So do I read this correctly that we're 'await'ing 6+millis on average for
data drive (sdb) 
requests to be serviced?


$iostat -x 5

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.590.000.220.940.00   98.25

Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sda   0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00
sda1  0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00
sda2  0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00
sda3  0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00
sdb  11.20 0.00 42.00  0.00  4993.60 0.00   118.90
0.286.77   5.22  21.92
sdb1 11.20 0.00 42.00  0.00  4993.60 0.00   118.90
0.286.77   5.22  21.92
sdc   0.0031.00  0.00  1.40 0.00   259.20   185.14
0.000.14   0.14   0.02
sdc1  0.0031.00  0.00  1.40 0.00   259.20   185.14
0.000.14   0.14   0.02

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.560.000.181.080.00   98.17

Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sda   0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00
sda1  0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00
sda2  0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00
sda3  0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00
sdb   8.80 0.00 49.40  0.00  5936.00 0.00   120.16
0.336.62   5.22  25.78
sdb1  8.80 0.00 49.40  0.00  5936.00 0.00   120.16
0.336.62   5.22  25.78
sdc   0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00
sdc1  0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.990.000.221.080.00   97.71

Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sda   0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00
sda1  0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00
sda2  0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00
sda3  0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00
sdb  11.40 0.00 46.20  0.00  5147.20 0.00   111.41
0.306.55   5.58  25.80
sdb1 11.40 0.00 46.20  0.00  5147.20 0.00   111.41
0.306.55   5.58  25.80
sdc   0.00 7.40  0.00  0.80 0.0065.6082.00
0.000.25   0.25   0.02
sdc1  0.00 7.40  0.00  0.80 0.0065.6082.00
0.000.25   0.25   0.02

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0.680.000.230.950.00   98.13

Device: rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sda   0.00 0.80  0.00  0.80 0.0012.7716.00
0.000.25   0.25   0.02
sda1  0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00
sda2  0.00 0.80  0.00  0.80 0.0012.7716.00
0.000.25   0.25   0.02
sda3  0.00 0.00  0.00  0.00 0.00 0.00 0.00
0.000.00   0.00   0.00
sdb   5.19 0.00 38.12  0.00  4356.09 0.00   114.26
0.266.70   5.91

Re: How to delete bulk data from cassandra 0.6.3

2011-02-04 Thread Ali Ahsan

So do we need to write a script ? or its some thing i can do as a system 
admin without involving and developer.If yes please guide me in this case.





On 02/04/2011 10:36 PM, Jonathan Ellis wrote:

In that case, you should shut down the server before removing data files.

On Fri, Feb 4, 2011 at 9:01 AM,roshandawr...@gmail.com  wrote:

I thought truncate() was not available before 0.7 (in 0.6.3)was it?

---
Sent from BlackBerry

-Original Message-
From: Jonathan Ellisjbel...@gmail.com
Date: Fri, 4 Feb 2011 08:58:35
To: useruser@cassandra.apache.org
Reply-To: user@cassandra.apache.org
Subject: Re: How to delete bulk data from cassandra 0.6.3

You should use truncate instead. (Then remove the snapshot truncate creates.)

On Fri, Feb 4, 2011 at 2:05 AM, Ali Ahsanali.ah...@panasiangroup.com  wrote:

Hi All

Is there any way i can delete column families data (not removing column
families ) from Cassandra without effecting ring integrity.What if  i delete
some column families data in linux with rm command  ?

--
S.Ali Ahsan

Senior System Engineer

e-Business (Pvt) Ltd

49-C Jail Road, Lahore, P.O. Box 676
Lahore 54000, Pakistan

Tel: +92 (0)42 3758 7140 Ext. 128

Mobile: +92 (0)345 831 8769

Fax: +92 (0)42 3758 0027

Email: ali.ah...@panasiangroup.com



www.ebusiness-pg.com

www.panasiangroup.com

Confidentiality: This e-mail and any attachments may be confidential
and/or privileged. If you are not a named recipient, please notify the
sender immediately and do not disclose the contents to another person
use it for any purpose or store or copy the information in any medium.
Internet communications cannot be guaranteed to be timely, secure, error
or virus-free. We do not accept liability for any errors or omissions.





--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com







--
S.Ali Ahsan

Senior System Engineer

e-Business (Pvt) Ltd

49-C Jail Road, Lahore, P.O. Box 676
Lahore 54000, Pakistan

Tel: +92 (0)42 3758 7140 Ext. 128

Mobile: +92 (0)345 831 8769

Fax: +92 (0)42 3758 0027

Email: ali.ah...@panasiangroup.com



www.ebusiness-pg.com

www.panasiangroup.com

Confidentiality: This e-mail and any attachments may be confidential
and/or privileged. If you are not a named recipient, please notify the
sender immediately and do not disclose the contents to another person
use it for any purpose or store or copy the information in any medium.
Internet communications cannot be guaranteed to be timely, secure, error
or virus-free. We do not accept liability for any errors or omissions.

Re: read latency in cassandra

2011-02-04 Thread aaron morton

What operation are you calling ? Are you trying to read the entire row back?

How many SSTables do you have for the CF? Does your data have a lot of 
overwrites ? Have you modified the default compaction settings ?

Do you have row cache enabled ? 

How long does the second request take ?

Can you use JConsole to check the read latency for the CF?

Sorry for all the questions, the answer to your initial question is mmm, that 
does not sound right. It will depend on

Aaron

On 5 Feb 2011, at 08:13, Dan Kuebrich wrote:

 Hi all,
 
 It often takes more than two seconds to load:
 
 - one row of ~450 events comprising ~600k
 - cluster size of 1
 - client is pycassa 1.04
 - timeout on recv
 - cold read (I believe)
 - load generally  0.5 on a 4-core machine, 2 EC2 instance store drives for 
 cassandra
 - cpu wait generally  1%
 
 Often the following sequence occurs:
 
 1. First attempt times out after 2 sec
 2. Second attempt loads fine on immediate retry
 
 So, I assume it's an issue about cache miss and going to disk.  Is 2 seconds 
 the normal I went to disk latency for cassandra?  What should we look to 
 tune, if anything? I don't think keeping everything in-memory is an option 
 for us given dataset size and access pattern (hot set is stuff being 
 currently written, stuff being accessed is likely to be older).
 
 I didn't notice this problem with cassandra 0.6.8 and pycassa 0.3.
 
 Thanks,
 dan

Re: Tracking down read latency

2011-02-04 Thread sridhar basam

On Fri, Feb 4, 2011 at 2:44 PM, David Dabbs dmda...@gmail.com wrote:


 Our data is on sdb, commit logs on sdc.
 So do I read this correctly that we're 'await'ing 6+millis on average for
 data drive (sdb)
 requests to be serviced?


That is right. Those numbers look pretty good for rotational media. What
sort of read latencies do you see? Have you also looked into GC.

 Sridhar

New Generation Size guidelines

Hi All,

I have a 3 server cluster with RF=2. My heap is 2G out of a 4G RAM. The servers
have 4 cores. I used default heap settings. The Eden space ended up around 60M
and the Survivor spaces are around 7M. This feels a little bit low for a process
that creates so much short-lived garbage. I just wanted to get your thoughts on
this. Space used in the Old Generation stays in a short range 1.2G-1.6G but when
the activity is low and I force GC it drops too 120M. It feels like there is a
lot of garbage that does not have a chance to get collected. The server is
running a batch load and its CPUs are 10-40% busy. The higher value is at 1.6G.
Yet I am reluctant to push my data load because I do hit OOMs.

The amount of data loaded so far is small - around 100G in total.

Should I increase my Gew Generation size?

Thank you,
Oleg

Re: New Generation Size guidelines

2011-02-04 Thread Ryan King

On Fri, Feb 4, 2011 at 1:45 PM, Oleg Proudnikov ol...@cloudorange.com wrote:

 Hi All,

 I have a 3 server cluster with RF=2. My heap is 2G out of a 4G RAM. The 
 servers
 have 4 cores. I used default heap settings. The Eden space ended up around 60M
 and the Survivor spaces are around 7M. This feels a little bit low for a 
 process
 that creates so much short-lived garbage. I just wanted to get your thoughts 
 on
 this. Space used in the Old Generation stays in a short range 1.2G-1.6G but 
 when
 the activity is low and I force GC it drops too 120M. It feels like there is a
 lot of garbage that does not have a chance to get collected. The server is
 running a batch load and its CPUs are 10-40% busy. The higher value is at 
 1.6G.
 Yet I am reluctant to push my data load because I do hit OOMs.

 The amount of data loaded so far is small - around 100G in total.

Almost certainly yes.

-ryan

Re: Problems with Python Stress Test

2011-02-04 Thread Sameer Farooqui

Brandon,

Thanks for the response. I have also noticed that stress.py's progress
interval gets thrown off in low memory situations.

What did you mean by contrib/stress on 0.7 instead.  I don't see that dir
in the src version of 0.7.

- Sameer


On Thu, Feb 3, 2011 at 5:22 PM, Brandon Williams dri...@gmail.com wrote:

 On Thu, Feb 3, 2011 at 7:02 PM, Sameer Farooqui 
 cassandral...@gmail.comwrote:

 Hi guys,

 I was playing around with the stress.py test this week and noticed a few
 things.

 1) Progress-interval does not always work correctly. I set it to 5 in the
 example below, but am instead getting varying intervals:


 Generally indicates that the client machine is being overloaded in my
 experience.

 2) The key_rate and op_rate doesn't seem to be calculated correctly. Also,
 what is the difference between the interval_key_rate and the
 interval_op_rate? For example in the example above, the first row shows 6662
 keys inserted in 5 seconds and 6662 / 5 = 1332, which matches the
 interval_op_rate.


 There should be no difference unless you're doing range slices, but IPC
 timing makes them vary somewhat.

 3) If I write x KB to Cassandra with py_stress, the used disk space doesn't
 grow by x after the test. In the example below I tried to write 500,000 keys
 * 32 bytes * 5 columns = 78,125 kilobytes of data to the database. When I
 checked the amount of disk space used after the test it actually grew by
 2,684,920 - 2,515,864 = 169,056 kilobytes. Is this because perhaps the
 commit log got duplicate copies of the data as the SSTables?


 Commitlogs could be part of it, you're not factoring in the column names,
 and then there's index and bloom filter overhead.

 Use contrib/stress on 0.7 instead.

 -Brandon

Re: Sorting in time order without using TimeUUID type column names

2011-02-04 Thread aaron morton

IMHO If you know the time of the event use store the time as a long, rather 
than a UUID. It will make it easier to get back to a 
time and make it easier for you to compare columns. TimeUUIDS has a pseudo 
random part as well as the time part, it could be set to a constant. By why 
bother if you know the absolute time.

I'm not sure what the ReminderCountOfThisUser is for, and as Sylvain says there 
is no need for the user name if this is in a row just for the user. 

Hope that helps.
Aaron
 
On 4 Feb 2011, at 01:32, Aditya Narayan wrote:

 If I use : TimestampOfDueTimeInFuture: UserId : ReminderCountOfThisUser
 as key pattern for the rows of reminders, then I am storing the key,
 just as it is, as the column name and thus column values  need not
 contain a link to the row containing the reminder details.
 
 I think UserId would be required along with timestamp in the key
 pattern to provide uniqueness to the key as there may be several
 reminders generated by users on the application, at the same time.
 
 But my question is about whether it is really advisable to even
 generate the keys like this pattern ... instead of going with
 timeuuids ?
 Are there are any downsides which I am not perhaps not aware of ?
 
 
 
 On Thu, Feb 3, 2011 at 5:43 PM, Sylvain Lebresne sylv...@datastax.com wrote:
 On Thu, Feb 3, 2011 at 11:27 AM, Aditya Narayan ady...@gmail.com wrote:
 
 Hey all,
 
 I want to store some columns that are reminders to the users on my
 application, in time sorted order in a row(timeline row of the user).
 
 Would it be recommended to store these reminder columns in the
 timeline row with column names like: combination of timestamp(of time
 when the reminder gets due) + UserId+ Reminders Count of that user;
 Column Name= TimestampOfDueTimeInFuture: UserId :
 ReminderCountOfThisUser
 
 If you have one row by user (which is a good idea), why keep the UserId in
 the column name ?
 
 
 Then what comparator could I use to sort them in order of the their
 due time ? This comparator should be able to sort no. in descending
 order.(I guess ascii type would do the opposite order) (Reminders need
 to be sorted in the timeline in the order of their due time.)
 
 *The* solution is write a custom comparator.
 Have a look at http://www.datastax.com/docs/0.7/data_model/column_families
 and http://www.sodeso.nl/?p=421 for instance.
 
 As a side note, the fact that the comparator sort in ascending order when
 you
 need descending order would be that much of a problem, since you can always
 do slice queries in reversed order. But even then, asciiType is not a very
 satisfying solution as you would have to be careful about the padding of
 your
 timestamp for it to work correctly. So again, custom comparator is the way
 to go.
 
 Basically I am trying to avoid 16 bytes long timeUUID first because
 they are too long and the above defined key pattern is guaranteeing me
 a unique key/Id for the reminder row always.
 
 
 Thanks
 Aditya Narayan
 
 --
 Sylvain

Re: Unavalible Exception

2011-02-04 Thread aaron morton

Please provide some information the client you are using, the client side error 
stack, the command you are running, the output from nodetool ring 

Aaron
 
On 5 Feb 2011, at 05:10, Oleg Proudnikov wrote:

 ruslan usifov ruslan.usifov at gmail.com writes:
 
 
 
 2011/2/4 Oleg Proudnikov olegp at cloudorange.com
 ruslan usifov ruslan.usifov at gmail.com writes:
 
 HelloWhy i can get Unavalible Exception on live cluster (all nodes is up
 andnever shutdown)PS: v 0.7.0
 Can the nodes see each other? Check Cassandra logs for messages regarding 
 other
 nodes.
 
 
 Yes they can, nodetool ring show well configured ring, and ther is nothing in
 logs (no WARN or ERROR) 
 
 
 
 
 
 Try searching for InetAddress as INFO

Re: Sorting in time order without using TimeUUID type column names

Thanks Aaron,

Yes I can put the column names without using the userId in the
timeline row, and when I want to retrieve the row corresponding to
that column name, I will attach the userId to get the row key.

Yes I'll store it as a long  I guess I'll have to write with a custom
comparator type (ReversedIntegerType) to sort those longs in
descending order.

Regards
Aditya


On Sat, Feb 5, 2011 at 6:24 AM, aaron morton aa...@thelastpickle.com wrote:
 IMHO If you know the time of the event use store the time as a long, rather 
 than a UUID. It will make it easier to get back to a
 time and make it easier for you to compare columns. TimeUUIDS has a pseudo 
 random part as well as the time part, it could be set to a constant. By why 
 bother if you know the absolute time.

 I'm not sure what the ReminderCountOfThisUser is for, and as Sylvain says 
 there is no need for the user name if this is in a row just for the user.

 Hope that helps.
 Aaron

 On 4 Feb 2011, at 01:32, Aditya Narayan wrote:

 If I use : TimestampOfDueTimeInFuture: UserId : ReminderCountOfThisUser
 as key pattern for the rows of reminders, then I am storing the key,
 just as it is, as the column name and thus column values  need not
 contain a link to the row containing the reminder details.

 I think UserId would be required along with timestamp in the key
 pattern to provide uniqueness to the key as there may be several
 reminders generated by users on the application, at the same time.

 But my question is about whether it is really advisable to even
 generate the keys like this pattern ... instead of going with
 timeuuids ?
 Are there are any downsides which I am not perhaps not aware of ?



 On Thu, Feb 3, 2011 at 5:43 PM, Sylvain Lebresne sylv...@datastax.com 
 wrote:
 On Thu, Feb 3, 2011 at 11:27 AM, Aditya Narayan ady...@gmail.com wrote:

 Hey all,

 I want to store some columns that are reminders to the users on my
 application, in time sorted order in a row(timeline row of the user).

 Would it be recommended to store these reminder columns in the
 timeline row with column names like: combination of timestamp(of time
 when the reminder gets due) + UserId+ Reminders Count of that user;
 Column Name= TimestampOfDueTimeInFuture: UserId :
 ReminderCountOfThisUser

 If you have one row by user (which is a good idea), why keep the UserId in
 the column name ?


 Then what comparator could I use to sort them in order of the their
 due time ? This comparator should be able to sort no. in descending
 order.(I guess ascii type would do the opposite order) (Reminders need
 to be sorted in the timeline in the order of their due time.)

 *The* solution is write a custom comparator.
 Have a look at http://www.datastax.com/docs/0.7/data_model/column_families
 and http://www.sodeso.nl/?p=421 for instance.

 As a side note, the fact that the comparator sort in ascending order when
 you
 need descending order would be that much of a problem, since you can always
 do slice queries in reversed order. But even then, asciiType is not a very
 satisfying solution as you would have to be careful about the padding of
 your
 timestamp for it to work correctly. So again, custom comparator is the way
 to go.

 Basically I am trying to avoid 16 bytes long timeUUID first because
 they are too long and the above defined key pattern is guaranteeing me
 a unique key/Id for the reminder row always.


 Thanks
 Aditya Narayan

 --
 Sylvain

Re: Unavalible Exception

2011-02-04 Thread David King

We're going to need *way* more information than this

On 03 Feb 2011, at 20:03, ruslan usifov wrote:

 Hello
 
 Why i can get Unavalible Exception on live cluster (all nodes is up and never 
 shutdown)
 
 PS: v 0.7.0

Merging the rows of two column families(with similar attributes) into one ??

2011-02-04 Thread Ertio Lew

I read somewhere that more no of column families is not a good idea as
it consumes more memory and more compactions to occur  thus I am
trying to reduce the no. of column families by adding the rows of
other Column families(with similar attributes) as separate rows into
one.

I have two kinds of data for two separate features on my application.
If I store them in two different column families then both of them
will have similar attributes like same comparator type  sorting
needs. Thus I can also merge both of them in one column family, just
by adding the rows of another to this one(increasing the no of rows).
However some rows of 1st kind of data are very frequently used and
rows of 2nd data are less freq. used. But I dont think this will be a
problem as I am not merging two rows into one, but just adding them as
separate rows in the column family.
1st kind of data has wider rows and 2nd kind of data has very less wide rows.

But the caching requirements may be different as they cater to two
different features.(but I think it is even advantageous since
resources are free to be utilized by any data that's more frequently
used)


Is it recommended to merge these two column families into one ?? Thoughts ?

--

Ertio

Re: Unavalible Exception

Start with grep -i down system.log on each machine

On Fri, Feb 4, 2011 at 7:37 PM, David King dk...@ketralnis.com wrote:
 We're going to need *way* more information than this

 On 03 Feb 2011, at 20:03, ruslan usifov wrote:

 Hello

 Why i can get Unavalible Exception on live cluster (all nodes is up and 
 never shutdown)

 PS: v 0.7.0





-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Pig not reading all cassandra data

2011-02-04 Thread Matt Kennedy

Found the culprit.  There is a new feature in Pig 0.8 that will try to reduce 
the number of splits used to speed up the whole job.  Since the 
ColumnFamilyInputFormat lists the input size as zero, this feature eliminates 
all of the splits except for one.  

The workaround is to disable this feature for jobs that use CassandraStorage by 
setting -Dpig.splitCombination=false in the pig_cassandra script.

Hope somebody finds this useful, you wouldn't believe how many dead-ends I ran 
down trying to figure this out.

-Matt 
On Feb 2, 2011, at 4:34 PM, Matthew E. Kennedy wrote:

 
 I noticed in the jobtracker log that when the pig job kicks off, I get the 
 following info message:
 
 2011-02-02 09:13:07,269 INFO org.apache.hadoop.mapred.JobInProgress: Input 
 size for job job_201101241634_0193 = 0. Number of splits = 1
 
 So I looked at the job.split file that is created for the Pig job and 
 compared it to the job.split file created for the map-reduce job.  The map 
 reduce file contains an entry for each split, whereas the  job.split file for 
 the Pig job contains just the one split.
 
 I added some code to the ColumnFamilyInputFormat to output what it thinks it 
 sees as it should be creating input splits for the pig jobs, and the call to 
 getSplits() appears to be returning the correct list of splits.  I can't 
 figure out where it goes wrong though when the splits should be written to 
 the job.split file.
 
 Does anybody know the specific class responsible for creating that file in a 
 Pig job, and why it might be affected by using the pig CassandraStorage 
 module?
 
 Is anyone else successfully running Pig jobs against a 0.7 cluster?
 
 Thanks,
 Matt

Re: Pig not reading all cassandra data