Hadoop/Cassandra for data transformation (rather than analysis)?

2013-08-10 Thread Jan Algermissen
Hi,

I have a specific use case to address with Cassandra and I can't get my head 
around whether using Hadoop on top creates any significant benefit or not.

Situation:

I have product data and each product 'contains' a number of articles (100 / 
product), representing individual colors/sizes etc.

My plan is to store each product in cassandra as a wide row, containing all the 
 articles per product. I choose this design because sometimes I need to work 
with all the articles in a product and sometimes I just need to pick one of 
them per product.

My understanding is that picking a certain 'row' from all the 'rows' in a wide 
row is nice (because it works on a per-row basis) and that any other approach 
would require a scan over essentially all the rows (not good).

So, after selecting one or  some or all of the 'rows' (articles) from every 
single wide row (product) the input to my data processing is essentially a 
bunch articles.

The final output of the overall processing will be and export file (XML or CSV) 
containing one line (or element) per article. There is no 'cross article' 
analysis going on, it is really sort of one-in/on-out.

I am looking a Hadoop because I see MapReduce as a nice fit given the 
independence of the per-article transformation into an output 'line'.

What I am worried about is whether Hadoop will actually give me a real benefit: 
While there will be processing (mostly string operations) going on to vreate 
lines from articles, the output still needs to be pulled over the wire to some 
place to create the single output file. 

I wonder whether it would not work equally well to per-article pull the 
necessary data from Cassandra and create the output file in a single process 
(in my case Java Web app). As I do not have Billions of input records (but a 
max of 10 Milllion) the added benefit of scaling out the per-line processing is 
probably not worth the additional setup and operations effort of Hadoop. 

Any idea how I could make a judgement call here?

Another question: I read in a C* 1.1 related slidedeck that Hadoop output to 
CFS is only possible with DSE and not with DSC - that with DSC the Hadoop 
output would be HDFS. Is that correct?  For homogeneity, I would certainly want 
to store the output files in CFS, too.

Sorry, that this was a bit of a longer question/explanation.

Jan








Custom data type class in pycassa

2013-08-10 Thread Vladimir Prudnikov
Hi all,
I use pycassa and I want to store lists and tuples in cassandra by
serializing them using MessagePack. Seems like custom data type is what I
need. Here is data type I created:
##
class MyListType(CassandraType):
@staticmethod
def pack(value):
return msgpack.packb(value)

@staticmethod
def unpack(value):
return msgpack.unpackb(value)
##
Now I'm creating a new column family and use instance of this class in
column_validation_classes but unsuccessful. It raises
InvalidRequestException: InvalidRequestException(why=Unable to find
abstract-type class 'org.apache.cassandra.db.marshal.MyListType').

What I'm doing wrong? How to do it properly?
Thanks.
-- 
Vladimir Prudnikov


Handling quorum writies fails

2013-08-10 Thread Mikhail Tsaplin
Hi.

According to Datastax documentation about atomicity in Cassandra: QUORUM
write succeeded only on one node will not be rolled back (Check Atomicity
chapter there:
http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/dml/dml_about_transactions_c.html).
So when I am performing a QUORUM write on cluster with RF=3 and one node
fails, I will get write error status and one successful write on another
node. This produces two cases:

   1. write will be propagated to other nodes when they became online;
   2. write can be completely lost if the node accepted that write will be
   completely broken before propagation.

What is the best ways to deal with such kind of fails in let say
hypothetical funds transfer logging?