Re: Size of a single Data Row?

Ralph Soika Sun, 10 Jun 2018 07:55:40 -0700

Hi Eevee,

thanks for your response. Low latency is not an issue because I do readonly in rarely cases and also I write rarely cases. But for me it isimportant to have a high data consistency over a decentralized cluster.And Cassandra fills that perfectly. Hadoop is much more complex in setupin compare to cassandra.

Extracting the XML is not an option because it is mostly unstructuredset of field/value pairs.

But I still stumble across this sense of a clustering key. What if Ishift the date column into a second table?


    CREATE TABLE documents (
       id text,
       data text,
       PRIMARY KEY (id)
    );

    CREATE TABLE documents_created (
       created text,
       id text,
       PRIMARY KEY (created,id)
    );

So my 'big-Table' holds only the uniqueID as the primary key. Is thistable design more performant? I am trying to keep things simple.



Best regards

Ralph





Am 10.06.2018 um 14:24 schrieb Evelyn Smith:

Hi Ralph,
Yes, having partitions of 100mb will seriously hit your performance.But usually the issue here is for people handling large numbers oftransactions and aiming for low latency. My understanding is thecolumn value up to 2GB is it’s max. Like after that the system wouldstart to fail, but well before that you are going to be seeing asignificant performance hit (for most use cases).
I think an important question for you is are you going to be readingthese files from Cassandra regularly? It sounds like something S3 orHadoop might be more appropriate for.
The other option is if your xml files have some format you couldextract the data from it and store it that way.
One final point, I’m pretty sure a TEXT type won’t hold a 10mb filelet alone a 1GB file, I think the max size is like 64K characters.
Regards,
Eevee.
On 10 Jun 2018, at 7:54 pm, Ralph Soika <ralph.so...@imixs.com<mailto:ralph.so...@imixs.com>> wrote:
Hi,
I have a general question concerning the Cassandra technology. Ialready read 2 books but after all I am more and more confused aboutthe question if Cassandra is the right technology. My goal is tostore Business Data form a workflow engine into Cassandra. I want touse Cassandra as a kind of archive service because of its faulttolerant and decentralized approach.
But here are two things which are confusing me. On the one hand theproject claims that a single column value can be 2 GB (1 MB isrecommended). On the other hand people explain that a partitionshould not be larger than 100MB.
I plan only one single simple table:

    CREATE TABLE documents (
       created text,
       id text,
       data text,
       PRIMARY KEY (created,id)
    );
'created' is the partition key holding the date in ISO fomat(YYYY-MM-DD). The 'id' is a clustering key and is unique.
But my 'data' column holds a XML document with business data. Thiscell contains many unstructured data and also media data. The datacell will be between 1 and 10 MB. BUT it can also hold more than100MB and less than 2GB in some cases.
Is Cassandra able to handle this kind of table? Or is Cassandra atthe end not recommended for this kind of data?
For example I would like to ask if data for a specific date isavailable :
    SELECT created,id WHERE created = '2018-06-10'
I select without the data column and just ask if data exists. Is theperformance automatically poor only because the data cell (no primarykey) of some rows is grater then 100MB? Or is cassandra running outof heap space in any case? It is perfectly clear that it makes nosense to select multiple cells which each contain over 100 MB of datain one single query. But this is a fundamental problem and hasnothing to do with Cassandra. My java application running in Wildflywould also not be able to handle a data result with multiple GB ofdata. But I would expect hat I can select a set of keys just todecide whether to load one single data cell.
Cassandra seems like a great system. But many people seem to claimthat it is only suitable for mapping a user status list ala Facebook?Is this true? Thanks for you comments in advance.
===
Ralph


--

*Imixs Software Solutions GmbH*
*Web:* www.imixs.com <http://www.imixs.com> *Phone:* +49 (0)89-452136 16
*Office:* Agnes-Pockels-Bogen 1, 80992 München
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsführer: Gaby Heinle u. Ralph Soika

*Imixs* is an open source company, read more: www.imixs.org<http://www.imixs.org>

Re: Size of a single Data Row?

Reply via email to