Re: Performance problem with large wide row inserts using CQL

Peter Lin Thu, 20 Feb 2014 13:33:06 -0800

Hi Ed,


you're definitely not mad. I've seen this all over the place. We have
several large retail customers and they all suffer the EAV horror. Having
built EAV horrors in the past and guilty of inflicting that pain on people,
mixing static and dynamic is "Totally Freaking awesome!"

I know many large shops buy mainframes just to make EAV queries fast. Big
retail shops can fork over millions for a big box, but everyone else
"probably" shouldn't.  I'm totally biased, to me the ability to use both in
a single columnFamily is the gem of Cassandra. Without it, we're stuck
using old techniques that create a painful nightmare. I've had to fix old
systems using EAV and it's so painful to just figure out what properties a
damn record.







On Thu, Feb 20, 2014 at 4:03 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote:

> Peter,
>
> I must meet you and shake your hand. I was actually having a debate with a
> number of people about a week back claiming there was "no reason to mix
> static and dynamic". We do it all the time I am glad someone else besides
> me "gets it" and I am not totally mad.
>
> Ed
>
>
>
> On Thu, Feb 20, 2014 at 3:26 PM, Peter Lin <wool...@gmail.com> wrote:
>
>>
>> Hi Duyhai,
>>
>> yes, I am talking about mixing static and dynamic columns in a single
>> column family. Let me give you an example from retail.
>>
>> Say you're amazon and you sell over 10K different products. How do you
>> store all those products with all the different properties like color,
>> size, dimensions, etc. With relational databases people use EAV (entity
>> attribute value) tables. This means querying for data the system has to
>> reconstruct the object by pivot a bunch of rows and flattening it out to
>> populate the java object. Typically there are common fields to a product
>> like SKU, price, and category.
>>
>> Using both static and dynamic columns, data can be stored in 1 row and
>> queried by 1 row. Anyone that has used EAV approach to build product
>> databases will tell you how much that sucks. Another example is from auto
>> insurance. Typically a policy database will allow 1 or more types of items
>> for property insurance. Property insurance is home/auto insurance.
>>
>> Each insurance carrier supports different number of insurable items,
>> coverages and endorsements. Many systems use the same EAV approach, but the
>> problem is bigger. Typically a commercial auto policy may have hundreds of
>> drivers and vehicles. Each policy may have dozens or hundreds of coverages
>> and endorsements. It is common for an auto insurance model to have hundreds
>> of coverage and endorsements with different properties. Using the old ORM
>> approach, it's usually mapped table-per-class. Problem is, that results in
>> query explosion for polymorphic queries. This is a known problem with
>> polymorphic queries using traditional techniques.
>>
>> Given that Cassandra + thrift gives developers the ability to store
>> dynamic columns of different types, it solves the performance issues
>> inherent in EAV technique.
>>
>> The point I was trying to make in my first response is that going with
>> pure CQL makes it much harder to take advantage of the COOL features of
>> Cassandra. It does require building a framework to make it "mostly"
>> transparent to developers, but it is worth it in my opinion to learn and
>> understand both thrift and cql. I use annotations in my framework and
>> delegates to handle the serialization. This way, the developer only needs
>> annotate the class and the framework handles serialization and
>> deserialization.
>>
>>
>>
>>
>>
>> On Thu, Feb 20, 2014 at 3:05 PM, DuyHai Doan <doanduy...@gmail.com>wrote:
>>
>>> "Developers can use what ever type they want for the name or value in a
>>> dynamic column and the framework will handle it appropriately."
>>>
>>>  What do you mean by "dynamic" column ? If you want to be able to insert
>>> an arbitrary number of columns in one physical row, CQL3 clustering is
>>> there and does pretty well the job.
>>>
>>>  If by "dynamic" you mean a column whose validation type can change at
>>> runtime (like the dynamic composite type :
>>> http://hector-client.github.io/hector/build/html/content/composite_with_templates.html)
>>> then why don't you just use blob type and serialize it yourself at client
>>> side ?
>>>
>>>  More pratically, in your previous example :
>>>
>>>   - insert into myColumnFamily(staticColumn1, staticColumn2, 20 as int,
>>> dynamicColumn as string) into ('text1','text2',30.55 as double, 3500 as
>>> long)
>>>
>>>  I can't see real sensible use-case where you need to mix static and
>>> dynamic columns in the same column family. If you need to save domain
>>> model, use skinny row with a fixed number of columns known before hand. If
>>> you want to store time series or timeline of data, wide row is there.
>>>
>>>
>>> On Thu, Feb 20, 2014 at 8:55 PM, Peter Lin <wool...@gmail.com> wrote:
>>>
>>>>
>>>> my apologies Sylvain, I didn't mean to misquote you. I still feel that
>>>> even if someone is only going to use CQL, it is "worth it" to learn thrift.
>>>>
>>>> In the interest of discussion, I looked at both jira tickets and I
>>>> don't see how that makes it so a developer can specify the name and value
>>>> type for a dynamic column.
>>>>
>>>> https://issues.apache.org/jira/browse/CASSANDRA-6561
>>>> https://issues.apache.org/jira/browse/CASSANDRA-4851
>>>>
>>>> Am I missing something? If the grammar for insert statements doesn't
>>>> give users the ability declare the name and value type, it means the
>>>> developer has to default name and value to bytes. In their code, they have
>>>> to handle that manually or build their own framework. I built my own
>>>> framework, which handles this for me. Developers can use what ever type
>>>> they want for the name or value in a dynamic column and the framework will
>>>> handle it appropriately.
>>>>
>>>> To me, developers should take time to learn both and use both. I
>>>> realize it's more work to understand both and take time to read the code.
>>>> Not everyone is crazy enough spend time reading cassandra code base or
>>>> spend hundreds of hours studying hector and other cassandra clients. I will
>>>> say this, if I hadn't spend time studying cassandra and reading Hector
>>>> code, I wouldn't have been able to help one of DataStax customer port
>>>> Hector to .Net. I also wouldn't have been able to port Hector to C#
>>>> natively in 3 months.
>>>>
>>>> Rather than recommend people be lazy, it would be more useful to list
>>>> the pros/cons. To my knowledge, there isn't a good writeup on the pros/cons
>>>> of thrift and cql on cassandra.apache.org. I don't know if the
>>>> DataStax docs have a detailed write up of it, does it?
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Feb 20, 2014 at 12:46 PM, Sylvain Lebresne <
>>>> sylv...@datastax.com> wrote:
>>>>
>>>>> On Thu, Feb 20, 2014 at 6:26 PM, Peter Lin <wool...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> I disagree with the sentiment that "thrift is not worth the trouble".
>>>>>>
>>>>>
>>>>> Way to quote only part of my sentence and get mental on it. My full
>>>>> sentence was "it's probably not worth the trouble to start with thrift if
>>>>> you're gonna use CQL later".
>>>>>
>>>>>
>>>>>>
>>>>>> CQL and all SQL inspired dialects limit one's ability to use
>>>>>> arbitrary typed data in dynamic columns. With thrift it's easy and 
>>>>>> straight
>>>>>> forward. With CQL there is no way to tell Cassandra the type of the name
>>>>>> and value for a dynamic column. You can only set the default type. That
>>>>>> means using a "pure cql" approach you can deviate from the default type.
>>>>>> Cassandra will throw an exception indicating the type is different than 
>>>>>> the
>>>>>> default type.
>>>>>>
>>>>>
>>>>>> Until such time that CQL abandons the shackles of SQL and adds the
>>>>>> ability to indicate the column and value type. Something like this
>>>>>>
>>>>>
>>>>>> insert into myColumnFamily(staticColumn1, staticColumn2, 20 as int,
>>>>>> dynamicColumn as string) into ('text1','text2',30.55 as double, 3500 as
>>>>>> long)
>>>>>>
>>>>>> This is one area where Thrift is superior to CQL. Having said that,
>>>>>> it's valid to use Cassandra "as if" it was a relational database, but 
>>>>>> then
>>>>>> you'd miss out on some of the unique features.
>>>>>>
>>>>>
>>>>> Man, if I had a nickel every time someone came on that mailing list
>>>>> pretending that something was possible with thrift and not CQL ... I will
>>>>> claim this: with CASSANDRA-6561 and CASSANDRA-4851 that just got in, there
>>>>> is *nothing* that thrift can do that CQL cannot. But well, what do I know
>>>>> about Cassandra.
>>>>>
>>>>> --
>>>>> Sylvain
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 20, 2014 at 12:12 PM, Sylvain Lebresne <
>>>>>> sylv...@datastax.com> wrote:
>>>>>>
>>>>>>> On Thu, Feb 20, 2014 at 2:16 PM, Edward Capriolo <
>>>>>>> edlinuxg...@gmail.com> wrote:
>>>>>>>
>>>>>>>> For what it is worth you schema is simple and uses compact storage.
>>>>>>>> Thus you really dont need anything in cassandra 2.0 as far as i can 
>>>>>>>> tell.
>>>>>>>> You might be happier with a stable release like 1.2.something and just
>>>>>>>> hector or astyanax. You are really dealing with many issues you should 
>>>>>>>> not
>>>>>>>> have to just to protoype a simple cassandra app.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Of course, if everyone was using that reasoning, no-one would ever
>>>>>>> test new features and report problems/suggest improvement. So thanks to
>>>>>>> anyone like Rüdiger that actually tries stuff and take the time to 
>>>>>>> report
>>>>>>> problems when they think they encounter one. Keep at it, *you* are the 
>>>>>>> one
>>>>>>> helping Cassandra to get better everyday.
>>>>>>>
>>>>>>> And you are also right Rüdiger that it's probably not worth the
>>>>>>> trouble to start with thrift if you're gonna use CQL later. And you
>>>>>>> definitively should use CQL, it is Cassandra's future.
>>>>>>>
>>>>>>> --
>>>>>>> Sylvain
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> On Thursday, February 20, 2014, Sylvain Lebresne <
>>>>>>>> sylv...@datastax.com> wrote:
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Wed, Feb 19, 2014 at 9:38 PM, Rüdiger Klaehn <
>>>>>>>> rkla...@gmail.com> wrote:
>>>>>>>> >>
>>>>>>>> >> I have cloned the cassandra repo, applied the patch, and built
>>>>>>>> it. But when I want to run the bechmark I get an exception. See below. 
>>>>>>>> I
>>>>>>>> tried with a non-managed dependency to
>>>>>>>> cassandra-driver-core-2.0.0-rc3-SNAPSHOT-jar-with-dependencies.jar, 
>>>>>>>> which I
>>>>>>>> compiled from source because I read that that might help. But that did 
>>>>>>>> not
>>>>>>>> make a difference.
>>>>>>>> >>
>>>>>>>> >> So currently I don't know how to give the patch a try. Any ideas?
>>>>>>>> >>
>>>>>>>> >> cheers,
>>>>>>>> >>
>>>>>>>> >> Rüdiger
>>>>>>>> >>
>>>>>>>> >> Exception in thread "main" java.lang.IllegalArgumentException:
>>>>>>>> replicate_on_write is not a column defined in this metadata
>>>>>>>> >>     at
>>>>>>>> com.datastax.driver.core.ColumnDefinitions.getAllIdx(ColumnDefinitions.java:273)
>>>>>>>> >>     at
>>>>>>>> com.datastax.driver.core.ColumnDefinitions.getFirstIdx(ColumnDefinitions.java:279)
>>>>>>>> >>     at com.datastax.driver.core.Row.getBool(Row.java:117)
>>>>>>>> >>     at
>>>>>>>> com.datastax.driver.core.TableMetadata$Options.<init>(TableMetadata.java:474)
>>>>>>>> >>     at
>>>>>>>> com.datastax.driver.core.TableMetadata.build(TableMetadata.java:107)
>>>>>>>> >>     at
>>>>>>>> com.datastax.driver.core.Metadata.buildTableMetadata(Metadata.java:128)
>>>>>>>> >>     at
>>>>>>>> com.datastax.driver.core.Metadata.rebuildSchema(Metadata.java:89)
>>>>>>>> >>     at
>>>>>>>> com.datastax.driver.core.ControlConnection.refreshSchema(ControlConnection.java:259)
>>>>>>>> >>     at
>>>>>>>> com.datastax.driver.core.ControlConnection.tryConnect(ControlConnection.java:214)
>>>>>>>> >>     at
>>>>>>>> com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:161)
>>>>>>>> >>     at
>>>>>>>> com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:77)
>>>>>>>> >>     at
>>>>>>>> com.datastax.driver.core.Cluster$Manager.init(Cluster.java:890)
>>>>>>>> >>     at
>>>>>>>> com.datastax.driver.core.Cluster$Manager.newSession(Cluster.java:910)
>>>>>>>> >>     at
>>>>>>>> com.datastax.driver.core.Cluster$Manager.access$200(Cluster.java:806)
>>>>>>>> >>     at com.datastax.driver.core.Cluster.connect(Cluster.java:158)
>>>>>>>> >>     at
>>>>>>>> cassandra.CassandraTestMinimized$delayedInit$body.apply(CassandraTestMinimized.scala:31)
>>>>>>>> >>     at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
>>>>>>>> >>     at
>>>>>>>> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
>>>>>>>> >>     at scala.App$$anonfun$main$1.apply(App.scala:71)
>>>>>>>> >>     at scala.App$$anonfun$main$1.apply(App.scala:71)
>>>>>>>> >>     at scala.collection.immutable.List.foreach(List.scala:318)
>>>>>>>> >>     at
>>>>>>>> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
>>>>>>>> >>     at scala.App$class.main(App.scala:71)
>>>>>>>> >>     at
>>>>>>>> cassandra.CassandraTestMinimized$.main(CassandraTestMinimized.scala:5)
>>>>>>>> >>     at
>>>>>>>> cassandra.CassandraTestMinimized.main(CassandraTestMinimized.scala)
>>>>>>>> >
>>>>>>>> > I believe you've tried the cassandra trunk branch? trunk is
>>>>>>>> basically the future Cassandra 2.1 and the driver is currently unhappy
>>>>>>>> because the replicate_on_write option has been removed in that 
>>>>>>>> version. I'm
>>>>>>>> supposed to have fixed that on the driver 2.0 branch like 2 days ago so
>>>>>>>> maybe you're also using a slightly old version of the driver sources in
>>>>>>>> there? Or maybe I've screwed up my fix, I'll double check. But anyway, 
>>>>>>>> it
>>>>>>>> would be overall simpler to test with the cassandra-2.0 branch of
>>>>>>>> Cassandra, with which you shouldn't run into that.
>>>>>>>> > --
>>>>>>>> > Sylvain
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sorry this was sent from mobile. Will do less grammar and spell
>>>>>>>> check than usual.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Performance problem with large wide row inserts using CQL

Reply via email to