You can definitely read all of columns in a single SELECT. And the n-INSERTS can be batched and will insert fewer cells in the storage engine than the previous approach.
-- Jack Krupansky On Sun, Feb 14, 2016 at 7:31 PM, Gianluca Borello <[email protected]> wrote: > Thank you for your reply. > > Your advice is definitely sound, although it still seems suboptimal to me > because: > > 1) It requires N INSERT queries from the application code (where N is the > number of columns) > > 2) It requires N SELECT queries from my application code (where N is the > number of columns I need to read at any given time, which is determined at > runtime). I can't even use the IN operator (e.g. WHERE column_number IN (1, > 2, 3, ...)) because I am already using a non-EQ relation on the timestamp > key and Cassandra restricts me to only one non-EQ relation. > > In summary, I can (and will) adapt my code to use a similar approach > despite everything, but the goal of my message was mainly to understand why > the jira issues I linked above are not full of dozens of "+1" comments. > > To me this really feels like a terrible performance issue that should be > fixed by default (or in the very worst case clearly documented), even after > understanding the motivation for reading all the columns in the CQL row. > > Thanks > > On Sun, Feb 14, 2016 at 3:05 PM, Jack Krupansky <[email protected]> > wrote: > >> You could add the column number as an additional clustering key. And then >> you can actually use COMPACT STORAGE for even more efficient storage and >> access (assuming there is only a single non-PK data column, the blob >> value.) You can then access (read or write) an individual column/blob or a >> slice of them. >> >> -- Jack Krupansky >> >> On Sun, Feb 14, 2016 at 5:22 PM, Gianluca Borello <[email protected]> >> wrote: >> >>> Hi >>> >>> I've just painfully discovered a "little" detail in Cassandra: Cassandra >>> touches all columns on a CQL select (related issues >>> https://issues.apache.org/jira/browse/CASSANDRA-6586, >>> https://issues.apache.org/jira/browse/CASSANDRA-6588, >>> https://issues.apache.org/jira/browse/CASSANDRA-7085). >>> >>> My data model is fairly simple: I have a bunch of "sensors" reporting a >>> blob of data (~10-100KB) periodically. When reading, 99% of the times I'm >>> interested in a subportion of that blob of data across an arbitrary period >>> of time. What I do is simply splitting those blobs of data in about 30 >>> logical units and write them in a CQL table such as: >>> >>> create table data ( >>> id bigint, >>> ts bigint, >>> column1 blob, >>> column2 blob, >>> column3 blob, >>> ... >>> column29 blob, >>> column30 blob >>> primary key (id, ts) >>> >>> id is a combination of the sensor id and a time bucket, in order to not >>> get the row too wide. Essentially, I thought this was a very legit data >>> model that helps me keep my application code very simple (because I can >>> work on a single table, I can write a split sensor blob in a single CQL >>> query and I can read a subset of the columns very efficiently with one >>> single CQL query). >>> >>> What I didn't realize is that Cassandra seems to always process all the >>> columns of the CQL row, regardless of the fact that my query asks just one >>> column, and this has dramatic effect on the performance of my reads. >>> >>> I wrote a simple isolated test case where I test how long it takes to >>> read one *single* column in a CQL table composed of several columns (at >>> each iteration I add and populate 10 new columns), each filled with 1MB >>> blobs: >>> >>> 10 columns: 209 ms >>> 20 columns: 339 ms >>> 30 columns: 510 ms >>> 40 columns: 670 ms >>> 50 columns: 884 ms >>> 60 columns: 1056 ms >>> 70 columns: 1527 ms >>> 80 columns: 1503 ms >>> 90 columns: 1600 ms >>> 100 columns: 1792 ms >>> >>> In other words, even if the result set returned is exactly the same >>> across all these iteration, the response time increases linearly with the >>> size of the other columns, and this is really causing a lot of problems in >>> my application. >>> >>> By reading the JIRA issues, it seems like this is considered a very >>> minor optimization not worth the effort of fixing, so I'm asking: is my use >>> case really so anomalous that the horrible performance that I'm >>> experiencing are to be considered "expected" and need to be fixed with some >>> painful column family splitting and messy application code? >>> >>> Thanks >>> >> >> >
