Re: Timeout error in fetching million rows as results using clustering keys

Jack Krupansky Thu, 19 Mar 2015 09:19:45 -0700

Content management (large blobs such as images and video) can be done with
Cassandra, but it is tricky and great care is needed. As with any Cassandra
app, you need to model your data based on how you intend to query and
access the data. You can certainly access large amounts of data with
Cassandra, but you need to carefully design the chunk size so that you
optimize between size of an individual request, the total number of
individual requests, and the degree of parallelism between requests. That
means balancing between designing for locality for efficient single
requests, and designing for distributed storage for efficient parallel
requests. Unfortunately, there is no magic answer for what the chunk size
should be. In this case it may be driven by what fraction of the image can
be efficiently rendered as well as transferred over the network, coupled
with how much load there may be on each server node due to number of
clients making requests. Only a proof of concept implementation can give
you the answers that you need.


I would surmise that your optimal chunk size for image processing might be
on the order of 1MB. Maybe smaller, like 100K, or maybe bigger, like 2 MB
or maybe 5 MB, but I don't imagine that individual Cassandra requests for
more than a low number of MB per request are going to work out well. And if
you carefully design your partition key so that chunks for an image will be
distributed around the cluster, then multiple chunks can be fetched in
parallel without causing a single request to tie up server resources for
too long. And even if requests come in to the same server, there is some
benefit to each being dispatched separately in parallel, so that the
results of one can be shipped back to the client even as the other(s) are
still being processed.

A single large request is an anti-pattern for Cassandra, as is a large
number of very small requests (to satisfy one client operation.)

As mentioned before, you may need to come up with some other unit of
storage than something as small as a single or partial scan line of an
image. Some sort of bucket or chunk that optimizes transfer is best.

You will also have to look at how many total requests you will need to
render an image for the client. I would surmise that you wouldn't want that
to take more than a few dozen Cassandra operations in a small number of
seconds.

-- Jack Krupansky

On Thu, Mar 19, 2015 at 8:57 AM, Kai Wang <dep...@gmail.com> wrote:

> With your reading path and data model, it doesn't matter how many nodes
> you have. All data with the same image_caseid is physically located on one
> node (Well, on RF nodes but only one of those will try to server your
> query). You are not taking advantage of Cassandra by creating hot spots on
> both reading and writing. The first step I would take is to use
> image_caseid-Area as the partition key. This breaks the query to small
> parallel ones on partitions on different nodes.
>
> On Wed, Mar 18, 2015 at 6:12 AM, Mehak Mehta <meme...@cs.stonybrook.edu>
> wrote:
>
>> ya I have cluster total 10 nodes but I am just testing with one node
>> currently.
>> Total data for all nodes will exceed 5 billion rows. But I may have
>> memory on other nodes.
>>
>> On Wed, Mar 18, 2015 at 6:06 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>
>>> 4g also seems small for the kind of load you are trying to handle
>>> (billions of rows) etc.
>>>
>>> I would also try adding more nodes to the cluster.
>>>
>>> On Wed, Mar 18, 2015 at 2:53 PM, Ali Akhtar <ali.rac...@gmail.com>
>>> wrote:
>>>
>>>> Yeah, it may be that the process is being limited by swap. This page:
>>>>
>>>>
>>>> https://gist.github.com/aliakhtar/3649e412787034156cbb#file-cassandra-install-sh-L42
>>>>
>>>> Lines 42 - 48 list a few settings that you could try out for increasing
>>>> / reducing the memory limits (assuming you're on linux).
>>>>
>>>> Also, are you using an SSD? If so make sure the IO scheduler is noop or
>>>> deadline .
>>>>
>>>> On Wed, Mar 18, 2015 at 2:48 PM, Mehak Mehta <meme...@cs.stonybrook.edu
>>>> > wrote:
>>>>
>>>>> Currently Cassandra java process is taking 1% of cpu (total 8% is
>>>>> being used) and 14.3% memory (out of total 4G memory).
>>>>> As you can see there is not much load from other processes.
>>>>>
>>>>> Should I try changing default parameters of memory in Cassandra
>>>>> settings.
>>>>>
>>>>> On Wed, Mar 18, 2015 at 5:33 AM, Ali Akhtar <ali.rac...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> What's your memory / CPU usage at? And how much ram + cpu do you have
>>>>>> on this server?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 18, 2015 at 2:31 PM, Mehak Mehta <
>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>
>>>>>>> Currently there is only single node which I am calling directly with
>>>>>>> around 150000 rows. Full data will be in around billions per node.
>>>>>>> The code is working only for size 100/200. Also the consecutive
>>>>>>> fetching is taking around 5-10 secs.
>>>>>>>
>>>>>>> I have a parallel script which is inserting the data while I am
>>>>>>> reading it. When I stopped the script it worked for 500/1000 but not 
>>>>>>> more
>>>>>>> than that.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 18, 2015 at 5:08 AM, Ali Akhtar <ali.rac...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>  If even 500-1000 isn't working, then your cassandra node might not
>>>>>>>> be up.
>>>>>>>>
>>>>>>>> 1) Try running nodetool status from shell on your cassandra server,
>>>>>>>> make sure the nodes are up.
>>>>>>>>
>>>>>>>> 2) Are you calling this on the same server where cassandra is
>>>>>>>> running? Its trying to connect to localhost . If you're running it on a
>>>>>>>> different server, try passing in the direct ip of your cassandra 
>>>>>>>> server.
>>>>>>>>
>>>>>>>> On Wed, Mar 18, 2015 at 2:05 PM, Mehak Mehta <
>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>
>>>>>>>>> Data won't change much but queries will be different.
>>>>>>>>> I am not working on the rendering tool myself so I don't know much
>>>>>>>>> details about it.
>>>>>>>>>
>>>>>>>>> Also as suggested by you I tried to fetch data in size of 500 or
>>>>>>>>> 1000 with java driver auto pagination.
>>>>>>>>> It fails when the number of records are high (around 100000) with
>>>>>>>>> following error:
>>>>>>>>>
>>>>>>>>> Exception in thread "main"
>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException: All 
>>>>>>>>> host(s)
>>>>>>>>> tried for query failed (tried: localhost/127.0.0.1:9042
>>>>>>>>> (com.datastax.driver.core.exceptions.DriverException: Timed out 
>>>>>>>>> waiting for
>>>>>>>>> server response))
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 18, 2015 at 4:47 AM, Ali Akhtar <ali.rac...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> How often does the data change?
>>>>>>>>>>
>>>>>>>>>> I would still recommend a caching of some kind, but without
>>>>>>>>>> knowing more details (how often the data is changing, what you're 
>>>>>>>>>> doing
>>>>>>>>>> with the 1m rows after getting them, etc) I can't recommend a 
>>>>>>>>>> solution.
>>>>>>>>>>
>>>>>>>>>> I did see your other thread. I would also vote for elasticsearch
>>>>>>>>>> / solr , they are more suited for the kind of analytics you seem to 
>>>>>>>>>> be
>>>>>>>>>> doing. Cassandra is more for storing data, it isn't all that great 
>>>>>>>>>> for
>>>>>>>>>> complex queries / analytics.
>>>>>>>>>>
>>>>>>>>>> If you want to stick to cassandra, you might have better luck if
>>>>>>>>>> you made your range columns part of the primary key, so something 
>>>>>>>>>> like
>>>>>>>>>> PRIMARY KEY(caseId, x, y)
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 18, 2015 at 1:41 PM, Mehak Mehta <
>>>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> The rendering tool renders a portion a very large image. It may
>>>>>>>>>>> fetch different data each time from billions of rows.
>>>>>>>>>>> So I don't think I can cache such large results. Since same
>>>>>>>>>>> results will rarely fetched again.
>>>>>>>>>>>
>>>>>>>>>>> Also do you know how I can do 2d range queries using Cassandra.
>>>>>>>>>>> Some other users suggested me using Solr.
>>>>>>>>>>> But is there any way I can achieve that without using any other
>>>>>>>>>>> technology.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 18, 2015 at 4:33 AM, Ali Akhtar <
>>>>>>>>>>> ali.rac...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sorry, meant to say "that way when you have to render, you can
>>>>>>>>>>>> just display the latest cache."
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 18, 2015 at 1:30 PM, Ali Akhtar <
>>>>>>>>>>>> ali.rac...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I would probably do this in a background thread and cache the
>>>>>>>>>>>>> results, that way when you have to render, you can just cache the 
>>>>>>>>>>>>> latest
>>>>>>>>>>>>> results.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I don't know why Cassandra can't seem to be able to fetch
>>>>>>>>>>>>> large batch sizes, I've also run into these timeouts but reducing 
>>>>>>>>>>>>> the batch
>>>>>>>>>>>>> size to 2k seemed to work for me.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 1:24 PM, Mehak Mehta <
>>>>>>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> We have UI interface which needs this data for rendering.
>>>>>>>>>>>>>> So efficiency of pulling this data matters a lot. It should
>>>>>>>>>>>>>> be fetched within a minute.
>>>>>>>>>>>>>> Is there a way to achieve such efficiency
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 4:06 AM, Ali Akhtar <
>>>>>>>>>>>>>> ali.rac...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Perhaps just fetch them in batches of 1000 or 2000? For 1m
>>>>>>>>>>>>>>> rows, it seems like the difference would only be a few minutes. 
>>>>>>>>>>>>>>> Do you have
>>>>>>>>>>>>>>> to do this all the time, or only once in a while?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 12:34 PM, Mehak Mehta <
>>>>>>>>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> yes it works for 1000 but not more than that.
>>>>>>>>>>>>>>>> How can I fetch all rows using this efficiently?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 3:29 AM, Ali Akhtar <
>>>>>>>>>>>>>>>> ali.rac...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Have you tried a smaller fetch size, such as 5k - 2k ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 12:22 PM, Mehak Mehta <
>>>>>>>>>>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have tried with fetch size of 10000 still its not
>>>>>>>>>>>>>>>>>> giving any results.
>>>>>>>>>>>>>>>>>> My expectations were that Cassandra can handle a million
>>>>>>>>>>>>>>>>>> rows easily.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Is there any mistake in the way I am defining the keys or
>>>>>>>>>>>>>>>>>> querying them.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> Mehak
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 3:02 AM, Jens Rantil <
>>>>>>>>>>>>>>>>>> jens.ran...@tink.se> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Try setting fetchsize before querying. Assuming you
>>>>>>>>>>>>>>>>>>> don't set it too high, and you don't have too many 
>>>>>>>>>>>>>>>>>>> tombstones, that should
>>>>>>>>>>>>>>>>>>> do it.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>> Jens
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> –
>>>>>>>>>>>>>>>>>>> Skickat från Mailbox <https://www.dropbox.com/mailbox>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Mar 18, 2015 at 2:58 AM, Mehak Mehta <
>>>>>>>>>>>>>>>>>>> meme...@cs.stonybrook.edu> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I have requirement to fetch million row as result of my
>>>>>>>>>>>>>>>>>>>> query which is giving timeout errors.
>>>>>>>>>>>>>>>>>>>> I am fetching results by selecting clustering columns,
>>>>>>>>>>>>>>>>>>>> then why the queries are taking so long. I can change the 
>>>>>>>>>>>>>>>>>>>> timeout settings
>>>>>>>>>>>>>>>>>>>> but I need the data to fetched faster as per my 
>>>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> My table definition is:
>>>>>>>>>>>>>>>>>>>> *CREATE TABLE images.results (uuid uuid,
>>>>>>>>>>>>>>>>>>>> analysis_execution_id varchar, analysis_execution_uuid 
>>>>>>>>>>>>>>>>>>>> uuid, x  double, y
>>>>>>>>>>>>>>>>>>>> double, loc varchar, w double, h double, normalized 
>>>>>>>>>>>>>>>>>>>> varchar, type varchar,
>>>>>>>>>>>>>>>>>>>> filehost varchar, filename varchar, image_uuid uuid, 
>>>>>>>>>>>>>>>>>>>> image_uri varchar,
>>>>>>>>>>>>>>>>>>>> image_caseid varchar, image_mpp_x double, image_mpp_y 
>>>>>>>>>>>>>>>>>>>> double, image_width
>>>>>>>>>>>>>>>>>>>> double, image_height double, objective double, cancer_type 
>>>>>>>>>>>>>>>>>>>> varchar,  Area
>>>>>>>>>>>>>>>>>>>> float, submit_date timestamp, points list<double>,  
>>>>>>>>>>>>>>>>>>>> PRIMARY KEY
>>>>>>>>>>>>>>>>>>>> ((image_caseid),Area,uuid));*
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Here each row is uniquely identified on the basis of
>>>>>>>>>>>>>>>>>>>> unique uuid. But since my data is generally queried based 
>>>>>>>>>>>>>>>>>>>> upon *image_caseid
>>>>>>>>>>>>>>>>>>>> *I have made it partition key.
>>>>>>>>>>>>>>>>>>>> I am currently using Java Datastax api to fetch the
>>>>>>>>>>>>>>>>>>>> results. But the query is taking a lot of time resulting 
>>>>>>>>>>>>>>>>>>>> in timeout errors:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  Exception in thread "main"
>>>>>>>>>>>>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException:
>>>>>>>>>>>>>>>>>>>>  All host(s)
>>>>>>>>>>>>>>>>>>>> tried for query failed (tried: localhost/127.0.0.1:9042
>>>>>>>>>>>>>>>>>>>> (com.datastax.driver.core.exceptions.DriverException: 
>>>>>>>>>>>>>>>>>>>> Timed out waiting for
>>>>>>>>>>>>>>>>>>>> server response))
>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84)
>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>> com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:289)
>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>> com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:205)
>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>> com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
>>>>>>>>>>>>>>>>>>>>  at QueryDB.queryArea(TestQuery.java:59)
>>>>>>>>>>>>>>>>>>>>  at TestQuery.main(TestQuery.java:35)
>>>>>>>>>>>>>>>>>>>> Caused by:
>>>>>>>>>>>>>>>>>>>> com.datastax.driver.core.exceptions.NoHostAvailableException:
>>>>>>>>>>>>>>>>>>>>  All host(s)
>>>>>>>>>>>>>>>>>>>> tried for query failed (tried: localhost/127.0.0.1:9042
>>>>>>>>>>>>>>>>>>>> (com.datastax.driver.core.exceptions.DriverException: 
>>>>>>>>>>>>>>>>>>>> Timed out waiting for
>>>>>>>>>>>>>>>>>>>> server response))
>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>> com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:108)
>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>> com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:179)
>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>>>>>>>>>>>>>>>  at java.lang.Thread.run(Thread.java:744)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Also when I try the same query on console even while
>>>>>>>>>>>>>>>>>>>> using limit of 2000 rows:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> cqlsh:images> select count(*) from results where
>>>>>>>>>>>>>>>>>>>> image_caseid='TCGA-HN-A2NL-01Z-00-DX1' and Area<100 and 
>>>>>>>>>>>>>>>>>>>> Area>20 limit 2000;
>>>>>>>>>>>>>>>>>>>> errors={}, last_host=127.0.0.1
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>>>>>>>>>>> Mehak
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Timeout error in fetching million rows as results using clustering keys

Reply via email to