Wide rows splitting

2017-09-17 Thread Adam Smith
Dear community,

I have a table with inlinks to URLs, i.e. many URLs point to
http://google.com, less URLs point to http://somesmallweb.page.

It has very wide and very skinny rows - the distribution is following a
power law. I do not know a priori how many columns a row has. Also, I can't
identify a schema to introduce a good partitioning.

Currently, I am thinking about introducing splits by: pk is like (URL,
splitnumber), where splitnumber is initially 1 and  hash URL mod
splitnumber would determine the splitnumber on insert. I would need a
separate table to maintain the splitnumber and a spark-cassandra-connector
job counts the columns and and increases/doubles the number of splits on
demand. This means then that I would have to move e.g. (URL1,0) -> (URL1,1)
when splitnumber would be 2.

Would you do the same? Is there a better way?

Thanks!
Adam


C* as fluent data storage, 10MB/sec/node?

2018-11-28 Thread Adam Smith
Hi All,

I need to use C* somehow as fluent data storage - maybe this is different
to the queue antipattern? Lots of data come in (10MB/sec/node), remains for
e.g. 1 hour and should then be evicted. It is somehow not critical when
data would occasionally disappear/get lost.

Thankful for any advice!

Is this nowadays possible without suffering too much from compactation? I
would not have ranged tombstones, and depending on a possible solution only
using point deletes (PK+CK). There is only one CK, could also be empty.

1) The data is usually 1 MB. Can I just update with empty data? PK + CK
would remain, but I would not carry about that. Would this create
tombstones or is equivalent to a DELETE?

2) Like 1) and later then set a TTL == small amount of data to be deleted
then? And hopefully small compactation?

3) Simply setting TTL 1h and hoping the best, because I am wrong with my
worries?

4) Any optimization strategies like setting the RF to 1? Which compactation
strategy is advised?

5) Are there any recent performance benchmarks for one of the scenarios?

What else could I do?

Thanks a lot!
Adam


Re: C* as fluent data storage, 10MB/sec/node?

2018-11-28 Thread Adam Smith
Thanks for the excellent advice, this was extremely helpful! Did not know
about TWCS... curing a lot of headache.

Adam

Am Mi., 28. Nov. 2018 um 20:47 Uhr schrieb Jeff Jirsa :

> Probably fine as long as there’s some concept of time in the partition key
> to keep them from growing unbounded.
>
> Use TWCS, TTLs and something like 5-10 minute buckets. Don’t use RF=1, but
> you can write at CL ONE. TWCS will largely just drop whole sstables as they
> expire (especially with 3.11 and the more aggressive expiration logic there)
>
>
>
> --
> Jeff Jirsa
>
>
> > On Nov 28, 2018, at 11:24 AM, Adam Smith 
> wrote:
> >
> > Hi All,
> >
> > I need to use C* somehow as fluent data storage - maybe this is
> different to the queue antipattern? Lots of data come in (10MB/sec/node),
> remains for e.g. 1 hour and should then be evicted. It is somehow not
> critical when data would occasionally disappear/get lost.
> >
> > Thankful for any advice!
> >
> > Is this nowadays possible without suffering too much from compactation?
> I would not have ranged tombstones, and depending on a possible solution
> only using point deletes (PK+CK). There is only one CK, could also be empty.
> >
> > 1) The data is usually 1 MB. Can I just update with empty data? PK + CK
> would remain, but I would not carry about that. Would this create
> tombstones or is equivalent to a DELETE?
> >
> > 2) Like 1) and later then set a TTL == small amount of data to be
> deleted then? And hopefully small compactation?
> >
> > 3) Simply setting TTL 1h and hoping the best, because I am wrong with my
> worries?
> >
> > 4) Any optimization strategies like setting the RF to 1? Which
> compactation strategy is advised?
> >
> > 5) Are there any recent performance benchmarks for one of the scenarios?
> >
> > What else could I do?
> >
> > Thanks a lot!
> > Adam
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>