Data modeling a write-intensive comment storage cluster

2014-01-25 Thread fxmy wang
Greetings List,
I'm a new guy who's only got some experience with RMDBs. So please
enlighten me if I'm doing something silly.

So I'm trying to use Riak for storing video comments - small but huge
amount of datas.
Prerequisites:

- One bucket for one video.
- Keys will consist of a timestamp and userID.
- Values will be plain text, contains a short comment and some tags.
 Should not be lager than 10KB.
- Values are seldom modified.
- Write-intensive, some hot videos maybe ~100,000 people watching at the
same time.
- There will be multiple Erlang-pb clients doing writes.

Then here are my questions:
1) To get better writing throughput, is it right to set the w=1?
2) What's the best way to query these comments? In this use case, I don't
need to retrieve all the comments in one bucket, but just the latest few
hundreds comments( if there are so many) based on the time they are posted.

Right now I'm thinking of using line-walking and keeping track of the
latest comment so I can trace backwards to get the latest 500 comments (
for example). And when new comment comes, point the line to the old latest,
then update new latest comment mark.

So in the scenario above, is it possible that after one client has written
on nodeA ,modified the latest-mark and another client on nodeB not yet sees
the change thus points the line to the old comment, resulting a "branch" in
the line?
If this could happen, then what can be done to avoid it? Are there any
better ways to store&query those comments? Any reply is appreciated.

B.R.
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Data modeling a write-intensive comment storage cluster

2014-01-26 Thread fxmy wang
Thanks for the response, Jeremiah.



> > Then here are my questions:
> > 1) To get better writing throughput, is it right to set the w=1?
>
> This will improve perceived throughput at the client, but it won't improve 
> throughput at the server.

Thank you for clarifying this for me :D

> > 2) What's the best way to query these comments? In this use case, I don't 
> > need to retrieve all the comments in one bucket, but just the latest few 
> > hundreds comments( if there are so many) based on the time they are posted.
> >
> > Right now I'm thinking of using line-walking and keeping track of the 
> > latest comment so I can trace backwards to get the latest 500 comments ( 
> > for example). And when new comment comes, point the line to the old latest, 
> > then update new latest comment mark.
> >
>
> I wouldn't use link-walking. IIRC this uses MapReduce under the covers. You 
> could use a single key to store the most recent comment.

What's bad about MapReduce?
Since there will be another cache layer lays on top of the cluster, so
the read operation is relatively quite infrequent. That's why I choose
to use link-walking.

> You can get the most recent n keys using secondary index queries on the 
> $bucket index, sorting, and pagination.
I'm not sure what you mean here =.=
How can I query most recent n keys using 2i ? Should I put timestamp
-like by every hour- in 2i on the coming comments , then when
it comes to queries, just try to query 2i by the hour segment? This
seems a little blind because some videos could be long time before got
commented again.  Querying based on time segmentation seems like
shooting in the dark to me :\

And doc says listing keys operation should not used in production, so
it's a no go either :\


> > So in the scenario above, is it possible that after one client has written 
> > on nodeA ,modified the latest-mark and another client on nodeB not yet sees 
> > the change thus points the line to the old comment, resulting a "branch" in 
> > the line?
> > If this could happen, then what can be done to avoid it? Are there any 
> > better ways to store&query those comments? Any reply is appreciated.
>
> You can avoid siblings by serializing all of your writes through a single 
> writer. That's not a great idea since you lose many of Riak's benefits.
> You could also use a CRDT with a register type. These tend toward the last 
> writer.

My goal is to form kind of a single-line-relationship based on
timestamp through the keys under high concurrent write pressure. And
through this relationship I can easily pick out the last
hundreds/thousands comments.
As Jeremiah said, serializing all of writes through a single writer
can avoid siblings totally. And note that we don't have key clashing
problems here -- every comment holds an unique key. What we want
is single-line-relationship. So how about this:

Multiple erlang-pb clients just do the writes and don't care about the
lining up.
Using post-commit hooks to notify one special global registered
process( which should be running in the riak cluster?) that "here
comes a new comment, line it up when it's appropriate".
Is this feasible? And if it is , how should i prepare for the cluster
partition & rejoin scenario when network fails?

> The point is that you need to decide how you want to deal with this type of 
> scenario - it's going to happen. In a worst case; you lose a write briefly.

Hopefully the method above could avoid this :)

Please everyone, share your thoughts please. _(:3JZ)_

B.R.

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Question about link-walk results returned by erlang_pb_client

2014-03-14 Thread fxmy wang
Hi, list,

This should be a trivial question and I think I'm definitely missing
something( and feeling stupid :\).

So when I am doing a chained link-walking through HTTP interface like
this(copied
from link walking
docs
):

> curl -v localhost:8091/riak/people/davethomas/_,friend,1/_,friend,_/


The output is quite verbose, including Bucket/Key/Value etc.etc.

> --JCgqdOHsL4BdXPCb0cuQDnLTxOH
>
Content-Type: multipart/mixed; boundary=LpfqXc9urbAJJNFH7aGGPBiAtnX
>

> --LpfqXc9urbAJJNFH7aGGPBiAtnX
>
X-Riak-Vclock: a85hYGBgzGDKBVIcc+04TgWFOj/NYEpkzGNlyNCadoYvCwA=
>
Location: /riak/people/timoreilly
>
Content-Type: text/plain
>
Link: ; riaktag="friend", ; rel="up"
>
Etag: 3DmGNeyDj2hUlLR2UhJvMr
>
Last-Modified: Thu, 13 Mar 2014 13:11:04 GMT
>

> I am an excellent public speaker.
>
--LpfqXc9urbAJJNFH7aGGPBiAtnX--
>

> --JCgqdOHsL4BdXPCb0cuQDnLTxOH
>
Content-Type: multipart/mixed; boundary=IcBLyeIFObvJlJGyXuhTty5cRSs
>

> --IcBLyeIFObvJlJGyXuhTty5cRSs
>
X-Riak-Vclock: a85hYGBgzGDKBVIcR4M2cgeFOkdkMCUy5rEyzNSadoYvCwA=
>
Location: /riak/people/dhh
>
Content-Type: text/plain
>
Link: ; rel="up"
>
Etag: 4qbA2ZufXNgzFRb8PlSLUO
>
Last-Modified: Thu, 13 Mar 2014 13:11:53 GMT
>

> I drive a Zonda.
>
--IcBLyeIFObvJlJGyXuhTty5cRSs--
>

> --JCgqdOHsL4BdXPCb0cuQDnLTxOH--
>


But when I retried it through the erlang_pb_client

> riakc_pb_socket:mapred(Pid,[{<<"people">>, <<"davethomas">>}],[{link,
> <<"people">>, <<"friend">>, true},{link, <<"people">>, <<"friend">>,
> true}]).


The return value is just Bucket/Key/link-tag pairs, without ObjectValue or
other metadata.

> {ok,[{0,[[<<"people">>,<<"timoreilly">>,<<"friend">>]]},

{1,[[<<"people">>,<<"dhh">>,<<"friend">>]]}]}
>

Is this intended or not?
If so, what's the best way to get these ObjectValues through one single
pass of link-walking?


Cheers,
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Question about link-walk results returned by erlang_pb_client

2014-03-16 Thread fxmy wang
Hi Alex,

Thank you for the clarify( and the warning :) ).

Just out of curiosity, by using erlang_pb_client, is there a good way
to retrieve *all* the Objects that have been walked in on shot( like
the HTTP interface does)?

And if there's not, then how is the HTTP map-reduce implemented so
that it can return all the Objects alone the way?

Best Regards :)
fxmy

2014-03-15 0:30 GMT+08:00 Alex Moore :
> Hi fxmy,
>
> The return value is just Bucket/Key/link-tag pairs, without ObjectValue or
> other metadata.
>>
>> {ok,[{0,[[<<"people">>,<<"timoreilly">>,<<"friend">>]]},
>>
>> {1,[[<<"people">>,<<"dhh">>,<<"friend">>]]}]}
>
>
> Is this intended or not?
>
>
> This is intended.  You are running a map reduce job with one "link" stage,
> which does the link walking for you.
> Link stages only return a list of key/value/relation tuples.
>
> If so, what's the best way to get these ObjectValues through one single pass
> of link-walking?
>
>
> If you want to grab everything in one shot you would have to feed the link
> stage's output into a map stage to grab the actual objects:
>
> {ok, RiakObj} = riakc_pb_socket:mapred(Pid,[{<<"people">>,
> <<"timoreilly">>}],
> [{link, <<"people">>, <<"friend">>, false},
>  {link, <<"people">>, <<"friend">>, false},
>  {map, {modfun, riak_kv_mapreduce, map_identity}, none, true}]).
>
> This should give you the entire object for the results of the last link
> phase, namely Dave Thomas's.
>
> I should warn you thought that while link walking and map reduce let you do
> things like this in one shot, you should be cautious about using them in
> production since a bad query can kill performance.
>
> Thanks,
> Alex
>
> On Mar 14, 2014, at 6:49 AM, fxmy wang  wrote:
>
>
> Hi, list,
>
> This should be a trivial question and I think I'm definitely missing
> something( and feeling stupid :\).
>
> So when I am doing a chained link-walking through HTTP interface like
> this(copied from link walking docs):
>>
>> curl -v localhost:8091/riak/people/davethomas/_,friend,1/_,friend,_/
>
>
> The output is quite verbose, including Bucket/Key/Value etc.etc.
>>
>> --JCgqdOHsL4BdXPCb0cuQDnLTxOH
>>
>> Content-Type: multipart/mixed; boundary=LpfqXc9urbAJJNFH7aGGPBiAtnX
>>
>>
>> --LpfqXc9urbAJJNFH7aGGPBiAtnX
>>
>> X-Riak-Vclock: a85hYGBgzGDKBVIcc+04TgWFOj/NYEpkzGNlyNCadoYvCwA=
>>
>> Location: /riak/people/timoreilly
>>
>> Content-Type: text/plain
>>
>> Link: ; riaktag="friend", ; rel="up"
>>
>> Etag: 3DmGNeyDj2hUlLR2UhJvMr
>>
>> Last-Modified: Thu, 13 Mar 2014 13:11:04 GMT
>>
>>
>> I am an excellent public speaker.
>>
>> --LpfqXc9urbAJJNFH7aGGPBiAtnX--
>>
>>
>> --JCgqdOHsL4BdXPCb0cuQDnLTxOH
>>
>> Content-Type: multipart/mixed; boundary=IcBLyeIFObvJlJGyXuhTty5cRSs
>>
>>
>> --IcBLyeIFObvJlJGyXuhTty5cRSs
>>
>> X-Riak-Vclock: a85hYGBgzGDKBVIcR4M2cgeFOkdkMCUy5rEyzNSadoYvCwA=
>>
>> Location: /riak/people/dhh
>>
>> Content-Type: text/plain
>>
>> Link: ; rel="up"
>>
>> Etag: 4qbA2ZufXNgzFRb8PlSLUO
>>
>> Last-Modified: Thu, 13 Mar 2014 13:11:53 GMT
>>
>>
>> I drive a Zonda.
>>
>> --IcBLyeIFObvJlJGyXuhTty5cRSs--
>>
>>
>> --JCgqdOHsL4BdXPCb0cuQDnLTxOH--
>
>
>
> But when I retried it through the erlang_pb_client
>>
>> riakc_pb_socket:mapred(Pid,[{<<"people">>, <<"davethomas">>}],[{link,
>> <<"people">>, <<"friend">>, true},{link, <<"people">>, <<"friend">>,
>> true}]).
>
>
> The return value is just Bucket/Key/link-tag pairs, without ObjectValue or
> other metadata.
>>
>> {ok,[{0,[[<<"people">>,<<"timoreilly">>,<<"friend">>]]},
>>
>> {1,[[<<"people">>,<<"dhh">>,<<"friend">>]]}]}
>
>
> Is this intended or not?
> If so, what's the best way to get these ObjectValues through one single pass
> of link-walking?
>
>
> Cheers,
> ___
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com