Thanks Aaron. I think, it has answered my question. I have a few more and
would be great if you can clarify them
1. Is the number of shards per-table fixed during table-creation or we can
dynamically allocate shards?
2. Assuming I have 10k shards with each shard-size=2GB, for a total of 20
TB table size. I typically use RowId = UserId and there are approx 3
million users, in our system
How do I ensure that when a user issues a query, I should not end-up
searching all these 10k shards, but rather search only a very small set of
shards?
3. Are there any advantages of running shard-server and data-nodes {HDFS}
in the same machine?
--
Ravi
On Thu, Sep 19, 2013 at 2:36 AM, Aaron McCurry <[email protected]> wrote:
> I will attempt to answer below:
>
> On Wed, Sep 18, 2013 at 1:54 AM, Ravikumar Govindarajan <
> [email protected]> wrote:
>
> > Thanks a bunch for a concise and quick reply. Few more questions
> >
> > 1. Any pointers/links on how you plan to tackle the availability problem?
> >
> > Lets say we store-forward hints to the failed shard-server. Won't the
> HDFS
> > index-files differ in shard replicas?
> >
>
> I am in the process of documenting the strategy and will be adding it to
> JIRA soon. The way I am planning on solving this problem doesn't involve
> storing the indexes in more than once in HDFS (which of course is
> replicated).
>
>
> >
> > 2. I did not phrase my question on cross-join correctly. Let me clarify
> >
> > RowKey = 123
> >
> > RecId = 1000
> > Family = "ACCOUNTS"
> > Col-Name = "NAME"
> > Col-Value = "ABC"
> > ......
> >
> > RecId = 1001
> > Family = "CONTACTS"
> > Col-Name = "NAME"
> > Col-Value = "XYZ"
> > Col-Name = "ACCOUNTS-NAME" [FK to RecId=1000]
> > Col-Value = "1000"
> > .......
> >
> > Lets say the user specifies the search query as
> > key=123 AND name:(ABC OR XYZ)
> >
> > Initially I will apply this query to each of the Family types, namely
> > "ACCOUNTS", "CONTACTS" etc.... and get their RecIds..
> >
> > After this, I will have to filter "CONTACTS" family results, based on
> > RecIds received from "ACCOUNTS" [Join within records of different family,
> > based on FK]
> >
> > Is something like this achievable? Can I design it differently to satisfy
> > my requirements?
> >
>
> I may not fully understand your scenario.
>
> As I understand your example above:
>
> Row {
> "id" => "123",
> "records" => [
> Record {
> "recordId" => "1000", "family" => "accounts",
> "columns" => [Column {"name" => "abc"}]
> },
> Record {
> "recordId" => "1001", "family" => "contacts",
> "columns" => [Column {"name" => "abc"}]
> }
> ]
> }
>
> Let me go through some example queries that we support:
>
> +<accounts.name:abc> +<contacts.name:abc>
>
> Another way of writing it would be:
>
> <accounts.name:abc> AND <contacts.name:abc>
>
> Would yield a hit on the Row, there aren't any FKs in Blur.
>
> However if there are some interesting queries that be done with more
> examples:
>
> Row {
> "id" => "123",
> "records" => [
> Record {
> "recordId" => "1000", "family" => "accounts",
> "columns" => [Column {"name" => "abc"}]
> },
> Record {
> "recordId" => "1001", "family" => "contacts",
> "columns" => [Column {"name" => "abc"}]
> }
> ]
> }
>
> Row {
> "id" => "456",
> "records" => [
> Record {
> "recordId" => "1000", "family" => "accounts",
> "columns" => [Column {"name" => "abc"}]
> },
> Record {
> "recordId" => "1001", "family" => "contacts",
> "columns" => [Column {"name" => "abc"}]
> },
> Record {
> "recordId" => "1002", "family" => "contacts",
> "columns" => [Column {"name" => "def"}]
> }
> ]
> }
>
>
> Row {
> "id" => "789",
> "records" => [
> Record {
> "recordId" => "1000", "family" => "accounts",
> "columns" => [Column {"name" => "abc"}]
> },
> Record {
> "recordId" => "1002", "family" => "contacts",
> "columns" => [Column {"name" => "def"}]
> }
> ]
> }
>
> For the given query: "<accounts.name:abc> AND <contacts.name:abc>" would
> yield 2 Row hits. 123 and 456
> For the given query: "<accounts.name:abc> AND <contacts.name:def>" would
> yield 2 Row hits. 456 and 789
> For the given query: "<contacts.name:abc> AND <contacts.name:def>" would
> yield 1 Row hit of 456. NOTICE that the family is the same here
> "contacts".
>
> Also in Blur you can turn off the Row query and just query the records.
> This would be your typical Document like access.
>
> I fear that this has not answered your question, so if it hasn't please let
> me know.
>
> Thanks!
>
> Aaron
>
>
>
> >
> > --
> > Ravi
> >
> >
> >
> > On Tue, Sep 17, 2013 at 7:01 PM, Aaron McCurry <[email protected]>
> wrote:
> >
> > > First off let me say welcome! Hopefully I can answer your questions
> > inline
> > > below.
> > >
> > >
> > > On Tue, Sep 17, 2013 at 6:52 AM, Ravikumar Govindarajan <
> > > [email protected]> wrote:
> > >
> > > > I am quite new to Blur and need some help with the following
> questions
> > > >
> > > > 1. Lets say I have a replication_factor=3 for all HDFS indexes. In
> case
> > > one
> > > > of the server hosting HDFS indexes goes down [temporary or
> take-down],
> > > what
> > > > will happen to writes? Some kind-of HintedHandoff [as in Cassandra]
> is
> > > > supported?
> > > >
> > >
> > > When there is a Blur Shard Server failure state in ZooKeeper will
> change
> > > and the other shard servers will take action to bring the down shard(s)
> > > online. This is similar to the HBase region model. While the shard(s)
> > are
> > > being relocated (which really means being reopened from HDFS) writes to
> > the
> > > shard(s) being moved are not available. However the bulk load
> capability
> > > is always available as long as HDFS is available, this can be used
> > through
> > > Hadoop MapReduce.
> > >
> > >
> > > >
> > > > To re-phrase, what is the Consistency Vs Availability trade-off in
> > Blur,
> > > > with replication_factor>1 for HDFS indexes?
> > > >
> > >
> > > Of the two Consistency is favored over Availability, however we are
> > > starting development (in 0.3.0) to increase availability during
> failures.
> > >
> > >
> > > >
> > > > 2. Since HDFSInputStream is used underneath, will this result in too
> > much
> > > > of data-transfer back-and-forth? A case of multi-segment-merge or
> even
> > > > wild-card search could trigger it.
> > > >
> > >
> > > Blur uses an in process file system cache (Block Cache is the term used
> > in
> > > the code) to reduce the IO from HDFS. During index merges data that is
> > not
> > > in the Block Cache is read from HDFS and the output is written back to
> > > HDFS. Overall once an index is hot (been online for some time) the IO
> > for
> > > any given search is fairly small assuming that the cluster has enough
> > > memory configured in the Block Cache.
> > >
> > >
> > > >
> > > > 3. Does Blur also support foreign-key like semantics to search across
> > > > column-families as well as delete using row_id?
> > > >
> > >
> > > Blur supports something called Row Queries that allow for searches
> across
> > > column families within single Rows. Take a look at this page for a
> > better
> > > explanation:
> > >
> > > http://incubator.apache.org/blur/docs/0.2.0/data-model.html#querying
> > >
> > > And yes Blur supports deletes by Row check out:
> > >
> > > http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Fn_Blur_mutate
> > > and
> > >
> http://incubator.apache.org/blur/docs/0.2.0/Blur.html#Struct_RowMutation
> > >
> > > Hopefully this can answer so of your questions. Let us know if you
> have
> > > any more.
> > >
> > > Thanks,
> > > Aaron
> > >
> > >
> > >
> > >
> > > >
> > > > --
> > > > Ravi
> > > >
> > >
> >
>