Re: HBase entity relationship

Wilm Schumacher Tue, 25 Nov 2014 05:26:31 -0800

Hi,

thx for the example. That makes it more easy to consider some options.


In my opinion you have 3 basic options.

1.) leading source I

As I assumed, "source" seems to be the leading concept. Every "job" has
to have a "source". So you could pack the "jobs" in the "source"

So you could make a column family "data", with the meta data of the
source, and "jobs" where you put the jobs

Advantage: clean design. E.g. kicking all jobs from one source is easy.
Scanning for sources is easy, scanning for jobs is easy (just concanate
the job columns by source). Getting of all jobs is easy. Just a get to a
row and fetch the jobs column family. In this design a "key" for a job
will look like "<sourceID>-<timestamp>" to target one job ecactly

(small remark: by this design jobs and sources are still kind of
seperated. If you just need the meta data, you only fetch the column
family "data" and you're good to go)

Disadvantage: a "job" has to be represented by a byte array (as it is a
cell). e.g. json. Thus you have to parse it every time you need a
specific job.

So a row could look like this

"sourceXYZ" => {
  data : {
    "description" : "foo bar" ,
    "type" : "typeX"
  } ,
  jobs : {
    <timestamp1> : "{ 'job-data1' : 'foo' , 'job-data2' : bar }" ,
    <timestamp2> : "{ 'job-data1' : 'foz' , 'job-data2' : baz }" ,
    ...
  }
}

And if you want to delete a specific job .... just delete it from the
column family. No need to update the source data. And if you want to add
a job to a source, you can just add it.

And if there are "free hanging" jobs, jobs where the source is not an
URL, you could make just one row. "useradded" or something like that.

2.) leading source II

like above, but you represent a job by a key for a second table

"sourceXYZ" => {
  data : {
    "description" : "foo bar" ,
    "type" : "typeX"
  } ,
  jobs : {
    <timestamp1> : "job34" ,
    <timestamp2> : "job56" ,
    ...
  }
}

"job34" => {
  data : {
    "job-data1" : "foo" ,
    "job-data2" : "bar"
  }
}

advantage: like above, but furthermore no json parsing needed

disadvantage: to get a "full get" of one source you need one "getRow",
and one "getRows", or multiple single gets.

3.) two separate tables

like your plan. One table for the sources, one table for the jobs. with
key for "source" in the jobs table.

Advantage: simple to manipulate single jobs (add, delete etc.)
Disadvantage: complicate for more complicate operations (kick all jobs
for one source)

====

However, I think option 2 is the best way from the above. CPU-time
saving and most flexible. E.g. if at some point a reevaluation of a
source has to be done, you could simply use a row lock to prevent race
conditions.

All other more rare operations (e.g. "kick all sources of type 'typeY'")
can be done by simple MapReduce.

This would be be my recommendation. But perhaps someone else has another
idea.

Best

Wilm

Am 25.11.2014 um 06:43 schrieb jatinpreet:
> Thanks Wilm, 
> 
> Let me try to explain my scenario in more detail. Let me talk about two
> specific entities, Jobs and Sources. 
> 
> *Source- *A URL that is source of some data. It also contains other
> meta-info like description, type etc. So, the required columns are,
> source_name, url, description, type. 
> 
> *Job- *An independent entity created with data from the selected sources.
> Apart from job information, we need to keep a track of which sources were
> selected for this job, and this list is editable, hence addition/removal are
> possible. The columns needed in job are, job_name, description,
> source_{source-rowkey} and so on. 
> 
> I was considering following options, 
> 
> 1. Create a JSON of each source and dump it into the value field of
> source_{timestamp} column. But I need to be able to list all of the
> available sources before creating a job. This would mean scanning all jobs
> and finding just the unique sources from the all the lists. This seems like
> an overkill. 
> Another problem with this approach is that I would have to write my own
> custom filters if I need to filter jobs on basis of source. 
> 
> 2. Create a new table for sources and keep the rowkeys of the sources in job
> rows. This turns out to be somewhat like foreign keys thoguh which
> understandably is awkward for HBase. But now I have the option of scanning
> the sources table for listing purposes. 
> And this is where my question originated. When I need to fetch sources for a
> particular job I could just filter them based on job key column from source
> table. This would mean a long scan on all rows of the source table. 
> Another option is, to fetch the list of source rowkeys from job row and then
> directly hit the source table for these specific rowkeys. 
> If this option sustains, which of the above methods if more prudent. 
> 
> 
> This example might not seem to be based on huge data but I do expect
> millions of jobs to be created. Also, this is a common pattern which I need
> to implement in other parts of HBase tables too. 
> 
> Thanks, 
> Jatin
> 
> 
> 
> --
> View this message in context: 
> http://apache-hbase.679495.n3.nabble.com/HBase-entity-relationship-tp4066296p4066327.html
> Sent from the HBase User mailing list archive at Nabble.com.
>

Re: HBase entity relationship

Reply via email to