Re: Data Deduplication in HBase

Anand Nalya Wed, 28 Aug 2013 02:28:37 -0700

Hi Chris,

Thanks a lot for the detailed response. I'll definitely try this design and
see how it performs.


Anand


On 28 August 2013 13:56, Chris Perluss <tradersan...@gmail.com> wrote:

> It might help to pick a granularity level. For example let's suppose you
> pick a granularity level of 0.1.
>
> Any piece of the song you receive should be broken down into segments of
> 0.1 and they need to be aligned on 0.1.
>
> Example: you receive a piece of the song from 0.65 to 0.85.
> You would break this into three segments:
> 0.65 to 0.70
> 0.70 to 0.80
> 0.80.to 0.85
>
> These three segments would get written to three different rows.  The row
> key would be the song identifier followed by the segment number.  The first
> row would be "songId-0.6", the second "songId-0.7" , and the third
> "songId-0.8".
>
> The first row is "songId-0.6" and not "songId-0.65" because you want all
> pieces of the song between 0.6 and 0.7 to end up in the same row.  You do
> this by rounding down to the start range for the segment.
>
> When writing the three example segments to HBase there will be two
> scenarios.
>
> The first scenario is that you have an entire segment to be saved. In the
> above example this is the case for your piece that spans the 0.7 to 0.8
> segment.  Since you have the entire segment you don't have to combine it
> with any existing data.  So you can simply do a put and overwrite any
> partial data that might happen to exist in that row.  If you configure your
> column family to only store one version for each cell then this will
> perform "deduping" for that segment because it will only keep your new,
> complete version of that segment.
>
> The other scenario is that you receive a part of a segment.  In this case
> you will need to read in the row corresponding to your segment, combine
> your new partial segment with any existing partial segment, then put the
> combined segment back into hbase.
> In the above example this applies to the 0.65 to 0.7 segment (and the 0.8
> to 0.85 segment).
>
> When you read the row at "songId-0.6",  if there is already data there you
> will need to combine it with your new data.  E.g. if you found 0.63 to 0.67
> you would combine it with 0.65 to 0.70 and end up with 0.63 to 0.70.  Then
> write this segment back to hbase. If you have versions set to 1 then this
> bigger segment will replace the smaller segment you had before, thus
> "deduping" that particular segment.
>
> If you think overlapping segments will potentially be uploaded at the same
> time then you will need to implement an optimistic locking model using
> checkAndPut.  I would do this by defining one column to contain the song
> data and another column to contain a row version.  I can go into more
> detail if requested.
>
> Here's the benefit of this design:
> 1.  Each row will have approximately the same size (KB of song data).  E.g.
> you don't have to worry about someone uploading a 2 hour long epic folk
> metal song (I'm looking at you, Moonsorrow!) and thus creating a cell too
> big for hbase to efficiently handle.  This 2 hour long song will be broken
> up over lots of rows.
> 2.  You can tune the row size by changing the granularity (before you go
> into production!)
> 3.  For each upload request you will only need to Get a max of two segments
> from HBase in order to append to a partial segment.  The only segments you
> need to Get would be the partial segment at the beginning of the upload and
> the partial segment at the end of the upload.  All segments between these
> two are complete segments and thus can just Put their entire contents into
> the right row.
> 4.  Since you only Get two segments you will only read in a few 100 kbs of
> data in order to perform the update (amount read in depends on your
> granularity). This is true no matter how much of the file has already been
> uploaded.  In a non segmented storage scenario where you stored the entire
> file in one cell, if 30 MB had already been uploaded then a request to
> upload an additional 100 KB would require reading in all 30MB and writing
> back all 30.1 MB back to HBase.
> 5. You can easily and efficiently retrieve a completed song by performing a
> Scan using the songId.  Ie, Scan(rowStart="songId-", rowEnd="songId.")
> "." Is the next ascii char after "-".
>
> Hope this helps!
>  On Aug 27, 2013 10:13 PM, "Anand Nalya" <anand.na...@gmail.com> wrote:
>
> > The segments are completely random. The segments can have from no overlap
> > to exact duplicates.
> >
> > Anand
> >
> >
> > On 27 August 2013 19:49, Ted Yu <yuzhih...@gmail.com> wrote:
> >
> > > bq.  Will hbase do some sort of deduplication?
> > >
> > > I don't think so.
> > >
> > > What is the granularity of segment overlap ? In the above example, it
> > seems
> > > to be 0.5
> > >
> > > Cheers
> > >
> > >
> > > On Tue, Aug 27, 2013 at 7:12 AM, Anand Nalya <anand.na...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I have a use case in which I need to store segments of mp3 files in
> > > hbase.
> > > > A song may come to the application in different ovelapping segments.
> > For
> > > > example, a 5 min song can have the following segments
> > 0-1,0.5-2,2-4,3-5.
> > > As
> > > > seen, some of the data is duplicate (3-4 is present in the last 2
> > > > segments).
> > > >
> > > > What would be the ideal way of removing this duplicate storage? Will
> > > snappy
> > > > compression help here or do I need to write some logic over HBase?
> > Also,
> > > > what if I store a single segment multiple times. Will hbase do some
> > sort
> > > of
> > > > deduplication?
> > > >
> > > > Regards,
> > > > Anand
> > > >
> > >
> >
>

Re: Data Deduplication in HBase

Reply via email to