Hi Chris, Thanks a lot for the detailed response. I'll definitely try this design and see how it performs.
Anand On 28 August 2013 13:56, Chris Perluss <tradersan...@gmail.com> wrote: > It might help to pick a granularity level. For example let's suppose you > pick a granularity level of 0.1. > > Any piece of the song you receive should be broken down into segments of > 0.1 and they need to be aligned on 0.1. > > Example: you receive a piece of the song from 0.65 to 0.85. > You would break this into three segments: > 0.65 to 0.70 > 0.70 to 0.80 > 0.80.to 0.85 > > These three segments would get written to three different rows. The row > key would be the song identifier followed by the segment number. The first > row would be "songId-0.6", the second "songId-0.7" , and the third > "songId-0.8". > > The first row is "songId-0.6" and not "songId-0.65" because you want all > pieces of the song between 0.6 and 0.7 to end up in the same row. You do > this by rounding down to the start range for the segment. > > When writing the three example segments to HBase there will be two > scenarios. > > The first scenario is that you have an entire segment to be saved. In the > above example this is the case for your piece that spans the 0.7 to 0.8 > segment. Since you have the entire segment you don't have to combine it > with any existing data. So you can simply do a put and overwrite any > partial data that might happen to exist in that row. If you configure your > column family to only store one version for each cell then this will > perform "deduping" for that segment because it will only keep your new, > complete version of that segment. > > The other scenario is that you receive a part of a segment. In this case > you will need to read in the row corresponding to your segment, combine > your new partial segment with any existing partial segment, then put the > combined segment back into hbase. > In the above example this applies to the 0.65 to 0.7 segment (and the 0.8 > to 0.85 segment). > > When you read the row at "songId-0.6", if there is already data there you > will need to combine it with your new data. E.g. if you found 0.63 to 0.67 > you would combine it with 0.65 to 0.70 and end up with 0.63 to 0.70. Then > write this segment back to hbase. If you have versions set to 1 then this > bigger segment will replace the smaller segment you had before, thus > "deduping" that particular segment. > > If you think overlapping segments will potentially be uploaded at the same > time then you will need to implement an optimistic locking model using > checkAndPut. I would do this by defining one column to contain the song > data and another column to contain a row version. I can go into more > detail if requested. > > Here's the benefit of this design: > 1. Each row will have approximately the same size (KB of song data). E.g. > you don't have to worry about someone uploading a 2 hour long epic folk > metal song (I'm looking at you, Moonsorrow!) and thus creating a cell too > big for hbase to efficiently handle. This 2 hour long song will be broken > up over lots of rows. > 2. You can tune the row size by changing the granularity (before you go > into production!) > 3. For each upload request you will only need to Get a max of two segments > from HBase in order to append to a partial segment. The only segments you > need to Get would be the partial segment at the beginning of the upload and > the partial segment at the end of the upload. All segments between these > two are complete segments and thus can just Put their entire contents into > the right row. > 4. Since you only Get two segments you will only read in a few 100 kbs of > data in order to perform the update (amount read in depends on your > granularity). This is true no matter how much of the file has already been > uploaded. In a non segmented storage scenario where you stored the entire > file in one cell, if 30 MB had already been uploaded then a request to > upload an additional 100 KB would require reading in all 30MB and writing > back all 30.1 MB back to HBase. > 5. You can easily and efficiently retrieve a completed song by performing a > Scan using the songId. Ie, Scan(rowStart="songId-", rowEnd="songId.") > "." Is the next ascii char after "-". > > Hope this helps! > On Aug 27, 2013 10:13 PM, "Anand Nalya" <anand.na...@gmail.com> wrote: > > > The segments are completely random. The segments can have from no overlap > > to exact duplicates. > > > > Anand > > > > > > On 27 August 2013 19:49, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > bq. Will hbase do some sort of deduplication? > > > > > > I don't think so. > > > > > > What is the granularity of segment overlap ? In the above example, it > > seems > > > to be 0.5 > > > > > > Cheers > > > > > > > > > On Tue, Aug 27, 2013 at 7:12 AM, Anand Nalya <anand.na...@gmail.com> > > > wrote: > > > > > > > Hi, > > > > > > > > I have a use case in which I need to store segments of mp3 files in > > > hbase. > > > > A song may come to the application in different ovelapping segments. > > For > > > > example, a 5 min song can have the following segments > > 0-1,0.5-2,2-4,3-5. > > > As > > > > seen, some of the data is duplicate (3-4 is present in the last 2 > > > > segments). > > > > > > > > What would be the ideal way of removing this duplicate storage? Will > > > snappy > > > > compression help here or do I need to write some logic over HBase? > > Also, > > > > what if I store a single segment multiple times. Will hbase do some > > sort > > > of > > > > deduplication? > > > > > > > > Regards, > > > > Anand > > > > > > > > > >