subject:"Re\: Storing images in Hbase"

Re: Storing images in Hbase

2013-01-28 Thread Adrien Mogenet

Could HCatalog be an option ?
Le 26 janv. 2013 21:56, Jack Levin magn...@gmail.com a écrit :

 AFAIK, namenode would not like tracking 20 billion small files :)

 -jack

 On Sat, Jan 26, 2013 at 6:00 PM, S Ahmed sahmed1...@gmail.com wrote:
  That's pretty amazing.
 
  What I am confused is, why did you go with hbase and not just straight
into
  hdfs?
 
 
 
 
  On Fri, Jan 25, 2013 at 2:41 AM, Jack Levin magn...@gmail.com wrote:
 
  Two people including myself, its fairly hands off. Took about 3 months
to
  tune it right, however we did have had multiple years of experience
with
  datanodes and hadoop in general, so that was a good boost.
 
  We have 4 hbase clusters today, image store being largest
  On Jan 24, 2013 2:14 PM, S Ahmed sahmed1...@gmail.com wrote:
 
   Jack, out of curiosity, how many people manage the hbase related
servers?
  
   Does it require constant monitoring or its fairly hands-off now?
 (or a
  bit
   of both, early days was getting things write/learning and now its
purring
   along).
  
  
   On Wed, Jan 23, 2013 at 11:53 PM, Jack Levin magn...@gmail.com
wrote:
  
Its best to keep some RAM for caching of the filesystem, besides we
also run datanode which takes heap as well.
Now, please keep in mind that even if you specify heap of say 5GB,
if
your server opens threads to communicate with other systems via RPC
(which hbase does a lot), you will indeed use HEAP +
Nthreads*thread*kb_size.  There is a good Sun Microsystems document
about it. (I don't have the link handy).
   
-Jack
   
   
   
On Mon, Jan 21, 2013 at 5:10 PM, Varun Sharma va...@pinterest.com
   wrote:
 Thanks for the useful information. I wonder why you use only 5G
heap
   when
 you have an 8G machine ? Is there a reason to not use all of it
(the
 DataNode typically takes a 1G of RAM)

 On Sun, Jan 20, 2013 at 11:49 AM, Jack Levin magn...@gmail.com
   wrote:

 I forgot to mention that I also have this setup:

 property
   namehbase.hregion.memstore.flush.size/name
   value33554432/value
   descriptionFlush more often. Default: 67108864/description
 /property

 This parameter works on per region amount, so this means if any
of
  my
 400 (currently) regions on a regionserver has 30MB+ in
memstore, the
 hbase will flush it to disk.


 Here are some metrics from a regionserver:

 requests=2, regions=370, stores=370, storefiles=1390,
 storefileIndexSize=304, memstoreSize=2233,
compactionQueueSize=0,
 flushQueueSize=0, usedHeap=3516, maxHeap=4987,
 blockCacheSize=790656256, blockCacheFree=255245888,
 blockCacheCount=2436, blockCacheHitCount=218015828,
 blockCacheMissCount=13514652, blockCacheEvictedCount=2561516,
 blockCacheHitRatio=94, blockCacheHitCachingRatio=98

 Note, that memstore is only 2G, this particular regionserver
HEAP is
   set
 to 5G.

 And last but not least, its very important to have good GC
setup:

 export HBASE_OPTS=$HBASE_OPTS -verbose:gc -Xms5000m
 -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails
 -XX:+PrintGCDateStamps
 -XX:+HeapDumpOnOutOfMemoryError
  -Xloggc:$HBASE_HOME/logs/gc-hbase.log
   \
 -XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=8 \
 -XX:+UseParNewGC \
 -XX:NewSize=128m -XX:MaxNewSize=128m \
 -XX:-UseAdaptiveSizePolicy \
 -XX:+CMSParallelRemarkEnabled \
 -XX:-TraceClassUnloading
 

 -Jack

 On Thu, Jan 17, 2013 at 3:29 PM, Varun Sharma 
va...@pinterest.com
wrote:
  Hey Jack,
 
  Thanks for the useful information. By flush size being 15 %,
do
  you
mean
  the memstore flush size ? 15 % would mean close to 1G, have
you
  seen
any
  issues with flushes taking too long ?
 
  Thanks
  Varun
 
  On Sun, Jan 13, 2013 at 8:17 AM, Jack Levin magn...@gmail.com

wrote:
 
  That's right, Memstore size , not flush size is increased.
Filesize
is
  10G. Overall write cache is 60% of heap and read cache is
20%.
Flush
 size
  is 15%.  64 maxlogs at 128MB. One namenode server, one
secondary
   that
 can
  be promoted.  On the way to hbase images are written to a
queue,
  so
 that we
  can take Hbase down for maintenance and still do inserts
later.
  ImageShack
  has ‘perma cache’ servers that allows writes and serving of
data
   even
 when
  hbase is down for hours, consider it 4th replica  outside
of
   hadoop
 
  Jack
 
   *From:* Mohit Anchlia mohitanch...@gmail.com
  *Sent:* ‎January‎ ‎13‎, ‎2013 ‎7‎:‎48‎ ‎AM
  *To:* user@hbase.apache.org
  *Subject:* Re: Storing images in Hbase
 
  Thanks Jack for sharing this information. This definitely
makes
   sense
 when
  using the type of caching layer. You mentioned about
increasing
   write
  cache, I am assuming you had to increase the following
parameters

Re: Storing images in Hbase

2013-01-28 Thread Jack Levin

I've never tried it, HBASE worked out nicely for this task, caching
and all is a bonus for files.

-jack

On Mon, Jan 28, 2013 at 2:01 AM, Adrien Mogenet
adrien.moge...@gmail.com wrote:
 Could HCatalog be an option ?
 Le 26 janv. 2013 21:56, Jack Levin magn...@gmail.com a écrit :

 AFAIK, namenode would not like tracking 20 billion small files :)

 -jack

 On Sat, Jan 26, 2013 at 6:00 PM, S Ahmed sahmed1...@gmail.com wrote:
  That's pretty amazing.
 
  What I am confused is, why did you go with hbase and not just straight
 into
  hdfs?
 
 
 
 
  On Fri, Jan 25, 2013 at 2:41 AM, Jack Levin magn...@gmail.com wrote:
 
  Two people including myself, its fairly hands off. Took about 3 months
 to
  tune it right, however we did have had multiple years of experience
 with
  datanodes and hadoop in general, so that was a good boost.
 
  We have 4 hbase clusters today, image store being largest
  On Jan 24, 2013 2:14 PM, S Ahmed sahmed1...@gmail.com wrote:
 
   Jack, out of curiosity, how many people manage the hbase related
 servers?
  
   Does it require constant monitoring or its fairly hands-off now?
  (or a
  bit
   of both, early days was getting things write/learning and now its
 purring
   along).
  
  
   On Wed, Jan 23, 2013 at 11:53 PM, Jack Levin magn...@gmail.com
 wrote:
  
Its best to keep some RAM for caching of the filesystem, besides we
also run datanode which takes heap as well.
Now, please keep in mind that even if you specify heap of say 5GB,
 if
your server opens threads to communicate with other systems via RPC
(which hbase does a lot), you will indeed use HEAP +
Nthreads*thread*kb_size.  There is a good Sun Microsystems document
about it. (I don't have the link handy).
   
-Jack
   
   
   
On Mon, Jan 21, 2013 at 5:10 PM, Varun Sharma va...@pinterest.com
   wrote:
 Thanks for the useful information. I wonder why you use only 5G
 heap
   when
 you have an 8G machine ? Is there a reason to not use all of it
 (the
 DataNode typically takes a 1G of RAM)

 On Sun, Jan 20, 2013 at 11:49 AM, Jack Levin magn...@gmail.com
   wrote:

 I forgot to mention that I also have this setup:

 property
   namehbase.hregion.memstore.flush.size/name
   value33554432/value
   descriptionFlush more often. Default: 67108864/description
 /property

 This parameter works on per region amount, so this means if any
 of
  my
 400 (currently) regions on a regionserver has 30MB+ in
 memstore, the
 hbase will flush it to disk.


 Here are some metrics from a regionserver:

 requests=2, regions=370, stores=370, storefiles=1390,
 storefileIndexSize=304, memstoreSize=2233,
 compactionQueueSize=0,
 flushQueueSize=0, usedHeap=3516, maxHeap=4987,
 blockCacheSize=790656256, blockCacheFree=255245888,
 blockCacheCount=2436, blockCacheHitCount=218015828,
 blockCacheMissCount=13514652, blockCacheEvictedCount=2561516,
 blockCacheHitRatio=94, blockCacheHitCachingRatio=98

 Note, that memstore is only 2G, this particular regionserver
 HEAP is
   set
 to 5G.

 And last but not least, its very important to have good GC
 setup:

 export HBASE_OPTS=$HBASE_OPTS -verbose:gc -Xms5000m
 -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails
 -XX:+PrintGCDateStamps
 -XX:+HeapDumpOnOutOfMemoryError
  -Xloggc:$HBASE_HOME/logs/gc-hbase.log
   \
 -XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=8 \
 -XX:+UseParNewGC \
 -XX:NewSize=128m -XX:MaxNewSize=128m \
 -XX:-UseAdaptiveSizePolicy \
 -XX:+CMSParallelRemarkEnabled \
 -XX:-TraceClassUnloading
 

 -Jack

 On Thu, Jan 17, 2013 at 3:29 PM, Varun Sharma 
 va...@pinterest.com
wrote:
  Hey Jack,
 
  Thanks for the useful information. By flush size being 15 %,
 do
  you
mean
  the memstore flush size ? 15 % would mean close to 1G, have
 you
  seen
any
  issues with flushes taking too long ?
 
  Thanks
  Varun
 
  On Sun, Jan 13, 2013 at 8:17 AM, Jack Levin magn...@gmail.com

wrote:
 
  That's right, Memstore size , not flush size is increased.
Filesize
is
  10G. Overall write cache is 60% of heap and read cache is
 20%.
Flush
 size
  is 15%.  64 maxlogs at 128MB. One namenode server, one
 secondary
   that
 can
  be promoted.  On the way to hbase images are written to a
 queue,
  so
 that we
  can take Hbase down for maintenance and still do inserts
 later.
  ImageShack
  has ‘perma cache’ servers that allows writes and serving of
 data
   even
 when
  hbase is down for hours, consider it 4th replica  outside
 of
   hadoop
 
  Jack
 
   *From:* Mohit Anchlia mohitanch...@gmail.com
  *Sent:* ‎January‎ ‎13‎, ‎2013 ‎7‎:‎48‎ ‎AM
  *To:* user@hbase.apache.org
  *Subject:* Re: Storing images in Hbase
 
  Thanks Jack for sharing

Re: Storing images in Hbase

2013-01-28 Thread Andrew Purtell

 but not least, its very important to have good GC
 setup:
 
  export HBASE_OPTS=$HBASE_OPTS -verbose:gc -Xms5000m
  -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails
  -XX:+PrintGCDateStamps
  -XX:+HeapDumpOnOutOfMemoryError
   -Xloggc:$HBASE_HOME/logs/gc-hbase.log
\
  -XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=8 \
  -XX:+UseParNewGC \
  -XX:NewSize=128m -XX:MaxNewSize=128m \
  -XX:-UseAdaptiveSizePolicy \
  -XX:+CMSParallelRemarkEnabled \
  -XX:-TraceClassUnloading
  
 
  -Jack
 
  On Thu, Jan 17, 2013 at 3:29 PM, Varun Sharma 
  va...@pinterest.com
 wrote:
   Hey Jack,
  
   Thanks for the useful information. By flush size being 15
 %, do
   you
 mean
   the memstore flush size ? 15 % would mean close to 1G, have
 you
   seen
 any
   issues with flushes taking too long ?
  
   Thanks
   Varun
  
   On Sun, Jan 13, 2013 at 8:17 AM, Jack Levin 
 magn...@gmail.com
  
 wrote:
  
   That's right, Memstore size , not flush size is increased.
 Filesize
 is
   10G. Overall write cache is 60% of heap and read cache is
 20%.
 Flush
  size
   is 15%.  64 maxlogs at 128MB. One namenode server, one
  secondary
that
  can
   be promoted.  On the way to hbase images are written to a
  queue,
   so
  that we
   can take Hbase down for maintenance and still do inserts
  later.
   ImageShack
   has ‘perma cache’ servers that allows writes and serving of
  data
even
  when
   hbase is down for hours, consider it 4th replica 
 outside of
hadoop
  
   Jack
  
*From:* Mohit Anchlia mohitanch...@gmail.com
   *Sent:* ‎January‎ ‎13‎, ‎2013 ‎7‎:‎48‎ ‎AM
   *To:* user@hbase.apache.org
   *Subject:* Re: Storing images in Hbase
  
   Thanks Jack for sharing this information. This definitely
  makes
sense
  when
   using the type of caching layer. You mentioned about
  increasing
write
   cache, I am assuming you had to increase the following
  parameters
in
   addition to increase the memstore size:
  
   hbase.hregion.max.filesize
   hbase.hregion.memstore.flush.size
  
   On Fri, Jan 11, 2013 at 9:47 AM, Jack Levin 
  magn...@gmail.com
 wrote:
  
We buffer all accesses to HBASE with Varnish SSD based
  caching
 layer.
So the impact for reads is negligible.  We have 70 node
   cluster,
8
 GB
of RAM per node, relatively weak nodes (intel core 2
 duo),
  with
10-12TB per server of disks.  Inserting 600,000 images
 per
  day.
 We
have relatively little of compaction activity as we made
 our
write
cache much larger than read cache - so we don't
 experience
   region
 file
fragmentation as much.
   
-Jack
   
On Fri, Jan 11, 2013 at 9:40 AM, Mohit Anchlia 
  mohitanch...@gmail.com
wrote:
 I think it really depends on volume of the traffic,
 data
  distribution
   per
 region, how and when files compaction occurs, number of
  nodes
in
 the
 cluster. In my experience when it comes to blob data
 where
   you
 are
serving
 10s of thousand+ requests/sec writes and reads then
 it's
  very
  difficult
to
 manage HBase without very hard operations and
 maintenance
  in
 play.
  Jack
 earlier mentioned they have 1 billion images, It would
 be
  interesting
   to
 know what they see in terms of compaction, no of
 requests
  per
 sec.
  I'd
   be
 surprised that in high volume site it can be done
 without
  any
  Caching
layer
 on the top to alleviate IO spikes that occurs because
 of
  GC
   and
compactions.

 On Fri, Jan 11, 2013 at 7:27 AM, Mohammad Tariq 
 donta...@gmail.com
  
wrote:

 IMHO, if the image files are not too huge, Hbase can
efficiently
  serve
the
 purpose. You can store some additional info along with
  the
file
depending
 upon your search criteria to make the search faster.
 Say
  if
you
  want
   to
 fetch images by the type, you can store images in one
  column
and
  its
 extension in another column(jpg, tiff etc).

 BTW, what exactly is the problem which you are facing.
  You
have
   written
 But I still cant do it?

 Warm Regards,
 Tariq
 https://mtariq.jux.com/


 On Fri, Jan 11, 2013 at 8:30 PM, Michael Segel 
michael_se...@hotmail.com
 wrote:

  That's a viable option.
  HDFS reads are faster than HBase, but it would
 require
   first
  hitting
the
  index in HBase which points

Re: Storing images in Hbase

2013-01-28 Thread yiyu jia

 forgot to mention that I also have this setup:
  
   property
 namehbase.hregion.memstore.flush.size/name
 value33554432/value
 descriptionFlush more often. Default:
  67108864/description
   /property
  
   This parameter works on per region amount, so this means if
  any
   of
my
   400 (currently) regions on a regionserver has 30MB+ in
  memstore,
   the
   hbase will flush it to disk.
  
  
   Here are some metrics from a regionserver:
  
   requests=2, regions=370, stores=370, storefiles=1390,
   storefileIndexSize=304, memstoreSize=2233,
  compactionQueueSize=0,
   flushQueueSize=0, usedHeap=3516, maxHeap=4987,
   blockCacheSize=790656256, blockCacheFree=255245888,
   blockCacheCount=2436, blockCacheHitCount=218015828,
   blockCacheMissCount=13514652,
 blockCacheEvictedCount=2561516,
   blockCacheHitRatio=94, blockCacheHitCachingRatio=98
  
   Note, that memstore is only 2G, this particular regionserver
   HEAP is
 set
   to 5G.
  
   And last but not least, its very important to have good GC
  setup:
  
   export HBASE_OPTS=$HBASE_OPTS -verbose:gc -Xms5000m
   -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails
   -XX:+PrintGCDateStamps
   -XX:+HeapDumpOnOutOfMemoryError
-Xloggc:$HBASE_HOME/logs/gc-hbase.log
 \
   -XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=8 \
   -XX:+UseParNewGC \
   -XX:NewSize=128m -XX:MaxNewSize=128m \
   -XX:-UseAdaptiveSizePolicy \
   -XX:+CMSParallelRemarkEnabled \
   -XX:-TraceClassUnloading
   
  
   -Jack
  
   On Thu, Jan 17, 2013 at 3:29 PM, Varun Sharma 
   va...@pinterest.com
  wrote:
Hey Jack,
   
Thanks for the useful information. By flush size being 15
  %, do
you
  mean
the memstore flush size ? 15 % would mean close to 1G,
 have
  you
seen
  any
issues with flushes taking too long ?
   
Thanks
Varun
   
On Sun, Jan 13, 2013 at 8:17 AM, Jack Levin 
  magn...@gmail.com
   
  wrote:
   
That's right, Memstore size , not flush size is
 increased.
  Filesize
  is
10G. Overall write cache is 60% of heap and read cache is
  20%.
  Flush
   size
is 15%.  64 maxlogs at 128MB. One namenode server, one
   secondary
 that
   can
be promoted.  On the way to hbase images are written to a
   queue,
so
   that we
can take Hbase down for maintenance and still do inserts
   later.
ImageShack
has ‘perma cache’ servers that allows writes and serving
 of
   data
 even
   when
hbase is down for hours, consider it 4th replica 
  outside of
 hadoop
   
Jack
   
 *From:* Mohit Anchlia mohitanch...@gmail.com
*Sent:* ‎January‎ ‎13‎, ‎2013 ‎7‎:‎48‎ ‎AM
*To:* user@hbase.apache.org
*Subject:* Re: Storing images in Hbase
   
Thanks Jack for sharing this information. This definitely
   makes
 sense
   when
using the type of caching layer. You mentioned about
   increasing
 write
cache, I am assuming you had to increase the following
   parameters
 in
addition to increase the memstore size:
   
hbase.hregion.max.filesize
hbase.hregion.memstore.flush.size
   
On Fri, Jan 11, 2013 at 9:47 AM, Jack Levin 
   magn...@gmail.com
  wrote:
   
 We buffer all accesses to HBASE with Varnish SSD based
   caching
  layer.
 So the impact for reads is negligible.  We have 70 node
cluster,
 8
  GB
 of RAM per node, relatively weak nodes (intel core 2
  duo),
   with
 10-12TB per server of disks.  Inserting 600,000 images
  per
   day.
  We
 have relatively little of compaction activity as we
 made
  our
 write
 cache much larger than read cache - so we don't
  experience
region
  file
 fragmentation as much.

 -Jack

 On Fri, Jan 11, 2013 at 9:40 AM, Mohit Anchlia 
   mohitanch...@gmail.com
 wrote:
  I think it really depends on volume of the traffic,
  data
   distribution
per
  region, how and when files compaction occurs, number
 of
   nodes
 in
  the
  cluster. In my experience when it comes to blob data
  where
you
  are
 serving
  10s of thousand+ requests/sec writes and reads then
  it's
   very
   difficult
 to
  manage HBase without very hard operations and
  maintenance
   in
  play.
   Jack
  earlier mentioned they have 1 billion images, It
 would
  be
   interesting
to
  know what they see in terms of compaction, no of
  requests
   per
  sec.
   I'd
be
  surprised that in high volume

Re: Storing images in Hbase

2013-01-28 Thread Andrew Purtell

You bring up a very common consideration I think.

For static content, such as images, then a cache can help offload read load
from the datastore. This fits into this conversation.

For dynamic content, then an external caching may not be helpful as you
say, although blockcache within HBase will help if you are assembling
content dynamically from repeated queries some of which are bringing in the
same data over and over.

On Mon, Jan 28, 2013 at 12:23 PM, yiyu jia jia.y...@gmail.com wrote:

 Let's say I have an web application which use HBase as data source at the
 backend. I have cache configured in my reverse proxy which is at the front
 of my Web server. And the cache is configured based on URL pattern or
 parameters. In this case, the cached data will be delivered to the client
 if the input parameter/url is same. So, the same data cached behind Web
 server wil not be hitted. if this is the case, I will say the cache between
 HBase and HDFS will not be helpful.


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Storing images in Hbase

2013-01-28 Thread yiyu jia

Hi Andy,

Thanks a lot for sharing. Yes. I am not talking about static content
caching, which may be called as internal CDN today.

I am asking some techniques of configuring cache on different layers with
concerning about avoiding duplicate caching on different layers.

thanks and regards,

Yiyu


On Mon, Jan 28, 2013 at 4:13 PM, Andrew Purtell apurt...@apache.org wrote:

 You bring up a very common consideration I think.

 For static content, such as images, then a cache can help offload read load
 from the datastore. This fits into this conversation.

 For dynamic content, then an external caching may not be helpful as you
 say, although blockcache within HBase will help if you are assembling
 content dynamically from repeated queries some of which are bringing in the
 same data over and over.

 On Mon, Jan 28, 2013 at 12:23 PM, yiyu jia jia.y...@gmail.com wrote:

  Let's say I have an web application which use HBase as data source at the
  backend. I have cache configured in my reverse proxy which is at the
 front
  of my Web server. And the cache is configured based on URL pattern or
  parameters. In this case, the cached data will be delivered to the client
  if the input parameter/url is same. So, the same data cached behind Web
  server wil not be hitted. if this is the case, I will say the cache
 between
  HBase and HDFS will not be helpful.
 

 --
 Best regards,

- Andy

 Problems worthy of attack prove their worth by hitting back. - Piet Hein
 (via Tom White)

Re: Storing images in Hbase

2013-01-28 Thread Andrew Purtell

In that case, then hypothetically speaking, you could disable HBase
blockcache on the table containing static content and rely on an external
reverse proxy tier, and enable HBase blockcache on the tables that you are
using as part of generation of dynamic content.


On Mon, Jan 28, 2013 at 1:44 PM, yiyu jia jia.y...@gmail.com wrote:

 I am asking some techniques of configuring cache on different layers with
 concerning about avoiding duplicate caching on different layers.


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Storing images in Hbase

2013-01-27 Thread yiyu jia

Hi Jack,

Thanks so much for sharing! Do you have comments on storing video in HDFS?

thanks and regards,

Yiyu

On Sat, Jan 26, 2013 at 9:56 PM, Jack Levin magn...@gmail.com wrote:

 AFAIK, namenode would not like tracking 20 billion small files :)

 -jack

 On Sat, Jan 26, 2013 at 6:00 PM, S Ahmed sahmed1...@gmail.com wrote:
  That's pretty amazing.
 
  What I am confused is, why did you go with hbase and not just straight
 into
  hdfs?
 
 
 
 
  On Fri, Jan 25, 2013 at 2:41 AM, Jack Levin magn...@gmail.com wrote:
 
  Two people including myself, its fairly hands off. Took about 3 months
 to
  tune it right, however we did have had multiple years of experience with
  datanodes and hadoop in general, so that was a good boost.
 
  We have 4 hbase clusters today, image store being largest
  On Jan 24, 2013 2:14 PM, S Ahmed sahmed1...@gmail.com wrote:
 
   Jack, out of curiosity, how many people manage the hbase related
 servers?
  
   Does it require constant monitoring or its fairly hands-off now?  (or
 a
  bit
   of both, early days was getting things write/learning and now its
 purring
   along).
  
  
   On Wed, Jan 23, 2013 at 11:53 PM, Jack Levin magn...@gmail.com
 wrote:
  
Its best to keep some RAM for caching of the filesystem, besides we
also run datanode which takes heap as well.
Now, please keep in mind that even if you specify heap of say 5GB,
 if
your server opens threads to communicate with other systems via RPC
(which hbase does a lot), you will indeed use HEAP +
Nthreads*thread*kb_size.  There is a good Sun Microsystems document
about it. (I don't have the link handy).
   
-Jack
   
   
   
On Mon, Jan 21, 2013 at 5:10 PM, Varun Sharma va...@pinterest.com
   wrote:
 Thanks for the useful information. I wonder why you use only 5G
 heap
   when
 you have an 8G machine ? Is there a reason to not use all of it
 (the
 DataNode typically takes a 1G of RAM)

 On Sun, Jan 20, 2013 at 11:49 AM, Jack Levin magn...@gmail.com
   wrote:

 I forgot to mention that I also have this setup:

 property
   namehbase.hregion.memstore.flush.size/name
   value33554432/value
   descriptionFlush more often. Default: 67108864/description
 /property

 This parameter works on per region amount, so this means if any
 of
  my
 400 (currently) regions on a regionserver has 30MB+ in memstore,
 the
 hbase will flush it to disk.


 Here are some metrics from a regionserver:

 requests=2, regions=370, stores=370, storefiles=1390,
 storefileIndexSize=304, memstoreSize=2233, compactionQueueSize=0,
 flushQueueSize=0, usedHeap=3516, maxHeap=4987,
 blockCacheSize=790656256, blockCacheFree=255245888,
 blockCacheCount=2436, blockCacheHitCount=218015828,
 blockCacheMissCount=13514652, blockCacheEvictedCount=2561516,
 blockCacheHitRatio=94, blockCacheHitCachingRatio=98

 Note, that memstore is only 2G, this particular regionserver
 HEAP is
   set
 to 5G.

 And last but not least, its very important to have good GC setup:

 export HBASE_OPTS=$HBASE_OPTS -verbose:gc -Xms5000m
 -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails
 -XX:+PrintGCDateStamps
 -XX:+HeapDumpOnOutOfMemoryError
  -Xloggc:$HBASE_HOME/logs/gc-hbase.log
   \
 -XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=8 \
 -XX:+UseParNewGC \
 -XX:NewSize=128m -XX:MaxNewSize=128m \
 -XX:-UseAdaptiveSizePolicy \
 -XX:+CMSParallelRemarkEnabled \
 -XX:-TraceClassUnloading
 

 -Jack

 On Thu, Jan 17, 2013 at 3:29 PM, Varun Sharma 
 va...@pinterest.com
wrote:
  Hey Jack,
 
  Thanks for the useful information. By flush size being 15 %, do
  you
mean
  the memstore flush size ? 15 % would mean close to 1G, have you
  seen
any
  issues with flushes taking too long ?
 
  Thanks
  Varun
 
  On Sun, Jan 13, 2013 at 8:17 AM, Jack Levin magn...@gmail.com
 
wrote:
 
  That's right, Memstore size , not flush size is increased.
Filesize
is
  10G. Overall write cache is 60% of heap and read cache is 20%.
Flush
 size
  is 15%.  64 maxlogs at 128MB. One namenode server, one
 secondary
   that
 can
  be promoted.  On the way to hbase images are written to a
 queue,
  so
 that we
  can take Hbase down for maintenance and still do inserts
 later.
  ImageShack
  has ‘perma cache’ servers that allows writes and serving of
 data
   even
 when
  hbase is down for hours, consider it 4th replica  outside of
   hadoop
 
  Jack
 
   *From:* Mohit Anchlia mohitanch...@gmail.com
  *Sent:* ‎January‎ ‎13‎, ‎2013 ‎7‎:‎48‎ ‎AM
  *To:* user@hbase.apache.org
  *Subject:* Re: Storing images in Hbase
 
  Thanks Jack for sharing this information. This definitely
 makes
   sense
 when
  using the type of caching layer. You

Re: Storing images in Hbase

2013-01-27 Thread Jack Levin

  *To:* user@hbase.apache.org
  *Subject:* Re: Storing images in Hbase
 
  Thanks Jack for sharing this information. This definitely
 makes
   sense
 when
  using the type of caching layer. You mentioned about
 increasing
   write
  cache, I am assuming you had to increase the following
 parameters
   in
  addition to increase the memstore size:
 
  hbase.hregion.max.filesize
  hbase.hregion.memstore.flush.size
 
  On Fri, Jan 11, 2013 at 9:47 AM, Jack Levin 
 magn...@gmail.com
wrote:
 
   We buffer all accesses to HBASE with Varnish SSD based
 caching
layer.
   So the impact for reads is negligible.  We have 70 node
  cluster,
   8
GB
   of RAM per node, relatively weak nodes (intel core 2 duo),
 with
   10-12TB per server of disks.  Inserting 600,000 images per
 day.
We
   have relatively little of compaction activity as we made our
   write
   cache much larger than read cache - so we don't experience
  region
file
   fragmentation as much.
  
   -Jack
  
   On Fri, Jan 11, 2013 at 9:40 AM, Mohit Anchlia 
 mohitanch...@gmail.com
   wrote:
I think it really depends on volume of the traffic, data
 distribution
  per
region, how and when files compaction occurs, number of
 nodes
   in
the
cluster. In my experience when it comes to blob data where
  you
are
   serving
10s of thousand+ requests/sec writes and reads then it's
 very
 difficult
   to
manage HBase without very hard operations and maintenance
 in
play.
 Jack
earlier mentioned they have 1 billion images, It would be
 interesting
  to
know what they see in terms of compaction, no of requests
 per
sec.
 I'd
  be
surprised that in high volume site it can be done without
 any
 Caching
   layer
on the top to alleviate IO spikes that occurs because of
 GC
  and
   compactions.
   
On Fri, Jan 11, 2013 at 7:27 AM, Mohammad Tariq 
donta...@gmail.com
 
   wrote:
   
IMHO, if the image files are not too huge, Hbase can
   efficiently
 serve
   the
purpose. You can store some additional info along with
 the
   file
   depending
upon your search criteria to make the search faster. Say
 if
   you
 want
  to
fetch images by the type, you can store images in one
 column
   and
 its
extension in another column(jpg, tiff etc).
   
BTW, what exactly is the problem which you are facing.
 You
   have
  written
But I still cant do it?
   
Warm Regards,
Tariq
https://mtariq.jux.com/
   
   
On Fri, Jan 11, 2013 at 8:30 PM, Michael Segel 
   michael_se...@hotmail.com
wrote:
   
 That's a viable option.
 HDFS reads are faster than HBase, but it would require
  first
 hitting
   the
 index in HBase which points to the file and then
 fetching
   the
 file.
 It could be faster... we found storing binary data in a
sequence
  file
   and
 indexed on HBase to be faster than HBase, however, YMMV
  and
HBase
  has
been
 improved since we did that project


 On Jan 10, 2013, at 10:56 PM, shashwat shriparv 
dwivedishash...@gmail.com
 wrote:

  Hi Kavish,
 
  i have a better idea for you copy your image files
 to a
single
  file
   on
  hdfs, and if new image comes append it to the
 existing
image,
 and
   keep
 and
  update the metadata and the offset to the HBase.
 Because
   if
you
  put
 bigger
  image in hbase it wil lead to some issue.
 
 
 
  ∞
  Shashwat Shriparv
 
 
 
  On Fri, Jan 11, 2013 at 9:21 AM, lars hofhansl 
 la...@apache.org
wrote:
 
  Interesting. That's close to a PB if my math is
  correct.
  Is there a write up about this somewhere? Something
  that
   we
 could
   link
  from the HBase homepage?
 
  -- Lars
 
 
  - Original Message -
  From: Jack Levin magn...@gmail.com
  To: user@hbase.apache.org
  Cc: Andrew Purtell apurt...@apache.org
  Sent: Thursday, January 10, 2013 9:24 AM
  Subject: Re: Storing images in Hbase
 
  We stored about 1 billion images into hbase with
 file
   size
up
 to
   10MB.
  Its been running for close to 2 years without issues
  and
 serves
  delivery of images for Yfrog and ImageShack.  If you
  have
any
  questions about the setup, I would be glad to answer
   them.
 
  -Jack

Re: Storing images in Hbase

2013-01-27 Thread Jack Levin

 size is increased.
 Filesize
 is
   10G. Overall write cache is 60% of heap and read cache is
 20%.
 Flush
  size
   is 15%.  64 maxlogs at 128MB. One namenode server, one
  secondary
that
  can
   be promoted.  On the way to hbase images are written to a
  queue,
   so
  that we
   can take Hbase down for maintenance and still do inserts
  later.
   ImageShack
   has ‘perma cache’ servers that allows writes and serving of
  data
even
  when
   hbase is down for hours, consider it 4th replica 
 outside of
hadoop
  
   Jack
  
*From:* Mohit Anchlia mohitanch...@gmail.com
   *Sent:* ‎January‎ ‎13‎, ‎2013 ‎7‎:‎48‎ ‎AM
   *To:* user@hbase.apache.org
   *Subject:* Re: Storing images in Hbase
  
   Thanks Jack for sharing this information. This definitely
  makes
sense
  when
   using the type of caching layer. You mentioned about
  increasing
write
   cache, I am assuming you had to increase the following
  parameters
in
   addition to increase the memstore size:
  
   hbase.hregion.max.filesize
   hbase.hregion.memstore.flush.size
  
   On Fri, Jan 11, 2013 at 9:47 AM, Jack Levin 
  magn...@gmail.com
 wrote:
  
We buffer all accesses to HBASE with Varnish SSD based
  caching
 layer.
So the impact for reads is negligible.  We have 70 node
   cluster,
8
 GB
of RAM per node, relatively weak nodes (intel core 2
 duo),
  with
10-12TB per server of disks.  Inserting 600,000 images
 per
  day.
 We
have relatively little of compaction activity as we made
 our
write
cache much larger than read cache - so we don't
 experience
   region
 file
fragmentation as much.
   
-Jack
   
On Fri, Jan 11, 2013 at 9:40 AM, Mohit Anchlia 
  mohitanch...@gmail.com
wrote:
 I think it really depends on volume of the traffic,
 data
  distribution
   per
 region, how and when files compaction occurs, number of
  nodes
in
 the
 cluster. In my experience when it comes to blob data
 where
   you
 are
serving
 10s of thousand+ requests/sec writes and reads then
 it's
  very
  difficult
to
 manage HBase without very hard operations and
 maintenance
  in
 play.
  Jack
 earlier mentioned they have 1 billion images, It would
 be
  interesting
   to
 know what they see in terms of compaction, no of
 requests
  per
 sec.
  I'd
   be
 surprised that in high volume site it can be done
 without
  any
  Caching
layer
 on the top to alleviate IO spikes that occurs because
 of
  GC
   and
compactions.

 On Fri, Jan 11, 2013 at 7:27 AM, Mohammad Tariq 
 donta...@gmail.com
  
wrote:

 IMHO, if the image files are not too huge, Hbase can
efficiently
  serve
the
 purpose. You can store some additional info along with
  the
file
depending
 upon your search criteria to make the search faster.
 Say
  if
you
  want
   to
 fetch images by the type, you can store images in one
  column
and
  its
 extension in another column(jpg, tiff etc).

 BTW, what exactly is the problem which you are facing.
  You
have
   written
 But I still cant do it?

 Warm Regards,
 Tariq
 https://mtariq.jux.com/


 On Fri, Jan 11, 2013 at 8:30 PM, Michael Segel 
michael_se...@hotmail.com
 wrote:

  That's a viable option.
  HDFS reads are faster than HBase, but it would
 require
   first
  hitting
the
  index in HBase which points to the file and then
  fetching
the
  file.
  It could be faster... we found storing binary data
 in a
 sequence
   file
and
  indexed on HBase to be faster than HBase, however,
 YMMV
   and
 HBase
   has
 been
  improved since we did that project
 
 
  On Jan 10, 2013, at 10:56 PM, shashwat shriparv 
 dwivedishash...@gmail.com
  wrote:
 
   Hi Kavish,
  
   i have a better idea for you copy your image files
  to a
 single
   file
on
   hdfs, and if new image comes append it to the
  existing
 image,
  and
keep
  and
   update the metadata and the offset to the HBase.
  Because
if
 you
   put
  bigger
   image in hbase it wil lead to some issue.
  
  
  
   ∞
   Shashwat Shriparv
  
  
  
   On Fri, Jan 11, 2013 at 9:21 AM, lars hofhansl 
  la...@apache.org

Re: Storing images in Hbase

2013-01-27 Thread Jack Levin

.
 
  And last but not least, its very important to have good GC
 setup:
 
  export HBASE_OPTS=$HBASE_OPTS -verbose:gc -Xms5000m
  -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails
  -XX:+PrintGCDateStamps
  -XX:+HeapDumpOnOutOfMemoryError
   -Xloggc:$HBASE_HOME/logs/gc-hbase.log
\
  -XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=8 \
  -XX:+UseParNewGC \
  -XX:NewSize=128m -XX:MaxNewSize=128m \
  -XX:-UseAdaptiveSizePolicy \
  -XX:+CMSParallelRemarkEnabled \
  -XX:-TraceClassUnloading
  
 
  -Jack
 
  On Thu, Jan 17, 2013 at 3:29 PM, Varun Sharma 
  va...@pinterest.com
 wrote:
   Hey Jack,
  
   Thanks for the useful information. By flush size being 15
 %, do
   you
 mean
   the memstore flush size ? 15 % would mean close to 1G, have
 you
   seen
 any
   issues with flushes taking too long ?
  
   Thanks
   Varun
  
   On Sun, Jan 13, 2013 at 8:17 AM, Jack Levin 
 magn...@gmail.com
  
 wrote:
  
   That's right, Memstore size , not flush size is increased.
 Filesize
 is
   10G. Overall write cache is 60% of heap and read cache is
 20%.
 Flush
  size
   is 15%.  64 maxlogs at 128MB. One namenode server, one
  secondary
that
  can
   be promoted.  On the way to hbase images are written to a
  queue,
   so
  that we
   can take Hbase down for maintenance and still do inserts
  later.
   ImageShack
   has ‘perma cache’ servers that allows writes and serving of
  data
even
  when
   hbase is down for hours, consider it 4th replica 
 outside of
hadoop
  
   Jack
  
*From:* Mohit Anchlia mohitanch...@gmail.com
   *Sent:* ‎January‎ ‎13‎, ‎2013 ‎7‎:‎48‎ ‎AM
   *To:* user@hbase.apache.org
   *Subject:* Re: Storing images in Hbase
  
   Thanks Jack for sharing this information. This definitely
  makes
sense
  when
   using the type of caching layer. You mentioned about
  increasing
write
   cache, I am assuming you had to increase the following
  parameters
in
   addition to increase the memstore size:
  
   hbase.hregion.max.filesize
   hbase.hregion.memstore.flush.size
  
   On Fri, Jan 11, 2013 at 9:47 AM, Jack Levin 
  magn...@gmail.com
 wrote:
  
We buffer all accesses to HBASE with Varnish SSD based
  caching
 layer.
So the impact for reads is negligible.  We have 70 node
   cluster,
8
 GB
of RAM per node, relatively weak nodes (intel core 2
 duo),
  with
10-12TB per server of disks.  Inserting 600,000 images
 per
  day.
 We
have relatively little of compaction activity as we made
 our
write
cache much larger than read cache - so we don't
 experience
   region
 file
fragmentation as much.
   
-Jack
   
On Fri, Jan 11, 2013 at 9:40 AM, Mohit Anchlia 
  mohitanch...@gmail.com
wrote:
 I think it really depends on volume of the traffic,
 data
  distribution
   per
 region, how and when files compaction occurs, number of
  nodes
in
 the
 cluster. In my experience when it comes to blob data
 where
   you
 are
serving
 10s of thousand+ requests/sec writes and reads then
 it's
  very
  difficult
to
 manage HBase without very hard operations and
 maintenance
  in
 play.
  Jack
 earlier mentioned they have 1 billion images, It would
 be
  interesting
   to
 know what they see in terms of compaction, no of
 requests
  per
 sec.
  I'd
   be
 surprised that in high volume site it can be done
 without
  any
  Caching
layer
 on the top to alleviate IO spikes that occurs because
 of
  GC
   and
compactions.

 On Fri, Jan 11, 2013 at 7:27 AM, Mohammad Tariq 
 donta...@gmail.com
  
wrote:

 IMHO, if the image files are not too huge, Hbase can
efficiently
  serve
the
 purpose. You can store some additional info along with
  the
file
depending
 upon your search criteria to make the search faster.
 Say
  if
you
  want
   to
 fetch images by the type, you can store images in one
  column
and
  its
 extension in another column(jpg, tiff etc).

 BTW, what exactly is the problem which you are facing.
  You
have
   written
 But I still cant do it?

 Warm Regards,
 Tariq
 https://mtariq.jux.com/


 On Fri, Jan 11, 2013 at 8:30 PM, Michael Segel 
michael_se...@hotmail.com
 wrote:

  That's a viable option.
  HDFS reads are faster than HBase, but it would
 require
   first
  hitting
the
  index in HBase

Re: Storing images in Hbase

2013-01-26 Thread S Ahmed

That's pretty amazing.

What I am confused is, why did you go with hbase and not just straight into
hdfs?




On Fri, Jan 25, 2013 at 2:41 AM, Jack Levin magn...@gmail.com wrote:

 Two people including myself, its fairly hands off. Took about 3 months to
 tune it right, however we did have had multiple years of experience with
 datanodes and hadoop in general, so that was a good boost.

 We have 4 hbase clusters today, image store being largest
 On Jan 24, 2013 2:14 PM, S Ahmed sahmed1...@gmail.com wrote:

  Jack, out of curiosity, how many people manage the hbase related servers?
 
  Does it require constant monitoring or its fairly hands-off now?  (or a
 bit
  of both, early days was getting things write/learning and now its purring
  along).
 
 
  On Wed, Jan 23, 2013 at 11:53 PM, Jack Levin magn...@gmail.com wrote:
 
   Its best to keep some RAM for caching of the filesystem, besides we
   also run datanode which takes heap as well.
   Now, please keep in mind that even if you specify heap of say 5GB, if
   your server opens threads to communicate with other systems via RPC
   (which hbase does a lot), you will indeed use HEAP +
   Nthreads*thread*kb_size.  There is a good Sun Microsystems document
   about it. (I don't have the link handy).
  
   -Jack
  
  
  
   On Mon, Jan 21, 2013 at 5:10 PM, Varun Sharma va...@pinterest.com
  wrote:
Thanks for the useful information. I wonder why you use only 5G heap
  when
you have an 8G machine ? Is there a reason to not use all of it (the
DataNode typically takes a 1G of RAM)
   
On Sun, Jan 20, 2013 at 11:49 AM, Jack Levin magn...@gmail.com
  wrote:
   
I forgot to mention that I also have this setup:
   
property
  namehbase.hregion.memstore.flush.size/name
  value33554432/value
  descriptionFlush more often. Default: 67108864/description
/property
   
This parameter works on per region amount, so this means if any of
 my
400 (currently) regions on a regionserver has 30MB+ in memstore, the
hbase will flush it to disk.
   
   
Here are some metrics from a regionserver:
   
requests=2, regions=370, stores=370, storefiles=1390,
storefileIndexSize=304, memstoreSize=2233, compactionQueueSize=0,
flushQueueSize=0, usedHeap=3516, maxHeap=4987,
blockCacheSize=790656256, blockCacheFree=255245888,
blockCacheCount=2436, blockCacheHitCount=218015828,
blockCacheMissCount=13514652, blockCacheEvictedCount=2561516,
blockCacheHitRatio=94, blockCacheHitCachingRatio=98
   
Note, that memstore is only 2G, this particular regionserver HEAP is
  set
to 5G.
   
And last but not least, its very important to have good GC setup:
   
export HBASE_OPTS=$HBASE_OPTS -verbose:gc -Xms5000m
-XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+HeapDumpOnOutOfMemoryError
 -Xloggc:$HBASE_HOME/logs/gc-hbase.log
  \
-XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=8 \
-XX:+UseParNewGC \
-XX:NewSize=128m -XX:MaxNewSize=128m \
-XX:-UseAdaptiveSizePolicy \
-XX:+CMSParallelRemarkEnabled \
-XX:-TraceClassUnloading

   
-Jack
   
On Thu, Jan 17, 2013 at 3:29 PM, Varun Sharma va...@pinterest.com
   wrote:
 Hey Jack,

 Thanks for the useful information. By flush size being 15 %, do
 you
   mean
 the memstore flush size ? 15 % would mean close to 1G, have you
 seen
   any
 issues with flushes taking too long ?

 Thanks
 Varun

 On Sun, Jan 13, 2013 at 8:17 AM, Jack Levin magn...@gmail.com
   wrote:

 That's right, Memstore size , not flush size is increased.
   Filesize
   is
 10G. Overall write cache is 60% of heap and read cache is 20%.
   Flush
size
 is 15%.  64 maxlogs at 128MB. One namenode server, one secondary
  that
can
 be promoted.  On the way to hbase images are written to a queue,
 so
that we
 can take Hbase down for maintenance and still do inserts later.
 ImageShack
 has ‘perma cache’ servers that allows writes and serving of data
  even
when
 hbase is down for hours, consider it 4th replica  outside of
  hadoop

 Jack

  *From:* Mohit Anchlia mohitanch...@gmail.com
 *Sent:* ‎January‎ ‎13‎, ‎2013 ‎7‎:‎48‎ ‎AM
 *To:* user@hbase.apache.org
 *Subject:* Re: Storing images in Hbase

 Thanks Jack for sharing this information. This definitely makes
  sense
when
 using the type of caching layer. You mentioned about increasing
  write
 cache, I am assuming you had to increase the following parameters
  in
 addition to increase the memstore size:

 hbase.hregion.max.filesize
 hbase.hregion.memstore.flush.size

 On Fri, Jan 11, 2013 at 9:47 AM, Jack Levin magn...@gmail.com
   wrote:

  We buffer all accesses to HBASE with Varnish SSD based caching
   layer.
  So the impact for reads is negligible.  We have 70 node
 cluster,
  8
   GB
  of RAM per

Re: Storing images in Hbase

2013-01-26 Thread Jack Levin

AFAIK, namenode would not like tracking 20 billion small files :)

-jack

On Sat, Jan 26, 2013 at 6:00 PM, S Ahmed sahmed1...@gmail.com wrote:
 That's pretty amazing.

 What I am confused is, why did you go with hbase and not just straight into
 hdfs?




 On Fri, Jan 25, 2013 at 2:41 AM, Jack Levin magn...@gmail.com wrote:

 Two people including myself, its fairly hands off. Took about 3 months to
 tune it right, however we did have had multiple years of experience with
 datanodes and hadoop in general, so that was a good boost.

 We have 4 hbase clusters today, image store being largest
 On Jan 24, 2013 2:14 PM, S Ahmed sahmed1...@gmail.com wrote:

  Jack, out of curiosity, how many people manage the hbase related servers?
 
  Does it require constant monitoring or its fairly hands-off now?  (or a
 bit
  of both, early days was getting things write/learning and now its purring
  along).
 
 
  On Wed, Jan 23, 2013 at 11:53 PM, Jack Levin magn...@gmail.com wrote:
 
   Its best to keep some RAM for caching of the filesystem, besides we
   also run datanode which takes heap as well.
   Now, please keep in mind that even if you specify heap of say 5GB, if
   your server opens threads to communicate with other systems via RPC
   (which hbase does a lot), you will indeed use HEAP +
   Nthreads*thread*kb_size.  There is a good Sun Microsystems document
   about it. (I don't have the link handy).
  
   -Jack
  
  
  
   On Mon, Jan 21, 2013 at 5:10 PM, Varun Sharma va...@pinterest.com
  wrote:
Thanks for the useful information. I wonder why you use only 5G heap
  when
you have an 8G machine ? Is there a reason to not use all of it (the
DataNode typically takes a 1G of RAM)
   
On Sun, Jan 20, 2013 at 11:49 AM, Jack Levin magn...@gmail.com
  wrote:
   
I forgot to mention that I also have this setup:
   
property
  namehbase.hregion.memstore.flush.size/name
  value33554432/value
  descriptionFlush more often. Default: 67108864/description
/property
   
This parameter works on per region amount, so this means if any of
 my
400 (currently) regions on a regionserver has 30MB+ in memstore, the
hbase will flush it to disk.
   
   
Here are some metrics from a regionserver:
   
requests=2, regions=370, stores=370, storefiles=1390,
storefileIndexSize=304, memstoreSize=2233, compactionQueueSize=0,
flushQueueSize=0, usedHeap=3516, maxHeap=4987,
blockCacheSize=790656256, blockCacheFree=255245888,
blockCacheCount=2436, blockCacheHitCount=218015828,
blockCacheMissCount=13514652, blockCacheEvictedCount=2561516,
blockCacheHitRatio=94, blockCacheHitCachingRatio=98
   
Note, that memstore is only 2G, this particular regionserver HEAP is
  set
to 5G.
   
And last but not least, its very important to have good GC setup:
   
export HBASE_OPTS=$HBASE_OPTS -verbose:gc -Xms5000m
-XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+HeapDumpOnOutOfMemoryError
 -Xloggc:$HBASE_HOME/logs/gc-hbase.log
  \
-XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=8 \
-XX:+UseParNewGC \
-XX:NewSize=128m -XX:MaxNewSize=128m \
-XX:-UseAdaptiveSizePolicy \
-XX:+CMSParallelRemarkEnabled \
-XX:-TraceClassUnloading

   
-Jack
   
On Thu, Jan 17, 2013 at 3:29 PM, Varun Sharma va...@pinterest.com
   wrote:
 Hey Jack,

 Thanks for the useful information. By flush size being 15 %, do
 you
   mean
 the memstore flush size ? 15 % would mean close to 1G, have you
 seen
   any
 issues with flushes taking too long ?

 Thanks
 Varun

 On Sun, Jan 13, 2013 at 8:17 AM, Jack Levin magn...@gmail.com
   wrote:

 That's right, Memstore size , not flush size is increased.
   Filesize
   is
 10G. Overall write cache is 60% of heap and read cache is 20%.
   Flush
size
 is 15%.  64 maxlogs at 128MB. One namenode server, one secondary
  that
can
 be promoted.  On the way to hbase images are written to a queue,
 so
that we
 can take Hbase down for maintenance and still do inserts later.
 ImageShack
 has ‘perma cache’ servers that allows writes and serving of data
  even
when
 hbase is down for hours, consider it 4th replica  outside of
  hadoop

 Jack

  *From:* Mohit Anchlia mohitanch...@gmail.com
 *Sent:* ‎January‎ ‎13‎, ‎2013 ‎7‎:‎48‎ ‎AM
 *To:* user@hbase.apache.org
 *Subject:* Re: Storing images in Hbase

 Thanks Jack for sharing this information. This definitely makes
  sense
when
 using the type of caching layer. You mentioned about increasing
  write
 cache, I am assuming you had to increase the following parameters
  in
 addition to increase the memstore size:

 hbase.hregion.max.filesize
 hbase.hregion.memstore.flush.size

 On Fri, Jan 11, 2013 at 9:47 AM, Jack Levin magn...@gmail.com
   wrote:

  We buffer all accesses

Re: Storing images in Hbase

2013-01-24 Thread S Ahmed

Jack, out of curiosity, how many people manage the hbase related servers?

Does it require constant monitoring or its fairly hands-off now?  (or a bit
of both, early days was getting things write/learning and now its purring
along).


On Wed, Jan 23, 2013 at 11:53 PM, Jack Levin magn...@gmail.com wrote:

 Its best to keep some RAM for caching of the filesystem, besides we
 also run datanode which takes heap as well.
 Now, please keep in mind that even if you specify heap of say 5GB, if
 your server opens threads to communicate with other systems via RPC
 (which hbase does a lot), you will indeed use HEAP +
 Nthreads*thread*kb_size.  There is a good Sun Microsystems document
 about it. (I don't have the link handy).

 -Jack



 On Mon, Jan 21, 2013 at 5:10 PM, Varun Sharma va...@pinterest.com wrote:
  Thanks for the useful information. I wonder why you use only 5G heap when
  you have an 8G machine ? Is there a reason to not use all of it (the
  DataNode typically takes a 1G of RAM)
 
  On Sun, Jan 20, 2013 at 11:49 AM, Jack Levin magn...@gmail.com wrote:
 
  I forgot to mention that I also have this setup:
 
  property
namehbase.hregion.memstore.flush.size/name
value33554432/value
descriptionFlush more often. Default: 67108864/description
  /property
 
  This parameter works on per region amount, so this means if any of my
  400 (currently) regions on a regionserver has 30MB+ in memstore, the
  hbase will flush it to disk.
 
 
  Here are some metrics from a regionserver:
 
  requests=2, regions=370, stores=370, storefiles=1390,
  storefileIndexSize=304, memstoreSize=2233, compactionQueueSize=0,
  flushQueueSize=0, usedHeap=3516, maxHeap=4987,
  blockCacheSize=790656256, blockCacheFree=255245888,
  blockCacheCount=2436, blockCacheHitCount=218015828,
  blockCacheMissCount=13514652, blockCacheEvictedCount=2561516,
  blockCacheHitRatio=94, blockCacheHitCachingRatio=98
 
  Note, that memstore is only 2G, this particular regionserver HEAP is set
  to 5G.
 
  And last but not least, its very important to have good GC setup:
 
  export HBASE_OPTS=$HBASE_OPTS -verbose:gc -Xms5000m
  -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails
  -XX:+PrintGCDateStamps
  -XX:+HeapDumpOnOutOfMemoryError -Xloggc:$HBASE_HOME/logs/gc-hbase.log \
  -XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=8 \
  -XX:+UseParNewGC \
  -XX:NewSize=128m -XX:MaxNewSize=128m \
  -XX:-UseAdaptiveSizePolicy \
  -XX:+CMSParallelRemarkEnabled \
  -XX:-TraceClassUnloading
  
 
  -Jack
 
  On Thu, Jan 17, 2013 at 3:29 PM, Varun Sharma va...@pinterest.com
 wrote:
   Hey Jack,
  
   Thanks for the useful information. By flush size being 15 %, do you
 mean
   the memstore flush size ? 15 % would mean close to 1G, have you seen
 any
   issues with flushes taking too long ?
  
   Thanks
   Varun
  
   On Sun, Jan 13, 2013 at 8:17 AM, Jack Levin magn...@gmail.com
 wrote:
  
   That's right, Memstore size , not flush size is increased.  Filesize
 is
   10G. Overall write cache is 60% of heap and read cache is 20%.  Flush
  size
   is 15%.  64 maxlogs at 128MB. One namenode server, one secondary that
  can
   be promoted.  On the way to hbase images are written to a queue, so
  that we
   can take Hbase down for maintenance and still do inserts later.
   ImageShack
   has ‘perma cache’ servers that allows writes and serving of data even
  when
   hbase is down for hours, consider it 4th replica  outside of hadoop
  
   Jack
  
*From:* Mohit Anchlia mohitanch...@gmail.com
   *Sent:* ‎January‎ ‎13‎, ‎2013 ‎7‎:‎48‎ ‎AM
   *To:* user@hbase.apache.org
   *Subject:* Re: Storing images in Hbase
  
   Thanks Jack for sharing this information. This definitely makes sense
  when
   using the type of caching layer. You mentioned about increasing write
   cache, I am assuming you had to increase the following parameters in
   addition to increase the memstore size:
  
   hbase.hregion.max.filesize
   hbase.hregion.memstore.flush.size
  
   On Fri, Jan 11, 2013 at 9:47 AM, Jack Levin magn...@gmail.com
 wrote:
  
We buffer all accesses to HBASE with Varnish SSD based caching
 layer.
So the impact for reads is negligible.  We have 70 node cluster, 8
 GB
of RAM per node, relatively weak nodes (intel core 2 duo), with
10-12TB per server of disks.  Inserting 600,000 images per day.  We
have relatively little of compaction activity as we made our write
cache much larger than read cache - so we don't experience region
 file
fragmentation as much.
   
-Jack
   
On Fri, Jan 11, 2013 at 9:40 AM, Mohit Anchlia 
  mohitanch...@gmail.com
wrote:
 I think it really depends on volume of the traffic, data
  distribution
   per
 region, how and when files compaction occurs, number of nodes in
 the
 cluster. In my experience when it comes to blob data where you
 are
serving
 10s of thousand+ requests/sec writes and reads then it's very
  difficult
to
 manage HBase without very hard operations

Re: Storing images in Hbase

2013-01-24 Thread Jack Levin

Two people including myself, its fairly hands off. Took about 3 months to
tune it right, however we did have had multiple years of experience with
datanodes and hadoop in general, so that was a good boost.

We have 4 hbase clusters today, image store being largest
On Jan 24, 2013 2:14 PM, S Ahmed sahmed1...@gmail.com wrote:

 Jack, out of curiosity, how many people manage the hbase related servers?

 Does it require constant monitoring or its fairly hands-off now?  (or a bit
 of both, early days was getting things write/learning and now its purring
 along).


 On Wed, Jan 23, 2013 at 11:53 PM, Jack Levin magn...@gmail.com wrote:

  Its best to keep some RAM for caching of the filesystem, besides we
  also run datanode which takes heap as well.
  Now, please keep in mind that even if you specify heap of say 5GB, if
  your server opens threads to communicate with other systems via RPC
  (which hbase does a lot), you will indeed use HEAP +
  Nthreads*thread*kb_size.  There is a good Sun Microsystems document
  about it. (I don't have the link handy).
 
  -Jack
 
 
 
  On Mon, Jan 21, 2013 at 5:10 PM, Varun Sharma va...@pinterest.com
 wrote:
   Thanks for the useful information. I wonder why you use only 5G heap
 when
   you have an 8G machine ? Is there a reason to not use all of it (the
   DataNode typically takes a 1G of RAM)
  
   On Sun, Jan 20, 2013 at 11:49 AM, Jack Levin magn...@gmail.com
 wrote:
  
   I forgot to mention that I also have this setup:
  
   property
 namehbase.hregion.memstore.flush.size/name
 value33554432/value
 descriptionFlush more often. Default: 67108864/description
   /property
  
   This parameter works on per region amount, so this means if any of my
   400 (currently) regions on a regionserver has 30MB+ in memstore, the
   hbase will flush it to disk.
  
  
   Here are some metrics from a regionserver:
  
   requests=2, regions=370, stores=370, storefiles=1390,
   storefileIndexSize=304, memstoreSize=2233, compactionQueueSize=0,
   flushQueueSize=0, usedHeap=3516, maxHeap=4987,
   blockCacheSize=790656256, blockCacheFree=255245888,
   blockCacheCount=2436, blockCacheHitCount=218015828,
   blockCacheMissCount=13514652, blockCacheEvictedCount=2561516,
   blockCacheHitRatio=94, blockCacheHitCachingRatio=98
  
   Note, that memstore is only 2G, this particular regionserver HEAP is
 set
   to 5G.
  
   And last but not least, its very important to have good GC setup:
  
   export HBASE_OPTS=$HBASE_OPTS -verbose:gc -Xms5000m
   -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails
   -XX:+PrintGCDateStamps
   -XX:+HeapDumpOnOutOfMemoryError -Xloggc:$HBASE_HOME/logs/gc-hbase.log
 \
   -XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=8 \
   -XX:+UseParNewGC \
   -XX:NewSize=128m -XX:MaxNewSize=128m \
   -XX:-UseAdaptiveSizePolicy \
   -XX:+CMSParallelRemarkEnabled \
   -XX:-TraceClassUnloading
   
  
   -Jack
  
   On Thu, Jan 17, 2013 at 3:29 PM, Varun Sharma va...@pinterest.com
  wrote:
Hey Jack,
   
Thanks for the useful information. By flush size being 15 %, do you
  mean
the memstore flush size ? 15 % would mean close to 1G, have you seen
  any
issues with flushes taking too long ?
   
Thanks
Varun
   
On Sun, Jan 13, 2013 at 8:17 AM, Jack Levin magn...@gmail.com
  wrote:
   
That's right, Memstore size , not flush size is increased.
  Filesize
  is
10G. Overall write cache is 60% of heap and read cache is 20%.
  Flush
   size
is 15%.  64 maxlogs at 128MB. One namenode server, one secondary
 that
   can
be promoted.  On the way to hbase images are written to a queue, so
   that we
can take Hbase down for maintenance and still do inserts later.
ImageShack
has ‘perma cache’ servers that allows writes and serving of data
 even
   when
hbase is down for hours, consider it 4th replica  outside of
 hadoop
   
Jack
   
 *From:* Mohit Anchlia mohitanch...@gmail.com
*Sent:* ‎January‎ ‎13‎, ‎2013 ‎7‎:‎48‎ ‎AM
*To:* user@hbase.apache.org
*Subject:* Re: Storing images in Hbase
   
Thanks Jack for sharing this information. This definitely makes
 sense
   when
using the type of caching layer. You mentioned about increasing
 write
cache, I am assuming you had to increase the following parameters
 in
addition to increase the memstore size:
   
hbase.hregion.max.filesize
hbase.hregion.memstore.flush.size
   
On Fri, Jan 11, 2013 at 9:47 AM, Jack Levin magn...@gmail.com
  wrote:
   
 We buffer all accesses to HBASE with Varnish SSD based caching
  layer.
 So the impact for reads is negligible.  We have 70 node cluster,
 8
  GB
 of RAM per node, relatively weak nodes (intel core 2 duo), with
 10-12TB per server of disks.  Inserting 600,000 images per day.
  We
 have relatively little of compaction activity as we made our
 write
 cache much larger than read cache - so we don't experience region
  file
 fragmentation as much.

 -Jack

Re: Storing images in Hbase

2013-01-23 Thread Jack Levin

Its best to keep some RAM for caching of the filesystem, besides we
also run datanode which takes heap as well.
Now, please keep in mind that even if you specify heap of say 5GB, if
your server opens threads to communicate with other systems via RPC
(which hbase does a lot), you will indeed use HEAP +
Nthreads*thread*kb_size.  There is a good Sun Microsystems document
about it. (I don't have the link handy).

-Jack



On Mon, Jan 21, 2013 at 5:10 PM, Varun Sharma va...@pinterest.com wrote:
 Thanks for the useful information. I wonder why you use only 5G heap when
 you have an 8G machine ? Is there a reason to not use all of it (the
 DataNode typically takes a 1G of RAM)

 On Sun, Jan 20, 2013 at 11:49 AM, Jack Levin magn...@gmail.com wrote:

 I forgot to mention that I also have this setup:

 property
   namehbase.hregion.memstore.flush.size/name
   value33554432/value
   descriptionFlush more often. Default: 67108864/description
 /property

 This parameter works on per region amount, so this means if any of my
 400 (currently) regions on a regionserver has 30MB+ in memstore, the
 hbase will flush it to disk.


 Here are some metrics from a regionserver:

 requests=2, regions=370, stores=370, storefiles=1390,
 storefileIndexSize=304, memstoreSize=2233, compactionQueueSize=0,
 flushQueueSize=0, usedHeap=3516, maxHeap=4987,
 blockCacheSize=790656256, blockCacheFree=255245888,
 blockCacheCount=2436, blockCacheHitCount=218015828,
 blockCacheMissCount=13514652, blockCacheEvictedCount=2561516,
 blockCacheHitRatio=94, blockCacheHitCachingRatio=98

 Note, that memstore is only 2G, this particular regionserver HEAP is set
 to 5G.

 And last but not least, its very important to have good GC setup:

 export HBASE_OPTS=$HBASE_OPTS -verbose:gc -Xms5000m
 -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails
 -XX:+PrintGCDateStamps
 -XX:+HeapDumpOnOutOfMemoryError -Xloggc:$HBASE_HOME/logs/gc-hbase.log \
 -XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=8 \
 -XX:+UseParNewGC \
 -XX:NewSize=128m -XX:MaxNewSize=128m \
 -XX:-UseAdaptiveSizePolicy \
 -XX:+CMSParallelRemarkEnabled \
 -XX:-TraceClassUnloading
 

 -Jack

 On Thu, Jan 17, 2013 at 3:29 PM, Varun Sharma va...@pinterest.com wrote:
  Hey Jack,
 
  Thanks for the useful information. By flush size being 15 %, do you mean
  the memstore flush size ? 15 % would mean close to 1G, have you seen any
  issues with flushes taking too long ?
 
  Thanks
  Varun
 
  On Sun, Jan 13, 2013 at 8:17 AM, Jack Levin magn...@gmail.com wrote:
 
  That's right, Memstore size , not flush size is increased.  Filesize is
  10G. Overall write cache is 60% of heap and read cache is 20%.  Flush
 size
  is 15%.  64 maxlogs at 128MB. One namenode server, one secondary that
 can
  be promoted.  On the way to hbase images are written to a queue, so
 that we
  can take Hbase down for maintenance and still do inserts later.
  ImageShack
  has ‘perma cache’ servers that allows writes and serving of data even
 when
  hbase is down for hours, consider it 4th replica  outside of hadoop
 
  Jack
 
   *From:* Mohit Anchlia mohitanch...@gmail.com
  *Sent:* ‎January‎ ‎13‎, ‎2013 ‎7‎:‎48‎ ‎AM
  *To:* user@hbase.apache.org
  *Subject:* Re: Storing images in Hbase
 
  Thanks Jack for sharing this information. This definitely makes sense
 when
  using the type of caching layer. You mentioned about increasing write
  cache, I am assuming you had to increase the following parameters in
  addition to increase the memstore size:
 
  hbase.hregion.max.filesize
  hbase.hregion.memstore.flush.size
 
  On Fri, Jan 11, 2013 at 9:47 AM, Jack Levin magn...@gmail.com wrote:
 
   We buffer all accesses to HBASE with Varnish SSD based caching layer.
   So the impact for reads is negligible.  We have 70 node cluster, 8 GB
   of RAM per node, relatively weak nodes (intel core 2 duo), with
   10-12TB per server of disks.  Inserting 600,000 images per day.  We
   have relatively little of compaction activity as we made our write
   cache much larger than read cache - so we don't experience region file
   fragmentation as much.
  
   -Jack
  
   On Fri, Jan 11, 2013 at 9:40 AM, Mohit Anchlia 
 mohitanch...@gmail.com
   wrote:
I think it really depends on volume of the traffic, data
 distribution
  per
region, how and when files compaction occurs, number of nodes in the
cluster. In my experience when it comes to blob data where you are
   serving
10s of thousand+ requests/sec writes and reads then it's very
 difficult
   to
manage HBase without very hard operations and maintenance in play.
 Jack
earlier mentioned they have 1 billion images, It would be
 interesting
  to
know what they see in terms of compaction, no of requests per sec.
 I'd
  be
surprised that in high volume site it can be done without any
 Caching
   layer
on the top to alleviate IO spikes that occurs because of GC and
   compactions.
   
On Fri, Jan 11, 2013 at 7:27 AM, Mohammad Tariq donta...@gmail.com
 
   wrote

Re: Storing images in Hbase

2013-01-21 Thread Varun Sharma

Thanks for the useful information. I wonder why you use only 5G heap when
you have an 8G machine ? Is there a reason to not use all of it (the
DataNode typically takes a 1G of RAM)

On Sun, Jan 20, 2013 at 11:49 AM, Jack Levin magn...@gmail.com wrote:

 I forgot to mention that I also have this setup:

 property
   namehbase.hregion.memstore.flush.size/name
   value33554432/value
   descriptionFlush more often. Default: 67108864/description
 /property

 This parameter works on per region amount, so this means if any of my
 400 (currently) regions on a regionserver has 30MB+ in memstore, the
 hbase will flush it to disk.


 Here are some metrics from a regionserver:

 requests=2, regions=370, stores=370, storefiles=1390,
 storefileIndexSize=304, memstoreSize=2233, compactionQueueSize=0,
 flushQueueSize=0, usedHeap=3516, maxHeap=4987,
 blockCacheSize=790656256, blockCacheFree=255245888,
 blockCacheCount=2436, blockCacheHitCount=218015828,
 blockCacheMissCount=13514652, blockCacheEvictedCount=2561516,
 blockCacheHitRatio=94, blockCacheHitCachingRatio=98

 Note, that memstore is only 2G, this particular regionserver HEAP is set
 to 5G.

 And last but not least, its very important to have good GC setup:

 export HBASE_OPTS=$HBASE_OPTS -verbose:gc -Xms5000m
 -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails
 -XX:+PrintGCDateStamps
 -XX:+HeapDumpOnOutOfMemoryError -Xloggc:$HBASE_HOME/logs/gc-hbase.log \
 -XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=8 \
 -XX:+UseParNewGC \
 -XX:NewSize=128m -XX:MaxNewSize=128m \
 -XX:-UseAdaptiveSizePolicy \
 -XX:+CMSParallelRemarkEnabled \
 -XX:-TraceClassUnloading
 

 -Jack

 On Thu, Jan 17, 2013 at 3:29 PM, Varun Sharma va...@pinterest.com wrote:
  Hey Jack,
 
  Thanks for the useful information. By flush size being 15 %, do you mean
  the memstore flush size ? 15 % would mean close to 1G, have you seen any
  issues with flushes taking too long ?
 
  Thanks
  Varun
 
  On Sun, Jan 13, 2013 at 8:17 AM, Jack Levin magn...@gmail.com wrote:
 
  That's right, Memstore size , not flush size is increased.  Filesize is
  10G. Overall write cache is 60% of heap and read cache is 20%.  Flush
 size
  is 15%.  64 maxlogs at 128MB. One namenode server, one secondary that
 can
  be promoted.  On the way to hbase images are written to a queue, so
 that we
  can take Hbase down for maintenance and still do inserts later.
  ImageShack
  has ‘perma cache’ servers that allows writes and serving of data even
 when
  hbase is down for hours, consider it 4th replica  outside of hadoop
 
  Jack
 
   *From:* Mohit Anchlia mohitanch...@gmail.com
  *Sent:* ‎January‎ ‎13‎, ‎2013 ‎7‎:‎48‎ ‎AM
  *To:* user@hbase.apache.org
  *Subject:* Re: Storing images in Hbase
 
  Thanks Jack for sharing this information. This definitely makes sense
 when
  using the type of caching layer. You mentioned about increasing write
  cache, I am assuming you had to increase the following parameters in
  addition to increase the memstore size:
 
  hbase.hregion.max.filesize
  hbase.hregion.memstore.flush.size
 
  On Fri, Jan 11, 2013 at 9:47 AM, Jack Levin magn...@gmail.com wrote:
 
   We buffer all accesses to HBASE with Varnish SSD based caching layer.
   So the impact for reads is negligible.  We have 70 node cluster, 8 GB
   of RAM per node, relatively weak nodes (intel core 2 duo), with
   10-12TB per server of disks.  Inserting 600,000 images per day.  We
   have relatively little of compaction activity as we made our write
   cache much larger than read cache - so we don't experience region file
   fragmentation as much.
  
   -Jack
  
   On Fri, Jan 11, 2013 at 9:40 AM, Mohit Anchlia 
 mohitanch...@gmail.com
   wrote:
I think it really depends on volume of the traffic, data
 distribution
  per
region, how and when files compaction occurs, number of nodes in the
cluster. In my experience when it comes to blob data where you are
   serving
10s of thousand+ requests/sec writes and reads then it's very
 difficult
   to
manage HBase without very hard operations and maintenance in play.
 Jack
earlier mentioned they have 1 billion images, It would be
 interesting
  to
know what they see in terms of compaction, no of requests per sec.
 I'd
  be
surprised that in high volume site it can be done without any
 Caching
   layer
on the top to alleviate IO spikes that occurs because of GC and
   compactions.
   
On Fri, Jan 11, 2013 at 7:27 AM, Mohammad Tariq donta...@gmail.com
 
   wrote:
   
IMHO, if the image files are not too huge, Hbase can efficiently
 serve
   the
purpose. You can store some additional info along with the file
   depending
upon your search criteria to make the search faster. Say if you
 want
  to
fetch images by the type, you can store images in one column and
 its
extension in another column(jpg, tiff etc).
   
BTW, what exactly is the problem which you are facing. You have
  written
But I still cant do it?
   
Warm Regards

Re: Storing images in Hbase

2013-01-21 Thread Varun Sharma

On Mon, Jan 21, 2013 at 5:10 PM, Varun Sharma va...@pinterest.com wrote:

 Thanks for the useful information. I wonder why you use only 5G heap when
 you have an 8G machine ? Is there a reason to not use all of it (the
 DataNode typically takes a 1G of RAM)


 On Sun, Jan 20, 2013 at 11:49 AM, Jack Levin magn...@gmail.com wrote:

 I forgot to mention that I also have this setup:

 property
   namehbase.hregion.memstore.flush.size/name
   value33554432/value
   descriptionFlush more often. Default: 67108864/description
 /property

 This parameter works on per region amount, so this means if any of my
 400 (currently) regions on a regionserver has 30MB+ in memstore, the
 hbase will flush it to disk.


 Here are some metrics from a regionserver:

 requests=2, regions=370, stores=370, storefiles=1390,
 storefileIndexSize=304, memstoreSize=2233, compactionQueueSize=0,
 flushQueueSize=0, usedHeap=3516, maxHeap=4987,
 blockCacheSize=790656256, blockCacheFree=255245888,
 blockCacheCount=2436, blockCacheHitCount=218015828,
 blockCacheMissCount=13514652, blockCacheEvictedCount=2561516,
 blockCacheHitRatio=94, blockCacheHitCachingRatio=98

 Note, that memstore is only 2G, this particular regionserver HEAP is set
 to 5G.

 And last but not least, its very important to have good GC setup:

 export HBASE_OPTS=$HBASE_OPTS -verbose:gc -Xms5000m
 -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails
 -XX:+PrintGCDateStamps
 -XX:+HeapDumpOnOutOfMemoryError -Xloggc:$HBASE_HOME/logs/gc-hbase.log \
 -XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=8 \
 -XX:+UseParNewGC \
 -XX:NewSize=128m -XX:MaxNewSize=128m \
 -XX:-UseAdaptiveSizePolicy \
 -XX:+CMSParallelRemarkEnabled \
 -XX:-TraceClassUnloading
 

 -Jack

 On Thu, Jan 17, 2013 at 3:29 PM, Varun Sharma va...@pinterest.com
 wrote:
  Hey Jack,
 
  Thanks for the useful information. By flush size being 15 %, do you mean
  the memstore flush size ? 15 % would mean close to 1G, have you seen any
  issues with flushes taking too long ?
 
  Thanks
  Varun
 
  On Sun, Jan 13, 2013 at 8:17 AM, Jack Levin magn...@gmail.com wrote:
 
  That's right, Memstore size , not flush size is increased.  Filesize is
  10G. Overall write cache is 60% of heap and read cache is 20%.  Flush
 size
  is 15%.  64 maxlogs at 128MB. One namenode server, one secondary that
 can
  be promoted.  On the way to hbase images are written to a queue, so
 that we
  can take Hbase down for maintenance and still do inserts later.
  ImageShack
  has ‘perma cache’ servers that allows writes and serving of data even
 when
  hbase is down for hours, consider it 4th replica  outside of hadoop
 
  Jack
 
   *From:* Mohit Anchlia mohitanch...@gmail.com
  *Sent:* ‎January‎ ‎13‎, ‎2013 ‎7‎:‎48‎ ‎AM
  *To:* user@hbase.apache.org
  *Subject:* Re: Storing images in Hbase
 
  Thanks Jack for sharing this information. This definitely makes sense
 when
  using the type of caching layer. You mentioned about increasing write
  cache, I am assuming you had to increase the following parameters in
  addition to increase the memstore size:
 
  hbase.hregion.max.filesize
  hbase.hregion.memstore.flush.size
 
  On Fri, Jan 11, 2013 at 9:47 AM, Jack Levin magn...@gmail.com wrote:
 
   We buffer all accesses to HBASE with Varnish SSD based caching layer.
   So the impact for reads is negligible.  We have 70 node cluster, 8 GB
   of RAM per node, relatively weak nodes (intel core 2 duo), with
   10-12TB per server of disks.  Inserting 600,000 images per day.  We
   have relatively little of compaction activity as we made our write
   cache much larger than read cache - so we don't experience region
 file
   fragmentation as much.
  
   -Jack
  
   On Fri, Jan 11, 2013 at 9:40 AM, Mohit Anchlia 
 mohitanch...@gmail.com
   wrote:
I think it really depends on volume of the traffic, data
 distribution
  per
region, how and when files compaction occurs, number of nodes in
 the
cluster. In my experience when it comes to blob data where you are
   serving
10s of thousand+ requests/sec writes and reads then it's very
 difficult
   to
manage HBase without very hard operations and maintenance in play.
 Jack
earlier mentioned they have 1 billion images, It would be
 interesting
  to
know what they see in terms of compaction, no of requests per sec.
 I'd
  be
surprised that in high volume site it can be done without any
 Caching
   layer
on the top to alleviate IO spikes that occurs because of GC and
   compactions.
   
On Fri, Jan 11, 2013 at 7:27 AM, Mohammad Tariq 
 donta...@gmail.com
   wrote:
   
IMHO, if the image files are not too huge, Hbase can efficiently
 serve
   the
purpose. You can store some additional info along with the file
   depending
upon your search criteria to make the search faster. Say if you
 want
  to
fetch images by the type, you can store images in one column and
 its
extension in another column(jpg, tiff etc).
   
BTW, what exactly is the problem which you

Re: Storing images in Hbase

2013-01-20 Thread Jack Levin

I forgot to mention that I also have this setup:

property
  namehbase.hregion.memstore.flush.size/name
  value33554432/value
  descriptionFlush more often. Default: 67108864/description
/property

This parameter works on per region amount, so this means if any of my
400 (currently) regions on a regionserver has 30MB+ in memstore, the
hbase will flush it to disk.


Here are some metrics from a regionserver:

requests=2, regions=370, stores=370, storefiles=1390,
storefileIndexSize=304, memstoreSize=2233, compactionQueueSize=0,
flushQueueSize=0, usedHeap=3516, maxHeap=4987,
blockCacheSize=790656256, blockCacheFree=255245888,
blockCacheCount=2436, blockCacheHitCount=218015828,
blockCacheMissCount=13514652, blockCacheEvictedCount=2561516,
blockCacheHitRatio=94, blockCacheHitCachingRatio=98

Note, that memstore is only 2G, this particular regionserver HEAP is set to 5G.

And last but not least, its very important to have good GC setup:

export HBASE_OPTS=$HBASE_OPTS -verbose:gc -Xms5000m
-XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+HeapDumpOnOutOfMemoryError -Xloggc:$HBASE_HOME/logs/gc-hbase.log \
-XX:MaxTenuringThreshold=15 -XX:SurvivorRatio=8 \
-XX:+UseParNewGC \
-XX:NewSize=128m -XX:MaxNewSize=128m \
-XX:-UseAdaptiveSizePolicy \
-XX:+CMSParallelRemarkEnabled \
-XX:-TraceClassUnloading


-Jack

On Thu, Jan 17, 2013 at 3:29 PM, Varun Sharma va...@pinterest.com wrote:
 Hey Jack,

 Thanks for the useful information. By flush size being 15 %, do you mean
 the memstore flush size ? 15 % would mean close to 1G, have you seen any
 issues with flushes taking too long ?

 Thanks
 Varun

 On Sun, Jan 13, 2013 at 8:17 AM, Jack Levin magn...@gmail.com wrote:

 That's right, Memstore size , not flush size is increased.  Filesize is
 10G. Overall write cache is 60% of heap and read cache is 20%.  Flush size
 is 15%.  64 maxlogs at 128MB. One namenode server, one secondary that can
 be promoted.  On the way to hbase images are written to a queue, so that we
 can take Hbase down for maintenance and still do inserts later.  ImageShack
 has ‘perma cache’ servers that allows writes and serving of data even when
 hbase is down for hours, consider it 4th replica  outside of hadoop

 Jack

  *From:* Mohit Anchlia mohitanch...@gmail.com
 *Sent:* ‎January‎ ‎13‎, ‎2013 ‎7‎:‎48‎ ‎AM
 *To:* user@hbase.apache.org
 *Subject:* Re: Storing images in Hbase

 Thanks Jack for sharing this information. This definitely makes sense when
 using the type of caching layer. You mentioned about increasing write
 cache, I am assuming you had to increase the following parameters in
 addition to increase the memstore size:

 hbase.hregion.max.filesize
 hbase.hregion.memstore.flush.size

 On Fri, Jan 11, 2013 at 9:47 AM, Jack Levin magn...@gmail.com wrote:

  We buffer all accesses to HBASE with Varnish SSD based caching layer.
  So the impact for reads is negligible.  We have 70 node cluster, 8 GB
  of RAM per node, relatively weak nodes (intel core 2 duo), with
  10-12TB per server of disks.  Inserting 600,000 images per day.  We
  have relatively little of compaction activity as we made our write
  cache much larger than read cache - so we don't experience region file
  fragmentation as much.
 
  -Jack
 
  On Fri, Jan 11, 2013 at 9:40 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
   I think it really depends on volume of the traffic, data distribution
 per
   region, how and when files compaction occurs, number of nodes in the
   cluster. In my experience when it comes to blob data where you are
  serving
   10s of thousand+ requests/sec writes and reads then it's very difficult
  to
   manage HBase without very hard operations and maintenance in play. Jack
   earlier mentioned they have 1 billion images, It would be interesting
 to
   know what they see in terms of compaction, no of requests per sec. I'd
 be
   surprised that in high volume site it can be done without any Caching
  layer
   on the top to alleviate IO spikes that occurs because of GC and
  compactions.
  
   On Fri, Jan 11, 2013 at 7:27 AM, Mohammad Tariq donta...@gmail.com
  wrote:
  
   IMHO, if the image files are not too huge, Hbase can efficiently serve
  the
   purpose. You can store some additional info along with the file
  depending
   upon your search criteria to make the search faster. Say if you want
 to
   fetch images by the type, you can store images in one column and its
   extension in another column(jpg, tiff etc).
  
   BTW, what exactly is the problem which you are facing. You have
 written
   But I still cant do it?
  
   Warm Regards,
   Tariq
   https://mtariq.jux.com/
  
  
   On Fri, Jan 11, 2013 at 8:30 PM, Michael Segel 
  michael_se...@hotmail.com
   wrote:
  
That's a viable option.
HDFS reads are faster than HBase, but it would require first hitting
  the
index in HBase which points to the file and then fetching the file.
It could be faster... we found storing binary data in a sequence
 file

Re: Storing images in Hbase

2013-01-17 Thread Varun Sharma

Hey Jack,

Thanks for the useful information. By flush size being 15 %, do you mean
the memstore flush size ? 15 % would mean close to 1G, have you seen any
issues with flushes taking too long ?

Thanks
Varun

On Sun, Jan 13, 2013 at 8:17 AM, Jack Levin magn...@gmail.com wrote:

 That's right, Memstore size , not flush size is increased.  Filesize is
 10G. Overall write cache is 60% of heap and read cache is 20%.  Flush size
 is 15%.  64 maxlogs at 128MB. One namenode server, one secondary that can
 be promoted.  On the way to hbase images are written to a queue, so that we
 can take Hbase down for maintenance and still do inserts later.  ImageShack
 has ‘perma cache’ servers that allows writes and serving of data even when
 hbase is down for hours, consider it 4th replica  outside of hadoop

 Jack

  *From:* Mohit Anchlia mohitanch...@gmail.com
 *Sent:* ‎January‎ ‎13‎, ‎2013 ‎7‎:‎48‎ ‎AM
 *To:* user@hbase.apache.org
 *Subject:* Re: Storing images in Hbase

 Thanks Jack for sharing this information. This definitely makes sense when
 using the type of caching layer. You mentioned about increasing write
 cache, I am assuming you had to increase the following parameters in
 addition to increase the memstore size:

 hbase.hregion.max.filesize
 hbase.hregion.memstore.flush.size

 On Fri, Jan 11, 2013 at 9:47 AM, Jack Levin magn...@gmail.com wrote:

  We buffer all accesses to HBASE with Varnish SSD based caching layer.
  So the impact for reads is negligible.  We have 70 node cluster, 8 GB
  of RAM per node, relatively weak nodes (intel core 2 duo), with
  10-12TB per server of disks.  Inserting 600,000 images per day.  We
  have relatively little of compaction activity as we made our write
  cache much larger than read cache - so we don't experience region file
  fragmentation as much.
 
  -Jack
 
  On Fri, Jan 11, 2013 at 9:40 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
   I think it really depends on volume of the traffic, data distribution
 per
   region, how and when files compaction occurs, number of nodes in the
   cluster. In my experience when it comes to blob data where you are
  serving
   10s of thousand+ requests/sec writes and reads then it's very difficult
  to
   manage HBase without very hard operations and maintenance in play. Jack
   earlier mentioned they have 1 billion images, It would be interesting
 to
   know what they see in terms of compaction, no of requests per sec. I'd
 be
   surprised that in high volume site it can be done without any Caching
  layer
   on the top to alleviate IO spikes that occurs because of GC and
  compactions.
  
   On Fri, Jan 11, 2013 at 7:27 AM, Mohammad Tariq donta...@gmail.com
  wrote:
  
   IMHO, if the image files are not too huge, Hbase can efficiently serve
  the
   purpose. You can store some additional info along with the file
  depending
   upon your search criteria to make the search faster. Say if you want
 to
   fetch images by the type, you can store images in one column and its
   extension in another column(jpg, tiff etc).
  
   BTW, what exactly is the problem which you are facing. You have
 written
   But I still cant do it?
  
   Warm Regards,
   Tariq
   https://mtariq.jux.com/
  
  
   On Fri, Jan 11, 2013 at 8:30 PM, Michael Segel 
  michael_se...@hotmail.com
   wrote:
  
That's a viable option.
HDFS reads are faster than HBase, but it would require first hitting
  the
index in HBase which points to the file and then fetching the file.
It could be faster... we found storing binary data in a sequence
 file
  and
indexed on HBase to be faster than HBase, however, YMMV and HBase
 has
   been
improved since we did that project
   
   
On Jan 10, 2013, at 10:56 PM, shashwat shriparv 
   dwivedishash...@gmail.com
wrote:
   
 Hi Kavish,

 i have a better idea for you copy your image files to a single
 file
  on
 hdfs, and if new image comes append it to the existing image, and
  keep
and
 update the metadata and the offset to the HBase. Because if you
 put
bigger
 image in hbase it wil lead to some issue.



 ∞
 Shashwat Shriparv



 On Fri, Jan 11, 2013 at 9:21 AM, lars hofhansl la...@apache.org
   wrote:

 Interesting. That's close to a PB if my math is correct.
 Is there a write up about this somewhere? Something that we could
  link
 from the HBase homepage?

 -- Lars


 - Original Message -
 From: Jack Levin magn...@gmail.com
 To: user@hbase.apache.org
 Cc: Andrew Purtell apurt...@apache.org
 Sent: Thursday, January 10, 2013 9:24 AM
 Subject: Re: Storing images in Hbase

 We stored about 1 billion images into hbase with file size up to
  10MB.
 Its been running for close to 2 years without issues and serves
 delivery of images for Yfrog and ImageShack.  If you have any
 questions about the setup, I would be glad to answer them.

 -Jack

Re: Storing images in Hbase

2013-01-13 Thread Mohit Anchlia

Thanks Jack for sharing this information. This definitely makes sense when
using the type of caching layer. You mentioned about increasing write
cache, I am assuming you had to increase the following parameters in
addition to increase the memstore size:

hbase.hregion.max.filesize
hbase.hregion.memstore.flush.size

On Fri, Jan 11, 2013 at 9:47 AM, Jack Levin magn...@gmail.com wrote:

 We buffer all accesses to HBASE with Varnish SSD based caching layer.
 So the impact for reads is negligible.  We have 70 node cluster, 8 GB
 of RAM per node, relatively weak nodes (intel core 2 duo), with
 10-12TB per server of disks.  Inserting 600,000 images per day.  We
 have relatively little of compaction activity as we made our write
 cache much larger than read cache - so we don't experience region file
 fragmentation as much.

 -Jack

 On Fri, Jan 11, 2013 at 9:40 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I think it really depends on volume of the traffic, data distribution per
  region, how and when files compaction occurs, number of nodes in the
  cluster. In my experience when it comes to blob data where you are
 serving
  10s of thousand+ requests/sec writes and reads then it's very difficult
 to
  manage HBase without very hard operations and maintenance in play. Jack
  earlier mentioned they have 1 billion images, It would be interesting to
  know what they see in terms of compaction, no of requests per sec. I'd be
  surprised that in high volume site it can be done without any Caching
 layer
  on the top to alleviate IO spikes that occurs because of GC and
 compactions.
 
  On Fri, Jan 11, 2013 at 7:27 AM, Mohammad Tariq donta...@gmail.com
 wrote:
 
  IMHO, if the image files are not too huge, Hbase can efficiently serve
 the
  purpose. You can store some additional info along with the file
 depending
  upon your search criteria to make the search faster. Say if you want to
  fetch images by the type, you can store images in one column and its
  extension in another column(jpg, tiff etc).
 
  BTW, what exactly is the problem which you are facing. You have written
  But I still cant do it?
 
  Warm Regards,
  Tariq
  https://mtariq.jux.com/
 
 
  On Fri, Jan 11, 2013 at 8:30 PM, Michael Segel 
 michael_se...@hotmail.com
  wrote:
 
   That's a viable option.
   HDFS reads are faster than HBase, but it would require first hitting
 the
   index in HBase which points to the file and then fetching the file.
   It could be faster... we found storing binary data in a sequence file
 and
   indexed on HBase to be faster than HBase, however, YMMV and HBase has
  been
   improved since we did that project
  
  
   On Jan 10, 2013, at 10:56 PM, shashwat shriparv 
  dwivedishash...@gmail.com
   wrote:
  
Hi Kavish,
   
i have a better idea for you copy your image files to a single file
 on
hdfs, and if new image comes append it to the existing image, and
 keep
   and
update the metadata and the offset to the HBase. Because if you put
   bigger
image in hbase it wil lead to some issue.
   
   
   
∞
Shashwat Shriparv
   
   
   
On Fri, Jan 11, 2013 at 9:21 AM, lars hofhansl la...@apache.org
  wrote:
   
Interesting. That's close to a PB if my math is correct.
Is there a write up about this somewhere? Something that we could
 link
from the HBase homepage?
   
-- Lars
   
   
- Original Message -
From: Jack Levin magn...@gmail.com
To: user@hbase.apache.org
Cc: Andrew Purtell apurt...@apache.org
Sent: Thursday, January 10, 2013 9:24 AM
Subject: Re: Storing images in Hbase
   
We stored about 1 billion images into hbase with file size up to
 10MB.
Its been running for close to 2 years without issues and serves
delivery of images for Yfrog and ImageShack.  If you have any
questions about the setup, I would be glad to answer them.
   
-Jack
   
On Sun, Jan 6, 2013 at 1:09 PM, Mohit Anchlia 
 mohitanch...@gmail.com
  
wrote:
I have done extensive testing and have found that blobs don't
 belong
  in
the
databases but are rather best left out on the file system. Andrew
outlined
issues that you'll face and not to mention IO issues when
 compaction
occurs
over large files.
   
On Sun, Jan 6, 2013 at 12:52 PM, Andrew Purtell 
 apurt...@apache.org
  
wrote:
   
I meant this to say a few really large values
   
On Sun, Jan 6, 2013 at 12:49 PM, Andrew Purtell 
  apurt...@apache.org
wrote:
   
Consider if the split threshold is 2 GB but your one row
 contains
  10
GB
as
really large value.
   
   
   
   
--
Best regards,
   
  - Andy
   
Problems worthy of attack prove their worth by hitting back. -
 Piet
   Hein
(via Tom White)

RE: Storing images in Hbase

2013-01-13 Thread Jack Levin

That's right, Memstore size , not flush size is increased.  Filesize is
10G. Overall write cache is 60% of heap and read cache is 20%.  Flush size
is 15%.  64 maxlogs at 128MB. One namenode server, one secondary that can
be promoted.  On the way to hbase images are written to a queue, so that we
can take Hbase down for maintenance and still do inserts later.  ImageShack
has ‘perma cache’ servers that allows writes and serving of data even when
hbase is down for hours, consider it 4th replica  outside of hadoop

Jack

 *From:* Mohit Anchlia mohitanch...@gmail.com
*Sent:* ‎January‎ ‎13‎, ‎2013 ‎7‎:‎48‎ ‎AM
*To:* user@hbase.apache.org
*Subject:* Re: Storing images in Hbase

Thanks Jack for sharing this information. This definitely makes sense when
using the type of caching layer. You mentioned about increasing write
cache, I am assuming you had to increase the following parameters in
addition to increase the memstore size:

hbase.hregion.max.filesize
hbase.hregion.memstore.flush.size

On Fri, Jan 11, 2013 at 9:47 AM, Jack Levin magn...@gmail.com wrote:

 We buffer all accesses to HBASE with Varnish SSD based caching layer.
 So the impact for reads is negligible.  We have 70 node cluster, 8 GB
 of RAM per node, relatively weak nodes (intel core 2 duo), with
 10-12TB per server of disks.  Inserting 600,000 images per day.  We
 have relatively little of compaction activity as we made our write
 cache much larger than read cache - so we don't experience region file
 fragmentation as much.

 -Jack

 On Fri, Jan 11, 2013 at 9:40 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I think it really depends on volume of the traffic, data distribution
per
  region, how and when files compaction occurs, number of nodes in the
  cluster. In my experience when it comes to blob data where you are
 serving
  10s of thousand+ requests/sec writes and reads then it's very difficult
 to
  manage HBase without very hard operations and maintenance in play. Jack
  earlier mentioned they have 1 billion images, It would be interesting to
  know what they see in terms of compaction, no of requests per sec. I'd
be
  surprised that in high volume site it can be done without any Caching
 layer
  on the top to alleviate IO spikes that occurs because of GC and
 compactions.
 
  On Fri, Jan 11, 2013 at 7:27 AM, Mohammad Tariq donta...@gmail.com
 wrote:
 
  IMHO, if the image files are not too huge, Hbase can efficiently serve
 the
  purpose. You can store some additional info along with the file
 depending
  upon your search criteria to make the search faster. Say if you want to
  fetch images by the type, you can store images in one column and its
  extension in another column(jpg, tiff etc).
 
  BTW, what exactly is the problem which you are facing. You have written
  But I still cant do it?
 
  Warm Regards,
  Tariq
  https://mtariq.jux.com/
 
 
  On Fri, Jan 11, 2013 at 8:30 PM, Michael Segel 
 michael_se...@hotmail.com
  wrote:
 
   That's a viable option.
   HDFS reads are faster than HBase, but it would require first hitting
 the
   index in HBase which points to the file and then fetching the file.
   It could be faster... we found storing binary data in a sequence file
 and
   indexed on HBase to be faster than HBase, however, YMMV and HBase has
  been
   improved since we did that project
  
  
   On Jan 10, 2013, at 10:56 PM, shashwat shriparv 
  dwivedishash...@gmail.com
   wrote:
  
Hi Kavish,
   
i have a better idea for you copy your image files to a single file
 on
hdfs, and if new image comes append it to the existing image, and
 keep
   and
update the metadata and the offset to the HBase. Because if you put
   bigger
image in hbase it wil lead to some issue.
   
   
   
∞
Shashwat Shriparv
   
   
   
On Fri, Jan 11, 2013 at 9:21 AM, lars hofhansl la...@apache.org
  wrote:
   
Interesting. That's close to a PB if my math is correct.
Is there a write up about this somewhere? Something that we could
 link
from the HBase homepage?
   
-- Lars
   
   
- Original Message -
From: Jack Levin magn...@gmail.com
To: user@hbase.apache.org
Cc: Andrew Purtell apurt...@apache.org
Sent: Thursday, January 10, 2013 9:24 AM
Subject: Re: Storing images in Hbase
   
We stored about 1 billion images into hbase with file size up to
 10MB.
Its been running for close to 2 years without issues and serves
delivery of images for Yfrog and ImageShack.  If you have any
questions about the setup, I would be glad to answer them.
   
-Jack
   
On Sun, Jan 6, 2013 at 1:09 PM, Mohit Anchlia 
 mohitanch...@gmail.com
  
wrote:
I have done extensive testing and have found that blobs don't
 belong
  in
the
databases but are rather best left out on the file system. Andrew
outlined
issues that you'll face and not to mention IO issues when
 compaction
occurs
over large files.
   
On Sun, Jan 6, 2013 at 12:52 PM, Andrew

Re: Storing images in Hbase

2013-01-11 Thread Michael Segel

That's a viable option. 
HDFS reads are faster than HBase, but it would require first hitting the index 
in HBase which points to the file and then fetching the file. 
It could be faster... we found storing binary data in a sequence file and 
indexed on HBase to be faster than HBase, however, YMMV and HBase has been 
improved since we did that project 


On Jan 10, 2013, at 10:56 PM, shashwat shriparv dwivedishash...@gmail.com 
wrote:

 Hi Kavish,
 
 i have a better idea for you copy your image files to a single file on
 hdfs, and if new image comes append it to the existing image, and keep and
 update the metadata and the offset to the HBase. Because if you put bigger
 image in hbase it wil lead to some issue.
 
 
 
 ∞
 Shashwat Shriparv
 
 
 
 On Fri, Jan 11, 2013 at 9:21 AM, lars hofhansl la...@apache.org wrote:
 
 Interesting. That's close to a PB if my math is correct.
 Is there a write up about this somewhere? Something that we could link
 from the HBase homepage?
 
 -- Lars
 
 
 - Original Message -
 From: Jack Levin magn...@gmail.com
 To: user@hbase.apache.org
 Cc: Andrew Purtell apurt...@apache.org
 Sent: Thursday, January 10, 2013 9:24 AM
 Subject: Re: Storing images in Hbase
 
 We stored about 1 billion images into hbase with file size up to 10MB.
 Its been running for close to 2 years without issues and serves
 delivery of images for Yfrog and ImageShack.  If you have any
 questions about the setup, I would be glad to answer them.
 
 -Jack
 
 On Sun, Jan 6, 2013 at 1:09 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 I have done extensive testing and have found that blobs don't belong in
 the
 databases but are rather best left out on the file system. Andrew
 outlined
 issues that you'll face and not to mention IO issues when compaction
 occurs
 over large files.
 
 On Sun, Jan 6, 2013 at 12:52 PM, Andrew Purtell apurt...@apache.org
 wrote:
 
 I meant this to say a few really large values
 
 On Sun, Jan 6, 2013 at 12:49 PM, Andrew Purtell apurt...@apache.org
 wrote:
 
 Consider if the split threshold is 2 GB but your one row contains 10
 GB
 as
 really large value.
 
 
 
 
 --
 Best regards,
 
   - Andy
 
 Problems worthy of attack prove their worth by hitting back. - Piet Hein
 (via Tom White)

Re: Storing images in Hbase

2013-01-11 Thread Mohammad Tariq

IMHO, if the image files are not too huge, Hbase can efficiently serve the
purpose. You can store some additional info along with the file depending
upon your search criteria to make the search faster. Say if you want to
fetch images by the type, you can store images in one column and its
extension in another column(jpg, tiff etc).

BTW, what exactly is the problem which you are facing. You have written
But I still cant do it?

Warm Regards,
Tariq
https://mtariq.jux.com/


On Fri, Jan 11, 2013 at 8:30 PM, Michael Segel michael_se...@hotmail.comwrote:

 That's a viable option.
 HDFS reads are faster than HBase, but it would require first hitting the
 index in HBase which points to the file and then fetching the file.
 It could be faster... we found storing binary data in a sequence file and
 indexed on HBase to be faster than HBase, however, YMMV and HBase has been
 improved since we did that project


 On Jan 10, 2013, at 10:56 PM, shashwat shriparv dwivedishash...@gmail.com
 wrote:

  Hi Kavish,
 
  i have a better idea for you copy your image files to a single file on
  hdfs, and if new image comes append it to the existing image, and keep
 and
  update the metadata and the offset to the HBase. Because if you put
 bigger
  image in hbase it wil lead to some issue.
 
 
 
  ∞
  Shashwat Shriparv
 
 
 
  On Fri, Jan 11, 2013 at 9:21 AM, lars hofhansl la...@apache.org wrote:
 
  Interesting. That's close to a PB if my math is correct.
  Is there a write up about this somewhere? Something that we could link
  from the HBase homepage?
 
  -- Lars
 
 
  - Original Message -
  From: Jack Levin magn...@gmail.com
  To: user@hbase.apache.org
  Cc: Andrew Purtell apurt...@apache.org
  Sent: Thursday, January 10, 2013 9:24 AM
  Subject: Re: Storing images in Hbase
 
  We stored about 1 billion images into hbase with file size up to 10MB.
  Its been running for close to 2 years without issues and serves
  delivery of images for Yfrog and ImageShack.  If you have any
  questions about the setup, I would be glad to answer them.
 
  -Jack
 
  On Sun, Jan 6, 2013 at 1:09 PM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
  I have done extensive testing and have found that blobs don't belong in
  the
  databases but are rather best left out on the file system. Andrew
  outlined
  issues that you'll face and not to mention IO issues when compaction
  occurs
  over large files.
 
  On Sun, Jan 6, 2013 at 12:52 PM, Andrew Purtell apurt...@apache.org
  wrote:
 
  I meant this to say a few really large values
 
  On Sun, Jan 6, 2013 at 12:49 PM, Andrew Purtell apurt...@apache.org
  wrote:
 
  Consider if the split threshold is 2 GB but your one row contains 10
  GB
  as
  really large value.
 
 
 
 
  --
  Best regards,
 
- Andy
 
  Problems worthy of attack prove their worth by hitting back. - Piet
 Hein
  (via Tom White)

Re: Storing images in Hbase

2013-01-11 Thread Mohit Anchlia

I think it really depends on volume of the traffic, data distribution per
region, how and when files compaction occurs, number of nodes in the
cluster. In my experience when it comes to blob data where you are serving
10s of thousand+ requests/sec writes and reads then it's very difficult to
manage HBase without very hard operations and maintenance in play. Jack
earlier mentioned they have 1 billion images, It would be interesting to
know what they see in terms of compaction, no of requests per sec. I'd be
surprised that in high volume site it can be done without any Caching layer
on the top to alleviate IO spikes that occurs because of GC and compactions.

On Fri, Jan 11, 2013 at 7:27 AM, Mohammad Tariq donta...@gmail.com wrote:

 IMHO, if the image files are not too huge, Hbase can efficiently serve the
 purpose. You can store some additional info along with the file depending
 upon your search criteria to make the search faster. Say if you want to
 fetch images by the type, you can store images in one column and its
 extension in another column(jpg, tiff etc).

 BTW, what exactly is the problem which you are facing. You have written
 But I still cant do it?

 Warm Regards,
 Tariq
 https://mtariq.jux.com/


 On Fri, Jan 11, 2013 at 8:30 PM, Michael Segel michael_se...@hotmail.com
 wrote:

  That's a viable option.
  HDFS reads are faster than HBase, but it would require first hitting the
  index in HBase which points to the file and then fetching the file.
  It could be faster... we found storing binary data in a sequence file and
  indexed on HBase to be faster than HBase, however, YMMV and HBase has
 been
  improved since we did that project
 
 
  On Jan 10, 2013, at 10:56 PM, shashwat shriparv 
 dwivedishash...@gmail.com
  wrote:
 
   Hi Kavish,
  
   i have a better idea for you copy your image files to a single file on
   hdfs, and if new image comes append it to the existing image, and keep
  and
   update the metadata and the offset to the HBase. Because if you put
  bigger
   image in hbase it wil lead to some issue.
  
  
  
   ∞
   Shashwat Shriparv
  
  
  
   On Fri, Jan 11, 2013 at 9:21 AM, lars hofhansl la...@apache.org
 wrote:
  
   Interesting. That's close to a PB if my math is correct.
   Is there a write up about this somewhere? Something that we could link
   from the HBase homepage?
  
   -- Lars
  
  
   - Original Message -
   From: Jack Levin magn...@gmail.com
   To: user@hbase.apache.org
   Cc: Andrew Purtell apurt...@apache.org
   Sent: Thursday, January 10, 2013 9:24 AM
   Subject: Re: Storing images in Hbase
  
   We stored about 1 billion images into hbase with file size up to 10MB.
   Its been running for close to 2 years without issues and serves
   delivery of images for Yfrog and ImageShack.  If you have any
   questions about the setup, I would be glad to answer them.
  
   -Jack
  
   On Sun, Jan 6, 2013 at 1:09 PM, Mohit Anchlia mohitanch...@gmail.com
 
   wrote:
   I have done extensive testing and have found that blobs don't belong
 in
   the
   databases but are rather best left out on the file system. Andrew
   outlined
   issues that you'll face and not to mention IO issues when compaction
   occurs
   over large files.
  
   On Sun, Jan 6, 2013 at 12:52 PM, Andrew Purtell apurt...@apache.org
 
   wrote:
  
   I meant this to say a few really large values
  
   On Sun, Jan 6, 2013 at 12:49 PM, Andrew Purtell 
 apurt...@apache.org
   wrote:
  
   Consider if the split threshold is 2 GB but your one row contains
 10
   GB
   as
   really large value.
  
  
  
  
   --
   Best regards,
  
 - Andy
  
   Problems worthy of attack prove their worth by hitting back. - Piet
  Hein
   (via Tom White)

Re: Storing images in Hbase

2013-01-11 Thread Jack Levin

We buffer all accesses to HBASE with Varnish SSD based caching layer.
So the impact for reads is negligible.  We have 70 node cluster, 8 GB
of RAM per node, relatively weak nodes (intel core 2 duo), with
10-12TB per server of disks.  Inserting 600,000 images per day.  We
have relatively little of compaction activity as we made our write
cache much larger than read cache - so we don't experience region file
fragmentation as much.

-Jack

On Fri, Jan 11, 2013 at 9:40 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 I think it really depends on volume of the traffic, data distribution per
 region, how and when files compaction occurs, number of nodes in the
 cluster. In my experience when it comes to blob data where you are serving
 10s of thousand+ requests/sec writes and reads then it's very difficult to
 manage HBase without very hard operations and maintenance in play. Jack
 earlier mentioned they have 1 billion images, It would be interesting to
 know what they see in terms of compaction, no of requests per sec. I'd be
 surprised that in high volume site it can be done without any Caching layer
 on the top to alleviate IO spikes that occurs because of GC and compactions.

 On Fri, Jan 11, 2013 at 7:27 AM, Mohammad Tariq donta...@gmail.com wrote:

 IMHO, if the image files are not too huge, Hbase can efficiently serve the
 purpose. You can store some additional info along with the file depending
 upon your search criteria to make the search faster. Say if you want to
 fetch images by the type, you can store images in one column and its
 extension in another column(jpg, tiff etc).

 BTW, what exactly is the problem which you are facing. You have written
 But I still cant do it?

 Warm Regards,
 Tariq
 https://mtariq.jux.com/


 On Fri, Jan 11, 2013 at 8:30 PM, Michael Segel michael_se...@hotmail.com
 wrote:

  That's a viable option.
  HDFS reads are faster than HBase, but it would require first hitting the
  index in HBase which points to the file and then fetching the file.
  It could be faster... we found storing binary data in a sequence file and
  indexed on HBase to be faster than HBase, however, YMMV and HBase has
 been
  improved since we did that project
 
 
  On Jan 10, 2013, at 10:56 PM, shashwat shriparv 
 dwivedishash...@gmail.com
  wrote:
 
   Hi Kavish,
  
   i have a better idea for you copy your image files to a single file on
   hdfs, and if new image comes append it to the existing image, and keep
  and
   update the metadata and the offset to the HBase. Because if you put
  bigger
   image in hbase it wil lead to some issue.
  
  
  
   ∞
   Shashwat Shriparv
  
  
  
   On Fri, Jan 11, 2013 at 9:21 AM, lars hofhansl la...@apache.org
 wrote:
  
   Interesting. That's close to a PB if my math is correct.
   Is there a write up about this somewhere? Something that we could link
   from the HBase homepage?
  
   -- Lars
  
  
   - Original Message -
   From: Jack Levin magn...@gmail.com
   To: user@hbase.apache.org
   Cc: Andrew Purtell apurt...@apache.org
   Sent: Thursday, January 10, 2013 9:24 AM
   Subject: Re: Storing images in Hbase
  
   We stored about 1 billion images into hbase with file size up to 10MB.
   Its been running for close to 2 years without issues and serves
   delivery of images for Yfrog and ImageShack.  If you have any
   questions about the setup, I would be glad to answer them.
  
   -Jack
  
   On Sun, Jan 6, 2013 at 1:09 PM, Mohit Anchlia mohitanch...@gmail.com
 
   wrote:
   I have done extensive testing and have found that blobs don't belong
 in
   the
   databases but are rather best left out on the file system. Andrew
   outlined
   issues that you'll face and not to mention IO issues when compaction
   occurs
   over large files.
  
   On Sun, Jan 6, 2013 at 12:52 PM, Andrew Purtell apurt...@apache.org
 
   wrote:
  
   I meant this to say a few really large values
  
   On Sun, Jan 6, 2013 at 12:49 PM, Andrew Purtell 
 apurt...@apache.org
   wrote:
  
   Consider if the split threshold is 2 GB but your one row contains
 10
   GB
   as
   really large value.
  
  
  
  
   --
   Best regards,
  
 - Andy
  
   Problems worthy of attack prove their worth by hitting back. - Piet
  Hein
   (via Tom White)

Re: Storing images in Hbase

2013-01-11 Thread Jack Levin

http://img338.imageshack.us/img338/6831/screenshot20130111at949.png

this shows how often we flush, and how large are the region files.  We
do have bloomfilters turn up, that we don't incur extra seeks across
multiple RS files.

-Jack

On Fri, Jan 11, 2013 at 9:47 AM, Jack Levin magn...@gmail.com wrote:
 We buffer all accesses to HBASE with Varnish SSD based caching layer.
 So the impact for reads is negligible.  We have 70 node cluster, 8 GB
 of RAM per node, relatively weak nodes (intel core 2 duo), with
 10-12TB per server of disks.  Inserting 600,000 images per day.  We
 have relatively little of compaction activity as we made our write
 cache much larger than read cache - so we don't experience region file
 fragmentation as much.

 -Jack

 On Fri, Jan 11, 2013 at 9:40 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 I think it really depends on volume of the traffic, data distribution per
 region, how and when files compaction occurs, number of nodes in the
 cluster. In my experience when it comes to blob data where you are serving
 10s of thousand+ requests/sec writes and reads then it's very difficult to
 manage HBase without very hard operations and maintenance in play. Jack
 earlier mentioned they have 1 billion images, It would be interesting to
 know what they see in terms of compaction, no of requests per sec. I'd be
 surprised that in high volume site it can be done without any Caching layer
 on the top to alleviate IO spikes that occurs because of GC and compactions.

 On Fri, Jan 11, 2013 at 7:27 AM, Mohammad Tariq donta...@gmail.com wrote:

 IMHO, if the image files are not too huge, Hbase can efficiently serve the
 purpose. You can store some additional info along with the file depending
 upon your search criteria to make the search faster. Say if you want to
 fetch images by the type, you can store images in one column and its
 extension in another column(jpg, tiff etc).

 BTW, what exactly is the problem which you are facing. You have written
 But I still cant do it?

 Warm Regards,
 Tariq
 https://mtariq.jux.com/


 On Fri, Jan 11, 2013 at 8:30 PM, Michael Segel michael_se...@hotmail.com
 wrote:

  That's a viable option.
  HDFS reads are faster than HBase, but it would require first hitting the
  index in HBase which points to the file and then fetching the file.
  It could be faster... we found storing binary data in a sequence file and
  indexed on HBase to be faster than HBase, however, YMMV and HBase has
 been
  improved since we did that project
 
 
  On Jan 10, 2013, at 10:56 PM, shashwat shriparv 
 dwivedishash...@gmail.com
  wrote:
 
   Hi Kavish,
  
   i have a better idea for you copy your image files to a single file on
   hdfs, and if new image comes append it to the existing image, and keep
  and
   update the metadata and the offset to the HBase. Because if you put
  bigger
   image in hbase it wil lead to some issue.
  
  
  
   ∞
   Shashwat Shriparv
  
  
  
   On Fri, Jan 11, 2013 at 9:21 AM, lars hofhansl la...@apache.org
 wrote:
  
   Interesting. That's close to a PB if my math is correct.
   Is there a write up about this somewhere? Something that we could link
   from the HBase homepage?
  
   -- Lars
  
  
   - Original Message -
   From: Jack Levin magn...@gmail.com
   To: user@hbase.apache.org
   Cc: Andrew Purtell apurt...@apache.org
   Sent: Thursday, January 10, 2013 9:24 AM
   Subject: Re: Storing images in Hbase
  
   We stored about 1 billion images into hbase with file size up to 10MB.
   Its been running for close to 2 years without issues and serves
   delivery of images for Yfrog and ImageShack.  If you have any
   questions about the setup, I would be glad to answer them.
  
   -Jack
  
   On Sun, Jan 6, 2013 at 1:09 PM, Mohit Anchlia mohitanch...@gmail.com
 
   wrote:
   I have done extensive testing and have found that blobs don't belong
 in
   the
   databases but are rather best left out on the file system. Andrew
   outlined
   issues that you'll face and not to mention IO issues when compaction
   occurs
   over large files.
  
   On Sun, Jan 6, 2013 at 12:52 PM, Andrew Purtell apurt...@apache.org
 
   wrote:
  
   I meant this to say a few really large values
  
   On Sun, Jan 6, 2013 at 12:49 PM, Andrew Purtell 
 apurt...@apache.org
   wrote:
  
   Consider if the split threshold is 2 GB but your one row contains
 10
   GB
   as
   really large value.
  
  
  
  
   --
   Best regards,
  
 - Andy
  
   Problems worthy of attack prove their worth by hitting back. - Piet
  Hein
   (via Tom White)

Re: Storing images in Hbase

2013-01-11 Thread Marcos Ortiz


It would be nice a blog post around this.

El 11/01/2013 0:51, lars hofhansl escribió:

Interesting. That's close to a PB if my math is correct.
Is there a write up about this somewhere? Something that we could link from the 
HBase homepage?

-- Lars


- Original Message -
From: Jack Levin magn...@gmail.com
To: user@hbase.apache.org
Cc: Andrew Purtell apurt...@apache.org
Sent: Thursday, January 10, 2013 9:24 AM
Subject: Re: Storing images in Hbase

We stored about 1 billion images into hbase with file size up to 10MB.
Its been running for close to 2 years without issues and serves
delivery of images for Yfrog and ImageShack.  If you have any
questions about the setup, I would be glad to answer them.

-Jack

On Sun, Jan 6, 2013 at 1:09 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

I have done extensive testing and have found that blobs don't belong in the
databases but are rather best left out on the file system. Andrew outlined
issues that you'll face and not to mention IO issues when compaction occurs
over large files.

On Sun, Jan 6, 2013 at 12:52 PM, Andrew Purtell apurt...@apache.org wrote:


I meant this to say a few really large values

On Sun, Jan 6, 2013 at 12:49 PM, Andrew Purtell apurt...@apache.org
wrote:


Consider if the split threshold is 2 GB but your one row contains 10 GB

as

really large value.




   --
Best regards,

 - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci





10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Storing images in Hbase

2013-01-10 Thread Jack Levin

We stored about 1 billion images into hbase with file size up to 10MB.
 Its been running for close to 2 years without issues and serves
delivery of images for Yfrog and ImageShack.  If you have any
questions about the setup, I would be glad to answer them.

-Jack

On Sun, Jan 6, 2013 at 1:09 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
 I have done extensive testing and have found that blobs don't belong in the
 databases but are rather best left out on the file system. Andrew outlined
 issues that you'll face and not to mention IO issues when compaction occurs
 over large files.

 On Sun, Jan 6, 2013 at 12:52 PM, Andrew Purtell apurt...@apache.org wrote:

 I meant this to say a few really large values

 On Sun, Jan 6, 2013 at 12:49 PM, Andrew Purtell apurt...@apache.org
 wrote:

  Consider if the split threshold is 2 GB but your one row contains 10 GB
 as
  really large value.




  --
 Best regards,

- Andy

 Problems worthy of attack prove their worth by hitting back. - Piet Hein
 (via Tom White)

Re: Storing images in Hbase

2013-01-10 Thread Leonid Fedotov

Jack,
yes, this is very interesting to know your setup details.
Could you please provide more information?
Or we can take this off the list if you like…

Thank you!

Sincerely,
Leonid Fedotov

On Jan 10, 2013, at 9:24 AM, Jack Levin wrote:

 We stored about 1 billion images into hbase with file size up to 10MB.
 Its been running for close to 2 years without issues and serves
 delivery of images for Yfrog and ImageShack.  If you have any
 questions about the setup, I would be glad to answer them.
 
 -Jack
 
 On Sun, Jan 6, 2013 at 1:09 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
 I have done extensive testing and have found that blobs don't belong in the
 databases but are rather best left out on the file system. Andrew outlined
 issues that you'll face and not to mention IO issues when compaction occurs
 over large files.
 
 On Sun, Jan 6, 2013 at 12:52 PM, Andrew Purtell apurt...@apache.org wrote:
 
 I meant this to say a few really large values
 
 On Sun, Jan 6, 2013 at 12:49 PM, Andrew Purtell apurt...@apache.org
 wrote:
 
 Consider if the split threshold is 2 GB but your one row contains 10 GB
 as
 really large value.
 
 
 
 
 --
 Best regards,
 
   - Andy
 
 Problems worthy of attack prove their worth by hitting back. - Piet Hein
 (via Tom White)

Re: Storing images in Hbase

2013-01-10 Thread Mohammad Tariq

Jack, Leonid,

I request you guys to please continue the discussion
through the thread itself if possible for you both. I would
like to know about Jack's setup. I too find it quite interesting.

Many thanks.

Warm Regards,
Tariq
https://mtariq.jux.com/


On Fri, Jan 11, 2013 at 12:50 AM, Leonid Fedotov
lfedo...@hortonworks.comwrote:

 Jack,
 yes, this is very interesting to know your setup details.
 Could you please provide more information?
 Or we can take this off the list if you like…

 Thank you!

 Sincerely,
 Leonid Fedotov

 On Jan 10, 2013, at 9:24 AM, Jack Levin wrote:

  We stored about 1 billion images into hbase with file size up to 10MB.
  Its been running for close to 2 years without issues and serves
  delivery of images for Yfrog and ImageShack.  If you have any
  questions about the setup, I would be glad to answer them.
 
  -Jack
 
  On Sun, Jan 6, 2013 at 1:09 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I have done extensive testing and have found that blobs don't belong in
 the
  databases but are rather best left out on the file system. Andrew
 outlined
  issues that you'll face and not to mention IO issues when compaction
 occurs
  over large files.
 
  On Sun, Jan 6, 2013 at 12:52 PM, Andrew Purtell apurt...@apache.org
 wrote:
 
  I meant this to say a few really large values
 
  On Sun, Jan 6, 2013 at 12:49 PM, Andrew Purtell apurt...@apache.org
  wrote:
 
  Consider if the split threshold is 2 GB but your one row contains 10
 GB
  as
  really large value.
 
 
 
 
  --
  Best regards,
 
- Andy
 
  Problems worthy of attack prove their worth by hitting back. - Piet
 Hein
  (via Tom White)

Re: Storing images in Hbase

2013-01-10 Thread Marcos Ortiz

This is a very interesting setup to analyze. I´m working in a similar 
problem

with HBase, so, any help is welcome.

El 10/01/2013 16:39, Doug Meil escribió:

+1.

This question comes up enough on the dist-list it's worth getting some
pointers on record.





On 1/10/13 2:24 PM, Mohammad Tariq donta...@gmail.com wrote:


Jack, Leonid,

I request you guys to please continue the discussion
through the thread itself if possible for you both. I would
like to know about Jack's setup. I too find it quite interesting.

Many thanks.

Warm Regards,
Tariq
https://mtariq.jux.com/


On Fri, Jan 11, 2013 at 12:50 AM, Leonid Fedotov
lfedo...@hortonworks.comwrote:


Jack,
yes, this is very interesting to know your setup details.
Could you please provide more information?
Or we can take this off the list if you likeŠ

Thank you!

Sincerely,
Leonid Fedotov

On Jan 10, 2013, at 9:24 AM, Jack Levin wrote:


We stored about 1 billion images into hbase with file size up to 10MB.
Its been running for close to 2 years without issues and serves
delivery of images for Yfrog and ImageShack.  If you have any
questions about the setup, I would be glad to answer them.

-Jack

On Sun, Jan 6, 2013 at 1:09 PM, Mohit Anchlia mohitanch...@gmail.com

wrote:

I have done extensive testing and have found that blobs don't belong

in
the

databases but are rather best left out on the file system. Andrew

outlined

issues that you'll face and not to mention IO issues when compaction

occurs

over large files.

On Sun, Jan 6, 2013 at 12:52 PM, Andrew Purtell apurt...@apache.org

wrote:

I meant this to say a few really large values

On Sun, Jan 6, 2013 at 12:49 PM, Andrew Purtell

apurt...@apache.org

wrote:


Consider if the split threshold is 2 GB but your one row contains

10
GB

as

really large value.




--
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet

Hein

(via Tom White)





--

Marcos Ortíz Valmaseda
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186 http://twitter.com/marcosluis2186





10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Storing images in Hbase

2013-01-10 Thread Leonid Fedotov

I'm voting for continuing here as well…
So, location is up to Jack. :)

Thank you!

Sincerely,
Leonid Fedotov

On Jan 10, 2013, at 11:24 AM, Mohammad Tariq wrote:

 Jack, Leonid,
 
I request you guys to please continue the discussion
 through the thread itself if possible for you both. I would
 like to know about Jack's setup. I too find it quite interesting.
 
 Many thanks.
 
 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 
 
 On Fri, Jan 11, 2013 at 12:50 AM, Leonid Fedotov
 lfedo...@hortonworks.comwrote:
 
 Jack,
 yes, this is very interesting to know your setup details.
 Could you please provide more information?
 Or we can take this off the list if you like…
 
 Thank you!
 
 Sincerely,
 Leonid Fedotov
 
 On Jan 10, 2013, at 9:24 AM, Jack Levin wrote:
 
 We stored about 1 billion images into hbase with file size up to 10MB.
 Its been running for close to 2 years without issues and serves
 delivery of images for Yfrog and ImageShack.  If you have any
 questions about the setup, I would be glad to answer them.
 
 -Jack
 
 On Sun, Jan 6, 2013 at 1:09 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 I have done extensive testing and have found that blobs don't belong in
 the
 databases but are rather best left out on the file system. Andrew
 outlined
 issues that you'll face and not to mention IO issues when compaction
 occurs
 over large files.
 
 On Sun, Jan 6, 2013 at 12:52 PM, Andrew Purtell apurt...@apache.org
 wrote:
 
 I meant this to say a few really large values
 
 On Sun, Jan 6, 2013 at 12:49 PM, Andrew Purtell apurt...@apache.org
 wrote:
 
 Consider if the split threshold is 2 GB but your one row contains 10
 GB
 as
 really large value.
 
 
 
 
 --
 Best regards,
 
  - Andy
 
 Problems worthy of attack prove their worth by hitting back. - Piet
 Hein
 (via Tom White)

Re: Storing images in Hbase

2013-01-10 Thread Mohammad Tariq

Thanks Leonid.

Warm Regards,
Tariq
https://mtariq.jux.com/


On Fri, Jan 11, 2013 at 2:15 AM, Leonid Fedotov lfedo...@hortonworks.comwrote:

 I'm voting for continuing here as well…
 So, location is up to Jack. :)

 Thank you!

 Sincerely,
 Leonid Fedotov

 On Jan 10, 2013, at 11:24 AM, Mohammad Tariq wrote:

  Jack, Leonid,
 
 I request you guys to please continue the discussion
  through the thread itself if possible for you both. I would
  like to know about Jack's setup. I too find it quite interesting.
 
  Many thanks.
 
  Warm Regards,
  Tariq
  https://mtariq.jux.com/
 
 
  On Fri, Jan 11, 2013 at 12:50 AM, Leonid Fedotov
  lfedo...@hortonworks.comwrote:
 
  Jack,
  yes, this is very interesting to know your setup details.
  Could you please provide more information?
  Or we can take this off the list if you like…
 
  Thank you!
 
  Sincerely,
  Leonid Fedotov
 
  On Jan 10, 2013, at 9:24 AM, Jack Levin wrote:
 
  We stored about 1 billion images into hbase with file size up to 10MB.
  Its been running for close to 2 years without issues and serves
  delivery of images for Yfrog and ImageShack.  If you have any
  questions about the setup, I would be glad to answer them.
 
  -Jack
 
  On Sun, Jan 6, 2013 at 1:09 PM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
  I have done extensive testing and have found that blobs don't belong
 in
  the
  databases but are rather best left out on the file system. Andrew
  outlined
  issues that you'll face and not to mention IO issues when compaction
  occurs
  over large files.
 
  On Sun, Jan 6, 2013 at 12:52 PM, Andrew Purtell apurt...@apache.org
  wrote:
 
  I meant this to say a few really large values
 
  On Sun, Jan 6, 2013 at 12:49 PM, Andrew Purtell apurt...@apache.org
 
  wrote:
 
  Consider if the split threshold is 2 GB but your one row contains 10
  GB
  as
  really large value.
 
 
 
 
  --
  Best regards,
 
   - Andy
 
  Problems worthy of attack prove their worth by hitting back. - Piet
  Hein
  (via Tom White)

Re: Storing images in Hbase

2013-01-10 Thread Michael Segel

Been there, done that... kind of an interesting problem... 

Someone earlier said that HBase isn't good for images.  It works pretty well, 
again it depends on the use case.

Your schema is also going to play a role and you're going to have to tune 
things a little differently because when you pull an image, you're pulling a 
larger chunk of data as well as you want to make sure you can fit a decent 
number of images within a region. 


How are you planning on using the images? Are you going to run a M/R job and 
see if you can't spot landmarks and businesses in a photo? Language 
translations? 
Or just a repository? 


On Jan 10, 2013, at 12:23 PM, Marcos Ortiz mlor...@uci.cu wrote:

 This is a very interesting setup to analyze. I´m working in a similar problem
 with HBase, so, any help is welcome.
 
 El 10/01/2013 16:39, Doug Meil escribió:
 +1.
 
 This question comes up enough on the dist-list it's worth getting some
 pointers on record.
 
 
 
 
 
 On 1/10/13 2:24 PM, Mohammad Tariq donta...@gmail.com wrote:
 
 Jack, Leonid,
 
I request you guys to please continue the discussion
 through the thread itself if possible for you both. I would
 like to know about Jack's setup. I too find it quite interesting.
 
 Many thanks.
 
 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 
 
 On Fri, Jan 11, 2013 at 12:50 AM, Leonid Fedotov
 lfedo...@hortonworks.comwrote:
 
 Jack,
 yes, this is very interesting to know your setup details.
 Could you please provide more information?
 Or we can take this off the list if you likeŠ
 
 Thank you!
 
 Sincerely,
 Leonid Fedotov
 
 On Jan 10, 2013, at 9:24 AM, Jack Levin wrote:
 
 We stored about 1 billion images into hbase with file size up to 10MB.
 Its been running for close to 2 years without issues and serves
 delivery of images for Yfrog and ImageShack.  If you have any
 questions about the setup, I would be glad to answer them.
 
 -Jack
 
 On Sun, Jan 6, 2013 at 1:09 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 I have done extensive testing and have found that blobs don't belong
 in
 the
 databases but are rather best left out on the file system. Andrew
 outlined
 issues that you'll face and not to mention IO issues when compaction
 occurs
 over large files.
 
 On Sun, Jan 6, 2013 at 12:52 PM, Andrew Purtell apurt...@apache.org
 wrote:
 I meant this to say a few really large values
 
 On Sun, Jan 6, 2013 at 12:49 PM, Andrew Purtell
 apurt...@apache.org
 wrote:
 
 Consider if the split threshold is 2 GB but your one row contains
 10
 GB
 as
 really large value.
 
 
 
 --
 Best regards,
 
   - Andy
 
 Problems worthy of attack prove their worth by hitting back. - Piet
 Hein
 (via Tom White)
 
 
 
 -- 
 
 Marcos Ortíz Valmaseda
 Blog: http://marcosluis2186.posterous.com
 Twitter: @marcosluis2186 http://twitter.com/marcosluis2186
 
 
 
 
 
 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
 INFORMATICAS...
 CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
 
 http://www.uci.cu
 http://www.facebook.com/universidad.uci
 http://www.flickr.com/photos/universidad_uci

Re: Storing images in Hbase

2013-01-10 Thread shashwat shriparv

Hi Kavish,

i have a better idea for you copy your image files to a single file on
hdfs, and if new image comes append it to the existing image, and keep and
update the metadata and the offset to the HBase. Because if you put bigger
image in hbase it wil lead to some issue.



∞
Shashwat Shriparv



On Fri, Jan 11, 2013 at 9:21 AM, lars hofhansl la...@apache.org wrote:

 Interesting. That's close to a PB if my math is correct.
 Is there a write up about this somewhere? Something that we could link
 from the HBase homepage?

 -- Lars


 - Original Message -
 From: Jack Levin magn...@gmail.com
 To: user@hbase.apache.org
 Cc: Andrew Purtell apurt...@apache.org
 Sent: Thursday, January 10, 2013 9:24 AM
 Subject: Re: Storing images in Hbase

 We stored about 1 billion images into hbase with file size up to 10MB.
 Its been running for close to 2 years without issues and serves
 delivery of images for Yfrog and ImageShack.  If you have any
 questions about the setup, I would be glad to answer them.

 -Jack

 On Sun, Jan 6, 2013 at 1:09 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I have done extensive testing and have found that blobs don't belong in
 the
  databases but are rather best left out on the file system. Andrew
 outlined
  issues that you'll face and not to mention IO issues when compaction
 occurs
  over large files.
 
  On Sun, Jan 6, 2013 at 12:52 PM, Andrew Purtell apurt...@apache.org
 wrote:
 
  I meant this to say a few really large values
 
  On Sun, Jan 6, 2013 at 12:49 PM, Andrew Purtell apurt...@apache.org
  wrote:
 
   Consider if the split threshold is 2 GB but your one row contains 10
 GB
  as
   really large value.
 
 
 
 
   --
  Best regards,
 
 - Andy
 
  Problems worthy of attack prove their worth by hitting back. - Piet Hein
  (via Tom White)

Re: Storing images in Hbase

2013-01-06 Thread Damien Hardy

Hi there,
Thank you, and happy new year.
I had the same problematic and wrote a python module⁰ for thumbor¹
I use the Thrift interface for HBase to store image blobs.
As allready said you have to keep images blob quite small (for latency
problematic in web you have to keep them small too) ~100ko, so HBase should
keep good performances.

BTW Stumbleupon store all its assets in HBase :
http://bb10.com/java-hadoop-hbase-user/2012-03/msg00054.html

[0] https://github.com/dhardy92/thumbor_hbase
[1] https://github.com/globocom/thumbor/wiki

Cheers,

-- 
Damien

Le 6 janv. 2013 04:46, kavishahuja kavishah...@yahoo.com a écrit :

 *Hello EVERYBODY
 first of all, a happy new year to everyone !!
 I need a small help regarding pushing images into apache HBase(DB)...i know
 its about converting objects into bytes and then saving those bytes into
 hbase rows. But still i cant do it.
 Kindly help !! *

 Regards,
 Kavish

Re: Storing images in Hbase

2013-01-06 Thread Yusup Ashrap

there are a lot great discussions on Quora on this topic.
http://www.quora.com/Apache-Hadoop/Is-HBase-appropriate-for-indexed-blob-storage-in-HDFS
http://www.quora.com/Is-it-possible-to-use-HDFS-HBase-to-serve-images
http://www.quora.com/What-is-a-good-choice-for-storing-blob-like-files-in-a-distributed-environment

Re: Storing images in Hbase

2013-01-06 Thread Andrew Purtell

Also YFrog / ImageShack serves all of its assets out of HBase too, so for
reasonably sized images some are having success. See
http://www.slideshare.net/jacque74/hug-hbase-presentation


On Sun, Jan 6, 2013 at 3:58 AM, Yusup Ashrap aph...@gmail.com wrote:

 there are a lot great discussions on Quora on this topic.

 http://www.quora.com/Apache-Hadoop/Is-HBase-appropriate-for-indexed-blob-storage-in-HDFS
 http://www.quora.com/Is-it-possible-to-use-HDFS-HBase-to-serve-images

 http://www.quora.com/What-is-a-good-choice-for-storing-blob-like-files-in-a-distributed-environment




-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Storing images in Hbase

2013-01-06 Thread Asaf Mesika

What's the penalty performance wise of saving a very large value in a
KeyValue in hbase? Splits, scans, etc.

Sent from my iPad

On 6 בינו 2013, at 22:12, Andrew Purtell apurt...@apache.org wrote:

 Also YFrog / ImageShack serves all of its assets out of HBase too, so for
 reasonably sized images some are having success. See
 http://www.slideshare.net/jacque74/hug-hbase-presentation


 On Sun, Jan 6, 2013 at 3:58 AM, Yusup Ashrap aph...@gmail.com wrote:

 there are a lot great discussions on Quora on this topic.

 http://www.quora.com/Apache-Hadoop/Is-HBase-appropriate-for-indexed-blob-storage-in-HDFS
 http://www.quora.com/Is-it-possible-to-use-HDFS-HBase-to-serve-images

 http://www.quora.com/What-is-a-good-choice-for-storing-blob-like-files-in-a-distributed-environment




 --
 Best regards,

   - Andy

 Problems worthy of attack prove their worth by hitting back. - Piet Hein
 (via Tom White)

Re: Storing images in Hbase

2013-01-06 Thread Amandeep Khurana

To add to Andy's point - storing images in HBase is fine as long as
the size of each image isn't huge. A couple for MBs per row in HBase
do just fine. But once you start getting into 10s of MBs, there are
more optimal solutions you can explore and HBase might not be the best
bet.

Amandeep

On Jan 6, 2013, at 12:12 PM, Andrew Purtell apurt...@apache.org wrote:

 Also YFrog / ImageShack serves all of its assets out of HBase too, so for
 reasonably sized images some are having success. See
 http://www.slideshare.net/jacque74/hug-hbase-presentation


 On Sun, Jan 6, 2013 at 3:58 AM, Yusup Ashrap aph...@gmail.com wrote:

 there are a lot great discussions on Quora on this topic.

 http://www.quora.com/Apache-Hadoop/Is-HBase-appropriate-for-indexed-blob-storage-in-HDFS
 http://www.quora.com/Is-it-possible-to-use-HDFS-HBase-to-serve-images

 http://www.quora.com/What-is-a-good-choice-for-storing-blob-like-files-in-a-distributed-environment



 --
 Best regards,

   - Andy

 Problems worthy of attack prove their worth by hitting back. - Piet Hein
 (via Tom White)

Re: Storing images in Hbase

2013-01-06 Thread Andrew Purtell

What do you mean by very large?

One possible source of performance concern is HBase RPC does not do
positioned/chunked/partial reads, so both on the RegionServer and client
the entirety of value data will be in the heap. A lot of really large
objects brought in this way under high concurrency can cause excessive GC
from fragmentation or OOME conditions if the heap isn't adequately sized.
The recommendation of ~10 MB max is to mitigate these effects. There is
nothing scientific about that number though, it's a rule of thumb, I've
built HBase applications with a max value size of 100 MB and it performed
adequately. (Larger objects were split into 100 MB chunks and keyed as
$rowkey$chunk where $chunk was an integer serialized with Bytes.toInt()).

Another is a consequence of the fact a row cannot be split. This means that
if the data in a single row grows significantly larger than the region
split threshold, you will have this one region sized differently from the
others, and this can lead to unexpected behavior. Consider if the split
threshold is 2 GB but your one row contains 10 GB as really large value.
This is undesirable because HBase expects housekeeping on a given region to
be more or less equal to others: compaction, etc.

From the application POV, if you have a few really big value size outliers,
then these could be like land mines if the app is short scanning over table
data. Gets or Scans including such values will have widely varying latency
from others. But this would be an application design problem.

On Sun, Jan 6, 2013 at 12:28 PM, Asaf Mesika asaf.mes...@gmail.com wrote:

What's the penalty performance wise of saving a very large value in a
KeyValue in hbase? Splits, scans, etc.

Sent from my iPad

On 6 בינו 2013, at 22:12, Andrew Purtell apurt...@apache.org wrote:

Also YFrog / ImageShack serves all of its assets out of HBase too, so for
reasonably sized images some are having success. See
http://www.slideshare.net/jacque74/hug-hbase-presentation

On Sun, Jan 6, 2013 at 3:58 AM, Yusup Ashrap aph...@gmail.com wrote:

there are a lot great discussions on Quora on this topic.

http://www.quora.com/Apache-Hadoop/Is-HBase-appropriate-for-indexed-blob-storage-in-HDFS
http://www.quora.com/Is-it-possible-to-use-HDFS-HBase-to-serve-images

http://www.quora.com/What-is-a-good-choice-for-storing-blob-like-files-in-a-distributed-environment

--
Best regards,

- Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

--
Best regards,

- Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Storing images in Hbase

2013-01-06 Thread Mohit Anchlia

I have done extensive testing and have found that blobs don't belong in the
databases but are rather best left out on the file system. Andrew outlined
issues that you'll face and not to mention IO issues when compaction occurs
over large files.

On Sun, Jan 6, 2013 at 12:52 PM, Andrew Purtell apurt...@apache.org wrote:

 I meant this to say a few really large values

 On Sun, Jan 6, 2013 at 12:49 PM, Andrew Purtell apurt...@apache.org
 wrote:

  Consider if the split threshold is 2 GB but your one row contains 10 GB
 as
  really large value.




  --
 Best regards,

- Andy

 Problems worthy of attack prove their worth by hitting back. - Piet Hein
 (via Tom White)

43 matches

Mail list logo