Re: spark streaming job to hbase write

2015-07-17 Thread Shushant Arora
Is this map creation happening on client side ?

But how does it know which RS will contain that row key in put operation
until asking the .Meta. table .
 Does Hbase client first gets that ranges of keys of each Reagionservers
and then group put objects based on Region servers ?

On Fri, Jul 17, 2015 at 7:48 PM, Ted Yu yuzhih...@gmail.com wrote:

 Internally AsyncProcess uses a Map which is keyed by server name:

 MapServerName, MultiActionRow actionsByServer =

 new HashMapServerName, MultiActionRow();

 Here MultiAction would group Put's in your example which are destined for
 the same server.

 Cheers

 On Fri, Jul 17, 2015 at 5:15 AM, Shushant Arora shushantaror...@gmail.com
  wrote:

 Thanks !

 My key is random (hexadecimal). So hot spot should not be created.

 Is there any concept of bulk put. Say I want to raise a one put request
 for a 1000 size batch which will hit a region server instead of individual
 put for each key.


 Htable.put(ListPut) Does this handles batching of put based on
 regionserver to which they will land to finally. Say in my batch there are
 10 puts- 5 for RS1,3 for RS3 and 2 for RS3. Does this handles that?









 On Thu, Jul 16, 2015 at 8:31 PM, Michael Segel michael_se...@hotmail.com
  wrote:

 You ask an interesting question…

 Lets set aside spark, and look at the overall ingestion pattern.

 Its really an ingestion pattern where your input in to the system is
 from a queue.

 Are the events discrete or continuous? (This is kinda important.)

 If the events are continuous then more than likely you’re going to be
 ingesting data where the key is somewhat sequential. If you use put(), you
 end up with hot spotting. And you’ll end up with regions half full.
 So you would be better off batching up the data and doing bulk imports.

 If the events are discrete, then you’ll want to use put() because the
 odds are you will not be using a sequential key. (You could, but I’d
 suggest that you rethink your primary key)

 Depending on the rate of ingestion, you may want to do a manual flush.
 (It depends on the velocity of data to be ingested and your use case )
 (Remember what caching occurs and where when dealing with HBase.)

 A third option… Depending on how you use the data, you may want to avoid
 storing the data in HBase, and only use HBase as an index to where you
 store the data files for quick access.  Again it depends on your data
 ingestion flow and how you intend to use the data.

 So really this is less a spark issue than an HBase issue when it comes
 to design.

 HTH

 -Mike

  On Jul 15, 2015, at 11:46 AM, Shushant Arora 
 shushantaror...@gmail.com wrote:
 
  Hi
 
  I have a requirement of writing in hbase table from Spark streaming
 app after some processing.
  Is Hbase put operation the only way of writing to hbase or is there
 any specialised connector or rdd of spark for hbase write.
 
  Should Bulk load to hbase from streaming  app be avoided if output of
 each batch interval is just few mbs?
 
  Thanks
 

 The opinions expressed here are mine, while they may reflect a cognitive
 thought, that is purely accidental.
 Use at your own risk.
 Michael Segel
 michael_segel (AT) hotmail.com










Re: spark streaming job to hbase write

2015-07-17 Thread Ted Yu
It resorts to the following method for finding region location:

  private RegionLocations locateRegionInMeta(TableName tableName, byte[]
row,

 boolean useCache, boolean retry, int replicaId) throws
IOException {

Note: useCache value is true in this call path.

Meaning the client side cache would be consulted to reduce RPC to server
hosting hbase:meta

Cheers

On Fri, Jul 17, 2015 at 7:41 AM, Shushant Arora shushantaror...@gmail.com
wrote:

 Is this map creation happening on client side ?

 But how does it know which RS will contain that row key in put operation
 until asking the .Meta. table .
  Does Hbase client first gets that ranges of keys of each Reagionservers
 and then group put objects based on Region servers ?

 On Fri, Jul 17, 2015 at 7:48 PM, Ted Yu yuzhih...@gmail.com wrote:

 Internally AsyncProcess uses a Map which is keyed by server name:

 MapServerName, MultiActionRow actionsByServer =

 new HashMapServerName, MultiActionRow();

 Here MultiAction would group Put's in your example which are destined for
 the same server.

 Cheers

 On Fri, Jul 17, 2015 at 5:15 AM, Shushant Arora 
 shushantaror...@gmail.com wrote:

 Thanks !

 My key is random (hexadecimal). So hot spot should not be created.

 Is there any concept of bulk put. Say I want to raise a one put request
 for a 1000 size batch which will hit a region server instead of individual
 put for each key.


 Htable.put(ListPut) Does this handles batching of put based on
 regionserver to which they will land to finally. Say in my batch there are
 10 puts- 5 for RS1,3 for RS3 and 2 for RS3. Does this handles that?









 On Thu, Jul 16, 2015 at 8:31 PM, Michael Segel 
 michael_se...@hotmail.com wrote:

 You ask an interesting question…

 Lets set aside spark, and look at the overall ingestion pattern.

 Its really an ingestion pattern where your input in to the system is
 from a queue.

 Are the events discrete or continuous? (This is kinda important.)

 If the events are continuous then more than likely you’re going to be
 ingesting data where the key is somewhat sequential. If you use put(), you
 end up with hot spotting. And you’ll end up with regions half full.
 So you would be better off batching up the data and doing bulk imports.

 If the events are discrete, then you’ll want to use put() because the
 odds are you will not be using a sequential key. (You could, but I’d
 suggest that you rethink your primary key)

 Depending on the rate of ingestion, you may want to do a manual flush.
 (It depends on the velocity of data to be ingested and your use case )
 (Remember what caching occurs and where when dealing with HBase.)

 A third option… Depending on how you use the data, you may want to
 avoid storing the data in HBase, and only use HBase as an index to where
 you store the data files for quick access.  Again it depends on your data
 ingestion flow and how you intend to use the data.

 So really this is less a spark issue than an HBase issue when it comes
 to design.

 HTH

 -Mike

  On Jul 15, 2015, at 11:46 AM, Shushant Arora 
 shushantaror...@gmail.com wrote:
 
  Hi
 
  I have a requirement of writing in hbase table from Spark streaming
 app after some processing.
  Is Hbase put operation the only way of writing to hbase or is there
 any specialised connector or rdd of spark for hbase write.
 
  Should Bulk load to hbase from streaming  app be avoided if output of
 each batch interval is just few mbs?
 
  Thanks
 

 The opinions expressed here are mine, while they may reflect a
 cognitive thought, that is purely accidental.
 Use at your own risk.
 Michael Segel
 michael_segel (AT) hotmail.com











Re: spark streaming job to hbase write

2015-07-17 Thread Shushant Arora
Thanks !

My key is random (hexadecimal). So hot spot should not be created.

Is there any concept of bulk put. Say I want to raise a one put request for
a 1000 size batch which will hit a region server instead of individual put
for each key.


Htable.put(ListPut) Does this handles batching of put based on
regionserver to which they will land to finally. Say in my batch there are
10 puts- 5 for RS1,3 for RS3 and 2 for RS3. Does this handles that?









On Thu, Jul 16, 2015 at 8:31 PM, Michael Segel michael_se...@hotmail.com
wrote:

 You ask an interesting question…

 Lets set aside spark, and look at the overall ingestion pattern.

 Its really an ingestion pattern where your input in to the system is from
 a queue.

 Are the events discrete or continuous? (This is kinda important.)

 If the events are continuous then more than likely you’re going to be
 ingesting data where the key is somewhat sequential. If you use put(), you
 end up with hot spotting. And you’ll end up with regions half full.
 So you would be better off batching up the data and doing bulk imports.

 If the events are discrete, then you’ll want to use put() because the odds
 are you will not be using a sequential key. (You could, but I’d suggest
 that you rethink your primary key)

 Depending on the rate of ingestion, you may want to do a manual flush. (It
 depends on the velocity of data to be ingested and your use case )
 (Remember what caching occurs and where when dealing with HBase.)

 A third option… Depending on how you use the data, you may want to avoid
 storing the data in HBase, and only use HBase as an index to where you
 store the data files for quick access.  Again it depends on your data
 ingestion flow and how you intend to use the data.

 So really this is less a spark issue than an HBase issue when it comes to
 design.

 HTH

 -Mike

  On Jul 15, 2015, at 11:46 AM, Shushant Arora shushantaror...@gmail.com
 wrote:
 
  Hi
 
  I have a requirement of writing in hbase table from Spark streaming app
 after some processing.
  Is Hbase put operation the only way of writing to hbase or is there any
 specialised connector or rdd of spark for hbase write.
 
  Should Bulk load to hbase from streaming  app be avoided if output of
 each batch interval is just few mbs?
 
  Thanks
 

 The opinions expressed here are mine, while they may reflect a cognitive
 thought, that is purely accidental.
 Use at your own risk.
 Michael Segel
 michael_segel (AT) hotmail.com








Re: spark streaming job to hbase write

2015-07-17 Thread Ted Yu
Internally AsyncProcess uses a Map which is keyed by server name:

MapServerName, MultiActionRow actionsByServer =

new HashMapServerName, MultiActionRow();

Here MultiAction would group Put's in your example which are destined for
the same server.

Cheers

On Fri, Jul 17, 2015 at 5:15 AM, Shushant Arora shushantaror...@gmail.com
wrote:

 Thanks !

 My key is random (hexadecimal). So hot spot should not be created.

 Is there any concept of bulk put. Say I want to raise a one put request
 for a 1000 size batch which will hit a region server instead of individual
 put for each key.


 Htable.put(ListPut) Does this handles batching of put based on
 regionserver to which they will land to finally. Say in my batch there are
 10 puts- 5 for RS1,3 for RS3 and 2 for RS3. Does this handles that?









 On Thu, Jul 16, 2015 at 8:31 PM, Michael Segel michael_se...@hotmail.com
 wrote:

 You ask an interesting question…

 Lets set aside spark, and look at the overall ingestion pattern.

 Its really an ingestion pattern where your input in to the system is from
 a queue.

 Are the events discrete or continuous? (This is kinda important.)

 If the events are continuous then more than likely you’re going to be
 ingesting data where the key is somewhat sequential. If you use put(), you
 end up with hot spotting. And you’ll end up with regions half full.
 So you would be better off batching up the data and doing bulk imports.

 If the events are discrete, then you’ll want to use put() because the
 odds are you will not be using a sequential key. (You could, but I’d
 suggest that you rethink your primary key)

 Depending on the rate of ingestion, you may want to do a manual flush.
 (It depends on the velocity of data to be ingested and your use case )
 (Remember what caching occurs and where when dealing with HBase.)

 A third option… Depending on how you use the data, you may want to avoid
 storing the data in HBase, and only use HBase as an index to where you
 store the data files for quick access.  Again it depends on your data
 ingestion flow and how you intend to use the data.

 So really this is less a spark issue than an HBase issue when it comes to
 design.

 HTH

 -Mike

  On Jul 15, 2015, at 11:46 AM, Shushant Arora shushantaror...@gmail.com
 wrote:
 
  Hi
 
  I have a requirement of writing in hbase table from Spark streaming app
 after some processing.
  Is Hbase put operation the only way of writing to hbase or is there any
 specialised connector or rdd of spark for hbase write.
 
  Should Bulk load to hbase from streaming  app be avoided if output of
 each batch interval is just few mbs?
 
  Thanks
 

 The opinions expressed here are mine, while they may reflect a cognitive
 thought, that is purely accidental.
 Use at your own risk.
 Michael Segel
 michael_segel (AT) hotmail.com









Re: spark streaming job to hbase write

2015-07-16 Thread Michael Segel
You ask an interesting question… 

Lets set aside spark, and look at the overall ingestion pattern. 

Its really an ingestion pattern where your input in to the system is from a 
queue. 

Are the events discrete or continuous? (This is kinda important.) 

If the events are continuous then more than likely you’re going to be ingesting 
data where the key is somewhat sequential. If you use put(), you end up with 
hot spotting. And you’ll end up with regions half full. 
So you would be better off batching up the data and doing bulk imports. 

If the events are discrete, then you’ll want to use put() because the odds are 
you will not be using a sequential key. (You could, but I’d suggest that you 
rethink your primary key) 

Depending on the rate of ingestion, you may want to do a manual flush. (It 
depends on the velocity of data to be ingested and your use case )
(Remember what caching occurs and where when dealing with HBase.) 

A third option… Depending on how you use the data, you may want to avoid 
storing the data in HBase, and only use HBase as an index to where you store 
the data files for quick access.  Again it depends on your data ingestion flow 
and how you intend to use the data. 

So really this is less a spark issue than an HBase issue when it comes to 
design. 

HTH

-Mike
 On Jul 15, 2015, at 11:46 AM, Shushant Arora shushantaror...@gmail.com 
 wrote:
 
 Hi
 
 I have a requirement of writing in hbase table from Spark streaming app after 
 some processing.
 Is Hbase put operation the only way of writing to hbase or is there any 
 specialised connector or rdd of spark for hbase write.
 
 Should Bulk load to hbase from streaming  app be avoided if output of each 
 batch interval is just few mbs?
 
 Thanks
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark streaming job to hbase write

2015-07-15 Thread Shushant Arora
Hi

I have a requirement of writing in hbase table from Spark streaming app
after some processing.
Is Hbase put operation the only way of writing to hbase or is there any
specialised connector or rdd of spark for hbase write.

Should Bulk load to hbase from streaming  app be avoided if output of each
batch interval is just few mbs?

Thanks


Re: spark streaming job to hbase write

2015-07-15 Thread Todd Nist
There are there connector packages listed on spark packages web site:

http://spark-packages.org/?q=hbase

HTH.

-Todd

On Wed, Jul 15, 2015 at 2:46 PM, Shushant Arora shushantaror...@gmail.com
wrote:

 Hi

 I have a requirement of writing in hbase table from Spark streaming app
 after some processing.
 Is Hbase put operation the only way of writing to hbase or is there any
 specialised connector or rdd of spark for hbase write.

 Should Bulk load to hbase from streaming  app be avoided if output of each
 batch interval is just few mbs?

 Thanks