Re: spark streaming job to hbase write
Is this map creation happening on client side ? But how does it know which RS will contain that row key in put operation until asking the .Meta. table . Does Hbase client first gets that ranges of keys of each Reagionservers and then group put objects based on Region servers ? On Fri, Jul 17, 2015 at 7:48 PM, Ted Yu yuzhih...@gmail.com wrote: Internally AsyncProcess uses a Map which is keyed by server name: MapServerName, MultiActionRow actionsByServer = new HashMapServerName, MultiActionRow(); Here MultiAction would group Put's in your example which are destined for the same server. Cheers On Fri, Jul 17, 2015 at 5:15 AM, Shushant Arora shushantaror...@gmail.com wrote: Thanks ! My key is random (hexadecimal). So hot spot should not be created. Is there any concept of bulk put. Say I want to raise a one put request for a 1000 size batch which will hit a region server instead of individual put for each key. Htable.put(ListPut) Does this handles batching of put based on regionserver to which they will land to finally. Say in my batch there are 10 puts- 5 for RS1,3 for RS3 and 2 for RS3. Does this handles that? On Thu, Jul 16, 2015 at 8:31 PM, Michael Segel michael_se...@hotmail.com wrote: You ask an interesting question… Lets set aside spark, and look at the overall ingestion pattern. Its really an ingestion pattern where your input in to the system is from a queue. Are the events discrete or continuous? (This is kinda important.) If the events are continuous then more than likely you’re going to be ingesting data where the key is somewhat sequential. If you use put(), you end up with hot spotting. And you’ll end up with regions half full. So you would be better off batching up the data and doing bulk imports. If the events are discrete, then you’ll want to use put() because the odds are you will not be using a sequential key. (You could, but I’d suggest that you rethink your primary key) Depending on the rate of ingestion, you may want to do a manual flush. (It depends on the velocity of data to be ingested and your use case ) (Remember what caching occurs and where when dealing with HBase.) A third option… Depending on how you use the data, you may want to avoid storing the data in HBase, and only use HBase as an index to where you store the data files for quick access. Again it depends on your data ingestion flow and how you intend to use the data. So really this is less a spark issue than an HBase issue when it comes to design. HTH -Mike On Jul 15, 2015, at 11:46 AM, Shushant Arora shushantaror...@gmail.com wrote: Hi I have a requirement of writing in hbase table from Spark streaming app after some processing. Is Hbase put operation the only way of writing to hbase or is there any specialised connector or rdd of spark for hbase write. Should Bulk load to hbase from streaming app be avoided if output of each batch interval is just few mbs? Thanks The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com
Re: spark streaming job to hbase write
It resorts to the following method for finding region location: private RegionLocations locateRegionInMeta(TableName tableName, byte[] row, boolean useCache, boolean retry, int replicaId) throws IOException { Note: useCache value is true in this call path. Meaning the client side cache would be consulted to reduce RPC to server hosting hbase:meta Cheers On Fri, Jul 17, 2015 at 7:41 AM, Shushant Arora shushantaror...@gmail.com wrote: Is this map creation happening on client side ? But how does it know which RS will contain that row key in put operation until asking the .Meta. table . Does Hbase client first gets that ranges of keys of each Reagionservers and then group put objects based on Region servers ? On Fri, Jul 17, 2015 at 7:48 PM, Ted Yu yuzhih...@gmail.com wrote: Internally AsyncProcess uses a Map which is keyed by server name: MapServerName, MultiActionRow actionsByServer = new HashMapServerName, MultiActionRow(); Here MultiAction would group Put's in your example which are destined for the same server. Cheers On Fri, Jul 17, 2015 at 5:15 AM, Shushant Arora shushantaror...@gmail.com wrote: Thanks ! My key is random (hexadecimal). So hot spot should not be created. Is there any concept of bulk put. Say I want to raise a one put request for a 1000 size batch which will hit a region server instead of individual put for each key. Htable.put(ListPut) Does this handles batching of put based on regionserver to which they will land to finally. Say in my batch there are 10 puts- 5 for RS1,3 for RS3 and 2 for RS3. Does this handles that? On Thu, Jul 16, 2015 at 8:31 PM, Michael Segel michael_se...@hotmail.com wrote: You ask an interesting question… Lets set aside spark, and look at the overall ingestion pattern. Its really an ingestion pattern where your input in to the system is from a queue. Are the events discrete or continuous? (This is kinda important.) If the events are continuous then more than likely you’re going to be ingesting data where the key is somewhat sequential. If you use put(), you end up with hot spotting. And you’ll end up with regions half full. So you would be better off batching up the data and doing bulk imports. If the events are discrete, then you’ll want to use put() because the odds are you will not be using a sequential key. (You could, but I’d suggest that you rethink your primary key) Depending on the rate of ingestion, you may want to do a manual flush. (It depends on the velocity of data to be ingested and your use case ) (Remember what caching occurs and where when dealing with HBase.) A third option… Depending on how you use the data, you may want to avoid storing the data in HBase, and only use HBase as an index to where you store the data files for quick access. Again it depends on your data ingestion flow and how you intend to use the data. So really this is less a spark issue than an HBase issue when it comes to design. HTH -Mike On Jul 15, 2015, at 11:46 AM, Shushant Arora shushantaror...@gmail.com wrote: Hi I have a requirement of writing in hbase table from Spark streaming app after some processing. Is Hbase put operation the only way of writing to hbase or is there any specialised connector or rdd of spark for hbase write. Should Bulk load to hbase from streaming app be avoided if output of each batch interval is just few mbs? Thanks The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com
Re: spark streaming job to hbase write
Thanks ! My key is random (hexadecimal). So hot spot should not be created. Is there any concept of bulk put. Say I want to raise a one put request for a 1000 size batch which will hit a region server instead of individual put for each key. Htable.put(ListPut) Does this handles batching of put based on regionserver to which they will land to finally. Say in my batch there are 10 puts- 5 for RS1,3 for RS3 and 2 for RS3. Does this handles that? On Thu, Jul 16, 2015 at 8:31 PM, Michael Segel michael_se...@hotmail.com wrote: You ask an interesting question… Lets set aside spark, and look at the overall ingestion pattern. Its really an ingestion pattern where your input in to the system is from a queue. Are the events discrete or continuous? (This is kinda important.) If the events are continuous then more than likely you’re going to be ingesting data where the key is somewhat sequential. If you use put(), you end up with hot spotting. And you’ll end up with regions half full. So you would be better off batching up the data and doing bulk imports. If the events are discrete, then you’ll want to use put() because the odds are you will not be using a sequential key. (You could, but I’d suggest that you rethink your primary key) Depending on the rate of ingestion, you may want to do a manual flush. (It depends on the velocity of data to be ingested and your use case ) (Remember what caching occurs and where when dealing with HBase.) A third option… Depending on how you use the data, you may want to avoid storing the data in HBase, and only use HBase as an index to where you store the data files for quick access. Again it depends on your data ingestion flow and how you intend to use the data. So really this is less a spark issue than an HBase issue when it comes to design. HTH -Mike On Jul 15, 2015, at 11:46 AM, Shushant Arora shushantaror...@gmail.com wrote: Hi I have a requirement of writing in hbase table from Spark streaming app after some processing. Is Hbase put operation the only way of writing to hbase or is there any specialised connector or rdd of spark for hbase write. Should Bulk load to hbase from streaming app be avoided if output of each batch interval is just few mbs? Thanks The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com
Re: spark streaming job to hbase write
Internally AsyncProcess uses a Map which is keyed by server name: MapServerName, MultiActionRow actionsByServer = new HashMapServerName, MultiActionRow(); Here MultiAction would group Put's in your example which are destined for the same server. Cheers On Fri, Jul 17, 2015 at 5:15 AM, Shushant Arora shushantaror...@gmail.com wrote: Thanks ! My key is random (hexadecimal). So hot spot should not be created. Is there any concept of bulk put. Say I want to raise a one put request for a 1000 size batch which will hit a region server instead of individual put for each key. Htable.put(ListPut) Does this handles batching of put based on regionserver to which they will land to finally. Say in my batch there are 10 puts- 5 for RS1,3 for RS3 and 2 for RS3. Does this handles that? On Thu, Jul 16, 2015 at 8:31 PM, Michael Segel michael_se...@hotmail.com wrote: You ask an interesting question… Lets set aside spark, and look at the overall ingestion pattern. Its really an ingestion pattern where your input in to the system is from a queue. Are the events discrete or continuous? (This is kinda important.) If the events are continuous then more than likely you’re going to be ingesting data where the key is somewhat sequential. If you use put(), you end up with hot spotting. And you’ll end up with regions half full. So you would be better off batching up the data and doing bulk imports. If the events are discrete, then you’ll want to use put() because the odds are you will not be using a sequential key. (You could, but I’d suggest that you rethink your primary key) Depending on the rate of ingestion, you may want to do a manual flush. (It depends on the velocity of data to be ingested and your use case ) (Remember what caching occurs and where when dealing with HBase.) A third option… Depending on how you use the data, you may want to avoid storing the data in HBase, and only use HBase as an index to where you store the data files for quick access. Again it depends on your data ingestion flow and how you intend to use the data. So really this is less a spark issue than an HBase issue when it comes to design. HTH -Mike On Jul 15, 2015, at 11:46 AM, Shushant Arora shushantaror...@gmail.com wrote: Hi I have a requirement of writing in hbase table from Spark streaming app after some processing. Is Hbase put operation the only way of writing to hbase or is there any specialised connector or rdd of spark for hbase write. Should Bulk load to hbase from streaming app be avoided if output of each batch interval is just few mbs? Thanks The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com
Re: spark streaming job to hbase write
You ask an interesting question… Lets set aside spark, and look at the overall ingestion pattern. Its really an ingestion pattern where your input in to the system is from a queue. Are the events discrete or continuous? (This is kinda important.) If the events are continuous then more than likely you’re going to be ingesting data where the key is somewhat sequential. If you use put(), you end up with hot spotting. And you’ll end up with regions half full. So you would be better off batching up the data and doing bulk imports. If the events are discrete, then you’ll want to use put() because the odds are you will not be using a sequential key. (You could, but I’d suggest that you rethink your primary key) Depending on the rate of ingestion, you may want to do a manual flush. (It depends on the velocity of data to be ingested and your use case ) (Remember what caching occurs and where when dealing with HBase.) A third option… Depending on how you use the data, you may want to avoid storing the data in HBase, and only use HBase as an index to where you store the data files for quick access. Again it depends on your data ingestion flow and how you intend to use the data. So really this is less a spark issue than an HBase issue when it comes to design. HTH -Mike On Jul 15, 2015, at 11:46 AM, Shushant Arora shushantaror...@gmail.com wrote: Hi I have a requirement of writing in hbase table from Spark streaming app after some processing. Is Hbase put operation the only way of writing to hbase or is there any specialised connector or rdd of spark for hbase write. Should Bulk load to hbase from streaming app be avoided if output of each batch interval is just few mbs? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark streaming job to hbase write
Hi I have a requirement of writing in hbase table from Spark streaming app after some processing. Is Hbase put operation the only way of writing to hbase or is there any specialised connector or rdd of spark for hbase write. Should Bulk load to hbase from streaming app be avoided if output of each batch interval is just few mbs? Thanks
Re: spark streaming job to hbase write
There are there connector packages listed on spark packages web site: http://spark-packages.org/?q=hbase HTH. -Todd On Wed, Jul 15, 2015 at 2:46 PM, Shushant Arora shushantaror...@gmail.com wrote: Hi I have a requirement of writing in hbase table from Spark streaming app after some processing. Is Hbase put operation the only way of writing to hbase or is there any specialised connector or rdd of spark for hbase write. Should Bulk load to hbase from streaming app be avoided if output of each batch interval is just few mbs? Thanks