It resorts to the following method for finding region location: private RegionLocations locateRegionInMeta(TableName tableName, byte[] row,
boolean useCache, boolean retry, int replicaId) throws IOException { Note: useCache value is true in this call path. Meaning the client side cache would be consulted to reduce RPC to server hosting hbase:meta Cheers On Fri, Jul 17, 2015 at 7:41 AM, Shushant Arora <shushantaror...@gmail.com> wrote: > Is this map creation happening on client side ? > > But how does it know which RS will contain that row key in put operation > until asking the .Meta. table . > Does Hbase client first gets that ranges of keys of each Reagionservers > and then group put objects based on Region servers ? > > On Fri, Jul 17, 2015 at 7:48 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Internally AsyncProcess uses a Map which is keyed by server name: >> >> Map<ServerName, MultiAction<Row>> actionsByServer = >> >> new HashMap<ServerName, MultiAction<Row>>(); >> >> Here MultiAction would group Put's in your example which are destined for >> the same server. >> >> Cheers >> >> On Fri, Jul 17, 2015 at 5:15 AM, Shushant Arora < >> shushantaror...@gmail.com> wrote: >> >>> Thanks ! >>> >>> My key is random (hexadecimal). So hot spot should not be created. >>> >>> Is there any concept of bulk put. Say I want to raise a one put request >>> for a 1000 size batch which will hit a region server instead of individual >>> put for each key. >>> >>> >>> Htable.put(List<Put>) Does this handles batching of put based on >>> regionserver to which they will land to finally. Say in my batch there are >>> 10 puts- 5 for RS1,3 for RS3 and 2 for RS3. Does this handles that? >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Thu, Jul 16, 2015 at 8:31 PM, Michael Segel < >>> michael_se...@hotmail.com> wrote: >>> >>>> You ask an interesting question… >>>> >>>> Lets set aside spark, and look at the overall ingestion pattern. >>>> >>>> Its really an ingestion pattern where your input in to the system is >>>> from a queue. >>>> >>>> Are the events discrete or continuous? (This is kinda important.) >>>> >>>> If the events are continuous then more than likely you’re going to be >>>> ingesting data where the key is somewhat sequential. If you use put(), you >>>> end up with hot spotting. And you’ll end up with regions half full. >>>> So you would be better off batching up the data and doing bulk imports. >>>> >>>> If the events are discrete, then you’ll want to use put() because the >>>> odds are you will not be using a sequential key. (You could, but I’d >>>> suggest that you rethink your primary key) >>>> >>>> Depending on the rate of ingestion, you may want to do a manual flush. >>>> (It depends on the velocity of data to be ingested and your use case ) >>>> (Remember what caching occurs and where when dealing with HBase.) >>>> >>>> A third option… Depending on how you use the data, you may want to >>>> avoid storing the data in HBase, and only use HBase as an index to where >>>> you store the data files for quick access. Again it depends on your data >>>> ingestion flow and how you intend to use the data. >>>> >>>> So really this is less a spark issue than an HBase issue when it comes >>>> to design. >>>> >>>> HTH >>>> >>>> -Mike >>>> >>>> > On Jul 15, 2015, at 11:46 AM, Shushant Arora < >>>> shushantaror...@gmail.com> wrote: >>>> > >>>> > Hi >>>> > >>>> > I have a requirement of writing in hbase table from Spark streaming >>>> app after some processing. >>>> > Is Hbase put operation the only way of writing to hbase or is there >>>> any specialised connector or rdd of spark for hbase write. >>>> > >>>> > Should Bulk load to hbase from streaming app be avoided if output of >>>> each batch interval is just few mbs? >>>> > >>>> > Thanks >>>> > >>>> >>>> The opinions expressed here are mine, while they may reflect a >>>> cognitive thought, that is purely accidental. >>>> Use at your own risk. >>>> Michael Segel >>>> michael_segel (AT) hotmail.com >>>> >>>> >>>> >>>> >>>> >>>> >>> >> >