Re: Dealing with data locality in the HBase Java API

Michael Segel Thu, 05 Mar 2015 09:15:49 -0800

The better answer is that you don’t worry about data locality.




> On Mar 4, 2015, at 12:32 PM, Andrew Purtell <apurt...@apache.org> wrote:
> 
> Spark supports creating RDDs using Hadoop input and output formats (
> https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.rdd.HadoopRDD)
> . You can use our TableInputFormat (
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html)
> or TableOutputFormat (
> https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html).
> These divide work up according to the contours of the keyspace and provide
> information to the framework on how to optimally place tasks on the cluster
> for data locality. You may not need to do anything special. InputFormats
> like TableInputFormat hand over an array of InputSplit (
> https://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/mapreduce/InputSplit.html)
> to the framework so it can optimize task placement. Hadoop MapReduce takes
> advantage of this information. I looked at Spark's HadoopRDD implementation
> and it appears to make use of this information when partitioning the RDD.
> 
> You might also want to take a look at Ted Malaka's SparkOnHBase:
> https://github.com/tmalaska/SparkOnHBase
> 
> 
> On Tue, Mar 3, 2015 at 9:46 PM, Gokul Balakrishnan <royal...@gmail.com>
> wrote:
> 
>> Hello,
>> 
>> I'm fairly new to HBase so would be grateful for any assistance.
>> 
>> My project is as follows: use HBase as an underlying data store for an
>> analytics cluster (powered by Apache Spark).
>> 
>> In doing this, I'm wondering how I may set about leveraging the locality of
>> the HBase data during processing (in other words, if the Spark instance is
>> running on a node that also houses HBase data, how to make use of the local
>> data first).
>> 
>> Is there some form of metadata offered by the Java API which I could then
>> use to organise the data into (virtual) groups based on the locality to be
>> passed forward to Spark? It could be something that *identifies on which
>> node a particular row resides*. I found [1] but I'm not sure if this is
>> what I'm looking for. Could someone please point me in the right direction?
>> 
>> [1] https://issues.apache.org/jira/browse/HBASE-12361
>> 
>> Thanks so much!
>> Gokul Balakrishnan.
>> 
> 
> 
> 
> -- 
> Best regards,
> 
>   - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

The opinions expressed here are mine, while they may reflect a cognitive 
thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: Dealing with data locality in the HBase Java API

Reply via email to