The better answer is that you don’t worry about data locality.
> On Mar 4, 2015, at 12:32 PM, Andrew Purtell <apurt...@apache.org> wrote: > > Spark supports creating RDDs using Hadoop input and output formats ( > https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.rdd.HadoopRDD) > . You can use our TableInputFormat ( > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html) > or TableOutputFormat ( > https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html). > These divide work up according to the contours of the keyspace and provide > information to the framework on how to optimally place tasks on the cluster > for data locality. You may not need to do anything special. InputFormats > like TableInputFormat hand over an array of InputSplit ( > https://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/mapreduce/InputSplit.html) > to the framework so it can optimize task placement. Hadoop MapReduce takes > advantage of this information. I looked at Spark's HadoopRDD implementation > and it appears to make use of this information when partitioning the RDD. > > You might also want to take a look at Ted Malaka's SparkOnHBase: > https://github.com/tmalaska/SparkOnHBase > > > On Tue, Mar 3, 2015 at 9:46 PM, Gokul Balakrishnan <royal...@gmail.com> > wrote: > >> Hello, >> >> I'm fairly new to HBase so would be grateful for any assistance. >> >> My project is as follows: use HBase as an underlying data store for an >> analytics cluster (powered by Apache Spark). >> >> In doing this, I'm wondering how I may set about leveraging the locality of >> the HBase data during processing (in other words, if the Spark instance is >> running on a node that also houses HBase data, how to make use of the local >> data first). >> >> Is there some form of metadata offered by the Java API which I could then >> use to organise the data into (virtual) groups based on the locality to be >> passed forward to Spark? It could be something that *identifies on which >> node a particular row resides*. I found [1] but I'm not sure if this is >> what I'm looking for. Could someone please point me in the right direction? >> >> [1] https://issues.apache.org/jira/browse/HBASE-12361 >> >> Thanks so much! >> Gokul Balakrishnan. >> > > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com