Hi Luqman, I just responded to another query on the list about phoenix-spark that may help shed some light. In addition, the preferred locations the phoenix-spark connector exposes are determined in the general PhoenixInputFormat MapReduce code [1]
I'm not very familiar with PrestoDB, but if it's able to load data using a general Hadoop InputFormat, the PhoenixInputFormat would be a good place to start looking. Josh [1] https://github.com/apache/phoenix/blob/5b099014446865c12779f3882fd8b407496717ea/phoenix-hive/src/main/java/org/apache/phoenix/hive/mapreduce/PhoenixInputFormat.java#L177-L178 On Thu, Aug 17, 2017 at 5:46 AM, Luqman Ghani <lgsa...@gmail.com> wrote: > Hi, > > We are evaluating the possibility of writing a custom connector for > Phoenix to access tables in stored in HBase. However, we need some help. > > The connector for Presto should be able to read from HBase cluster using > parallel collections. For that the connector has a "ConnectorSplitManager" > which needs to be implemented. To quote from here > <https://prestodb.io/docs/current/develop/connectors.html>: > " > The split manager partitions the data for a table into the individual > chunks that Presto will distribute to workers for processing. For example, > the Hive connector lists the files for each Hive partition and creates one > or more split per file. For data sources that don’t have partitioned data, > a good strategy here is to simply return a single split for the entire > table. This is the strategy employed by the Example HTTP connector. > " > > I want to know if there's a way to implement Split Manager so that the > data in HBase can be accessed by parallel connections. I was trying to > follow the code for Phoenix-Spark connector > <https://github.com/apache/phoenix/blob/master/phoenix-spark/src/main/scala/org/apache/phoenix/spark/PhoenixRDD.scala> > to > see how it decides getPreferredLocations to create splits, but couldn't > understand. > > Any hints or code directions will be helpful. > > Regards, > Luqman >