(cross-posting from the HBase user list as I didn't receive a reply there) Hello,
I'm completely new to Spark and evaluating setting up a cluster either in YARN or standalone. Our idea for the general workflow is create a concatenated dataframe using historical pickle/parquet files (whichever is faster) and current data stored in HBase. I'm aware of the benefit of short circuit reads if the historical files are stored in HDFS but I'm more concerned about resource contention between Spark and HBase during data loading. My question is, would running Spark on the same nodes provide a benefit when using hbase-connectors (https://github.com/apache/hbase-connectors/tree/master/spark)? Is there a mechanism in the connector to "pass through" a short circuit read to Spark, or would data always bounce from HDFS -> RegionServer -> Spark? Thanks in advance, Aaron