Hi Mich,

Thanks for your reply. In hindsight I realize I didn't provide enough 
information about the infrastructure for the question to be answered properly. 
We are currently running a Hadoop cluster with nodes that have the following 
services:

- HDFS NameNode (3.3.4)
- YARN NodeManager (3.3.4)
- HBase RegionServer (2.4.15)
- LLAP on YARN (3.1.3)

So to answer your questions directly, putting Spark on the Hadoop nodes is the 
first idea that I had in order to colocate Spark with HBase for reads (HBase is 
sharing nodes with Hadoop to answer the second question). However, what 
currently happens is, when a Hive query runs that either reads from or writes 
to HBase, there ends up being resource contention as HBase threads "spill over" 
onto vcores that are in theory reserved for YARN. We tolerate this in order for 
both LLAP and HBase to benefit from short circuited reads, but when it comes to 
Spark, I was hoping to find out if that same localization benefit would exist 
when reading from HBase, or if it would be better to incur the cost of 
inter-server, intra-VPC traffic in order to avoid resource contention between 
Spark and HBase during data loading. Regarding HBase being the speed layer and 
Parquet files being the batch layer, I was more looking at both of them as the 
batch layer, but the role HBase plays is it reduces the amount of data scanning 
and joining needed to support our use case. Basically we receive events that 
number in the thousands, and those events need to be matched to events that 
number in the hundreds of millions, but they both share a UUIDv4, so instead of 
matching those rows in a MR-style job, we run simple inserts into HBase with 
the UUIDv4 as the table key. The parquet files would end up being data from 
HBase that are past the window for us to receive more events for that UUIDv4, 
i.e. static data. I'm happy to draw up a diagram but hopefully these details 
are enough for an understanding of the question.

To attempt to summarize, would running Spark on YARN on the same machines where 
both HDFS and HBase are running provide localization benefits when Spark reads 
from HBase, or are localization benefits negligible and it's a better idea to 
put Spark in a standalone cluster?

Thanks for your time,
Aaron

On Thu, 2023-01-05 at 19:00 +0000, Mich Talebzadeh wrote:
Few questions

  *   As I understand you already have a Hadoop cluster. Are you going to put 
your spark as Hadoopp nodes?
  *   Where is your HBase cluster? Is it sharing nodes with Hadoop or has its 
own cluster

I looked at that link and it does not say much. Essentially you want to use 
HBase for speed layer and your inactive data is stored in Parquet files on 
HDFS. So that is your batch layer so to speak.

Have a look at this article of mine Real Time Processing of Trade Data with 
Kafka, Flume, Spark, Hbase and 
MongoDB<https://www.linkedin.com/pulse/real-time-processing-trade-data-kafka-flume-spark-talebzadeh-ph-d-/>,
 a bit dated but still valid.

  *

It helps if you provide an Architectural diagram of your proposed solution.


You then need to do a PoC to see how it looks.


HTH
 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk.Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 5 Jan 2023 at 09:35, Aaron Grubb 
<aa...@kaden.ai<mailto:aa...@kaden.ai>> wrote:
(cross-posting from the HBase user list as I didn't receive a reply there)

Hello,

I'm completely new to Spark and evaluating setting up a cluster either in YARN 
or standalone. Our idea for the general workflow is create a concatenated 
dataframe using historical pickle/parquet files (whichever is faster) and 
current data stored in HBase. I'm aware of the benefit of short circuit reads 
if the historical files are stored in HDFS but I'm more concerned about 
resource contention between Spark and HBase during data loading. My question 
is, would running Spark on the same nodes provide a benefit when using 
hbase-connectors 
(https://github.com/apache/hbase-connectors/tree/master/spark)? Is there a 
mechanism in the connector to "pass through" a short circuit read to Spark, or 
would data always bounce from HDFS -> RegionServer -> Spark?

Thanks in advance,
Aaron

Reply via email to