Hi Aaron,

Thanks for the details.

It is a general practice when running Spark on premise to use Hadoop
clusters.
<https://spark.apache.org/faq.html#:~:text=How%20does%20Spark%20relate%20to,Hive%2C%20and%20any%20Hadoop%20InputForm>
This comes from the notion of data locality. Data locality in simple terms
means doing computation on the node where data resides. As you are already
aware Spark is a cluster computing system. It is not a storage system like
HDFS or HBase.  Spark is used to process the data stored in such
distributed systems. In case there is a spark application which is
processing data stored in HDFS., for example PARQUET files on HDFS,  Spark
will attempt to place computation tasks alongside HDFS blocks.
With HDFS the Spark driver contacts NameNode about the DataNodes (ideally
local) containing the various blocks of a file or directory as well as
their locations (represented as InputSplits), and then schedules the work
to the Spark Workers.

Moving on, Spark on Hadoop communicates with Hive, it uses an efficient API
to talk to Hive without the need for JDBC drivers so that is another
advantage point here.

Spark can talk to HBase through Spark-Hbase connecto
<https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_HBase_Connector.md>
r  which provides HBaseContext to interact Spark with HBase. HBaseContext
pushes the configuration to the Spark executors and allows it to have an
HBase Connection per Spark Executor.


With regard to your question:


 Would running Spark on YARN on the same machines where both HDFS and HBase
are running provide localization benefits when Spark reads from HBase, or
are localization benefits negligible and it's a better idea to put Spark in
a standalone cluster?


As per my previous points, I believe it does --> HBaseContext pushes the
configuration to the Spark executors and allows it to have an HBase
Connection per Spark Executor.Putting Spark on a standalone cluster will
add to the cost and IMO will not achieve much.


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 5 Jan 2023 at 22:53, Aaron Grubb <aa...@kaden.ai> wrote:

> Hi Mich,
>
> Thanks for your reply. In hindsight I realize I didn't provide enough
> information about the infrastructure for the question to be answered
> properly. We are currently running a Hadoop cluster with nodes that have
> the following services:
>
> - HDFS NameNode (3.3.4)
> - YARN NodeManager (3.3.4)
> - HBase RegionServer (2.4.15)
> - LLAP on YARN (3.1.3)
>
> So to answer your questions directly, putting Spark on the Hadoop nodes is
> the first idea that I had in order to colocate Spark with HBase for reads
> (HBase is sharing nodes with Hadoop to answer the second question).
> However, what currently happens is, when a Hive query runs that either
> reads from or writes to HBase, there ends up being resource contention as
> HBase threads "spill over" onto vcores that are in theory reserved for
> YARN. We tolerate this in order for both LLAP and HBase to benefit from
> short circuited reads, but when it comes to Spark, I was hoping to find out
> if that same localization benefit would exist when reading from HBase, or
> if it would be better to incur the cost of inter-server, intra-VPC traffic
> in order to avoid resource contention between Spark and HBase during data
> loading. Regarding HBase being the speed layer and Parquet files being the
> batch layer, I was more looking at both of them as the batch layer, but the
> role HBase plays is it reduces the amount of data scanning and joining
> needed to support our use case. Basically we receive events that number in
> the thousands, and those events need to be matched to events that number in
> the hundreds of millions, but they both share a UUIDv4, so instead of
> matching those rows in a MR-style job, we run simple inserts into HBase
> with the UUIDv4 as the table key. The parquet files would end up being data
> from HBase that are past the window for us to receive more events for that
> UUIDv4, i.e. static data. I'm happy to draw up a diagram but hopefully
> these details are enough for an understanding of the question.
>
> To attempt to summarize, would running Spark on YARN on the same machines
> where both HDFS and HBase are running provide localization benefits when
> Spark reads from HBase, or are localization benefits negligible and it's a
> better idea to put Spark in a standalone cluster?
>
> Thanks for your time,
> Aaron
>
> On Thu, 2023-01-05 at 19:00 +0000, Mich Talebzadeh wrote:
>
> Few questions
>
>    - As I understand you already have a Hadoop cluster. Are you going to
>    put your spark as Hadoopp nodes?
>    - Where is your HBase cluster? Is it sharing nodes with Hadoop or has
>    its own cluster
>
> I looked at that link and it does not say much. Essentially you want to
> use HBase for speed layer and your inactive data is stored in Parquet files
> on HDFS. So that is your batch layer so to speak.
>
> Have a look at this article of mine Real Time Processing of Trade Data
> with Kafka, Flume, Spark, Hbase and MongoDB
> <https://www.linkedin.com/pulse/real-time-processing-trade-data-kafka-flume-spark-talebzadeh-ph-d-/>,
> a bit dated but still valid.
>
>    -
>
> It helps if you provide an Architectural diagram of your proposed solution.
>
>
> You then need to do a PoC to see how it looks.
>
>
> HTH
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk.Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 5 Jan 2023 at 09:35, Aaron Grubb <aa...@kaden.ai> wrote:
>
> (cross-posting from the HBase user list as I didn't receive a reply there)
>
> Hello,
>
> I'm completely new to Spark and evaluating setting up a cluster either in
> YARN or standalone. Our idea for the general workflow is create a
> concatenated dataframe using historical pickle/parquet files (whichever is
> faster) and current data stored in HBase. I'm aware of the benefit of short
> circuit reads if the historical files are stored in HDFS but I'm more
> concerned about resource contention between Spark and HBase during data
> loading. My question is, would running Spark on the same nodes provide a
> benefit when using hbase-connectors (
> https://github.com/apache/hbase-connectors/tree/master/spark)? Is there a
> mechanism in the connector to "pass through" a short circuit read to Spark,
> or would data always bounce from HDFS -> RegionServer -> Spark?
>
> Thanks in advance,
> Aaron
>
>
>

Reply via email to