Few questions

   - As I understand you already have a Hadoop cluster. Are you going to
   put your spark as Hadoopp nodes?
   - Where is your HBase cluster? Is it sharing nodes with Hadoop or has
   its own cluster

I looked at that link and it does not say much. Essentially you want to use
HBase for speed layer and your inactive data is stored in Parquet files on
HDFS. So that is your batch layer so to speak.

Have a look at this article of mine Real Time Processing of Trade Data with
Kafka, Flume, Spark, Hbase and MongoDB
<https://www.linkedin.com/pulse/real-time-processing-trade-data-kafka-flume-spark-talebzadeh-ph-d-/>,
a bit dated but still valid.

   -

It helps if you provide an Architectural diagram of your proposed solution.


You then need to do a PoC to see how it looks.


HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 5 Jan 2023 at 09:35, Aaron Grubb <aa...@kaden.ai> wrote:

> (cross-posting from the HBase user list as I didn't receive a reply there)
>
> Hello,
>
> I'm completely new to Spark and evaluating setting up a cluster either in
> YARN or standalone. Our idea for the general workflow is create a
> concatenated dataframe using historical pickle/parquet files (whichever is
> faster) and current data stored in HBase. I'm aware of the benefit of short
> circuit reads if the historical files are stored in HDFS but I'm more
> concerned about resource contention between Spark and HBase during data
> loading. My question is, would running Spark on the same nodes provide a
> benefit when using hbase-connectors (
> https://github.com/apache/hbase-connectors/tree/master/spark)? Is there a
> mechanism in the connector to "pass through" a short circuit read to Spark,
> or would data always bounce from HDFS -> RegionServer -> Spark?
>
> Thanks in advance,
> Aaron
>

Reply via email to