Re: Spark reading from HBase using hbase-connectors - any benefit from localization?

2023-01-05 Thread Mich Talebzadeh
Hi Aaron,

Thanks for the details.

It is a general practice when running Spark on premise to use Hadoop
clusters.

This comes from the notion of data locality. Data locality in simple terms
means doing computation on the node where data resides. As you are already
aware Spark is a cluster computing system. It is not a storage system like
HDFS or HBase.  Spark is used to process the data stored in such
distributed systems. In case there is a spark application which is
processing data stored in HDFS., for example PARQUET files on HDFS,  Spark
will attempt to place computation tasks alongside HDFS blocks.
With HDFS the Spark driver contacts NameNode about the DataNodes (ideally
local) containing the various blocks of a file or directory as well as
their locations (represented as InputSplits), and then schedules the work
to the Spark Workers.

Moving on, Spark on Hadoop communicates with Hive, it uses an efficient API
to talk to Hive without the need for JDBC drivers so that is another
advantage point here.

Spark can talk to HBase through Spark-Hbase connecto

r  which provides HBaseContext to interact Spark with HBase. HBaseContext
pushes the configuration to the Spark executors and allows it to have an
HBase Connection per Spark Executor.


With regard to your question:


 Would running Spark on YARN on the same machines where both HDFS and HBase
are running provide localization benefits when Spark reads from HBase, or
are localization benefits negligible and it's a better idea to put Spark in
a standalone cluster?


As per my previous points, I believe it does --> HBaseContext pushes the
configuration to the Spark executors and allows it to have an HBase
Connection per Spark Executor.Putting Spark on a standalone cluster will
add to the cost and IMO will not achieve much.


HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 5 Jan 2023 at 22:53, Aaron Grubb  wrote:

> Hi Mich,
>
> Thanks for your reply. In hindsight I realize I didn't provide enough
> information about the infrastructure for the question to be answered
> properly. We are currently running a Hadoop cluster with nodes that have
> the following services:
>
> - HDFS NameNode (3.3.4)
> - YARN NodeManager (3.3.4)
> - HBase RegionServer (2.4.15)
> - LLAP on YARN (3.1.3)
>
> So to answer your questions directly, putting Spark on the Hadoop nodes is
> the first idea that I had in order to colocate Spark with HBase for reads
> (HBase is sharing nodes with Hadoop to answer the second question).
> However, what currently happens is, when a Hive query runs that either
> reads from or writes to HBase, there ends up being resource contention as
> HBase threads "spill over" onto vcores that are in theory reserved for
> YARN. We tolerate this in order for both LLAP and HBase to benefit from
> short circuited reads, but when it comes to Spark, I was hoping to find out
> if that same localization benefit would exist when reading from HBase, or
> if it would be better to incur the cost of inter-server, intra-VPC traffic
> in order to avoid resource contention between Spark and HBase during data
> loading. Regarding HBase being the speed layer and Parquet files being the
> batch layer, I was more looking at both of them as the batch layer, but the
> role HBase plays is it reduces the amount of data scanning and joining
> needed to support our use case. Basically we receive events that number in
> the thousands, and those events need to be matched to events that number in
> the hundreds of millions, but they both share a UUIDv4, so instead of
> matching those rows in a MR-style job, we run simple inserts into HBase
> with the UUIDv4 as the table key. The parquet files would end up being data
> from HBase that are past the window for us to receive more events for that
> UUIDv4, i.e. static data. I'm happy to draw up a diagram but hopefully
> these details are enough for an understanding of the question.
>
> To attempt to summarize, would running Spark on YARN on the same machines
> where both HDFS and HBase are running provide localization benefits when
> Spark reads from HBase, or are localization benefits negligible and it's a
> better idea to put Spark in a standalone cluster?
>
> Thanks for your time,
> Aaron
>
> On Thu, 2023-01-05 at 19:00 +, Mich Talebzadeh wrote:
>
> Few questions
>
>- As I understand you already 

Re: Spark reading from HBase using hbase-connectors - any benefit from localization?

2023-01-05 Thread Aaron Grubb
Hi Mich,

Thanks for your reply. In hindsight I realize I didn't provide enough 
information about the infrastructure for the question to be answered properly. 
We are currently running a Hadoop cluster with nodes that have the following 
services:

- HDFS NameNode (3.3.4)
- YARN NodeManager (3.3.4)
- HBase RegionServer (2.4.15)
- LLAP on YARN (3.1.3)

So to answer your questions directly, putting Spark on the Hadoop nodes is the 
first idea that I had in order to colocate Spark with HBase for reads (HBase is 
sharing nodes with Hadoop to answer the second question). However, what 
currently happens is, when a Hive query runs that either reads from or writes 
to HBase, there ends up being resource contention as HBase threads "spill over" 
onto vcores that are in theory reserved for YARN. We tolerate this in order for 
both LLAP and HBase to benefit from short circuited reads, but when it comes to 
Spark, I was hoping to find out if that same localization benefit would exist 
when reading from HBase, or if it would be better to incur the cost of 
inter-server, intra-VPC traffic in order to avoid resource contention between 
Spark and HBase during data loading. Regarding HBase being the speed layer and 
Parquet files being the batch layer, I was more looking at both of them as the 
batch layer, but the role HBase plays is it reduces the amount of data scanning 
and joining needed to support our use case. Basically we receive events that 
number in the thousands, and those events need to be matched to events that 
number in the hundreds of millions, but they both share a UUIDv4, so instead of 
matching those rows in a MR-style job, we run simple inserts into HBase with 
the UUIDv4 as the table key. The parquet files would end up being data from 
HBase that are past the window for us to receive more events for that UUIDv4, 
i.e. static data. I'm happy to draw up a diagram but hopefully these details 
are enough for an understanding of the question.

To attempt to summarize, would running Spark on YARN on the same machines where 
both HDFS and HBase are running provide localization benefits when Spark reads 
from HBase, or are localization benefits negligible and it's a better idea to 
put Spark in a standalone cluster?

Thanks for your time,
Aaron

On Thu, 2023-01-05 at 19:00 +, Mich Talebzadeh wrote:
Few questions

  *   As I understand you already have a Hadoop cluster. Are you going to put 
your spark as Hadoopp nodes?
  *   Where is your HBase cluster? Is it sharing nodes with Hadoop or has its 
own cluster

I looked at that link and it does not say much. Essentially you want to use 
HBase for speed layer and your inactive data is stored in Parquet files on 
HDFS. So that is your batch layer so to speak.

Have a look at this article of mine Real Time Processing of Trade Data with 
Kafka, Flume, Spark, Hbase and 
MongoDB,
 a bit dated but still valid.

  *

It helps if you provide an Architectural diagram of your proposed solution.


You then need to do a PoC to see how it looks.


HTH
 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile

 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk.Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 5 Jan 2023 at 09:35, Aaron Grubb 
mailto:aa...@kaden.ai>> wrote:
(cross-posting from the HBase user list as I didn't receive a reply there)

Hello,

I'm completely new to Spark and evaluating setting up a cluster either in YARN 
or standalone. Our idea for the general workflow is create a concatenated 
dataframe using historical pickle/parquet files (whichever is faster) and 
current data stored in HBase. I'm aware of the benefit of short circuit reads 
if the historical files are stored in HDFS but I'm more concerned about 
resource contention between Spark and HBase during data loading. My question 
is, would running Spark on the same nodes provide a benefit when using 
hbase-connectors 
(https://github.com/apache/hbase-connectors/tree/master/spark)? Is there a 
mechanism in the connector to "pass through" a short circuit read to Spark, or 
would data always bounce from HDFS -> RegionServer -> Spark?

Thanks in advance,
Aaron



Re: Spark reading from HBase using hbase-connectors - any benefit from localization?

2023-01-05 Thread Mich Talebzadeh
Few questions

   - As I understand you already have a Hadoop cluster. Are you going to
   put your spark as Hadoopp nodes?
   - Where is your HBase cluster? Is it sharing nodes with Hadoop or has
   its own cluster

I looked at that link and it does not say much. Essentially you want to use
HBase for speed layer and your inactive data is stored in Parquet files on
HDFS. So that is your batch layer so to speak.

Have a look at this article of mine Real Time Processing of Trade Data with
Kafka, Flume, Spark, Hbase and MongoDB
,
a bit dated but still valid.

   -

It helps if you provide an Architectural diagram of your proposed solution.


You then need to do a PoC to see how it looks.


HTH

   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 5 Jan 2023 at 09:35, Aaron Grubb  wrote:

> (cross-posting from the HBase user list as I didn't receive a reply there)
>
> Hello,
>
> I'm completely new to Spark and evaluating setting up a cluster either in
> YARN or standalone. Our idea for the general workflow is create a
> concatenated dataframe using historical pickle/parquet files (whichever is
> faster) and current data stored in HBase. I'm aware of the benefit of short
> circuit reads if the historical files are stored in HDFS but I'm more
> concerned about resource contention between Spark and HBase during data
> loading. My question is, would running Spark on the same nodes provide a
> benefit when using hbase-connectors (
> https://github.com/apache/hbase-connectors/tree/master/spark)? Is there a
> mechanism in the connector to "pass through" a short circuit read to Spark,
> or would data always bounce from HDFS -> RegionServer -> Spark?
>
> Thanks in advance,
> Aaron
>


Re: Got Error Creating permanent view in Postgresql through Pyspark code

2023-01-05 Thread ayan guha
Hi

What you are trying to do does not make sense. I suggest you to understand
how Views work in SQL. IMHO you are better off creating a table.

Ayan

On Fri, 6 Jan 2023 at 12:20 am, Stelios Philippou 
wrote:

> Vajiha,
>
> I dont see your query working as you hope it will.
>
> spark.sql will execute a query on a database level
>
> to retrieve the temp view you need to go from the sessions.
> i.e
>
> session.sql("SELECT * FROM TEP_VIEW")
>
> You might need to retrieve the data in a collection and iterate over them
> to do batch insertion using spark.sql("INSERt ...");
>
> Hope this helps
>
> Stelios
>
>
> --
> Hi Stelios Philippou,
> I need to create a view table in Postgresql DB using pyspark code. But I'm
> unable to create a view table, I can able to create table through pyspark
> code.
> I need to know Whether through Pyspark code can I create view table in
> postgresql database or not. Thanks for you reply
>
> Pyspark Code:
> df.createOrReplaceTempView("TEMP_VIEW")
> spark.sql("CREATE VIEW TEMP1 AS SELECT * FROM TEMP_VIEW")
>
> On Wed, 4 Jan 2023 at 15:10, Vajiha Begum S A <
> vajihabegu...@maestrowiz.com> wrote:
>
>>
>> I have tried to Create a permanent view in Postgresql DB through Pyspark
>> code, but I have received the below error message. Kindly help me to create
>> a permanent view table in the database.How shall create permanent view
>> using Pyspark code. Please do reply.
>>
>> *Error Message::*
>> *Exception has occurred: Analysis Exception*
>> Not allowed to create a permanent view `default`.`TEMP1` by referencing a
>> temporary view TEMP_VIEW. Please create a temp view instead by CREATE TEMP
>> VIEW
>>
>>
>> Regards,
>> Vajiha
>> Research Analyst
>> MW Solutions
>>
> --
Best Regards,
Ayan Guha


Re: GPU Support

2023-01-05 Thread Sean Owen
Spark itself does not use GPUs, but you can write and run code on Spark
that uses GPUs. You'd typically use software like Tensorflow that uses CUDA
to access the GPU.

On Thu, Jan 5, 2023 at 7:05 AM K B M Kaala Subhikshan <
kbmkaalasubhiks...@gmail.com> wrote:

> Is Gigabyte GeForce RTX 3080  GPU support for running machine learning in
> Spark?
>


Re: Got Error Creating permanent view in Postgresql through Pyspark code

2023-01-05 Thread Stelios Philippou
Vajiha,

I dont see your query working as you hope it will.

spark.sql will execute a query on a database level

to retrieve the temp view you need to go from the sessions.
i.e

session.sql("SELECT * FROM TEP_VIEW")

You might need to retrieve the data in a collection and iterate over them
to do batch insertion using spark.sql("INSERt ...");

Hope this helps

Stelios


--
Hi Stelios Philippou,
I need to create a view table in Postgresql DB using pyspark code. But I'm
unable to create a view table, I can able to create table through pyspark
code.
I need to know Whether through Pyspark code can I create view table in
postgresql database or not. Thanks for you reply

Pyspark Code:
df.createOrReplaceTempView("TEMP_VIEW")
spark.sql("CREATE VIEW TEMP1 AS SELECT * FROM TEMP_VIEW")

On Wed, 4 Jan 2023 at 15:10, Vajiha Begum S A 
wrote:

>
> I have tried to Create a permanent view in Postgresql DB through Pyspark
> code, but I have received the below error message. Kindly help me to create
> a permanent view table in the database.How shall create permanent view
> using Pyspark code. Please do reply.
>
> *Error Message::*
> *Exception has occurred: Analysis Exception*
> Not allowed to create a permanent view `default`.`TEMP1` by referencing a
> temporary view TEMP_VIEW. Please create a temp view instead by CREATE TEMP
> VIEW
>
>
> Regards,
> Vajiha
> Research Analyst
> MW Solutions
>


GPU Support

2023-01-05 Thread K B M Kaala Subhikshan
Is Gigabyte GeForce RTX 3080  GPU support for running machine learning in
Spark?


Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-05 Thread Saurabh Gulati
and 2 single quotes together'' are looking like a single double quote ".

Mvg/Regards
Saurabh Gulati

From: Saurabh Gulati 
Sent: 05 January 2023 12:24
To: Sean Owen 
Cc: User 
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

Its the same input except that headers are also being read with csv reader.

Mvg/Regards
Saurabh Gulati

From: Sean Owen 
Sent: 04 January 2023 15:12
To: Saurabh Gulati 
Cc: User 
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

That does not appear to be the same input you used in your example. What is the 
contents of test.csv?

On Wed, Jan 4, 2023 at 7:45 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:
Hi @Sean Owen
Probably the data is incorrect, and the source needs to fix it.
But using python's csv parser returns the correct results.

import csv

with open("/tmp/test.csv") as c_file:

csv_reader = csv.reader(c_file, delimiter=",")
for row in csv_reader:
print(row)

['a', 'b', 'c']
['1', '', ',see what "I did",\ni am still writing']
['2', '', 'abc']
And also, I don't understand why there is a distinction in outputs from 
df.show() and df.select("c").show()

Mvg/Regards
Saurabh Gulati
Data Platform

From: Sean Owen mailto:sro...@gmail.com>>
Sent: 04 January 2023 14:25
To: Saurabh Gulati mailto:saurabh.gul...@fedex.com>>
Cc: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>; User 
mailto:user@spark.apache.org>>
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

That input is just invalid as CSV for any parser. You end a quoted col without 
following with a col separator. What would the intended parsing be and how 
would it work?

On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:

@Sean Owen Also see the example below with quotes 
feedback:
"a","b","c"
"1","",",see what ""I did"","
"2","","abc"


Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-05 Thread Saurabh Gulati
Its the same input except that headers are also being read with csv reader.

Mvg/Regards
Saurabh Gulati

From: Sean Owen 
Sent: 04 January 2023 15:12
To: Saurabh Gulati 
Cc: User 
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

That does not appear to be the same input you used in your example. What is the 
contents of test.csv?

On Wed, Jan 4, 2023 at 7:45 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:
Hi @Sean Owen
Probably the data is incorrect, and the source needs to fix it.
But using python's csv parser returns the correct results.

import csv

with open("/tmp/test.csv") as c_file:

csv_reader = csv.reader(c_file, delimiter=",")
for row in csv_reader:
print(row)

['a', 'b', 'c']
['1', '', ',see what "I did",\ni am still writing']
['2', '', 'abc']
And also, I don't understand why there is a distinction in outputs from 
df.show() and df.select("c").show()

Mvg/Regards
Saurabh Gulati
Data Platform

From: Sean Owen mailto:sro...@gmail.com>>
Sent: 04 January 2023 14:25
To: Saurabh Gulati mailto:saurabh.gul...@fedex.com>>
Cc: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>; User 
mailto:user@spark.apache.org>>
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

That input is just invalid as CSV for any parser. You end a quoted col without 
following with a col separator. What would the intended parsing be and how 
would it work?

On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:

@Sean Owen Also see the example below with quotes 
feedback:
"a","b","c"
"1","",",see what ""I did"","
"2","","abc"


Re: [EXTERNAL] Re: Re: Incorrect csv parsing when delimiter used within the data

2023-01-05 Thread Saurabh Gulati
Yes, there are other ways to solve this but trying to understand why there is a 
difference in behaviour between df.show() and df.select("c").show()​

Mvg/Regards
Saurabh Gulati

From: Shay Elbaz 
Sent: 04 January 2023 14:54
To: Saurabh Gulati ; Sean Owen 

Cc: Mich Talebzadeh ; User 
Subject: Re: [EXTERNAL] Re: Re: Incorrect csv parsing when delimiter used 
within the data

If you have found a parser that works, simply read the data as text files, 
apply the parser manually, and convert to DataFrame (if needed at all),

From: Saurabh Gulati 
Sent: Wednesday, January 4, 2023 3:45 PM
To: Sean Owen 
Cc: Mich Talebzadeh ; User 
Subject: [EXTERNAL] Re: Re: Incorrect csv parsing when delimiter used within 
the data


ATTENTION: This email originated from outside of GM.


Hi @Sean Owen
Probably the data is incorrect, and the source needs to fix it.
But using python's csv parser returns the correct results.

import csv

with open("/tmp/test.csv") as c_file:

csv_reader = csv.reader(c_file, delimiter=",")
for row in csv_reader:
print(row)

['a', 'b', 'c']
['1', '', ',see what "I did",\ni am still writing']
['2', '', 'abc']
And also, I don't understand why there is a distinction in outputs from 
df.show()​ and df.select("c").show()​

Mvg/Regards
Saurabh Gulati
Data Platform

From: Sean Owen 
Sent: 04 January 2023 14:25
To: Saurabh Gulati 
Cc: Mich Talebzadeh ; User 
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

That input is just invalid as CSV for any parser. You end a quoted col without 
following with a col separator. What would the intended parsing be and how 
would it work?

On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:

@Sean Owen Also see the example below with quotes 
feedback:
"a","b","c"
"1","",",see what ""I did"","
"2","","abc"


Spark reading from HBase using hbase-connectors - any benefit from localization?

2023-01-05 Thread Aaron Grubb
(cross-posting from the HBase user list as I didn't receive a reply there)

Hello,

I'm completely new to Spark and evaluating setting up a cluster either in YARN 
or standalone. Our idea for the general workflow is create a concatenated 
dataframe using historical pickle/parquet files (whichever is faster) and 
current data stored in HBase. I'm aware of the benefit of short circuit reads 
if the historical files are stored in HDFS but I'm more concerned about 
resource contention between Spark and HBase during data loading. My question 
is, would running Spark on the same nodes provide a benefit when using 
hbase-connectors 
(https://github.com/apache/hbase-connectors/tree/master/spark)? Is there a 
mechanism in the connector to "pass through" a short circuit read to Spark, or 
would data always bounce from HDFS -> RegionServer -> Spark?

Thanks in advance,
Aaron


Re: How to set a config for a single query?

2023-01-05 Thread Khalid Mammadov
Hi

I believe there is a feature in Spark specifically for this purpose. You
can create a new spark session and set those configs.
Note that it's not the same as creating a separate driver processes with
separate sessions, here you will still have the same SparkContext that
works as a backend for both or more spark sessions and does all the heavy
work.

*spark.newSession()*

https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.SparkSession.newSession.html#pyspark.sql.SparkSession.newSession

Hope this helps
Khalid


On Wed, 4 Jan 2023, 00:25 Felipe Pessoto,  wrote:

> Hi,
>
>
>
> In Scala is it possible to set a config value to a single query?
>
>
>
> I could set/unset the value, but it won’t work for multithreading
> scenarios.
>
>
>
> Example:
>
>
>
> spark.sql.adaptive.coalescePartitions.enabled = false
>
> queryA_df.collect
>
> spark.sql.adaptive.coalescePartitions.enabled=original value
>
> queryB_df.collect
>
> queryC_df.collect
>
> queryD_df.collect
>
>
>
>
>
> If I execute that block of code multiple times using multiple thread, I
> can end up executing Query A with coalescePartitions.enabled=true, and
> Queries B, C and D with the config set to false, because another thread
> could set it between the executions.
>
>
>
> Is there any good alternative to this?
>
>
>
> Thanks.
>