Re: [pyspark/sparksql]: How to overcome redundant/repetitive code? Is a for loop over an sql statement with a variable a bad idea?

2023-01-06 Thread Sean Owen
Right, nothing wrong with a for loop here. Seems like just the right thing.

On Fri, Jan 6, 2023, 3:20 PM Joris Billen 
wrote:

> Hello Community,
> I am working in pyspark with sparksql and have a very similar very complex
> list of dataframes that Ill have to execute several times for all the
> “models” I have.
> Suppose the code is exactly the same for all models, only the table it
> reads from and some values in the where statements will have the modelname
> in it.
> My question is how to prevent repetitive code.
> So instead of doing somethg like this (this is pseudocode, in reality it
> makes use of lots of complex dataframes) which also would require me to
> change the code every time I change it in the future:
>
> *dfmodel1=sqlContext.sql("SELECT  FROM model1_table
> WHERE model =‘model1’ “).write()*
> *dfmodel2=sqlContext.sql("SELECT  FROM model2_table
> WHERE model =‘model2’ “).write()*
> *dfmodel3=sqlContext.sql("SELECT  FROM model3_table
> WHERE model =‘model3’ “).write()*
>
>
> For loops in spark sound like a bad idea (but that is mainly in terms of
> data, maybe nothing against looping over sql statements). Is it allowed
> to do something like this?
>
>
> *spark-submit withloops.py [“model1”,"model2”,"model3"]*
>
> *code withloops.py*
> *models=sys.arg[1]*
> *qry="""SELECT  FROM {} WHERE model ='{}'"""*
> *for i in models:*
> *  FROM_TABLE=table_model*
> *  sqlContext.sql(qry.format(i,table_model )).write()*
>
>
>
> I was trying to look up about refactoring in pyspark to prevent redundant
> code but didnt find any relevant links.
>
>
>
> Thanks for input!
>


[pyspark/sparksql]: How to overcome redundant/repetitive code? Is a for loop over an sql statement with a variable a bad idea?

2023-01-06 Thread Joris Billen
Hello Community,
I am working in pyspark with sparksql and have a very similar very complex list 
of dataframes that Ill have to execute several times for all the “models” I 
have.
Suppose the code is exactly the same for all models, only the table it reads 
from and some values in the where statements will have the modelname in it.
My question is how to prevent repetitive code.
So instead of doing somethg like this (this is pseudocode, in reality it makes 
use of lots of complex dataframes) which also would require me to change the 
code every time I change it in the future:

dfmodel1=sqlContext.sql("SELECT  FROM model1_table WHERE 
model =‘model1’ “).write()
dfmodel2=sqlContext.sql("SELECT  FROM model2_table WHERE 
model =‘model2’ “).write()
dfmodel3=sqlContext.sql("SELECT  FROM model3_table WHERE 
model =‘model3’ “).write()


For loops in spark sound like a bad idea (but that is mainly in terms of data, 
maybe nothing against looping over sql statements). Is it allowed to do 
something like this?


spark-submit withloops.py [“model1”,"model2”,"model3"]

code withloops.py
models=sys.arg[1]
qry="""SELECT  FROM {} WHERE model ='{}'"""
for i in models:
  FROM_TABLE=table_model
  sqlContext.sql(qry.format(i,table_model )).write()



I was trying to look up about refactoring in pyspark to prevent redundant code 
but didnt find any relevant links.



Thanks for input!


Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Oliver Ruebenacker
So I think now that my problem is Spark-related after all. It looks like my
bootstrap script installs SciPy just fine in a regular environment, but
somehow interaction with PySpark breaks it.

On Fri, Jan 6, 2023 at 12:39 PM Bjørn Jørgensen 
wrote:

> Create a Dockerfile
>
> FROM fedora
>
> RUN sudo yum install -y python3-devel
> RUN sudo pip3 install -U Cython && \
> sudo pip3 install -U pybind11 && \
> sudo pip3 install -U pythran && \
> sudo pip3 install -U numpy && \
> sudo pip3 install -U scipy
>
>
>
>
>
>
> docker build --pull --rm -f "Dockerfile" -t fedoratest:latest "."
>
> Sending build context to Docker daemon  2.048kB
> Step 1/3 : FROM fedora
> latest: Pulling from library/fedora
> Digest:
> sha256:3487c98481d1bba7e769cf7bcecd6343c2d383fdd6bed34ec541b6b23ef07664
> Status: Image is up to date for fedora:latest
>  ---> 95b7a2603d3a
> Step 2/3 : RUN sudo yum install -y python3-devel
>  ---> Running in a7c648ae7014
> Fedora 37 - x86_64  7.5 MB/s |  64 MB
> 00:08
> Fedora 37 openh264 (From Cisco) - x86_64418  B/s | 2.5 kB
> 00:06
> Fedora Modular 37 - x86_64  471 kB/s | 3.0 MB
> 00:06
> Fedora 37 - x86_64 - Updates3.0 MB/s |  20 MB
> 00:06
> Fedora Modular 37 - x86_64 - Updates179 kB/s | 1.1 MB
> 00:06
> Last metadata expiration check: 0:00:01 ago on Fri Jan  6 17:37:59 2023.
> Dependencies resolved.
>
> 
>  Package  Architecture Version  Repository
> Size
>
> 
> Installing:
>  python3-develx86_64   3.11.1-1.fc37updates
> 269 k
> Upgrading:
>  python3  x86_64   3.11.1-1.fc37updates
>  27 k
>  python3-libs x86_64   3.11.1-1.fc37updates
> 9.6 M
> Installing dependencies:
>  libpkgconf   x86_64   1.8.0-3.fc37 fedora
> 36 k
>  pkgconf  x86_64   1.8.0-3.fc37 fedora
> 41 k
>  pkgconf-m4   noarch   1.8.0-3.fc37 fedora
> 14 k
>  pkgconf-pkg-config   x86_64   1.8.0-3.fc37 fedora
> 10 k
> Installing weak dependencies:
>  python3-pip  noarch   22.2.2-3.fc37updates
> 3.1 M
>  python3-setuptools   noarch   62.6.0-2.fc37fedora
>  1.6 M
>
> Transaction Summary
>
> 
> Install  7 Packages
> Upgrade  2 Packages
>
> Total download size: 15 M
> Downloading Packages:
> (1/9): pkgconf-m4-1.8.0-3.fc37.noarch.rpm   2.9 kB/s |  14 kB
> 00:05
> (2/9): libpkgconf-1.8.0-3.fc37.x86_64.rpm   7.1 kB/s |  36 kB
> 00:05
> (3/9): pkgconf-1.8.0-3.fc37.x86_64.rpm  8.2 kB/s |  41 kB
> 00:05
> (4/9): pkgconf-pkg-config-1.8.0-3.fc37.x86_64.r 143 kB/s |  10 kB
> 00:00
> (5/9): python3-devel-3.11.1-1.fc37.x86_64.rpm   458 kB/s | 269 kB
> 00:00
> (6/9): python3-3.11.1-1.fc37.x86_64.rpm 442 kB/s |  27 kB
> 00:00
> (7/9): python3-setuptools-62.6.0-2.fc37.noarch. 2.1 MB/s | 1.6 MB
> 00:00
> (8/9): python3-pip-22.2.2-3.fc37.noarch.rpm 4.0 MB/s | 3.1 MB
> 00:00
> (9/9): python3-libs-3.11.1-1.fc37.x86_64.rpm7.2 MB/s | 9.6 MB
> 00:01
>
> 
> Total   1.8 MB/s |  15 MB
> 00:08
> Running transaction check
> Transaction check succeeded.
> Running transaction test
> Transaction test succeeded.
> Running transaction
>   Preparing:
>  1/1
>   Upgrading: python3-libs-3.11.1-1.fc37.x86_64
> 1/11
>   Upgrading: python3-3.11.1-1.fc37.x86_64
>  2/11
>   Installing   : python3-setuptools-62.6.0-2.fc37.noarch
> 3/11
>   Installing   : python3-pip-22.2.2-3.fc37.noarch
>  4/11
>   Installing   : pkgconf-m4-1.8.0-3.fc37.noarch
>  5/11
>   Installing   : libpkgconf-1.8.0-3.fc37.x86_64
>  6/11
>   Installing   : pkgconf-1.8.0-3.fc37.x86_64
> 7/11
>   Installing   : pkgconf-pkg-config-1.8.0-3.fc37.x86_64
>  8/11
>   Installing   : python3-devel-3.11.1-1.fc37.x86_64
>  9/11
>   Cleanup  : python3-3.11.0-1.fc37.x86_64
> 10/11
>   Cleanup  : python3-libs-3.11.0-1.fc37.x86_64
>  11/11
>   Running scriptlet: python3-libs-3.11.0-1.fc37.x86_64
>  11/11
>   Verifying: libpkgconf-1.8.0-3.fc37.x86_64
>  1/11
>   Verifying: pkgconf-1.8.0-3.fc37.x86_64
> 2/11
>   Verifying: pkgconf-m4-1.8.0-3.fc37.noarch
>  3/11
>   Verifying: pkgconf-pkg-config-1.8.0-3.fc37.x86_64
>  4/11
>   Verifying: python3-setuptools-62.6.0-2.fc37.noarch
> 5/11
>   Verifying: python3-devel-3.11.1-1.fc37.x86_64
>  6/11
>   Verifying: python3-pip-22.2.2-3.fc37.noarch
>  7/11
>   Verifying: python3-3.11.1-1.fc37.x86_64
>  8/11
>   Verifying: python3-3.11.0-1.fc37.x86_64
>

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Bjørn Jørgensen
Create a Dockerfile

FROM fedora

RUN sudo yum install -y python3-devel
RUN sudo pip3 install -U Cython && \
sudo pip3 install -U pybind11 && \
sudo pip3 install -U pythran && \
sudo pip3 install -U numpy && \
sudo pip3 install -U scipy






docker build --pull --rm -f "Dockerfile" -t fedoratest:latest "."

Sending build context to Docker daemon  2.048kB
Step 1/3 : FROM fedora
latest: Pulling from library/fedora
Digest:
sha256:3487c98481d1bba7e769cf7bcecd6343c2d383fdd6bed34ec541b6b23ef07664
Status: Image is up to date for fedora:latest
 ---> 95b7a2603d3a
Step 2/3 : RUN sudo yum install -y python3-devel
 ---> Running in a7c648ae7014
Fedora 37 - x86_64  7.5 MB/s |  64 MB 00:08

Fedora 37 openh264 (From Cisco) - x86_64418  B/s | 2.5 kB 00:06

Fedora Modular 37 - x86_64  471 kB/s | 3.0 MB 00:06

Fedora 37 - x86_64 - Updates3.0 MB/s |  20 MB 00:06

Fedora Modular 37 - x86_64 - Updates179 kB/s | 1.1 MB 00:06

Last metadata expiration check: 0:00:01 ago on Fri Jan  6 17:37:59 2023.
Dependencies resolved.

 Package  Architecture Version  Repository
Size

Installing:
 python3-develx86_64   3.11.1-1.fc37updates
269 k
Upgrading:
 python3  x86_64   3.11.1-1.fc37updates
 27 k
 python3-libs x86_64   3.11.1-1.fc37updates
9.6 M
Installing dependencies:
 libpkgconf   x86_64   1.8.0-3.fc37 fedora
36 k
 pkgconf  x86_64   1.8.0-3.fc37 fedora
41 k
 pkgconf-m4   noarch   1.8.0-3.fc37 fedora
14 k
 pkgconf-pkg-config   x86_64   1.8.0-3.fc37 fedora
10 k
Installing weak dependencies:
 python3-pip  noarch   22.2.2-3.fc37updates
3.1 M
 python3-setuptools   noarch   62.6.0-2.fc37fedora
 1.6 M

Transaction Summary

Install  7 Packages
Upgrade  2 Packages

Total download size: 15 M
Downloading Packages:
(1/9): pkgconf-m4-1.8.0-3.fc37.noarch.rpm   2.9 kB/s |  14 kB 00:05

(2/9): libpkgconf-1.8.0-3.fc37.x86_64.rpm   7.1 kB/s |  36 kB 00:05

(3/9): pkgconf-1.8.0-3.fc37.x86_64.rpm  8.2 kB/s |  41 kB 00:05

(4/9): pkgconf-pkg-config-1.8.0-3.fc37.x86_64.r 143 kB/s |  10 kB 00:00

(5/9): python3-devel-3.11.1-1.fc37.x86_64.rpm   458 kB/s | 269 kB 00:00

(6/9): python3-3.11.1-1.fc37.x86_64.rpm 442 kB/s |  27 kB 00:00

(7/9): python3-setuptools-62.6.0-2.fc37.noarch. 2.1 MB/s | 1.6 MB 00:00

(8/9): python3-pip-22.2.2-3.fc37.noarch.rpm 4.0 MB/s | 3.1 MB 00:00

(9/9): python3-libs-3.11.1-1.fc37.x86_64.rpm7.2 MB/s | 9.6 MB 00:01


Total   1.8 MB/s |  15 MB 00:08

Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing:
 1/1
  Upgrading: python3-libs-3.11.1-1.fc37.x86_64
1/11
  Upgrading: python3-3.11.1-1.fc37.x86_64
 2/11
  Installing   : python3-setuptools-62.6.0-2.fc37.noarch
3/11
  Installing   : python3-pip-22.2.2-3.fc37.noarch
 4/11
  Installing   : pkgconf-m4-1.8.0-3.fc37.noarch
 5/11
  Installing   : libpkgconf-1.8.0-3.fc37.x86_64
 6/11
  Installing   : pkgconf-1.8.0-3.fc37.x86_64
7/11
  Installing   : pkgconf-pkg-config-1.8.0-3.fc37.x86_64
 8/11
  Installing   : python3-devel-3.11.1-1.fc37.x86_64
 9/11
  Cleanup  : python3-3.11.0-1.fc37.x86_64
10/11
  Cleanup  : python3-libs-3.11.0-1.fc37.x86_64
 11/11
  Running scriptlet: python3-libs-3.11.0-1.fc37.x86_64
 11/11
  Verifying: libpkgconf-1.8.0-3.fc37.x86_64
 1/11
  Verifying: pkgconf-1.8.0-3.fc37.x86_64
2/11
  Verifying: pkgconf-m4-1.8.0-3.fc37.noarch
 3/11
  Verifying: pkgconf-pkg-config-1.8.0-3.fc37.x86_64
 4/11
  Verifying: python3-setuptools-62.6.0-2.fc37.noarch
5/11
  Verifying: python3-devel-3.11.1-1.fc37.x86_64
 6/11
  Verifying: python3-pip-22.2.2-3.fc37.noarch
 7/11
  Verifying: python3-3.11.1-1.fc37.x86_64
 8/11
  Verifying: python3-3.11.0-1.fc37.x86_64
 9/11
  Verifying: python3-libs-3.11.1-1.fc37.x86_64
 10/11
  Verifying: python3-libs-3.11.0-1.fc37.x86_64
 11/11

Upgraded:
  python3-3.11.1-1.fc37.x86_64 python3-libs-3.11.1-1.fc37.x86_64

Installed:
  libpkgconf-1.8.0-3.fc37.x86_64

  pkgconf-1.8.0-3.fc37.x86_64

  pkgconf-m4-1.8.0-3.fc37.noarch

  pkgconf-pkg-config-1.8.0-3.fc37.x86_64

  python3-devel-3.11.1-1.fc37.x86_64

  python3-pip-22.2.2-3.fc37.noarch

  python3-setupt

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Mich Talebzadeh
https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 6 Jan 2023 at 17:13, Oliver Ruebenacker 
wrote:

> Thank you for the link. I already tried most of what was suggested there,
> but without success.
>
> On Fri, Jan 6, 2023 at 11:35 AM Bjørn Jørgensen 
> wrote:
>
>>
>>
>>
>> https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp
>>
>>
>>
>>
>> fre. 6. jan. 2023, 16:01 skrev Oliver Ruebenacker <
>> oliv...@broadinstitute.org>:
>>
>>>
>>>  Hello,
>>>
>>>   I'm trying to install SciPy using a bootstrap script and then use it
>>> to calculate a new field in a dataframe, running on AWS EMR.
>>>
>>>   Although the SciPy website states that only NumPy is needed, when I
>>> tried to install SciPy using pip, pip kept failing, complaining about
>>> missing software, until I ended up with this bootstrap script:
>>>
>>>
>>>
>>>
>>>
>>>
>>> *sudo yum install -y python3-develsudo pip3 install -U Cythonsudo pip3
>>> install -U pybind11sudo pip3 install -U pythransudo pip3 install -U
>>> numpysudo pip3 install -U scipy*
>>>
>>>   At this point, the bootstrap seems to be successful, but then at this
>>> line:
>>>
>>> *from scipy.stats import norm*
>>>
>>>   I get the following error:
>>>
>>> *ValueError: numpy.ndarray size changed, may indicate binary
>>> incompatibility. Expected 88 from C header, got 80 from PyObject*
>>>
>>>   Any advice on how to proceed? Thanks!
>>>
>>>  Best, Oliver
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network , 
>>> Flannick
>>> Lab , Broad Institute
>>> 
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>


Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Oliver Ruebenacker
Thank you for the link. I already tried most of what was suggested there,
but without success.

On Fri, Jan 6, 2023 at 11:35 AM Bjørn Jørgensen 
wrote:

>
>
>
> https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp
>
>
>
>
> fre. 6. jan. 2023, 16:01 skrev Oliver Ruebenacker <
> oliv...@broadinstitute.org>:
>
>>
>>  Hello,
>>
>>   I'm trying to install SciPy using a bootstrap script and then use it to
>> calculate a new field in a dataframe, running on AWS EMR.
>>
>>   Although the SciPy website states that only NumPy is needed, when I
>> tried to install SciPy using pip, pip kept failing, complaining about
>> missing software, until I ended up with this bootstrap script:
>>
>>
>>
>>
>>
>>
>> *sudo yum install -y python3-develsudo pip3 install -U Cythonsudo pip3
>> install -U pybind11sudo pip3 install -U pythransudo pip3 install -U
>> numpysudo pip3 install -U scipy*
>>
>>   At this point, the bootstrap seems to be successful, but then at this
>> line:
>>
>> *from scipy.stats import norm*
>>
>>   I get the following error:
>>
>> *ValueError: numpy.ndarray size changed, may indicate binary
>> incompatibility. Expected 88 from C header, got 80 from PyObject*
>>
>>   Any advice on how to proceed? Thanks!
>>
>>  Best, Oliver
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network , 
>> Flannick
>> Lab , Broad Institute
>> 
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
, Flannick
Lab , Broad Institute



Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Bjørn Jørgensen
https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp




fre. 6. jan. 2023, 16:01 skrev Oliver Ruebenacker <
oliv...@broadinstitute.org>:

>
>  Hello,
>
>   I'm trying to install SciPy using a bootstrap script and then use it to
> calculate a new field in a dataframe, running on AWS EMR.
>
>   Although the SciPy website states that only NumPy is needed, when I
> tried to install SciPy using pip, pip kept failing, complaining about
> missing software, until I ended up with this bootstrap script:
>
>
>
>
>
>
> *sudo yum install -y python3-develsudo pip3 install -U Cythonsudo pip3
> install -U pybind11sudo pip3 install -U pythransudo pip3 install -U
> numpysudo pip3 install -U scipy*
>
>   At this point, the bootstrap seems to be successful, but then at this
> line:
>
> *from scipy.stats import norm*
>
>   I get the following error:
>
> *ValueError: numpy.ndarray size changed, may indicate binary
> incompatibility. Expected 88 from C header, got 80 from PyObject*
>
>   Any advice on how to proceed? Thanks!
>
>  Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>


[PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Oliver Ruebenacker
 Hello,

  I'm trying to install SciPy using a bootstrap script and then use it to
calculate a new field in a dataframe, running on AWS EMR.

  Although the SciPy website states that only NumPy is needed, when I tried
to install SciPy using pip, pip kept failing, complaining about missing
software, until I ended up with this bootstrap script:






*sudo yum install -y python3-develsudo pip3 install -U Cythonsudo pip3
install -U pybind11sudo pip3 install -U pythransudo pip3 install -U
numpysudo pip3 install -U scipy*

  At this point, the bootstrap seems to be successful, but then at this
line:

*from scipy.stats import norm*

  I get the following error:

*ValueError: numpy.ndarray size changed, may indicate binary
incompatibility. Expected 88 from C header, got 80 from PyObject*

  Any advice on how to proceed? Thanks!

 Best, Oliver

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
, Flannick
Lab , Broad Institute



Re: Spark reading from HBase using hbase-connectors - any benefit from localization?

2023-01-06 Thread Aaron Grubb
Hi Mich,

Thanks a lot for the insight, it was very helpful.

Aaron

On Thu, 2023-01-05 at 23:44 +, Mich Talebzadeh wrote:
Hi Aaron,

Thanks for the details.

It is a general practice when running Spark on premise to use Hadoop 
clusters.
 This comes from the notion of data locality. Data locality in simple terms 
means doing computation on the node where data resides. As you are already 
aware Spark is a cluster computing system. It is not a storage system like HDFS 
or HBase.  Spark is used to process the data stored in such distributed 
systems. In case there is a spark application which is processing data stored 
in HDFS., for example PARQUET files on HDFS,  Spark will attempt to place 
computation tasks alongside HDFS blocks.
With HDFS the Spark driver contacts NameNode about the DataNodes (ideally 
local) containing the various blocks of a file or directory as well as their 
locations (represented as InputSplits), and then schedules the work to the 
Spark Workers.

Moving on, Spark on Hadoop communicates with Hive, it uses an efficient API to 
talk to Hive without the need for JDBC drivers so that is another advantage 
point here.

Spark can talk to HBase through Spark-Hbase 
connector
  which provides HBaseContext to interact Spark with HBase. HBaseContext pushes 
the configuration to the Spark executors and allows it to have an HBase 
Connection per Spark Executor.


With regard to your question:


 Would running Spark on YARN on the same machines where both HDFS and HBase are 
running provide localization benefits when Spark reads from HBase, or are 
localization benefits negligible and it's a better idea to put Spark in a 
standalone cluster?


As per my previous points, I believe it does --> HBaseContext pushes the 
configuration to the Spark executors and allows it to have an HBase Connection 
per Spark Executor.Putting Spark on a standalone cluster will add to the cost 
and IMO will not achieve much.


HTH

 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile

 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk.Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 5 Jan 2023 at 22:53, Aaron Grubb 
mailto:aa...@kaden.ai>> wrote:
Hi Mich,

Thanks for your reply. In hindsight I realize I didn't provide enough 
information about the infrastructure for the question to be answered properly. 
We are currently running a Hadoop cluster with nodes that have the following 
services:

- HDFS NameNode (3.3.4)
- YARN NodeManager (3.3.4)
- HBase RegionServer (2.4.15)
- LLAP on YARN (3.1.3)

So to answer your questions directly, putting Spark on the Hadoop nodes is the 
first idea that I had in order to colocate Spark with HBase for reads (HBase is 
sharing nodes with Hadoop to answer the second question). However, what 
currently happens is, when a Hive query runs that either reads from or writes 
to HBase, there ends up being resource contention as HBase threads "spill over" 
onto vcores that are in theory reserved for YARN. We tolerate this in order for 
both LLAP and HBase to benefit from short circuited reads, but when it comes to 
Spark, I was hoping to find out if that same localization benefit would exist 
when reading from HBase, or if it would be better to incur the cost of 
inter-server, intra-VPC traffic in order to avoid resource contention between 
Spark and HBase during data loading. Regarding HBase being the speed layer and 
Parquet files being the batch layer, I was more looking at both of them as the 
batch layer, but the role HBase plays is it reduces the amount of data scanning 
and joining needed to support our use case. Basically we receive events that 
number in the thousands, and those events need to be matched to events that 
number in the hundreds of millions, but they both share a UUIDv4, so instead of 
matching those rows in a MR-style job, we run simple inserts into HBase with 
the UUIDv4 as the table key. The parquet files would end up being data from 
HBase that are past the window for us to receive more events for that UUIDv4, 
i.e. static data. I'm happy to draw up a diagram but hopefully these details 
are enough for an understanding of the question.

To attempt to summarize, would running Spark on YARN on the same machines where 
both HDFS and HBase are running provide localization benefits when Spark reads 
from HBase, or ar