date:20230106

Re: [pyspark/sparksql]: How to overcome redundant/repetitive code? Is a for loop over an sql statement with a variable a bad idea?

2023-01-06 Thread Sean Owen

Right, nothing wrong with a for loop here. Seems like just the right thing.

On Fri, Jan 6, 2023, 3:20 PM Joris Billen 
wrote:

> Hello Community,
> I am working in pyspark with sparksql and have a very similar very complex
> list of dataframes that Ill have to execute several times for all the
> “models” I have.
> Suppose the code is exactly the same for all models, only the table it
> reads from and some values in the where statements will have the modelname
> in it.
> My question is how to prevent repetitive code.
> So instead of doing somethg like this (this is pseudocode, in reality it
> makes use of lots of complex dataframes) which also would require me to
> change the code every time I change it in the future:
>
> *dfmodel1=sqlContext.sql("SELECT  FROM model1_table
> WHERE model =‘model1’ “).write()*
> *dfmodel2=sqlContext.sql("SELECT  FROM model2_table
> WHERE model =‘model2’ “).write()*
> *dfmodel3=sqlContext.sql("SELECT  FROM model3_table
> WHERE model =‘model3’ “).write()*
>
>
> For loops in spark sound like a bad idea (but that is mainly in terms of
> data, maybe nothing against looping over sql statements). Is it allowed
> to do something like this?
>
>
> *spark-submit withloops.py [“model1”,"model2”,"model3"]*
>
> *code withloops.py*
> *models=sys.arg[1]*
> *qry="""SELECT  FROM {} WHERE model ='{}'"""*
> *for i in models:*
> *  FROM_TABLE=table_model*
> *  sqlContext.sql(qry.format(i,table_model )).write()*
>
>
>
> I was trying to look up about refactoring in pyspark to prevent redundant
> code but didnt find any relevant links.
>
>
>
> Thanks for input!
>

[pyspark/sparksql]: How to overcome redundant/repetitive code? Is a for loop over an sql statement with a variable a bad idea?

2023-01-06 Thread Joris Billen

Hello Community,
I am working in pyspark with sparksql and have a very similar very complex list 
of dataframes that Ill have to execute several times for all the “models” I 
have.
Suppose the code is exactly the same for all models, only the table it reads 
from and some values in the where statements will have the modelname in it.
My question is how to prevent repetitive code.
So instead of doing somethg like this (this is pseudocode, in reality it makes 
use of lots of complex dataframes) which also would require me to change the 
code every time I change it in the future:

dfmodel1=sqlContext.sql("SELECT  FROM model1_table WHERE 
model =‘model1’ “).write()
dfmodel2=sqlContext.sql("SELECT  FROM model2_table WHERE 
model =‘model2’ “).write()
dfmodel3=sqlContext.sql("SELECT  FROM model3_table WHERE 
model =‘model3’ “).write()


For loops in spark sound like a bad idea (but that is mainly in terms of data, 
maybe nothing against looping over sql statements). Is it allowed to do 
something like this?


spark-submit withloops.py [“model1”,"model2”,"model3"]

code withloops.py
models=sys.arg[1]
qry="""SELECT  FROM {} WHERE model ='{}'"""
for i in models:
  FROM_TABLE=table_model
  sqlContext.sql(qry.format(i,table_model )).write()



I was trying to look up about refactoring in pyspark to prevent redundant code 
but didnt find any relevant links.



Thanks for input!

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Oliver Ruebenacker

So I think now that my problem is Spark-related after all. It looks like my
bootstrap script installs SciPy just fine in a regular environment, but
somehow interaction with PySpark breaks it.

On Fri, Jan 6, 2023 at 12:39 PM Bjørn Jørgensen 
wrote:

> Create a Dockerfile
>
> FROM fedora
>
> RUN sudo yum install -y python3-devel
> RUN sudo pip3 install -U Cython && \
> sudo pip3 install -U pybind11 && \
> sudo pip3 install -U pythran && \
> sudo pip3 install -U numpy && \
> sudo pip3 install -U scipy
>
>
>
>
>
>
> docker build --pull --rm -f "Dockerfile" -t fedoratest:latest "."
>
> Sending build context to Docker daemon  2.048kB
> Step 1/3 : FROM fedora
> latest: Pulling from library/fedora
> Digest:
> sha256:3487c98481d1bba7e769cf7bcecd6343c2d383fdd6bed34ec541b6b23ef07664
> Status: Image is up to date for fedora:latest
>  ---> 95b7a2603d3a
> Step 2/3 : RUN sudo yum install -y python3-devel
>  ---> Running in a7c648ae7014
> Fedora 37 - x86_64  7.5 MB/s |  64 MB
> 00:08
> Fedora 37 openh264 (From Cisco) - x86_64418  B/s | 2.5 kB
> 00:06
> Fedora Modular 37 - x86_64  471 kB/s | 3.0 MB
> 00:06
> Fedora 37 - x86_64 - Updates3.0 MB/s |  20 MB
> 00:06
> Fedora Modular 37 - x86_64 - Updates179 kB/s | 1.1 MB
> 00:06
> Last metadata expiration check: 0:00:01 ago on Fri Jan  6 17:37:59 2023.
> Dependencies resolved.
>
> 
>  Package  Architecture Version  Repository
> Size
>
> 
> Installing:
>  python3-develx86_64   3.11.1-1.fc37updates
> 269 k
> Upgrading:
>  python3  x86_64   3.11.1-1.fc37updates
>  27 k
>  python3-libs x86_64   3.11.1-1.fc37updates
> 9.6 M
> Installing dependencies:
>  libpkgconf   x86_64   1.8.0-3.fc37 fedora
> 36 k
>  pkgconf  x86_64   1.8.0-3.fc37 fedora
> 41 k
>  pkgconf-m4   noarch   1.8.0-3.fc37 fedora
> 14 k
>  pkgconf-pkg-config   x86_64   1.8.0-3.fc37 fedora
> 10 k
> Installing weak dependencies:
>  python3-pip  noarch   22.2.2-3.fc37updates
> 3.1 M
>  python3-setuptools   noarch   62.6.0-2.fc37fedora
>  1.6 M
>
> Transaction Summary
>
> 
> Install  7 Packages
> Upgrade  2 Packages
>
> Total download size: 15 M
> Downloading Packages:
> (1/9): pkgconf-m4-1.8.0-3.fc37.noarch.rpm   2.9 kB/s |  14 kB
> 00:05
> (2/9): libpkgconf-1.8.0-3.fc37.x86_64.rpm   7.1 kB/s |  36 kB
> 00:05
> (3/9): pkgconf-1.8.0-3.fc37.x86_64.rpm  8.2 kB/s |  41 kB
> 00:05
> (4/9): pkgconf-pkg-config-1.8.0-3.fc37.x86_64.r 143 kB/s |  10 kB
> 00:00
> (5/9): python3-devel-3.11.1-1.fc37.x86_64.rpm   458 kB/s | 269 kB
> 00:00
> (6/9): python3-3.11.1-1.fc37.x86_64.rpm 442 kB/s |  27 kB
> 00:00
> (7/9): python3-setuptools-62.6.0-2.fc37.noarch. 2.1 MB/s | 1.6 MB
> 00:00
> (8/9): python3-pip-22.2.2-3.fc37.noarch.rpm 4.0 MB/s | 3.1 MB
> 00:00
> (9/9): python3-libs-3.11.1-1.fc37.x86_64.rpm7.2 MB/s | 9.6 MB
> 00:01
>
> 
> Total   1.8 MB/s |  15 MB
> 00:08
> Running transaction check
> Transaction check succeeded.
> Running transaction test
> Transaction test succeeded.
> Running transaction
>   Preparing:
>  1/1
>   Upgrading: python3-libs-3.11.1-1.fc37.x86_64
> 1/11
>   Upgrading: python3-3.11.1-1.fc37.x86_64
>  2/11
>   Installing   : python3-setuptools-62.6.0-2.fc37.noarch
> 3/11
>   Installing   : python3-pip-22.2.2-3.fc37.noarch
>  4/11
>   Installing   : pkgconf-m4-1.8.0-3.fc37.noarch
>  5/11
>   Installing   : libpkgconf-1.8.0-3.fc37.x86_64
>  6/11
>   Installing   : pkgconf-1.8.0-3.fc37.x86_64
> 7/11
>   Installing   : pkgconf-pkg-config-1.8.0-3.fc37.x86_64
>  8/11
>   Installing   : python3-devel-3.11.1-1.fc37.x86_64
>  9/11
>   Cleanup  : python3-3.11.0-1.fc37.x86_64
> 10/11
>   Cleanup  : python3-libs-3.11.0-1.fc37.x86_64
>  11/11
>   Running scriptlet: python3-libs-3.11.0-1.fc37.x86_64
>  11/11
>   Verifying: libpkgconf-1.8.0-3.fc37.x86_64
>  1/11
>   Verifying: pkgconf-1.8.0-3.fc37.x86_64
> 2/11
>   Verifying: pkgconf-m4-1.8.0-3.fc37.noarch
>  3/11
>   Verifying: pkgconf-pkg-config-1.8.0-3.fc37.x86_64
>  4/11
>   Verifying: python3-setuptools-62.6.0-2.fc37.noarch
> 5/11
>   Verifying: python3-devel-3.11.1-1.fc37.x86_64
>  6/11
>   Verifying: python3-pip-22.2.2-3.fc37.noarch
>  7/11
>   Verifying: python3-3.11.1-1.fc37.x86_64
>  8/11
>   Verifying: python3-3.11.0-1.fc37.x86_64
>

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Bjørn Jørgensen

Create a Dockerfile

FROM fedora

RUN sudo yum install -y python3-devel
RUN sudo pip3 install -U Cython && \
sudo pip3 install -U pybind11 && \
sudo pip3 install -U pythran && \
sudo pip3 install -U numpy && \
sudo pip3 install -U scipy






docker build --pull --rm -f "Dockerfile" -t fedoratest:latest "."

Sending build context to Docker daemon  2.048kB
Step 1/3 : FROM fedora
latest: Pulling from library/fedora
Digest:
sha256:3487c98481d1bba7e769cf7bcecd6343c2d383fdd6bed34ec541b6b23ef07664
Status: Image is up to date for fedora:latest
 ---> 95b7a2603d3a
Step 2/3 : RUN sudo yum install -y python3-devel
 ---> Running in a7c648ae7014
Fedora 37 - x86_64  7.5 MB/s |  64 MB 00:08

Fedora 37 openh264 (From Cisco) - x86_64418  B/s | 2.5 kB 00:06

Fedora Modular 37 - x86_64  471 kB/s | 3.0 MB 00:06

Fedora 37 - x86_64 - Updates3.0 MB/s |  20 MB 00:06

Fedora Modular 37 - x86_64 - Updates179 kB/s | 1.1 MB 00:06

Last metadata expiration check: 0:00:01 ago on Fri Jan  6 17:37:59 2023.
Dependencies resolved.

 Package  Architecture Version  Repository
Size

Installing:
 python3-develx86_64   3.11.1-1.fc37updates
269 k
Upgrading:
 python3  x86_64   3.11.1-1.fc37updates
 27 k
 python3-libs x86_64   3.11.1-1.fc37updates
9.6 M
Installing dependencies:
 libpkgconf   x86_64   1.8.0-3.fc37 fedora
36 k
 pkgconf  x86_64   1.8.0-3.fc37 fedora
41 k
 pkgconf-m4   noarch   1.8.0-3.fc37 fedora
14 k
 pkgconf-pkg-config   x86_64   1.8.0-3.fc37 fedora
10 k
Installing weak dependencies:
 python3-pip  noarch   22.2.2-3.fc37updates
3.1 M
 python3-setuptools   noarch   62.6.0-2.fc37fedora
 1.6 M

Transaction Summary

Install  7 Packages
Upgrade  2 Packages

Total download size: 15 M
Downloading Packages:
(1/9): pkgconf-m4-1.8.0-3.fc37.noarch.rpm   2.9 kB/s |  14 kB 00:05

(2/9): libpkgconf-1.8.0-3.fc37.x86_64.rpm   7.1 kB/s |  36 kB 00:05

(3/9): pkgconf-1.8.0-3.fc37.x86_64.rpm  8.2 kB/s |  41 kB 00:05

(4/9): pkgconf-pkg-config-1.8.0-3.fc37.x86_64.r 143 kB/s |  10 kB 00:00

(5/9): python3-devel-3.11.1-1.fc37.x86_64.rpm   458 kB/s | 269 kB 00:00

(6/9): python3-3.11.1-1.fc37.x86_64.rpm 442 kB/s |  27 kB 00:00

(7/9): python3-setuptools-62.6.0-2.fc37.noarch. 2.1 MB/s | 1.6 MB 00:00

(8/9): python3-pip-22.2.2-3.fc37.noarch.rpm 4.0 MB/s | 3.1 MB 00:00

(9/9): python3-libs-3.11.1-1.fc37.x86_64.rpm7.2 MB/s | 9.6 MB 00:01


Total   1.8 MB/s |  15 MB 00:08

Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing:
 1/1
  Upgrading: python3-libs-3.11.1-1.fc37.x86_64
1/11
  Upgrading: python3-3.11.1-1.fc37.x86_64
 2/11
  Installing   : python3-setuptools-62.6.0-2.fc37.noarch
3/11
  Installing   : python3-pip-22.2.2-3.fc37.noarch
 4/11
  Installing   : pkgconf-m4-1.8.0-3.fc37.noarch
 5/11
  Installing   : libpkgconf-1.8.0-3.fc37.x86_64
 6/11
  Installing   : pkgconf-1.8.0-3.fc37.x86_64
7/11
  Installing   : pkgconf-pkg-config-1.8.0-3.fc37.x86_64
 8/11
  Installing   : python3-devel-3.11.1-1.fc37.x86_64
 9/11
  Cleanup  : python3-3.11.0-1.fc37.x86_64
10/11
  Cleanup  : python3-libs-3.11.0-1.fc37.x86_64
 11/11
  Running scriptlet: python3-libs-3.11.0-1.fc37.x86_64
 11/11
  Verifying: libpkgconf-1.8.0-3.fc37.x86_64
 1/11
  Verifying: pkgconf-1.8.0-3.fc37.x86_64
2/11
  Verifying: pkgconf-m4-1.8.0-3.fc37.noarch
 3/11
  Verifying: pkgconf-pkg-config-1.8.0-3.fc37.x86_64
 4/11
  Verifying: python3-setuptools-62.6.0-2.fc37.noarch
5/11
  Verifying: python3-devel-3.11.1-1.fc37.x86_64
 6/11
  Verifying: python3-pip-22.2.2-3.fc37.noarch
 7/11
  Verifying: python3-3.11.1-1.fc37.x86_64
 8/11
  Verifying: python3-3.11.0-1.fc37.x86_64
 9/11
  Verifying: python3-libs-3.11.1-1.fc37.x86_64
 10/11
  Verifying: python3-libs-3.11.0-1.fc37.x86_64
 11/11

Upgraded:
  python3-3.11.1-1.fc37.x86_64 python3-libs-3.11.1-1.fc37.x86_64

Installed:
  libpkgconf-1.8.0-3.fc37.x86_64

  pkgconf-1.8.0-3.fc37.x86_64

  pkgconf-m4-1.8.0-3.fc37.noarch

  pkgconf-pkg-config-1.8.0-3.fc37.x86_64

  python3-devel-3.11.1-1.fc37.x86_64

  python3-pip-22.2.2-3.fc37.noarch

  python3-setupt

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Mich Talebzadeh

https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 6 Jan 2023 at 17:13, Oliver Ruebenacker 
wrote:

> Thank you for the link. I already tried most of what was suggested there,
> but without success.
>
> On Fri, Jan 6, 2023 at 11:35 AM Bjørn Jørgensen 
> wrote:
>
>>
>>
>>
>> https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp
>>
>>
>>
>>
>> fre. 6. jan. 2023, 16:01 skrev Oliver Ruebenacker <
>> oliv...@broadinstitute.org>:
>>
>>>
>>>  Hello,
>>>
>>>   I'm trying to install SciPy using a bootstrap script and then use it
>>> to calculate a new field in a dataframe, running on AWS EMR.
>>>
>>>   Although the SciPy website states that only NumPy is needed, when I
>>> tried to install SciPy using pip, pip kept failing, complaining about
>>> missing software, until I ended up with this bootstrap script:
>>>
>>>
>>>
>>>
>>>
>>>
>>> *sudo yum install -y python3-develsudo pip3 install -U Cythonsudo pip3
>>> install -U pybind11sudo pip3 install -U pythransudo pip3 install -U
>>> numpysudo pip3 install -U scipy*
>>>
>>>   At this point, the bootstrap seems to be successful, but then at this
>>> line:
>>>
>>> *from scipy.stats import norm*
>>>
>>>   I get the following error:
>>>
>>> *ValueError: numpy.ndarray size changed, may indicate binary
>>> incompatibility. Expected 88 from C header, got 80 from PyObject*
>>>
>>>   Any advice on how to proceed? Thanks!
>>>
>>>  Best, Oliver
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network , 
>>> Flannick
>>> Lab , Broad Institute
>>> 
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Oliver Ruebenacker

Thank you for the link. I already tried most of what was suggested there,
but without success.

On Fri, Jan 6, 2023 at 11:35 AM Bjørn Jørgensen 
wrote:

>
>
>
> https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp
>
>
>
>
> fre. 6. jan. 2023, 16:01 skrev Oliver Ruebenacker <
> oliv...@broadinstitute.org>:
>
>>
>>  Hello,
>>
>>   I'm trying to install SciPy using a bootstrap script and then use it to
>> calculate a new field in a dataframe, running on AWS EMR.
>>
>>   Although the SciPy website states that only NumPy is needed, when I
>> tried to install SciPy using pip, pip kept failing, complaining about
>> missing software, until I ended up with this bootstrap script:
>>
>>
>>
>>
>>
>>
>> *sudo yum install -y python3-develsudo pip3 install -U Cythonsudo pip3
>> install -U pybind11sudo pip3 install -U pythransudo pip3 install -U
>> numpysudo pip3 install -U scipy*
>>
>>   At this point, the bootstrap seems to be successful, but then at this
>> line:
>>
>> *from scipy.stats import norm*
>>
>>   I get the following error:
>>
>> *ValueError: numpy.ndarray size changed, may indicate binary
>> incompatibility. Expected 88 from C header, got 80 from PyObject*
>>
>>   Any advice on how to proceed? Thanks!
>>
>>  Best, Oliver
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network , 
>> Flannick
>> Lab , Broad Institute
>> 
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
, Flannick
Lab , Broad Institute

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Bjørn Jørgensen

https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp




fre. 6. jan. 2023, 16:01 skrev Oliver Ruebenacker <
oliv...@broadinstitute.org>:

>
>  Hello,
>
>   I'm trying to install SciPy using a bootstrap script and then use it to
> calculate a new field in a dataframe, running on AWS EMR.
>
>   Although the SciPy website states that only NumPy is needed, when I
> tried to install SciPy using pip, pip kept failing, complaining about
> missing software, until I ended up with this bootstrap script:
>
>
>
>
>
>
> *sudo yum install -y python3-develsudo pip3 install -U Cythonsudo pip3
> install -U pybind11sudo pip3 install -U pythransudo pip3 install -U
> numpysudo pip3 install -U scipy*
>
>   At this point, the bootstrap seems to be successful, but then at this
> line:
>
> *from scipy.stats import norm*
>
>   I get the following error:
>
> *ValueError: numpy.ndarray size changed, may indicate binary
> incompatibility. Expected 88 from C header, got 80 from PyObject*
>
>   Any advice on how to proceed? Thanks!
>
>  Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>

[PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Oliver Ruebenacker

 Hello,

  I'm trying to install SciPy using a bootstrap script and then use it to
calculate a new field in a dataframe, running on AWS EMR.

  Although the SciPy website states that only NumPy is needed, when I tried
to install SciPy using pip, pip kept failing, complaining about missing
software, until I ended up with this bootstrap script:






*sudo yum install -y python3-develsudo pip3 install -U Cythonsudo pip3
install -U pybind11sudo pip3 install -U pythransudo pip3 install -U
numpysudo pip3 install -U scipy*

  At this point, the bootstrap seems to be successful, but then at this
line:

*from scipy.stats import norm*

  I get the following error:

*ValueError: numpy.ndarray size changed, may indicate binary
incompatibility. Expected 88 from C header, got 80 from PyObject*

  Any advice on how to proceed? Thanks!

 Best, Oliver

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
, Flannick
Lab , Broad Institute

Re: Spark reading from HBase using hbase-connectors - any benefit from localization?

2023-01-06 Thread Aaron Grubb

Hi Mich,

Thanks a lot for the insight, it was very helpful.

Aaron

On Thu, 2023-01-05 at 23:44 +, Mich Talebzadeh wrote:
Hi Aaron,

Thanks for the details.

It is a general practice when running Spark on premise to use Hadoop
clusters.
This comes from the notion of data locality. Data locality in simple terms
means doing computation on the node where data resides. As you are already
aware Spark is a cluster computing system. It is not a storage system like HDFS
or HBase. Spark is used to process the data stored in such distributed
systems. In case there is a spark application which is processing data stored
in HDFS., for example PARQUET files on HDFS, Spark will attempt to place
computation tasks alongside HDFS blocks.
With HDFS the Spark driver contacts NameNode about the DataNodes (ideally
local) containing the various blocks of a file or directory as well as their
locations (represented as InputSplits), and then schedules the work to the
Spark Workers.

Moving on, Spark on Hadoop communicates with Hive, it uses an efficient API to
talk to Hive without the need for JDBC drivers so that is another advantage
point here.

Spark can talk to HBase through Spark-Hbase
connector
which provides HBaseContext to interact Spark with HBase. HBaseContext pushes
the configuration to the Spark executors and allows it to have an HBase
Connection per Spark Executor.

With regard to your question:

Would running Spark on YARN on the same machines where both HDFS and HBase are
running provide localization benefits when Spark reads from HBase, or are
localization benefits negligible and it's a better idea to put Spark in a
standalone cluster?

As per my previous points, I believe it does --> HBaseContext pushes the
configuration to the Spark executors and allows it to have an HBase Connection
per Spark Executor.Putting Spark on a standalone cluster will add to the cost
and IMO will not achieve much.

HTH

[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
view my Linkedin
profile

https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: Use it at your own risk.Any and all responsibility for any loss,
damage or destruction of data or any other property which may arise from
relying on this email's technical content is explicitly disclaimed. The author
will in no case be liable for any monetary damages arising from such loss,
damage or destruction.

On Thu, 5 Jan 2023 at 22:53, Aaron Grubb
mailto:aa...@kaden.ai>> wrote:
Hi Mich,

Thanks for your reply. In hindsight I realize I didn't provide enough
information about the infrastructure for the question to be answered properly.
We are currently running a Hadoop cluster with nodes that have the following
services:

- HDFS NameNode (3.3.4)
- YARN NodeManager (3.3.4)
- HBase RegionServer (2.4.15)
- LLAP on YARN (3.1.3)

So to answer your questions directly, putting Spark on the Hadoop nodes is the
first idea that I had in order to colocate Spark with HBase for reads (HBase is
sharing nodes with Hadoop to answer the second question). However, what
currently happens is, when a Hive query runs that either reads from or writes
to HBase, there ends up being resource contention as HBase threads "spill over"
onto vcores that are in theory reserved for YARN. We tolerate this in order for
both LLAP and HBase to benefit from short circuited reads, but when it comes to
Spark, I was hoping to find out if that same localization benefit would exist
when reading from HBase, or if it would be better to incur the cost of
inter-server, intra-VPC traffic in order to avoid resource contention between
Spark and HBase during data loading. Regarding HBase being the speed layer and
Parquet files being the batch layer, I was more looking at both of them as the
batch layer, but the role HBase plays is it reduces the amount of data scanning
and joining needed to support our use case. Basically we receive events that
number in the thousands, and those events need to be matched to events that
number in the hundreds of millions, but they both share a UUIDv4, so instead of
matching those rows in a MR-style job, we run simple inserts into HBase with
the UUIDv4 as the table key. The parquet files would end up being data from
HBase that are past the window for us to receive more events for that UUIDv4,
i.e. static data. I'm happy to draw up a diagram but hopefully these details
are enough for an understanding of the question.

To attempt to summarize, would running Spark on YARN on the same machines where
both HDFS and HBase are running provide localization benefits when Spark reads
from HBase, or ar

Re: [pyspark/sparksql]: How to overcome redundant/repetitive code? Is a for loop over an sql statement with a variable a bad idea?

[pyspark/sparksql]: How to overcome redundant/repetitive code? Is a for loop over an sql statement with a variable a bad idea?

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

[PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

Re: Spark reading from HBase using hbase-connectors - any benefit from localization?

9 matches

Site Navigation

Mail list logo

Footer information