date:20210729

How can I sync 2 hive cluster

2021-07-29 Thread igyu

I want read data from hive cluster1
and write data to hive cluster2

How can I do it?

notice: cluster1,cluster2 are enable kerberos



igyu

Running Spark Rapids on GPU-Powered Spark Cluster

2021-07-29 Thread Artemis User

Has anyone had any experience with running Spark-Rapids on a GPU-powered 
cluster (https://github.com/NVIDIA/spark-rapids)?  I am very interested 
in knowing:


1. What is the hardware/software platform and the type of Spark cluster
   you are using to run Spark-Rapids?
2. How easy was the installation process?
3. Are you running Scala or PySpark or both with Spark-Rapids?
4. Have performance you've seen compared with running a CPU-only cluster?
5. Any pros/cons of using Spark-Rapids?

Thanks a lot in advance!

-- ND

Hacking my way through Kubernetes docker file

2021-07-29 Thread Mich Talebzadeh

You may recall that I raised a few questions here and in Stacktrace
regarding two items both related to running Pyspark inside kubernetes.

The challenge was


   1. Load third party packages like tensorflow, numpy, pyyaml in running
   job in k8s
   2. How to read from a yaml file to load initialisation variables into
   Pyspark job


The option using --archives pyspark_venv.tar.gz#pyspark_venv etc did not
work. It could not unzip and untar the file. Moreover this is loaded after
sparkcontext is created and it was crashing all the time.

Having thought about this and talking to a fellow forum member, I was
advised and decided to add this to the Dockerfile that generates PySpark
image. So basically this.

File $SPARK_HOME/kubernetes/dockerfiles/spark/bindings/python/Dokerfile

RUN pip install pyyaml numpy cx_Oracle pyspark tensorflow
when run you get

Step 8/20 : RUN pip install pyyaml numpy cx_Oracle pyspark tensorflow
 ---> Running in 9efc3cb25f25

Next  challenge was to add config.yml somewhere to the image that I could
pick it up.
Also I wanted to be consistent so whatever run mode (k8s, YARN, local etc),
I could pick up this config.yml file seamlessly and did not need to change
the code.

On prem this file was read from  directory
/home/hduser/dba/bin/python/DSBQ/conf

And also I wanted to be able to edit as spark_uid= 185. So I needed to
install vim as well.

RUN ["apt-get","install","-y","vim"]
RUN mkdir -p /home/hduser/dba/bin/python/DSBQ/conf
RUN chmod g+w /home/hduser/dba/bin/python/DSBQ/conf
COPY config.yml /home/hduser/dba/bin/python/DSBQ/conf/config.yml
RUN chmod g+w /home/hduser/dba/bin/python/DSBQ/conf/config.yml

Note that for this to work you will need to copy config.yml to under
$SPARK_HOME (no soft link) for it to be copied into the docker image ( as
persistent)

Once the image generated, you can log like below (image

docker run  -it be686602970d bash

docker run -it be686602970d bash
185@5b4c23427cfd:/opt/spark/work-dir$ ls -l
/home/hduser/dba/bin/python/DSBQ/conf
total 12
-rw-rw-r--. 1 root root 4433 Jul 29 19:32 config.yml
-rw-rw-r--. 1 root root  824 Jul 29 20:10 config_test.yml

Now you can edit those two files as the user is member of root group

185@5b4c23427cfd:/opt/spark/work-dir$ id
uid=185(185) gid=0(root) groups=0(root)

You can also log in as root to the image itself

docker run -u 0 -it be686602970d bash
root@addbe3ffb9fa:/opt/spark/work-dir# id
uid=0(root) gid=0(root) groups=0(root)

Hope this helps someone. It is a hack but it works for now.


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: Well balanced Python code with Pandas compared to PySpark

2021-07-29 Thread Mich Talebzadeh

Yes indeed very good points by the Artemis User.

Just to add if I may, why choose Spark?  Generally, parallel architecture
comes into play when the data size is significantly large which cannot be
handled on a single machine, hence, the use of Spark becomes meaningful. In
cases where (the generated) data size is going to be very large (which is
often norm rather than the exception these days), the data cannot be
processed and stored in Pandas data frames as these data frames store data
in RAM. Then, the whole dataset from a storage like HDFS or cloud storage
cannot be collected, because it will take significant time and space and
probably will not fit in a single machine RAM.

HTH

   view my Linkedin profile

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Thu, 29 Jul 2021 at 15:12, Artemis User  wrote:

> PySpark still uses Spark dataframe underneath (it wraps java code).  Use
> PySpark when you have to deal with big data ETL and analytics so you can
> leverage the distributed architecture in Spark.  If you job is simple,
> dataset is relatively small, and doesn't require distributed processing,
> use Pandas..
>
> -- ND
>
> On 7/29/21 9:02 AM, ashok34...@yahoo.com.INVALID wrote:
>
> Hello team
>
> Someone asked me regarding well developed Python code with Panda dataframe
> and comparing that to PySpark.
>
> Under what situations one choose PySpark instead of Python and Pandas.
>
> Appreciate
>
>
> AK
>
>
>
>
>

Re: Well balanced Python code with Pandas compared to PySpark

2021-07-29 Thread Artemis User

PySpark still uses Spark dataframe underneath (it wraps java code). Use 
PySpark when you have to deal with big data ETL and analytics so you can 
leverage the distributed architecture in Spark.  If you job is simple, 
dataset is relatively small, and doesn't require distributed processing, 
use Pandas..


-- ND

On 7/29/21 9:02 AM, ashok34...@yahoo.com.INVALID wrote:

Hello team

Someone asked me regarding well developed Python code with Panda 
dataframe and comparing that to PySpark.


Under what situations one choose PySpark instead of Python and Pandas.

Appreciate


AK

Re: Connection Reset by Peer : failed to remove cached rdd

2021-07-29 Thread Artemis User

Can you please post the error log/exception messages?  There is not 
enough info to help diagnose what the real problem is


On 7/29/21 8:55 AM, Big data developer need help relat to spark gateway 
roles in 2.0 wrote:


Hi Team ,

We are facing issue in production where we are getting frequent

Still have 1 request outstanding when connection with the hostname was 
closed


connection reset by peer : errors as well as warnings  : failed to 
remove cache rdd or failed  to remove broadcast variable.


Please help us how to mitigate this  :

Executor memory : 12g

Network timeout :   60

Heartbeat interval : 25


 
	Virus-free. www.avast.com 
 



<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
- 
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Well balanced Python code with Pandas compared to PySpark

2021-07-29 Thread ashok34...@yahoo.com.INVALID

Hello team
Someone asked me regarding well developed Python code with Panda dataframe and 
comparing that to PySpark.
Under what situations one choose PySpark instead of Python and Pandas.
Appreciate

AK

Connection Reset by Peer : failed to remove cached rdd

2021-07-29 Thread Big data developer need help relat to spark gateway roles in 2 . 0

Hi Team , We are facing issue in production where we are getting frequent Still have 1 request outstanding when connection with the hostname was closed connection reset by peer : errors as well as warnings  : failed to remove cache rdd or failed  to remove broadcast variable. Please help us how to mitigate this  : Executor memory : 12g Network timeout :   60Heartbeat interval : 25

	

		Virus-free. www.avast.com
		
	
 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Connection Reset by Peer : failed to remove cached rdd

2021-07-29 Thread Big data developer need help relat to spark gateway roles in 2 . 0

Hi Team , We are facing issue in production where we are getting frequent Still have 1 request outstanding when connection with the hostname was closed connection reset by peer : errors as well as warnings  : failed to remove cache rdd or failed  to remove broadcast variable. Please help us how to mitigate this  : Executor memory : 12g Network timeout :   60Heartbeat interval : 25  

	

		Virus-free. www.avast.com
		
	
 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark Architecture Question

2021-07-29 Thread Pasha Finkelshteyn

Hi Renganathan,

Not quite. It strongly depends on your usage of UDFs defined in any
manner — as UDF object or just lambdas. If you have ones — they may and
will be called on executors too.

On 21/07/29 05:17, Renganathan Mutthiah wrote:
> Hi,
> 
> I have read in many materials (including from the book: Spark - The
> Definitive Guide) that Spark is a compiler.
> 
> In my understanding, our program is used until the point of DAG generation.
> This portion can be written in any language - Java,Scala,R,Python.
> Post that (executing the DAG), the engine runs in Scala only. This leads to
> Spark being called as a compiler.
> 
> If the above is true, we need to install R / Python only in the driver
> machine. R / Python run time is not needed in worker nodes. Am I correct ?
> 
> Thanks!

-- 
Regards,
Pasha

Big Data Tools @ JetBrains


signature.asc
Description: PGP signature

Spark Architecture Question

2021-07-29 Thread Renganathan Mutthiah

Hi,

I have read in many materials (including from the book: Spark - The
Definitive Guide) that Spark is a compiler.

In my understanding, our program is used until the point of DAG generation.
This portion can be written in any language - Java,Scala,R,Python.
Post that (executing the DAG), the engine runs in Scala only. This leads to
Spark being called as a compiler.

If the above is true, we need to install R / Python only in the driver
machine. R / Python run time is not needed in worker nodes. Am I correct ?

Thanks!

How can I sync 2 hive cluster

Running Spark Rapids on GPU-Powered Spark Cluster

Hacking my way through Kubernetes docker file

Re: Well balanced Python code with Pandas compared to PySpark

Re: Well balanced Python code with Pandas compared to PySpark

Re: Connection Reset by Peer : failed to remove cached rdd

Well balanced Python code with Pandas compared to PySpark

Connection Reset by Peer : failed to remove cached rdd

Connection Reset by Peer : failed to remove cached rdd

Re: Spark Architecture Question

Spark Architecture Question

11 matches

Site Navigation

Mail list logo

Footer information