How can I sync 2 hive cluster

2021-07-29 Thread igyu
I want read data from hive cluster1 and write data to hive cluster2 How can I do it? notice: cluster1,cluster2 are enable kerberos igyu

Running Spark Rapids on GPU-Powered Spark Cluster

2021-07-29 Thread Artemis User
Has anyone had any experience with running Spark-Rapids on a GPU-powered cluster (https://github.com/NVIDIA/spark-rapids)?  I am very interested in knowing: 1. What is the hardware/software platform and the type of Spark cluster you are using to run Spark-Rapids? 2. How easy was the

Hacking my way through Kubernetes docker file

2021-07-29 Thread Mich Talebzadeh
You may recall that I raised a few questions here and in Stacktrace regarding two items both related to running Pyspark inside kubernetes. The challenge was 1. Load third party packages like tensorflow, numpy, pyyaml in running job in k8s 2. How to read from a yaml file to load

Re: Well balanced Python code with Pandas compared to PySpark

2021-07-29 Thread Mich Talebzadeh
Yes indeed very good points by the Artemis User. Just to add if I may, why choose Spark? Generally, parallel architecture comes into play when the data size is significantly large which cannot be handled on a single machine, hence, the use of Spark becomes meaningful. In cases where (the

Re: Well balanced Python code with Pandas compared to PySpark

2021-07-29 Thread Artemis User
PySpark still uses Spark dataframe underneath (it wraps java code). Use PySpark when you have to deal with big data ETL and analytics so you can leverage the distributed architecture in Spark.  If you job is simple, dataset is relatively small, and doesn't require distributed processing, use

Re: Connection Reset by Peer : failed to remove cached rdd

2021-07-29 Thread Artemis User
Can you please post the error log/exception messages?  There is not enough info to help diagnose what the real problem is On 7/29/21 8:55 AM, Big data developer need help relat to spark gateway roles in 2.0 wrote: Hi Team , We are facing issue in production where we are getting frequent

Well balanced Python code with Pandas compared to PySpark

2021-07-29 Thread ashok34...@yahoo.com.INVALID
Hello team Someone asked me regarding well developed Python code with Panda dataframe and comparing that to PySpark. Under what situations one choose PySpark instead of Python and Pandas. Appreciate AK  

Connection Reset by Peer : failed to remove cached rdd

2021-07-29 Thread Big data developer need help relat to spark gateway roles in 2 . 0
Hi Team , We are facing issue in production where we are getting frequent Still have 1 request outstanding when connection with the hostname was closed connection reset by peer : errors as well as warnings  : failed to remove cache rdd or failed  to remove broadcast variable. Please help us how to

Connection Reset by Peer : failed to remove cached rdd

2021-07-29 Thread Big data developer need help relat to spark gateway roles in 2 . 0
Hi Team , We are facing issue in production where we are getting frequent Still have 1 request outstanding when connection with the hostname was closed connection reset by peer : errors as well as warnings  : failed to remove cache rdd or failed  to remove broadcast variable. Please help us how to

Re: Spark Architecture Question

2021-07-29 Thread Pasha Finkelshteyn
Hi Renganathan, Not quite. It strongly depends on your usage of UDFs defined in any manner — as UDF object or just lambdas. If you have ones — they may and will be called on executors too. On 21/07/29 05:17, Renganathan Mutthiah wrote: > Hi, > > I have read in many materials (including from the

Spark Architecture Question

2021-07-29 Thread Renganathan Mutthiah
Hi, I have read in many materials (including from the book: Spark - The Definitive Guide) that Spark is a compiler. In my understanding, our program is used until the point of DAG generation. This portion can be written in any language - Java,Scala,R,Python. Post that (executing the DAG), the