Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread ayan guha
Hi - interesting stuff. My stand always was to use spark native functions, pandas and python native - in this order. To OP - did you try the code? What kind of perf are you seeing? Just curious, why do you think UDFs are bad? On Sat, 10 Apr 2021 at 2:36 am, Sean Owen wrote: > Actually, good

Spark Hbase Hive error in EMR

2021-04-09 Thread KhajaAsmath Mohammed
Hi, I am trying to connect hbase which sits on top of hive as external table. I am getting below exception. Am I missing anything to pass here? 21/04/09 18:08:11 INFO ZooKeeper: Client environment:user.dir=/ 21/04/09 18:08:11 INFO ZooKeeper: Initiating client connection,

Re: GPU job in Spark 3

2021-04-09 Thread Sean Owen
(I apologize, I totally missed that this should use GPUs because of RAPIDS. Ignore my previous. But yeah it's more a RAPIDS question.) On Fri, Apr 9, 2021 at 12:09 PM HaoZ wrote: > Hi Martin, > > I tested the local mode in Spark on Rapids Accelerator and it works fine > for > me. > The only

Re: GPU job in Spark 3

2021-04-09 Thread HaoZ
Hi Martin, I tested the local mode in Spark on Rapids Accelerator and it works fine for me. The only possible issue is the CUDA 11.2 however the supported CUDA version as per https://nvidia.github.io/spark-rapids/docs/download.html is 11.0. Here is a quick test using Spark local mode. Note: When

Re: GPU job in Spark 3

2021-04-09 Thread Tom Graves
Hey Martin, I would encourage you to file issues in the spark-rapids repo for questions with that plugin: https://github.com/NVIDIA/spark-rapids/issues I'm assuming the query ran and you looked at the sql UI or the .expalin() output and it was on cpu and not gpu?  I am assuming you have the

Re: possible bug

2021-04-09 Thread Mich Talebzadeh
Spark 3.1.1 view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical

Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread Sean Owen
Actually, good question, I'm not sure. I don't think that Spark would vectorize these operations over rows. Whereas in a pandas UDF, given a DataFrame, you can apply operations like sin to 1000s of values at once in native code via numpy. It's trivially 'vectorizable' and I've seen good wins over,

Re: possible bug

2021-04-09 Thread Mich Talebzadeh
I ran this one on RHES 7.6 with 64GB of memory and it hit OOM >>> data=list(range(rows)) >>> rdd=sc.parallelize(data,rows) >>> assert rdd.getNumPartitions()==rows >>> rdd0=rdd.filter(lambda x:False) >>> assert rdd0.getNumPartitions()==rows >>> rdd00=rdd0.coalesce(1) >>> data=rdd00.collect()

Re: GPU job in Spark 3

2021-04-09 Thread Sean Owen
I don't see anything in this job that would use a GPU? On Fri, Apr 9, 2021 at 11:19 AM Martin Somers wrote: > > Hi Everyone !! > > Im trying to get on premise GPU instance of Spark 3 running on my ubuntu > box, and I am following: > >

Re: possible bug

2021-04-09 Thread Sean Owen
OK so it's '7 threads overwhelming off heap mem in the JVM' kind of thing. Or running afoul of ulimits in the OS. On Fri, Apr 9, 2021 at 11:19 AM Attila Zsolt Piros < piros.attila.zs...@gmail.com> wrote: > Hi Sean! > > So the "coalesce" without shuffle will create a CoalescedRDD which during

Re: possible bug

2021-04-09 Thread Attila Zsolt Piros
Hi Sean! So the "coalesce" without shuffle will create a CoalescedRDD which during its computation delegates to the parent RDD partitions. As the CoalescedRDD contains only 1 partition so we talk about 1 task and 1 task context. The next stop is PythonRunner. Here the python workers at least

GPU job in Spark 3

2021-04-09 Thread Martin Somers
Hi Everyone !! Im trying to get on premise GPU instance of Spark 3 running on my ubuntu box, and I am following: https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#example-join-operation Anyone with any insight into why a spark job isnt being ran on the GPU -

Re: possible bug

2021-04-09 Thread Mich Talebzadeh
Interesting unitest not pytest :) What is data in [11] reused compared to 5 -- list()? HTH view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data

Re: possible bug

2021-04-09 Thread Sean Owen
Yeah I figured it's not something fundamental to the task or Spark. The error is very odd, never seen that. Do you have a theory on what's going on there? I don't! On Fri, Apr 9, 2021 at 10:43 AM Attila Zsolt Piros < piros.attila.zs...@gmail.com> wrote: > Hi! > > I looked into the code and find

Re: possible bug

2021-04-09 Thread Attila Zsolt Piros
Hi! I looked into the code and find a way to improve it. With the improvement your test runs just fine: Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.2.0-SNAPSHOT /_/ Using Python version 3.8.1

Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread ayan guha
Hi Sean - absolutely open to suggestions. My impression was using spark native functions should provide similar perf as scala ones because serialization penalty should not be there, unlike native python udfs. Is it wrong understanding? On Fri, 9 Apr 2021 at 10:55 pm, Rao Bandaru wrote: > Hi

Re: How to use spark steaming data to plot live line chart

2021-04-09 Thread Mich Talebzadeh
Ok are you using PySpark for this with PyCharm whatever? Then convert that DF into Pandas DF and do the plot. Check Google for the needed packages import numpy as np import matplotlib.pyplot as plt import pandas as pd summary_df = spark.sql(f"""SELECT datetaken, salesvolume as volumeOfSales

RE: How to use spark steaming data to plot live line chart

2021-04-09 Thread Muhammed Favas
Hi, No, I am using normal spark streaming using DStream API. Regards, Favas From: Mich Talebzadeh Sent: Friday, April 9, 2021 18:18 PM To: Muhammed Favas Cc: user@spark.apache.org Subject: Re: How to use spark steaming data to plot live line chart Hi, Within the event driven architecture

Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread Rao Bandaru
Hi All, yes ,i need to add the below scenario based code to the executing spark job,while executing this it took lot of time to complete,please suggest best way to get below requirement without using UDF Thanks, Ankamma Rao B From: Sean Owen Sent: Friday,

Re: How to use spark steaming data to plot live line chart

2021-04-09 Thread Mich Talebzadeh
Hi, Within the event driven architecture are you using Spark Structured Streaming with foreachBatch to pick up the streaming data? HTH view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility

Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread Sean Owen
This can be significantly faster with a pandas UDF, note, because you can vectorize the operations. On Fri, Apr 9, 2021, 7:32 AM ayan guha wrote: > Hi > > We are using a haversine distance function for this, and wrapping it in > udf. > > from pyspark.sql.functions import acos, cos, sin, lit,

How to use spark steaming data to plot live line chart

2021-04-09 Thread Muhammed Favas
Hi, I have an application that collects streaming data and transformed into dataframe. Now I want to plot a live line chart using this data each time when ever a new set of data comes in spark RDD. Please suggest what is the best solution to implement this Regards, Favas

Re: [Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread ayan guha
Hi We are using a haversine distance function for this, and wrapping it in udf. from pyspark.sql.functions import acos, cos, sin, lit, toRadians, udf from pyspark.sql.types import * def haversine_distance(long_x, lat_x, long_y, lat_y): return acos( sin(toRadians(lat_x)) *

[Spark SQL]:to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe

2021-04-09 Thread Rao Bandaru
Hi All, I have a requirement to calculate distance between four coordinates(Latitude1, Longtitude1, Latitude2, Longtitude2) in the pysaprk dataframe with the help of from geopy import distance without using UDF (user defined function),Please help how to achieve this scenario and do the

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-09 Thread Mich Talebzadeh
Hi, Regarding your point: I won't be able to defend this request by telling Spark users the previous major release was and still is more stable than the latest major release ... With the benefit of hindsight version 3.1.1 was released recently and the definition of stable (from a practical

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-09 Thread Maziyar Panahi
Thanks Mich, I will ask all of our users to use pyspark 3.0.x and will change all the notebooks/scripts to switch back from 3.1.1 to 3.0.2. That's being said, I won't be able to defend this request by telling Spark users the previous major release was and still is more stable than the latest