Re: pyspark dataframe join with two different data type

2024-05-17 Thread Karthick Nk
Hi All, I have tried the same result with pyspark and with SQL query by creating with tempView, I could able to achieve whereas I have to do in the pyspark code itself, Could you help on this incoming_data = [["a"], ["b"], ["d"]] column_names = ["c

Re: pyspark dataframe join with two different data type

2024-05-15 Thread Karthick Nk
Thanks Mich, I have tried this solution, but i want all the columns from the dataframe df_1, if i explode the df_1 i am getting only data column. But the resultant should get the all the column from the df_1 with distinct result like below. Results in *df:* +---+ |column1| +---+ |

Re: pyspark dataframe join with two different data type

2024-05-14 Thread Mich Talebzadeh
You can use a combination of explode and distinct before joining. from pyspark.sql import SparkSession from pyspark.sql.functions import explode # Create a SparkSession spark = SparkSession.builder \ .appName("JoinExample") \ .getOrCreate() sc = spark.sparkContext # Set the log level to

Re: pyspark dataframe join with two different data type

2024-05-14 Thread Karthick Nk
Hi All, Could anyone have any idea or suggestion of any alternate way to achieve this scenario? Thanks. On Sat, May 11, 2024 at 6:55 AM Damien Hawes wrote: > Right now, with the structure of your data, it isn't possible. > > The rows aren't duplicates of each other. "a" and "b" both exist in

Re: pyspark dataframe join with two different data type

2024-05-10 Thread Damien Hawes
Right now, with the structure of your data, it isn't possible. The rows aren't duplicates of each other. "a" and "b" both exist in the array. So Spark is correctly performing the join. It looks like you need to find another way to model this data to get what you want to achieve. Are the values

Re: pyspark dataframe join with two different data type

2024-05-10 Thread Karthick Nk
Hi Mich, Thanks for the solution, But I am getting duplicate result by using array_contains. I have explained the scenario below, could you help me on that, how we can achieve i have tried different way bu i could able to achieve. For example data = [ ["a"], ["b"], ["d"], ]

Traceback is missing content in pyspark when invoked with UDF

2024-05-01 Thread Indivar Mishra
error > Traceback (most recent call last): > > File "temp.py", line 28, in > > fun() > > File "temp.py", line 25, in fun > > df.select(col("Seqno"), >> errror_func_udf(col("Name")).alias("Name"

Re: Python for the kids and now PySpark

2024-04-28 Thread Meena Rajani
and they cannot focus for long hours. I let him explore Python on his >> Windows 10 laptop and download it himself. In the following video Christian >> explains to his mother what he started to do just before going to bed. BTW, >> when he says 32M he means 32-bit. I leave i

Re: Python for the kids and now PySpark

2024-04-27 Thread Farshid Ashouri
focus for long hours. I let him explore Python on his > Windows 10 laptop and download it himself. In the following video Christian > explains to his mother what he started to do just before going to bed. BTW, > when he says 32M he means 32-bit. I leave it to you to judge :) Now the > i

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-19 Thread Varun Shah
Hi @Mich Talebzadeh , community, Where can I find such insights on the Spark Architecture ? I found few sites below which did/does cover internals : 1. https://github.com/JerryLead/SparkInternals 2. https://books.japila.pl/apache-spark-internals/overview/ 3.

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Mon, Mar 18, 2024 at 1:16 PM Mich Talebzadeh wrote: > > "I may need something like that for synthetic data for testing. Any way to > do that ?" > > Have a look at this. > > https://github.com/joke2k/faker > No I was not actually referring to data that can be faked. I want data to actually

pyspark - Use Spark to generate a large dataset on the fly

2024-03-18 Thread Sreyan Chakravarty
Hi, I have a specific problem where I have to get the data from REST APIs and store it, and then do some transformations on it and then write to a RDBMS table. I am wondering if Spark will help in this regard. I am confused as to how do I store the data while I actually acquire it on the driver

pyspark - Use Spark to generate a large dataset on the fly

2024-03-18 Thread Sreyan Chakravarty
Hi, I have a specific problem where I have to get the data from REST APIs and store it, and then do some transformations on it and then write to a RDBMS table. I am wondering if Spark will help in this regard. I am confused as to how do I store the data while I actually acquire it on the driver

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Mich Talebzadeh
Yes, transformations are indeed executed on the worker nodes, but they are only performed when necessary, usually when an action is called. This lazy evaluation helps in optimizing the execution of Spark jobs by allowing Spark to optimize the execution plan and perform optimizations such as

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-18 Thread Sreyan Chakravarty
On Fri, Mar 15, 2024 at 3:10 AM Mich Talebzadeh wrote: > > No Data Transfer During Creation: --> Data transfer occurs only when an > action is triggered. > Distributed Processing: --> DataFrames are distributed for parallel > execution, not stored entirely on the driver node. > Lazy Evaluation

Re: pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Mich Talebzadeh
Hi, When you create a DataFrame from Python objects using spark.createDataFrame, here it goes: *Initial Local Creation:* The DataFrame is initially created in the memory of the driver node. The data is not yet distributed to executors at this point. *The role of lazy Evaluation:* Spark

pyspark - Where are Dataframes created from Python objects stored?

2024-03-14 Thread Sreyan Chakravarty
I am trying to understand Spark Architecture. For Dataframes that are created from python objects ie. that are *created in memory where are they stored ?* Take following example: from pyspark.sql import Rowimport datetime courses = [ { 'course_id': 1, 'course_title':

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-12 Thread Mich Talebzadeh
the `get` method which is used for dictionaries. > > But weirdly, for query.lastProgress and query.recentProgress, they should > return StreamingQueryProgress but instead they returned a dict. So the > `get` method works there. > > I think PySpark should improve on this part. > > Mich Talebza

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread 刘唯
, for query.lastProgress and query.recentProgress, they should return StreamingQueryProgress but instead they returned a dict. So the `get` method works there. I think PySpark should improve on this part. Mich Talebzadeh 于2024年3月11日周一 05:51写道: > Hi, > > Thank you for your advice > > This is t

Data ingestion into elastic failing using pyspark

2024-03-11 Thread Karthick Nk
Hi @all, I am using pyspark program to write the data into elastic index by using upsert operation (sample code snippet below). def writeDataToES(final_df): write_options = { "es.nodes": elastic_host, "es.net.ssl": "false", "es.nodes.wan.only&q

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-11 Thread Mich Talebzadeh
Hi, Thank you for your advice This is the amended code def onQueryProgress(self, event): print("onQueryProgress") # Access micro-batch data microbatch_data = event.progress #print("microbatch_data received") # Check if data is received

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread 刘唯
*now -> not 刘唯 于2024年3月10日周日 22:04写道: > Have you tried using microbatch_data.get("processedRowsPerSecond")? > Camel case now snake case > > Mich Talebzadeh 于2024年3月10日周日 11:46写道: > >> >> There is a paper from Databricks on this subject >> >> >>

Re: Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread 刘唯
Have you tried using microbatch_data.get("processedRowsPerSecond")? Camel case now snake case Mich Talebzadeh 于2024年3月10日周日 11:46写道: > > There is a paper from Databricks on this subject > > > https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html > > But

Bug in How to Monitor Streaming Queries in PySpark

2024-03-10 Thread Mich Talebzadeh
There is a paper from Databricks on this subject https://www.databricks.com/blog/2022/05/27/how-to-monitor-streaming-queries-in-pyspark.html But having tested it, there seems to be a bug there that I reported to Databricks forum as well (in answer to a user question) I have come to a conclusion

Re: Creating remote tables using PySpark

2024-03-08 Thread Mich Talebzadeh
leOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:334) >> at >> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:404) >> at >> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:377) >> at >> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48) >> at >> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192) >> >> My assumption is that its trying to look on my local machine for >> /data/hive/warehouse and failing because on the remote box I can see those >> folders. >> >> So the question is, if you're not backing it with hadoop or something do >> you have to mount the drive in the same place on the computer running the >> pyspark? Or am I missing a config option somewhere? >> >> Thanks! >> >

Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Okay that was some caching issue. Now there is a shared mount point between the place the pyspark code is executed and the spark nodes it runs. Hrmph, I was hoping that wouldn't be the case. Fair enough! On Thu, Mar 7, 2024 at 11:23 PM Tom Barber wrote: > Okay interesting, maybe my assumpt

Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
duceCommitProtocol.scala:192) > > My assumption is that its trying to look on my local machine for > /data/hive/warehouse and failing because on the remote box I can see those > folders. > > So the question is, if you're not backing it with hadoop or something do > you have to mount the drive in the same place on the computer running the > pyspark? Or am I missing a config option somewhere? > > Thanks! >

Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
uceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:192) My assumption is that its trying to look on my local machine for /data/hive/warehouse and failing because on the remote box I can see those folders. So the question is, if you're not backing it with hadoop or something do you have to mount the drive in the same place on the computer running the pyspark? Or am I missing a config option somewhere? Thanks!

Working with a text file that is both compressed by bz2 followed by zip in PySpark

2024-03-04 Thread Mich Talebzadeh
I have downloaded Amazon reviews for sentiment analysis from here. The file is not particularly large (just over 500MB) but comes in the following format test.ft.txt.bz2.zip So it is a text file that is compressed by bz2 followed by zip. Now I like tro do all these operations in PySpark

Re: pyspark dataframe join with two different data type

2024-02-29 Thread Mich Talebzadeh
This is what you want, how to join two DFs with a string column in one and an array of strings in the other, keeping only rows where the string is present in the array. from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.functions import expr spark =

pyspark dataframe join with two different data type

2024-02-29 Thread Karthick Nk
Hi All, I have two dataframe with below structure, i have to join these two dataframe - the scenario is one column is string in one dataframe and in other df join column is array of string, so we have to inner join two df and get the data if string value is present in any of the array of string

Best option to process single kafka stream in parallel: PySpark Vs Dask

2024-01-11 Thread lab22
I am creating a setup to process packets from single kafta topic in parallel. For example, I have 3 containers (let's take 4 cores) on one vm, and from 1 kafka topic stream I create 10 jobs depending on packet source. These packets have small workload. 1. I can install dask in each

Re: Pyspark UDF as a data source for streaming

2024-01-08 Thread Mich Talebzadeh
ingestion and analytics. My use case revolves around a scenario where data is generated through REST API requests in real time with Pyspark.. The Flask REST API efficiently captures and processes this data, saving it to a sync of your choice like a data warehouse or kafka. HTH Mich Talebzadeh, Dad

Re: Pyspark UDF as a data source for streaming

2023-12-29 Thread Mich Talebzadeh
c 28, 2023 at 4:53 PM Поротиков Станислав Вячеславович > wrote: > >> Yes, it's actual data. >> >> >> >> Best regards, >> >> Stanislav Porotikov >> >> >> >> *From:* Mich Talebzadeh >> *Sent:* Wednesday, December 27, 2023

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Mich Talebzadeh
Hi Stanislav , On Pyspark DF can you the following df.printSchema() and send the output please HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

RE: Pyspark UDF as a data source for streaming

2023-12-28 Thread Поротиков Станислав Вячеславович
Ok. Thank you very much! Best regards, Stanislav Porotikov From: Mich Talebzadeh Sent: Thursday, December 28, 2023 5:14 PM To: Hyukjin Kwon Cc: Поротиков Станислав Вячеславович ; user@spark.apache.org Subject: Re: Pyspark UDF as a data source for streaming You can work around this issue

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Mich Talebzadeh
танислав Вячеславович > wrote: > >> Yes, it's actual data. >> >> >> >> Best regards, >> >> Stanislav Porotikov >> >> >> >> *From:* Mich Talebzadeh >> *Sent:* Wednesday, December 27, 2023 9:43 PM >> *Cc:* user@spark.a

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Hyukjin Kwon
rotikov > > > > *From:* Mich Talebzadeh > *Sent:* Wednesday, December 27, 2023 9:43 PM > *Cc:* user@spark.apache.org > *Subject:* Re: Pyspark UDF as a data source for streaming > > > > Is this generated data actual data or you are testing the application? > >

RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Yes, it's actual data. Best regards, Stanislav Porotikov From: Mich Talebzadeh Sent: Wednesday, December 27, 2023 9:43 PM Cc: user@spark.apache.org Subject: Re: Pyspark UDF as a data source for streaming Is this generated data actual data or you are testing the application? Sounds like a form

RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
From: Mich Talebzadeh Sent: Wednesday, December 27, 2023 6:17 PM To: Поротиков Станислав Вячеславович Cc: user@spark.apache.org Subject: Re: Pyspark UDF as a data source for streaming Ok so you want to generate some random data and load it into Kafka on a regular interval and the rest

RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
otikov From: Mich Talebzadeh Sent: Wednesday, December 27, 2023 6:17 PM To: Поротиков Станислав Вячеславович Cc: user@spark.apache.org Subject: Re: Pyspark UDF as a data source for streaming Ok so you want to generate some random data and load it into Kafka on a regular interval and the rest

Re: Pyspark UDF as a data source for streaming

2023-12-27 Thread Mich Talebzadeh
liable for any monetary damages arising from such loss, damage or destruction. On Wed, 27 Dec 2023 at 12:16, Поротиков Станислав Вячеславович wrote: > Hello! > > Is it possible to write pyspark UDF, generated data to streaming dataframe? > > I want to get some data from REST API

Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Hello! Is it possible to write pyspark UDF, generated data to streaming dataframe? I want to get some data from REST API requests in real time and consider to save this data to dataframe. And then put it to Kafka. I can't realise how to create streaming dataframe from generated data. I am new

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-11 Thread Михаил Кулаков
Hey Enrico it does help to understand it, thanks for explaining. Regarding this comment > PySpark and Scala should behave identically here Is it ok that Scala and PySpark optimization works differently in this case? вт, 5 дек. 2023 г. в 20:08, Enrico Minack : > Hi M

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-05 Thread Enrico Minack
Hi Michail, with spark.conf.set("spark.sql.planChangeLog.level", "WARN") you can see how Spark optimizes the query plan. In PySpark, the plan is optimized into Project ...   +- CollectMetrics 2, [count(1) AS count(1)#200L]   +- LocalTableScan , [col1#125, col2#126L,

Re: [PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-04 Thread Enrico Minack
quot;)) df2.show() +++++ |col1|col2|col3|col4| +++++ +++++ o1.get Map[String,Any] = Map(count(1) -> 2) o2.get Map[String,Any] = Map(count(1) -> 0) Pyspark and Scala should behave identically here. I will investigate. Cheers, Enrico Am

[PySpark][Spark Dataframe][Observation] Why empty dataframe join doesn't let you get metrics from observation?

2023-12-02 Thread Михаил Кулаков
Hey folks, I actively using observe method on my spark jobs and noticed interesting behavior: Here is an example of working and non working code: https://gist.github.com/Coola4kov/8aeeb05abd39794f8362a3cf1c66519c In a few words, if I'm joining dataframe after some filter rules and it became

How to configure authentication from a pySpark client to a Spark Connect server ?

2023-11-05 Thread Xiaolong Wang
Hi, Our company is currently introducing the Spark Connect server to production. Most of the issues have been solved yet I don't know how to configure authentication from a pySpark client to the Spark Connect server. I noticed that there is some interceptor configs at the Scala client side

Re: Parser error when running PySpark on Windows connecting to GCS

2023-11-04 Thread Mich Talebzadeh
regexp: invalid escape sequence: '\m'* > > I tracked it down to *site-packages/pyspark/ml/util.py* line 578 > > metadataPath = os.path.join(path,"metadata") > > which seems innocuous but what's happening is because I'm on Windows, > os.path.join is appending double bac

Parser error when running PySpark on Windows connecting to GCS

2023-11-04 Thread Richard Smith
to /site-packages/pyspark/ml/util.py/ line 578 metadataPath = os.path.join(path,"metadata") which seems innocuous but what's happening is because I'm on Windows, os.path.join is appending double backslash, whilst the gcs path uses forward slashes like Linux. I hacked the code to expl

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-05 Thread Mich Talebzadeh
The fact that you have 60 partitions or brokers in kaka is not directly correlated to Spark Structured Streaming (SSS) executors by itself. See below. Spark starts with 200 partitions. However, by default, Spark/PySpark creates partitions that are equal to the number of CPU cores in the node

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-05 Thread Perez
You can try the 'optimize' command of delta lake. That will help you for sure. It merges small files. Also, it depends on the file format. If you are working with Parquet then still small files should not cause any issues. P. On Thu, Oct 5, 2023 at 10:55 AM Shao Yang Hong wrote: > Hi

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Shao Yang Hong
Hi Raghavendra, Yes, we are trying to reduce the number of files in delta as well (the small file problem [0][1]). We already have a scheduled app to compact files, but the number of files is still large, at 14K files per day. [0]:

[PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Shao Yang Hong
Hi all on user@spark: We are looking for advice and suggestions on how to tune the .repartition() parameter. We are using Spark Streaming on our data pipeline to consume messages and persist them to a Delta Lake (https://delta.io/learn/getting-started/). We read messages from a Kafka topic,

Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Raghavendra Ganesh
Hi, What is the purpose for which you want to use repartition() .. to reduce the number of files in delta? Also note that there is an alternative option of using coalesce() instead of repartition(). -- Raghavendra On Thu, Oct 5, 2023 at 10:15 AM Shao Yang Hong wrote: > Hi all on user@spark: >

[PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Shao Yang Hong
Hi all on user@spark: We are looking for advice and suggestions on how to tune the .repartition() parameter. We are using Spark Streaming on our data pipeline to consume messages and persist them to a Delta Lake (https://delta.io/learn/getting-started/). We read messages from a Kafka topic,

using facebook Prophet + pyspark for forecasting - Dataframe has less than 2 non-NaN rows

2023-09-29 Thread karan alang
Hello - Anyone used Prophet + pyspark for forecasting ? I'm trying to backfill forecasts, and running into issues (error - Dataframe has less than 2 non-NaN rows) I'm removing all records with NaN values, yet getting this error. details are in stackoverflow link -> https://stackoverflow.

[PySpark][Spark logs] Is it possible to dynamically customize Spark logs?

2023-09-25 Thread Ayman Rekik
Hello, What would be the right way, if any, to inject a runtime variable into Spark logs. So that, for example, if Spark (driver/worker) logs some info/warning/error message, the variable will be output there (in order to help filtering logs for the sake of monitoring and troubleshooting).

Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Kezhi Xiong
; > On Wed, Sep 20, 2023, 3:00 PM Kezhi Xiong > wrote: > >> Hi, >> >> Are there any plans to upload PySpark 3.5.0 to PyPI ( >> https://pypi.org/project/pyspark/)? It's still 3.4.1. >> >> Thanks, >> Kezhi >> >> >>

Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Sean Owen
I think the announcement mentioned there were some issues with pypi and the upload size this time. I am sure it's intended to be there when possible. On Wed, Sep 20, 2023, 3:00 PM Kezhi Xiong wrote: > Hi, > > Are there any plans to upload PySpark 3.5.0 to PyPI ( > https://pypi

PySpark 3.5.0 on PyPI

2023-09-20 Thread Kezhi Xiong
Hi, Are there any plans to upload PySpark 3.5.0 to PyPI ( https://pypi.org/project/pyspark/)? It's still 3.4.1. Thanks, Kezhi

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-20 Thread Sean Owen
nd all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destructio

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-20 Thread Mich Talebzadeh
al content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Tue, 19 Sept 2023 at 21:50, Sean Owen wrote: > Pyspark follows SQL databases here. stddev is stddev_samp, and sample > standard deviation is

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Mich Talebzadeh
Hi Helen, Assuming you want to calculate stddev_samp, Spark correctly points STDDEV to STDDEV_SAMP. In below replace sales with your table name and AMOUNT_SOLD with the column you want to do the calculation SELECT

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Bjørn Jørgensen
df.select(stddev_samp("value").alias("sample_stddev")).show() +-+ |sample_stddev| +-+ |5.320025062597606| +-+ In MS Excel 365 Norwegian [image: image.png] =STDAVVIKA(B1:B10) =STDAV.S(B1:B10) They both prints 5,32002506 Whi

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Sean Owen
Pyspark follows SQL databases here. stddev is stddev_samp, and sample standard deviation is the calculation with the Bessel correction, n-1 in the denominator. stddev_pop is simply standard deviation, with n in the denominator. On Tue, Sep 19, 2023 at 7:13 AM Helene Bøe wrote: > Hi! > &

Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Helene Bøe
Hi! I am applying the stddev function (so actually stddev_samp), however when comparing with the sample standard deviation in Excel the resuls do not match. I cannot find in your documentation any more specifics on how the sample standard deviation is calculated, so I cannot compare the

Managing python modules in docker for PySpark?

2023-08-16 Thread Mich Talebzadeh
Hi, This is a bit of an old hat but worth getting opinions on it. Current options that I believe apply are: 1. Installing them individually via pip in the docker build process 2. Installing them together via pip in the build process via requirments.txt 3. Installing them to a

Re: [PySpark][UDF][PickleException]

2023-08-10 Thread Bjørn Jørgensen
I pasted your text to chatgtp and this is what I got back Your problem arises due to how Apache Spark serializes Python objects to be used in Spark tasks. When a User-Defined Function (UDF) is defined, Spark uses Python's `pickle` library to serialize the Python function and any required objects

[PySpark][UDF][PickleException]

2023-08-10 Thread Sanket Sharma
Hi, I've been trying to debug a Spark UDF for a couple of days now but I can't seem to figure out what is going on. The UDF essentially pads a 2D array to a certain fixed length. When the code uses NumPy, it fails with a PickleException. When I re write using plain python, it works like charm.:

Re: [PySpark] Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path:

2023-08-09 Thread lnxpgn
igdata/appcache/application_1691548913900_0002/container_1691548913900_0002_01_01/pyspark.zip/pyspark/context.py:350: RuntimeWarning: Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path: If I use HDFS file: spark-submit --maste

Re: [PySpark] Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path:

2023-08-09 Thread Mich Talebzadeh
er --py-files > /tmp/app-submodules.zip app.py > > The YARN application ran successfully, but have a warning log message: > > /opt/hadoop-tmp-dir/nm-local-dir/usercache/bigdata/appcache/application_1691548913900_0002/container_1691548913900_0002_01_01/pyspark.zip/pyspark/context.py:350:

[PySpark] Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path:

2023-08-09 Thread lnxpgn
-local-dir/usercache/bigdata/appcache/application_1691548913900_0002/container_1691548913900_0002_01_01/pyspark.zip/pyspark/context.py:350: RuntimeWarning: Failed to add file [file:///tmp/app-submodules.zip] specified in 'spark.submit.pyFiles' to Python path: If I use HDFS file: spark-submit

Re: PySpark error java.lang.IllegalArgumentException

2023-07-10 Thread elango vaidyanathan
;>> >>> >>> Hi all, >>> >>> I am reading a parquet file like this and it gives >>> java.lang.IllegalArgumentException. >>> However i can work with other parquet files (such as nyc taxi parquet >>> files) without any issue.

Re: PySpark error java.lang.IllegalArgumentException

2023-07-07 Thread Brian Huynh
and it gives java.lang.IllegalArgumentException. However i can work with other parquet files (such as nyc taxi parquet files) without any issue. I have copied the full error log as well. Can you please check once and let me know how to fix this? import pyspark from pyspark.sql import SparkS

Re: PySpark error java.lang.IllegalArgumentException

2023-07-07 Thread Khalid Mammadov
;> files) without any issue. I have copied the full error log as well. Can you >> please check once and let me know how to fix this? >> >> import pyspark >> >> from pyspark.sql import SparkSession >> >> spark=SparkSession.builder.appName(&q

Re: PySpark error java.lang.IllegalArgumentException

2023-07-05 Thread elango vaidyanathan
taxi parquet > files) without any issue. I have copied the full error log as well. Can you > please check once and let me know how to fix this? > > import pyspark > > from pyspark.sql import SparkSession > > spark=SparkSession.builder.appName("testPyspark").config("

PySpark error java.lang.IllegalArgumentException

2023-07-03 Thread elango vaidyanathan
this? import pyspark from pyspark.sql import SparkSession spark=SparkSession.builder.appName("testPyspark").config("spark.executor.memory", "20g").config("spark.driver.memory", "50g").getOrCreate() df=spark.read.parquet("/data/202301/account_cycle

[PySpark] Intermittent Spark session initialization error on M1 Mac

2023-06-27 Thread BeoumSuk Kim
Hi, When I launch pyspark CLI on my M1 Macbook (standalone mode), I intermittently get the following error and the Spark session doesn't get initialized. 7~8 times out of 10, it doesn't have the issue, but it intermittently fails. And, this occurs only when I specify `spark.jars.packages` option

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
i() > or > p_df = DF.to_koalas() > > > > https://spark.apache.org/docs/latest/api/python/migration_guide/koalas_to_pyspark.html > > Then you will have yours pyspark df to panda API on spark. > > tir. 20. juni 2023 kl. 22:16 skrev Mich Talebzadeh < > m

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
Then you will have yours pyspark df to panda API on spark. tir. 20. juni 2023 kl. 22:16 skrev Mich Talebzadeh < mich.talebza...@gmail.com>: > OK thanks > > So the issue seems to be creating a Panda DF from Spark DF (I do it for > plotting with something like > > import

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
nsen >> wrote: >> >>> Pandas API on spark is an API so that users can use spark as they use >>> pandas. This was known as koalas. >>> >>> Is this limitation still valid for Pandas? >>> For pandas, yes. But what I did show wos pandas API on spar

Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen
gt; pandas. This was known as koalas. >> >> Is this limitation still valid for Pandas? >> For pandas, yes. But what I did show wos pandas API on spark so its spark. >> >> Additionally when we convert from Panda DF to Spark DF, what process is >> involved under the

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
> involved under the bonnet? > I gess pyarrow and drop the index column. > > Have a look at > https://github.com/apache/spark/tree/master/python/pyspark/pandas > > tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh < > mich.talebza...@gmail.com>: > >> Whenev

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
is involved under the bonnet? I gess pyarrow and drop the index column. Have a look at https://github.com/apache/spark/tree/master/python/pyspark/pandas tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh < mich.talebza...@gmail.com>: > Whenever someone mentions Pandas I automatica

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
damage or destruction. On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen wrote: > This is pandas API on spark > > from pyspark import pandas as ps > df = ps.read_excel("testexcel.xlsx") > [image: image.png] > this will convert it to pyspark > [image: image.png] > > t

Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
This is pandas API on spark from pyspark import pandas as ps df = ps.read_excel("testexcel.xlsx") [image: image.png] this will convert it to pyspark [image: image.png] tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme : > Good day, > > > > I have a task to read excel

Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen
It is indeed not part of SparkSession. See the link you cite. It is part of the pyspark pandas API On Tue, Jun 20, 2023, 5:42 AM John Paul Jayme wrote: > Good day, > > > > I have a task to read excel files in databricks but I cannot seem to > proceed. I am referencing

How to read excel file in PySpark

2023-06-20 Thread John Paul Jayme
Good day, I have a task to read excel files in databricks but I cannot seem to proceed. I am referencing the API documents - read_excel , but there is an error sparksession object has

Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-09 Thread Wenchen Fan
s can be created from a DataFrame >>> (df.createOrReplaceTempView or df.createOrReplaceGlobalTempView). >>> >>> When I want a *permanent* Spark View I need to specify it via Spark SQL >>> (CREATE VIEW AS SELECT ...). >>> >>> Sometimes it is easier

Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-04 Thread Mich Talebzadeh
t via Spark SQL >> (CREATE VIEW AS SELECT ...). >> >> Sometimes it is easier to specify the desired logic of the View through >> Spark/PySpark DataFrame API. >> Therefore, I'd like to suggest to implement a new PySpark method that >> allows creating

Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-04 Thread keen
ws can be created from a DataFrame > (df.createOrReplaceTempView or df.createOrReplaceGlobalTempView). > > When I want a *permanent* Spark View I need to specify it via Spark SQL > (CREATE VIEW AS SELECT ...). > > Sometimes it is easier to specify the desired logic of the View through &

[Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-01 Thread keen
logic of the View through Spark/PySpark DataFrame API. Therefore, I'd like to suggest to implement a new PySpark method that allows creating a *permanent* Spark View from a DataFrame (df.createOrReplaceView). see also: https://community.databricks.com/s/question/0D53f1PANVgCAP/is-there-a-way

cannot load model using pyspark

2023-05-23 Thread second_co...@yahoo.com.INVALID
path exist. ---Py4JJavaError Traceback (most recent call last)Cell In[16], line 1> 1 spark.sparkContext.textFile("s3a://a)bucket/models/random_forest_zepp/bestModel/metadata", 1).getNumPartitions()File /spark/python/lib/pyspark.zip/pyspark/rdd.py:599, in RDD.getNumP

Pyspark cluster mode on standalone deployment

2023-05-14 Thread خالد القحطاني
Hi Can I deploy my Pyspark application on standalone cluster with cluster mode I believe it was not possible to do that but I searched all the documentation and I did not find it. My Spark standalone cluster version is 3.3.1

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-06 Thread Mich Talebzadeh
that case. Perhaps, I'll trigger > a Lambda to rename/combine the files after PySpark writes them. > > Cheers, > Marco. > > On Thu, May 4, 2023 at 5:25 PM Mich Talebzadeh > wrote: > >> you can try >> >> df2.coalesce(1).write.mode("overwrite").json(&qu

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-05 Thread Marco Costantini
Hi Mich, Thank you. Ah, I want to avoid bringing all data to the driver node. That is my understanding of what will happen in that case. Perhaps, I'll trigger a Lambda to rename/combine the files after PySpark writes them. Cheers, Marco. On Thu, May 4, 2023 at 5:25 PM Mich Talebzadeh wrote

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Mich Talebzadeh
ing on. Perhaps the Spark > 'part' files should not be thought of as files, but rather pieces of a > conceptual file. If that is true, then your approach (of which I'm well > aware) makes sense. Question: what are some good methods, tools, for > combining the parts into a single, well-named file? I

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Marco Costantini
sense. Question: what are some good methods, tools, for combining the parts into a single, well-named file? I imagine that is outside of the scope of PySpark, but any advice is welcome. Thank you, Marco. On Thu, May 4, 2023 at 5:05 PM Mich Talebzadeh wrote: > AWS S3, or Google gs are had

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Mich Talebzadeh
AWS S3, or Google gs are hadoop compatible file systems (HCFS) , so they do sharding to improve read performance when writing to HCFS file systems. Let us take your code for a drive import findspark findspark.init() from pyspark.sql import SparkSession from pyspark.sql.functions import struct

  1   2   3   4   5   6   7   8   9   10   >