Re: How to read excel file in PySpark

Mich Talebzadeh Tue, 20 Jun 2023 10:05:24 -0700

Whenever someone mentions Pandas I automatically think of it as an excel
sheet for Python.

OK my point below needs some qualification

Why Spark here. Generally, parallel architecture comes into play when the
data size is significantly large which cannot be handled on a single
machine, hence, the use of Spark becomes meaningful. In cases where (the
generated) data size is going to be very large (which is often norm rather
than the exception these days), the data cannot be processed and stored in
Pandas data frames as these data frames store data in RAM. Then, the whole
dataset from a storage like HDFS or cloud storage cannot be collected,
because it will take significant time and space and probably won't fit in a
single machine RAM. (in this the driver memory)

Is this limitation still valid for Pandas? Additionally when we convert
from Panda DF to Spark DF, what process is involved under the bonnet?

Thanks

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen <bjornjorgen...@gmail.com>
wrote:

> This is pandas API on spark
>
> from pyspark import pandas as ps
> df = ps.read_excel("testexcel.xlsx")
> [image: image.png]
> this will convert it to pyspark
> [image: image.png]
>
> tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
> <john.ja...@tdcx.com.invalid>:
>
>> Good day,
>>
>>
>>
>> I have a task to read excel files in databricks but I cannot seem to
>> proceed. I am referencing the API documents -  read_excel
>> <https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_excel.html>
>> , but there is an error sparksession object has no attribute
>> 'read_excel'. Can you advise?
>>
>>
>>
>> *JOHN PAUL JAYME*
>> Data Engineer
>>
>> m. +639055716384  w. www.tdcx.com
>>
>>
>>
>> *Winner of over 350 Industry Awards*
>>
>> [image: Linkedin] <https://www.linkedin.com/company/tdcxgroup/> [image:
>> Facebook] <https://www.facebook.com/tdcxgroup/> [image: Twitter]
>> <https://twitter.com/tdcxgroup/> [image: Youtube]
>> <https://www.youtube.com/c/TDCXgroup> [image: Instagram]
>> <https://www.instagram.com/tdcxgroup/>
>>
>>
>>
>> This is a confidential email that may be privileged or legally protected.
>> You are not authorized to copy or disclose the contents of this email. If
>> you are not the intended addressee, please inform the sender and delete
>> this email.
>>
>>
>>
>>
>>
>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: How to read excel file in PySpark

Reply via email to