Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
OK thanks for the info.

Regards

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 20 Jun 2023 at 21:27, Bjørn Jørgensen 
wrote:

> yes, p_df = DF.toPandas() that is THE pandas the one you know.
>
> change p_df = DF.toPandas() to
> p_df = DF.pandas_on_spark()
> or
> p_df = DF.to_pandas_on_spark()
> or
> p_df = DF.pandas_api()
> or
> p_df = DF.to_koalas()
>
>
>
> https://spark.apache.org/docs/latest/api/python/migration_guide/koalas_to_pyspark.html
>
> Then you will have yours pyspark df to panda API on spark.
>
> tir. 20. juni 2023 kl. 22:16 skrev Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>> OK thanks
>>
>> So the issue seems to be creating  a Panda DF from Spark DF (I do it for
>> plotting with something like
>>
>> import matplotlib.pyplot as plt
>> p_df = DF.toPandas()
>> p_df.plt()
>>
>> I guess that stays in the driver.
>>
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 20 Jun 2023 at 20:46, Sean Owen  wrote:
>>
>>> No, a pandas on Spark DF is distributed.
>>>
>>> On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh 
>>> wrote:
>>>
 Thanks but if you create a Spark DF from Pandas DF that Spark DF is not
 distributed and remains on the driver. I recall a while back we had this
 conversation. I don't think anything has changed.

 Happy to be corrected

 Mich Talebzadeh,
 Lead Solutions Architect/Engineering Lead
 Palantir Technologies Limited
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Tue, 20 Jun 2023 at 20:09, Bjørn Jørgensen 
 wrote:

> Pandas API on spark is an API so that users can use spark as they use
> pandas. This was known as koalas.
>
> Is this limitation still valid for Pandas?
> For pandas, yes. But what I did show wos pandas API on spark so its
> spark.
>
>  Additionally when we convert from Panda DF to Spark DF, what process
> is involved under the bonnet?
> I gess pyarrow and drop the index column.
>
> Have a look at
> https://github.com/apache/spark/tree/master/python/pyspark/pandas
>
> tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>> Whenever someone mentions Pandas I automatically think of it as an
>> excel sheet for Python.
>>
>> OK my point below needs some qualification
>>
>> Why Spark here. Generally, parallel architecture comes into play when
>> the data size is significantly large which cannot be handled on a single
>> machine, hence, the use of Spark becomes meaningful. In cases where (the
>> generated) data size is going to be very large (which is often norm 
>> rather
>> than the exception these days), the data cannot be processed and stored 
>> in
>> Pandas data frames as these data frames store data in RAM. Then, the 
>> whole
>> dataset from a storage like HDFS or cloud storage cannot be collected,
>> because it will take significant time and space and probably won't fit 
>> in a
>> single machine RAM. (in this the driver memory)
>>
>> Is this limitation still valid for Pandas? Additionally when we
>> convert from Panda DF to Spark DF, what process is involved under the
>> bonnet?
>>
>> Thanks
>>
>> Mich Talebzadeh,
>> Lead Solutions 

Unsubscribe

2023-06-20 Thread Bhargava Sukkala
-- 
Thanks,
Bhargava Sukkala.
Cell no:216-278-1066
MS in Business Analytics,
Arizona State University.


Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
yes, p_df = DF.toPandas() that is THE pandas the one you know.

change p_df = DF.toPandas() to
p_df = DF.pandas_on_spark()
or
p_df = DF.to_pandas_on_spark()
or
p_df = DF.pandas_api()
or
p_df = DF.to_koalas()


https://spark.apache.org/docs/latest/api/python/migration_guide/koalas_to_pyspark.html

Then you will have yours pyspark df to panda API on spark.

tir. 20. juni 2023 kl. 22:16 skrev Mich Talebzadeh <
mich.talebza...@gmail.com>:

> OK thanks
>
> So the issue seems to be creating  a Panda DF from Spark DF (I do it for
> plotting with something like
>
> import matplotlib.pyplot as plt
> p_df = DF.toPandas()
> p_df.plt()
>
> I guess that stays in the driver.
>
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Jun 2023 at 20:46, Sean Owen  wrote:
>
>> No, a pandas on Spark DF is distributed.
>>
>> On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh 
>> wrote:
>>
>>> Thanks but if you create a Spark DF from Pandas DF that Spark DF is not
>>> distributed and remains on the driver. I recall a while back we had this
>>> conversation. I don't think anything has changed.
>>>
>>> Happy to be corrected
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 20 Jun 2023 at 20:09, Bjørn Jørgensen 
>>> wrote:
>>>
 Pandas API on spark is an API so that users can use spark as they use
 pandas. This was known as koalas.

 Is this limitation still valid for Pandas?
 For pandas, yes. But what I did show wos pandas API on spark so its
 spark.

  Additionally when we convert from Panda DF to Spark DF, what process
 is involved under the bonnet?
 I gess pyarrow and drop the index column.

 Have a look at
 https://github.com/apache/spark/tree/master/python/pyspark/pandas

 tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh <
 mich.talebza...@gmail.com>:

> Whenever someone mentions Pandas I automatically think of it as an
> excel sheet for Python.
>
> OK my point below needs some qualification
>
> Why Spark here. Generally, parallel architecture comes into play when
> the data size is significantly large which cannot be handled on a single
> machine, hence, the use of Spark becomes meaningful. In cases where (the
> generated) data size is going to be very large (which is often norm rather
> than the exception these days), the data cannot be processed and stored in
> Pandas data frames as these data frames store data in RAM. Then, the whole
> dataset from a storage like HDFS or cloud storage cannot be collected,
> because it will take significant time and space and probably won't fit in 
> a
> single machine RAM. (in this the driver memory)
>
> Is this limitation still valid for Pandas? Additionally when we
> convert from Panda DF to Spark DF, what process is involved under the
> bonnet?
>
> Thanks
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen <
> bjornjorgen...@gmail.com> wrote:
>
>> This is pandas API on spark
>>
>> from pyspark import pandas as 

Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
OK thanks

So the issue seems to be creating  a Panda DF from Spark DF (I do it for
plotting with something like

import matplotlib.pyplot as plt
p_df = DF.toPandas()
p_df.plt()

I guess that stays in the driver.


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 20 Jun 2023 at 20:46, Sean Owen  wrote:

> No, a pandas on Spark DF is distributed.
>
> On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh 
> wrote:
>
>> Thanks but if you create a Spark DF from Pandas DF that Spark DF is not
>> distributed and remains on the driver. I recall a while back we had this
>> conversation. I don't think anything has changed.
>>
>> Happy to be corrected
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 20 Jun 2023 at 20:09, Bjørn Jørgensen 
>> wrote:
>>
>>> Pandas API on spark is an API so that users can use spark as they use
>>> pandas. This was known as koalas.
>>>
>>> Is this limitation still valid for Pandas?
>>> For pandas, yes. But what I did show wos pandas API on spark so its
>>> spark.
>>>
>>>  Additionally when we convert from Panda DF to Spark DF, what process
>>> is involved under the bonnet?
>>> I gess pyarrow and drop the index column.
>>>
>>> Have a look at
>>> https://github.com/apache/spark/tree/master/python/pyspark/pandas
>>>
>>> tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh <
>>> mich.talebza...@gmail.com>:
>>>
 Whenever someone mentions Pandas I automatically think of it as an
 excel sheet for Python.

 OK my point below needs some qualification

 Why Spark here. Generally, parallel architecture comes into play when
 the data size is significantly large which cannot be handled on a single
 machine, hence, the use of Spark becomes meaningful. In cases where (the
 generated) data size is going to be very large (which is often norm rather
 than the exception these days), the data cannot be processed and stored in
 Pandas data frames as these data frames store data in RAM. Then, the whole
 dataset from a storage like HDFS or cloud storage cannot be collected,
 because it will take significant time and space and probably won't fit in a
 single machine RAM. (in this the driver memory)

 Is this limitation still valid for Pandas? Additionally when we convert
 from Panda DF to Spark DF, what process is involved under the bonnet?

 Thanks

 Mich Talebzadeh,
 Lead Solutions Architect/Engineering Lead
 Palantir Technologies Limited
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen 
 wrote:

> This is pandas API on spark
>
> from pyspark import pandas as ps
> df = ps.read_excel("testexcel.xlsx")
> [image: image.png]
> this will convert it to pyspark
> [image: image.png]
>
> tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
> :
>
>> Good day,
>>
>>
>>
>> I have a task to read excel files in databricks but I cannot seem to
>> proceed. I am referencing the API documents -  read_excel
>> 
>> , but there is an error sparksession object has no attribute
>> 'read_excel'. Can you advise?
>>
>>
>>
>> 

Re: Shuffle data on pods which get decomissioned

2023-06-20 Thread Mich Talebzadeh
If one executor fails, it moves the processing over to another executor.
However, if the data is lost, it re-executes the processing that generated
the data, and might have to go back to the source.Does this mean that only
those tasks that the dead executor was executing at the time need to be
rerun to generate the processing stages. If I am correct,  It uses RDD
lineage to figure out what needs to be re-executed. Remember we are talking
about the executor failure not node failure hereI don’t know the details
how it determines which tasks to run, but I am guessing that it is a
multi-stage job, it might have to rerun all the stages again. For example,
if you have done a groupBy, you will have 2 stages. After the first stage,
the data will be shuffled by hashing the groupBy key , so that data for the
same value of key lands in the same partition. Now, if one of those
partitions is lost during execution of the second stage, I am guessing
Spark will have to go back and re-execute all the tasks in the first stage.

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 20 Jun 2023 at 20:07, Nikhil Goyal  wrote:

> Hi folks,
> When running Spark on K8s, what would happen to shuffle data if an
> executor is terminated or lost. Since there is no shuffle service, does all
> the work done by that executor gets recomputed?
>
> Thanks
> Nikhil
>


Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen
No, a pandas on Spark DF is distributed.

On Tue, Jun 20, 2023, 1:45 PM Mich Talebzadeh 
wrote:

> Thanks but if you create a Spark DF from Pandas DF that Spark DF is not
> distributed and remains on the driver. I recall a while back we had this
> conversation. I don't think anything has changed.
>
> Happy to be corrected
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Jun 2023 at 20:09, Bjørn Jørgensen 
> wrote:
>
>> Pandas API on spark is an API so that users can use spark as they use
>> pandas. This was known as koalas.
>>
>> Is this limitation still valid for Pandas?
>> For pandas, yes. But what I did show wos pandas API on spark so its spark.
>>
>>  Additionally when we convert from Panda DF to Spark DF, what process is
>> involved under the bonnet?
>> I gess pyarrow and drop the index column.
>>
>> Have a look at
>> https://github.com/apache/spark/tree/master/python/pyspark/pandas
>>
>> tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>> Whenever someone mentions Pandas I automatically think of it as an excel
>>> sheet for Python.
>>>
>>> OK my point below needs some qualification
>>>
>>> Why Spark here. Generally, parallel architecture comes into play when
>>> the data size is significantly large which cannot be handled on a single
>>> machine, hence, the use of Spark becomes meaningful. In cases where (the
>>> generated) data size is going to be very large (which is often norm rather
>>> than the exception these days), the data cannot be processed and stored in
>>> Pandas data frames as these data frames store data in RAM. Then, the whole
>>> dataset from a storage like HDFS or cloud storage cannot be collected,
>>> because it will take significant time and space and probably won't fit in a
>>> single machine RAM. (in this the driver memory)
>>>
>>> Is this limitation still valid for Pandas? Additionally when we convert
>>> from Panda DF to Spark DF, what process is involved under the bonnet?
>>>
>>> Thanks
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen 
>>> wrote:
>>>
 This is pandas API on spark

 from pyspark import pandas as ps
 df = ps.read_excel("testexcel.xlsx")
 [image: image.png]
 this will convert it to pyspark
 [image: image.png]

 tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
 :

> Good day,
>
>
>
> I have a task to read excel files in databricks but I cannot seem to
> proceed. I am referencing the API documents -  read_excel
> 
> , but there is an error sparksession object has no attribute
> 'read_excel'. Can you advise?
>
>
>
> *JOHN PAUL JAYME*
> Data Engineer
>
> m. +639055716384  w. www.tdcx.com
>
>
>
> *Winner of over 350 Industry Awards*
>
> [image: Linkedin]  [image:
> Facebook]  [image: Twitter]
>  [image: Youtube]
>  [image: Instagram]
> 
>
>
>
> This is a confidential email that may be privileged or legally
> protected. You are not authorized to copy or disclose the contents of this
> email. If you are not the intended addressee, please inform the sender and
> delete this email.
>
>
>
>
>


 --
 Bjørn Jørgensen
 Vestre Aspehaug 4, 6010 Ålesund
 Norge

 +47 480 94 297

>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>


Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
Thanks but if you create a Spark DF from Pandas DF that Spark DF is not
distributed and remains on the driver. I recall a while back we had this
conversation. I don't think anything has changed.

Happy to be corrected

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 20 Jun 2023 at 20:09, Bjørn Jørgensen 
wrote:

> Pandas API on spark is an API so that users can use spark as they use
> pandas. This was known as koalas.
>
> Is this limitation still valid for Pandas?
> For pandas, yes. But what I did show wos pandas API on spark so its spark.
>
>  Additionally when we convert from Panda DF to Spark DF, what process is
> involved under the bonnet?
> I gess pyarrow and drop the index column.
>
> Have a look at
> https://github.com/apache/spark/tree/master/python/pyspark/pandas
>
> tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>> Whenever someone mentions Pandas I automatically think of it as an excel
>> sheet for Python.
>>
>> OK my point below needs some qualification
>>
>> Why Spark here. Generally, parallel architecture comes into play when the
>> data size is significantly large which cannot be handled on a single
>> machine, hence, the use of Spark becomes meaningful. In cases where (the
>> generated) data size is going to be very large (which is often norm rather
>> than the exception these days), the data cannot be processed and stored in
>> Pandas data frames as these data frames store data in RAM. Then, the whole
>> dataset from a storage like HDFS or cloud storage cannot be collected,
>> because it will take significant time and space and probably won't fit in a
>> single machine RAM. (in this the driver memory)
>>
>> Is this limitation still valid for Pandas? Additionally when we convert
>> from Panda DF to Spark DF, what process is involved under the bonnet?
>>
>> Thanks
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen 
>> wrote:
>>
>>> This is pandas API on spark
>>>
>>> from pyspark import pandas as ps
>>> df = ps.read_excel("testexcel.xlsx")
>>> [image: image.png]
>>> this will convert it to pyspark
>>> [image: image.png]
>>>
>>> tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
>>> :
>>>
 Good day,



 I have a task to read excel files in databricks but I cannot seem to
 proceed. I am referencing the API documents -  read_excel
 
 , but there is an error sparksession object has no attribute
 'read_excel'. Can you advise?



 *JOHN PAUL JAYME*
 Data Engineer

 m. +639055716384  w. www.tdcx.com



 *Winner of over 350 Industry Awards*

 [image: Linkedin]  [image:
 Facebook]  [image: Twitter]
  [image: Youtube]
  [image: Instagram]
 



 This is a confidential email that may be privileged or legally
 protected. You are not authorized to copy or disclose the contents of this
 email. If you are not the intended addressee, please inform the sender and
 delete this email.





>>>
>>>
>>> --
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297
>>>
>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>


Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
Pandas API on spark is an API so that users can use spark as they use
pandas. This was known as koalas.

Is this limitation still valid for Pandas?
For pandas, yes. But what I did show wos pandas API on spark so its spark.

 Additionally when we convert from Panda DF to Spark DF, what process is
involved under the bonnet?
I gess pyarrow and drop the index column.

Have a look at
https://github.com/apache/spark/tree/master/python/pyspark/pandas

tir. 20. juni 2023 kl. 19:05 skrev Mich Talebzadeh <
mich.talebza...@gmail.com>:

> Whenever someone mentions Pandas I automatically think of it as an excel
> sheet for Python.
>
> OK my point below needs some qualification
>
> Why Spark here. Generally, parallel architecture comes into play when the
> data size is significantly large which cannot be handled on a single
> machine, hence, the use of Spark becomes meaningful. In cases where (the
> generated) data size is going to be very large (which is often norm rather
> than the exception these days), the data cannot be processed and stored in
> Pandas data frames as these data frames store data in RAM. Then, the whole
> dataset from a storage like HDFS or cloud storage cannot be collected,
> because it will take significant time and space and probably won't fit in a
> single machine RAM. (in this the driver memory)
>
> Is this limitation still valid for Pandas? Additionally when we convert
> from Panda DF to Spark DF, what process is involved under the bonnet?
>
> Thanks
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen 
> wrote:
>
>> This is pandas API on spark
>>
>> from pyspark import pandas as ps
>> df = ps.read_excel("testexcel.xlsx")
>> [image: image.png]
>> this will convert it to pyspark
>> [image: image.png]
>>
>> tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
>> :
>>
>>> Good day,
>>>
>>>
>>>
>>> I have a task to read excel files in databricks but I cannot seem to
>>> proceed. I am referencing the API documents -  read_excel
>>> 
>>> , but there is an error sparksession object has no attribute
>>> 'read_excel'. Can you advise?
>>>
>>>
>>>
>>> *JOHN PAUL JAYME*
>>> Data Engineer
>>>
>>> m. +639055716384  w. www.tdcx.com
>>>
>>>
>>>
>>> *Winner of over 350 Industry Awards*
>>>
>>> [image: Linkedin]  [image:
>>> Facebook]  [image: Twitter]
>>>  [image: Youtube]
>>>  [image: Instagram]
>>> 
>>>
>>>
>>>
>>> This is a confidential email that may be privileged or legally
>>> protected. You are not authorized to copy or disclose the contents of this
>>> email. If you are not the intended addressee, please inform the sender and
>>> delete this email.
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Shuffle data on pods which get decomissioned

2023-06-20 Thread Nikhil Goyal
Hi folks,
When running Spark on K8s, what would happen to shuffle data if an executor
is terminated or lost. Since there is no shuffle service, does all the work
done by that executor gets recomputed?

Thanks
Nikhil


Re: How to read excel file in PySpark

2023-06-20 Thread Mich Talebzadeh
Whenever someone mentions Pandas I automatically think of it as an excel
sheet for Python.

OK my point below needs some qualification

Why Spark here. Generally, parallel architecture comes into play when the
data size is significantly large which cannot be handled on a single
machine, hence, the use of Spark becomes meaningful. In cases where (the
generated) data size is going to be very large (which is often norm rather
than the exception these days), the data cannot be processed and stored in
Pandas data frames as these data frames store data in RAM. Then, the whole
dataset from a storage like HDFS or cloud storage cannot be collected,
because it will take significant time and space and probably won't fit in a
single machine RAM. (in this the driver memory)

Is this limitation still valid for Pandas? Additionally when we convert
from Panda DF to Spark DF, what process is involved under the bonnet?

Thanks

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 20 Jun 2023 at 13:07, Bjørn Jørgensen 
wrote:

> This is pandas API on spark
>
> from pyspark import pandas as ps
> df = ps.read_excel("testexcel.xlsx")
> [image: image.png]
> this will convert it to pyspark
> [image: image.png]
>
> tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
> :
>
>> Good day,
>>
>>
>>
>> I have a task to read excel files in databricks but I cannot seem to
>> proceed. I am referencing the API documents -  read_excel
>> 
>> , but there is an error sparksession object has no attribute
>> 'read_excel'. Can you advise?
>>
>>
>>
>> *JOHN PAUL JAYME*
>> Data Engineer
>>
>> m. +639055716384  w. www.tdcx.com
>>
>>
>>
>> *Winner of over 350 Industry Awards*
>>
>> [image: Linkedin]  [image:
>> Facebook]  [image: Twitter]
>>  [image: Youtube]
>>  [image: Instagram]
>> 
>>
>>
>>
>> This is a confidential email that may be privileged or legally protected.
>> You are not authorized to copy or disclose the contents of this email. If
>> you are not the intended addressee, please inform the sender and delete
>> this email.
>>
>>
>>
>>
>>
>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>


Re: How to read excel file in PySpark

2023-06-20 Thread Bjørn Jørgensen
This is pandas API on spark

from pyspark import pandas as ps
df = ps.read_excel("testexcel.xlsx")
[image: image.png]
this will convert it to pyspark
[image: image.png]

tir. 20. juni 2023 kl. 13:42 skrev John Paul Jayme
:

> Good day,
>
>
>
> I have a task to read excel files in databricks but I cannot seem to
> proceed. I am referencing the API documents -  read_excel
> 
> , but there is an error sparksession object has no attribute
> 'read_excel'. Can you advise?
>
>
>
> *JOHN PAUL JAYME*
> Data Engineer
>
> m. +639055716384  w. www.tdcx.com
>
>
>
> *Winner of over 350 Industry Awards*
>
> [image: Linkedin]  [image:
> Facebook]  [image: Twitter]
>  [image: Youtube]
>  [image: Instagram]
> 
>
>
>
> This is a confidential email that may be privileged or legally protected.
> You are not authorized to copy or disclose the contents of this email. If
> you are not the intended addressee, please inform the sender and delete
> this email.
>
>
>
>
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: How to read excel file in PySpark

2023-06-20 Thread Sean Owen
It is indeed not part of SparkSession. See the link you cite. It is part of
the pyspark pandas API

On Tue, Jun 20, 2023, 5:42 AM John Paul Jayme 
wrote:

> Good day,
>
>
>
> I have a task to read excel files in databricks but I cannot seem to
> proceed. I am referencing the API documents -  read_excel
> 
> , but there is an error sparksession object has no attribute
> 'read_excel'. Can you advise?
>
>
>
> *JOHN PAUL JAYME*
> Data Engineer
>
> m. +639055716384  w. www.tdcx.com
>
>
>
> *Winner of over 350 Industry Awards*
>
> [image: Linkedin]  [image:
> Facebook]  [image: Twitter]
>  [image: Youtube]
>  [image: Instagram]
> 
>
>
>
> This is a confidential email that may be privileged or legally protected.
> You are not authorized to copy or disclose the contents of this email. If
> you are not the intended addressee, please inform the sender and delete
> this email.
>
>
>
>
>


How to read excel file in PySpark

2023-06-20 Thread John Paul Jayme
Good day,

I have a task to read excel files in databricks but I cannot seem to proceed. I 
am referencing the API documents -  
read_excel
 , but there is an error sparksession object has no attribute 'read_excel'. Can 
you advise?

JOHN PAUL JAYME
Data Engineer
[https://app.tdcx.com/email-signature/assets/img/tdcx-logo.png]
m. +639055716384  w. www.tdcx.com

Winner of over 350 Industry Awards
[Linkedin] [Facebook] 
  [Twitter] 
  [Youtube] 
  [Instagram] 


This is a confidential email that may be privileged or legally protected. You 
are not authorized to copy or disclose the contents of this email. If you are 
not the intended addressee, please inform the sender and delete this email.