Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Sean Owen
The standalone koalas project should have the same functionality for older Spark versions: https://koalas.readthedocs.io/en/latest/ You should be moving to Spark 3 though; 2.x is EOL. On Wed, Feb 23, 2022 at 9:06 AM Sid wrote: > Cool. Here, the problem is I have to run the Spark jobs on Glue

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Sid
Cool. Here, the problem is I have to run the Spark jobs on Glue ETL which supports 2.4.3 of Spark and I don't think so this distributed support was added for pandas in that version. AFMKIC, it has been added in 3.2 version. So how can I do it in spark 2.4.3? Correct me if I'm wrong. On Wed, Feb

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Bjørn Jørgensen
You will. Pandas API on spark that `imported with from pyspark import pandas as ps` is not pandas but an API that is using pyspark under. ons. 23. feb. 2022 kl. 15:54 skrev Sid : > Hi Bjørn, > > Thanks for your reply. This doesn't help while loading huge datasets. > Won't be able to achieve

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Sean Owen
This isn't pandas, it's pandas on Spark. It's distributed. On Wed, Feb 23, 2022 at 8:55 AM Sid wrote: > Hi Bjørn, > > Thanks for your reply. This doesn't help while loading huge datasets. > Won't be able to achieve spark functionality while loading the file in > distributed manner. > > Thanks,

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Sid
Hi Bjørn, Thanks for your reply. This doesn't help while loading huge datasets. Won't be able to achieve spark functionality while loading the file in distributed manner. Thanks, Sid On Wed, Feb 23, 2022 at 7:38 PM Bjørn Jørgensen wrote: > from pyspark import pandas as ps > > > ps.read_excel?

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Bjørn Jørgensen
from pyspark import pandas as ps ps.read_excel? "Support both `xls` and `xlsx` file extensions from a local filesystem or URL" pdf = ps.read_excel("file") df = pdf.to_spark() ons. 23. feb. 2022 kl. 14:57 skrev Sid : > Hi Gourav, > > Thanks for your time. > > I am worried about the

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Sid
Hi Gourav, Thanks for your time. I am worried about the distribution of data in case of a huge dataset file. Is Koalas still a better option to go ahead with? If yes, how can I use it with Glue ETL jobs? Do I have to pass some kind of external jars for it? Thanks, Sid On Wed, Feb 23, 2022 at

Re: Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Gourav Sengupta
Hi, this looks like a very specific and exact problem in its scope. Do you think that you can load the data into panda dataframe and load it back to SPARK using PANDAS UDF? Koalas is now natively integrated with SPARK, try to see if you can use those features. Regards, Gourav On Wed, Feb 23,

Loading .xlsx and .xlx files using pyspark

2022-02-23 Thread Sid
I have an excel file which unfortunately cannot be converted to CSV format and I am trying to load it using pyspark shell. I tried invoking the below pyspark session with the jars provided. pyspark --jars