The standalone koalas project should have the same functionality for older
Spark versions:
https://koalas.readthedocs.io/en/latest/
You should be moving to Spark 3 though; 2.x is EOL.
On Wed, Feb 23, 2022 at 9:06 AM Sid wrote:
> Cool. Here, the problem is I have to run the Spark jobs on Glue
Cool. Here, the problem is I have to run the Spark jobs on Glue ETL which
supports 2.4.3 of Spark and I don't think so this distributed support was
added for pandas in that version. AFMKIC, it has been added in 3.2 version.
So how can I do it in spark 2.4.3? Correct me if I'm wrong.
On Wed, Feb
You will. Pandas API on spark that `imported with from pyspark import
pandas as ps` is not pandas but an API that is using pyspark under.
ons. 23. feb. 2022 kl. 15:54 skrev Sid :
> Hi Bjørn,
>
> Thanks for your reply. This doesn't help while loading huge datasets.
> Won't be able to achieve
This isn't pandas, it's pandas on Spark. It's distributed.
On Wed, Feb 23, 2022 at 8:55 AM Sid wrote:
> Hi Bjørn,
>
> Thanks for your reply. This doesn't help while loading huge datasets.
> Won't be able to achieve spark functionality while loading the file in
> distributed manner.
>
> Thanks,
Hi Bjørn,
Thanks for your reply. This doesn't help while loading huge datasets. Won't
be able to achieve spark functionality while loading the file in
distributed manner.
Thanks,
Sid
On Wed, Feb 23, 2022 at 7:38 PM Bjørn Jørgensen
wrote:
> from pyspark import pandas as ps
>
>
> ps.read_excel?
from pyspark import pandas as ps
ps.read_excel?
"Support both `xls` and `xlsx` file extensions from a local filesystem or
URL"
pdf = ps.read_excel("file")
df = pdf.to_spark()
ons. 23. feb. 2022 kl. 14:57 skrev Sid :
> Hi Gourav,
>
> Thanks for your time.
>
> I am worried about the
Hi Gourav,
Thanks for your time.
I am worried about the distribution of data in case of a huge dataset file.
Is Koalas still a better option to go ahead with? If yes, how can I use it
with Glue ETL jobs? Do I have to pass some kind of external jars for it?
Thanks,
Sid
On Wed, Feb 23, 2022 at
Hi,
this looks like a very specific and exact problem in its scope.
Do you think that you can load the data into panda dataframe and load it
back to SPARK using PANDAS UDF?
Koalas is now natively integrated with SPARK, try to see if you can use
those features.
Regards,
Gourav
On Wed, Feb 23,
I have an excel file which unfortunately cannot be converted to CSV format
and I am trying to load it using pyspark shell.
I tried invoking the below pyspark session with the jars provided.
pyspark --jars