Since you cannot create the DF directly, you may try to first create an
RDD of tuples from the file
and then convert the RDD to a DF by using the toDF() transformation.
Perhaps you may bypass the issue with this.
Another thing that I have seen in the example is that you are using ""
as an escape character.
Can you check if this may cause any issues?
Regards,
Apostolos
On 26/5/22 16:31, Sid wrote:
Thanks for opening the issue, Bjorn. However, could you help me to
address the problem for now with some kind of alternative?
I am actually stuck in this since yesterday.
Thanks,
Sid
On Thu, 26 May 2022, 18:48 Bjørn Jørgensen, <bjornjorgen...@gmail.com>
wrote:
Yes, it looks like a bug that we also have in pandas API on spark.
So I have opened a JIRA
<https://issues.apache.org/jira/browse/SPARK-39304> for this.
tor. 26. mai 2022 kl. 11:09 skrev Sid <flinkbyhe...@gmail.com>:
Hello Everyone,
I have posted a question finally with the dataset and the
column names.
PFB link:
https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark
Thanks,
Sid
On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen
<bjornjorgen...@gmail.com> wrote:
Sid, dump one of yours files.
https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
ons. 25. mai 2022, 23:04 skrev Sid <flinkbyhe...@gmail.com>:
I have 10 columns with me but in the dataset, I
observed that some records have 11 columns of data(for
the additional column it is marked as null). But, how
do I handle this?
Thanks,
Sid
On Thu, May 26, 2022 at 2:22 AM Sid
<flinkbyhe...@gmail.com> wrote:
How can I do that? Any examples or links, please.
So, this works well with pandas I suppose. It's
just that I need to convert back to the spark data
frame by providing a schema but since we are using
a lower spark version and pandas won't work in a
distributed way in the lower versions, therefore,
was wondering if spark could handle this in a much
better way.
Thanks,
Sid
On Thu, May 26, 2022 at 2:19 AM Gavin Ray
<ray.gavi...@gmail.com> wrote:
Forgot to reply-all last message, whoops. Not
very good at email.
You need to normalize the CSV with a parser
that can escape commas inside of strings
Not sure if Spark has an option for this?
On Wed, May 25, 2022 at 4:37 PM Sid
<flinkbyhe...@gmail.com> wrote:
Thank you so much for your time.
I have data like below which I tried to
load by setting multiple options while
reading the file but however, but I am not
able to consolidate the 9th column data
within itself.
image.png
I tried the below code:
df = spark.read.option("header",
"true").option("multiline",
"true").option("inferSchema",
"true").option("quote",
'"').option(
"delimiter", ",").csv("path")
What else I can do?
Thanks,
Sid
On Thu, May 26, 2022 at 1:46 AM Apostolos
N. Papadopoulos <papad...@csd.auth.gr> wrote:
Dear Sid,
can you please give us more info? Is
it true that every line may have a
different number of columns? Is there
any rule followed by
every line of the file? From the
information you have sent I cannot
fully understand the "schema" of your
data.
Regards,
Apostolos
On 25/5/22 23:06, Sid wrote:
> Hi Experts,
>
> I have below CSV data that is
getting generated automatically. I can't
> change the data manually.
>
> The data looks like below:
>
> 2020-12-12,abc,2000,,INR,
> 2020-12-09,cde,3000,he is a
manager,DOLLARS,nothing
> 2020-12-09,fgh,,software_developer,I
only manage the development part.
>
> Since I don't have much experience
with the other domains.
>
> It is handled by the other people.,INR
> 2020-12-12,abc,2000,,USD,
>
> The third record is a problem. Since
the value is separated by the new
> line by the user while filling up
the form. So, how do I handle this?
>
> There are 6 columns and 4 records in
total. These are the sample records.
>
> Should I load it as RDD and then may
be using a regex should eliminate
> the new lines? Or how it should be?
with ". /n" ?
>
> Any suggestions?
>
> Thanks,
> Sid
--
Apostolos N. Papadopoulos, Associate
Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papad...@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol
---------------------------------------------------------------------
To unsubscribe e-mail:
user-unsubscr...@spark.apache.org
--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge
+47 480 94 297
--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email:papad...@csd.auth.gr
twitter: @papadopoulos_ap
web:http://datalab.csd.auth.gr/~apostol