Hi Ye,
This is super super helpful! It wasn't obvious to me from the documentation
that this property needed to be set in the yarn-site.xml file, as all other
configurations in the main spark configuration page are set through spark
conf. It was particularly confusing because this property, like m
Sid, dump one of yours files.
https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
ons. 25. mai 2022, 23:04 skrev Sid :
> I have 10 columns with me but in the dataset, I observed that some records
> have 11 columns of data(for the additional column it is marked as null).
>
I have 10 columns with me but in the dataset, I observed that some records
have 11 columns of data(for the additional column it is marked as null).
But, how do I handle this?
Thanks,
Sid
On Thu, May 26, 2022 at 2:22 AM Sid wrote:
> How can I do that? Any examples or links, please. So, this work
How can I do that? Any examples or links, please. So, this works well with
pandas I suppose. It's just that I need to convert back to the spark data
frame by providing a schema but since we are using a lower spark version
and pandas won't work in a distributed way in the lower versions,
therefore,
Forgot to reply-all last message, whoops. Not very good at email.
You need to normalize the CSV with a parser that can escape commas inside
of strings
Not sure if Spark has an option for this?
On Wed, May 25, 2022 at 4:37 PM Sid wrote:
> Thank you so much for your time.
>
> I have data like be
Thank you so much for your time.
I have data like below which I tried to load by setting multiple options
while reading the file but however, but I am not able to consolidate the
9th column data within itself.
[image: image.png]
I tried the below code:
df = spark.read.option("header", "true").o
Dear Sid,
can you please give us more info? Is it true that every line may have a
different number of columns? Is there any rule followed by
every line of the file? From the information you have sent I cannot
fully understand the "schema" of your data.
Regards,
Apostolos
On 25/5/22 23:06
Hi Experts,
I have below CSV data that is getting generated automatically. I can't
change the data manually.
The data looks like below:
2020-12-12,abc,2000,,INR,
2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
2020-12-09,fgh,,software_developer,I only manage the development part.
Since I do
Hi,
After some heavy usage, spark thrift server is not releasing memory and
when i enabled gc logging. Here is the output from GC Log.
```
2022-05-25T06:34:50.577+: 581403.163: [Full GC (System.gc())
8322M->8224M(26312M), 10.8605086 secs]
[Eden: 104.0M(14784.0M)->0.0B(14784.0M) Survivors:
To me, it seems like the data being processed on the 2 systems is not
identical. Can't think of any other reason why the single task stage will
get a different number of input records in the 2 cases. 700gb of input to a
single task is not good, and seems to be the bottleneck.
On Wed, 25 May 2022,
10 matches
Mail list logo