Jacek,
Turns out that this was the RPC connection to the master (7077) from the
driver closing. We had Istio closing this out as there was a silly idle
timeout setting they had after one hour.
I was able to re-create this by running lsof on the driver for port 7077
and then killing that process.
I think that submitting the spark job on behalf of user01 will solve the
problem.
You may also try to set a sticky bit on /data/user01/rdd folder if you want
to allow multiple users writing to /data/user01/rdd same at same time, but
i'd not recommend allow multiple users writing to same dir
Thank you for the first answer to my question.
Unfortunately, I have to make totally different tables
and It is not possible to make only one table via UGI.
---
below is the sample codes I wrote.
org.apache.hadoop.security.UserGroupInformation.createRemoteUser("user01").doAs(new
Spark is overkill for this problem; use sklearn.
But I'd suspect that you are using just 1 partition for such a small data
set, and get no parallelism from Spark.
repartition your input to many more partitions, but, it's unlikely to get
much faster than in-core sklearn for this task.
On Thu, Mar
Classification: Public
Hi Team,
We are trying to utilize ML Gradient Boosting Tree Classification algorithm and
found the performance of the algorithm is very poor during training.
We would like to see we can improve the performance timings since, it is taking
2 days for training for a
Assuming that all tables have same schema, you can make entire global
table partitioned by some column. Then apply specific UGOs permissions/ACLs per
partition subdirectory
> On 25 Mar 2021, at 15:13, Kwangsun Noh wrote:
>
>
> Hi, Spark users.
>
> Currently I have to make multiple tables
Hi, Spark users.
Currently I have to make multiple tables in hdfs using spark api.
The tables need to made by each other users.
For example, table01 is owned by user01, table02 is owned by user02 like
below.
path | owner:group | permission
Pretty easy if you do it efficiently
gunzip --to-stdout csvfile.gz | bzip2 > csvfile.bz2
Just create a simple bash file to do it and print timings
cat convert_file.sh
#!/bin/bash
GZFILE="csvfile.gz"
FILE_NAME=`basename $GZFILE .gz`
BZFILE="$FILE_NAME.bz2"
echo `date` " ""=== Started
Hi Mich,
Yes you are right. We were getting gz files and this is causing the issue. I
will be changing it to bzip or other splittable formats and try running it
again today.
Thanks,
Asmath
Sent from my iPhone
> On Mar 25, 2021, at 6:51 AM, Mich Talebzadeh
> wrote:
>
>
> Hi Asmath,
>
>
Hi Asmath,
Have you actually managed to run this single file? Because Spark (as
brought up a few times already) will pull the whole of the GZ file in a
single partition in the driver, and can get an out of memory error.
HTH
view my Linkedin profile
10 matches
Mail list logo