Re: Application Timeout

2021-03-25 Thread Brett Spark
Jacek, Turns out that this was the RPC connection to the master (7077) from the driver closing. We had Istio closing this out as there was a silly idle timeout setting they had after one hour. I was able to re-create this by running lsof on the driver for port 7077 and then killing that process.

Re: Is it enable to use Multiple UGIs in One Spark Context?

2021-03-25 Thread יורי אולייניקוב
I think that submitting the spark job on behalf of user01 will solve the problem. You may also try to set a sticky bit on /data/user01/rdd folder if you want to allow multiple users writing to /data/user01/rdd same at same time, but i'd not recommend allow multiple users writing to same dir

Re: Is it enable to use Multiple UGIs in One Spark Context?

2021-03-25 Thread Kwangsun Noh
Thank you for the first answer to my question. Unfortunately, I have to make totally different tables and It is not possible to make only one table via UGI. --- below is the sample codes I wrote. org.apache.hadoop.security.UserGroupInformation.createRemoteUser("user01").doAs(new

Re: FW: Email to Spark Org please

2021-03-25 Thread Sean Owen
Spark is overkill for this problem; use sklearn. But I'd suspect that you are using just 1 partition for such a small data set, and get no parallelism from Spark. repartition your input to many more partitions, but, it's unlikely to get much faster than in-core sklearn for this task. On Thu, Mar

FW: Email to Spark Org please

2021-03-25 Thread Williams, David (Risk Value Stream)
Classification: Public Hi Team, We are trying to utilize ML Gradient Boosting Tree Classification algorithm and found the performance of the algorithm is very poor during training. We would like to see we can improve the performance timings since, it is taking 2 days for training for a

Re: Is it enable to use Multiple UGIs in One Spark Context?

2021-03-25 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Assuming that all tables have same schema, you can make entire global table partitioned by some column. Then apply specific UGOs permissions/ACLs per partition subdirectory > On 25 Mar 2021, at 15:13, Kwangsun Noh wrote: > >  > Hi, Spark users. > > Currently I have to make multiple tables

Is it enable to use Multiple UGIs in One Spark Context?

2021-03-25 Thread Kwangsun Noh
Hi, Spark users. Currently I have to make multiple tables in hdfs using spark api. The tables need to made by each other users. For example, table01 is owned by user01, table02 is owned by user02 like below. path | owner:group | permission

Re: Rdd - zip with index

2021-03-25 Thread Mich Talebzadeh
Pretty easy if you do it efficiently gunzip --to-stdout csvfile.gz | bzip2 > csvfile.bz2 Just create a simple bash file to do it and print timings cat convert_file.sh #!/bin/bash GZFILE="csvfile.gz" FILE_NAME=`basename $GZFILE .gz` BZFILE="$FILE_NAME.bz2" echo `date` " ""=== Started

Re: Rdd - zip with index

2021-03-25 Thread KhajaAsmath Mohammed
Hi Mich, Yes you are right. We were getting gz files and this is causing the issue. I will be changing it to bzip or other splittable formats and try running it again today. Thanks, Asmath Sent from my iPhone > On Mar 25, 2021, at 6:51 AM, Mich Talebzadeh > wrote: > >  > Hi Asmath, > >

Re: Rdd - zip with index

2021-03-25 Thread Mich Talebzadeh
Hi Asmath, Have you actually managed to run this single file? Because Spark (as brought up a few times already) will pull the whole of the GZ file in a single partition in the driver, and can get an out of memory error. HTH view my Linkedin profile