Identify bottleneck

2019-12-18 Thread Antoine DUBOIS
Hello I'm working on an ETL based on csv describing file systems to transform it into parquet so I can work on them easily to extract informations. I'm using Mr. Powers framework Daria to do so. I've quiet different input and a lot of transformation and the framework helps organize the code.

Re: Identify bottleneck

2019-12-18 Thread Enrico Minack
How many withColumn statements do you have? Note that it is better to use a single select, rather than lots of withColumn. This also makes drops redundant. Reading 25m CSV lines and writing to Parquet in 5 minutes on 32 cores is really slow. Can you try this on a single machine, i.e. run wit "

Re: Identify bottleneck

2019-12-18 Thread Antoine DUBOIS
There's 15 withColumn Statement and one drop at the end to remove old column. I which I could write it as a single sql statement, but it's not reasonable for maintaining purpose. I will try on a local instance and let you know. Thanks for the help. De: "Enrico Minack" À: user@spark.apach

Re: Identify bottleneck

2019-12-18 Thread Chris Teoh
Please look at the spark UI and confirm you are indeed getting more than 1 partition in your dataframe. Text files are usually not splittable so you may just be doing all the work in a single partition. If that is the case, It may be worthwhile considering calling the repartition method to distrib

Re: Identify bottleneck

2019-12-18 Thread Enrico Minack
Good points, but single-line CSV files are splitable (not multi-line CSV though), especially in the mentioned size. And bz2 is also splitable, though reading speed is much slower than uncompressed csv. If your csv.bz2 files are not splittable then repartitioning does not improve the situation

Understanding life cycle of RpcEndpoint: CoarseGrainedExecutorBackend

2019-12-18 Thread S
I am trying to understand the lifecycle of an RPCEndpoint. Here is my understanding: After negotiating containers form the ClusterManager, the master starts the CoarseGrainedExecutorBackend on the worker which connects back to the CoarseGrainedSchedulerBackend's DriverEndpoint which sends requests

Re: Identify bottleneck

2019-12-18 Thread Antoine DUBOIS
I can confirm that the job is able to use multiple cores on multiple nodes at the same time and that I have several task running at the same time. Depending on my csv it take from 5 part up to several hundred part. Regarding the job running locally on one node I took more than 20 minutes, ans

Re: Identify bottleneck

2019-12-18 Thread Antoine DUBOIS
Also, the framework allow to execute all the modification at the same time as one big request (but i wont paste it here, it would not be really relevant De: "Antoine DUBOIS" À: "Enrico Minack" Cc: "Chris Teoh" , "user @spark" Envoyé: Mercredi 18 Décembre 2019 14:59:12 Objet: Re: Identi

How to submit a jar from Remote Server

2019-12-18 Thread Praveen Kumar Ramachandran
I'm running a spark Server at 192.172.12.100:7070 (standalone spark) and a Rest Service at 192.168.50.121:8080 (java). I'm supposed to execute the spark-sumbit shell file under SPARK_HOME located at the Spark Server from the Rest Service. Could you suggest any solution (now I am using Jenkins to e

回复:How to submit a jar from Remote Server

2019-12-18 Thread tianlangstudio
What about SparkJobServer https://github.com/spark-jobserver/spark-jobserver and Apache livy https://github.com/apache/incubator-livy ? TianlangStudio Some of the biggest lies: I will start tomorrow/Others are better than me/I am not good enough/I don't have time/This is the way I am ---