Hello
I'm working on an ETL based on csv describing file systems to transform it into
parquet so I can work on them easily to extract informations.
I'm using Mr. Powers framework Daria to do so. I've quiet different input and a
lot of transformation and the framework helps organize the code.
How many withColumn statements do you have? Note that it is better to
use a single select, rather than lots of withColumn. This also makes
drops redundant.
Reading 25m CSV lines and writing to Parquet in 5 minutes on 32 cores is
really slow. Can you try this on a single machine, i.e. run wit "
There's 15 withColumn Statement and one drop at the end to remove old column.
I which I could write it as a single sql statement, but it's not reasonable for
maintaining purpose.
I will try on a local instance and let you know.
Thanks for the help.
De: "Enrico Minack"
À: user@spark.apach
Please look at the spark UI and confirm you are indeed getting more than 1
partition in your dataframe. Text files are usually not splittable so you
may just be doing all the work in a single partition.
If that is the case, It may be worthwhile considering calling the
repartition method to distrib
Good points, but single-line CSV files are splitable (not multi-line CSV
though), especially in the mentioned size. And bz2 is also splitable,
though reading speed is much slower than uncompressed csv.
If your csv.bz2 files are not splittable then repartitioning does not
improve the situation
I am trying to understand the lifecycle of an RPCEndpoint.
Here is my understanding: After negotiating containers form the
ClusterManager, the master starts the CoarseGrainedExecutorBackend on the
worker which connects back to the CoarseGrainedSchedulerBackend's
DriverEndpoint which sends requests
I can confirm that the job is able to use multiple cores on multiple nodes at
the same time and that I have several task running at the same time.
Depending on my csv it take from 5 part up to several hundred part.
Regarding the job running locally on one node I took more than 20 minutes, ans
Also,
the framework allow to execute all the modification at the same time as one big
request (but i wont paste it here, it would not be really relevant
De: "Antoine DUBOIS"
À: "Enrico Minack"
Cc: "Chris Teoh" , "user @spark"
Envoyé: Mercredi 18 Décembre 2019 14:59:12
Objet: Re: Identi
I'm running a spark Server at 192.172.12.100:7070 (standalone spark) and a
Rest Service at 192.168.50.121:8080 (java).
I'm supposed to execute the spark-sumbit shell file under SPARK_HOME
located at the Spark Server from the Rest Service. Could you suggest any
solution (now I am using Jenkins to e
What about SparkJobServer https://github.com/spark-jobserver/spark-jobserver
and Apache livy https://github.com/apache/incubator-livy ?
TianlangStudio
Some of the biggest lies: I will start tomorrow/Others are better than me/I am
not good enough/I don't have time/This is the way I am
---
10 matches
Mail list logo