Re: Spark job's driver programe consums too much memory

James Starks Fri, 07 Sep 2018 07:40:13 -0700


Is df.write.mode(...).parquet("hdfs://..") also actions function? Checking doc 
shows that my spark doesn't use those actions functions. But saveXXXX functions 
looks resembling the function 
df.write.mode(overwrite).parquet("hdfs://path/to/parquet-file") used by my 
spark job uses. Therefore I am thinking maybe that's the reason why my spark 
job driver consumes such amount of memory.


https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#actions

My spark job's driver program consumes too much memory, so I want to prevent 
that by writing data to hdfs at the executor side, instead of waiting those 
data to be sent back to the driver program (then writing to hdfs). This is 
because our worker servers have bigger memory size than the one that runs 
driver program. If I can write data to hdfs at executor, then the driver memory 
for my spark job can be reduced.

Otherwise does Spark support streaming read from database (i.e. spark streaming 
+ spark sql)?

Thanks for your reply.



‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On 7 September 2018 4:15 PM, Apostolos N. Papadopoulos <papad...@csd.auth.gr> 
wrote:

> Dear James,
>
> -   check the Spark documentation to see the actions that return a lot of
>     data back to the driver. One of these actions is collect(). However,
>     take(x) is an action, also reduce() is an action.
>
>     Before executing collect() find out what is the size of your RDD/DF.
>
> -   I cannot understand the phrase "hdfs directly from the executor". You
>     can specify an hdfs file as your input and also you can use hdfs to
>     store your output.
>
>     regards,
>
>     Apostolos
>
>     On 07/09/2018 05:04 μμ, James Starks wrote:
>
>
> > I have a Spark job that read data from database. By increasing submit
> > parameter '--driver-memory 25g' the job can works without a problem
> > locally but not in prod env because prod master do not have enough
> > capacity.
> > So I have a few questions:
> > -  What functions such as collecct() would cause the data to be sent
> > back to the driver program?
> >   My job so far merely uses `as`, `filter`, `map`, and `filter`.
> >
> > -   Is it possible to write data (in parquet format for instance) to
> >     hdfs directly from the executor? If so how can I do (any code snippet,
> >     doc for reference, or what keyword to search cause can't find by e.g.
> >     `spark direct executor hdfs write`)?
> >
> >
> > Thanks
>
> --
>
> Apostolos N. Papadopoulos, Associate Professor
> Department of Informatics
> Aristotle University of Thessaloniki
> Thessaloniki, GREECE
> tel: ++0030312310991918
> email: papad...@csd.auth.gr
> twitter: @papadopoulos_ap
> web: http://datalab.csd.auth.gr/~apostol
>
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark job's driver programe consums too much memory

Reply via email to