Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

Sean Owen Thu, 10 Mar 2022 10:12:25 -0800

Wouldn't these be separately submitted jobs for separate workloads? You can
of course dynamically change each job submitted to have whatever packages
you like, from whatever is orchestrating. A single job doing everything
sound right.


On Thu, Mar 10, 2022, 12:05 PM Rafał Wojdyła <ravwojd...@gmail.com> wrote:

> Because I can't (and should not) know ahead of time which jobs will be
> executed, that's the job of the orchestration layer (and can be dynamic). I
> know I can specify multiple packages. Also not worried about memory.
>
> On Thu, 10 Mar 2022 at 13:54, Artemis User <arte...@dtechspace.com> wrote:
>
>> If changing packages or jars isn't your concern, why not just specify ALL
>> packages that you would need for the Spark environment?  You know you can
>> define multiple packages under the packages option.  This shouldn't cause
>> memory issues since JVM uses dynamic class loading...
>>
>> On 3/9/22 10:03 PM, Rafał Wojdyła wrote:
>>
>> Hi Artemis,
>> Thanks for your input, to answer your questions:
>>
>> > You may want to ask yourself why it is necessary to change the jar
>> packages during runtime.
>>
>> I have a long running orchestrator process, which executes multiple spark
>> jobs, currently on a single VM/driver, some of those jobs might
>> require extra packages/jars (please see example in the issue).
>>
>> > Changing package doesn't mean to reload the classes.
>>
>> AFAIU this is unrelated
>>
>> > There is no way to reload the same class unless you customize the
>> classloader of Spark.
>>
>> AFAIU this is an implementation detail.
>>
>> > I also don't think it is necessary to implement a warning or error
>> message when changing the configuration since it doesn't do any harm
>>
>> To reiterate right now the API allows to change configuration of the
>> context, without that configuration taking effect. See example of confused
>> users here:
>>  *
>> https://stackoverflow.com/questions/41886346/spark-2-1-0-session-config-settings-pyspark
>>  *
>> https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1
>>
>> I'm curious if you have any opinion about the "hard-reset" workaround,
>> copy-pasting from the issue:
>>
>> ```
>> s: SparkSession = ...
>>
>> # Hard reset:
>> s.stop()
>> s._sc._gateway.shutdown()
>> s._sc._gateway.proc.stdin.close()
>> SparkContext._gateway = None
>> SparkContext._jvm = None
>> ```
>>
>> Cheers - Rafal
>>
>> On 2022/03/09 15:39:58 Artemis User wrote:
>> > This is indeed a JVM issue, not a Spark issue.  You may want to ask
>> > yourself why it is necessary to change the jar packages during
>> runtime.
>> > Changing package doesn't mean to reload the classes. There is no way to
>> > reload the same class unless you customize the classloader of Spark.  I
>> > also don't think it is necessary to implement a warning or error
>> message
>> > when changing the configuration since it doesn't do any harm.  Spark
>> > uses lazy binding so you can do a lot of such "unharmful" things.
>> > Developers will have to understand the behaviors of each API before
>> when
>> > using them..
>> >
>> >
>> > On 3/9/22 9:31 AM, Rafał Wojdyła wrote:
>> > >  Sean,
>> > > I understand you might be sceptical about adding this functionality
>> > > into (py)spark, I'm curious:
>> > > * would error/warning on update in configuration that is currently
>> > > effectively impossible (requires restart of JVM) be reasonable?
>> > > * what do you think about the workaround in the issue?
>> > > Cheers - Rafal
>> > >
>> > > On Wed, 9 Mar 2022 at 14:24, Sean Owen <sr...@gmail.com> wrote:
>> > >
>> > >     Unfortunately this opens a lot more questions and problems than it
>> > >     solves. What if you take something off the classpath, for example?
>> > >     change a class?
>> > >
>> > >     On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła
>> > >     <ra...@gmail.com> wrote:
>> > >
>> > >         Thanks Sean,
>> > >         To be clear, if you prefer to change the label on this issue
>> > >         from bug to sth else, feel free to do so, no strong opinions
>> > >         on my end. What happens to the classpath, whether spark uses
>> > >         some classloader magic, is probably an implementation detail.
>> > >         That said, it's definitely not intuitive that you can change
>> > >         the configuration and get the context (with the updated
>> > >         config) without any warnings/errors. Also what would you
>> > >         recommend as a workaround or solution to this problem? Any
>> > >         comments about the workaround in the issue? Keep in mind that
>> > >         I can't restart the long running orchestration process (python
>> > >         process if that matters).
>> > >         Cheers - Rafal
>> > >
>> > >         On Wed, 9 Mar 2022 at 13:15, Sean Owen <sr...@gmail.com>
>> wrote:
>> > >
>> > >             That isn't a bug - you can't change the classpath once the
>> > >             JVM is executing.
>> > >
>> > >             On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła
>> > >             <ra...@gmail.com> wrote:
>> > >
>> > >                 Hi,
>> > >                 My use case is that, I have a long running process
>> > >                 (orchestrator) with multiple tasks, some tasks might
>> > >                 require extra spark dependencies. It seems once the
>> > >                 spark context is started it's not possible to update
>> > >                 `spark.jars.packages`? I have reported an issue at
>> > >                 https://issues.apache.org/jira/browse/SPARK-38438,
>> > >                 together with a workaround ("hard reset of the
>> > >                 cluster"). I wonder if anyone has a solution for this?
>> > >                 Cheers - Rafal
>> > >
>> >
>>
>>>
>>

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

Reply via email to