Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

Artemis User Fri, 11 Mar 2022 13:07:41 -0800

OK, I see the confusions in terminologies. However, what were suggestedshould still work. A Luigi worker in this case would function like aSpark client, responsible for submitting a Spark application (or job inLuigi's term). In other words, you just define all necessary jars forall your jobs in your SparkContext (or to make things easier, define inthe spark-default.conf file or just place them in the spark's jarsdirectory). This should work 100% especially when you don't know which"job" (should be called application or task) needs which jars in advance.

For other questions unrelated to this discussion, I'd suggest starting anew thread to make things clear. Thanks!


On 3/11/22 1:09 PM, Rafał Wojdyła wrote:

I don't know why I don't see my last message in the thread here:https://lists.apache.org/thread/5wgdqp746nj4f6ovdl42rt82wc8ltkcnAlso don't get messages from Artemis in my mail, I can only see themin the thread web UI, which is very confusing.On top of that when I click on "reply via your own email client" inthe web UI, I get: Bad Request Error 400


Anyways to answer to your last comment Artemis:

> I guess there are several misconceptions here:

There's no confusion on my side, all that makes sense. When I said"worker" in that comment I meant the scheduler worker not Sparkworker, which in the Spark realm would be the client.Everything else you said is undoubtedly correct, but unrelated to theissue/problem at hand.

Sean, Artemis - I appreciate your feedback about the infra setup, butit's beside the problem behind this issue.


Let me describe a simpler setup/example with the same problem, say:
 1. I have a jupyter notebook
 2. use local/driver spark mode only
 3. I start the driver, process some data, store it in pandas dataframe

4. now say I want to add a package to spark driver (or increase theJVM memory etc)

There's currently no way to do the step 4 without restarting thenotebook process which holds the "reference" to the Spark driver/JVM.If I restart the Jupter notebook I would lose all the data in memory(e.g. pandas data), ofc I can save that data to e.g. disk but that'sbeside the point.

I understand you don't want to provide this functionality in Spark,nor warn users on changes in Spark Configuration that won't actuallywork - as a user I wish I could get at least a warning in that case,but I respect your decision. It seems like the workaround to shutdownthe JVM works in this case, I would much appreciate your feedbackabout **that specific workaround** please. Any reason not to use it?

Cheers - Rafal

On Thu, 10 Mar 2022 at 18:50, Rafał Wojdyła <ravwojd...@gmail.com> wrote:

    If you have a long running python orchestrator worker (e.g. Luigi
    worker), and say it's gets a DAG of A -> B ->C, and say the worker
    first creates a spark driver for A (which doesn't need extra
    jars/packages), then it gets B which is also a spark job but it
    needs an extra package, it won't be able to create a new spark
    driver with extra packages since it's "not possible" to create a
    new driver JVM. I would argue it's the same scenario if you have
    multiple spark jobs that need different amounts of memory or
    anything that requires JVM restart. Of course I can use the
    workaround to shut down the driver/JVM, do you have any feedback
    about that workaround (see my previous comment or the issue).

    On Thu, 10 Mar 2022 at 18:12, Sean Owen <sro...@gmail.com> wrote:

        Wouldn't these be separately submitted jobs for separate
        workloads? You can of course dynamically change each job
        submitted to have whatever packages you like, from whatever is
        orchestrating. A single job doing everything sound right.

        On Thu, Mar 10, 2022, 12:05 PM Rafał Wojdyła
        <ravwojd...@gmail.com> wrote:

            Because I can't (and should not) know ahead of time which
            jobs will be executed, that's the job of the orchestration
            layer (and can be dynamic). I know I can specify multiple
            packages. Also not worried about memory.

            On Thu, 10 Mar 2022 at 13:54, Artemis User
            <arte...@dtechspace.com> wrote:

                If changing packages or jars isn't your concern, why
                not just specify ALL packages that you would need for
                the Spark environment?  You know you can define
                multiple packages under the packages option.  This
                shouldn't cause memory issues since JVM uses dynamic
                class loading...

                On 3/9/22 10:03 PM, Rafał Wojdyła wrote:

                Hi Artemis,
                Thanks for your input, to answer your questions:

                > You may want to ask yourself why it is necessary to
                change the jar packages during runtime.

                I have a long running orchestrator process, which
                executes multiple spark jobs, currently on a single
                VM/driver, some of those jobs might require extra
                packages/jars (please see example in the issue).

                > Changing package doesn't mean to reload the classes.

                AFAIU this is unrelated

                > There is no way to reload the same class unless you
                customize the classloader of Spark.

                AFAIU this is an implementation detail.

                > I also don't think it is necessary to implement a
                warning or error message when changing the
                configuration since it doesn't do any harm

                To reiterate right now the API allows to change
                configuration of the context, without that
                configuration taking effect. See example of confused
                users here:
                 *
                
https://stackoverflow.com/questions/41886346/spark-2-1-0-session-config-settings-pyspark
                 *
                
https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1

                I'm curious if you have any opinion about the
                "hard-reset" workaround, copy-pasting from the issue:

                ```
                s: SparkSession = ...

                # Hard reset:
                s.stop()
                s._sc._gateway.shutdown()
                s._sc._gateway.proc.stdin.close()
                SparkContext._gateway = None
                SparkContext._jvm = None
                ```

                Cheers - Rafal

                On 2022/03/09 15:39:58 Artemis User wrote:
                > This is indeed a JVM issue, not a Spark issue.  You
                may want to ask
                > yourself why it is necessary to change the jar
                packages during runtime.
                > Changing package doesn't mean to reload the
                classes. There is no way to
                > reload the same class unless you customize the
                classloader of Spark.  I
                > also don't think it is necessary to implement a
                warning or error message
                > when changing the configuration since it doesn't do
                any harm.  Spark
                > uses lazy binding so you can do a lot of such
                "unharmful" things.
                > Developers will have to understand the behaviors of
                each API before when
                > using them..
                >
                >
                > On 3/9/22 9:31 AM, Rafał Wojdyła wrote:
                > >  Sean,
                > > I understand you might be sceptical about adding
                this functionality
                > > into (py)spark, I'm curious:
                > > * would error/warning on update in configuration
                that is currently
                > > effectively impossible (requires restart of JVM)
                be reasonable?
                > > * what do you think about the workaround in the
                issue?
                > > Cheers - Rafal
                > >
                > > On Wed, 9 Mar 2022 at 14:24, Sean Owen
                <sr...@gmail.com> wrote:
                > >
                > >     Unfortunately this opens a lot more questions
                and problems than it
                > >     solves. What if you take something off the
                classpath, for example?
                > >     change a class?
                > >
                > >     On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła
                > >     <ra...@gmail.com> wrote:
                > >
                > >         Thanks Sean,
                > >         To be clear, if you prefer to change the
                label on this issue
                > >         from bug to sth else, feel free to do so,
                no strong opinions
                > >         on my end. What happens to the classpath,
                whether spark uses
                > >         some classloader magic, is probably an
                implementation detail.
                > >         That said, it's definitely not intuitive
                that you can change
                > >         the configuration and get the context
                (with the updated
                > >         config) without any warnings/errors. Also
                what would you
                > >         recommend as a workaround or solution to
                this problem? Any
                > >         comments about the workaround in the
                issue? Keep in mind that
                > >         I can't restart the long running
                orchestration process (python
                > >         process if that matters).
                > >         Cheers - Rafal
                > >
                > >         On Wed, 9 Mar 2022 at 13:15, Sean Owen
                <sr...@gmail.com> wrote:
                > >
                > >             That isn't a bug - you can't change
                the classpath once the
                > >             JVM is executing.
                > >
                > >             On Wed, Mar 9, 2022 at 7:11 AM Rafał
                Wojdyła
                > >             <ra...@gmail.com> wrote:
                > >
                > >                 Hi,
                > >                 My use case is that, I have a
                long running process
                > >                 (orchestrator) with multiple
                tasks, some tasks might
                > >                 require extra spark dependencies.
                It seems once the
                > >                 spark context is started it's not
                possible to update
                > > `spark.jars.packages`? I have reported an issue at
                > > https://issues.apache.org/jira/browse/SPARK-38438,
                > >                 together with a workaround ("hard
                reset of the
                > >                 cluster"). I wonder if anyone has
                a solution for this?
                > >                 Cheers - Rafal
                > >
                >

Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

Reply via email to