Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?
OK, I see the confusions in terminologies. However, what were suggested should still work. A Luigi worker in this case would function like a Spark client, responsible for submitting a Spark application (or job in Luigi's term). In other words, you just define all necessary jars for all your jobs in your SparkContext (or to make things easier, define in the spark-default.conf file or just place them in the spark's jars directory). This should work 100% especially when you don't know which "job" (should be called application or task) needs which jars in advance. For other questions unrelated to this discussion, I'd suggest starting a new thread to make things clear. Thanks! On 3/11/22 1:09 PM, Rafał Wojdyła wrote: I don't know why I don't see my last message in the thread here: https://lists.apache.org/thread/5wgdqp746nj4f6ovdl42rt82wc8ltkcn Also don't get messages from Artemis in my mail, I can only see them in the thread web UI, which is very confusing. On top of that when I click on "reply via your own email client" in the web UI, I get: Bad Request Error 400 Anyways to answer to your last comment Artemis: > I guess there are several misconceptions here: There's no confusion on my side, all that makes sense. When I said "worker" in that comment I meant the scheduler worker not Spark worker, which in the Spark realm would be the client. Everything else you said is undoubtedly correct, but unrelated to the issue/problem at hand. Sean, Artemis - I appreciate your feedback about the infra setup, but it's beside the problem behind this issue. Let me describe a simpler setup/example with the same problem, say: 1. I have a jupyter notebook 2. use local/driver spark mode only 3. I start the driver, process some data, store it in pandas dataframe 4. now say I want to add a package to spark driver (or increase the JVM memory etc) There's currently no way to do the step 4 without restarting the notebook process which holds the "reference" to the Spark driver/JVM. If I restart the Jupter notebook I would lose all the data in memory (e.g. pandas data), ofc I can save that data to e.g. disk but that's beside the point. I understand you don't want to provide this functionality in Spark, nor warn users on changes in Spark Configuration that won't actually work - as a user I wish I could get at least a warning in that case, but I respect your decision. It seems like the workaround to shutdown the JVM works in this case, I would much appreciate your feedback about **that specific workaround** please. Any reason not to use it? Cheers - Rafal On Thu, 10 Mar 2022 at 18:50, Rafał Wojdyła wrote: If you have a long running python orchestrator worker (e.g. Luigi worker), and say it's gets a DAG of A -> B ->C, and say the worker first creates a spark driver for A (which doesn't need extra jars/packages), then it gets B which is also a spark job but it needs an extra package, it won't be able to create a new spark driver with extra packages since it's "not possible" to create a new driver JVM. I would argue it's the same scenario if you have multiple spark jobs that need different amounts of memory or anything that requires JVM restart. Of course I can use the workaround to shut down the driver/JVM, do you have any feedback about that workaround (see my previous comment or the issue). On Thu, 10 Mar 2022 at 18:12, Sean Owen wrote: Wouldn't these be separately submitted jobs for separate workloads? You can of course dynamically change each job submitted to have whatever packages you like, from whatever is orchestrating. A single job doing everything sound right. On Thu, Mar 10, 2022, 12:05 PM Rafał Wojdyła wrote: Because I can't (and should not) know ahead of time which jobs will be executed, that's the job of the orchestration layer (and can be dynamic). I know I can specify multiple packages. Also not worried about memory. On Thu, 10 Mar 2022 at 13:54, Artemis User wrote: If changing packages or jars isn't your concern, why not just specify ALL packages that you would need for the Spark environment? You know you can define multiple packages under the packages option. This shouldn't cause memory issues since JVM uses dynamic class loading... On 3/9/22 10:03 PM, Rafał Wojdyła wrote: Hi Artemis, Thanks for your input, to answer your questions: > You may want to ask yourself why it is necessary to change the jar packages during runtime. I have a long running orchestrator process, which executes multiple spark jobs, currently on a single VM/driver, some of those jobs might require ext
Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?
I don't know why I don't see my last message in the thread here: https://lists.apache.org/thread/5wgdqp746nj4f6ovdl42rt82wc8ltkcn Also don't get messages from Artemis in my mail, I can only see them in the thread web UI, which is very confusing. On top of that when I click on "reply via your own email client" in the web UI, I get: Bad Request Error 400 Anyways to answer to your last comment Artemis: > I guess there are several misconceptions here: There's no confusion on my side, all that makes sense. When I said "worker" in that comment I meant the scheduler worker not Spark worker, which in the Spark realm would be the client. Everything else you said is undoubtedly correct, but unrelated to the issue/problem at hand. Sean, Artemis - I appreciate your feedback about the infra setup, but it's beside the problem behind this issue. Let me describe a simpler setup/example with the same problem, say: 1. I have a jupyter notebook 2. use local/driver spark mode only 3. I start the driver, process some data, store it in pandas dataframe 4. now say I want to add a package to spark driver (or increase the JVM memory etc) There's currently no way to do the step 4 without restarting the notebook process which holds the "reference" to the Spark driver/JVM. If I restart the Jupter notebook I would lose all the data in memory (e.g. pandas data), ofc I can save that data to e.g. disk but that's beside the point. I understand you don't want to provide this functionality in Spark, nor warn users on changes in Spark Configuration that won't actually work - as a user I wish I could get at least a warning in that case, but I respect your decision. It seems like the workaround to shutdown the JVM works in this case, I would much appreciate your feedback about **that specific workaround** please. Any reason not to use it? Cheers - Rafal On Thu, 10 Mar 2022 at 18:50, Rafał Wojdyła wrote: > If you have a long running python orchestrator worker (e.g. Luigi worker), > and say it's gets a DAG of A -> B ->C, and say the worker first creates a > spark driver for A (which doesn't need extra jars/packages), then it gets B > which is also a spark job but it needs an extra package, it won't be able > to create a new spark driver with extra packages since it's "not possible" > to create a new driver JVM. I would argue it's the same scenario if you > have multiple spark jobs that need different amounts of memory or anything > that requires JVM restart. Of course I can use the workaround to shut down > the driver/JVM, do you have any feedback about that workaround (see my > previous comment or the issue). > > On Thu, 10 Mar 2022 at 18:12, Sean Owen wrote: > >> Wouldn't these be separately submitted jobs for separate workloads? You >> can of course dynamically change each job submitted to have whatever >> packages you like, from whatever is orchestrating. A single job doing >> everything sound right. >> >> On Thu, Mar 10, 2022, 12:05 PM Rafał Wojdyła >> wrote: >> >>> Because I can't (and should not) know ahead of time which jobs will be >>> executed, that's the job of the orchestration layer (and can be dynamic). I >>> know I can specify multiple packages. Also not worried about memory. >>> >>> On Thu, 10 Mar 2022 at 13:54, Artemis User >>> wrote: >>> If changing packages or jars isn't your concern, why not just specify ALL packages that you would need for the Spark environment? You know you can define multiple packages under the packages option. This shouldn't cause memory issues since JVM uses dynamic class loading... On 3/9/22 10:03 PM, Rafał Wojdyła wrote: Hi Artemis, Thanks for your input, to answer your questions: > You may want to ask yourself why it is necessary to change the jar packages during runtime. I have a long running orchestrator process, which executes multiple spark jobs, currently on a single VM/driver, some of those jobs might require extra packages/jars (please see example in the issue). > Changing package doesn't mean to reload the classes. AFAIU this is unrelated > There is no way to reload the same class unless you customize the classloader of Spark. AFAIU this is an implementation detail. > I also don't think it is necessary to implement a warning or error message when changing the configuration since it doesn't do any harm To reiterate right now the API allows to change configuration of the context, without that configuration taking effect. See example of confused users here: * https://stackoverflow.com/questions/41886346/spark-2-1-0-session-config-settings-pyspark * https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1 I'm curious if you have any opinion about the "hard-reset" workaround, copy-pasting from the issue: ``` s: Spark
Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?
I guess there are several misconceptions here: 1. Worker doesn't create driver, client does. 2. Regardless of job scheduling, all jobs of the same task/application are under the same SparkContext which is created by the driver. Therefore, you need to specify ALL dependency jars for ALL jobs when a single SparkContext is initialized 3. The SparkContext object lives and won't change as long as the application is alive 4. Please see this Spark doc page as a reference: https://spark.apache.org/docs/latest/cluster-overview.html On 3/10/22 1:50 PM, Rafał Wojdyła wrote: If you have a long running python orchestrator worker (e.g. Luigi worker), and say it's gets a DAG of A -> B ->C, and say the worker first creates a spark driver for A (which doesn't need extra jars/packages), then it gets B which is also a spark job but it needs an extra package, it won't be able to create a new spark driver with extra packages since it's "not possible" to create a new driver JVM. I would argue it's the same scenario if you have multiple spark jobs that need different amounts of memory or anything that requires JVM restart. Of course I can use the workaround to shut down the driver/JVM, do you have any feedback about that workaround (see my previous comment or the issue). On Thu, 10 Mar 2022 at 18:12, Sean Owen wrote: Wouldn't these be separately submitted jobs for separate workloads? You can of course dynamically change each job submitted to have whatever packages you like, from whatever is orchestrating. A single job doing everything sound right. On Thu, Mar 10, 2022, 12:05 PM Rafał Wojdyła wrote: Because I can't (and should not) know ahead of time which jobs will be executed, that's the job of the orchestration layer (and can be dynamic). I know I can specify multiple packages. Also not worried about memory. On Thu, 10 Mar 2022 at 13:54, Artemis User wrote: If changing packages or jars isn't your concern, why not just specify ALL packages that you would need for the Spark environment? You know you can define multiple packages under the packages option. This shouldn't cause memory issues since JVM uses dynamic class loading... On 3/9/22 10:03 PM, Rafał Wojdyła wrote: Hi Artemis, Thanks for your input, to answer your questions: > You may want to ask yourself why it is necessary to change the jar packages during runtime. I have a long running orchestrator process, which executes multiple spark jobs, currently on a single VM/driver, some of those jobs might require extra packages/jars (please see example in the issue). > Changing package doesn't mean to reload the classes. AFAIU this is unrelated > There is no way to reload the same class unless you customize the classloader of Spark. AFAIU this is an implementation detail. > I also don't think it is necessary to implement a warning or error message when changing the configuration since it doesn't do any harm To reiterate right now the API allows to change configuration of the context, without that configuration taking effect. See example of confused users here: * https://stackoverflow.com/questions/41886346/spark-2-1-0-session-config-settings-pyspark * https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1 I'm curious if you have any opinion about the "hard-reset" workaround, copy-pasting from the issue: ``` s: SparkSession = ... # Hard reset: s.stop() s._sc._gateway.shutdown() s._sc._gateway.proc.stdin.close() SparkContext._gateway = None SparkContext._jvm = None ``` Cheers - Rafal On 2022/03/09 15:39:58 Artemis User wrote: > This is indeed a JVM issue, not a Spark issue. You may want to ask > yourself why it is necessary to change the jar packages during runtime. > Changing package doesn't mean to reload the classes. There is no way to > reload the same class unless you customize the classloader of Spark. I > also don't think it is necessary to implement a warning or error message > when changing the configuration since it doesn't do any harm. Spark > uses lazy binding so you can do a lot of such "unharmful" things. > Developers will have to understand the behaviors of each API before when > us
Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?
Wouldn't these be separately submitted jobs for separate workloads? You can of course dynamically change each job submitted to have whatever packages you like, from whatever is orchestrating. A single job doing everything sound right. On Thu, Mar 10, 2022, 12:05 PM Rafał Wojdyła wrote: > Because I can't (and should not) know ahead of time which jobs will be > executed, that's the job of the orchestration layer (and can be dynamic). I > know I can specify multiple packages. Also not worried about memory. > > On Thu, 10 Mar 2022 at 13:54, Artemis User wrote: > >> If changing packages or jars isn't your concern, why not just specify ALL >> packages that you would need for the Spark environment? You know you can >> define multiple packages under the packages option. This shouldn't cause >> memory issues since JVM uses dynamic class loading... >> >> On 3/9/22 10:03 PM, Rafał Wojdyła wrote: >> >> Hi Artemis, >> Thanks for your input, to answer your questions: >> >> > You may want to ask yourself why it is necessary to change the jar >> packages during runtime. >> >> I have a long running orchestrator process, which executes multiple spark >> jobs, currently on a single VM/driver, some of those jobs might >> require extra packages/jars (please see example in the issue). >> >> > Changing package doesn't mean to reload the classes. >> >> AFAIU this is unrelated >> >> > There is no way to reload the same class unless you customize the >> classloader of Spark. >> >> AFAIU this is an implementation detail. >> >> > I also don't think it is necessary to implement a warning or error >> message when changing the configuration since it doesn't do any harm >> >> To reiterate right now the API allows to change configuration of the >> context, without that configuration taking effect. See example of confused >> users here: >> * >> https://stackoverflow.com/questions/41886346/spark-2-1-0-session-config-settings-pyspark >> * >> https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1 >> >> I'm curious if you have any opinion about the "hard-reset" workaround, >> copy-pasting from the issue: >> >> ``` >> s: SparkSession = ... >> >> # Hard reset: >> s.stop() >> s._sc._gateway.shutdown() >> s._sc._gateway.proc.stdin.close() >> SparkContext._gateway = None >> SparkContext._jvm = None >> ``` >> >> Cheers - Rafal >> >> On 2022/03/09 15:39:58 Artemis User wrote: >> > This is indeed a JVM issue, not a Spark issue. You may want to ask >> > yourself why it is necessary to change the jar packages during >> runtime. >> > Changing package doesn't mean to reload the classes. There is no way to >> > reload the same class unless you customize the classloader of Spark. I >> > also don't think it is necessary to implement a warning or error >> message >> > when changing the configuration since it doesn't do any harm. Spark >> > uses lazy binding so you can do a lot of such "unharmful" things. >> > Developers will have to understand the behaviors of each API before >> when >> > using them.. >> > >> > >> > On 3/9/22 9:31 AM, Rafał Wojdyła wrote: >> > > Sean, >> > > I understand you might be sceptical about adding this functionality >> > > into (py)spark, I'm curious: >> > > * would error/warning on update in configuration that is currently >> > > effectively impossible (requires restart of JVM) be reasonable? >> > > * what do you think about the workaround in the issue? >> > > Cheers - Rafal >> > > >> > > On Wed, 9 Mar 2022 at 14:24, Sean Owen wrote: >> > > >> > > Unfortunately this opens a lot more questions and problems than it >> > > solves. What if you take something off the classpath, for example? >> > > change a class? >> > > >> > > On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła >> > > wrote: >> > > >> > > Thanks Sean, >> > > To be clear, if you prefer to change the label on this issue >> > > from bug to sth else, feel free to do so, no strong opinions >> > > on my end. What happens to the classpath, whether spark uses >> > > some classloader magic, is probably an implementation detail. >> > > That said, it's definitely not intuitive that you can change >> > > the configuration and get the context (with the updated >> > > config) without any warnings/errors. Also what would you >> > > recommend as a workaround or solution to this problem? Any >> > > comments about the workaround in the issue? Keep in mind that >> > > I can't restart the long running orchestration process (python >> > > process if that matters). >> > > Cheers - Rafal >> > > >> > > On Wed, 9 Mar 2022 at 13:15, Sean Owen >> wrote: >> > > >> > > That isn't a bug - you can't change the classpath once the >> > > JVM is executing. >> > > >> > > On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła >> > > wrote: >> > > >> > > Hi,
Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?
Because I can't (and should not) know ahead of time which jobs will be executed, that's the job of the orchestration layer (and can be dynamic). I know I can specify multiple packages. Also not worried about memory. On Thu, 10 Mar 2022 at 13:54, Artemis User wrote: > If changing packages or jars isn't your concern, why not just specify ALL > packages that you would need for the Spark environment? You know you can > define multiple packages under the packages option. This shouldn't cause > memory issues since JVM uses dynamic class loading... > > On 3/9/22 10:03 PM, Rafał Wojdyła wrote: > > Hi Artemis, > Thanks for your input, to answer your questions: > > > You may want to ask yourself why it is necessary to change the jar > packages during runtime. > > I have a long running orchestrator process, which executes multiple spark > jobs, currently on a single VM/driver, some of those jobs might > require extra packages/jars (please see example in the issue). > > > Changing package doesn't mean to reload the classes. > > AFAIU this is unrelated > > > There is no way to reload the same class unless you customize the > classloader of Spark. > > AFAIU this is an implementation detail. > > > I also don't think it is necessary to implement a warning or error > message when changing the configuration since it doesn't do any harm > > To reiterate right now the API allows to change configuration of the > context, without that configuration taking effect. See example of confused > users here: > * > https://stackoverflow.com/questions/41886346/spark-2-1-0-session-config-settings-pyspark > * > https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1 > > I'm curious if you have any opinion about the "hard-reset" workaround, > copy-pasting from the issue: > > ``` > s: SparkSession = ... > > # Hard reset: > s.stop() > s._sc._gateway.shutdown() > s._sc._gateway.proc.stdin.close() > SparkContext._gateway = None > SparkContext._jvm = None > ``` > > Cheers - Rafal > > On 2022/03/09 15:39:58 Artemis User wrote: > > This is indeed a JVM issue, not a Spark issue. You may want to ask > > yourself why it is necessary to change the jar packages during runtime. > > Changing package doesn't mean to reload the classes. There is no way to > > reload the same class unless you customize the classloader of Spark. I > > also don't think it is necessary to implement a warning or error message > > when changing the configuration since it doesn't do any harm. Spark > > uses lazy binding so you can do a lot of such "unharmful" things. > > Developers will have to understand the behaviors of each API before when > > using them.. > > > > > > On 3/9/22 9:31 AM, Rafał Wojdyła wrote: > > > Sean, > > > I understand you might be sceptical about adding this functionality > > > into (py)spark, I'm curious: > > > * would error/warning on update in configuration that is currently > > > effectively impossible (requires restart of JVM) be reasonable? > > > * what do you think about the workaround in the issue? > > > Cheers - Rafal > > > > > > On Wed, 9 Mar 2022 at 14:24, Sean Owen wrote: > > > > > > Unfortunately this opens a lot more questions and problems than it > > > solves. What if you take something off the classpath, for example? > > > change a class? > > > > > > On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła > > > wrote: > > > > > > Thanks Sean, > > > To be clear, if you prefer to change the label on this issue > > > from bug to sth else, feel free to do so, no strong opinions > > > on my end. What happens to the classpath, whether spark uses > > > some classloader magic, is probably an implementation detail. > > > That said, it's definitely not intuitive that you can change > > > the configuration and get the context (with the updated > > > config) without any warnings/errors. Also what would you > > > recommend as a workaround or solution to this problem? Any > > > comments about the workaround in the issue? Keep in mind that > > > I can't restart the long running orchestration process (python > > > process if that matters). > > > Cheers - Rafal > > > > > > On Wed, 9 Mar 2022 at 13:15, Sean Owen > wrote: > > > > > > That isn't a bug - you can't change the classpath once the > > > JVM is executing. > > > > > > On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła > > > wrote: > > > > > > Hi, > > > My use case is that, I have a long running process > > > (orchestrator) with multiple tasks, some tasks might > > > require extra spark dependencies. It seems once the > > > spark context is started it's not possible to update > > > `spark.jars.packages`? I have reported an issue at > > > https://issues.apache.or
Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?
If changing packages or jars isn't your concern, why not just specify ALL packages that you would need for the Spark environment? You know you can define multiple packages under the packages option. This shouldn't cause memory issues since JVM uses dynamic class loading... On 3/9/22 10:03 PM, Rafał Wojdyła wrote: Hi Artemis, Thanks for your input, to answer your questions: > You may want to ask yourself why it is necessary to change the jar packages during runtime. I have a long running orchestrator process, which executes multiple spark jobs, currently on a single VM/driver, some of those jobs might require extra packages/jars (please see example in the issue). > Changing package doesn't mean to reload the classes. AFAIU this is unrelated > There is no way to reload the same class unless you customize the classloader of Spark. AFAIU this is an implementation detail. > I also don't think it is necessary to implement a warning or error message when changing the configuration since it doesn't do any harm To reiterate right now the API allows to change configuration of the context, without that configuration taking effect. See example of confused users here: * https://stackoverflow.com/questions/41886346/spark-2-1-0-session-config-settings-pyspark * https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1 I'm curious if you have any opinion about the "hard-reset" workaround, copy-pasting from the issue: ``` s: SparkSession = ... # Hard reset: s.stop() s._sc._gateway.shutdown() s._sc._gateway.proc.stdin.close() SparkContext._gateway = None SparkContext._jvm = None ``` Cheers - Rafal On 2022/03/09 15:39:58 Artemis User wrote: > This is indeed a JVM issue, not a Spark issue. You may want to ask > yourself why it is necessary to change the jar packages during runtime. > Changing package doesn't mean to reload the classes. There is no way to > reload the same class unless you customize the classloader of Spark. I > also don't think it is necessary to implement a warning or error message > when changing the configuration since it doesn't do any harm. Spark > uses lazy binding so you can do a lot of such "unharmful" things. > Developers will have to understand the behaviors of each API before when > using them.. > > > On 3/9/22 9:31 AM, Rafał Wojdyła wrote: > > Sean, > > I understand you might be sceptical about adding this functionality > > into (py)spark, I'm curious: > > * would error/warning on update in configuration that is currently > > effectively impossible (requires restart of JVM) be reasonable? > > * what do you think about the workaround in the issue? > > Cheers - Rafal > > > > On Wed, 9 Mar 2022 at 14:24, Sean Owen wrote: > > > > Unfortunately this opens a lot more questions and problems than it > > solves. What if you take something off the classpath, for example? > > change a class? > > > > On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła > > wrote: > > > > Thanks Sean, > > To be clear, if you prefer to change the label on this issue > > from bug to sth else, feel free to do so, no strong opinions > > on my end. What happens to the classpath, whether spark uses > > some classloader magic, is probably an implementation detail. > > That said, it's definitely not intuitive that you can change > > the configuration and get the context (with the updated > > config) without any warnings/errors. Also what would you > > recommend as a workaround or solution to this problem? Any > > comments about the workaround in the issue? Keep in mind that > > I can't restart the long running orchestration process (python > > process if that matters). > > Cheers - Rafal > > > > On Wed, 9 Mar 2022 at 13:15, Sean Owen wrote: > > > > That isn't a bug - you can't change the classpath once the > > JVM is executing. > > > > On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła > > wrote: > > > > Hi, > > My use case is that, I have a long running process > > (orchestrator) with multiple tasks, some tasks might > > require extra spark dependencies. It seems once the > > spark context is started it's not possible to update > > `spark.jars.packages`? I have reported an issue at > > https://issues.apache.org/jira/browse/SPARK-38438, > > together with a workaround ("hard reset of the > > cluster"). I wonder if anyone has a solution for this? > > Cheers - Rafal > > >
Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?
Hi Artemis, Thanks for your input, to answer your questions: > You may want to ask yourself why it is necessary to change the jar packages during runtime. I have a long running orchestrator process, which executes multiple spark jobs, currently on a single VM/driver, some of those jobs might require extra packages/jars (please see example in the issue). > Changing package doesn't mean to reload the classes. AFAIU this is unrelated > There is no way to reload the same class unless you customize the classloader of Spark. AFAIU this is an implementation detail. > I also don't think it is necessary to implement a warning or error message when changing the configuration since it doesn't do any harm To reiterate right now the API allows to change configuration of the context, without that configuration taking effect. See example of confused users here: * https://stackoverflow.com/questions/41886346/spark-2-1-0-session-config-settings-pyspark * https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1 I'm curious if you have any opinion about the "hard-reset" workaround, copy-pasting from the issue: ``` s: SparkSession = ... # Hard reset: s.stop() s._sc._gateway.shutdown() s._sc._gateway.proc.stdin.close() SparkContext._gateway = None SparkContext._jvm = None ``` Cheers - Rafal On 2022/03/09 15:39:58 Artemis User wrote: > This is indeed a JVM issue, not a Spark issue. You may want to ask > yourself why it is necessary to change the jar packages during runtime. > Changing package doesn't mean to reload the classes. There is no way to > reload the same class unless you customize the classloader of Spark. I > also don't think it is necessary to implement a warning or error message > when changing the configuration since it doesn't do any harm. Spark > uses lazy binding so you can do a lot of such "unharmful" things. > Developers will have to understand the behaviors of each API before when > using them.. > > > On 3/9/22 9:31 AM, Rafał Wojdyła wrote: > > Sean, > > I understand you might be sceptical about adding this functionality > > into (py)spark, I'm curious: > > * would error/warning on update in configuration that is currently > > effectively impossible (requires restart of JVM) be reasonable? > > * what do you think about the workaround in the issue? > > Cheers - Rafal > > > > On Wed, 9 Mar 2022 at 14:24, Sean Owen wrote: > > > > Unfortunately this opens a lot more questions and problems than it > > solves. What if you take something off the classpath, for example? > > change a class? > > > > On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła > > wrote: > > > > Thanks Sean, > > To be clear, if you prefer to change the label on this issue > > from bug to sth else, feel free to do so, no strong opinions > > on my end. What happens to the classpath, whether spark uses > > some classloader magic, is probably an implementation detail. > > That said, it's definitely not intuitive that you can change > > the configuration and get the context (with the updated > > config) without any warnings/errors. Also what would you > > recommend as a workaround or solution to this problem? Any > > comments about the workaround in the issue? Keep in mind that > > I can't restart the long running orchestration process (python > > process if that matters). > > Cheers - Rafal > > > > On Wed, 9 Mar 2022 at 13:15, Sean Owen wrote: > > > > That isn't a bug - you can't change the classpath once the > > JVM is executing. > > > > On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła > > wrote: > > > > Hi, > > My use case is that, I have a long running process > > (orchestrator) with multiple tasks, some tasks might > > require extra spark dependencies. It seems once the > > spark context is started it's not possible to update > > `spark.jars.packages`? I have reported an issue at > > https://issues.apache.org/jira/browse/SPARK-38438, > > together with a workaround ("hard reset of the > > cluster"). I wonder if anyone has a solution for this? > > Cheers - Rafal > > > >
Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?
This is indeed a JVM issue, not a Spark issue. You may want to ask yourself why it is necessary to change the jar packages during runtime. Changing package doesn't mean to reload the classes. There is no way to reload the same class unless you customize the classloader of Spark. I also don't think it is necessary to implement a warning or error message when changing the configuration since it doesn't do any harm. Spark uses lazy binding so you can do a lot of such "unharmful" things. Developers will have to understand the behaviors of each API before when using them.. On 3/9/22 9:31 AM, Rafał Wojdyła wrote: Sean, I understand you might be sceptical about adding this functionality into (py)spark, I'm curious: * would error/warning on update in configuration that is currently effectively impossible (requires restart of JVM) be reasonable? * what do you think about the workaround in the issue? Cheers - Rafal On Wed, 9 Mar 2022 at 14:24, Sean Owen wrote: Unfortunately this opens a lot more questions and problems than it solves. What if you take something off the classpath, for example? change a class? On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła wrote: Thanks Sean, To be clear, if you prefer to change the label on this issue from bug to sth else, feel free to do so, no strong opinions on my end. What happens to the classpath, whether spark uses some classloader magic, is probably an implementation detail. That said, it's definitely not intuitive that you can change the configuration and get the context (with the updated config) without any warnings/errors. Also what would you recommend as a workaround or solution to this problem? Any comments about the workaround in the issue? Keep in mind that I can't restart the long running orchestration process (python process if that matters). Cheers - Rafal On Wed, 9 Mar 2022 at 13:15, Sean Owen wrote: That isn't a bug - you can't change the classpath once the JVM is executing. On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła wrote: Hi, My use case is that, I have a long running process (orchestrator) with multiple tasks, some tasks might require extra spark dependencies. It seems once the spark context is started it's not possible to update `spark.jars.packages`? I have reported an issue at https://issues.apache.org/jira/browse/SPARK-38438, together with a workaround ("hard reset of the cluster"). I wonder if anyone has a solution for this? Cheers - Rafal
Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?
Sean, I understand you might be sceptical about adding this functionality into (py)spark, I'm curious: * would error/warning on update in configuration that is currently effectively impossible (requires restart of JVM) be reasonable? * what do you think about the workaround in the issue? Cheers - Rafal On Wed, 9 Mar 2022 at 14:24, Sean Owen wrote: > Unfortunately this opens a lot more questions and problems than it solves. > What if you take something off the classpath, for example? change a class? > > On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła wrote: > >> Thanks Sean, >> To be clear, if you prefer to change the label on this issue from bug to >> sth else, feel free to do so, no strong opinions on my end. What happens to >> the classpath, whether spark uses some classloader magic, is probably an >> implementation detail. That said, it's definitely not intuitive that you >> can change the configuration and get the context (with the updated config) >> without any warnings/errors. Also what would you recommend as a workaround >> or solution to this problem? Any comments about the workaround in the >> issue? Keep in mind that I can't restart the long running orchestration >> process (python process if that matters). >> Cheers - Rafal >> >> On Wed, 9 Mar 2022 at 13:15, Sean Owen wrote: >> >>> That isn't a bug - you can't change the classpath once the JVM is >>> executing. >>> >>> On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła >>> wrote: >>> Hi, My use case is that, I have a long running process (orchestrator) with multiple tasks, some tasks might require extra spark dependencies. It seems once the spark context is started it's not possible to update `spark.jars.packages`? I have reported an issue at https://issues.apache.org/jira/browse/SPARK-38438, together with a workaround ("hard reset of the cluster"). I wonder if anyone has a solution for this? Cheers - Rafal >>>
Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?
Unfortunately this opens a lot more questions and problems than it solves. What if you take something off the classpath, for example? change a class? On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła wrote: > Thanks Sean, > To be clear, if you prefer to change the label on this issue from bug to > sth else, feel free to do so, no strong opinions on my end. What happens to > the classpath, whether spark uses some classloader magic, is probably an > implementation detail. That said, it's definitely not intuitive that you > can change the configuration and get the context (with the updated config) > without any warnings/errors. Also what would you recommend as a workaround > or solution to this problem? Any comments about the workaround in the > issue? Keep in mind that I can't restart the long running orchestration > process (python process if that matters). > Cheers - Rafal > > On Wed, 9 Mar 2022 at 13:15, Sean Owen wrote: > >> That isn't a bug - you can't change the classpath once the JVM is >> executing. >> >> On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła >> wrote: >> >>> Hi, >>> My use case is that, I have a long running process (orchestrator) with >>> multiple tasks, some tasks might require extra spark dependencies. It seems >>> once the spark context is started it's not possible to update >>> `spark.jars.packages`? I have reported an issue at >>> https://issues.apache.org/jira/browse/SPARK-38438, together with a >>> workaround ("hard reset of the cluster"). I wonder if anyone has a solution >>> for this? >>> Cheers - Rafal >>> >>
Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?
Thanks Sean, To be clear, if you prefer to change the label on this issue from bug to sth else, feel free to do so, no strong opinions on my end. What happens to the classpath, whether spark uses some classloader magic, is probably an implementation detail. That said, it's definitely not intuitive that you can change the configuration and get the context (with the updated config) without any warnings/errors. Also what would you recommend as a workaround or solution to this problem? Any comments about the workaround in the issue? Keep in mind that I can't restart the long running orchestration process (python process if that matters). Cheers - Rafal On Wed, 9 Mar 2022 at 13:15, Sean Owen wrote: > That isn't a bug - you can't change the classpath once the JVM is > executing. > > On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła wrote: > >> Hi, >> My use case is that, I have a long running process (orchestrator) with >> multiple tasks, some tasks might require extra spark dependencies. It seems >> once the spark context is started it's not possible to update >> `spark.jars.packages`? I have reported an issue at >> https://issues.apache.org/jira/browse/SPARK-38438, together with a >> workaround ("hard reset of the cluster"). I wonder if anyone has a solution >> for this? >> Cheers - Rafal >> >
Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?
That isn't a bug - you can't change the classpath once the JVM is executing. On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła wrote: > Hi, > My use case is that, I have a long running process (orchestrator) with > multiple tasks, some tasks might require extra spark dependencies. It seems > once the spark context is started it's not possible to update > `spark.jars.packages`? I have reported an issue at > https://issues.apache.org/jira/browse/SPARK-38438, together with a > workaround ("hard reset of the cluster"). I wonder if anyone has a solution > for this? > Cheers - Rafal >