Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-09 Thread Rafał Wojdyła
Hi Artemis,
Thanks for your input, to answer your questions:

> You may want to ask yourself why it is necessary to change the jar
packages during runtime.

I have a long running orchestrator process, which executes multiple spark
jobs, currently on a single VM/driver, some of those jobs might
require extra packages/jars (please see example in the issue).

> Changing package doesn't mean to reload the classes.

AFAIU this is unrelated

> There is no way to reload the same class unless you customize the
classloader of Spark.

AFAIU this is an implementation detail.

> I also don't think it is necessary to implement a warning or error
message when changing the configuration since it doesn't do any harm

To reiterate right now the API allows to change configuration of the
context, without that configuration taking effect. See example of confused
users here:
 *
https://stackoverflow.com/questions/41886346/spark-2-1-0-session-config-settings-pyspark
 *
https://stackoverflow.com/questions/53606756/how-to-set-spark-driver-memory-in-client-mode-pyspark-version-2-3-1

I'm curious if you have any opinion about the "hard-reset" workaround,
copy-pasting from the issue:

```
s: SparkSession = ...

# Hard reset:
s.stop()
s._sc._gateway.shutdown()
s._sc._gateway.proc.stdin.close()
SparkContext._gateway = None
SparkContext._jvm = None
```

Cheers - Rafal

On 2022/03/09 15:39:58 Artemis User wrote:
> This is indeed a JVM issue, not a Spark issue.  You may want to ask
> yourself why it is necessary to change the jar packages during runtime.
> Changing package doesn't mean to reload the classes. There is no way to
> reload the same class unless you customize the classloader of Spark.  I
> also don't think it is necessary to implement a warning or error message
> when changing the configuration since it doesn't do any harm.  Spark
> uses lazy binding so you can do a lot of such "unharmful" things.
> Developers will have to understand the behaviors of each API before when
> using them..
>
>
> On 3/9/22 9:31 AM, Rafał Wojdyła wrote:
> >  Sean,
> > I understand you might be sceptical about adding this functionality
> > into (py)spark, I'm curious:
> > * would error/warning on update in configuration that is currently
> > effectively impossible (requires restart of JVM) be reasonable?
> > * what do you think about the workaround in the issue?
> > Cheers - Rafal
> >
> > On Wed, 9 Mar 2022 at 14:24, Sean Owen  wrote:
> >
> > Unfortunately this opens a lot more questions and problems than it
> > solves. What if you take something off the classpath, for example?
> > change a class?
> >
> > On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła
> >  wrote:
> >
> > Thanks Sean,
> > To be clear, if you prefer to change the label on this issue
> > from bug to sth else, feel free to do so, no strong opinions
> > on my end. What happens to the classpath, whether spark uses
> > some classloader magic, is probably an implementation detail.
> > That said, it's definitely not intuitive that you can change
> > the configuration and get the context (with the updated
> > config) without any warnings/errors. Also what would you
> > recommend as a workaround or solution to this problem? Any
> > comments about the workaround in the issue? Keep in mind that
> > I can't restart the long running orchestration process (python
> > process if that matters).
> > Cheers - Rafal
> >
> > On Wed, 9 Mar 2022 at 13:15, Sean Owen  wrote:
> >
> > That isn't a bug - you can't change the classpath once the
> > JVM is executing.
> >
> > On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła
> >  wrote:
> >
> > Hi,
> > My use case is that, I have a long running process
> > (orchestrator) with multiple tasks, some tasks might
> > require extra spark dependencies. It seems once the
> > spark context is started it's not possible to update
> > `spark.jars.packages`? I have reported an issue at
> > https://issues.apache.org/jira/browse/SPARK-38438,
> > together with a workaround ("hard reset of the
> > cluster"). I wonder if anyone has a solution for this?
> > Cheers - Rafal
> >
>

>


Re: spark jobs don't require the master/worker to startup?

2022-03-09 Thread Sean Owen
You can run Spark in local mode and not require any standalone master or
worker.
Are you sure you're not using local mode? are you sure the daemons aren't
running?
What is the Spark master you pass?

On Wed, Mar 9, 2022 at 7:35 PM  wrote:

> What I tried to say is, I didn't start spark master/worker at all, for a
> standalone deployment.
>
> But I still can login into pyspark to run the job. I don't know why.
>
> $ ps -efw|grep spark
> $ netstat -ntlp
>
> both the output above have no spark related info.
> And this machine is managed by myself, I know how to start spark
> correctly. But I didn't start them yet, and I still can login to pyspark
> to run the jobs. for exmaple:
>
> >>> df = sc.parallelize([("t1",1),("t2",2)]).toDF(["name","number"])
> >>> df.show()
> ++--+
> |name|number|
> ++--+
> |  t1| 1|
> |  t2| 2|
> ++--+
>
>
> do you know why?
> Thank you.
> frakass.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: spark jobs don't require the master/worker to startup?

2022-03-09 Thread capitnfrakass
What I tried to say is, I didn't start spark master/worker at all, for a 
standalone deployment.


But I still can login into pyspark to run the job. I don't know why.

$ ps -efw|grep spark
$ netstat -ntlp

both the output above have no spark related info.
And this machine is managed by myself, I know how to start spark 
correctly. But I didn't start them yet, and I still can login to pyspark 
to run the jobs. for exmaple:



df = sc.parallelize([("t1",1),("t2",2)]).toDF(["name","number"])
df.show()

++--+
|name|number|
++--+
|  t1| 1|
|  t2| 2|
++--+


do you know why?
Thank you.
frakass.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: RebaseDateTime with dynamicAllocation

2022-03-09 Thread Andreas Weise
Okay, found the root cause. Our k8s image got some changes, including a
mess with some jars dependencies around  com.fasterxml.jackson ...

Sorry for the inconvenience.

Some earlier log in the driver contained that info...

[2022-03-09 21:54:25,163] ({task-result-getter-3}
Logging.scala[logWarning]:69) - Lost task 1.0 in stage 0.0 (TID 1)
(10.131.0.221 executor 1): java.lang.ExceptionInInitializerError
at
org.apache.spark.sql.catalyst.util.RebaseDateTime.lastSwitchJulianTs(RebaseDateTime.scala)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.rebaseTimestamp(ParquetVectorUpdaterFactory.java:1067)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.rebaseInt96(ParquetVectorUpdaterFactory.java:1088)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.access$1500(ParquetVectorUpdaterFactory.java:43)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory$BinaryToSQLTimestampRebaseUpdater.decodeSingleDictionaryId(ParquetVectorUpdaterFactory.java:860)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdater.decodeDictionaryIds(ParquetVectorUpdater.java:75)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:216)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:298)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:196)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:191)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
at
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_1$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at
org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:41)
at
org.apache.spark.RangePartitioner$.$anonfun$sketch$1(Partitioner.scala:306)
at
org.apache.spark.RangePartitioner$.$anonfun$sketch$1$adapted(Partitioner.scala:304)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)
at
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.fasterxml.jackson.databind.JsonMappingException: Scala
module 2.12.3 requires Jackson Databind version >= 2.12.0 and < 2.13.0
at
com.fasterxml.jackson.module.scala.JacksonModule.setupModule(JacksonModule.scala:61)
at
com.fasterxml.jackson.module.scala.JacksonModule.setupModule$(JacksonModule.scala:46)
at
com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:17)
at
com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:751)
at
org.apache.spark.sql.catalyst.util.RebaseDateTime$.loadRebaseRecords(RebaseDateTime.scala:277)
at
org.apache.spark.sql.catalyst.util.RebaseDateTime$.(RebaseDateTime.scala:300)
at
org.apache.spark.sql.catalyst.util.RebaseDateTime$.(RebaseDateTime.scala)
... 38 more

On Wed, Mar 9, 2022 at 5:30 PM Andreas Weise 
wrote:

> Full trace doesn't provide any 

Re: RebaseDateTime with dynamicAllocation

2022-03-09 Thread Andreas Weise
Full trace doesn't provide any further details. It looks like this:

Py4JJavaError: An error occurred while calling o337.showString. :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 18.0 failed 4 times, most recent failure: Lost task 1.3 in stage
18.0 (TID 220) (10.128.6.170 executor 13): java.lang.NoClassDefFoundError:
Could not initialize class
org.apache.spark.sql.catalyst.util.RebaseDateTime$ at
org.apache.spark.sql.catalyst.util.RebaseDateTime.lastSwitchJulianTs(RebaseDateTime.scala)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.rebaseTimestamp(ParquetVectorUpdaterFactory.java:1067)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.rebaseInt96(ParquetVectorUpdaterFactory.java:1088)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.access$1500(ParquetVectorUpdaterFactory.java:43)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory$BinaryToSQLTimestampRebaseUpdater.decodeSingleDictionaryId(ParquetVectorUpdaterFactory.java:860)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdater.decodeDictionaryIds(ParquetVectorUpdater.java:75)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:216)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:298)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:196)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:191)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
at
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_1$(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source) at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131) at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
at
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
at scala.Option.foreach(Option.scala:407) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
at 

Re: CPU usage from Event log

2022-03-09 Thread Artemis User
I am not sure what column/properties you are referring to.  But the 
event log in Spark deals with application level "events', not JVM-level 
metrics.  To retrieve the JVM metrics, you need to use the REST API 
provided in Spark.  Please see 
https://spark.apache.org/docs/latest/monitoring.html for details..


On 3/9/22 10:21 AM, Prasad Bhalerao wrote:

Hi,

I am trying to calculate CPU utilization of an Executor(JVM level CPU 
usage) using Event log. Can someone please help me with this?


1) Which column/properties to select
2) the correct formula to derive cpu usage

Has anyone done anything similar to this?

We have many pipelines and those are using very huge EMR clusters. I 
am trying to find out the cpu utilization and memory utilization of 
the nodes. This will help me find out if the clusters are under 
utilized and reduce the nodes,


Is there a better way to get these stats without changing the code?


Thanks,
Prasad



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: RebaseDateTime with dynamicAllocation

2022-03-09 Thread Sean Owen
Doesn't quite seem the same. What is the rest of the error -- why did the
class fail to initialize?

On Wed, Mar 9, 2022 at 10:08 AM Andreas Weise 
wrote:

> Hi,
>
> When playing around with spark.dynamicAllocation.enabled I face the
> following error after the first round of executors have been killed.
>
> Py4JJavaError: An error occurred while calling o337.showString. :
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
> in stage 18.0 failed 4 times, most recent failure: Lost task 1.3 in stage
> 18.0 (TID 220) (10.128.6.170 executor 13): java.lang.NoClassDefFoundError:
> Could not initialize class
> org.apache.spark.sql.catalyst.util.RebaseDateTime$ at
> org.apache.spark.sql.catalyst.util.RebaseDateTime.lastSwitchJulianTs(RebaseDateTime.scala)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.rebaseTimestamp(ParquetVectorUpdaterFactory.java:1067)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.rebaseInt96(ParquetVectorUpdaterFactory.java:1088)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.access$1500(ParquetVectorUpdaterFactory.java:43)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory$BinaryToSQLTimestampRebaseUpdater.decodeSingleDictionaryId(ParquetVectorUpdaterFactory.java:860)
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdater.decodeDictionaryIds(ParquetVectorUpdater.java:75)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:216)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:298)
> at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:196)
> at
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:191)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
> at
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
> Source) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_1$(Unknown
> Source) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
> Source) at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source) at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
> at
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
> at org.apache.spark.scheduler.Task.run(Task.scala:131) at
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
> We tested on Spark 3.2.1 k8s with these dynamicAllocation settings:
>
> spark.dynamicAllocation.enabled=true
> spark.dynamicAllocation.maxExecutors=4
> spark.dynamicAllocation.minExecutors=1
> spark.dynamicAllocation.executorIdleTimeout=30s
> spark.dynamicAllocation.shuffleTracking.enabled=true
> spark.dynamicAllocation.shuffleTracking.timeout=30s
> spark.decommission.enabled=true
>
> Might be related to SPARK-34772 /
> https://www.mail-archive.com/commits@spark.apache.org/msg50240.html but
> as this was fixed for 3.2.0 it might be worth another issue ?
>
> Best regards
> Andreas
>


RebaseDateTime with dynamicAllocation

2022-03-09 Thread Andreas Weise
Hi,

When playing around with spark.dynamicAllocation.enabled I face the
following error after the first round of executors have been killed.

Py4JJavaError: An error occurred while calling o337.showString. :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 18.0 failed 4 times, most recent failure: Lost task 1.3 in stage
18.0 (TID 220) (10.128.6.170 executor 13): java.lang.NoClassDefFoundError:
Could not initialize class
org.apache.spark.sql.catalyst.util.RebaseDateTime$ at
org.apache.spark.sql.catalyst.util.RebaseDateTime.lastSwitchJulianTs(RebaseDateTime.scala)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.rebaseTimestamp(ParquetVectorUpdaterFactory.java:1067)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.rebaseInt96(ParquetVectorUpdaterFactory.java:1088)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory.access$1500(ParquetVectorUpdaterFactory.java:43)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory$BinaryToSQLTimestampRebaseUpdater.decodeSingleDictionaryId(ParquetVectorUpdaterFactory.java:860)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdater.decodeDictionaryIds(ParquetVectorUpdater.java:75)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:216)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:298)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:196)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:191)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
at
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_1$(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source) at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131) at
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

We tested on Spark 3.2.1 k8s with these dynamicAllocation settings:

spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.maxExecutors=4
spark.dynamicAllocation.minExecutors=1
spark.dynamicAllocation.executorIdleTimeout=30s
spark.dynamicAllocation.shuffleTracking.enabled=true
spark.dynamicAllocation.shuffleTracking.timeout=30s
spark.decommission.enabled=true

Might be related to SPARK-34772 /
https://www.mail-archive.com/commits@spark.apache.org/msg50240.html but as
this was fixed for 3.2.0 it might be worth another issue ?

Best regards
Andreas


Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-09 Thread Artemis User
This is indeed a JVM issue, not a Spark issue.  You may want to ask 
yourself why it is necessary to change the jar packages during runtime.  
Changing package doesn't mean to reload the classes. There is no way to 
reload the same class unless you customize the classloader of Spark.  I 
also don't think it is necessary to implement a warning or error message 
when changing the configuration since it doesn't do any harm.  Spark 
uses lazy binding so you can do a lot of such "unharmful" things.  
Developers will have to understand the behaviors of each API before when 
using them..



On 3/9/22 9:31 AM, Rafał Wojdyła wrote:

 Sean,
I understand you might be sceptical about adding this functionality 
into (py)spark, I'm curious:
* would error/warning on update in configuration that is currently 
effectively impossible (requires restart of JVM) be reasonable?

* what do you think about the workaround in the issue?
Cheers - Rafal

On Wed, 9 Mar 2022 at 14:24, Sean Owen  wrote:

Unfortunately this opens a lot more questions and problems than it
solves. What if you take something off the classpath, for example?
change a class?

On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła
 wrote:

Thanks Sean,
To be clear, if you prefer to change the label on this issue
from bug to sth else, feel free to do so, no strong opinions
on my end. What happens to the classpath, whether spark uses
some classloader magic, is probably an implementation detail.
That said, it's definitely not intuitive that you can change
the configuration and get the context (with the updated
config) without any warnings/errors. Also what would you
recommend as a workaround or solution to this problem? Any
comments about the workaround in the issue? Keep in mind that
I can't restart the long running orchestration process (python
process if that matters).
Cheers - Rafal

On Wed, 9 Mar 2022 at 13:15, Sean Owen  wrote:

That isn't a bug - you can't change the classpath once the
JVM is executing.

On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła
 wrote:

Hi,
My use case is that, I have a long running process
(orchestrator) with multiple tasks, some tasks might
require extra spark dependencies. It seems once the
spark context is started it's not possible to update
`spark.jars.packages`? I have reported an issue at
https://issues.apache.org/jira/browse/SPARK-38438,
together with a workaround ("hard reset of the
cluster"). I wonder if anyone has a solution for this?
Cheers - Rafal



Re: spark jobs don't require the master/worker to startup?

2022-03-09 Thread Artemis User

To be specific:

1. Check the log files on both master and worker and see if any errors.
2. If you are not running your browser on the same machine and the
   Spark cluster, please use the host's external IP instead of
   localhost IP when launching the worker

Hope this helps...
-- ND

On 3/9/22 9:23 AM, Sean Owen wrote:

Did it start successfully? What do you mean ports were not opened?

On Wed, Mar 9, 2022 at 3:02 AM  wrote:

Hello

I have spark 3.2.0 deployed in localhost as the standalone mode.
I even didn't run the start master and worker command:

     start-master.sh
     start-worker.sh spark://127.0.0.1:7077 


And the ports (such as 7077) were not opened there.
But I still can login into pyspark to run the jobs.

Why this happens?

Thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org





CPU usage from Event log

2022-03-09 Thread Prasad Bhalerao
Hi,

I am trying to calculate CPU utilization of an Executor(JVM level CPU
usage) using Event log. Can someone please help me with this?

1) Which column/properties to select
2) the correct formula to derive cpu usage

Has anyone done anything similar to this?

We have many pipelines and those are using very huge EMR clusters. I am
trying to find out the cpu utilization and memory utilization of the nodes.
This will help me find out if the clusters are under utilized and reduce
the nodes,

Is there a better way to get these stats without changing the code?


Thanks,
Prasad


Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-09 Thread Rafał Wojdyła
Sean,
I understand you might be sceptical about adding this functionality into
(py)spark, I'm curious:
* would error/warning on update in configuration that is currently
effectively impossible (requires restart of JVM) be reasonable?
* what do you think about the workaround in the issue?
Cheers - Rafal

On Wed, 9 Mar 2022 at 14:24, Sean Owen  wrote:

> Unfortunately this opens a lot more questions and problems than it solves.
> What if you take something off the classpath, for example? change a class?
>
> On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła  wrote:
>
>> Thanks Sean,
>> To be clear, if you prefer to change the label on this issue from bug to
>> sth else, feel free to do so, no strong opinions on my end. What happens to
>> the classpath, whether spark uses some classloader magic, is probably an
>> implementation detail. That said, it's definitely not intuitive that you
>> can change the configuration and get the context (with the updated config)
>> without any warnings/errors. Also what would you recommend as a workaround
>> or solution to this problem? Any comments about the workaround in the
>> issue? Keep in mind that I can't restart the long running orchestration
>> process (python process if that matters).
>> Cheers - Rafal
>>
>> On Wed, 9 Mar 2022 at 13:15, Sean Owen  wrote:
>>
>>> That isn't a bug - you can't change the classpath once the JVM is
>>> executing.
>>>
>>> On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła 
>>> wrote:
>>>
 Hi,
 My use case is that, I have a long running process (orchestrator) with
 multiple tasks, some tasks might require extra spark dependencies. It seems
 once the spark context is started it's not possible to update
 `spark.jars.packages`? I have reported an issue at
 https://issues.apache.org/jira/browse/SPARK-38438, together with a
 workaround ("hard reset of the cluster"). I wonder if anyone has a solution
 for this?
 Cheers - Rafal

>>>


Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-09 Thread Sean Owen
Unfortunately this opens a lot more questions and problems than it solves.
What if you take something off the classpath, for example? change a class?

On Wed, Mar 9, 2022 at 8:22 AM Rafał Wojdyła  wrote:

> Thanks Sean,
> To be clear, if you prefer to change the label on this issue from bug to
> sth else, feel free to do so, no strong opinions on my end. What happens to
> the classpath, whether spark uses some classloader magic, is probably an
> implementation detail. That said, it's definitely not intuitive that you
> can change the configuration and get the context (with the updated config)
> without any warnings/errors. Also what would you recommend as a workaround
> or solution to this problem? Any comments about the workaround in the
> issue? Keep in mind that I can't restart the long running orchestration
> process (python process if that matters).
> Cheers - Rafal
>
> On Wed, 9 Mar 2022 at 13:15, Sean Owen  wrote:
>
>> That isn't a bug - you can't change the classpath once the JVM is
>> executing.
>>
>> On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła 
>> wrote:
>>
>>> Hi,
>>> My use case is that, I have a long running process (orchestrator) with
>>> multiple tasks, some tasks might require extra spark dependencies. It seems
>>> once the spark context is started it's not possible to update
>>> `spark.jars.packages`? I have reported an issue at
>>> https://issues.apache.org/jira/browse/SPARK-38438, together with a
>>> workaround ("hard reset of the cluster"). I wonder if anyone has a solution
>>> for this?
>>> Cheers - Rafal
>>>
>>


Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-09 Thread Rafał Wojdyła
Thanks Sean,
To be clear, if you prefer to change the label on this issue from bug to
sth else, feel free to do so, no strong opinions on my end. What happens to
the classpath, whether spark uses some classloader magic, is probably an
implementation detail. That said, it's definitely not intuitive that you
can change the configuration and get the context (with the updated config)
without any warnings/errors. Also what would you recommend as a workaround
or solution to this problem? Any comments about the workaround in the
issue? Keep in mind that I can't restart the long running orchestration
process (python process if that matters).
Cheers - Rafal

On Wed, 9 Mar 2022 at 13:15, Sean Owen  wrote:

> That isn't a bug - you can't change the classpath once the JVM is
> executing.
>
> On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła  wrote:
>
>> Hi,
>> My use case is that, I have a long running process (orchestrator) with
>> multiple tasks, some tasks might require extra spark dependencies. It seems
>> once the spark context is started it's not possible to update
>> `spark.jars.packages`? I have reported an issue at
>> https://issues.apache.org/jira/browse/SPARK-38438, together with a
>> workaround ("hard reset of the cluster"). I wonder if anyone has a solution
>> for this?
>> Cheers - Rafal
>>
>


Re: spark jobs don't require the master/worker to startup?

2022-03-09 Thread Sean Owen
Did it start successfully? What do you mean ports were not opened?

On Wed, Mar 9, 2022 at 3:02 AM  wrote:

> Hello
>
> I have spark 3.2.0 deployed in localhost as the standalone mode.
> I even didn't run the start master and worker command:
>
>  start-master.sh
>  start-worker.sh spark://127.0.0.1:7077
>
>
> And the ports (such as 7077) were not opened there.
> But I still can login into pyspark to run the jobs.
>
> Why this happens?
>
> Thanks.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: [SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-09 Thread Sean Owen
That isn't a bug - you can't change the classpath once the JVM is executing.

On Wed, Mar 9, 2022 at 7:11 AM Rafał Wojdyła  wrote:

> Hi,
> My use case is that, I have a long running process (orchestrator) with
> multiple tasks, some tasks might require extra spark dependencies. It seems
> once the spark context is started it's not possible to update
> `spark.jars.packages`? I have reported an issue at
> https://issues.apache.org/jira/browse/SPARK-38438, together with a
> workaround ("hard reset of the cluster"). I wonder if anyone has a solution
> for this?
> Cheers - Rafal
>


[SPARK-38438] pyspark - how to update spark.jars.packages on existing default context?

2022-03-09 Thread Rafał Wojdyła
Hi,
My use case is that, I have a long running process (orchestrator) with
multiple tasks, some tasks might require extra spark dependencies. It seems
once the spark context is started it's not possible to update
`spark.jars.packages`? I have reported an issue at
https://issues.apache.org/jira/browse/SPARK-38438, together with a
workaround ("hard reset of the cluster"). I wonder if anyone has a solution
for this?
Cheers - Rafal


spark jobs don't require the master/worker to startup?

2022-03-09 Thread capitnfrakass

Hello

I have spark 3.2.0 deployed in localhost as the standalone mode.
I even didn't run the start master and worker command:

start-master.sh
start-worker.sh spark://127.0.0.1:7077


And the ports (such as 7077) were not opened there.
But I still can login into pyspark to run the jobs.

Why this happens?

Thanks.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org