Apache Spark master value question

Patrik Iselind Sat, 09 May 2020 05:35:08 -0700

Hi,

First comes some background, then I have some questions.


*Background*
I'm trying out Zeppelin 0.8.2 based on the Docker image. My Docker file
looks like this:

```Dockerfile
FROM apache/zeppelin:0.8.2


# Install Java and some tools
RUN apt-get -y update &&\
    DEBIAN_FRONTEND=noninteractive \
        apt -y install vim python3-pip

RUN python3 -m pip install -U pyspark

ENV PYSPARK_PYTHON python3
ENV PYSPARK_DRIVER_PYTHON python3
```

When I start a section like so

```Zeppelin paragraph
%pyspark

print(sc)
print()
print(dir(sc))
print()
print(sc.master)
print()
print(sc.defaultParallelism)
```

I get the following output

```output
<SparkContext master=local appName=Zeppelin> ['PACKAGE_EXTENSIONS',
'__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__',
'__eq__', '__exit__', '__format__', '__ge__', '__getattribute__',
'__getnewargs__', '__gt__', '__hash__', '__init__', '__le__', '__lt__',
'__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__',
'__weakref__', '_accumulatorServer', '_active_spark_context', '_batchSize',
'_callsite', '_checkpointFile', '_conf', '_dictToJavaMap', '_do_init',
'_ensure_initialized', '_gateway', '_getJavaStorageLevel',
'_initialize_context', '_javaAccumulator', '_jsc', '_jvm', '_lock',
'_next_accum_id', '_pickled_broadcast_vars', '_python_includes',
'_repr_html_', '_temp_dir', '_unbatched_serializer', 'accumulator',
'addFile', 'addPyFile', 'appName', 'applicationId', 'binaryFiles',
'binaryRecords', 'broadcast', 'cancelAllJobs', 'cancelJobGroup',
'defaultMinPartitions', 'defaultParallelism', 'dump_profiles', 'emptyRDD',
'environment', 'getConf', 'getLocalProperty', 'getOrCreate', 'hadoopFile',
'hadoopRDD', 'master', 'newAPIHadoopFile', 'newAPIHadoopRDD',
'parallelize', 'pickleFile', 'profiler_collector', 'pythonExec',
'pythonVer', 'range', 'runJob', 'sequenceFile', 'serializer',
'setCheckpointDir', 'setJobGroup', 'setLocalProperty', 'setLogLevel',
'setSystemProperty', 'show_profiles', 'sparkHome', 'sparkUser',
'startTime', 'statusTracker', 'stop', 'textFile', 'uiWebUrl', 'union',
'version', 'wholeTextFiles'] local 1
```

This even though the "master" property in the interpretter is set to
"local[*]". I'd like to use all cores on my machine. To do that I have to
explicitly create the "spark.master" property in the spark
interpretter with the value "local[*]", then I get

```new output
<SparkContext master=local[*] appName=Zeppelin> ['PACKAGE_EXTENSIONS',
'__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__',
'__eq__', '__exit__', '__format__', '__ge__', '__getattribute__',
'__getnewargs__', '__gt__', '__hash__', '__init__', '__le__', '__lt__',
'__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__',
'__weakref__', '_accumulatorServer', '_active_spark_context', '_batchSize',
'_callsite', '_checkpointFile', '_conf', '_dictToJavaMap', '_do_init',
'_ensure_initialized', '_gateway', '_getJavaStorageLevel',
'_initialize_context', '_javaAccumulator', '_jsc', '_jvm', '_lock',
'_next_accum_id', '_pickled_broadcast_vars', '_python_includes',
'_repr_html_', '_temp_dir', '_unbatched_serializer', 'accumulator',
'addFile', 'addPyFile', 'appName', 'applicationId', 'binaryFiles',
'binaryRecords', 'broadcast', 'cancelAllJobs', 'cancelJobGroup',
'defaultMinPartitions', 'defaultParallelism', 'dump_profiles', 'emptyRDD',
'environment', 'getConf', 'getLocalProperty', 'getOrCreate', 'hadoopFile',
'hadoopRDD', 'master', 'newAPIHadoopFile', 'newAPIHadoopRDD',
'parallelize', 'pickleFile', 'profiler_collector', 'pythonExec',
'pythonVer', 'range', 'runJob', 'sequenceFile', 'serializer',
'setCheckpointDir', 'setJobGroup', 'setLocalProperty', 'setLogLevel',
'setSystemProperty', 'show_profiles', 'sparkHome', 'sparkUser',
'startTime', 'statusTracker', 'stop', 'textFile', 'uiWebUrl', 'union',
'version', 'wholeTextFiles'] local[*] 8
```
This is what I want.

*The Questions*

   - Why is the "master" property not used in the created SparkContext?
   - How do I add the spark.master property to the docker image?


Any hint or support you can provide would be greatly appreciated.

Yours Sincerely,
Patrik Iselind

Apache Spark master value question

Reply via email to