Transform Logging Issues with Spark/Dataproc in GCP

Rion Williams Tue, 27 Oct 2020 10:29:23 -0700

Hi all,

Recently, I deployed a very simple Apache Beam pipeline to get some
insights into how it behaved executing in Dataproc as opposed to on my
local machine. I quickly realized that after executing that any DoFn or
transform-level logging didn't appear within the job logs within the Google
Cloud Console as I would have expected and I'm not entirely sure what might
be missing.



All of the high level logging messages are emitted as expected:


*// This works*

*log.info <http://log.info>("Testing logging operations")*


*pipeline*

*    .apply(Create.of(...))*

*    .apply(ParDo.of(LoggingDoFn))*


The LoggingDoFn class here is a very basic transform that emits each of the
values that it encounters as seen below:


*object LoggingDoFn : DoFn<String, ...>() {*

*    private val log = LoggerFactory.getLogger(LoggingDoFn::class.java)*



*    @ProcessElement*

*    fun processElement(c: ProcessContext) {*

*        // This is never emitted within the logs*

*        log.info <http://log.info>("Attempting to parse ${c.element()}")*

*    }*

*}*


As detailed in the comments, I can see logging messages outside of
the processElement() calls (presumably because those are being executed by
the Spark runner), but is there a way to easily expose those within the
inner transform as well?


When viewing the logs related to this job, we can see the higher-level
logging present, but no mention of a "Attempting to parse ..." message from
the DoFn:


[image: gjela.png] <https://i.stack.imgur.com/gjela.png>


The job itself is being executed by the following gcloud command, which has
the driver log levels explicitly defined, but perhaps there's another level
of logging or configuration that needs to be added:


*gcloud dataproc jobs submit spark
--jar=gs://some_bucket/deployment/example.jar --project example-project
--cluster example-cluster --region us-example --driver-log-levels
com.example=DEBUG -- --runner=SparkRunner
--output=gs://some_bucket/deployment/out*


To summarize, log messages are not being emitted to the Google Cloud
Console for tasks that would generally be assigned to the Spark runner
itself (e.g. processElement()). I'm unsure if it's a configuration-related
issue or something else entirely.


Any advice would be appreciated and I’d be glad to provide some additional
details however I can.


Thanks,


Rion

Transform Logging Issues with Spark/Dataproc in GCP

Reply via email to