Something like this, maybe?

import org.apache.spark.sql.Dataset
import org.apache.spark.sql.catalyst.expressions.AttributeReference
import org.apache.spark.sql.execution.LogicalRDD
import org.apache.spark.sql.catalyst.encoders.RowEncoder

val df: DataFrame = ???
val spark = df.sparkSession
val rddOfInternalRows = df.queryExecution.toRdd.mapPartitions(iter => {
  log.info("Test")
  iter
})
val attributes = df.schema.map(f =>
   AttributeReference(f.name, f.dataType, f.nullable, f.metadata)()
)
val logicalPlan = LogicalRDD(attributes, rddOfInternalRows)(spark)
val rowEncoder = RowEncoder(df.schema)
val resultingDataFrame = new Dataset[Row](spark, logicalPlan, rowEncoder)
resultingDataFrame

On Mon, Aug 14, 2017 at 2:15 PM, Lukas Bradley <lukasbrad...@gmail.com>
wrote:

> We have had issues with gathering status on long running jobs.  We have
> attempted to draw parallels between the Spark UI/Monitoring API and our
> code base.  Due to the separation between code and the execution plan, even
> having a guess as to where we are in the process is difficult.  The
> Job/Stage/Task information is too abstracted from our code to be easily
> digested by non Spark engineers on our team.
>
> Is there a "hook" to which I can attach a piece of code that is triggered
> when a point in the plan is reached?  This could be when a SQL command
> completes, or when a new DataSet is created, anything really...
>
> It seems Dataset.checkpoint() offers an excellent snapshot position during
> execution, but I'm concerned I'm short-circuiting the optimal execution of
> the full plan.  I really want these trigger functions to be completely
> independent of the actual processing itself.  I'm not looking to extract
> information from a Dataset, RDD, or anything else.  I essentially want to
> write independent output for status.
>
> If this doesn't exist, is there any desire on the dev team for me to
> investigate this feature?
>
> Thank you for any and all help.
>

Reply via email to