Re: [Spark Core] Is it possible to insert a function directly into the Logical Plan?
What about accumulators ? > On 14. Aug 2017, at 20:15, Lukas Bradley wrote: > > We have had issues with gathering status on long running jobs. We have > attempted to draw parallels between the Spark UI/Monitoring API and our code > base. Due to the separation between code and the execution plan, even having > a guess as to where we are in the process is difficult. The Job/Stage/Task > information is too abstracted from our code to be easily digested by non > Spark engineers on our team. > > Is there a "hook" to which I can attach a piece of code that is triggered > when a point in the plan is reached? This could be when a SQL command > completes, or when a new DataSet is created, anything really... > > It seems Dataset.checkpoint() offers an excellent snapshot position during > execution, but I'm concerned I'm short-circuiting the optimal execution of > the full plan. I really want these trigger functions to be completely > independent of the actual processing itself. I'm not looking to extract > information from a Dataset, RDD, or anything else. I essentially want to > write independent output for status. > > If this doesn't exist, is there any desire on the dev team for me to > investigate this feature? > > Thank you for any and all help.
Re: [Spark Core] Is it possible to insert a function directly into the Logical Plan?
Something like this, maybe? import org.apache.spark.sql.Dataset import org.apache.spark.sql.catalyst.expressions.AttributeReference import org.apache.spark.sql.execution.LogicalRDD import org.apache.spark.sql.catalyst.encoders.RowEncoder val df: DataFrame = ??? val spark = df.sparkSession val rddOfInternalRows = df.queryExecution.toRdd.mapPartitions(iter => { log.info("Test") iter }) val attributes = df.schema.map(f => AttributeReference(f.name, f.dataType, f.nullable, f.metadata)() ) val logicalPlan = LogicalRDD(attributes, rddOfInternalRows)(spark) val rowEncoder = RowEncoder(df.schema) val resultingDataFrame = new Dataset[Row](spark, logicalPlan, rowEncoder) resultingDataFrame On Mon, Aug 14, 2017 at 2:15 PM, Lukas Bradley wrote: > We have had issues with gathering status on long running jobs. We have > attempted to draw parallels between the Spark UI/Monitoring API and our > code base. Due to the separation between code and the execution plan, even > having a guess as to where we are in the process is difficult. The > Job/Stage/Task information is too abstracted from our code to be easily > digested by non Spark engineers on our team. > > Is there a "hook" to which I can attach a piece of code that is triggered > when a point in the plan is reached? This could be when a SQL command > completes, or when a new DataSet is created, anything really... > > It seems Dataset.checkpoint() offers an excellent snapshot position during > execution, but I'm concerned I'm short-circuiting the optimal execution of > the full plan. I really want these trigger functions to be completely > independent of the actual processing itself. I'm not looking to extract > information from a Dataset, RDD, or anything else. I essentially want to > write independent output for status. > > If this doesn't exist, is there any desire on the dev team for me to > investigate this feature? > > Thank you for any and all help. >
[Spark Core] Is it possible to insert a function directly into the Logical Plan?
We have had issues with gathering status on long running jobs. We have attempted to draw parallels between the Spark UI/Monitoring API and our code base. Due to the separation between code and the execution plan, even having a guess as to where we are in the process is difficult. The Job/Stage/Task information is too abstracted from our code to be easily digested by non Spark engineers on our team. Is there a "hook" to which I can attach a piece of code that is triggered when a point in the plan is reached? This could be when a SQL command completes, or when a new DataSet is created, anything really... It seems Dataset.checkpoint() offers an excellent snapshot position during execution, but I'm concerned I'm short-circuiting the optimal execution of the full plan. I really want these trigger functions to be completely independent of the actual processing itself. I'm not looking to extract information from a Dataset, RDD, or anything else. I essentially want to write independent output for status. If this doesn't exist, is there any desire on the dev team for me to investigate this feature? Thank you for any and all help.
[Spark Core] Is it possible to insert a function directly into the Logical Plan?
We have had issues with gathering status on long running jobs. We have attempted to draw parallels between the Spark UI/Monitoring API and our code base. Due to the separation between code and the execution plan, even having a guess as to where we are in the process is difficult. The Job/Stage/Task information is too abstracted from our code to be easily digested by non Spark engineers on our team. Is there a "hook" to which I can attach a piece of code that is triggered when a point in the plan is reached? This could be when a SQL command completes, or when a new DataSet is created, anything really... It seems Dataset.checkpoint() offers an excellent snapshot position during execution, but I'm concerned I'm short-circuiting the optimal execution of the full plan. I really want these trigger functions to be completely independent of the actual processing itself. I'm not looking to extract information from a Dataset, RDD, or anything else. I essentially want to write independent output for status. If this doesn't exist, is there any desire on the dev team for me to investigate this feature? Thank you for any and all help. Lukas