Re: [Spark Core] Is it possible to insert a function directly into the Logical Plan?

2017-08-14 Thread Jörn Franke
What about accumulators ?

> On 14. Aug 2017, at 20:15, Lukas Bradley  wrote:
> 
> We have had issues with gathering status on long running jobs.  We have 
> attempted to draw parallels between the Spark UI/Monitoring API and our code 
> base.  Due to the separation between code and the execution plan, even having 
> a guess as to where we are in the process is difficult.  The Job/Stage/Task 
> information is too abstracted from our code to be easily digested by non 
> Spark engineers on our team.
> 
> Is there a "hook" to which I can attach a piece of code that is triggered 
> when a point in the plan is reached?  This could be when a SQL command 
> completes, or when a new DataSet is created, anything really...  
> 
> It seems Dataset.checkpoint() offers an excellent snapshot position during 
> execution, but I'm concerned I'm short-circuiting the optimal execution of 
> the full plan.  I really want these trigger functions to be completely 
> independent of the actual processing itself.  I'm not looking to extract 
> information from a Dataset, RDD, or anything else.  I essentially want to 
> write independent output for status.  
> 
> If this doesn't exist, is there any desire on the dev team for me to 
> investigate this feature?
> 
> Thank you for any and all help.


Re: [Spark Core] Is it possible to insert a function directly into the Logical Plan?

2017-08-14 Thread Vadim Semenov
Something like this, maybe?


import org.apache.spark.sql.Dataset
import org.apache.spark.sql.catalyst.expressions.AttributeReference
import org.apache.spark.sql.execution.LogicalRDD
import org.apache.spark.sql.catalyst.encoders.RowEncoder

val df: DataFrame = ???
val spark = df.sparkSession
val rddOfInternalRows = df.queryExecution.toRdd.mapPartitions(iter => {
  log.info("Test")
  iter
})
val attributes = df.schema.map(f =>
   AttributeReference(f.name, f.dataType, f.nullable, f.metadata)()
)
val logicalPlan = LogicalRDD(attributes, rddOfInternalRows)(spark)
val rowEncoder = RowEncoder(df.schema)
val resultingDataFrame = new Dataset[Row](spark, logicalPlan, rowEncoder)
resultingDataFrame

On Mon, Aug 14, 2017 at 2:15 PM, Lukas Bradley 
wrote:

> We have had issues with gathering status on long running jobs.  We have
> attempted to draw parallels between the Spark UI/Monitoring API and our
> code base.  Due to the separation between code and the execution plan, even
> having a guess as to where we are in the process is difficult.  The
> Job/Stage/Task information is too abstracted from our code to be easily
> digested by non Spark engineers on our team.
>
> Is there a "hook" to which I can attach a piece of code that is triggered
> when a point in the plan is reached?  This could be when a SQL command
> completes, or when a new DataSet is created, anything really...
>
> It seems Dataset.checkpoint() offers an excellent snapshot position during
> execution, but I'm concerned I'm short-circuiting the optimal execution of
> the full plan.  I really want these trigger functions to be completely
> independent of the actual processing itself.  I'm not looking to extract
> information from a Dataset, RDD, or anything else.  I essentially want to
> write independent output for status.
>
> If this doesn't exist, is there any desire on the dev team for me to
> investigate this feature?
>
> Thank you for any and all help.
>


[Spark Core] Is it possible to insert a function directly into the Logical Plan?

2017-08-14 Thread Lukas Bradley
We have had issues with gathering status on long running jobs.  We have
attempted to draw parallels between the Spark UI/Monitoring API and our
code base.  Due to the separation between code and the execution plan, even
having a guess as to where we are in the process is difficult.  The
Job/Stage/Task information is too abstracted from our code to be easily
digested by non Spark engineers on our team.

Is there a "hook" to which I can attach a piece of code that is triggered
when a point in the plan is reached?  This could be when a SQL command
completes, or when a new DataSet is created, anything really...

It seems Dataset.checkpoint() offers an excellent snapshot position during
execution, but I'm concerned I'm short-circuiting the optimal execution of
the full plan.  I really want these trigger functions to be completely
independent of the actual processing itself.  I'm not looking to extract
information from a Dataset, RDD, or anything else.  I essentially want to
write independent output for status.

If this doesn't exist, is there any desire on the dev team for me to
investigate this feature?

Thank you for any and all help.


[Spark Core] Is it possible to insert a function directly into the Logical Plan?

2017-08-11 Thread Lukas Bradley
We have had issues with gathering status on long running jobs.  We have
attempted to draw parallels between the Spark UI/Monitoring API and our
code base.  Due to the separation between code and the execution plan, even
having a guess as to where we are in the process is difficult.  The
Job/Stage/Task information is too abstracted from our code to be easily
digested by non Spark engineers on our team.

Is there a "hook" to which I can attach a piece of code that is triggered
when a point in the plan is reached?  This could be when a SQL command
completes, or when a new DataSet is created, anything really...

It seems Dataset.checkpoint() offers an excellent snapshot position during
execution, but I'm concerned I'm short-circuiting the optimal execution of
the full plan.  I really want these trigger functions to be completely
independent of the actual processing itself.  I'm not looking to extract
information from a Dataset, RDD, or anything else.  I essentially want to
write independent output for status.

If this doesn't exist, is there any desire on the dev team for me to
investigate this feature?

Thank you for any and all help.

Lukas