Re: Spark 2.2 streaming with append mode: empty output

2017-08-14 Thread Tathagata Das
In append mode, the aggregation outputs a row only when the watermark has been crossed and the corresponding aggregate is *final*, that is, will not be updated any more. See http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking On Mon,

Apache Spark on Kubernetes: New Release for Spark 2.2

2017-08-14 Thread Erik Erlandson
The Apache Spark on Kubernetes Community Development Project is pleased to announce the latest release of Apache Spark with native Scheduler Backend for Kubernetes! Features provided in this release include: - Cluster-mode submission of Spark jobs to a Kubernetes cluster - Support

Spark 2.2 streaming with append mode: empty output

2017-08-14 Thread Ashwin Raju
Hi, I am running Spark 2.2 and trying out structured streaming. I have the following code: from pyspark.sql import functions as F df=frame \ .withWatermark("timestamp","1 minute") \ .groupby(F.window("timestamp","1 day"),*groupby_cols) \ .agg(f.sum('bytes')) query =

Re: [Spark Core] Is it possible to insert a function directly into the Logical Plan?

2017-08-14 Thread Jörn Franke
What about accumulators ? > On 14. Aug 2017, at 20:15, Lukas Bradley wrote: > > We have had issues with gathering status on long running jobs. We have > attempted to draw parallels between the Spark UI/Monitoring API and our code > base. Due to the separation between

Re: [Spark Core] Is it possible to insert a function directly into the Logical Plan?

2017-08-14 Thread Vadim Semenov
Something like this, maybe? import org.apache.spark.sql.Dataset import org.apache.spark.sql.catalyst.expressions.AttributeReference import org.apache.spark.sql.execution.LogicalRDD import org.apache.spark.sql.catalyst.encoders.RowEncoder val df: DataFrame = ??? val spark = df.sparkSession val

[Spark Core] Is it possible to insert a function directly into the Logical Plan?

2017-08-14 Thread Lukas Bradley
We have had issues with gathering status on long running jobs. We have attempted to draw parallels between the Spark UI/Monitoring API and our code base. Due to the separation between code and the execution plan, even having a guess as to where we are in the process is difficult. The

DAGScheduler - two runtimes

2017-08-14 Thread Kaepke, Marc
Hi everyone, I’m a Spark newbie and have one question: What is the difference between the duration measures in my log/ console output? 17/08/14 12:48:58 INFO DAGScheduler: ResultStage 232 (reduce at VertexRDDImpl.scala:90) finished in 0.026 s 17/08/14 12:48:58 INFO DAGScheduler: Job 13

WARN HIVE: Failed to access metastore. This class should not accessed in runtime

2017-08-14 Thread Ascot Moss
Hi, I got following error when start spark thriftserver: WARN HIVE: Failed to access metastore. This class should not accessed in runtime. org.apache.hadoop.hive.ql.metadata.HiveException: Java.lang.RuntimeException: Unable to instantiate