Recompute Spark outputs intelligently

2017-12-15 Thread Ashwin Raju
Hi, We have a batch processing application that reads logs files over multiple days, does transformations and aggregations on them using Spark and saves various intermediate outputs to Parquet. These jobs take many hours to run. This pipeline is deployed at many customer sites with some site

Re: Spark 2.2 streaming with append mode: empty output

2017-08-15 Thread Ashwin Raju
t; be updated any more. > See http://spark.apache.org/docs/latest/structured- > streaming-programming-guide.html#handling-late-data-and-watermarking > > On Mon, Aug 14, 2017 at 4:09 PM, Ashwin Raju <ther...@gmail.com> wrote: > >> Hi, >> >> I am running Spa

Spark 2.2 streaming with append mode: empty output

2017-08-14 Thread Ashwin Raju
Hi, I am running Spark 2.2 and trying out structured streaming. I have the following code: from pyspark.sql import functions as F df=frame \ .withWatermark("timestamp","1 minute") \ .groupby(F.window("timestamp","1 day"),*groupby_cols) \ .agg(f.sum('bytes')) query =

Reusing dataframes for streaming (spark 1.6)

2017-08-08 Thread Ashwin Raju
Hi, We've built a batch application on Spark 1.6.1. I'm looking into how to run the same code as a streaming (DStream based) application. This is using pyspark. In the batch application, we have a sequence of transforms that read from file, do dataframe operations, then write to file. I was