Hi,
We have a batch processing application that reads logs files over multiple
days, does transformations and aggregations on them using Spark and saves
various intermediate outputs to Parquet. These jobs take many hours to run.
This pipeline is deployed at many customer sites with some site
t; be updated any more.
> See http://spark.apache.org/docs/latest/structured-
> streaming-programming-guide.html#handling-late-data-and-watermarking
>
> On Mon, Aug 14, 2017 at 4:09 PM, Ashwin Raju <ther...@gmail.com> wrote:
>
>> Hi,
>>
>> I am running Spa
Hi,
I am running Spark 2.2 and trying out structured streaming. I have the
following code:
from pyspark.sql import functions as F
df=frame \
.withWatermark("timestamp","1 minute") \
.groupby(F.window("timestamp","1 day"),*groupby_cols) \
.agg(f.sum('bytes'))
query =
Hi,
We've built a batch application on Spark 1.6.1. I'm looking into how to run
the same code as a streaming (DStream based) application. This is using
pyspark.
In the batch application, we have a sequence of transforms that read from
file, do dataframe operations, then write to file. I was