Spark lists paths after `write` - how to avoid refreshing the file index?

2019-02-14 Thread peay
Hello, I have a piece of code that looks roughly like this: df = spark.read.parquet("s3://bucket/data.parquet/name=A", "s3://bucket/data.parquet/name=B") df_out = df. # Do stuff to transform df df_out.write.partitionBy("name").parquet("s3://bucket/data.parquet") I specific explicit

Re: Watermarking without aggregation with Structured Streaming

2018-09-30 Thread peay
code though and would like to know if someone can >> shed more light on this. >> >> Regards, >> Chandan >> >> On Sat, Sep 22, 2018 at 7:43 PM peay wrote: >> >>> Hello, >>> >>> I am trying to use watermarking without aggregat

Watermarking without aggregation with Structured Streaming

2018-09-22 Thread peay
Hello, I am trying to use watermarking without aggregation, to filter out records that are just too late, instead of appending them to the output. My understanding is that aggregation is required for `withWatermark` to have any effect. Is that correct? I am looking for something along the

Spark on YARN in client-mode: do we need 1 vCore for the AM?

2018-05-18 Thread peay
Hello, I run a Spark cluster on YARN, and we have a bunch of client-mode applications we use for interactive work. Whenever we start one of this, an application master container is started. My understanding is that this is mostly an empty shell, used to request further containers or get

Saving dataframes with partitionBy: append partitions, overwrite within each

2017-09-29 Thread peay
Hello, I am trying to use data_frame.write.partitionBy("day").save("dataset.parquet") to write a dataset while splitting by day. I would like to run a Spark job to process, e.g., a month: dataset.parquet/day=2017-01-01/... ... and then run another Spark job to add another month using the same

Configuration for unit testing and sql.shuffle.partitions

2017-09-12 Thread peay
Hello, I am running unit tests with Spark DataFrames, and I am looking for configuration tweaks that would make tests faster. Usually, I use a local[2] or local[4] master. Something that has been bothering me is that most of my stages end up using 200 partitions, independently of whether I

Re: map/foreachRDD equivalent for pyspark Structured Streaming

2017-05-05 Thread peay
/foreachRDD equivalent for pyspark Structured Streaming Local Time: 3 May 2017 12:05 PM UTC Time: 3 May 2017 10:05 From: tathagata.das1...@gmail.com To: peay <p...@protonmail.com> user@spark.apache.org <user@spark.apache.org> You can apply apply any kind of aggregation on windows. There ar

map/foreachRDD equivalent for pyspark Structured Streaming

2017-05-03 Thread peay
Hello, I would like to get started on Spark Streaming with a simple window. I've got some existing Spark code that takes a dataframe, and outputs a dataframe. This includes various joins and operations that are not supported by structured streaming yet. I am looking to essentially map/apply