mapGroupsWithState in Python

2018-01-28 Thread ayan guha
Hi I want to write something in Structured streaming: 1. I have a dataset which has 3 columns: id, last_update_timestamp, attribute 2. I am receiving the data through Kinesis I want to deduplicate records based on last_updated. In batch, it looks like: spark.sql("select * from (Select *,

How and when the types of the result set are figured out in Spark?

2018-01-28 Thread kant kodali
Hi All, I would like to know how and when the types of the result set are figured out in Spark? for example say I have the following dataframe. *inputdf* col1 | col2 | col3 --- 1 | 2 | 5 2 | 3 | 6 Now say I do something like below (Pseudo sql) resultdf = select

unsubscribe

2018-01-28 Thread 韩盼
unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark Dataframe Writer _temporary directory

2018-01-28 Thread Richard Primera
In a situation where multiple workflows write different partitions of the same table. Example: 10 Different processes are writing parquet or orc files for different partitions of the same table foo, at

Re: write parquet with statistics min max with binary field

2018-01-28 Thread Stephen Joung
After setting `parquet.strings.signed-min-max.enabled` to `true` in `ShowMetaCommand.java`, parquet-tools meta show min,max. @@ -57,8 +57,9 @@ public class ShowMetaCommand extends ArgsOnlyCommand { String[] args = options.getArgs(); String input = args[0];

Re: Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

2018-01-28 Thread Dongjoon Hyun
Hi, Nicolas. Yes. In Apache Spark 2.3, there are new sub-improvements for SPARK-20901 (Feature parity for ORC with Parquet). For your questions, the following three are related. 1. spark.sql.orc.impl="native" By default, `native` ORC implementation (based on the latest ORC 1.4.1) is added.

Re: Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

2018-01-28 Thread Nicolas Paris
Hi Thanks for this work. Will this affect both: 1) spark.read.format("orc").load("...") 2) spark.sql("select ... from my_orc_table_in_hive") ? Le 10 janv. 2018 à 20:14, Dongjoon Hyun écrivait : > Hi, All. > > Vectorized ORC Reader is now supported in Apache Spark 2.3. > >    

Re: S3 token times out during data frame "write.csv"

2018-01-28 Thread Jörn Franke
He is using CSV and either ORC or parquet would be fine. > On 28. Jan 2018, at 06:49, Gourav Sengupta wrote: > > Hi, > > There is definitely a parameter while creating temporary security credential > to mention the number of minutes those credentials will be active.