Need to convert Dataset to HashMap

2018-09-27 Thread rishmanisation
I am writing a data-profiling application that needs to iterate over a large .gz file (imported as a Dataset). Each key-value pair in the hashmap will be the row value and the number of times it occurs in the column. There is one hashmap for each column, and they are all added to a JSON at the

Looking for some feedbacks on proposal - native support of session window

2018-09-27 Thread Jungtaek Lim
Hi users, I'm Jungtaek Lim, one of contributors on streaming part. Recently I proposed some new feature: native support of session window [1]. While it also tackles the edge-case map/flatMapGroupsWithState don't cover for session window, its major benefit is mostly better usability on session

PySpark: batch_df in ForeachBatch - aggregation

2018-09-27 Thread mmuru
Hi, Using the master branch, I tried to perform SQL aggregation on batch_df in foreachBatch and only SQL API methods work but not spark sql queries on the temp table (register as a table or view createOrReplaceTempView). Is it supported? I really appreciate your help. -- Sent from:

Data source V2 in spark 2.4.0

2018-09-27 Thread AssafMendelson
Hi all, I understood from previous threads that the Data source V2 API will see some changes in spark 2.4.0, however, I can't seem to find what these changes are. Is there some documentation which summarizes the changes? The only mention I seem to find is this pull request:

Unsubscribe

2018-09-27 Thread Hasunie Adikari
-- AM Hasunie Sandanathala Adikari Tel:0713095876

RE: spark.lapply

2018-09-27 Thread Junior Alvarez
Around 500KB each time i call the function (~150 times) From: Felix Cheung Sent: den 26 september 2018 14:57 To: Junior Alvarez ; user@spark.apache.org Subject: Re: spark.lapply It looks like the native R process is terminated from buffer overflow. Do you know how much data is involved?