Re: [PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Shuporno Choudhury
Hi, Responding to your queries: I am using Spark 2.2.1.I have tried with both dynamic resource allocation turned on and off and have encountered the same behaviour. The way data is being read is that filepaths (for each independent data set) are passed to a method, then the method does the

Re: [PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Jörn Franke
Additionally I meant with modularization that jobs that have really nothing to do with each other should be in separate python programs > On 5. Jun 2018, at 04:50, Thakrar, Jayesh > wrote: > > Disclaimer - I use Spark with Scala and not Python. > > But I am guessing that Jorn's reference to

Re: [PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Thakrar, Jayesh
Disclaimer - I use Spark with Scala and not Python. But I am guessing that Jorn's reference to modularization is to ensure that you do the processing inside methods/functions and call those methods sequentially. I believe that as long as an RDD/dataset variable is in scope, its memory may not

Re: spark partitionBy with partitioned column in json output

2018-06-04 Thread Jay
The partitionBy clause is used to create hive folders so that you can point a hive partitioned table on the data . What are you using the partitionBy for ? What is the use case ? On Mon 4 Jun, 2018, 4:59 PM purna pradeep, wrote: > im reading below json in spark > > {"bucket": "B01",

Re: [PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Jay
Can you tell us what version of Spark you are using and if Dynamic Allocation is enabled ? Also, how are the files being read ? Is it a single read of all files using a file matching regex or are you running different threads in the same pyspark job? On Mon 4 Jun, 2018, 1:27 PM Shuporno

Re: spark partitionBy with partitioned column in json output

2018-06-04 Thread Lalwani, Jayesh
Purna, This behavior is by design. If you provide partitionBy, Spark removes the columns from the data From: purna pradeep Date: Monday, June 4, 2018 at 8:00 PM To: "user@spark.apache.org" Subject: spark partitionBy with partitioned column in json output im reading below json in spark

spark partitionBy with partitioned column in json output

2018-06-04 Thread purna pradeep
im reading below json in spark {"bucket": "B01", "actionType": "A1", "preaction": "NULL", "postaction": "NULL"} {"bucket": "B02", "actionType": "A2", "preaction": "NULL", "postaction": "NULL"} {"bucket": "B03", "actionType": "A3", "preaction": "NULL", "postaction": "NULL"} val

Re: [PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Shuporno Choudhury
Thanks a lot for the insight. Actually I have the exact same transformations for all the datasets, hence only 1 python code. Now, do you suggest that I run different spark-submit for all the different datasets given that I have the exact same transformations? On Tue 5 Jun, 2018, 1:48 AM Jörn

Re: [PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Jörn Franke
Yes if they are independent with different transformations then I would create a separate python program. Especially for big data processing frameworks one should avoid to put everything in one big monotholic applications. > On 4. Jun 2018, at 22:02, Shuporno Choudhury > wrote: > > Hi, > >

Apply Core Java Transformation UDF on DataFrame

2018-06-04 Thread Chetan Khatri
All, I would like to Apply Java Transformation UDF on DataFrame created from Table, Flat Files and retrun new Data Frame Object. Any suggestions, with respect to Spark Internals. Thanks.

Re: [PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Shuporno Choudhury
Hi, Thanks for the input. I was trying to get the functionality first, hence I was using local mode. I will be running on a cluster definitely but later. Sorry for my naivety, but can you please elaborate on the modularity concept that you mentioned and how it will affect whatever I am already

Re: [PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Jörn Franke
Why don’t you modularize your code and write for each process an independent python program that is submitted via Spark? Not sure though if Spark local make sense. If you don’t have a cluster then a normal python program can be much better. > On 4. Jun 2018, at 21:37, Shuporno Choudhury >

[PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Shuporno Choudhury
Hi everyone, I am trying to run a pyspark code on some data sets sequentially [basically 1. Read data into a dataframe 2.Perform some join/filter/aggregation 3. Write modified data in parquet format to a target location] Now, while running this pyspark code across *multiple independent data sets

A code example of Catalyst optimization

2018-06-04 Thread Jean Georges Perrin
Hi there, I am looking for an example of optimization through Catalyst, that you can demonstrate via code. Typically, you load some data in a dataframe, you do something, you do the opposite operation, and, when you collect, it’s super fast because nothing really happened to the data.

Re: [External] Re: Sorting in Spark on multiple partitions

2018-06-04 Thread Jörn Franke
I think also there is a misunderstanding how repartition works. It keeps the existing number of partitions, but hash partitions according to userid. Means in each partition it is likely to have different user ids. That would also explain your observed behavior. However without having the full

Re: [External] Re: Sorting in Spark on multiple partitions

2018-06-04 Thread Jörn Franke
How do you load the data? How do you write it? I fear without a full source code it will be difficult to troubleshoot the issue. Which Spark version? Use case is not yet 100% clear to me. You want to set the row with the oldest/newest date to true? I would just use top or something similar

is there a way to create a static dataframe inside mapGroups?

2018-06-04 Thread kant kodali
Hi All, Is there a way to create a static dataframe inside mapGroups? given that mapGroups gives Iterator of rows. I just want to take that iterator and populate a static dataframe so I can run raw sql queries on the static dataframe. Thanks!

Re: [Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-04 Thread Pranav Agrawal
yes, issue is with array type only, I have confirmed that. I exploded array to struct but still getting the same error, *Exception in thread "main" org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct <> struct at the 21th column

Re: Spark structured streaming generate output path runtime

2018-06-04 Thread Swapnil Chougule
Thanks Jayesh. It worked for me. ~Swapnil On Fri, 1 Jun 2018, 7:10 pm Lalwani, Jayesh, wrote: > This will not work the way you have implemented it. The code that you have > here will be called only once before the streaming query is started. Once > the streaming query starts, this code is not

Re: testing frameworks

2018-06-04 Thread Spico Florin
Hello! Thank you very much for your helpful answer and for the very good job performed in spark-testing-base . I managed to perform unit testing with spark-testing-base library as the provided article and also get inspired from

Re: [Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-04 Thread Jorge Machado
Have you tryed to narrow down the problem so that we can be 100% sure that it lies on the array types ? Just exclude them for sake of testing. If we know 100% that it is on this array stuff try to explode that columns into simple types. Jorge Machado > On 4 Jun 2018, at 11:09, Pranav

Re: [Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-04 Thread Pranav Agrawal
I am ordering the columns before doing union, so I think that should not be an issue, * String[] columns_original_order = baseDs.columns(); String[] columns = baseDs.columns();Arrays.sort(columns); baseDs=baseDs.selectExpr(columns);

Re: [Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-04 Thread Jorge Machado
Try the same union with a dataframe without the arrays types. Could be something strange there like ordering or so. Jorge Machado > On 4 Jun 2018, at 10:17, Pranav Agrawal wrote: > > schema is exactly the same, not sure why it is failing though. > > root > |-- booking_id: integer

Re: [Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-04 Thread Pranav Agrawal
schema is exactly the same, not sure why it is failing though. root |-- booking_id: integer (nullable = true) |-- booking_rooms_room_category_id: integer (nullable = true) |-- booking_rooms_room_id: integer (nullable = true) |-- booking_source: integer (nullable = true) |-- booking_status: