Re: Update / Delete records in Parquet

2019-05-01 Thread Vitaliy Pisarev
Ankit, you should take a look at delta.io that was recently open sourced by databricks. Full DML support is on the way. From: "Khare, Ankit" Date: Tuesday, 23 April 2019 at 11:35 To: Chetan Khatri , Jason Nerothin Cc: user Subject: Re: Update / Delete records in Parquet

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Vitaliy Pisarev
u have access to the Spark UI – what is the peak memory that you see > for the executors? > > The UI will also give you the time spent on GC by each executor. > > So even if you completely eliminated all GC, that’s the max time you can > potentially save. > > > > > >

Re: Testing Apache Spark applications

2018-11-15 Thread Vitaliy Pisarev
Hard to answer in a succinct manner but I'll give it a shot. Cucumber is a tool for writing *Behaviour* Driven Tests (closely related to behaviour driven development, BDD). It is not a mere *technical* approach to testing but a mindset, a way of work and a different (different, whether it is

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Vitaliy Pisarev
t ,see how > cores are being used? > > Regards, > Shahbaz > > On Thu, Nov 15, 2018 at 10:58 PM Vitaliy Pisarev < > vitaliy.pisa...@biocatch.com> wrote: > >> Oh, regarding and shuffle.partitions being 30k, don't know. I inherited >> the workload from an

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Vitaliy Pisarev
Oh, regarding and shuffle.partitions being 30k, don't know. I inherited the workload from an engineer that is no longer around and am trying to make sense of things in general. On Thu, Nov 15, 2018 at 7:26 PM Vitaliy Pisarev < vitaliy.pisa...@biocatch.com> wrote: > The ques

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Vitaliy Pisarev
– are you trying to > maximize utilization hoping that by having high parallelism will reduce > your total runtime? > > > > > > *From: *Vitaliy Pisarev > *Date: *Thursday, November 15, 2018 at 10:07 AM > *To: * > *Cc: *user , David Markovitz < > dudu.markov...@microso

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Vitaliy Pisarev
I. > > SparkContext.setJobGroup(….) > > SparkContext.setJobDescription(….) > > > > *From: *Vitaliy Pisarev > *Date: *Thursday, November 15, 2018 at 8:51 AM > *To: *user > *Cc: *David Markovitz > *Subject: *How to address seemingly low core utilization on a spark

How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Vitaliy Pisarev
I have a workload that runs on a cluster of 300 cores. Below is a plot of the amount of active tasks over time during the execution of this workload: [image: image.png] What I deduce is that there are substantial intervals where the cores are heavily under-utilised. What actions can I take to:

Re: How to address seemingly low core utilization on a spark workload?

2018-11-15 Thread Vitaliy Pisarev
the gaps, where there is no spark activity? > > > > Best regards, > > > > David (דודו) Markovitz > > Technology Solutions Professional, Data Platform > > Microsoft Israel > > > > Mobile: +972-525-834-304 > > Office: +972-747-119-274 > > > &

Re: Pyspark Partitioning

2018-10-04 Thread Vitaliy Pisarev
Groupby is an operator you would use if you wanted to *aggregate* the values that are grouped by rhe specify key. In your case you want to retain access to the values. You need to do df.partitionBy and then you can map the partirions. Of course you need to be carefull of potential skews in the

Re: No space left on device

2018-08-22 Thread Vitaliy Pisarev
change have you changed from caching to persistent data frames? > > > Regards, > Gourav Sengupta > > > > On Tue, Aug 21, 2018 at 12:04 PM Vitaliy Pisarev < > vitaliy.pisa...@biocatch.com> wrote: > >> The other time when I encountered this I solved it by throwing

Re: No space left on device

2018-08-21 Thread Vitaliy Pisarev
The other time when I encountered this I solved it by throwing more resources at it (stronger cluster). I was not able to understand the root cause though. I'll be happy to hear deeper insight as well. On Mon, Aug 20, 2018 at 7:08 PM, Steve Lewis wrote: > > We are trying to run a job that has

Optimizing a join with bucketing

2018-07-26 Thread Vitaliy Pisarev
I am joining two entities. One of the entities weighs ~0.5 TB. The other weighs ~16GB Both are stored in parquet. Another trait of the problem is that the "smaller" entity does not change, so I figured I'd pre-bucket it to improve performance. * What are the guidelines for deciding the best

Error when joining on two bucketed tables

2018-06-25 Thread Vitaliy Pisarev
What I did: I have two datasets I need to join. One of the datasets does not change so I bucket it once and save in a table. It looks something like: spark.table("profiles").bucketBy(500, "uid").saveAsTable("profiles_bkt"). Now I have another dataset that I bucket "online":

Re: Does joining table in Spark multiplies selected columns of smaller table?

2018-04-08 Thread Vitaliy Pisarev
e large data in separate files on HDFS and just maintain a file > name in the table. > > On 8. Apr 2018, at 19:52, Vitaliy Pisarev <vitaliy.pisa...@biocatch.com> > wrote: > > I have two tables in spark: > > T1 > |--x1 > |--x2 > > T2 > |--z1 > |--z2 > &g

Does joining table in Spark multiplies selected columns of smaller table?

2018-04-08 Thread Vitaliy Pisarev
I have two tables in spark: T1 |--x1 |--x2 T2 |--z1 |--z2 - T1 is much larger than T2 - The values in column z2 are *very large* - There is a Many-One relationships between T1 and T2 respectively (via the x2 and z1 columns). I perform the following query: select T1.x1, T2.z2 from

How does extending an existing parquet with columns affect impala/spark performance?

2018-04-03 Thread Vitaliy Pisarev
This is not strictly a spark question but I'll give it a shot: have an existing setup of parquet files that are being queried from impala and from spark. I intend to add some 30 relatively 'heavy' columns to the parquet. Each column would store an array of structs. Each struct can have from 5 to

Best practices for optimizing the structure of parquet schema

2018-03-29 Thread Vitaliy Pisarev
There is a lot of talk that in order to really benefit from fast queries over parquet and hdfs, we need to make sure that the data is stored in a manner that is friendly to compression. Unfortunately, I did not find any specific guidelines or tips online that describe do-s and dont-s in designing

Why doesn't spark use broadcast join?

2018-03-29 Thread Vitaliy Pisarev
I am looking at the physical plan for the following query: SELECT f1,f2,f3,... FROM T1 LEFT ANTI JOIN T2 ON T1.id = T2.id WHERE f1 = 'bla' AND f2 = 'bla2' AND some_date >= date_sub(current_date(), 1) LIMIT 100 An important detail: the table 'T1' can be very large (hundreds of

Accessing a file that was passed via --files to spark submit

2018-03-18 Thread Vitaliy Pisarev
I am submitting a script to spark-submit and passing it a file using --files property. Later on I need to read it in a worker. I don't understand what API I should use to do that. I figured I'd try just: with open('myfile'): but this did not work. I am able to pass the file using the addFile

Re: [EXT] Debugging a local spark executor in pycharm

2018-03-14 Thread Vitaliy Pisarev
t offer the > step-through capability. > > > > Best of luck! > > M > > -- > > Michael Mansour > > Data Scientist > > Symantec CASB > > *From: *Vitaliy Pisarev <vitaliy.pisa...@biocatch.com> > *Date: *Sunday, March 11, 2018 at 8:46 AM > *To: *&q

Debugging a local spark executor in pycharm

2018-03-11 Thread Vitaliy Pisarev
I want to step through the work of a spark executor running locally on my machine, from Pycharm. I am running explicit functionality, in the form of dataset.foreachPartition(f) and I want to see what is going on inside f. Is there a straightforward way to do it or do I need to resort to remote

Do values adjacent to exploded columns get duplicated?

2018-03-07 Thread Vitaliy Pisarev
This is a fairly basic question but I did not find an answer to it anywhere online: Suppose I have the following data frame (a and b are column names): a | b --- 1 |[x1,x2,x3,x4] # this is an array Now I explode column b and logically get: a | b