Running spark with javaagent configuration

2019-05-15 Thread Anton Puzanov
Hi everyone, I want to run my spark application with javaagent, specifically I want to use newrelic with my application. When I run spark-submit I must pass --conf "spark.driver.extraJavaOptions=-javaagent=" My problem is that I can't specify the full path as I run in cluster mode and I don't

Spark dynamic allocation with special executor configuration

2019-02-25 Thread Anton Puzanov
Hello everyone, Spark has a dynamic resource allocation scheme, where, when available Spark manager will automatically add executors to the application resource. Spark's default configuration is for executors to allocate the entire worker node they are running on, but this is configurable, my

Spark pools

2019-02-24 Thread Anton Puzanov
Hello everyone, I have been looking into Spark pools and have two questions I would really like to get answers about. 1. Are pools available when Yarn is used as resource manager? 2. Do pools define static partitioning of the cluster? I mean, if I define two pools (using xml file) with equal

Re: Split a row into multiple rows Java

2018-08-01 Thread Anton Puzanov
you can always use array+explode, I don't know if its the most elegant/optimal solution (would be happy to hear from the experts) code example: //create data Dataset test= spark.createDataFrame(Arrays.asList(new InternalData("bob", "b1", 1,2,3), new InternalData("alive", "c1", 3,4,6),

How to make Yarn dynamically allocate resources for Spark

2018-08-01 Thread Anton Puzanov
Hi everyone, have a cluster managed with Yarn and runs Spark jobs, the components were installed using Ambari (2.6.3.0-235). I have 6 hosts each with 6 cores. I use Fair scheduler I want Yarn to automatically add/remove executor cores, but no matter what I do it doesn't work Relevant Spark

How to make Yarn dynamically allocate resources for Spark

2018-08-01 Thread Anton Puzanov
Hi everyone, have a cluster managed with Yarn and runs Spark jobs, the components were installed using Ambari (2.6.3.0-235). I have 6 hosts each with 6 cores. I use Fair scheduler I want Yarn to automatically add/remove executor cores, but no matter what I do it doesn't work Relevant Spark

Using window function works extremely slowly

2018-01-22 Thread Anton Puzanov
I try to use spark sql built in window function: https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/functions.html#window(org.apache.spark.sql.Column,%20java.lang.String) I run it with step=1 seconds and window = 3 minutes (ratio of 180) and it runs extremely slow compared to other

Current way of using functions.window with Java

2018-01-02 Thread Anton Puzanov
I write a sliding window analytic program and use the functions.window function ( https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/functions.html#window(org.apache.spark.sql.Column,%20java.lang.String,%20java.lang.String) ) The code looks like this: Column slidingWindow =

Predicate Pushdown Doesn't Work With Data Source API

2017-08-28 Thread Anton Puzanov
Hi everyone, I am trying to improve the performance of data loading from disk. For that I have implemented my own RDD and now I am trying to increase the performance with predicate pushdown. I have used many sources including the documentations and