Re: Train ML models on each partition

2019-05-09 Thread Dillon Dukek
Hi Qian, The way that I have gotten around this type of problem in the past is to do a groupBy on the dimensions that you want to build a model for and then initialize, and train a model using a package like scikit learn for each group in something like a group map pandas udf. If you need these

Re: Unable to broadcast a very large variable

2019-04-12 Thread Dillon Dukek
That's fine. The other points that I mentioned still apply. On Thu, Apr 11, 2019 at 4:52 PM V0lleyBallJunki3 wrote: > I am not using pyspark. The job is written in Scala > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > >

Re: Unable to broadcast a very large variable

2019-04-10 Thread Dillon Dukek
You will probably need to do a couple of things. One, you will need to probably increase the "spark.sql.broadcastTimeout" setting. As well, when you broadcast a variable it gets replicated once per executor not once per machine so you will need to increase your executor size and allow more cores

Re: Troubleshooting Spark OOM

2019-01-09 Thread Dillon Dukek
tic, smaller and larger in >> number executors are preferred over larger and fewer executors. >> Changing GC algorithm >> >> http://orastack.com/spark-scaling-to-large-datasets.html >> >> >> Here are a few tips >> >> >> >> >> On Wed,

Re: Troubleshooting Spark OOM

2019-01-09 Thread Dillon Dukek
Hi William, Just to get started, can you describe the spark version you are using and the language? It doesn't sound like you are using pyspark, however, problems arising from that can be different so I just want to be sure. As well, can you talk through the scenario under which you are dealing

Re: Shuffle write explosion

2018-11-05 Thread Dillon Dukek
What is your function in mapToPair doing? -Dillon On Mon, Nov 5, 2018 at 1:41 PM Taylor Cox wrote: > At first glance, I wonder if your tables are partitioned? There may not be > enough parallelism happening. You can also pass in the number of partitions > and/or a custom partitioner to help

Re: Spark seems to think that a particular broadcast variable is large in size

2018-10-16 Thread Dillon Dukek
g shows up in the Storage tab of either Zeppelin or spark-shell. > > However, I have several running applications in production that does > > show the data in cache. I am using Scala and Spark 2.2.1 in EMR. Any > > workarounds to see the data in cache. > > On Mon,

Re: Spark seems to think that a particular broadcast variable is large in size

2018-10-15 Thread Dillon Dukek
In your program persist the smaller table and use count to force it to materialize. Then in the Spark UI go to the Storage tab. The size of your table as spark sees it should be displayed there. Out of curiosity what version / language of Spark are you using? On Mon, Oct 15, 2018 at 11:53 AM

Re: Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread Dillon Dukek
VE, and it just saves processing time in terms of allocating containers. > That said I am still trying to understand how do we determine one YARN > container = one executor in SPARK. > > Regards, > Gourav > > On Tue, Oct 9, 2018 at 9:04 PM Dillon Dukek > wrote: > >>

Re: Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread Dillon Dukek
n my head on this one. But how do we know > that 1 yarn container = 1 executor? > > Regards, > Gourav Sengupta > > On Tue, Oct 9, 2018 at 8:53 PM Dillon Dukek > wrote: > >> Can you send how you are launching your streaming process? Also what >> environment is this cl

Re: Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread Dillon Dukek
Can you send how you are launching your streaming process? Also what environment is this cluster running in (EMR, GCP, self managed, etc)? On Tue, Oct 9, 2018 at 10:21 AM kant kodali wrote: > Hi All, > > I am using Spark 2.3.1 and using YARN as a cluster manager. > > I currently got > > 1) 6

Re: [Spark SQL]: Java Spark Classes With Attributes of Type Set In Datasets

2018-09-25 Thread Dillon Dukek
Actually, it appears walking through it in a debug terminal that the deserializer can properly transform the data on read to an ArrayType, but the serializer doesn't know what to do when we try to go back out from the internal spark representation. tags, if

Re: DirectFileOutputCommitter in Spark 2.3.1

2018-09-19 Thread Dillon Dukek
I believe you need to set mapreduce.fileoutputcommitter.algorithm.version to 2. On Wed, Sep 19, 2018 at 10:45 AM Priya Ch wrote: > Hello Team, > > I am trying to write a DataSet as parquet file in Append mode partitioned > by few columns. However since the job is time consuming, I would like to