Re: Spark driver getting out of memory

2016-07-24 Thread Raghava Mutharaju
Saurav, We have the same issue. Our application runs fine on 32 nodes with 4 cores each and 256 partitions but gives an OOM on the driver when run on 64 nodes with 512 partitions. Did you get to know the reason behind this behavior or the relation between number of partitions and driver RAM

Re: OOM on the driver after increasing partitions

2016-06-22 Thread Raghava Mutharaju
args to take it automatically on OutOfMemory error. Analyze it and > share your finding :) > > > > On Wed, Jun 22, 2016 at 4:33 PM, Raghava Mutharaju < > m.vijayaragh...@gmail.com> wrote: > >> Ok. Would be able to shed more light on what exact meta data it manages

Re: OOM on the driver after increasing partitions

2016-06-22 Thread Raghava Mutharaju
ir amount of meta data to manage scheduling across all > your executors. I assume with 64 nodes you have more executors as well. > Simple way to test is to increase driver memory. > > On Wed, Jun 22, 2016 at 10:10 AM, Raghava Mutharaju < > m.vijayaragh...@gmail.com> wrote: >

Re: OOM on the driver after increasing partitions

2016-06-22 Thread Raghava Mutharaju
t; <https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/> > > <http://in.linkedin.com/in/sonalgoyal> > > > > On Wed, Jun 22, 2016 at 9:57 PM, Raghava Mutharaju < > m.vijayaragh...@gmail.com> wrote: > >> Hello A

OOM on the driver after increasing partitions

2016-06-22 Thread Raghava Mutharaju
Hello All, We have a Spark cluster where driver and master are running on the same node. We are using Spark Standalone cluster manager. If the number of nodes (and the partitions) are increased, the same dataset that used to run to completion on lesser number of nodes is now giving an out of

Spark 2.0.0-snapshot: IllegalArgumentException: requirement failed: chunks must be non-empty

2016-05-13 Thread Raghava Mutharaju
(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) On Fri, May 13, 2016 at 6:33 AM, Raghava Mutharaju < m.vijayaragh...@gmail.com> wrote: > Thank you for the response. > > I use

Re: sbt for Spark build with Scala 2.11

2016-05-13 Thread Raghava Mutharaju
t; you might be missing some modules/profiles for your build. What command did > you use to build ? > > On Thu, May 12, 2016 at 9:01 PM, Raghava Mutharaju < > m.vijayaragh...@gmail.com> wrote: > >> Hello All, >> >> I built Spark from the source code available a

sbt for Spark build with Scala 2.11

2016-05-12 Thread Raghava Mutharaju
Hello All, I built Spark from the source code available at https://github.com/apache/spark/. Although I haven't specified the "-Dscala-2.11" option (to build with Scala 2.11), from the build messages I see that it ended up using Scala 2.11. Now, for my application sbt, what should be the spark

Re: partitioner aware subtract

2016-05-10 Thread Raghava Mutharaju
zipPartitions(rdd2){ (leftItr, rightItr) => > leftItr.filter(p => !rightItr.contains(p)) > } > sum.foreach(println) > > > > Regards, > Rishitesh Mishra, > SnappyData . (http://www.snappydata.io/) > > https://in.linkedin.com/in/rishiteshmishra > > On Mo

Re: partitioner aware subtract

2016-05-09 Thread Raghava Mutharaju
:27 AM, ayan guha <guha.a...@gmail.com> wrote: > How about outer join? > On 9 May 2016 13:18, "Raghava Mutharaju" <m.vijayaragh...@gmail.com> > wrote: > >> Hello All, >> >> We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key >>

partitioner aware subtract

2016-05-08 Thread Raghava Mutharaju
Hello All, We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key (number of partitions are same for both the RDDs). We would like to subtract rdd2 from rdd1. The subtract code at https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala seems to

Re: executor delay in Spark

2016-04-28 Thread Raghava Mutharaju
only > complicate your trace debugging. > > I've attached a python script that dumps relevant info from the Spark > JSON logs into a CSV for easier analysis in you language of choice; > hopefully it can aid in finer grained debugging (the headers of the > fields it prints are liste

Re: executor delay in Spark

2016-04-24 Thread Raghava Mutharaju
this other than creating a dummy task to synchronize the executors, but > hopefully someone from there can suggest other possibilities. > > Mike > On Apr 23, 2016 5:53 AM, "Raghava Mutharaju" <m.vijayaragh...@gmail.com> > wrote: > >> Mike, >> >&

Re: executor delay in Spark

2016-04-22 Thread Raghava Mutharaju
o synchronize the executors, but > hopefully someone from there can suggest other possibilities. > > Mike > On Apr 23, 2016 5:53 AM, "Raghava Mutharaju" <m.vijayaragh...@gmail.com> > wrote: > >> Mike, >> >> It turns out the executor delay, as

executor delay in Spark

2016-04-22 Thread Raghava Mutharaju
3. Make the tasks longer, i.e. with some silly computational work. > > Mike > > > On 4/17/16, Raghava Mutharaju <m.vijayaragh...@gmail.com> wrote: > > Yes its the same data. > > > > 1) The number of partitions are the same (8, which is an argument to the > >

Re: strange HashPartitioner behavior in Spark

2016-04-18 Thread Raghava Mutharaju
arch/> > > <http://in.linkedin.com/in/sonalgoyal> > > > > On Mon, Apr 18, 2016 at 6:26 PM, Raghava Mutharaju < > m.vijayaragh...@gmail.com> wrote: > >> Mike, >> >> We tried that. This map task is actually part of a larger set of >> operations. I

Re: strange HashPartitioner behavior in Spark

2016-04-18 Thread Raghava Mutharaju
e > >> following to confirm: > >> 1. Examine the starting times of the tasks alongside their > >> executor > >> 2. Make a "dummy" stage execute before your real stages to > >> synchronize the executors by creating and materializing an

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Raghava Mutharaju
t;> If the data file is same then it should have similar distribution of >>> keys. Few queries- >>> >>> 1. Did you compare the number of partitions in both the cases? >>> 2. Did you compare the resource allocation for Spark Shell vs Scala >>> Program

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Raghava Mutharaju
t and > Executors when you run via Scala program? > > On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju < > m.vijayaragh...@gmail.com> wrote: > >> Hello All, >> >> We are using HashPartitioner in the following way on a 3 node cluster (1 >> master and 2 wor

strange HashPartitioner behavior in Spark

2016-04-17 Thread Raghava Mutharaju
Hello All, We are using HashPartitioner in the following way on a 3 node cluster (1 master and 2 worker nodes). val u = sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int, Int)](line => { line.split("\\|") match { case Array(x, y) => (y.toInt, x.toInt) } }).partitionBy(new

DataFrames - Kryo registration issue

2016-03-10 Thread Raghava Mutharaju
Hello All, If Kryo serialization is enabled, doesn't Spark take care of registration of built-in classes, i.e., are we not supposed to register just the custom classes? When using DataFrames, this does not seem to be the case. I had to register the following classes

Dataset takes more memory compared to RDD

2016-02-12 Thread Raghava Mutharaju
Hello All, I implemented an algorithm using both the RDDs and the Dataset API (in Spark 1.6). Dataset version takes lot more memory than the RDDs. Is this normal? Even for very small input data, it is running out of memory and I get a java heap exception. I tried the Kryo serializer by

Re: Dataset joinWith condition

2016-02-10 Thread Raghava Mutharaju
a.x" > > Please use expr("...") > e.g. if your DataSet has two columns, you can write: > ds.select(expr("_2 / _1").as[Int]) > > where _1 refers to first column and _2 refers to second. > > On Tue, Feb 9, 2016 at 3:31 PM, Raghava Mutharaju < > m.vijaya

Dataset joinWith condition

2016-02-09 Thread Raghava Mutharaju
Hello All, joinWith() method in Dataset takes a condition of type Column. Without converting a Dataset to a DataFrame, how can we get a specific column? For eg: case class Pair(x: Long, y: Long) A, B are Datasets of type Pair and I want to join A.x with B.y A.joinWith(B, A.toDF().col("x") ==

Re: Dataset joinWith condition

2016-02-09 Thread Raghava Mutharaju
uot;) > > checkAnswer( > ds1.joinWith(ds2, $"a.value" === $"b.value", "inner"), > > On Tue, Feb 9, 2016 at 7:07 AM, Raghava Mutharaju < > m.vijayaragh...@gmail.com> wrote: > >> Hello All, >> >> joinWith() method