Saurav,
We have the same issue. Our application runs fine on 32 nodes with 4 cores
each and 256 partitions but gives an OOM on the driver when run on 64 nodes
with 512 partitions. Did you get to know the reason behind this behavior or
the relation between number of partitions and driver RAM
args to take it automatically on OutOfMemory error. Analyze it and
> share your finding :)
>
>
>
> On Wed, Jun 22, 2016 at 4:33 PM, Raghava Mutharaju <
> m.vijayaragh...@gmail.com> wrote:
>
>> Ok. Would be able to shed more light on what exact meta data it manages
ir amount of meta data to manage scheduling across all
> your executors. I assume with 64 nodes you have more executors as well.
> Simple way to test is to increase driver memory.
>
> On Wed, Jun 22, 2016 at 10:10 AM, Raghava Mutharaju <
> m.vijayaragh...@gmail.com> wrote:
>
t; <https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
> On Wed, Jun 22, 2016 at 9:57 PM, Raghava Mutharaju <
> m.vijayaragh...@gmail.com> wrote:
>
>> Hello A
Hello All,
We have a Spark cluster where driver and master are running on the same
node. We are using Spark Standalone cluster manager. If the number of nodes
(and the partitions) are increased, the same dataset that used to run to
completion on lesser number of nodes is now giving an out of
(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
On Fri, May 13, 2016 at 6:33 AM, Raghava Mutharaju <
m.vijayaragh...@gmail.com> wrote:
> Thank you for the response.
>
> I use
t; you might be missing some modules/profiles for your build. What command did
> you use to build ?
>
> On Thu, May 12, 2016 at 9:01 PM, Raghava Mutharaju <
> m.vijayaragh...@gmail.com> wrote:
>
>> Hello All,
>>
>> I built Spark from the source code available a
Hello All,
I built Spark from the source code available at
https://github.com/apache/spark/. Although I haven't specified the
"-Dscala-2.11" option (to build with Scala 2.11), from the build messages I
see that it ended up using Scala 2.11. Now, for my application sbt, what
should be the spark
zipPartitions(rdd2){ (leftItr, rightItr) =>
> leftItr.filter(p => !rightItr.contains(p))
> }
> sum.foreach(println)
>
>
>
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
>
> https://in.linkedin.com/in/rishiteshmishra
>
> On Mo
:27 AM, ayan guha <guha.a...@gmail.com> wrote:
> How about outer join?
> On 9 May 2016 13:18, "Raghava Mutharaju" <m.vijayaragh...@gmail.com>
> wrote:
>
>> Hello All,
>>
>> We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key
>>
Hello All,
We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key (number
of partitions are same for both the RDDs). We would like to subtract rdd2
from rdd1.
The subtract code at
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala
seems to
only
> complicate your trace debugging.
>
> I've attached a python script that dumps relevant info from the Spark
> JSON logs into a CSV for easier analysis in you language of choice;
> hopefully it can aid in finer grained debugging (the headers of the
> fields it prints are liste
this other than creating a dummy task to synchronize the executors, but
> hopefully someone from there can suggest other possibilities.
>
> Mike
> On Apr 23, 2016 5:53 AM, "Raghava Mutharaju" <m.vijayaragh...@gmail.com>
> wrote:
>
>> Mike,
>>
>&
o synchronize the executors, but
> hopefully someone from there can suggest other possibilities.
>
> Mike
> On Apr 23, 2016 5:53 AM, "Raghava Mutharaju" <m.vijayaragh...@gmail.com>
> wrote:
>
>> Mike,
>>
>> It turns out the executor delay, as
3. Make the tasks longer, i.e. with some silly computational work.
>
> Mike
>
>
> On 4/17/16, Raghava Mutharaju <m.vijayaragh...@gmail.com> wrote:
> > Yes its the same data.
> >
> > 1) The number of partitions are the same (8, which is an argument to the
> >
arch/>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
> On Mon, Apr 18, 2016 at 6:26 PM, Raghava Mutharaju <
> m.vijayaragh...@gmail.com> wrote:
>
>> Mike,
>>
>> We tried that. This map task is actually part of a larger set of
>> operations. I
e
> >> following to confirm:
> >> 1. Examine the starting times of the tasks alongside their
> >> executor
> >> 2. Make a "dummy" stage execute before your real stages to
> >> synchronize the executors by creating and materializing an
t;> If the data file is same then it should have similar distribution of
>>> keys. Few queries-
>>>
>>> 1. Did you compare the number of partitions in both the cases?
>>> 2. Did you compare the resource allocation for Spark Shell vs Scala
>>> Program
t and
> Executors when you run via Scala program?
>
> On Mon, Apr 18, 2016 at 4:41 AM, Raghava Mutharaju <
> m.vijayaragh...@gmail.com> wrote:
>
>> Hello All,
>>
>> We are using HashPartitioner in the following way on a 3 node cluster (1
>> master and 2 wor
Hello All,
We are using HashPartitioner in the following way on a 3 node cluster (1
master and 2 worker nodes).
val u = sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int,
Int)](line => { line.split("\\|") match { case Array(x, y) => (y.toInt,
x.toInt) } }).partitionBy(new
Hello All,
If Kryo serialization is enabled, doesn't Spark take care of registration
of built-in classes, i.e., are we not supposed to register just the custom
classes?
When using DataFrames, this does not seem to be the case. I had to register
the following classes
Hello All,
I implemented an algorithm using both the RDDs and the Dataset API (in
Spark 1.6). Dataset version takes lot more memory than the RDDs. Is this
normal? Even for very small input data, it is running out of memory and I
get a java heap exception.
I tried the Kryo serializer by
a.x"
>
> Please use expr("...")
> e.g. if your DataSet has two columns, you can write:
> ds.select(expr("_2 / _1").as[Int])
>
> where _1 refers to first column and _2 refers to second.
>
> On Tue, Feb 9, 2016 at 3:31 PM, Raghava Mutharaju <
> m.vijaya
Hello All,
joinWith() method in Dataset takes a condition of type Column. Without
converting a Dataset to a DataFrame, how can we get a specific column?
For eg: case class Pair(x: Long, y: Long)
A, B are Datasets of type Pair and I want to join A.x with B.y
A.joinWith(B, A.toDF().col("x") ==
uot;)
>
> checkAnswer(
> ds1.joinWith(ds2, $"a.value" === $"b.value", "inner"),
>
> On Tue, Feb 9, 2016 at 7:07 AM, Raghava Mutharaju <
> m.vijayaragh...@gmail.com> wrote:
>
>> Hello All,
>>
>> joinWith() method
25 matches
Mail list logo