Inquiry about Processing Speed

2023-09-27 Thread Haseeb Khalid
Dear Support Team,

I hope this message finds you well. My name is Haseeb Khalid, and I am
reaching out to discuss a scenario related to processing speed in Apache
Spark.

I have been utilizing these technologies in our projects, and we have
encountered a specific use case where we are seeking to optimize processing
speed. Given the critical nature of this requirement, I would greatly
appreciate the opportunity to discuss this with a knowledgeable
representative from your team.

I am particularly interested in understanding any best practices,
configuration tweaks, or architectural considerations that can be employed
to enhance processing speed in our specific scenario.

Would it be possible to schedule a call or exchange emails to delve deeper
into this matter? I am available at your convenience and can accommodate
any preferred mode of communication.

I genuinely value the expertise of the Apache Spark communities and believe
that your insights will be instrumental in achieving our objectives.

Thank you very much for your time and consideration. I look forward to
hearing from you soon.

-- 

Thanks & Best Regards,

*Haseeb Khalid*

*Senior Data Analyst*

+92 306 4436 790


Re: Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-03-08 Thread Muhammad Haseeb Javed
I was talking about the Kafka binary if using to run the Kafka server
(broker) with. The version for that binary is kafka_2.10-0.8.2.1 and Spark
is 2.0.2 is built with 2.11. So I am using the Kafka Connector that Spark
is using internally to communicate with the broker is also built with Scala
2.11. So can this version mismatch be the cause of the issue?

On Wed, Feb 22, 2017 at 8:44 PM, Cody Koeninger  wrote:

> If you're talking about the version of scala used to build the broker,
> that shouldn't matter.
> If you're talking about the version of scala used for the kafka client
> dependency, it shouldn't have compiled at all to begin with.
>
> On Wed, Feb 22, 2017 at 12:11 PM, Muhammad Haseeb Javed
> <11besemja...@seecs.edu.pk> wrote:
> > I just noticed that Spark version that I am using (2.0.2) is built with
> > Scala 2.11. However I am using Kafka 0.8.2 built with Scala 2.10. Could
> this
> > be the reason why we are getting this error?
> >
> > On Mon, Feb 20, 2017 at 5:50 PM, Cody Koeninger 
> wrote:
> >>
> >> So there's no reason to use checkpointing at all, right?  Eliminate
> >> that as a possible source of problems.
> >>
> >> Probably unrelated, but this also isn't a very good way to benchmark.
> >> Kafka producers are threadsafe, there's no reason to create one for
> >> each partition.
> >>
> >> On Mon, Feb 20, 2017 at 4:43 PM, Muhammad Haseeb Javed
> >> <11besemja...@seecs.edu.pk> wrote:
> >> > This is the code that I have been trying is giving me this error. No
> >> > complicated operation being performed on the topics as far as I can
> see.
> >> >
> >> > class Identity() extends BenchBase {
> >> >
> >> >
> >> >   override def process(lines: DStream[(Long, String)], config:
> >> > SparkBenchConfig): Unit = {
> >> >
> >> > val reportTopic = config.reporterTopic
> >> >
> >> > val brokerList = config.brokerList
> >> >
> >> >
> >> > lines.foreachRDD(rdd => rdd.foreachPartition( partLines => {
> >> >
> >> >   val reporter = new KafkaReporter(reportTopic, brokerList)
> >> >
> >> >   partLines.foreach{ case (inTime , content) =>
> >> >
> >> > val outTime = System.currentTimeMillis()
> >> >
> >> > reporter.report(inTime, outTime)
> >> >
> >> > if(config.debugMode) {
> >> >
> >> >   println("Event: " + inTime + ", " + outTime)
> >> >
> >> > }
> >> >
> >> >   }
> >> >
> >> > }))
> >> >
> >> >   }
> >> >
> >> > }
> >> >
> >> >
> >> > On Mon, Feb 20, 2017 at 3:10 PM, Cody Koeninger 
> >> > wrote:
> >> >>
> >> >> That's an indication that the beginning offset for a given batch is
> >> >> higher than the ending offset, i.e. something is seriously wrong.
> >> >>
> >> >> Are you doing anything at all odd with topics, i.e. deleting and
> >> >> recreating them, using compacted topics, etc?
> >> >>
> >> >> Start off with a very basic stream over the same kafka topic that
> just
> >> >> does foreach println or similar, with no checkpointing at all, and
> get
> >> >> that working first.
> >> >>
> >> >> On Mon, Feb 20, 2017 at 12:10 PM, Muhammad Haseeb Javed
> >> >> <11besemja...@seecs.edu.pk> wrote:
> >> >> > Update: I am using Spark 2.0.2 and  Kafka 0.8.2 with Scala 2.10
> >> >> >
> >> >> > On Mon, Feb 20, 2017 at 1:06 PM, Muhammad Haseeb Javed
> >> >> > <11besemja...@seecs.edu.pk> wrote:
> >> >> >>
> >> >> >> I am PhD student at Ohio State working on a study to evaluate
> >> >> >> streaming
> >> >> >> frameworks (Spark Streaming, Storm, Flink) using the the Intel
> >> >> >> HiBench
> >> >> >> benchmarks. But I think I am having a problem  with Spark. I have
> >> >> >> Spark
> >> >> >> Streaming application which I am trying to run on a 5 node cluster
> >> >> >> (including master). I have 2 zookeeper and 4 kafka brokers.
> However,
> >> >>

Re: Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-02-22 Thread Muhammad Haseeb Javed
I just noticed that Spark version that I am using (2.0.2) is built with
Scala 2.11. However I am using Kafka 0.8.2 built with Scala 2.10. Could
this be the reason why we are getting this error?

On Mon, Feb 20, 2017 at 5:50 PM, Cody Koeninger  wrote:

> So there's no reason to use checkpointing at all, right?  Eliminate
> that as a possible source of problems.
>
> Probably unrelated, but this also isn't a very good way to benchmark.
> Kafka producers are threadsafe, there's no reason to create one for
> each partition.
>
> On Mon, Feb 20, 2017 at 4:43 PM, Muhammad Haseeb Javed
> <11besemja...@seecs.edu.pk> wrote:
> > This is the code that I have been trying is giving me this error. No
> > complicated operation being performed on the topics as far as I can see.
> >
> > class Identity() extends BenchBase {
> >
> >
> >   override def process(lines: DStream[(Long, String)], config:
> > SparkBenchConfig): Unit = {
> >
> > val reportTopic = config.reporterTopic
> >
> > val brokerList = config.brokerList
> >
> >
> > lines.foreachRDD(rdd => rdd.foreachPartition( partLines => {
> >
> >   val reporter = new KafkaReporter(reportTopic, brokerList)
> >
> >   partLines.foreach{ case (inTime , content) =>
> >
> > val outTime = System.currentTimeMillis()
> >
> > reporter.report(inTime, outTime)
> >
> > if(config.debugMode) {
> >
> >   println("Event: " + inTime + ", " + outTime)
> >
> > }
> >
> >   }
> >
> > }))
> >
> >   }
> >
> > }
> >
> >
> > On Mon, Feb 20, 2017 at 3:10 PM, Cody Koeninger 
> wrote:
> >>
> >> That's an indication that the beginning offset for a given batch is
> >> higher than the ending offset, i.e. something is seriously wrong.
> >>
> >> Are you doing anything at all odd with topics, i.e. deleting and
> >> recreating them, using compacted topics, etc?
> >>
> >> Start off with a very basic stream over the same kafka topic that just
> >> does foreach println or similar, with no checkpointing at all, and get
> >> that working first.
> >>
> >> On Mon, Feb 20, 2017 at 12:10 PM, Muhammad Haseeb Javed
> >> <11besemja...@seecs.edu.pk> wrote:
> >> > Update: I am using Spark 2.0.2 and  Kafka 0.8.2 with Scala 2.10
> >> >
> >> > On Mon, Feb 20, 2017 at 1:06 PM, Muhammad Haseeb Javed
> >> > <11besemja...@seecs.edu.pk> wrote:
> >> >>
> >> >> I am PhD student at Ohio State working on a study to evaluate
> streaming
> >> >> frameworks (Spark Streaming, Storm, Flink) using the the Intel
> HiBench
> >> >> benchmarks. But I think I am having a problem  with Spark. I have
> Spark
> >> >> Streaming application which I am trying to run on a 5 node cluster
> >> >> (including master). I have 2 zookeeper and 4 kafka brokers. However,
> >> >> whenever I run a Spark Streaming application I encounter the
> following
> >> >> error:
> >> >>
> >> >> java.lang.IllegalArgumentException: requirement failed: numRecords
> must
> >> >> not be negative
> >> >> at scala.Predef$.require(Predef.scala:224)
> >> >> at
> >> >>
> >> >> org.apache.spark.streaming.scheduler.StreamInputInfo.<
> init>(InputInfoTracker.scala:38)
> >> >> at
> >> >>
> >> >> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(
> DirectKafkaInputDStream.scala:165)
> >> >> at
> >> >>
> >> >> org.apache.spark.streaming.dstream.DStream$$anonfun$
> getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> >> >> at
> >> >>
> >> >> org.apache.spark.streaming.dstream.DStream$$anonfun$
> getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> >> >> at
> >> >> scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> >> >> at
> >> >>
> >> >> org.apache.spark.streaming.dstream.DStream$$anonfun$
> getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> >> >> at
> >> >>
> >> >> org.apache.spark.streaming.dstream.DStream$$anonfun$
> getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> >> &g

Re: Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-02-20 Thread Muhammad Haseeb Javed
This is the code that I have been trying is giving me this error. No
complicated operation being performed on the topics as far as I can see.

class Identity() extends BenchBase {


  override def process(lines: DStream[(Long, String)], config:
SparkBenchConfig): Unit = {

val reportTopic = config.reporterTopic

val brokerList = config.brokerList


lines.foreachRDD(rdd => rdd.foreachPartition( partLines => {

  val reporter = new KafkaReporter(reportTopic, brokerList)

  partLines.foreach{ case (inTime , content) =>

val outTime = System.currentTimeMillis()

reporter.report(inTime, outTime)

if(config.debugMode) {

  println("Event: " + inTime + ", " + outTime)

}

  }

}))

  }

}

On Mon, Feb 20, 2017 at 3:10 PM, Cody Koeninger  wrote:

> That's an indication that the beginning offset for a given batch is
> higher than the ending offset, i.e. something is seriously wrong.
>
> Are you doing anything at all odd with topics, i.e. deleting and
> recreating them, using compacted topics, etc?
>
> Start off with a very basic stream over the same kafka topic that just
> does foreach println or similar, with no checkpointing at all, and get
> that working first.
>
> On Mon, Feb 20, 2017 at 12:10 PM, Muhammad Haseeb Javed
> <11besemja...@seecs.edu.pk> wrote:
> > Update: I am using Spark 2.0.2 and  Kafka 0.8.2 with Scala 2.10
> >
> > On Mon, Feb 20, 2017 at 1:06 PM, Muhammad Haseeb Javed
> > <11besemja...@seecs.edu.pk> wrote:
> >>
> >> I am PhD student at Ohio State working on a study to evaluate streaming
> >> frameworks (Spark Streaming, Storm, Flink) using the the Intel HiBench
> >> benchmarks. But I think I am having a problem  with Spark. I have Spark
> >> Streaming application which I am trying to run on a 5 node cluster
> >> (including master). I have 2 zookeeper and 4 kafka brokers. However,
> >> whenever I run a Spark Streaming application I encounter the following
> >> error:
> >>
> >> java.lang.IllegalArgumentException: requirement failed: numRecords must
> >> not be negative
> >> at scala.Predef$.require(Predef.scala:224)
> >> at
> >> org.apache.spark.streaming.scheduler.StreamInputInfo.<
> init>(InputInfoTracker.scala:38)
> >> at
> >> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(
> DirectKafkaInputDStream.scala:165)
> >> at
> >> org.apache.spark.streaming.dstream.DStream$$anonfun$
> getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> >> at
> >> org.apache.spark.streaming.dstream.DStream$$anonfun$
> getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> >> at scala.util.DynamicVariable.withValue(DynamicVariable.
> scala:58)
> >> at
> >> org.apache.spark.streaming.dstream.DStream$$anonfun$
> getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> >> at
> >> org.apache.spark.streaming.dstream.DStream$$anonfun$
> getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> >> at
> >> org.apache.spark.streaming.dstream.DStream.
> createRDDWithLocalProperties(DStream.scala:415)
> >> at
> >> org.apache.spark.streaming.dstream.DStream$$anonfun$
> getOrCompute$1.apply(DStream.scala:335)
> >> at
> >> org.apache.spark.streaming.dstream.DStream$$anonfun$
> getOrCompute$1.apply(DStream.scala:333)
> >> at scala.Option.orElse(Option.scala:289)
> >>
> >> The application starts fine, but as soon as the Kafka producers start
> >> emitting the stream data I start receiving the aforementioned error
> >> repeatedly.
> >>
> >> I have tried removing Spark Streaming checkpointing files as has been
> >> suggested in similar posts on the internet. However, the problem
> persists
> >> even if I start a Kafka topic and its corresponding consumer Spark
> Streaming
> >> application for the first time. Also the problem could not be offset
> related
> >> as I start the topic for the first time.
> >>
> >> Although the application seems to be processing the stream properly as I
> >> can see by the benchmark numbers generated. However, the numbers are
> way of
> >> from what I got for Storm and Flink, suspecting me to believe that
> there is
> >> something wrong with the pipeline and Spark is not able to process the
> >> stream as cleanly as it should. Any help in this regard would be really
> >> appreciated.
> >
> >
>


Re: Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-02-20 Thread Muhammad Haseeb Javed
Update: I am using Spark 2.0.2 and  Kafka 0.8.2 with Scala 2.10

On Mon, Feb 20, 2017 at 1:06 PM, Muhammad Haseeb Javed <
11besemja...@seecs.edu.pk> wrote:

> I am PhD student at Ohio State working on a study to evaluate streaming
> frameworks (Spark Streaming, Storm, Flink) using the the Intel HiBench
> benchmarks. But I think I am having a problem  with Spark. I have Spark
> Streaming application which I am trying to run on a 5 node cluster
> (including master). I have 2 zookeeper and 4 kafka brokers. However,
> whenever I run a Spark Streaming application I encounter the following
> error:
>
> java.lang.IllegalArgumentException: requirement failed: numRecords must not 
> be negative
> at scala.Predef$.require(Predef.scala:224)
> at 
> org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
> at 
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:165)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
> at scala.Option.orElse(Option.scala:289)
>
> The application starts fine, but as soon as the Kafka producers start
> emitting the stream data I start receiving the aforementioned error
> repeatedly.
>
> I have tried removing Spark Streaming checkpointing files as has been
> suggested in similar posts on the internet. However, the problem persists
> even if I start a Kafka topic and its corresponding consumer Spark
> Streaming application for the first time. Also the problem could not be
> offset related as I start the topic for the first time.
> Although the application seems to be processing the stream properly as I
> can see by the benchmark numbers generated. However, the numbers are way of
> from what I got for Storm and Flink, suspecting me to believe that there is
> something wrong with the pipeline and Spark is not able to process the
> stream as cleanly as it should. Any help in this regard would be really
> appreciated.
>


Why does Spark Streaming application with Kafka fail with “requirement failed: numRecords must not be negative”?

2017-02-20 Thread Muhammad Haseeb Javed
I am PhD student at Ohio State working on a study to evaluate streaming
frameworks (Spark Streaming, Storm, Flink) using the the Intel HiBench
benchmarks. But I think I am having a problem  with Spark. I have Spark
Streaming application which I am trying to run on a 5 node cluster
(including master). I have 2 zookeeper and 4 kafka brokers. However,
whenever I run a Spark Streaming application I encounter the following
error:

java.lang.IllegalArgumentException: requirement failed: numRecords
must not be negative
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.streaming.scheduler.StreamInputInfo.(InputInfoTracker.scala:38)
at 
org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:165)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
at scala.Option.orElse(Option.scala:289)

The application starts fine, but as soon as the Kafka producers start
emitting the stream data I start receiving the aforementioned error
repeatedly.

I have tried removing Spark Streaming checkpointing files as has been
suggested in similar posts on the internet. However, the problem persists
even if I start a Kafka topic and its corresponding consumer Spark
Streaming application for the first time. Also the problem could not be
offset related as I start the topic for the first time.
Although the application seems to be processing the stream properly as I
can see by the benchmark numbers generated. However, the numbers are way of
from what I got for Storm and Flink, suspecting me to believe that there is
something wrong with the pipeline and Spark is not able to process the
stream as cleanly as it should. Any help in this regard would be really
appreciated.


Wrap an RDD with a ShuffledRDD

2015-11-08 Thread Muhammad Haseeb Javed
I am working on a modified Spark core and have a Broadcast variable which I
deserialize to obtain an RDD along with its set of dependencies, as is done
in ShuffleMapTask, as following:

val taskBinary: Broadcast[Array[Byte]]var (rdd, dep) =
ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
  ByteBuffer.wrap(taskBinary.value),
Thread.currentThread.getContextClassLoader)

However, I want to wrap this rdd by a ShuffledRDD because I need to apply a
custom partitioner to it ,and I am doing this by:

var wrappedRDD = new ShuffledRDD[_ ,_, _](rdd[_ <: Product2[Any,
Any]], context.getCustomPartitioner())

but it results in an error:

Error:unbound wildcard type rdd = new ShuffledRDD[_ ,_, _ ](rdd[_ <:
Product2[Any, Any]], context.getCustomPartitioner())
..^

The problem is that I don't know how to replace these wildcards with any
inferred type as I its supposed to be dynamic and I have no idea what would
be the inferred type of the original rdd. Any idea how I could resolved
this?


What is the abstraction for a Worker process in Spark code

2015-10-12 Thread Muhammad Haseeb Javed
I understand that each executor that is processing a Spark job is emulated
in Spark code by the Executor class in Executor.scala and
CoarseGrainedExecutorBackend is the abstraction which facilitates
communication between an Executor and the Driver. But what is the
abstraction for a Worker process in Spark code which would a reference to
all the Executors running in it.


Building spark-examples takes too much time using Maven

2015-08-26 Thread Muhammad Haseeb Javed
I checked out the master branch and started playing around with the
examples. I want to build a jar  of the examples as I wish run them using
the modified spark jar that I have. However, packaging spark-examples takes
too much time as maven tries to download the jar dependencies rather than
use the jar that are already present int the system as I extended and
packaged spark itself locally?


Re: Difference between Sort based and Hash based shuffle

2015-08-19 Thread Muhammad Haseeb Javed
Thanks Andrew for a detailed response,

So the reason why key value pairs with same keys are always found in a
single buckets in Hash based shuffle but not in Sort is because in
sort-shuffle each mapper writes a single partitioned file, and it is up to
the reducer to fetch correct partitions from the the files ?

On Wed, Aug 19, 2015 at 2:13 AM, Andrew Or  wrote:

> Hi Muhammad,
>
> On a high level, in hash-based shuffle each mapper M writes R shuffle
> files, one for each reducer where R is the number of reduce partitions.
> This results in M * R shuffle files. Since it is not uncommon for M and R
> to be O(1000), this quickly becomes expensive. An optimization with
> hash-based shuffle is consolidation, where all mappers run in the same core
> C write one file per reducer, resulting in C * R files. This is a strict
> improvement, but it is still relatively expensive.
>
> Instead, in sort-based shuffle each mapper writes a single partitioned
> file. This allows a particular reducer to request a specific portion of
> each mapper's single output file. In more detail, the mapper first fills up
> an internal buffer in memory and continually spills the contents of the
> buffer to disk, then finally merges all the spilled files together to form
> one final output file. This places much less stress on the file system and
> requires much fewer I/O operations especially on the read side.
>
> -Andrew
>
>
>
> 2015-08-16 11:08 GMT-07:00 Muhammad Haseeb Javed <
> 11besemja...@seecs.edu.pk>:
>
>> I did check it out and although I did get a general understanding of the
>> various classes used to implement Sort and Hash shuffles, however these
>> slides lack details as to how they are implemented and why sort generally
>> has better performance than hash
>>
>> On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran 
>> wrote:
>>
>>> Have a look at this presentation.
>>> http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be
>>> of help to you.
>>>
>>> On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed <
>>> 11besemja...@seecs.edu.pk> wrote:
>>>
>>>> What are the major differences between how Sort based and Hash based
>>>> shuffle operate and what is it that cause Sort Shuffle to perform better
>>>> than Hash?
>>>> Any talks that discuss both shuffles in detail, how they are
>>>> implemented and the performance gains ?
>>>>
>>>
>>>
>>
>


Re: Difference between Sort based and Hash based shuffle

2015-08-16 Thread Muhammad Haseeb Javed
I did check it out and although I did get a general understanding of the
various classes used to implement Sort and Hash shuffles, however these
slides lack details as to how they are implemented and why sort generally
has better performance than hash

On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran 
wrote:

> Have a look at this presentation.
> http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be of
> help to you.
>
> On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed <
> 11besemja...@seecs.edu.pk> wrote:
>
>> What are the major differences between how Sort based and Hash based
>> shuffle operate and what is it that cause Sort Shuffle to perform better
>> than Hash?
>> Any talks that discuss both shuffles in detail, how they are implemented
>> and the performance gains ?
>>
>
>


Difference between Sort based and Hash based shuffle

2015-08-15 Thread Muhammad Haseeb Javed
What are the major differences between how Sort based and Hash based
shuffle operate and what is it that cause Sort Shuffle to perform better
than Hash?
Any talks that discuss both shuffles in detail, how they are implemented
and the performance gains ?


Re: Actor not found for: ActorSelection

2015-07-28 Thread Haseeb
The problem was that I was trying to start the example app in standalone
cluster mode by passing in
*-Dspark.master=spark://myhost:7077* as an argument to the JVM. I launched
the example app locally using -*Dspark.master=local* and it worked.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Actor-not-found-for-ActorSelection-tp24035p24037.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Actor not found for: ActorSelection

2015-07-28 Thread Haseeb
I just cloned the master repository of Spark from Github. I am running it on
OSX 10.9, Spark 1.4.1 and Scala 2.10.4

I just tried to run the SparkPi example program using IntelliJ Idea but get
the error : akka.actor.ActorNotFound: Actor not found for:
ActorSelection[Anchor(akka.tcp://sparkMaster@myhost:7077/)

I did checkout a similar  post

  
but found no solution.

Find the complete stack trace below. Any help would be really appreciated.

2015-07-28 22:16:45,888 INFO  [main] spark.SparkContext
(Logging.scala:logInfo(59)) - Running Spark version 1.5.0-SNAPSHOT
2015-07-28 22:16:47,125 WARN  [main] util.NativeCodeLoader
(NativeCodeLoader.java:(62)) - Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
2015-07-28 22:16:47,753 INFO  [main] spark.SecurityManager
(Logging.scala:logInfo(59)) - Changing view acls to: mac
2015-07-28 22:16:47,755 INFO  [main] spark.SecurityManager
(Logging.scala:logInfo(59)) - Changing modify acls to: mac
2015-07-28 22:16:47,756 INFO  [main] spark.SecurityManager
(Logging.scala:logInfo(59)) - SecurityManager: authentication disabled; ui
acls disabled; users with view permissions: Set(mac); users with modify
permissions: Set(mac)
2015-07-28 22:16:49,454 INFO  [sparkDriver-akka.actor.default-dispatcher-2]
slf4j.Slf4jLogger (Slf4jLogger.scala:applyOrElse(80)) - Slf4jLogger started
2015-07-28 22:16:49,695 INFO  [sparkDriver-akka.actor.default-dispatcher-2]
Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Starting remoting
2015-07-28 22:16:50,167 INFO  [sparkDriver-akka.actor.default-dispatcher-2]
Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting started; listening
on addresses :[akka.tcp://sparkDriver@192.168.2.105:49981]
2015-07-28 22:16:50,215 INFO  [main] util.Utils (Logging.scala:logInfo(59))
- Successfully started service 'sparkDriver' on port 49981.
2015-07-28 22:16:50,372 INFO  [main] spark.SparkEnv
(Logging.scala:logInfo(59)) - Registering MapOutputTracker
2015-07-28 22:16:50,596 INFO  [main] spark.SparkEnv
(Logging.scala:logInfo(59)) - Registering BlockManagerMaster
2015-07-28 22:16:50,948 INFO  [main] storage.DiskBlockManager
(Logging.scala:logInfo(59)) - Created local directory at
/private/var/folders/8k/jfw576r50m97rlk5qpj1n4l8gn/T/blockmgr-309db4d1-d129-43e5-a90e-12cf51ad491f
2015-07-28 22:16:51,198 INFO  [main] storage.MemoryStore
(Logging.scala:logInfo(59)) - MemoryStore started with capacity 491.7 MB
2015-07-28 22:16:51,707 INFO  [main] spark.HttpFileServer
(Logging.scala:logInfo(59)) - HTTP File server directory is
/private/var/folders/8k/jfw576r50m97rlk5qpj1n4l8gn/T/spark-f28e24e7-b798-4365-8209-409d8b27ad2f/httpd-ce32c41d-b618-49e9-bec1-f409454f3679
2015-07-28 22:16:51,777 INFO  [main] spark.HttpServer
(Logging.scala:logInfo(59)) - Starting HTTP Server
2015-07-28 22:16:52,091 INFO  [main] server.Server
(Server.java:doStart(272)) - jetty-8.1.14.v20131031
2015-07-28 22:16:52,116 INFO  [main] server.AbstractConnector
(AbstractConnector.java:doStart(338)) - Started
SocketConnector@0.0.0.0:49982
2015-07-28 22:16:52,116 INFO  [main] util.Utils (Logging.scala:logInfo(59))
- Successfully started service 'HTTP file server' on port 49982.
2015-07-28 22:16:52,249 INFO  [main] spark.SparkEnv
(Logging.scala:logInfo(59)) - Registering OutputCommitCoordinator
2015-07-28 22:16:54,253 INFO  [main] server.Server
(Server.java:doStart(272)) - jetty-8.1.14.v20131031
2015-07-28 22:16:54,315 INFO  [main] server.AbstractConnector
(AbstractConnector.java:doStart(338)) - Started
SelectChannelConnector@0.0.0.0:4040
2015-07-28 22:16:54,317 INFO  [main] util.Utils (Logging.scala:logInfo(59))
- Successfully started service 'SparkUI' on port 4040.
2015-07-28 22:16:54,386 INFO  [main] ui.SparkUI (Logging.scala:logInfo(59))
- Started SparkUI at http://192.168.2.105:4040
2015-07-28 22:16:54,924 WARN  [main] metrics.MetricsSystem
(Logging.scala:logWarning(71)) - Using default name DAGScheduler for source
because spark.app.id is not set.
2015-07-28 22:16:55,132 INFO  [appclient-register-master-threadpool-0]
client.AppClient$ClientEndpoint (Logging.scala:logInfo(59)) - Connecting to
master spark://myhost:7077...
2015-07-28 22:16:55,392 WARN  [sparkDriver-akka.actor.default-dispatcher-14]
client.AppClient$ClientEndpoint (Logging.scala:logWarning(71)) - Could not
connect to myhost:7077: akka.remote.EndpointAssociationException:
Association failed with [akka.tcp://sparkMaster@myhost:7077]
2015-07-28 22:16:55,412 WARN  [sparkDriver-akka.actor.default-dispatcher-14]
remote.ReliableDeliverySupervisor (Slf4jLogger.scala:apply$mcV$sp(71)) -
Association with remote system [akka.tcp://sparkMaster@myhost:7077] has
failed, address is now gated for [5000] ms. Reason: [Association failed with
[akka.tcp://sparkMaster@myhost:7077]] Caused by: [myhost: unknown error]
2015-07-28 22:16:55,447 WARN  [appclient-register-master-threadpool-0]
client.AppClient$ClientEndpoint (Loggi