Spark hangs on collect (stuck on scheduler delay)

2015-08-16 Thread Sagi r
Hi,
I'm building a spark application in which I load some data from an
Elasticsearch cluster (using latest elasticsearch-hadoop connector) and
continue to perform some calculations on the spark cluster.

In one case, I use collect on the RDD as soon as it is created (loaded from
ES).
However, it is sometimes hangs on one (and sometimes more) node and doesn't
continue.
In the web UI, I can see that one node is stuck on scheduler delay and
prevents from the job to continue,
(while others have finished).

Do you have any idea what is going on here?

The data that is being loaded is fairly small, and only gets mapped once to
domain objects before being collected.

Thank you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-hangs-on-collect-stuck-on-scheduler-delay-tp24283.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: TestSQLContext compilation error when run SparkPi in Intellij ?

2015-08-16 Thread canan chen
Thanks Andrew.



On Sun, Aug 16, 2015 at 1:53 PM, Andrew Or and...@databricks.com wrote:

 Hi Canan, TestSQLContext is no longer a singleton but now a class. It is
 never meant to be a fully public API, but if you wish to use it you can
 just instantiate a new one:

 val sqlContext = new TestSQLContext

 or just create a new SQLContext from a SparkContext.

 -Andrew

 2015-08-15 20:33 GMT-07:00 canan chen ccn...@gmail.com:

 I am not sure other people's spark debugging environment ( I mean for the
 master branch) , Anyone can share his experience ?


 On Sun, Aug 16, 2015 at 10:40 AM, canan chen ccn...@gmail.com wrote:

 I import the spark source code to intellij, and want to run SparkPi in
 intellij, but meet the folliwing weird compilation error? I googled it and
 sbt clean doesn't work for me. I am not sure whether anyone else has meet
 this issue also, any help is appreciated

 Error:scalac:
  while compiling:
 /Users/root/github/spark/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala
 during phase: jvm
  library version: version 2.10.4
 compiler version: version 2.10.4
   reconstructed args: -nobootcp -javabootclasspath : -deprecation
 -feature -classpath






Apache Spark - Parallel Processing of messages from Kafka - Java

2015-08-16 Thread mohanaugust
JavaPairReceiverInputDStreamString, byte[] messages =
KafkaUtils.createStream(...);
JavaPairDStreamString, byte[] filteredMessages =
filterValidMessages(messages);

JavaDStreamString useCase1 = calculateUseCase1(filteredMessages);
JavaDStreamString useCase2 = calculateUseCase2(filteredMessages);
JavaDStreamString useCase3 = calculateUseCase3(filteredMessages);
JavaDStreamString useCase4 = calculateUseCase4(filteredMessages);
...

I retrieve messages from Kafka, filter that and use the same messages for
mutiple use-cases. Here useCase1 to 4 are independent of each other and can
be calculated parallely. However, when i look at the logs, i see that
calculations are happening sequentially. How can i make them to run
parallely. Any suggestion would be helpful



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Parallel-Processing-of-messages-from-Kafka-Java-tp24284.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Error: Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

2015-08-16 Thread Rishi Yadav
try --jars rather than --class to submit jar.



On Fri, Aug 14, 2015 at 6:19 AM, Stephen Boesch java...@gmail.com wrote:

 The NoClassDefFoundException differs from ClassNotFoundException : it
 indicates an error while initializing that class: but the class is found in
 the classpath. Please provide the full stack trace.

 2015-08-14 4:59 GMT-07:00 stelsavva stel...@avocarrot.com:

 Hello, I am just starting out with spark streaming and Hbase/hadoop, i m
 writing a simple app to read from kafka and store to Hbase, I am having
 trouble submitting my job to spark.

 I 've downloaded Apache Spark 1.4.1 pre-build for hadoop 2.6

 I am building the project with mvn package

 and submitting the jar file with

  ~/Desktop/spark/bin/spark-submit --class org.example.main.scalaConsumer
 scalConsumer-0.0.1-SNAPSHOT.jar

 And then i am getting the error you see in the subject line. Is this a
 problem with my maven dependencies? do i need to install hadoop locally?
 And
 if so how can i add the hadoop classpath to the spark job?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Error-Exception-in-thread-main-java-lang-NoClassDefFoundError-org-apache-hadoop-hbase-HBaseConfiguran-tp24266.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Spark cant fetch the added jar to http server

2015-08-16 Thread t4ng0
Hi 

I have been trying to run standalone application using spark-submit but
somehow spark started the http server and added jar file to it but it is
unable to fetch the jar file. I am running the spark-cluster on localhost.
If anyone can help me to find what i am missing here, thanks in advance. 

LOGS: 
 Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties 15/08/16 15:20:52 INFO
SparkContext: Running Spark version 1.4.1 15/08/16 15:20:53 WARN
NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable 15/08/16 15:20:53 INFO
SecurityManager: Changing view acls to: manvendratomar 15/08/16 15:20:53
INFO SecurityManager: Changing modify acls to: manvendratomar 15/08/16
15:20:53 INFO SecurityManager: SecurityManager: authentication disabled; ui
acls disabled; users with view permissions: Set(manvendratomar); users with
modify permissions: Set(manvendratomar) 15/08/16 15:20:53 INFO Slf4jLogger:
Slf4jLogger started 15/08/16 15:20:53 INFO Remoting: Starting remoting
15/08/16 15:20:54 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@10.133.201.51:63985] 15/08/16 15:20:54 INFO Utils:
Successfully started service 'sparkDriver' on port 63985. 15/08/16 15:20:54
INFO SparkEnv: Registering MapOutputTracker 15/08/16 15:20:54 INFO SparkEnv:
Registering BlockManagerMaster 15/08/16 15:20:54 INFO DiskBlockManager:
Created local directory at
/private/var/folders/6z/2j2qj3rn3z32tymm_9ybz960gn/T/spark-bd1c72c8-5d27-4751-ae6f-f298451e4f66/blockmgr-ed753ee2-726e-45ae-97a2-0c82923262ef
15/08/16 15:20:54 INFO MemoryStore: MemoryStore started with capacity 265.4
MB 15/08/16 15:20:54 INFO HttpFileServer: HTTP File server directory is
/private/var/folders/6z/2j2qj3rn3z32tymm_9ybz960gn/T/spark-bd1c72c8-5d27-4751-ae6f-f298451e4f66/httpd-68c0afb2-92ef-4ddd-9ffc-639ccac81d6a
15/08/16 15:20:54 INFO HttpServer: Starting HTTP Server 15/08/16 15:20:54
INFO Utils: Successfully started service 'HTTP file server' on port 63986.
15/08/16 15:20:54 INFO SparkEnv: Registering OutputCommitCoordinator
15/08/16 15:20:54 INFO Utils: Successfully started service 'SparkUI' on port
4040. 15/08/16 15:20:54 INFO SparkUI: Started SparkUI at
http://10.133.201.51:4040 15/08/16 15:20:54 INFO SparkContext: Added JAR
target/scala-2.11/spark_matrix_2.11-1.0.jar at
http://10.133.201.51:63986/jars/spark_matrix_2.11-1.0.jar with timestamp
1439718654603 15/08/16 15:20:54 INFO Executor: Starting executor ID driver
on host localhost 15/08/16 15:20:54 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 63987.
15/08/16 15:20:54 INFO NettyBlockTransferService: Server created on 63987
15/08/16 15:20:54 INFO BlockManagerMaster: Trying to register BlockManager
15/08/16 15:20:54 INFO BlockManagerMasterEndpoint: Registering block manager
localhost:63987 with 265.4 MB RAM, BlockManagerId(driver, localhost, 63987)
15/08/16 15:20:54 INFO BlockManagerMaster: Registered BlockManager 15/08/16
15:20:55 INFO MemoryStore: ensureFreeSpace(157248) called with curMem=0,
maxMem=278302556 15/08/16 15:20:55 INFO MemoryStore: Block broadcast_0
stored as values in memory (estimated size 153.6 KB, free 265.3 MB) 15/08/16
15:20:55 INFO MemoryStore: ensureFreeSpace(14257) called with curMem=157248,
maxMem=278302556 15/08/16 15:20:55 INFO MemoryStore: Block
broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free
265.2 MB) 15/08/16 15:20:55 INFO BlockManagerInfo: Added broadcast_0_piece0
in memory on localhost:63987 (size: 13.9 KB, free: 265.4 MB) 15/08/16
15:20:55 INFO SparkContext: Created broadcast 0 from textFile at
partition.scala:20 15/08/16 15:20:56 INFO FileInputFormat: Total input paths
to process : 1 15/08/16 15:20:56 INFO SparkContext: Starting job: reduce at
IndexedRowMatrix.scala:65 15/08/16 15:20:56 INFO DAGScheduler: Got job 0
(reduce at IndexedRowMatrix.scala:65) with 1 output partitions
(allowLocal=false) 15/08/16 15:20:56 INFO DAGScheduler: Final stage:
ResultStage 0(reduce at IndexedRowMatrix.scala:65) 15/08/16 15:20:56 INFO
DAGScheduler: Parents of final stage: List() 15/08/16 15:20:56 INFO
DAGScheduler: Missing parents: List() 15/08/16 15:20:56 INFO DAGScheduler:
Submitting ResultStage 0 (MapPartitionsRDD[6] at map at
IndexedRowMatrix.scala:65), which has no missing parents 15/08/16 15:20:56
INFO MemoryStore: ensureFreeSpace(4064) called with curMem=171505,
maxMem=278302556 15/08/16 15:20:56 INFO MemoryStore: Block broadcast_1
stored as values in memory (estimated size 4.0 KB, free 265.2 MB) 15/08/16
15:20:56 INFO MemoryStore: ensureFreeSpace(2249) called with curMem=175569,
maxMem=278302556 15/08/16 15:20:56 INFO MemoryStore: Block
broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free
265.2 MB) 15/08/16 15:20:56 INFO BlockManagerInfo: Added broadcast_1_piece0
in memory on localhost:63987 (size: 2.2 KB, free: 265.4 MB) 15/08/16
15:20:56 INFO SparkContext: 

Spark on scala 2.11 build fails due to incorrect jline dependency in REPL

2015-08-16 Thread Stephen Boesch
I am building spark with the following options - most notably the
**scala-2.11**:

 . dev/switch-to-scala-2.11.sh
mvn -Phive -Pyarn -Phadoop-2.6 -Dhadoop2.6.2 -Pscala-2.11 -DskipTests
-Dmaven.javadoc.skip=true clean package


The build goes pretty far but fails in one of the minor modules *repl*:

[INFO]

[ERROR] Failed to execute goal on project spark-repl_2.11: Could not
resolve dependencies
for project org.apache.spark:spark-repl_2.11:jar:1.5.0-SNAPSHOT:
 Could not   find artifact org.scala-lang:jline:jar:2.11.7 in central
 (https://repo1.maven.org/maven2) - [Help 1]

Upon investigation - from 2.11.5 and later the scala version of jline is no
longer required: they use the default jline distribution.

And in fact the repl only shows dependency on jline for the 2.10.4 scala
version:

profile
  idscala-2.10/id
  activation
propertyname!scala-2.11/name/property
  /activation
  properties
scala.version2.10.4/scala.version
scala.binary.version2.10/scala.binary.version
jline.version${scala.version}/jline.version
jline.groupidorg.scala-lang/jline.groupid
  /properties
  dependencyManagement
dependencies
  dependency
groupId${jline.groupid}/groupId
artifactIdjline/artifactId
version${jline.version}/version
  /dependency
/dependencies
  /dependencyManagement
/profile

So then it is not clear why this error is occurring. Pointers appreciated.


SparkPi is geting java.lang.NoClassDefFoundError: scala/collection/Seq

2015-08-16 Thread xiaohe lan
Hi,

I am trying to run SparkPi in Intellij and getting NoClassDefFoundError.
Anyone else saw this issue before ?

Exception in thread main java.lang.NoClassDefFoundError:
scala/collection/Seq
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: java.lang.ClassNotFoundException: scala.collection.Seq
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 6 more

Process finished with exit code 1

Thanks,
Xiaohe


Re: Difference between Sort based and Hash based shuffle

2015-08-16 Thread Muhammad Haseeb Javed
I did check it out and although I did get a general understanding of the
various classes used to implement Sort and Hash shuffles, however these
slides lack details as to how they are implemented and why sort generally
has better performance than hash

On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran ravikiranmag...@gmail.com
wrote:

 Have a look at this presentation.
 http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be of
 help to you.

 On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed 
 11besemja...@seecs.edu.pk wrote:

 What are the major differences between how Sort based and Hash based
 shuffle operate and what is it that cause Sort Shuffle to perform better
 than Hash?
 Any talks that discuss both shuffles in detail, how they are implemented
 and the performance gains ?





Spark executor lost because of time out even after setting quite long time out value 1000 seconds

2015-08-16 Thread unk1102
Hi I have written Spark job which seems to be working fine for almost an hour
and after that executor start getting lost because of timeout I see the
following in log statement

15/08/16 12:26:46 WARN spark.HeartbeatReceiver: Removing executor 10 with no
recent heartbeats: 1051638 ms exceeds timeout 100 ms 

I dont see any errors but I see above warning and because of it executor
gets removed by YARN and I see Rpc client disassociated error and
IOException connection refused and FetchFailedException

After executor gets removed I see it is again getting added and starts
working and some other executors fails again. My question is is it normal
for executor getting lost? What happens to that task lost executors were
working on? My Spark job keeps on running since it is long around 4-5 hours
I have very good cluster with 1.2 TB memory and good no of CPU cores. To
solve above time out issue I tried to increase time spark.akka.timeout to
1000 seconds but no luck. I am using the following command to run my Spark
job Please guide I am new to Spark. I am using Spark 1.4.1. Thanks in
advance.

/spark-submit --class com.xyz.abc.MySparkJob  --conf
spark.executor.extraJavaOptions=-XX:MaxPermSize=512M --driver-java-options
-XX:MaxPermSize=512m --driver-memory 4g --master yarn-client
--executor-memory 25G --executor-cores 8 --num-executors 5 --jars
/path/to/spark-job.jar



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-executor-lost-because-of-time-out-even-after-setting-quite-long-time-out-value-1000-seconds-tp24289.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Executors on multiple nodes

2015-08-16 Thread Sandy Ryza
Hi Mohit,

It depends on whether dynamic allocation is turned on.  If not, the number
of executors is specified by the user with the --num-executors option.  If
dynamic allocation is turned on, refer to the doc for details:
https://spark.apache.org/docs/1.4.0/job-scheduling.html#dynamic-resource-allocation
.

-Sandy


On Sat, Aug 15, 2015 at 6:40 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:

 I am running on Yarn and do have a question on how spark runs executors on
 different data nodes. Is that primarily decided based on number of
 receivers?

 What do I need to do to ensure that multiple nodes are being used for data
 processing?



Re: SparkPi is geting java.lang.NoClassDefFoundError: scala/collection/Seq

2015-08-16 Thread Jeff Zhang
Check module example's dependency (right click examples and click Open
Modules Settings), by default scala-library is provided, you need to change
it to compile to run SparkPi in Intellij. As I remember, you also need to
change guava and jetty related library to compile too.

On Mon, Aug 17, 2015 at 2:14 AM, xiaohe lan zombiexco...@gmail.com wrote:

 Hi,

 I am trying to run SparkPi in Intellij and getting NoClassDefFoundError.
 Anyone else saw this issue before ?

 Exception in thread main java.lang.NoClassDefFoundError:
 scala/collection/Seq
 at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
 Caused by: java.lang.ClassNotFoundException: scala.collection.Seq
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 ... 6 more

 Process finished with exit code 1

 Thanks,
 Xiaohe




-- 
Best Regards

Jeff Zhang


Re: Spark Master HA on YARN

2015-08-16 Thread Jeff Zhang
To make it clear,  Spark Standalone is similar to Yarn as a simple cluster
management system.

Spark Master  ---   Yarn Resource Manager
Spark Worker  ---   Yarn Node Manager

On Mon, Aug 17, 2015 at 4:59 AM, Ruslan Dautkhanov dautkha...@gmail.com
wrote:

 There is no Spark master in YARN mode. It's standalone mode terminology.
 In YARN cluster mode, Spark's Application Master (Spark Driver runs in it)
 will be restarted
 automatically by RM up to yarn.resourcemanager.am.max-retries
 times (default is 2).

 --
 Ruslan Dautkhanov

 On Fri, Jul 17, 2015 at 1:29 AM, Bhaskar Dutta bhas...@gmail.com wrote:

 Hi,

 Is Spark master high availability supported on YARN (yarn-client mode)
 analogous to
 https://spark.apache.org/docs/1.4.0/spark-standalone.html#high-availability
 ?

 Thanks
 Bhaskie





-- 
Best Regards

Jeff Zhang


Re: Spark can't fetch application jar after adding it to HTTP server

2015-08-16 Thread Rishi Yadav
can you tell more about your environment. I understand you are running it
on a single machine but is firewall enabled?

On Sun, Aug 16, 2015 at 5:47 AM, t4ng0 manvendra.tom...@gmail.com wrote:

 Hi

 I am new to spark and trying to run standalone application using
 spark-submit. Whatever i could understood, from logs is that spark can't
 fetch the jar file after adding it to the http server. Do i need to
 configure proxy settings for spark too individually if it is a problem.
 Otherwise please help me, thanks in advance.

 PS: i am attaching logs here.

  Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties 15/08/16 15:20:52 INFO
 SparkContext: Running Spark version 1.4.1 15/08/16 15:20:53 WARN
 NativeCodeLoader: Unable to load native-hadoop library for your platform...
 using builtin-java classes where applicable 15/08/16 15:20:53 INFO
 SecurityManager: Changing view acls to: manvendratomar 15/08/16 15:20:53
 INFO SecurityManager: Changing modify acls to: manvendratomar 15/08/16
 15:20:53 INFO SecurityManager: SecurityManager: authentication disabled; ui
 acls disabled; users with view permissions: Set(manvendratomar); users with
 modify permissions: Set(manvendratomar) 15/08/16 15:20:53 INFO Slf4jLogger:
 Slf4jLogger started 15/08/16 15:20:53 INFO Remoting: Starting remoting
 15/08/16 15:20:54 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkDriver@10.133.201.51:63985] 15/08/16 15:20:54 INFO
 Utils:
 Successfully started service 'sparkDriver' on port 63985. 15/08/16 15:20:54
 INFO SparkEnv: Registering MapOutputTracker 15/08/16 15:20:54 INFO
 SparkEnv:
 Registering BlockManagerMaster 15/08/16 15:20:54 INFO DiskBlockManager:
 Created local directory at

 /private/var/folders/6z/2j2qj3rn3z32tymm_9ybz960gn/T/spark-bd1c72c8-5d27-4751-ae6f-f298451e4f66/blockmgr-ed753ee2-726e-45ae-97a2-0c82923262ef
 15/08/16 15:20:54 INFO MemoryStore: MemoryStore started with capacity 265.4
 MB 15/08/16 15:20:54 INFO HttpFileServer: HTTP File server directory is

 /private/var/folders/6z/2j2qj3rn3z32tymm_9ybz960gn/T/spark-bd1c72c8-5d27-4751-ae6f-f298451e4f66/httpd-68c0afb2-92ef-4ddd-9ffc-639ccac81d6a
 15/08/16 15:20:54 INFO HttpServer: Starting HTTP Server 15/08/16 15:20:54
 INFO Utils: Successfully started service 'HTTP file server' on port 63986.
 15/08/16 15:20:54 INFO SparkEnv: Registering OutputCommitCoordinator
 15/08/16 15:20:54 INFO Utils: Successfully started service 'SparkUI' on
 port
 4040. 15/08/16 15:20:54 INFO SparkUI: Started SparkUI at
 http://10.133.201.51:4040 15/08/16 15:20:54 INFO SparkContext: Added JAR
 target/scala-2.11/spark_matrix_2.11-1.0.jar at
 http://10.133.201.51:63986/jars/spark_matrix_2.11-1.0.jar with timestamp
 1439718654603 15/08/16 15:20:54 INFO Executor: Starting executor ID driver
 on host localhost 15/08/16 15:20:54 INFO Utils: Successfully started
 service
 'org.apache.spark.network.netty.NettyBlockTransferService' on port 63987.
 15/08/16 15:20:54 INFO NettyBlockTransferService: Server created on 63987
 15/08/16 15:20:54 INFO BlockManagerMaster: Trying to register BlockManager
 15/08/16 15:20:54 INFO BlockManagerMasterEndpoint: Registering block
 manager
 localhost:63987 with 265.4 MB RAM, BlockManagerId(driver, localhost, 63987)
 15/08/16 15:20:54 INFO BlockManagerMaster: Registered BlockManager 15/08/16
 15:20:55 INFO MemoryStore: ensureFreeSpace(157248) called with curMem=0,
 maxMem=278302556 15/08/16 15:20:55 INFO MemoryStore: Block broadcast_0
 stored as values in memory (estimated size 153.6 KB, free 265.3 MB)
 15/08/16
 15:20:55 INFO MemoryStore: ensureFreeSpace(14257) called with
 curMem=157248,
 maxMem=278302556 15/08/16 15:20:55 INFO MemoryStore: Block
 broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free
 265.2 MB) 15/08/16 15:20:55 INFO BlockManagerInfo: Added broadcast_0_piece0
 in memory on localhost:63987 (size: 13.9 KB, free: 265.4 MB) 15/08/16
 15:20:55 INFO SparkContext: Created broadcast 0 from textFile at
 partition.scala:20 15/08/16 15:20:56 INFO FileInputFormat: Total input
 paths
 to process : 1 15/08/16 15:20:56 INFO SparkContext: Starting job: reduce at
 IndexedRowMatrix.scala:65 15/08/16 15:20:56 INFO DAGScheduler: Got job 0
 (reduce at IndexedRowMatrix.scala:65) with 1 output partitions
 (allowLocal=false) 15/08/16 15:20:56 INFO DAGScheduler: Final stage:
 ResultStage 0(reduce at IndexedRowMatrix.scala:65) 15/08/16 15:20:56 INFO
 DAGScheduler: Parents of final stage: List() 15/08/16 15:20:56 INFO
 DAGScheduler: Missing parents: List() 15/08/16 15:20:56 INFO DAGScheduler:
 Submitting ResultStage 0 (MapPartitionsRDD[6] at map at
 IndexedRowMatrix.scala:65), which has no missing parents 15/08/16 15:20:56
 INFO MemoryStore: ensureFreeSpace(4064) called with curMem=171505,
 maxMem=278302556 15/08/16 15:20:56 INFO MemoryStore: Block broadcast_1
 stored as values in memory (estimated size 4.0 KB, free 265.2 MB) 15/08/16
 15:20:56 INFO MemoryStore: ensureFreeSpace(2249) called with 

Re: Spark Master HA on YARN

2015-08-16 Thread Ruslan Dautkhanov
There is no Spark master in YARN mode. It's standalone mode terminology.
In YARN cluster mode, Spark's Application Master (Spark Driver runs in it)
will be restarted
automatically by RM up to yarn.resourcemanager.am.max-retries
times (default is 2).

--
Ruslan Dautkhanov

On Fri, Jul 17, 2015 at 1:29 AM, Bhaskar Dutta bhas...@gmail.com wrote:

 Hi,

 Is Spark master high availability supported on YARN (yarn-client mode)
 analogous to
 https://spark.apache.org/docs/1.4.0/spark-standalone.html#high-availability
 ?

 Thanks
 Bhaskie



Example code to spawn multiple threads in driver program

2015-08-16 Thread unk1102
Hi I have Spark driver program which has one loop which iterates for around
2000 times and for two thousands times it executes jobs in YARN. Since loop
will do the job serially I want to introduce parallelism If I create 2000
tasks/runnable/callable in my Spark driver program will it get executed in
parallel in YARN cluster. Please guide it would be great if you can share
example code where we can run multiple threads in driver program. I am new
to Spark thanks in advance



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Example-code-to-spawn-multiple-threads-in-driver-program-tp24290.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Understanding the two jobs run with spark sql join

2015-08-16 Thread Todd
Hi,I have a basic spark sql join run in the local mode. I checked the UI,and 
see that there are two jobs are run. There DAG graph are pasted at the end.
I have several questions here:
1. Looks that Job0 and Job1 all have the same DAG Stages, but the stage 3 and 
stage4 are skipped. I would ask what job 0 and job1 each do, why they have the 
same DAG graph and why stage3 and stage4 are skipped.
2. Job0 has only 5 tasks. What controls the number of tasks in the job0?
3. Job0 has 5 tasks and job1 has 199 tasks. I thought that the number of tasks 
of job1 are controlled by the ,which is 200 by default. And why it shows 199 
here.



Job0:



Job1:



Re: Apache Spark - Parallel Processing of messages from Kafka - Java

2015-08-16 Thread Hemant Bhanawat
In spark, every action (foreach, collect etc.) gets converted into a spark
job and jobs are executed sequentially.

You may want to refactor your code in calculateUseCase? to just run
transformations (map, flatmap) and call a single action in the end.

On Sun, Aug 16, 2015 at 3:19 PM, mohanaugust mohanaug...@gmail.com wrote:

 JavaPairReceiverInputDStreamString, byte[] messages =
 KafkaUtils.createStream(...);
 JavaPairDStreamString, byte[] filteredMessages =
 filterValidMessages(messages);

 JavaDStreamString useCase1 = calculateUseCase1(filteredMessages);
 JavaDStreamString useCase2 = calculateUseCase2(filteredMessages);
 JavaDStreamString useCase3 = calculateUseCase3(filteredMessages);
 JavaDStreamString useCase4 = calculateUseCase4(filteredMessages);
 ...

 I retrieve messages from Kafka, filter that and use the same messages for
 mutiple use-cases. Here useCase1 to 4 are independent of each other and can
 be calculated parallely. However, when i look at the logs, i see that
 calculations are happening sequentially. How can i make them to run
 parallely. Any suggestion would be helpful



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Parallel-Processing-of-messages-from-Kafka-Java-tp24284.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org