Subscribe

2015-08-16 Thread Rishitesh Mishra



grpah x issue spark 1.3

2015-08-16 Thread dizzy5112
Hi using spark 1.3 and trying some sample code:


when i run:

all works well but with

it falls over and i get a whole heap of errors:
 
Is anyone else experiencing this? Ive tried different graphs and always end
up with the same results.

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/grpah-x-issue-spark-1-3-tp24292.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Apache Spark - Parallel Processing of messages from Kafka - Java

2015-08-16 Thread Hemant Bhanawat
In spark, every action (foreach, collect etc.) gets converted into a spark
job and jobs are executed sequentially.

You may want to refactor your code in calculateUseCase? to just run
transformations (map, flatmap) and call a single action in the end.

On Sun, Aug 16, 2015 at 3:19 PM, mohanaugust  wrote:

> JavaPairReceiverInputDStream messages =
> KafkaUtils.createStream(...);
> JavaPairDStream filteredMessages =
> filterValidMessages(messages);
>
> JavaDStream useCase1 = calculateUseCase1(filteredMessages);
> JavaDStream useCase2 = calculateUseCase2(filteredMessages);
> JavaDStream useCase3 = calculateUseCase3(filteredMessages);
> JavaDStream useCase4 = calculateUseCase4(filteredMessages);
> ...
>
> I retrieve messages from Kafka, filter that and use the same messages for
> mutiple use-cases. Here useCase1 to 4 are independent of each other and can
> be calculated parallely. However, when i look at the logs, i see that
> calculations are happening sequentially. How can i make them to run
> parallely. Any suggestion would be helpful
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Parallel-Processing-of-messages-from-Kafka-Java-tp24284.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Understanding the two jobs run with spark sql join

2015-08-16 Thread Todd
Hi,I have a basic spark sql join run in the local mode. I checked the UI,and 
see that there are two jobs are run. There DAG graph are pasted at the end.
I have several questions here:
1. Looks that Job0 and Job1 all have the same DAG Stages, but the stage 3 and 
stage4 are skipped. I would ask what job 0 and job1 each do, why they have the 
same DAG graph and why stage3 and stage4 are skipped.
2. Job0 has only 5 tasks. What controls the number of tasks in the job0?
3. Job0 has 5 tasks and job1 has 199 tasks. I thought that the number of tasks 
of job1 are controlled by the ,which is 200 by default. And why it shows 199 
here.



Job0:



Job1:



Re: Spark Master HA on YARN

2015-08-16 Thread Jeff Zhang
To make it clear,  Spark Standalone is similar to Yarn as a simple cluster
management system.

Spark Master  <--->   Yarn Resource Manager
Spark Worker  <--->   Yarn Node Manager

On Mon, Aug 17, 2015 at 4:59 AM, Ruslan Dautkhanov 
wrote:

> There is no Spark master in YARN mode. It's standalone mode terminology.
> In YARN cluster mode, Spark's Application Master (Spark Driver runs in it)
> will be restarted
> automatically by RM up to yarn.resourcemanager.am.max-retries
> times (default is 2).
>
> --
> Ruslan Dautkhanov
>
> On Fri, Jul 17, 2015 at 1:29 AM, Bhaskar Dutta  wrote:
>
>> Hi,
>>
>> Is Spark master high availability supported on YARN (yarn-client mode)
>> analogous to
>> https://spark.apache.org/docs/1.4.0/spark-standalone.html#high-availability
>> ?
>>
>> Thanks
>> Bhaskie
>>
>
>


-- 
Best Regards

Jeff Zhang


Re: SparkPi is geting java.lang.NoClassDefFoundError: scala/collection/Seq

2015-08-16 Thread Jeff Zhang
Check module example's dependency (right click examples and click Open
Modules Settings), by default scala-library is provided, you need to change
it to compile to run SparkPi in Intellij. As I remember, you also need to
change guava and jetty related library to compile too.

On Mon, Aug 17, 2015 at 2:14 AM, xiaohe lan  wrote:

> Hi,
>
> I am trying to run SparkPi in Intellij and getting NoClassDefFoundError.
> Anyone else saw this issue before ?
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> scala/collection/Seq
> at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
> Caused by: java.lang.ClassNotFoundException: scala.collection.Seq
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 6 more
>
> Process finished with exit code 1
>
> Thanks,
> Xiaohe
>



-- 
Best Regards

Jeff Zhang


Re: Can't find directory after resetting REPL state

2015-08-16 Thread Kevin Jung
Thanks Ted, it may be a bug. This is a jira ticket.
https://issues.apache.org/jira/browse/SPARK-10039 

Kevin

--- Original Message ---
Sender : Ted Yu
Date : 2015-08-16 11:29 (GMT+09:00)
Title : Re: Can't find directory after resetting REPL state

I tried with master branch and got the following:


http://pastebin.com/2nhtMFjQ



FYI


On Sat, Aug 15, 2015 at 1:03 AM, Kevin Jung  wrote:

Spark shell can't find base directory of class server after running ":reset" 
command.
scala> :reset
scala> 1
uncaught exception during compilation: java.lang.AssertiON-ERROR
java.lang.AssertiON-ERROR: assertion failed: Tried to find '$line33' in 
'/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory
~~~impossible to command anymore~~~
I figure out reset() method in SparkIMain try to delete virtualDirectory and 
then create again. But virtualDirectory.create() makes a file, not a directory.
Does anyone face a same problem under spark 1.4.0?

Kevin




상기 메일은 지정된 수신인만을 위한 것이며, 부정경쟁방지 및 영업비밀보호에 관한 법률,개인정보 보호법을 포함하여
 관련 법령에 따라 보호의 대상이 되는 영업비밀, 산업기술,기밀정보, 개인정보 등을 포함하고 있을 수 있습니다.
본 문서에 포함된 정보의 전부 또는 일부를 무단으로 복사 또는 사용하거나 제3자에게 공개, 배포, 제공하는 것은 엄격히
 금지됩니다. 본 메일이 잘못 전송된 경우 발신인 또는 당사에게 알려주시고 본 메일을 즉시 삭제하여 주시기 바랍니다. 
The contents of this e-mail message and any attachments are confidential and 
are intended solely for addressee.
 The information may also be legally privileged. This transmission is sent in 
trust, for the sole purpose of delivery
 to the intended recipient. If you have received this transmission in error, 
any use, reproduction or dissemination of
 this transmission is strictly prohibited. If you are not the intended 
recipient, please immediately notify the sender
 by reply e-mail or phone and delete this message and its attachments, if any.

Re: Spark Master HA on YARN

2015-08-16 Thread Ruslan Dautkhanov
There is no Spark master in YARN mode. It's standalone mode terminology.
In YARN cluster mode, Spark's Application Master (Spark Driver runs in it)
will be restarted
automatically by RM up to yarn.resourcemanager.am.max-retries
times (default is 2).

--
Ruslan Dautkhanov

On Fri, Jul 17, 2015 at 1:29 AM, Bhaskar Dutta  wrote:

> Hi,
>
> Is Spark master high availability supported on YARN (yarn-client mode)
> analogous to
> https://spark.apache.org/docs/1.4.0/spark-standalone.html#high-availability
> ?
>
> Thanks
> Bhaskie
>


Re: Spark can't fetch application jar after adding it to HTTP server

2015-08-16 Thread Rishi Yadav
can you tell more about your environment. I understand you are running it
on a single machine but is firewall enabled?

On Sun, Aug 16, 2015 at 5:47 AM, t4ng0  wrote:

> Hi
>
> I am new to spark and trying to run standalone application using
> spark-submit. Whatever i could understood, from logs is that spark can't
> fetch the jar file after adding it to the http server. Do i need to
> configure proxy settings for spark too individually if it is a problem.
> Otherwise please help me, thanks in advance.
>
> PS: i am attaching logs here.
>
>  Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties 15/08/16 15:20:52 INFO
> SparkContext: Running Spark version 1.4.1 15/08/16 15:20:53 WARN
> NativeCodeLoader: Unable to load native-hadoop library for your platform...
> using builtin-java classes where applicable 15/08/16 15:20:53 INFO
> SecurityManager: Changing view acls to: manvendratomar 15/08/16 15:20:53
> INFO SecurityManager: Changing modify acls to: manvendratomar 15/08/16
> 15:20:53 INFO SecurityManager: SecurityManager: authentication disabled; ui
> acls disabled; users with view permissions: Set(manvendratomar); users with
> modify permissions: Set(manvendratomar) 15/08/16 15:20:53 INFO Slf4jLogger:
> Slf4jLogger started 15/08/16 15:20:53 INFO Remoting: Starting remoting
> 15/08/16 15:20:54 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkDriver@10.133.201.51:63985] 15/08/16 15:20:54 INFO
> Utils:
> Successfully started service 'sparkDriver' on port 63985. 15/08/16 15:20:54
> INFO SparkEnv: Registering MapOutputTracker 15/08/16 15:20:54 INFO
> SparkEnv:
> Registering BlockManagerMaster 15/08/16 15:20:54 INFO DiskBlockManager:
> Created local directory at
>
> /private/var/folders/6z/2j2qj3rn3z32tymm_9ybz960gn/T/spark-bd1c72c8-5d27-4751-ae6f-f298451e4f66/blockmgr-ed753ee2-726e-45ae-97a2-0c82923262ef
> 15/08/16 15:20:54 INFO MemoryStore: MemoryStore started with capacity 265.4
> MB 15/08/16 15:20:54 INFO HttpFileServer: HTTP File server directory is
>
> /private/var/folders/6z/2j2qj3rn3z32tymm_9ybz960gn/T/spark-bd1c72c8-5d27-4751-ae6f-f298451e4f66/httpd-68c0afb2-92ef-4ddd-9ffc-639ccac81d6a
> 15/08/16 15:20:54 INFO HttpServer: Starting HTTP Server 15/08/16 15:20:54
> INFO Utils: Successfully started service 'HTTP file server' on port 63986.
> 15/08/16 15:20:54 INFO SparkEnv: Registering OutputCommitCoordinator
> 15/08/16 15:20:54 INFO Utils: Successfully started service 'SparkUI' on
> port
> 4040. 15/08/16 15:20:54 INFO SparkUI: Started SparkUI at
> http://10.133.201.51:4040 15/08/16 15:20:54 INFO SparkContext: Added JAR
> target/scala-2.11/spark_matrix_2.11-1.0.jar at
> http://10.133.201.51:63986/jars/spark_matrix_2.11-1.0.jar with timestamp
> 1439718654603 15/08/16 15:20:54 INFO Executor: Starting executor ID driver
> on host localhost 15/08/16 15:20:54 INFO Utils: Successfully started
> service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 63987.
> 15/08/16 15:20:54 INFO NettyBlockTransferService: Server created on 63987
> 15/08/16 15:20:54 INFO BlockManagerMaster: Trying to register BlockManager
> 15/08/16 15:20:54 INFO BlockManagerMasterEndpoint: Registering block
> manager
> localhost:63987 with 265.4 MB RAM, BlockManagerId(driver, localhost, 63987)
> 15/08/16 15:20:54 INFO BlockManagerMaster: Registered BlockManager 15/08/16
> 15:20:55 INFO MemoryStore: ensureFreeSpace(157248) called with curMem=0,
> maxMem=278302556 15/08/16 15:20:55 INFO MemoryStore: Block broadcast_0
> stored as values in memory (estimated size 153.6 KB, free 265.3 MB)
> 15/08/16
> 15:20:55 INFO MemoryStore: ensureFreeSpace(14257) called with
> curMem=157248,
> maxMem=278302556 15/08/16 15:20:55 INFO MemoryStore: Block
> broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free
> 265.2 MB) 15/08/16 15:20:55 INFO BlockManagerInfo: Added broadcast_0_piece0
> in memory on localhost:63987 (size: 13.9 KB, free: 265.4 MB) 15/08/16
> 15:20:55 INFO SparkContext: Created broadcast 0 from textFile at
> partition.scala:20 15/08/16 15:20:56 INFO FileInputFormat: Total input
> paths
> to process : 1 15/08/16 15:20:56 INFO SparkContext: Starting job: reduce at
> IndexedRowMatrix.scala:65 15/08/16 15:20:56 INFO DAGScheduler: Got job 0
> (reduce at IndexedRowMatrix.scala:65) with 1 output partitions
> (allowLocal=false) 15/08/16 15:20:56 INFO DAGScheduler: Final stage:
> ResultStage 0(reduce at IndexedRowMatrix.scala:65) 15/08/16 15:20:56 INFO
> DAGScheduler: Parents of final stage: List() 15/08/16 15:20:56 INFO
> DAGScheduler: Missing parents: List() 15/08/16 15:20:56 INFO DAGScheduler:
> Submitting ResultStage 0 (MapPartitionsRDD[6] at map at
> IndexedRowMatrix.scala:65), which has no missing parents 15/08/16 15:20:56
> INFO MemoryStore: ensureFreeSpace(4064) called with curMem=171505,
> maxMem=278302556 15/08/16 15:20:56 INFO MemoryStore: Block broadcast_1
> stored as values in memory (estimated size 4.0 KB, free 265.2 MB) 15/08/16
> 15:20:56 I

Example code to spawn multiple threads in driver program

2015-08-16 Thread unk1102
Hi I have Spark driver program which has one loop which iterates for around
2000 times and for two thousands times it executes jobs in YARN. Since loop
will do the job serially I want to introduce parallelism If I create 2000
tasks/runnable/callable in my Spark driver program will it get executed in
parallel in YARN cluster. Please guide it would be great if you can share
example code where we can run multiple threads in driver program. I am new
to Spark thanks in advance



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Example-code-to-spawn-multiple-threads-in-driver-program-tp24290.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark executor lost because of time out even after setting quite long time out value 1000 seconds

2015-08-16 Thread unk1102
Hi I have written Spark job which seems to be working fine for almost an hour
and after that executor start getting lost because of timeout I see the
following in log statement

15/08/16 12:26:46 WARN spark.HeartbeatReceiver: Removing executor 10 with no
recent heartbeats: 1051638 ms exceeds timeout 100 ms 

I dont see any errors but I see above warning and because of it executor
gets removed by YARN and I see Rpc client disassociated error and
IOException connection refused and FetchFailedException

After executor gets removed I see it is again getting added and starts
working and some other executors fails again. My question is is it normal
for executor getting lost? What happens to that task lost executors were
working on? My Spark job keeps on running since it is long around 4-5 hours
I have very good cluster with 1.2 TB memory and good no of CPU cores. To
solve above time out issue I tried to increase time spark.akka.timeout to
1000 seconds but no luck. I am using the following command to run my Spark
job Please guide I am new to Spark. I am using Spark 1.4.1. Thanks in
advance.

/spark-submit --class com.xyz.abc.MySparkJob  --conf
"spark.executor.extraJavaOptions=-XX:MaxPermSize=512M" --driver-java-options
-XX:MaxPermSize=512m --driver-memory 4g --master yarn-client
--executor-memory 25G --executor-cores 8 --num-executors 5 --jars
/path/to/spark-job.jar



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-executor-lost-because-of-time-out-even-after-setting-quite-long-time-out-value-1000-seconds-tp24289.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkPi is geting java.lang.NoClassDefFoundError: scala/collection/Seq

2015-08-16 Thread xiaohe lan
Hi,

I am trying to run SparkPi in Intellij and getting NoClassDefFoundError.
Anyone else saw this issue before ?

Exception in thread "main" java.lang.NoClassDefFoundError:
scala/collection/Seq
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
Caused by: java.lang.ClassNotFoundException: scala.collection.Seq
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 6 more

Process finished with exit code 1

Thanks,
Xiaohe


Spark on scala 2.11 build fails due to incorrect jline dependency in REPL

2015-08-16 Thread Stephen Boesch
I am building spark with the following options - most notably the
**scala-2.11**:

 . dev/switch-to-scala-2.11.sh
mvn -Phive -Pyarn -Phadoop-2.6 -Dhadoop2.6.2 -Pscala-2.11 -DskipTests
-Dmaven.javadoc.skip=true clean package


The build goes pretty far but fails in one of the minor modules *repl*:

[INFO]

[ERROR] Failed to execute goal on project spark-repl_2.11: Could not
resolve dependencies
for project org.apache.spark:spark-repl_2.11:jar:1.5.0-SNAPSHOT:
 Could not   find artifact org.scala-lang:jline:jar:2.11.7 in central
 (https://repo1.maven.org/maven2) -> [Help 1]

Upon investigation - from 2.11.5 and later the scala version of jline is no
longer required: they use the default jline distribution.

And in fact the repl only shows dependency on jline for the 2.10.4 scala
version:


  scala-2.10
  
!scala-2.11
  
  
2.10.4
2.10
${scala.version}
org.scala-lang
  
  

  
${jline.groupid}
jline
${jline.version}
  

  


So then it is not clear why this error is occurring. Pointers appreciated.


Re: Difference between Sort based and Hash based shuffle

2015-08-16 Thread Muhammad Haseeb Javed
I did check it out and although I did get a general understanding of the
various classes used to implement Sort and Hash shuffles, however these
slides lack details as to how they are implemented and why sort generally
has better performance than hash

On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran 
wrote:

> Have a look at this presentation.
> http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be of
> help to you.
>
> On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed <
> 11besemja...@seecs.edu.pk> wrote:
>
>> What are the major differences between how Sort based and Hash based
>> shuffle operate and what is it that cause Sort Shuffle to perform better
>> than Hash?
>> Any talks that discuss both shuffles in detail, how they are implemented
>> and the performance gains ?
>>
>
>


Spark cant fetch the added jar to http server

2015-08-16 Thread t4ng0
Hi 

I have been trying to run standalone application using spark-submit but
somehow spark started the http server and added jar file to it but it is
unable to fetch the jar file. I am running the spark-cluster on localhost.
If anyone can help me to find what i am missing here, thanks in advance. 

LOGS: 
 Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties 15/08/16 15:20:52 INFO
SparkContext: Running Spark version 1.4.1 15/08/16 15:20:53 WARN
NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable 15/08/16 15:20:53 INFO
SecurityManager: Changing view acls to: manvendratomar 15/08/16 15:20:53
INFO SecurityManager: Changing modify acls to: manvendratomar 15/08/16
15:20:53 INFO SecurityManager: SecurityManager: authentication disabled; ui
acls disabled; users with view permissions: Set(manvendratomar); users with
modify permissions: Set(manvendratomar) 15/08/16 15:20:53 INFO Slf4jLogger:
Slf4jLogger started 15/08/16 15:20:53 INFO Remoting: Starting remoting
15/08/16 15:20:54 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@10.133.201.51:63985] 15/08/16 15:20:54 INFO Utils:
Successfully started service 'sparkDriver' on port 63985. 15/08/16 15:20:54
INFO SparkEnv: Registering MapOutputTracker 15/08/16 15:20:54 INFO SparkEnv:
Registering BlockManagerMaster 15/08/16 15:20:54 INFO DiskBlockManager:
Created local directory at
/private/var/folders/6z/2j2qj3rn3z32tymm_9ybz960gn/T/spark-bd1c72c8-5d27-4751-ae6f-f298451e4f66/blockmgr-ed753ee2-726e-45ae-97a2-0c82923262ef
15/08/16 15:20:54 INFO MemoryStore: MemoryStore started with capacity 265.4
MB 15/08/16 15:20:54 INFO HttpFileServer: HTTP File server directory is
/private/var/folders/6z/2j2qj3rn3z32tymm_9ybz960gn/T/spark-bd1c72c8-5d27-4751-ae6f-f298451e4f66/httpd-68c0afb2-92ef-4ddd-9ffc-639ccac81d6a
15/08/16 15:20:54 INFO HttpServer: Starting HTTP Server 15/08/16 15:20:54
INFO Utils: Successfully started service 'HTTP file server' on port 63986.
15/08/16 15:20:54 INFO SparkEnv: Registering OutputCommitCoordinator
15/08/16 15:20:54 INFO Utils: Successfully started service 'SparkUI' on port
4040. 15/08/16 15:20:54 INFO SparkUI: Started SparkUI at
http://10.133.201.51:4040 15/08/16 15:20:54 INFO SparkContext: Added JAR
target/scala-2.11/spark_matrix_2.11-1.0.jar at
http://10.133.201.51:63986/jars/spark_matrix_2.11-1.0.jar with timestamp
1439718654603 15/08/16 15:20:54 INFO Executor: Starting executor ID driver
on host localhost 15/08/16 15:20:54 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 63987.
15/08/16 15:20:54 INFO NettyBlockTransferService: Server created on 63987
15/08/16 15:20:54 INFO BlockManagerMaster: Trying to register BlockManager
15/08/16 15:20:54 INFO BlockManagerMasterEndpoint: Registering block manager
localhost:63987 with 265.4 MB RAM, BlockManagerId(driver, localhost, 63987)
15/08/16 15:20:54 INFO BlockManagerMaster: Registered BlockManager 15/08/16
15:20:55 INFO MemoryStore: ensureFreeSpace(157248) called with curMem=0,
maxMem=278302556 15/08/16 15:20:55 INFO MemoryStore: Block broadcast_0
stored as values in memory (estimated size 153.6 KB, free 265.3 MB) 15/08/16
15:20:55 INFO MemoryStore: ensureFreeSpace(14257) called with curMem=157248,
maxMem=278302556 15/08/16 15:20:55 INFO MemoryStore: Block
broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free
265.2 MB) 15/08/16 15:20:55 INFO BlockManagerInfo: Added broadcast_0_piece0
in memory on localhost:63987 (size: 13.9 KB, free: 265.4 MB) 15/08/16
15:20:55 INFO SparkContext: Created broadcast 0 from textFile at
partition.scala:20 15/08/16 15:20:56 INFO FileInputFormat: Total input paths
to process : 1 15/08/16 15:20:56 INFO SparkContext: Starting job: reduce at
IndexedRowMatrix.scala:65 15/08/16 15:20:56 INFO DAGScheduler: Got job 0
(reduce at IndexedRowMatrix.scala:65) with 1 output partitions
(allowLocal=false) 15/08/16 15:20:56 INFO DAGScheduler: Final stage:
ResultStage 0(reduce at IndexedRowMatrix.scala:65) 15/08/16 15:20:56 INFO
DAGScheduler: Parents of final stage: List() 15/08/16 15:20:56 INFO
DAGScheduler: Missing parents: List() 15/08/16 15:20:56 INFO DAGScheduler:
Submitting ResultStage 0 (MapPartitionsRDD[6] at map at
IndexedRowMatrix.scala:65), which has no missing parents 15/08/16 15:20:56
INFO MemoryStore: ensureFreeSpace(4064) called with curMem=171505,
maxMem=278302556 15/08/16 15:20:56 INFO MemoryStore: Block broadcast_1
stored as values in memory (estimated size 4.0 KB, free 265.2 MB) 15/08/16
15:20:56 INFO MemoryStore: ensureFreeSpace(2249) called with curMem=175569,
maxMem=278302556 15/08/16 15:20:56 INFO MemoryStore: Block
broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free
265.2 MB) 15/08/16 15:20:56 INFO BlockManagerInfo: Added broadcast_1_piece0
in memory on localhost:63987 (size: 2.2 KB, free: 265.4 MB) 15/08/16
15:20:56 INFO SparkContext: Creat

Re: Error: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

2015-08-16 Thread Rishi Yadav
try --jars rather than --class to submit jar.



On Fri, Aug 14, 2015 at 6:19 AM, Stephen Boesch  wrote:

> The NoClassDefFoundException differs from ClassNotFoundException : it
> indicates an error while initializing that class: but the class is found in
> the classpath. Please provide the full stack trace.
>
> 2015-08-14 4:59 GMT-07:00 stelsavva :
>
>> Hello, I am just starting out with spark streaming and Hbase/hadoop, i m
>> writing a simple app to read from kafka and store to Hbase, I am having
>> trouble submitting my job to spark.
>>
>> I 've downloaded Apache Spark 1.4.1 pre-build for hadoop 2.6
>>
>> I am building the project with mvn package
>>
>> and submitting the jar file with
>>
>>  ~/Desktop/spark/bin/spark-submit --class org.example.main.scalaConsumer
>> scalConsumer-0.0.1-SNAPSHOT.jar
>>
>> And then i am getting the error you see in the subject line. Is this a
>> problem with my maven dependencies? do i need to install hadoop locally?
>> And
>> if so how can i add the hadoop classpath to the spark job?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Error-Exception-in-thread-main-java-lang-NoClassDefFoundError-org-apache-hadoop-hbase-HBaseConfiguran-tp24266.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: How to submit an application using spark-submit

2015-08-16 Thread t4ng0
Hi 

I have been trying to run standalone application using spark-submit but
somehow spark started the http server and added jar file to it but it is
unable to fetch the jar file. I am running the spark-cluster on localhost.
If anyone can help me to find what i am missing here, thanks in advance.

LOGS:
 Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties 15/08/16 15:20:52 INFO
SparkContext: Running Spark version 1.4.1 15/08/16 15:20:53 WARN
NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable 15/08/16 15:20:53 INFO
SecurityManager: Changing view acls to: manvendratomar 15/08/16 15:20:53
INFO SecurityManager: Changing modify acls to: manvendratomar 15/08/16
15:20:53 INFO SecurityManager: SecurityManager: authentication disabled; ui
acls disabled; users with view permissions: Set(manvendratomar); users with
modify permissions: Set(manvendratomar) 15/08/16 15:20:53 INFO Slf4jLogger:
Slf4jLogger started 15/08/16 15:20:53 INFO Remoting: Starting remoting
15/08/16 15:20:54 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@10.133.201.51:63985] 15/08/16 15:20:54 INFO Utils:
Successfully started service 'sparkDriver' on port 63985. 15/08/16 15:20:54
INFO SparkEnv: Registering MapOutputTracker 15/08/16 15:20:54 INFO SparkEnv:
Registering BlockManagerMaster 15/08/16 15:20:54 INFO DiskBlockManager:
Created local directory at
/private/var/folders/6z/2j2qj3rn3z32tymm_9ybz960gn/T/spark-bd1c72c8-5d27-4751-ae6f-f298451e4f66/blockmgr-ed753ee2-726e-45ae-97a2-0c82923262ef
15/08/16 15:20:54 INFO MemoryStore: MemoryStore started with capacity 265.4
MB 15/08/16 15:20:54 INFO HttpFileServer: HTTP File server directory is
/private/var/folders/6z/2j2qj3rn3z32tymm_9ybz960gn/T/spark-bd1c72c8-5d27-4751-ae6f-f298451e4f66/httpd-68c0afb2-92ef-4ddd-9ffc-639ccac81d6a
15/08/16 15:20:54 INFO HttpServer: Starting HTTP Server 15/08/16 15:20:54
INFO Utils: Successfully started service 'HTTP file server' on port 63986.
15/08/16 15:20:54 INFO SparkEnv: Registering OutputCommitCoordinator
15/08/16 15:20:54 INFO Utils: Successfully started service 'SparkUI' on port
4040. 15/08/16 15:20:54 INFO SparkUI: Started SparkUI at
http://10.133.201.51:4040 15/08/16 15:20:54 INFO SparkContext: Added JAR
target/scala-2.11/spark_matrix_2.11-1.0.jar at
http://10.133.201.51:63986/jars/spark_matrix_2.11-1.0.jar with timestamp
1439718654603 15/08/16 15:20:54 INFO Executor: Starting executor ID driver
on host localhost 15/08/16 15:20:54 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 63987.
15/08/16 15:20:54 INFO NettyBlockTransferService: Server created on 63987
15/08/16 15:20:54 INFO BlockManagerMaster: Trying to register BlockManager
15/08/16 15:20:54 INFO BlockManagerMasterEndpoint: Registering block manager
localhost:63987 with 265.4 MB RAM, BlockManagerId(driver, localhost, 63987)
15/08/16 15:20:54 INFO BlockManagerMaster: Registered BlockManager 15/08/16
15:20:55 INFO MemoryStore: ensureFreeSpace(157248) called with curMem=0,
maxMem=278302556 15/08/16 15:20:55 INFO MemoryStore: Block broadcast_0
stored as values in memory (estimated size 153.6 KB, free 265.3 MB) 15/08/16
15:20:55 INFO MemoryStore: ensureFreeSpace(14257) called with curMem=157248,
maxMem=278302556 15/08/16 15:20:55 INFO MemoryStore: Block
broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free
265.2 MB) 15/08/16 15:20:55 INFO BlockManagerInfo: Added broadcast_0_piece0
in memory on localhost:63987 (size: 13.9 KB, free: 265.4 MB) 15/08/16
15:20:55 INFO SparkContext: Created broadcast 0 from textFile at
partition.scala:20 15/08/16 15:20:56 INFO FileInputFormat: Total input paths
to process : 1 15/08/16 15:20:56 INFO SparkContext: Starting job: reduce at
IndexedRowMatrix.scala:65 15/08/16 15:20:56 INFO DAGScheduler: Got job 0
(reduce at IndexedRowMatrix.scala:65) with 1 output partitions
(allowLocal=false) 15/08/16 15:20:56 INFO DAGScheduler: Final stage:
ResultStage 0(reduce at IndexedRowMatrix.scala:65) 15/08/16 15:20:56 INFO
DAGScheduler: Parents of final stage: List() 15/08/16 15:20:56 INFO
DAGScheduler: Missing parents: List() 15/08/16 15:20:56 INFO DAGScheduler:
Submitting ResultStage 0 (MapPartitionsRDD[6] at map at
IndexedRowMatrix.scala:65), which has no missing parents 15/08/16 15:20:56
INFO MemoryStore: ensureFreeSpace(4064) called with curMem=171505,
maxMem=278302556 15/08/16 15:20:56 INFO MemoryStore: Block broadcast_1
stored as values in memory (estimated size 4.0 KB, free 265.2 MB) 15/08/16
15:20:56 INFO MemoryStore: ensureFreeSpace(2249) called with curMem=175569,
maxMem=278302556 15/08/16 15:20:56 INFO MemoryStore: Block
broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free
265.2 MB) 15/08/16 15:20:56 INFO BlockManagerInfo: Added broadcast_1_piece0
in memory on localhost:63987 (size: 2.2 KB, free: 265.4 MB) 15/08/16
15:20:56 INFO SparkContext: Created

Re: Re: Can't understand the size of raw RDD and its DataFrame

2015-08-16 Thread Rishi Yadav
Dataframes in simple terms are RDDs combined with Schema. In reality they
are much more than that and provide a very fine level of optimization,
Check out project Tungsten.

In your case it was one column as you chose. By default, it keeps same
columns as in RDD (same as field of a case class if you created RDD using
case class)


Author: Spark Cook Book  (Packt)


On Sat, Aug 15, 2015 at 10:01 PM, Todd  wrote:

> I thought that the df only contains one column, and actually contains only
> one resulting row(select avg(age) from theTable).
> So,I would think that it would take less space,looks my understanding is
> run??
>
>
>
>
>
> At 2015-08-16 12:34:31, "Rishi Yadav"  wrote:
>
> why are you expecting footprint of dataframe to be lower when it contains
> more information ( RDD + Schema)
>
> On Sat, Aug 15, 2015 at 6:35 PM, Todd  wrote:
>
>> Hi,
>> With following code snippet, I cached the raw RDD(which is already in
>> memory, but just for illustration) and its DataFrame.
>> I thought that the df cache would take less space than the rdd
>> cache,which is wrong because from the UI that I see the rdd cache takes
>> 168B,while the df cache takes 272B.
>> What data is cached when df.cache is called and actually cache the data?
>> It looks that the df only cached the avg(age) which should be much smaller
>> in size,
>>
>> val conf = new SparkConf().setMaster("local").setAppName("SparkSQL_Cache")
>> val sc = new SparkContext(conf)
>> val sqlContext = new SQLContext(sc)
>> import sqlContext.implicits._
>> val rdd=sc.parallelize(Array(Student("Jack",21), Student("Mary", 22)))
>> rdd.cache
>> rdd.toDF().registerTempTable("TBL_STUDENT")
>> val df = sqlContext.sql("select avg(age) from TBL_STUDENT")
>> df.cache()
>> df.show
>>
>>
>


Spark can't fetch application jar after adding it to HTTP server

2015-08-16 Thread t4ng0
Hi 

I am new to spark and trying to run standalone application using
spark-submit. Whatever i could understood, from logs is that spark can't
fetch the jar file after adding it to the http server. Do i need to
configure proxy settings for spark too individually if it is a problem.
Otherwise please help me, thanks in advance. 

PS: i am attaching logs here. 

 Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties 15/08/16 15:20:52 INFO
SparkContext: Running Spark version 1.4.1 15/08/16 15:20:53 WARN
NativeCodeLoader: Unable to load native-hadoop library for your platform...
using builtin-java classes where applicable 15/08/16 15:20:53 INFO
SecurityManager: Changing view acls to: manvendratomar 15/08/16 15:20:53
INFO SecurityManager: Changing modify acls to: manvendratomar 15/08/16
15:20:53 INFO SecurityManager: SecurityManager: authentication disabled; ui
acls disabled; users with view permissions: Set(manvendratomar); users with
modify permissions: Set(manvendratomar) 15/08/16 15:20:53 INFO Slf4jLogger:
Slf4jLogger started 15/08/16 15:20:53 INFO Remoting: Starting remoting
15/08/16 15:20:54 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@10.133.201.51:63985] 15/08/16 15:20:54 INFO Utils:
Successfully started service 'sparkDriver' on port 63985. 15/08/16 15:20:54
INFO SparkEnv: Registering MapOutputTracker 15/08/16 15:20:54 INFO SparkEnv:
Registering BlockManagerMaster 15/08/16 15:20:54 INFO DiskBlockManager:
Created local directory at
/private/var/folders/6z/2j2qj3rn3z32tymm_9ybz960gn/T/spark-bd1c72c8-5d27-4751-ae6f-f298451e4f66/blockmgr-ed753ee2-726e-45ae-97a2-0c82923262ef
15/08/16 15:20:54 INFO MemoryStore: MemoryStore started with capacity 265.4
MB 15/08/16 15:20:54 INFO HttpFileServer: HTTP File server directory is
/private/var/folders/6z/2j2qj3rn3z32tymm_9ybz960gn/T/spark-bd1c72c8-5d27-4751-ae6f-f298451e4f66/httpd-68c0afb2-92ef-4ddd-9ffc-639ccac81d6a
15/08/16 15:20:54 INFO HttpServer: Starting HTTP Server 15/08/16 15:20:54
INFO Utils: Successfully started service 'HTTP file server' on port 63986.
15/08/16 15:20:54 INFO SparkEnv: Registering OutputCommitCoordinator
15/08/16 15:20:54 INFO Utils: Successfully started service 'SparkUI' on port
4040. 15/08/16 15:20:54 INFO SparkUI: Started SparkUI at
http://10.133.201.51:4040 15/08/16 15:20:54 INFO SparkContext: Added JAR
target/scala-2.11/spark_matrix_2.11-1.0.jar at
http://10.133.201.51:63986/jars/spark_matrix_2.11-1.0.jar with timestamp
1439718654603 15/08/16 15:20:54 INFO Executor: Starting executor ID driver
on host localhost 15/08/16 15:20:54 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 63987.
15/08/16 15:20:54 INFO NettyBlockTransferService: Server created on 63987
15/08/16 15:20:54 INFO BlockManagerMaster: Trying to register BlockManager
15/08/16 15:20:54 INFO BlockManagerMasterEndpoint: Registering block manager
localhost:63987 with 265.4 MB RAM, BlockManagerId(driver, localhost, 63987)
15/08/16 15:20:54 INFO BlockManagerMaster: Registered BlockManager 15/08/16
15:20:55 INFO MemoryStore: ensureFreeSpace(157248) called with curMem=0,
maxMem=278302556 15/08/16 15:20:55 INFO MemoryStore: Block broadcast_0
stored as values in memory (estimated size 153.6 KB, free 265.3 MB) 15/08/16
15:20:55 INFO MemoryStore: ensureFreeSpace(14257) called with curMem=157248,
maxMem=278302556 15/08/16 15:20:55 INFO MemoryStore: Block
broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free
265.2 MB) 15/08/16 15:20:55 INFO BlockManagerInfo: Added broadcast_0_piece0
in memory on localhost:63987 (size: 13.9 KB, free: 265.4 MB) 15/08/16
15:20:55 INFO SparkContext: Created broadcast 0 from textFile at
partition.scala:20 15/08/16 15:20:56 INFO FileInputFormat: Total input paths
to process : 1 15/08/16 15:20:56 INFO SparkContext: Starting job: reduce at
IndexedRowMatrix.scala:65 15/08/16 15:20:56 INFO DAGScheduler: Got job 0
(reduce at IndexedRowMatrix.scala:65) with 1 output partitions
(allowLocal=false) 15/08/16 15:20:56 INFO DAGScheduler: Final stage:
ResultStage 0(reduce at IndexedRowMatrix.scala:65) 15/08/16 15:20:56 INFO
DAGScheduler: Parents of final stage: List() 15/08/16 15:20:56 INFO
DAGScheduler: Missing parents: List() 15/08/16 15:20:56 INFO DAGScheduler:
Submitting ResultStage 0 (MapPartitionsRDD[6] at map at
IndexedRowMatrix.scala:65), which has no missing parents 15/08/16 15:20:56
INFO MemoryStore: ensureFreeSpace(4064) called with curMem=171505,
maxMem=278302556 15/08/16 15:20:56 INFO MemoryStore: Block broadcast_1
stored as values in memory (estimated size 4.0 KB, free 265.2 MB) 15/08/16
15:20:56 INFO MemoryStore: ensureFreeSpace(2249) called with curMem=175569,
maxMem=278302556 15/08/16 15:20:56 INFO MemoryStore: Block
broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free
265.2 MB) 15/08/16 15:20:56 INFO BlockManagerInfo: Added broadcast_1_piece0
in memory on localhost:63987 (size: 2.2 KB, fre

Re: Executors on multiple nodes

2015-08-16 Thread Sandy Ryza
Hi Mohit,

It depends on whether dynamic allocation is turned on.  If not, the number
of executors is specified by the user with the --num-executors option.  If
dynamic allocation is turned on, refer to the doc for details:
https://spark.apache.org/docs/1.4.0/job-scheduling.html#dynamic-resource-allocation
.

-Sandy


On Sat, Aug 15, 2015 at 6:40 AM, Mohit Anchlia 
wrote:

> I am running on Yarn and do have a question on how spark runs executors on
> different data nodes. Is that primarily decided based on number of
> receivers?
>
> What do I need to do to ensure that multiple nodes are being used for data
> processing?
>


Apache Spark - Parallel Processing of messages from Kafka - Java

2015-08-16 Thread mohanaugust
JavaPairReceiverInputDStream messages =
KafkaUtils.createStream(...);
JavaPairDStream filteredMessages =
filterValidMessages(messages);

JavaDStream useCase1 = calculateUseCase1(filteredMessages);
JavaDStream useCase2 = calculateUseCase2(filteredMessages);
JavaDStream useCase3 = calculateUseCase3(filteredMessages);
JavaDStream useCase4 = calculateUseCase4(filteredMessages);
...

I retrieve messages from Kafka, filter that and use the same messages for
mutiple use-cases. Here useCase1 to 4 are independent of each other and can
be calculated parallely. However, when i look at the logs, i see that
calculations are happening sequentially. How can i make them to run
parallely. Any suggestion would be helpful



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Parallel-Processing-of-messages-from-Kafka-Java-tp24284.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: TestSQLContext compilation error when run SparkPi in Intellij ?

2015-08-16 Thread canan chen
Thanks Andrew.



On Sun, Aug 16, 2015 at 1:53 PM, Andrew Or  wrote:

> Hi Canan, TestSQLContext is no longer a singleton but now a class. It is
> never meant to be a fully public API, but if you wish to use it you can
> just instantiate a new one:
>
> val sqlContext = new TestSQLContext
>
> or just create a new SQLContext from a SparkContext.
>
> -Andrew
>
> 2015-08-15 20:33 GMT-07:00 canan chen :
>
>> I am not sure other people's spark debugging environment ( I mean for the
>> master branch) , Anyone can share his experience ?
>>
>>
>> On Sun, Aug 16, 2015 at 10:40 AM, canan chen  wrote:
>>
>>> I import the spark source code to intellij, and want to run SparkPi in
>>> intellij, but meet the folliwing weird compilation error? I googled it and
>>> sbt clean doesn't work for me. I am not sure whether anyone else has meet
>>> this issue also, any help is appreciated
>>>
>>> Error:scalac:
>>>  while compiling:
>>> /Users/root/github/spark/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala
>>> during phase: jvm
>>>  library version: version 2.10.4
>>> compiler version: version 2.10.4
>>>   reconstructed args: -nobootcp -javabootclasspath : -deprecation
>>> -feature -classpath
>>>
>>
>>
>


Spark hangs on collect (stuck on scheduler delay)

2015-08-16 Thread Sagi r
Hi,
I'm building a spark application in which I load some data from an
Elasticsearch cluster (using latest elasticsearch-hadoop connector) and
continue to perform some calculations on the spark cluster.

In one case, I use collect on the RDD as soon as it is created (loaded from
ES).
However, it is sometimes hangs on one (and sometimes more) node and doesn't
continue.
In the web UI, I can see that one node is stuck on scheduler delay and
prevents from the job to continue,
(while others have finished).

Do you have any idea what is going on here?

The data that is being loaded is fairly small, and only gets mapped once to
domain objects before being collected.

Thank you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-hangs-on-collect-stuck-on-scheduler-delay-tp24283.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org