Spark Standard Application to Test

2015-02-25 Thread danilopds
Hello,
I am preparing some tests to execute in Spark in order to manipulate
properties and check the variations in results.

For this, I need to use a Standard Application in my environment like the
well-known apps to Hadoop:  Terasort
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html
  
and specially  Terrier http://terrier.org/   or something similar. I do
not need applications wordcount and grep because I have used them.

Can anyone suggest me something about this?
Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Standard-Application-to-Test-tp21803.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLlib - Show an element in RDD[(Int, Iterable[Array[Double]])]

2015-02-05 Thread danilopds
I solve the question with this code:

import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors

val data = sc.textFile(/opt/testAppSpark/data/height-weight.txt).map {
   line = Vectors.dense(line.split(' ').map(_.toDouble))
}.cache()

val cluster = KMeans.train(data, 3, 20)

val vectorsAndClusterIdx = data.map{ point =
   val prediction = cluster.predict(point)
   (point.toString, prediction)
}
vectorsAndClusterIdx.collect



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Show-an-element-in-RDD-Int-Iterable-Array-Double-tp21521p21522.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



MLlib - Show an element in RDD[(Int, Iterable[Array[Double]])]

2015-02-05 Thread danilopds
Hi,
I'm learning Spark and testing the Spark MLlib library with algorithm
K-means.

So,
I created a file height-weight.txt like this:
65.0 220.0
73.0 160.0
59.0 110.0
61.0 120.0
...

And the code (executed in spark-shell):
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors

val data = sc.textFile(/opt/testAppSpark/data/height-weight.txt)
val parsedData = data.map(s = Vectors.dense(s.split('
').map(_.toDouble))).cache()
val numCluster = 3
val numIterations = 30
val cluster = KMeans.train(parsedData, numCluster, numIterations)
val groups = data.map{_.split(' ').map(_.toDouble)}.groupBy{rdd =
cluster.predict(Vectors.dense(rdd))}
groups.collect

When I typed /groups.collect/, I received an information like:
res29: Array[(Int, Iterable[Array[Double]])] =
Array((0,CompactBuffer([D@12c6123d, [D@9d76c6c, [D@1e0f2b80, [D@75f0efea,
[D@1d172824, [D@5b4c6267, [D@73d08704)), (2,CompactBuffer([D@7f505302,
[D@7279e99a, [D@21d7b82d, [D@597ca3b6, [D@5e02fa0)),
(1,CompactBuffer([D@4156b463, [D@235cf118, [D@2ad870cb, [D@67d53566,
[D@5ea4f0cb, [D@1ebccff8, [D@7df9b28b, [D@1439044a)))

Typing /groups/ em command line I see:
res1: org.apache.spark.rdd.RDD[(Int, Iterable[Array[Double]])] =
ShuffledRDD[28] at groupBy at console:24

How can I see the results?
Thanks.







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Show-an-element-in-RDD-Int-Iterable-Array-Double-tp21521.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark metrics for ganglia

2014-12-15 Thread danilopds
Thanks tsingfu,

I used this configuration based in your post: (with ganglia unicast mode)
# Enable GangliaSink for all instances 
*.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink 
*.sink.ganglia.host=10.0.0.7
*.sink.ganglia.port=8649 
*.sink.ganglia.period=15
*.sink.ganglia.unit=seconds 
*.sink.ganglia.ttl=1 
*.sink.ganglia.mode=unicast

Then,
I have the following error now.
ERROR metrics.MetricsSystem: Sink class
org.apache.spark.metrics.sink.GangliaSink  cannot be instantialized
java.lang.ClassNotFoundException: org.apache.spark.metrics.sink.GangliaSink





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-metrics-for-ganglia-tp14335p20690.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can not see any spark metrics on ganglia-web

2014-12-04 Thread danilopds
I used the command below because I'm using Spark 1.0.2 built with SBT and it
worked. 

SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true SPARK_GANGLIA_LGPL=true sbt/sbt
assembly



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-not-see-any-spark-metrics-on-ganglia-web-tp14981p20384.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark metrics for ganglia

2014-12-04 Thread danilopds
Hello Samudrala,

Did you solve this issue about view metrics in Ganglia??
Because I have the same problem.

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-metrics-for-ganglia-tp14335p20385.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: A question about streaming throughput

2014-10-15 Thread danilopds
Ok, 
I understand.

But in both cases the data are in the same processing node.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/A-question-about-streaming-throughput-tp16416p16501.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



A question about streaming throughput

2014-10-14 Thread danilopds
Hi,
I'm learning about Apache Spark Streaming and I'm doing some tests.

Now, 
I have a modified version of the app NetworkWordCount that perform a
/reduceByKeyAndWindow/ with window of 10 seconds in intervals of 5 seconds.

I'm using also the function to measure the rate of records/second like this:
/words.foreachRDD(rdd = {
val count = rdd.count()
 println(Current rate: + (count/1) + records/second)
})/

Then,
In my computer with 4 cores and 8gb (running: /local[4]/) I have this
average result:
Current rate: 130 000 

Running locally with my computer as /master and worker/ I have this:
Current rate: 25 000

And running in a cloud computing azure with 4 cores and 7 gb, the result is:
Current rate: 10 000

I read the  Spark Streaming paper
http://www.eecs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf  
and the performance evaluation to a similar application was 250 000
records/second.

To send data in the socket I'm using an application similar to this:
http://apache-spark-user-list.1001560.n3.nabble.com/streaming-code-to-simulate-a-network-socket-data-source-td3431.html#a13814

So,
Can anyone suggest me something to improve these rate?
/(I increased the memory in executor and I didn't have better results)/

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/A-question-about-streaming-throughput-tp16416.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can not see any spark metrics on ganglia-web

2014-10-02 Thread danilopds
Hi tsingfu,

I want to see metrics in ganglia too.
But I don't understand this step:
./make-distribution.sh --tgz --skip-java-test -Phadoop-2.3 -Pyarn -Phive
-Pspark-ganglia-lgpl 

Are you installing the hadoop, yarn, hive AND ganglia??

If I want to install just ganglia?
Can you suggest me something?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-not-see-any-spark-metrics-on-ganglia-web-tp14981p15631.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can not see any spark metrics on ganglia-web

2014-10-02 Thread danilopds
Ok Krishna Sankar,

In relation to this information on Spark monitoring webpage,
For sbt users, set the SPARK_GANGLIA_LGPL environment variable before
building. For Maven users, enable the -Pspark-ganglia-lgpl profile

Do you know what I need to do to install with sbt?
Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-not-see-any-spark-metrics-on-ganglia-web-tp14981p15636.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Monitoring with Ganglia

2014-10-01 Thread danilopds
Hi,
I need monitoring some aspects about my cluster like network and resources.

Ganglia looks like a good option for what I need.
Then, I found out that Spark has support to Ganglia.

On the Spark monitoring webpage there is this information:
To install the GangliaSink you’ll need to perform a custom build of Spark.

I found in my Spark the directory: /extras/spark-ganglia-lgpl. But I don't
know how to install it.

How can I install the Ganglia to monitoring Spark cluster?
How I do this custom build?

Thanks!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Monitoring-with-Ganglia-tp15538.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Question About Submit Application

2014-10-01 Thread danilopds
I'll do this test and after I reply the result. 

Thank you Marcelo.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Question-About-Submit-Application-tp15072p15539.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Question About Submit Application

2014-09-24 Thread danilopds
Hello,
I'm learning about Spark Streaming and I'm really excited.
Today I was testing to package some apps and send them in a Standalone
cluster in my computer locally.
It occurred ok.

So,
I created one Virtual Machine with network bridge and I tried to send again
the app to this VM from my local PC.

But I had some errors, like these:
WARN AppClient$ClientActor: Could not connect to
akka.tcp://sparkMaster@spark-01:7077:
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkMaster@spark-01:7077]
14/09/24 16:28:48 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory

Can I submit an application from another computer to a remote master with
workers?
Anybody suggest something?

Thanks a lot.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Question-About-Submit-Application-tp15072.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming Twitter Example Error

2014-09-24 Thread danilopds
I solved this question using the SBT plugin sbt-assembly.
It's very good!

Bye.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Twitter-Example-Error-tp12600p15073.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Memory/Network Intensive Workload

2014-09-24 Thread danilopds
Thank you for the suggestion!
Bye.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-Network-Intensive-Workload-tp8501p15074.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Question About Submit Application

2014-09-24 Thread danilopds
One more information..

When I submit the application from my local PC to my VM,
This VM was the master and worker and my local PC wasn't part of the
cluster.

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Question-About-Submit-Application-tp15072p15076.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: streaming: code to simulate a network socket data source

2014-09-09 Thread danilopds
Hello Diana,

How can I include this implementation in my code, in terms of start this
task together the NetworkWordCount.

In my case, I have a directory with several files.

Then,
I include this line:
  StreamingDataGenerator.streamingGenerator(NetPort, BytesSecond, DirFiles)

But the program stays in my loop of files.
And after returning to NetworkWordCount.

Can you suggest me something to start these tasks in parallel?
Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/streaming-code-to-simulate-a-network-socket-data-source-tp3431p13814.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: streaming: code to simulate a network socket data source

2014-09-09 Thread danilopds
I utilize this code in separated but the program block in this character:
val socket = listener.accept()

Do you have any suggestion?
Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/streaming-code-to-simulate-a-network-socket-data-source-tp3431p13817.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Records - Input Byte

2014-09-08 Thread danilopds
Hi,

I was reading the paper of Spark Streaming:
Discretized Streams: Fault-Tolerant Streaming Computation at Scale

So,
I read that performance evaluation used 100-byte input records in test Grep
and WordCount.

I don't have much experience and I'd like to know how can I control this
value in my records (like words in an input file)?
Can anyone suggest me something to start?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Records-Input-Byte-tp13733.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Streaming Twitter Example Error

2014-08-21 Thread danilopds
Hi!

I'm beginning with the development in Spark Streaming.. And I'm learning
with the examples available in the spark directory. There are several
applications and I want to make modifications.

I can execute the TwitterPopularTags normally with command:
./bin/run-example TwitterPopularTags auth

So, 
I moved the source code to a separate folder with the structure:
./src/main/scala/

With the files:
-TwitterPopularTags
-TwitterUtils
-StreamingExamples
-TwitterInputDStream

But when I run the command:
./bin/spark-submit --class TwitterPopularTags --master local[4]
/MY_DIR/TwitterTest/target/scala-2.10/simple-project_2.10-1.0.jar auth

I receive the following error:
Exception in thread main java.lang.NoClassDefFoundError:
twitter4j/auth/Authorization
at TwitterUtils$.createStream(TwitterUtils.scala:42)
at TwitterPopularTags$.main(TwitterPopularTags.scala:65)
at TwitterPopularTags.main(TwitterPopularTags.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: twitter4j.auth.Authorization
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 10 more

This is my sbt build file:
name := Simple Project

version := 1.0

scalaVersion := 2.10.4

libraryDependencies += org.apache.spark %% spark-core % 1.0.2

libraryDependencies += org.apache.spark %% spark-streaming % 1.0.2

libraryDependencies += org.twitter4j % twitter4j-core % 3.0.3

libraryDependencies += org.twitter4j % twitter4j-stream % 3.0.3

resolvers += Akka Repository at http://repo.akka.io/releases/;

Can anybody help me?
Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Twitter-Example-Error-tp12600.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Memory/Network Intensive Workload

2014-06-29 Thread danilopds
Hello,
I'm studying the Spark platform and I'd like to realize experiments in your
extension Spark Streaming.

So,
I guess that an intensive memory and network workload are a good options.
Can anyone suggest a few typical Spark Streaming workloads that are
network/memory intensive? 

If someone have other suggestions for good workloads upon Spark Streaming
will be interesting too.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-Network-Intensive-Workload-tp8501.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Interconnect benchmarking

2014-06-27 Thread danilopds
Hi,
According with the research paper bellow of Mathei Zaharia, Spark's creator,
http://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf

He says on page 10 that: 
Grep is network-bound due to the cost to replicate the input data to
multiple nodes.

So, 
I guess a can be a good initial recommendation.

But I would like to know others workloads too.
Best Regards.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Interconnect-benchmarking-tp8467p8470.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.