Re: Installing Spark on Mac
Hi Aida, The installation has detected a maven version 3.0.3. Update to 3.3.3 and try again. Il 08/Mar/2016 14:06, "Aida"ha scritto: > Hi all, > > Thanks everyone for your responses; really appreciate it. > > Eduardo - I tried your suggestions but ran into some issues, please see > below: > > ukdrfs01:Spark aidatefera$ cd spark-1.6.0 > ukdrfs01:spark-1.6.0 aidatefera$ build/mvn -DskipTests clean package > Using `mvn` from path: /usr/bin/mvn > Java HotSpot(TM) 64-Bit Server VM warning: ignoring option > MaxPermSize=512M; > support was removed in 8.0 > [INFO] Scanning for projects... > [INFO] > > [INFO] Reactor Build Order: > [INFO] > [INFO] Spark Project Parent POM > [INFO] Spark Project Test Tags > [INFO] Spark Project Launcher > [INFO] Spark Project Networking > [INFO] Spark Project Shuffle Streaming Service > [INFO] Spark Project Unsafe > [INFO] Spark Project Core > [INFO] Spark Project Bagel > [INFO] Spark Project GraphX > [INFO] Spark Project Streaming > [INFO] Spark Project Catalyst > [INFO] Spark Project SQL > [INFO] Spark Project ML Library > [INFO] Spark Project Tools > [INFO] Spark Project Hive > [INFO] Spark Project Docker Integration Tests > [INFO] Spark Project REPL > [INFO] Spark Project Assembly > [INFO] Spark Project External Twitter > [INFO] Spark Project External Flume Sink > [INFO] Spark Project External Flume > [INFO] Spark Project External Flume Assembly > [INFO] Spark Project External MQTT > [INFO] Spark Project External MQTT Assembly > [INFO] Spark Project External ZeroMQ > [INFO] Spark Project External Kafka > [INFO] Spark Project Examples > [INFO] Spark Project External Kafka Assembly > [INFO] > [INFO] > > [INFO] Building Spark Project Parent POM 1.6.0 > [INFO] > > [INFO] > [INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @ > spark-parent_2.10 --- > [INFO] > [INFO] --- maven-enforcer-plugin:1.4:enforce (enforce-versions) @ > spark-parent_2.10 --- > [WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion > failed with message: > Detected Maven Version: 3.0.3 is not in the allowed range 3.3.3. > [INFO] > > [INFO] Reactor Summary: > [INFO] > [INFO] Spark Project Parent POM .. FAILURE [0.821s] > [INFO] Spark Project Test Tags ... SKIPPED > [INFO] Spark Project Launcher SKIPPED > [INFO] Spark Project Networking .. SKIPPED > [INFO] Spark Project Shuffle Streaming Service ... SKIPPED > [INFO] Spark Project Unsafe .. SKIPPED > [INFO] Spark Project Core SKIPPED > [INFO] Spark Project Bagel ... SKIPPED > [INFO] Spark Project GraphX .. SKIPPED > [INFO] Spark Project Streaming ... SKIPPED > [INFO] Spark Project Catalyst SKIPPED > [INFO] Spark Project SQL . SKIPPED > [INFO] Spark Project ML Library .. SKIPPED > [INFO] Spark Project Tools ... SKIPPED > [INFO] Spark Project Hive SKIPPED > [INFO] Spark Project Docker Integration Tests SKIPPED > [INFO] Spark Project REPL SKIPPED > [INFO] Spark Project Assembly SKIPPED > [INFO] Spark Project External Twitter SKIPPED > [INFO] Spark Project External Flume Sink . SKIPPED > [INFO] Spark Project External Flume .. SKIPPED > [INFO] Spark Project External Flume Assembly . SKIPPED > [INFO] Spark Project External MQTT ... SKIPPED > [INFO] Spark Project External MQTT Assembly .. SKIPPED > [INFO] Spark Project External ZeroMQ . SKIPPED > [INFO] Spark Project External Kafka .. SKIPPED > [INFO] Spark Project Examples SKIPPED > [INFO] Spark Project External Kafka Assembly . SKIPPED > [INFO] > > [INFO] BUILD FAILURE > [INFO] > > [INFO] Total time: 1.745s > [INFO] Finished at: Tue Mar 08 18:01:48 GMT 2016 > [INFO] Final Memory: 19M/183M > [INFO] > > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4:enforce > (enforce-versions) on project spark-parent_2.10: Some Enforcer rules have > failed. Look above for specific messages explaining why
Re: Installing Spark on Mac
Hi Aida Run only "build/mvn -DskipTests clean package” BR Eduardo Costa Alfaia Ph.D. Student in Telecommunications Engineering Università degli Studi di Brescia Tel: +39 3209333018 On 3/4/16, 16:18, "Aida" <aida1.tef...@gmail.com> wrote: >Hi all, > >I am a complete novice and was wondering whether anyone would be willing to >provide me with a step by step guide on how to install Spark on a Mac; on >standalone mode btw. > >I downloaded a prebuilt version, the second version from the top. However, I >have not installed Hadoop and am not planning to at this stage. > >I also downloaded Scala from the Scala website, do I need to download >anything else? > >I am very eager to learn more about Spark but am unsure about the best way >to do it. > >I would be happy for any suggestions or ideas > >Many thanks, > >Aida > > > >-- >View this message in context: >http://apache-spark-user-list.1001560.n3.nabble.com/Installing-Spark-on-Mac-tp26397.html >Sent from the Apache Spark User List mailing list archive at Nabble.com. > >- >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >For additional commands, e-mail: user-h...@spark.apache.org > -- Informativa sulla Privacy: http://www.unibs.it/node/8155 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Accessing Web UI
Hi, try http://OAhtvJ5MCA:8080 BR On 2/19/16, 07:18, "vasbhat"wrote: >OAhtvJ5MCA -- Informativa sulla Privacy: http://www.unibs.it/node/8155 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using SPARK packages in Spark Cluster
Hi Gourav, I did a prove as you said, for me it’s working, I am using spark in local mode, master and worker in the same machine. I run the example in spark-shell —package com.databricks:spark-csv_2.10:1.3.0 without errors. BR From: Gourav SenguptaDate: Monday, February 15, 2016 at 10:03 To: Jorge Machado Cc: Spark Group Subject: Re: Using SPARK packages in Spark Cluster Hi Jorge/ All, Please please please go through this link http://spark.apache.org/docs/latest/spark-standalone.html. The link tells you how to start a SPARK cluster in local mode. If you have not started or worked in SPARK cluster in local mode kindly do not attempt in answering this question. My question is how to use packages like https://github.com/databricks/spark-csv when I using SPARK cluster in local mode. Regards, Gourav Sengupta On Mon, Feb 15, 2016 at 1:55 PM, Jorge Machado wrote: Hi Gourav, I did not unterstand your problem… the - - packages command should not make any difference if you are running standalone or in YARN for example. Give us an example what packages are you trying to load, and what error are you getting… If you want to use the libraries in spark-packages.org without the --packages why do you not use maven ? Regards On 12/02/2016, at 13:22, Gourav Sengupta wrote: Hi, I am creating sparkcontext in a SPARK standalone cluster as mentioned here: http://spark.apache.org/docs/latest/spark-standalone.html using the following code: -- sc.stop() conf = SparkConf().set( 'spark.driver.allowMultipleContexts' , False) \ .setMaster("spark://hostname:7077") \ .set('spark.shuffle.service.enabled', True) \ .set('spark.dynamicAllocation.enabled','true') \ .set('spark.executor.memory','20g') \ .set('spark.driver.memory', '4g') \ .set('spark.default.parallelism',(multiprocessing.cpu_count() -1 )) conf.getAll() sc = SparkContext(conf = conf) -(we should definitely be able to optimise the configuration but that is not the point here) --- I am not able to use packages, a list of which is mentioned here http://spark-packages.org, using this method. Where as if I use the standard "pyspark --packages" option then the packages load just fine. I will be grateful if someone could kindly let me know how to load packages when starting a cluster as mentioned above. Regards, Gourav Sengupta -- Informativa sulla Privacy: http://www.unibs.it/node/8155
unsubscribe email
Hi Guys, How could I unsubscribe the email e.costaalf...@studenti.unibs.it, that is an alias from my email e.costaalf...@unibs.it and it is registered in the mail list . Thanks Eduardo Costa Alfaia PhD Student Telecommunication Engineering Università degli Studi di Brescia-UNIBS -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: Similar code in Java
Thanks Ted. On Feb 10, 2015, at 20:06, Ted Yu yuzhih...@gmail.com wrote: Please take a look at: examples/scala-2.10/src/main/java/org/apache/spark/examples/streaming/JavaDirectKafkaWordCount.java which was checked in yesterday. On Sat, Feb 7, 2015 at 10:53 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it mailto:e.costaalf...@unibs.it wrote: Hi Ted, I’ve seen the codes, I am using JavaKafkaWordCount.java but I would like reproducing in java that I’ve done in scala. Is it possible doing the same thing that scala code does in java? Principally this code below or something looks liked: val KafkaDStreams = (1 to numStreams) map {_ = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicMap,storageLevel = StorageLevel.MEMORY_ONLY).map(_._2) On Feb 7, 2015, at 19:32, Ted Yu yuzhih...@gmail.com mailto:yuzhih...@gmail.com wrote: Can you take a look at: ./examples/scala-2.10/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java ./external/kafka/src/test/java/org/apache/spark/streaming/kafka/JavaKafkaStreamSuite.java Cheers On Sat, Feb 7, 2015 at 9:45 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it mailto:e.costaalf...@unibs.it wrote: Hi Guys, How could I doing in Java the code scala below? val KafkaDStreams = (1 to numStreams) map {_ = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicMap,storageLevel = StorageLevel.MEMORY_ONLY).map(_._2) } val unifiedStream = ssc.union(KafkaDStreams) val sparkProcessingParallelism = 1 unifiedStream.repartition(sparkProcessingParallelism) Thanks Guys Informativa sulla Privacy: http://www.unibs.it/node/8155 http://www.unibs.it/node/8155 Informativa sulla Privacy: http://www.unibs.it/node/8155 http://www.unibs.it/node/8155 -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Similar code in Java
Hi Guys, How could I doing in Java the code scala below? val KafkaDStreams = (1 to numStreams) map {_ = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicMap,storageLevel = StorageLevel.MEMORY_ONLY).map(_._2) } val unifiedStream = ssc.union(KafkaDStreams) val sparkProcessingParallelism = 1 unifiedStream.repartition(sparkProcessingParallelism) Thanks Guys -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Error KafkaStream
Hi Guys, I’m getting this error in KafkaWordCount; TaskSetManager: Lost task 0.0 in stage 4095.0 (TID 1281, 10.20.10.234): java.lang.ClassCastException: [B cannot be cast to java.lang.String at org.apache.spark.examples.streaming.KafkaWordCount$$anonfun$4$$anonfun$apply$1.apply(KafkaWordCount.scala:7 Some idea that could be? Bellow the piece of code val kafkaStream = { val kafkaParams = Map[String, String]( zookeeper.connect - achab3:2181, group.id - mygroup, zookeeper.connect.timeout.ms - 1, kafka.fetch.message.max.bytes - 400, auto.offset.reset - largest) val topicpMap = topics.split(,).map((_,numThreads.toInt)).toMap //val lines = KafkaUtils.createStream[String, String, DefaultDecoder, DefaultDecoder](ssc, kafkaParams, topicpMa p, storageLevel = StorageLevel.MEMORY_ONLY_SER).map(_._2) val KafkaDStreams = (1 to numStreams).map {_ = KafkaUtils.createStream[String, String, DefaultDecoder, DefaultDecoder](ssc, kafkaParams, topicpMap, storageLe vel = StorageLevel.MEMORY_ONLY_SER).map(_._2) } val unifiedStream = ssc.union(KafkaDStreams) unifiedStream.repartition(sparkProcessingParallelism) } Thanks Guys -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: Error KafkaStream
I don’t think so Sean. On Feb 5, 2015, at 16:57, Sean Owen so...@cloudera.com wrote: Is SPARK-4905 / https://github.com/apache/spark/pull/4371/files the same issue? On Thu, Feb 5, 2015 at 7:03 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: Hi Guys, I’m getting this error in KafkaWordCount; TaskSetManager: Lost task 0.0 in stage 4095.0 (TID 1281, 10.20.10.234): java.lang.ClassCastException: [B cannot be cast to java.lang.String at org.apache.spark.examples.streaming.KafkaWordCount$$anonfun$4$$anonfun$apply$1.apply(KafkaWordCount.scala:7 Some idea that could be? Bellow the piece of code val kafkaStream = { val kafkaParams = Map[String, String]( zookeeper.connect - achab3:2181, group.id - mygroup, zookeeper.connect.timeout.ms - 1, kafka.fetch.message.max.bytes - 400, auto.offset.reset - largest) val topicpMap = topics.split(,).map((_,numThreads.toInt)).toMap //val lines = KafkaUtils.createStream[String, String, DefaultDecoder, DefaultDecoder](ssc, kafkaParams, topicpMa p, storageLevel = StorageLevel.MEMORY_ONLY_SER).map(_._2) val KafkaDStreams = (1 to numStreams).map {_ = KafkaUtils.createStream[String, String, DefaultDecoder, DefaultDecoder](ssc, kafkaParams, topicpMap, storageLe vel = StorageLevel.MEMORY_ONLY_SER).map(_._2) } val unifiedStream = ssc.union(KafkaDStreams) unifiedStream.repartition(sparkProcessingParallelism) } Thanks Guys Informativa sulla Privacy: http://www.unibs.it/node/8155 -- Informativa sulla Privacy: http://www.unibs.it/node/8155 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Error KafkaStream
Hi Shao, When I changed to StringDecoder I’ve get this compiling error: [error] /sata_disk/workspace/spark-1.2.0/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/KafkaW ordCount.scala:78: not found: type StringDecoder [error] KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicMap,stora geLevel = StorageLevel.MEMORY_ONLY_SER).map(_._2) [error] ^ [error] /sata_disk/workspace/spark-1.2.0/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/KafkaW ordCount.scala:85: value split is not a member of Nothing [error] val words = unifiedStream.flatMap(_.split( )) [error] ^ [error] /sata_disk/workspace/spark-1.2.0/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/KafkaW ordCount.scala:86: value reduceByKeyAndWindow is not a member of org.apache.spark.streaming.dstream.DStream[(Nothing, Long)] [error] val wordCounts = words.map(x = (x, 1L)).reduceByKeyAndWindow(_ + _, _ - _, Seconds(20), Seconds(10), 2) [error] ^ [error] three errors found [error] (examples/compile:compile) Compilation failed On Feb 6, 2015, at 02:11, Shao, Saisai saisai.s...@intel.com wrote: Hi, I think you should change the `DefaultDecoder` of your type parameter into `StringDecoder`, seems you want to decode the message into String. `DefaultDecoder` is to return Array[Byte], not String, so here class casting will meet error. Thanks Jerry -Original Message- From: Eduardo Costa Alfaia [mailto:e.costaalf...@unibs.it] Sent: Friday, February 6, 2015 12:04 AM To: Sean Owen Cc: user@spark.apache.org Subject: Re: Error KafkaStream I don’t think so Sean. On Feb 5, 2015, at 16:57, Sean Owen so...@cloudera.com wrote: Is SPARK-4905 / https://github.com/apache/spark/pull/4371/files the same issue? On Thu, Feb 5, 2015 at 7:03 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: Hi Guys, I’m getting this error in KafkaWordCount; TaskSetManager: Lost task 0.0 in stage 4095.0 (TID 1281, 10.20.10.234): java.lang.ClassCastException: [B cannot be cast to java.lang.String at org.apache.spark.examples.streaming.KafkaWordCount$$anonfun$4$$anonfu n$apply$1.apply(KafkaWordCount.scala:7 Some idea that could be? Bellow the piece of code val kafkaStream = { val kafkaParams = Map[String, String]( zookeeper.connect - achab3:2181, group.id - mygroup, zookeeper.connect.timeout.ms - 1, kafka.fetch.message.max.bytes - 400, auto.offset.reset - largest) val topicpMap = topics.split(,).map((_,numThreads.toInt)).toMap //val lines = KafkaUtils.createStream[String, String, DefaultDecoder, DefaultDecoder](ssc, kafkaParams, topicpMa p, storageLevel = StorageLevel.MEMORY_ONLY_SER).map(_._2) val KafkaDStreams = (1 to numStreams).map {_ = KafkaUtils.createStream[String, String, DefaultDecoder, DefaultDecoder](ssc, kafkaParams, topicpMap, storageLe vel = StorageLevel.MEMORY_ONLY_SER).map(_._2) } val unifiedStream = ssc.union(KafkaDStreams) unifiedStream.repartition(sparkProcessingParallelism) } Thanks Guys Informativa sulla Privacy: http://www.unibs.it/node/8155 -- Informativa sulla Privacy: http://www.unibs.it/node/8155 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Informativa sulla Privacy: http://www.unibs.it/node/8155
KafkaWordCount
Hi Guys, I would like to put in the kafkawordcount scala code the kafka parameter: val kafkaParams = Map(“fetch.message.max.bytes” - “400”). I’ve put this variable like this val KafkaDStreams = (1 to numStreams) map {_ = KafkaUtils.createStream(ssc, kafkaParams, zkQuorum, group, topicpMap).map(_._2) However I’ve gotten these erros: (jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,zkQuorum: String,groupId: String,topics: jav a.util.Map[String,Integer],storageLevel: org.apache.spark.storage.StorageLevel)org.apache.spark.streaming.api.java.Jav aPairReceiverInputDStream[String,String] and [error] (ssc: org.apache.spark.streaming.StreamingContext,zkQuorum: String,groupId: String,topics: scala.collection. immutable.Map[String,Int],storageLevel: org.apache.spark.storage.StorageLevel)org.apache.spark.streaming.dstream.Recei verInputDStream[(String, String)] Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Error Compiling
Hi Guys, some idea how solve this error [error] /sata_disk/workspace/spark-1.1.1/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala:76: missing parameter type for expanded function ((x$6, x$7) = x$6.$plus(x$7)) [error] val wordCounts = words.map(x = (x, 1L)).reduceByWindow(_ + _, _ - _, Minutes(1), Seconds(2), 2) Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155
JavaKafkaWordCount
Hi Guys, I am doing some tests with JavaKafkaWordCount, my cluster is composed by 8 workers and 1 driver con spark-1.1.0, I am using Kafka too and I have some questions about. 1 - When I launch the command: bin/spark-submit --class org.apache.spark.examples.streaming.JavaKafkaWordCount —master spark://computer8:7077 --driver-memory 1g --executor-memory 2g --executor-cores 2 examples/target/scala-2.10/spark-examples-1.1.0-hadoop1.0.4.jar computer49:2181 test-consumer-group test 2 I see in the Spark WebAdmin that only 1 worker work. Why? 2 - In Kafka I can see the same thing: Group Topic Pid Offset logSize Lag Owner test-consumer-group test 0 147092 147092 0 test-consumer-group_computer1-1416319543858-817b566f-0 test-consumer-group test 1 232183 232183 0 test-consumer-group_computer1-1416319543858-817b566f-0 test-consumer-group test 2 186805 186805 0 test-consumer-group_computer1-1416319543858-817b566f-0 test-consumer-group test 3 0 0 0 test-consumer-group_computer1-1416319543858-817b566f-1 test-consumer-group test 4 0 0 0 test-consumer-group_computer1-1416319543858-817b566f-1 test-consumer-group test 5 0 0 0 test-consumer-group_computer1-1416319543858-817b566f-1 I would like to understand this behavior, Is it normal? Am I doing something wrong? Thanks Guys -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Kafka examples
Hi guys, The Kafka’s examples in master branch were canceled? Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark and Kafka
Hi Guys, I am doing some tests with Spark Streaming and Kafka, but I have seen something strange, I have modified the JavaKafkaWordCount to use ReducebyKeyandWindow and to print in the screen the accumulated numbers of the words, in the beginning spark works very well in each interaction the numbers of the words increase but after 12 a 13 sec the results repeats continually. My program producer remain sending the words toward the kafka. Does anyone have any idea about this? --- Time: 1415272266000 ms --- (accompanied them,6) (merrier,5) (it possessed,5) (the treacherous,5) (Quite,12) (offer,273) (rabble,58) (exchanging,16) (Genoa,18) (merchant,41) ... --- Time: 1415272267000 ms --- (accompanied them,12) (merrier,12) (it possessed,12) (the treacherous,11) (Quite,24) (offer,602) (rabble,132) (exchanging,35) (Genoa,36) (merchant,84) ... --- Time: 1415272268000 ms --- (accompanied them,17) (merrier,18) (it possessed,17) (the treacherous,17) (Quite,35) (offer,889) (rabble,192) (the bed,1) (exchanging,51) (Genoa,54) ... --- Time: 1415272269000 ms --- (accompanied them,17) (merrier,18) (it possessed,17) (the treacherous,17) (Quite,35) (offer,889) (rabble,192) (the bed,1) (exchanging,51) (Genoa,54) ... --- Time: 141527227 ms --- (accompanied them,17) (merrier,18) (it possessed,17) (the treacherous,17) (Quite,35) (offer,889) (rabble,192) (the bed,1) (exchanging,51) (Genoa,54) ... -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: Spark and Kafka
This is my window: reduceByKeyAndWindow( new Function2Integer, Integer, Integer() { @Override public Integer call(Integer i1, Integer i2) { return i1 + i2; } }, new Function2Integer, Integer, Integer() { public Integer call(Integer i1, Integer i2) { return i1 - i2; } }, new Duration(60 * 5 * 1000), new Duration(1 * 1000) ); On Nov 6, 2014, at 18:37, Gwen Shapira gshap...@cloudera.com wrote: What's the window size? If the window is around 10 seconds and you are sending data at very stable rate, this is expected. On Thu, Nov 6, 2014 at 9:32 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: Hi Guys, I am doing some tests with Spark Streaming and Kafka, but I have seen something strange, I have modified the JavaKafkaWordCount to use ReducebyKeyandWindow and to print in the screen the accumulated numbers of the words, in the beginning spark works very well in each interaction the numbers of the words increase but after 12 a 13 sec the results repeats continually. My program producer remain sending the words toward the kafka. Does anyone have any idea about this? --- Time: 1415272266000 ms --- (accompanied them,6) (merrier,5) (it possessed,5) (the treacherous,5) (Quite,12) (offer,273) (rabble,58) (exchanging,16) (Genoa,18) (merchant,41) ... --- Time: 1415272267000 ms --- (accompanied them,12) (merrier,12) (it possessed,12) (the treacherous,11) (Quite,24) (offer,602) (rabble,132) (exchanging,35) (Genoa,36) (merchant,84) ... --- Time: 1415272268000 ms --- (accompanied them,17) (merrier,18) (it possessed,17) (the treacherous,17) (Quite,35) (offer,889) (rabble,192) (the bed,1) (exchanging,51) (Genoa,54) ... --- Time: 1415272269000 ms --- (accompanied them,17) (merrier,18) (it possessed,17) (the treacherous,17) (Quite,35) (offer,889) (rabble,192) (the bed,1) (exchanging,51) (Genoa,54) ... --- Time: 141527227 ms --- (accompanied them,17) (merrier,18) (it possessed,17) (the treacherous,17) (Quite,35) (offer,889) (rabble,192) (the bed,1) (exchanging,51) (Genoa,54) ... -- Informativa sulla Privacy: http://www.unibs.it/node/8155 -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Spark Kafka Performance
Hi Guys, Anyone could explain me how to work Kafka with Spark, I am using the JavaKafkaWordCount.java like a test and the line command is: ./run-example org.apache.spark.streaming.examples.JavaKafkaWordCount spark://192.168.0.13:7077 computer49:2181 test-consumer-group unibs.it 3 and like a producer I am using this command: rdkafka_cachesender -t unibs.nec -p 1 -b 192.168.0.46:9092 -f output.txt -l 100 -n 10 rdkafka_cachesender is a program that was developed by me which send to kafka the output.txt’s content where -l is the length of each send(upper bound) and -n is the lines to send in a row. Bellow is the throughput calculated by the program: File is 2235755 bytes throughput (b/s) = 699751388 throughput (b/s) = 723542382 throughput (b/s) = 662989745 throughput (b/s) = 505028200 throughput (b/s) = 471263416 throughput (b/s) = 446837266 throughput (b/s) = 409856716 throughput (b/s) = 373994467 throughput (b/s) = 366343097 throughput (b/s) = 373240017 throughput (b/s) = 386139016 throughput (b/s) = 373802209 throughput (b/s) = 369308515 throughput (b/s) = 366935820 throughput (b/s) = 365175388 throughput (b/s) = 362175419 throughput (b/s) = 358356633 throughput (b/s) = 357219124 throughput (b/s) = 352174125 throughput (b/s) = 348313093 throughput (b/s) = 355099099 throughput (b/s) = 348069777 throughput (b/s) = 348478302 throughput (b/s) = 340404276 throughput (b/s) = 339876031 throughput (b/s) = 339175102 throughput (b/s) = 327555252 throughput (b/s) = 324272374 throughput (b/s) = 322479222 throughput (b/s) = 319544906 throughput (b/s) = 317201853 throughput (b/s) = 317351399 throughput (b/s) = 315027978 throughput (b/s) = 313831014 throughput (b/s) = 310050384 throughput (b/s) = 307654601 throughput (b/s) = 305707061 throughput (b/s) = 307961102 throughput (b/s) = 296898200 throughput (b/s) = 296409904 throughput (b/s) = 294609332 throughput (b/s) = 293397843 throughput (b/s) = 293194876 throughput (b/s) = 291724886 throughput (b/s) = 290031314 throughput (b/s) = 289747022 throughput (b/s) = 289299632 The throughput goes down after some seconds and it does not maintain the performance like the initial values: throughput (b/s) = 699751388 throughput (b/s) = 723542382 throughput (b/s) = 662989745 Another question is about spark, after I have started the spark line command after 15 sec spark continue to repeat the words counted, but my program continue to send words to kafka, so I mean that the words counted in spark should grow up. I have attached the log from spark. My Case is: ComputerA(Kafka_cachsesender) - ComputerB(Kakfa-Brokers-Zookeeper) - ComputerC (Spark) If I don’t explain very well send a reply to me. Thanks Guys -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Spark's Behavior 2
Hi TD, I have sent more informations now using 8 workers. The gap has been 27 sec now. Have you seen? Thanks BR -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: Spark's behavior
Ok Andrew, Thanks I sent informations of test with 8 worker and the gap is grown up. On May 4, 2014, at 2:31, Andrew Ash and...@andrewash.com wrote: From the logs, I see that the print() starts printing stuff 10 seconds after the context is started. And that 10 seconds is taken by the initial empty job (50 map + 20 reduce tasks) that spark streaming starts to ensure all the executors have started. Somehow the first empty task takes 7-8 seconds to complete. See if this can be reproduced by running a simple, empty job in spark shell (in the same cluster) and see if the first task takes 7-8 seconds. Either way, I didnt see the 30 second gap, but a 10 second gap. And that does not seem to be a persistent problem as after that 10 seconds, the data is being received and processed. TD -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: Spark's behavior
Hi TD, Thanks for reply This last experiment I did with one computer, like local, but I think that time gap grow up when I add more computer. I will do again now with 8 worker and 1 word source and I will see what’s go on. I will control the time too, like suggested by Andrew. On May 3, 2014, at 1:19, Tathagata Das tathagata.das1...@gmail.com wrote: From the logs, I see that the print() starts printing stuff 10 seconds after the context is started. And that 10 seconds is taken by the initial empty job (50 map + 20 reduce tasks) that spark streaming starts to ensure all the executors have started. Somehow the first empty task takes 7-8 seconds to complete. See if this can be reproduced by running a simple, empty job in spark shell (in the same cluster) and see if the first task takes 7-8 seconds. Either way, I didnt see the 30 second gap, but a 10 second gap. And that does not seem to be a persistent problem as after that 10 seconds, the data is being received and processed. TD On Fri, May 2, 2014 at 2:14 PM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: Hi TD, I got the another information today using Spark 1.0 RC3 and the situation remain the same: PastedGraphic-1.png The lines begin after 17 sec: 14/05/02 21:52:25 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140502215225-0005/0 on hostPort computer8.ant-net:57229 with 2 cores, 2.0 GB RAM 14/05/02 21:52:25 INFO AppClient$ClientActor: Executor updated: app-20140502215225-0005/0 is now RUNNING 14/05/02 21:52:25 INFO ReceiverTracker: ReceiverTracker started 14/05/02 21:52:26 INFO ForEachDStream: metadataCleanupDelay = -1 14/05/02 21:52:26 INFO SocketInputDStream: metadataCleanupDelay = -1 14/05/02 21:52:26 INFO SocketInputDStream: Slide time = 1000 ms 14/05/02 21:52:26 INFO SocketInputDStream: Storage level = StorageLevel(false, false, false, false, 1) 14/05/02 21:52:26 INFO SocketInputDStream: Checkpoint interval = null 14/05/02 21:52:26 INFO SocketInputDStream: Remember duration = 1000 ms 14/05/02 21:52:26 INFO SocketInputDStream: Initialized and validated org.apache.spark.streaming.dstream.SocketInputDStream@5433868e 14/05/02 21:52:26 INFO ForEachDStream: Slide time = 1000 ms 14/05/02 21:52:26 INFO ForEachDStream: Storage level = StorageLevel(false, false, false, false, 1) 14/05/02 21:52:26 INFO ForEachDStream: Checkpoint interval = null 14/05/02 21:52:26 INFO ForEachDStream: Remember duration = 1000 ms 14/05/02 21:52:26 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream@1ebdbc05 14/05/02 21:52:26 INFO SparkContext: Starting job: collect at ReceiverTracker.scala:270 14/05/02 21:52:26 INFO RecurringTimer: Started timer for JobGenerator at time 1399060346000 14/05/02 21:52:26 INFO JobGenerator: Started JobGenerator at 1399060346000 ms 14/05/02 21:52:26 INFO JobScheduler: Started JobScheduler 14/05/02 21:52:26 INFO DAGScheduler: Registering RDD 3 (reduceByKey at ReceiverTracker.scala:270) 14/05/02 21:52:26 INFO ReceiverTracker: Stream 0 received 0 blocks 14/05/02 21:52:26 INFO DAGScheduler: Got job 0 (collect at ReceiverTracker.scala:270) with 20 output partitions (allowLocal=false) 14/05/02 21:52:26 INFO DAGScheduler: Final stage: Stage 0(collect at ReceiverTracker.scala:270) 14/05/02 21:52:26 INFO DAGScheduler: Parents of final stage: List(Stage 1) 14/05/02 21:52:26 INFO JobScheduler: Added jobs for time 1399060346000 ms 14/05/02 21:52:26 INFO JobScheduler: Starting job streaming job 1399060346000 ms.0 from job set of time 1399060346000 ms 14/05/02 21:52:26 INFO JobGenerator: Checkpointing graph for time 1399060346000 ms ---14/05/02 21:52:26 INFO DStreamGraph: Updating checkpoint data for time 1399060346000 ms Time: 1399060346000 ms --- 14/05/02 21:52:26 INFO JobScheduler: Finished job streaming job 1399060346000 ms.0 from job set of time 1399060346000 ms 14/05/02 21:52:26 INFO JobScheduler: Total delay: 0.325 s for time 1399060346000 ms (execution: 0.024 s) 14/05/02 21:52:42 INFO JobScheduler: Added jobs for time 1399060362000 ms 14/05/02 21:52:42 INFO JobGenerator: Checkpointing graph for time 1399060362000 ms 14/05/02 21:52:42 INFO DStreamGraph: Updating checkpoint data for time 1399060362000 ms 14/05/02 21:52:42 INFO DStreamGraph: Updated checkpoint data for time 1399060362000 ms 14/05/02 21:52:42 INFO JobScheduler: Starting job streaming job 1399060362000 ms.0 from job set of time 1399060362000 ms 14/05/02 21:52:42 INFO SparkContext: Starting job: take at DStream.scala:593 14/05/02 21:52:42 INFO DAGScheduler: Got job 2 (take at DStream.scala:593) with 1 output partitions (allowLocal=true) 14/05/02 21:52:42 INFO DAGScheduler: Final stage: Stage 3(take at DStream.scala:593) 14/05/02 21:52:42 INFO DAGScheduler: Parents of final stage: List() 14/05/02 21:52:42 INFO
Spark's behavior
Hi TD, In my tests with spark streaming, I'm using JavaNetworkWordCount(modified) code and a program that I wrote that sends words to the Spark worker, I use TCP as transport. I verified that after starting Spark, it connects to my source which actually starts sending, but the first word count is advertised approximately 30 seconds after the context creation. So I'm wondering where is stored the 30 seconds data already sent by the source. Is this a normal spark’s behaviour? I saw the same behaviour using the shipped JavaNetworkWordCount application. Many thanks. -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: Spark's behavior
Hi TD, We are not using stream context with master local, we have 1 Master and 8 Workers and 1 word source. The command line that we are using is: bin/run-example org.apache.spark.streaming.examples.JavaNetworkWordCount spark://192.168.0.13:7077 On Apr 30, 2014, at 0:09, Tathagata Das tathagata.das1...@gmail.com wrote: Is you batch size 30 seconds by any chance? Assuming not, please check whether you are creating the streaming context with master local[n] where n 2. With local or local[1], the system only has one processing slot, which is occupied by the receiver leaving no room for processing the received data. It could be that after 30 seconds, the server disconnects, the receiver terminates, releasing the single slot for the processing to proceed. TD On Tue, Apr 29, 2014 at 2:28 PM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: Hi TD, In my tests with spark streaming, I'm using JavaNetworkWordCount(modified) code and a program that I wrote that sends words to the Spark worker, I use TCP as transport. I verified that after starting Spark, it connects to my source which actually starts sending, but the first word count is advertised approximately 30 seconds after the context creation. So I'm wondering where is stored the 30 seconds data already sent by the source. Is this a normal spark’s behaviour? I saw the same behaviour using the shipped JavaNetworkWordCount application. Many thanks. -- Informativa sulla Privacy: http://www.unibs.it/node/8155 -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: reduceByKeyAndWindow Java
Hi TD Could you explain me this code part? .reduceByKeyAndWindow( 109 new Function2Integer, Integer, Integer() { 110 public Integer call(Integer i1, Integer i2) { return i1 + i2; } 111 }, 112 new Function2Integer, Integer, Integer() { 113 public Integer call(Integer i1, Integer i2) { return i1 - i2; } 114 }, 115 new Duration(60 * 5 * 1000), 116 new Duration(1 * 1000) 117 ); Thanks Em 4/4/14, 22:56, Tathagata Das escreveu: I havent really compiled the code, but it looks good to me. Why? Is there any problem you are facing? TD On Fri, Apr 4, 2014 at 8:03 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it mailto:e.costaalf...@unibs.it wrote: Hi guys, I would like knowing if the part of code is right to use in Window. JavaPairDStreamString, Integer wordCounts = words.map( 103 new PairFunctionString, String, Integer() { 104 @Override 105 public Tuple2String, Integer call(String s) { 106 return new Tuple2String, Integer(s, 1); 107 } 108 }).reduceByKeyAndWindow( 109 new Function2Integer, Integer, Integer() { 110 public Integer call(Integer i1, Integer i2) { return i1 + i2; } 111 }, 112 new Function2Integer, Integer, Integer() { 113 public Integer call(Integer i1, Integer i2) { return i1 - i2; } 114 }, 115 new Duration(60 * 5 * 1000), 116 new Duration(1 * 1000) 117 ); Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155 -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Driver Out of Memory
Hi Guys, I would like understanding why the Driver's RAM goes down, Does the processing occur only in the workers? Thanks # Start Tests computer1(Worker/Source Stream) 23:57:18 up 12:03, 1 user, load average: 0.03, 0.31, 0.44 total used free sharedbuffers cached Mem: 3945 1084 2860 0 44827 -/+ buffers/cache:212 3732 Swap:0 0 0 computer8 (Driver/Master) 23:57:18 up 11:53, 5 users, load average: 0.43, 1.19, 1.31 total used free sharedbuffers cached Mem: 5897 4430 1466 0 384 2662 -/+ buffers/cache: 1382 4514 Swap:0 0 0 computer10(Worker/Source Stream) 23:57:18 up 12:02, 1 user, load average: 0.55, 1.34, 0.98 total used free sharedbuffers cached Mem: 5897564 5332 0 18358 -/+ buffers/cache:187 5709 Swap:0 0 0 computer11(Worker/Source Stream) 23:57:18 up 12:02, 1 user, load average: 0.07, 0.19, 0.29 total used free sharedbuffers cached Mem: 3945603 3342 0 54355 -/+ buffers/cache:193 3751 Swap:0 0 0 After 2 Minutes computer1 00:06:41 up 12:12, 1 user, load average: 3.11, 1.32, 0.73 total used free sharedbuffers cached Mem: 3945 2950994 0 46 1095 -/+ buffers/cache: 1808 2136 Swap:0 0 0 computer8(Driver/Master) 00:06:41 up 12:02, 5 users, load average: 1.16, 0.71, 0.96 total used free sharedbuffers cached Mem: 5897 5191705 0 385 2792 -/+ buffers/cache: 2014 3882 Swap:0 0 0 computer10 00:06:41 up 12:11, 1 user, load average: 2.02, 1.07, 0.89 total used free sharedbuffers cached Mem: 5897 2567 3329 0 21647 -/+ buffers/cache: 1898 3998 Swap:0 0 0 computer11 00:06:42 up 12:12, 1 user, load average: 3.96, 1.83, 0.88 total used free sharedbuffers cached Mem: 3945 3542402 0 57 1099 -/+ buffers/cache: 2385 1559 Swap:0 0 0 -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Explain Add Input
Hi all, Could anyone explain me about the lines below? computer1 - worker computer8 - driver(master) 14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added input-0-1396614314800 in memory on computer1.ant-net:60820 (size: 1262.5 KB, free: 540.3 MB) 14/04/04 14:24:56 INFO MemoryStore: ensureFreeSpace(1292780) called with curMem=49555672, maxMem=825439027 14/04/04 14:24:56 INFO MemoryStore: Block input-0-1396614314800 stored as bytes to memory (size 1262.5 KB, free 738.7 MB) 14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added input-0-1396614314800 in memory on computer8.ant-net:49743 (size: 1262.5 KB, free: 738.7 MB) Why does spark add the same input in computer8, which is the Driver(master)? Thanks guys -- Informativa sulla Privacy: http://www.unibs.it/node/8155
RAM high consume
Hi all, I am doing some tests using JavaNetworkWordcount and I have some questions about the performance machine, my tests' time are approximately 2 min. Why does the RAM Memory decrease meaningly? I have done tests with 2, 3 machines and I had gotten the same behavior. What should I do to get a better performance in this case? # Star Test computer1 total used free sharedbuffers cached Mem: 3945711 3233 0 3430 -/+ buffers/cache:276 3668 Swap:0 0 0 14:42:50 up 73 days, 3:32, 2 users, load average: 0.00, 0.06, 0.21 14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added input-0-1396614314400 in memory on computer1.ant-net:60820 (size: 826.1 KB, free: 542.9 MB) 14/04/04 14:24:56 INFO MemoryStore: ensureFreeSpace(845956) called with curMem=47278100, maxMem=825439027 14/04/04 14:24:56 INFO MemoryStore: Block input-0-1396614314400 stored as bytes to memory (size 826.1 KB, free 741.3 MB) 14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added input-0-1396614314400 in memory on computer8.ant-net:49743 (size: 826.1 KB, free: 741.3 MB) 14/04/04 14:24:56 INFO BlockManagerMaster: Updated info of block input-0-1396614314400 14/04/04 14:24:56 INFO TaskSetManager: Finished TID 272 in 84 ms on computer1.ant-net (progress: 0/1) 14/04/04 14:24:56 INFO TaskSchedulerImpl: Remove TaskSet 43.0 from pool 14/04/04 14:24:56 INFO DAGScheduler: Completed ResultTask(43, 0) 14/04/04 14:24:56 INFO DAGScheduler: Stage 43 (take at DStream.scala:594) finished in 0.088 s 14/04/04 14:24:56 INFO SparkContext: Job finished: take at DStream.scala:594, took 1.872875734 s --- Time: 1396614289000 ms --- (Santiago,1) (liveliness,1) (Sun,1) (reapers,1) (offer,3) (BARBER,3) (shrewdness,1) (truism,1) (hits,1) (merchant,1) # End Test computer1 total used free sharedbuffers cached Mem: 3945 2209 1735 0 5773 -/+ buffers/cache: 1430 2514 Swap:0 0 0 14:46:05 up 73 days, 3:35, 2 users, load average: 2.69, 1.07, 0.55 14/04/04 14:26:57 INFO TaskSetManager: Starting task 183.0:0 as TID 696 on executor 0: computer1.ant-net (PROCESS_LOCAL) 14/04/04 14:26:57 INFO TaskSetManager: Serialized task 183.0:0 as 1981 bytes in 0 ms 14/04/04 14:26:57 INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 81 to sp...@computer1.ant-net:44817 14/04/04 14:26:57 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 81 is 212 bytes 14/04/04 14:26:57 INFO BlockManagerMasterActor$BlockManagerInfo: Added input-0-1396614336600 on disk on computer1.ant-net:60820 (size: 1441.7 KB) 14/04/04 14:26:57 INFO BlockManagerMasterActor$BlockManagerInfo: Added input-0-1396614435200 in memory on computer1.ant-net:60820 (size: 1295.7 KB, free: 589.3 KB) 14/04/04 14:26:57 INFO TaskSetManager: Finished TID 696 in 56 ms on computer1.ant-net (progress: 0/1) 14/04/04 14:26:57 INFO TaskSchedulerImpl: Remove TaskSet 183.0 from pool 14/04/04 14:26:57 INFO DAGScheduler: Completed ResultTask(183, 0) 14/04/04 14:26:57 INFO DAGScheduler: Stage 183 (take at DStream.scala:594) finished in 0.057 s 14/04/04 14:26:57 INFO SparkContext: Job finished: take at DStream.scala:594, took 1.575268894 s --- Time: 1396614359000 ms --- (hapless,9) (reapers,8) (amazed,113) (feebleness,7) (offer,148) (rabble,27) (exchanging,7) (merchant,20) (incentives,2) (quarrel,48) ... Thanks Guys -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Driver increase memory utilization
Hi Guys, Could anyone help me understand this driver behavior when I start the JavaNetworkWordCount? computer8 16:24:07 up 121 days, 22:21, 12 users, load average: 0.66, 1.27, 1.55 total used free shared buffers cached Mem: 5897 4341 1555 0227 2798 -/+ buffers/cache: 1315 4581 Swap:0 0 0 in 2 minutes computer8 16:23:08 up 121 days, 22:20, 12 users, load average: 0.80, 1.43, 1.62 total used free shared buffers cached Mem: 5897 5866 30 0230 3255 -/+ buffers/cache: 2380 3516 Swap:0 0 0 Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Parallelism level
Hi all, I have put this line in my spark-env.sh: -Dspark.default.parallelism=20 this parallelism level, is it correct? The machine's processor is a dual core. Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155
RAM Increase
Hi Guys, Could anyone explain me this behavior? After 2 min of tests computer1- worker computer10 - worker computer8 - driver(master) computer1 18:24:31 up 73 days, 7:14, 1 user, load average: 3.93, 2.45, 1.14 total used free shared buffers cached Mem: 3945 3925 19 0 18 1368 -/+ buffers/cache: 2539 1405 Swap:0 0 0 computer10 18:22:38 up 44 days, 21:26, 2 users, load average: 3.05, 2.20, 1.03 total used free shared buffers cached Mem: 5897 5292 604 0 46 2707 -/+ buffers/cache: 2538 3358 Swap:0 0 0 computer8 18:24:13 up 122 days, 22 min, 13 users, load average: 1.10, 0.93, 0.82 total used free shared buffers cached Mem: 5897 5841 55 0113 2747 -/+ buffers/cache: 2980 2916 Swap:0 0 0 Thanks Guys -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: Parallelism level
What do you advice me Nicholas? Em 4/4/14, 19:05, Nicholas Chammas escreveu: If you're running on one machine with 2 cores, I believe all you can get out of it are 2 concurrent tasks at any one time. So setting your default parallelism to 20 won't help. On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it mailto:e.costaalf...@unibs.it wrote: Hi all, I have put this line in my spark-env.sh: -Dspark.default.parallelism=20 this parallelism level, is it correct? The machine's processor is a dual core. Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155 -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: reduceByKeyAndWindow Java
Hi Tathagata, You are right, this code compile, but I am some problems with high memory consummation, I sent today some email about this, but no response until now. Thanks Em 4/4/14, 22:56, Tathagata Das escreveu: I havent really compiled the code, but it looks good to me. Why? Is there any problem you are facing? TD On Fri, Apr 4, 2014 at 8:03 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it mailto:e.costaalf...@unibs.it wrote: Hi guys, I would like knowing if the part of code is right to use in Window. JavaPairDStreamString, Integer wordCounts = words.map( 103 new PairFunctionString, String, Integer() { 104 @Override 105 public Tuple2String, Integer call(String s) { 106 return new Tuple2String, Integer(s, 1); 107 } 108 }).reduceByKeyAndWindow( 109 new Function2Integer, Integer, Integer() { 110 public Integer call(Integer i1, Integer i2) { return i1 + i2; } 111 }, 112 new Function2Integer, Integer, Integer() { 113 public Integer call(Integer i1, Integer i2) { return i1 - i2; } 114 }, 115 new Duration(60 * 5 * 1000), 116 new Duration(1 * 1000) 117 ); Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155 -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Print line in JavaNetworkWordCount
Hi Guys I would like printing the content inside of line in : JavaDStreamString lines = ssc.socketTextStream(args[1], Integer.parseInt(args[2])); JavaDStreamString words = lines.flatMap(new FlatMapFunctionString, String() { @Override public IterableString call(String x) { return Lists.newArrayList(x.split( )); } }); Is it possible? How could I do? Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: Change print() in JavaNetworkWordCount
Thank you very much Sourav BR Em 3/26/14, 17:29, Sourav Chandra escreveu: def print() { def foreachFunc = (rdd: RDD[T], time: Time) = { val total = rdd.collect().toList println (---) println (Time: + time) println (---) total.foreach(println) // val first11 = rdd.take(11) // println (---) // println (Time: + time) // println (---) // first11.take(10).foreach(println) // if (first11.size 10) println(...) println() } new ForEachDStream(this, context.sparkContext.clean(foreachFunc)).register() } -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Change print() in JavaNetworkWordCount
Hi Guys, I think that I already did this question, but I don't remember if anyone has answered me. I would like changing in the function print() the quantity of words and the frequency number that are sent to driver's screen. The default value is 10. Anyone could help me with this? Best Regards -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Log Analyze
Hi Guys, Could anyone help me to understand this piece of log in red? Why is this happened? Thanks 14/03/10 16:55:20 INFO SparkContext: Starting job: first at NetworkWordCount.scala:87 14/03/10 16:55:20 INFO JobScheduler: Finished job streaming job 1394466892000 ms.0 from job set of time 1394466892000 ms 14/03/10 16:55:20 INFO JobScheduler: Total delay: 28.537 s for time 1394466892000 ms (execution: 4.479 s) 14/03/10 16:55:20 INFO JobScheduler: Starting job streaming job 1394466893000 ms.0 from job set of time 1394466893000 ms 14/03/10 16:55:20 INFO JobGenerator: Checkpointing graph for time 1394466892000 ms 14/03/10 16:55:20 INFO DStreamGraph: Updating checkpoint data for time 1394466892000 ms 14/03/10 16:55:20 INFO DStreamGraph: Updated checkpoint data for time 1394466892000 ms 14/03/10 16:55:20 INFO CheckpointWriter: Saving checkpoint for time 1394466892000 ms to file 'hdfs://computer8:54310/user/root/INPUT/checkpoint-1394466892000' 14/03/10 16:55:20 INFO DAGScheduler: Registering RDD 496 (combineByKey at ShuffledDStream.scala:42) 14/03/10 16:55:20 INFO DAGScheduler: Got job 39 (first at NetworkWordCount.scala:87) with 1 output partitions (allowLocal=true) 14/03/10 16:55:20 INFO DAGScheduler: Final stage: Stage 77 (first at NetworkWordCount.scala:87) 14/03/10 16:55:20 INFO DAGScheduler: Parents of final stage: List(Stage 78) 14/03/10 16:55:20 INFO DAGScheduler: Missing parents: List(Stage 78) 14/03/10 16:55:20 INFO BlockManagerMasterActor$BlockManagerInfo: Removed input-1-1394466782400 on computer10.ant-net:34062 in memory (size: 5.9 MB, free: 502.2 MB) 14/03/10 16:55:20 INFO DAGScheduler: Submitting Stage 78 (MapPartitionsRDD[496] at combineByKey at ShuffledDStream.scala:42), which has no missing parents 14/03/10 16:55:20 INFO BlockManagerMasterActor$BlockManagerInfo: Added input-1-1394466816600 in memory on computer10.ant-net:34062 (size: 4.4 MB, free: 497.8 MB) 14/03/10 16:55:20 INFO DAGScheduler: Submitting 15 missing tasks from Stage 78 (MapPartitionsRDD[496] at combineByKey at ShuffledDStream.scala:42) 14/03/10 16:55:20 INFO TaskSchedulerImpl: Adding task set 78.0 with 15 tasks 14/03/10 16:55:20 INFO TaskSetManager: Starting task 78.0:9 as TID 539 on executor 2: computer1.ant-net (PROCESS_LOCAL) 14/03/10 16:55:20 INFO TaskSetManager: Serialized task 78.0:9 as 4144 bytes in 1 ms 14/03/10 16:55:20 INFO TaskSetManager: Starting task 78.0:10 as TID 540 on executor 1: computer10.ant-net (PROCESS_LOCAL) 14/03/10 16:55:20 INFO TaskSetManager: Serialized task 78.0:10 as 4144 bytes in 0 ms 14/03/10 16:55:20 INFO TaskSetManager: Starting task 78.0:11 as TID 541 on executor 0: computer11.ant-net (PROCESS_LOCAL) 14/03/10 16:55:20 INFO TaskSetManager: Serialized task 78.0:11 as 4144 bytes in 0 ms 14/03/10 16:55:20 INFO BlockManagerMasterActor$BlockManagerInfo: Removed input-0-1394466874200 on computer1.ant-net:51406 in memory (size: 2.9 MB, free: 460.0 MB) 14/03/10 16:55:20 INFO BlockManagerMasterActor$BlockManagerInfo: Removed input-0-1394466874400 on computer1.ant-net:51406 in memory (size: 4.1 MB, free: 468.2 MB) 14/03/10 16:55:20 INFO TaskSetManager: Starting task 78.0:12 as TID 542 on executor 1: computer10.ant-net (PROCESS_LOCAL) 14/03/10 16:55:20 INFO TaskSetManager: Serialized task 78.0:12 as 4144 bytes in 1 ms 14/03/10 16:55:20 WARN TaskSetManager: Lost TID 540 (task 78.0:10) 14/03/10 16:55:20 INFO CheckpointWriter: Deleting hdfs://computer8:54310/user/root/INPUT/checkpoint-1394466892000 14/03/10 16:55:20 INFO CheckpointWriter: Checkpoint for time 1394466892000 ms saved to file 'hdfs://computer8:54310/user/root/INPUT/checkpoint-1394466892000', took 3633 bytes and 93 ms 14/03/10 16:55:20 INFO DStreamGraph: Clearing checkpoint data for time 1394466892000 ms 14/03/10 16:55:20 INFO DStreamGraph: Cleared checkpoint data for time 1394466892000 ms 14/03/10 16:55:20 INFO BlockManagerMasterActor$BlockManagerInfo: Removed input-2-1394466789000 on computer11.ant-net:58332 in memory (size: 3.9 MB, free: 536.0 MB) 14/03/10 16:55:20 WARN TaskSetManager: Loss was due to java.lang.Exception java.lang.Exception: Could not compute split, block input-2-1394466794200 not found at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:45) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:32) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:72) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at
Re: Explain About Logs NetworkWordcount.scala
Yes TD, I can use tcpdump to see if the data are being accepted by the receiver and if else them are arriving into the IP packet. Thanks Em 3/8/14, 4:19, Tathagata Das escreveu: I am not sure how to debug this without any more information about the source. Can you monitor on the receiver side that data is being accepted by the receiver but not reported? TD On Wed, Mar 5, 2014 at 7:23 AM, eduardocalfaia e.costaalf...@unibs.it mailto:e.costaalf...@unibs.it wrote: Hi TD, I have seen in the web UI the stage number that result has been zero and in the field GC Times there is nothing. http://apache-spark-user-list.1001560.n3.nabble.com/file/n2306/CaptureStage.png -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Explain-About-Logs-NetworkWordcount-scala-tp1835p2306.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Informativa sulla Privacy: http://www.unibs.it/node/8155