Realtime Data Visualization Tool for Spark
Hi I have got streaming data which needs to be processed and send for visualization. I am planning to use spark streaming for this but little bit confused in choosing visualization tool. I read somewhere that D3.js can be used but i wanted know which is best tool for visualization while dealing with streaming application.(something that can be easily integrated) If someone has any link which can tell about D3.js(or any other visualization tool) and Spark streaming application integration then please share . That would be great help. Thanks and Regards Shashi
Spark Job failing with exit status 15
I am trying to run simple word count job in spark but I am getting exception while running job. For more detailed output, check application tracking page:http://quickstart.cloudera:8088/proxy/application_1446699275562_0006/Then, click on links to logs of each attempt.Diagnostics: Exception from container-launch.Container id: container_1446699275562_0006_02_01Exit code: 15Stack trace: ExitCodeException exitCode=15: at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 15Failing this attempt. Failing the application. ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.cloudera start time: 1446910483956 final status: FAILED tracking URL: http://quickstart.cloudera:8088/cluster/app/application_1446699275562_0006 user: clouderaException in thread "main" org.apache.spark.SparkException: Application finished with failed status at org.apache.spark.deploy.yarn.Client.run(Client.scala:626) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:651) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) I checked log from following command yarn logs -applicationId application_1446699275562_0006 Here is log 15/11/07 07:35:09 ERROR yarn.ApplicationMaster: User class threw exception: Output directory hdfs://quickstart.cloudera:8020/user/cloudera/WordCountOutput already exists org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://quickstart.cloudera:8020/user/cloudera/WordCountOutput already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1053) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:954) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:863) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1290) at org.com.td.sparkdemo.spark.WordCount$.main(WordCount.scala:23) at org.com.td.sparkdemo.spark.WordCount.main(WordCount.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:480)15/11/07 07:35:09 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: Output directory hdfs://quickstart.cloudera:8020/user/cloudera/WordCountOutput already exists)15/11/07 07:35:14 ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 10 ms. Please check earlier log output for errors. Failing the application. Exception clearly indicates that WordCountOutput directory already exists but I made sure that directory is not there before running job. Why I am getting this error even though directory was not there before running my job?
Re: Spark Job failing with exit status 15
Hi I am using Spark 1.3.0 . Command that I use is below. /spark-submit --class org.com.td.sparkdemo.spark.WordCount \ --master yarn-cluster \ target/spark-0.0.1-SNAPSHOT-jar-with-dependencies.jar Thanks Shashi On Sun, Nov 8, 2015 at 11:33 PM, Ted Yu wrote: > Which release of Spark were you using ? > > Can you post the command you used to run WordCount ? > > Cheers > > On Sat, Nov 7, 2015 at 7:59 AM, Shashi Vishwakarma < > shashi.vish...@gmail.com> wrote: > >> I am trying to run simple word count job in spark but I am getting >> exception while running job. >> >> For more detailed output, check application tracking >> page:http://quickstart.cloudera:8088/proxy/application_1446699275562_0006/Then, >> click on links to logs of each attempt.Diagnostics: Exception from >> container-launch.Container id: container_1446699275562_0006_02_01Exit >> code: 15Stack trace: ExitCodeException exitCode=15: >> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) >> at org.apache.hadoop.util.Shell.run(Shell.java:455) >> at >> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) >> at >> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) >> at >> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) >> at >> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) >> at java.util.concurrent.FutureTask.run(FutureTask.java:262) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:745) >> >> Container exited with a non-zero exit code 15Failing this attempt. Failing >> the application. >> ApplicationMaster host: N/A >> ApplicationMaster RPC port: -1 >> queue: root.cloudera >> start time: 1446910483956 >> final status: FAILED >> tracking URL: >> http://quickstart.cloudera:8088/cluster/app/application_1446699275562_0006 >> user: clouderaException in thread "main" >> org.apache.spark.SparkException: Application finished with failed status >> at org.apache.spark.deploy.yarn.Client.run(Client.scala:626) >> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:651) >> at org.apache.spark.deploy.yarn.Client.main(Client.scala) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:606) >> at >> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) >> at >> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) >> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) >> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) >> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >> >> I checked log from following command >> >> yarn logs -applicationId application_1446699275562_0006 >> >> Here is log >> >> 15/11/07 07:35:09 ERROR yarn.ApplicationMaster: User class threw exception: >> Output directory >> hdfs://quickstart.cloudera:8020/user/cloudera/WordCountOutput already exists >> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory >> hdfs://quickstart.cloudera:8020/user/cloudera/WordCountOutput already exists >> at >> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132) >> at >> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1053) >> at >> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:954) >> at >> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:863) >> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1290) >> at org.com.td.sparkdemo.spark.WordCount$.main(WordCount.scala:23) >> at org.com.td.sparkdemo.spark.WordCount.main(WordCount.scala) >>
Securing Spark Job on Cluster
Hi All I was dealing with one the spark requirement here where Client (like Banking Client where security is major concern) needs all spark processing should happen securely. For example all communication happening between spark client and server ( driver & executor communication) should be on secure channel. Even when spark spills on disk based on storage level (Mem+Disk), it should not be written in un-encrypted format on local disk or there should be some workaround to prevent spill. I did some research but could not get any concrete solution.Let me know if someone has done this. Any guidance would be a great help. Thanks Shashi
Re: Securing Spark Job on Cluster
Kerberos is not a apache project. Kerberos provides a way to do authentication but does not provide data security. On Fri, Apr 28, 2017 at 3:24 PM, veera satya nv Dantuluri < dvsnva...@gmail.com> wrote: > Hi Shashi, > > Based on your requirement for securing data, we can use Apache kebros, or > we could use the security feature in Spark. > > > > On Apr 28, 2017, at 8:45 AM, Shashi Vishwakarma < > shashi.vish...@gmail.com> wrote: > > > > Hi All > > > > I was dealing with one the spark requirement here where Client (like > Banking Client where security is major concern) needs all spark processing > should happen securely. > > > > For example all communication happening between spark client and server > ( driver & executor communication) should be on secure channel. Even when > spark spills on disk based on storage level (Mem+Disk), it should not be > written in un-encrypted format on local disk or there should be some > workaround to prevent spill. > > > > I did some research but could not get any concrete solution.Let me know > if someone has done this. > > > > Any guidance would be a great help. > > > > Thanks > > Shashi > >
Re: Securing Spark Job on Cluster
Agreed Jorn. Disk encryption is one option that will help to secure data but how do I know at which location Spark is spilling temp file, shuffle data and application data ? Thanks Shashi On Fri, Apr 28, 2017 at 3:54 PM, Jörn Franke wrote: > You can use disk encryption as provided by the operating system. > Additionally, you may think about shredding disks after they are not used > anymore. > > > On 28. Apr 2017, at 14:45, Shashi Vishwakarma > wrote: > > > > Hi All > > > > I was dealing with one the spark requirement here where Client (like > Banking Client where security is major concern) needs all spark processing > should happen securely. > > > > For example all communication happening between spark client and server > ( driver & executor communication) should be on secure channel. Even when > spark spills on disk based on storage level (Mem+Disk), it should not be > written in un-encrypted format on local disk or there should be some > workaround to prevent spill. > > > > I did some research but could not get any concrete solution.Let me know > if someone has done this. > > > > Any guidance would be a great help. > > > > Thanks > > Shashi >
Re: Securing Spark Job on Cluster
Yes I am using HDFS .Just trying to understand couple of point. There would be two kind of encryption which would be required. 1. Data in Motion - This could be achieved by enabling SSL - https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_spark-component-guide/content/spark-encryption.html 2. Data at Rest - HDFS Encryption can be applied. Apart from this when spark executes a job , each disk available in all node needs to be encrypted . I can have multiple disk on each node and encrypting all of them could be costly operation - Therefore I was trying to identify during job execution what are possible folders where spark can spill data . Once these items are identified those specific disk can be encrypted. Thanks Shashi On Fri, Apr 28, 2017 at 4:34 PM, Jörn Franke wrote: > Why don't you use whole disk encryption? > Are you using HDFS? > > On 28. Apr 2017, at 16:57, Shashi Vishwakarma > wrote: > > Agreed Jorn. Disk encryption is one option that will help to secure data > but how do I know at which location Spark is spilling temp file, shuffle > data and application data ? > > Thanks > Shashi > > On Fri, Apr 28, 2017 at 3:54 PM, Jörn Franke wrote: > >> You can use disk encryption as provided by the operating system. >> Additionally, you may think about shredding disks after they are not used >> anymore. >> >> > On 28. Apr 2017, at 14:45, Shashi Vishwakarma >> wrote: >> > >> > Hi All >> > >> > I was dealing with one the spark requirement here where Client (like >> Banking Client where security is major concern) needs all spark processing >> should happen securely. >> > >> > For example all communication happening between spark client and server >> ( driver & executor communication) should be on secure channel. Even when >> spark spills on disk based on storage level (Mem+Disk), it should not be >> written in un-encrypted format on local disk or there should be some >> workaround to prevent spill. >> > >> > I did some research but could not get any concrete solution.Let me >> know if someone has done this. >> > >> > Any guidance would be a great help. >> > >> > Thanks >> > Shashi >> > >
Spark Shuffle Encryption
Hi I was doing research on encrypting spark shuffle data and found that Spark 2.1 has got that feature. https://issues.apache.org/jira/browse/SPARK-5682 Does anyone has more documentation around it ? How do I aim to use this feature in real production environment keeping mind that I need to secure spark job. ? Thanks Shashi
Spark Streaming Design Suggestion
Hi I have to design a spark streaming application with below use case. I am looking for best possible approach for this. I have application which pushing data into 1000+ different topics each has different purpose . Spark streaming will receive data from each topic and after processing it will write back to corresponding another topic. Ex. Input Type 1 Topic --> Spark Streaming --> Output Type 1 Topic Input Type 2 Topic --> Spark Streaming --> Output Type 2 Topic Input Type 3 Topic --> Spark Streaming --> Output Type 3 Topic . . . Input Type N Topic --> Spark Streaming --> Output Type N Topic and so on. I need to answer following questions. 1. Is it a good idea to launch 1000+ spark streaming application per topic basis ? Or I should have one streaming application for all topics as processing logic going to be same ? 2. If one streaming context , then how will I determine which RDD belongs to which Kafka topic , so that after processing I can write it back to its corresponding OUTPUT Topic? 3. Client may add/delete topic from Kafka , how do dynamically handle in Spark streaming ? 4. How do I restart job automatically on failure ? Any other issue you guys see here ? Highly appreicate your response. Thanks Shashi
Re: Spark Streaming Design Suggestion
I agree Jorn and Satish. I think I should starting grouping similar kind of messages into single topic with some kind of id attached to it which can be pulled from spark streaming application. I can try reducing no of topic to significant lower but still at the end I can expect 50+ topics in cluster. Do you think creating parallel Dstream will help here ? Refer below link. streaming-programming-guide.html <https://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving> Thanks Shashi On Wed, Jun 14, 2017 at 8:12 AM, satish lalam wrote: > Agree with Jörn. Dynamically creating/deleting Topics is nontrivial to > manage. > With the limited knowledge about your scenario - it appears that you are > using topics as some kind of message type enum. > If that is the case - you might be better off with one (or just a few > topics) and have a messagetype field in kafka event itself. > Your streaming job can then match-case incoming events on this field to > choose the right processor for respective events. > > On Tue, Jun 13, 2017 at 1:47 PM, Jörn Franke wrote: > >> I do not fully understand the design here. >> Why not send all to one topic with some application id in the message and >> you write to one topic also indicating the application id. >> >> Can you elaborate a little bit more on the use case? >> >> Especially applications deleting/creating topics dynamically can be a >> nightmare to operate >> >> > On 13. Jun 2017, at 22:03, Shashi Vishwakarma >> wrote: >> > >> > Hi >> > >> > I have to design a spark streaming application with below use case. I >> am looking for best possible approach for this. >> > >> > I have application which pushing data into 1000+ different topics each >> has different purpose . Spark streaming will receive data from each topic >> and after processing it will write back to corresponding another topic. >> > >> > Ex. >> > >> > Input Type 1 Topic --> Spark Streaming --> Output Type 1 Topic >> > Input Type 2 Topic --> Spark Streaming --> Output Type 2 Topic >> > Input Type 3 Topic --> Spark Streaming --> Output Type 3 Topic >> > . >> > . >> > . >> > Input Type N Topic --> Spark Streaming --> Output Type N Topic and so >> on. >> > >> > I need to answer following questions. >> > >> > 1. Is it a good idea to launch 1000+ spark streaming application per >> topic basis ? Or I should have one streaming application for all topics as >> processing logic going to be same ? >> > 2. If one streaming context , then how will I determine which RDD >> belongs to which Kafka topic , so that after processing I can write it back >> to its corresponding OUTPUT Topic? >> > 3. Client may add/delete topic from Kafka , how do dynamically handle >> in Spark streaming ? >> > 4. How do I restart job automatically on failure ? >> > >> > Any other issue you guys see here ? >> > >> > Highly appreicate your response. >> > >> > Thanks >> > Shashi >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >