Realtime Data Visualization Tool for Spark

2015-09-11 Thread Shashi Vishwakarma
Hi

I have got streaming data which needs to be processed and send for
visualization.  I am planning to use spark streaming for this but little
bit confused in choosing visualization tool. I read somewhere that D3.js
can be used but i wanted know which is best tool for visualization while
dealing with streaming application.(something that can be easily integrated)

If someone has any link which can tell about D3.js(or any other
visualization tool) and Spark streaming application integration  then
please share . That would be great help.


Thanks and Regards
Shashi


Spark Job failing with exit status 15

2015-11-07 Thread Shashi Vishwakarma
I am trying to run simple word count job in spark but I am getting
exception while running job.

For more detailed output, check application tracking
page:http://quickstart.cloudera:8088/proxy/application_1446699275562_0006/Then,
click on links to logs of each attempt.Diagnostics: Exception from
container-launch.Container id:
container_1446699275562_0006_02_01Exit code: 15Stack trace:
ExitCodeException exitCode=15:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 15Failing this attempt.
Failing the application.
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: root.cloudera
 start time: 1446910483956
 final status: FAILED
 tracking URL:
http://quickstart.cloudera:8088/cluster/app/application_1446699275562_0006
 user: clouderaException in thread "main"
org.apache.spark.SparkException: Application finished with failed
status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:626)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:651)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I checked log from following command

yarn logs -applicationId application_1446699275562_0006

Here is log

 15/11/07 07:35:09 ERROR yarn.ApplicationMaster: User class threw
exception: Output directory
hdfs://quickstart.cloudera:8020/user/cloudera/WordCountOutput already
exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
hdfs://quickstart.cloudera:8020/user/cloudera/WordCountOutput already
exists
at 
org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1053)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:954)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:863)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1290)
at org.com.td.sparkdemo.spark.WordCount$.main(WordCount.scala:23)
at org.com.td.sparkdemo.spark.WordCount.main(WordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:480)15/11/07
07:35:09 INFO yarn.ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception: Output directory
hdfs://quickstart.cloudera:8020/user/cloudera/WordCountOutput already
exists)15/11/07 07:35:14 ERROR yarn.ApplicationMaster: SparkContext
did not initialize after waiting for 10 ms. Please check earlier
log output for errors. Failing the application.

Exception clearly indicates that WordCountOutput directory already exists
but I made sure that directory is not there before running job.

Why I am getting this error even though directory was not there before
running my job?


Re: Spark Job failing with exit status 15

2015-11-08 Thread Shashi Vishwakarma
Hi

I am using Spark 1.3.0 . Command that I use is below.

/spark-submit --class org.com.td.sparkdemo.spark.WordCount \
--master yarn-cluster \
target/spark-0.0.1-SNAPSHOT-jar-with-dependencies.jar

Thanks
Shashi

On Sun, Nov 8, 2015 at 11:33 PM, Ted Yu  wrote:

> Which release of Spark were you using ?
>
> Can you post the command you used to run WordCount ?
>
> Cheers
>
> On Sat, Nov 7, 2015 at 7:59 AM, Shashi Vishwakarma <
> shashi.vish...@gmail.com> wrote:
>
>> I am trying to run simple word count job in spark but I am getting
>> exception while running job.
>>
>> For more detailed output, check application tracking 
>> page:http://quickstart.cloudera:8088/proxy/application_1446699275562_0006/Then,
>>  click on links to logs of each attempt.Diagnostics: Exception from 
>> container-launch.Container id: container_1446699275562_0006_02_01Exit 
>> code: 15Stack trace: ExitCodeException exitCode=15:
>> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>> at org.apache.hadoop.util.Shell.run(Shell.java:455)
>> at 
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
>> at 
>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
>> at 
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>> at 
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Container exited with a non-zero exit code 15Failing this attempt. Failing 
>> the application.
>>  ApplicationMaster host: N/A
>>  ApplicationMaster RPC port: -1
>>  queue: root.cloudera
>>  start time: 1446910483956
>>  final status: FAILED
>>  tracking URL: 
>> http://quickstart.cloudera:8088/cluster/app/application_1446699275562_0006
>>  user: clouderaException in thread "main" 
>> org.apache.spark.SparkException: Application finished with failed status
>> at org.apache.spark.deploy.yarn.Client.run(Client.scala:626)
>> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:651)
>> at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at 
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
>> at 
>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> I checked log from following command
>>
>> yarn logs -applicationId application_1446699275562_0006
>>
>> Here is log
>>
>>  15/11/07 07:35:09 ERROR yarn.ApplicationMaster: User class threw exception: 
>> Output directory 
>> hdfs://quickstart.cloudera:8020/user/cloudera/WordCountOutput already exists
>> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
>> hdfs://quickstart.cloudera:8020/user/cloudera/WordCountOutput already exists
>> at 
>> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
>> at 
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1053)
>> at 
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:954)
>> at 
>> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:863)
>> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1290)
>> at org.com.td.sparkdemo.spark.WordCount$.main(WordCount.scala:23)
>> at org.com.td.sparkdemo.spark.WordCount.main(WordCount.scala)
>>

Securing Spark Job on Cluster

2017-04-28 Thread Shashi Vishwakarma
Hi All

I was dealing with one the spark requirement here where Client (like
Banking Client where security is major concern) needs all spark processing
should happen securely.

For example all communication happening between spark client and server (
driver & executor communication) should be on secure channel. Even when
spark spills on disk based on storage level (Mem+Disk), it should not be
written in un-encrypted format on local disk or there should be some
workaround to prevent spill.

I did some research  but could not get any concrete solution.Let me know if
someone has done this.

Any guidance would be a great help.

Thanks
Shashi


Re: Securing Spark Job on Cluster

2017-04-28 Thread Shashi Vishwakarma
Kerberos is not a apache project. Kerberos provides a way to do
authentication but does not provide data security.

On Fri, Apr 28, 2017 at 3:24 PM, veera satya nv Dantuluri <
dvsnva...@gmail.com> wrote:

> Hi Shashi,
>
> Based on your requirement for securing data,  we can use Apache kebros, or
> we could use the security feature in Spark.
>
>
> > On Apr 28, 2017, at 8:45 AM, Shashi Vishwakarma <
> shashi.vish...@gmail.com> wrote:
> >
> > Hi All
> >
> > I was dealing with one the spark requirement here where Client (like
> Banking Client where security is major concern) needs all spark processing
> should happen securely.
> >
> > For example all communication happening between spark client and server
> ( driver & executor communication) should be on secure channel. Even when
> spark spills on disk based on storage level (Mem+Disk), it should not be
> written in un-encrypted format on local disk or there should be some
> workaround to prevent spill.
> >
> > I did some research  but could not get any concrete solution.Let me know
> if someone has done this.
> >
> > Any guidance would be a great help.
> >
> > Thanks
> > Shashi
>
>


Re: Securing Spark Job on Cluster

2017-04-28 Thread Shashi Vishwakarma
Agreed Jorn. Disk encryption is one option that will help to secure data
but how do I know at which location Spark is spilling temp file, shuffle
data and application data ?

Thanks
Shashi

On Fri, Apr 28, 2017 at 3:54 PM, Jörn Franke  wrote:

> You can use disk encryption as provided by the operating system.
> Additionally, you may think about shredding disks after they are not used
> anymore.
>
> > On 28. Apr 2017, at 14:45, Shashi Vishwakarma 
> wrote:
> >
> > Hi All
> >
> > I was dealing with one the spark requirement here where Client (like
> Banking Client where security is major concern) needs all spark processing
> should happen securely.
> >
> > For example all communication happening between spark client and server
> ( driver & executor communication) should be on secure channel. Even when
> spark spills on disk based on storage level (Mem+Disk), it should not be
> written in un-encrypted format on local disk or there should be some
> workaround to prevent spill.
> >
> > I did some research  but could not get any concrete solution.Let me know
> if someone has done this.
> >
> > Any guidance would be a great help.
> >
> > Thanks
> > Shashi
>


Re: Securing Spark Job on Cluster

2017-04-28 Thread Shashi Vishwakarma
Yes I am using HDFS .Just trying to understand couple of point.

There would be two kind of encryption which would be required.

1. Data in Motion - This could be achieved by enabling SSL -
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_spark-component-guide/content/spark-encryption.html

2. Data at Rest - HDFS Encryption can be applied.

Apart from this when spark executes a job , each disk available in all node
needs to be encrypted .

I can have multiple disk on each node and encrypting all of them could be
costly operation - Therefore I was trying to identify during job execution
what are possible folders where spark can spill data .

Once these items are identified those specific disk can be encrypted.

Thanks
Shashi




On Fri, Apr 28, 2017 at 4:34 PM, Jörn Franke  wrote:

> Why don't you use whole disk encryption?
> Are you using HDFS?
>
> On 28. Apr 2017, at 16:57, Shashi Vishwakarma 
> wrote:
>
> Agreed Jorn. Disk encryption is one option that will help to secure data
> but how do I know at which location Spark is spilling temp file, shuffle
> data and application data ?
>
> Thanks
> Shashi
>
> On Fri, Apr 28, 2017 at 3:54 PM, Jörn Franke  wrote:
>
>> You can use disk encryption as provided by the operating system.
>> Additionally, you may think about shredding disks after they are not used
>> anymore.
>>
>> > On 28. Apr 2017, at 14:45, Shashi Vishwakarma 
>> wrote:
>> >
>> > Hi All
>> >
>> > I was dealing with one the spark requirement here where Client (like
>> Banking Client where security is major concern) needs all spark processing
>> should happen securely.
>> >
>> > For example all communication happening between spark client and server
>> ( driver & executor communication) should be on secure channel. Even when
>> spark spills on disk based on storage level (Mem+Disk), it should not be
>> written in un-encrypted format on local disk or there should be some
>> workaround to prevent spill.
>> >
>> > I did some research  but could not get any concrete solution.Let me
>> know if someone has done this.
>> >
>> > Any guidance would be a great help.
>> >
>> > Thanks
>> > Shashi
>>
>
>


Spark Shuffle Encryption

2017-05-12 Thread Shashi Vishwakarma
Hi

I was doing research on encrypting spark shuffle data and found that Spark
2.1 has got that feature.

https://issues.apache.org/jira/browse/SPARK-5682

Does anyone has more documentation around it ? How do I aim to use this
feature in real production environment keeping mind that I need to secure
spark job. ?

Thanks
Shashi


Spark Streaming Design Suggestion

2017-06-13 Thread Shashi Vishwakarma
Hi

I have to design a spark streaming application with below use case. I am
looking for best possible approach for this.

I have application which pushing data into 1000+ different topics each has
different purpose . Spark streaming will receive data from each topic and
after processing it will write back to corresponding another topic.

Ex.

Input Type 1 Topic  --> Spark Streaming --> Output Type 1 Topic
Input Type 2 Topic  --> Spark Streaming --> Output Type 2 Topic
Input Type 3 Topic  --> Spark Streaming --> Output Type 3 Topic
.
.
.
Input Type N Topic  --> Spark Streaming --> Output Type N Topic  and so on.

I need to answer following questions.

1. Is it a good idea to launch 1000+ spark streaming application per topic
basis ? Or I should have one streaming application for all topics as
processing logic going to be same ?
2. If one streaming context , then how will I determine which RDD belongs
to which Kafka topic , so that after processing I can write it back to its
corresponding OUTPUT Topic?
3. Client may add/delete topic from Kafka , how do dynamically handle in
Spark streaming ?
4. How do I restart job automatically on failure ?

Any other issue you guys see here ?

Highly appreicate your response.

Thanks
Shashi


Re: Spark Streaming Design Suggestion

2017-06-14 Thread Shashi Vishwakarma
I agree Jorn and Satish. I think I should starting grouping similar kind of
messages into single topic with some kind of id attached to it which can be
pulled from spark streaming application.

I can try reducing no of topic to significant lower but still at the end I
can expect 50+ topics in cluster. Do you think creating parallel Dstream
will help here ?

Refer below link.

streaming-programming-guide.html
<https://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving>

Thanks
Shashi


On Wed, Jun 14, 2017 at 8:12 AM, satish lalam 
wrote:

> Agree with Jörn. Dynamically creating/deleting Topics is nontrivial to
> manage.
> With the limited knowledge about your scenario - it appears that you are
> using topics as some kind of message type enum.
> If that is the case - you might be better off with one (or just a few
> topics) and have a messagetype field in kafka event itself.
> Your streaming job can then match-case incoming events on this field to
> choose the right processor for respective events.
>
> On Tue, Jun 13, 2017 at 1:47 PM, Jörn Franke  wrote:
>
>> I do not fully understand the design here.
>> Why not send all to one topic with some application id in the message and
>> you write to one topic also indicating the application id.
>>
>> Can you elaborate a little bit more on the use case?
>>
>> Especially applications deleting/creating topics dynamically can be a
>> nightmare to operate
>>
>> > On 13. Jun 2017, at 22:03, Shashi Vishwakarma 
>> wrote:
>> >
>> > Hi
>> >
>> > I have to design a spark streaming application with below use case. I
>> am looking for best possible approach for this.
>> >
>> > I have application which pushing data into 1000+ different topics each
>> has different purpose . Spark streaming will receive data from each topic
>> and after processing it will write back to corresponding another topic.
>> >
>> > Ex.
>> >
>> > Input Type 1 Topic  --> Spark Streaming --> Output Type 1 Topic
>> > Input Type 2 Topic  --> Spark Streaming --> Output Type 2 Topic
>> > Input Type 3 Topic  --> Spark Streaming --> Output Type 3 Topic
>> > .
>> > .
>> > .
>> > Input Type N Topic  --> Spark Streaming --> Output Type N Topic  and so
>> on.
>> >
>> > I need to answer following questions.
>> >
>> > 1. Is it a good idea to launch 1000+ spark streaming application per
>> topic basis ? Or I should have one streaming application for all topics as
>> processing logic going to be same ?
>> > 2. If one streaming context , then how will I determine which RDD
>> belongs to which Kafka topic , so that after processing I can write it back
>> to its corresponding OUTPUT Topic?
>> > 3. Client may add/delete topic from Kafka , how do dynamically handle
>> in Spark streaming ?
>> > 4. How do I restart job automatically on failure ?
>> >
>> > Any other issue you guys see here ?
>> >
>> > Highly appreicate your response.
>> >
>> > Thanks
>> > Shashi
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>