Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
I am new to spark and I keep hearing that RDD's can be persisted to memory or disk after each checkpoint. I wonder why RDD's are persisted in memory? In case of node failure how would you access memory to reconstruct the RDD? persisting to disk make sense because its like persisting to a Network f

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
so when do we ever need to persist RDD on disk? given that we don't need to worry about RAM(memory) as virtual memory will just push pages to the disk when memory becomes scarce. On Tue, Aug 23, 2016 11:23 AM, srikanth.je...@gmail.com wrote: Hi Kant Kodali, Based on the input paramet

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
en to choose the persistency level. http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose Thanks, Sreekanth Jella From: kant kodali Sent: Tuesday, August 23, 2016 2:42 PM To: srikanth.je...@gmail.com Cc: user@spark.apache.org Subject: Re: Are RDD's ever pers

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
different RDD save apis for that. Sent from my iPhone On Aug 23, 2016, at 12:26 PM, kant kodali < kanth...@gmail.com > wrote: ok now that I understand RDD can be stored to the disk. My last question on this topic would be this. Storing RDD to disk is nothing but storing JVM byte code to disk (in c

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
ioned this data will be serialized before persisting to disk.. Thanks, Sreekanth Jella From: kant kodali Sent: Tuesday, August 23, 2016 3:59 PM To: Nirav Cc: RK Aduri ; srikanth.je...@gmail.com ; user@spark.apache.org Subject: Re: Are RDD's ever persisted to disk? Storing RDD to disk

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
reconstruct an RDD from its lineage in that case. so this sounds very contradictory to me after reading the spark paper. On Tue, Aug 23, 2016 1:28 PM, kant kodali kanth...@gmail.com wrote: @srkanth are you sure? the whole point of RDD's is to store transformations but not the data as the spark

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
le to store a serialized representation in memory because it may be more compact. This is not the same as saving/writing an RDD to persistent storage as text or JSON or whatever. On Tue, Aug 23, 2016 at 9:28 PM, kant kodali wrote: > @srkanth are you sure? the whole point of RDD's

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
, Aug 23, 2016 2:39 PM, RK Aduri rkad...@collectivei.com wrote: I just had a glance. AFAIK, that is nothing do with RDDs. It’s a pickler used to serialize and deserialize the python code. On Aug 23, 2016, at 2:23 PM, kant kodali < kanth...@gmail.com > wrote: @Sean well this makes sense but I

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
apache/spark spark - Mirror of Apache Spark github.com On Tue, Aug 23, 2016 4:17 PM, kant kodali kanth...@gmail.com wrote: @RK you may want to look more deeply if you are curious. the code starts from here apache/spark spark - Mirror of Apache Spark github.com and it goes here where it is

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
...@collectivei.com wrote: Can you come up with your complete analysis? A snapshot of what you think the code is doing. May be that would help us understand what exactly you were trying to convey. On Aug 23, 2016, at 4:21 PM, kant kodali < kanth...@gmail.com > wrote: apache/spark spark - Mirror of Apache

What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-24 Thread kant kodali
What do I loose if I run spark without using HDFS or Zookeper ? which of them is almost a must in practice?

How to compute a net (difference) given a bi-directional stream of numbers using spark streaming?

2016-08-24 Thread kant kodali
Hi Guys, I am new to spark but I am wondering how do I compute the difference given a bidirectional stream of numbers using spark streaming? To put it more concrete say Bank A is sending money to Bank B and Bank B is sending money to Bank A throughout the day such that at any given time we want to

Re: quick question

2016-08-24 Thread kant kodali
object in your dashboard code and receive the data in realtime and update the dashboard. You can use Node.js in your dashboard ( socket.io ). I am sure there are other ways too. Does that help? Sivakumaran S On 25-Aug-2016, at 6:30 AM, kant kodali < kanth...@gmail.com > wrote: so I would need to

Re: quick question

2016-08-25 Thread kant kodali
it should be On 25-Aug-2016, at 7:08 AM, kant kodali < kanth...@gmail.com > wrote: @Sivakumaran when you say create a web socket object in your spark code I assume you meant a spark "task" opening websocket connection from one of the worker machines to some node.js server in that cas

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
case be liable for any monetary damages arising from such loss, damage or destruction. On 24 August 2016 at 21:54, kant kodali < kanth...@gmail.com > wrote: What do I loose if I run spark without using HDFS or Zookeper ? which of them is almost a must in practice?

Re: quick question

2016-08-25 Thread kant kodali
format the data in the way your client (dashboard) requires it and write it to the websocket. Is your driver code in Python? The link Kevin has sent should start you off. Regards, Sivakumaran On 25-Aug-2016, at 11:53 AM, kant kodali < kanth...@gmail.com > wrote: yes for now it will be Spark Str

Re: quick question

2016-08-25 Thread kant kodali
twork/tutorials/obe/java/HomeWebsocket/WebsocketHome.html#section7 ) Regards, Sivakumaran S On 25-Aug-2016, at 8:09 PM, kant kodali < kanth...@gmail.com > wrote: Your assumption is right (thats what I intend to do). My driver code will be in Java. The link sent by Kevin is a API reference to w

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
n't think many apps exist that don't read/write data. The premise here is not just replication, but partitioning data across compute resources. With a distributed file system, your big input exists across a bunch of machines and you can send the work to the pieces of data. On Thu, Aug 25,

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
also uses ZK for leader election. There seems to be some effort in supporting etcd, but it's in progress: https://issues.apache.org/jira/browse/MESOS-1806 On Thu, Aug 25, 2016 at 1:55 PM, kant kodali < kanth...@gmail.com > wrote: @Ofir @Sean very good points. @Mike We dont use Kafka or

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
. S3 or NFS will not able to provide that. On 26 Aug 2016 07:49, "kant kodali" < kanth...@gmail.com > wrote: yeah so its seems like its work in progress. At very least Mesos took the initiative to provide alternatives to ZK. I am just really looking forward for this. https://iss

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
ZFS linux port has got very stable these days given LLNL maintains the linux port and they also use it as a FileSystem for their super computer (The supercomputer is one of the top in the nation is what I heard) On Thu, Aug 25, 2016 4:58 PM, kant kodali kanth...@gmail.com wrote: How about

How to install spark with s3 on AWS?

2016-08-26 Thread kant kodali
Hi guys, Are there any instructions on how to setup spark with S3 on AWS? Thanks!

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread kant kodali
: On 25 Aug 2016, at 22:49, kant kodali < kanth...@gmail.com > wrote: yeah so its seems like its work in progress. At very least Mesos took the initiative to provide alternatives to ZK. I am just really looking forward for this. https://issues.apache.org/jira/browse/MESOS-3797 I worr

unable to start slaves from master (SSH problem)

2016-08-26 Thread kant kodali
Hi, I am unable to start spark slaves from my master node. when I run ./start-all.sh in my master node it brings up the master and but fails for slaves saying "permission denied public key" for slaves but I did add the master id_rsa.pub to my slaves authorized_keys and I checked manually from my m

Re: How to install spark with s3 on AWS?

2016-08-26 Thread kant kodali
t;fs.s3.awsAcces sKeyId",AccessKey) hadoopConf.set("fs.s3.awsSecre tAccessKey",SecretKey) var jobInput = sc.textFile("s3://path to bucket") Thanks On Fri, Aug 26, 2016 at 5:16 PM, kant kodali < kanth...@gmail.com > wrote: Hi guys, Are there any instructions on how to setup spark with S3 on AWS? Thanks!

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread kant kodali
for any monetary damages arising from such loss, damage or destruction. On 26 August 2016 at 12:58, kant kodali < kanth...@gmail.com > wrote: @Steve your arguments make sense however there is a good majority of people who have extensive experience with zookeeper prefer to avoid zookeeper and

Re: unable to start slaves from master (SSH problem)

2016-08-26 Thread kant kodali
Fixed..I just had to logout and login the master node for some reason On Fri, Aug 26, 2016 5:32 AM, kant kodali kanth...@gmail.com wrote: Hi, I am unable to start spark slaves from my master node. when I run ./start-all.sh in my master node it brings up the master and but fails for slaves

is there a HTTP2 (v2) endpoint for Spark Streaming?

2016-08-26 Thread kant kodali
is there a HTTP2 (v2) endpoint for Spark Streaming?

Re: is there a HTTP2 (v2) endpoint for Spark Streaming?

2016-08-26 Thread kant kodali
y http/2? #curious [1] http://bahir.apache.org/ Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Aug 26, 2016 at 9:42 PM, kant kodali < kanth...@gmail.com > wro

Re: is there a HTTP2 (v2) endpoint for Spark Streaming?

2016-08-26 Thread kant kodali
es these days communicate using HTTP. HTTP2 for better performance. On Fri, Aug 26, 2016 2:47 PM, kant kodali kanth...@gmail.com wrote: HTTP2 for fully pipelined out of order execution. other words I should be able to send multiple requests through same TCP connection and by out of order execut

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-27 Thread kant kodali
I see an example on how I can tell spark cluster to use Cassandra for checkpointing and others if at all. On Fri, Aug 26, 2016 9:50 AM, Steve Loughran ste...@hortonworks.com wrote: On 26 Aug 2016, at 12:58, kant kodali < kanth...@gmail.com > wrote: @Steve your arguments make sense howev

can I use cassandra for checkpointing during a spark streaming job

2016-08-29 Thread kant kodali
I understand that I cannot use spark streaming window operation without checkpointing to HDFS but Without window operation I don't think we can do much with spark streaming. so since it is very essential can I use Cassandra as a distributed storage? If so, can I see an example on how I can tell sp

java.lang.RuntimeException: java.lang.AssertionError: assertion failed: A ReceiverSupervisor has not been attached to the receiver yet.

2016-08-29 Thread kant kodali
java.lang.RuntimeException: java.lang.AssertionError: assertion failed: A ReceiverSupervisor has not been attached to the receiver yet. Maybe you are starting some computation in the receiver before the Receiver.onStart() has been called. importorg.apache.spark.SparkConf;importorg.apache.spark.st

How to attach a ReceiverSupervisor for a Custom receiver in Spark Streaming?

2016-08-29 Thread kant kodali
How to attach a ReceiverSupervisor for a Custom receiver in Spark Streaming?

Re: Scala Vs Python

2016-09-01 Thread kant kodali
c'mon man this is no Brainer..Dynamic Typed Languages for Large Code Bases or Large Scale Distributed Systems makes absolutely no sense. I can write a 10 page essay on why that wouldn't work so great. you might be wondering why would spark have it then? well probably because its ease of use for ML

Re: any idea what this error could be?

2016-09-03 Thread kant kodali
ssException complaining about non-matching serialVersionUIDs. Shouldn't that be caused by different jars on executors and driver? Am 03.09.2016 1:04 nachm. schrieb "Tal Grynbaum" : My guess is that you're running out of memory somewhere.  Try to increase the driver memory and/or e

Re: any idea what this error could be?

2016-09-03 Thread kant kodali
@Fridtjof you are right! changing it to this Fixed it! ompile group: org.apache.spark' name: 'spark-core_2.11' version: '2.0.0' compile group: 'org.apache.spark' name: 'spark-streaming_2.11' version: '2.0.0' On Sat, Sep 3, 2016 12:30 PM

seeing this message repeatedly.

2016-09-03 Thread kant kodali
Hi Guys, I am running my driver program on my local machine and my spark cluster is on AWS. The big question is I don't know what are the right settings to get around this public and private ip thing on AWS? my spark-env.sh currently has the the following lines export SPARK_PUBLIC_DNS="52.44.36.2

Re: seeing this message repeatedly.

2016-09-03 Thread kant kodali
not sure how to fix this or what I am missing? Any help would be great.Thanks! On Sat, Sep 3, 2016 5:39 PM, kant kodali kanth...@gmail.com wrote: Hi Guys, I am running my driver program on my local machine and my spark cluster is on AWS. The big question is I don't know what are the rig

Not sure why Filter on DStream doesn't get invoked?

2016-09-10 Thread kant kodali
Hi All, I am trying to simplify how to frame my question so below is my code. I see that BAR gets printed but not FOO and I am not sure why? my batch interval is 1 second (something I pass in when I create a spark context). any idea? I have bunch of events and I want to store the number of events

ideas on de duplication for spark streaming?

2016-09-23 Thread kant kodali
Hi Guys, I have bunch of data coming in to my spark streaming cluster from a message queue(not kafka). And this message queue guarantees at least once delivery only so there is potential that some of the messages that come in to the spark streaming cluster are actually duplicates and I am trying t

What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread kant kodali
What is the difference between mini-batch vs real time streaming in practice (not theory)? In theory, I understand mini batch is something that batches in the given time frame whereas real time streaming is more like do something as the data arrives but my biggest question is why not have mini batc

Re: What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread kant kodali
ion. On 27 September 2016 at 08:12, kant kodali wrote: What is the difference between mini-batch vs real time streaming in practice (not theory)? In theory, I understand mini batch is something that batches in the given time frame whereas real time streaming is more like do something as the data

java.lang.NoClassDefFoundError: org/apache/spark/sql/Dataset

2016-10-05 Thread kant kodali
Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: org/apache/spark/sql/Dataset at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) at java.lang.Class.getDeclaredMethod(Class.java:2128) at java.io.ObjectStre

Re: java.lang.NoClassDefFoundError: org/apache/spark/sql/Dataset

2016-10-05 Thread kant kodali
I am running locally so they all are on one host On Wed, Oct 5, 2016 3:12 PM, Jakob Odersky ja...@odersky.com wrote: Are all spark and scala versions the same? By "all" I mean the master, worker and driver instances.

How to Disable or do minimal Logging for apache spark client Driver program?

2016-10-06 Thread kant kodali
How to Disable or do minimal Logging for apache spark client Driver program? I couldn't find this information on docs. By Driver program I mean the java program where I initialize spark context. It produces lot of INFO messages but I would like to know only when there is error or a Exception such

Re: How to Disable or do minimal Logging for apache spark client Driver program?

2016-10-06 Thread kant kodali
u referring to spark local mode? It is possible to also run spark applications in "distributed mode" (i.e. standalone, yarn or mesos) just from the command line, however that will require using spark's launcher interface and bundling your application in a jar. On Thu, Oct 6, 2016

Re: java.lang.NoClassDefFoundError: org/apache/spark/sql/Dataset

2016-10-07 Thread kant kodali
Oct 5, 2016 at 3:35 PM, kant kodali wrote: I am running locally so they all are on one host On Wed, Oct 5, 2016 3:12 PM, Jakob Odersky ja...@odersky.com wrote: Are all spark and scala versions the same? By "all" I mean the master, worker and driver instances.

Re: java.lang.NoClassDefFoundError: org/apache/spark/sql/Dataset

2016-10-07 Thread kant kodali
perfect! That fixes it all! On Fri, Oct 7, 2016 1:29 AM, Denis Bolshakov bolshakov.de...@gmail.com wrote: You need to have spark-sql, now you are missing it. 7 Окт 2016 г. 11:12 пользователь "kant kodali" написал: Here are the jar files on my classpath after doing a grep for

How to resubmit the job after it is done?

2016-10-07 Thread kant kodali
I am currently not using spark streaming. I have a ETL pipeline and I want to just resubmit the job after it is done. Like a typical cron job. is that possible?

Re: How to Disable or do minimal Logging for apache spark client Driver program?

2016-10-07 Thread kant kodali
uot;standalone" in the past. --Jakob On Thu, Oct 6, 2016 at 10:30 PM, kant kodali wrote: Hi Jakob, It is a biggest question for me too since I seem to be on a different page than everyone else whenever I say "I am also using spark standalone mode and I don

This Exception has been really hard to trace

2016-10-09 Thread kant kodali
I tried SpanBy but look like there is a strange error that happening no matter which way I try. Like the one here described for Java solution. http://qaoverflow.com/question/how-to-use-spanby-in-java/ java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$Serializ

Re: This Exception has been really hard to trace

2016-10-09 Thread kant kodali
Hi Reynold, Actually, I did that a well before posting my question here. Thanks,kant On Sun, Oct 9, 2016 8:48 PM, Reynold Xin r...@databricks.com wrote: You should probably check with DataStax who build the Cassandra connector for Spark. On Sun, Oct 9, 2016 at 8:13 PM, kant kodali wrote

Re: This Exception has been really hard to trace

2016-10-10 Thread kant kodali
27;META-INF/*.DSA'zip64 true} This successfully creates the jar but the error still persists. On Sun, Oct 9, 2016 11:44 PM, Shixiong(Ryan) Zhu shixi...@databricks.com wrote: Seems the runtime Spark is different from the compiled one. You should mark the Spark components  "provided&q

Re: ClassCastException while running a simple wordCount

2016-10-10 Thread kant kodali
+1 Wooho I have the same problem. I have been trying hard to fix this. On Mon, Oct 10, 2016 3:23 AM, vaibhav thapliyal vaibhav.thapliyal...@gmail.com wrote: Hi, If I change the parameter inside the setMaster()  to "local", the program runs. Is there something wrong with the cluster installa

Re: ClassCastException while running a simple wordCount

2016-10-11 Thread kant kodali
d lead to the bug you're seeing. A common >>> reason for a mismatch is if the SPARK_HOME environment variable is set. >>> This will cause the spark-submit script to use the launcher determined by >>> that environment variable, regardless of the directory from which it

spark streaming client program needs to be restarted after few hours of idle time. how can I fix it?

2016-10-18 Thread kant kodali
Hi Guys, My Spark Streaming Client program works fine as the long as the receiver receives the data but say my receiver has no more data to receive for few hours like (4-5 hours) and then its starts receiving the data again at that point spark client program doesn't seem to process any data. It n

java.lang.OutOfMemoryError: unable to create new native thread

2016-10-28 Thread kant kodali
"dag-scheduler-event-loop" java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at scala.concurrent.forkjoin.ForkJoinPool.tryAddWorker( ForkJoinPool.java:1672) at scala.co

Re: java.lang.OutOfMemoryError: unable to create new native thread

2016-10-29 Thread kant kodali
Another thing I forgot to mention is that it happens after running for several hours say (4 to 5 hours) I am not sure why it is creating so many threads? any way to control them? On Fri, Oct 28, 2016 at 12:47 PM, kant kodali wrote: > "dag-scheduler-event-loop" java.lang.Out

Re: java.lang.OutOfMemoryError: unable to create new native thread

2016-10-30 Thread kant kodali
On Sun, Oct 30, 2016 at 5:22 PM, Chan Chor Pang wrote: > /etc/security/limits.d/90-nproc.conf > Hi, I am using Ubuntu 16.04 LTS. I have this directory /etc/security/limits.d/ but I don't have any files underneath it. This error happens after running for 4 to 5 hours. I wonder if this is a GC is

Re: java.lang.OutOfMemoryError: unable to create new native thread

2016-10-30 Thread kant kodali
ther way > the jvm process will still not able to create new thread. > > btw the default limit for centos is 1024 > > > On 10/31/16 9:51 AM, kant kodali wrote: > > > On Sun, Oct 30, 2016 at 5:22 PM, Chan Chor Pang > wrote: > >> /etc/security/limits.d/90-nproc.con

Re: java.lang.OutOfMemoryError: unable to create new native thread

2016-10-30 Thread kant kodali
when I did this cat /proc/sys/kernel/pid_max I got 32768 On Sun, Oct 30, 2016 at 6:36 PM, kant kodali wrote: > I believe for ubuntu it is unlimited but I am not 100% sure (I just read > somewhere online). I ran ulimit -a and this is what I get > > core file size (blocks,

Re: java.lang.OutOfMemoryError: unable to create new native thread

2016-10-30 Thread kant kodali
gt; other user. > > > On 10/31/16 10:38 AM, kant kodali wrote: > > when I did this > > cat /proc/sys/kernel/pid_max > > I got 32768 > > On Sun, Oct 30, 2016 at 6:36 PM, kant kodali wrote: > >> I believe for ubuntu it is unlimited but I am not 100% sure (

Re: java.lang.OutOfMemoryError: unable to create new native thread

2016-10-30 Thread kant kodali
still suspecting the user, > as the user who run spark-submit is not necessary the pid for the JVM > process > > can u make sure when you "ps -ef | grep {your app id} " the PID is root? > On 10/31/16 11:21 AM, kant kodali wrote: > > The java process is run by the root

why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread kant kodali
when I do ps -elfT | grep "spark-driver-program.jar" | wc -l The result is around 32K. why does it create so many threads how can I limit this?

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread kant kodali
m to not create so many? On Mon, Oct 31, 2016 at 3:25 AM, Sean Owen wrote: > ps -L [pid] is what shows threads. I am not sure this is counting what you > think it does. My shell process has about a hundred threads, and I can't > imagine why one would have thousands unless your app spa

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread kant kodali
pend on your driver program. Do you spawn any threads in > it? Could you share some more information on the driver program, spark > version and your environment? It would greatly help others to help you > > On Mon, Oct 31, 2016 at 3:47 AM, kant kodali wrote: > > The source of my p

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread kant kodali
I am also under the assumption that *onStart *function of the Receiver is only called only once by Spark. please correct me if I am wrong. On Mon, Oct 31, 2016 at 11:35 AM, kant kodali wrote: > My driver program runs a spark streaming job. And it spawns a thread by > itself only

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread kant kodali
which types of threads are leaking? > > On Mon, Oct 31, 2016 at 11:50 AM, kant kodali wrote: > >> I am also under the assumption that *onStart *function of the Receiver is >> only called only once by Spark. please correct me if I am wrong. >> >> On Mon, Oct 31, 201

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread kant kodali
if the leak threads are in the driver side. > > Does it happen in the driver or executors? > > On Mon, Oct 31, 2016 at 12:20 PM, kant kodali wrote: > >> Hi Ryan, >> >> Ahh My Receiver.onStop method is currently empty. >> >> 1) I have a hard time seeing why th

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread kant kodali
e `jstack` to find out > the name of leaking threads? > > On Mon, Oct 31, 2016 at 12:35 PM, kant kodali wrote: > >> Hi Ryan, >> >> It happens on the driver side and I am running on a client mode (not the >> cluster mode). >> >> Thanks! >> >

Re: java.lang.OutOfMemoryError: unable to create new native thread

2016-10-31 Thread kant kodali
te: > Have you tried to get number of threads in a running process using `cat > /proc//status` ? > > On Sun, Oct 30, 2016 at 11:04 PM, kant kodali wrote: > >> yes I did run ps -ef | grep "app_name" and it is root. >> >> >> >> On Sun, Oct 30, 20

Re: java.lang.OutOfMemoryError: unable to create new native thread

2016-10-31 Thread kant kodali
Here is a UI of my thread dump. http://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTYvMTEvMS8tLWpzdGFja19kdW1wX3dpbmRvd19pbnRlcnZhbF8xbWluX2JhdGNoX2ludGVydmFsXzFzLnR4dC0tNi0xNy00Ng== On Mon, Oct 31, 2016 at 10:32 PM, kant kodali wrote: > Hi Vadim, > > Thank you so much this w

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-10-31 Thread kant kodali
Here is a UI of my thread dump. http://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTYvMTEvMS8tLWpzdG Fja19kdW1wX3dpbmRvd19pbnRlcnZhbF8xbWluX2JhdGNoX2ludGVydmFsXz FzLnR4dC0tNi0xNy00Ng== On Mon, Oct 31, 2016 at 7:10 PM, kant kodali wrote: > Hi Ryan, > > I think you are right. Thi

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
This question looks very similar to mine but I don't see any answer. http://markmail.org/message/kkxhi5jjtwyadzxt On Mon, Oct 31, 2016 at 11:24 PM, kant kodali wrote: > Here is a UI of my thread dump. > > http://fastthread.io/my-thread-report.jsp?p=c2

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
ubly ended queues in the ForkJoinPool. Thanks! On Tue, Nov 1, 2016 at 2:19 AM, Sean Owen wrote: > Possibly https://issues.apache.org/jira/browse/SPARK-17396 ? > > On Tue, Nov 1, 2016 at 2:11 AM kant kodali wrote: > >> Hi Ryan, >> >> I think you are right. This ma

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
@Sean It looks like this problem can happen with other RDD's as well. Not just unionRDD On Tue, Nov 1, 2016 at 2:52 AM, kant kodali wrote: > Hi Sean, > > The comments seem very relevant although I am not sure if this pull > request https://github.com/apache/spark/pull/14985 w

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
AH!!! Got it! Should I use 2.0.1 then ? I don't see 2.1.0 On Tue, Nov 1, 2016 at 10:14 AM, Shixiong(Ryan) Zhu wrote: > Dstream "Window" uses "union" to combine multiple RDDs in one window into > a single RDD. > > On Tue, Nov 1, 2016 at 2:59 AM kant kodali

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
Looks like upgrading to Spark 2.0.1 fixed it! The thread count now when I do cat /proc/pid/status is about 84 as opposed to a 1000 in the span of 2 mins in Spark 2.0.0 On Tue, Nov 1, 2016 at 11:40 AM, Shixiong(Ryan) Zhu wrote: > Yes, try 2.0.1! > > On Tue, Nov 1, 2016 at 11:25 AM, ka

random idea

2016-11-02 Thread kant kodali
Hi Guys, I have a random idea and it would be great to receive some input. Can we have a HTTP2 Based receiver for Spark Streaming? I am wondering why not build micro services using Spark when needed? I can see it is not meant for that but I like to think it can be possible. To be more concrete, h

Re: Custom receiver for WebSocket in Spark not working

2016-11-02 Thread kant kodali
I don't see a store() call in your receive(). Search for store() in here http://spark.apache.org/ docs/latest/streaming-custom-receivers.html On Wed, Nov 2, 2016 at 10:23 AM, Cassa L wrote: > Hi, > I am using spark 1.6. I wrote a custom receiver to read from WebSocket. > But when I start my spa

How do I specify StorageLevel in KafkaUtils.createDirectStream?

2016-11-03 Thread kant kodali
JavaInputDStream> directKafkaStream = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent(), ConsumerStrategies.Subscribe(topics, kafkaParams));

How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread kant kodali
How to use Spark SQL to connect to Cassandra from Spark-Shell? Any examples ? I use Java 8. Thanks! kant

Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread kant kodali
ithub.com/datastax/spark-cassandra-connector > > > Yong > > > > ------ > *From:* kant kodali > *Sent:* Friday, November 11, 2016 11:04 AM > *To:* user @spark > *Subject:* How to use Spark SQL to connect to Cassandra from Spark-Shell? > > H

Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread kant kodali
https://academy.datastax.com/courses/ds320-analytics-apache-spark/spark-sql-spark-sql-basics On Fri, Nov 11, 2016 at 8:11 AM, kant kodali wrote: > Hi, > > This is spark-cassandra-connector > <https://github.com/datastax/spark-cassandra-connector> but I am looking > more for

Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread kant kodali
Wait I cannot create CassandraSQLContext from spark-shell. is this only for enterprise versions? Thanks! On Fri, Nov 11, 2016 at 8:14 AM, kant kodali wrote: > https://academy.datastax.com/courses/ds320-analytics- > apache-spark/spark-sql-spark-sql-basics > > On Fri, Nov 11, 201

How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-16 Thread kant kodali
https://spark.apache.org/docs/2.0.2/sql-programming-guide.html#json-datasets "Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SQLContext.read.json() on either an RDD of String, or a JSON file." val df = spark.sql("SELECT

How does predicate push down really help?

2016-11-16 Thread kant kodali
How does predicate push down really help? in the following cases val df1 = spark.sql("select * from users where age > 30") vs val df1 = spark.sql("select * from users") df.filter("age > 30")

Re: How does predicate push down really help?

2016-11-16 Thread kant kodali
robably better) example would be something like having two table > A and B which are joined by some common key. Then a filtering is done on > the key. Moving the filter to be before the join would probably make > everything faster as filter is a faster operation than a join. > &

Re: How does predicate push down really help?

2016-11-17 Thread kant kodali
time taken by the > filter itself. > > > > BTW. You can see the differences between the original plan and the > optimized plan by calling explain(true) on the dataframe. This would show > you what was parsed, how the optimization worked and what was physically > run.

Re: How does predicate push down really help?

2016-11-17 Thread kant kodali
Thanks for the effort and clear explanation. On Thu, Nov 17, 2016 at 12:07 AM, kant kodali wrote: > Yes thats how I understood it with your first email as well but the key > thing here sounds like some datasources may not have operators such as > filter and so on in which case Spark St

Another Interesting Question on SPARK SQL

2016-11-17 Thread kant kodali
​ Which parts in the diagram above are executed by DataSource connectors and which parts are executed by Tungsten? or to put it in another way which phase in the diagram above does Tungsten leverages the Datasource connectors (such as say cassandra connector ) ? My understanding so far is that con

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-17 Thread kant kodali
e mapping your JSON payloads to > tractable data structures will depend on business logic. > > The strategy of pulling out a blob into its on rdd and feeding it into the > JSON loader should work for any data source once you have your data > strategy figured out. > > On Wed,

Re: Configure spark.kryoserializer.buffer.max at runtime does not take effect

2016-11-17 Thread kant kodali
yeah I feel like this is a bug since you can't really modify the settings once you were given spark session or spark context. so the work around would be to use --conf. In your case it would be like this ./spark-shell --conf spark.kryoserializer.buffer.max=1g On Thu, Nov 17, 2016 at 1:59 PM, Ko

How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-17 Thread kant kodali
Hi All, I would like to flatten JSON blobs into a Data Frame using Spark/Spark SQl inside Spark-Shell. val df = spark.sql("select body from test limit 3"); // body is a json encoded blob column val df2 = df.select(df("body").cast(StringType).as("body")) when I do df2.show // shows the 3 rows

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-18 Thread kant kodali
This seem to work import org.apache.spark.sql._ val rdd = df2.rdd.map { case Row(j: String) => j } spark.read.json(rdd).show() However I wonder if this any inefficiency here ? since I have to apply this function for billion rows.

How to expose Spark-Shell in the production?

2016-11-18 Thread kant kodali
How to expose Spark-Shell in the production? 1) Should we expose it on Master Nodes or Executer nodes? 2) Should we simple give access to those machines and Spark-Shell binary? what is the recommended way? Thanks!

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-21 Thread kant kodali
main/scala/org/apache/spark/sql/functions.scala#L2902> > function that I think will do what you want. > > On Fri, Nov 18, 2016 at 2:29 AM, kant kodali wrote: > >> This seem to work >> >> import org.apache.spark.sql._ >> val rdd = df2.rdd.map { ca

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-23 Thread kant kodali
On Tue, Nov 22, 2016 at 2:42 PM, Michael Armbrust wrote: > The first release candidate should be coming out this week. You can > subscribe to the dev list if you want to follow the release schedule. > > On Mon, Nov 21, 2016 at 9:34 PM, kant kodali wrote: > >> Hi Michael, >&

Spark Shell doesnt seem to use spark workers but Spark Submit does.

2016-11-23 Thread kant kodali
Hi All, Spark Shell doesnt seem to use spark workers but Spark Submit does. I had the workers ips listed under conf/slaves file. I am trying to count number of rows in Cassandra using spark-shell so I do the following on spark master val df = spark.sql("SELECT test from hello") // This has abo

  1   2   3   4   5   >