Re: How do I dynamically add nodes to spark standalone cluster and be able to discover them?

2017-02-03 Thread kant kodali
s address of master as a parameter. That > slave will contact master and register itself. > > On Jan 25, 2017 4:12 AM, "kant kodali" <kanth...@gmail.com> wrote: > >> Hi, >> >> How do I dynamically add nodes to spark standalone cluster and be ab

Re: How do I dynamically add nodes to spark standalone cluster and be able to discover them?

2017-02-03 Thread kant kodali
sorry I should just do this ./start-slave.sh spark://x.x.x.x:7077,y.y.y.y:7077,z.z.z.z:7077 but what about export SPARK_MASTER_HOST="x.x.x.x y.y.y.y z.z.z.z" ? Dont I need to have that on my worker node? Thanks! On Fri, Feb 3, 2017 at 4:57 PM, kant kodali <kanth...@gmail.com&

can I use Spark Standalone with HDFS but no YARN

2017-02-03 Thread kant kodali
can I use Spark Standalone with HDFS but no YARN? Thanks!

Re: can I use Spark Standalone with HDFS but no YARN

2017-02-03 Thread kant kodali
On Fri, Feb 3, 2017 at 10:27 PM, Mark Hamstra <m...@clearstorydata.com> wrote: > yes > > On Fri, Feb 3, 2017 at 10:08 PM, kant kodali <kanth...@gmail.com> wrote: > >> can I use Spark Standalone with HDFS but no YARN? >> >> Thanks! >> > >

Re: can I use Spark Standalone with HDFS but no YARN

2017-02-03 Thread kant kodali
t;m...@clearstorydata.com> wrote: > yes > > On Fri, Feb 3, 2017 at 10:08 PM, kant kodali <kanth...@gmail.com> wrote: > >> can I use Spark Standalone with HDFS but no YARN? >> >> Thanks! >> > >

Re: why does spark web UI keeps changing its port?

2017-01-23 Thread kant kodali
he Master, whose default port is 8080 (not 4040). The default > port for the app's UI is 4040. > > On Mon, Jan 23, 2017 at 11:47 AM, kant kodali <kanth...@gmail.com> wrote: > > I am not sure why Spark web UI keeps changing its port every time I > restart > > a cluster? how

Re: why does spark web UI keeps changing its port?

2017-01-23 Thread kant kodali
s own UI which runs (starting on) port 4040. > > On Mon, Jan 23, 2017 at 12:05 PM, kant kodali <kanth...@gmail.com> wrote: > > I am using standalone mode so wouldn't be 8080 for my app web ui as well? > > There is nothing running on 4040 in my cluster. > > > > http://spa

why does spark web UI keeps changing its port?

2017-01-23 Thread kant kodali
I am not sure why Spark web UI keeps changing its port every time I restart a cluster? how can I make it run always on one port? I did make sure there is no process running on 4040(spark default web ui port) however it still starts at 8080. any ideas? MasterWebUI: Bound MasterWebUI to 0.0.0.0,

what would be the recommended production requirements?

2017-01-23 Thread kant kodali
Hi, I am planning to go production using spark standalone mode using the following configuration and I would like to know if I am missing something or any other suggestions are welcome. 1) Three Spark Standalone Master deployed on different nodes and using Apache Zookeeper for Leader Election.

Re: why does spark web UI keeps changing its port?

2017-01-23 Thread kant kodali
s on what you mean by "job". Which is why I prefer "app", which > is clearer (something you submit using "spark-submit", for example). > > But really, I'm not sure what you're asking now. > > On Mon, Jan 23, 2017 at 12:15 PM, kant kodali <kanth...@gmai

How many spark masters and do I need to tolerate one failure in one DC and two AZ?

2017-01-24 Thread kant kodali
How many spark masters and zookeeper servers do I need to tolerate one failure in one DC that has two availability zones ? Note: The one failure that I want to tolerate can be in either availability zone. Here is my understanding so far. please correct me If I am wrong? for Zookeeper I would

Re: question on spark streaming based on event time

2017-01-29 Thread kant kodali
2016/events/a-deep-dive-into- > structured-streaming/ > > On Sat, Jan 28, 2017 at 7:05 PM, kant kodali <kanth...@gmail.com> wrote: > >> Hi All, >> >> I read through the documentation on Spark Streaming based on event time >> and how spark handles lags w.r.t p

do I need to run spark standalone master with supervisord?

2017-01-25 Thread kant kodali
Do I need to run spark standalone master with a process supervisor such as supervisord or systemd? Does spark standalone master aborts itself if zookeeper tells it is not a master anymore? Thanks!

Re: java.io.InvalidClassException: org.apache.spark.executor.TaskMetrics

2017-01-20 Thread kant kodali
nvm figured. I compiled my client jar with 2.0.2 while the spark that is deployed on my machines were 2.0.1. communication problems between dev team and ops team :) On Fri, Jan 20, 2017 at 3:03 PM, kant kodali <kanth...@gmail.com> wrote: > Is this because of versioning issue? can't wai

is this something to worry about? HADOOP_HOME or hadoop.home.dir are not set

2017-01-20 Thread kant kodali
Hi, I am running spark standalone with no storage. when I use spark-submit to submit my job I get the following Exception and I wonder if this is something to worry about? *java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set*

Re: java.io.InvalidClassException: org.apache.spark.executor.TaskMetrics

2017-01-20 Thread kant kodali
Is this because of versioning issue? can't wait for JDK 9 modular system. I am not sure if spark plans to leverage it? On Fri, Jan 20, 2017 at 1:30 PM, kant kodali <kanth...@gmail.com> wrote: > I get the following exception. I am using Spark 2.0.1 and Sca

java.io.InvalidClassException: org.apache.spark.executor.TaskMetrics

2017-01-20 Thread kant kodali
I get the following exception. I am using Spark 2.0.1 and Scala 2.11.8. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 13, 172.31.20.212): java.io.InvalidClassException:

spark intermediate data fills up the disk

2017-01-25 Thread kant kodali
I have bunch of .index and .data files like that fills up my disk. I am not sure what the fix is? I am running spark 2.0.2 in stand alone mode Thanks!

Re: spark intermediate data fills up the disk

2017-01-25 Thread kant kodali
oh sorry its actually in the documentation. I should just set spark.worker.cleanup.enabled = true On Wed, Jan 25, 2017 at 11:30 AM, kant kodali <kanth...@gmail.com> wrote: > I have bunch of .index and .data files like that fills up my disk. I am > not sure what the fix is? I am r

How do I dynamically add nodes to spark standalone cluster and be able to discover them?

2017-01-24 Thread kant kodali
Hi, How do I dynamically add nodes to spark standalone cluster and be able to discover them? Does Zookeeper do service discovery? What is the standard tool for these things? Thanks, kant

Re: How many spark masters and do I need to tolerate one failure in one DC and two AZ?

2017-01-24 Thread kant kodali
the two availability zones that are available in my DC. On Tue, Jan 24, 2017 at 5:37 PM, kant kodali <kanth...@gmail.com> wrote: > How many spark masters and zookeeper servers do I need to tolerate one > failure in one DC that has two availability zones ? Note: The one failure &

question on spark streaming based on event time

2017-01-28 Thread kant kodali
Hi All, I read through the documentation on Spark Streaming based on event time and how spark handles lags w.r.t processing time and so on.. but what if the lag is too long between the event time and processing time? other words what should I do if I am receiving yesterday's data (the timestamp

I am not sure why I am getting java.lang.NoClassDefFoundError

2017-02-17 Thread kant kodali
Hi, val df = spark.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "hello", "keyspace" -> "test" )).load() This line works fine. I can see it actually pulled the table schema from cassandra. however when I do df.count I get the error below. I am using the following

Re: I am not sure why I am getting java.lang.NoClassDefFoundError

2017-02-17 Thread kant kodali
gmail.com> > wrote: > >> Hey, >> >> Can you try with the 2.11 spark-cassandra-connector? You just reported >> that you use spark-cassandra-connector*_2.10*-2.0.0-RC1.jar >> >> Best, >> Anastasios >> >> On Fri, Feb 17, 2017 at 6:40 PM, kant

How do I increase readTimeoutMillis parameter in Spark-shell?

2017-02-17 Thread kant kodali
How do I increase readTimeoutMillis parameter in Spark-shell? because in the middle of CassandraCount The job aborts with the following exception java.io.IOException: Exception during execution of SELECT count(*) FROM "test"."hello" WHERE token("cid") > ? AND token("cid") <= ? ALLOW FILTERING:

question on SPARK_WORKER_CORES

2017-02-17 Thread kant kodali
when I submit a job using spark shell I get something like this [Stage 0:>(36814 + 4) / 220129] Now all I want is I want to increase number of parallel tasks running from 4 to 16 so I exported an env variable called SPARK_WORKER_CORES=16 in conf/spark-env.sh. I though that should do

Re: question on SPARK_WORKER_CORES

2017-02-17 Thread kant kodali
Standalone. On Fri, Feb 17, 2017 at 5:01 PM, Alex Kozlov <ale...@gmail.com> wrote: > What Spark mode are you running the program in? > > On Fri, Feb 17, 2017 at 4:55 PM, kant kodali <kanth...@gmail.com> wrote: > >> when I submit a job using spark shell I get some

Re: question on SPARK_WORKER_CORES

2017-02-17 Thread kant kodali
executor > per Spark slave, and DECREASING the executor-cores in standalone makes > the total # of executors go up. Just my 2¢. > > On Fri, Feb 17, 2017 at 5:20 PM, kant kodali <kanth...@gmail.com> wrote: > >> Hi Satish, >> >> I am using spark 2.0.2. And n

Re: question on SPARK_WORKER_CORES

2017-02-17 Thread kant kodali
> > > On Fri, Feb 17, 2017 at 5:01 PM, Alex Kozlov <ale...@gmail.com> wrote: > > What Spark mode are you running the program in? > > > > On Fri, Feb 17, 2017 at 4:55 PM, kant kodali <kanth...@gmail.com> wrote: > > when I submit a job using spark sh

Anyone has any experience using spark in the banking industry?

2017-01-18 Thread kant kodali
Anyone has any experience using spark in the banking industry? I have couple of questions. 1. Most of the banks seem to care about number of pending transaction at any given time and I wonder if this is processing time or event time? I am just trying to understand how this is normally done in the

Are we still dependent on Guava jar in Spark 2.1.0 as well?

2017-02-26 Thread kant kodali
Are we still dependent on Guava jar in Spark 2.1.0 as well (Given Guava jar incompatibility issues)?

How to compute a net (difference) given a bi-directional stream of numbers using spark streaming?

2016-08-24 Thread kant kodali
Hi Guys, I am new to spark but I am wondering how do I compute the difference given a bidirectional stream of numbers using spark streaming? To put it more concrete say Bank A is sending money to Bank B and Bank B is sending money to Bank A throughout the day such that at any given time we want

Re: quick question

2016-08-25 Thread kant kodali
ket object in your dashboard code and receive the data in realtime and update the dashboard. You can use Node.js in your dashboard ( socket.io ). I am sure there are other ways too. Does that help? Sivakumaran S On 25-Aug-2016, at 6:30 AM, kant kodali < kanth...@gmail.com > wrote: so I would need to

Re: quick question

2016-08-25 Thread kant kodali
it should be On 25-Aug-2016, at 7:08 AM, kant kodali < kanth...@gmail.com > wrote: @Sivakumaran when you say create a web socket object in your spark code I assume you meant a spark "task" opening websocket connection from one of the worker machines to some node.js server in that cas

Re: any idea what this error could be?

2016-09-03 Thread kant kodali
2016, 11:42 kant kodali <kanth...@gmail.com> wrote: I am running this on aws. On Fri, Sep 2, 2016 11:49 PM, kant kodali kanth...@gmail.com wrote: I am running spark in stand alone mode. I guess this error when I run my driver program..I am using spark 2.0.0. any idea

Re: any idea what this error could be?

2016-09-03 Thread kant kodali
@Fridtjof you are right! changing it to this Fixed it! ompile group: org.apache.spark' name: 'spark-core_2.11' version: '2.0.0' compile group: 'org.apache.spark' name: 'spark-streaming_2.11' version: '2.0.0' On Sat, Sep 3, 2016 12:30 PM, kant kodali kanth...@gmail.com wrote: I increased

seeing this message repeatedly.

2016-09-03 Thread kant kodali
Hi Guys, I am running my driver program on my local machine and my spark cluster is on AWS. The big question is I don't know what are the right settings to get around this public and private ip thing on AWS? my spark-env.sh currently has the the following lines export

Re: seeing this message repeatedly.

2016-09-03 Thread kant kodali
how to fix this or what I am missing? Any help would be great.Thanks! On Sat, Sep 3, 2016 5:39 PM, kant kodali kanth...@gmail.com wrote: Hi Guys, I am running my driver program on my local machine and my spark cluster is on AWS. The big question is I don't know what are the right settings to

Re: Scala Vs Python

2016-09-01 Thread kant kodali
c'mon man this is no Brainer..Dynamic Typed Languages for Large Code Bases or Large Scale Distributed Systems makes absolutely no sense. I can write a 10 page essay on why that wouldn't work so great. you might be wondering why would spark have it then? well probably because its ease of use for

java.lang.RuntimeException: java.lang.AssertionError: assertion failed: A ReceiverSupervisor has not been attached to the receiver yet.

2016-08-29 Thread kant kodali
java.lang.RuntimeException: java.lang.AssertionError: assertion failed: A ReceiverSupervisor has not been attached to the receiver yet. Maybe you are starting some computation in the receiver before the Receiver.onStart() has been called.

How to attach a ReceiverSupervisor for a Custom receiver in Spark Streaming?

2016-08-29 Thread kant kodali
How to attach a ReceiverSupervisor for a Custom receiver in Spark Streaming?

Not sure why Filter on DStream doesn't get invoked?

2016-09-10 Thread kant kodali
Hi All, I am trying to simplify how to frame my question so below is my code. I see that BAR gets printed but not FOO and I am not sure why? my batch interval is 1 second (something I pass in when I create a spark context). any idea? I have bunch of events and I want to store the number of events

ideas on de duplication for spark streaming?

2016-09-24 Thread kant kodali
Hi Guys, I have bunch of data coming in to my spark streaming cluster from a message queue(not kafka). And this message queue guarantees at least once delivery only so there is potential that some of the messages that come in to the spark streaming cluster are actually duplicates and I am trying

Re: java.lang.NoClassDefFoundError: org/apache/spark/sql/Dataset

2016-10-05 Thread kant kodali
I am running locally so they all are on one host On Wed, Oct 5, 2016 3:12 PM, Jakob Odersky ja...@odersky.com wrote: Are all spark and scala versions the same? By "all" I mean the master, worker and driver instances.

Re: java.lang.NoClassDefFoundError: org/apache/spark/sql/Dataset

2016-10-07 Thread kant kodali
perfect! That fixes it all! On Fri, Oct 7, 2016 1:29 AM, Denis Bolshakov bolshakov.de...@gmail.com wrote: You need to have spark-sql, now you are missing it. 7 Окт 2016 г. 11:12 пользователь "kant kodali" <kanth...@gmail.com> написал: Here are the jar files on my class

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-27 Thread kant kodali
an example on how I can tell spark cluster to use Cassandra for checkpointing and others if at all. On Fri, Aug 26, 2016 9:50 AM, Steve Loughran ste...@hortonworks.com wrote: On 26 Aug 2016, at 12:58, kant kodali < kanth...@gmail.com > wrote: @Steve your arguments make sense h

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread kant kodali
arising from such loss, damage or destruction. On 26 August 2016 at 12:58, kant kodali < kanth...@gmail.com > wrote: @Steve your arguments make sense however there is a good majority of people who have extensive experience with zookeeper prefer to avoid zookeeper and given the ease of consul

Re: How to install spark with s3 on AWS?

2016-08-26 Thread kant kodali
s3.awsAcces sKeyId",AccessKey) hadoopConf.set("fs.s3.awsSecre tAccessKey",SecretKey) var jobInput = sc.textFile("s3://path to bucket") Thanks On Fri, Aug 26, 2016 at 5:16 PM, kant kodali < kanth...@gmail.com > wrote: Hi guys, Are there any instructions on how to setup spark with S3 on AWS? Thanks!

unable to start slaves from master (SSH problem)

2016-08-26 Thread kant kodali
Hi, I am unable to start spark slaves from my master node. when I run ./start-all.sh in my master node it brings up the master and but fails for slaves saying "permission denied public key" for slaves but I did add the master id_rsa.pub to my slaves authorized_keys and I checked manually from my

How to install spark with s3 on AWS?

2016-08-26 Thread kant kodali
Hi guys, Are there any instructions on how to setup spark with S3 on AWS? Thanks!

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-26 Thread kant kodali
: On 25 Aug 2016, at 22:49, kant kodali < kanth...@gmail.com > wrote: yeah so its seems like its work in progress. At very least Mesos took the initiative to provide alternatives to ZK. I am just really looking forward for this. https://issues.apache.org/jira/browse/MESOS-3797 I worry abo

Re: unable to start slaves from master (SSH problem)

2016-08-26 Thread kant kodali
Fixed..I just had to logout and login the master node for some reason On Fri, Aug 26, 2016 5:32 AM, kant kodali kanth...@gmail.com wrote: Hi, I am unable to start spark slaves from my master node. when I run ./start-all.sh in my master node it brings up the master and but fails for slaves

Re: is there a HTTP2 (v2) endpoint for Spark Streaming?

2016-08-26 Thread kant kodali
p/2? #curious [1] http://bahir.apache.org/ Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Aug 26, 2016 at 9:42 PM, kant kodali < kanth...@gmail.com &

is there a HTTP2 (v2) endpoint for Spark Streaming?

2016-08-26 Thread kant kodali
is there a HTTP2 (v2) endpoint for Spark Streaming?

Re: is there a HTTP2 (v2) endpoint for Spark Streaming?

2016-08-26 Thread kant kodali
ays communicate using HTTP. HTTP2 for better performance. On Fri, Aug 26, 2016 2:47 PM, kant kodali kanth...@gmail.com wrote: HTTP2 for fully pipelined out of order execution. other words I should be able to send multiple requests through same TCP connection and by out of order execution I m

Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
I am new to spark and I keep hearing that RDD's can be persisted to memory or disk after each checkpoint. I wonder why RDD's are persisted in memory? In case of node failure how would you access memory to reconstruct the RDD? persisting to disk make sense because its like persisting to a Network

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
this data will be serialized before persisting to disk.. Thanks, Sreekanth Jella From: kant kodali Sent: Tuesday, August 23, 2016 3:59 PM To: Nirav Cc: RK Aduri ; srikanth.je...@gmail.com ; user@spark.apache.org Subject: Re: Are RDD's ever persisted to disk? Storing RDD to disk is nothing

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
. There are different RDD save apis for that. Sent from my iPhone On Aug 23, 2016, at 12:26 PM, kant kodali < kanth...@gmail.com > wrote: ok now that I understand RDD can be stored to the disk. My last question on this topic would be this. Storing RDD to disk is nothing but storing JVM byte code to disk (i

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
erialized representation in memory because it may be more compact. This is not the same as saving/writing an RDD to persistent storage as text or JSON or whatever. On Tue, Aug 23, 2016 at 9:28 PM, kant kodali <kanth...@gmail.com> wrote: > @srkanth are you sure? the whole point of

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
to reconstruct an RDD from its lineage in that case. so this sounds very contradictory to me after reading the spark paper. On Tue, Aug 23, 2016 1:28 PM, kant kodali kanth...@gmail.com wrote: @srkanth are you sure? the whole point of RDD's is to store transformations but not the data as the spark paper

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
, Aug 23, 2016 2:39 PM, RK Aduri rkad...@collectivei.com wrote: I just had a glance. AFAIK, that is nothing do with RDDs. It’s a pickler used to serialize and deserialize the python code. On Aug 23, 2016, at 2:23 PM, kant kodali < kanth...@gmail.com > wrote: @Sean well this makes sense but I

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
apache/spark spark - Mirror of Apache Spark github.com On Tue, Aug 23, 2016 4:17 PM, kant kodali kanth...@gmail.com wrote: @RK you may want to look more deeply if you are curious. the code starts from here apache/spark spark - Mirror of Apache Spark github.com and it goes here where

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
...@collectivei.com wrote: Can you come up with your complete analysis? A snapshot of what you think the code is doing. May be that would help us understand what exactly you were trying to convey. On Aug 23, 2016, at 4:21 PM, kant kodali < kanth...@gmail.com > wrote: apache/spark spark - Mirror of Apache

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
so when do we ever need to persist RDD on disk? given that we don't need to worry about RAM(memory) as virtual memory will just push pages to the disk when memory becomes scarce. On Tue, Aug 23, 2016 11:23 AM, srikanth.je...@gmail.com wrote: Hi Kant Kodali, Based on the input parameter

Re: Are RDD's ever persisted to disk?

2016-08-23 Thread kant kodali
en to choose the persistency level. http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose Thanks, Sreekanth Jella From: kant kodali Sent: Tuesday, August 23, 2016 2:42 PM To: srikanth.je...@gmail.com Cc: user@spark.apache.org Subject: Re: Are RDD's ever persisted

can I use cassandra for checkpointing during a spark streaming job

2016-08-29 Thread kant kodali
I understand that I cannot use spark streaming window operation without checkpointing to HDFS but Without window operation I don't think we can do much with spark streaming. so since it is very essential can I use Cassandra as a distributed storage? If so, can I see an example on how I can tell

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
exist that don't read/write data. The premise here is not just replication, but partitioning data across compute resources. With a distributed file system, your big input exists across a bunch of machines and you can send the work to the pieces of data. On Thu, Aug 25, 2016 at 7:57 PM, kant kodali <

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
also uses ZK for leader election. There seems to be some effort in supporting etcd, but it's in progress: https://issues.apache.org/jira/browse/MESOS-1806 On Thu, Aug 25, 2016 at 1:55 PM, kant kodali < kanth...@gmail.com > wrote: @Ofir @Sean very good points. @Mike We dont use Kafka or Hive

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
able for any monetary damages arising from such loss, damage or destruction. On 24 August 2016 at 21:54, kant kodali < kanth...@gmail.com > wrote: What do I loose if I run spark without using HDFS or Zookeper ? which of them is almost a must in practice?

Re: quick question

2016-08-25 Thread kant kodali
format the data in the way your client (dashboard) requires it and write it to the websocket. Is your driver code in Python? The link Kevin has sent should start you off. Regards, Sivakumaran On 25-Aug-2016, at 11:53 AM, kant kodali < kanth...@gmail.com > wrote: yes for now it will be Spark Str

Re: quick question

2016-08-25 Thread kant kodali
twork/tutorials/obe/java/HomeWebsocket/WebsocketHome.html#section7 ) Regards, Sivakumaran S On 25-Aug-2016, at 8:09 PM, kant kodali < kanth...@gmail.com > wrote: Your assumption is right (thats what I intend to do). My driver code will be in Java. The link sent by Kevin is a API reference to w

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
or NFS will not able to provide that. On 26 Aug 2016 07:49, "kant kodali" < kanth...@gmail.com > wrote: yeah so its seems like its work in progress. At very least Mesos took the initiative to provide alternatives to ZK. I am just really looking forward for this. https://issues.a

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread kant kodali
ZFS linux port has got very stable these days given LLNL maintains the linux port and they also use it as a FileSystem for their super computer (The supercomputer is one of the top in the nation is what I heard) On Thu, Aug 25, 2016 4:58 PM, kant kodali kanth...@gmail.com wrote: How about

What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-24 Thread kant kodali
What do I loose if I run spark without using HDFS or Zookeper ? which of them is almost a must in practice?

Re: What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread kant kodali
On 27 September 2016 at 08:12, kant kodali <kanth...@gmail.com> wrote: What is the difference between mini-batch vs real time streaming in practice (not theory)? In theory, I understand mini batch is something that batches in the given time frame whereas real time streaming is more l

What is the difference between mini-batch vs real time streaming in practice (not theory)?

2016-09-27 Thread kant kodali
What is the difference between mini-batch vs real time streaming in practice (not theory)? In theory, I understand mini batch is something that batches in the given time frame whereas real time streaming is more like do something as the data arrives but my biggest question is why not have mini

java.lang.OutOfMemoryError: unable to create new native thread

2016-10-28 Thread kant kodali
"dag-scheduler-event-loop" java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at scala.concurrent.forkjoin.ForkJoinPool.tryAddWorker( ForkJoinPool.java:1672) at

Re: java.lang.OutOfMemoryError: unable to create new native thread

2016-10-29 Thread kant kodali
Another thing I forgot to mention is that it happens after running for several hours say (4 to 5 hours) I am not sure why it is creating so many threads? any way to control them? On Fri, Oct 28, 2016 at 12:47 PM, kant kodali <kanth...@gmail.com> wrote: > "dag-schedu

spark streaming client program needs to be restarted after few hours of idle time. how can I fix it?

2016-10-18 Thread kant kodali
Hi Guys, My Spark Streaming Client program works fine as the long as the receiver receives the data but say my receiver has no more data to receive for few hours like (4-5 hours) and then its starts receiving the data again at that point spark client program doesn't seem to process any data. It

Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.

2016-11-26 Thread kant kodali
up vote down votefavorite Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel. Here is my code using

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-23 Thread kant kodali
Tue, Nov 22, 2016 at 2:42 PM, Michael Armbrust <mich...@databricks.com> wrote: > The first release candidate should be coming out this week. You can > subscribe to the dev list if you want to follow the release schedule. > > On Mon, Nov 21, 2016 at 9:34 PM, kant kodali <kanth...@g

Spark Shell doesnt seem to use spark workers but Spark Submit does.

2016-11-23 Thread kant kodali
Hi All, Spark Shell doesnt seem to use spark workers but Spark Submit does. I had the workers ips listed under conf/slaves file. I am trying to count number of rows in Cassandra using spark-shell so I do the following on spark master val df = spark.sql("SELECT test from hello") // This has

Re: Spark Shell doesnt seem to use spark workers but Spark Submit does.

2016-11-23 Thread kant kodali
somehow the table scan to do the count of billion rows in Cassandra is not being done in parallel. On Wed, Nov 23, 2016 at 12:45 PM, kant kodali <kanth...@gmail.com> wrote: > Hi All, > > > Spark Shell doesnt seem to use spark workers but Spark Submit does. I had > the workers

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
According to this link https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md I tried the following but it still looks like it is taking forever sc.cassandraTable(keyspace, table).cassandraCount On Thu, Nov 24, 2016 at 12:56 AM, kant kodali <kanth...@gmail.

Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
I have the following code I invoke spark-shell as follows ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134 --executor-memory 15G --executor-cores 12 --conf spark.cassandra.input.split.size_in_mb=67108864 code scala> val df = spark.sql("SELECT test from hello") //

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
with what you are doing but might help you find > the root of the cause) > > On Thu, Nov 24, 2016 at 9:03 AM, kant kodali <kanth...@gmail.com> wrote: > >> I have the following code >> >> I invoke spark-shell as follows >> >> ./spark-shell --conf spark.cas

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
Take a look at this https://github.com/brianmhess/cassandra-count Now It is just matter of incorporating it into spark-cassandra-connector I guess. On Thu, Nov 24, 2016 at 1:01 AM, kant kodali <kanth...@gmail.com> wrote: > According to this link https://github.com/datastax/ > spa

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
some accurate numbers here. so it took me 1hr:30 mins to count 698705723 rows (~700 Million) and my code is just this sc.cassandraTable("cuneiform", "blocks").cassandraCount On Thu, Nov 24, 2016 at 10:48 AM, kant kodali <kanth...@gmail.com> wrote: > Take a lo

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
ing on your use case you may want to go > for that on hive2+tez+ldap or spark. > > On 24 Nov 2016, at 20:52, kant kodali <kanth...@gmail.com> wrote: > > some accurate numbers here. so it took me 1hr:30 mins to count 698705723 > rows (~700 Million) > > and my code is just this >

What do I set rolling log to avoid filling up the disk?

2016-11-28 Thread kant kodali
Hi All, The files like below are just filling up the disk quickly. I am using a standalone cluster so what setting do I need to change this into rolling log or something to avoid filling up the disk? spark/work/app-20161128185548/1/stderr Thanks, kant

Re: Third party library

2016-11-26 Thread kant kodali
Yes this is a Java JNI question. Nothing to do with Spark really. java.lang.UnsatisfiedLinkError typically would mean the way you setup LD_LIBRARY_PATH is wrong unless you tell us that it is working for other cases but not this one. On Sat, Nov 26, 2016 at 11:23 AM, Reynold Xin

Re: Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.

2016-11-26 Thread kant kodali
; HDFS and then read it using spark.read.json . > > Cheers, > Anastasios > > > > On Sat, Nov 26, 2016 at 9:34 AM, kant kodali <kanth...@gmail.com> wrote: > >> up vote >> down votefavorite >> <http://stackoverflow.com/questions/40797231/apache-spark

Re: Third party library

2016-11-26 Thread kant kodali
y checked these out. Some basic questions that come to > my mind are: > 1) is this library "foolib" or "foo-C-library" available on the worker > node? > 2) if yes, is it accessible by the user/program (rwx)? > > Thanks, > Vasu. > > On Nov 26,

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-21 Thread kant kodali
blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2902> > function that I think will do what you want. > > On Fri, Nov 18, 2016 at 2:29 AM, kant kodali <kanth...@gmail.com> wrote: > >> This seem to work >> >> import org.apache.spark.sql

Another Interesting Question on SPARK SQL

2016-11-17 Thread kant kodali
​ Which parts in the diagram above are executed by DataSource connectors and which parts are executed by Tungsten? or to put it in another way which phase in the diagram above does Tungsten leverages the Datasource connectors (such as say cassandra connector ) ? My understanding so far is that

How does predicate push down really help?

2016-11-16 Thread kant kodali
How does predicate push down really help? in the following cases val df1 = spark.sql("select * from users where age > 30") vs val df1 = spark.sql("select * from users") df.filter("age > 30")

Re: How does predicate push down really help?

2016-11-16 Thread kant kodali
t; > > Another (probably better) example would be something like having two table > A and B which are joined by some common key. Then a filtering is done on > the key. Moving the filter to be before the join would probably make > everything faster as filter is a faster operation tha

Re: How does predicate push down really help?

2016-11-17 Thread kant kodali
the join would be much more than any extra time taken by the > filter itself. > > > > BTW. You can see the differences between the original plan and the > optimized plan by calling explain(true) on the dataframe. This would show > you what was parsed, how the optimization worked and wha

Re: How does predicate push down really help?

2016-11-17 Thread kant kodali
Thanks for the effort and clear explanation. On Thu, Nov 17, 2016 at 12:07 AM, kant kodali <kanth...@gmail.com> wrote: > Yes thats how I understood it with your first email as well but the key > thing here sounds like some datasources may not have operators such as > filter and

How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-16 Thread kant kodali
https://spark.apache.org/docs/2.0.2/sql-programming-guide.html#json-datasets "Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SQLContext.read.json() on either an RDD of String, or a JSON file." val df =

  1   2   3   4   5   >