question on Write Ahead Log (Spark Streaming )

2017-03-08 Thread kant kodali
Hi All, I am using a Receiver based approach. And I understand that spark streaming API's will convert the received data from receiver into blocks and these blocks that are in memory are also stored in WAL if one enables it. my upstream source which is not Kafka can also replay by which I mean if

Re: How to unit test spark streaming?

2017-03-07 Thread kant kodali
Agreed with the statement in quotes below whether one wants to do unit tests or not It is a good practice to write code that way. But I think the more painful and tedious task is to mock/emulate all the nodes such as spark workers/master/hdfs/input source stream and all that. I wish there is

How to unit test spark streaming?

2017-03-07 Thread kant kodali
Hi All, How to unit test spark streaming or spark in general? How do I test the results of my transformations? Also, more importantly don't we need to spawn master and worker JVM's either in one or multiple nodes? Thanks! kant

Re: [ANNOUNCE] Apache Bahir 2.1.0 Released

2017-03-05 Thread kant kodali
How about HTTP2/REST connector for Spark? Is that something we can expect? Thanks! On Wed, Feb 22, 2017 at 4:07 AM, Christian Kadner wrote: > The Apache Bahir community is pleased to announce the release > of Apache Bahir 2.1.0 which provides the following extensions for >

Are we still dependent on Guava jar in Spark 2.1.0 as well?

2017-02-26 Thread kant kodali
Are we still dependent on Guava jar in Spark 2.1.0 as well (Given Guava jar incompatibility issues)?

Re: question on SPARK_WORKER_CORES

2017-02-17 Thread kant kodali
executor > per Spark slave, and DECREASING the executor-cores in standalone makes > the total # of executors go up. Just my 2¢. > > On Fri, Feb 17, 2017 at 5:20 PM, kant kodali <kanth...@gmail.com> wrote: > >> Hi Satish, >> >> I am using spark 2.0.2. And n

Re: question on SPARK_WORKER_CORES

2017-02-17 Thread kant kodali
> > > On Fri, Feb 17, 2017 at 5:01 PM, Alex Kozlov <ale...@gmail.com> wrote: > > What Spark mode are you running the program in? > > > > On Fri, Feb 17, 2017 at 4:55 PM, kant kodali <kanth...@gmail.com> wrote: > > when I submit a job using spark sh

Re: question on SPARK_WORKER_CORES

2017-02-17 Thread kant kodali
Standalone. On Fri, Feb 17, 2017 at 5:01 PM, Alex Kozlov <ale...@gmail.com> wrote: > What Spark mode are you running the program in? > > On Fri, Feb 17, 2017 at 4:55 PM, kant kodali <kanth...@gmail.com> wrote: > >> when I submit a job using spark shell I get some

question on SPARK_WORKER_CORES

2017-02-17 Thread kant kodali
when I submit a job using spark shell I get something like this [Stage 0:>(36814 + 4) / 220129] Now all I want is I want to increase number of parallel tasks running from 4 to 16 so I exported an env variable called SPARK_WORKER_CORES=16 in conf/spark-env.sh. I though that should do

How do I increase readTimeoutMillis parameter in Spark-shell?

2017-02-17 Thread kant kodali
How do I increase readTimeoutMillis parameter in Spark-shell? because in the middle of CassandraCount The job aborts with the following exception java.io.IOException: Exception during execution of SELECT count(*) FROM "test"."hello" WHERE token("cid") > ? AND token("cid") <= ? ALLOW FILTERING:

Re: I am not sure why I am getting java.lang.NoClassDefFoundError

2017-02-17 Thread kant kodali
gmail.com> > wrote: > >> Hey, >> >> Can you try with the 2.11 spark-cassandra-connector? You just reported >> that you use spark-cassandra-connector*_2.10*-2.0.0-RC1.jar >> >> Best, >> Anastasios >> >> On Fri, Feb 17, 2017 at 6:40 PM, kant

I am not sure why I am getting java.lang.NoClassDefFoundError

2017-02-17 Thread kant kodali
Hi, val df = spark.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "hello", "keyspace" -> "test" )).load() This line works fine. I can see it actually pulled the table schema from cassandra. however when I do df.count I get the error below. I am using the following

Re: can I use Spark Standalone with HDFS but no YARN

2017-02-03 Thread kant kodali
t;m...@clearstorydata.com> wrote: > yes > > On Fri, Feb 3, 2017 at 10:08 PM, kant kodali <kanth...@gmail.com> wrote: > >> can I use Spark Standalone with HDFS but no YARN? >> >> Thanks! >> > >

Re: can I use Spark Standalone with HDFS but no YARN

2017-02-03 Thread kant kodali
On Fri, Feb 3, 2017 at 10:27 PM, Mark Hamstra <m...@clearstorydata.com> wrote: > yes > > On Fri, Feb 3, 2017 at 10:08 PM, kant kodali <kanth...@gmail.com> wrote: > >> can I use Spark Standalone with HDFS but no YARN? >> >> Thanks! >> > >

can I use Spark Standalone with HDFS but no YARN

2017-02-03 Thread kant kodali
can I use Spark Standalone with HDFS but no YARN? Thanks!

Re: How do I dynamically add nodes to spark standalone cluster and be able to discover them?

2017-02-03 Thread kant kodali
sorry I should just do this ./start-slave.sh spark://x.x.x.x:7077,y.y.y.y:7077,z.z.z.z:7077 but what about export SPARK_MASTER_HOST="x.x.x.x y.y.y.y z.z.z.z" ? Dont I need to have that on my worker node? Thanks! On Fri, Feb 3, 2017 at 4:57 PM, kant kodali <kanth...@gmail.com&

Re: How do I dynamically add nodes to spark standalone cluster and be able to discover them?

2017-02-03 Thread kant kodali
s address of master as a parameter. That > slave will contact master and register itself. > > On Jan 25, 2017 4:12 AM, "kant kodali" <kanth...@gmail.com> wrote: > >> Hi, >> >> How do I dynamically add nodes to spark standalone cluster and be ab

Re: question on spark streaming based on event time

2017-01-29 Thread kant kodali
2016/events/a-deep-dive-into- > structured-streaming/ > > On Sat, Jan 28, 2017 at 7:05 PM, kant kodali <kanth...@gmail.com> wrote: > >> Hi All, >> >> I read through the documentation on Spark Streaming based on event time >> and how spark handles lags w.r.t p

question on spark streaming based on event time

2017-01-28 Thread kant kodali
Hi All, I read through the documentation on Spark Streaming based on event time and how spark handles lags w.r.t processing time and so on.. but what if the lag is too long between the event time and processing time? other words what should I do if I am receiving yesterday's data (the timestamp

do I need to run spark standalone master with supervisord?

2017-01-25 Thread kant kodali
Do I need to run spark standalone master with a process supervisor such as supervisord or systemd? Does spark standalone master aborts itself if zookeeper tells it is not a master anymore? Thanks!

Re: spark intermediate data fills up the disk

2017-01-25 Thread kant kodali
oh sorry its actually in the documentation. I should just set spark.worker.cleanup.enabled = true On Wed, Jan 25, 2017 at 11:30 AM, kant kodali <kanth...@gmail.com> wrote: > I have bunch of .index and .data files like that fills up my disk. I am > not sure what the fix is? I am r

spark intermediate data fills up the disk

2017-01-25 Thread kant kodali
I have bunch of .index and .data files like that fills up my disk. I am not sure what the fix is? I am running spark 2.0.2 in stand alone mode Thanks!

Re: How many spark masters and do I need to tolerate one failure in one DC and two AZ?

2017-01-24 Thread kant kodali
the two availability zones that are available in my DC. On Tue, Jan 24, 2017 at 5:37 PM, kant kodali <kanth...@gmail.com> wrote: > How many spark masters and zookeeper servers do I need to tolerate one > failure in one DC that has two availability zones ? Note: The one failure &

How many spark masters and do I need to tolerate one failure in one DC and two AZ?

2017-01-24 Thread kant kodali
How many spark masters and zookeeper servers do I need to tolerate one failure in one DC that has two availability zones ? Note: The one failure that I want to tolerate can be in either availability zone. Here is my understanding so far. please correct me If I am wrong? for Zookeeper I would

How do I dynamically add nodes to spark standalone cluster and be able to discover them?

2017-01-24 Thread kant kodali
Hi, How do I dynamically add nodes to spark standalone cluster and be able to discover them? Does Zookeeper do service discovery? What is the standard tool for these things? Thanks, kant

what would be the recommended production requirements?

2017-01-23 Thread kant kodali
Hi, I am planning to go production using spark standalone mode using the following configuration and I would like to know if I am missing something or any other suggestions are welcome. 1) Three Spark Standalone Master deployed on different nodes and using Apache Zookeeper for Leader Election.

Re: why does spark web UI keeps changing its port?

2017-01-23 Thread kant kodali
s on what you mean by "job". Which is why I prefer "app", which > is clearer (something you submit using "spark-submit", for example). > > But really, I'm not sure what you're asking now. > > On Mon, Jan 23, 2017 at 12:15 PM, kant kodali <kanth...@gmai

Re: why does spark web UI keeps changing its port?

2017-01-23 Thread kant kodali
s own UI which runs (starting on) port 4040. > > On Mon, Jan 23, 2017 at 12:05 PM, kant kodali <kanth...@gmail.com> wrote: > > I am using standalone mode so wouldn't be 8080 for my app web ui as well? > > There is nothing running on 4040 in my cluster. > > > > http://spa

Re: why does spark web UI keeps changing its port?

2017-01-23 Thread kant kodali
he Master, whose default port is 8080 (not 4040). The default > port for the app's UI is 4040. > > On Mon, Jan 23, 2017 at 11:47 AM, kant kodali <kanth...@gmail.com> wrote: > > I am not sure why Spark web UI keeps changing its port every time I > restart > > a cluster? how

why does spark web UI keeps changing its port?

2017-01-23 Thread kant kodali
I am not sure why Spark web UI keeps changing its port every time I restart a cluster? how can I make it run always on one port? I did make sure there is no process running on 4040(spark default web ui port) however it still starts at 8080. any ideas? MasterWebUI: Bound MasterWebUI to 0.0.0.0,

Re: java.io.InvalidClassException: org.apache.spark.executor.TaskMetrics

2017-01-20 Thread kant kodali
nvm figured. I compiled my client jar with 2.0.2 while the spark that is deployed on my machines were 2.0.1. communication problems between dev team and ops team :) On Fri, Jan 20, 2017 at 3:03 PM, kant kodali <kanth...@gmail.com> wrote: > Is this because of versioning issue? can't wai

Re: java.io.InvalidClassException: org.apache.spark.executor.TaskMetrics

2017-01-20 Thread kant kodali
Is this because of versioning issue? can't wait for JDK 9 modular system. I am not sure if spark plans to leverage it? On Fri, Jan 20, 2017 at 1:30 PM, kant kodali <kanth...@gmail.com> wrote: > I get the following exception. I am using Spark 2.0.1 and Sca

java.io.InvalidClassException: org.apache.spark.executor.TaskMetrics

2017-01-20 Thread kant kodali
I get the following exception. I am using Spark 2.0.1 and Scala 2.11.8. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 13, 172.31.20.212): java.io.InvalidClassException:

is this something to worry about? HADOOP_HOME or hadoop.home.dir are not set

2017-01-20 Thread kant kodali
Hi, I am running spark standalone with no storage. when I use spark-submit to submit my job I get the following Exception and I wonder if this is something to worry about? *java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set*

Anyone has any experience using spark in the banking industry?

2017-01-18 Thread kant kodali
Anyone has any experience using spark in the banking industry? I have couple of questions. 1. Most of the banks seem to care about number of pending transaction at any given time and I wonder if this is processing time or event time? I am just trying to understand how this is normally done in the

What can mesos or yarn do that spark standalone cannot do?

2017-01-15 Thread kant kodali
Hi, What can mesos or yarn do that spark standalone cannot do? Thanks!

Re: Gradle dependency problem with spark

2016-12-21 Thread kant kodali
ens to have the functionality that both dependencies want, and hope > that exists. Spark should shade Guava at this point but doesn't mean that > you won't hit this problem from transitive dependencies. > > On Fri, Dec 16, 2016 at 11:05 AM kant kodali <kanth...@gmail.com> wrote: &

theory question

2016-12-17 Thread kant kodali
Given a set of transformations does spark create multiple DAG's and picks the DAG by some metric such as say higher degree of concurrency or something else like the typical task graph model in parallel computing suggests? or does it simply builds one simple DAG by going through

Re: Do we really need mesos or yarn? or is standalone sufficent?

2016-12-16 Thread kant kodali
ll you > need more. Otherwise, try YARN or MESOS depending on the rest of your > components. > > > > 2cents > > > > Saif > > > > *From:* kant kodali [mailto:kanth...@gmail.com] > *Sent:* Friday, December 16, 2016 3:14 AM > *To:* user @spark > *Subject:* Do

Re: Gradle dependency problem with spark

2016-12-16 Thread kant kodali
AM, kant kodali <kanth...@gmail.com> wrote: > Hi Guys, > > Here is the simplified version of my problem. I have the following problem > and I new to gradle > > > dependencies { > compile group: 'org.apache.spark', name: 'spark-core_2.11', version:

Gradle dependency problem with spark

2016-12-16 Thread kant kodali
Hi Guys, Here is the simplified version of my problem. I have the following problem and I new to gradle dependencies { compile group: 'org.apache.spark', name: 'spark-core_2.11', version: '2.0.2' compile group: 'com.github.brainlag', name: 'nsq-client', version: '1.0.0.RC2' } I took

Do we really need mesos or yarn? or is standalone sufficent?

2016-12-16 Thread kant kodali
Do we really need mesos or yarn? or is standalone sufficient for production systems? I understand the difference but I don't know the capabilities of standalone cluster. does anyone have experience deploying standalone in the production?

Re: Wrting data from Spark streaming to AWS Redshift?

2016-12-11 Thread kant kodali
@shyla a side question: What does Redshift can do that Spark cannot do? Trying to understand your use case. On Fri, Dec 9, 2016 at 8:47 PM, ayan guha wrote: > Ideally, saving data to external sources should not be any different. give > the write options as stated in the

few basic questions on structured streaming

2016-12-08 Thread kant kodali
Hi All, I read the documentation on Structured Streaming based on event time and I have the following questions. 1. what happens if an event arrives few days late? Looks like we have an unbound table with sorted time intervals as keys but I assume spark doesn't keep several days worth of data in

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-12-05 Thread kant kodali
= df.select(df("body").cast(StringType).as("body")) > val df2 = Seq("""{"a": 1}""").toDF("body") > val schema = spark.read.json(df2.as[String].rdd).schema > df2.select(from_json(col("body"), schema)).show() > &

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-12-05 Thread kant kodali
use to build the static schema code automatically > <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1128172975083446/2840265927289860/latest.html> > . > > Would that work for you? If not, why not? > > On Wed, Nov 23, 2016 at 2:48 A

Re: What benefits do we really get out of colocation?

2016-12-03 Thread kant kodali
ephemeral storage on ssd will be very painful to maintain especially with large datasets. we will pretty soon have somewhere in PB. I am thinking to leverage something like below. But not sure how much performance gain we could get out of that. https://github.com/stec-inc/EnhanceIO On Sat, Dec

Re: What benefits do we really get out of colocation?

2016-12-03 Thread kant kodali
hmm GCE pretty much seems to follow the same model as AWS. On Sat, Dec 3, 2016 at 1:22 AM, kant kodali <kanth...@gmail.com> wrote: > GCE seems to have better options. Any one had any experience with GCE? > > On Sat, Dec 3, 2016 at 1:16 AM, Manish Malhotra < > manish.ma

Re: What benefits do we really get out of colocation?

2016-12-03 Thread kant kodali
colocation for various use cases. > > AWS emphmerial is not good for reliable storage file system, EBS is the > expensive alternative :) > > On Sat, Dec 3, 2016 at 1:12 AM, kant kodali <kanth...@gmail.com> wrote: > >> Thanks Sean! Just for the record I am currently seeing 95 M

Re: What benefits do we really get out of colocation?

2016-12-03 Thread kant kodali
Forgot to mention my entire cluster is on one DC. so if it is across multiple DC's then colocating does makes sense in theory as well. On Sat, Dec 3, 2016 at 1:12 AM, kant kodali <kanth...@gmail.com> wrote: > Thanks Sean! Just for the record I am currently seeing 95 MB/s RX (Receive >

Re: What benefits do we really get out of colocation?

2016-12-03 Thread kant kodali
ugh that's not as > much an issue in this context. > > On Sat, Dec 3, 2016 at 8:42 AM kant kodali <kanth...@gmail.com> wrote: > >> wait, how is that a benefit? isn't that a bad thing if you are saying >> colocating leads to more latency and overall execution

Re: What benefits do we really get out of colocation?

2016-12-03 Thread kant kodali
tion time is longer > > Le 3 déc. 2016 7:39 AM, "kant kodali" <kanth...@gmail.com> a écrit : > >> >> I wonder what benefits do I really I get If I colocate my spark worker >> process and Cassandra server process on each node? >> >> I understand the

What benefits do we really get out of colocation?

2016-12-02 Thread kant kodali
I wonder what benefits do I really I get If I colocate my spark worker process and Cassandra server process on each node? I understand the concept of moving compute towards the data instead of moving data towards computation but It sounds more like one is trying to optimize for network latency.

quick question

2016-12-01 Thread kant kodali
Assume I am running a Spark Client Program in client mode and Spark Cluster in Stand alone mode. I want some clarification of the following things 1. Build a DAG 2. DAG Scheduler 3. TASK Scheduler I want to which of the above part is done by SPARK CLIENT and which of the above parts are done by

Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-12-01 Thread kant kodali
worker instance) On Thu, Dec 1, 2016 at 12:55 AM, kant kodali <kanth...@gmail.com> wrote: > My batch interval is 1s > slide interval is 1s > window interval is 1 minute > > I am using a standalone alone cluster. I don't have any storage layer like > HDFS. so I dont k

Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-12-01 Thread kant kodali
? is it a disk block ? if so, what is it default size? and Finally, why does the following error happens so often? java.lang.Exception: Could not compute split, block input-0-1480539568000 not found On Thu, Dec 1, 2016 at 12:42 AM, kant kodali <kanth...@gmail.com> wrote: > I also use t

Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-12-01 Thread kant kodali
I also use this super(StorageLevel.MEMORY_AND_DISK_2()); inside my receiver On Wed, Nov 30, 2016 at 10:44 PM, kant kodali <kanth...@gmail.com> wrote: > Here is another transformation that might cause the error but it has to be > one of these two since I only have two tra

Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-11-30 Thread kant kodali
stringIntegerJavaPairRDD .collect() .forEach((Tuple2<String, Long> KV) -> { String status = KV._1(); Long count = KV._2(); map.put(status, count);

Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-11-30 Thread kant kodali
Bytes()); } }); Thanks, kant On Wed, Nov 30, 2016 at 2:11 PM, Marco Mistroni <mmistr...@gmail.com> wrote: > Could you paste reproducible snippet code? > Kr > > On 30 Nov 2016 9:08 pm, "kant kodali" <kanth...@gmail.com> wrote: > >> I have lot of these exceptions happening >> >> java.lang.Exception: Could not compute split, block input-0-1480539568000 >> not found >> >> >> Any ideas what this could be? >> >

java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-11-30 Thread kant kodali
I have lot of these exceptions happening java.lang.Exception: Could not compute split, block input-0-1480539568000 not found Any ideas what this could be?

Can I have two different receivers for my Spark client program?

2016-11-30 Thread kant kodali
HI All, I am wondering if it makes sense to have two receivers inside my Spark Client program? The use case is as follows. 1) We have to support a feed from Kafka so this will be a direct receiver #1. We need to perform batch inserts from kafka feed to Cassandra. 2) an gRPC receiver where we

What do I set rolling log to avoid filling up the disk?

2016-11-28 Thread kant kodali
Hi All, The files like below are just filling up the disk quickly. I am using a standalone cluster so what setting do I need to change this into rolling log or something to avoid filling up the disk? spark/work/app-20161128185548/1/stderr Thanks, kant

Re: Third party library

2016-11-26 Thread kant kodali
y checked these out. Some basic questions that come to > my mind are: > 1) is this library "foolib" or "foo-C-library" available on the worker > node? > 2) if yes, is it accessible by the user/program (rwx)? > > Thanks, > Vasu. > > On Nov 26,

Re: Third party library

2016-11-26 Thread kant kodali
program in >> scala/java. >> >> Regards, >> Vineet >> >> On Sat, Nov 26, 2016 at 11:43 AM, kant kodali <kanth...@gmail.com> wrote: >> >>> Yes this is a Java JNI question. Nothing to do with Spark really. >>> >>> java.lan

Re: Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.

2016-11-26 Thread kant kodali
; HDFS and then read it using spark.read.json . > > Cheers, > Anastasios > > > > On Sat, Nov 26, 2016 at 9:34 AM, kant kodali <kanth...@gmail.com> wrote: > >> up vote >> down votefavorite >> <http://stackoverflow.com/questions/40797231/apache-spark

Re: Third party library

2016-11-26 Thread kant kodali
Yes this is a Java JNI question. Nothing to do with Spark really. java.lang.UnsatisfiedLinkError typically would mean the way you setup LD_LIBRARY_PATH is wrong unless you tell us that it is working for other cases but not this one. On Sat, Nov 26, 2016 at 11:23 AM, Reynold Xin

Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel.

2016-11-26 Thread kant kodali
up vote down votefavorite Apache Spark or Spark-Cassandra-Connector doesnt look like it is reading multiple partitions in parallel. Here is my code using

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
ing on your use case you may want to go > for that on hive2+tez+ldap or spark. > > On 24 Nov 2016, at 20:52, kant kodali <kanth...@gmail.com> wrote: > > some accurate numbers here. so it took me 1hr:30 mins to count 698705723 > rows (~700 Million) > > and my code is just this >

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
some accurate numbers here. so it took me 1hr:30 mins to count 698705723 rows (~700 Million) and my code is just this sc.cassandraTable("cuneiform", "blocks").cassandraCount On Thu, Nov 24, 2016 at 10:48 AM, kant kodali <kanth...@gmail.com> wrote: > Take a lo

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
Take a look at this https://github.com/brianmhess/cassandra-count Now It is just matter of incorporating it into spark-cassandra-connector I guess. On Thu, Nov 24, 2016 at 1:01 AM, kant kodali <kanth...@gmail.com> wrote: > According to this link https://github.com/datastax/ > spa

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
According to this link https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md I tried the following but it still looks like it is taking forever sc.cassandraTable(keyspace, table).cassandraCount On Thu, Nov 24, 2016 at 12:56 AM, kant kodali <kanth...@gmail.

Re: Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
with what you are doing but might help you find > the root of the cause) > > On Thu, Nov 24, 2016 at 9:03 AM, kant kodali <kanth...@gmail.com> wrote: > >> I have the following code >> >> I invoke spark-shell as follows >> >> ./spark-shell --conf spark.cas

Apache Spark SQL is taking forever to count billion rows from Cassandra?

2016-11-24 Thread kant kodali
I have the following code I invoke spark-shell as follows ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134 --executor-memory 15G --executor-cores 12 --conf spark.cassandra.input.split.size_in_mb=67108864 code scala> val df = spark.sql("SELECT test from hello") //

Re: Spark Shell doesnt seem to use spark workers but Spark Submit does.

2016-11-23 Thread kant kodali
somehow the table scan to do the count of billion rows in Cassandra is not being done in parallel. On Wed, Nov 23, 2016 at 12:45 PM, kant kodali <kanth...@gmail.com> wrote: > Hi All, > > > Spark Shell doesnt seem to use spark workers but Spark Submit does. I had > the workers

Spark Shell doesnt seem to use spark workers but Spark Submit does.

2016-11-23 Thread kant kodali
Hi All, Spark Shell doesnt seem to use spark workers but Spark Submit does. I had the workers ips listed under conf/slaves file. I am trying to count number of rows in Cassandra using spark-shell so I do the following on spark master val df = spark.sql("SELECT test from hello") // This has

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-23 Thread kant kodali
Tue, Nov 22, 2016 at 2:42 PM, Michael Armbrust <mich...@databricks.com> wrote: > The first release candidate should be coming out this week. You can > subscribe to the dev list if you want to follow the release schedule. > > On Mon, Nov 21, 2016 at 9:34 PM, kant kodali <kanth...@g

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-21 Thread kant kodali
blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2902> > function that I think will do what you want. > > On Fri, Nov 18, 2016 at 2:29 AM, kant kodali <kanth...@gmail.com> wrote: > >> This seem to work >> >> import org.apache.spark.sql

How to expose Spark-Shell in the production?

2016-11-18 Thread kant kodali
How to expose Spark-Shell in the production? 1) Should we expose it on Master Nodes or Executer nodes? 2) Should we simple give access to those machines and Spark-Shell binary? what is the recommended way? Thanks!

Re: How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-18 Thread kant kodali
This seem to work import org.apache.spark.sql._ val rdd = df2.rdd.map { case Row(j: String) => j } spark.read.json(rdd).show() However I wonder if this any inefficiency here ? since I have to apply this function for billion rows.

How do I flatten JSON blobs into a Data Frame using Spark/Spark SQL

2016-11-17 Thread kant kodali
Hi All, I would like to flatten JSON blobs into a Data Frame using Spark/Spark SQl inside Spark-Shell. val df = spark.sql("select body from test limit 3"); // body is a json encoded blob column val df2 = df.select(df("body").cast(StringType).as("body")) when I do df2.show // shows the 3

Re: Configure spark.kryoserializer.buffer.max at runtime does not take effect

2016-11-17 Thread kant kodali
yeah I feel like this is a bug since you can't really modify the settings once you were given spark session or spark context. so the work around would be to use --conf. In your case it would be like this ./spark-shell --conf spark.kryoserializer.buffer.max=1g On Thu, Nov 17, 2016 at 1:59 PM,

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-17 Thread kant kodali
out. > > On Wed, Nov 16, 2016 at 4:39 PM, kant kodali <kanth...@gmail.com> wrote: > >> 1. I have a Cassandra Table where one of the columns is blob. And this >> blob contains a JSON encoded String however not all the blob's across the >> Cassandra table for that co

Another Interesting Question on SPARK SQL

2016-11-17 Thread kant kodali
​ Which parts in the diagram above are executed by DataSource connectors and which parts are executed by Tungsten? or to put it in another way which phase in the diagram above does Tungsten leverages the Datasource connectors (such as say cassandra connector ) ? My understanding so far is that

Re: How does predicate push down really help?

2016-11-17 Thread kant kodali
Thanks for the effort and clear explanation. On Thu, Nov 17, 2016 at 12:07 AM, kant kodali <kanth...@gmail.com> wrote: > Yes thats how I understood it with your first email as well but the key > thing here sounds like some datasources may not have operators such as > filter and

Re: How does predicate push down really help?

2016-11-17 Thread kant kodali
the join would be much more than any extra time taken by the > filter itself. > > > > BTW. You can see the differences between the original plan and the > optimized plan by calling explain(true) on the dataframe. This would show > you what was parsed, how the optimization worked and wha

Re: How does predicate push down really help?

2016-11-16 Thread kant kodali
t; > > Another (probably better) example would be something like having two table > A and B which are joined by some common key. Then a filtering is done on > the key. Moving the filter to be before the join would probably make > everything faster as filter is a faster operation tha

How does predicate push down really help?

2016-11-16 Thread kant kodali
How does predicate push down really help? in the following cases val df1 = spark.sql("select * from users where age > 30") vs val df1 = spark.sql("select * from users") df.filter("age > 30")

How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-16 Thread kant kodali
https://spark.apache.org/docs/2.0.2/sql-programming-guide.html#json-datasets "Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SQLContext.read.json() on either an RDD of String, or a JSON file." val df =

Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread kant kodali
Wait I cannot create CassandraSQLContext from spark-shell. is this only for enterprise versions? Thanks! On Fri, Nov 11, 2016 at 8:14 AM, kant kodali <kanth...@gmail.com> wrote: > https://academy.datastax.com/courses/ds320-analytics- > apache-spark/spark-sql-spark-sql-basics > &

Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread kant kodali
https://academy.datastax.com/courses/ds320-analytics-apache-spark/spark-sql-spark-sql-basics On Fri, Nov 11, 2016 at 8:11 AM, kant kodali <kanth...@gmail.com> wrote: > Hi, > > This is spark-cassandra-connector > <https://github.com/datastax/spark-cassandra-connector>

Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread kant kodali
ad the document on https://github.com/datastax/spark-cassandra-connector > > > Yong > > > > ------ > *From:* kant kodali <kanth...@gmail.com> > *Sent:* Friday, November 11, 2016 11:04 AM > *To:* user @spark > *Subject:* How to use Spark SQL to

How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread kant kodali
How to use Spark SQL to connect to Cassandra from Spark-Shell? Any examples ? I use Java 8. Thanks! kant

How do I specify StorageLevel in KafkaUtils.createDirectStream?

2016-11-03 Thread kant kodali
JavaInputDStream> directKafkaStream = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent(), ConsumerStrategies.Subscribe(topics, kafkaParams));

Re: Custom receiver for WebSocket in Spark not working

2016-11-02 Thread kant kodali
I don't see a store() call in your receive(). Search for store() in here http://spark.apache.org/ docs/latest/streaming-custom-receivers.html On Wed, Nov 2, 2016 at 10:23 AM, Cassa L wrote: > Hi, > I am using spark 1.6. I wrote a custom receiver to read from WebSocket. > But

random idea

2016-11-02 Thread kant kodali
Hi Guys, I have a random idea and it would be great to receive some input. Can we have a HTTP2 Based receiver for Spark Streaming? I am wondering why not build micro services using Spark when needed? I can see it is not meant for that but I like to think it can be possible. To be more concrete,

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
ue, Nov 1, 2016 at 11:25 AM, kant kodali <kanth...@gmail.com> wrote: > >> AH!!! Got it! Should I use 2.0.1 then ? I don't see 2.1.0 >> >> On Tue, Nov 1, 2016 at 10:14 AM, Shixiong(Ryan) Zhu < >> shixi...@databricks.com> wrote: >> >>>

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
AH!!! Got it! Should I use 2.0.1 then ? I don't see 2.1.0 On Tue, Nov 1, 2016 at 10:14 AM, Shixiong(Ryan) Zhu <shixi...@databricks.com > wrote: > Dstream "Window" uses "union" to combine multiple RDDs in one window into > a single RDD. > > On Tue, Nov

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
@Sean It looks like this problem can happen with other RDD's as well. Not just unionRDD On Tue, Nov 1, 2016 at 2:52 AM, kant kodali <kanth...@gmail.com> wrote: > Hi Sean, > > The comments seem very relevant although I am not sure if this pull > request https://github.com/apach

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
queues in the ForkJoinPool. Thanks! On Tue, Nov 1, 2016 at 2:19 AM, Sean Owen <so...@cloudera.com> wrote: > Possibly https://issues.apache.org/jira/browse/SPARK-17396 ? > > On Tue, Nov 1, 2016 at 2:11 AM kant kodali <kanth...@gmail.com> wrote: > >> Hi Rya

Re: why spark driver program is creating so many threads? How can I limit this number?

2016-11-01 Thread kant kodali
This question looks very similar to mine but I don't see any answer. http://markmail.org/message/kkxhi5jjtwyadzxt On Mon, Oct 31, 2016 at 11:24 PM, kant kodali <kanth...@gmail.com> wrote: > Here is a UI of my thread dump. > > http://fastthread.io/my-thread-report.jsp?p=c

<    1   2   3   4   5   >