Re: is RosckDB backend available in 3.0 preview?

2020-04-22 Thread kant kodali
this project is > most widely known implementation for RocksDB backend state store - > https://github.com/chermenin/spark-states > > On Wed, Apr 22, 2020 at 11:32 AM kant kodali wrote: > >> Hi All, >> >> 1. is RosckDB backend available in 3.0 preview? >> 2. if

is RosckDB backend available in 3.0 preview?

2020-04-21 Thread kant kodali
Hi All, 1. is RosckDB backend available in 3.0 preview? 2. if RocksDB can store intermediate results of a stream-stream join can I run streaming join queries forever? forever I mean until I run out of disk. or put it another can I run the stream-stream join queries for years if necessary

Re: How does spark sql evaluate case statements?

2020-04-17 Thread kant kodali
Thanks! On Thu, Apr 16, 2020 at 9:57 PM ZHANG Wei wrote: > Are you looking for this: > https://spark.apache.org/docs/2.4.0/api/sql/#when ? > > The code generated will look like this in a `do { ... } while (false)` > loop: > > do { > ${cond.code} > if (!${cond.isNull} && ${cond.value})

How does spark sql evaluate case statements?

2020-04-06 Thread kant kodali
Hi All, I have the following query and I was wondering if spark sql evaluates the same condition twice in the case statement below? I did .explain(true) and all I get is a table scan so not sure if spark sql evaluates the same condition twice? if it does, is there a way to return multiple values

Connected components using GraphFrames is significantly slower than GraphX?

2020-02-16 Thread kant kodali
Hi All, Trying to understand why connected components algorithms runs much slower than the graphX equivalent? Graphx code creates 16 stages. GraphFrame graphFrame = GraphFrame.fromEdges(edges); Dataset connectedComponents = graphFrame.connectedComponents().setAlgorithm("graphx").run(); and the

Re: How to programmatically pause and resume Spark/Kafka structured streaming?

2019-08-06 Thread kant kodali
Spark doesn't expose that. On Mon, Aug 5, 2019 at 10:54 PM Gourav Sengupta wrote: > Hi, > > exactly my question, I was also looking for ways to gracefully exit spark > structured streaming. > > > Regards, > Gourav > > On Tue, Aug 6, 2019 at 3:43 AM kant kodali wrote:

How to programmatically pause and resume Spark/Kafka structured streaming?

2019-08-05 Thread kant kodali
Hi All, I am trying to see if there is a way to pause a spark stream that process data from Kafka such that my application can take some actions while the stream is paused and resume when the application completes those actions. Thanks!

java_method udf is not visible in the API documentation

2019-06-27 Thread kant kodali
Hi All, I see it here https://spark.apache.org/docs/2.3.1/api/sql/index.html#java_method But I don't see it here https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html

Re: Does Spark SQL has match_recognize?

2019-05-29 Thread kant kodali
Nope Not at all On Sun, May 26, 2019 at 8:15 AM yeikel valdes wrote: > Isn't match_recognize just a filter? > > df.filter(predicate)? > > > On Sat, 25 May 2019 12:55:47 -0700 * kanth...@gmail.com > * wrote > > Hi All, > > Does Spark SQL has match_recognize? I am not sure why CEP

Does Spark SQL has match_recognize?

2019-05-25 Thread kant kodali
Hi All, Does Spark SQL has match_recognize? I am not sure why CEP seems to be neglected I believe it is one of the most useful concepts in the Financial applications! Is there a plan to support it? Thanks!

How to control batch size while reading from hdfs files?

2019-03-22 Thread kant kodali
Hi All, What determines the batch size while reading from a file from HDFS? I am trying to read files from HDFS and ingest into Kafka using Spark Structured Streaming 2.3.1. I get an error sayiKafkafka batch size is too big and that I need to increase max.request.size. Sure I can increase it but

what is the difference between udf execution and map(someLambda)?

2019-03-17 Thread kant kodali
Hi All, I am wondering what is the difference between UDF execution and map(someLambda)? you can assume someLambda ~= UDF. Any performance difference? Thanks!

Re: Is there a way to validate the syntax of raw spark sql query?

2019-03-05 Thread kant kodali
used. > If there is an issue with your SQL syntax then the method throws below > exception that you can catch > > org.apache.spark.sql.catalyst.parser.ParseException > > > Hope this helps! > > > > Akshay Bhardwaj > +91-97111-33849 > > > On Fri, Mar 1, 2019 a

Is there a way to validate the syntax of raw spark sql query?

2019-03-01 Thread kant kodali
Hi All, Is there a way to validate the syntax of raw spark SQL query? for example, I would like to know if there is any isValid API call spark provides? val query = "select * from table"if(isValid(query)) { sparkSession.sql(query) } else { log.error("Invalid Syntax")} I tried the

Re: How to track batch jobs in spark ?

2018-12-06 Thread kant kodali
from output of list command >> yarn application --kill >> >> On Wed, Dec 5, 2018 at 1:42 PM kant kodali wrote: >> >>> Hi All, >>> >>> How to track batch jobs in spark? For example, is there some id or token >>> i can get after I s

How to track batch jobs in spark ?

2018-12-05 Thread kant kodali
Hi All, How to track batch jobs in spark? For example, is there some id or token i can get after I spawn a batch job and use it to track the progress or to kill the batch job itself? For Streaming, we have StreamingQuery.id() Thanks!

Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-29 Thread kant kodali
Most people when they compare two different programming languages 99% of the time it all seems to boil down to syntax sugar. Performance I doubt Scala is ever faster than Java given that Scala likes Heap more than Java. I had also written some pointless micro-benchmarking code like (Random String

Does Spark have a plan to move away from sun.misc.Unsafe?

2018-10-24 Thread kant kodali
Hi All, Does Spark have a plan to move away from sun.misc.Unsafe to VarHandles ? I am trying to find a JIRA issue for this? Thanks!

What eactly is Function shipping?

2018-10-16 Thread kant kodali
Hi All, Everyone talks about how easy function shipping is in Scala. I Immediately go "wait a minute" isn't it just Object serialization and deserialization that existing in Java since a long time ago or am I missing something profound here? Thanks!

Does spark.streaming.concurrentJobs still exist?

2018-10-09 Thread kant kodali
Does spark.streaming.concurrentJobs still exist? spark.streaming.concurrentJobs (default: 1) is the number of concurrent jobs, i.e. threads in streaming-job-executor thread pool

Spark on YARN not utilizing all the YARN containers available

2018-10-09 Thread kant kodali
Hi All, I am using Spark 2.3.1 and using YARN as a cluster manager. I currently got 1) 6 YARN containers(executors=6) with 4 executor cores for each container. 2) 6 Kafka partitions from one topic. 3) You can assume every other configuration is set to whatever the default values are. Spawned a

How to do a broadcast join using raw Spark SQL 2.3.1 or 2.3.2?

2018-10-03 Thread kant kodali
Hi All, How to do a broadcast join using raw Spark SQL 2.3.1 or 2.3.2? Thanks

can Spark 2.4 work on JDK 11?

2018-09-25 Thread kant kodali
Hi All, can Spark 2.4 work on JDK 11? I feel like there are lot of features that are added in JDK 9, 10, 11 that can make deployment process a whole lot better and of course some more syntax sugar similar to Scala. Thanks!

can I model any arbitrary data structure as an RDD?

2018-09-25 Thread kant kodali
Hi All, I am wondering if I can model any arbitrary data structure as an RDD? For example, can I model, Red-black trees, Suffix Trees, Radix Trees, Splay Trees, Fibonacci heaps, Tries, Linked Lists etc as RDD's? If so, how? To implement a custom RDD I have to implement compute and getPartitions

Is there any open source framework that converts Cypher to SparkSQL?

2018-09-14 Thread kant kodali
Hi All, Is there any open source framework that converts Cypher to SparkSQL? Thanks!

How do I generate current UTC timestamp in raw spark sql?

2018-08-28 Thread kant kodali
Hi All, How do I generate current UTC timestamp using spark sql? When I do curent_timestamp() it is giving me local time. to_utc_timestamp(current_time(), ) takes timezone in the second parameter and I see no udf that can give me current timezone. when I do

Re: How to Create one DB connection per executor and close it after the job is done?

2018-07-30 Thread kant kodali
ect in Scala, of which only one instance will be > created on each JVM / Executor. E.g. > > object MyDatabseSingleton { > var dbConn = ??? > } > > On Sat, 28 Jul 2018, 08:34 kant kodali, wrote: > >> Hi All, >> >> I understand creating a connection forEac

How to Create one DB connection per executor and close it after the job is done?

2018-07-28 Thread kant kodali
Hi All, I understand creating a connection forEachPartition but I am wondering can I create one DB connection per executor and close it after the job is done? any sample code would help. you can imagine I am running a simple batch processing application. Thanks!

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-24 Thread kant kodali
ssages to kafka, might be easier on you to > just explode the rows and let Spark do the rest for you. > > > > *From: *kant kodali > *Date: *Tuesday, July 24, 2018 at 1:04 PM > *To: *Silvio Fiorito > *Cc: *Arun Mahadevan , chandan prakash < > chandanbaran...@gmail.com>

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-24 Thread kant kodali
chBatch method that will give you a > DataFrame and let you write to the sink as you wish. > > > > *From: *kant kodali > *Date: *Monday, July 23, 2018 at 4:43 AM > *To: *Arun Mahadevan > *Cc: *chandan prakash , Tathagata Das < > tathagata.das1...@gmail.com>, "ymaha

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-23 Thread kant kodali
understand each row has a topic column but can we write one row to multiple topics? On Thu, Jul 12, 2018 at 11:00 AM, Arun Mahadevan wrote: > What I meant was the number of partitions cannot be varied with > ForeachWriter v/s if you were to write to each sink using independent > queries. Maybe

Re: Do GraphFrames support streaming?

2018-07-15 Thread kant kodali
ackend) then when the next stream arrive join them - create > graph and store the next stream together with the existing stream on disk > etc. > > On 14. Jul 2018, at 17:19, kant kodali wrote: > > The question now would be can it be done in streaming fashion? Are you > talking about

Re: Can I specify watermark using raw sql alone?

2018-07-15 Thread kant kodali
I don't see a withWatermark UDF to use it in Raw sql. I am currently using Spark 2.3.1 On Sat, Jul 14, 2018 at 4:19 PM, kant kodali wrote: > Hi All, > > Can I specify watermark using raw sql alone? other words without using > .withWatermark from > Dataset API. > > Thanks! >

Can I specify watermark using raw sql alone?

2018-07-14 Thread kant kodali
Hi All, Can I specify watermark using raw sql alone? other words without using .withWatermark from Dataset API. Thanks!

Re: Do GraphFrames support streaming?

2018-07-14 Thread kant kodali
for > incremental graph updates. > > On 14. Jul 2018, at 15:59, kant kodali wrote: > > "You want to update incrementally an existing graph and run incrementally > a graph algorithm suitable for this - you have to implement yourself as > far as I am aware" > > I

Re: Do GraphFrames support streaming?

2018-07-14 Thread kant kodali
run an algorithms -> look at Janusgraph > You want to update incrementally an existing graph and run incrementally a > graph algorithm suitable for this - you have to implement yourself as far > as I am aware > > > On 29. Apr 2018, at 11:43, kant kodali wrote: > > > > Do GraphFrames support streaming? >

Re: spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-09 Thread kant kodali
the node automatically is going to unhealthy state and INFO logs don't tell me why. On Sun, Jul 8, 2018 at 7:36 PM, kant kodali wrote: > @yohann Thanks for shining some light! It is making more sense now. > > I think you are correct when you stated: "Your application master is

Re: spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-08 Thread kant kodali
these numbers are not part of resource manager logs? On Sun, Jul 8, 2018 at 8:09 AM, kant kodali wrote: > yarn.scheduler.capacity.maximum-am-resource-percent by default is set to > 0.1 and I tried changing it to 1.0 and still no luck. same problem > persists. The master here is yarn and I ju

Re: spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-08 Thread kant kodali
s allowed to provide. > You might take a look at https://hadoop.apache.org/ > docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html and > search for maximum-am-resource-percent. > > Regards, > > *Yohann Jardin* > Le 7/8/2018 à 4:40 PM, kant kodali a écrit : > > Hi

Re: spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-08 Thread kant kodali
Hi, It's on local mac book pro machine that has 16GB RAM 512GB disk and 8 vCpu! I am not running any code since I can't even spawn spark-shell with yarn as master as described in my previous email. I just want to run simple word count using yarn as master. Thanks! Below is the resource manager

spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-08 Thread kant kodali
Hi All, I am trying to run a simple word count using YARN as a cluster manager. I am currently using Spark 2.3.1 and Apache hadoop 2.7.3. When I spawn spark-shell like below it gets stuck in ACCEPTED stated forever. ./bin/spark-shell --master yarn --deploy-mode client I set my

Re: Error while doing stream-stream inner join (java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access)

2018-07-03 Thread kant kodali
> > Best Regards, > Ryan > > On Mon, Jul 2, 2018 at 3:56 AM, kant kodali wrote: > >> Hi All, >> >> I get the below error quite often when I do an stream-stream inner join >> on two data frames. After running through several experiments stream-stream >&

Error while doing stream-stream inner join (java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access)

2018-07-02 Thread kant kodali
Hi All, I get the below error quite often when I do an stream-stream inner join on two data frames. After running through several experiments stream-stream joins dont look stable enough for production yet. any advice on this? Thanks! java.util.ConcurrentModificationException: KafkaConsumer is

Re: Does Spark Structured Streaming have a JDBC sink or Do I need to use ForEachWriter?

2018-06-21 Thread kant kodali
aalDornick/spark/blob/master/ > sql/core/src/main/scala/org/apache/spark/sql/execution/ > streaming/JdbcSink.scala > > https://github.com/GaalDornick/spark/blob/master/ > sql/core/src/main/scala/org/apache/spark/sql/execution/ > streaming/JDBCSinkLog.scala > > > > > &

Does Spark Structured Streaming have a JDBC sink or Do I need to use ForEachWriter?

2018-06-20 Thread kant kodali
Hi All, Does Spark Structured Streaming have a JDBC sink or Do I need to use ForEachWriter? I see the following code in this link and I can see that database name can be passed in the connection string, however, I

is spark stream-stream joins in update mode targeted for 2.4?

2018-06-18 Thread kant kodali
Hi All, Is spark stream-stream joins in update mode targeted for 2.4? Thanks!

is there a way to parse and modify raw spark sql query?

2018-06-05 Thread kant kodali
Hi All, is there a way to parse and modify raw spark sql query? For example, given the following query spark.sql("select hello from view") I want to modify the query or logical plan such that I can get the result equivalent to the below query. spark.sql("select foo, hello from view") Any

is there a way to create a static dataframe inside mapGroups?

2018-06-04 Thread kant kodali
Hi All, Is there a way to create a static dataframe inside mapGroups? given that mapGroups gives Iterator of rows. I just want to take that iterator and populate a static dataframe so I can run raw sql queries on the static dataframe. Thanks!

is it possible to create one KafkaDirectStream (Dstream) per topic?

2018-05-20 Thread kant kodali
Hi All, I have 5 Kafka topics and I am wondering if is even possible to create one KafkaDirectStream (Dstream) per topic within the same JVM i.e using only one sparkcontext? Thanks!

What to consider when implementing a custom streaming sink?

2018-05-15 Thread kant kodali
Hi All, I am trying to implement a custom sink and I have few questions mainly on output modes. 1) How does spark let the sink know that a new row is an update of an existing row? does it look at all the values of all columns of the new row and an existing row for an equality match or does it

I cannot use spark 2.3.0 and kafka 0.9?

2018-05-04 Thread kant kodali
Hi All, This link seems to suggest I cant use Spark 2.3.0 and Kafka 0.9 broker. is that correct? https://spark.apache.org/docs/latest/streaming-kafka-integration.html Thanks!

Re: question on collect_list or say aggregations in general in structured streaming 2.3.0

2018-05-04 Thread kant kodali
:55 AM, Arun Mahadevan <ar...@apache.org> wrote: > I think you need to group by a window (tumbling) and define watermarks > (put a very low watermark or even 0) to discard the state. Here the window > duration becomes your logical batch. > > - Arun > > From: kant kodali

Re: question on collect_list or say aggregations in general in structured streaming 2.3.0

2018-05-03 Thread kant kodali
SQL so not using FlatMapsGroupWithState. And if that is not available then is it fair to say there is no declarative way to do stateless aggregations? On Thu, May 3, 2018 at 1:24 AM, kant kodali <kanth...@gmail.com> wrote: > Hi All, > > I was under an assumption that one needs to ru

question on collect_list or say aggregations in general in structured streaming 2.3.0

2018-05-03 Thread kant kodali
Hi All, I was under an assumption that one needs to run grouby(window(...)) to run any stateful operations but looks like that is not the case since any aggregation like query "select count(*) from some_view" is also stateful since it stores the result of the count from the previous batch.

what is the query language used for graphX?

2018-05-02 Thread kant kodali
Hi All, what is the query language used for graphX? are there any plans to introduce gremlin or is that idea being dropped and go with Spark SQL? Thanks!

is there a minOffsetsTrigger in spark structured streaming 2.3.0?

2018-04-29 Thread kant kodali
Hi All, just like maxOffsetsTrigger is there a minOffsetsTrigger in spark structured streaming 2.3.0? Thanks!

Re: A naive ML question

2018-04-29 Thread kant kodali
I think matshow in numpy/matplotlib > does this). > > On Sat, 28 Apr 2018 at 21:34, kant kodali <kanth...@gmail.com> wrote: > >> Hi, >> >> I mean a transaction goes typically goes through different states like >> STARTED, PENDING, CANCELLED, COMPLETED,

Do GraphFrames support streaming?

2018-04-29 Thread kant kodali
Do GraphFrames support streaming?

Re: A naive ML question

2018-04-28 Thread kant kodali
describes > basically an action at a certain point of time. Do you mean how a financial > product evolved over time given a set of a transactions? > > > On 28. Apr 2018, at 12:46, kant kodali <kanth...@gmail.com> wrote: > > > > Hi All, > > > > I have a bun

A naive ML question

2018-04-28 Thread kant kodali
Hi All, I have a bunch of financial transactional data and I was wondering if there is any ML model that can give me a graph structure for this data? other words, show how a transaction had evolved over time? Any suggestions or references would help. Thanks!

Re: is it ok to make I/O calls in UDF? other words is it a standard practice ?

2018-04-23 Thread kant kodali
n be applied for large datasets but may be for small lookup files. Thanks > > On Mon, Apr 23, 2018 at 4:28 PM kant kodali <kanth...@gmail.com> wrote: > >> Hi All, >> >> Is it ok to make I/O calls in UDF? other words is it a standard practice? >> >> Thanks! >> >

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread kant kodali
gt; 2018년 4월 19일 (목) 오전 9:43, Arun Mahadevan <ar...@apache.org>님이 작성: > >> The below expr might work: >> >> df.groupBy($"id").agg(max(struct($"amount", >> $"my_timestamp")).as("data")).select($"id", $"data.*&quo

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread kant kodali
ate operations is not > there yet. > > Thanks, > Arun > > From: kant kodali <kanth...@gmail.com> > Date: Tuesday, April 17, 2018 at 11:41 AM > To: Tathagata Das <tathagata.das1...@gmail.com> > Cc: "user @spark" <user@spark.apache.org> > Subject

Re: can we use mapGroupsWithState in raw sql?

2018-04-17 Thread kant kodali
nction. That does not fit in with the SQL language structure. > > On Mon, Apr 16, 2018 at 7:34 PM, kant kodali <kanth...@gmail.com> wrote: > >> Hi All, >> >> can we use mapGroupsWithState in raw SQL? or is it in the roadmap? >> >> Thanks! >> >> >> >

can we use mapGroupsWithState in raw sql?

2018-04-16 Thread kant kodali
Hi All, can we use mapGroupsWithState in raw SQL? or is it in the roadmap? Thanks!

How to select the max row for every group in spark structured streaming 2.3.0 without using order by or mapGroupWithState?

2018-04-15 Thread kant kodali
How to select the max row for every group in spark structured streaming 2.3.0 without using order by or mapGroupWithState? *Input:* id | amount | my_timestamp --- 1 | 5 | 2018-04-01T01:00:00.000Z 1 | 10 | 2018-04-01T01:10:00.000Z 2

Re: Does partition by and order by works only in stateful case?

2018-04-14 Thread kant kodali
ed in streaming. > > On Thu, Apr 12, 2018 at 7:34 PM, kant kodali <kanth...@gmail.com> wrote: > >> Hi All, >> >> Does partition by and order by works only in stateful case? >> >> For example: >> >> select row_number() over (partition by id ord

when can we expect multiple aggregations to be supported in spark structured streaming?

2018-04-14 Thread kant kodali
Hi All, when can we expect multiple aggregations to be supported in spark structured streaming? For example, id | amount | my_timestamp -- 1 | 5 | 2018-04-01T01:00:00.000Z 1 | 10 | 2018-04-01T01:10:00.000Z 2 | 20

Does partition by and order by works only in stateful case?

2018-04-12 Thread kant kodali
Hi All, Does partition by and order by works only in stateful case? For example: select row_number() over (partition by id order by timestamp) from table gives me *SEVERE: Exception occured while submitting the query: java.lang.RuntimeException: org.apache.spark.sql.AnalysisException:

Re: is there a way of register python UDF using java API?

2018-04-02 Thread kant kodali
UserDefinedPythonFunction(java.lang.String name, org.apache.spark.api.python.PythonFunction func, org.apache.spark.sql.types.DataType dataType, int pythonEvalType, boolean udfDeterministic) { /* compiled code */ } On Sun, Apr 1, 2018 at 3:46 PM, kant kodali <kanth...@gmail.com> wrote: > Hi All, >

is there a way of register python UDF using java API?

2018-04-01 Thread kant kodali
Hi All, All of our spark code is in Java wondering if there a way to register python UDF's using java API such that the registered UDF's can be used using raw spark SQL. If there is any other way to achieve this goal please suggest! Thanks

Re: Does Spark run on Java 10?

2018-04-01 Thread kant kodali
-14220 > > On a side note, if some coming version of Scala 2.11 becomes full Java > 9/10 compliant it could work. > > Hope, this helps. > > Thanks, > Muthu > > On Sun, Apr 1, 2018 at 6:57 AM, kant kodali <kanth...@gmail.com> wrote: > >> Hi All, >> >> Does anybody got Spark running on Java 10? >> >> Thanks! >> >> >> >

Does Spark run on Java 10?

2018-04-01 Thread kant kodali
Hi All, Does anybody got Spark running on Java 10? Thanks!

Re: What do I need to set to see the number of records and processing time for each batch in SPARK UI?

2018-03-27 Thread kant kodali
For example in this blog <https://databricks.com/blog/2015/07/08/new-visualizations-for-understanding-apache-spark-streaming-applications.html> post. Looking at figure 1 and figure 2 I wonder What I need to do to see those graphs in spark 2.3.0? On Mon, Mar 26, 2018 at 7:10 AM, kant kodali

What do I need to set to see the number of records and processing time for each batch in SPARK UI?

2018-03-26 Thread kant kodali
Hi All, I am using spark 2.3.0 and I wondering what do I need to set to see the number of records and processing time for each batch in SPARK UI? The default UI doesn't seem to show this. Thanks@

Re: Is there a mutable dataframe spark structured streaming 2.3.0?

2018-03-23 Thread kant kodali
y if this is the case? Thanks! On Thu, Mar 22, 2018 at 7:48 AM, kant kodali <kanth...@gmail.com> wrote: > Thanks all! > > On Thu, Mar 22, 2018 at 2:08 AM, Jorge Machado <jom...@me.com> wrote: > >> DataFrames are not mutable. >> >> Jorge Machado >>

Re: Is there a mutable dataframe spark structured streaming 2.3.0?

2018-03-22 Thread kant kodali
e of days back, kindly go through the mail > chain with "*Multiple Kafka Spark Streaming Dataframe Join query*" as > subject, TD and Chris has cleared my doubts, it would help you too. > > Thanks, > Aakash. > > On Thu, Mar 22, 2018 at 7:50 AM, kant kodali <kant

Is there a mutable dataframe spark structured streaming 2.3.0?

2018-03-21 Thread kant kodali
Hi All, Is there a mutable dataframe spark structured streaming 2.3.0? I am currently reading from Kafka and if I cannot parse the messages that I get from Kafka I want to write them to say some "dead_queue" topic. I wonder what is the best way to do this? Thanks!

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-20 Thread kant kodali
g on improving this by building a generic mechanism into the > Streaming DataSource V2 so that the engine can do admission control on the > amount of data returned in a source independent way. > > On Tue, Mar 20, 2018 at 2:58 PM, kant kodali <kanth...@gmail.com> wrote: > >&

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-20 Thread kant kodali
tml#spark-streaming> > settings > may be what you’re looking for: > >- spark.streaming.backpressure.enabled >- spark.streaming.backpressure.initialRate >- spark.streaming.receiver.maxRate >- spark.streaming.kafka.maxRatePerPartition > > ​ > > On Mon, Mar 19, 2018 at 5:27

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-19 Thread kant kodali
messages exist). > > Hope that makes sense - > > > On Mon, Mar 19, 2018 at 13:36 kant kodali <kanth...@gmail.com> wrote: > >> Hi All, >> >> I have 10 million records in my Kafka and I am just trying to >> spark.sql(select count(*) from kafka_view). I am re

select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-19 Thread kant kodali
Hi All, I have 10 million records in my Kafka and I am just trying to spark.sql(select count(*) from kafka_view). I am reading from kafka and writing to kafka. My writeStream is set to "update" mode and trigger interval of one second ( Trigger.ProcessingTime(1000)). I expect the counts to be

is it possible to use Spark 2.3.0 along with Kafka 0.9.0.1?

2018-03-16 Thread kant kodali
Hi All, is it possible to use Spark 2.3.0 along with Kafka 0.9.0.1? Thanks, kant

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
Do I need to set SPARK_DIST_CLASSPATH or SPARK_CLASSPATH ? The latest version of spark (2.3) only has SPARK_CLASSPATH. On Wed, Mar 14, 2018 at 11:37 AM, kant kodali <kanth...@gmail.com> wrote: > Hi, > > I am not using emr. And yes I restarted several times. > > On Wed, Ma

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
mazon.com/premiumsupport/knowledge- > center/restart-service-emr/ > > > > Femi > > > > *From: *kant kodali <kanth...@gmail.com> > *Date: *Wednesday, March 14, 2018 at 6:16 AM > *To: *Femi Anthony <femib...@gmail.com> > *Cc: *vermanurag <anurag.ve...@fnma

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
ed, Mar 14, 2018 at 5:32 AM, kant kodali <kanth...@gmail.com> wrote: > >> Tried this >> >> ./spark-shell --master yarn --deploy-mode client --executor-memory 4g >> >> >> Same issue. Keeps going forever.. >> >> >> 18/03/14 09:31:25 INFO

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
> > On Wed, Mar 14, 2018 at 3:25 AM, kant kodali <kanth...@gmail.com> wrote: > >> I am using spark 2.3.0 and hadoop 2.7.3. >> >> Also I have done the following and restarted all. But I still >> see ACCEPTED: waiting for AM container to be allocated, launche

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
, Mar 14, 2018 at 12:12 AM, kant kodali <kanth...@gmail.com> wrote: > any idea? > > On Wed, Mar 14, 2018 at 12:12 AM, kant kodali <kanth...@gmail.com> wrote: > >> I set core-site.xml, hdfs-site.xml, yarn-site.xml as per this website >> <https://dwbi.o

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
any idea? On Wed, Mar 14, 2018 at 12:12 AM, kant kodali <kanth...@gmail.com> wrote: > I set core-site.xml, hdfs-site.xml, yarn-site.xml as per this website > <https://dwbi.org/etl/bigdata/183-setup-hadoop-cluster> and these are the > only three files I changed Do I

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
I set core-site.xml, hdfs-site.xml, yarn-site.xml as per this website and these are the only three files I changed Do I need to set or change anything in mapred-site.xml (As of now I have not touched mapred-site.xml)? when I do yarn -node

Re: How to run spark shell using YARN

2018-03-12 Thread kant kodali
nning-on-yarn.html > > On Mon, Mar 12, 2018 at 4:42 PM, kant kodali <kanth...@gmail.com> wrote: > > Hi All, > > > > I am trying to use YARN for the very first time. I believe I configured > all > > the resource manager and name node fine. And then I run the below command

How to run spark shell using YARN

2018-03-12 Thread kant kodali
Hi All, I am trying to use YARN for the very first time. I believe I configured all the resource manager and name node fine. And then I run the below command ./spark-shell --master yarn --deploy-mode client *I get the below output and it hangs there forever *(I had been waiting over 10 minutes)

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-10 Thread kant kodali
mode, which allow quite a large range of use cases, including > multiple cascading joins. > > TD > > > > On Thu, Mar 8, 2018 at 9:18 AM, Gourav Sengupta <gourav.sengu...@gmail.com > > wrote: > >> super interesting. >> >> On Wed, Mar 7

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-07 Thread kant kodali
append mode? Anyways the moment it is in master I am ready to test so JIRA tickets on this would help to keep track. please let me know. Thanks! On Tue, Mar 6, 2018 at 9:16 PM, kant kodali <kanth...@gmail.com> wrote: > Sorry I meant Spark 2.4 in my previous email > > On Tue, Mar 6,

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-06 Thread kant kodali
Sorry I meant Spark 2.4 in my previous email On Tue, Mar 6, 2018 at 9:15 PM, kant kodali <kanth...@gmail.com> wrote: > Hi TD, > > I agree I think we are better off either with a full fix or no fix. I am > ok with the complete fix being available in master or some branch. I gu

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-06 Thread kant kodali
trictly worse than no > fix. > > TD > > > > On Thu, Feb 22, 2018 at 2:32 PM, kant kodali <kanth...@gmail.com> wrote: > >> Hi TD, >> >> I pulled your commit that is listed on this ticket >> https://issues.apache.org/jira/browse/SPARK-234

Re: How does Spark Structured Streaming determine an event has arrived late?

2018-02-27 Thread kant kodali
n if you had processed the stream in real-time a week ago. >> This is fundamentally necessary for achieving the deterministic processing >> that Structured Streaming guarantees. >> >> Regarding the picture, the "time" is actually "event-time". My apologies >>

How does Spark Structured Streaming determine an event has arrived late?

2018-02-27 Thread kant kodali
I read through the spark structured streaming documentation and I wonder how does spark structured streaming determine an event has arrived late? Does it compare the event-time with the processing time? [image: enter image description here] Taking the above

Is there a way to query dataframe views directly without going through scheduler?

2018-02-26 Thread kant kodali
Hi All, I wonder if there is a way to query data frame views directly without going through scheduler? for example. say I have the following code DataSet kafkaDf = session.readStream().format("kafka").load(); kafkaDf.createOrReplaceView("table") Now Can I query the view "table" without going

What happens if I can't fit data into memory while doing stream-stream join.

2018-02-23 Thread kant kodali
Hi All, I am experimenting with Spark 2.3.0 stream-stream join feature to see if I can leverage it to replace some of our existing services. Imagine I have 3 worker nodes with *each node* having (16GB RAM and 100GB SSD). My input dataset which is in Kafka is about 250GB per day. Now I want to do

  1   2   3   4   5   >