Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread kant kodali
T,TIME, row_number() over (partition by train order by > time desc) r > from TrainTable > ) x where r=1 > > All the constructs supported in dataframe functions. > > On Wed, Aug 30, 2017 at 1:08 PM, kant kodali <kanth...@gmail.com> wrote: > >> yes in a relational

Re: Spark 2.2 structured streaming with mapGroupsWithState + window functions

2017-08-30 Thread kant kodali
+1 Is this ticket related https://issues.apache.org/jira/browse/SPARK-21641 ? On Mon, Aug 28, 2017 at 7:06 AM, daniel williams wrote: > Hi all, > > I've been looking heavily into Spark 2.2 to solve a problem I have by > specifically using mapGroupsWithState. What

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-30 Thread kant kodali
override these methods right so comparisons and retrieval from external store can work? Thanks! On Wed, Aug 30, 2017 at 1:39 AM, kant kodali <kanth...@gmail.com> wrote: > Hi TD, > > Thanks for the explanation and for the clear pseudo code and an example! > > mapGroupsWithState i

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-30 Thread kant kodali
en you should set watermark on the timestamp column. >>> >>> *trainTimesDataset* >>> * .withWatermark("**activity_timestamp", "5 days")* >>> * .groupBy(window(activity_timestamp, "24 hours", "24 hours"), "train", >&

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread kant kodali
By("train", "dest").max("time")* > > > *SQL*: *"select train, dest, max(time) from trainTimesView group by > train, dest"* // after calling > *trainTimesData.createOrReplaceTempView(trainTimesView)* > > > On Tue, Aug 29, 201

Do we always need to go through spark-submit?

2017-08-30 Thread kant kodali
Hi All, I understand spark-submit sets up its own class loader and other things but I am wondering if it is possible to just compile the code and run it using "java -jar mysparkapp.jar" ? Thanks, kant

Need some Clarification on checkpointing w.r.t Spark Structured Streaming

2017-09-11 Thread kant kodali
Hi All, I was wondering if we need to checkpoint both read and write streams when reading from Kafka and inserting into a target store? for example sparkSession.readStream().option("checkpointLocation", "hdfsPath").load() vs dataSet.writeStream().option("checkpointLocation", "hdfsPath")

How to convert Row to JSON in Java?

2017-09-09 Thread kant kodali
Hi All, How to convert Row to JSON in Java? It would be nice to have .toJson() method in the Row class. Thanks, kant

Re: How to convert Row to JSON in Java?

2017-09-09 Thread kant kodali
toJSON on Row object. On Sat, Sep 9, 2017 at 4:18 PM, Felix Cheung <felixcheun...@hotmail.com> wrote: > toJSON on Dataset/DataFrame? > > -- > *From:* kant kodali <kanth...@gmail.com> > *Sent:* Saturday, September 9, 2017 4:15:49 PM > *To:

Queries with streaming sources must be executed with writeStream.start()

2017-09-09 Thread kant kodali
Hi All, I have the following code and I am not sure what's wrong with it? I cannot call dataset.toJSON() (which returns a DataSet) ? I am using spark 2.2.0 so I am wondering if there is any work around? Dataset ds = newDS.toJSON().map(()->{some function that returns a string}); StreamingQuery

Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-09 Thread kant kodali
xcheun...@hotmail.com> wrote: > What is newDS? > If it is a Streaming Dataset/DataFrame (since you have writeStream there) > then there seems to be an issue preventing toJSON to work. > > ------ > *From:* kant kodali <kanth...@gmail.com> > *Sent:* Sa

Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-09 Thread kant kodali
mentation will convert the Dataset to an RDD, which is not supported > by streaming queries. > > On Sat, Sep 9, 2017 at 4:40 PM, kant kodali <kanth...@gmail.com> wrote: > >> yes it is a streaming dataset. so what is the problem with following code? >> >> Dataset

Re: Spark 2.2 structured streaming with mapGroupsWithState + window functions

2017-09-05 Thread kant kodali
Hi Daniel, I am thinking you could use groupByKey & mapGroupsWithState to send whatever updates ("updated state") you want and then use .groupBy(window). will that work as expected? Thanks, Kant On Mon, Aug 28, 2017 at 7:06 AM, daniel williams wrote: > Hi all, > >

Non-time-based windows are not supported on streaming DataFrames/Datasets;;

2017-09-06 Thread kant kodali
Hi All, I get the following exception when I run the query below. Not sure what the cause is? "Non-time-based windows are not supported on streaming DataFrames/Datasets;" My TIME_STAMP field is of string type. dataset .sqlContext() .sql("select count(distinct ID), sum(AMOUNT) from

is it ok to have multiple sparksession's in one spark structured streaming app?

2017-09-06 Thread kant kodali
Hi All, I am wondering if it is ok to have multiple sparksession's in one spark structured streaming app? Basically, I want to create 1) Spark session for reading from Kafka and 2) Another Spark session for storing the mutations of a dataframe/dataset to a persistent table as I get the mutations

Re: Multiple Kafka topics processing in Spark 2.2

2017-09-12 Thread kant kodali
@Dan shouldn't you be using Dataset/Dataframes ? I heard it is recommended to use Dataset and Dataframes than using Dstreams since Dstreams is in maintenance mode. On Mon, Sep 11, 2017 at 7:41 AM, Cody Koeninger wrote: > If you want an "easy" but not particularly performant

Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-12 Thread kant kodali
ect(to_json(struct(col("*" > > On Sat, Sep 9, 2017 at 6:27 PM, kant kodali <kanth...@gmail.com> wrote: > >> Thanks Ryan! In this case, I will have Dataset so is there a way to >> convert Row to Json string? >> >> Thanks >> >> On Sat

Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-12 Thread kant kodali
I have about 100 fields in my dataset and some of them have "null" in it. Does to_json fails to convert if that is the case? Thanks! On Tue, Sep 12, 2017 at 12:32 PM, kant kodali <kanth...@gmail.com> wrote: > Hi Michael, > > Interestingly that doesn't seem to quite wo

Re: How to convert Row to JSON in Java?

2017-09-11 Thread kant kodali
rg/apache/spark/sql/Row.html#getValuesMap-scala.collection.Seq->. >>> I found this post <https://stackoverflow.com/a/41602178/8356352> >>> useful, it is in Scala but should be a good starting point. >>> An alternative approach is combine the 'struct' and 'to_json' f

Re: Does Spark SQL uses Calcite?

2017-08-20 Thread kant kodali
ct https://docs.databricks.com/ > spark/latest/data-sources/sql-databases.html > > Cheers > Jules > > Sent from my iPhone > Pardon the dumb thumb typos :) > > On Aug 19, 2017, at 5:27 PM, kant kodali <kanth...@gmail.com> wrote: > > Hi Russell, > > I went through th

Is watermark always set using processing time or event time or both?

2017-09-01 Thread kant kodali
Is watermark always set using processing time or event time or both?

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread kant kodali
>> *trainTimesDataset* >> * .withWatermark("**activity_timestamp", "5 days")* >> * .groupBy(window(activity_timestamp, "24 hours", "24 hours"), "train", >> "dest")* >> * .max("time")* >

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread kant kodali
> Why removing the destination from the window wont work? Like this: >> >> *trainTimesDataset* >> * .withWatermark("**activity_timestamp", "5 days")* >> * .groupBy(window(activity_timestamp, "24 hours", "24 hours"), "train"

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-30 Thread kant kodali
ultimate flexible tool. However, you have to do > bookkeeping of "last 24 hours" and calculate the max yourself. :) > > Hope this helps. > > On Wed, Aug 30, 2017 at 10:58 AM, kant kodali <kanth...@gmail.com> wrote: > >> I think I understand *groupByKey/**mapGr

How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2017-10-07 Thread kant kodali
I have a Dataset ds which consists of json rows. *Sample Json Row (This is just an example of one row in the dataset)* [ {"name": "foo", "address": {"state": "CA", "country": "USA"}, "docs":[{"subject": "english", "year": 2016}]} {"name": "bar", "address": {"state": "OH", "country":

Does Spark 2.2.0 support Dataset<List<Map<String,Object>>> ?

2017-10-09 Thread kant kodali
Hi All, I am wondering if spark supports Dataset>> ? when I do the following it says no map function available? Dataset>> resultDs = ds.map(lambda, Encoders.bean(List.class)); Thanks!

Given a Avro Schema object is there a way to get StructType in Java?

2017-12-15 Thread kant kodali
Hi All, Given a Avro Schema object is there a way to get StructType that represents the schema in Java? Thanks!

is Union or Join Supported for Spark Structured Streaming Queries in 2.2.0?

2017-12-13 Thread kant kodali
Hi All, I have messages in a queue that might be coming in with few different schemas like msg 1 schema 1, msg2 schema2, msg3 schema3, msg 4 schema1 I want to put all of this in one data frame. is it possible with structured streaming? I am using Spark 2.2.0 Thanks!

Can we pass the Calcite streaming sql queries to spark sql?

2017-11-09 Thread kant kodali
Can we pass the Calcite streaming sql queries to spark sql? https://calcite.apache.org/docs/stream.html#references

Re: Do I need to do .collect inside forEachRDD

2017-12-05 Thread kant kodali
Reads from Kafka and outputs to Kafka. so I check the output from Kafka. On Tue, Dec 5, 2017 at 1:26 PM, Qiao, Richard <richard.q...@capitalone.com> wrote: > Where do you check the output result for both case? > > Sent from my iPhone > > > On Dec 5, 2017, at 1

Do I need to do .collect inside forEachRDD

2017-12-05 Thread kant kodali
Hi All, I have a simple stateless transformation using Dstreams (stuck with the old API for one of the Application). The pseudo code is rough like this dstream.map().reduce().forEachRdd(rdd -> { rdd.collect(),forEach(); // Is this necessary ? Does execute fine but a bit slow }) I

Re: Is Databricks REST API open source ?

2017-12-02 Thread kant kodali
got it! Thanks! On Sat, Dec 2, 2017 at 9:46 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > That API is not open source. There are some other options as separate > projects you can check out (like Livy,spark-jobserver, etc). > > On Sat, Dec 2, 2017 at 8:30 PM kant kodali &

Is Databricks REST API open source ?

2017-12-02 Thread kant kodali
HI All, Is REST API (https://docs.databricks.com/api/index.html) open source? where I can submit spark jobs over rest? Thanks!

Re: Do I need to do .collect inside forEachRDD

2017-12-05 Thread kant kodali
rdd.toDF() > dataFrame.write > .format("kafka") > .option("kafka.bootstrap.servers", > "host1:port1,host2:port2") > .option("topic", "topic1") >

Re: Do I need to do .collect inside forEachRDD

2017-12-06 Thread kant kodali
producer.send(new ProducerRecord<>("topicA", gson.toJson(map))); }); jssc.start(); jssc.awaitTermination(); On Wed, Dec 6, 2017 at 1:43 AM, Gerard Maas <gerard.m...@gmail.com> wrote: > Hi Kant, > > > but would your answer on .co

Re: Do I need to do .collect inside forEachRDD

2017-12-07 Thread kant kodali
.toJson(map))); > // send smaller json in a task > > } > > } > > }); > > When you do it, make sure kafka producer (seek kafka sink for it) and > gson’s environment setup correctly in executors. > > > > If after this, there is still OOM, l

sparkSession.sql("sql query") vs df.sqlContext().sql(this.query) ?

2017-12-06 Thread kant kodali
Hi All, I have the following snippets of the code and I wonder what is the difference between these two and which one should I use? I am using spark 2.2. Dataset df = sparkSession.readStream() .format("kafka") .load(); df.createOrReplaceTempView("table"); df.printSchema(); *Dataset

What should LivyUrl be set to when running locally?

2017-12-01 Thread kant kodali
Hi All, I am running both spark and livy locally so imagine everything on a local machine. what should my livyUrl be set to? I don't see that in the example. Thanks!

Re: What should LivyUrl be set to when running locally?

2017-12-01 Thread kant kodali
nvm, I see it. It's http://localhost:8998 On Fri, Dec 1, 2017 at 3:28 PM, kant kodali <kanth...@gmail.com> wrote: > Hi All, > > I am running both spark and livy locally so imagine everything on a local > machine. > what should my livyUrl be set to? I don't see that in the example. > > Thanks! >

Does spark sql has timezone support?

2017-10-26 Thread kant kodali
Hi All, Does spark sql has timezone support? Thanks, kant

JIRA Ticket 21641

2017-10-26 Thread kant kodali
Hi All, I am just wondering if anyone had a chance to look at this ticket ? https://issues.apache.org/jira/browse/SPARK-21641 I am not expecting it to be resolved quickly however would like to know if this is something that will be implemented or not (since I see no comments in the ticket)

Re: question on collect_list or say aggregations in general in structured streaming 2.3.0

2018-05-04 Thread kant kodali
:55 AM, Arun Mahadevan <ar...@apache.org> wrote: > I think you need to group by a window (tumbling) and define watermarks > (put a very low watermark or even 0) to discard the state. Here the window > duration becomes your logical batch. > > - Arun > > From: kant kodali

I cannot use spark 2.3.0 and kafka 0.9?

2018-05-04 Thread kant kodali
Hi All, This link seems to suggest I cant use Spark 2.3.0 and Kafka 0.9 broker. is that correct? https://spark.apache.org/docs/latest/streaming-kafka-integration.html Thanks!

A naive ML question

2018-04-28 Thread kant kodali
Hi All, I have a bunch of financial transactional data and I was wondering if there is any ML model that can give me a graph structure for this data? other words, show how a transaction had evolved over time? Any suggestions or references would help. Thanks!

is it possible to create one KafkaDirectStream (Dstream) per topic?

2018-05-20 Thread kant kodali
Hi All, I have 5 Kafka topics and I am wondering if is even possible to create one KafkaDirectStream (Dstream) per topic within the same JVM i.e using only one sparkcontext? Thanks!

What to consider when implementing a custom streaming sink?

2018-05-15 Thread kant kodali
Hi All, I am trying to implement a custom sink and I have few questions mainly on output modes. 1) How does spark let the sink know that a new row is an update of an existing row? does it look at all the values of all columns of the new row and an existing row for an equality match or does it

is spark stream-stream joins in update mode targeted for 2.4?

2018-06-18 Thread kant kodali
Hi All, Is spark stream-stream joins in update mode targeted for 2.4? Thanks!

is there a way to create a static dataframe inside mapGroups?

2018-06-04 Thread kant kodali
Hi All, Is there a way to create a static dataframe inside mapGroups? given that mapGroups gives Iterator of rows. I just want to take that iterator and populate a static dataframe so I can run raw sql queries on the static dataframe. Thanks!

is there a way to parse and modify raw spark sql query?

2018-06-05 Thread kant kodali
Hi All, is there a way to parse and modify raw spark sql query? For example, given the following query spark.sql("select hello from view") I want to modify the query or logical plan such that I can get the result equivalent to the below query. spark.sql("select foo, hello from view") Any

Error while doing stream-stream inner join (java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access)

2018-07-02 Thread kant kodali
Hi All, I get the below error quite often when I do an stream-stream inner join on two data frames. After running through several experiments stream-stream joins dont look stable enough for production yet. any advice on this? Thanks! java.util.ConcurrentModificationException: KafkaConsumer is

Does Spark Structured Streaming have a JDBC sink or Do I need to use ForEachWriter?

2018-06-20 Thread kant kodali
Hi All, Does Spark Structured Streaming have a JDBC sink or Do I need to use ForEachWriter? I see the following code in this link and I can see that database name can be passed in the connection string, however, I

what is the query language used for graphX?

2018-05-02 Thread kant kodali
Hi All, what is the query language used for graphX? are there any plans to introduce gremlin or is that idea being dropped and go with Spark SQL? Thanks!

question on collect_list or say aggregations in general in structured streaming 2.3.0

2018-05-03 Thread kant kodali
Hi All, I was under an assumption that one needs to run grouby(window(...)) to run any stateful operations but looks like that is not the case since any aggregation like query "select count(*) from some_view" is also stateful since it stores the result of the count from the previous batch.

Re: question on collect_list or say aggregations in general in structured streaming 2.3.0

2018-05-03 Thread kant kodali
SQL so not using FlatMapsGroupWithState. And if that is not available then is it fair to say there is no declarative way to do stateless aggregations? On Thu, May 3, 2018 at 1:24 AM, kant kodali <kanth...@gmail.com> wrote: > Hi All, > > I was under an assumption that one needs to ru

Re: A naive ML question

2018-04-28 Thread kant kodali
describes > basically an action at a certain point of time. Do you mean how a financial > product evolved over time given a set of a transactions? > > > On 28. Apr 2018, at 12:46, kant kodali <kanth...@gmail.com> wrote: > > > > Hi All, > > > > I have a bun

Do GraphFrames support streaming?

2018-04-29 Thread kant kodali
Do GraphFrames support streaming?

is there a minOffsetsTrigger in spark structured streaming 2.3.0?

2018-04-29 Thread kant kodali
Hi All, just like maxOffsetsTrigger is there a minOffsetsTrigger in spark structured streaming 2.3.0? Thanks!

Re: A naive ML question

2018-04-29 Thread kant kodali
I think matshow in numpy/matplotlib > does this). > > On Sat, 28 Apr 2018 at 21:34, kant kodali <kanth...@gmail.com> wrote: > >> Hi, >> >> I mean a transaction goes typically goes through different states like >> STARTED, PENDING, CANCELLED, COMPLETED,

Re: Error while doing stream-stream inner join (java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access)

2018-07-03 Thread kant kodali
> > Best Regards, > Ryan > > On Mon, Jul 2, 2018 at 3:56 AM, kant kodali wrote: > >> Hi All, >> >> I get the below error quite often when I do an stream-stream inner join >> on two data frames. After running through several experiments stream-stream >&

How to use schema from one of the columns of a dataset to parse another column and create a flattened dataset using Spark Streaming 2.2.0?

2017-12-23 Thread kant kodali
Hi All, How to use value (schema) of one of the columns of a dataset to parse another column and create a flattened dataset using Spark Streaming 2.2.0? I have the following *source data frame* that I create from reading messages from Kafka col1: string col2: json string col1|

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread kant kodali
ending lines to the same > file then changes will not be picked up. > > Regards, > Gourav Sengupta > > On Tue, Jan 16, 2018 at 12:20 AM, kant kodali <kanth...@gmail.com> wrote: > >> Hi, >> >> I am not sure I understand. any examples ? >> >> On Mon,

can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread kant kodali
Hi All, I am wondering if HDFS can be a streaming source like Kafka in Spark 2.2.0? For example can I have stream1 reading from Kafka and writing to HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that stream2 will be pulling the latest updates written by stream1. Thanks!

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread kant kodali
ectory. > Updating the files is not supported. > > kr, Gerard. > > On Mon, Jan 15, 2018 at 11:41 PM, kant kodali <kanth...@gmail.com> wrote: > >> Hi All, >> >> I am wondering if HDFS can be a streaming source like Kafka in Spark >> 2.2.0? For example can I

is there a way to write a Streaming Dataframe/Dataset to Cassandra with auto mapping?

2018-01-19 Thread kant kodali
Hi All, I was wondering if there is a way to write a Streaming Dataframe/Dataset to Cassandra with auto mapping? By auto mapping I mean mapping DataSet/Dataframe schema to Cassandra Table schema? I can for example get Dataframe.dtypes() and then map Spark SQL types to CQL types but I was

Re: how to create a DataType Object using the String representation in Java using Spark 2.2.0?

2018-01-25 Thread kant kodali
egerType" but this string representation "IntegerType" doesnt seem to be very useful because I cannot do DataType.fromJson("IntegerType") This will throw an error. so I am not quite sure how to construct a DataType given its String representation ? On Thu, Jan 25, 2018 at 4:22 PM,

how to create a DataType Object using the String representation in Java using Spark 2.2.0?

2018-01-25 Thread kant kodali
Hi All, I have a datatype "IntegerType" represented as a String and now I want to create DataType object out of that. I couldn't find in the DataType or DataTypes api on how to do that? Thanks!

is there a way to create new column with timeuuid using raw spark sql ?

2018-02-01 Thread kant kodali
Hi All, Is there any way to create a new timeuuid column of a existing dataframe using raw sql? you can assume that there is a timeuuid udf function if that helps. Thanks!

can we expect UUID type in Spark 2.3?

2018-02-02 Thread kant kodali
Hi All, can we expect UUID type in Spark 2.3? It looks like it can help lot of downstream sources to model. Thanks!

How and when the types of the result set are figured out in Spark?

2018-01-28 Thread kant kodali
Hi All, I would like to know how and when the types of the result set are figured out in Spark? for example say I have the following dataframe. *inputdf* col1 | col2 | col3 --- 1 | 2 | 5 2 | 3 | 6 Now say I do something like below (Pseudo sql) resultdf = select

Re: is there a way to create new column with timeuuid using raw spark sql ?

2018-02-01 Thread kant kodali
e, use withColumn()... > > jg > > > > On Feb 1, 2018, at 05:50, kant kodali <kanth...@gmail.com> wrote: > > > > Hi All, > > > > Is there any way to create a new timeuuid column of a existing dataframe > using raw sql? you can assume that there is a timeuuid udf function if that > helps. > > > > Thanks! > >

There is no UDF0 interface?

2018-02-04 Thread kant kodali
Hi All, I see the current UDF API's can take one or more arguments but I don't see any UDF0 in Spark 2.2.0. am I correct? Thanks!

Re: There is no UDF0 interface?

2018-02-04 Thread kant kodali
got it! Not sure why this link https://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest doesn't work. On Sun, Feb 4, 2018 at 12:25 PM, Xiao Li <gatorsm...@gmail.com> wrote: > The upcoming 2.3 will have it. > > On Sun, Feb 4, 2018 at 12:24 PM kant kodali <

can udaf's return complex types?

2018-02-10 Thread kant kodali
Hi All, Can UDAF's return complex types? like say a Map with key as an Integer and the value as an Array of strings? For Example say I have the following *input dataframe* id | name | amount - 1 | foo | 10 2 | bar | 15 1 | car | 20 1 | bus | 20 and

stdout: org.apache.spark.sql.AnalysisException: nondeterministic expressions are only allowed in

2018-02-14 Thread kant kodali
Hi All, I get an AnalysisException when I run the following query spark.sql(select current_timestamp() as tsp, count(*) from table group by window(tsp, '5 minutes')) I just want create a processing time columns and want to run some simple stateful query like above. I understand

Does the classloader used by spark blocks the I/O calls from UDF's?

2018-02-16 Thread kant kodali
Hi All, Does the class loader used by spark blocks the I/O calls from UDF's? If not, For security reasons wouldn't it make sense to block I/O calls within the UDF code? Thanks!

can we do self join on streaming dataset in 2.2.0?

2018-02-17 Thread kant kodali
Hi All, I know that stream to stream joins are not yet supported. From the text below I wonder if we can do self joins on the same streaming dataset/dataframe in 2.2.0 since there are no two explicit streaming datasets or dataframes? Thanks!! In Spark 2.3, we have added support for

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-20 Thread kant kodali
If I change it to this On Tue, Feb 20, 2018 at 7:52 PM, kant kodali <kanth...@gmail.com> wrote: > Hi All, > > I have the following code > > import org.apache.spark.sql.streaming.Trigger > > val jdf = spark.readStream.format("kafka").option("ka

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-20 Thread kant kodali
quot;table") jdf1.createOrReplaceTempView("table") val resultdf = spark.sql("select * from table inner join table1 on table.offset=table1.offset") resultdf.writeStream.outputMode("append").format("console").option("truncate", false).trigger(

what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-20 Thread kant kodali
Hi All, I have the following code import org.apache.spark.sql.streaming.Trigger val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "join_test").option("startingOffsets", "earliest").load(); jdf.createOrReplaceTempView("table")

Re: The timestamp column for kafka records doesn't seem to change

2018-02-20 Thread kant kodali
Sorry. please ignore. it works now! On Tue, Feb 20, 2018 at 5:41 AM, kant kodali <kanth...@gmail.com> wrote: > Hi All, > > I am reading records from Kafka using Spark 2.2.0 Structured Streaming. I > can see my Dataframe has a schema like below. The timestamp column seems to &

The timestamp column for kafka records doesn't seem to change

2018-02-20 Thread kant kodali
Hi All, I am reading records from Kafka using Spark 2.2.0 Structured Streaming. I can see my Dataframe has a schema like below. The timestamp column seems to be same for every record and I am not sure why? am I missing something (did I fail to configure something)? Thanks! Column Type key

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-22 Thread kant kodali
t; TD > > > > On Tue, Feb 20, 2018 at 8:20 PM, kant kodali <kanth...@gmail.com> wrote: > >> if I change it to the below code it works. However, I don't believe it is >> the solution I am looking for. I want to be able to do it in raw SQL and >> moreover, If a us

Is there a way to query dataframe views directly without going through scheduler?

2018-02-26 Thread kant kodali
Hi All, I wonder if there is a way to query data frame views directly without going through scheduler? for example. say I have the following code DataSet kafkaDf = session.readStream().format("kafka").load(); kafkaDf.createOrReplaceView("table") Now Can I query the view "table" without going

What happens if I can't fit data into memory while doing stream-stream join.

2018-02-23 Thread kant kodali
Hi All, I am experimenting with Spark 2.3.0 stream-stream join feature to see if I can leverage it to replace some of our existing services. Imagine I have 3 worker nodes with *each node* having (16GB RAM and 100GB SSD). My input dataset which is in Kafka is about 250GB per day. Now I want to do

Re: Does Spark Structured Streaming have a JDBC sink or Do I need to use ForEachWriter?

2018-06-21 Thread kant kodali
aalDornick/spark/blob/master/ > sql/core/src/main/scala/org/apache/spark/sql/execution/ > streaming/JdbcSink.scala > > https://github.com/GaalDornick/spark/blob/master/ > sql/core/src/main/scala/org/apache/spark/sql/execution/ > streaming/JDBCSinkLog.scala > > > > > &

How to Create one DB connection per executor and close it after the job is done?

2018-07-28 Thread kant kodali
Hi All, I understand creating a connection forEachPartition but I am wondering can I create one DB connection per executor and close it after the job is done? any sample code would help. you can imagine I am running a simple batch processing application. Thanks!

Re: How to Create one DB connection per executor and close it after the job is done?

2018-07-30 Thread kant kodali
ect in Scala, of which only one instance will be > created on each JVM / Executor. E.g. > > object MyDatabseSingleton { > var dbConn = ??? > } > > On Sat, 28 Jul 2018, 08:34 kant kodali, wrote: > >> Hi All, >> >> I understand creating a connection forEac

How do I generate current UTC timestamp in raw spark sql?

2018-08-28 Thread kant kodali
Hi All, How do I generate current UTC timestamp using spark sql? When I do curent_timestamp() it is giving me local time. to_utc_timestamp(current_time(), ) takes timezone in the second parameter and I see no udf that can give me current timezone. when I do

Re: Do GraphFrames support streaming?

2018-07-15 Thread kant kodali
ackend) then when the next stream arrive join them - create > graph and store the next stream together with the existing stream on disk > etc. > > On 14. Jul 2018, at 17:19, kant kodali wrote: > > The question now would be can it be done in streaming fashion? Are you > talking about

Re: Can I specify watermark using raw sql alone?

2018-07-15 Thread kant kodali
I don't see a withWatermark UDF to use it in Raw sql. I am currently using Spark 2.3.1 On Sat, Jul 14, 2018 at 4:19 PM, kant kodali wrote: > Hi All, > > Can I specify watermark using raw sql alone? other words without using > .withWatermark from > Dataset API. > > Thanks! >

Can I specify watermark using raw sql alone?

2018-07-14 Thread kant kodali
Hi All, Can I specify watermark using raw sql alone? other words without using .withWatermark from Dataset API. Thanks!

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-24 Thread kant kodali
ssages to kafka, might be easier on you to > just explode the rows and let Spark do the rest for you. > > > > *From: *kant kodali > *Date: *Tuesday, July 24, 2018 at 1:04 PM > *To: *Silvio Fiorito > *Cc: *Arun Mahadevan , chandan prakash < > chandanbaran...@gmail.com>

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-23 Thread kant kodali
understand each row has a topic column but can we write one row to multiple topics? On Thu, Jul 12, 2018 at 11:00 AM, Arun Mahadevan wrote: > What I meant was the number of partitions cannot be varied with > ForeachWriter v/s if you were to write to each sink using independent > queries. Maybe

Re: Do GraphFrames support streaming?

2018-07-14 Thread kant kodali
run an algorithms -> look at Janusgraph > You want to update incrementally an existing graph and run incrementally a > graph algorithm suitable for this - you have to implement yourself as far > as I am aware > > > On 29. Apr 2018, at 11:43, kant kodali wrote: > > > > Do GraphFrames support streaming? >

Re: Do GraphFrames support streaming?

2018-07-14 Thread kant kodali
for > incremental graph updates. > > On 14. Jul 2018, at 15:59, kant kodali wrote: > > "You want to update incrementally an existing graph and run incrementally > a graph algorithm suitable for this - you have to implement yourself as > far as I am aware" > > I

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-24 Thread kant kodali
chBatch method that will give you a > DataFrame and let you write to the sink as you wish. > > > > *From: *kant kodali > *Date: *Monday, July 23, 2018 at 4:43 AM > *To: *Arun Mahadevan > *Cc: *chandan prakash , Tathagata Das < > tathagata.das1...@gmail.com>, "ymaha

Re: spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-08 Thread kant kodali
Hi, It's on local mac book pro machine that has 16GB RAM 512GB disk and 8 vCpu! I am not running any code since I can't even spawn spark-shell with yarn as master as described in my previous email. I just want to run simple word count using yarn as master. Thanks! Below is the resource manager

Re: spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-08 Thread kant kodali
s allowed to provide. > You might take a look at https://hadoop.apache.org/ > docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html and > search for maximum-am-resource-percent. > > Regards, > > *Yohann Jardin* > Le 7/8/2018 à 4:40 PM, kant kodali a écrit : > > Hi

spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-08 Thread kant kodali
Hi All, I am trying to run a simple word count using YARN as a cluster manager. I am currently using Spark 2.3.1 and Apache hadoop 2.7.3. When I spawn spark-shell like below it gets stuck in ACCEPTED stated forever. ./bin/spark-shell --master yarn --deploy-mode client I set my

Re: spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-09 Thread kant kodali
the node automatically is going to unhealthy state and INFO logs don't tell me why. On Sun, Jul 8, 2018 at 7:36 PM, kant kodali wrote: > @yohann Thanks for shining some light! It is making more sense now. > > I think you are correct when you stated: "Your application master is

<    1   2   3   4   5   >