is it ok to have multiple sparksession's in one spark structured streaming app?

2017-09-06 Thread kant kodali
Hi All, I am wondering if it is ok to have multiple sparksession's in one spark structured streaming app? Basically, I want to create 1) Spark session for reading from Kafka and 2) Another Spark session for storing the mutations of a dataframe/dataset to a persistent table as I get the mutations f

Re: is it ok to have multiple sparksession's in one spark structured streaming app?

2017-09-08 Thread kant kodali
d > idea. > > On Thu, Sep 7, 2017 at 2:40 AM, kant kodali wrote: > >> Hi All, >> >> I am wondering if it is ok to have multiple sparksession's in one spark >> structured streaming app? Basically, I want to create 1) Spark session for >> reading from Kafka

Queries with streaming sources must be executed with writeStream.start()

2017-09-09 Thread kant kodali
Hi All, I have the following code and I am not sure what's wrong with it? I cannot call dataset.toJSON() (which returns a DataSet) ? I am using spark 2.2.0 so I am wondering if there is any work around? Dataset ds = newDS.toJSON().map(()->{some function that returns a string}); StreamingQuery

How to convert Row to JSON in Java?

2017-09-09 Thread kant kodali
Hi All, How to convert Row to JSON in Java? It would be nice to have .toJson() method in the Row class. Thanks, kant

Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-09 Thread kant kodali
hat is newDS? > If it is a Streaming Dataset/DataFrame (since you have writeStream there) > then there seems to be an issue preventing toJSON to work. > > ------ > *From:* kant kodali > *Sent:* Saturday, September 9, 2017 4:04:33 PM > *To:* user @spar

Re: How to convert Row to JSON in Java?

2017-09-09 Thread kant kodali
toJSON on Row object. On Sat, Sep 9, 2017 at 4:18 PM, Felix Cheung wrote: > toJSON on Dataset/DataFrame? > > -- > *From:* kant kodali > *Sent:* Saturday, September 9, 2017 4:15:49 PM > *To:* user @spark > *Subject:* How to convert Row to JSO

Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-09 Thread kant kodali
e Dataset to an RDD, which is not supported > by streaming queries. > > On Sat, Sep 9, 2017 at 4:40 PM, kant kodali wrote: > >> yes it is a streaming dataset. so what is the problem with following code? >> >> Dataset ds = dataset.toJSON().map(()->{some fun

Need some Clarification on checkpointing w.r.t Spark Structured Streaming

2017-09-11 Thread kant kodali
Hi All, I was wondering if we need to checkpoint both read and write streams when reading from Kafka and inserting into a target store? for example sparkSession.readStream().option("checkpointLocation", "hdfsPath").load() vs dataSet.writeStream().option("checkpointLocation", "hdfsPath") Thank

Re: How to convert Row to JSON in Java?

2017-09-11 Thread kant kodali
#getValuesMap-scala.collection.Seq->. >>> I found this post <https://stackoverflow.com/a/41602178/8356352> >>> useful, it is in Scala but should be a good starting point. >>> An alternative approach is combine the 'struct' and 'to_json' functi

Does Kafka dependency jars changed for Spark Structured Streaming 2.2.0?

2017-09-11 Thread kant kodali
Hi All, Does Kafka dependency jars changed for Spark Structured Streaming 2.2.0? kafka-clients-0.10.0.1.jar spark-streaming-kafka-0-10_2.11-2.2.0.jar 1) Above two are the only Kafka related jars or am I missing something? 2) What is the difference between the above two jars? 3) If I have th

unable to read from Kafka (very strange)

2017-09-11 Thread kant kodali
Hi All, I started using spark 2.2.0 very recently and now I can't even get the json data from Kafka out to console. I have no clue what's happening. This was working for me when I was using 2.1.1 Here is my code StreamingQuery query = sparkSession.readStream() .format("kafka") .o

Re: Does Kafka dependency jars changed for Spark Structured Streaming 2.2.0?

2017-09-11 Thread kant kodali
.option("kafka.bootstrap.servers", "localhost:9092") .option("subscribe", "hello")) .option("startingOffsets", "earliest") .load() .writeStream() .format("console") .start(); query.awaitTermination(); On Mon, Sep 11, 2017 at 6:24 PM

Re: Multiple Kafka topics processing in Spark 2.2

2017-09-12 Thread kant kodali
@Dan shouldn't you be using Dataset/Dataframes ? I heard it is recommended to use Dataset and Dataframes than using Dstreams since Dstreams is in maintenance mode. On Mon, Sep 11, 2017 at 7:41 AM, Cody Koeninger wrote: > If you want an "easy" but not particularly performant way to do it, each >

Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-12 Thread kant kodali
>> shixi...@databricks.com> wrote: >> >>> It's because "toJSON" doesn't support Structured Streaming. The current >>> implementation will convert the Dataset to an RDD, which is not supported >>> by streaming queries. >>> >>>

Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-12 Thread kant kodali
I have about 100 fields in my dataset and some of them have "null" in it. Does to_json fails to convert if that is the case? Thanks! On Tue, Sep 12, 2017 at 12:32 PM, kant kodali wrote: > Hi Michael, > > Interestingly that doesn't seem to quite work for me for some

Should I use Dstream or Structured Stream to transfer data from source to sink and then back from sink to source?

2017-09-13 Thread kant kodali
Hi All, I am trying to read data from kafka, insert into Mongo and read from mongo and insert back into Kafka. I went with structured stream approach first however I believe I am making some naiver error because my map operations are not getting invoked. The pseudo code looks like this DataSet r

Re: Should I use Dstream or Structured Stream to transfer data from source to sink and then back from sink to source?

2017-09-14 Thread kant kodali
re you sure > the query is running? Maybe actual code (not pseudocode) may help debug > this. > > > On Wed, Sep 13, 2017 at 11:20 AM, kant kodali wrote: > >> Hi All, >> >> I am trying to read data from kafka, insert into Mongo and read from >> mongo and insert

Re: ConcurrentModificationException using Kafka Direct Stream

2017-09-17 Thread kant kodali
Are you creating threads in your application? On Sun, Sep 17, 2017 at 7:48 AM, HARSH TAKKAR wrote: > > Hi > > I am using spark 2.1.0 with scala 2.11.8, and while iterating over the > partitions of each rdd in a dStream formed using KafkaUtils, i am getting > the below exception, please suggest

Re: ConcurrentModificationException using Kafka Direct Stream

2017-09-17 Thread kant kodali
a DStream > however, we have a single thread for refreshing a resource cache on > driver, but that is totally separate to this connection. > > On Mon, Sep 18, 2017 at 12:29 AM kant kodali wrote: > >> Are you creating threads in your application? >> >> On Sun, Sep 1

How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread kant kodali
HI All, I am wondering How to read from multiple kafka topics using structured streaming (code below)? I googled prior to asking this question and I see responses related to Dstreams but not structured streams. Is it possible to read multiple topics using the same spark structured stream? sparkSe

Re: How to read from multiple kafka topics using structured streaming (spark 2.2.0)?

2017-09-19 Thread kant kodali
> Pozdrawiam, > Jacek Laskowski > > https://about.me/JacekLaskowski > Spark Structured Streaming (Apache Spark 2.2+) https://bit.ly/spark- > structured-streaming > Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski

Structured streaming coding question

2017-09-19 Thread kant kodali
Hi All, I have the following Psuedo code (I could paste the real code however it is pretty long and involves Database calls inside dataset.map operation and so on) so I am just trying to simplify my question. would like to know if there is something wrong with the following pseudo code? DataSet i

Re: Structured streaming coding question

2017-09-19 Thread kant kodali
Looks like my problem was the order of awaitTermination() for some reason. Doesn't work On Tue, Sep 19, 2017 at 1:54 PM, kant kodali wrote: > Hi All, > > I have the following Psuedo code (I could paste the real code however it > is pretty long and involves Database calls i

Re: Structured streaming coding question

2017-09-19 Thread kant kodali
gger. processingTime(1000)).foreach(new KafkaSink("hello2")).start(); query2.awaitTermination() On Tue, Sep 19, 2017 at 10:09 PM, kant kodali wrote: > Looks like my problem was the order of awaitTermination() for some reason. > > Doesn't work > > > > > &

Re: Structured streaming coding question

2017-09-20 Thread kant kodali
and as long as your > first query is running, you will not know. Put this as the last line > instead: > > spark.streams.awaitAnyTermination() > > On Tue, Sep 19, 2017 at 10:11 PM, kant kodali wrote: > >> Looks like my problem was the order of awaitTermination() for s

Re: Structured streaming coding question

2017-09-20 Thread kant kodali
wrote: > Please remove > > query1.awaitTermination(); > query2.awaitTermination(); > > once > > query1.awaitTermination(); > > is called, you don't even get to query2.awaitTermination(). > > > On Tue, Sep 19, 2017 at 11:59 PM, kant kodali wrote: &

Re: Structured streaming coding question

2017-09-20 Thread kant kodali
Hi Burak, Are you saying get rid of both query1.awaitTermination(); query2.awaitTermination(); and just have the line below? sparkSession.streams().awaitAnyTermination(); Thanks! On Wed, Sep 20, 2017 at 12:51 AM, kant kodali wrote: > If I don't get anywhere after query1.awaitTer

Re: Structured streaming coding question

2017-09-20 Thread kant kodali
Just tried with sparkSession.streams().awaitAnyTermination(); And thats the only await* I had and it works! But what if I don't want all my queries to fail or stop making progress if one of them fails? On Wed, Sep 20, 2017 at 2:26 AM, kant kodali wrote: > Hi Burak, > > Are you

Is there a SparkILoop for Java?

2017-09-20 Thread kant kodali
Hi All, I am wondering if there is a SparkILoop for java so I can pass Java code as a string to repl? Thanks!

Re: Is there a SparkILoop for Java?

2017-09-20 Thread kant kodali
017 at 8:38 PM, kant kodali wrote: > >> Hi All, >> >> I am wondering if there is a SparkILoop >> <https://spark.apache.org/docs/0.8.0/api/repl/org/apache/spark/repl/SparkILoop.html> >> for >> java so I can pass Java code as a string to repl? >> >> Thanks! >> > >

Re: Is there a SparkILoop for Java?

2017-09-20 Thread kant kodali
tion of Java code through dynamic class loading, not as powerful > and requires some work on your end, but you can build a Java-based > interpreter. > > jg > > > On Sep 20, 2017, at 08:06, kant kodali wrote: > > Got it! Thats what I thought. Java 9 is going to release tomorro

How to know what are possible operations spark raw sql can support?

2017-09-21 Thread kant kodali
How to know what are all possible operations spark raw sql can support? Is there any document ? Thanks!

How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2017-10-07 Thread kant kodali
I have a Dataset ds which consists of json rows. *Sample Json Row (This is just an example of one row in the dataset)* [ {"name": "foo", "address": {"state": "CA", "country": "USA"}, "docs":[{"subject": "english", "year": 2016}]} {"name": "bar", "address": {"state": "OH", "country": "USA"

Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2017-10-08 Thread kant kodali
ql-high-order- > lambda-functions-examine-complex-structured-data-databricks.html > > https://databricks.com/blog/2017/06/13/five-spark-sql- > utility-functions-extract-explore-complex-data-types.html > > Cheers > Jules > > > Sent from my iPhone > Pardon the dumb thumb

Does Spark 2.2.0 support Dataset>> ?

2017-10-09 Thread kant kodali
Hi All, I am wondering if spark supports Dataset>> ? when I do the following it says no map function available? Dataset>> resultDs = ds.map(lambda, Encoders.bean(List.class)); Thanks!

Re: Does Spark 2.2.0 support Dataset>> ?

2017-10-09 Thread kant kodali
upported type > also. Object is not a supported type. > > On Mon, Oct 9, 2017 at 7:36 AM, kant kodali wrote: > >> Hi All, >> >> I am wondering if spark supports Dataset>> ? >> >> when I do the following it says no map function available? &

Re: Does Spark 2.2.0 support Dataset>> ?

2017-10-09 Thread kant kodali
; +-+ > |value| > +-+ > |1| > |2| > |3| > +-+ > > > > > On Mon, Oct 9, 2017 at 2:38 PM, kant kodali wrote: > >> Hi Koert, >> >> Thanks! If I have this Dataset>> what would be the >> Enconding?is it Enco

Re: How to convert Array of Json rows into Dataset of specific columns in Spark 2.2.0?

2017-10-09 Thread kant kodali
https://issues.apache.org/jira/browse/SPARK-8 On Sun, Oct 8, 2017 at 11:58 AM, kant kodali wrote: > I have the following so far > > private StructType getSchema() { > return new StructType() > .add("name", StringType) >

Re: Does Spark 2.2.0 support Dataset>> ?

2017-10-10 Thread kant kodali
asScalaBufferConverter(temp).asScala().toSeq(); } }, Encoders.*bean*(Seq.class)); On Mon, Oct 9, 2017 at 12:08 PM, kant kodali wrote: > Tried the following. > > > dataset.map(new MapFunction>>() { > @Override > public List> call(String

Does spark sql has timezone support?

2017-10-25 Thread kant kodali
Hi All, Does spark sql has timezone support? Thanks, kant

JIRA Ticket 21641

2017-10-25 Thread kant kodali
Hi All, I am just wondering if anyone had a chance to look at this ticket ? https://issues.apache.org/jira/browse/SPARK-21641 I am not expecting it to be resolved quickly however would like to know if this is something that will be implemented or not (since I see no comments in the ticket) Th

Can we pass the Calcite streaming sql queries to spark sql?

2017-11-09 Thread kant kodali
Can we pass the Calcite streaming sql queries to spark sql? https://calcite.apache.org/docs/stream.html#references

What should LivyUrl be set to when running locally?

2017-12-01 Thread kant kodali
Hi All, I am running both spark and livy locally so imagine everything on a local machine. what should my livyUrl be set to? I don't see that in the example. Thanks!

Re: What should LivyUrl be set to when running locally?

2017-12-01 Thread kant kodali
nvm, I see it. It's http://localhost:8998 On Fri, Dec 1, 2017 at 3:28 PM, kant kodali wrote: > Hi All, > > I am running both spark and livy locally so imagine everything on a local > machine. > what should my livyUrl be set to? I don't see that in the example. > > Thanks! >

Is Databricks REST API open source ?

2017-12-02 Thread kant kodali
HI All, Is REST API (https://docs.databricks.com/api/index.html) open source? where I can submit spark jobs over rest? Thanks!

Re: Is Databricks REST API open source ?

2017-12-02 Thread kant kodali
got it! Thanks! On Sat, Dec 2, 2017 at 9:46 PM, Holden Karau wrote: > That API is not open source. There are some other options as separate > projects you can check out (like Livy,spark-jobserver, etc). > > On Sat, Dec 2, 2017 at 8:30 PM kant kodali wrote: > >> HI All, >

Do I need to do .collect inside forEachRDD

2017-12-05 Thread kant kodali
Hi All, I have a simple stateless transformation using Dstreams (stuck with the old API for one of the Application). The pseudo code is rough like this dstream.map().reduce().forEachRdd(rdd -> { rdd.collect(),forEach(); // Is this necessary ? Does execute fine but a bit slow }) I understand

Re: Do I need to do .collect inside forEachRDD

2017-12-05 Thread kant kodali
Reads from Kafka and outputs to Kafka. so I check the output from Kafka. On Tue, Dec 5, 2017 at 1:26 PM, Qiao, Richard wrote: > Where do you check the output result for both case? > > Sent from my iPhone > > > On Dec 5, 2017, at 15:36, kant kodali wrote: > > > > H

Re: Do I need to do .collect inside forEachRDD

2017-12-05 Thread kant kodali
DF() > dataFrame.write > .format("kafka") > .option("kafka.bootstrap.servers", > "host1:port1,host2:port2") > .option("topic", "topic1") > .save(

sparkSession.sql("sql query") vs df.sqlContext().sql(this.query) ?

2017-12-06 Thread kant kodali
Hi All, I have the following snippets of the code and I wonder what is the difference between these two and which one should I use? I am using spark 2.2. Dataset df = sparkSession.readStream() .format("kafka") .load(); df.createOrReplaceTempView("table"); df.printSchema(); *Dataset resu

Re: Do I need to do .collect inside forEachRDD

2017-12-06 Thread kant kodali
nd(new ProducerRecord<>("topicA", gson.toJson(map))); }); jssc.start(); jssc.awaitTermination(); On Wed, Dec 6, 2017 at 1:43 AM, Gerard Maas wrote: > Hi Kant, > > > but would your answer on .collect() change depending on running the > spark app in cli

Re: Do I need to do .collect inside forEachRDD

2017-12-07 Thread kant kodali
nd smaller json in a task > > } > > } > > }); > > When you do it, make sure kafka producer (seek kafka sink for it) and > gson’s environment setup correctly in executors. > > > > If after this, there is still OOM, let’s discuss furthe

is Union or Join Supported for Spark Structured Streaming Queries in 2.2.0?

2017-12-13 Thread kant kodali
Hi All, I have messages in a queue that might be coming in with few different schemas like msg 1 schema 1, msg2 schema2, msg3 schema3, msg 4 schema1 I want to put all of this in one data frame. is it possible with structured streaming? I am using Spark 2.2.0 Thanks!

Given a Avro Schema object is there a way to get StructType in Java?

2017-12-15 Thread kant kodali
Hi All, Given a Avro Schema object is there a way to get StructType that represents the schema in Java? Thanks!

How to use schema from one of the columns of a dataset to parse another column and create a flattened dataset using Spark Streaming 2.2.0?

2017-12-23 Thread kant kodali
Hi All, How to use value (schema) of one of the columns of a dataset to parse another column and create a flattened dataset using Spark Streaming 2.2.0? I have the following *source data frame* that I create from reading messages from Kafka col1: string col2: json string col1| col2

can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread kant kodali
Hi All, I am wondering if HDFS can be a streaming source like Kafka in Spark 2.2.0? For example can I have stream1 reading from Kafka and writing to HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that stream2 will be pulling the latest updates written by stream1. Thanks!

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread kant kodali
s not supported. > > kr, Gerard. > > On Mon, Jan 15, 2018 at 11:41 PM, kant kodali wrote: > >> Hi All, >> >> I am wondering if HDFS can be a streaming source like Kafka in Spark >> 2.2.0? For example can I have stream1 reading from Kafka and writing to >> H

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread kant kodali
n changes will not be picked up. > > Regards, > Gourav Sengupta > > On Tue, Jan 16, 2018 at 12:20 AM, kant kodali wrote: > >> Hi, >> >> I am not sure I understand. any examples ? >> >> On Mon, Jan 15, 2018 at 3:45 PM, Gerard Maas >> wrot

is there a way to write a Streaming Dataframe/Dataset to Cassandra with auto mapping?

2018-01-19 Thread kant kodali
Hi All, I was wondering if there is a way to write a Streaming Dataframe/Dataset to Cassandra with auto mapping? By auto mapping I mean mapping DataSet/Dataframe schema to Cassandra Table schema? I can for example get Dataframe.dtypes() and then map Spark SQL types to CQL types but I was wondering

how to create a DataType Object using the String representation in Java using Spark 2.2.0?

2018-01-25 Thread kant kodali
Hi All, I have a datatype "IntegerType" represented as a String and now I want to create DataType object out of that. I couldn't find in the DataType or DataTypes api on how to do that? Thanks!

Re: how to create a DataType Object using the String representation in Java using Spark 2.2.0?

2018-01-25 Thread kant kodali
egerType" but this string representation "IntegerType" doesnt seem to be very useful because I cannot do DataType.fromJson("IntegerType") This will throw an error. so I am not quite sure how to construct a DataType given its String representation ? On Thu, Jan 25, 2018 at 4:22 PM, ka

How and when the types of the result set are figured out in Spark?

2018-01-28 Thread kant kodali
Hi All, I would like to know how and when the types of the result set are figured out in Spark? for example say I have the following dataframe. *inputdf* col1 | col2 | col3 --- 1 | 2 | 5 2 | 3 | 6 Now say I do something like below (Pseudo sql) resultdf = select c

is there a way to create new column with timeuuid using raw spark sql ?

2018-02-01 Thread kant kodali
Hi All, Is there any way to create a new timeuuid column of a existing dataframe using raw sql? you can assume that there is a timeuuid udf function if that helps. Thanks!

Re: is there a way to create new column with timeuuid using raw spark sql ?

2018-02-01 Thread kant kodali
> > jg > > > > On Feb 1, 2018, at 05:50, kant kodali wrote: > > > > Hi All, > > > > Is there any way to create a new timeuuid column of a existing dataframe > using raw sql? you can assume that there is a timeuuid udf function if that > helps. > > > > Thanks! > >

can we expect UUID type in Spark 2.3?

2018-02-02 Thread kant kodali
Hi All, can we expect UUID type in Spark 2.3? It looks like it can help lot of downstream sources to model. Thanks!

There is no UDF0 interface?

2018-02-04 Thread kant kodali
Hi All, I see the current UDF API's can take one or more arguments but I don't see any UDF0 in Spark 2.2.0. am I correct? Thanks!

Re: There is no UDF0 interface?

2018-02-04 Thread kant kodali
got it! Not sure why this link https://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest doesn't work. On Sun, Feb 4, 2018 at 12:25 PM, Xiao Li wrote: > The upcoming 2.3 will have it. > > On Sun, Feb 4, 2018 at 12:24 PM kant kodali wrote: > >> Hi All,

can udaf's return complex types?

2018-02-10 Thread kant kodali
Hi All, Can UDAF's return complex types? like say a Map with key as an Integer and the value as an Array of strings? For Example say I have the following *input dataframe* id | name | amount - 1 | foo | 10 2 | bar | 15 1 | car | 20 1 | bus | 20 and my

stdout: org.apache.spark.sql.AnalysisException: nondeterministic expressions are only allowed in

2018-02-14 Thread kant kodali
Hi All, I get an AnalysisException when I run the following query spark.sql(select current_timestamp() as tsp, count(*) from table group by window(tsp, '5 minutes')) I just want create a processing time columns and want to run some simple stateful query like above. I understand current_timestamp

Does the classloader used by spark blocks the I/O calls from UDF's?

2018-02-16 Thread kant kodali
Hi All, Does the class loader used by spark blocks the I/O calls from UDF's? If not, For security reasons wouldn't it make sense to block I/O calls within the UDF code? Thanks!

can we do self join on streaming dataset in 2.2.0?

2018-02-17 Thread kant kodali
Hi All, I know that stream to stream joins are not yet supported. From the text below I wonder if we can do self joins on the same streaming dataset/dataframe in 2.2.0 since there are no two explicit streaming datasets or dataframes? Thanks!! In Spark 2.3, we have added support for stream-stre

The timestamp column for kafka records doesn't seem to change

2018-02-20 Thread kant kodali
Hi All, I am reading records from Kafka using Spark 2.2.0 Structured Streaming. I can see my Dataframe has a schema like below. The timestamp column seems to be same for every record and I am not sure why? am I missing something (did I fail to configure something)? Thanks! Column Type key bina

Re: The timestamp column for kafka records doesn't seem to change

2018-02-20 Thread kant kodali
Sorry. please ignore. it works now! On Tue, Feb 20, 2018 at 5:41 AM, kant kodali wrote: > Hi All, > > I am reading records from Kafka using Spark 2.2.0 Structured Streaming. I > can see my Dataframe has a schema like below. The timestamp column seems to > be same for every reco

what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-20 Thread kant kodali
Hi All, I have the following code import org.apache.spark.sql.streaming.Trigger val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "join_test").option("startingOffsets", "earliest").load(); jdf.createOrReplaceTempView("table") val

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-20 Thread kant kodali
If I change it to this On Tue, Feb 20, 2018 at 7:52 PM, kant kodali wrote: > Hi All, > > I have the following code > > import org.apache.spark.sql.streaming.Trigger > > val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers"

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-20 Thread kant kodali
iew("table") jdf1.createOrReplaceTempView("table") val resultdf = spark.sql("select * from table inner join table1 on table.offset=table1.offset") resultdf.writeStream.outputMode("append").format("console").option("truncate", false).trigger(Tr

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-22 Thread kant kodali
--r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn On Thu, Feb 22, 2018 at 11:25 AM, Tathagata Das wrote: > Hey, > > Thanks for testing out stream-stream joins and reporting this issue. I am > going to take a look at this. > > TD > > > > On Tue, Feb

What happens if I can't fit data into memory while doing stream-stream join.

2018-02-23 Thread kant kodali
Hi All, I am experimenting with Spark 2.3.0 stream-stream join feature to see if I can leverage it to replace some of our existing services. Imagine I have 3 worker nodes with *each node* having (16GB RAM and 100GB SSD). My input dataset which is in Kafka is about 250GB per day. Now I want to do

Is there a way to query dataframe views directly without going through scheduler?

2018-02-26 Thread kant kodali
Hi All, I wonder if there is a way to query data frame views directly without going through scheduler? for example. say I have the following code DataSet kafkaDf = session.readStream().format("kafka").load(); kafkaDf.createOrReplaceView("table") Now Can I query the view "table" without going th

How does Spark Structured Streaming determine an event has arrived late?

2018-02-27 Thread kant kodali
I read through the spark structured streaming documentation and I wonder how does spark structured streaming determine an event has arrived late? Does it compare the event-time with the processing time? [image: enter image description here] Taking the above p

Re: How does Spark Structured Streaming determine an event has arrived late?

2018-02-27 Thread kant kodali
in real-time a week ago. >> This is fundamentally necessary for achieving the deterministic processing >> that Structured Streaming guarantees. >> >> Regarding the picture, the "time" is actually "event-time". My apologies >> for not making this clear i

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-06 Thread kant kodali
all but a small fraction of self-join > queries. That small fraction can produce incorrect results, and part 2 > avoids that. > > So for 2.3.1, we can enable self-joins by merging only part 1, but it can > give wrong results in some cases. I think that is strictly worse than no >

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-06 Thread kant kodali
Sorry I meant Spark 2.4 in my previous email On Tue, Mar 6, 2018 at 9:15 PM, kant kodali wrote: > Hi TD, > > I agree I think we are better off either with a full fix or no fix. I am > ok with the complete fix being available in master or some branch. I guess > the solution fo

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-07 Thread kant kodali
append mode? Anyways the moment it is in master I am ready to test so JIRA tickets on this would help to keep track. please let me know. Thanks! On Tue, Mar 6, 2018 at 9:16 PM, kant kodali wrote: > Sorry I meant Spark 2.4 in my previous email > > On Tue, Mar 6, 2018 at 9:15 PM, kan

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-10 Thread kant kodali
uite a large range of use cases, including > multiple cascading joins. > > TD > > > > On Thu, Mar 8, 2018 at 9:18 AM, Gourav Sengupta > wrote: > >> super interesting. >> >> On Wed, Mar 7, 2018 at 11:44 AM, kant kodali wrote: >> >>

How to run spark shell using YARN

2018-03-12 Thread kant kodali
Hi All, I am trying to use YARN for the very first time. I believe I configured all the resource manager and name node fine. And then I run the below command ./spark-shell --master yarn --deploy-mode client *I get the below output and it hangs there forever *(I had been waiting over 10 minutes)

Re: How to run spark shell using YARN

2018-03-12 Thread kant kodali
Mon, Mar 12, 2018 at 4:42 PM, kant kodali wrote: > > Hi All, > > > > I am trying to use YARN for the very first time. I believe I configured > all > > the resource manager and name node fine. And then I run the below command > > > > ./spark-shell --master y

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
I set core-site.xml, hdfs-site.xml, yarn-site.xml as per this website and these are the only three files I changed Do I need to set or change anything in mapred-site.xml (As of now I have not touched mapred-site.xml)? when I do yarn -node -l

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
any idea? On Wed, Mar 14, 2018 at 12:12 AM, kant kodali wrote: > I set core-site.xml, hdfs-site.xml, yarn-site.xml as per this website > <https://dwbi.org/etl/bigdata/183-setup-hadoop-cluster> and these are the > only three files I changed Do I need to set or change anyth

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
, Mar 14, 2018 at 12:12 AM, kant kodali wrote: > any idea? > > On Wed, Mar 14, 2018 at 12:12 AM, kant kodali wrote: > >> I set core-site.xml, hdfs-site.xml, yarn-site.xml as per this website >> <https://dwbi.org/etl/bigdata/183-setup-hadoop-cluster> and these are >&g

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
by >> updating the following property: >> >> >> yarn.nodemanager.resource.memory-mb >> 8192 >> >> >> https://stackoverflow.com/questions/45687607/waiting-for-am- >> container-to-be-allocated-launched-and-register-with-rm >> >&

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
). My mapred-site.xml is empty. Do I even need this? if so, what should I set it to? On Wed, Mar 14, 2018 at 2:46 AM, Femi Anthony wrote: > What's the hardware configuration of the box you're running on i.e. how > much memory does it have ? > > Femi > > On Wed, Mar

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
mazon.com/premiumsupport/knowledge- > center/restart-service-emr/ > > > > Femi > > > > *From: *kant kodali > *Date: *Wednesday, March 14, 2018 at 6:16 AM > *To: *Femi Anthony > *Cc: *vermanurag , "user @spark" < > user@spark.apache.org> > *Subjec

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
Do I need to set SPARK_DIST_CLASSPATH or SPARK_CLASSPATH ? The latest version of spark (2.3) only has SPARK_CLASSPATH. On Wed, Mar 14, 2018 at 11:37 AM, kant kodali wrote: > Hi, > > I am not using emr. And yes I restarted several times. > > On Wed, Mar 14, 2018 at 6:35 AM, A

is it possible to use Spark 2.3.0 along with Kafka 0.9.0.1?

2018-03-16 Thread kant kodali
Hi All, is it possible to use Spark 2.3.0 along with Kafka 0.9.0.1? Thanks, kant

select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-19 Thread kant kodali
Hi All, I have 10 million records in my Kafka and I am just trying to spark.sql(select count(*) from kafka_view). I am reading from kafka and writing to kafka. My writeStream is set to "update" mode and trigger interval of one second ( Trigger.ProcessingTime(1000)). I expect the counts to be prin

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-19 Thread kant kodali
trigger time, you will > notice warnings that it is 'falling behind' (I forget the exact verbiage, > but something to the effect of the calculation took XX time and is falling > behind). In that case, it will immediately check kafka for new messages and > begin processing the n

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-20 Thread kant kodali
; may be what you’re looking for: > >- spark.streaming.backpressure.enabled >- spark.streaming.backpressure.initialRate >- spark.streaming.receiver.maxRate > - spark.streaming.kafka.maxRatePerPartition > > ​ > > On Mon, Mar 19, 2018 at 5:27 PM, kant kodali

Re: select count * doesnt seem to respect update mode in Kafka Structured Streaming?

2018-03-20 Thread kant kodali
eric mechanism into the > Streaming DataSource V2 so that the engine can do admission control on the > amount of data returned in a source independent way. > > On Tue, Mar 20, 2018 at 2:58 PM, kant kodali wrote: > >> I am using spark 2.3.0 and Kafka 0.10.2.0 so I assume structure

Is there a mutable dataframe spark structured streaming 2.3.0?

2018-03-21 Thread kant kodali
Hi All, Is there a mutable dataframe spark structured streaming 2.3.0? I am currently reading from Kafka and if I cannot parse the messages that I get from Kafka I want to write them to say some "dead_queue" topic. I wonder what is the best way to do this? Thanks!

<    1   2   3   4   5   >