Spark Structured Streaming Inner Queries fails

2018-04-05 Thread Aakash Basu
Hi, Why are inner queries not allowed in Spark Streaming? Spark assumes the inner query to be a separate stream altogether and expects it to be triggered with a separate writeStream.start(). Why so? Error: pyspark.sql.utils.StreamingQueryException: 'Queries with streaming sources must be execute

Re: Writing record once after the watermarking interval in Spark Structured Streaming

2018-03-29 Thread Bowden, Chris
ing record once after the watermarking interval in Spark Structured Streaming I have the following query: val ds = dataFrame .filter(! $"requri".endsWith(".m3u8")) .filter(! $"bserver".contains("trimmer")) .withWatermark("time", &

Writing record once after the watermarking interval in Spark Structured Streaming

2018-03-29 Thread karthikjay
I have the following query: val ds = dataFrame .filter(! $"requri".endsWith(".m3u8")) .filter(! $"bserver".contains("trimmer")) .withWatermark("time", "120 seconds") .groupBy(window(dataFrame.col("time"),"60 seconds"),col("channelName")) .agg(sum("bytes")/100

Apache Spark - Structured Streaming State Management With Watermark

2018-03-28 Thread M Singh
Hi: I am using Apache Spark Structured Streaming (2.2.1) to implement custom sessionization for events.  The processing is in two steps:1. flatMapGroupsWithState (based on user id) - which stores the state of user and emits events every minute until a expire event is received 2. The next step

Apache Spark - Structured Streaming StreamExecution Stats Description

2018-03-28 Thread M Singh
Hi: I am using spark structured streaming 2.2.1 and am using flatMapGroupWithState and a groupBy count operators. In the StreamExecution logs I see two enteries for stateOperators "stateOperators" : [ {     "numRowsTotal" : 1617339,     "numRowsUpdated" : 9647   },

Re: Is there a mutable dataframe spark structured streaming 2.3.0?

2018-03-23 Thread kant kodali
oo. >> >> Thanks, >> Aakash. >> >> On Thu, Mar 22, 2018 at 7:50 AM, kant kodali wrote: >> >>> Hi All, >>> >>> Is there a mutable dataframe spark structured streaming 2.3.0? I am >>> currently reading from Kafka and if I cannot parse the messages that I get >>> from Kafka I want to write them to say some "dead_queue" topic. >>> >>> I wonder what is the best way to do this? >>> >>> Thanks! >>> >> >> >> >

Apache Spark Structured Streaming - How to keep executor alive.

2018-03-23 Thread M Singh
Hi: I am working on spark structured streaming (2.2.1) with kafka and want 100 executors to be alive. I set spark.executor.instances to be 100.  The process starts running with 100 executors but after some time only a few remain which causes backlog of events from kafka.  I thought I saw a

Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-03-22 Thread M Singh
Hi: I am working on a realtime application using spark structured streaming (v 2.2.1). The application reads data from kafka and if there is a failure, I would like to ignore the checkpoint.  Is there any configuration to just read from last kafka offset after a failure and ignore any offset

Re: Apache Spark Structured Streaming - Kafka Consumer cannot fetch records for offset exception

2018-03-22 Thread Tathagata Das
Structured Streaming AUTOMATICALLY saves the offsets in a checkpoint directory that you provide. And when you start the query again with the same directory it will just pick up where it left off. https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failur

Apache Spark Structured Streaming - Kafka Consumer cannot fetch records for offset exception

2018-03-22 Thread M Singh
Hi: I am working with Spark (2.2.1) and Kafka (0.10) on AWS EMR and for the last few days, after running the application for 30-60 minutes get exception from Kafka Consumer included below. The structured streaming application is processing 1 minute worth of data from kafka topic. So I've tried

Re: Is there a mutable dataframe spark structured streaming 2.3.0?

2018-03-22 Thread kant kodali
chain with "*Multiple Kafka Spark Streaming Dataframe Join query*" as > subject, TD and Chris has cleared my doubts, it would help you too. > > Thanks, > Aakash. > > On Thu, Mar 22, 2018 at 7:50 AM, kant kodali wrote: > >> Hi All, >> >> Is there a mutable da

Re: Is there a mutable dataframe spark structured streaming 2.3.0?

2018-03-22 Thread Jorge Machado
gt; TD and Chris has cleared my doubts, it would help you too. > > Thanks, > Aakash. > > On Thu, Mar 22, 2018 at 7:50 AM, kant kodali <mailto:kanth...@gmail.com>> wrote: > Hi All, > > Is there a mutable dataframe spark structured streaming 2.3.0? I am currently &g

Re: Is there a mutable dataframe spark structured streaming 2.3.0?

2018-03-22 Thread Aakash Basu
> Hi All, > > Is there a mutable dataframe spark structured streaming 2.3.0? I am > currently reading from Kafka and if I cannot parse the messages that I get > from Kafka I want to write them to say some "dead_queue" topic. > > I wonder what is the best way to do this? > > Thanks! >

Is there a mutable dataframe spark structured streaming 2.3.0?

2018-03-21 Thread kant kodali
Hi All, Is there a mutable dataframe spark structured streaming 2.3.0? I am currently reading from Kafka and if I cannot parse the messages that I get from Kafka I want to write them to say some "dead_queue" topic. I wonder what is the best way to do this? Thanks!

[Spark Structured Streaming, Spark 2.3.0] Calling current_timestamp() function within a streaming dataframe results in dataType error

2018-03-19 Thread Artem Moskvin
Hi all, There's probably a regression in Spark 2.3.0. Running the code below in 2.2.1 succeeds but in 2.3.0 results in error `org.apache.spark.sql.streaming.StreamingQueryException: Invalid call to dataType on unresolved object, tree: 'current_timestamp`. ``` import org.apache.spark.sql.functions

Re: retention policy for spark structured streaming dataset

2018-03-14 Thread Lian Jiang
to drop > data older than x days outside streaming job. > > Sunil Parmar > > On Wed, Mar 14, 2018 at 11:36 AM, Lian Jiang > wrote: > >> I have a spark structured streaming job which dump data into a parquet >> file. To avoid the parquet file grows infinitely, I want

Re: retention policy for spark structured streaming dataset

2018-03-14 Thread Sunil Parmar
Can you use partitioning ( by day ) ? That will make it easier to drop data older than x days outside streaming job. Sunil Parmar On Wed, Mar 14, 2018 at 11:36 AM, Lian Jiang wrote: > I have a spark structured streaming job which dump data into a parquet > file. To avoid the parque

retention policy for spark structured streaming dataset

2018-03-14 Thread Lian Jiang
I have a spark structured streaming job which dump data into a parquet file. To avoid the parquet file grows infinitely, I want to discard 3 month old data. Does spark streaming supports this? Or I need to stop the streaming job, trim the parquet file and restart the streaming job? Thanks for any

Re: [Beginner] Kafka 0.11 header support in Spark Structured Streaming

2018-02-28 Thread Tathagata Das
I made a JIRA for it - https://issues.apache.org/jira/browse/SPARK-23539 Unfortunately it is blocked by Kafka version upgrade, which has a few nasty issues related to Kafka bugs - https://issues.apache.org/jira/browse/SPARK-18057 On Wed, Feb 28, 2018 at 3:17 PM, karthikus wrote: > TD, > > Thanks

Re: [Beginner] Kafka 0.11 header support in Spark Structured Streaming

2018-02-28 Thread karthikus
TD, Thanks for your response. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: How does Spark Structured Streaming determine an event has arrived late?

2018-02-27 Thread kant kodali
in real-time a week ago. >> This is fundamentally necessary for achieving the deterministic processing >> that Structured Streaming guarantees. >> >> Regarding the picture, the "time" is actually "event-time". My apologies >> for not making this clear i

Re: How does Spark Structured Streaming determine an event has arrived late?

2018-02-27 Thread naresh Goud
he picture. In hindsight, the picture can be > made much better. :) > > Hope this explanation helps! > > TD > > On Tue, Feb 27, 2018 at 2:26 AM, kant kodali wrote: > >> I read through the spark structured streaming documentation and I wonder >> how does spar

Re: How does Spark Structured Streaming determine an event has arrived late?

2018-02-27 Thread Tathagata Das
;time" is actually "event-time". My apologies for not making this clear in the picture. In hindsight, the picture can be made much better. :) Hope this explanation helps! TD On Tue, Feb 27, 2018 at 2:26 AM, kant kodali wrote: > I read through the spark structured streaming docu

Re: [Beginner] Kafka 0.11 header support in Spark Structured Streaming

2018-02-27 Thread Tathagata Das
Unfortunately, exposing Kafka headers is not yet supported in Structured Streaming. The community is more than welcome to add support for it :) On Tue, Feb 27, 2018 at 2:51 PM, Karthik Jayaraman wrote: > Hi all, > > I am using Spark 2.2.1 Structured Streaming to read messages from Kafka. I > wou

[Beginner] Kafka 0.11 header support in Spark Structured Streaming

2018-02-27 Thread Karthik Jayaraman
Hi all, I am using Spark 2.2.1 Structured Streaming to read messages from Kafka. I would like to know how to access the Kafka headers programmatically ? Since the Kafka message header support is introduced in Kafka 0.11 (https://issues.apache.org/jira/browse/KAFKA-4208

How does Spark Structured Streaming determine an event has arrived late?

2018-02-27 Thread kant kodali
I read through the spark structured streaming documentation and I wonder how does spark structured streaming determine an event has arrived late? Does it compare the event-time with the processing time? [image: enter image description here] <https://i.stack.imgur.com/CXH4i.png> Taking the

Re: Spark structured streaming: periodically refresh static data frame

2018-02-25 Thread naresh Goud
Appu, I am also landed in same problem. Are you able to solve this issue? Could you please share snippet of code if your able to do? Thanks, Naresh On Wed, Feb 14, 2018 at 8:04 PM, Tathagata Das wrote: > 1. Just loop like this. > > > def startQuery(): Streaming Query = { >// Define the da

Re: Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-24 Thread M Singh
Hi Vijay: I am using spark-shell because I am still prototyping the steps involved. Regarding executors - I have 280 executors and UI only show a few straggler tasks on each trigger.  The UI does not show too much time spend on GC.  suspect the delay is because of getting data from kafka. The num

Re: Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-23 Thread vijay.bvp
Instead of spark-shell have you tried running it as a job. how many executors and cores, can you share the RDD graph and event timeline on the UI and did you find which of the tasks taking more time was they are any GC please look at the UI if not already it can provide lot of information -

Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-23 Thread M Singh
Hi: I am working with spark structured streaming (2.2.1) reading data from Kafka (0.11).  I need to aggregate data ingested every minute and I am using spark-shell at the moment.  The message rate ingestion rate is approx 500k/second.  During some trigger intervals (1 minute) especially when

Re: Spark structured streaming: periodically refresh static data frame

2018-02-14 Thread Tathagata Das
1. Just loop like this. def startQuery(): Streaming Query = { // Define the dataframes and start the query } // call this on main thread while (notShutdown) { val query = startQuery() query.awaitTermination(refreshIntervalMs) query.stop() // refresh static data } 2. Yes, stream-

Re: Spark structured streaming: periodically refresh static data frame

2018-02-14 Thread Appu K
TD, Thanks a lot for the quick reply :) Did I understand it right that in the main thread, to wait for the termination of the context I'll not be able to use outStream.awaitTermination() - [ since i'll be closing in inside another thread ] What would be a good approach to keep the main app l

Re: Spark structured streaming: periodically refresh static data frame

2018-02-14 Thread Tathagata Das
Let me fix my mistake :) What I suggested in that earlier thread does not work. The streaming query that joins a streaming dataset with a batch view, does not correctly pick up when the view is updated. It works only when you restart the query. That is, - stop the query - recreate the dataframes, -

Re: Spark structured streaming: periodically refresh static data frame

2018-02-14 Thread Appu K
More specifically, Quoting TD from the previous thread "Any streaming query that joins a streaming dataframe with the view will automatically start using the most updated data as soon as the view is updated” Wondering if I’m doing something wrong in https://gist.github.com/anonymous/90dac8efadca3

Spark structured streaming: periodically refresh static data frame

2018-02-14 Thread Appu K
Hi, I had followed the instructions from the thread https://mail-archives.apache.org/mod_mbox/spark-user/201704.mbox/%3cd1315d33-41cd-4ba3-8b77-0879f3669...@qvantel.com%3E while trying to reload a static data frame periodically that gets joined to a structured streaming query. However, the stream

Re: Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-11 Thread M Singh
description of the query status fields for spark structured streaming ? StreamExecution: Streaming query made progress: {   "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2",   "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce",   "name" : null,   "t

Re: Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-11 Thread Richard Qiao
the elements for two group counts. The output is send to console > sink. I see the following output (with my questions in bold). > > Please me know where I can find detailed description of the query status > fields for spark structured streaming ? > > > StreamExe

Re: Apache Spark - Structured Streaming - Updating UDF state dynamically at run time

2018-02-10 Thread M Singh
Just checking if anyone has any pointers for dynamically updating query state in structured streaming. Thanks On Thursday, February 8, 2018 2:58 PM, M Singh wrote: Hi Spark Experts: I am trying to use a stateful udf with spark structured streaming that needs to update the state

Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-10 Thread M Singh
in bold). Please me know where I can find detailed description of the query status fields for spark structured streaming ? StreamExecution: Streaming query made progress: {   "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2",   "runId" : "21778fbb-406c-4c65-

Apache Spark - Structured Streaming - Updating UDF state dynamically at run time

2018-02-08 Thread M Singh
Hi Spark Experts: I am trying to use a stateful udf with spark structured streaming that needs to update the state periodically. Here is the scenario: 1. I have a udf with a variable with default value (eg: 1)  This value is applied to a column (eg: subtract the variable from the column value )2

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-06 Thread M Singh
that's the purpose of watermarking it. That's my understanding at least. Pozdrawiam,Jacek Laskowskihttps://about.me/JacekLaskowskiMastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams http

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-06 Thread Jacek Laskowski
Hi, What would you expect? The data is simply dropped as that's the purpose of watermarking it. That's my understanding at least. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming http

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-05 Thread M Singh
Just checking if anyone has more details on how watermark works in cases where event time is earlier than processing time stamp. On Friday, February 2, 2018 8:47 AM, M Singh wrote: Hi Vishu/Jacek: Thanks for your responses. Jacek - At the moment, the current time for my use case is proc

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-02 Thread M Singh
Hi Vishu/Jacek: Thanks for your responses. Jacek - At the moment, the current time for my use case is processing time. Vishnu - Spark documentation (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) does indicate that it can dedup using watermark.  So I believe th

Re: Spark Structured Streaming for Twitter Streaming data

2018-01-31 Thread Divya Gehlot
> > On Wed, Jan 31, 2018 at 7:18 PM, Divya Gehlot > wrote: > >> Hi , >> I see ,Does that means Spark structured streaming doesn't work with >> Twitter streams ? >> I could see people used kafka or other streaming tools and used spark to >> process

Re: Spark Structured Streaming for Twitter Streaming data

2018-01-31 Thread Tathagata Das
into Kafka/Kinesis. And then just read from Kafka/Kinesis. Hope this helps. TD On Wed, Jan 31, 2018 at 7:18 PM, Divya Gehlot wrote: > Hi , > I see ,Does that means Spark structured streaming doesn't work with > Twitter streams ? > I could see people used kafka or other stream

Re: Spark Structured Streaming for Twitter Streaming data

2018-01-31 Thread Divya Gehlot
Hi , I see ,Does that means Spark structured streaming doesn't work with Twitter streams ? I could see people used kafka or other streaming tools and used spark to process the data in structured streaming . The below doesn't work directly with Twitter Stream until I set up Kafka ?

Re: Spark Structured Streaming for Twitter Streaming data

2018-01-31 Thread Tathagata Das
Hello Divya, To add further clarification, the Apache Bahir does not have any Structured Streaming support for Twitter. It only has support for Twitter + DStreams. TD On Wed, Jan 31, 2018 at 2:44 AM, vermanurag wrote: > Twitter functionality is not part of Core Spark. We have successfully us

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-31 Thread Vishnu Viswanath
Hi Mans, Watermark is Spark is used to decide when to clear the state, so if the even it delayed more than when the state is cleared by Spark, then it will be ignored. I recently wrote a blog post on this : http://vishnuviswanath.com/spark_structured_streaming.html#watermark Yes, this State is ap

Re: Spark Structured Streaming for Twitter Streaming data

2018-01-31 Thread vermanurag
Twitter functionality is not part of Core Spark. We have successfully used the following packages from maven central in past org.apache.bahir:spark-streaming-twitter_2.11:2.2.0 Earlier there used to be a twitter package under spark, but I find that it has not been updated beyond Spark 1.6 org.ap

Spark Structured Streaming for Twitter Streaming data

2018-01-30 Thread Divya Gehlot
Hi, I am exploring the spark structured streaming . When turned to internet to understand about it I could find its more integrated with Kafka or other streaming tool like Kenesis. I couldnt find where we can use Spark Streaming API for twitter streaming data . Would be grateful ,if any body used

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-26 Thread Jacek Laskowski
o filter out records which are lagging behind (based on event time) by a certain amount of time." Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mas

Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-26 Thread M Singh
Hi: I am trying to filter out records which are lagging behind (based on event time) by a certain amount of time.   Is the watermark api applicable to this scenario (ie, filtering lagging records) or it is only applicable with aggregation ?  I could not get a clear understanding from the documen

Re: [Spark structured streaming] Use of (flat)mapgroupswithstate takes long time

2018-01-23 Thread Christiaan Ras
n Ras Cc: user Subject: Re: [Spark structured streaming] Use of (flat)mapgroupswithstate takes long time For computing mapGroupsWithState, can you check the following. - How many tasks are being launched in the reduce stage (that is, the stage after the shuffle, that is computing mapGroupsWithSta

Re: [Spark structured streaming] Use of (flat)mapgroupswithstate takes long time

2018-01-22 Thread Tathagata Das
For computing mapGroupsWithState, can you check the following. - How many tasks are being launched in the reduce stage (that is, the stage after the shuffle, that is computing mapGroupsWithState) - How long each task is taking? - How many cores does the cluster have? On Thu, Jan 18, 2018 at 11:28

[Spark structured streaming] Use of (flat)mapgroupswithstate takes long time

2018-01-18 Thread chris-sw
Hi, I recently did some experiments with stateful structured streaming by using flatmapgroupswithstate. The streaming application is quit simple: It receives data from Kafka, feed it to the stateful operator (flatmapgroupswithstate) and sinks the output to console. During a test with small dataset

Re: Spark structured streaming time series forecasting

2018-01-09 Thread Tathagata Das
. On Mon, Jan 8, 2018 at 7:04 AM, Bogdan Cojocar wrote: > Hello, > > Is there a method to do time series forecasting in spark structured > streaming? Is there any integration going on with spark-ts or a similar > library? > > Many thanks, > Bogdan Cojocar >

Spark structured streaming time series forecasting

2018-01-08 Thread Bogdan Cojocar
Hello, Is there a method to do time series forecasting in spark structured streaming? Is there any integration going on with spark-ts or a similar library? Many thanks, Bogdan Cojocar

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-30 Thread M Singh
Thanks Eyal - it appears that these are the same patterns used for spark DStreams. On Wednesday, December 27, 2017 1:15 AM, Eyal Zituny wrote: Hiif you're interested in stopping you're spark application externally, you will probably need a way to communicate with the spark driver  (wh

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-27 Thread Eyal Zituny
Hi if you're interested in stopping you're spark application externally, you will probably need a way to communicate with the spark driver (which start and holds a ref to the spark context) this can be done by adding some code to the driver app, for example: - you can expose a rest api that st

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-26 Thread M Singh
Thanks Diogo.  My question is how to gracefully call the stop method while the streaming application is running in a cluster. On Monday, December 25, 2017 5:39 PM, Diogo Munaro Vieira wrote: Hi M Singh! Here I'm using query.stop() Em 25 de dez de 2017 19:19, "M Singh" escreveu: Hi:

Re: Apache Spark - Structured Streaming from file - checkpointing

2017-12-25 Thread Diogo Munaro Vieira
Can you please post here your code? Em 25 de dez de 2017 19:24, "M Singh" escreveu: > Hi: > > I am using spark structured streaming (v 2.2.0) to read data from files. I > have configured checkpoint location. On stopping and restarting the > application, it loo

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-25 Thread Diogo Munaro Vieira
Hi M Singh! Here I'm using query.stop() Em 25 de dez de 2017 19:19, "M Singh" escreveu: > Hi: > Are there any patterns/recommendations for gracefully stopping a > structured streaming application ? > Thanks > > >

Apache Spark - Structured Streaming from file - checkpointing

2017-12-25 Thread M Singh
Hi: I am using spark structured streaming (v 2.2.0) to read data from files. I have configured checkpoint location. On stopping and restarting the application, it looks like it is reading the previously ingested files.  Is that expected behavior ?   Is there anyway to prevent reading files that

Apache Spark - Structured Streaming graceful shutdown

2017-12-25 Thread M Singh
Hi:Are there any patterns/recommendations for gracefully stopping a structured streaming application ?Thanks

Re: is Union or Join Supported for Spark Structured Streaming Queries in 2.2.0?

2017-12-16 Thread Jacek Laskowski
u cannot have datasets of different schema in a query. You'd have to use the most wide schema to cover all schemas. p.s. Have you tried anything...spark-shell's your friend, my friend :) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming https://

is Union or Join Supported for Spark Structured Streaming Queries in 2.2.0?

2017-12-13 Thread kant kodali
Hi All, I have messages in a queue that might be coming in with few different schemas like msg 1 schema 1, msg2 schema2, msg3 schema3, msg 4 schema1 I want to put all of this in one data frame. is it possible with structured streaming? I am using Spark 2.2.0 Thanks!

Spark Structured Streaming how to read data from AWS SQS

2017-12-11 Thread Bogdan Cojocar
For spark streaming there are connectors <https://github.com/imapi/spark-sqs-receiver> that can achieve this functionality. Unfortunately for spark structured streaming I couldn't find any as it's a newer technology. Is there a way to connect to a source using a spark streaming

Spark Structured Streaming + Kafka

2017-11-14 Thread Agostino Calamita
Hi, I have a problem with Structured Streaming and Kafka. I have 2 brokers and a topic with 8 partitions and replication factor 2. This is my driver program: public static void main(String[] args) { SparkSession spark = SparkSession .builder() .app

Re: Generate windows on processing time in Spark Structured Streaming

2017-11-10 Thread Michael Armbrust
Hmmm, we should allow that. current_timestamp() is acutally deterministic within any given batch. Could you open a JIRA ticket? On Fri, Nov 10, 2017 at 1:52 AM, wangsan wrote: > Hi all, > > How can I use current processing time to generate windows in streaming > processing? > *window* function

Generate windows on processing time in Spark Structured Streaming

2017-11-10 Thread wangsan
Hi all, How can I use current processing time to generate windows in streaming processing? window function's Scala doc says "For a streaming query, you may use the function current_timestamp to generate windows on processing time.” But when using current_timestamp as column in window functio

Re: [Spark Structured Streaming] Changing partitions of (flat)MapGroupsWithState

2017-11-08 Thread Michael Armbrust
The relevant config is spark.sql.shuffle.partitions. Note that once you start a query, this number is fixed. The config will only affect queries starting from an empty checkpoint. On Wed, Nov 8, 2017 at 7:34 AM, Teemu Heikkilä wrote: > I have spark structured streaming job and I'm c

[Spark Structured Streaming] Changing partitions of (flat)MapGroupsWithState

2017-11-08 Thread Teemu Heikkilä
I have spark structured streaming job and I'm crunching through few terabytes of data. I'm using file stream reader and it works flawlessly, I can adjust the partitioning of that with spark.default.parallelism However I'm doing sessionization for the data after loading it and

Re: Spark Structured Streaming not connecting to Kafka using kerberos

2017-10-26 Thread Prashant Sharma
Hi Darshan, Did you try passing the config directly as an option, like this: .option("kafka.sasl.jaas.config", saslConfig) Where saslConfig can look like: com.sun.security.auth.module.Krb5LoginModule required \ useKeyTab=true \ storeKey=true \ keyTab="/etc/security/key

Re: Spark Structured Streaming not connecting to Kafka using kerberos

2017-10-16 Thread Darshan Pandya
HI Burak, Well turns out it worked fine when i submit in cluster mode. I also tried to convert my app in dstreams. In dstreams too it works well only when deployed in cluster mode. Here is how i configured the stream. val lines = spark.readStream .format("kafka") .option("kafka.bootstrap.se

Re: Spark Structured Streaming not connecting to Kafka using kerberos

2017-10-16 Thread Burak Yavuz
Hi Darshan, How are you creating your kafka stream? Can you please share the options you provide? spark.readStream.format("kafka") .option(...) // all these please .load() On Sat, Oct 14, 2017 at 1:55 AM, Darshan Pandya wrote: > Hello, > > I'm using Spark 2.1.0 on CDH 5.8 with kafka 0.10.

Spark Structured Streaming not connecting to Kafka using kerberos

2017-10-14 Thread Darshan Pandya
Hello, I'm using Spark 2.1.0 on CDH 5.8 with kafka 0.10.0.1 + kerberos I am unable to connect to the kafka broker with the following message 17/10/14 14:29:10 WARN clients.NetworkClient: Bootstrap broker 10.197.19.25:9092 disconnected and is unable to consume any messages. And am using it as

[Spark Structured Streaming] How to select events by latest timestamp and aggregate count

2017-10-08 Thread Li Zuwei
I would like to perform structured streaming aggregation with a windowing period. Given this following data schema. The objective is to filter by the latest occurring event based on user. Then aggregate the count of each event type for each location. timelocation user type 1A

Spark structured streaming - join static dataset with streaming dataset

2017-10-05 Thread jithinpt
I'm using Spark structured streaming to process records read from Kafka. Here's what I'm trying to achieve: (a) Each record is a Tuple2 of type (Timestamp, DeviceId). (b) I've created a static Dataset[DeviceId] which contains the set of all valid device IDs (of type DeviceI

Any rabbit mq connect for spark structured streaming ?

2017-10-05 Thread Darshan Pandya
-- Sincerely, Darshan

Re: Does Kafka dependency jars changed for Spark Structured Streaming 2.2.0?

2017-09-11 Thread kant kodali
.option("kafka.bootstrap.servers", "localhost:9092") .option("subscribe", "hello")) .option("startingOffsets", "earliest") .load() .writeStream() .format("console") .start(); query.awaitTermination(); On Mon, Sep 11, 2017 at 6:24 PM

Does Kafka dependency jars changed for Spark Structured Streaming 2.2.0?

2017-09-11 Thread kant kodali
Hi All, Does Kafka dependency jars changed for Spark Structured Streaming 2.2.0? kafka-clients-0.10.0.1.jar spark-streaming-kafka-0-10_2.11-2.2.0.jar 1) Above two are the only Kafka related jars or am I missing something? 2) What is the difference between the above two jars? 3) If I have

Re: Need some Clarification on checkpointing w.r.t Spark Structured Streaming

2017-09-11 Thread Michael Armbrust
Checkpoints record what has been processed for a specific query, and as such only need to be defined when writing (which is how you "start" a query). You can use the DataFrame created with readStream to start multiple queries, so it wouldn't really make sense to have a single checkpoint there. On

Re: [Structured Streaming] Trying to use Spark structured streaming

2017-09-11 Thread Eduardo D'Avila
Burak, thanks for the resources. I was thinking that the trigger interval and the sliding window were the same thing, but now I am confused. I didn't know there was a .trigger() method, since the official Programming Guide

Re: [Structured Streaming] Trying to use Spark structured streaming

2017-09-11 Thread Burak Yavuz
Hi Eduardo, What you have written out is to output counts "as fast as possible" for windows of 5 minute length and with a sliding window of 1 minute. So for a record at 10:13, you would get that record included in the count for 10:09-10:14, 10:10-10:15, 10:11-10:16, 10:12-10:16, 10:13-10:18. Plea

[Structured Streaming] Trying to use Spark structured streaming

2017-09-11 Thread Eduardo D'Avila
Hi, I'm trying to use Spark 2.1.1 structured streaming to *count the number of records* from Kafka *for each time window* with the code in this GitHub gist . I expected that, *once each minute* (the slide duration), it would *outp

Need some Clarification on checkpointing w.r.t Spark Structured Streaming

2017-09-11 Thread kant kodali
Hi All, I was wondering if we need to checkpoint both read and write streams when reading from Kafka and inserting into a target store? for example sparkSession.readStream().option("checkpointLocation", "hdfsPath").load() vs dataSet.writeStream().option("checkpointLocation", "hdfsPath") Thank

Re: is it ok to have multiple sparksession's in one spark structured streaming app?

2017-09-08 Thread Paul
ea. >> >>> On Thu, Sep 7, 2017 at 2:40 AM, kant kodali wrote: >>> Hi All, >>> >>> I am wondering if it is ok to have multiple sparksession's in one spark >>> structured streaming app? Basically, I want to create 1) Spark session for >&

Re: is it ok to have multiple sparksession's in one spark structured streaming app?

2017-09-08 Thread kant kodali
d > idea. > > On Thu, Sep 7, 2017 at 2:40 AM, kant kodali wrote: > >> Hi All, >> >> I am wondering if it is ok to have multiple sparksession's in one spark >> structured streaming app? Basically, I want to create 1) Spark session for >> reading from Kafka

Re: is it ok to have multiple sparksession's in one spark structured streaming app?

2017-09-08 Thread Arkadiusz Bicz
#x27;s in one spark > structured streaming app? Basically, I want to create 1) Spark session for > reading from Kafka and 2) Another Spark session for storing the mutations > of a dataframe/dataset to a persistent table as I get the mutations from > #1? > > Finally, is this a common practice? > > Thanks, > kant >

is it ok to have multiple sparksession's in one spark structured streaming app?

2017-09-06 Thread kant kodali
Hi All, I am wondering if it is ok to have multiple sparksession's in one spark structured streaming app? Basically, I want to create 1) Spark session for reading from Kafka and 2) Another Spark session for storing the mutations of a dataframe/dataset to a persistent table as I get the muta

Spark Structured Streaming and compacted topic in Kafka

2017-09-06 Thread Olivier Girardot
Hi everyone, I'm aware of the issue regarding direct stream 0.10 consumer in spark and compacted topics (c.f. https://issues.apache.org/jira/browse/SPARK-17147). Is there any chance that spark structured-streaming kafka is compatible with compacted topics ? Regards, -- *Olivier Girardot*

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-30 Thread kant kodali
last few days >>>>>> of max, then you should set watermark on the timestamp column. >>>>>> >>>>>> *trainTimesDataset* >>>>>> * .withWatermark("**activity_timestamp", "5 days")* >>>>>> * .groupBy(window(activity_tim

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-30 Thread Tathagata Das
gt;>>>> >>>>> On Tue, Aug 29, 2017 at 2:06 PM, kant kodali >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Thanks for the response. Since this is a streaming based query and in >>>

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-30 Thread kant kodali
; Thanks for the response. Since this is a streaming based query and in >>>>> my case I need to hold state for 24 hours which I forgot to mention in my >>>>> previous email. can I do ? >>>>> >>>>> *trainTimesDataset.groupBy(window(activity_timestamp, "24

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-30 Thread kant kodali
p, "24 hours", "24 >>>> hours"), "train", "dest").max("time")* >>>> >>>> >>>> On Tue, Aug 29, 2017 at 1:38 PM, Tathagata Das < >>>> tathagata.das1...@gmail.com> wrote: >>>> >>>>>

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread Tathagata Das
: *trainTimesDataset.groupBy("train", "dest").max("time")* >>>> >>>> >>>> *SQL*: *"select train, dest, max(time) from trainTimesView group by >>>> train, dest"*// after calling >>>> *trainTimesDa

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread kant kodali
;>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Thanks for the response. Since this is a streaming based query and >>>>>>>> in my case I need to hold state for 24 hours which I forgot to

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread ayan guha
need to hold state for 24 hours which I forgot to mention >>>>>>> in >>>>>>> my previous email. can I do ? >>>>>>> >>>>>>> *trainTimesDataset.groupBy(window(activity_timestamp, "24 hours", >>&

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread kant kodali
to mention in my >>>>>> previous email. can I do ? >>>>>> >>>>>> *trainTimesDataset.groupBy(window(activity_timestamp, "24 hours", >>>>>> "24 hours"), "train", "dest").max("time&q

<    1   2   3   4   5   6   >