Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-30 Thread kant kodali
That makes a lot of sense now! I am looking for Tumbling window so my window interval and batch interval is 24 hours. Every day I want to start with a fresh state. Finally, Since you said I need to do book keeping of "last 24 hours" ? Do you mean I need to do this some external store and then

Re: Time window on Processing Time

2017-08-30 Thread madhu phatak
Hi, That's great. Thanks a lot. On Wed, Aug 30, 2017 at 10:44 AM, Tathagata Das wrote: > Yes, it can be! There is a sql function called current_timestamp() which > is self-explanatory. So I believe you should be able to do something like > > import

Re: Do we always need to go through spark-submit?

2017-08-30 Thread vaquar khan
Hi Kant, Ans :Yes The org.apache.spark.launcher package provides classes for launching Spark jobs as child processes using a simple Java API. *Doc:*

Re: [Spark] Can Apache Spark be used with time series processing?

2017-08-30 Thread vaquar khan
Hi Alex, Hope following links help you to understand why Spark is good for your usecase. - https://www.youtube.com/watch?v=tKkneWcAIqU=youtu.be - https://blog.cloudera.com/blog/2015/12/spark-ts-a-new-library-for-analyzing-time-series-data-with-apache-spark/ -

Re: Different watermark for different kafka partitions in Structured Streaming

2017-08-30 Thread Tathagata Das
Why not set the watermark to be looser, one that works across all partitions? The main usage of watermark is to drop state. If you loosen the watermark threshold (e.g. from 1 hour to 10 hours), then you will keep more state with older data, but you are guaranteed that you will not drop important

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-30 Thread Tathagata Das
The per-key state S is kept in the memory. It has to be of a type that can be encoded by Datasets. All you have to do is update S every time the function is called, and the engine takes care of serializing/checkpointing the state value, and retrieving the correct version of the value when

Re: Do we always need to go through spark-submit?

2017-08-30 Thread Irving Duran
I don't know how this would work, but maybe your .jar calls spark-submit from within your jar if you were to compile the jar with the spark-submit class. Thank You, Irving Duran On Wed, Aug 30, 2017 at 10:57 AM, kant kodali wrote: > Hi All, > > I understand spark-submit

Re: Python UDF to convert timestamps (performance question)

2017-08-30 Thread Brian Wylie
Tathagata, Thanks, your explanation was great. The suggestion worked well with the only minutia is that I needed to have the TS field brought in as a DoubleType() or the time got truncated. Thanks again, -Brian On Wed, Aug 30, 2017 at 1:34 PM, Tathagata Das

Re: Python UDF to convert timestamps (performance question)

2017-08-30 Thread Tathagata Das
1. Generally, adding columns, etc. will not affect performance because the Spark's optimizer will automatically figure out columns that are not needed and eliminate in the optimization step. So that should never be a concern. 2. Again, this is generally not a concern as the optimizer will take

Design aspects of Data partitioning for Window functions

2017-08-30 Thread Vasu Gourabathina
All: If this question was already discussed, please let me know. I can try to look into the archive. Data Characteristics: entity_id date fact_1 fact_2 fact_N derived_1 derived_2 derived_X a) There are 1000s of such entities in the system b) Each one has various Fact attributes per

Python UDF to convert timestamps (performance question)

2017-08-30 Thread Brian Wylie
Hi All, I'm using structured streaming in Spark 2.2. I'm using PySpark and I have data (from a Kafka publisher) where the timestamp is a float that looks like this: 1379288667.631940 So here's my code (which is working fine) # SUBSCRIBE: Setup connection to Kafka Stream raw_data =

Re: [Spark] Can Apache Spark be used with time series processing?

2017-08-30 Thread Irving Duran
I think it will work. Might want to explore spark streams. Thank You, Irving Duran On Wed, Aug 30, 2017 at 10:50 AM, wrote: > I don't see why not > > Sent from my iPhone > > > On Aug 24, 2017, at 1:52 PM, Alexandr Porunov < > alexandr.poru...@gmail.com> wrote: > > > >

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-30 Thread kant kodali
I think I understand *groupByKey/**mapGroupsWithState *and I am still trying to wrap my head around *GroupState*. so, I believe I have a naive questions to ask on *GroupState*. If I were to represent a state that has history of events (say 24 hours) and say the number of events can be big for a

Re: [Structured Streaming]Data processing and output trigger should be decoupled

2017-08-30 Thread Shixiong(Ryan) Zhu
I don't think that's a good idea. If the engine keeps on processing data but doesn't output anything, where to keep the intermediate data? On Wed, Aug 30, 2017 at 9:26 AM, KevinZwx wrote: > Hi, > > I'm working with structured streaming, and I'm wondering whether there >

[Structured Streaming]Data processing and output trigger should be decoupled

2017-08-30 Thread KevinZwx
Hi, I'm working with structured streaming, and I'm wondering whether there should be some improvements about trigger. Currently, when I specify a trigger, i.e. tigger(Trigger.ProcessingTime("10 minutes")), the engine will begin processing data at the time the trigger begins, like 10:00:00,

Re: Spark 2.2 structured streaming with mapGroupsWithState + window functions

2017-08-30 Thread kant kodali
+1 Is this ticket related https://issues.apache.org/jira/browse/SPARK-21641 ? On Mon, Aug 28, 2017 at 7:06 AM, daniel williams wrote: > Hi all, > > I've been looking heavily into Spark 2.2 to solve a problem I have by > specifically using mapGroupsWithState. What

Do we always need to go through spark-submit?

2017-08-30 Thread kant kodali
Hi All, I understand spark-submit sets up its own class loader and other things but I am wondering if it is possible to just compile the code and run it using "java -jar mysparkapp.jar" ? Thanks, kant

Re: [Spark] Can Apache Spark be used with time series processing?

2017-08-30 Thread kanth909
I don't see why not Sent from my iPhone > On Aug 24, 2017, at 1:52 PM, Alexandr Porunov > wrote: > > Hello, > > I am new in Apache Spark. I need to process different time series data > (numeric values which depend on time) and react on next actions: > 1. Data is

Sync commit to kafka 0.10

2017-08-30 Thread krot.vyacheslav
Hi, I'm looking for a way to make a sync commit of offsets to kafka 0.10? commitAsync works well, but I'd like to proceed to next job only after successful commit, a small additional latency is not an issue for my usecase. I know I can store offsets somewhere else, but builtin kafka offset

Re: Updates on migration guides

2017-08-30 Thread Nick Pentreath
MLlib has tried quite hard to ensure the migration guide is up to date for each release. I think generally we catch all breaking and most major behavior changes On Wed, 30 Aug 2017 at 17:02, Dongjoon Hyun wrote: > +1 > > On Wed, Aug 30, 2017 at 7:54 AM, Xiao Li

Different watermark for different kafka partitions in Structured Streaming

2017-08-30 Thread KevinZwx
Hi, I'm working with Structured Streaming to process logs from kafka and use watermark to handle late events. Currently the watermark is computed by (max event time seen by the engine - late threshold), and the same watermark is used for all partitions. But in production environment it happens

Different watermark for different kafka partitions in Structured Streaming

2017-08-30 Thread 张万新
Hi, I'm working with Structured Streaming to process logs from kafka and use watermark to handle late events. Currently the watermark is computed by (max event time seen by the engine - late threshold), and the same watermark is used for all partitions. But in production environment it happens

Re: Select entire row based on a logic applied on 2 columns across multiple rows

2017-08-30 Thread Andrés Ivaldi
I see, as @ayan said, it's valid, but, why don't use API or SQL, the build-in options are optimized I understand that SQL API is hard when trying to build an api over that, but Spark API doesn't, and you can do a lot with that. regards, On Wed, Aug 30, 2017 at 10:31 AM, ayan guha

RE: from_json()

2017-08-30 Thread JG Perrin
Hey Sam, Nope – it does not work the way I want. I guess it is only working with one type… Trying to convert: {"releaseDate":147944880,"link":"http://amzn.to/2kup94P","id":1,"authorId":1,"title":"Fantastic Beasts and Where to Find Them: The Original Screenplay"} I get: [Executor task

Re: Select entire row based on a logic applied on 2 columns across multiple rows

2017-08-30 Thread ayan guha
Well, using raw sql is a valid option, but if you do not want you can always implement the concept using apis. All these constructs have api counterparts, such as filter, window, over, row number etc. On Wed, 30 Aug 2017 at 10:49 pm, purna pradeep wrote: > @Andres I

Re: Select entire row based on a logic applied on 2 columns across multiple rows

2017-08-30 Thread purna pradeep
@Andres I need latest but it should less than 10 months based income_age column and don't want to use sql here On Wed, Aug 30, 2017 at 8:08 AM Andrés Ivaldi wrote: > Hi, if you need the last value from income in window function you can use > last_value. > No tested but meaby

Re: Select entire row based on a logic applied on 2 columns across multiple rows

2017-08-30 Thread Andrés Ivaldi
Hi, if you need the last value from income in window function you can use last_value. No tested but meaby with @ayan sql spark.sql("select *, row_number(), last_value(income) over (partition by id order by income_age_ts desc) r from t") On Tue, Aug 29, 2017 at 11:30 PM, purna pradeep

[SS] StateStoreSaveExec in Complete output mode and metrics in stateOperators

2017-08-30 Thread Jacek Laskowski
Hi, I've been reviewing how StateStoreSaveExec works per output mode focusing on Complete output mode [1] at the moment. My understanding is that while in Complete output mode StateStoreSaveExec uses the metrics as follows: * numRowsTotal is the number of all the state keys in the state store *

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-30 Thread kant kodali
Hi TD, Thanks for the explanation and for the clear pseudo code and an example! mapGroupsWithState is cool and looks very flexible however I have few concerns and questions. For example Say I store TrainHistory as max heap from the Java Collections library and I keep adding to to this heap for

Different watermark for different kafka partitions in Structured Streaming

2017-08-30 Thread 张万新
Hi, I'm working with Structured Streaming to process logs from kafka and use watermark to handle late events. Currently the watermark is computed by (max event time seen by the engine - late threshold), and the same watermark is used for all partitions. But in production environment it happens

spark streaming and reactive streams

2017-08-30 Thread Mich Talebzadeh
hi, i just had the idea of reactive streams thrown in. i was wondering in practical terms what it adds to spark streaming. what we have been missing so to speak? thanks, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw