Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread ayan guha
gt;> >>>> >>>> On Tue, Aug 29, 2017 at 1:38 PM, Tathagata Das < >>>> tathagata.das1...@gmail.com> wrote: >>>> >>>>> Say, *trainTimesDataset* is the streaming Dataset of schema *[train: >>>>> Int, dest: Strin

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread kant kodali
;>> Int, dest: String, time: Timestamp] * >>>> >>>> >>>> *Scala*: *trainTimesDataset.groupBy("train", "dest").max("time")* >>>> >>>> >>>> *SQL*: *"select train, dest, max(time) from trainTime

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread Burak Yavuz
quot;time")* >>> >>> >>> *SQL*: *"select train, dest, max(time) from trainTimesView group by >>> train, dest"*// after calling >>> *trainTimesData.createOrReplaceTempView(trainTimesView)* >>> >>> >>> On Tue, Aug 29, 201

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread Tathagata Das
gt; >> >> *Scala*: *trainTimesDataset.groupBy("train", "dest").max("time")* >> >> >> *SQL*: *"select train, dest, max(time) from trainTimesView group by >> train, dest"*// after calling >> *trainTimesData.createOrReplaceTempView(trainTimesVi

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread kant kodali
7 at 12:59 PM, kant kodali <kanth...@gmail.com> wrote: > >> Hi All, >> >> I am wondering what is the easiest and concise way to express the >> computation below in Spark Structured streaming given that it supports both >> imperative and declarative styles? >> I am ju

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread Tathagata Das
y train, dest"*// after calling *trainTimesData.createOrReplaceTempView(trainTimesView)* On Tue, Aug 29, 2017 at 12:59 PM, kant kodali <kanth...@gmail.com> wrote: > Hi All, > > I am wondering what is the easiest and concise way to express the > computation below in Spark Str

How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread kant kodali
Hi All, I am wondering what is the easiest and concise way to express the computation below in Spark Structured streaming given that it supports both imperative and declarative styles? I am just trying to select rows that has max timestamp for each train? Instead of doing some sort of nested

Re: [Spark Structured Streaming]: truncated Parquet after driver crash or kill

2017-08-24 Thread cbowden
you shutdown gracefully, eg. invoke StreamingQuery#stop. https://issues.apache.org/jira/browse/SPARK-21029 is probably of interest. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Structured-Streaming-truncated-Parquet-after-driver-crash-or-kill

[Spark Structured Streaming]: truncated Parquet after driver crash or kill

2017-08-08 Thread dcam
iodically to deploy new versions of the code. We want to run the application as a long lived process that is continually reading from the Kafka queue and writing out to HDFS for archival purposes. Thanks, Dave Cameron -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.c

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-28 Thread Priyank Shrivastava
Also, in your example doesn't the tempview need to be accessed using the same sparkSession on the scala side? Since I am not using a notebook, how can I get access to the same sparksession in scala. On Fri, Jul 28, 2017 at 3:17 PM, Priyank Shrivastava wrote: > Thanks

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-28 Thread Priyank Shrivastava
Thanks Burak. In a streaming context would I need to do any state management for the temp views? for example across sliding windows. Priyank On Fri, Jul 28, 2017 at 3:13 PM, Burak Yavuz wrote: > Hi Priyank, > > You may register them as temporary tables to use across language

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-28 Thread Burak Yavuz
Hi Priyank, You may register them as temporary tables to use across language boundaries. Python: df = spark.readStream... # Python logic df.createOrReplaceTempView("tmp1") Scala: val df = spark.table("tmp1") df.writeStream .foreach(...) On Fri, Jul 28, 2017 at 3:06 PM, Priyank Shrivastava

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-28 Thread Priyank Shrivastava
TD, For a hybrid python-scala approach, what's the recommended way of handing off a dataframe from python to scala. I would like to know especially in a streaming context. I am not using notebooks/databricks. We are running it on our own spark 2.1 cluster. Priyank On Wed, Jul 26, 2017 at

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-27 Thread Tathagata Das
For built-in SQL functions, it does not matter which language you use as the engine will use the most optimized JVM code to execute. However, in your case, you are asking for foreach in python. My interpretation was that you want to specify your python function that process the rows in python.

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-26 Thread ayan guha
Hi TD I thought structured streaming does provide similar concept of dataframes where it does not matter which language I use to invoke the APIs, with exception of udf. So, when I think of support foreach sink in python, I think it as just a wrapper api and data should remain in JVM only.

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-26 Thread Tathagata Das
We see that all the time. For example, in SQL, people can write their user-defined function in Scala/Java and use it from SQL/python/anywhere. That is the recommended way to get the best combo of performance and ease-of-use from non-jvm languages. On Wed, Jul 26, 2017 at 11:49 AM, Priyank

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-26 Thread Priyank Shrivastava
Thanks TD. I am going to try the python-scala hybrid approach by using scala only for custom redis sink and python for the rest of the app . I understand it might not be as efficient as purely writing the app in scala but unfortunately I am constrained on scala resources. Have you come across

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-26 Thread Tathagata Das
Hello Priyank Writing something purely in Scale/Java would be the most efficient. Even if we expose python APIs that allow writing custom sinks in pure Python, it wont be as efficient as Scala/Java foreach as the data would have to go through JVM / PVM boundary which has significant overheads. So

[SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-25 Thread Priyank Shrivastava
I am trying to write key-values to redis using a DataStreamWriter object using pyspark structured streaming APIs. I am using Spark 2.2 Since the Foreach Sink is not supported for python; here , I am

Fwd: Spark Structured Streaming - Spark Consumer does not display messages

2017-07-21 Thread Cassa L
Hi, This is first time I am trying structured streaming with Kafka. I have simple code to read from Kafka and display it on the console. Message is in JSON format. However, when I run my code nothin after below line gets printed. 17/07/21 13:43:41 INFO AppInfoParser: Kafka commitId :

Spark Structured Streaming - Spark Consumer does not display messages

2017-07-21 Thread Cassa L
Hi, This is first time I am trying structured streaming with Kafka. I have simple code to read from Kafka and display it on the console. Message is in JSON format. However, when I run my code nothin after below line gets printed. 17/07/21 13:43:41 INFO AppInfoParser: Kafka commitId :

RE: Timeline for stable release for Spark Structured Streaming

2017-07-10 Thread Mendelson, Assaf
playing around with it now (most of the capabilities are already there and stable). Thanks, Assaf. From: Dhrubajyoti Hati [mailto:dhruba.w...@gmail.com] Sent: Monday, July 10, 2017 1:33 PM To: user@spark.apache.org Subject: Timeline for stable release for Spark Structured Streaming

Timeline for stable release for Spark Structured Streaming

2017-07-10 Thread Dhrubajyoti Hati
Hi, I was checking the documentation of Structured Streaming Programming Guide and it seems its still in alpha mode. Any timeline when this module will be ready to use for production environments. *Regards,*

Re: [Spark Structured Streaming] Exception while using watermark with type of timestamp

2017-06-06 Thread Tathagata Das
Cast the timestamp column to a timestamp type. E.g. "cast timestamp as timestamp" Watermark can be defined only columns that are of type timestamp. On Jun 6, 2017 3:06 AM, "Biplob Biswas" <revolutioni...@gmail.com> wrote: > Hi, > > I am playing around with Spa

[Spark Structured Streaming] Exception while using watermark with type of timestamp

2017-06-06 Thread Biplob Biswas
Hi, I am playing around with Spark structured streaming and we have a use case to use this as a CEP engine. I am reading from 3 different kafka topics together. I want to perform windowing on this structured stream and then run some queries on this block on a sliding scale. Also, all

[Spark Structured Streaming] Exception while using watermark with type of timestamp

2017-06-06 Thread Biplob Biswas
Hi, I am playing around with Spark structured streaming and we have a use case to use this as a CEP engine. I am reading from 3 different kafka topics together. I want to perform windowing on this structured stream and then run some queries on this block on a sliding scale. Also, all

Is there a way to do partial sort in update mode in Spark Structured Streaming?

2017-06-03 Thread kant kodali
Hi All, 1. Is there a way to do partial sort of say (timestamp column) in update mode? I am currently using Spark 2.1.1 and its looks like it is not possible however I am wondering if this possible in 2.2? 2. can we do full sort in update mode with a specified watermark? since after a specified

RE: The following Error seems to happen once in every ten minutes (Spark Structured Streaming)?

2017-05-31 Thread Mahesh Sawaiker
: Thursday, June 01, 2017 4:35 AM To: user @spark Subject: The following Error seems to happen once in every ten minutes (Spark Structured Streaming)? Hi All, When my query is streaming I get the following error once in say 10 minutes. Lot of the solutions online seems to suggest just clear data

The following Error seems to happen once in every ten minutes (Spark Structured Streaming)?

2017-05-31 Thread kant kodali
Hi All, When my query is streaming I get the following error once in say 10 minutes. Lot of the solutions online seems to suggest just clear data directories under datanode and namenode and restart the HDFS cluster but I didn't see anything that explains the cause? If it happens so frequent what

Re: How to convert Dataset to Dataset in Spark Structured Streaming?

2017-05-31 Thread kant kodali
llo Kant, >>> >>> See is the examples in this blog explains how to deal with your >>> particular case: https://databricks.com/blog/2017/02/23/working-complex >>> -data-formats-structured-streaming-apache-spark-2-1.html >>> >>> Cheers >>&

Re: How to convert Dataset to Dataset in Spark Structured Streaming?

2017-05-31 Thread kant kodali
ant, >> >> See is the examples in this blog explains how to deal with your >> particular case: https://databricks.com/blog/2017/02/23/working-complex >> -data-formats-structured-streaming-apache-spark-2-1.html >> >> Cheers >> Jules >> >> Sent from

Re: How to convert Dataset to Dataset in Spark Structured Streaming?

2017-05-31 Thread kant kodali
gt; I have a Dataset and I am trying to convert it into Dataset > (json String) using Spark Structured Streaming. I have tried the following. > > df2.toJSON().writeStream().foreach(new KafkaSink()) > > This doesn't seem to work for the following reason. > > "Queries wit

Re: How to convert Dataset to Dataset in Spark Structured Streaming?

2017-05-31 Thread Jules Damji
017, at 7:31 PM, kant kodali <kanth...@gmail.com> wrote: > > Hi All, > > I have a Dataset and I am trying to convert it into Dataset > (json String) using Spark Structured Streaming. I have tried the following. > > df2.toJSON().writeStream().foreach(new KafkaSink(

How to convert Dataset to Dataset in Spark Structured Streaming?

2017-05-30 Thread kant kodali
Hi All, I have a Dataset and I am trying to convert it into Dataset (json String) using Spark Structured Streaming. I have tried the following. df2.toJSON().writeStream().foreach(new KafkaSink()) This doesn't seem to work for the following reason. "Queries with streaming sources

Re: couple naive questions on Spark Structured Streaming

2017-05-22 Thread kant kodali
HI Burak, My response is inline. Thanks a lot! On Mon, May 22, 2017 at 9:26 AM, Burak Yavuz <brk...@gmail.com> wrote: > Hi Kant, > >> >> >> 1. Can we use Spark Structured Streaming for stateless transformations >> just like we would do with DStreams o

Re: Is there a Kafka sink for Spark Structured Streaming

2017-05-22 Thread Michael Armbrust
option("kafka.bootstrap.servers", "host1:port1,host2:port2") \ >> .option("topic", "topic1") \ >> .option("checkpointLocation", "/path/to/HDFS/dir") \ >> .start() >> >> Described here: >> >> https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html >> >> >> >> On 19 May 2017 at 10:45, <kanth...@gmail.com> wrote: >> >>> Is there a Kafka sink for Spark Structured Streaming ? >>> >>> Sent from my iPhone >>> >> >> >

Re: couple naive questions on Spark Structured Streaming

2017-05-22 Thread Burak Yavuz
Hi Kant, > > > 1. Can we use Spark Structured Streaming for stateless transformations > just like we would do with DStreams or Spark Structured Streaming is only > meant for stateful computations? > Of course you can do stateless transformations. Any map, filter, select, type

couple naive questions on Spark Structured Streaming

2017-05-20 Thread kant kodali
Hi, 1. Can we use Spark Structured Streaming for stateless transformations just like we would do with DStreams or Spark Structured Streaming is only meant for stateful computations? 2. When we use groupBy and Window operations for event time processing and specify a watermark does this mean

Re: Is there a Kafka sink for Spark Structured Streaming

2017-05-19 Thread kant kodali
ot;to_json(struct(*)) AS >>> value") \ >>> .writeStream \ >>> .format("kafka") \ >>> .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \ >>> .option("topic", "topic1"

Re: Is there a Kafka sink for Spark Structured Streaming

2017-05-19 Thread Tathagata Das
.option("topic", "topic1") \ >> .option("checkpointLocation", "/path/to/HDFS/dir") \ >> .start() >> >> Described here: >> >> https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html >> >> >> >> On 19 May 2017 at 10:45, <kanth...@gmail.com> wrote: >> >>> Is there a Kafka sink for Spark Structured Streaming ? >>> >>> Sent from my iPhone >>> >> >> >

Re: Is there a Kafka sink for Spark Structured Streaming

2017-05-19 Thread kant kodali
quot;host1:port1,host2:port2") \ > .option("topic", "topic1") \ > .option("checkpointLocation", "/path/to/HDFS/dir") \ > .start() > > Described here: > > https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-s

Re: Is there a Kafka sink for Spark Structured Streaming

2017-05-19 Thread kanth909
> .option("checkpointLocation", "/path/to/HDFS/dir") \ > .start() > Described here: > https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html > > >> On 19 May 2017 at 10:45, <kanth...@gmail.com> wrote: >> Is there a Kafka sink for Spark Structured Streaming ? >> >> Sent from my iPhone >

Re: Is there a Kafka sink for Spark Structured Streaming

2017-05-19 Thread Patrick McGloin
-2-2.html On 19 May 2017 at 10:45, <kanth...@gmail.com> wrote: > Is there a Kafka sink for Spark Structured Streaming ? > > Sent from my iPhone >

Is there a Kafka sink for Spark Structured Streaming

2017-05-19 Thread kanth909
Is there a Kafka sink for Spark Structured Streaming ? Sent from my iPhone

Re: Spark Structured Streaming is taking too long to process 2KB messages

2017-05-18 Thread kant kodali
ok so the problem really was I was compiling with 2.1.0 jars and at run time supplying 2.1.1. once I changed to 2.1.1 at compile time as well it seem to work fine and I can see all my 75 fields. On Thu, May 18, 2017 at 2:39 AM, kant kodali wrote: > Hi All, > > Here is my

Spark Structured Streaming is taking too long to process 2KB messages

2017-05-18 Thread kant kodali
Hi All, Here is my code. Dataset df = ds.select(functions.from_json(new Column("value").cast("string"), getSchema()).as("payload")); Dataset df1 = df.selectExpr("payload.data.*"); StreamingQuery query = df1.writeStream().outputMode("append").option("truncate",

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-22 Thread Hemanth Gudela
<hemanth.gud...@qvantel.com>, Tathagata Das <tathagata.das1...@gmail.com>, "user@spark.apache.org" <user@spark.apache.org> Subject: Re: Spark structured streaming: Is it possible to periodically refresh static data frame? Hi Georg, Yes, that should be possible with

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-21 Thread Gene Pang
>> >> >> Thanks, >> >> Hemanth >> >> >> >> >> >> *From: *Tathagata Das <tathagata.das1...@gmail.com> >> >> >> *Date: *Friday, 21 April 2017 at 0.03 >> *To: *Hemanth Gudela <hemanth.gud...@qvantel.com> >>

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-21 Thread Georg Heiler
Is Scala’s > “Futures” the way to achieve this? > > > > Thanks, > > Hemanth > > > > > > *From: *Tathagata Das <tathagata.das1...@gmail.com> > > > *Date: *Friday, 21 April 2017 at 0.03 > *To: *Hemanth Gudela <hemanth.gud...@qvantel.co

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-21 Thread Hemanth Gudela
:georg.kf.hei...@gmail.com>> Date: Thursday, 20 April 2017 at 23.11 To: Hemanth Gudela <hemanth.gud...@qvantel.com<mailto:hemanth.gud...@qvantel.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-20 Thread Georg Heiler
y, 21 April 2017 at 0.03 > *To: *Hemanth Gudela <hemanth.gud...@qvantel.com> > *Cc: *Georg Heiler <georg.kf.hei...@gmail.com>, "user@spark.apache.org" < > user@spark.apache.org> > > > *Subject: *Re: Spark structured streaming: Is it possible to perio

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-20 Thread Hemanth Gudela
<tathagata.das1...@gmail.com> Date: Friday, 21 April 2017 at 0.03 To: Hemanth Gudela <hemanth.gud...@qvantel.com> Cc: Georg Heiler <georg.kf.hei...@gmail.com>, "user@spark.apache.org" <user@spark.apache.org> Subject: Re: Spark structured streaming: Is it possible to perio

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-20 Thread Tathagata Das
at 23.11 > *To: *Hemanth Gudela <hemanth.gud...@qvantel.com>, "user@spark.apache.org" > <user@spark.apache.org> > *Subject: *Re: Spark structured streaming: Is it possible to periodically > refresh static data frame? > > > > What about treating the static data as a (s

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-20 Thread Hemanth Gudela
manth Gudela <hemanth.gud...@qvantel.com>, "user@spark.apache.org" <user@spark.apache.org> Subject: Re: Spark structured streaming: Is it possible to periodically refresh static data frame? What about treating the static data as a (slow) stream as well? Hemanth

Re: Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-20 Thread Georg Heiler
eriodically to get the > latest information from underlying database table. > > > > My questions: > > 1. Is it possible to periodically refresh static data frame? > > 2. If refreshing static data frame is not possible, is there a > mechanism to automatically

Spark structured streaming: Is it possible to periodically refresh static data frame?

2017-04-20 Thread Hemanth Gudela
, is there a mechanism to automatically stop & restarting spark structured streaming job, so that every time the job restarts, the static data frame gets updated with latest information from underlying database table. 3. If 1) and 2) are not possible, please suggest alternatives to achieve

Re: SPARK-20325 - Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-14 Thread Katherin Eri
Thank you your reply, I will open pull request for this doc issue. The logic is clear. пт, 14 апр. 2017, 23:34 Michael Armbrust : > 1) could we update documentation for Structured Streaming and describe >> that checkpointing could be specified by >>

Re: SPARK-20325 - Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-14 Thread Michael Armbrust
> > 1) could we update documentation for Structured Streaming and describe > that checkpointing could be specified by > spark.sql.streaming.checkpointLocation > on SparkSession level and thus automatically checkpoint dirs will be > created per foreach query? > > Sure, please open a pull request.

SPARK-20325 - Spark Structured Streaming documentation Update: checkpoint configuration

2017-04-14 Thread Katherin Eri
Hello, guys. I have initiated the ticket https://issues.apache.org/jira/browse/SPARK-20325 , My case was: I launch two streams from one source stream *streamToProcess *like this streamToProcess .groupBy(metric) .agg(count(metric)) .writeStream

Re: Why do we ever run out of memory in Spark Structured Streaming?

2017-04-05 Thread kant kodali
> >> I am talking about "stateful operations like aggregations". Does this >> happen on heap or off heap by default? I came across a article where I saw >> both on and off heap are possible but I am not sure what happens by default >> and when Spark or Spark

Re: Why do we ever run out of memory in Spark Structured Streaming?

2017-04-05 Thread kant kodali
ap by default? I came across a article where I saw > both on and off heap are possible but I am not sure what happens by default > and when Spark or Spark Structured Streaming decides to store off heap? > > I don't even know what mapGroupsWithState does since It's not part of > sp

Re: Why do we ever run out of memory in Spark Structured Streaming?

2017-04-05 Thread kant kodali
Hi! I am talking about "stateful operations like aggregations". Does this happen on heap or off heap by default? I came across a article where I saw both on and off heap are possible but I am not sure what happens by default and when Spark or Spark Structured Streaming decides to stor

Re: Why do we ever run out of memory in Spark Structured Streaming?

2017-04-04 Thread Tathagata Das
kodali <kanth...@gmail.com> wrote: > Why do we ever run out of memory in Spark Structured Streaming especially > when Memory can always spill to disk ? until the disk is full we shouldn't > be out of memory.isn't it? sure thrashing will happen more frequently and > degrades performance

Why do we ever run out of memory in Spark Structured Streaming?

2017-04-04 Thread kant kodali
Why do we ever run out of memory in Spark Structured Streaming especially when Memory can always spill to disk ? until the disk is full we shouldn't be out of memory.isn't it? sure thrashing will happen more frequently and degrades performance but we do we ever run out Memory even in case

Re: Counting things in Spark Structured Streaming

2017-02-09 Thread Tathagata Das
Probably something like this. dataset .filter { userData => val dateThreshold = lookupThreshold(record)// look up the threshold date based on the record details userData.date > dateThreshold // compare } .groupBy() .count() This would

Counting things in Spark Structured Streaming

2017-02-08 Thread Timothy Chan
I would like to count running totals for events coming in since a given date for a given user. How would I go about doing this? For example, we have user data coming in, we'd like to score that data, then keep running totals on that score, since a given date. Specifically, I always want to score

Re: [External] Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-04 Thread Ben Teeuwen
Another option: https://github.com/mysql-time-machine/replicator >From the readme: "Replicates data changes from MySQL binlog to HBase or Kafka. In case of HBase, preserves the previous data versions. HBase storage is intended for auditing purposes of historical data. In addition, special

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Ayan, This "inline view" idea is really awesome and enlightens me! Finally I have a plan to move on. I greatly appreciate your help! Best regards, Yang 2017-01-03 18:14 GMT+01:00 ayan guha : > Ahh I see what you meanI confused two terminologiesbecause we were >

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread ayan guha
Ahh I see what you meanI confused two terminologiesbecause we were talking about partitioning and then changed topic to identify changed data For that, you can "construct" a dbtable as an inline view - viewSQL = "(select * from table where >

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Ayan, Yeah, I understand your proposal, but according to here http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases, it says Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread ayan guha
Hi You need to store and capture the Max of the column you intend to use for identifying new records (Ex: INSERTED_ON) after every successful run of your job. Then, use the value in lowerBound option. Essentially, you want to create a query like select * from table where INSERTED_ON >

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Ayan, Thanks a lot for your suggestion. I am currently looking into sqoop. Concerning your suggestion for Spark, it is indeed parallelized with multiple workers, but the job is one-off and cannot keep streaming. Moreover, I cannot specify any "start row" in the job, it will always ingest the

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread ayan guha
Hi While the solutions provided by others looks promising and I'd like to try out few of them, our old pal sqoop already "does" the job. It has a incremental mode where you can provide a --check-column and --last-modified-value combination to grab the data - and yes, sqoop essentially does it by

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Tamas, Thanks a lot for your suggestion! I will also investigate this one later. Best regards, Yang 2017-01-03 12:38 GMT+01:00 Tamas Szuromi : > > You can also try https://github.com/zendesk/maxwell > > Tamas > > On 3 January 2017 at 12:25, Amrit Jangid

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Amrit, Thanks a lot for your suggestion! I will investigate it later. Best regards, Yang 2017-01-03 12:25 GMT+01:00 Amrit Jangid : > You can try out *debezium* : https://github.com/debezium. it reads data > from bin-logs, provides structure and stream into Kafka. >

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Ayan, Thanks a lot for such a detailed response. I really appreciate it! I think this use case can be generalized, because the data is immutable and append-only. We only need to find one column or timestamp to track the last row consumed in the previous ingestion. This pattern should be

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Tamas Szuromi
You can also try https://github.com/zendesk/maxwell Tamas On 3 January 2017 at 12:25, Amrit Jangid wrote: > You can try out *debezium* : https://github.com/debezium. it reads data > from bin-logs, provides structure and stream into Kafka. > > Now Kafka can be your new

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Amrit Jangid
You can try out *debezium* : https://github.com/debezium. it reads data from bin-logs, provides structure and stream into Kafka. Now Kafka can be your new source for streaming. On Tue, Jan 3, 2017 at 4:36 PM, Yuanzhe Yang wrote: > Hi Hongdi, > > Thanks a lot for your

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Hongdi, Thanks a lot for your suggestion. The data is truely immutable and the table is append-only. But actually there are different databases involved, so the only feature they share in common and I can depend on is jdbc... Best regards, Yang 2016-12-30 6:45 GMT+01:00 任弘迪

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2017-01-03 Thread Yuanzhe Yang
Hi Michael, Thanks a lot for your ticket. At least it is the first step. Best regards, Yang 2016-12-30 2:01 GMT+01:00 Michael Armbrust : > We don't support this yet, but I've opened this JIRA as it sounds > generally useful:

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread ayan guha
"If data ingestion speed is faster than data production speed, then eventually the entire database will be harvested and those workers will start to "tail" the database for new data streams and the processing becomes real time." This part is really database dependent. So it will be hard to

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread 任弘迪
why not sync binlog of mysql(hopefully the data is immutable and the table is append-only), send the log through kafka and then consume it by spark streaming? On Fri, Dec 30, 2016 at 9:01 AM, Michael Armbrust wrote: > We don't support this yet, but I've opened this JIRA

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread Michael Armbrust
We don't support this yet, but I've opened this JIRA as it sounds generally useful: https://issues.apache.org/jira/browse/SPARK-19031 In the mean time you could try implementing your own Source, but that is pretty low level and is not yet a stable API. On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe

[Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread Yuanzhe Yang (杨远哲)
Hi all, Thanks a lot for your contributions to bring us new technologies. I don't want to waste your time, so before I write to you, I googled, checked stackoverflow and mailing list archive with keywords "streaming" and "jdbc". But I was not able to get any solution to my use case. I hope I

Kafka Spark structured streaming latency benchmark.

2016-12-17 Thread Prashant Sharma
Hi, Goal of my benchmark is to arrive at end to end latency lower than 100ms and sustain them over time, by consuming from a kafka topic and writing back to another kafka topic using Spark. Since the job does not do aggregation and does a constant time processing on each message, it appeared to

Re: Spark structured streaming is Micro batch?

2016-05-07 Thread madhu phatak
tak <phatak@gmail.com> >> wrote: >> >>> Hi, >>> As I was playing with new structured streaming API, I noticed that spark >>> starts processing as and when the data appears. It's no more seems like >>> micro batch processing. Is spark struct

Re: Spark structured streaming is Micro batch?

2016-05-06 Thread Sachin Aggarwal
th new structured streaming API, I noticed that spark >> starts processing as and when the data appears. It's no more seems like >> micro batch processing. Is spark structured streaming will be an event >> based processing? >> >> -- >> Regards, >> Madhuk

Re: Spark structured streaming is Micro batch?

2016-05-06 Thread Deepak Sharma
ing as and when the data appears. It's no more seems like > micro batch processing. Is spark structured streaming will be an event > based processing? > > -- > Regards, > Madhukara Phatak > http://datamantra.io/ > -- Thanks Deepak www.bigdatabig.com www.keosha.net

Spark structured streaming is Micro batch?

2016-05-06 Thread madhu phatak
Hi, As I was playing with new structured streaming API, I noticed that spark starts processing as and when the data appears. It's no more seems like micro batch processing. Is spark structured streaming will be an event based processing? -- Regards, Madhukara Phatak http://datamantra.io/

Re: Spark structured streaming

2016-03-08 Thread Michael Armbrust
t; > > > > > From:Jacek Laskowski <ja...@japila.pl> > > To:Praveen Devarao/India/IBM@IBMIN > > Cc:user <user@spark.apache.org>, dev <d...@spark.apache.org> > > Date:08/03/2016 04:17 pm > > Subject:Re: Spar

Re: Spark structured streaming

2016-03-08 Thread Jacek Laskowski
Jacek Laskowski <ja...@japila.pl> > To:Praveen Devarao/India/IBM@IBMIN > Cc:user <user@spark.apache.org>, dev <d...@spark.apache.org> > Date:08/03/2016 04:17 pm > Subject:Re: Spark structured streaming > _

Re: Spark structured streaming

2016-03-08 Thread Praveen Devarao
again" From: Jacek Laskowski <ja...@japila.pl> To: Praveen Devarao/India/IBM@IBMIN Cc: user <user@spark.apache.org>, dev <d...@spark.apache.org> Date: 08/03/2016 04:17 pm Subject: Re: Spark structured streaming Hi Praveen, I've spent few hours on the

Re: Spark structured streaming

2016-03-08 Thread Jacek Laskowski
Hi Praveen, I've spent few hours on the changes related to streaming dataframes (included in the SPARK-8360) and concluded that it's currently only possible to read.stream(), but not write.stream() since there are no streaming Sinks yet. Pozdrawiam, Jacek Laskowski

Spark structured streaming

2016-03-08 Thread Praveen Devarao
Hi, I would like to get my hands on the structured streaming feature coming out in Spark 2.0. I have tried looking around for code samples to get started but am not able to find any. Only few things I could look into is the test cases that have been committed under the JIRA umbrella

<    1   2   3   4   5   6