Got it, thanks! 2017-02-11 0:56 GMT-08:00 Sam Elamin <hussam.ela...@gmail.com>:
> Here's a link to the thread > > http://apache-spark-developers-list.1001551.n3.nabble.com/Structured- > Streaming-Dropping-Duplicates-td20884.html > > On Sat, 11 Feb 2017 at 08:47, Sam Elamin <hussam.ela...@gmail.com> wrote: > >> Hey Egor >> >> >> You can use for each writer or you can write a custom sink. I personally >> went with a custom sink since I get a dataframe per batch >> >> https://github.com/samelamin/spark-bigquery/blob/master/ >> src/main/scala/com/samelamin/spark/bigquery/streaming/BigQuerySink.scala >> >> You can have a look at how I implemented something similar to file sink >> that in the event if a failure skips batches already written >> >> >> Also have a look at Micheals reply to me a few days ago on exactly the >> same topic. The email subject was called structured streaming. Dropping >> duplicates >> >> >> Regards >> >> Sam >> >> On Sat, 11 Feb 2017 at 07:59, Jacek Laskowski <ja...@japila.pl> wrote: >> >> "Something like that" I've never tried it out myself so I'm only >> guessing having a brief look at the API. >> >> Pozdrawiam, >> Jacek Laskowski >> ---- >> https://medium.com/@jaceklaskowski/ >> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark >> Follow me at https://twitter.com/jaceklaskowski >> >> >> On Sat, Feb 11, 2017 at 1:31 AM, Egor Pahomov <pahomov.e...@gmail.com> >> wrote: >> > Jacek, so I create cache in ForeachWriter, in all "process()" I write >> to it >> > and on close I flush? Something like that? >> > >> > 2017-02-09 12:42 GMT-08:00 Jacek Laskowski <ja...@japila.pl>: >> >> >> >> Hi, >> >> >> >> Yes, that's ForeachWriter. >> >> >> >> Yes, it works with element by element. You're looking for mapPartition >> >> and ForeachWriter has partitionId that you could use to implement a >> >> similar thing. >> >> >> >> Pozdrawiam, >> >> Jacek Laskowski >> >> ---- >> >> https://medium.com/@jaceklaskowski/ >> >> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark >> >> Follow me at https://twitter.com/jaceklaskowski >> >> >> >> >> >> On Thu, Feb 9, 2017 at 3:55 AM, Egor Pahomov <pahomov.e...@gmail.com> >> >> wrote: >> >> > Jacek, you mean >> >> > >> >> > http://spark.apache.org/docs/latest/api/scala/index.html# >> org.apache.spark.sql.ForeachWriter >> >> > ? I do not understand how to use it, since it passes every value >> >> > separately, >> >> > not every partition. And addding to table value by value would not >> work >> >> > >> >> > 2017-02-07 12:10 GMT-08:00 Jacek Laskowski <ja...@japila.pl>: >> >> >> >> >> >> Hi, >> >> >> >> >> >> Have you considered foreach sink? >> >> >> >> >> >> Jacek >> >> >> >> >> >> On 6 Feb 2017 8:39 p.m., "Egor Pahomov" <pahomov.e...@gmail.com> >> wrote: >> >> >>> >> >> >>> Hi, I'm thinking of using Structured Streaming instead of old >> >> >>> streaming, >> >> >>> but I need to be able to save results to Hive table. Documentation >> for >> >> >>> file >> >> >>> sink >> >> >>> >> >> >>> says(http://spark.apache.org/docs/latest/structured- >> streaming-programming-guide.html#output-sinks): >> >> >>> "Supports writes to partitioned tables. ". But being able to write >> to >> >> >>> partitioned directories is not enough to write to the table: >> someone >> >> >>> needs >> >> >>> to write to Hive metastore. How can I use Structured Streaming and >> >> >>> write to >> >> >>> Hive table? >> >> >>> >> >> >>> -- >> >> >>> Sincerely yours >> >> >>> Egor Pakhomov >> >> > >> >> > >> >> > >> >> > >> >> > -- >> >> > Sincerely yours >> >> > Egor Pakhomov >> > >> > >> > >> > >> > -- >> > Sincerely yours >> > Egor Pakhomov >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- *Sincerely yoursEgor Pakhomov*