Got it, thanks!

2017-02-11 0:56 GMT-08:00 Sam Elamin <hussam.ela...@gmail.com>:

> Here's a link to the thread
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-
> Streaming-Dropping-Duplicates-td20884.html
>
> On Sat, 11 Feb 2017 at 08:47, Sam Elamin <hussam.ela...@gmail.com> wrote:
>
>> Hey Egor
>>
>>
>> You can use for each writer or you can write a custom sink. I personally
>> went with a custom sink since I get a dataframe per batch
>>
>> https://github.com/samelamin/spark-bigquery/blob/master/
>> src/main/scala/com/samelamin/spark/bigquery/streaming/BigQuerySink.scala
>>
>> You can have a look at how I implemented something similar to file sink
>> that in the event if a failure skips batches already written
>>
>>
>> Also have a look at Micheals reply to me a few days ago on exactly the
>> same topic. The email subject was called structured streaming. Dropping
>> duplicates
>>
>>
>> Regards
>>
>> Sam
>>
>> On Sat, 11 Feb 2017 at 07:59, Jacek Laskowski <ja...@japila.pl> wrote:
>>
>> "Something like that" I've never tried it out myself so I'm only
>> guessing having a brief look at the API.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Sat, Feb 11, 2017 at 1:31 AM, Egor Pahomov <pahomov.e...@gmail.com>
>> wrote:
>> > Jacek, so I create cache in ForeachWriter, in all "process()" I write
>> to it
>> > and on close I flush? Something like that?
>> >
>> > 2017-02-09 12:42 GMT-08:00 Jacek Laskowski <ja...@japila.pl>:
>> >>
>> >> Hi,
>> >>
>> >> Yes, that's ForeachWriter.
>> >>
>> >> Yes, it works with element by element. You're looking for mapPartition
>> >> and ForeachWriter has partitionId that you could use to implement a
>> >> similar thing.
>> >>
>> >> Pozdrawiam,
>> >> Jacek Laskowski
>> >> ----
>> >> https://medium.com/@jaceklaskowski/
>> >> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
>> >> Follow me at https://twitter.com/jaceklaskowski
>> >>
>> >>
>> >> On Thu, Feb 9, 2017 at 3:55 AM, Egor Pahomov <pahomov.e...@gmail.com>
>> >> wrote:
>> >> > Jacek, you mean
>> >> >
>> >> > http://spark.apache.org/docs/latest/api/scala/index.html#
>> org.apache.spark.sql.ForeachWriter
>> >> > ? I do not understand how to use it, since it passes every value
>> >> > separately,
>> >> > not every partition. And addding to table value by value would not
>> work
>> >> >
>> >> > 2017-02-07 12:10 GMT-08:00 Jacek Laskowski <ja...@japila.pl>:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> Have you considered foreach sink?
>> >> >>
>> >> >> Jacek
>> >> >>
>> >> >> On 6 Feb 2017 8:39 p.m., "Egor Pahomov" <pahomov.e...@gmail.com>
>> wrote:
>> >> >>>
>> >> >>> Hi, I'm thinking of using Structured Streaming instead of old
>> >> >>> streaming,
>> >> >>> but I need to be able to save results to Hive table. Documentation
>> for
>> >> >>> file
>> >> >>> sink
>> >> >>>
>> >> >>> says(http://spark.apache.org/docs/latest/structured-
>> streaming-programming-guide.html#output-sinks):
>> >> >>> "Supports writes to partitioned tables. ". But being able to write
>> to
>> >> >>> partitioned directories is not enough to write to the table:
>> someone
>> >> >>> needs
>> >> >>> to write to Hive metastore. How can I use Structured Streaming and
>> >> >>> write to
>> >> >>> Hive table?
>> >> >>>
>> >> >>> --
>> >> >>> Sincerely yours
>> >> >>> Egor Pakhomov
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Sincerely yours
>> >> > Egor Pakhomov
>> >
>> >
>> >
>> >
>> > --
>> > Sincerely yours
>> > Egor Pakhomov
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


-- 


*Sincerely yoursEgor Pakhomov*

Reply via email to