Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-11 Thread Egor Pahomov
Got it, thanks! 2017-02-11 0:56 GMT-08:00 Sam Elamin : > Here's a link to the thread > > http://apache-spark-developers-list.1001551.n3.nabble.com/Structured- > Streaming-Dropping-Duplicates-td20884.html > > On Sat, 11 Feb 2017 at 08:47, Sam Elamin

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-11 Thread Sam Elamin
Here's a link to the thread http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-Dropping-Duplicates-td20884.html On Sat, 11 Feb 2017 at 08:47, Sam Elamin wrote: > Hey Egor > > > You can use for each writer or you can write a custom sink. I

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-11 Thread Sam Elamin
Hey Egor You can use for each writer or you can write a custom sink. I personally went with a custom sink since I get a dataframe per batch https://github.com/samelamin/spark-bigquery/blob/master/src/main/scala/com/samelamin/spark/bigquery/streaming/BigQuerySink.scala You can have a look at

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-10 Thread Jacek Laskowski
"Something like that" I've never tried it out myself so I'm only guessing having a brief look at the API. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sat,

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-10 Thread Egor Pahomov
Jacek, so I create cache in ForeachWriter, in all "process()" I write to it and on close I flush? Something like that? 2017-02-09 12:42 GMT-08:00 Jacek Laskowski : > Hi, > > Yes, that's ForeachWriter. > > Yes, it works with element by element. You're looking for mapPartition >

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-09 Thread Jacek Laskowski
Hi, Yes, that's ForeachWriter. Yes, it works with element by element. You're looking for mapPartition and ForeachWriter has partitionId that you could use to implement a similar thing. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-08 Thread Egor Pahomov
Jacek, you mean http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.ForeachWriter ? I do not understand how to use it, since it passes every value separately, not every partition. And addding to table value by value would not work 2017-02-07 12:10 GMT-08:00 Jacek

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-07 Thread Jacek Laskowski
Hi, Have you considered foreach sink? Jacek On 6 Feb 2017 8:39 p.m., "Egor Pahomov" wrote: > Hi, I'm thinking of using Structured Streaming instead of old streaming, > but I need to be able to save results to Hive table. Documentation for file > sink

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-06 Thread Burak Yavuz
I presume you may be able to implement a custom sink and use df.saveAsTable. The problem is that you will have to handle idempotence / garbage collection yourself, in case your job fails while writing, etc. On Mon, Feb 6, 2017 at 5:53 PM, Egor Pahomov wrote: > I have

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-06 Thread Egor Pahomov
I have stream of files on HDFS with JSON events. I need to convert it to pq in realtime, process some fields and store in simple Hive table so people can query it. People even might want to query it with Impala, so it's important, that it would be real Hive metastore based table. How can I do

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-06 Thread Burak Yavuz
Hi Egor, Structured Streaming handles all of its metadata itself, which files are actually valid, etc. You may use the "create table" syntax in SQL to treat it like a hive table, but it will handle all partitioning information in its own metadata log. Is there a specific reason that you want to

[Structured Streaming] Using File Sink to store to hive table.

2017-02-06 Thread Egor Pahomov
Hi, I'm thinking of using Structured Streaming instead of old streaming, but I need to be able to save results to Hive table. Documentation for file sink says( http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks): "Supports writes to partitioned tables. ".