I presume you may be able to implement a custom sink and use df.saveAsTable. The problem is that you will have to handle idempotence / garbage collection yourself, in case your job fails while writing, etc.
On Mon, Feb 6, 2017 at 5:53 PM, Egor Pahomov <pahomov.e...@gmail.com> wrote: > I have stream of files on HDFS with JSON events. I need to convert it to > pq in realtime, process some fields and store in simple Hive table so > people can query it. People even might want to query it with Impala, so > it's important, that it would be real Hive metastore based table. How can I > do that? > > 2017-02-06 14:25 GMT-08:00 Burak Yavuz <brk...@gmail.com>: > >> Hi Egor, >> >> Structured Streaming handles all of its metadata itself, which files are >> actually valid, etc. You may use the "create table" syntax in SQL to treat >> it like a hive table, but it will handle all partitioning information in >> its own metadata log. Is there a specific reason that you want to store the >> information in the Hive Metastore? >> >> Best, >> Burak >> >> On Mon, Feb 6, 2017 at 11:39 AM, Egor Pahomov <pahomov.e...@gmail.com> >> wrote: >> >>> Hi, I'm thinking of using Structured Streaming instead of old streaming, >>> but I need to be able to save results to Hive table. Documentation for file >>> sink says(http://spark.apache.org/docs/latest/structured-streamin >>> g-programming-guide.html#output-sinks): "Supports writes to partitioned >>> tables. ". But being able to write to partitioned directories is not >>> enough to write to the table: someone needs to write to Hive metastore. How >>> can I use Structured Streaming and write to Hive table? >>> >>> -- >>> >>> >>> *Sincerely yoursEgor Pakhomov* >>> >> >> > > > -- > > > *Sincerely yoursEgor Pakhomov* >