Re: [Structured Streaming] Using File Sink to store to hive table.

Burak Yavuz Mon, 06 Feb 2017 18:08:34 -0800

I presume you may be able to implement a custom sink and use
df.saveAsTable. The problem is that you will have to handle idempotence /
garbage collection yourself, in case your job fails while writing, etc.


On Mon, Feb 6, 2017 at 5:53 PM, Egor Pahomov <pahomov.e...@gmail.com> wrote:

> I have stream of files on HDFS with JSON events. I need to convert it to
> pq in realtime, process some fields and store in simple Hive table so
> people can query it. People even might want to query it with Impala, so
> it's important, that it would be real Hive metastore based table. How can I
> do that?
>
> 2017-02-06 14:25 GMT-08:00 Burak Yavuz <brk...@gmail.com>:
>
>> Hi Egor,
>>
>> Structured Streaming handles all of its metadata itself, which files are
>> actually valid, etc. You may use the "create table" syntax in SQL to treat
>> it like a hive table, but it will handle all partitioning information in
>> its own metadata log. Is there a specific reason that you want to store the
>> information in the Hive Metastore?
>>
>> Best,
>> Burak
>>
>> On Mon, Feb 6, 2017 at 11:39 AM, Egor Pahomov <pahomov.e...@gmail.com>
>> wrote:
>>
>>> Hi, I'm thinking of using Structured Streaming instead of old streaming,
>>> but I need to be able to save results to Hive table. Documentation for file
>>> sink says(http://spark.apache.org/docs/latest/structured-streamin
>>> g-programming-guide.html#output-sinks): "Supports writes to partitioned
>>> tables. ". But being able to write to partitioned directories is not
>>> enough to write to the table: someone needs to write to Hive metastore. How
>>> can I use Structured Streaming and write to Hive table?
>>>
>>> --
>>>
>>>
>>> *Sincerely yoursEgor Pakhomov*
>>>
>>
>>
>
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>

Re: [Structured Streaming] Using File Sink to store to hive table.

Reply via email to