Hi Ted and Silvio, thanks for your responses.

Hive has a new API for streaming (
https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest)
that takes care of compaction and doesn't require any downtime for the
table. The data is immediately available and Hive will combine files in
background transparently. I was hoping to use this API from within Spark to
mitigate the issue with lots of small files...

Here's my equivalent code for Trident (work in progress):
https://gist.github.com/lgvier/ee28f1c95ac4f60efc3e
Trident will coordinate the transaction and send all the tuples from each
server/partition to your component at once (Stream.partitionPersist). That
is very helpful since Hive expects batches of records instead of one call
for each record.
I had a look at foreachRDD but it seems to be invoked for each record. I'd
like to get all the Stream's records on each server/partition at once.
For example, if the stream was processed by 3 servers and resulted in 100
records on each server, I'd like to receive 3 calls (one on each server),
each with 100 records. Please let me know if I'm making any sense. I'm
fairly new to Spark.

Thank you,
-Geovani


-Geovani

On Thu, Nov 6, 2014 at 9:54 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

>  Geovani,
>
>  You can use HiveContext to do inserts into a Hive table in a Streaming
> app just as you would a batch app. A DStream is really a collection of RDDs
> so you can run the insert from within the foreachRDD. You just have to be
> careful that you’re not creating large amounts of small files. So you may
> want to either increase the duration of your Streaming batches or
> repartition right before you insert. You’ll just need to do some testing
> based on your ingest volume. You may also want to consider streaming into
> another data store though.
>
>  Thanks,
> Silvio
>
>   From: Luiz Geovani Vier <lgv...@gmail.com>
> Date: Thursday, November 6, 2014 at 7:46 PM
> To: "user@spark.apache.org" <user@spark.apache.org>
> Subject: Store DStreams into Hive using Hive Streaming
>
>   Hello,
>
> Is there a built-in way or connector to store DStream results into an
> existing Hive ORC table using the Hive/HCatalog Streaming API?
> Otherwise, do you have any suggestions regarding the implementation of
> such component?
>
> Thank you,
>  -Geovani
>

Reply via email to