Hi Ted and Silvio, thanks for your responses. Hive has a new API for streaming ( https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest) that takes care of compaction and doesn't require any downtime for the table. The data is immediately available and Hive will combine files in background transparently. I was hoping to use this API from within Spark to mitigate the issue with lots of small files...
Here's my equivalent code for Trident (work in progress): https://gist.github.com/lgvier/ee28f1c95ac4f60efc3e Trident will coordinate the transaction and send all the tuples from each server/partition to your component at once (Stream.partitionPersist). That is very helpful since Hive expects batches of records instead of one call for each record. I had a look at foreachRDD but it seems to be invoked for each record. I'd like to get all the Stream's records on each server/partition at once. For example, if the stream was processed by 3 servers and resulted in 100 records on each server, I'd like to receive 3 calls (one on each server), each with 100 records. Please let me know if I'm making any sense. I'm fairly new to Spark. Thank you, -Geovani -Geovani On Thu, Nov 6, 2014 at 9:54 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Geovani, > > You can use HiveContext to do inserts into a Hive table in a Streaming > app just as you would a batch app. A DStream is really a collection of RDDs > so you can run the insert from within the foreachRDD. You just have to be > careful that you’re not creating large amounts of small files. So you may > want to either increase the duration of your Streaming batches or > repartition right before you insert. You’ll just need to do some testing > based on your ingest volume. You may also want to consider streaming into > another data store though. > > Thanks, > Silvio > > From: Luiz Geovani Vier <lgv...@gmail.com> > Date: Thursday, November 6, 2014 at 7:46 PM > To: "user@spark.apache.org" <user@spark.apache.org> > Subject: Store DStreams into Hive using Hive Streaming > > Hello, > > Is there a built-in way or connector to store DStream results into an > existing Hive ORC table using the Hive/HCatalog Streaming API? > Otherwise, do you have any suggestions regarding the implementation of > such component? > > Thank you, > -Geovani >