Also, this blog has a picture of what I described with MergeContent: https://blogs.apache.org/nifi/entry/indexing_tweets_with_nifi_and
-Bryan On Thu, Apr 21, 2016 at 4:37 PM, Bryan Bende <bbe...@gmail.com> wrote: > Hi Igor, > > I don't know that much about Hive so I can't really say what format it > needs to be in for Hive to understand it. > > If it needs to be a valid array of JSON documents, in MergeContent change > the Delimiter Strategy to "Text" which means it will use whatever values > you type directly into Header, Footer, Demarcator, and then specify [ ] , > respectively as the values. > > That will get you something like this where {...} are the incoming > documents: > > [ > {...}, > {...}, > ] > > -Bryan > > > On Thu, Apr 21, 2016 at 4:06 PM, Igor Kravzov <igork.ine...@gmail.com> > wrote: > >> Hi Brian, >> >> I am aware of this example. But I want to store JSON as it is and create >> external table. Like in this example. >> http://hortonworks.com/blog/howto-use-hive-to-sqlize-your-own-tweets-part-two-loading-hive-sql-queries/ >> What I don't know is how to properly merge multiple JSON in one file in >> order for hive to read it properly. >> >> On Thu, Apr 21, 2016 at 2:33 PM, Bryan Bende <bbe...@gmail.com> wrote: >> >>> Hello, >>> >>> I believe this example shows an approach to do it (it includes Hive even >>> though the title is Solr/banana): >>> >>> https://community.hortonworks.com/articles/1282/sample-hdfnifi-flow-to-push-tweets-into-solrbanana.html >>> >>> The short version is that it extracts several attributes from each tweet >>> using EvaluateJsonPath, then uses ReplaceText to replace the FlowFile >>> content with a pipe delimited string of those attributes, and then creates >>> a Hive table that knows how to handle that delimiter. With this approach >>> you don't need to set the header, footer, and demarcator in MergeContent. >>> >>> create table if not exists tweets_text_partition( >>> tweet_id bigint, >>> created_unixtime bigint, >>> created_time string, >>> displayname string, >>> msg string, >>> fulltext string >>> ) >>> row format delimited fields terminated by "|" >>> location "/tmp/tweets_staging"; >>> >>> -Bryan >>> >>> >>> On Thu, Apr 21, 2016 at 1:52 PM, Igor Kravzov <igork.ine...@gmail.com> >>> wrote: >>> >>>> Hi guys, >>>> >>>> I want to create a following workflow: >>>> >>>> 1.Fetch tweets using GetTwitter processor. >>>> 2.Merge tweets in a bigger file using MergeContent process. >>>> 3.Store merged files in HDFS. >>>> 4. On the hadoop/hive side I want to create an external table based on >>>> these tweets. >>>> >>>> There are examples how to do this tbut what I am missing is how to >>>> configure MergeContent processor: what to set as header,footer and >>>> demarcator. And what to use on on hive side as separator so thatit will >>>> split merged tweets in rows. Hope I described myself clearly. >>>> >>>> Thanks in advance. >>>> >>> >>> >> >