Thanks guys. I think it will work. One thing: merged file comes out without extension. How do I add extension to a merged file?
On Thu, Apr 21, 2016 at 4:42 PM, Simon Ball <sb...@hortonworks.com> wrote: > For most hive JSON serdes you are going to want what some people call JSON > record format. This is essentially a text file with a JSON document per > line which represents a record, with reasonably consistent structure. You > can achieve this by ensuring your JSON is not pretty formatted (one doc per > line) and then just using binary concatenation in the MergeContent > processor Bryne mentioned. > > Simon > > > On 21 Apr 2016, at 22:38, Bryan Bende <bbe...@gmail.com> wrote: > > Also, this blog has a picture of what I described with MergeContent: > > https://blogs.apache.org/nifi/entry/indexing_tweets_with_nifi_and > > -Bryan > > On Thu, Apr 21, 2016 at 4:37 PM, Bryan Bende <bbe...@gmail.com> wrote: > >> Hi Igor, >> >> I don't know that much about Hive so I can't really say what format it >> needs to be in for Hive to understand it. >> >> If it needs to be a valid array of JSON documents, in MergeContent change >> the Delimiter Strategy to "Text" which means it will use whatever values >> you type directly into Header, Footer, Demarcator, and then specify [ ] , >> respectively as the values. >> >> That will get you something like this where {...} are the incoming >> documents: >> >> [ >> {...}, >> {...}, >> ] >> >> -Bryan >> >> >> On Thu, Apr 21, 2016 at 4:06 PM, Igor Kravzov <igork.ine...@gmail.com> >> wrote: >> >>> Hi Brian, >>> >>> I am aware of this example. But I want to store JSON as it is and create >>> external table. Like in this example. >>> http://hortonworks.com/blog/howto-use-hive-to-sqlize-your-own-tweets-part-two-loading-hive-sql-queries/ >>> What I don't know is how to properly merge multiple JSON in one file in >>> order for hive to read it properly. >>> >>> On Thu, Apr 21, 2016 at 2:33 PM, Bryan Bende <bbe...@gmail.com> wrote: >>> >>>> Hello, >>>> >>>> I believe this example shows an approach to do it (it includes Hive >>>> even though the title is Solr/banana): >>>> >>>> https://community.hortonworks.com/articles/1282/sample-hdfnifi-flow-to-push-tweets-into-solrbanana.html >>>> >>>> The short version is that it extracts several attributes from each >>>> tweet using EvaluateJsonPath, then uses ReplaceText to replace the FlowFile >>>> content with a pipe delimited string of those attributes, and then creates >>>> a Hive table that knows how to handle that delimiter. With this approach >>>> you don't need to set the header, footer, and demarcator in MergeContent. >>>> >>>> create table if not exists tweets_text_partition( >>>> tweet_id bigint, >>>> created_unixtime bigint, >>>> created_time string, >>>> displayname string, >>>> msg string, >>>> fulltext string >>>> ) >>>> row format delimited fields terminated by "|" >>>> location "/tmp/tweets_staging"; >>>> >>>> -Bryan >>>> >>>> >>>> On Thu, Apr 21, 2016 at 1:52 PM, Igor Kravzov <igork.ine...@gmail.com> >>>> wrote: >>>> >>>>> Hi guys, >>>>> >>>>> I want to create a following workflow: >>>>> >>>>> 1.Fetch tweets using GetTwitter processor. >>>>> 2.Merge tweets in a bigger file using MergeContent process. >>>>> 3.Store merged files in HDFS. >>>>> 4. On the hadoop/hive side I want to create an external table based on >>>>> these tweets. >>>>> >>>>> There are examples how to do this tbut what I am missing is how to >>>>> configure MergeContent processor: what to set as header,footer and >>>>> demarcator. And what to use on on hive side as separator so thatit will >>>>> split merged tweets in rows. Hope I described myself clearly. >>>>> >>>>> Thanks in advance. >>>>> >>>> >>>> >>> >> >