Re: Apache NiFi/Hive - store merged tweets in HDFS, create table in hive

Bryan Bende Thu, 21 Apr 2016 13:39:07 -0700

Also, this blog has a picture of what I described with MergeContent:

https://blogs.apache.org/nifi/entry/indexing_tweets_with_nifi_and


-Bryan

On Thu, Apr 21, 2016 at 4:37 PM, Bryan Bende <bbe...@gmail.com> wrote:

> Hi Igor,
>
> I don't know that much about Hive so I can't really say what format it
> needs to be in for Hive to understand it.
>
> If it needs to be a valid array of JSON documents, in MergeContent change
> the Delimiter Strategy to "Text" which means it will use whatever values
> you type directly into Header, Footer, Demarcator, and then specify [ ] ,
>  respectively as the values.
>
> That will get you something like this where {...} are the incoming
> documents:
>
> [
> {...},
> {...},
> ]
>
> -Bryan
>
>
> On Thu, Apr 21, 2016 at 4:06 PM, Igor Kravzov <igork.ine...@gmail.com>
> wrote:
>
>> Hi Brian,
>>
>> I am aware of this example. But I want to store JSON as it is and create
>> external table. Like in this example.
>> http://hortonworks.com/blog/howto-use-hive-to-sqlize-your-own-tweets-part-two-loading-hive-sql-queries/
>> What I don't know is how to properly merge multiple JSON in one file in
>> order for hive to read it properly.
>>
>> On Thu, Apr 21, 2016 at 2:33 PM, Bryan Bende <bbe...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I believe this example shows an approach to do it (it includes Hive even
>>> though the title is Solr/banana):
>>>
>>> https://community.hortonworks.com/articles/1282/sample-hdfnifi-flow-to-push-tweets-into-solrbanana.html
>>>
>>> The short version is that it extracts several attributes from each tweet
>>> using EvaluateJsonPath, then uses ReplaceText to replace the FlowFile
>>> content with a pipe delimited string of those attributes, and then creates
>>> a Hive table that knows how to handle that delimiter. With this approach
>>> you don't need to set the header, footer, and demarcator in MergeContent.
>>>
>>> create table if not exists tweets_text_partition(
>>> tweet_id bigint,
>>> created_unixtime bigint,
>>> created_time string,
>>> displayname string,
>>> msg string,
>>> fulltext string
>>> )
>>> row format delimited fields terminated by "|"
>>> location "/tmp/tweets_staging";
>>>
>>> -Bryan
>>>
>>>
>>> On Thu, Apr 21, 2016 at 1:52 PM, Igor Kravzov <igork.ine...@gmail.com>
>>> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> I want to create a following workflow:
>>>>
>>>> 1.Fetch tweets using GetTwitter processor.
>>>> 2.Merge tweets in a bigger file using MergeContent process.
>>>> 3.Store merged files in HDFS.
>>>> 4. On the hadoop/hive side I want to create an external table based on
>>>> these tweets.
>>>>
>>>> There are examples how to do this tbut what I am missing is how to
>>>> configure MergeContent processor: what to set as header,footer and
>>>> demarcator. And what to use on on hive side as separator so thatit will
>>>> split merged tweets in rows. Hope I described myself clearly.
>>>>
>>>> Thanks in advance.
>>>>
>>>
>>>
>>
>

Re: Apache NiFi/Hive - store merged tweets in HDFS, create table in hive

Reply via email to