Re: Apache NiFi/Hive - store merged tweets in HDFS, create table in hive

Igor Kravzov Thu, 21 Apr 2016 14:42:34 -0700

That worked. Thank you.

On Thu, Apr 21, 2016 at 5:26 PM, Joe Witt <joe.w...@gmail.com> wrote:


> Run the output through UpdateAttribute and put a property on that
> processor with a name of 'filename' and a value of
> '${filename}.yourextension'
>
> Thanks
> Joe
>
> On Thu, Apr 21, 2016 at 5:24 PM, Igor Kravzov <igork.ine...@gmail.com>
> wrote:
> > Thanks guys. I think it will work.
> > One thing: merged file comes out without extension. How do I add
> extension
> > to a merged file?
> >
> > On Thu, Apr 21, 2016 at 4:42 PM, Simon Ball <sb...@hortonworks.com>
> wrote:
> >>
> >> For most hive JSON serdes you are going to want what some people call
> JSON
> >> record format. This is essentially a text file with a JSON document per
> line
> >> which represents a record, with reasonably consistent structure. You can
> >> achieve this by ensuring your JSON is not pretty formatted (one doc per
> >> line) and then just using binary concatenation in the MergeContent
> processor
> >> Bryne mentioned.
> >>
> >> Simon
> >>
> >>
> >> On 21 Apr 2016, at 22:38, Bryan Bende <bbe...@gmail.com> wrote:
> >>
> >> Also, this blog has a picture of what I described with MergeContent:
> >>
> >> https://blogs.apache.org/nifi/entry/indexing_tweets_with_nifi_and
> >>
> >> -Bryan
> >>
> >> On Thu, Apr 21, 2016 at 4:37 PM, Bryan Bende <bbe...@gmail.com> wrote:
> >>>
> >>> Hi Igor,
> >>>
> >>> I don't know that much about Hive so I can't really say what format it
> >>> needs to be in for Hive to understand it.
> >>>
> >>> If it needs to be a valid array of JSON documents, in MergeContent
> change
> >>> the Delimiter Strategy to "Text" which means it will use whatever
> values you
> >>> type directly into Header, Footer, Demarcator, and then specify [ ] ,
> >>> respectively as the values.
> >>>
> >>> That will get you something like this where {...} are the incoming
> >>> documents:
> >>>
> >>> [
> >>> {...},
> >>> {...},
> >>> ]
> >>>
> >>> -Bryan
> >>>
> >>>
> >>> On Thu, Apr 21, 2016 at 4:06 PM, Igor Kravzov <igork.ine...@gmail.com>
> >>> wrote:
> >>>>
> >>>> Hi Brian,
> >>>>
> >>>> I am aware of this example. But I want to store JSON as it is and
> create
> >>>> external table. Like in this example.
> >>>>
> http://hortonworks.com/blog/howto-use-hive-to-sqlize-your-own-tweets-part-two-loading-hive-sql-queries/
> >>>> What I don't know is how to properly merge multiple JSON in one file
> in
> >>>> order for hive to read it properly.
> >>>>
> >>>> On Thu, Apr 21, 2016 at 2:33 PM, Bryan Bende <bbe...@gmail.com>
> wrote:
> >>>>>
> >>>>> Hello,
> >>>>>
> >>>>> I believe this example shows an approach to do it (it includes Hive
> >>>>> even though the title is Solr/banana):
> >>>>>
> >>>>>
> https://community.hortonworks.com/articles/1282/sample-hdfnifi-flow-to-push-tweets-into-solrbanana.html
> >>>>>
> >>>>> The short version is that it extracts several attributes from each
> >>>>> tweet using EvaluateJsonPath, then uses ReplaceText to replace the
> FlowFile
> >>>>> content with a pipe delimited string of those attributes, and then
> creates a
> >>>>> Hive table that knows how to handle that delimiter. With this
> approach you
> >>>>> don't need to set the header, footer, and demarcator in MergeContent.
> >>>>>
> >>>>> create table if not exists tweets_text_partition(
> >>>>> tweet_id bigint,
> >>>>> created_unixtime bigint,
> >>>>> created_time string,
> >>>>> displayname string,
> >>>>> msg string,
> >>>>> fulltext string
> >>>>> )
> >>>>> row format delimited fields terminated by "|"
> >>>>> location "/tmp/tweets_staging";
> >>>>>
> >>>>> -Bryan
> >>>>>
> >>>>>
> >>>>> On Thu, Apr 21, 2016 at 1:52 PM, Igor Kravzov <
> igork.ine...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi guys,
> >>>>>>
> >>>>>> I want to create a following workflow:
> >>>>>>
> >>>>>> 1.Fetch tweets using GetTwitter processor.
> >>>>>> 2.Merge tweets in a bigger file using MergeContent process.
> >>>>>> 3.Store merged files in HDFS.
> >>>>>> 4. On the hadoop/hive side I want to create an external table based
> on
> >>>>>> these tweets.
> >>>>>>
> >>>>>> There are examples how to do this tbut what I am missing is how to
> >>>>>> configure MergeContent processor: what to set as header,footer and
> >>>>>> demarcator. And what to use on on hive side as separator so thatit
> will
> >>>>>> split merged tweets in rows. Hope I described myself clearly.
> >>>>>>
> >>>>>> Thanks in advance.
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: Apache NiFi/Hive - store merged tweets in HDFS, create table in hive

Reply via email to