Re: CSV to Mongo

Joe Witt Tue, 22 Sep 2015 06:37:30 -0700

There aren't any plans.  But is an awesome idea and great JIRA.

Thanks
Joe
On Sep 22, 2015 9:31 AM, "Jonathan Lyons" <[email protected]> wrote:


> Speaking of CSV to JSON conversion, is there any interest in implementing
> schema inference in general, and specifically schema inference for CSV
> files? This is something that was added to spark-csv recently (
> https://github.com/databricks/spark-csv/pull/93). Any thoughts?
>
> On Tue, Sep 22, 2015 at 9:16 AM, Bryan Bende <[email protected]> wrote:
>
>> Andrew,
>>
>> If you are interested in the ExtractText+ReplaceText approach, I posted
>> an example template that shows how to convert a line from a CSV file to a
>> JSON document [1].
>>
>> The first part of the flow is just for testing and generates a flow file
>> with the content set to "a,b,c,d", then the ExtractText pulls those values
>> into attributes (csv.1, csv.2, csv.3, csv.4) and ReplaceText uses them to
>> build a JSON document.
>>
>> -Bryan
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates
>>  (CsvToJson)
>>
>>
>> On Mon, Sep 21, 2015 at 4:40 PM, Bryan Bende <[email protected]> wrote:
>>
>>> Yup, Joe beat me too it, but was going to suggest those options...
>>>
>>> In the second case, you would probably use SplitText to get each line of
>>> the CSV as a FlowFile, then ExtractText to pull out every value of the line
>>> into attributes, then ReplaceText would construct a JSON document using
>>> expression language to access the attributes from ExtractText.
>>>
>>> On Mon, Sep 21, 2015 at 4:33 PM, Joe Witt <[email protected]> wrote:
>>>
>>>> Adam, Bryan,
>>>>
>>>> Could do the CSV to Avro processor and then follow it with the Avro to
>>>> JSON processor.  Alternatively, could use ExtractText to pull the
>>>> fields as attributes and then use ReplaceText to produce a JSON
>>>> output.
>>>>
>>>> Thanks
>>>> Joe
>>>>
>>>> On Mon, Sep 21, 2015 at 4:21 PM, Adam Williams
>>>> <[email protected]> wrote:
>>>> > Bryan,
>>>> >
>>>> > Thanks for the feedback.  I stripped the ExtractText and tried
>>>> routing all
>>>> > unmatched traffic to Mongo as well, hence the CSV import problems.
>>>> Off the
>>>> > top of my head i do not think MongoDB allows CSV inserts through the
>>>> java
>>>> > client, we've always had to work with the JSON/document model for
>>>> it.  For a
>>>> > CSV format, it would have to be similar to this idea:
>>>> >
>>>> https://github.com/AdoptOpenJDK/javacountdown/blob/master/src/main/java/org/adoptopenjdk/javacountdown/ImportGeoData.java
>>>> >
>>>> > So looking at the other processors in NiFi, is there a way then to
>>>> move from
>>>> > a CSV format to JSON before putting to Mongo?
>>>> >
>>>> > ________________________________
>>>> > Date: Mon, 21 Sep 2015 16:09:10 -0400
>>>> >
>>>> > Subject: Re: CSV to Mongo
>>>> > From: [email protected]
>>>> > To: [email protected]
>>>> >
>>>> > Adam,
>>>> >
>>>> > I was able import the full template, thanks. A couple of things...
>>>> >
>>>> > The ExtractText processor works by adding user-defined properties
>>>> (the +
>>>> > icon in the top-right of the properties window) where the property
>>>> name is a
>>>> > destination attribute and the value is a regular expression.
>>>> > Right now there weren't any regular expressions defined so that
>>>> processor
>>>> > will always route the file to 'unmatched'. Generally you would
>>>> probably want
>>>> > to route the matched files to the next processor, and then
>>>> auto-terminate
>>>> > the unmatched relationship (assuming you want to filter out
>>>> non-matches).
>>>> >
>>>> > Do you know if MongoDB supports inserting a CSV file through their
>>>> Java
>>>> > client? do you have similar code that already does this in Storm?
>>>> >
>>>> > I am honestly not that familiar with MongoDB, but in the PutMongo
>>>> processor
>>>> > it takes the incoming data and calls:
>>>> > Document doc = Document.parse(new String(content, charset));
>>>> >
>>>> > Looking at that Document.parse() method, it looks like it expects a
>>>> JSON
>>>> > document, so I just want to make sure that we expect CSV insertions
>>>> to work
>>>> > here.
>>>> > In researching this, it looks Mongo has some kind of bulkimport
>>>> utility that
>>>> > handles CSV [1], but this is a command line utility.
>>>> >
>>>> > -Bryan
>>>> >
>>>> > [1] http://docs.mongodb.org/manual/reference/program/mongoimport/
>>>> >
>>>> >
>>>> > On Mon, Sep 21, 2015 at 3:19 PM, Adam Williams <
>>>> [email protected]>
>>>> > wrote:
>>>> >
>>>> > Sorry about that, this should work.  Attached the template and the
>>>> below
>>>> > error:
>>>> >
>>>> > 2015-09-21 14:36:02,821 ERROR [Timer-Driven Process Thread-10]
>>>> > o.a.nifi.processors.mongodb.PutMongo
>>>> > PutMongo[id=480877a4-f349-4ef7-9538-8e3e3e108e06] Failed to insert
>>>> >
>>>> StandardFlowFileRecord[uuid=bbd7048f-d5a1-4db4-b938-da64b67e810e,claim=org.apache.nifi.controller.repository.claim.StandardContentClaim@8893ae38
>>>> ,offset=0,name=GDELT.MASTERREDUCEDV2.TXT,size=6581409407]
>>>> > into MongoDB due to java.lang.NegativeArraySizeException:
>>>> > java.lang.NegativeArraySizeException
>>>> >
>>>> > ________________________________
>>>> > Date: Mon, 21 Sep 2015 15:12:43 -0400
>>>> > Subject: Re: CSV to Mongo
>>>> > From: [email protected]
>>>> > To: [email protected]
>>>> >
>>>> >
>>>> > Adam,
>>>> >
>>>> > I imported the template and it looks like it only captured the
>>>> PutMongo
>>>> > processor. Can you try deselecting everything on the graph and
>>>> creating the
>>>> > template again so we can take a look at the rest of the flow? or if
>>>> you have
>>>> > other stuff on your graph, select all of the processors you described
>>>> so
>>>> > they all get captured.
>>>> >
>>>> > Also, can you provide any of the stacktrace for the exception you are
>>>> > seeing? The log is in NIFI_HOME/logs/nifi-app.log
>>>> >
>>>> > Thanks,
>>>> >
>>>> > Bryan
>>>> >
>>>> >
>>>> > On Mon, Sep 21, 2015 at 3:03 PM, Bryan Bende <[email protected]>
>>>> wrote:
>>>> >
>>>> > Adam,
>>>> >
>>>> > Thanks for attaching the template, we will take a look and see what
>>>> is going
>>>> > on.
>>>> >
>>>> > Thanks,
>>>> >
>>>> > Bryan
>>>> >
>>>> >
>>>> > On Mon, Sep 21, 2015 at 2:50 PM, Adam Williams <
>>>> [email protected]>
>>>> > wrote:
>>>> >
>>>> > Hey Joe,
>>>> >
>>>> > Sure thing.  I attached the template, I'm just taking the GDELT data
>>>> set for
>>>> > the getFile Processor which works.  The error i get is a negative
>>>> array.
>>>> >
>>>> >
>>>> >
>>>> >> Date: Mon, 21 Sep 2015 14:24:50 -0400
>>>> >> Subject: Re: CSV to Mongo
>>>> >> From: [email protected]
>>>> >> To: [email protected]
>>>> >
>>>> >>
>>>> >> Adam,
>>>> >>
>>>> >> Regarding moving from Storm to NiFi i'd say they make better
>>>> teammates
>>>> >> than competitors. The use case outlines above should be quite easy
>>>> >> for NiFi but there are analytic/processing functions Storm is
>>>> probably
>>>> >> a better answer for. We're happy to help explore that with you as you
>>>> >> progress.
>>>> >>
>>>> >> If you ever run into an ArrayIndexBoundsException.. then it will
>>>> >> always be 100% a coding error. Would you mind sending your
>>>> >> flow.xml.gz over or making a template of the flow (assuming it
>>>> >> contains nothing sensitive)? If at all possible sample data which
>>>> >> exposes the issue would be ideal. As an alternative can you go ahead
>>>> >> and send us the resulting stack trace/error that comes out?
>>>> >>
>>>> >> We'll get this addressed.
>>>> >>
>>>> >> Thanks
>>>> >> Joe
>>>> >>
>>>> >> On Mon, Sep 21, 2015 at 2:17 PM, Adam Williams
>>>> >> <[email protected]> wrote:
>>>> >> > Hello,
>>>> >> >
>>>> >> > I'm moving from storm to NiFi and trying to do a simple test with
>>>> >> > getting a
>>>> >> > large CSV file dumped into MongoDB. The CSV file has a header with
>>>> >> > column
>>>> >> > names and it is structured, my only problem is dumping it into
>>>> MongoDB.
>>>> >> > At
>>>> >> > a high level, do the following processor steps look correct? All i
>>>> want
>>>> >> > is
>>>> >> > to just pull the whole CSV file over the MongoDB without a regex or
>>>> >> > anything
>>>> >> > fancy (yet). I eventually always seem to hit trouble with array
>>>> index
>>>> >> > problems with the putmongo processor:
>>>> >> >
>>>> >> > GetFile --> ExtractText --> RoutOnAttribute(not a null line) -->
>>>> >> > PutMongo.
>>>> >> >
>>>> >> > Does that seem to be the right way to do this in NiFi?
>>>> >> >
>>>> >> > Thank you,
>>>> >> > Adam
>>>> >
>>>> >
>>>> >
>>>> >
>>>>
>>>
>>>
>>
>

Re: CSV to Mongo

Reply via email to