Re: CSV/delimited to Parquet conversion via Nifi

Matt Burgess Tue, 22 Mar 2016 19:26:06 -0700

I am +1 for the ConvertFormat processor, the  user experience is so much 
enhanced by the hands-off conversion. Such a capability might be contingent on 
the "dependent properties" concept (in Jira somewhere).


Also this guy could get pretty big in terms of footprint, I'd imagine the 
forthcoming Registry might be a good place for it.

In general a format translator would probably make for a great Apache project 
:) Martin Fowler has blogged about some ideas like this (w.r.t. abstracting 
translation logic), Tika has done some of this but AFAIK its focus is on 
extraction not transformation. In any case, we could certainly capture the idea 
in NiFi.

Regards,
Matt

> On Mar 22, 2016, at 9:52 PM, Edmon Begoli <ebeg...@gmail.com> wrote:
> 
> Good point. 
> 
> I just think that Parquet and ORC are important targets, just as 
> relational/JDBC stores are. 
> 
>> On Tuesday, March 22, 2016, Tony Kurc <trk...@gmail.com> wrote:
>> Interesting question. A couple discussion points: If we start doing a 
>> processor for each of these conversions, it may become unwieldy (P(x,2) 
>> processors, where x is number of data formats?) I'd say maybe a more general 
>> ConvertFormat processor may be appropriate, but then configuration and code 
>> complexity may suffer. If there is a canonical internal data form and a 
>> bunch (2*x) of convertXtocanonical, and convertcanonicaltoX processors, the 
>> flow could get complex and the extra transform could be expensive.
>> 
>>> On Mar 21, 2016 9:39 PM, "Dmitry Goldenberg" <dgoldenberg...@gmail.com> 
>>> wrote:
>>> Since NiFi has ConvertJsonToAvro and ConvertCsvToAvro processors, would it 
>>> make sense to add a feature request for a ConvertJsonToParquet processor 
>>> and a ConvertCsvToParquet processor?
>>> 
>>> - Dmitry
>>> 
>>>> On Mon, Mar 21, 2016 at 9:23 PM, Matt Burgess <mattyb...@gmail.com> wrote:
>>>> Edmon,
>>>> 
>>>> NIFI-1663 [1] was created to add ORC support to NiFi. If you have a target 
>>>> dataset that has been created with Parquet format, I think you can use 
>>>> ConvertCSVtoAvro then StoreInKiteDataset to get flow files in Parquet 
>>>> format into Hive, HDFS, etc. Others in the community know a lot more about 
>>>> the StoreInKiteDataset processor than I do.
>>>> 
>>>> Regards,
>>>> Matt
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/NIFI-1663
>>>> 
>>>>> On Mon, Mar 21, 2016 at 8:25 PM, Edmon Begoli <ebeg...@gmail.com> wrote:
>>>>> 
>>>>> Is there a way to do straight CSV(PSV) to Parquet or ORC conversion via 
>>>>> Nifi, or do I always need to push the data through some of the "data 
>>>>> engines" - Drill, Spark, Hive, etc.?

Re: CSV/delimited to Parquet conversion via Nifi

Reply via email to