Re: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter

2017-09-25 Thread Arun Manivannan
Hi All,

Just raised a PR (https://github.com/apache/nifi/pull/2172) for JIRA
NIFI-4416 

Appreciate your help, Peter and Matt.  Could you please have a quick look
and give your comments.

Joe - Could you also check out the JIRA and let me know if I've committed
some crime.

You guys are the best !

Best Regards,
Arun

On Mon, Sep 25, 2017 at 9:44 AM Arun Manivannan  wrote:

> Thanks a lot, gentlemen. JIRA and PR coming through in a few hours.
>
> On Mon, Sep 25, 2017, 09:07 Matt Burgess  wrote:
>
>> Thanks all, if the PR is available tomorrow I can review as well and
>> merge, but I will be on vacation for a week after that. No pressure :)
>>
>> Regards,
>> Matt
>>
>> > On Sep 24, 2017, at 8:57 PM, Joe Witt  wrote:
>> >
>> > Thanks Arun and Peter.  Getting that resolved will be nice.  The
>> > performance difference of the record reader/writer approach in all
>> > this is pretty fantastic so the more we can do to iron out these sorts
>> > of edges the better.  Thanks!
>> >
>> >> On Sun, Sep 24, 2017 at 8:56 PM, Peter Wicks (pwicks) <
>> pwi...@micron.com> wrote:
>> >> Arun,
>> >>
>> >> I'm also using Ctrl+A as a delimiter and had the same problem.  I
>> haven't had time to write up a PR but it looked like a pretty easy fix to
>> me too.
>> >>
>> >> I can't merge the change if you submit it, but I'd be happy to review
>> it.
>> >>
>> >> --Peter
>> >>
>> >> -Original Message-
>> >> From: Arun Manivannan [mailto:a...@arunma.com]
>> >> Sent: Sunday, September 24, 2017 11:17 PM
>> >> To: Dev@nifi.apache.org
>> >> Subject: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter
>> >>
>> >> Hi,
>> >>
>> >> The ConvertCSVToAvro processor have been having performance issues
>> while processing files which are more than a GB and I was suggested to use
>> the ConvertRecord that leverages the RecordReader and Writer. Did some
>> tests and they do perform well.
>> >>
>> >> Strangely, the CSVReader doesn't accept unicode character as the value
>> delimiter - Control A  (\u0001) character is the delimiter of my CSV.
>> >>
>> >> Did some analysis and I see that a minor change needs to be made on
>> the CSVUtils to unescape the delimiter, like what ConvertCSVToAvro does and
>> also modify the SingleCharacterValidator.
>> >>
>> >> Please let me know if you believe this isn't an issue and there's a
>> workaround for this. Else, I am more than happy to raise an issue and
>> submit a PR for review.
>> >>
>> >> Best Regards,
>> >> Arun
>>
>


Re: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter

2017-09-24 Thread Arun Manivannan
Thanks a lot, gentlemen. JIRA and PR coming through in a few hours.

On Mon, Sep 25, 2017, 09:07 Matt Burgess  wrote:

> Thanks all, if the PR is available tomorrow I can review as well and
> merge, but I will be on vacation for a week after that. No pressure :)
>
> Regards,
> Matt
>
> > On Sep 24, 2017, at 8:57 PM, Joe Witt  wrote:
> >
> > Thanks Arun and Peter.  Getting that resolved will be nice.  The
> > performance difference of the record reader/writer approach in all
> > this is pretty fantastic so the more we can do to iron out these sorts
> > of edges the better.  Thanks!
> >
> >> On Sun, Sep 24, 2017 at 8:56 PM, Peter Wicks (pwicks) <
> pwi...@micron.com> wrote:
> >> Arun,
> >>
> >> I'm also using Ctrl+A as a delimiter and had the same problem.  I
> haven't had time to write up a PR but it looked like a pretty easy fix to
> me too.
> >>
> >> I can't merge the change if you submit it, but I'd be happy to review
> it.
> >>
> >> --Peter
> >>
> >> -Original Message-
> >> From: Arun Manivannan [mailto:a...@arunma.com]
> >> Sent: Sunday, September 24, 2017 11:17 PM
> >> To: Dev@nifi.apache.org
> >> Subject: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter
> >>
> >> Hi,
> >>
> >> The ConvertCSVToAvro processor have been having performance issues
> while processing files which are more than a GB and I was suggested to use
> the ConvertRecord that leverages the RecordReader and Writer. Did some
> tests and they do perform well.
> >>
> >> Strangely, the CSVReader doesn't accept unicode character as the value
> delimiter - Control A  (\u0001) character is the delimiter of my CSV.
> >>
> >> Did some analysis and I see that a minor change needs to be made on the
> CSVUtils to unescape the delimiter, like what ConvertCSVToAvro does and
> also modify the SingleCharacterValidator.
> >>
> >> Please let me know if you believe this isn't an issue and there's a
> workaround for this. Else, I am more than happy to raise an issue and
> submit a PR for review.
> >>
> >> Best Regards,
> >> Arun
>


Re: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter

2017-09-24 Thread Matt Burgess
Thanks all, if the PR is available tomorrow I can review as well and merge, but 
I will be on vacation for a week after that. No pressure :)

Regards,
Matt

> On Sep 24, 2017, at 8:57 PM, Joe Witt  wrote:
> 
> Thanks Arun and Peter.  Getting that resolved will be nice.  The
> performance difference of the record reader/writer approach in all
> this is pretty fantastic so the more we can do to iron out these sorts
> of edges the better.  Thanks!
> 
>> On Sun, Sep 24, 2017 at 8:56 PM, Peter Wicks (pwicks)  
>> wrote:
>> Arun,
>> 
>> I'm also using Ctrl+A as a delimiter and had the same problem.  I haven't 
>> had time to write up a PR but it looked like a pretty easy fix to me too.
>> 
>> I can't merge the change if you submit it, but I'd be happy to review it.
>> 
>> --Peter
>> 
>> -Original Message-
>> From: Arun Manivannan [mailto:a...@arunma.com]
>> Sent: Sunday, September 24, 2017 11:17 PM
>> To: Dev@nifi.apache.org
>> Subject: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter
>> 
>> Hi,
>> 
>> The ConvertCSVToAvro processor have been having performance issues while 
>> processing files which are more than a GB and I was suggested to use the 
>> ConvertRecord that leverages the RecordReader and Writer. Did some tests and 
>> they do perform well.
>> 
>> Strangely, the CSVReader doesn't accept unicode character as the value 
>> delimiter - Control A  (\u0001) character is the delimiter of my CSV.
>> 
>> Did some analysis and I see that a minor change needs to be made on the 
>> CSVUtils to unescape the delimiter, like what ConvertCSVToAvro does and also 
>> modify the SingleCharacterValidator.
>> 
>> Please let me know if you believe this isn't an issue and there's a 
>> workaround for this. Else, I am more than happy to raise an issue and submit 
>> a PR for review.
>> 
>> Best Regards,
>> Arun


Re: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter

2017-09-24 Thread Joe Witt
Thanks Arun and Peter.  Getting that resolved will be nice.  The
performance difference of the record reader/writer approach in all
this is pretty fantastic so the more we can do to iron out these sorts
of edges the better.  Thanks!

On Sun, Sep 24, 2017 at 8:56 PM, Peter Wicks (pwicks)  wrote:
> Arun,
>
> I'm also using Ctrl+A as a delimiter and had the same problem.  I haven't had 
> time to write up a PR but it looked like a pretty easy fix to me too.
>
> I can't merge the change if you submit it, but I'd be happy to review it.
>
> --Peter
>
> -Original Message-
> From: Arun Manivannan [mailto:a...@arunma.com]
> Sent: Sunday, September 24, 2017 11:17 PM
> To: Dev@nifi.apache.org
> Subject: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter
>
> Hi,
>
> The ConvertCSVToAvro processor have been having performance issues while 
> processing files which are more than a GB and I was suggested to use the 
> ConvertRecord that leverages the RecordReader and Writer. Did some tests and 
> they do perform well.
>
> Strangely, the CSVReader doesn't accept unicode character as the value 
> delimiter - Control A  (\u0001) character is the delimiter of my CSV.
>
> Did some analysis and I see that a minor change needs to be made on the 
> CSVUtils to unescape the delimiter, like what ConvertCSVToAvro does and also 
> modify the SingleCharacterValidator.
>
> Please let me know if you believe this isn't an issue and there's a 
> workaround for this. Else, I am more than happy to raise an issue and submit 
> a PR for review.
>
> Best Regards,
> Arun


RE: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter

2017-09-24 Thread Peter Wicks (pwicks)
Arun,

I'm also using Ctrl+A as a delimiter and had the same problem.  I haven't had 
time to write up a PR but it looked like a pretty easy fix to me too.

I can't merge the change if you submit it, but I'd be happy to review it.

--Peter

-Original Message-
From: Arun Manivannan [mailto:a...@arunma.com] 
Sent: Sunday, September 24, 2017 11:17 PM
To: Dev@nifi.apache.org
Subject: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter

Hi,

The ConvertCSVToAvro processor have been having performance issues while 
processing files which are more than a GB and I was suggested to use the 
ConvertRecord that leverages the RecordReader and Writer. Did some tests and 
they do perform well.

Strangely, the CSVReader doesn't accept unicode character as the value 
delimiter - Control A  (\u0001) character is the delimiter of my CSV.

Did some analysis and I see that a minor change needs to be made on the 
CSVUtils to unescape the delimiter, like what ConvertCSVToAvro does and also 
modify the SingleCharacterValidator.

Please let me know if you believe this isn't an issue and there's a workaround 
for this. Else, I am more than happy to raise an issue and submit a PR for 
review.

Best Regards,
Arun