Re: Delimiter splitting in ExtractText possible?

prabhu Mahendran Thu, 24 Nov 2016 21:51:47 -0800

Hi folks,

@jason -->Thank you so much for your suggestions it really helpful for us.


@Joe-->
I am having csv data which having ',' as seperator and move that into SQL
Server.

And i just need quickest way to extract all unstructured data by using
common regex or using delimiter of csv file.

So i think ',' as delimiter it will split the data as data.1,data.2..upto
number of columns in file by using comma in ExtractText processor.

But jason give common regex ([^,]*?),([^,]*),....  to split the data .It
could be useful for me.Moreover this regex very expensive for perform
pattern mapping.

If i use that regex then sometimes it shows "Java Heap Space error" in all
ReplaceText,UpdateAttribute Processors.

*Is there is any way to split the data by using separator like , or |?*

Because all file having some delimiter,If i give delimiter in processor
then it will extract the rows according to the data.1,data.2,..etc

So i have given ',' as new attribute value in ExtractText attribute.It
shows validation error,

Is there is any other way to extract *csv *data by using seperator of the
file?







On Wed, Nov 23, 2016 at 8:57 PM, Joe Witt <[email protected]> wrote:

> Jason
>
> That was an excellent response.
>
> Prabhu - i think the question is what would you like to do with the
> data?  Are you going to transform it then send it somewhere?  Do you
> want to be able to filter some rows out then send the rest?  Can you
> describe that part more?
>
> The general pattern here is
>
> It is certainly easy enough to do the two-phase split to maintain
> efficiency
>
> SplitText (500 line chunks for example)
> SplitText (single line chunks)
> ?? - what do you want to accomplish per line?
> ?? - where is the data going?
>
> Thanks
> Joe
>
> On Wed, Nov 23, 2016 at 9:41 AM, Jason Tarasovic
> <[email protected]> wrote:
> > Prabhu,
> >
> > It's possible to do what you're asking but not especially efficient. You
> can
> > SplitText twice (10,000 and then 1) outputting the header on each and
> then
> > running the result through ExtractText. Your regex would be something
> like
> > ([^,]*?),([^,]*),.... so match 0 or more non-comma characters followed
> by a
> > comma. ExtractText will place the matched capture groups into attributes
> > like you mentioned (date.1->the_captured_text)
> >
> > However, it's not super efficient or at least it hasn't been in my case
> as
> > you're moving the FlowFile contents into attributes and the attributes
> are
> > stored in memory so, depending on how large the file is, you *may*
> > experience excessive GC activity or OOM errors.
> >
> > Using InferAvroSchema (if you don't know the schema in advance) and then
> > using ConvertCSVtoAvro may be better option depending on where the data
> is
> > ultimately going. One caveat though is that ConvertCSVtoAvro seems to
> only
> > work with properly quoted and escaped CSV that conforms to RFC 4180.
> >
> > I'm just getting started with NiFi myself so not an expert or anything
> but I
> > hope that helps.
> >
> > -Jason
> >
> > On Tue, Nov 22, 2016 at 3:34 AM, prabhu Mahendran <
> [email protected]>
> > wrote:
> >>
> >> Hi All,
> >>
> >> I have CSV unstructured data with comma as delimiter which contains 100
> >> rows.
> >>
> >> Is it possible to extract the data's in csv file using comma as
> seperator
> >> in nifi processors.
> >>
> >>
> >> See my Sample data 3 from 100 rows.
> >>
> >> No,Name,Age,PAN,City
> >> 1,Siva,22,91230,Londan,
> >> 2,,23,91231,UK
> >> 3,Greck,22,,US
> >>
> >>
> >> In 1st row having all values which can be seperated by "data" attribute
> >> having regex (.+),(.+),(.+),(.+),(.+) then row will be split like
> below..,
> >>
> >>                 data.1-->1
> >>                 data.2-->Siva
> >>                 data.3-->22
> >>                 data.4-->91230
> >>                 data.5-->Londan
> >>
> >> But in Second row which having Empty values in Name column can using
> regex
> >> (.+),,(.+),(.+),(.+) then row will be split like below..,
> >>
> >>                data.1-->2
> >>                data.2-->23
> >>                data.3-->91231
> >>                data.4-->UK
> >>
> >> Third row same as PAN Column empty it can able to split using another
> >> regex attribute.
> >>
> >> But my problem is now data having 100 rows.In future this may having
> >> another 100 rows.So again need to write more regex attributes to capture
> >> group wise .
> >>
> >>
> >> So I think  i have given comma(,) as common regex for all rows in csv
> file
> >> then it will split data as into data.1,data.2,...data.5
> >>
> >> But i gets an validation failed error in Bulletins Indicator in
> >> ExtractTextProcessor.
> >>
> >> So is this possible to write delimiter wise splitting of rows in CSV
> File?
> >>
> >> Is this possible to write common regex for all csv data in ExtractText
> or
> >> any other processor?
> >>
> >
>

Re: Delimiter splitting in ExtractText possible?

Reply via email to