Hi folks, @jason -->Thank you so much for your suggestions it really helpful for us.
@Joe--> I am having csv data which having ',' as seperator and move that into SQL Server. And i just need quickest way to extract all unstructured data by using common regex or using delimiter of csv file. So i think ',' as delimiter it will split the data as data.1,data.2..upto number of columns in file by using comma in ExtractText processor. But jason give common regex ([^,]*?),([^,]*),.... to split the data .It could be useful for me.Moreover this regex very expensive for perform pattern mapping. If i use that regex then sometimes it shows "Java Heap Space error" in all ReplaceText,UpdateAttribute Processors. *Is there is any way to split the data by using separator like , or |?* Because all file having some delimiter,If i give delimiter in processor then it will extract the rows according to the data.1,data.2,..etc So i have given ',' as new attribute value in ExtractText attribute.It shows validation error, Is there is any other way to extract *csv *data by using seperator of the file? On Wed, Nov 23, 2016 at 8:57 PM, Joe Witt <[email protected]> wrote: > Jason > > That was an excellent response. > > Prabhu - i think the question is what would you like to do with the > data? Are you going to transform it then send it somewhere? Do you > want to be able to filter some rows out then send the rest? Can you > describe that part more? > > The general pattern here is > > It is certainly easy enough to do the two-phase split to maintain > efficiency > > SplitText (500 line chunks for example) > SplitText (single line chunks) > ?? - what do you want to accomplish per line? > ?? - where is the data going? > > Thanks > Joe > > On Wed, Nov 23, 2016 at 9:41 AM, Jason Tarasovic > <[email protected]> wrote: > > Prabhu, > > > > It's possible to do what you're asking but not especially efficient. You > can > > SplitText twice (10,000 and then 1) outputting the header on each and > then > > running the result through ExtractText. Your regex would be something > like > > ([^,]*?),([^,]*),.... so match 0 or more non-comma characters followed > by a > > comma. ExtractText will place the matched capture groups into attributes > > like you mentioned (date.1->the_captured_text) > > > > However, it's not super efficient or at least it hasn't been in my case > as > > you're moving the FlowFile contents into attributes and the attributes > are > > stored in memory so, depending on how large the file is, you *may* > > experience excessive GC activity or OOM errors. > > > > Using InferAvroSchema (if you don't know the schema in advance) and then > > using ConvertCSVtoAvro may be better option depending on where the data > is > > ultimately going. One caveat though is that ConvertCSVtoAvro seems to > only > > work with properly quoted and escaped CSV that conforms to RFC 4180. > > > > I'm just getting started with NiFi myself so not an expert or anything > but I > > hope that helps. > > > > -Jason > > > > On Tue, Nov 22, 2016 at 3:34 AM, prabhu Mahendran < > [email protected]> > > wrote: > >> > >> Hi All, > >> > >> I have CSV unstructured data with comma as delimiter which contains 100 > >> rows. > >> > >> Is it possible to extract the data's in csv file using comma as > seperator > >> in nifi processors. > >> > >> > >> See my Sample data 3 from 100 rows. > >> > >> No,Name,Age,PAN,City > >> 1,Siva,22,91230,Londan, > >> 2,,23,91231,UK > >> 3,Greck,22,,US > >> > >> > >> In 1st row having all values which can be seperated by "data" attribute > >> having regex (.+),(.+),(.+),(.+),(.+) then row will be split like > below.., > >> > >> data.1-->1 > >> data.2-->Siva > >> data.3-->22 > >> data.4-->91230 > >> data.5-->Londan > >> > >> But in Second row which having Empty values in Name column can using > regex > >> (.+),,(.+),(.+),(.+) then row will be split like below.., > >> > >> data.1-->2 > >> data.2-->23 > >> data.3-->91231 > >> data.4-->UK > >> > >> Third row same as PAN Column empty it can able to split using another > >> regex attribute. > >> > >> But my problem is now data having 100 rows.In future this may having > >> another 100 rows.So again need to write more regex attributes to capture > >> group wise . > >> > >> > >> So I think i have given comma(,) as common regex for all rows in csv > file > >> then it will split data as into data.1,data.2,...data.5 > >> > >> But i gets an validation failed error in Bulletins Indicator in > >> ExtractTextProcessor. > >> > >> So is this possible to write delimiter wise splitting of rows in CSV > File? > >> > >> Is this possible to write common regex for all csv data in ExtractText > or > >> any other processor? > >> > > >
