Hey there, I work for an organisation named Bullero Capital Pvt. Ltd. in New Delhi, which is involved in financial services, Portfolio management to be precise. Recently we've started creating ourselves a database using R for all the stocks etc. to be automated and hence analyzed accordingly for future investment purposes (data related to which is already available, and in our possession).
I and a colleague of mine, we are currently at the data cleaning stage - where we need to organize and format the data according to how we want it in the database. The problem lies in notation & symbols used in the original csv data files acquired from the government website - where we have to do approximate matching (for efficiency) and thereby extract the numerics only from that string of characters from the respective columns of the dataframe. 1.) As of now we are looking at using the agrep function, to detect & locate the pattern matches namely - DIVIDEND , SPLIT, BONUS 2.) From there on carry out the extraction of the respective numeric values associated with these actions in to the corresponding columns - BONUS_NUM(Numerator for the ratio), BONUS_DEN( Denominator for the ratio), SPLIT_NUM(Numerator for the ratio), SPLIT_DEN (Denominator for the Ratio), FInal Dividend, Interim Dividend & Special Dividend. COLUMN PURPOSE 1. DIVIDEND-RE.1/- PER SHARE 2. AGM/DIV-RS.3.50 PER SHARE 3. SPL DIV-RS.2.70 PER SHARE 4. DIV - FIN 3.50RE PER SHARE + SPL-Rs.1.4 5. FV SPLIT Rs.10 to RE.1 6. BON 3:2 + SPLT Rs. 5 to Rs.2.5 7. BONUS 4:1 8. DIV:10% Ex. DIVIDEND-RE.1/- PER SHARE FINAL_DIV-1 AGM/DIV-RS.3.50 PER SHARE FINAL_DIV-3.50 SPL DIV-RS.2.70 PER SHARE SPECIAL DIV-2.70 Ex. FV SPLIT Rs.10 to RE.1 SPLIT_NUM - 1 SPLIT_DEN - 10 Ex. BONUS 4:1 BONUS_NUM - 4 BONUS_DEN - 1 However, the problem with that is that agrep returns the vector indices instead of the string indices which makes it cumbersome to extract the numeric values following the respective matches. So I want a Fuzzy logic approach to - check for the presence of SPLIT, DIVIDEND, BONUS - index of which ever cell the pattern match occurs in the column PURPOSE of the data frame - index position of that particular pattern in the string to extract the numerical value following the matched pattern *Basically Is there any way in R to determine if the patterns can be checked and matched approximately while returning for value - the indices for the same in the respective strings?**(such that if in case the symbols change furthermore in the future according to the government website's notation in the data storage, or the format/positioning/spacing changes - it could account for all those changes automatically.)* I am attaching below the .csv file consisting of just the column we need to carry out the cleaning in for your convenience. It would be very helpful, if we could get some guidance as to how to proceed further at the earliest. regards, aarushi kaushal ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.