Thanks! I ended up using beautiful soup to remove the html tags and create three lists (titles of article, publications dates, main body) but am still facing a problem where the list is not properly storing the main body. There is something wrong with my code for that section, and any comment would be really helpful!
ListFile Text <https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing> BeautifulSoup code for removing tags <https://pastebin.com/qvbVMUGD> On Wed, Mar 10, 2021 at 4:32 AM Dan Ciprus (dciprus) <dcip...@cisco.com> wrote: > No problem, list just converts everything into plain/txt which is GREAT ! > :-) > > So without digging deeply into what you need to do: I am assuming that > your > input contains html tags. Why don't you utilize lib like: > https://pypi.org/project/beautifulsoup4/ instead of doing harakiri with > parsing > data without using regex ? Just a hint .. > > On Wed, Mar 10, 2021 at 04:22:19AM +0600, S Monzur wrote: > > Thank you and apologies! I did not realize how jumbled it was at the > > receiver's end. > > The code is now at this site : [1]https://pastebin.com/wSi2xzBh > > I'm basically trying to do a few things with my code- > > > > 1. Extract 3 strings from the text- title, date and main text > > > > 2. Remove all tags afterwards > > > > 3. Save in a dictionary, with three keys- title, date and bodytext. > > > > 4. Remove punctuation and stopwords (I've used a user generated > function > > for that). > > > > I've been able to do all of these steps for the file > [2]ListFileReduced, > > as shown in the code (although it's clunky). > > > > But, I would like to be able to do it for the other text file: > [3]ListFile > > which has more articles. I used BeautifulSoup to scrape the data from > the > > website, and then generated a list that I saved as a text file. > > > > Best, > > Monzur > > On Wed, Mar 10, 2021 at 4:00 AM Dan Ciprus (dciprus) > > <[4]dcip...@cisco.com> wrote: > > > > If you could utilized pastebin or similar site to show your code, it > > would help > > tremendously since it's an unindented mess now and can not be read > > easily. > > > > On Wed, Mar 10, 2021 at 03:07:14AM +0600, S Monzur wrote: > > >Dear List, > > > > > >Newbie here. I am trying to loop over a text file to remove html > tags, > > >punctuation marks, stopwords. I have already used Beautiful Soup > > (Python v > > >3.8.3) to scrape the text (newspaper articles) from the site. It > > returns a > > >list that I saved as a file. However, I am not sure how to use a > loop > > in > > >order to process all the items in the text file. > > > > > >In the code below I have used listfilereduced.text(containing data > from > > one > > >news article, link to listfilereduced.txt here > > ><[5] > https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing > >), > > >however I would like to run this code on listfile.text(containing > data > > from > > >multiple articles, link to listfile.text > > ><[6] > https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing > > > > >). > > > > > > > > >Any help would be greatly appreciated! > > > > > >P.S. The text is in a Non-English script, but the tags are all in > > English. > > > > > > > > >#The code below is for a textfile containing just one item. I am not > > sure > > >how to tweak this to make it run for listfile.text (which contains > raw > > data > > >from multiple articles) with open('listfilereduced.txt', 'r', > > >encoding='utf8') as my_file: rawData = my_file.read() print(rawData) > > >#Separating body text from other data articleStart = > rawData.find("<div > > >class=\"story-element story-element-text\">") articleData = > > >rawData[:articleStart] articleBody = rawData[articleStart:] > > >print(articleData) print("*******") print(articleBody) > print("*******") > > >#First, I define a function to strip tags from the body text def > > >stripTags(pageContents): insideTag = 0 text = '' for char in > > pageContents: > > >if char == '<': insideTag = 1 elif (insideTag == 1 and char == '>'): > > >insideTag = 0 elif insideTag == 1: continue else: text += char > return > > text > > >#Calling the function articleBodyText = stripTags(articleBody) > > >print(articleBodyText) ##Isolating article title and publication > date > > >TitleEndLoc = articleData.find("</h1>") dateStartLoc = > > >articleData.find("<div > > >class=\"storyPageMetaData-m__publish-time__19bdV\">") > > >dateEndLoc=articleData.find("<div class=\"meta-data-icons > > >storyPageMetaDataIcons-m__icons__3E4Xg\">") titleString = > > >articleData[:TitleEndLoc] dateString = > > articleData[dateStartLoc:dateEndLoc] > > >##Call stripTags to clean articleTitle= stripTags(titleString) > > articleDate > > >= stripTags(dateString) print(articleTitle) print(articleDate) > > #Cleaning > > >the date a bit more startLocDate = articleDate.find(":") endLocDate > = > > >articleDate.find(",") articleDateClean = > > >articleDate[startLocDate+2:endLocDate] print(articleDateClean) #save > > all > > >this data to a dictionary that saves the title, data and the body > text > > >PAloTextDict = {"Title": articleTitle, "Date": articleDateClean, > > "Text": > > >articleBodyText} print(PAloTextDict) #Normalize text by: #1. > Splitting > > >paragraphs of text into lists of words articleBodyWordList = > > >articleBodyText.split() print(articleBodyWordList) #2.Removing > > punctuation > > >and stopwords from bnlp.corpus import stopwords, punctuations #A. > > Remove > > >punctuation first listNoPunct = [] for word in articleBodyWordList: > for > > >mark in punctuations: word=word.replace(mark, '') > > listNoPunct.append(word) > > >print(listNoPunct) #B. removing stopwords banglastopwords = > stopwords() > > >print(banglastopwords) cleanList=[] for word in listNoPunct: if > word in > > >banglastopwords: continue else: cleanList.append(word) > print(cleanList) > > >-- > > >[7]https://mail.python.org/mailman/listinfo/python-list > > > > -- > > > > Daniel Ciprus .:|:.:|:. > > CONSULTING ENGINEER.CUSTOMER DELIVERY Cisco Systems Inc. > > > > [8]dcip...@cisco.com > > > > tel: +1 703 484 0205 > > mob: +1 540 223 7098 > > > >References > > > > Visible links > > 1. https://pastebin.com/wSi2xzBh > > 2. > https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing > > 3. > https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing > > 4. mailto:dcip...@cisco.com > > 5. > https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing > > 6. > https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing > > 7. https://mail.python.org/mailman/listinfo/python-list > > 8. mailto:dcip...@cisco.com > > -- > > Daniel Ciprus .:|:.:|:. > CONSULTING ENGINEER.CUSTOMER DELIVERY Cisco Systems Inc. > > dcip...@cisco.com > > tel: +1 703 484 0205 > mob: +1 540 223 7098 > > -- https://mail.python.org/mailman/listinfo/python-list