Re: How to loop over a text file (to remove tags and normalize) using Python

2021-03-10 Thread S Monzur
I initially scraped the links using beautiful soup, and from those links
downloaded the specific content of the articles I was interested in
(titles, dates, names of contributor, main texts) and stored that
information in a list. I then saved the list to a text file.
https://pastebin.com/8BMi9qjW . I am now trying to remove the html tags
from this text file, and running into issues as mentioned in the previous
post.



On Wed, Mar 10, 2021 at 3:46 PM Peter Otten <__pete...@web.de> wrote:

> On 10/03/2021 04:35, S Monzur wrote:
> > Thanks! I ended up using beautiful soup to remove the html tags and
> create
> > three lists (titles of article, publications dates, main body) but am
> still
> > facing a problem where the list is not properly storing the main body.
> > There is something wrong with my code for that section, and any comment
> > would be really helpful!
> >
> >   ListFile Text
> > <
> https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing
> >
>
> How did you create that file?
>
>  > BeautifulSoup code for removing tags <https://pastebin.com/qvbVMUGD>
>
> > print(bodytext[0]) # so here, I'm only getting the first paragraph of
> the body of the first article, not all of the first article
> >
> > print(bodytext[1]) # here, I'm getting the second paragraph of the first
> article, and not the second article
>
> It may help if you process the individual articles with beautiful soup,
> not the whole list at once.
>
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: How to loop over a text file (to remove tags and normalize) using Python

2021-03-09 Thread S Monzur
Thanks! I ended up using beautiful soup to remove the html tags and create
three lists (titles of article, publications dates, main body) but am still
facing a problem where the list is not properly storing the main body.
There is something wrong with my code for that section, and any comment
would be really helpful!

 ListFile Text
<https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing>
BeautifulSoup code for removing tags <https://pastebin.com/qvbVMUGD>


On Wed, Mar 10, 2021 at 4:32 AM Dan Ciprus (dciprus) 
wrote:

> No problem, list just converts everything into plain/txt which is GREAT !
> :-)
>
> So without digging deeply into what you need to do: I am assuming that
> your
> input contains html tags. Why don't you utilize lib like:
> https://pypi.org/project/beautifulsoup4/ instead of doing harakiri with
> parsing
> data without using regex ? Just a hint ..
>
> On Wed, Mar 10, 2021 at 04:22:19AM +0600, S Monzur wrote:
> >   Thank you and apologies! I did not realize how jumbled it was at the
> >   receiver's end.
> >   The code is now at this site :  [1]https://pastebin.com/wSi2xzBh
> >   I'm basically trying to do a few things with my code-
> >
> >1. Extract 3 strings from the text- title, date and main text
> >
> >2. Remove all tags afterwards
> >
> >3. Save in a dictionary, with three keys- title, date and bodytext.
> >
> >4. Remove punctuation and stopwords (I've used a user generated
> function
> >   for that).
> >
> >   I've been able to do all of these steps for the file
> [2]ListFileReduced,
> >   as shown in the code (although it's clunky).
> >
> >   But, I would like to be able to do it for the other text file:
> [3]ListFile
> >   which has more articles. I used BeautifulSoup to scrape the data from
> the
> >   website, and then generated a list that I saved as a text file.
> >
> >   Best,
> >   Monzur
> >   On Wed, Mar 10, 2021 at 4:00 AM Dan Ciprus (dciprus)
> >   <[4]dcip...@cisco.com> wrote:
> >
> > If you could utilized pastebin or similar site to show your code, it
> > would help
> > tremendously since it's an unindented mess now and can not be read
> > easily.
> >
> > On Wed, Mar 10, 2021 at 03:07:14AM +0600, S Monzur wrote:
> > >Dear List,
> > >
> > >Newbie here. I am trying to loop over a text file to remove html
> tags,
> > >punctuation marks, stopwords. I have already used Beautiful Soup
> > (Python v
> > >3.8.3) to scrape the text (newspaper articles) from the site. It
> > returns a
> > >list that I saved as a file. However, I am not sure how to use a
> loop
> > in
> > >order to process all the items in the text file.
> > >
> > >In the code below I have used listfilereduced.text(containing data
> from
> > one
> > >news article, link to listfilereduced.txt here
> > ><[5]
> https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing
> >),
> > >however I would like to run this code on listfile.text(containing
> data
> > from
> > >multiple articles, link to listfile.text
> > ><[6]
> https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing
> >
> > >).
> > >
> > >
> > >Any help would be greatly appreciated!
> > >
> > >P.S. The text is in a Non-English script, but the tags are all in
> > English.
> > >
> > >
> > >#The code below is for a textfile containing just one item. I am not
> > sure
> > >how to tweak this to make it run for listfile.text (which contains
> raw
> > data
> > >from multiple articles) with open('listfilereduced.txt', 'r',
> > >encoding='utf8') as my_file: rawData = my_file.read() print(rawData)
> > >#Separating body text from other data articleStart =
> rawData.find(" > >class=\"story-element story-element-text\">") articleData =
> > >rawData[:articleStart] articleBody = rawData[articleStart:]
> > >print(articleData) print("***") print(articleBody)
> print("***")
> > >#First, I define a function to strip tags from the body text def
> > >stripTags(pageContents): insideTag = 0 text = '' for char in
> > pageContents:
> > >if char == '<': insideTag = 1 elif (insideTag == 1 and char == '>'):
> &g

Re: How to loop over a text file (to remove tags and normalize) using Python

2021-03-09 Thread S Monzur
Thank you and apologies! I did not realize how jumbled it was at the
receiver's end.

The code is now at this site :  https://pastebin.com/wSi2xzBh

I'm basically trying to do a few things with my code-

   1.

   Extract 3 strings from the text- title, date and main text
   2.

   Remove all tags afterwards
   3.

   Save in a dictionary, with three keys- title, date and bodytext.
   4.

   Remove punctuation and stopwords (I've used a user generated function
   for that).

I've been able to do all of these steps for the file ListFileReduced
<https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing>,
as shown in the code (although it's clunky).

But, I would like to be able to do it for the other text file: ListFile
<https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing>
which has more articles. I used BeautifulSoup to scrape the data from the
website, and then generated a list that I saved as a text file.


Best,

Monzur

On Wed, Mar 10, 2021 at 4:00 AM Dan Ciprus (dciprus) 
wrote:

> If you could utilized pastebin or similar site to show your code, it would
> help
> tremendously since it's an unindented mess now and can not be read easily.
>
> On Wed, Mar 10, 2021 at 03:07:14AM +0600, S Monzur wrote:
> >Dear List,
> >
> >Newbie here. I am trying to loop over a text file to remove html tags,
> >punctuation marks, stopwords. I have already used Beautiful Soup (Python v
> >3.8.3) to scrape the text (newspaper articles) from the site. It returns a
> >list that I saved as a file. However, I am not sure how to use a loop in
> >order to process all the items in the text file.
> >
> >In the code below I have used listfilereduced.text(containing data from
> one
> >news article, link to listfilereduced.txt here
> ><
> https://drive.google.com/file/d/1ojwN4u8cmh_nUoMJpdZ5ObaGW5URYYj3/view?usp=sharing
> >),
> >however I would like to run this code on listfile.text(containing data
> from
> >multiple articles, link to listfile.text
> ><
> https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing
> >
> >).
> >
> >
> >Any help would be greatly appreciated!
> >
> >P.S. The text is in a Non-English script, but the tags are all in English.
> >
> >
> >#The code below is for a textfile containing just one item. I am not sure
> >how to tweak this to make it run for listfile.text (which contains raw
> data
> >from multiple articles) with open('listfilereduced.txt', 'r',
> >encoding='utf8') as my_file: rawData = my_file.read() print(rawData)
> >#Separating body text from other data articleStart = rawData.find(" >class=\"story-element story-element-text\">") articleData =
> >rawData[:articleStart] articleBody = rawData[articleStart:]
> >print(articleData) print("***") print(articleBody) print("***")
> >#First, I define a function to strip tags from the body text def
> >stripTags(pageContents): insideTag = 0 text = '' for char in pageContents:
> >if char == '<': insideTag = 1 elif (insideTag == 1 and char == '>'):
> >insideTag = 0 elif insideTag == 1: continue else: text += char return text
> >#Calling the function articleBodyText = stripTags(articleBody)
> >print(articleBodyText) ##Isolating article title and publication date
> >TitleEndLoc = articleData.find("") dateStartLoc =
> >articleData.find(" >class=\"storyPageMetaData-m__publish-time__19bdV\">")
> >dateEndLoc=articleData.find(" >storyPageMetaDataIcons-m__icons__3E4Xg\">") titleString =
> >articleData[:TitleEndLoc] dateString =
> articleData[dateStartLoc:dateEndLoc]
> >##Call stripTags to clean articleTitle= stripTags(titleString) articleDate
> >= stripTags(dateString) print(articleTitle) print(articleDate) #Cleaning
> >the date a bit more startLocDate = articleDate.find(":") endLocDate =
> >articleDate.find(",") articleDateClean =
> >articleDate[startLocDate+2:endLocDate] print(articleDateClean) #save all
> >this data to a dictionary that saves the title, data and the body text
> >PAloTextDict = {"Title": articleTitle, "Date": articleDateClean, "Text":
> >articleBodyText} print(PAloTextDict) #Normalize text by: #1. Splitting
> >paragraphs of text into lists of words articleBodyWordList =
> >articleBodyText.split() print(articleBodyWordList) #2.Removing punctuation
> >and stopwords from bnlp.corpus import stopwords, punctuations #A. Remove
> >punctuation first listNoPunct = [] for word in articleBodyWordList: for
> >mark in punctuations: word=word.repl

How to loop over a text file (to remove tags and normalize) using Python

2021-03-09 Thread S Monzur
Dear List,

Newbie here. I am trying to loop over a text file to remove html tags,
punctuation marks, stopwords. I have already used Beautiful Soup (Python v
3.8.3) to scrape the text (newspaper articles) from the site. It returns a
list that I saved as a file. However, I am not sure how to use a loop in
order to process all the items in the text file.

In the code below I have used listfilereduced.text(containing data from one
news article, link to listfilereduced.txt here
),
however I would like to run this code on listfile.text(containing data from
multiple articles, link to listfile.text

).


Any help would be greatly appreciated!

P.S. The text is in a Non-English script, but the tags are all in English.


#The code below is for a textfile containing just one item. I am not sure
how to tweak this to make it run for listfile.text (which contains raw data
from multiple articles) with open('listfilereduced.txt', 'r',
encoding='utf8') as my_file: rawData = my_file.read() print(rawData)
#Separating body text from other data articleStart = rawData.find("") articleData =
rawData[:articleStart] articleBody = rawData[articleStart:]
print(articleData) print("***") print(articleBody) print("***")
#First, I define a function to strip tags from the body text def
stripTags(pageContents): insideTag = 0 text = '' for char in pageContents:
if char == '<': insideTag = 1 elif (insideTag == 1 and char == '>'):
insideTag = 0 elif insideTag == 1: continue else: text += char return text
#Calling the function articleBodyText = stripTags(articleBody)
print(articleBodyText) ##Isolating article title and publication date
TitleEndLoc = articleData.find("") dateStartLoc =
articleData.find("")
dateEndLoc=articleData.find("") titleString =
articleData[:TitleEndLoc] dateString = articleData[dateStartLoc:dateEndLoc]
##Call stripTags to clean articleTitle= stripTags(titleString) articleDate
= stripTags(dateString) print(articleTitle) print(articleDate) #Cleaning
the date a bit more startLocDate = articleDate.find(":") endLocDate =
articleDate.find(",") articleDateClean =
articleDate[startLocDate+2:endLocDate] print(articleDateClean) #save all
this data to a dictionary that saves the title, data and the body text
PAloTextDict = {"Title": articleTitle, "Date": articleDateClean, "Text":
articleBodyText} print(PAloTextDict) #Normalize text by: #1. Splitting
paragraphs of text into lists of words articleBodyWordList =
articleBodyText.split() print(articleBodyWordList) #2.Removing punctuation
and stopwords from bnlp.corpus import stopwords, punctuations #A. Remove
punctuation first listNoPunct = [] for word in articleBodyWordList: for
mark in punctuations: word=word.replace(mark, '') listNoPunct.append(word)
print(listNoPunct) #B. removing stopwords banglastopwords = stopwords()
print(banglastopwords) cleanList=[] for word in listNoPunct: if word in
banglastopwords: continue else: cleanList.append(word) print(cleanList)
-- 
https://mail.python.org/mailman/listinfo/python-list