How to consider HTML files in Spark

yh18190 Thu, 12 Mar 2015 09:27:23 -0700

Hi.I am very much fascinated to Spark framework.I am trying to use Pyspark +
Beautifulsoup to parse HTML files.I am facing problems to load html file
into beautiful soup.
Example
filepath= file:///path to html directory
def readhtml(inputhtml):
{
soup=Beautifulsoup(inputhtml) //to load html content
}
loaddata=sc.textFile(filepath).map(readhtml)


The problem is here spark considers loaded file as textfile and goes through
process line by line.I want to consider to load the entire html content into
Beautifulsoup for further processing..
Does anyone have any idea to how to take the whole html file as input
instead of linebyline processing?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-consider-HTML-files-in-Spark-tp22017.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to consider HTML files in Spark

Reply via email to