Hi.I am very much fascinated to Spark framework.I am trying to use Pyspark + Beautifulsoup to parse HTML files.I am facing problems to load html file into beautiful soup. Example filepath= file:///path to html directory def readhtml(inputhtml): { soup=Beautifulsoup(inputhtml) //to load html content } loaddata=sc.textFile(filepath).map(readhtml)
The problem is here spark considers loaded file as textfile and goes through process line by line.I want to consider to load the entire html content into Beautifulsoup for further processing.. Does anyone have any idea to how to take the whole html file as input instead of linebyline processing? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-consider-HTML-files-in-Spark-tp22017.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org