Davies Liu created SPARK-3047: --------------------------------- Summary: Use utf-8 for textFile() by default Key: SPARK-3047 URL: https://issues.apache.org/jira/browse/SPARK-3047 Project: Spark Issue Type: Improvement Reporter: Davies Liu
In Python 2.x, most of the string are bytearray, it's more efficient then unicode (both cpu and memory). UTF-8 is the default encoding. After disable decode from utf8 into unicode in UTF8Deserializer, the total time for wc job is reduce by 32% (from 2m17s to 1m34s). We also could add argument unicode=False to textFile(). -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org