[ https://issues.apache.org/jira/browse/SPARK-26971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776194#comment-16776194 ]
Hyukjin Kwon commented on SPARK-26971: -------------------------------------- Questions should go to mailing list rather than filing it as an issue here. > How to read delimiter (Cedilla) in spark RDD and Dataframes > ----------------------------------------------------------- > > Key: SPARK-26971 > URL: https://issues.apache.org/jira/browse/SPARK-26971 > Project: Spark > Issue Type: Question > Components: PySpark > Affects Versions: 1.6.0 > Reporter: Babu > Priority: Minor > > > I am trying to read a cedilla delimited HDFS Text file. I am getting the > below error, did any one face similar issue? > {{hadoop fs -cat test_file.dat }} > {{1ÇCelvelandÇOhio 2ÇDurhamÇNC 3ÇDallasÇTexas }} > {{>>> rdd = sc.textFile("test_file.dat") }} > {{>>> rdd.collect() [u'1\xc7Celveland\xc7Ohio', u'2\xc7Durham\xc7NC', > u'3Dallas\xc7Texas'] }} > {{>>> rdd.map(lambda p: p.split("\xc7")).collect() UnicodeDecodeError: > 'ascii' codec can't decode byte 0xc7 in position 0: ordinal not in range(128) > }} > {{>>> > sqlContext.read.format("text").option("delimiter","Ç").option("encoding","ISO-8859").load("/user/cloudera/test_file.dat").show() > }} > |1ÇCelvelandÇOhio| > {{2ÇDurhamÇNC}} > {{ 3DallasÇTexas}} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org