Also, have you looked into Dos2Unix (http://dos2unix.sourceforge.net/ <http://dos2unix.sourceforge.net/>)
Has helped me in the past to deal with special characters while using windows based CSV’s in Linux. (Might not be the solution here.. Just an FYI :)) > On Feb 20, 2016, at 2:17 PM, Chandeep Singh <c...@chandeep.com> wrote: > > Understood. In that case Ted’s suggestion to check the length should solve > the problem. > >> On Feb 20, 2016, at 2:09 PM, Mich Talebzadeh <m...@peridale.co.uk >> <mailto:m...@peridale.co.uk>> wrote: >> >> Hi, >> >> That is a good question. >> >> When data is exported from CSV to Linux, any character that cannot be >> transformed is replaced by ?. That question mark is not actually the >> expected “?” J >> >> So the only way I can get rid of it is by drooping the first character using >> substring(1). I checked I did the same in Hive sql >> >> The actual field in CSV is “£2,500.oo” that translates into “?2,500.00” >> >> HTH >> >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> >> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >> >> NOTE: The information in this email is proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be understood as given or endorsed by Peridale Technology >> Ltd, its subsidiaries or their employees, unless expressly so stated. It is >> the responsibility of the recipient to ensure that this email is virus free, >> therefore neither Peridale Technology Ltd, its subsidiaries nor their >> employees accept any responsibility. >> >> >> From: Chandeep Singh [mailto:c...@chandeep.com <mailto:c...@chandeep.com>] >> Sent: 20 February 2016 13:47 >> To: Mich Talebzadeh <m...@peridale.co.uk <mailto:m...@peridale.co.uk>> >> Cc: user @spark <user@spark.apache.org <mailto:user@spark.apache.org>> >> Subject: Re: Checking for null values when mapping >> >> Looks like you’re using substring just to get rid of the ‘?’. Why not use >> replace for that as well? And then you wouldn’t run into issues with index >> out of bound. >> >> val a = "?1,187.50" >> val b = "" >> >> println(a.substring(1).replace(",", "”)) >> —> 1187.50 >> >> println(a.replace("?", "").replace(",", "”)) >> —> 1187.50 >> >> println(b.replace("?", "").replace(",", "”)) >> —> No error / output since both ‘?' and ‘,' don’t exist. >> >> >>> On Feb 20, 2016, at 8:24 AM, Mich Talebzadeh <m...@peridale.co.uk >>> <mailto:m...@peridale.co.uk>> wrote: >>> >>> >>> I have a DF like below reading a csv file >>> >>> >>> val df = >>> HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", >>> "true").option("header", "true").load("/data/stg/table2") >>> >>> val a = df.map(x => (x.getString(0), x.getString(1), >>> x.getString(2).substring(1).replace(",", >>> "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, >>> x.getString(4).substring(1).replace(",", "").toDouble)) >>> >>> >>> For most rows I am reading from csv file the above mapping works fine. >>> However, at the bottom of csv there are couple of empty columns as below >>> >>> [421,02/10/2015,?1,187.50,?237.50,?1,425.00] >>> [,,,,] >>> [Net income,,?182,531.25,?14,606.25,?197,137.50] >>> [,,,,] >>> [year 2014,,?113,500.00,?0.00,?113,500.00] >>> [Year 2015,,?69,031.25,?14,606.25,?83,637.50] >>> >>> However, I get >>> >>> a.collect.foreach(println) >>> 16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 (TID >>> 161) >>> java.lang.StringIndexOutOfBoundsException: String index out of range: -1 >>> >>> I suspect the cause is substring operation say x.getString(2).substring(1) >>> on empty values that according to web will throw this type of error >>> >>> >>> The easiest solution seems to be to check whether x above is not null and >>> do the substring operation. Can this be done without using a UDF? >>> >>> Thanks >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >>> >>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> >>> >>> NOTE: The information in this email is proprietary and confidential. This >>> message is for the designated recipient only, if you are not the intended >>> recipient, you should destroy it immediately. Any information in this >>> message shall not be understood as given or endorsed by Peridale Technology >>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is >>> the responsibility of the recipient to ensure that this email is virus >>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their >>> employees accept any responsibility. >