Also, have you looked into Dos2Unix (http://dos2unix.sourceforge.net/ 
<http://dos2unix.sourceforge.net/>)

Has helped me in the past to deal with special characters while using windows 
based CSV’s in Linux. (Might not be the solution here.. Just an FYI :))

> On Feb 20, 2016, at 2:17 PM, Chandeep Singh <c...@chandeep.com> wrote:
> 
> Understood. In that case Ted’s suggestion to check the length should solve 
> the problem.
> 
>> On Feb 20, 2016, at 2:09 PM, Mich Talebzadeh <m...@peridale.co.uk 
>> <mailto:m...@peridale.co.uk>> wrote:
>> 
>> Hi,
>>  
>> That is a good question.
>>  
>> When data is exported from CSV to Linux, any character that cannot be 
>> transformed is replaced by ?. That question mark is not actually the 
>> expected “?” J
>>  
>> So the only way I can get rid of it is by drooping the first character using 
>> substring(1). I checked I did the same in Hive sql
>>  
>> The actual field in CSV is “£2,500.oo” that translates into “?2,500.00”
>>  
>> HTH
>>  
>>  
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Peridale Technology 
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
>> the responsibility of the recipient to ensure that this email is virus free, 
>> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
>> employees accept any responsibility.
>>  
>>  
>> From: Chandeep Singh [mailto:c...@chandeep.com <mailto:c...@chandeep.com>] 
>> Sent: 20 February 2016 13:47
>> To: Mich Talebzadeh <m...@peridale.co.uk <mailto:m...@peridale.co.uk>>
>> Cc: user @spark <user@spark.apache.org <mailto:user@spark.apache.org>>
>> Subject: Re: Checking for null values when mapping
>>  
>> Looks like you’re using substring just to get rid of the ‘?’. Why not use 
>> replace for that as well? And then you wouldn’t run into issues with index 
>> out of bound.
>>  
>> val a = "?1,187.50"  
>> val b = ""
>>  
>> println(a.substring(1).replace(",", "”))
>> —> 1187.50
>>  
>> println(a.replace("?", "").replace(",", "”))
>> —> 1187.50
>>  
>> println(b.replace("?", "").replace(",", "”))
>> —> No error / output since both ‘?' and ‘,' don’t exist.
>>  
>>  
>>> On Feb 20, 2016, at 8:24 AM, Mich Talebzadeh <m...@peridale.co.uk 
>>> <mailto:m...@peridale.co.uk>> wrote:
>>>  
>>>  
>>> I have a DF like below reading a csv file
>>>  
>>>  
>>> val df = 
>>> HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", 
>>> "true").option("header", "true").load("/data/stg/table2")
>>>  
>>> val a = df.map(x => (x.getString(0), x.getString(1), 
>>> x.getString(2).substring(1).replace(",", 
>>> "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, 
>>> x.getString(4).substring(1).replace(",", "").toDouble))
>>>  
>>>  
>>> For most rows I am reading from csv file the above mapping works fine. 
>>> However, at the bottom of csv there are couple of empty columns as below
>>>  
>>> [421,02/10/2015,?1,187.50,?237.50,?1,425.00]
>>> [,,,,]
>>> [Net income,,?182,531.25,?14,606.25,?197,137.50]
>>> [,,,,]
>>> [year 2014,,?113,500.00,?0.00,?113,500.00]
>>> [Year 2015,,?69,031.25,?14,606.25,?83,637.50]
>>>  
>>> However, I get 
>>>  
>>> a.collect.foreach(println)
>>> 16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 (TID 
>>> 161)
>>> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>>>  
>>> I suspect the cause is substring operation  say x.getString(2).substring(1) 
>>> on empty values that according to web will throw this type of error
>>>  
>>>  
>>> The easiest solution seems to be to check whether x above is not null and 
>>> do the substring operation. Can this be done without using a UDF?
>>>  
>>> Thanks
>>>  
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> NOTE: The information in this email is proprietary and confidential. This 
>>> message is for the designated recipient only, if you are not the intended 
>>> recipient, you should destroy it immediately. Any information in this 
>>> message shall not be understood as given or endorsed by Peridale Technology 
>>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
>>> the responsibility of the recipient to ensure that this email is virus 
>>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their 
>>> employees accept any responsibility.
> 

Reply via email to