Re: Checking for null values when mapping

Chandeep Singh Sat, 20 Feb 2016 06:33:08 -0800
Ah. Ok.

> On Feb 20, 2016, at 2:31 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:
> 
> Yes I did that as well but no joy. My shell does it for windows files 
> automatically
>  
> Thanks, 
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>  
>  
> From: Chandeep Singh [mailto:c...@chandeep.com] 
> Sent: 20 February 2016 14:27
> To: Mich Talebzadeh <m...@peridale.co.uk>
> Cc: user @spark <user@spark.apache.org>
> Subject: Re: Checking for null values when mapping
>  
> Also, have you looked into Dos2Unix (http://dos2unix.sourceforge.net/ 
> <http://dos2unix.sourceforge.net/>)
>  
> Has helped me in the past to deal with special characters while using windows 
> based CSV’s in Linux. (Might not be the solution here.. Just an FYI :))
>  
>> On Feb 20, 2016, at 2:17 PM, Chandeep Singh <c...@chandeep.com 
>> <mailto:c...@chandeep.com>> wrote:
>>  
>> Understood. In that case Ted’s suggestion to check the length should solve 
>> the problem.
>>  
>>> On Feb 20, 2016, at 2:09 PM, Mich Talebzadeh <m...@peridale.co.uk 
>>> <mailto:m...@peridale.co.uk>> wrote:
>>>  
>>> Hi,
>>>  
>>> That is a good question.
>>>  
>>> When data is exported from CSV to Linux, any character that cannot be 
>>> transformed is replaced by ?. That question mark is not actually the 
>>> expected “?” J
>>>  
>>> So the only way I can get rid of it is by drooping the first character 
>>> using substring(1). I checked I did the same in Hive sql
>>>  
>>> The actual field in CSV is “£2,500.oo” that translates into “?2,500.00”
>>>  
>>> HTH
>>>  
>>>  
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> NOTE: The information in this email is proprietary and confidential. This 
>>> message is for the designated recipient only, if you are not the intended 
>>> recipient, you should destroy it immediately. Any information in this 
>>> message shall not be understood as given or endorsed by Peridale Technology 
>>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
>>> the responsibility of the recipient to ensure that this email is virus 
>>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their 
>>> employees accept any responsibility.
>>>  
>>>  
>>> From: Chandeep Singh [mailto:c...@chandeep.com <mailto:c...@chandeep.com>] 
>>> Sent: 20 February 2016 13:47
>>> To: Mich Talebzadeh <m...@peridale.co.uk <mailto:m...@peridale.co.uk>>
>>> Cc: user @spark <user@spark.apache.org <mailto:user@spark.apache.org>>
>>> Subject: Re: Checking for null values when mapping
>>>  
>>> Looks like you’re using substring just to get rid of the ‘?’. Why not use 
>>> replace for that as well? And then you wouldn’t run into issues with index 
>>> out of bound.
>>>  
>>> val a = "?1,187.50"  
>>> val b = ""
>>>  
>>> println(a.substring(1).replace(",", "”))
>>> —> 1187.50
>>>  
>>> println(a.replace("?", "").replace(",", "”))
>>> —> 1187.50
>>>  
>>> println(b.replace("?", "").replace(",", "”))
>>> —> No error / output since both ‘?' and ‘,' don’t exist.
>>>  
>>>  
>>>> On Feb 20, 2016, at 8:24 AM, Mich Talebzadeh <m...@peridale.co.uk 
>>>> <mailto:m...@peridale.co.uk>> wrote:
>>>>  
>>>>  
>>>> I have a DF like below reading a csv file
>>>>  
>>>>  
>>>> val df = 
>>>> HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", 
>>>> "true").option("header", "true").load("/data/stg/table2")
>>>>  
>>>> val a = df.map(x => (x.getString(0), x.getString(1), 
>>>> x.getString(2).substring(1).replace(",", 
>>>> "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, 
>>>> x.getString(4).substring(1).replace(",", "").toDouble))
>>>>  
>>>>  
>>>> For most rows I am reading from csv file the above mapping works fine. 
>>>> However, at the bottom of csv there are couple of empty columns as below
>>>>  
>>>> [421,02/10/2015,?1,187.50,?237.50,?1,425.00]
>>>> [,,,,]
>>>> [Net income,,?182,531.25,?14,606.25,?197,137.50]
>>>> [,,,,]
>>>> [year 2014,,?113,500.00,?0.00,?113,500.00]
>>>> [Year 2015,,?69,031.25,?14,606.25,?83,637.50]
>>>>  
>>>> However, I get 
>>>>  
>>>> a.collect.foreach(println)
>>>> 16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 
>>>> (TID 161)
>>>> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>>>>  
>>>> I suspect the cause is substring operation  say 
>>>> x.getString(2).substring(1) on empty values that according to web will 
>>>> throw this type of error
>>>>  
>>>>  
>>>> The easiest solution seems to be to check whether x above is not null and 
>>>> do the substring operation. Can this be done without using a UDF?
>>>>  
>>>> Thanks
>>>>  
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>  
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>  
>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>>  
>>>> NOTE: The information in this email is proprietary and confidential. This 
>>>> message is for the designated recipient only, if you are not the intended 
>>>> recipient, you should destroy it immediately. Any information in this 
>>>> message shall not be understood as given or endorsed by Peridale 
>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so 
>>>> stated. It is the responsibility of the recipient to ensure that this 
>>>> email is virus free, therefore neither Peridale Technology Ltd, its 
>>>> subsidiaries nor their employees accept any responsibility.
Re: Checking for null values when mapping

Reply via email to