Ah. Ok.
> On Feb 20, 2016, at 2:31 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:
>
> Yes I did that as well but no joy. My shell does it for windows files
> automatically
>
> Thanks,
>
> Dr Mich Talebzadeh
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this message
> shall not be understood as given or endorsed by Peridale Technology Ltd, its
> subsidiaries or their employees, unless expressly so stated. It is the
> responsibility of the recipient to ensure that this email is virus free,
> therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
> From: Chandeep Singh [mailto:c...@chandeep.com]
> Sent: 20 February 2016 14:27
> To: Mich Talebzadeh <m...@peridale.co.uk>
> Cc: user @spark <user@spark.apache.org>
> Subject: Re: Checking for null values when mapping
>
> Also, have you looked into Dos2Unix (http://dos2unix.sourceforge.net/
> <http://dos2unix.sourceforge.net/>)
>
> Has helped me in the past to deal with special characters while using windows
> based CSV’s in Linux. (Might not be the solution here.. Just an FYI :))
>
>> On Feb 20, 2016, at 2:17 PM, Chandeep Singh <c...@chandeep.com
>> <mailto:c...@chandeep.com>> wrote:
>>
>> Understood. In that case Ted’s suggestion to check the length should solve
>> the problem.
>>
>>> On Feb 20, 2016, at 2:09 PM, Mich Talebzadeh <m...@peridale.co.uk
>>> <mailto:m...@peridale.co.uk>> wrote:
>>>
>>> Hi,
>>>
>>> That is a good question.
>>>
>>> When data is exported from CSV to Linux, any character that cannot be
>>> transformed is replaced by ?. That question mark is not actually the
>>> expected “?” J
>>>
>>> So the only way I can get rid of it is by drooping the first character
>>> using substring(1). I checked I did the same in Hive sql
>>>
>>> The actual field in CSV is “£2,500.oo” that translates into “?2,500.00”
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>
>>> NOTE: The information in this email is proprietary and confidential. This
>>> message is for the designated recipient only, if you are not the intended
>>> recipient, you should destroy it immediately. Any information in this
>>> message shall not be understood as given or endorsed by Peridale Technology
>>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>>> the responsibility of the recipient to ensure that this email is virus
>>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
>>> employees accept any responsibility.
>>>
>>>
>>> From: Chandeep Singh [mailto:c...@chandeep.com <mailto:c...@chandeep.com>]
>>> Sent: 20 February 2016 13:47
>>> To: Mich Talebzadeh <m...@peridale.co.uk <mailto:m...@peridale.co.uk>>
>>> Cc: user @spark <user@spark.apache.org <mailto:user@spark.apache.org>>
>>> Subject: Re: Checking for null values when mapping
>>>
>>> Looks like you’re using substring just to get rid of the ‘?’. Why not use
>>> replace for that as well? And then you wouldn’t run into issues with index
>>> out of bound.
>>>
>>> val a = "?1,187.50"
>>> val b = ""
>>>
>>> println(a.substring(1).replace(",", "”))
>>> —> 1187.50
>>>
>>> println(a.replace("?", "").replace(",", "”))
>>> —> 1187.50
>>>
>>> println(b.replace("?", "").replace(",", "”))
>>> —> No error / output since both ‘?' and ‘,' don’t exist.
>>>
>>>
>>>> On Feb 20, 2016, at 8:24 AM, Mich Talebzadeh <m...@peridale.co.uk
>>>> <mailto:m...@peridale.co.uk>> wrote:
>>>>
>>>>
>>>> I have a DF like below reading a csv file
>>>>
>>>>
>>>> val df =
>>>> HiveContext.read.format("com.databricks.spark.csv").option("inferSchema",
>>>> "true").option("header", "true").load("/data/stg/table2")
>>>>
>>>> val a = df.map(x => (x.getString(0), x.getString(1),
>>>> x.getString(2).substring(1).replace(",",
>>>> "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble,
>>>> x.getString(4).substring(1).replace(",", "").toDouble))
>>>>
>>>>
>>>> For most rows I am reading from csv file the above mapping works fine.
>>>> However, at the bottom of csv there are couple of empty columns as below
>>>>
>>>> [421,02/10/2015,?1,187.50,?237.50,?1,425.00]
>>>> [,,,,]
>>>> [Net income,,?182,531.25,?14,606.25,?197,137.50]
>>>> [,,,,]
>>>> [year 2014,,?113,500.00,?0.00,?113,500.00]
>>>> [Year 2015,,?69,031.25,?14,606.25,?83,637.50]
>>>>
>>>> However, I get
>>>>
>>>> a.collect.foreach(println)
>>>> 16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0
>>>> (TID 161)
>>>> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>>>>
>>>> I suspect the cause is substring operation say
>>>> x.getString(2).substring(1) on empty values that according to web will
>>>> throw this type of error
>>>>
>>>>
>>>> The easiest solution seems to be to check whether x above is not null and
>>>> do the substring operation. Can this be done without using a UDF?
>>>>
>>>> Thanks
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>> LinkedIn
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>
>>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>>
>>>> NOTE: The information in this email is proprietary and confidential. This
>>>> message is for the designated recipient only, if you are not the intended
>>>> recipient, you should destroy it immediately. Any information in this
>>>> message shall not be understood as given or endorsed by Peridale
>>>> Technology Ltd, its subsidiaries or their employees, unless expressly so
>>>> stated. It is the responsibility of the recipient to ensure that this
>>>> email is virus free, therefore neither Peridale Technology Ltd, its
>>>> subsidiaries nor their employees accept any responsibility.