Re: Issue with rogue data in csv file used in Spark application

Hyukjin Kwon Tue, 27 Sep 2016 16:43:01 -0700

Hi Mich,

I guess you could use nullValue option by setting it to null.


If you are reading them into strings at the first please, then, you would
meet https://github.com/apache/spark/pull/14118 first which is resolved
from 2.0.1

Unfortunately, this bug also exists in external csv library for strings if
I recall correctly.

However, it'd be fine if you set the schema explicitly when you load as
this bug does not exists for floats at least.

I hope this is helpful.

Thanks!

On 28 Sep 2016 7:06 a.m., "Mich Talebzadeh" <mich.talebza...@gmail.com>
wrote:

> Thanks guys
>
> Actually these are the 7 rogue rows. The column 0 is the Volume column
> which means there was no trades on those days
>
>
> *cat stock.csv|grep ",0"*SAP SE,SAP, 23-Dec-11,-,-,-,40.56,0
> SAP SE,SAP, 21-Apr-11,-,-,-,45.85,0
> SAP SE,SAP, 30-Dec-10,-,-,-,38.10,0
> SAP SE,SAP, 23-Dec-10,-,-,-,38.36,0
> SAP SE,SAP, 30-Apr-08,-,-,-,32.39,0
> SAP SE,SAP, 29-Apr-08,-,-,-,33.05,0
> SAP SE,SAP, 28-Apr-08,-,-,-,32.60,0
>
> So one way would be to exclude the rows that there was no volume of trade
> that day when cleaning up the csv file
>
> *cat stock.csv|grep -v **",0"*
>
> and that works. Bearing in mind that putting 0s in place of "-" will skew
> the price plot.
>
> BTW I am using Spark csv as well
>
> val df1 = spark.read.option("header", true).csv(location)
>
> This is the class and the mapping
>
>
> case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
> Float, High: Float, Low: Float, Close: Float, Volume: Integer)
> val df2 = df1.map(p => columns(p(0).toString, p(1).toString,
> p(2).toString, p(3).toString.toFloat, p(4).toString.toFloat,
> p(5).toString.toFloat, p(6).toString.toFloat, p(7).toString.toInt))
>
>
> In here I have
>
> p(3).toString.toFloat
>
> How can one check for rogue data in p(3)?
>
>
> Thanks
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 September 2016 at 21:49, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>>
>> I have historical prices for various stocks.
>>
>> Each csv file has 10 years trade one row per each day.
>>
>> These are the columns defined in the class
>>
>> case class columns(Stock: String, Ticker: String, TradeDate: String,
>> Open: Float, High: Float, Low: Float, Close: Float, Volume: Integer)
>>
>> The issue is with Open, High, Low, Close columns that all are defined as
>> Float.
>>
>> Most rows are OK like below but the red one with "-" defined as Float
>> causes issues
>>
>>   Date     Open High  Low   Close Volume
>> 27-Sep-16 80.91 80.93 79.87 80.85 1873158
>> 23-Dec-11   -     -    -    40.56 0
>>
>> Because the prices are defined as Float, these rows cause the application
>> to crash
>> scala> val rs = df2.filter(changeToDate("TradeDate") >=
>> monthsago).select((changeToDate("TradeDate").as("TradeDate")
>> ),(('Close+'Open)/2).as("AverageDailyPrice"), 'Low.as("Day's Low"),
>> 'High.as("Day's High")).orderBy("TradeDate").collect
>> 16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage 61.0
>> (TID 260)
>> java.lang.NumberFormatException: For input string: "-"
>>
>>
>> One way is to define the prices as Strings but that is not
>> meaningful. Alternatively do the clean up before putting csv in HDFS but
>> that becomes tedious and error prone.
>>
>> Any ideas will be appreciated.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>

Re: Issue with rogue data in csv file used in Spark application

Reply via email to