Re: Issue with rogue data in csv file used in Spark application

Bedrytski Aliaksandr Wed, 28 Sep 2016 06:46:57 -0700

Hi Mich,

if I understood you well, you may cast the value to float, it will yield
null if the value is not a correct float:


val df = Seq(("-", 5), ("1", 6), (",", 7), ("8.6", 7)).toDF("value",
"id").createOrReplaceTempView("lines")

spark.sql("SELECT cast(value as FLOAT) from lines").show()

+-----+
|value|
+-----+
| null|
|  1. |
| null|
|  8.6 |
+-----+

After it you may filter the DataFrame for values containing null.

Regards,
--
  Bedrytski Aliaksandr
  sp...@bedryt.ski



On Wed, Sep 28, 2016, at 10:11, Mich Talebzadeh wrote:
> Thanks all.
>
> This is the csv schema all columns mapped to String
>
> scala> df2.printSchema
> root
>  |-- Stock: string (nullable = true) -- Ticker: string (nullable =
>  |true) -- TradeDate: string (nullable = true) -- Open: string
>  |(nullable = true) -- High: string (nullable = true) -- Low: string
>  |(nullable = true) -- Close: string (nullable = true) -- Volume:
>  |string (nullable = true)
>
> The issue I have can be shown as below
>
> df2.filter( $"OPen" === "-
> ").select((changeToDate("TradeDate").as("TradeDate")), 'Open, 'High,
> 'Low, 'Close, 'Volume).show
>
> +----------+----+----+---+-----+------+
> | TradeDate|Open|High|Low|Close|Volume|
> +----------+----+----+---+-----+------+
> |2011-12-23|   -|   -|  -|40.56|     0| 2011-04-21|   -|   -|  -
> ||45.85|     0| 2010-12-30|   -|   -|  -|38.10|     0| 2010-12-23|
> |-|   -|  -|38.36|     0| 2008-04-30|   -|   -|  -|32.39|     0| 2008-04-
> |29|   -|   -|  -|33.05|     0| 2008-04-28|   -|   -|  -|32.60|     0|
> +----------+----+----+---+-----+------+
> Now there are ways of dealing with this. However, the solution has to
> be generic! Checking for a column == "-" is not generic. How about if
> that column was "," etc.
>
> This is an issue in most databases. Specifically if a field is NaN..
> --> (*NaN*, standing for not a number, is a numeric data type value
> representing an undefined or unrepresentable value, especially in floating-
> point calculations)
>
> Spark handles this[1]. I am on  Spark 2.0.1  in Class
> DataFrameNaFunctions. The simplest one is to drop these rogue rows
> df2.filter( $"Open" === "-").drop()
> However, a better approach would be to use REPLACE method or testing
> any column for NaN
>
>
>
>
> There is a method called isnan(). However, it does not return
> correct values!
>
>  df2.filter(isnan($"Open")).show 
> +-----+------+---------+----+----+---+-----+------
>  + |Stock|Ticker|TradeDate|Open|High|Low|Close|Volume| 
> +-----+------+---------+----+----+---+-----+------
>  + +-----+------+---------+----+----+---+-----+------+
>
>
> Any suggestions?
>
> Thanks
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which
> may arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary
> damages arising from such loss, damage or destruction.
>
>
>
>
> On 28 September 2016 at 04:07, Mike Metzger
> <m...@flexiblecreations.com> wrote:
>> Hi Mich -
>>
>>    Can you run a filter command on df1 prior to your map for any rows
>>    where p(3).toString != '-' then run your map command?
>>
>> Thanks
>>
>>
>> Mike
>>
>>
>> On Tue, Sep 27, 2016 at 5:06 PM, Mich Talebzadeh
>> <mich.talebza...@gmail.com> wrote:
>>> Thanks guys
>>>
>>> Actually these are the 7 rogue rows. The column 0 is the Volume
>>> column  which means there was no trades on those days
>>>
>>> *cat stock.csv|grep ",0"
*SAP SE,SAP, 23-Dec-11,-,-,-,40.56,0
>>> SAP SE,SAP, 21-Apr-11,-,-,-,45.85,0 SAP SE,SAP, 30-Dec-10,-,-,-
>>> ,38.10,0 SAP SE,SAP, 23-Dec-10,-,-,-,38.36,0 SAP SE,SAP, 30-Apr-08,-,-,-
>>> ,32.39,0 SAP SE,SAP, 29-Apr-08,-,-,-,33.05,0 SAP SE,SAP, 28-Apr-08,-,-,-
>>> ,32.60,0
>>>
>>> So one way would be to exclude the rows that there was no volume of
>>> trade that day when cleaning up the csv file
>>>
>>> *cat stock.csv|grep -v ",0"*
>>>
>>> and that works. Bearing in mind that putting 0s in place of "-" will
>>> skew the price plot.
>>>
>>> BTW I am using Spark csv as well
>>>
>>> val df1 = spark.read.option("header", true).csv(location)
>>>
>>> This is the class and the mapping
>>>
>>> case class columns(Stock: String, Ticker: String, TradeDate: String,
>>> Open: Float, High: Float, Low: Float, Close: Float, Volume: Integer)
>>> val df2 = df1.map(p => columns(p(0).toString, p(1).toString,
>>> p(2).toString, p(3).toString.toFloat, p(4).toString.toFloat,
>>> p(5).toString.toFloat, p(6).toString.toFloat, p(7).toString.toInt))
>>>
>>>
>>>
>>> In here I have
>>>
>>> p(3).toString.toFloat
>>>
>>>
>>> How can one check for rogue data in p(3)?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn *
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>> for any loss, damage or destruction of data or any other property
>>> which may arise from relying on this email's technical content is
>>> explicitly disclaimed. The author will in no case be liable for any
>>> monetary damages arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On 27 September 2016 at 21:49, Mich Talebzadeh
>>> <mich.talebza...@gmail.com> wrote:
>>>>
>>>> I have historical prices for various stocks.
>>>>
>>>> Each csv file has 10 years trade one row per each day.
>>>>
>>>> These are the columns defined in the class
>>>>
>>>> case class columns(Stock: String, Ticker: String, TradeDate:
>>>> String, Open: Float, High: Float, Low: Float, Close: Float, Volume:
>>>> Integer)
>>>>
>>>> The issue is with Open, High, Low, Close columns that all are
>>>> defined as Float.
>>>>
>>>> Most rows are OK like below but the red one with "-" defined as
>>>> Float causes issues
>>>>
>>>>   Date     Open High  Low   Close Volume 27-Sep-16 80.91 80.93
>>>>   79.87 80.85 1873158
>>>> 23-Dec-11   -     -    -    40.56 0
>>>>
>>>> Because the prices are defined as Float, these rows cause the
>>>> application to crash
>>>> scala> val rs = df2.filter(changeToDate("TradeDate") >= monthsago)-
>>>> scala> .select((changeToDate("TradeDate").as("TradeDate")),(('Clos-
>>>> scala> e+'Open)/2).as("AverageDailyPrice"), 'Low.as("Day's Low"),
>>>> scala> 'High.as("Day's High")).orderBy("TradeDate").collect
>>>> 16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage
>>>>       61.0 (TID 260) java.lang.NumberFormatException: For input
>>>>       string: "-"
>>>>
>>>>
>>>>
>>>>
>>>> One way is to define the prices as Strings but that is not
>>>> meaningful. Alternatively do the clean up before putting csv in
>>>> HDFS but that becomes tedious and error prone.
>>>>
>>>> Any ideas will be appreciated.
>>>>
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn *
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>> for any loss, damage or destruction of data or any other property
>>>> which may arise from relying on this email's technical content is
>>>> explicitly disclaimed. The author will in no case be liable for any
>>>> monetary damages arising from such loss, damage or destruction.
>>>>
>>>>
>>>>


Links:

  1. 
https://spark.apache.org/docs/1.5.1/api/java/org/apache/spark/sql/DataFrameNaFunctions.html

Re: Issue with rogue data in csv file used in Spark application

Reply via email to