Re: Treadting NaN fields in Spark

Michael Segel Thu, 29 Sep 2016 08:56:11 -0700

On Sep 29, 2016, at 10:29 AM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:


Good points :) it took take "-" as a negative number -123456?

Yeah… you have to go down a level and start to remember that you’re dealing 
with a stream or buffer of bytes below any casting.

At this moment in time this is what the code does


  1.  csv is imported into HDFS as is. No cleaning done for rogue columns done 
at shell level
  2.  Spark programs does the following filtration:
  3.  val rs = df2.filter($"Open" !== "-").filter($"Volume".cast("Integer") > 0)

So my first line of defence is to check for !== "-" which is a dash, commonly 
used for not available. The next filter is for volume column > 0 (there was 
trades on this stock), otherwise the calculation could skew the results.  Note 
that a filter with AND with !== will not work.


You can’t rely on the ‘-‘ to represent NaN or NULL.

The issue is that you’re going from a loose typing to a stronger typing (String 
to Double).
So pretty much any byte buffer could be interpreted as a String, but iff the 
String value is too long to be a Double, you will fail the NaN test. (Or its a 
NULL value/string)
As to filtering… you would probably want to filter on volume being == 0.  (Its 
possible to actually have a negative volume.
Or you could set the opening, low, high to the close if the volume is 0 
regardless of the values in those columns.

Note: This would be a transformation of the data and should be done during 
ingestion so you’re doing it only once.

Or you could just remove the rows since no trades occurred and then either 
reflect it in your graph as gaps or the graph interpolates it out .


scala> val rs = df2.filter($"Open" !== "-" && $"Volume".cast("Integer") > 0)
<console>:40: error: value && is not a member of String
       val rs = df2.filter($"Open" !== "-" && $"Volume".cast("Integer") > 0)

Will throw an error.

But this equality === works!

scala> val rs = df2.filter($"Open" === "-" && $"Volume".cast("Integer") > 0)
rs: org.apache.spark.sql.Dataset[columns] = [Stock: string, Ticker: string ... 
6 more fields]


Another alternative is to check for all digits here

 scala> def isAllPostiveNumber (price: String) = price forall Character.isDigit
isAllPostiveNumber: (price: String)Boolean

Not really a good idea. You’re walking thru each byte in a stream and checking 
to see if its a digit. What if its a NULL string? What do you set the value to?
This doesn’t scale well…

Again why not remove the rows where the volume of trades is 0?

Retuns Boolean true or false.  But does not work unless someone tells me what 
is wrong with this below!

scala> val rs = df2.filter(isAllPostiveNumber("Open") => true)

scala> val rs = df2.filter(isAllPostiveNumber("Open") => true)
<console>:1: error: not a legal formal parameter.
Note: Tuples cannot be directly destructured in method or function parameters.
      Either create a single parameter accepting the Tuple1,
      or consider a pattern matching anonymous function: `{ case (param1, 
param1) => ... }
val rs = df2.filter(isAllPostiveNumber("Open") => true)


Thanks











Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 29 September 2016 at 13:45, Michael Segel 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote:
Hi,

Just a few thoughts so take it for what its worth…

Databases have static schemas and will reject a row’s column on insert.

In your case… you have one data set where you have a column which is supposed 
to be a number but you have it as a string.
You want to convert this to a double in your final data set.


It looks like your problem is that your original data set that you ingested 
used a ‘-‘ (dash) to represent missing data, rather than a NULL value.
In fact, looking at the rows… you seem to have a stock that didn’t trade for a 
given day. (All have Volume as 0. ) Why do you need this?  Wouldn’t you want to 
represent this as null or no row for a given date?

The reason your ‘-‘ check failed when isnan() is that ‘-‘ actually could be 
represented as a number.

If you replaced the ‘-‘ with a String that is wider than the width of a double 
… the isnan should flag the row.

(I still need more coffee, so I could be wrong) ;-)

HTH

-Mike

On Sep 28, 2016, at 5:56 AM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:


This is an issue in most databases. Specifically if a field is NaN.. --> (NaN, 
standing for not a number, is a numeric data type value representing an 
undefined or unrepresentable value, especially in floating-point calculations)

There is a method called isnan() in Spark that is supposed to handle this 
scenario . However, it does not return correct values! For example I defined 
column "Open" as String  (it should be Float) and it has the following 7 rogue 
entries out of 1272 rows in a csv

df2.filter( $"OPen" === 
"-").select((changeToDate("TradeDate").as("TradeDate")), 'Open, 'High, 'Low, 
'Close, 'Volume).show

+----------+----+----+---+-----+------+
| TradeDate|Open|High|Low|Close|Volume|
+----------+----+----+---+-----+------+
|2011-12-23|   -|   -|  -|40.56|     0|
|2011-04-21|   -|   -|  -|45.85|     0|
|2010-12-30|   -|   -|  -|38.10|     0|
|2010-12-23|   -|   -|  -|38.36|     0|
|2008-04-30|   -|   -|  -|32.39|     0|
|2008-04-29|   -|   -|  -|33.05|     0|
|2008-04-28|   -|   -|  -|32.60|     0|
+----------+----+----+---+-----+------+

However, the following does not work!

 df2.filter(isnan($"Open")).show
+-----+------+---------+----+----+---+-----+------+
|Stock|Ticker|TradeDate|Open|High|Low|Close|Volume|
+-----+------+---------+----+----+---+-----+------+
+-----+------+---------+----+----+---+-----+------+

Any suggestions?

Thanks


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.

Re: Treadting NaN fields in Spark

Reply via email to