Re: Issue with rogue data in csv file used in Spark application

2016-09-28 Thread Mich Talebzadeh
Thanks guys. This seemed to be working after declaring all columns as Strings to start and using filters below to avoid rogue characters. The second filter ensures that there was trade volumes on that date. val *rs = df2.filter($"Open" !== "-").filter($"Volume".cast("Integer") >

Re: Issue with rogue data in csv file used in Spark application

2016-09-28 Thread Bedrytski Aliaksandr
Hi Mich, if I understood you well, you may cast the value to float, it will yield null if the value is not a correct float: val df = Seq(("-", 5), ("1", 6), (",", 7), ("8.6", 7)).toDF("value", "id").createOrReplaceTempView("lines") spark.sql("SELECT cast(value as FLOAT) from lines").show()

Re: Issue with rogue data in csv file used in Spark application

2016-09-28 Thread Mich Talebzadeh
Thanks all. This is the csv schema all columns mapped to String scala> df2.printSchema root |-- Stock: string (nullable = true) |-- Ticker: string (nullable = true) |-- TradeDate: string (nullable = true) |-- Open: string (nullable = true) |-- High: string (nullable = true) |-- Low: string

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Mike Metzger
Hi Mich - Can you run a filter command on df1 prior to your map for any rows where p(3).toString != '-' then run your map command? Thanks Mike On Tue, Sep 27, 2016 at 5:06 PM, Mich Talebzadeh wrote: > Thanks guys > > Actually these are the 7 rogue rows. The

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Hyukjin Kwon
Hi Mich, I guess you could use nullValue option by setting it to null. If you are reading them into strings at the first please, then, you would meet https://github.com/apache/spark/pull/14118 first which is resolved from 2.0.1 Unfortunately, this bug also exists in external csv library for

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Mich Talebzadeh
Thanks guys Actually these are the 7 rogue rows. The column 0 is the Volume column which means there was no trades on those days *cat stock.csv|grep ",0"*SAP SE,SAP, 23-Dec-11,-,-,-,40.56,0 SAP SE,SAP, 21-Apr-11,-,-,-,45.85,0 SAP SE,SAP, 30-Dec-10,-,-,-,38.10,0 SAP SE,SAP,

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread ayan guha
You can read as string, write a map to fix rows and then convert back to your desired Dataframe. On 28 Sep 2016 06:49, "Mich Talebzadeh" wrote: > > I have historical prices for various stocks. > > Each csv file has 10 years trade one row per each day. > > These are the

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Adrian Bridgett
We use the spark-csv (a successor of which is built in to spark 2.0) for this. It doesn't cause crashes, failed parsing is logged. We run on Mesos so I have to pull back all the logs from all the executors and search for failed lines (so that we can ensure that the failure rate isn't too

Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Mich Talebzadeh
I have historical prices for various stocks. Each csv file has 10 years trade one row per each day. These are the columns defined in the class case class columns(Stock: String, Ticker: String, TradeDate: String, Open: Float, High: Float, Low: Float, Close: Float, Volume: Integer) The issue is