subject:"Issue with rogue data in csv file used in Spark application"

Re: Issue with rogue data in csv file used in Spark application

2016-09-28 Thread Mich Talebzadeh

Thanks guys. This seemed to be working after declaring all columns as Strings to start and using filters below to avoid rogue characters. The second filter ensures that there was trade volumes on that date. val *rs = df2.filter($"Open" !== "-").filter($"Volume".cast("Integer") > 0*).filter(change

Re: Issue with rogue data in csv file used in Spark application

2016-09-28 Thread Bedrytski Aliaksandr

Hi Mich, if I understood you well, you may cast the value to float, it will yield null if the value is not a correct float: val df = Seq(("-", 5), ("1", 6), (",", 7), ("8.6", 7)).toDF("value", "id").createOrReplaceTempView("lines") spark.sql("SELECT cast(value as FLOAT) from lines").show() +---

Re: Issue with rogue data in csv file used in Spark application

2016-09-28 Thread Mich Talebzadeh

Thanks all. This is the csv schema all columns mapped to String scala> df2.printSchema root |-- Stock: string (nullable = true) |-- Ticker: string (nullable = true) |-- TradeDate: string (nullable = true) |-- Open: string (nullable = true) |-- High: string (nullable = true) |-- Low: string

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Mike Metzger

Hi Mich - Can you run a filter command on df1 prior to your map for any rows where p(3).toString != '-' then run your map command? Thanks Mike On Tue, Sep 27, 2016 at 5:06 PM, Mich Talebzadeh wrote: > Thanks guys > > Actually these are the 7 rogue rows. The column 0 is the Volume column >

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Hyukjin Kwon

Hi Mich, I guess you could use nullValue option by setting it to null. If you are reading them into strings at the first please, then, you would meet https://github.com/apache/spark/pull/14118 first which is resolved from 2.0.1 Unfortunately, this bug also exists in external csv library for stri

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Mich Talebzadeh

Thanks guys Actually these are the 7 rogue rows. The column 0 is the Volume column which means there was no trades on those days *cat stock.csv|grep ",0"*SAP SE,SAP, 23-Dec-11,-,-,-,40.56,0 SAP SE,SAP, 21-Apr-11,-,-,-,45.85,0 SAP SE,SAP, 30-Dec-10,-,-,-,38.10,0 SAP SE,SAP, 23-Dec-10,-,-,-,38.36,

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread ayan guha

You can read as string, write a map to fix rows and then convert back to your desired Dataframe. On 28 Sep 2016 06:49, "Mich Talebzadeh" wrote: > > I have historical prices for various stocks. > > Each csv file has 10 years trade one row per each day. > > These are the columns defined in the clas

Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Adrian Bridgett

We use the spark-csv (a successor of which is built in to spark 2.0) for this. It doesn't cause crashes, failed parsing is logged. We run on Mesos so I have to pull back all the logs from all the executors and search for failed lines (so that we can ensure that the failure rate isn't too hig

Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Mich Talebzadeh

I have historical prices for various stocks. Each csv file has 10 years trade one row per each day. These are the columns defined in the class case class columns(Stock: String, Ticker: String, TradeDate: String, Open: Float, High: Float, Low: Float, Close: Float, Volume: Integer) The issue is w

Re: Issue with rogue data in csv file used in Spark application

Re: Issue with rogue data in csv file used in Spark application

Re: Issue with rogue data in csv file used in Spark application

Re: Issue with rogue data in csv file used in Spark application

Re: Issue with rogue data in csv file used in Spark application

Re: Issue with rogue data in csv file used in Spark application

Re: Issue with rogue data in csv file used in Spark application

Re: Issue with rogue data in csv file used in Spark application

Issue with rogue data in csv file used in Spark application

9 matches

Site Navigation

Mail list logo

Footer information