Agreed. Thanks. On Sat, Feb 17, 2018 at 9:53 AM, Jörn Franke <jornfra...@gmail.com> wrote:
> You may want to think about separating the import step from the processing > step. It is not very economical to download all the data again every time > you want to calculate something. So download it first and store it on a > distributed file system. Schedule to download newest information every day/ > hour etc. you can store it using a query optimized format such as ORC or > Parquet. Then you can run queries over it. > > On 17. Feb 2018, at 01:10, Lian Jiang <jiangok2...@gmail.com> wrote: > > Hi, > > I have a user case: > > I want to download S&P500 stock data from Yahoo API in parallel using > Spark. I have got all stock symbols as a Dataset. Then I used below code to > call Yahoo API for each symbol: > > > > case class Symbol(symbol: String, sector: String) > > case class Tick(symbol: String, sector: String, open: Double, close: > Double) > > > // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick] > > > symbolDs.map { k => > > pullSymbolFromYahoo(k.symbol, k.sector) > > } > > > This statement cannot compile: > > > Unable to find encoder for type stored in a Dataset. Primitive types > (Int, String, etc) and Product types (case classes) are supported by > importing spark.implicits._ Support for serializing other types will be > added in future releases. > > > My questions are: > > > 1. As you can see, this scenario is not traditional dataset handling such > as count, sql query... Instead, it is more like a UDF which apply random > operation on each record. Is Spark good at handling such scenario? > > > 2. Regarding the compilation error, any fix? I did not find a satisfactory > solution online. > > > Thanks for help! > > > > >