You may want to think about separating the import step from the processing 
step. It is not very economical to download all the data again every time you 
want to calculate something. So download it first and store it on a distributed 
file system. Schedule to download newest information every day/ hour etc. you 
can store it using a query optimized format such as ORC or Parquet. Then you 
can run queries over it.

> On 17. Feb 2018, at 01:10, Lian Jiang <jiangok2...@gmail.com> wrote:
> 
> Hi,
> 
> I have a user case:
> 
> I want to download S&P500 stock data from Yahoo API in parallel using Spark. 
> I have got all stock symbols as a Dataset. Then I used below code to call 
> Yahoo API for each symbol:
> 
>        
> case class Symbol(symbol: String, sector: String)
> case class Tick(symbol: String, sector: String, open: Double, close: Double)
> 
> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick]
> 
>     symbolDs.map { k =>
>       pullSymbolFromYahoo(k.symbol, k.sector)
>     }
> 
> This statement cannot compile:
> 
> Unable to find encoder for type stored in a Dataset.  Primitive types (Int, 
> String, etc) and Product types (case classes) are supported by importing 
> spark.implicits._  Support for serializing other types will be added in 
> future releases.
> 
> 
> My questions are:
> 
> 1. As you can see, this scenario is not traditional dataset handling such as 
> count, sql query... Instead, it is more like a UDF which apply random 
> operation on each record. Is Spark good at handling such scenario?
> 
> 2. Regarding the compilation error, any fix? I did not find a satisfactory 
> solution online.
> 
> Thanks for help!
> 
> 
> 

Reply via email to