Agreed. Thanks.

On Sat, Feb 17, 2018 at 9:53 AM, Jörn Franke <jornfra...@gmail.com> wrote:

> You may want to think about separating the import step from the processing
> step. It is not very economical to download all the data again every time
> you want to calculate something. So download it first and store it on a
> distributed file system. Schedule to download newest information every day/
> hour etc. you can store it using a query optimized format such as ORC or
> Parquet. Then you can run queries over it.
>
> On 17. Feb 2018, at 01:10, Lian Jiang <jiangok2...@gmail.com> wrote:
>
> Hi,
>
> I have a user case:
>
> I want to download S&P500 stock data from Yahoo API in parallel using
> Spark. I have got all stock symbols as a Dataset. Then I used below code to
> call Yahoo API for each symbol:
>
>
>
> case class Symbol(symbol: String, sector: String)
>
> case class Tick(symbol: String, sector: String, open: Double, close:
> Double)
>
>
> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick]
>
>
>     symbolDs.map { k =>
>
>       pullSymbolFromYahoo(k.symbol, k.sector)
>
>     }
>
>
> This statement cannot compile:
>
>
> Unable to find encoder for type stored in a Dataset.  Primitive types
> (Int, String, etc) and Product types (case classes) are supported by
> importing spark.implicits._  Support for serializing other types will be
> added in future releases.
>
>
> My questions are:
>
>
> 1. As you can see, this scenario is not traditional dataset handling such
> as count, sql query... Instead, it is more like a UDF which apply random
> operation on each record. Is Spark good at handling such scenario?
>
>
> 2. Regarding the compilation error, any fix? I did not find a satisfactory
> solution online.
>
>
> Thanks for help!
>
>
>
>
>

Reply via email to