Hi

Couple of suggestions:

1. Do not use Dataset, use Dataframe in this scenario. There is no benefit
of dataset features here. Using Dataframe, you can write an arbitrary UDF
which can do what you want to do.
2. In fact you do need dataframes here. You would be better off with RDD
here. just create a RDD of symbols and use map to do the processing.

On Sat, Feb 17, 2018 at 12:40 PM, Irving Duran <irving.du...@gmail.com>
wrote:

> Do you only want to use Scala? Because otherwise, I think with pyspark and
> pandas read table you should be able to accomplish what you want to
> accomplish.
>
> Thank you,
>
> Irving Duran
>
> On 02/16/2018 06:10 PM, Lian Jiang wrote:
>
> Hi,
>
> I have a user case:
>
> I want to download S&P500 stock data from Yahoo API in parallel using
> Spark. I have got all stock symbols as a Dataset. Then I used below code to
> call Yahoo API for each symbol:
>
>
>
> case class Symbol(symbol: String, sector: String)
>
> case class Tick(symbol: String, sector: String, open: Double, close:
> Double)
>
>
> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick]
>
>
>     symbolDs.map { k =>
>
>       pullSymbolFromYahoo(k.symbol, k.sector)
>
>     }
>
>
> This statement cannot compile:
>
>
> Unable to find encoder for type stored in a Dataset.  Primitive types
> (Int, String, etc) and Product types (case classes) are supported by
> importing spark.implicits._  Support for serializing other types will be
> added in future releases.
>
>
> My questions are:
>
>
> 1. As you can see, this scenario is not traditional dataset handling such
> as count, sql query... Instead, it is more like a UDF which apply random
> operation on each record. Is Spark good at handling such scenario?
>
>
> 2. Regarding the compilation error, any fix? I did not find a satisfactory
> solution online.
>
>
> Thanks for help!
>
>
>
>
>
>


-- 
Best Regards,
Ayan Guha

Reply via email to