Thanks Ayan. RDD may support map better than Dataset/DataFrame. However, it could be hard to serialize complex operation for Spark to execute in parallel. IMHO, spark does not fit this scenario. Hope this makes sense.
On Fri, Feb 16, 2018 at 8:58 PM, ayan guha <guha.a...@gmail.com> wrote: > ** You do NOT need dataframes, I mean..... > > On Sat, Feb 17, 2018 at 3:58 PM, ayan guha <guha.a...@gmail.com> wrote: > >> Hi >> >> Couple of suggestions: >> >> 1. Do not use Dataset, use Dataframe in this scenario. There is no >> benefit of dataset features here. Using Dataframe, you can write an >> arbitrary UDF which can do what you want to do. >> 2. In fact you do need dataframes here. You would be better off with RDD >> here. just create a RDD of symbols and use map to do the processing. >> >> On Sat, Feb 17, 2018 at 12:40 PM, Irving Duran <irving.du...@gmail.com> >> wrote: >> >>> Do you only want to use Scala? Because otherwise, I think with pyspark >>> and pandas read table you should be able to accomplish what you want to >>> accomplish. >>> >>> Thank you, >>> >>> Irving Duran >>> >>> On 02/16/2018 06:10 PM, Lian Jiang wrote: >>> >>> Hi, >>> >>> I have a user case: >>> >>> I want to download S&P500 stock data from Yahoo API in parallel using >>> Spark. I have got all stock symbols as a Dataset. Then I used below code to >>> call Yahoo API for each symbol: >>> >>> >>> >>> case class Symbol(symbol: String, sector: String) >>> >>> case class Tick(symbol: String, sector: String, open: Double, close: >>> Double) >>> >>> >>> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns >>> Dataset[Tick] >>> >>> >>> symbolDs.map { k => >>> >>> pullSymbolFromYahoo(k.symbol, k.sector) >>> >>> } >>> >>> >>> This statement cannot compile: >>> >>> >>> Unable to find encoder for type stored in a Dataset. Primitive types >>> (Int, String, etc) and Product types (case classes) are supported by >>> importing spark.implicits._ Support for serializing other types will >>> be added in future releases. >>> >>> >>> My questions are: >>> >>> >>> 1. As you can see, this scenario is not traditional dataset handling >>> such as count, sql query... Instead, it is more like a UDF which apply >>> random operation on each record. Is Spark good at handling such scenario? >>> >>> >>> 2. Regarding the compilation error, any fix? I did not find a >>> satisfactory solution online. >>> >>> >>> Thanks for help! >>> >>> >>> >>> >>> >>> >> >> >> -- >> Best Regards, >> Ayan Guha >> > > > > -- > Best Regards, > Ayan Guha >