Thanks Anastasios. This link is helpful! On Sat, Feb 17, 2018 at 11:05 AM, Anastasios Zouzias <zouz...@gmail.com> wrote:
> Hi Lian, > > The remaining problem is: > > > Spark need all classes used in the fn() serializable for t.rdd.map{ k=> > fn(k) } to work. This could be hard since some classes in third party > libraries are not serializable. This restricts the power of using spark to > parallel an operation on multiple machines. Hope this is clear. > > > This is not entirely true. You can bypass the serialisation issue in most > cases, see the link below for an example. > > https://www.nicolaferraro.me/2016/02/22/using-non-serializable-objects-in- > apache-spark/ > > In a nutshell, the non-serialisable code is available to all executors, so > there is no need for Spark to serialise from the driver to the executors. > > Best regards, > Anastasios > > > > > On Sat, Feb 17, 2018 at 6:13 PM, Lian Jiang <jiangok2...@gmail.com> wrote: > >> *Snehasish,* >> >> I got this in spark-shell 2.11.8: >> >> case class My(name:String, age:Int) >> >> import spark.implicits._ >> >> val t = List(new My("lian", 20), new My("sh", 3)).toDS >> >> t.map{ k=> print(My) }(org.apache.spark.sql.Encoders.kryo[My.getClass]) >> >> >> <console>:31: error: type getClass is not a member of object My >> >> t.map{ k=> print(My) }(org.apache.spark.sql.Encoder >> s.kryo[My.getClass]) >> >> >> >> Using RDD can workaround this issue as mentioned in previous emails: >> >> >> t.rdd.map{ k=> print(k) } >> >> >> *Holden,* >> >> >> The remaining problem is: >> >> >> Spark need all classes used in the fn() serializable for t.rdd.map{ k=> >> fn(k) } to work. This could be hard since some classes in third party >> libraries are not serializable. This restricts the power of using spark to >> parallel an operation on multiple machines. Hope this is clear. >> >> >> On Sat, Feb 17, 2018 at 12:04 AM, SNEHASISH DUTTA < >> info.snehas...@gmail.com> wrote: >> >>> Hi Lian, >>> >>> This could be the solution >>> >>> >>> case class Symbol(symbol: String, sector: String) >>> >>> case class Tick(symbol: String, sector: String, open: Double, close: >>> Double) >>> >>> >>> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns >>> Dataset[Tick] >>> >>> >>> symbolDs.map { k => >>> >>> pullSymbolFromYahoo(k.symbol, k.sector) >>> >>> }(org.apache.spark.sql.Encoders.kryo[Tick.getClass]) >>> >>> >>> Thanks, >>> >>> Snehasish >>> >>> >>> Regards, >>> Snehasish >>> >>> On Sat, Feb 17, 2018 at 1:05 PM, Holden Karau <holden.ka...@gmail.com> >>> wrote: >>> >>>> I'm not sure what you mean by it could be hard to serialize complex >>>> operations? >>>> >>>> Regardless I think the question is do you want to parallelize this on >>>> multiple machines or just one? >>>> >>>> On Feb 17, 2018 4:20 PM, "Lian Jiang" <jiangok2...@gmail.com> wrote: >>>> >>>>> Thanks Ayan. RDD may support map better than Dataset/DataFrame. >>>>> However, it could be hard to serialize complex operation for Spark to >>>>> execute in parallel. IMHO, spark does not fit this scenario. Hope this >>>>> makes sense. >>>>> >>>>> On Fri, Feb 16, 2018 at 8:58 PM, ayan guha <guha.a...@gmail.com> >>>>> wrote: >>>>> >>>>>> ** You do NOT need dataframes, I mean..... >>>>>> >>>>>> On Sat, Feb 17, 2018 at 3:58 PM, ayan guha <guha.a...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi >>>>>>> >>>>>>> Couple of suggestions: >>>>>>> >>>>>>> 1. Do not use Dataset, use Dataframe in this scenario. There is no >>>>>>> benefit of dataset features here. Using Dataframe, you can write an >>>>>>> arbitrary UDF which can do what you want to do. >>>>>>> 2. In fact you do need dataframes here. You would be better off with >>>>>>> RDD here. just create a RDD of symbols and use map to do the processing. >>>>>>> >>>>>>> On Sat, Feb 17, 2018 at 12:40 PM, Irving Duran < >>>>>>> irving.du...@gmail.com> wrote: >>>>>>> >>>>>>>> Do you only want to use Scala? Because otherwise, I think with >>>>>>>> pyspark and pandas read table you should be able to accomplish what you >>>>>>>> want to accomplish. >>>>>>>> >>>>>>>> Thank you, >>>>>>>> >>>>>>>> Irving Duran >>>>>>>> >>>>>>>> On 02/16/2018 06:10 PM, Lian Jiang wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I have a user case: >>>>>>>> >>>>>>>> I want to download S&P500 stock data from Yahoo API in parallel >>>>>>>> using Spark. I have got all stock symbols as a Dataset. Then I used >>>>>>>> below >>>>>>>> code to call Yahoo API for each symbol: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> case class Symbol(symbol: String, sector: String) >>>>>>>> >>>>>>>> case class Tick(symbol: String, sector: String, open: Double, >>>>>>>> close: Double) >>>>>>>> >>>>>>>> >>>>>>>> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns >>>>>>>> Dataset[Tick] >>>>>>>> >>>>>>>> >>>>>>>> symbolDs.map { k => >>>>>>>> >>>>>>>> pullSymbolFromYahoo(k.symbol, k.sector) >>>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> This statement cannot compile: >>>>>>>> >>>>>>>> >>>>>>>> Unable to find encoder for type stored in a Dataset. Primitive >>>>>>>> types (Int, String, etc) and Product types (case classes) are >>>>>>>> supported by >>>>>>>> importing spark.implicits._ Support for serializing other types >>>>>>>> will be added in future releases. >>>>>>>> >>>>>>>> >>>>>>>> My questions are: >>>>>>>> >>>>>>>> >>>>>>>> 1. As you can see, this scenario is not traditional dataset >>>>>>>> handling such as count, sql query... Instead, it is more like a UDF >>>>>>>> which >>>>>>>> apply random operation on each record. Is Spark good at handling such >>>>>>>> scenario? >>>>>>>> >>>>>>>> >>>>>>>> 2. Regarding the compilation error, any fix? I did not find a >>>>>>>> satisfactory solution online. >>>>>>>> >>>>>>>> >>>>>>>> Thanks for help! >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best Regards, >>>>>>> Ayan Guha >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best Regards, >>>>>> Ayan Guha >>>>>> >>>>> >>>>> >>> >> > > > -- > -- Anastasios Zouzias > <a...@zurich.ibm.com> >