Thanks Anastasios. This link is helpful!

On Sat, Feb 17, 2018 at 11:05 AM, Anastasios Zouzias <zouz...@gmail.com>
wrote:

> Hi Lian,
>
> The remaining problem is:
>
>
> Spark need all classes used in the fn() serializable for t.rdd.map{ k=>
> fn(k) } to work. This could be hard since some classes in third party
> libraries are not serializable. This restricts the power of using spark to
> parallel an operation on multiple machines. Hope this is clear.
>
>
> This is not entirely true. You can bypass the serialisation issue in most
> cases, see the link below for an example.
>
> https://www.nicolaferraro.me/2016/02/22/using-non-serializable-objects-in-
> apache-spark/
>
> In a nutshell, the non-serialisable code is available to all executors, so
> there is no need for Spark to serialise from the driver to the executors.
>
> Best regards,
> Anastasios
>
>
>
>
> On Sat, Feb 17, 2018 at 6:13 PM, Lian Jiang <jiangok2...@gmail.com> wrote:
>
>> *Snehasish,*
>>
>> I got this in spark-shell 2.11.8:
>>
>> case class My(name:String, age:Int)
>>
>> import spark.implicits._
>>
>> val t = List(new My("lian", 20), new My("sh", 3)).toDS
>>
>> t.map{ k=> print(My) }(org.apache.spark.sql.Encoders.kryo[My.getClass])
>>
>>
>> <console>:31: error: type getClass is not a member of object My
>>
>>        t.map{ k=> print(My) }(org.apache.spark.sql.Encoder
>> s.kryo[My.getClass])
>>
>>
>>
>> Using RDD can workaround this issue as mentioned in previous emails:
>>
>>
>>  t.rdd.map{ k=> print(k) }
>>
>>
>> *Holden,*
>>
>>
>> The remaining problem is:
>>
>>
>> Spark need all classes used in the fn() serializable for t.rdd.map{ k=>
>> fn(k) } to work. This could be hard since some classes in third party
>> libraries are not serializable. This restricts the power of using spark to
>> parallel an operation on multiple machines. Hope this is clear.
>>
>>
>> On Sat, Feb 17, 2018 at 12:04 AM, SNEHASISH DUTTA <
>> info.snehas...@gmail.com> wrote:
>>
>>> Hi  Lian,
>>>
>>> This could be the solution
>>>
>>>
>>> case class Symbol(symbol: String, sector: String)
>>>
>>> case class Tick(symbol: String, sector: String, open: Double, close:
>>> Double)
>>>
>>>
>>> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns
>>> Dataset[Tick]
>>>
>>>
>>>     symbolDs.map { k =>
>>>
>>>       pullSymbolFromYahoo(k.symbol, k.sector)
>>>
>>>     }(org.apache.spark.sql.Encoders.kryo[Tick.getClass])
>>>
>>>
>>> Thanks,
>>>
>>> Snehasish
>>>
>>>
>>> Regards,
>>> Snehasish
>>>
>>> On Sat, Feb 17, 2018 at 1:05 PM, Holden Karau <holden.ka...@gmail.com>
>>> wrote:
>>>
>>>> I'm not sure what you mean by it could be hard to serialize complex
>>>> operations?
>>>>
>>>> Regardless I think the question is do you want to parallelize this on
>>>> multiple machines or just one?
>>>>
>>>> On Feb 17, 2018 4:20 PM, "Lian Jiang" <jiangok2...@gmail.com> wrote:
>>>>
>>>>> Thanks Ayan. RDD may support map better than Dataset/DataFrame.
>>>>> However, it could be hard to serialize complex operation for Spark to
>>>>> execute in parallel. IMHO, spark does not fit this scenario. Hope this
>>>>> makes sense.
>>>>>
>>>>> On Fri, Feb 16, 2018 at 8:58 PM, ayan guha <guha.a...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> ** You do NOT need dataframes, I mean.....
>>>>>>
>>>>>> On Sat, Feb 17, 2018 at 3:58 PM, ayan guha <guha.a...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> Couple of suggestions:
>>>>>>>
>>>>>>> 1. Do not use Dataset, use Dataframe in this scenario. There is no
>>>>>>> benefit of dataset features here. Using Dataframe, you can write an
>>>>>>> arbitrary UDF which can do what you want to do.
>>>>>>> 2. In fact you do need dataframes here. You would be better off with
>>>>>>> RDD here. just create a RDD of symbols and use map to do the processing.
>>>>>>>
>>>>>>> On Sat, Feb 17, 2018 at 12:40 PM, Irving Duran <
>>>>>>> irving.du...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Do you only want to use Scala? Because otherwise, I think with
>>>>>>>> pyspark and pandas read table you should be able to accomplish what you
>>>>>>>> want to accomplish.
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>>
>>>>>>>> Irving Duran
>>>>>>>>
>>>>>>>> On 02/16/2018 06:10 PM, Lian Jiang wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have a user case:
>>>>>>>>
>>>>>>>> I want to download S&P500 stock data from Yahoo API in parallel
>>>>>>>> using Spark. I have got all stock symbols as a Dataset. Then I used 
>>>>>>>> below
>>>>>>>> code to call Yahoo API for each symbol:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> case class Symbol(symbol: String, sector: String)
>>>>>>>>
>>>>>>>> case class Tick(symbol: String, sector: String, open: Double,
>>>>>>>> close: Double)
>>>>>>>>
>>>>>>>>
>>>>>>>> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns
>>>>>>>> Dataset[Tick]
>>>>>>>>
>>>>>>>>
>>>>>>>>     symbolDs.map { k =>
>>>>>>>>
>>>>>>>>       pullSymbolFromYahoo(k.symbol, k.sector)
>>>>>>>>
>>>>>>>>     }
>>>>>>>>
>>>>>>>>
>>>>>>>> This statement cannot compile:
>>>>>>>>
>>>>>>>>
>>>>>>>> Unable to find encoder for type stored in a Dataset.  Primitive
>>>>>>>> types (Int, String, etc) and Product types (case classes) are 
>>>>>>>> supported by
>>>>>>>> importing spark.implicits._  Support for serializing other types
>>>>>>>> will be added in future releases.
>>>>>>>>
>>>>>>>>
>>>>>>>> My questions are:
>>>>>>>>
>>>>>>>>
>>>>>>>> 1. As you can see, this scenario is not traditional dataset
>>>>>>>> handling such as count, sql query... Instead, it is more like a UDF 
>>>>>>>> which
>>>>>>>> apply random operation on each record. Is Spark good at handling such
>>>>>>>> scenario?
>>>>>>>>
>>>>>>>>
>>>>>>>> 2. Regarding the compilation error, any fix? I did not find a
>>>>>>>> satisfactory solution online.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for help!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards,
>>>>>>> Ayan Guha
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>
>>>>>
>>>
>>
>
>
> --
> -- Anastasios Zouzias
> <a...@zurich.ibm.com>
>

Reply via email to