Re: Can spark handle this scenario?

2018-02-26 Thread Lian Jiang
Thanks Vijay. After changing the programming model (create a context class for the workers), it finally worked for me. Cheers. On Fri, Feb 23, 2018 at 5:42 PM, vijay.bvp wrote: > when HTTP connection is opened you are opening a connection between > specific > machine (with

Re: Can spark handle this scenario?

2018-02-23 Thread vijay.bvp
when HTTP connection is opened you are opening a connection between specific machine (with IP and NIC card) to another specific machine, so this can't be serialized and used on other machine right!! This isn't spark limitation. I made a simple diagram if it helps. The Objects created at driver

Re: Can spark handle this scenario?

2018-02-22 Thread Lian Jiang
Hi Vijay, Should HTTPConnection() (or any other object created per partition) be serializable so that your code work? If so, the usage seems to be limited. Sometimes, the error caused by a non-serializable object can be very misleading (e.g. "Return statements aren't allowed in Spark closures")

Re: Can spark handle this scenario?

2018-02-20 Thread Lian Jiang
Thanks Vijay! This is very clear. On Tue, Feb 20, 2018 at 12:47 AM, vijay.bvp wrote: > I am assuming pullSymbolFromYahoo functions opens a connection to yahoo API > with some token passed, in the code provided so far if you have 2000 > symbols, it will make 2000 new

Re: Can spark handle this scenario?

2018-02-20 Thread vijay.bvp
I am assuming pullSymbolFromYahoo functions opens a connection to yahoo API with some token passed, in the code provided so far if you have 2000 symbols, it will make 2000 new connections!! and 2000 API calls connection objects can't/shouldn't be serialized and send to executors, they should

Re: Can spark handle this scenario?

2018-02-17 Thread Lian Jiang
Thanks Anastasios. This link is helpful! On Sat, Feb 17, 2018 at 11:05 AM, Anastasios Zouzias wrote: > Hi Lian, > > The remaining problem is: > > > Spark need all classes used in the fn() serializable for t.rdd.map{ k=> > fn(k) } to work. This could be hard since some classes

Re: Can spark handle this scenario?

2018-02-17 Thread Anastasios Zouzias
Hi Lian, The remaining problem is: Spark need all classes used in the fn() serializable for t.rdd.map{ k=> fn(k) } to work. This could be hard since some classes in third party libraries are not serializable. This restricts the power of using spark to parallel an operation on multiple machines.

Re: Can spark handle this scenario?

2018-02-17 Thread Lian Jiang
Agreed. Thanks. On Sat, Feb 17, 2018 at 9:53 AM, Jörn Franke wrote: > You may want to think about separating the import step from the processing > step. It is not very economical to download all the data again every time > you want to calculate something. So download it

Re: Can spark handle this scenario?

2018-02-17 Thread Jörn Franke
You may want to think about separating the import step from the processing step. It is not very economical to download all the data again every time you want to calculate something. So download it first and store it on a distributed file system. Schedule to download newest information every

Re: Can spark handle this scenario?

2018-02-17 Thread Lian Jiang
*Snehasish,* I got this in spark-shell 2.11.8: case class My(name:String, age:Int) import spark.implicits._ val t = List(new My("lian", 20), new My("sh", 3)).toDS t.map{ k=> print(My) }(org.apache.spark.sql.Encoders.kryo[My.getClass]) :31: error: type getClass is not a member of object My

Re: Can spark handle this scenario?

2018-02-17 Thread SNEHASISH DUTTA
Hi Lian, This could be the solution case class Symbol(symbol: String, sector: String) case class Tick(symbol: String, sector: String, open: Double, close: Double) // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick] symbolDs.map { k =>

Re: Can spark handle this scenario?

2018-02-16 Thread Holden Karau
I'm not sure what you mean by it could be hard to serialize complex operations? Regardless I think the question is do you want to parallelize this on multiple machines or just one? On Feb 17, 2018 4:20 PM, "Lian Jiang" wrote: > Thanks Ayan. RDD may support map better

Re: Can spark handle this scenario?

2018-02-16 Thread Lian Jiang
Thanks Ayan. RDD may support map better than Dataset/DataFrame. However, it could be hard to serialize complex operation for Spark to execute in parallel. IMHO, spark does not fit this scenario. Hope this makes sense. On Fri, Feb 16, 2018 at 8:58 PM, ayan guha wrote: > **

Re: Can spark handle this scenario?

2018-02-16 Thread ayan guha
Hi Couple of suggestions: 1. Do not use Dataset, use Dataframe in this scenario. There is no benefit of dataset features here. Using Dataframe, you can write an arbitrary UDF which can do what you want to do. 2. In fact you do need dataframes here. You would be better off with RDD here. just

Re: Can spark handle this scenario?

2018-02-16 Thread ayan guha
** You do NOT need dataframes, I mean. On Sat, Feb 17, 2018 at 3:58 PM, ayan guha wrote: > Hi > > Couple of suggestions: > > 1. Do not use Dataset, use Dataframe in this scenario. There is no benefit > of dataset features here. Using Dataframe, you can write an

Re: Can spark handle this scenario?

2018-02-16 Thread Irving Duran
Do you only want to use Scala? Because otherwise, I think with pyspark and pandas read table you should be able to accomplish what you want to accomplish. Thank you, Irving Duran On 02/16/2018 06:10 PM, Lian Jiang wrote: > Hi, > > I have a user case: > > I want to download S stock data from

Can spark handle this scenario?

2018-02-16 Thread Lian Jiang
Hi, I have a user case: I want to download S stock data from Yahoo API in parallel using Spark. I have got all stock symbols as a Dataset. Then I used below code to call Yahoo API for each symbol: case class Symbol(symbol: String, sector: String) case class Tick(symbol: String, sector: