Can Precompiled Stand Alone Python Application Submitted To A Spark Cluster?
Hi, To pretect the IP of our software distributed to customers, one solution is to use precompiled python scriptes, but we are wondering whether this is a supported feature by pyspark. Thanks.
Re: Does Pyspark Support Graphx?
Most likely not as most of the effort is currently on GraphFrames - a great blog post on the what GraphFrames offers can be found at: https://databricks.com/blog/2016/03/03/introducing-graphframes.html. Is there a particular scenario or situation that you're addressing that requires GraphX vs. GraphFrames? On Sat, Feb 17, 2018 at 8:26 PM xiaobo wrote: > Thanks Denny, will it be supported in the near future? > > > > -- Original -- > *From:* Denny Lee > *Date:* Sun,Feb 18,2018 11:05 AM > *To:* 94035420 > *Cc:* user@spark.apache.org > *Subject:* Re: Does Pyspark Support Graphx? > > That’s correct - you can use GraphFrames though as it does support > PySpark. > On Sat, Feb 17, 2018 at 17:36 94035420 wrote: > >> I can not find anything for graphx module in the python API document, >> does it mean it is not supported yet? >> >
Re: Does Pyspark Support Graphx?
Thanks Denny, will it be supported in the near future? -- Original -- From: Denny Lee Date: Sun,Feb 18,2018 11:05 AM To: 94035420 Cc: user@spark.apache.org Subject: Re: Does Pyspark Support Graphx? That??s correct - you can use GraphFrames though as it does support PySpark. On Sat, Feb 17, 2018 at 17:36 94035420 wrote: I can not find anything for graphx module in the python API document, does it mean it is not supported yet?
Re: Does Pyspark Support Graphx?
That’s correct - you can use GraphFrames though as it does support PySpark. On Sat, Feb 17, 2018 at 17:36 94035420 wrote: > I can not find anything for graphx module in the python API document, does > it mean it is not supported yet? >
Does Pyspark Support Graphx?
I can not find anything for graphx module in the python API document, does it mean it is not supported yet?
can we do self join on streaming dataset in 2.2.0?
Hi All, I know that stream to stream joins are not yet supported. From the text below I wonder if we can do self joins on the same streaming dataset/dataframe in 2.2.0 since there are no two explicit streaming datasets or dataframes? Thanks!! In Spark 2.3, we have added support for stream-stream joins, that is, you can join two streaming Datasets/DataFrames. The challenge of generating join results between two data streams is that, at any point of time, the view of the dataset is incomplete for both sides of the join making it much harder to find matches between inputs. Any row received from one input stream can match with any future, yet-to-be-received row from the other input stream. Hence, for both the input streams, we buffer past input as streaming state, so that we can match every future input with past input and accordingly generate joined results. Furthermore, similar to streaming aggregations, we automatically handle late, out-of-order data and can limit the state using watermarks. Let’s discuss the different types of supported stream-stream joins and how to use them.
Re: Can spark handle this scenario?
Thanks Anastasios. This link is helpful! On Sat, Feb 17, 2018 at 11:05 AM, Anastasios Zouzias wrote: > Hi Lian, > > The remaining problem is: > > > Spark need all classes used in the fn() serializable for t.rdd.map{ k=> > fn(k) } to work. This could be hard since some classes in third party > libraries are not serializable. This restricts the power of using spark to > parallel an operation on multiple machines. Hope this is clear. > > > This is not entirely true. You can bypass the serialisation issue in most > cases, see the link below for an example. > > https://www.nicolaferraro.me/2016/02/22/using-non-serializable-objects-in- > apache-spark/ > > In a nutshell, the non-serialisable code is available to all executors, so > there is no need for Spark to serialise from the driver to the executors. > > Best regards, > Anastasios > > > > > On Sat, Feb 17, 2018 at 6:13 PM, Lian Jiang wrote: > >> *Snehasish,* >> >> I got this in spark-shell 2.11.8: >> >> case class My(name:String, age:Int) >> >> import spark.implicits._ >> >> val t = List(new My("lian", 20), new My("sh", 3)).toDS >> >> t.map{ k=> print(My) }(org.apache.spark.sql.Encoders.kryo[My.getClass]) >> >> >> :31: error: type getClass is not a member of object My >> >>t.map{ k=> print(My) }(org.apache.spark.sql.Encoder >> s.kryo[My.getClass]) >> >> >> >> Using RDD can workaround this issue as mentioned in previous emails: >> >> >> t.rdd.map{ k=> print(k) } >> >> >> *Holden,* >> >> >> The remaining problem is: >> >> >> Spark need all classes used in the fn() serializable for t.rdd.map{ k=> >> fn(k) } to work. This could be hard since some classes in third party >> libraries are not serializable. This restricts the power of using spark to >> parallel an operation on multiple machines. Hope this is clear. >> >> >> On Sat, Feb 17, 2018 at 12:04 AM, SNEHASISH DUTTA < >> info.snehas...@gmail.com> wrote: >> >>> Hi Lian, >>> >>> This could be the solution >>> >>> >>> case class Symbol(symbol: String, sector: String) >>> >>> case class Tick(symbol: String, sector: String, open: Double, close: >>> Double) >>> >>> >>> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns >>> Dataset[Tick] >>> >>> >>> symbolDs.map { k => >>> >>> pullSymbolFromYahoo(k.symbol, k.sector) >>> >>> }(org.apache.spark.sql.Encoders.kryo[Tick.getClass]) >>> >>> >>> Thanks, >>> >>> Snehasish >>> >>> >>> Regards, >>> Snehasish >>> >>> On Sat, Feb 17, 2018 at 1:05 PM, Holden Karau >>> wrote: >>> I'm not sure what you mean by it could be hard to serialize complex operations? Regardless I think the question is do you want to parallelize this on multiple machines or just one? On Feb 17, 2018 4:20 PM, "Lian Jiang" wrote: > Thanks Ayan. RDD may support map better than Dataset/DataFrame. > However, it could be hard to serialize complex operation for Spark to > execute in parallel. IMHO, spark does not fit this scenario. Hope this > makes sense. > > On Fri, Feb 16, 2018 at 8:58 PM, ayan guha > wrote: > >> ** You do NOT need dataframes, I mean. >> >> On Sat, Feb 17, 2018 at 3:58 PM, ayan guha >> wrote: >> >>> Hi >>> >>> Couple of suggestions: >>> >>> 1. Do not use Dataset, use Dataframe in this scenario. There is no >>> benefit of dataset features here. Using Dataframe, you can write an >>> arbitrary UDF which can do what you want to do. >>> 2. In fact you do need dataframes here. You would be better off with >>> RDD here. just create a RDD of symbols and use map to do the processing. >>> >>> On Sat, Feb 17, 2018 at 12:40 PM, Irving Duran < >>> irving.du...@gmail.com> wrote: >>> Do you only want to use Scala? Because otherwise, I think with pyspark and pandas read table you should be able to accomplish what you want to accomplish. Thank you, Irving Duran On 02/16/2018 06:10 PM, Lian Jiang wrote: Hi, I have a user case: I want to download S&P500 stock data from Yahoo API in parallel using Spark. I have got all stock symbols as a Dataset. Then I used below code to call Yahoo API for each symbol: case class Symbol(symbol: String, sector: String) case class Tick(symbol: String, sector: String, open: Double, close: Double) // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick] symbolDs.map { k => pullSymbolFromYahoo(k.symbol, k.sector) } This statement cannot compile: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported
Re: Can spark handle this scenario?
Hi Lian, The remaining problem is: Spark need all classes used in the fn() serializable for t.rdd.map{ k=> fn(k) } to work. This could be hard since some classes in third party libraries are not serializable. This restricts the power of using spark to parallel an operation on multiple machines. Hope this is clear. This is not entirely true. You can bypass the serialisation issue in most cases, see the link below for an example. https://www.nicolaferraro.me/2016/02/22/using-non-serializable-objects-in-apache-spark/ In a nutshell, the non-serialisable code is available to all executors, so there is no need for Spark to serialise from the driver to the executors. Best regards, Anastasios On Sat, Feb 17, 2018 at 6:13 PM, Lian Jiang wrote: > *Snehasish,* > > I got this in spark-shell 2.11.8: > > case class My(name:String, age:Int) > > import spark.implicits._ > > val t = List(new My("lian", 20), new My("sh", 3)).toDS > > t.map{ k=> print(My) }(org.apache.spark.sql.Encoders.kryo[My.getClass]) > > > :31: error: type getClass is not a member of object My > >t.map{ k=> print(My) }(org.apache.spark.sql. > Encoders.kryo[My.getClass]) > > > > Using RDD can workaround this issue as mentioned in previous emails: > > > t.rdd.map{ k=> print(k) } > > > *Holden,* > > > The remaining problem is: > > > Spark need all classes used in the fn() serializable for t.rdd.map{ k=> > fn(k) } to work. This could be hard since some classes in third party > libraries are not serializable. This restricts the power of using spark to > parallel an operation on multiple machines. Hope this is clear. > > > On Sat, Feb 17, 2018 at 12:04 AM, SNEHASISH DUTTA < > info.snehas...@gmail.com> wrote: > >> Hi Lian, >> >> This could be the solution >> >> >> case class Symbol(symbol: String, sector: String) >> >> case class Tick(symbol: String, sector: String, open: Double, close: >> Double) >> >> >> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick] >> >> >> symbolDs.map { k => >> >> pullSymbolFromYahoo(k.symbol, k.sector) >> >> }(org.apache.spark.sql.Encoders.kryo[Tick.getClass]) >> >> >> Thanks, >> >> Snehasish >> >> >> Regards, >> Snehasish >> >> On Sat, Feb 17, 2018 at 1:05 PM, Holden Karau >> wrote: >> >>> I'm not sure what you mean by it could be hard to serialize complex >>> operations? >>> >>> Regardless I think the question is do you want to parallelize this on >>> multiple machines or just one? >>> >>> On Feb 17, 2018 4:20 PM, "Lian Jiang" wrote: >>> Thanks Ayan. RDD may support map better than Dataset/DataFrame. However, it could be hard to serialize complex operation for Spark to execute in parallel. IMHO, spark does not fit this scenario. Hope this makes sense. On Fri, Feb 16, 2018 at 8:58 PM, ayan guha wrote: > ** You do NOT need dataframes, I mean. > > On Sat, Feb 17, 2018 at 3:58 PM, ayan guha > wrote: > >> Hi >> >> Couple of suggestions: >> >> 1. Do not use Dataset, use Dataframe in this scenario. There is no >> benefit of dataset features here. Using Dataframe, you can write an >> arbitrary UDF which can do what you want to do. >> 2. In fact you do need dataframes here. You would be better off with >> RDD here. just create a RDD of symbols and use map to do the processing. >> >> On Sat, Feb 17, 2018 at 12:40 PM, Irving Duran < >> irving.du...@gmail.com> wrote: >> >>> Do you only want to use Scala? Because otherwise, I think with >>> pyspark and pandas read table you should be able to accomplish what you >>> want to accomplish. >>> >>> Thank you, >>> >>> Irving Duran >>> >>> On 02/16/2018 06:10 PM, Lian Jiang wrote: >>> >>> Hi, >>> >>> I have a user case: >>> >>> I want to download S&P500 stock data from Yahoo API in parallel >>> using Spark. I have got all stock symbols as a Dataset. Then I used >>> below >>> code to call Yahoo API for each symbol: >>> >>> >>> >>> case class Symbol(symbol: String, sector: String) >>> >>> case class Tick(symbol: String, sector: String, open: Double, close: >>> Double) >>> >>> >>> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns >>> Dataset[Tick] >>> >>> >>> symbolDs.map { k => >>> >>> pullSymbolFromYahoo(k.symbol, k.sector) >>> >>> } >>> >>> >>> This statement cannot compile: >>> >>> >>> Unable to find encoder for type stored in a Dataset. Primitive >>> types (Int, String, etc) and Product types (case classes) are supported >>> by >>> importing spark.implicits._ Support for serializing other types >>> will be added in future releases. >>> >>> >>> My questions are: >>> >>> >>> 1. As you can see, this scenario is not traditional dataset handling >>> such as count, sql query... Instead, it is more
Re: Can spark handle this scenario?
Agreed. Thanks. On Sat, Feb 17, 2018 at 9:53 AM, Jörn Franke wrote: > You may want to think about separating the import step from the processing > step. It is not very economical to download all the data again every time > you want to calculate something. So download it first and store it on a > distributed file system. Schedule to download newest information every day/ > hour etc. you can store it using a query optimized format such as ORC or > Parquet. Then you can run queries over it. > > On 17. Feb 2018, at 01:10, Lian Jiang wrote: > > Hi, > > I have a user case: > > I want to download S&P500 stock data from Yahoo API in parallel using > Spark. I have got all stock symbols as a Dataset. Then I used below code to > call Yahoo API for each symbol: > > > > case class Symbol(symbol: String, sector: String) > > case class Tick(symbol: String, sector: String, open: Double, close: > Double) > > > // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick] > > > symbolDs.map { k => > > pullSymbolFromYahoo(k.symbol, k.sector) > > } > > > This statement cannot compile: > > > Unable to find encoder for type stored in a Dataset. Primitive types > (Int, String, etc) and Product types (case classes) are supported by > importing spark.implicits._ Support for serializing other types will be > added in future releases. > > > My questions are: > > > 1. As you can see, this scenario is not traditional dataset handling such > as count, sql query... Instead, it is more like a UDF which apply random > operation on each record. Is Spark good at handling such scenario? > > > 2. Regarding the compilation error, any fix? I did not find a satisfactory > solution online. > > > Thanks for help! > > > > >
Re: Can spark handle this scenario?
You may want to think about separating the import step from the processing step. It is not very economical to download all the data again every time you want to calculate something. So download it first and store it on a distributed file system. Schedule to download newest information every day/ hour etc. you can store it using a query optimized format such as ORC or Parquet. Then you can run queries over it. > On 17. Feb 2018, at 01:10, Lian Jiang wrote: > > Hi, > > I have a user case: > > I want to download S&P500 stock data from Yahoo API in parallel using Spark. > I have got all stock symbols as a Dataset. Then I used below code to call > Yahoo API for each symbol: > > > case class Symbol(symbol: String, sector: String) > case class Tick(symbol: String, sector: String, open: Double, close: Double) > > // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick] > > symbolDs.map { k => > pullSymbolFromYahoo(k.symbol, k.sector) > } > > This statement cannot compile: > > Unable to find encoder for type stored in a Dataset. Primitive types (Int, > String, etc) and Product types (case classes) are supported by importing > spark.implicits._ Support for serializing other types will be added in > future releases. > > > My questions are: > > 1. As you can see, this scenario is not traditional dataset handling such as > count, sql query... Instead, it is more like a UDF which apply random > operation on each record. Is Spark good at handling such scenario? > > 2. Regarding the compilation error, any fix? I did not find a satisfactory > solution online. > > Thanks for help! > > >
Re: Can spark handle this scenario?
*Snehasish,* I got this in spark-shell 2.11.8: case class My(name:String, age:Int) import spark.implicits._ val t = List(new My("lian", 20), new My("sh", 3)).toDS t.map{ k=> print(My) }(org.apache.spark.sql.Encoders.kryo[My.getClass]) :31: error: type getClass is not a member of object My t.map{ k=> print(My) }(org.apache.spark.sql.Encoders.kryo[My.getClass]) Using RDD can workaround this issue as mentioned in previous emails: t.rdd.map{ k=> print(k) } *Holden,* The remaining problem is: Spark need all classes used in the fn() serializable for t.rdd.map{ k=> fn(k) } to work. This could be hard since some classes in third party libraries are not serializable. This restricts the power of using spark to parallel an operation on multiple machines. Hope this is clear. On Sat, Feb 17, 2018 at 12:04 AM, SNEHASISH DUTTA wrote: > Hi Lian, > > This could be the solution > > > case class Symbol(symbol: String, sector: String) > > case class Tick(symbol: String, sector: String, open: Double, close: > Double) > > > // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick] > > > symbolDs.map { k => > > pullSymbolFromYahoo(k.symbol, k.sector) > > }(org.apache.spark.sql.Encoders.kryo[Tick.getClass]) > > > Thanks, > > Snehasish > > > Regards, > Snehasish > > On Sat, Feb 17, 2018 at 1:05 PM, Holden Karau > wrote: > >> I'm not sure what you mean by it could be hard to serialize complex >> operations? >> >> Regardless I think the question is do you want to parallelize this on >> multiple machines or just one? >> >> On Feb 17, 2018 4:20 PM, "Lian Jiang" wrote: >> >>> Thanks Ayan. RDD may support map better than Dataset/DataFrame. However, >>> it could be hard to serialize complex operation for Spark to execute in >>> parallel. IMHO, spark does not fit this scenario. Hope this makes sense. >>> >>> On Fri, Feb 16, 2018 at 8:58 PM, ayan guha wrote: >>> ** You do NOT need dataframes, I mean. On Sat, Feb 17, 2018 at 3:58 PM, ayan guha wrote: > Hi > > Couple of suggestions: > > 1. Do not use Dataset, use Dataframe in this scenario. There is no > benefit of dataset features here. Using Dataframe, you can write an > arbitrary UDF which can do what you want to do. > 2. In fact you do need dataframes here. You would be better off with > RDD here. just create a RDD of symbols and use map to do the processing. > > On Sat, Feb 17, 2018 at 12:40 PM, Irving Duran > wrote: > >> Do you only want to use Scala? Because otherwise, I think with >> pyspark and pandas read table you should be able to accomplish what you >> want to accomplish. >> >> Thank you, >> >> Irving Duran >> >> On 02/16/2018 06:10 PM, Lian Jiang wrote: >> >> Hi, >> >> I have a user case: >> >> I want to download S&P500 stock data from Yahoo API in parallel using >> Spark. I have got all stock symbols as a Dataset. Then I used below code >> to >> call Yahoo API for each symbol: >> >> >> >> case class Symbol(symbol: String, sector: String) >> >> case class Tick(symbol: String, sector: String, open: Double, close: >> Double) >> >> >> // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns >> Dataset[Tick] >> >> >> symbolDs.map { k => >> >> pullSymbolFromYahoo(k.symbol, k.sector) >> >> } >> >> >> This statement cannot compile: >> >> >> Unable to find encoder for type stored in a Dataset. Primitive >> types (Int, String, etc) and Product types (case classes) are supported >> by >> importing spark.implicits._ Support for serializing other types >> will be added in future releases. >> >> >> My questions are: >> >> >> 1. As you can see, this scenario is not traditional dataset handling >> such as count, sql query... Instead, it is more like a UDF which apply >> random operation on each record. Is Spark good at handling such scenario? >> >> >> 2. Regarding the compilation error, any fix? I did not find a >> satisfactory solution online. >> >> >> Thanks for help! >> >> >> >> >> >> > > > -- > Best Regards, > Ayan Guha > -- Best Regards, Ayan Guha >>> >>> >
Re: Can spark handle this scenario?
Hi Lian, This could be the solution case class Symbol(symbol: String, sector: String) case class Tick(symbol: String, sector: String, open: Double, close: Double) // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns Dataset[Tick] symbolDs.map { k => pullSymbolFromYahoo(k.symbol, k.sector) }(org.apache.spark.sql.Encoders.kryo[Tick.getClass]) Thanks, Snehasish Regards, Snehasish On Sat, Feb 17, 2018 at 1:05 PM, Holden Karau wrote: > I'm not sure what you mean by it could be hard to serialize complex > operations? > > Regardless I think the question is do you want to parallelize this on > multiple machines or just one? > > On Feb 17, 2018 4:20 PM, "Lian Jiang" wrote: > >> Thanks Ayan. RDD may support map better than Dataset/DataFrame. However, >> it could be hard to serialize complex operation for Spark to execute in >> parallel. IMHO, spark does not fit this scenario. Hope this makes sense. >> >> On Fri, Feb 16, 2018 at 8:58 PM, ayan guha wrote: >> >>> ** You do NOT need dataframes, I mean. >>> >>> On Sat, Feb 17, 2018 at 3:58 PM, ayan guha wrote: >>> Hi Couple of suggestions: 1. Do not use Dataset, use Dataframe in this scenario. There is no benefit of dataset features here. Using Dataframe, you can write an arbitrary UDF which can do what you want to do. 2. In fact you do need dataframes here. You would be better off with RDD here. just create a RDD of symbols and use map to do the processing. On Sat, Feb 17, 2018 at 12:40 PM, Irving Duran wrote: > Do you only want to use Scala? Because otherwise, I think with pyspark > and pandas read table you should be able to accomplish what you want to > accomplish. > > Thank you, > > Irving Duran > > On 02/16/2018 06:10 PM, Lian Jiang wrote: > > Hi, > > I have a user case: > > I want to download S&P500 stock data from Yahoo API in parallel using > Spark. I have got all stock symbols as a Dataset. Then I used below code > to > call Yahoo API for each symbol: > > > > case class Symbol(symbol: String, sector: String) > > case class Tick(symbol: String, sector: String, open: Double, close: > Double) > > > // symbolDS is Dataset[Symbol], pullSymbolFromYahoo returns > Dataset[Tick] > > > symbolDs.map { k => > > pullSymbolFromYahoo(k.symbol, k.sector) > > } > > > This statement cannot compile: > > > Unable to find encoder for type stored in a Dataset. Primitive types > (Int, String, etc) and Product types (case classes) are supported by > importing spark.implicits._ Support for serializing other types will > be added in future releases. > > > My questions are: > > > 1. As you can see, this scenario is not traditional dataset handling > such as count, sql query... Instead, it is more like a UDF which apply > random operation on each record. Is Spark good at handling such scenario? > > > 2. Regarding the compilation error, any fix? I did not find a > satisfactory solution online. > > > Thanks for help! > > > > > > -- Best Regards, Ayan Guha >>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >>