Re: Extending GraphFrames without running into serialization issues

2021-01-05 Thread Sean Owen
It's because this calls the no-arg superclass constructor that sets
_vertices and _edges in the actual GraphFrame class to null. That yields
the error.
Normally you'd just show you want to call the two-arg superclass
constructor with "extends GraphFrame(_vertices, _edges)" but that
constructor is private.
Instead try overriding the accessors in your class body:

  override def vertices = _vertices
  override def edges = _edges

On Tue, Jan 5, 2021 at 3:16 PM Michal Monselise 
wrote:

> Hi,
>
> I am trying to extend GraphFrames and create my own class that has some
> additional graph functionality.
>
> To simplify for this example, I have created a class that doesn't contain
> any functions. All it does is just extend GraphFrames:
>
> import org.apache.spark.sql.DataFrameimport org.graphframes._
> class NewGraphFrame(@transient private val _vertices: DataFrame,
> @transient private val _edges: DataFrame) extends 
> GraphFrame {
>
> }
> val vertices = Seq(
>   (1, "John"),
>   (2, "Jane"),
>   (3, "Karen")
> ).toDF("id", "name")
> val edges = Seq(
>   (1, 3),
>   (2, 3),
>   (2, 1)
> ).toDF("src", "dst")
> val g = new NewGraphFrame(vertices, edges)
>
> When I run this code in the REPL, I get the following error:
>
> java.lang.Exception: You cannot use GraphFrame objects within a Spark closure
>   at org.graphframes.GraphFrame.vertices(GraphFrame.scala:125)
>   at org.graphframes.GraphFrame.toString(GraphFrame.scala:55)
>   at 
> scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:332)
>   at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:337)
>   at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:345)
>   at .$print$lzycompute(:10)
>   at .$print(:6)
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
>   at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
>   at 
> scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
>   at 
> scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
>   at 
> scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
>   at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
>   at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
>   at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
>   at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
>   at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
>   at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:415)
>   at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:923)
>   at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
>   at 
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
>   at 
> scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
>   at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
>   at org.apache.spark.repl.Main$.doMain(Main.scala:76)
>   at org.apache.spark.repl.Main$.main(Main.scala:56)
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
>   at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> I know that this means that I'm serializing twice. However, I am obviously
> not interested in doing that. I simply want to extend this class so that I
> can use the graph functionality in my class. How do I extend this class
> without the spark repl throwing this error?
>


Extending GraphFrames without running into serialization issues

2021-01-05 Thread Michal Monselise
Hi,

I am trying to extend GraphFrames and create my own class that has some
additional graph functionality.

To simplify for this example, I have created a class that doesn't contain
any functions. All it does is just extend GraphFrames:

import org.apache.spark.sql.DataFrameimport org.graphframes._
class NewGraphFrame(@transient private val _vertices: DataFrame,
@transient private val _edges: DataFrame) extends
GraphFrame {

}
val vertices = Seq(
  (1, "John"),
  (2, "Jane"),
  (3, "Karen")
).toDF("id", "name")
val edges = Seq(
  (1, 3),
  (2, 3),
  (2, 1)
).toDF("src", "dst")
val g = new NewGraphFrame(vertices, edges)

When I run this code in the REPL, I get the following error:

java.lang.Exception: You cannot use GraphFrame objects within a Spark closure
  at org.graphframes.GraphFrame.vertices(GraphFrame.scala:125)
  at org.graphframes.GraphFrame.toString(GraphFrame.scala:55)
  at 
scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:332)
  at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:337)
  at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:345)
  at .$print$lzycompute(:10)
  at .$print(:6)
  at $print()
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
  at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
  at 
scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
  at 
scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
  at 
scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
  at 
scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
  at 
scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
  at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
  at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
  at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
  at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
  at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
  at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:415)
  at 
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:923)
  at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
  at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
  at 
scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
  at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
  at org.apache.spark.repl.Main$.doMain(Main.scala:76)
  at org.apache.spark.repl.Main$.main(Main.scala:56)
  at org.apache.spark.repl.Main.main(Main.scala)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
  at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
  at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I know that this means that I'm serializing twice. However, I am obviously
not interested in doing that. I simply want to extend this class so that I
can use the graph functionality in my class. How do I extend this class
without the spark repl throwing this error?


Re: Suggestion on Spark 2.4.7 vs Spark 3 for Kubernetes

2021-01-05 Thread Sachit Murarka
Thanks for the link Prashant.

Regards
Sachit

On Tue, 5 Jan 2021, 15:08 Prashant Sharma,  wrote:

>  A lot of developers may have already moved to 3.0.x, FYI 3.1.0 is just
> around the corner hopefully(in a few days) and has a lot of improvements to
> spark on K8s, including it will be transitioning from experimental to GA in
> this release.
>
> See: https://issues.apache.org/jira/browse/SPARK-33005
>
> Thanks,
>
> On Tue, Jan 5, 2021 at 12:41 AM Sachit Murarka 
> wrote:
>
>> Hi Users,
>>
>> Could you please tell which Spark version have you used in Production for
>> Kubernetes.
>> Which is a recommended version for Production provided that both
>> Streaming and core apis have to be used using Pyspark.
>>
>> Thanks !
>>
>> Kind Regards,
>> Sachit Murarka
>>
>


Re: A question on extrapolation of a nonlinear curve fit beyond x value

2021-01-05 Thread Mich Talebzadeh
OK will try it thanks



On Tue, 5 Jan 2021 at 15:42, Sean Owen  wrote:

> You need to fit a curve to those points using your chosen model. It sounds
> like you want scipy's curve_fit maybe? matplotlib is for plotting, not
> curve fitting.
> But that and the plotting are nothing to do with Spark here. Spark gives
> you the data as pandas so you can use all these tools as you like.
>
> On Tue, Jan 5, 2021 at 9:38 AM Mich Talebzadeh 
> wrote:
>
>> Thanks again
>>
>> Just to clarify, I want to see the average price for year 2021, 2022 etc
>> based on the best fit. So naively if someone asked a question what the
>> average price will be in 2022, I should be able to make some predictions.
>>
>> I can of course crudely use pen and pencil like shown in the attached
>> figure, but I was wondering if this is possible with anything that
>> matplotlib offers?
>>
>>
>>
>> [image: Capture123.PNG]
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 5 Jan 2021 at 15:22, Sean Owen  wrote:
>>
>>> You will need to use matplotlib on the driver to plot in any event. If
>>> this is a single extrapolation, over 11 data points, you can just use Spark
>>> to do the aggregation, call .toPandas, and do whatever you want in the
>>> Python ecosystem to fit and plot that result.
>>>



Re: A question on extrapolation of a nonlinear curve fit beyond x value

2021-01-05 Thread Sean Owen
You need to fit a curve to those points using your chosen model. It sounds
like you want scipy's curve_fit maybe? matplotlib is for plotting, not
curve fitting.
But that and the plotting are nothing to do with Spark here. Spark gives
you the data as pandas so you can use all these tools as you like.

On Tue, Jan 5, 2021 at 9:38 AM Mich Talebzadeh 
wrote:

> Thanks again
>
> Just to clarify, I want to see the average price for year 2021, 2022 etc
> based on the best fit. So naively if someone asked a question what the
> average price will be in 2022, I should be able to make some predictions.
>
> I can of course crudely use pen and pencil like shown in the attached
> figure, but I was wondering if this is possible with anything that
> matplotlib offers?
>
>
>
> [image: Capture123.PNG]
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 5 Jan 2021 at 15:22, Sean Owen  wrote:
>
>> You will need to use matplotlib on the driver to plot in any event. If
>> this is a single extrapolation, over 11 data points, you can just use Spark
>> to do the aggregation, call .toPandas, and do whatever you want in the
>> Python ecosystem to fit and plot that result.
>>
>>>


Re: A question on extrapolation of a nonlinear curve fit beyond x value

2021-01-05 Thread Mich Talebzadeh
Thanks again

Just to clarify, I want to see the average price for year 2021, 2022 etc
based on the best fit. So naively if someone asked a question what the
average price will be in 2022, I should be able to make some predictions.

I can of course crudely use pen and pencil like shown in the attached
figure, but I was wondering if this is possible with anything that
matplotlib offers?



[image: Capture123.PNG]



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 5 Jan 2021 at 15:22, Sean Owen  wrote:

> You will need to use matplotlib on the driver to plot in any event. If
> this is a single extrapolation, over 11 data points, you can just use Spark
> to do the aggregation, call .toPandas, and do whatever you want in the
> Python ecosystem to fit and plot that result.
>
> On Tue, Jan 5, 2021 at 9:18 AM Mich Talebzadeh 
> wrote:
>
>> thanks Sean.
>>
>> This is the gist of the case
>>
>> 
>>
>> I have data points for x-axis from 2010 till 2020 and values for y axis.
>> I am using PySpark, pandas and matplotlib. Data is read into PySpark from
>> the underlying database and a pandas Data Frame is built on it. Data is
>> aggregated over each year. However, the underlying prices are provided on a
>> monthly basis in CSV file which has been loaded into a Hive table
>>
>> summary_df = spark.sql(f"""SELECT cast(Year as int) as year,
>> AVGFlatPricePerYear, AVGTerracedPricePerYear, AVGSemiDetachedPricePerYear,
>> AVGDetachedPricePerYear FROM {v.DSDB}.yearlyhouseprices""")
>>
>> df_10 = summary_df.filter(col("year").between(f'{start_date}',
>> f'{end_date}'))
>>
>> p_dfm = df_10.toPandas()  # converting spark DF to Pandas DF
>>
>>
>> for i in range(n):
>>
>>   if p_dfm.columns[i] != 'year':   # year is x axis in integer
>>
>> vcolumn = p_dfm.columns[i]
>>
>>  print(vcolumn)
>>
>>  params = model.guess(p_dfm[vcolumn], x = p_dfm['year'])
>>
>>  result = model.fit(p_dfm[vcolumn], params, x = p_dfm['year'])
>>
>>  result.plot_fit()
>>
>>  if vcolumn == "AVGFlatPricePerYear":
>>
>>  plt.xlabel("Year", fontdict=v.font)
>>
>>  plt.ylabel("Flat house prices in millions/GBP", fontdict=v.font)
>>
>>  plt.title(f"""Flat price fluctuations in {regionname} for the
>> past 10 years """,  fontdict=v.font)
>>
>>  plt.text(0.35,
>>
>> 0.45,
>>
>> "Best-fit based on Non-Linear Lorentzian Model",
>>
>> transform=plt.gca().transAxes,
>>
>> color="grey",
>>
>> fontsize=10
>>
>>  )
>>
>>  print(result.fit_report())
>>
>>  plt.xlim(left=2009)
>>
>>  plt.xlim(right=2022)
>>
>>  plt.show()
>>
>>  plt.close()
>>
>> ```
>>
>> So far so good. I get a best fit plot as shown using Lorentzian model
>>
>> Also I have model fit data
>>
>> [[Model]]
>>
>> Model(lorentzian)
>>
>> [[Fit Statistics]]
>>
>> # fitting method   = leastsq
>>
>> # function evals   = 25
>>
>> # data points  = 11
>>
>> # variables= 3
>>
>> chi-square = 8.4155e+09
>>
>> reduced chi-square = 1.0519e+09
>>
>> Akaike info crit   = 231.009958
>>
>> Bayesian info crit = 232.203644
>>
>> [[Variables]]
>>
>> amplitude:  31107480.0 +/- 1471033.33 (4.73%) (init = 6106104)
>>
>> center: 2016.75722 +/- 0.18632315 (0.01%) (init = 2016.5)
>>
>> sigma:  8.37428353 +/- 0.45979189 (5.49%) (init = 3.5)
>>
>> fwhm:   16.7485671 +/- 0.91958379 (5.49%) == '2.000*sigma'
>>
>> height: 1182407.88 +/- 15681.8211 (1.33%) ==
>> '0.3183099*amplitude/max(2.220446049250313e-16, sigma)'
>>
>> [[Correlations]] (unreported correlations are < 0.100)
>>
>> C(amplitude, sigma)  =  0.977
>>
>> C(amplitude, center) =  0.644
>>
>> C(center, sigma) =  0.603
>>
>>
>> Now I need to predict the prices for years 2021-2022 based on this fit.
>> Is there any way I can use some plt functions to provide extrapolated
>> values for 2021 and beyond?
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>> On Tue, 5 Jan 2021 at 14:43, Sean Owen  wrote:
>>
>>> If your data set is 11 points, surely this is not a distributed problem?
>>> or are you asking how to build tens of thousands of those projections in
>>> parallel?
>>>
>>> On Tue, Jan 5, 2021 at 6:04 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 I am not sure Spark forum is the correct avenue for this question.

 I am using PySpark with matplotlib to  get the best fit for data using
 the Lorentzian Model. This curve uses 2010-2020 data points (11 on x-axis).
 I need to predict predict the prices for years 2021-2025 base

Re: A question on extrapolation of a nonlinear curve fit beyond x value

2021-01-05 Thread Sean Owen
You will need to use matplotlib on the driver to plot in any event. If this
is a single extrapolation, over 11 data points, you can just use Spark to
do the aggregation, call .toPandas, and do whatever you want in the Python
ecosystem to fit and plot that result.

On Tue, Jan 5, 2021 at 9:18 AM Mich Talebzadeh 
wrote:

> thanks Sean.
>
> This is the gist of the case
>
> 
>
> I have data points for x-axis from 2010 till 2020 and values for y axis. I
> am using PySpark, pandas and matplotlib. Data is read into PySpark from the
> underlying database and a pandas Data Frame is built on it. Data is
> aggregated over each year. However, the underlying prices are provided on a
> monthly basis in CSV file which has been loaded into a Hive table
>
> summary_df = spark.sql(f"""SELECT cast(Year as int) as year,
> AVGFlatPricePerYear, AVGTerracedPricePerYear, AVGSemiDetachedPricePerYear,
> AVGDetachedPricePerYear FROM {v.DSDB}.yearlyhouseprices""")
>
> df_10 = summary_df.filter(col("year").between(f'{start_date}',
> f'{end_date}'))
>
> p_dfm = df_10.toPandas()  # converting spark DF to Pandas DF
>
>
> for i in range(n):
>
>   if p_dfm.columns[i] != 'year':   # year is x axis in integer
>
> vcolumn = p_dfm.columns[i]
>
>  print(vcolumn)
>
>  params = model.guess(p_dfm[vcolumn], x = p_dfm['year'])
>
>  result = model.fit(p_dfm[vcolumn], params, x = p_dfm['year'])
>
>  result.plot_fit()
>
>  if vcolumn == "AVGFlatPricePerYear":
>
>  plt.xlabel("Year", fontdict=v.font)
>
>  plt.ylabel("Flat house prices in millions/GBP", fontdict=v.font)
>
>  plt.title(f"""Flat price fluctuations in {regionname} for the
> past 10 years """,  fontdict=v.font)
>
>  plt.text(0.35,
>
> 0.45,
>
> "Best-fit based on Non-Linear Lorentzian Model",
>
> transform=plt.gca().transAxes,
>
> color="grey",
>
> fontsize=10
>
>  )
>
>  print(result.fit_report())
>
>  plt.xlim(left=2009)
>
>  plt.xlim(right=2022)
>
>  plt.show()
>
>  plt.close()
>
> ```
>
> So far so good. I get a best fit plot as shown using Lorentzian model
>
> Also I have model fit data
>
> [[Model]]
>
> Model(lorentzian)
>
> [[Fit Statistics]]
>
> # fitting method   = leastsq
>
> # function evals   = 25
>
> # data points  = 11
>
> # variables= 3
>
> chi-square = 8.4155e+09
>
> reduced chi-square = 1.0519e+09
>
> Akaike info crit   = 231.009958
>
> Bayesian info crit = 232.203644
>
> [[Variables]]
>
> amplitude:  31107480.0 +/- 1471033.33 (4.73%) (init = 6106104)
>
> center: 2016.75722 +/- 0.18632315 (0.01%) (init = 2016.5)
>
> sigma:  8.37428353 +/- 0.45979189 (5.49%) (init = 3.5)
>
> fwhm:   16.7485671 +/- 0.91958379 (5.49%) == '2.000*sigma'
>
> height: 1182407.88 +/- 15681.8211 (1.33%) ==
> '0.3183099*amplitude/max(2.220446049250313e-16, sigma)'
>
> [[Correlations]] (unreported correlations are < 0.100)
>
> C(amplitude, sigma)  =  0.977
>
> C(amplitude, center) =  0.644
>
> C(center, sigma) =  0.603
>
>
> Now I need to predict the prices for years 2021-2022 based on this fit. Is
> there any way I can use some plt functions to provide extrapolated values
> for 2021 and beyond?
>
>
> Thanks
>
>
>
>
>
> On Tue, 5 Jan 2021 at 14:43, Sean Owen  wrote:
>
>> If your data set is 11 points, surely this is not a distributed problem?
>> or are you asking how to build tens of thousands of those projections in
>> parallel?
>>
>> On Tue, Jan 5, 2021 at 6:04 AM Mich Talebzadeh 
>> wrote:
>>
>>> Hi,
>>>
>>> I am not sure Spark forum is the correct avenue for this question.
>>>
>>> I am using PySpark with matplotlib to  get the best fit for data using
>>> the Lorentzian Model. This curve uses 2010-2020 data points (11 on x-axis).
>>> I need to predict predict the prices for years 2021-2025 based on this
>>> fit. So not sure if someone can advise me? If Ok, then I can post the
>>> details
>>>
>>> Thanks
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>


Re: A question on extrapolation of a nonlinear curve fit beyond x value

2021-01-05 Thread Mich Talebzadeh
thanks Sean.

This is the gist of the case



I have data points for x-axis from 2010 till 2020 and values for y axis. I
am using PySpark, pandas and matplotlib. Data is read into PySpark from the
underlying database and a pandas Data Frame is built on it. Data is
aggregated over each year. However, the underlying prices are provided on a
monthly basis in CSV file which has been loaded into a Hive table

summary_df = spark.sql(f"""SELECT cast(Year as int) as year,
AVGFlatPricePerYear, AVGTerracedPricePerYear, AVGSemiDetachedPricePerYear,
AVGDetachedPricePerYear FROM {v.DSDB}.yearlyhouseprices""")

df_10 = summary_df.filter(col("year").between(f'{start_date}',
f'{end_date}'))

p_dfm = df_10.toPandas()  # converting spark DF to Pandas DF


for i in range(n):

  if p_dfm.columns[i] != 'year':   # year is x axis in integer

vcolumn = p_dfm.columns[i]

 print(vcolumn)

 params = model.guess(p_dfm[vcolumn], x = p_dfm['year'])

 result = model.fit(p_dfm[vcolumn], params, x = p_dfm['year'])

 result.plot_fit()

 if vcolumn == "AVGFlatPricePerYear":

 plt.xlabel("Year", fontdict=v.font)

 plt.ylabel("Flat house prices in millions/GBP", fontdict=v.font)

 plt.title(f"""Flat price fluctuations in {regionname} for the past
10 years """,  fontdict=v.font)

 plt.text(0.35,

0.45,

"Best-fit based on Non-Linear Lorentzian Model",

transform=plt.gca().transAxes,

color="grey",

fontsize=10

 )

 print(result.fit_report())

 plt.xlim(left=2009)

 plt.xlim(right=2022)

 plt.show()

 plt.close()

```

So far so good. I get a best fit plot as shown using Lorentzian model

Also I have model fit data

[[Model]]

Model(lorentzian)

[[Fit Statistics]]

# fitting method   = leastsq

# function evals   = 25

# data points  = 11

# variables= 3

chi-square = 8.4155e+09

reduced chi-square = 1.0519e+09

Akaike info crit   = 231.009958

Bayesian info crit = 232.203644

[[Variables]]

amplitude:  31107480.0 +/- 1471033.33 (4.73%) (init = 6106104)

center: 2016.75722 +/- 0.18632315 (0.01%) (init = 2016.5)

sigma:  8.37428353 +/- 0.45979189 (5.49%) (init = 3.5)

fwhm:   16.7485671 +/- 0.91958379 (5.49%) == '2.000*sigma'

height: 1182407.88 +/- 15681.8211 (1.33%) ==
'0.3183099*amplitude/max(2.220446049250313e-16, sigma)'

[[Correlations]] (unreported correlations are < 0.100)

C(amplitude, sigma)  =  0.977

C(amplitude, center) =  0.644

C(center, sigma) =  0.603


Now I need to predict the prices for years 2021-2022 based on this fit. Is
there any way I can use some plt functions to provide extrapolated values
for 2021 and beyond?


Thanks





On Tue, 5 Jan 2021 at 14:43, Sean Owen  wrote:

> If your data set is 11 points, surely this is not a distributed problem?
> or are you asking how to build tens of thousands of those projections in
> parallel?
>
> On Tue, Jan 5, 2021 at 6:04 AM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> I am not sure Spark forum is the correct avenue for this question.
>>
>> I am using PySpark with matplotlib to  get the best fit for data using
>> the Lorentzian Model. This curve uses 2010-2020 data points (11 on x-axis).
>> I need to predict predict the prices for years 2021-2025 based on this
>> fit. So not sure if someone can advise me? If Ok, then I can post the
>> details
>>
>> Thanks
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: A question on extrapolation of a nonlinear curve fit beyond x value

2021-01-05 Thread Sean Owen
If your data set is 11 points, surely this is not a distributed problem? or
are you asking how to build tens of thousands of those projections in
parallel?

On Tue, Jan 5, 2021 at 6:04 AM Mich Talebzadeh 
wrote:

> Hi,
>
> I am not sure Spark forum is the correct avenue for this question.
>
> I am using PySpark with matplotlib to  get the best fit for data using the
> Lorentzian Model. This curve uses 2010-2020 data points (11 on x-axis). I
> need to predict predict the prices for years 2021-2025 based on this fit.
> So not sure if someone can advise me? If Ok, then I can post the details
>
> Thanks
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


A question on extrapolation of a nonlinear curve fit beyond x value

2021-01-05 Thread Mich Talebzadeh
Hi,

I am not sure Spark forum is the correct avenue for this question.

I am using PySpark with matplotlib to  get the best fit for data using the
Lorentzian Model. This curve uses 2010-2020 data points (11 on x-axis). I
need to predict predict the prices for years 2021-2025 based on this fit.
So not sure if someone can advise me? If Ok, then I can post the details

Thanks



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Spark DF does not rename the column

2021-01-05 Thread Mich Talebzadeh
Yes many thanks German. Jayesh kindly reminded me about it.

It is amazing how one at times one overlooks these typos and assumes more
sophisticated investigation to the code not working.

Mich



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 4 Jan 2021 at 19:49, German Schiavon 
wrote:

> Hi,
>
> I think you have a typo :
>
> root
>
>  |-- ceated: string (nullable = true)
>
>
> and then:
>
> withColumnRenamed("created","Date Calculated").
>
>
> On Mon, 4 Jan 2021 at 19:12, Lalwani, Jayesh 
> wrote:
>
>> You don’t have a column named “created”. The column name is “ceated”,
>> without the “r”
>>
>>
>>
>> *From: *Mich Talebzadeh 
>> *Date: *Monday, January 4, 2021 at 1:06 PM
>> *To: *"user @spark" 
>> *Subject: *[EXTERNAL] Spark DF does not rename the column
>>
>>
>>
>> *CAUTION*: This email originated from outside of the organization. Do
>> not click links or open attachments unless you can confirm the sender and
>> know the content is safe.
>>
>>
>>
>> Hi,
>>
>>
>>
>>  version 2.4.3
>>
>>
>>
>> I don't know the cause of this.
>>
>>
>>
>> This renaming of DF columns used to work fine. I did couple of changes to
>> spark/Scala code not relevant to this table and it refuses to rename the
>> columns for a table!.
>>
>>
>>
>> val summaryACC = HiveContext.table("summaryACC")
>>
>>
>>
>> summaryACC.printSchema()
>>
>>
>>
>> root
>>
>>  |-- ceated: string (nullable = true)
>>
>>  |-- hashtag: string (nullable = true)
>>
>>  |-- paid: float (nullable = true)
>>
>>  |-- received: float (nullable = true)
>>
>>
>>
>> summaryACC.
>>
>> orderBy(desc("paid"),desc("received")).
>>
>> withColumnRenamed("created","Date Calculated").
>>
>> withColumnRenamed("hashtag","Who").
>>
>> withColumn(("received"),format_number(col("received"),2)).
>>
>> withColumn(("paid"),format_number(col("paid"),2)).
>>
>> withColumnRenamed("paid","paid out/GBP").
>>
>> withColumnRenamed("received","paid in/GBP").
>>
>> withColumn("paid in/GBP",when(col("paid in/GBP") ===
>> "0.00","--").otherwise(col("paid in/GBP"))).
>>
>> withColumn("paid out/GBP",when(col("paid out/GBP") ===
>> "0.00","--").otherwise(col("paid out/GBP"))).
>>
>> select("Date Calculated","Who","paid in/GBP","paid
>> out/GBP").show(1000,false)
>>
>>
>>
>> and this is the error
>>
>>
>>
>> org.apache.spark.sql.AnalysisException: cannot resolve '`Date
>> Calculated`' given input columns: [alayer.summaryacc.ceated, Who, paid
>> out/GBP, paid in/GBP];;
>>
>>
>>
>> This used to work before!
>>
>>
>>
>> ++--+---++
>>
>> |Date Calculated |Who   |paid in/GBP|paid out/GBP|
>>
>> ++--+---++
>>
>> |Mon Jan 04 14:22:17 GMT 2021|paypal|579.98 |1,526.86|
>>
>>
>>
>> Appreciate any ideas.
>>
>>
>>
>> Thanks, Mich
>>
>>
>>
>


Re: Suggestion on Spark 2.4.7 vs Spark 3 for Kubernetes

2021-01-05 Thread Prashant Sharma
 A lot of developers may have already moved to 3.0.x, FYI 3.1.0 is just
around the corner hopefully(in a few days) and has a lot of improvements to
spark on K8s, including it will be transitioning from experimental to GA in
this release.

See: https://issues.apache.org/jira/browse/SPARK-33005

Thanks,

On Tue, Jan 5, 2021 at 12:41 AM Sachit Murarka 
wrote:

> Hi Users,
>
> Could you please tell which Spark version have you used in Production for
> Kubernetes.
> Which is a recommended version for Production provided that both Streaming
> and core apis have to be used using Pyspark.
>
> Thanks !
>
> Kind Regards,
> Sachit Murarka
>