thanks Sean. This is the gist of the case
<https://stackoverflow.com/posts/65570917/timeline> I have data points for x-axis from 2010 till 2020 and values for y axis. I am using PySpark, pandas and matplotlib. Data is read into PySpark from the underlying database and a pandas Data Frame is built on it. Data is aggregated over each year. However, the underlying prices are provided on a monthly basis in CSV file which has been loaded into a Hive table summary_df = spark.sql(f"""SELECT cast(Year as int) as year, AVGFlatPricePerYear, AVGTerracedPricePerYear, AVGSemiDetachedPricePerYear, AVGDetachedPricePerYear FROM {v.DSDB}.yearlyhouseprices""") df_10 = summary_df.filter(col("year").between(f'{start_date}', f'{end_date}')) p_dfm = df_10.toPandas() # converting spark DF to Pandas DF for i in range(n): if p_dfm.columns[i] != 'year': # year is x axis in integer vcolumn = p_dfm.columns[i] print(vcolumn) params = model.guess(p_dfm[vcolumn], x = p_dfm['year']) result = model.fit(p_dfm[vcolumn], params, x = p_dfm['year']) result.plot_fit() if vcolumn == "AVGFlatPricePerYear": plt.xlabel("Year", fontdict=v.font) plt.ylabel("Flat house prices in millions/GBP", fontdict=v.font) plt.title(f"""Flat price fluctuations in {regionname} for the past 10 years """, fontdict=v.font) plt.text(0.35, 0.45, "Best-fit based on Non-Linear Lorentzian Model", transform=plt.gca().transAxes, color="grey", fontsize=10 ) print(result.fit_report()) plt.xlim(left=2009) plt.xlim(right=2022) plt.show() plt.close() ``` So far so good. I get a best fit plot as shown using Lorentzian model Also I have model fit data [[Model]] Model(lorentzian) [[Fit Statistics]] # fitting method = leastsq # function evals = 25 # data points = 11 # variables = 3 chi-square = 8.4155e+09 reduced chi-square = 1.0519e+09 Akaike info crit = 231.009958 Bayesian info crit = 232.203644 [[Variables]] amplitude: 31107480.0 +/- 1471033.33 (4.73%) (init = 6106104) center: 2016.75722 +/- 0.18632315 (0.01%) (init = 2016.5) sigma: 8.37428353 +/- 0.45979189 (5.49%) (init = 3.5) fwhm: 16.7485671 +/- 0.91958379 (5.49%) == '2.0000000*sigma' height: 1182407.88 +/- 15681.8211 (1.33%) == '0.3183099*amplitude/max(2.220446049250313e-16, sigma)' [[Correlations]] (unreported correlations are < 0.100) C(amplitude, sigma) = 0.977 C(amplitude, center) = 0.644 C(center, sigma) = 0.603 Now I need to predict the prices for years 2021-2022 based on this fit. Is there any way I can use some plt functions to provide extrapolated values for 2021 and beyond? Thanks On Tue, 5 Jan 2021 at 14:43, Sean Owen <sro...@gmail.com> wrote: > If your data set is 11 points, surely this is not a distributed problem? > or are you asking how to build tens of thousands of those projections in > parallel? > > On Tue, Jan 5, 2021 at 6:04 AM Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> Hi, >> >> I am not sure Spark forum is the correct avenue for this question. >> >> I am using PySpark with matplotlib to get the best fit for data using >> the Lorentzian Model. This curve uses 2010-2020 data points (11 on x-axis). >> I need to predict predict the prices for years 2021-2025 based on this >> fit. So not sure if someone can advise me? If Ok, then I can post the >> details >> >> Thanks >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >
--------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org