[ https://issues.apache.org/jira/browse/SPARK-37926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477517#comment-17477517 ]
Hyukjin Kwon commented on SPARK-37926: -------------------------------------- [~mithril], the results might be slightly different between spark and pandas. what are exactly difference between them? also Spark 2.X is EOL. should check with Spark 3+. > toPandas presicion doesn't match pandas, and would cause error on some case > --------------------------------------------------------------------------- > > Key: SPARK-37926 > URL: https://issues.apache.org/jira/browse/SPARK-37926 > Project: Spark > Issue Type: Bug > Components: Input/Output, ML > Affects Versions: 2.4.7 > Reporter: kasim > Priority: Major > > # The background: > I have two copies of one dataset on the filesystem and spark hdfs . > I transformed the two data one by pandas and one by spark SQL with the same > logic: > - df: read from hdfs, transformed by spark SQL, convert spark.DataFrame to > pandas.DataFrame > - df1: read from the filesystem, transformed by pandas, > Put each to BetaGeoFitter model (https://lifetimes.readthedocs.io/en/latest/) > , df1 is fine, but df2 got ConvergenceError. > # First: the summary is the same between df and df1 > ``` > In [17]: df.describe() > > Out[17]: > > frequency recency T monetary_value > > count 68878.000000 68878.000000 68878.000000 68878.000000 > > mean 0.210198 1.364253 69.407097 66.740974 > > std 1.094161 7.460129 44.604855 351.516145 > > min 0.000000 0.000000 0.000000 0.000000 > > 25% 0.000000 0.000000 31.000000 0.000000 > > 50% 0.000000 0.000000 64.000000 0.000000 > > 75% 0.000000 0.000000 108.000000 0.000000 > > max 59.000000 155.000000 157.000000 18975.360000 > > > > In [18]: df1.describe() > > Out[18]: > > frequency recency T monetary_value > > count 68878.000000 68878.000000 68878.000000 68878.000000 > > mean 0.210198 1.364253 69.407097 66.740974 > > std 1.094161 7.460129 44.604856 351.516145 > > min 0.000000 0.000000 0.000000 0.000000 > > 25% 0.000000 0.000000 31.000000 0.000000 > > 50% 0.000000 0.000000 64.000000 0.000000 > > 75% 0.000000 0.000000 108.000000 0.000000 > > max 59.000000 155.000000 157.000000 18975.360000 > > > > In [19]: bgf = BetaGeoFitter(penalizer_coef=penalizer_coef) > > ...: bgf.fit(df1['frequency'], df1['recency'], df1['T']) > > Out[19]: <lifetimes.BetaGeoFitter: fitted with 68878 subjects, a: 1.08, > alpha: 0.74, b: 0.65, r: 0.03> > > > In [20]: bgf = BetaGeoFitter(penalizer_coef=penalizer_coef) > > ...: bgf.fit(df['frequency'], df['recency'], df['T']) > > fun: -0.03513675395757231 > > hess_inv: array([[ 13.30839758, 17.8546921 , -0.17820442, 0.31872313], > > [ 17.8546921 , 73.49152334, -1.06609042, 0.96429223], > > [ -0.17820442, -1.06609042, 65.85101032, 67.62388159], > > [ 0.31872313, 0.96429223, 67.62388159, 109.01577057]]) > > jac: array([ 1.17874160e-06, -6.62967570e-07, 1.06154732e-06, > 1.56458773e-06]) > message: 'Desired error not necessarily achieved due to precision loss.' > > nfev: 130 > > nit: 29 > > njev: 117 > > status: 2 > > success: False > > x: array([-3.59592079, -5.36183489, 0.07652525, -0.4253566 ]) > > --------------------------------------------------------------------------- > > ConvergenceError Traceback (most recent call last) > > /data/modou/python/clv.py in <module> > > 1 bgf = BetaGeoFitter(penalizer_coef=penalizer_coef) > > ----> 2 bgf.fit(df['frequency'], df['recency'], df['T']) > > > > ``` > # Secound, I found the float is something different on df1 and df > They shows different after round: > ```python > idx = ~np.isclose(df.round(1)['monetary_value'], > df1.round(1)['monetary_value']) > In [71]: np.isclose(df[idx]['monetary_value'], df1[idx]['monetary_value']) > > Out[71]: > > array([ True, True, True, True, True, True, True, True, True, > > True, True, True, True, True, True, True, True, True, > > True, True, True, True, True]) > > > > In [72]: np.isclose(df[idx].round(1)['monetary_value'], > df1[idx].round(1)['monetary_value']) > Out[72]: > > array([False, False, False, False, False, False, False, False, False, > > False, False, False, False, False, False, False, False, False, > > False, False, False, False, False]) > > ``` > The diff contents: > ``` > > In [67]: df[idx].round(1)['monetary_value'] > Out[67]: > 11498 426.4 > 17791 1464.1 > 18037 1309.1 > 19800 426.4 > 22464 134.3 > 24717 29.7 > 26202 881.6 > 26729 426.4 > 29519 1464.1 > 35798 1464.1 > 36034 388.7 > 39156 1464.1 > 39566 194.1 > 39687 426.4 > 39737 388.7 > 44185 1464.1 > 45628 1574.9 > 48241 4325.3 > 49841 1464.1 > 54789 129.5 > 57159 3289.6 > 66517 426.4 > 67991 388.7 > Name: monetary_value, dtype: float64 > > In [68]: df1[idx].round(1)['monetary_value'] > Out[68]: > 11498 426.5 > 17791 1464.2 > 18037 1309.2 > 19800 426.5 > 22464 134.2 > 24717 29.8 > 26202 881.7 > 26729 426.5 > 29519 1464.2 > 35798 1464.2 > 36034 388.6 > 39156 1464.2 > 39566 194.2 > 39687 426.5 > 39737 388.6 > 44185 1464.2 > 45628 1574.8 > 48241 4325.2 > 49841 1464.2 > 54789 129.6 > 57159 3289.7 > 66517 426.5 > 67991 388.6 > Name: monetary_value, dtype: float64 > ``` > # Third, suppress idx value to zeros on both df and df1 test again > fit df1 is still converged > ``` > In [88]: df2 = df1.copy() > > ...: df2.loc[idx, "monetary_value"] = 0 > > > > In [89]: df2[idx] > > Out[89]: > > frequency recency T monetary_value > > 11498 6.0 16.0 124.0 0.0 > > 17791 1.0 1.0 109.0 0.0 > > 18037 1.0 1.0 109.0 0.0 > > 19800 2.0 3.0 104.0 0.0 > > 22464 6.0 36.0 69.0 0.0 > > 24717 11.0 11.0 93.0 0.0 > > 26202 1.0 12.0 88.0 0.0 > > 26729 2.0 14.0 34.0 0.0 > > 29519 1.0 5.0 79.0 0.0 > > 35798 1.0 1.0 63.0 0.0 > > 36034 1.0 1.0 63.0 0.0 > > 39156 1.0 1.0 54.0 0.0 > > 39566 1.0 2.0 53.0 0.0 > > 39687 2.0 3.0 53.0 0.0 > > 39737 1.0 1.0 53.0 0.0 > > 44185 1.0 6.0 45.0 0.0 > > 45628 1.0 1.0 43.0 0.0 > > 48241 3.0 17.0 39.0 0.0 > > 49841 1.0 2.0 36.0 0.0 > > 54789 3.0 3.0 27.0 0.0 > > 57159 9.0 9.0 22.0 0.0 > > 66517 2.0 2.0 4.0 0.0 > > 67991 1.0 1.0 1.0 0.0 > > > > In [90]: bgf = BetaGeoFitter(penalizer_coef=penalizer_coef) > > ...: bgf.fit(df2['frequency'], df2['recency'], df2['T']) > > Out[90]: <lifetimes.BetaGeoFitter: fitted with 68878 subjects, a: 1.08, > alpha: 0.74, b: 0.65, r: 0.03> > ``` > fit df still throw ConvergenceError > ``` > In [92]: df2 = df.copy() > > ...: df2.loc[idx, "monetary_value"] = 0 > > > > In [93]: df2[idx] > > Out[93]: > > user_id frequency recency T monetary_value > > 11498 1515915625531317256 6.0 16.0 124.0 0.0 > > 17791 1515915625538189543 1.0 1.0 109.0 0.0 > > 18037 1515915625538353966 1.0 1.0 109.0 0.0 > > 19800 1515915625539864468 2.0 3.0 104.0 0.0 > > 22464 1515915625542102075 6.0 36.0 69.0 0.0 > > 24717 1515915625545486890 11.0 11.0 93.0 0.0 > > 26202 1515915625547164014 1.0 12.0 88.0 0.0 > > 26729 1515915625547973880 2.0 14.0 34.0 0.0 > > 29519 1515915625561317292 1.0 5.0 79.0 0.0 > > 35798 1515915625569444951 1.0 1.0 63.0 0.0 > > 36034 1515915625569751989 1.0 1.0 63.0 0.0 > > 39156 1515915625573167676 1.0 1.0 54.0 0.0 > > 39566 1515915625573482744 1.0 2.0 53.0 0.0 > > 39687 1515915625573575950 2.0 3.0 53.0 0.0 > > 39737 1515915625573629519 1.0 1.0 53.0 0.0 > > 44185 1515915625592904652 1.0 6.0 45.0 0.0 > > 45628 1515915625593770495 1.0 1.0 43.0 0.0 > > 48241 1515915625595271558 3.0 17.0 39.0 0.0 > > 49841 1515915625596215381 1.0 2.0 36.0 0.0 > > 54789 1515915625599473044 3.0 3.0 27.0 0.0 > > 57159 1515915625601113987 9.0 9.0 22.0 0.0 > > 66517 1515915625609072139 2.0 2.0 4.0 0.0 > > 67991 1515915625610224305 1.0 1.0 1.0 0.0 > > > > In [94]: bgf = BetaGeoFitter(penalizer_coef=penalizer_coef) > > ...: bgf.fit(df2['frequency'], df2['recency'], df2['T']) > > fun: -0.03513675395757231 > > hess_inv: array([[ 13.30839758, 17.8546921 , -0.17820442, 0.31872313], > > [ 17.8546921 , 73.49152334, -1.06609042, 0.96429223], > > [ -0.17820442, -1.06609042, 65.85101032, 67.62388159], > > [ 0.31872313, 0.96429223, 67.62388159, 109.01577057]]) > > jac: array([ 1.17874160e-06, -6.62967570e-07, 1.06154732e-06, > 1.56458773e-06]) > message: 'Desired error not necessarily achieved due to precision loss.' > > nfev: 130 > > nit: 29 > > njev: 117 > > status: 2 > > success: False > > x: array([-3.59592079, -5.36183489, 0.07652525, -0.4253566 ]) > > --------------------------------------------------------------------------- > > ConvergenceError Traceback (most recent call last) > > /data/modou/python/clv.py in <module> > > 1 bgf = BetaGeoFitter(penalizer_coef=penalizer_coef) > > ----> 2 bgf.fit(df2['frequency'], df2['recency'], df2['T']) > > /data/modou/conda/envs/py36/lib/python3.6/site-packages/lifetimes/fitters/beta_geo_fitter.py > in fit(self, frequency, recency, T, weights, initial_params, verb > ose, tol, index, **kwargs) > > > 141 verbose, > > > 142 tol, > > > --> 143 **kwargs > > > 144 ) > > > 145 > > > > > > /data/modou/conda/envs/py36/lib/python3.6/site-packages/lifetimes/fitters/__init__.py > in _fit(self, minimizing_function_args, initial_params, params_size, dis > p, tol, bounds, **kwargs) > > > 117 """ > > > 118 The model did not converge. Try adding a larger penalizer > to see if that helps convergence. > > --> 119 """ > > > 120 ) > > > 121 ) > > > > > > ConvergenceError: > > > The model did not converge. Try adding a larger penalizer to see if that > helps convergence. > > ``` > ## As a result, df still got error > There must be some strange thing on the df( transformed on spark) , how it > got error even if > suppress idx monetary_value value to zeros ?? > I just want to figure this thing out. > ## Update > Write out df and read back, go through fitting!? Holy strange. > ``` > In [108]: df["monetary_value"].sum() > Out[108]: 4596984.839164658 > > In [109]: df1["monetary_value"].sum() > Out[109]: 4596984.8391646575 > > > In [111]: df.to_csv('e.csv', index=False, header=True) > > In [112]: x = pd.read_csv('e.csv') > > In [113]: x["monetary_value"].sum() > Out[113]: 4596984.8391646575 > In [114]: bgf = BetaGeoFitter(penalizer_coef=penalizer_coef) > > ...: bgf.fit(x['frequency'], x['recency'], x['T']) > > ...: > > Out[114]: <lifetimes.BetaGeoFitter: fitted with 68878 subjects, a: 1.08, > alpha: 0.74, b: 0.65, r: 0.03> > ``` > -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org