Re: [DISCUSS] Support decimals with negative scale in decimal operation
Jörn, may you explain a bit more your proposal, please? We are not modifying the existing decimal datatype. This is how it works now. If you check the PR, the only difference is how we compute the result for the divsion operation. The discussion about precision and scale is about: shall we limit them more then we are doing now? Now we are supporting any scale <= precision and any precision in the range (1, 38]. Il giorno mer 9 gen 2019 alle ore 09:13 Jörn Franke ha scritto: > Maybe it is better to introduce a new datatype that supports negative > scale, otherwise the migration and testing efforts for organizations > running Spark application becomes too large. Of course the current decimal > will be kept as it is. > > Am 07.01.2019 um 15:08 schrieb Marco Gaido : > > In general we can say that some datasources allow them, others fail. At > the moment, we are doing no casting before writing (so we can state so in > the doc). But since there is ongoing discussion for DSv2, we can maybe add > a flag/interface there for "negative scale intollerant" DS and try and cast > before writing to them. What do you think about this? > > Il giorno lun 7 gen 2019 alle ore 15:03 Wenchen Fan > ha scritto: > >> AFAIK parquet spec says decimal scale can't be negative. If we want to >> officially support negative-scale decimal, we should clearly define the >> behavior when writing negative-scale decimals to parquet and other data >> sources. The most straightforward way is to fail for this case, but maybe >> we can do something better, like casting decimal(1, -20) to decimal(20, 0) >> before writing. >> >> On Mon, Jan 7, 2019 at 9:32 PM Marco Gaido >> wrote: >> >>> Hi Wenchen, >>> >>> thanks for your email. I agree adding doc for decimal type, but I am not >>> sure what you mean speaking of the behavior when writing: we are not >>> performing any automatic casting before writing; if we want to do that, we >>> need a design about it I think. >>> >>> I am not sure if it makes sense to set a min for it. That would break >>> backward compatibility (for very weird use case), so I wouldn't do that. >>> >>> Thanks, >>> Marco >>> >>> Il giorno lun 7 gen 2019 alle ore 05:53 Wenchen Fan >>> ha scritto: >>> >>>> I think we need to do this for backward compatibility, and according to >>>> the discussion in the doc, SQL standard allows negative scale. >>>> >>>> To do this, I think the PR should also include a doc for the decimal >>>> type, like the definition of precision and scale(this one >>>> <https://stackoverflow.com/questions/35435691/bigdecimal-precision-and-scale> >>>> looks pretty good), and the result type of decimal operations, and the >>>> behavior when writing out decimals(e.g. we can cast decimal(1, -20) to >>>> decimal(20, 0) before writing). >>>> >>>> Another question is, shall we set a min scale? e.g. shall we allow >>>> decimal(1, -1000)? >>>> >>>> On Thu, Oct 25, 2018 at 9:49 PM Marco Gaido >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> a bit more than one month ago, I sent a proposal for handling properly >>>>> decimals with negative scales in our operations. This is a long standing >>>>> problem in our codebase as we derived our rules from Hive and SQLServer >>>>> where negative scales are forbidden, while in Spark they are not. >>>>> >>>>> The discussion has been stale for a while now. No more comments on the >>>>> design doc: >>>>> https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit#heading=h.x7062zmkubwm >>>>> . >>>>> >>>>> So I am writing this e-mail in order to check whether there are more >>>>> comments on it or we can go ahead with the PR. >>>>> >>>>> Thanks, >>>>> Marco >>>>> >>>>
Re: [DISCUSS] Support decimals with negative scale in decimal operation
Maybe it is better to introduce a new datatype that supports negative scale, otherwise the migration and testing efforts for organizations running Spark application becomes too large. Of course the current decimal will be kept as it is. > Am 07.01.2019 um 15:08 schrieb Marco Gaido : > > In general we can say that some datasources allow them, others fail. At the > moment, we are doing no casting before writing (so we can state so in the > doc). But since there is ongoing discussion for DSv2, we can maybe add a > flag/interface there for "negative scale intollerant" DS and try and cast > before writing to them. What do you think about this? > >> Il giorno lun 7 gen 2019 alle ore 15:03 Wenchen Fan ha >> scritto: >> AFAIK parquet spec says decimal scale can't be negative. If we want to >> officially support negative-scale decimal, we should clearly define the >> behavior when writing negative-scale decimals to parquet and other data >> sources. The most straightforward way is to fail for this case, but maybe we >> can do something better, like casting decimal(1, -20) to decimal(20, 0) >> before writing. >> >>> On Mon, Jan 7, 2019 at 9:32 PM Marco Gaido wrote: >>> Hi Wenchen, >>> >>> thanks for your email. I agree adding doc for decimal type, but I am not >>> sure what you mean speaking of the behavior when writing: we are not >>> performing any automatic casting before writing; if we want to do that, we >>> need a design about it I think. >>> >>> I am not sure if it makes sense to set a min for it. That would break >>> backward compatibility (for very weird use case), so I wouldn't do that. >>> >>> Thanks, >>> Marco >>> >>>> Il giorno lun 7 gen 2019 alle ore 05:53 Wenchen Fan >>>> ha scritto: >>>> I think we need to do this for backward compatibility, and according to >>>> the discussion in the doc, SQL standard allows negative scale. >>>> >>>> To do this, I think the PR should also include a doc for the decimal type, >>>> like the definition of precision and scale(this one looks pretty good), >>>> and the result type of decimal operations, and the behavior when writing >>>> out decimals(e.g. we can cast decimal(1, -20) to decimal(20, 0) before >>>> writing). >>>> >>>> Another question is, shall we set a min scale? e.g. shall we allow >>>> decimal(1, -1000)? >>>> >>>>> On Thu, Oct 25, 2018 at 9:49 PM Marco Gaido >>>>> wrote: >>>>> Hi all, >>>>> >>>>> a bit more than one month ago, I sent a proposal for handling properly >>>>> decimals with negative scales in our operations. This is a long standing >>>>> problem in our codebase as we derived our rules from Hive and SQLServer >>>>> where negative scales are forbidden, while in Spark they are not. >>>>> >>>>> The discussion has been stale for a while now. No more comments on the >>>>> design doc: >>>>> https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit#heading=h.x7062zmkubwm. >>>>> >>>>> So I am writing this e-mail in order to check whether there are more >>>>> comments on it or we can go ahead with the PR. >>>>> >>>>> Thanks, >>>>> Marco
Re: [DISCUSS] Support decimals with negative scale in decimal operation
Oracle does the same: "The *scale* must be less than or equal to the precision." (see https://docs.oracle.com/javadb/10.6.2.1/ref/rrefsqlj15260.html). Il giorno mer 9 gen 2019 alle ore 05:31 Wenchen Fan ha scritto: > Some more thoughts. If we support unlimited negative scale, why can't we > support unlimited positive scale? e.g. 0.0001 can be decimal(1, 4) instead > of (4, 4). I think we need more references here: how other databases deal > with decimal type and parse decimal literals? > > On Mon, Jan 7, 2019 at 10:36 PM Wenchen Fan wrote: > >> I'm OK with it, i.e. fail the write if there are negative-scale decimals >> (we need to document it though). We can improve it later in data source v2. >> >> On Mon, Jan 7, 2019 at 10:09 PM Marco Gaido >> wrote: >> >>> In general we can say that some datasources allow them, others fail. At >>> the moment, we are doing no casting before writing (so we can state so in >>> the doc). But since there is ongoing discussion for DSv2, we can maybe add >>> a flag/interface there for "negative scale intollerant" DS and try and cast >>> before writing to them. What do you think about this? >>> >>> Il giorno lun 7 gen 2019 alle ore 15:03 Wenchen Fan >>> ha scritto: >>> >>>> AFAIK parquet spec says decimal scale can't be negative. If we want to >>>> officially support negative-scale decimal, we should clearly define the >>>> behavior when writing negative-scale decimals to parquet and other data >>>> sources. The most straightforward way is to fail for this case, but maybe >>>> we can do something better, like casting decimal(1, -20) to decimal(20, 0) >>>> before writing. >>>> >>>> On Mon, Jan 7, 2019 at 9:32 PM Marco Gaido >>>> wrote: >>>> >>>>> Hi Wenchen, >>>>> >>>>> thanks for your email. I agree adding doc for decimal type, but I am >>>>> not sure what you mean speaking of the behavior when writing: we are not >>>>> performing any automatic casting before writing; if we want to do that, we >>>>> need a design about it I think. >>>>> >>>>> I am not sure if it makes sense to set a min for it. That would break >>>>> backward compatibility (for very weird use case), so I wouldn't do that. >>>>> >>>>> Thanks, >>>>> Marco >>>>> >>>>> Il giorno lun 7 gen 2019 alle ore 05:53 Wenchen Fan < >>>>> cloud0...@gmail.com> ha scritto: >>>>> >>>>>> I think we need to do this for backward compatibility, and according >>>>>> to the discussion in the doc, SQL standard allows negative scale. >>>>>> >>>>>> To do this, I think the PR should also include a doc for the decimal >>>>>> type, like the definition of precision and scale(this one >>>>>> <https://stackoverflow.com/questions/35435691/bigdecimal-precision-and-scale> >>>>>> looks pretty good), and the result type of decimal operations, and the >>>>>> behavior when writing out decimals(e.g. we can cast decimal(1, -20) to >>>>>> decimal(20, 0) before writing). >>>>>> >>>>>> Another question is, shall we set a min scale? e.g. shall we allow >>>>>> decimal(1, -1000)? >>>>>> >>>>>> On Thu, Oct 25, 2018 at 9:49 PM Marco Gaido >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> a bit more than one month ago, I sent a proposal for handling >>>>>>> properly decimals with negative scales in our operations. This is a long >>>>>>> standing problem in our codebase as we derived our rules from Hive and >>>>>>> SQLServer where negative scales are forbidden, while in Spark they are >>>>>>> not. >>>>>>> >>>>>>> The discussion has been stale for a while now. No more comments on >>>>>>> the design doc: >>>>>>> https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit#heading=h.x7062zmkubwm >>>>>>> . >>>>>>> >>>>>>> So I am writing this e-mail in order to check whether there are more >>>>>>> comments on it or we can go ahead with the PR. >>>>>>> >>>>>>> Thanks, >>>>>>> Marco >>>>>>> >>>>>>
Re: [DISCUSS] Support decimals with negative scale in decimal operation
Some more thoughts. If we support unlimited negative scale, why can't we support unlimited positive scale? e.g. 0.0001 can be decimal(1, 4) instead of (4, 4). I think we need more references here: how other databases deal with decimal type and parse decimal literals? On Mon, Jan 7, 2019 at 10:36 PM Wenchen Fan wrote: > I'm OK with it, i.e. fail the write if there are negative-scale decimals > (we need to document it though). We can improve it later in data source v2. > > On Mon, Jan 7, 2019 at 10:09 PM Marco Gaido > wrote: > >> In general we can say that some datasources allow them, others fail. At >> the moment, we are doing no casting before writing (so we can state so in >> the doc). But since there is ongoing discussion for DSv2, we can maybe add >> a flag/interface there for "negative scale intollerant" DS and try and cast >> before writing to them. What do you think about this? >> >> Il giorno lun 7 gen 2019 alle ore 15:03 Wenchen Fan >> ha scritto: >> >>> AFAIK parquet spec says decimal scale can't be negative. If we want to >>> officially support negative-scale decimal, we should clearly define the >>> behavior when writing negative-scale decimals to parquet and other data >>> sources. The most straightforward way is to fail for this case, but maybe >>> we can do something better, like casting decimal(1, -20) to decimal(20, 0) >>> before writing. >>> >>> On Mon, Jan 7, 2019 at 9:32 PM Marco Gaido >>> wrote: >>> >>>> Hi Wenchen, >>>> >>>> thanks for your email. I agree adding doc for decimal type, but I am >>>> not sure what you mean speaking of the behavior when writing: we are not >>>> performing any automatic casting before writing; if we want to do that, we >>>> need a design about it I think. >>>> >>>> I am not sure if it makes sense to set a min for it. That would break >>>> backward compatibility (for very weird use case), so I wouldn't do that. >>>> >>>> Thanks, >>>> Marco >>>> >>>> Il giorno lun 7 gen 2019 alle ore 05:53 Wenchen Fan < >>>> cloud0...@gmail.com> ha scritto: >>>> >>>>> I think we need to do this for backward compatibility, and according >>>>> to the discussion in the doc, SQL standard allows negative scale. >>>>> >>>>> To do this, I think the PR should also include a doc for the decimal >>>>> type, like the definition of precision and scale(this one >>>>> <https://stackoverflow.com/questions/35435691/bigdecimal-precision-and-scale> >>>>> looks pretty good), and the result type of decimal operations, and the >>>>> behavior when writing out decimals(e.g. we can cast decimal(1, -20) to >>>>> decimal(20, 0) before writing). >>>>> >>>>> Another question is, shall we set a min scale? e.g. shall we allow >>>>> decimal(1, -1000)? >>>>> >>>>> On Thu, Oct 25, 2018 at 9:49 PM Marco Gaido >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> a bit more than one month ago, I sent a proposal for handling >>>>>> properly decimals with negative scales in our operations. This is a long >>>>>> standing problem in our codebase as we derived our rules from Hive and >>>>>> SQLServer where negative scales are forbidden, while in Spark they are >>>>>> not. >>>>>> >>>>>> The discussion has been stale for a while now. No more comments on >>>>>> the design doc: >>>>>> https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit#heading=h.x7062zmkubwm >>>>>> . >>>>>> >>>>>> So I am writing this e-mail in order to check whether there are more >>>>>> comments on it or we can go ahead with the PR. >>>>>> >>>>>> Thanks, >>>>>> Marco >>>>>> >>>>>
Re: [DISCUSS] Support decimals with negative scale in decimal operation
I'm OK with it, i.e. fail the write if there are negative-scale decimals (we need to document it though). We can improve it later in data source v2. On Mon, Jan 7, 2019 at 10:09 PM Marco Gaido wrote: > In general we can say that some datasources allow them, others fail. At > the moment, we are doing no casting before writing (so we can state so in > the doc). But since there is ongoing discussion for DSv2, we can maybe add > a flag/interface there for "negative scale intollerant" DS and try and cast > before writing to them. What do you think about this? > > Il giorno lun 7 gen 2019 alle ore 15:03 Wenchen Fan > ha scritto: > >> AFAIK parquet spec says decimal scale can't be negative. If we want to >> officially support negative-scale decimal, we should clearly define the >> behavior when writing negative-scale decimals to parquet and other data >> sources. The most straightforward way is to fail for this case, but maybe >> we can do something better, like casting decimal(1, -20) to decimal(20, 0) >> before writing. >> >> On Mon, Jan 7, 2019 at 9:32 PM Marco Gaido >> wrote: >> >>> Hi Wenchen, >>> >>> thanks for your email. I agree adding doc for decimal type, but I am not >>> sure what you mean speaking of the behavior when writing: we are not >>> performing any automatic casting before writing; if we want to do that, we >>> need a design about it I think. >>> >>> I am not sure if it makes sense to set a min for it. That would break >>> backward compatibility (for very weird use case), so I wouldn't do that. >>> >>> Thanks, >>> Marco >>> >>> Il giorno lun 7 gen 2019 alle ore 05:53 Wenchen Fan >>> ha scritto: >>> >>>> I think we need to do this for backward compatibility, and according to >>>> the discussion in the doc, SQL standard allows negative scale. >>>> >>>> To do this, I think the PR should also include a doc for the decimal >>>> type, like the definition of precision and scale(this one >>>> <https://stackoverflow.com/questions/35435691/bigdecimal-precision-and-scale> >>>> looks pretty good), and the result type of decimal operations, and the >>>> behavior when writing out decimals(e.g. we can cast decimal(1, -20) to >>>> decimal(20, 0) before writing). >>>> >>>> Another question is, shall we set a min scale? e.g. shall we allow >>>> decimal(1, -1000)? >>>> >>>> On Thu, Oct 25, 2018 at 9:49 PM Marco Gaido >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> a bit more than one month ago, I sent a proposal for handling properly >>>>> decimals with negative scales in our operations. This is a long standing >>>>> problem in our codebase as we derived our rules from Hive and SQLServer >>>>> where negative scales are forbidden, while in Spark they are not. >>>>> >>>>> The discussion has been stale for a while now. No more comments on the >>>>> design doc: >>>>> https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit#heading=h.x7062zmkubwm >>>>> . >>>>> >>>>> So I am writing this e-mail in order to check whether there are more >>>>> comments on it or we can go ahead with the PR. >>>>> >>>>> Thanks, >>>>> Marco >>>>> >>>>
Re: [DISCUSS] Support decimals with negative scale in decimal operation
In general we can say that some datasources allow them, others fail. At the moment, we are doing no casting before writing (so we can state so in the doc). But since there is ongoing discussion for DSv2, we can maybe add a flag/interface there for "negative scale intollerant" DS and try and cast before writing to them. What do you think about this? Il giorno lun 7 gen 2019 alle ore 15:03 Wenchen Fan ha scritto: > AFAIK parquet spec says decimal scale can't be negative. If we want to > officially support negative-scale decimal, we should clearly define the > behavior when writing negative-scale decimals to parquet and other data > sources. The most straightforward way is to fail for this case, but maybe > we can do something better, like casting decimal(1, -20) to decimal(20, 0) > before writing. > > On Mon, Jan 7, 2019 at 9:32 PM Marco Gaido wrote: > >> Hi Wenchen, >> >> thanks for your email. I agree adding doc for decimal type, but I am not >> sure what you mean speaking of the behavior when writing: we are not >> performing any automatic casting before writing; if we want to do that, we >> need a design about it I think. >> >> I am not sure if it makes sense to set a min for it. That would break >> backward compatibility (for very weird use case), so I wouldn't do that. >> >> Thanks, >> Marco >> >> Il giorno lun 7 gen 2019 alle ore 05:53 Wenchen Fan >> ha scritto: >> >>> I think we need to do this for backward compatibility, and according to >>> the discussion in the doc, SQL standard allows negative scale. >>> >>> To do this, I think the PR should also include a doc for the decimal >>> type, like the definition of precision and scale(this one >>> <https://stackoverflow.com/questions/35435691/bigdecimal-precision-and-scale> >>> looks pretty good), and the result type of decimal operations, and the >>> behavior when writing out decimals(e.g. we can cast decimal(1, -20) to >>> decimal(20, 0) before writing). >>> >>> Another question is, shall we set a min scale? e.g. shall we allow >>> decimal(1, -1000)? >>> >>> On Thu, Oct 25, 2018 at 9:49 PM Marco Gaido >>> wrote: >>> >>>> Hi all, >>>> >>>> a bit more than one month ago, I sent a proposal for handling properly >>>> decimals with negative scales in our operations. This is a long standing >>>> problem in our codebase as we derived our rules from Hive and SQLServer >>>> where negative scales are forbidden, while in Spark they are not. >>>> >>>> The discussion has been stale for a while now. No more comments on the >>>> design doc: >>>> https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit#heading=h.x7062zmkubwm >>>> . >>>> >>>> So I am writing this e-mail in order to check whether there are more >>>> comments on it or we can go ahead with the PR. >>>> >>>> Thanks, >>>> Marco >>>> >>>
Re: [DISCUSS] Support decimals with negative scale in decimal operation
AFAIK parquet spec says decimal scale can't be negative. If we want to officially support negative-scale decimal, we should clearly define the behavior when writing negative-scale decimals to parquet and other data sources. The most straightforward way is to fail for this case, but maybe we can do something better, like casting decimal(1, -20) to decimal(20, 0) before writing. On Mon, Jan 7, 2019 at 9:32 PM Marco Gaido wrote: > Hi Wenchen, > > thanks for your email. I agree adding doc for decimal type, but I am not > sure what you mean speaking of the behavior when writing: we are not > performing any automatic casting before writing; if we want to do that, we > need a design about it I think. > > I am not sure if it makes sense to set a min for it. That would break > backward compatibility (for very weird use case), so I wouldn't do that. > > Thanks, > Marco > > Il giorno lun 7 gen 2019 alle ore 05:53 Wenchen Fan > ha scritto: > >> I think we need to do this for backward compatibility, and according to >> the discussion in the doc, SQL standard allows negative scale. >> >> To do this, I think the PR should also include a doc for the decimal >> type, like the definition of precision and scale(this one >> <https://stackoverflow.com/questions/35435691/bigdecimal-precision-and-scale> >> looks pretty good), and the result type of decimal operations, and the >> behavior when writing out decimals(e.g. we can cast decimal(1, -20) to >> decimal(20, 0) before writing). >> >> Another question is, shall we set a min scale? e.g. shall we allow >> decimal(1, -1000)? >> >> On Thu, Oct 25, 2018 at 9:49 PM Marco Gaido >> wrote: >> >>> Hi all, >>> >>> a bit more than one month ago, I sent a proposal for handling properly >>> decimals with negative scales in our operations. This is a long standing >>> problem in our codebase as we derived our rules from Hive and SQLServer >>> where negative scales are forbidden, while in Spark they are not. >>> >>> The discussion has been stale for a while now. No more comments on the >>> design doc: >>> https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit#heading=h.x7062zmkubwm >>> . >>> >>> So I am writing this e-mail in order to check whether there are more >>> comments on it or we can go ahead with the PR. >>> >>> Thanks, >>> Marco >>> >>
Re: [DISCUSS] Support decimals with negative scale in decimal operation
Hi Wenchen, thanks for your email. I agree adding doc for decimal type, but I am not sure what you mean speaking of the behavior when writing: we are not performing any automatic casting before writing; if we want to do that, we need a design about it I think. I am not sure if it makes sense to set a min for it. That would break backward compatibility (for very weird use case), so I wouldn't do that. Thanks, Marco Il giorno lun 7 gen 2019 alle ore 05:53 Wenchen Fan ha scritto: > I think we need to do this for backward compatibility, and according to > the discussion in the doc, SQL standard allows negative scale. > > To do this, I think the PR should also include a doc for the decimal type, > like the definition of precision and scale(this one > <https://stackoverflow.com/questions/35435691/bigdecimal-precision-and-scale> > looks pretty good), and the result type of decimal operations, and the > behavior when writing out decimals(e.g. we can cast decimal(1, -20) to > decimal(20, 0) before writing). > > Another question is, shall we set a min scale? e.g. shall we allow > decimal(1, -1000)? > > On Thu, Oct 25, 2018 at 9:49 PM Marco Gaido > wrote: > >> Hi all, >> >> a bit more than one month ago, I sent a proposal for handling properly >> decimals with negative scales in our operations. This is a long standing >> problem in our codebase as we derived our rules from Hive and SQLServer >> where negative scales are forbidden, while in Spark they are not. >> >> The discussion has been stale for a while now. No more comments on the >> design doc: >> https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit#heading=h.x7062zmkubwm >> . >> >> So I am writing this e-mail in order to check whether there are more >> comments on it or we can go ahead with the PR. >> >> Thanks, >> Marco >> >
Re: [DISCUSS] Support decimals with negative scale in decimal operation
I think we need to do this for backward compatibility, and according to the discussion in the doc, SQL standard allows negative scale. To do this, I think the PR should also include a doc for the decimal type, like the definition of precision and scale(this one <https://stackoverflow.com/questions/35435691/bigdecimal-precision-and-scale> looks pretty good), and the result type of decimal operations, and the behavior when writing out decimals(e.g. we can cast decimal(1, -20) to decimal(20, 0) before writing). Another question is, shall we set a min scale? e.g. shall we allow decimal(1, -1000)? On Thu, Oct 25, 2018 at 9:49 PM Marco Gaido wrote: > Hi all, > > a bit more than one month ago, I sent a proposal for handling properly > decimals with negative scales in our operations. This is a long standing > problem in our codebase as we derived our rules from Hive and SQLServer > where negative scales are forbidden, while in Spark they are not. > > The discussion has been stale for a while now. No more comments on the > design doc: > https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit#heading=h.x7062zmkubwm > . > > So I am writing this e-mail in order to check whether there are more > comments on it or we can go ahead with the PR. > > Thanks, > Marco >
Re: Decimals with negative scale
That is feasible, the main point is that negative scales were not really meant to be there in the first place, so it something which was forgot to be forbidden, and it is something which the DBs we are drawing our inspiration from for decimals (mainly SQLServer) do not support. Honestly, my opinion on this topic is: - let's add the support to negative scales in the operations (I have already a PR out for that, https://github.com/apache/spark/pull/22450); - let's reduce our usage of DECIMAL in favor of DOUBLE when parsing literals, as done by Hive, Presto, DB2, ...; so the number of cases when we deal with negative scales in anyway small (and we do not have issues with datasources which don't support them). Thanks, Marco Il giorno mar 18 dic 2018 alle ore 19:08 Reynold Xin ha scritto: > So why can't we just do validation to fail sources that don't support > negative scale, if it is not supported? This way, we don't need to break > backward compatibility in anyway and it becomes a strict improvement. > > > On Tue, Dec 18, 2018 at 8:43 AM, Marco Gaido > wrote: > >> This is at analysis time. >> >> On Tue, 18 Dec 2018, 17:32 Reynold Xin > >>> Is this an analysis time thing or a runtime thing? >>> >>> On Tue, Dec 18, 2018 at 7:45 AM Marco Gaido >>> wrote: >>> >>>> Hi all, >>>> >>>> as you may remember, there was a design doc to support operations >>>> involving decimals with negative scales. After the discussion in the design >>>> doc, now the related PR is blocked because for 3.0 we have another option >>>> which we can explore, ie. forbidding negative scales. This is probably a >>>> cleaner solution, as most likely we didn't want negative scales, but it is >>>> a breaking change: so we wanted to check the opinion of the community. >>>> >>>> Getting to the topic, here there are the 2 options: >>>> * - Forbidding negative scales* >>>> Pros: many sources do not support negative scales (so they can create >>>> issues); they were something which was not considered as possible in the >>>> initial implementation, so we get to a more stable situation. >>>> Cons: some operations which were supported earlier, won't be working >>>> anymore. Eg. since our max precision is 38, if the scale cannot be negative >>>> 1e36 * 1e36 would cause an overflow, while now works fine (producing a >>>> decimal with negative scale); basically impossible to create a config which >>>> controls the behavior. >>>> >>>> *- Handling negative scales in operations* >>>> Pros: no regressions; we support all the operations we supported on >>>> 2.x. >>>> Cons: negative scales can cause issues in other moments, eg. when >>>> saving to a data source which doesn't support them. >>>> >>>> Looking forward to hear your thoughts, >>>> Thanks. >>>> Marco >>>> >>> >
Re: Decimals with negative scale
So why can't we just do validation to fail sources that don't support negative scale, if it is not supported? This way, we don't need to break backward compatibility in anyway and it becomes a strict improvement. On Tue, Dec 18, 2018 at 8:43 AM, Marco Gaido < marcogaid...@gmail.com > wrote: > > This is at analysis time. > > On Tue, 18 Dec 2018, 17:32 Reynold Xin < rxin@ databricks. com ( > r...@databricks.com ) wrote: > > >> Is this an analysis time thing or a runtime thing? >> >> On Tue, Dec 18, 2018 at 7:45 AM Marco Gaido < marcogaido91@ gmail. com ( >> marcogaid...@gmail.com ) > wrote: >> >> >>> Hi all, >>> >>> >>> as you may remember, there was a design doc to support operations >>> involving decimals with negative scales. After the discussion in the >>> design doc, now the related PR is blocked because for 3.0 we have another >>> option which we can explore, ie. forbidding negative scales. This is >>> probably a cleaner solution, as most likely we didn't want negative >>> scales, but it is a breaking change: so we wanted to check the opinion of >>> the community. >>> >>> >>> Getting to the topic, here there are the 2 options: >>> * - Forbidding negative scales* >>> Pros: many sources do not support negative scales (so they can create >>> issues); they were something which was not considered as possible in the >>> initial implementation, so we get to a more stable situation. >>> Cons: some operations which were supported earlier, won't be working >>> anymore. Eg. since our max precision is 38, if the scale cannot be >>> negative 1e36 * 1e36 would cause an overflow, while now works fine >>> (producing a decimal with negative scale); basically impossible to create >>> a config which controls the behavior. >>> >>> >>> >>> *- Handling negative scales in operations* >>> Pros: no regressions; we support all the operations we supported on 2.x. >>> >>> Cons: negative scales can cause issues in other moments, eg. when saving >>> to a data source which doesn't support them. >>> >>> >>> >>> Looking forward to hear your thoughts, >>> Thanks. >>> Marco >>> >> >> > >
Re: Decimals with negative scale
This is at analysis time. On Tue, 18 Dec 2018, 17:32 Reynold Xin Is this an analysis time thing or a runtime thing? > > On Tue, Dec 18, 2018 at 7:45 AM Marco Gaido > wrote: > >> Hi all, >> >> as you may remember, there was a design doc to support operations >> involving decimals with negative scales. After the discussion in the design >> doc, now the related PR is blocked because for 3.0 we have another option >> which we can explore, ie. forbidding negative scales. This is probably a >> cleaner solution, as most likely we didn't want negative scales, but it is >> a breaking change: so we wanted to check the opinion of the community. >> >> Getting to the topic, here there are the 2 options: >> * - Forbidding negative scales* >> Pros: many sources do not support negative scales (so they can create >> issues); they were something which was not considered as possible in the >> initial implementation, so we get to a more stable situation. >> Cons: some operations which were supported earlier, won't be working >> anymore. Eg. since our max precision is 38, if the scale cannot be negative >> 1e36 * 1e36 would cause an overflow, while now works fine (producing a >> decimal with negative scale); basically impossible to create a config which >> controls the behavior. >> >> *- Handling negative scales in operations* >> Pros: no regressions; we support all the operations we supported on 2.x. >> Cons: negative scales can cause issues in other moments, eg. when >> saving to a data source which doesn't support them. >> >> Looking forward to hear your thoughts, >> Thanks. >> Marco >> >> >>
Decimals with negative scale
Hi all, as you may remember, there was a design doc to support operations involving decimals with negative scales. After the discussion in the design doc, now the related PR is blocked because for 3.0 we have another option which we can explore, ie. forbidding negative scales. This is probably a cleaner solution, as most likely we didn't want negative scales, but it is a breaking change: so we wanted to check the opinion of the community. Getting to the topic, here there are the 2 options: * - Forbidding negative scales* Pros: many sources do not support negative scales (so they can create issues); they were something which was not considered as possible in the initial implementation, so we get to a more stable situation. Cons: some operations which were supported earlier, won't be working anymore. Eg. since our max precision is 38, if the scale cannot be negative 1e36 * 1e36 would cause an overflow, while now works fine (producing a decimal with negative scale); basically impossible to create a config which controls the behavior. *- Handling negative scales in operations* Pros: no regressions; we support all the operations we supported on 2.x. Cons: negative scales can cause issues in other moments, eg. when saving to a data source which doesn't support them. Looking forward to hear your thoughts, Thanks. Marco
[DISCUSS] Support decimals with negative scale in decimal operation
Hi all, a bit more than one month ago, I sent a proposal for handling properly decimals with negative scales in our operations. This is a long standing problem in our codebase as we derived our rules from Hive and SQLServer where negative scales are forbidden, while in Spark they are not. The discussion has been stale for a while now. No more comments on the design doc: https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit#heading=h.x7062zmkubwm . So I am writing this e-mail in order to check whether there are more comments on it or we can go ahead with the PR. Thanks, Marco
Re: SPIP: support decimals with negative scale in decimal operation
DISCUSS thread is good to have... From: Marco Gaido Sent: Friday, September 21, 2018 3:31 AM To: Wenchen Fan Cc: dev Subject: Re: SPIP: support decimals with negative scale in decimal operation Hi Wenchen, Thank you for the clarification. I agree that this is more a bug fix rather than an improvement. I apologize for the error. Please consider this as a design doc. Thanks, Marco Il giorno ven 21 set 2018 alle ore 12:04 Wenchen Fan mailto:cloud0...@gmail.com>> ha scritto: Hi Marco, Thanks for sending it! The problem is clearly explained in this email, but I would not treat it as a SPIP. It proposes a fix for a very tricky bug, and SPIP is usually for new features. Others please correct me if I was wrong. Thanks, Wenchen On Fri, Sep 21, 2018 at 5:47 PM Marco Gaido mailto:marcogaid...@gmail.com>> wrote: Hi all, I am writing this e-mail in order to discuss the issue which is reported in SPARK-25454 and according to Wenchen's suggestion I prepared a design doc for it. The problem we are facing here is that our rules for decimals operations are taken from Hive and MS SQL server and they explicitly don't support decimals with negative scales. So the rules we have currently are not meant to deal with negative scales. The issue is that Spark, instead, doesn't forbid negative scales and - indeed - there are cases in which we are producing them (eg. a SQL constant like 1e8 would be turned to a decimal(1, -8)). Having negative scales most likely wasn't really intended. But unfortunately getting rid of them would be a breaking change as many operations working fine currently would not be allowed anymore and would overflow (eg. select 1e36 * 1). As such, this is something I'd definitely agree on doing, but I think we can target only for 3.0. What we can start doing now, instead, is updating our rules in order to handle properly also the case when decimal scales are negative. From my investigation, it turns out that the only operations which has problems with them is Divide. Here you can find the design doc with all the details: https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit?usp=sharing. The document is also linked in SPARK-25454. There is also already a PR with the change: https://github.com/apache/spark/pull/22450. Looking forward to hear your feedback, Thanks. Marco
Re: SPIP: support decimals with negative scale in decimal operation
Hi Wenchen, Thank you for the clarification. I agree that this is more a bug fix rather than an improvement. I apologize for the error. Please consider this as a design doc. Thanks, Marco Il giorno ven 21 set 2018 alle ore 12:04 Wenchen Fan ha scritto: > Hi Marco, > > Thanks for sending it! The problem is clearly explained in this email, but > I would not treat it as a SPIP. It proposes a fix for a very tricky bug, > and SPIP is usually for new features. Others please correct me if I was > wrong. > > Thanks, > Wenchen > > On Fri, Sep 21, 2018 at 5:47 PM Marco Gaido > wrote: > >> Hi all, >> >> I am writing this e-mail in order to discuss the issue which is reported >> in SPARK-25454 and according to Wenchen's suggestion I prepared a design >> doc for it. >> >> The problem we are facing here is that our rules for decimals operations >> are taken from Hive and MS SQL server and they explicitly don't support >> decimals with negative scales. So the rules we have currently are not meant >> to deal with negative scales. The issue is that Spark, instead, doesn't >> forbid negative scales and - indeed - there are cases in which we are >> producing them (eg. a SQL constant like 1e8 would be turned to a decimal(1, >> -8)). >> >> Having negative scales most likely wasn't really intended. But >> unfortunately getting rid of them would be a breaking change as many >> operations working fine currently would not be allowed anymore and would >> overflow (eg. select 1e36 * 1). As such, this is something I'd >> definitely agree on doing, but I think we can target only for 3.0. >> >> What we can start doing now, instead, is updating our rules in order to >> handle properly also the case when decimal scales are negative. From my >> investigation, it turns out that the only operations which has problems >> with them is Divide. >> >> Here you can find the design doc with all the details: >> https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit?usp=sharing. >> The document is also linked in SPARK-25454. There is also already a PR with >> the change: https://github.com/apache/spark/pull/22450. >> >> Looking forward to hear your feedback, >> Thanks. >> Marco >> >
Re: SPIP: support decimals with negative scale in decimal operation
Hi Marco, Thanks for sending it! The problem is clearly explained in this email, but I would not treat it as a SPIP. It proposes a fix for a very tricky bug, and SPIP is usually for new features. Others please correct me if I was wrong. Thanks, Wenchen On Fri, Sep 21, 2018 at 5:47 PM Marco Gaido wrote: > Hi all, > > I am writing this e-mail in order to discuss the issue which is reported > in SPARK-25454 and according to Wenchen's suggestion I prepared a design > doc for it. > > The problem we are facing here is that our rules for decimals operations > are taken from Hive and MS SQL server and they explicitly don't support > decimals with negative scales. So the rules we have currently are not meant > to deal with negative scales. The issue is that Spark, instead, doesn't > forbid negative scales and - indeed - there are cases in which we are > producing them (eg. a SQL constant like 1e8 would be turned to a decimal(1, > -8)). > > Having negative scales most likely wasn't really intended. But > unfortunately getting rid of them would be a breaking change as many > operations working fine currently would not be allowed anymore and would > overflow (eg. select 1e36 * 1). As such, this is something I'd > definitely agree on doing, but I think we can target only for 3.0. > > What we can start doing now, instead, is updating our rules in order to > handle properly also the case when decimal scales are negative. From my > investigation, it turns out that the only operations which has problems > with them is Divide. > > Here you can find the design doc with all the details: > https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit?usp=sharing. > The document is also linked in SPARK-25454. There is also already a PR with > the change: https://github.com/apache/spark/pull/22450. > > Looking forward to hear your feedback, > Thanks. > Marco >
SPIP: support decimals with negative scale in decimal operation
Hi all, I am writing this e-mail in order to discuss the issue which is reported in SPARK-25454 and according to Wenchen's suggestion I prepared a design doc for it. The problem we are facing here is that our rules for decimals operations are taken from Hive and MS SQL server and they explicitly don't support decimals with negative scales. So the rules we have currently are not meant to deal with negative scales. The issue is that Spark, instead, doesn't forbid negative scales and - indeed - there are cases in which we are producing them (eg. a SQL constant like 1e8 would be turned to a decimal(1, -8)). Having negative scales most likely wasn't really intended. But unfortunately getting rid of them would be a breaking change as many operations working fine currently would not be allowed anymore and would overflow (eg. select 1e36 * 1). As such, this is something I'd definitely agree on doing, but I think we can target only for 3.0. What we can start doing now, instead, is updating our rules in order to handle properly also the case when decimal scales are negative. From my investigation, it turns out that the only operations which has problems with them is Divide. Here you can find the design doc with all the details: https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit?usp=sharing. The document is also linked in SPARK-25454. There is also already a PR with the change: https://github.com/apache/spark/pull/22450. Looking forward to hear your feedback, Thanks. Marco
Re: Decimals
Hi Marco, great work, I personally hope it gets included soon! I just wanted to clarify one thing - Oracle and PostgreSQL do not have infinite precision. The scale and precision of decimals are just user-defined (explicitly or implicitly). So, both of them follow the exact same rules you mentioned (like every other database). Specifically, both round on loss of precision and both throw an exception on overflow. Here is an example: *PostgreSQL* postgres=# create table test(i decimal(3,2)); CREATE TABLE postgres=# insert into test select 1.5; INSERT 0 1 postgres=# select * from test; i -- 1.60 (1 row) postgres=# insert into test select -654.123; ERROR: numeric field overflow DETAIL: A field with precision 3, scale 2 must round to an absolute value less than 10^1. *Oracle* SQL> create table test(i number(3,2)); Table created. SQL> insert into test select 1.5 from dual; 1 row created. SQL> select * from test; I -- 1.6 SQL> insert into test select -654.123 from dual; insert into test select -654.123 from dual * ERROR at line 1: ORA-01438: value larger than specified precision allowed for this column I hope that helps, it strengthens your point! Ofir Manor Co-Founder & CTO | Equalum Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io On Fri, Dec 22, 2017 at 1:11 PM, Marco Gaido <marcogaid...@gmail.com> wrote: > Hi Xiao, hi all, > > I checked the DB2 documentation for which you provided me a link in the PR > (thanks for it), and what you are stating is not really right. > DB2, in compliance with the SQL standards, throws an exception if an > overflow occurs, ie. if a loss of significant digits is necessary to > properly represent the value, which is the case I discussed as point 3 of > the previous e-mail. Since DB2 has a maximum precision of 31, while the > other DBs have an higher precision (in particular SQLServer and Hive as > Spark have 38 as maximum precision), the same operation running fine on > Hive or SQLServer (or Spark after my PR) may throw an exception on DB2 (but > this is due to overflow, ie. point 3, not to loss of precision). > > In the case of loss of precision, in compliance with SQL standards, DB2 > performs truncation. (emitting just a warning) And it can truncate more > than us, since we are ensuring to save at least 6 digits for the scale, > while DB2 has a minimum of 0 for the scale (refer to > https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/ > sqlref/src/tpc/db2z_decimalmultiplication.html). I'm citing the relevant > part here for convenience: > > The truncated copy has a scale of MAX(0,S-(P-15)), where P and S are the >> original precision and scale. If, in the process of truncation, one or more >> nonzero digits are removed, SQLWARN7 in SQLCA is set to W, indicating loss >> of precision. >> > > Moreover, the rules applied by DB2 to determine precision are scale are > analogous with the one used by Hive and SQLServer. The only real practical > difference is that we are enforcing to keep at least 6 as value for scale, > while DB2 has 0 as minimum (which is even worse than our approach according > to your previous e-mail about the financial use case). > > For the brave ones who went on reading so far, I'll summarize the > situation of the 3 point above, adding DB2 to the comparison: > > 1. *Rules to determine precision and scale* > - *Hive, SQLServer (and Spark after the PR)*: I won't include the > exact formulas, anyway the relevant part is that in case of precision > higher that the maximum value, we use the maximum available value (38) as > precision and the maximum between the needed scale (computing according the > relevant formula) and a minimum value guaranteed for the scale which is 6. > - *DB2*: practically same rules as above. Main difference are: the > maximum precision is 31 and it doesn't enforce any minimum value for the > scale (or the minimum value guaranteed for the scale is 0). > - *Postgres and Oracle*: NA, they have nearly infinite precision... > - *SQL ANSI 2011*: no indication > - *Spark now*: if the precision needed is more than 38, use 38 as > precision; use the needed scale without any adjustment. > > 2. *Behavior in case of precision loss but result in the range of the > representable values* > - *Hive, SQLServer (and Spark after the PR)*: round the result. > - *DB2*: truncates the result (and emits a warning). > - *Postgres and Oracle*: NA, they have nearly infinite precision... > - *SQL ANSI 2011*: either truncate or round the value. > - *Spark now*: returns NULL. > > 3. *Behavior in case of result out of the range of the representable > values (i.e overflow)* > - *DB2, **SQLServer*: throw a
R: Decimals
Thanks for your answer Xiao. The point is that behaving like this is against SQL standard and is different also from Hive's behavior. Then I would propose to add a configuration flag to switch between the two behaviors, either being SQL compliant and Hive compliant or behaving like now (as Hermann was suggesting in the PR). Do we agree on this way? If so, is there any way to read a configuration property in the catalyst project? Thank you, Marco - Messaggio originale - Da: "Xiao Li" <gatorsm...@gmail.com> Inviato: 21/12/2017 22:46 A: "Marco Gaido" <marcogaid...@gmail.com> Cc: "Reynold Xin" <r...@databricks.com>; "dev@spark.apache.org" <dev@spark.apache.org> Oggetto: Re: Decimals Losing precision is not acceptable to financial customers. Thus, instead of returning NULL, I saw DB2 issues the following error message: SQL0802N Arithmetic overflow or other arithmetic exception occurred. SQLSTATE=22003 DB2 on z/OS is being used by most of biggest banks and financial intuitions since 1980s. Either issuing exceptions (what DB2 does) or returning NULL (what we are doing) looks fine to me. If they want to avoid getting NULL or exceptions, users should manually putting the round functions by themselves. Also see the technote of DB2 zOS: http://www-01.ibm.com/support/docview.wss?uid=swg21161024 2017-12-19 8:41 GMT-08:00 Marco Gaido <marcogaid...@gmail.com>: Hello everybody, I did some further researches and now I am sharing my findings. I am sorry, it is going to be a quite long e-mail, but I'd really appreciate some feedbacks when you have time to read it. Spark's current implementation of arithmetic operations on decimals was "copied" from Hive. Thus, the initial goal of the implementation was to be compliant with Hive, which itself aims to reproduce SQLServer behavior. Therefore I compared these 3 DBs and of course I checked the SQL ANSI standard 2011 (you can find it at http://standards.iso.org/ittf/PubliclyAvailableStandards/c053681_ISO_IEC_9075-1_2011.zip) and a late draft of the standard 2003 (http://www.wiscorp.com/sql_2003_standard.zip). The main topics are 3: how to determine the precision and scale of a result; how to behave when the result is a number which is not representable exactly with the result's precision and scale (ie. requires precision loss); how to behave when the result is out of the range of the representable values with the result's precision and scale (ie. it is bigger of the biggest number representable or lower the lowest one). Currently, Spark behaves like follows: It follows some rules taken from intial Hive implementation; it returns NULL; it returns NULL. The SQL ANSI is pretty clear about points 2 and 3, while it says barely nothing about point 1, I am citing SQL ANSI:2011 page 27: If the result cannot be represented exactly in the result type, then whether it is rounded or truncated is implementation-defined. An exception condition is raised if the result is outside the range of numeric values of the result type, or if the arithmetic operation is not defined for the operands. Then, as you can see, Spark is not respecting the SQL standard neither for point 2 and 3. Someone, then might argue that we need compatibility with Hive. Then, let's take a look at it. Since Hive 2.2.0 (HIVE-15331), Hive's behavior is: Rules are a bit changed, to reflect SQLServer implementation as described in this blog (https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/); It rounds the result; It returns NULL (HIVE-18291 is open to be compliant with SQL ANSI standard and throw an Exception). As far as the other DBs are regarded, there is little to say about Oracle and Postgres, since they have a nearly infinite precision, thus it is hard also to test the behavior in these conditions, but SQLServer has the same precision as Hive and Spark. Thus, this is SQLServer behavior: Rules should be the same as Hive, as described on their post (tests about the behavior confirm); It rounds the result; It throws an Exception. Therefore, since I think that Spark should be compliant to SQL ANSI (first) and Hive, I propose the following changes: Update the rules to derive the result type in order to reflect new Hive's one (which are SQLServer's one); Change Spark behavior to round the result, as done by Hive and SQLServer and prescribed by the SQL standard; Change Spark's behavior, introducing a configuration parameter in order to determine whether to return null or throw an Exception (by default I propose to throw an exception in order to be compliant with the SQL standard, which IMHO is more important that being compliant with Hive). For 1 and 2, I prepared a PR, which is https://github.com/apache/spark/pull/20023. For 3, I'd love to get your feedbacks in order to agree on what to do and then I wi
Re: Decimals
Losing precision is not acceptable to financial customers. Thus, instead of returning NULL, I saw DB2 issues the following error message: SQL0802N Arithmetic overflow or other arithmetic exception occurred. SQLSTATE=22003 DB2 on z/OS is being used by most of biggest banks and financial intuitions since 1980s. Either issuing exceptions (what DB2 does) or returning NULL (what we are doing) looks fine to me. If they want to avoid getting NULL or exceptions, users should manually putting the round functions by themselves. Also see the technote of DB2 zOS: http://www-01.ibm.com/support/docview.wss?uid=swg21161024 2017-12-19 8:41 GMT-08:00 Marco Gaido <marcogaid...@gmail.com>: > Hello everybody, > > I did some further researches and now I am sharing my findings. I am > sorry, it is going to be a quite long e-mail, but I'd really appreciate > some feedbacks when you have time to read it. > > Spark's current implementation of arithmetic operations on decimals was > "copied" from Hive. Thus, the initial goal of the implementation was to be > compliant with Hive, which itself aims to reproduce SQLServer behavior. > Therefore I compared these 3 DBs and of course I checked the SQL ANSI > standard 2011 (you can find it at http://standards.iso.org/ittf/ > PubliclyAvailableStandards/c053681_ISO_IEC_9075-1_2011.zip) and a late > draft of the standard 2003 (http://www.wiscorp.com/sql_2003_standard.zip). > The main topics are 3: > >1. how to determine the precision and scale of a result; >2. how to behave when the result is a number which is not >representable exactly with the result's precision and scale (ie. requires >precision loss); >3. how to behave when the result is out of the range of the >representable values with the result's precision and scale (ie. it is >bigger of the biggest number representable or lower the lowest one). > > Currently, Spark behaves like follows: > >1. It follows some rules taken from intial Hive implementation; >2. it returns NULL; >3. it returns NULL. > > > The SQL ANSI is pretty clear about points 2 and 3, while it says barely > nothing about point 1, I am citing SQL ANSI:2011 page 27: > > If the result cannot be represented exactly in the result type, then >> whether it is rounded >> or truncated is implementation-defined. An exception condition is raised >> if the result is >> outside the range of numeric values of the result type, or if the >> arithmetic operation >> is not defined for the operands. > > > Then, as you can see, Spark is not respecting the SQL standard neither for > point 2 and 3. Someone, then might argue that we need compatibility with > Hive. Then, let's take a look at it. Since Hive 2.2.0 (HIVE-15331), Hive's > behavior is: > >1. Rules are a bit changed, to reflect SQLServer implementation as >described in this blog (https://blogs.msdn.microsoft. >com/sqlprogrammability/2006/03/29/multiplication-and- >division-with-numerics/ > > <https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/> >); >2. It rounds the result; >3. It returns NULL (HIVE-18291 is open to be compliant with SQL ANSI >standard and throw an Exception). > > As far as the other DBs are regarded, there is little to say about Oracle > and Postgres, since they have a nearly infinite precision, thus it is hard > also to test the behavior in these conditions, but SQLServer has the same > precision as Hive and Spark. Thus, this is SQLServer behavior: > >1. Rules should be the same as Hive, as described on their post (tests >about the behavior confirm); >2. It rounds the result; >3. It throws an Exception. > > Therefore, since I think that Spark should be compliant to SQL ANSI > (first) and Hive, I propose the following changes: > >1. Update the rules to derive the result type in order to reflect new >Hive's one (which are SQLServer's one); >2. Change Spark behavior to round the result, as done by Hive and >SQLServer and prescribed by the SQL standard; >3. Change Spark's behavior, introducing a configuration parameter in >order to determine whether to return null or throw an Exception (by default >I propose to throw an exception in order to be compliant with the SQL >standard, which IMHO is more important that being compliant with Hive). > > For 1 and 2, I prepared a PR, which is https://github.com/apache/ > spark/pull/20023. For 3, I'd love to get your feedbacks in order to agree > on what to do and then I will eventually do a PR which reflect what decided > here by the community. > I would really love to get your feedback either here
Re: Decimals
Hello everybody, I did some further researches and now I am sharing my findings. I am sorry, it is going to be a quite long e-mail, but I'd really appreciate some feedbacks when you have time to read it. Spark's current implementation of arithmetic operations on decimals was "copied" from Hive. Thus, the initial goal of the implementation was to be compliant with Hive, which itself aims to reproduce SQLServer behavior. Therefore I compared these 3 DBs and of course I checked the SQL ANSI standard 2011 (you can find it at http://standards.iso.org/ittf/PubliclyAvailableStandards/c053681_ISO_IEC_9075-1_2011.zip) and a late draft of the standard 2003 ( http://www.wiscorp.com/sql_2003_standard.zip). The main topics are 3: 1. how to determine the precision and scale of a result; 2. how to behave when the result is a number which is not representable exactly with the result's precision and scale (ie. requires precision loss); 3. how to behave when the result is out of the range of the representable values with the result's precision and scale (ie. it is bigger of the biggest number representable or lower the lowest one). Currently, Spark behaves like follows: 1. It follows some rules taken from intial Hive implementation; 2. it returns NULL; 3. it returns NULL. The SQL ANSI is pretty clear about points 2 and 3, while it says barely nothing about point 1, I am citing SQL ANSI:2011 page 27: If the result cannot be represented exactly in the result type, then > whether it is rounded > or truncated is implementation-defined. An exception condition is raised > if the result is > outside the range of numeric values of the result type, or if the > arithmetic operation > is not defined for the operands. Then, as you can see, Spark is not respecting the SQL standard neither for point 2 and 3. Someone, then might argue that we need compatibility with Hive. Then, let's take a look at it. Since Hive 2.2.0 (HIVE-15331), Hive's behavior is: 1. Rules are a bit changed, to reflect SQLServer implementation as described in this blog ( https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/ ); 2. It rounds the result; 3. It returns NULL (HIVE-18291 is open to be compliant with SQL ANSI standard and throw an Exception). As far as the other DBs are regarded, there is little to say about Oracle and Postgres, since they have a nearly infinite precision, thus it is hard also to test the behavior in these conditions, but SQLServer has the same precision as Hive and Spark. Thus, this is SQLServer behavior: 1. Rules should be the same as Hive, as described on their post (tests about the behavior confirm); 2. It rounds the result; 3. It throws an Exception. Therefore, since I think that Spark should be compliant to SQL ANSI (first) and Hive, I propose the following changes: 1. Update the rules to derive the result type in order to reflect new Hive's one (which are SQLServer's one); 2. Change Spark behavior to round the result, as done by Hive and SQLServer and prescribed by the SQL standard; 3. Change Spark's behavior, introducing a configuration parameter in order to determine whether to return null or throw an Exception (by default I propose to throw an exception in order to be compliant with the SQL standard, which IMHO is more important that being compliant with Hive). For 1 and 2, I prepared a PR, which is https://github.com/apache/spark/pull/20023. For 3, I'd love to get your feedbacks in order to agree on what to do and then I will eventually do a PR which reflect what decided here by the community. I would really love to get your feedback either here or on the PR. Thanks for your patience and your time reading this long email, Best regards. Marco 2017-12-13 9:08 GMT+01:00 Reynold Xin <r...@databricks.com>: > Responses inline > > On Tue, Dec 12, 2017 at 2:54 AM, Marco Gaido <marcogaid...@gmail.com> > wrote: > >> Hi all, >> >> I saw in these weeks that there are a lot of problems related to decimal >> values (SPARK-22036, SPARK-22755, for instance). Some are related to >> historical choices, which I don't know, thus please excuse me if I am >> saying dumb things: >> >> - why are we interpreting literal constants in queries as Decimal and >> not as Double? I think it is very unlikely that a user can enter a number >> which is beyond Double precision. >> > > Probably just to be consistent with some popular databases. > > > >> - why are we returning null in case of precision loss? Is this approach >> better than just giving a result which might loose some accuracy? >> > > The contract with decimal is that it should never lose precision (it is > created for financial reports, accounting, etc). Returning null is at least > telling the user the data type can no longer support the precision required. > > > >> >> Thanks, >> Marco >> > >
Re: Decimals
Responses inline On Tue, Dec 12, 2017 at 2:54 AM, Marco Gaidowrote: > Hi all, > > I saw in these weeks that there are a lot of problems related to decimal > values (SPARK-22036, SPARK-22755, for instance). Some are related to > historical choices, which I don't know, thus please excuse me if I am > saying dumb things: > > - why are we interpreting literal constants in queries as Decimal and not > as Double? I think it is very unlikely that a user can enter a number which > is beyond Double precision. > Probably just to be consistent with some popular databases. > - why are we returning null in case of precision loss? Is this approach > better than just giving a result which might loose some accuracy? > The contract with decimal is that it should never lose precision (it is created for financial reports, accounting, etc). Returning null is at least telling the user the data type can no longer support the precision required. > > Thanks, > Marco >
Decimals
Hi all, I saw in these weeks that there are a lot of problems related to decimal values (SPARK-22036, SPARK-22755, for instance). Some are related to historical choices, which I don't know, thus please excuse me if I am saying dumb things: - why are we interpreting literal constants in queries as Decimal and not as Double? I think it is very unlikely that a user can enter a number which is beyond Double precision. - why are we returning null in case of precision loss? Is this approach better than just giving a result which might loose some accuracy? Thanks, Marco
Re: Multiplication on decimals in a dataframe query
Not quiet sure whats happening, but its not an issue with multiplication i guess as the following query worked for me: trades.select(trades("price")*9.5).show +-+ |(price * 9.5)| +-+ |199.5| |228.0| |190.0| |199.5| |190.0| |256.5| |218.5| |275.5| |218.5| .. .. Could it be with the precision? ccing dev list, may be you can open up a jira for this as it seems to be a bug. Thanks Best Regards On Mon, Nov 30, 2015 at 12:41 AM, Philip Dodds <philip.do...@gmail.com> wrote: > I hit a weird issue when I tried to multiply to decimals in a select > (either in scala or as SQL), and Im assuming I must be missing the point. > > The issue is fairly easy to recreate with something like the following: > > > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.implicits._ > import org.apache.spark.sql.types.Decimal > > case class Trade(quantity: Decimal,price: Decimal) > > val data = Seq.fill(100) { > val price = Decimal(20+scala.util.Random.nextInt(10)) > val quantity = Decimal(20+scala.util.Random.nextInt(10)) > > Trade(quantity, price) > } > > val trades = sc.parallelize(data).toDF() > trades.registerTempTable("trades") > > trades.select(trades("price")*trades("quantity")).show > > sqlContext.sql("select > price/quantity,price*quantity,price+quantity,price-quantity from > trades").show > > The odd part is if you run it you will see that the addition/division and > subtraction works but the multiplication returns a null. > > Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11) > > ie. > > +--+ > > |(price * quantity)| > > +--+ > > | null| > > | null| > > | null| > > | null| > > | null| > > +--+ > > > +++++ > > | _c0| _c1| _c2| _c3| > > +++++ > > |0.952380952380952381|null|41.00...|-1.00...| > > |1.380952380952380952|null|50.00...|8.00| > > |1.272727272727272727|null|50.00...|6.00| > > |0.83|null|44.00...|-4.00...| > > |1.00|null|58.00...| 0E-18| > > +++++ > > > Just keen to know what I did wrong? > > > Cheers > > P > > -- > Philip Dodds > > >