[jira] [Commented] (SPARK-22250) Be less restrictive on type checking
[ https://issues.apache.org/jira/browse/SPARK-22250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266642#comment-16266642 ] Fernando Pereira commented on SPARK-22250: -- [~bryanc] It could help, but it doesn't solve the problem. If we have SQL field that is an Array, the best equivalent representation from the Python side would be a plain Numpy array, given that lists are not efficient. When building a dataframe in our projects we have use-cases that would immensely benefit from such support. >From dataframe to Python returning Array fields as Numpy IMHO would be better, >but also changes behavior, so it might be trickier to support. We could >eventually control that by detecting if Numpy is available in the system, >otherwise raise a warning and fall back to use plain lists. What do the developers think? > Be less restrictive on type checking > > > Key: SPARK-22250 > URL: https://issues.apache.org/jira/browse/SPARK-22250 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Fernando Pereira >Priority: Minor > > I find types.py._verify_type() often too restrictive. E.g. > {code} > TypeError: FloatType can not accept object 0 in type > {code} > I believe it would be globally acceptable to fill a float field with an int, > especially since in some formats (json) you don't have a way of inferring the > type correctly. > Another situation relates to other equivalent numerical types, like > array.array or numpy. A numpy scalar int is not accepted as an int, and these > arrays have always to be converted down to plain lists, which can be > prohibitively large and computationally expensive. > Any thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22250) Be less restrictive on type checking
[ https://issues.apache.org/jira/browse/SPARK-22250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211750#comment-16211750 ] Bryan Cutler commented on SPARK-22250: -- [~ferdonline] maybe SPARK-20791 would help you out when working with numpy arrays? > Be less restrictive on type checking > > > Key: SPARK-22250 > URL: https://issues.apache.org/jira/browse/SPARK-22250 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Fernando Pereira >Priority: Minor > > I find types.py._verify_type() often too restrictive. E.g. > {code} > TypeError: FloatType can not accept object 0 in type > {code} > I believe it would be globally acceptable to fill a float field with an int, > especially since in some formats (json) you don't have a way of inferring the > type correctly. > Another situation relates to other equivalent numerical types, like > array.array or numpy. A numpy scalar int is not accepted as an int, and these > arrays have always to be converted down to plain lists, which can be > prohibitively large and computationally expensive. > Any thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22250) Be less restrictive on type checking
[ https://issues.apache.org/jira/browse/SPARK-22250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207798#comment-16207798 ] Fernando Pereira commented on SPARK-22250: -- I did some tests and even though verifySchema=False might help in some cases it is still not enough for the case of handling numpy arrays. Using arrays from the array module work nicely (even without disabling verifySchema). I think it is because elements of array.array are automatically converted to their python corresponding type. So I think the problem mentioned involves two issues: # Accept ints to float fields. Apparently is it just a matter of a schema verification issue, so _acceptable_types should be updated # Accept numpy array for an ArrayField. In this case, with verifySchema=False there's still an Exception: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct) I believe numpy array support, even in this simple case, would be extremely valuable for a lot of people. In our case we work with large hdf5 files where the data interface is numpy. > Be less restrictive on type checking > > > Key: SPARK-22250 > URL: https://issues.apache.org/jira/browse/SPARK-22250 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Fernando Pereira >Priority: Minor > > I find types.py._verify_type() often too restrictive. E.g. > {code} > TypeError: FloatType can not accept object 0 in type > {code} > I believe it would be globally acceptable to fill a float field with an int, > especially since in some formats (json) you don't have a way of inferring the > type correctly. > Another situation relates to other equivalent numerical types, like > array.array or numpy. A numpy scalar int is not accepted as an int, and these > arrays have always to be converted down to plain lists, which can be > prohibitively large and computationally expensive. > Any thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22250) Be less restrictive on type checking
[ https://issues.apache.org/jira/browse/SPARK-22250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204059#comment-16204059 ] Fernando Pereira commented on SPARK-22250: -- I have to admit I was not aware of that option. Nevertheless, even though that would shut up all complains, I find it a bit extreme, and I believe a better handling of these cases would be valuable. A float can be initialized by an int in almost any language, and java is no exception. And since Pandas is supported, taking into consideration Numpy types would also be really nice, especially for large arrays. I could work on that feature if the community considers it worth. > Be less restrictive on type checking > > > Key: SPARK-22250 > URL: https://issues.apache.org/jira/browse/SPARK-22250 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Fernando Pereira >Priority: Minor > > I find types.py._verify_type() often too restrictive. E.g. > {code} > TypeError: FloatType can not accept object 0 in type > {code} > I believe it would be globally acceptable to fill a float field with an int, > especially since in some formats (json) you don't have a way of inferring the > type correctly. > Another situation relates to other equivalent numerical types, like > array.array or numpy. A numpy scalar int is not accepted as an int, and these > arrays have always to be converted down to plain lists, which can be > prohibitively large and computationally expensive. > Any thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22250) Be less restrictive on type checking
[ https://issues.apache.org/jira/browse/SPARK-22250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16200309#comment-16200309 ] Hyukjin Kwon commented on SPARK-22250: -- {{createDataFrame(... verifySchema=False)}}? > Be less restrictive on type checking > > > Key: SPARK-22250 > URL: https://issues.apache.org/jira/browse/SPARK-22250 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Fernando Pereira >Priority: Minor > > I find types.py._verify_type() often too restrictive. E.g. > {code} > TypeError: FloatType can not accept object 0 in type > {code} > I believe it would be globally acceptable to fill a float field with an int, > especially since in some formats (json) you don't have a way of inferring the > type correctly. > Another situation relates to other equivalent numerical types, like > array.array or numpy. A numpy scalar int is not accepted as an int, and these > arrays have always to be converted down to plain lists, which can be > prohibitively large and computationally expensive. > Any thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org