[jira] [Commented] (SPARK-22250) Be less restrictive on type checking

2017-11-27 Thread Fernando Pereira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266642#comment-16266642
 ] 

Fernando Pereira commented on SPARK-22250:
--

[~bryanc] It could help, but it doesn't solve the problem. If we have SQL field 
that is an Array, the best equivalent representation from the Python side would 
be a plain Numpy array, given that lists are not efficient. 
When building a dataframe in our projects we have use-cases that would 
immensely benefit from such support. 
>From dataframe to Python returning Array fields as Numpy IMHO would be better, 
>but also changes behavior, so it might be trickier to support. We could 
>eventually control that by detecting if Numpy is available in the system, 
>otherwise raise a warning and fall back to use plain lists.
What do the developers think?

> Be less restrictive on type checking
> 
>
> Key: SPARK-22250
> URL: https://issues.apache.org/jira/browse/SPARK-22250
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Fernando Pereira
>Priority: Minor
>
> I find types.py._verify_type() often too restrictive. E.g. 
> {code}
> TypeError: FloatType can not accept object 0 in type 
> {code}
> I believe it would be globally acceptable to fill a float field with an int, 
> especially since in some formats (json) you don't have a way of inferring the 
> type correctly.
> Another situation relates to other equivalent numerical types, like 
> array.array or numpy. A numpy scalar int is not accepted as an int, and these 
> arrays have always to be converted down to plain lists, which can be 
> prohibitively large and computationally expensive.
> Any thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22250) Be less restrictive on type checking

2017-10-19 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211750#comment-16211750
 ] 

Bryan Cutler commented on SPARK-22250:
--

[~ferdonline] maybe SPARK-20791 would help you out when working with numpy 
arrays?

> Be less restrictive on type checking
> 
>
> Key: SPARK-22250
> URL: https://issues.apache.org/jira/browse/SPARK-22250
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Fernando Pereira
>Priority: Minor
>
> I find types.py._verify_type() often too restrictive. E.g. 
> {code}
> TypeError: FloatType can not accept object 0 in type 
> {code}
> I believe it would be globally acceptable to fill a float field with an int, 
> especially since in some formats (json) you don't have a way of inferring the 
> type correctly.
> Another situation relates to other equivalent numerical types, like 
> array.array or numpy. A numpy scalar int is not accepted as an int, and these 
> arrays have always to be converted down to plain lists, which can be 
> prohibitively large and computationally expensive.
> Any thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22250) Be less restrictive on type checking

2017-10-17 Thread Fernando Pereira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207798#comment-16207798
 ] 

Fernando Pereira commented on SPARK-22250:
--

I did some tests and even though verifySchema=False might help in some cases it 
is still not enough for the case of handling numpy arrays. Using arrays from 
the array module work nicely (even without disabling verifySchema). I think it 
is because elements of array.array are automatically converted to their python 
corresponding type.

So I think the problem mentioned involves two issues:
# Accept ints to float fields. Apparently is it just a matter of a schema 
verification issue, so _acceptable_types should be updated
# Accept numpy array for an ArrayField. In this case, with  verifySchema=False 
there's still an Exception:
net.razorvine.pickle.PickleException: expected zero arguments for construction 
of ClassDict (for numpy.core.multiarray._reconstruct)

I believe numpy array support, even in this simple case, would be extremely 
valuable for a lot of people. In our case we work with large hdf5 files where 
the data interface is numpy.

> Be less restrictive on type checking
> 
>
> Key: SPARK-22250
> URL: https://issues.apache.org/jira/browse/SPARK-22250
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Fernando Pereira
>Priority: Minor
>
> I find types.py._verify_type() often too restrictive. E.g. 
> {code}
> TypeError: FloatType can not accept object 0 in type 
> {code}
> I believe it would be globally acceptable to fill a float field with an int, 
> especially since in some formats (json) you don't have a way of inferring the 
> type correctly.
> Another situation relates to other equivalent numerical types, like 
> array.array or numpy. A numpy scalar int is not accepted as an int, and these 
> arrays have always to be converted down to plain lists, which can be 
> prohibitively large and computationally expensive.
> Any thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22250) Be less restrictive on type checking

2017-10-13 Thread Fernando Pereira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204059#comment-16204059
 ] 

Fernando Pereira commented on SPARK-22250:
--

I have to admit I was not aware of that option.
Nevertheless, even though that would shut up all complains, I find it a bit 
extreme, and I believe a better handling of these cases would be valuable.
A float can be initialized by an int in almost any language, and java is no 
exception.
And since Pandas is supported, taking into consideration Numpy types would also 
be really nice, especially for large arrays.
I could work on that feature if the community considers it worth.

> Be less restrictive on type checking
> 
>
> Key: SPARK-22250
> URL: https://issues.apache.org/jira/browse/SPARK-22250
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Fernando Pereira
>Priority: Minor
>
> I find types.py._verify_type() often too restrictive. E.g. 
> {code}
> TypeError: FloatType can not accept object 0 in type 
> {code}
> I believe it would be globally acceptable to fill a float field with an int, 
> especially since in some formats (json) you don't have a way of inferring the 
> type correctly.
> Another situation relates to other equivalent numerical types, like 
> array.array or numpy. A numpy scalar int is not accepted as an int, and these 
> arrays have always to be converted down to plain lists, which can be 
> prohibitively large and computationally expensive.
> Any thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22250) Be less restrictive on type checking

2017-10-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16200309#comment-16200309
 ] 

Hyukjin Kwon commented on SPARK-22250:
--

{{createDataFrame(... verifySchema=False)}}?

> Be less restrictive on type checking
> 
>
> Key: SPARK-22250
> URL: https://issues.apache.org/jira/browse/SPARK-22250
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Fernando Pereira
>Priority: Minor
>
> I find types.py._verify_type() often too restrictive. E.g. 
> {code}
> TypeError: FloatType can not accept object 0 in type 
> {code}
> I believe it would be globally acceptable to fill a float field with an int, 
> especially since in some formats (json) you don't have a way of inferring the 
> type correctly.
> Another situation relates to other equivalent numerical types, like 
> array.array or numpy. A numpy scalar int is not accepted as an int, and these 
> arrays have always to be converted down to plain lists, which can be 
> prohibitively large and computationally expensive.
> Any thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org