[jira] [Commented] (SPARK-22250) Be less restrictive on type checking

Fernando Pereira (JIRA) Tue, 17 Oct 2017 08:39:15 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-22250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207798#comment-16207798
 ]


Fernando Pereira commented on SPARK-22250:
------------------------------------------

I did some tests and even though verifySchema=False might help in some cases it 
is still not enough for the case of handling numpy arrays. Using arrays from 
the array module work nicely (even without disabling verifySchema). I think it 
is because elements of array.array are automatically converted to their python 
corresponding type.

So I think the problem mentioned involves two issues:
# Accept ints to float fields. Apparently is it just a matter of a schema 
verification issue, so _acceptable_types should be updated
# Accept numpy array for an ArrayField. In this case, with  verifySchema=False 
there's still an Exception:
net.razorvine.pickle.PickleException: expected zero arguments for construction 
of ClassDict (for numpy.core.multiarray._reconstruct)

I believe numpy array support, even in this simple case, would be extremely 
valuable for a lot of people. In our case we work with large hdf5 files where 
the data interface is numpy.

> Be less restrictive on type checking
> ------------------------------------
>
>                 Key: SPARK-22250
>                 URL: https://issues.apache.org/jira/browse/SPARK-22250
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Fernando Pereira
>            Priority: Minor
>
> I find types.py._verify_type() often too restrictive. E.g. 
> {code}
> TypeError: FloatType can not accept object 0 in type <type 'int'>
> {code}
> I believe it would be globally acceptable to fill a float field with an int, 
> especially since in some formats (json) you don't have a way of inferring the 
> type correctly.
> Another situation relates to other equivalent numerical types, like 
> array.array or numpy. A numpy scalar int is not accepted as an int, and these 
> arrays have always to be converted down to plain lists, which can be 
> prohibitively large and computationally expensive.
> Any thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22250) Be less restrictive on type checking

Reply via email to