[jira] [Resolved] (SPARK-25801) pandas_udf grouped_map fails with input dataframe with more than 255 columns

2018-10-23 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-25801.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

> pandas_udf grouped_map fails with input dataframe with more than 255 columns
> 
>
> Key: SPARK-25801
> URL: https://issues.apache.org/jira/browse/SPARK-25801
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 2.7
> pyspark 2.3.0
>Reporter: Frederik
>Priority: Major
> Fix For: 2.4.0
>
>
> Hi,
> I'm using a pandas_udf to deploy a model to predict all samples in a spark 
> dataframe,
> for this I use a udf as follows:
> @pandas_udf("scores double", PandasUDFType.GROUPED_MAP) def 
> predict_scores(pdf):  score_values = model.predict_proba(pdf)[:,1]  return 
> pd.DataFrame({'scores': score_values})
> So it takes a dataframe and predicts the probability of being positive 
> according to an sklearn model for each row and returns this as single column. 
> This works great on a random groupBy, e.g.:
> sdf_to_score.groupBy(sf.col('age')).apply(predict_scores)
> as long as the dataframe has <255 columns. When the input dataframe has more 
> than 255 columns (thus features in my model), I get:
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 
> 219, in main
> func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
> eval_type)
>   File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 
> 148, in read_udfs
> mapper = eval(mapper_str, udfs)
>   File "", line 1
> SyntaxError: more than 255 arguments
> Which seems to be related with Python's general limitation of having not 
> allowing more than 255 arguments for a function?
>  
> Is this a bug or is there a straightforward way around this problem?
>  
> Regards,
> Frederik



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22809) pyspark is sensitive to imports with dots

2018-10-23 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-22809:
-
Fix Version/s: 2.3.2

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-10-23 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661502#comment-16661502
 ] 

Bryan Cutler commented on SPARK-22809:
--

Sure, I probably shouldn't have tested out of the branches. Running tests again 
from IPython with Python 3.6.6:

 

*v2.2.2* - Error is raised

*v2.3.2* - Working

*v2.4.0-rc4* - Working

 

>From those results, it seems like SPARK-21070 most likely fixed it

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
> Fix For: 2.4.0
>
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.5

2018-11-05 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675866#comment-16675866
 ] 

Bryan Cutler edited comment on SPARK-25079 at 11/5/18 11:09 PM:


Sounds like a good plan [~shaneknapp]\!  The instances of python3.4 in 
SparkLauncherSuite and SparkSubmitSuite look like they are for testing that the 
env vars PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON can be set to different 
Python versions. So I don't think they need to be changed, but if you want to 
maybe change python3.4 -> python3.5 and also python3.5 -> python3.6 to keep 
them different. Everything else looks good!


was (Author: bryanc):
Sounds like a good plan [~shaneknapp] !  The instances of python3.4 in 
SparkLauncherSuite and SparkSubmitSuite look like they are for testing that the 
env vars PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON can be set to different 
Python versions. So I don't think they need to be changed, but if you want to 
maybe change python3.4 -> python3.5 and also python3.5 -> python3.6 to keep 
them different. Everything else looks good!

> [PYTHON] upgrade python 3.4 -> 3.5
> --
>
> Key: SPARK-25079
> URL: https://issues.apache.org/jira/browse/SPARK-25079
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> for the impending arrow upgrade 
> (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 
> 3.4 -> 3.5.
> i have been testing this here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]
> my methodology:
> 1) upgrade python + arrow to 3.5 and 0.10.0
> 2) run python tests
> 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
> upgrade centos workers to python3.5
> 4) simultaneously do the following: 
>   - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that 
> points to python3.5 (this is currently being tested here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]
>   - push a change to python/run-tests.py replacing 3.4 with 3.5
> 5) once the python3.5 change to run-tests.py is merged, we will need to 
> back-port this to all existing branches
> 6) then and only then can i remove the python3.4 -> python3.5 symlink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.5

2018-11-05 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675866#comment-16675866
 ] 

Bryan Cutler commented on SPARK-25079:
--

Sounds like a good plan [~shaneknapp]!  The instances of python3.4 in 
SparkLauncherSuite and SparkSubmitSuite look like they are for testing that the 
env vars PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON can be set to different 
Python versions. So I don't think they need to be changed, but if you want to 
maybe change python3.4 -> python3.5 and also python3.5 -> python3.6 to keep 
them different. Everything else looks good!

> [PYTHON] upgrade python 3.4 -> 3.5
> --
>
> Key: SPARK-25079
> URL: https://issues.apache.org/jira/browse/SPARK-25079
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> for the impending arrow upgrade 
> (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 
> 3.4 -> 3.5.
> i have been testing this here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]
> my methodology:
> 1) upgrade python + arrow to 3.5 and 0.10.0
> 2) run python tests
> 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
> upgrade centos workers to python3.5
> 4) simultaneously do the following: 
>   - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that 
> points to python3.5 (this is currently being tested here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]
>   - push a change to python/run-tests.py replacing 3.4 with 3.5
> 5) once the python3.5 change to run-tests.py is merged, we will need to 
> back-port this to all existing branches
> 6) then and only then can i remove the python3.4 -> python3.5 symlink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.5

2018-11-05 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675866#comment-16675866
 ] 

Bryan Cutler edited comment on SPARK-25079 at 11/5/18 11:08 PM:


Sounds like a good plan [~shaneknapp] !  The instances of python3.4 in 
SparkLauncherSuite and SparkSubmitSuite look like they are for testing that the 
env vars PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON can be set to different 
Python versions. So I don't think they need to be changed, but if you want to 
maybe change python3.4 -> python3.5 and also python3.5 -> python3.6 to keep 
them different. Everything else looks good!


was (Author: bryanc):
Sounds like a good plan [~shaneknapp]!  The instances of python3.4 in 
SparkLauncherSuite and SparkSubmitSuite look like they are for testing that the 
env vars PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON can be set to different 
Python versions. So I don't think they need to be changed, but if you want to 
maybe change python3.4 -> python3.5 and also python3.5 -> python3.6 to keep 
them different. Everything else looks good!

> [PYTHON] upgrade python 3.4 -> 3.5
> --
>
> Key: SPARK-25079
> URL: https://issues.apache.org/jira/browse/SPARK-25079
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> for the impending arrow upgrade 
> (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 
> 3.4 -> 3.5.
> i have been testing this here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]
> my methodology:
> 1) upgrade python + arrow to 3.5 and 0.10.0
> 2) run python tests
> 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
> upgrade centos workers to python3.5
> 4) simultaneously do the following: 
>   - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that 
> points to python3.5 (this is currently being tested here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]
>   - push a change to python/run-tests.py replacing 3.4 with 3.5
> 5) once the python3.5 change to run-tests.py is merged, we will need to 
> back-port this to all existing branches
> 6) then and only then can i remove the python3.4 -> python3.5 symlink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25344) Break large PySpark unittests into smaller files

2018-11-13 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685613#comment-16685613
 ] 

Bryan Cutler commented on SPARK-25344:
--

[~hyukjin.kwon] no problem, I can take on ML and MLlib

> Break large PySpark unittests into smaller files
> 
>
> Key: SPARK-25344
> URL: https://issues.apache.org/jira/browse/SPARK-25344
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We've got a ton of tests in one humongous tests.py file, rather than breaking 
> it out into smaller files.
> Having one huge file doesn't seem great for code organization, and it also 
> makes the test parallelization in run-tests.py not work as well. On my 
> laptop, tests.py takes 150s, and the next longest test file takes only 20s. 
> There are similarly large files in other pyspark modules, eg. sql/tests.py, 
> ml/tests.py, mllib/tests.py, streaming/tests.py.
> It seems that at least for some of these files, its already broken into 
> independent test classes, so it shouldn't be too hard to just move them into 
> their own files.
> We could pick up one example and follow. The current style looks closer to 
> NumPy structure and looks easier to follow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25344) Break large tests.py files into smaller files

2018-10-09 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643762#comment-16643762
 ] 

Bryan Cutler commented on SPARK-25344:
--

No I don't have strong feelings, my only preference was to put the tests in a 
subdir. I'm sure it's quite a lot of work, so however you see fit 
[~hyukjin.kwon] is fine! If you are planning on breaking this up into tasks, I 
can try to help out as well.

> Break large tests.py files into smaller files
> -
>
> Key: SPARK-25344
> URL: https://issues.apache.org/jira/browse/SPARK-25344
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>
> We've got a ton of tests in one humongous tests.py file, rather than breaking 
> it out into smaller files.
> Having one huge file doesn't seem great for code organization, and it also 
> makes the test parallelization in run-tests.py not work as well.  On my 
> laptop, tests.py takes 150s, and the next longest test file takes only 20s.  
> There are similarly large files in other pyspark modules, eg. sql/tests.py, 
> ml/tests.py, mllib/tests.py, streaming/tests.py.
> It seems that at least for some of these files, its already broken into 
> independent test classes, so it shouldn't be too hard to just move them into 
> their own files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

2018-10-02 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16635762#comment-16635762
 ] 

Bryan Cutler commented on SPARK-25461:
--

Thanks for looking into this [~viirya]! You are right that the above udf 
returns a float64 instead of boolean. I'm not sure what the expected cast from 
float to bool should be, but it does seem like pyarrow might be doing something 
wrong here. I'll look into it some more and raise an issue there if so.

> PySpark Pandas UDF outputs incorrect results when input columns contain None
> 
>
> Key: SPARK-25461
> URL: https://issues.apache.org/jira/browse/SPARK-25461
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: I reproduced this issue by running pyspark locally on 
> mac:
> Spark version: 2.3.1 pre-built with Hadoop 2.7
> Python library versions: pyarrow==0.10.0, pandas==0.20.2
>Reporter: Chongyuan Xiang
>Priority: Major
>
> The following PySpark script uses a simple pandas UDF to calculate a column 
> given column 'A'. When column 'A' contains None, the results look incorrect.
> Script: 
>  
> {code:java}
> import pandas as pd
> import random
> import pyspark
> from pyspark.sql.functions import col, lit, pandas_udf
> values = [None] * 3 + [1.0] * 17 + [2.0] * 600
> random.shuffle(values)
> pdf = pd.DataFrame({'A': values})
> df = spark.createDataFrame(pdf)
> @pandas_udf(returnType=pyspark.sql.types.BooleanType())
> def gt_2(column):
> return (column >= 2).where(column.notnull())
> calculated_df = (df.select(['A'])
> .withColumn('potential_bad_col', gt_2('A'))
> )
> calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) 
> | (col("A").isNull()))
> calculated_df.show()
> {code}
>  
> Output:
> {code:java}
> +---+-+---+
> | A|potential_bad_col|correct_col|
> +---+-+---+
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |1.0| false| false|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> +---+-+---+
> only showing top 20 rows
> {code}
> This problem disappears when the number of rows is small or when the input 
> column does not contain None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

2018-10-03 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637626#comment-16637626
 ] 

Bryan Cutler commented on SPARK-25461:
--

I file ARROW-3428, which deals with the incorrect cast from float to bool

> PySpark Pandas UDF outputs incorrect results when input columns contain None
> 
>
> Key: SPARK-25461
> URL: https://issues.apache.org/jira/browse/SPARK-25461
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: I reproduced this issue by running pyspark locally on 
> mac:
> Spark version: 2.3.1 pre-built with Hadoop 2.7
> Python library versions: pyarrow==0.10.0, pandas==0.20.2
>Reporter: Chongyuan Xiang
>Priority: Major
>
> The following PySpark script uses a simple pandas UDF to calculate a column 
> given column 'A'. When column 'A' contains None, the results look incorrect.
> Script: 
>  
> {code:java}
> import pandas as pd
> import random
> import pyspark
> from pyspark.sql.functions import col, lit, pandas_udf
> values = [None] * 3 + [1.0] * 17 + [2.0] * 600
> random.shuffle(values)
> pdf = pd.DataFrame({'A': values})
> df = spark.createDataFrame(pdf)
> @pandas_udf(returnType=pyspark.sql.types.BooleanType())
> def gt_2(column):
> return (column >= 2).where(column.notnull())
> calculated_df = (df.select(['A'])
> .withColumn('potential_bad_col', gt_2('A'))
> )
> calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) 
> | (col("A").isNull()))
> calculated_df.show()
> {code}
>  
> Output:
> {code:java}
> +---+-+---+
> | A|potential_bad_col|correct_col|
> +---+-+---+
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |1.0| false| false|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> +---+-+---+
> only showing top 20 rows
> {code}
> This problem disappears when the number of rows is small or when the input 
> column does not contain None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

2018-10-03 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637626#comment-16637626
 ] 

Bryan Cutler edited comment on SPARK-25461 at 10/3/18 11:53 PM:


I filed ARROW-3428, which deals with the incorrect cast from float to bool


was (Author: bryanc):
I file ARROW-3428, which deals with the incorrect cast from float to bool

> PySpark Pandas UDF outputs incorrect results when input columns contain None
> 
>
> Key: SPARK-25461
> URL: https://issues.apache.org/jira/browse/SPARK-25461
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: I reproduced this issue by running pyspark locally on 
> mac:
> Spark version: 2.3.1 pre-built with Hadoop 2.7
> Python library versions: pyarrow==0.10.0, pandas==0.20.2
>Reporter: Chongyuan Xiang
>Priority: Major
>
> The following PySpark script uses a simple pandas UDF to calculate a column 
> given column 'A'. When column 'A' contains None, the results look incorrect.
> Script: 
>  
> {code:java}
> import pandas as pd
> import random
> import pyspark
> from pyspark.sql.functions import col, lit, pandas_udf
> values = [None] * 3 + [1.0] * 17 + [2.0] * 600
> random.shuffle(values)
> pdf = pd.DataFrame({'A': values})
> df = spark.createDataFrame(pdf)
> @pandas_udf(returnType=pyspark.sql.types.BooleanType())
> def gt_2(column):
> return (column >= 2).where(column.notnull())
> calculated_df = (df.select(['A'])
> .withColumn('potential_bad_col', gt_2('A'))
> )
> calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) 
> | (col("A").isNull()))
> calculated_df.show()
> {code}
>  
> Output:
> {code:java}
> +---+-+---+
> | A|potential_bad_col|correct_col|
> +---+-+---+
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |1.0| false| false|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> +---+-+---+
> only showing top 20 rows
> {code}
> This problem disappears when the number of rows is small or when the input 
> column does not contain None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

2018-10-08 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642379#comment-16642379
 ] 

Bryan Cutler commented on SPARK-25461:
--

Just wanted to add that the resolution here added a note for the user to verify 
the return type and data is correct.  The actual fix for correct data when 
converting to/from booleans is on the Arrow side and won't be available until 
Spark upgrades pyarrow to version 0.12.0+.

> PySpark Pandas UDF outputs incorrect results when input columns contain None
> 
>
> Key: SPARK-25461
> URL: https://issues.apache.org/jira/browse/SPARK-25461
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: I reproduced this issue by running pyspark locally on 
> mac:
> Spark version: 2.3.1 pre-built with Hadoop 2.7
> Python library versions: pyarrow==0.10.0, pandas==0.20.2
>Reporter: Chongyuan Xiang
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> The following PySpark script uses a simple pandas UDF to calculate a column 
> given column 'A'. When column 'A' contains None, the results look incorrect.
> Script: 
>  
> {code:java}
> import pandas as pd
> import random
> import pyspark
> from pyspark.sql.functions import col, lit, pandas_udf
> values = [None] * 3 + [1.0] * 17 + [2.0] * 600
> random.shuffle(values)
> pdf = pd.DataFrame({'A': values})
> df = spark.createDataFrame(pdf)
> @pandas_udf(returnType=pyspark.sql.types.BooleanType())
> def gt_2(column):
> return (column >= 2).where(column.notnull())
> calculated_df = (df.select(['A'])
> .withColumn('potential_bad_col', gt_2('A'))
> )
> calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) 
> | (col("A").isNull()))
> calculated_df.show()
> {code}
>  
> Output:
> {code:java}
> +---+-+---+
> | A|potential_bad_col|correct_col|
> +---+-+---+
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |1.0| false| false|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> +---+-+---+
> only showing top 20 rows
> {code}
> This problem disappears when the number of rows is small or when the input 
> column does not contain None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25471) Fix tests for Python 3.6 with Pandas 0.23+

2018-09-19 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-25471:


Assignee: (was: Bryan Cutler)

> Fix tests for Python 3.6 with Pandas 0.23+
> --
>
> Key: SPARK-25471
> URL: https://issues.apache.org/jira/browse/SPARK-25471
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Running pyspark tests causes at least 1 error when using Python 3.6 and 
> Pandas 0.23 or higher.  This is because the Pandas DataFrame constructor can 
> create columns in the defined order, where earlier versions might be in 
> alphabetical order.  This leads to the following failure:
> {noformat}
> ==
> ERROR: test_create_dataframe_from_pandas_with_timestamp 
> (pyspark.sql.tests.SQLTests)
> --
> Traceback (most recent call last):
>   File "/home/bryan/git/spark/python/pyspark/sql/tests.py", line 3275, in 
> test_create_dataframe_from_pandas_with_timestamp
> df = self.spark.createDataFrame(pdf, schema="d date, ts timestamp")
>   File "/home/bryan/git/spark/python/pyspark/sql/session.py", line 748, in 
> createDataFrame
> rdd, schema = self._createFromLocal(map(prepare, data), schema)
>   File "/home/bryan/git/spark/python/pyspark/sql/session.py", line 413, in 
> _createFromLocal
> data = list(data)
>   File "/home/bryan/git/spark/python/pyspark/sql/session.py", line 730, in 
> prepare
> verify_func(obj)
>   File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1389, in 
> verify
> verify_value(obj)
>   File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1370, in 
> verify_struct
> verifier(v)
>   File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1389, in 
> verify
> verify_value(obj)
>   File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1383, in 
> verify_default
> verify_acceptable_types(obj)
>   File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1278, in 
> verify_acceptable_types
> % (dataType, obj, type(obj
> TypeError: field ts: TimestampType can not accept object datetime.date(2018, 
> 9, 19) in type 
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25471) Fix tests for Python 3.6 with Pandas 0.23+

2018-09-19 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-25471:


 Summary: Fix tests for Python 3.6 with Pandas 0.23+
 Key: SPARK-25471
 URL: https://issues.apache.org/jira/browse/SPARK-25471
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Tests
Affects Versions: 2.4.0
Reporter: Bryan Cutler
Assignee: Bryan Cutler


Running pyspark tests causes at least 1 error when using Python 3.6 and Pandas 
0.23 or higher.  This is because the Pandas DataFrame constructor can create 
columns in the defined order, where earlier versions might be in alphabetical 
order.  This leads to the following failure:
{noformat}
==
ERROR: test_create_dataframe_from_pandas_with_timestamp 
(pyspark.sql.tests.SQLTests)
--
Traceback (most recent call last):
  File "/home/bryan/git/spark/python/pyspark/sql/tests.py", line 3275, in 
test_create_dataframe_from_pandas_with_timestamp
df = self.spark.createDataFrame(pdf, schema="d date, ts timestamp")
  File "/home/bryan/git/spark/python/pyspark/sql/session.py", line 748, in 
createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/home/bryan/git/spark/python/pyspark/sql/session.py", line 413, in 
_createFromLocal
data = list(data)
  File "/home/bryan/git/spark/python/pyspark/sql/session.py", line 730, in 
prepare
verify_func(obj)
  File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1389, in verify
verify_value(obj)
  File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1370, in 
verify_struct
verifier(v)
  File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1389, in verify
verify_value(obj)
  File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1383, in 
verify_default
verify_acceptable_types(obj)
  File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1278, in 
verify_acceptable_types
% (dataType, obj, type(obj
TypeError: field ts: TimestampType can not accept object datetime.date(2018, 9, 
19) in type 
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25432) Consider if using standard getOrCreate from PySpark into JVM SparkSession would simplify code

2018-09-24 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626148#comment-16626148
 ] 

Bryan Cutler commented on SPARK-25432:
--

moved description :)

> Consider if using standard getOrCreate from PySpark into JVM SparkSession 
> would simplify code
> -
>
> Key: SPARK-25432
> URL: https://issues.apache.org/jira/browse/SPARK-25432
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: holdenk
>Priority: Trivial
>
> As we saw in [https://github.com/apache/spark/pull/22295/files] the logic can 
> get a bit out of sync. It _might_ make sense to try and simplify this so 
> there's less duplicated logic in Python & Scala around session set up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25432) Consider if using standard getOrCreate from PySpark into JVM SparkSession would simplify code

2018-09-24 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-25432:
-
Description: As we saw in 
[https://github.com/apache/spark/pull/22295/files] the logic can get a bit out 
of sync. It _might_ make sense to try and simplify this so there's less 
duplicated logic in Python & Scala around session set up.

> Consider if using standard getOrCreate from PySpark into JVM SparkSession 
> would simplify code
> -
>
> Key: SPARK-25432
> URL: https://issues.apache.org/jira/browse/SPARK-25432
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: holdenk
>Priority: Trivial
>
> As we saw in [https://github.com/apache/spark/pull/22295/files] the logic can 
> get a bit out of sync. It _might_ make sense to try and simplify this so 
> there's less duplicated logic in Python & Scala around session set up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25432) Consider if using standard getOrCreate from PySpark into JVM SparkSession would simplify code

2018-09-24 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-25432:
-
Environment: (was: As we saw in 
[https://github.com/apache/spark/pull/22295/files] the logic can get a bit out 
of sync. It _might_ make sense to try and simplify this so there's less 
duplicated logic in Python & Scala around session set up.)

> Consider if using standard getOrCreate from PySpark into JVM SparkSession 
> would simplify code
> -
>
> Key: SPARK-25432
> URL: https://issues.apache.org/jira/browse/SPARK-25432
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: holdenk
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow

2018-09-26 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629572#comment-16629572
 ] 

Bryan Cutler commented on SPARK-25351:
--

Hi [~pgadige], yes please go ahead with this issue!  When creating a DataFrame 
from Pandas without Arrow, category columns are converted into the type of the 
category. So in the example above, column "A" becomes a string type. The same 
should be done when Arrow is enabled, so we end up with the same Spark 
DataFrame. If you are able to, we also need to see how this affects pandas_udfs 
too. Thanks!

> Handle Pandas category type when converting from Python with Arrow
> --
>
> Key: SPARK-25351
> URL: https://issues.apache.org/jira/browse/SPARK-25351
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Bryan Cutler
>Priority: Major
>
> There needs to be some handling of category types done when calling 
> {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}.  
> Without Arrow, Spark casts each element to the category. For example 
> {noformat}
> In [1]: import pandas as pd
> In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})
> In [3]: pdf["B"] = pdf["A"].astype('category')
> In [4]: pdf
> Out[4]: 
>A  B
> 0  a  a
> 1  b  b
> 2  c  c
> 3  a  a
> In [5]: pdf.dtypes
> Out[5]: 
> A  object
> Bcategory
> dtype: object
> In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)
> In [8]: df = spark.createDataFrame(pdf)
> In [9]: df.show()
> +---+---+
> |  A|  B|
> +---+---+
> |  a|  a|
> |  b|  b|
> |  c|  c|
> |  a|  a|
> +---+---+
> In [10]: df.printSchema()
> root
>  |-- A: string (nullable = true)
>  |-- B: string (nullable = true)
> In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)
> In [19]: df = spark.createDataFrame(pdf)   
>1667 spark_type = ArrayType(from_arrow_type(at.value_type))
>1668 else:
> -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " 
> + str(at))
>1670 return spark_type
>1671 
> TypeError: Unsupported type in conversion from Arrow: 
> dictionary
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26591) Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain environment

2019-01-16 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744288#comment-16744288
 ] 

Bryan Cutler commented on SPARK-26591:
--

[~elch10] please go ahead and make a Jira for Arrow regarding the pyarrow 
import error. Also, include all relevant details about your system.

> Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain 
> environment
> 
>
> Key: SPARK-26591
> URL: https://issues.apache.org/jira/browse/SPARK-26591
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Python 3.6.7
> Pyspark 2.4.0
> OS:
> {noformat}
> Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 
> x86_64 x86_64 GNU/Linux{noformat}
> CPU:
>  
> {code:java}
> Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB
> clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz
> {code}
>  
>  
>Reporter: Elchin
>Priority: Major
> Attachments: core
>
>
> When I try to use pandas_udf from examples in 
> [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.types import IntegerType, StringType
> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is 
> crashed{code}
> I get the error:
> {code:java}
> [1]    17969 illegal hardware instruction (core dumped)  python3{code}
> The environment is:
> Python 3.6.7
>  PySpark 2.4.0
>  PyArrow: 0.11.1
>  Pandas: 0.23.4
>  NumPy: 1.15.4
>  OS: Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 
> x86_64 x86_64 x86_64 GNU/Linux



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26566) Upgrade apache/arrow to 0.12.0

2019-01-16 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-26566:
-
Description: 
_This is just a placeholder for now to collect what needs to be fixed when we 
upgrade next time_

Version 0.12.0 includes the following:
 * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
 * conversion to date object no longer needed, ARROW-3910

 

  was:
_This is just a placeholder for now to collect what needs to be fixed when we 
upgrade next time_

Version 0.12.0 includes the following:
 * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098

 


> Upgrade apache/arrow to 0.12.0
> --
>
> Key: SPARK-26566
> URL: https://issues.apache.org/jira/browse/SPARK-26566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Major
>
> _This is just a placeholder for now to collect what needs to be fixed when we 
> upgrade next time_
> Version 0.12.0 includes the following:
>  * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
>  * conversion to date object no longer needed, ARROW-3910
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26591) Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain environment

2019-01-14 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742517#comment-16742517
 ] 

Bryan Cutler commented on SPARK-26591:
--

I created the same virtual environment and could not reproduce, can anyone else 
verify?

OS:  4.15.0-43-generic #46~16.04.1-Ubuntu SMP Fri Dec 7 13:31:08 UTC 2018 
x86_64 x86_64 x86_64 GNU/Linux
{noformat}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0
  /_/

Using Python version 3.6.7 (default, Nov 21 2018 02:32:25)
SparkSession available as 'spark'.
>>> import pyarrow
>>> pyarrow.__version__
'0.11.1'
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import numpy
>>> numpy.__version__
'1.15.4'
>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> from pyspark.sql.types import IntegerType, StringType
>>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
>>> slen
 at 0x7f099a99cd90>
{noformat}

> Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain 
> environment
> 
>
> Key: SPARK-26591
> URL: https://issues.apache.org/jira/browse/SPARK-26591
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Python 3.6.7
> Pyspark 2.4.0
> OS:
> {noformat}
> Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 
> x86_64 x86_64 GNU/Linux{noformat}
> CPU:
>  
> {code:java}
> Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB
> clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz
> {code}
>  
>  
>Reporter: Elchin
>Priority: Major
> Attachments: core
>
>
> When I try to use pandas_udf from examples in 
> [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.types import IntegerType, StringType
> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is 
> crashed{code}
> I get the error:
> {code:java}
> [1]    17969 illegal hardware instruction (core dumped)  python3{code}
> The environment is:
> Python 3.6.7
>  PySpark 2.4.0
>  PyArrow: 0.11.1
>  Pandas: 0.23.4
>  NumPy: 1.15.4
>  OS: Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 
> x86_64 x86_64 x86_64 GNU/Linux



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26591) Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain environment

2019-01-15 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743460#comment-16743460
 ] 

Bryan Cutler commented on SPARK-26591:
--

[~elch10] this seems like it is more an Arrow issue with pyarrow compatibility. 
Can you try something like the following in an interpreter to test is pyarrow 
basics work?
{code:java}
import pandas as pd
import pyarrow as pa

s = pd.Series([1, 2, 3])
pa.Array.from_pandas(s){code}
If any of that fails, we should move the issue over to Arrow

> Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain 
> environment
> 
>
> Key: SPARK-26591
> URL: https://issues.apache.org/jira/browse/SPARK-26591
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Python 3.6.7
> Pyspark 2.4.0
> OS:
> {noformat}
> Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 
> x86_64 x86_64 GNU/Linux{noformat}
> CPU:
>  
> {code:java}
> Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB
> clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz
> {code}
>  
>  
>Reporter: Elchin
>Priority: Major
> Attachments: core
>
>
> When I try to use pandas_udf from examples in 
> [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.types import IntegerType, StringType
> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is 
> crashed{code}
> I get the error:
> {code:java}
> [1]    17969 illegal hardware instruction (core dumped)  python3{code}
> The environment is:
> Python 3.6.7
>  PySpark 2.4.0
>  PyArrow: 0.11.1
>  Pandas: 0.23.4
>  NumPy: 1.15.4
>  OS: Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 
> x86_64 x86_64 x86_64 GNU/Linux



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26591) Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain environment

2019-01-15 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743463#comment-16743463
 ] 

Bryan Cutler commented on SPARK-26591:
--

And yes, you could build pyarrow yourself, but it shouldn't fail if you are 
using standard hardware and nothing exotic.  See 
[https://arrow.apache.org/docs/python/development.html#development] for build 
info

> Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain 
> environment
> 
>
> Key: SPARK-26591
> URL: https://issues.apache.org/jira/browse/SPARK-26591
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Python 3.6.7
> Pyspark 2.4.0
> OS:
> {noformat}
> Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 
> x86_64 x86_64 GNU/Linux{noformat}
> CPU:
>  
> {code:java}
> Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB
> clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz
> {code}
>  
>  
>Reporter: Elchin
>Priority: Major
> Attachments: core
>
>
> When I try to use pandas_udf from examples in 
> [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.types import IntegerType, StringType
> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is 
> crashed{code}
> I get the error:
> {code:java}
> [1]    17969 illegal hardware instruction (core dumped)  python3{code}
> The environment is:
> Python 3.6.7
>  PySpark 2.4.0
>  PyArrow: 0.11.1
>  Pandas: 0.23.4
>  NumPy: 1.15.4
>  OS: Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 
> x86_64 x86_64 x86_64 GNU/Linux



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26676) Make HiveContextSQLTests.test_unbounded_frames test compatible with Python 2 and PyPy

2019-01-21 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-26676.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23604
[https://github.com/apache/spark/pull/23604]

> Make HiveContextSQLTests.test_unbounded_frames test compatible with Python 2 
> and PyPy
> -
>
> Key: SPARK-26676
> URL: https://issues.apache.org/jira/browse/SPARK-26676
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> This particular test is being skipped at PyPy and Python 2.
> {code}
> Skipped tests in pyspark.sql.tests.test_context with pypy:
> test_unbounded_frames 
> (pyspark.sql.tests.test_context.HiveContextSQLTests) ... skipped "Unittest < 
> 3.3 doesn't support mocking"
> Skipped tests in pyspark.sql.tests.test_context with python2.7:
> test_unbounded_frames 
> (pyspark.sql.tests.test_context.HiveContextSQLTests) ... skipped "Unittest < 
> 3.3 doesn't support mocking"
> {code}
> We don't have to use unittest 3.3 module to mock. And looks the test itself 
> isn't compatible with Python 2. We should fix it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26676) Make HiveContextSQLTests.test_unbounded_frames test compatible with Python 2 and PyPy

2019-01-21 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-26676:


Assignee: Hyukjin Kwon

> Make HiveContextSQLTests.test_unbounded_frames test compatible with Python 2 
> and PyPy
> -
>
> Key: SPARK-26676
> URL: https://issues.apache.org/jira/browse/SPARK-26676
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> This particular test is being skipped at PyPy and Python 2.
> {code}
> Skipped tests in pyspark.sql.tests.test_context with pypy:
> test_unbounded_frames 
> (pyspark.sql.tests.test_context.HiveContextSQLTests) ... skipped "Unittest < 
> 3.3 doesn't support mocking"
> Skipped tests in pyspark.sql.tests.test_context with python2.7:
> test_unbounded_frames 
> (pyspark.sql.tests.test_context.HiveContextSQLTests) ... skipped "Unittest < 
> 3.3 doesn't support mocking"
> {code}
> We don't have to use unittest 3.3 module to mock. And looks the test itself 
> isn't compatible with Python 2. We should fix it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26315) auto cast threshold from Integer to Float in approxSimilarityJoin of BucketedRandomProjectionLSHModel

2018-12-11 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717873#comment-16717873
 ] 

Bryan Cutler commented on SPARK-26315:
--

I believe {{def approxSimilarityJoin(...)}} in LSHModelf in feature.py, should 
have {{threshold = TypeConverters.toFloat(threshold)}} before the call to 
{{_call_java(...)}}

I can help guide if anyone would like to submit a PR for the fix.

> auto cast threshold from Integer to Float in approxSimilarityJoin of 
> BucketedRandomProjectionLSHModel
> -
>
> Key: SPARK-26315
> URL: https://issues.apache.org/jira/browse/SPARK-26315
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 2.3.2
>Reporter: Song Ci
>Priority: Major
>
> when I was using 
> {code:java}
> // code placeholder
> BucketedRandomProjectionLSHModel.approxSimilarityJoin(dt_features, 
> dt_features, distCol="EuclideanDistance", threshold=20.)
> {code}
> I was confused then that this method reported an exception some java method 
> (dataset, dataset, integer, string) fingerprint can not be found I think 
> if I give an integer, and the python method of pyspark should be auto-cast 
> this to float if needed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization

2018-11-29 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704092#comment-16704092
 ] 

Bryan Cutler edited comment on SPARK-26200 at 11/30/18 12:56 AM:
-

I think this is a duplicate of SPARK-24915 except the columns are mixed up 
instead of an exception being thrown, could you please confirm [~davidlyness]?


was (Author: bryanc):
I think this is a duplicate of 
https://issues.apache.org/jira/browse/SPARK-24915 except the columns are mixed 
up instead of an exception being thrown, could you please confirm 
[~davidlyness]?

> Column values are incorrectly transposed when a field in a PySpark Row 
> requires serialization
> -
>
> Key: SPARK-26200
> URL: https://issues.apache.org/jira/browse/SPARK-26200
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Spark version 2.4.0
> Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144
> The same issue is observed when PySpark is run on both macOS 10.13.6 and 
> CentOS 7, so this appears to be a cross-platform issue.
>Reporter: David Lyness
>Priority: Major
>  Labels: correctness
>
> h2. Description of issue
> Whenever a field in a PySpark {{Row}} requires serialization (such as a 
> {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below 
> will assign column values *in alphabetical order*, rather than assigning each 
> column value to its specified columns.
> h3. Code to reproduce:
> {code:java}
> import datetime
> from pyspark.sql import Row
> from pyspark.sql.session import SparkSession
> from pyspark.sql.types import DateType, StringType, StructField, StructType
> spark = SparkSession.builder.getOrCreate()
> schema = StructType([
> StructField("date_column", DateType()),
> StructField("my_b_column", StringType()),
> StructField("my_a_column", StringType()),
> ])
> spark.createDataFrame([Row(
> date_column=datetime.date.today(),
> my_b_column="my_b_value",
> my_a_column="my_a_value"
> )], schema).show()
> {code}
> h3. Expected result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_b_value| my_a_value|
> +---+---+---+{noformat}
> h3. Actual result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_a_value| my_b_value|
> +---+---+---+{noformat}
> (Note that {{my_a_value}} and {{my_b_value}} are transposed.)
> h2. Analysis of issue
> Reviewing [the relevant code on 
> GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622],
>  there are two relevant conditional blocks:
>  
> {code:java}
> if self._needSerializeAnyField:
> # Block 1, does not work correctly
> else:
> # Block 2, works correctly
> {code}
> {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and 
> a dictionary of named columns. In Block 2, there is a conditional that works 
> specifically to serialize a {{Row}} object:
>  
> {code:java}
> elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False):
> return tuple(obj[n] for n in self.names)
> {code}
> There is no such condition in Block 1, so we fall into this instead:
>  
> {code:java}
> elif isinstance(obj, (tuple, list)):
> return tuple(f.toInternal(v) if c else v
> for f, v, c in zip(self.fields, obj, self._needConversion))
> {code}
> The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will 
> return a different ordering than the schema fields. So we end up with:
> {code:java}
> (date, date, True),
> (b, a, False),
> (a, b, False)
> {code}
> h2. Workarounds
> Correct behaviour is observed if you use a Python {{list}} or {{dict}} 
> instead of PySpark's {{Row}} object:
>  
> {code:java}
> # Using a list works
> spark.createDataFrame([[
> datetime.date.today(),
> "my_b_value",
> "my_a_value"
> ]], schema)
> # Using a dict also works
> spark.createDataFrame([{
> "date_column": datetime.date.today(),
> "my_b_column": "my_b_value",
> "my_a_column": "my_a_value"
> }], schema){code}
> Correct behaviour is also observed if you have no fields that require 
> serialization; in this example, changing {{date_column}} to {{StringType}} 
> avoids the correctness issue.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization

2018-11-29 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704092#comment-16704092
 ] 

Bryan Cutler commented on SPARK-26200:
--

I think this is a duplicate of 
https://issues.apache.org/jira/browse/SPARK-24915 except the columns are mixed 
up instead of an exception being thrown, could you please confirm 
[~davidlyness]?

> Column values are incorrectly transposed when a field in a PySpark Row 
> requires serialization
> -
>
> Key: SPARK-26200
> URL: https://issues.apache.org/jira/browse/SPARK-26200
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Spark version 2.4.0
> Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144
> The same issue is observed when PySpark is run on both macOS 10.13.6 and 
> CentOS 7, so this appears to be a cross-platform issue.
>Reporter: David Lyness
>Priority: Major
>  Labels: correctness
>
> h2. Description of issue
> Whenever a field in a PySpark {{Row}} requires serialization (such as a 
> {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below 
> will assign column values *in alphabetical order*, rather than assigning each 
> column value to its specified columns.
> h3. Code to reproduce:
> {code:java}
> import datetime
> from pyspark.sql import Row
> from pyspark.sql.session import SparkSession
> from pyspark.sql.types import DateType, StringType, StructField, StructType
> spark = SparkSession.builder.getOrCreate()
> schema = StructType([
> StructField("date_column", DateType()),
> StructField("my_b_column", StringType()),
> StructField("my_a_column", StringType()),
> ])
> spark.createDataFrame([Row(
> date_column=datetime.date.today(),
> my_b_column="my_b_value",
> my_a_column="my_a_value"
> )], schema).show()
> {code}
> h3. Expected result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_b_value| my_a_value|
> +---+---+---+{noformat}
> h3. Actual result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_a_value| my_b_value|
> +---+---+---+{noformat}
> (Note that {{my_a_value}} and {{my_b_value}} are transposed.)
> h2. Analysis of issue
> Reviewing [the relevant code on 
> GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622],
>  there are two relevant conditional blocks:
>  
> {code:java}
> if self._needSerializeAnyField:
> # Block 1, does not work correctly
> else:
> # Block 2, works correctly
> {code}
> {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and 
> a dictionary of named columns. In Block 2, there is a conditional that works 
> specifically to serialize a {{Row}} object:
>  
> {code:java}
> elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False):
> return tuple(obj[n] for n in self.names)
> {code}
> There is no such condition in Block 1, so we fall into this instead:
>  
> {code:java}
> elif isinstance(obj, (tuple, list)):
> return tuple(f.toInternal(v) if c else v
> for f, v, c in zip(self.fields, obj, self._needConversion))
> {code}
> The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will 
> return a different ordering than the schema fields. So we end up with:
> {code:java}
> (date, date, True),
> (b, a, False),
> (a, b, False)
> {code}
> h2. Workarounds
> Correct behaviour is observed if you use a Python {{list}} or {{dict}} 
> instead of PySpark's {{Row}} object:
>  
> {code:java}
> # Using a list works
> spark.createDataFrame([[
> datetime.date.today(),
> "my_b_value",
> "my_a_value"
> ]], schema)
> # Using a dict also works
> spark.createDataFrame([{
> "date_column": datetime.date.today(),
> "my_b_column": "my_b_value",
> "my_a_column": "my_a_value"
> }], schema){code}
> Correct behaviour is also observed if you have no fields that require 
> serialization; in this example, changing {{date_column}} to {{StringType}} 
> avoids the correctness issue.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization

2018-11-30 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-26200.
--
Resolution: Duplicate

> Column values are incorrectly transposed when a field in a PySpark Row 
> requires serialization
> -
>
> Key: SPARK-26200
> URL: https://issues.apache.org/jira/browse/SPARK-26200
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Spark version 2.4.0
> Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144
> The same issue is observed when PySpark is run on both macOS 10.13.6 and 
> CentOS 7, so this appears to be a cross-platform issue.
>Reporter: David Lyness
>Priority: Major
>  Labels: correctness
>
> h2. Description of issue
> Whenever a field in a PySpark {{Row}} requires serialization (such as a 
> {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below 
> will assign column values *in alphabetical order*, rather than assigning each 
> column value to its specified columns.
> h3. Code to reproduce:
> {code:java}
> import datetime
> from pyspark.sql import Row
> from pyspark.sql.session import SparkSession
> from pyspark.sql.types import DateType, StringType, StructField, StructType
> spark = SparkSession.builder.getOrCreate()
> schema = StructType([
> StructField("date_column", DateType()),
> StructField("my_b_column", StringType()),
> StructField("my_a_column", StringType()),
> ])
> spark.createDataFrame([Row(
> date_column=datetime.date.today(),
> my_b_column="my_b_value",
> my_a_column="my_a_value"
> )], schema).show()
> {code}
> h3. Expected result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_b_value| my_a_value|
> +---+---+---+{noformat}
> h3. Actual result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_a_value| my_b_value|
> +---+---+---+{noformat}
> (Note that {{my_a_value}} and {{my_b_value}} are transposed.)
> h2. Analysis of issue
> Reviewing [the relevant code on 
> GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622],
>  there are two relevant conditional blocks:
>  
> {code:java}
> if self._needSerializeAnyField:
> # Block 1, does not work correctly
> else:
> # Block 2, works correctly
> {code}
> {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and 
> a dictionary of named columns. In Block 2, there is a conditional that works 
> specifically to serialize a {{Row}} object:
>  
> {code:java}
> elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False):
> return tuple(obj[n] for n in self.names)
> {code}
> There is no such condition in Block 1, so we fall into this instead:
>  
> {code:java}
> elif isinstance(obj, (tuple, list)):
> return tuple(f.toInternal(v) if c else v
> for f, v, c in zip(self.fields, obj, self._needConversion))
> {code}
> The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will 
> return a different ordering than the schema fields. So we end up with:
> {code:java}
> (date, date, True),
> (b, a, False),
> (a, b, False)
> {code}
> h2. Workarounds
> Correct behaviour is observed if you use a Python {{list}} or {{dict}} 
> instead of PySpark's {{Row}} object:
>  
> {code:java}
> # Using a list works
> spark.createDataFrame([[
> datetime.date.today(),
> "my_b_value",
> "my_a_value"
> ]], schema)
> # Using a dict also works
> spark.createDataFrame([{
> "date_column": datetime.date.today(),
> "my_b_column": "my_b_value",
> "my_a_column": "my_a_value"
> }], schema){code}
> Correct behaviour is also observed if you have no fields that require 
> serialization; in this example, changing {{date_column}} to {{StringType}} 
> avoids the correctness issue.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24333) Add fit with validation set to spark.ml GBT: Python API

2018-12-07 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-24333:


Assignee: Huaxin Gao

> Add fit with validation set to spark.ml GBT: Python API
> ---
>
> Key: SPARK-24333
> URL: https://issues.apache.org/jira/browse/SPARK-24333
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Huaxin Gao
>Priority: Major
>
> Python version of API added by [SPARK-7132]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24333) Add fit with validation set to spark.ml GBT: Python API

2018-12-07 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-24333.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21465
[https://github.com/apache/spark/pull/21465]

> Add fit with validation set to spark.ml GBT: Python API
> ---
>
> Key: SPARK-24333
> URL: https://issues.apache.org/jira/browse/SPARK-24333
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> Python version of API added by [SPARK-7132]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25274) Improve toPandas with Arrow by sending out-of-order record batches

2018-12-06 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-25274:


Assignee: Bryan Cutler

> Improve toPandas with Arrow by sending out-of-order record batches
> --
>
> Key: SPARK-25274
> URL: https://issues.apache.org/jira/browse/SPARK-25274
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> When executing {{toPandas}} with Arrow enabled, partitions that arrive in the 
> JVM out-of-order must be buffered before they can be send to Python. This 
> causes an excess of memory to be used in the driver JVM and increases the 
> time it takes to complete because data must sit in the JVM waiting for 
> preceding partitions to come in.
> This can be improved by sending out-of-order partitions to Python as soon as 
> they arrive in the JVM, followed by a list of indices so that Python can 
> assemble the data in the correct order. This way, data is not buffered at the 
> JVM and there is no waiting on particular partitions so performance will be 
> increased.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25274) Improve toPandas with Arrow by sending out-of-order record batches

2018-12-06 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-25274.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22275
https://github.com/apache/spark/pull/22275

> Improve toPandas with Arrow by sending out-of-order record batches
> --
>
> Key: SPARK-25274
> URL: https://issues.apache.org/jira/browse/SPARK-25274
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.0.0
>
>
> When executing {{toPandas}} with Arrow enabled, partitions that arrive in the 
> JVM out-of-order must be buffered before they can be send to Python. This 
> causes an excess of memory to be used in the driver JVM and increases the 
> time it takes to complete because data must sit in the JVM waiting for 
> preceding partitions to come in.
> This can be improved by sending out-of-order partitions to Python as soon as 
> they arrive in the JVM, followed by a list of indices so that Python can 
> assemble the data in the correct order. This way, data is not buffered at the 
> JVM and there is no waiting on particular partitions so performance will be 
> increased.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26573) Python worker not reused with mapPartitions if not consuming iterator

2019-01-08 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-26573:


 Summary: Python worker not reused with mapPartitions if not 
consuming iterator
 Key: SPARK-26573
 URL: https://issues.apache.org/jira/browse/SPARK-26573
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Bryan Cutler


In PySpark, if the user calls RDD mapPartitions and does not consume the 
iterator, the Python worker will read the wrong signal and not be reused.  Test 
to replicate:
{code:java}
def test_worker_reused_in_map_partition(self):

def map_pid(iterator):
# Fails when iterator not consumed, e.g. len(list(iterator))
return (os.getpid() for _ in xrange(10))

rdd = self.sc.parallelize([], 10)

worker_pids_a = rdd.mapPartitions(map_pid).collect()
worker_pids_b = rdd.mapPartitions(map_pid).collect()

self.assertTrue(all([pid in worker_pids_a for pid in worker_pids_b])){code}

This is related to SPARK-26549 which fixes this issue, but only for use in 
rdd.range



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26349) Pyspark should not accept insecure p4yj gateways

2019-01-08 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-26349:


Assignee: Imran Rashid

> Pyspark should not accept insecure p4yj gateways
> 
>
> Key: SPARK-26349
> URL: https://issues.apache.org/jira/browse/SPARK-26349
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Blocker
>
> Pyspark normally produces a secure py4j connection between python & the jvm, 
> but it also lets users provide their own connection.  Spark should fail fast 
> if that connection is insecure.
> Note this is breaking a public api, which is why this is targeted at 3.0.0.  
> SPARK-26019 is about adding a warning, but still allowing the old behavior to 
> work, in 2.x



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26349) Pyspark should not accept insecure p4yj gateways

2019-01-08 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-26349.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23441
[https://github.com/apache/spark/pull/23441]

> Pyspark should not accept insecure p4yj gateways
> 
>
> Key: SPARK-26349
> URL: https://issues.apache.org/jira/browse/SPARK-26349
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Blocker
> Fix For: 3.0.0
>
>
> Pyspark normally produces a secure py4j connection between python & the jvm, 
> but it also lets users provide their own connection.  Spark should fail fast 
> if that connection is insecure.
> Note this is breaking a public api, which is why this is targeted at 3.0.0.  
> SPARK-26019 is about adding a warning, but still allowing the old behavior to 
> work, in 2.x



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26566) Upgrade apache/arrow to 0.12.0

2019-01-07 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-26566:
-
Target Version/s:   (was: 2.4.0)

> Upgrade apache/arrow to 0.12.0
> --
>
> Key: SPARK-26566
> URL: https://issues.apache.org/jira/browse/SPARK-26566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26566) Upgrade apache/arrow to 0.12.0

2019-01-07 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-26566:


 Summary: Upgrade apache/arrow to 0.12.0
 Key: SPARK-26566
 URL: https://issues.apache.org/jira/browse/SPARK-26566
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Bryan Cutler
Assignee: Bryan Cutler
 Fix For: 2.4.0


Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support ARROW-2141
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26566) Upgrade apache/arrow to 0.12.0

2019-01-07 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736540#comment-16736540
 ] 

Bryan Cutler commented on SPARK-26566:
--

Version 0.12.0 is slated to be released in mid January

> Upgrade apache/arrow to 0.12.0
> --
>
> Key: SPARK-26566
> URL: https://issues.apache.org/jira/browse/SPARK-26566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Major
>
> _This is just a placeholder for now to collect what needs to be fixed when we 
> upgrade next time_
> Version 0.12.0 includes the following:
>  * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2019-01-07 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-25272.
--
Resolution: Won't Fix

> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26566) Upgrade apache/arrow to 0.12.0

2019-01-07 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-26566:
-
Fix Version/s: (was: 2.4.0)

> Upgrade apache/arrow to 0.12.0
> --
>
> Key: SPARK-26566
> URL: https://issues.apache.org/jira/browse/SPARK-26566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26566) Upgrade apache/arrow to 0.12.0

2019-01-07 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-26566:
-
Affects Version/s: (was: 2.3.0)
   2.4.0

> Upgrade apache/arrow to 0.12.0
> --
>
> Key: SPARK-26566
> URL: https://issues.apache.org/jira/browse/SPARK-26566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26566) Upgrade apache/arrow to 0.12.0

2019-01-07 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-26566:


Assignee: (was: Bryan Cutler)

> Upgrade apache/arrow to 0.12.0
> --
>
> Key: SPARK-26566
> URL: https://issues.apache.org/jira/browse/SPARK-26566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26566) Upgrade apache/arrow to 0.12.0

2019-01-07 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-26566:
-
Description: 
_This is just a placeholder for now to collect what needs to be fixed when we 
upgrade next time_

Version 0.12.0 includes the following:
 * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098

 

  was:
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support ARROW-2141
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

 

 


> Upgrade apache/arrow to 0.12.0
> --
>
> Key: SPARK-26566
> URL: https://issues.apache.org/jira/browse/SPARK-26566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Major
>
> _This is just a placeholder for now to collect what needs to be fixed when we 
> upgrade next time_
> Version 0.12.0 includes the following:
>  * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26591) illegal hardware instruction

2019-01-11 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16740571#comment-16740571
 ] 

Bryan Cutler commented on SPARK-26591:
--

Could you share some details of your pyarrow installation - version, did you 
pip install, are you using a virtual env?  If possible, I would make a clean 
virtual environment and try installation again, it sounds like something went 
bad.

> illegal hardware instruction
> 
>
> Key: SPARK-26591
> URL: https://issues.apache.org/jira/browse/SPARK-26591
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Python 3.6.7
> Pyspark 2.4.0
> OS:
> {noformat}
> Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 
> x86_64 x86_64 GNU/Linux{noformat}
> CPU:
>  
> {code:java}
> Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB
> clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz
> {code}
>  
>  
>Reporter: Elchin
>Priority: Critical
>
> When I try to use pandas_udf from examples in 
> [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]:
> {code:java}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.types import IntegerType, StringType
> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is 
> crashed{code}
> I get the error:
> {code:java}
> [1]    17969 illegal hardware instruction (core dumped)  python3{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25344) Break large tests.py files into smaller files

2018-09-13 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614176#comment-16614176
 ] 

Bryan Cutler commented on SPARK-25344:
--

>From the mailing list I think we should agree on a few things first:

1. When to create a separate test file, for each module? and how to name? e.g. 
"test_rdd.py"
2. Where to put the test files? same dir as source or subdir named "tests"
3. Start splitting tests immediately as new tests are written? Incrementally as 
subtasks in this JIRA?

> Break large tests.py files into smaller files
> -
>
> Key: SPARK-25344
> URL: https://issues.apache.org/jira/browse/SPARK-25344
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>
> We've got a ton of tests in one humongous tests.py file, rather than breaking 
> it out into smaller files.
> Having one huge file doesn't seem great for code organization, and it also 
> makes the test parallelization in run-tests.py not work as well.  On my 
> laptop, tests.py takes 150s, and the next longest test file takes only 20s.  
> There are similarly large files in other pyspark modules, eg. sql/tests.py, 
> ml/tests.py, mllib/tests.py, streaming/tests.py.
> It seems that at least for some of these files, its already broken into 
> independent test classes, so it shouldn't be too hard to just move them into 
> their own files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization

2018-11-30 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705224#comment-16705224
 ] 

Bryan Cutler commented on SPARK-26200:
--

Thanks [~davidlyness], I'll mark this as a duplicate since the root cause is 
the same. I've been tracking this and similar issues with PySpark Rows, but 
since addressing these will cause a behavior change, we can only make the fix 
in Spark 3.0. If you didn't see the workaround in the other JIRA, I would 
recommend creating your Rows this way when using a schema and you care about 
field ordering, or use a list as you pointed out

{code}
In [10]: MyRow = Row("field2", "field1")

In [11]: data = [
...: MyRow(Row(sub_field='world'), "hello")
...: ]

In [12]: df = spark.createDataFrame(data, schema=schema)

In [13]: df.show()
+---+--+
| field2|field1|
+---+--+
|[world]| hello|
+---+--+
{code}

> Column values are incorrectly transposed when a field in a PySpark Row 
> requires serialization
> -
>
> Key: SPARK-26200
> URL: https://issues.apache.org/jira/browse/SPARK-26200
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Spark version 2.4.0
> Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144
> The same issue is observed when PySpark is run on both macOS 10.13.6 and 
> CentOS 7, so this appears to be a cross-platform issue.
>Reporter: David Lyness
>Priority: Major
>  Labels: correctness
>
> h2. Description of issue
> Whenever a field in a PySpark {{Row}} requires serialization (such as a 
> {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below 
> will assign column values *in alphabetical order*, rather than assigning each 
> column value to its specified columns.
> h3. Code to reproduce:
> {code:java}
> import datetime
> from pyspark.sql import Row
> from pyspark.sql.session import SparkSession
> from pyspark.sql.types import DateType, StringType, StructField, StructType
> spark = SparkSession.builder.getOrCreate()
> schema = StructType([
> StructField("date_column", DateType()),
> StructField("my_b_column", StringType()),
> StructField("my_a_column", StringType()),
> ])
> spark.createDataFrame([Row(
> date_column=datetime.date.today(),
> my_b_column="my_b_value",
> my_a_column="my_a_value"
> )], schema).show()
> {code}
> h3. Expected result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_b_value| my_a_value|
> +---+---+---+{noformat}
> h3. Actual result:
> {noformat}
> +---+---+---+
> |date_column|my_b_column|my_a_column|
> +---+---+---+
> | 2018-11-28| my_a_value| my_b_value|
> +---+---+---+{noformat}
> (Note that {{my_a_value}} and {{my_b_value}} are transposed.)
> h2. Analysis of issue
> Reviewing [the relevant code on 
> GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622],
>  there are two relevant conditional blocks:
>  
> {code:java}
> if self._needSerializeAnyField:
> # Block 1, does not work correctly
> else:
> # Block 2, works correctly
> {code}
> {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and 
> a dictionary of named columns. In Block 2, there is a conditional that works 
> specifically to serialize a {{Row}} object:
>  
> {code:java}
> elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False):
> return tuple(obj[n] for n in self.names)
> {code}
> There is no such condition in Block 1, so we fall into this instead:
>  
> {code:java}
> elif isinstance(obj, (tuple, list)):
> return tuple(f.toInternal(v) if c else v
> for f, v, c in zip(self.fields, obj, self._needConversion))
> {code}
> The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will 
> return a different ordering than the schema fields. So we end up with:
> {code:java}
> (date, date, True),
> (b, a, False),
> (a, b, False)
> {code}
> h2. Workarounds
> Correct behaviour is observed if you use a Python {{list}} or {{dict}} 
> instead of PySpark's {{Row}} object:
>  
> {code:java}
> # Using a list works
> spark.createDataFrame([[
> datetime.date.today(),
> "my_b_value",
> "my_a_value"
> ]], schema)
> # Using a dict also works
> spark.createDataFrame([{
> "date_column": datetime.date.today(),
> "my_b_column": "my_b_value",
> "my_a_column": "my_a_value"
> }], schema){code}
> Correct behaviour is also observed if you have no fields that require 
> serialization; in this example, changing {{date_column}} to {{StringType}} 
> avoids 

[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames for the entire partition

2019-01-24 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751785#comment-16751785
 ] 

Bryan Cutler commented on SPARK-26412:
--

[~mengxr] I think Arrow record batches would be a much more ideal way to 
connect with other frameworks. Making the conversion to Pandas carries some 
overhead and the Arrow format/types are more solidly defined. It is also better 
suited to be used with an iterator - most of the Arrow IPC mechanisms operate 
on streams of record batches. Is this proposal instead of SPARK-24579 SPIP: 
Standardize Optimized Data Exchange between Spark and DL/AI frameworks?

> Allow Pandas UDF to take an iterator of pd.DataFrames for the entire partition
> --
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to batch scope, user need to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient. I created this JIRA to discuss possible solutions.
> Essentially we need to support "start()" and "finish()" besides "apply". We 
> can either provide those interfaces or simply provide users the iterator of 
> batches in pd.DataFrame and let user code handle it.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26410) Support per Pandas UDF configuration

2019-01-24 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751771#comment-16751771
 ] 

Bryan Cutler commented on SPARK-26410:
--

This could be useful to have, but it does seem a little strange to bind batch 
size to a udf. To me, batch size seems more related to the data being used, and 
merging different batch sizes could complicate the behavior. Still, I can see 
how someone might want to change batch size at different points in a session.

> Support per Pandas UDF configuration
> 
>
> Key: SPARK-26410
> URL: https://issues.apache.org/jira/browse/SPARK-26410
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the 
> "right" batch size usually depends on the task itself. It would be nice if 
> user can configure the batch size when they declare the Pandas UDF.
> This is orthogonal to SPARK-23258 (using max buffer size instead of row 
> count).
> Besides API, we should also discuss how to merge Pandas UDFs of different 
> configurations. For example,
> {code}
> df.select(predict1(col("features"), predict2(col("features")))
> {code}
> when predict1 requests 100 rows per batch, while predict2 requests 120 rows 
> per batch.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2019-01-24 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751802#comment-16751802
 ] 

Bryan Cutler commented on SPARK-24579:
--

It would be great to start up this discussion again, I saw related jiras 
SPARK-24579 and SPARK-26413 have been recently created. I think the ideal way 
for Spark to exchange data is an iterator of Arrow record batches, which could 
be done with an arrow_udf (mentioned in the design here) or through a 
{{mapPartitions}} function. Maybe expanding on the examples touched on in the 
design doc would help with discussion?

> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> 
>
> Key: SPARK-24579
> URL: https://issues.apache.org/jira/browse/SPARK-24579
> Project: Spark
>  Issue Type: Epic
>  Components: ML, PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen
> Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange 
> between Apache Spark and DL%2FAI Frameworks .pdf
>
>
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark 
> as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache 
> MXNet (incubating).
> Both big data and AI are indispensable components to drive business 
> innovation and there have
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI 
> frameworks like tf.data and tf.Transform. However, with 50+ data sources and 
> built-in SQL, DataFrames, and Streaming features, Spark remains the community 
> choice for big data. This is why we saw many efforts to integrate DL/AI 
> frameworks with Spark to leverage its power, for example, TFRecords data 
> source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project 
> Hydrogen, this SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark 
> and external DL/AI frameworks. And the performance matters. However, there 
> doesn’t exist a standard way to exchange data and hence implementation and 
> performance optimization fall into pieces. For example, TensorFlowOnSpark 
> uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and 
> save data and pass the RDD records to TensorFlow in Python. And TensorFrames 
> converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s 
> Java API. How can we reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format) 
> between Spark and DL/AI frameworks and optimize data conversion from/to this 
> interface.  So DL/AI frameworks can leverage Spark to load data virtually 
> from anywhere without spending extra effort building complex data solutions, 
> like reading features from a production data warehouse or streaming model 
> inference. Spark users can use DL/AI frameworks without learning specific 
> data APIs implemented there. And developers from both sides can work on 
> performance optimizations independently given the interface itself doesn’t 
> introduce big overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27276) Increase the minimum pyarrow version to 0.12.0

2019-03-25 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801250#comment-16801250
 ] 

Bryan Cutler commented on SPARK-27276:
--

[~shaneknapp] this will need an upgrade on Jenkins, so let me know whenever you 
think is a good time to do this - I know you are working out some kinks with 
recent upgrades right now so no rush.  I will try to do the code changes next 
week and then we can coordinate anytime after. Thanks!  cc [~hyukjin.kwon]

> Increase the minimum pyarrow version to 0.12.0
> --
>
> Key: SPARK-27276
> URL: https://issues.apache.org/jira/browse/SPARK-27276
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> The current minimum version is 0.8.0, which is pretty ancient since Arrow has 
> been moving fast and a lot has changed since this version. There are 
> currently many workarounds checking for different versions or disabling 
> specific functionality, and the code is getting ugly and difficult to 
> maintain. Increasing the version will allow cleanup and upgrade the testing 
> environment.
> This involves changing the pyarrow version in setup.py (currently at 0.8.0), 
> updating Jenkins to test against the new version, code cleanup to remove 
> workarounds from older versions.  Users would then need to ensure this 
> version is installed on the cluster.
> There is also a 0.12.1 release, so I will need to check what bugs that fixed 
> to see if that will be a better version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27276) Increase the minimum pyarrow version to 0.12.0

2019-03-25 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-27276:


 Summary: Increase the minimum pyarrow version to 0.12.0
 Key: SPARK-27276
 URL: https://issues.apache.org/jira/browse/SPARK-27276
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Bryan Cutler


The current minimum version is 0.8.0, which is pretty ancient since Arrow has 
been moving fast and a lot has changed since this version. There are currently 
many workarounds checking for different versions or disabling specific 
functionality, and the code is getting ugly and difficult to maintain. 
Increasing the version will allow cleanup and upgrade the testing environment.

This involves changing the pyarrow version in setup.py (currently at 0.8.0), 
updating Jenkins to test against the new version, code cleanup to remove 
workarounds from older versions.  Users would then need to ensure this version 
is installed on the cluster.

There is also a 0.12.1 release, so I will need to check what bugs that fixed to 
see if that will be a better version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27276) Increase the minimum pyarrow version to 0.12.0

2019-03-26 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27276:
-
Description: 
The current minimum version is 0.8.0, which is pretty ancient since Arrow has 
been moving fast and a lot has changed since this version. There are currently 
many workarounds checking for different versions or disabling specific 
functionality, and the code is getting ugly and difficult to maintain. 
Increasing the version will allow cleanup and upgrade the testing environment.

This involves changing the pyarrow version in setup.py (currently at 0.8.0), 
updating Jenkins to test against the new version, code cleanup to remove 
workarounds from older versions.  Newer versions of pyarrow have dropped 
support for Python 3.4, so it might be necessary to update to Python 3.5+ in 
Jenkins as well. Users would then need to ensure at least this version of 
pyarrow is installed on the cluster.

There is also a 0.12.1 release, so I will need to check what bugs that fixed to 
see if that will be a better version.

  was:
The current minimum version is 0.8.0, which is pretty ancient since Arrow has 
been moving fast and a lot has changed since this version. There are currently 
many workarounds checking for different versions or disabling specific 
functionality, and the code is getting ugly and difficult to maintain. 
Increasing the version will allow cleanup and upgrade the testing environment.

This involves changing the pyarrow version in setup.py (currently at 0.8.0), 
updating Jenkins to test against the new version, code cleanup to remove 
workarounds from older versions.  Users would then need to ensure this version 
is installed on the cluster.

There is also a 0.12.1 release, so I will need to check what bugs that fixed to 
see if that will be a better version.


> Increase the minimum pyarrow version to 0.12.0
> --
>
> Key: SPARK-27276
> URL: https://issues.apache.org/jira/browse/SPARK-27276
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> The current minimum version is 0.8.0, which is pretty ancient since Arrow has 
> been moving fast and a lot has changed since this version. There are 
> currently many workarounds checking for different versions or disabling 
> specific functionality, and the code is getting ugly and difficult to 
> maintain. Increasing the version will allow cleanup and upgrade the testing 
> environment.
> This involves changing the pyarrow version in setup.py (currently at 0.8.0), 
> updating Jenkins to test against the new version, code cleanup to remove 
> workarounds from older versions.  Newer versions of pyarrow have dropped 
> support for Python 3.4, so it might be necessary to update to Python 3.5+ in 
> Jenkins as well. Users would then need to ensure at least this version of 
> pyarrow is installed on the cluster.
> There is also a 0.12.1 release, so I will need to check what bugs that fixed 
> to see if that will be a better version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27276) Increase the minimum pyarrow version to 0.12.1

2019-04-04 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27276:
-
Summary: Increase the minimum pyarrow version to 0.12.1  (was: Increase the 
minimum pyarrow version to 0.12.0)

> Increase the minimum pyarrow version to 0.12.1
> --
>
> Key: SPARK-27276
> URL: https://issues.apache.org/jira/browse/SPARK-27276
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> The current minimum version is 0.8.0, which is pretty ancient since Arrow has 
> been moving fast and a lot has changed since this version. There are 
> currently many workarounds checking for different versions or disabling 
> specific functionality, and the code is getting ugly and difficult to 
> maintain. Increasing the version will allow cleanup and upgrade the testing 
> environment.
> This involves changing the pyarrow version in setup.py (currently at 0.8.0), 
> updating Jenkins to test against the new version, code cleanup to remove 
> workarounds from older versions.  Newer versions of pyarrow have dropped 
> support for Python 3.4, so it might be necessary to update to Python 3.5+ in 
> Jenkins as well. Users would then need to ensure at least this version of 
> pyarrow is installed on the cluster.
> There is also a 0.12.1 release, so I will need to check what bugs that fixed 
> to see if that will be a better version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27276) Increase the minimum pyarrow version to 0.12.1

2019-04-04 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810024#comment-16810024
 ] 

Bryan Cutler commented on SPARK-27276:
--

I think we should use 0.12.1, there was a bug fix ARROW-4582 that might affect 
usage in Spark.

> Increase the minimum pyarrow version to 0.12.1
> --
>
> Key: SPARK-27276
> URL: https://issues.apache.org/jira/browse/SPARK-27276
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> The current minimum version is 0.8.0, which is pretty ancient since Arrow has 
> been moving fast and a lot has changed since this version. There are 
> currently many workarounds checking for different versions or disabling 
> specific functionality, and the code is getting ugly and difficult to 
> maintain. Increasing the version will allow cleanup and upgrade the testing 
> environment.
> This involves changing the pyarrow version in setup.py (currently at 0.8.0), 
> updating Jenkins to test against the new version, code cleanup to remove 
> workarounds from older versions.  Newer versions of pyarrow have dropped 
> support for Python 3.4, so it might be necessary to update to Python 3.5+ in 
> Jenkins as well. Users would then need to ensure at least this version of 
> pyarrow is installed on the cluster.
> There is also a 0.12.1 release, so I will need to check what bugs that fixed 
> to see if that will be a better version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27353) PySpark Row __repr__ bug

2019-04-04 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810009#comment-16810009
 ] 

Bryan Cutler commented on SPARK-27353:
--

Works for me out of master, can you provide a script to reproduce?

In [1]: from pyspark.sql.types import Row   
     

In [2]: import datetime 
     

In [3]: Row(d=datetime.date.today())
     
Out[3]: Row(d=datetime.date(2019, 4, 4))

In [4]: repr(Row(d=datetime.date.today()))  
     
Out[4]: 'Row(d=datetime.date(2019, 4, 4))'

> PySpark  Row  __repr__ bug
> --
>
> Key: SPARK-27353
> URL: https://issues.apache.org/jira/browse/SPARK-27353
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Ihor Bobak
>Priority: Major
>
> Row class has this implementation of __repr__:
>     def __repr__(self):
>     """Printable representation of Row used in Python REPL."""
>     if hasattr(self, "__fields__"):
>     return "Row(%s)" % ", ".join("%s=%r" % (k, v)
>  for k, v in zip(self.__fields__, 
> tuple(self)))
>     else:
>     return "" % ", ".join(self)
>  
> the last line fails when you have a datetime.date instance in a row:
> TypeError Traceback (most recent call last)
>  in 
>   2 print(*row.values)
>   3 df_row = Row(*row.values)
> > 4 print(repr(df_row))
>   5 break
>   6 
> E:\spark\spark-2.3.2-bin-without-hadoop\python\pyspark\sql\types.py in 
> __repr__(self)
>    1579  for k, v in 
> zip(self.__fields__, tuple(self)))
>    1580 else:
> -> 1581 return "" % ", ".join(self)
>    1582 
>    1583 
> TypeError: sequence item 0: expected str instance, datetime.date found
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27387) Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests

2019-04-04 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-27387:


 Summary: Replace sqlutils assertPandasEqual with Pandas 
assert_frame_equal in tests
 Key: SPARK-27387
 URL: https://issues.apache.org/jira/browse/SPARK-27387
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Tests
Affects Versions: 2.4.1
Reporter: Bryan Cutler


In PySpark unit tests, sqlutils ReusedSQLTestCase.assertPandasEqual is meant to 
check if 2 pandas.DataFrames are equal but it seems for later versions of 
Pandas, this can fail if the DataFrame has an array column. This method can be 
replaced by {{assert_frame_equal}} from pandas.util.testing.  This is what it 
is meant for and it will give a better assertion message as well.

The test failure I have seen is:

 {noformat}
==
ERROR: test_supported_types 
(pyspark.sql.tests.test_pandas_udf_grouped_map.GroupedMapPandasUDFTests)
--
Traceback (most recent call last):
  File 
"/home/bryan/git/spark/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py",
 line 128, in test_supported_types
    self.assertPandasEqual(expected1, result1)
  File "/home/bryan/git/spark/python/pyspark/testing/sqlutils.py", line 268, in 
assertPandasEqual
    self.assertTrue(expected.equals(result), msg=msg)
  File "/home/bryan/miniconda2/envs/pa012/lib/python3.6/site-packages/pandas

...
  File "pandas/_libs/lib.pyx", line 523, in 
pandas._libs.lib.array_equivalent_object
ValueError: The truth value of an array with more than one element is 
ambiguous. Use a.any() or a.all()
 {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27387) Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests

2019-04-04 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810197#comment-16810197
 ] 

Bryan Cutler commented on SPARK-27387:
--

This can be done after the upgrade of pyarrow version to avoid conflicts.

> Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests
> --
>
> Key: SPARK-27387
> URL: https://issues.apache.org/jira/browse/SPARK-27387
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.4.1
>Reporter: Bryan Cutler
>Priority: Major
>
> In PySpark unit tests, sqlutils ReusedSQLTestCase.assertPandasEqual is meant 
> to check if 2 pandas.DataFrames are equal but it seems for later versions of 
> Pandas, this can fail if the DataFrame has an array column. This method can 
> be replaced by {{assert_frame_equal}} from pandas.util.testing.  This is what 
> it is meant for and it will give a better assertion message as well.
> The test failure I have seen is:
>  {noformat}
> ==
> ERROR: test_supported_types 
> (pyspark.sql.tests.test_pandas_udf_grouped_map.GroupedMapPandasUDFTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/bryan/git/spark/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py",
>  line 128, in test_supported_types
>     self.assertPandasEqual(expected1, result1)
>   File "/home/bryan/git/spark/python/pyspark/testing/sqlutils.py", line 268, 
> in assertPandasEqual
>     self.assertTrue(expected.equals(result), msg=msg)
>   File "/home/bryan/miniconda2/envs/pa012/lib/python3.6/site-packages/pandas
> ...
>   File "pandas/_libs/lib.pyx", line 523, in 
> pandas._libs.lib.array_equivalent_object
> ValueError: The truth value of an array with more than one element is 
> ambiguous. Use a.any() or a.all()
>  {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27387) Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests

2019-04-04 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810200#comment-16810200
 ] 

Bryan Cutler commented on SPARK-27387:
--

I can work on this

> Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests
> --
>
> Key: SPARK-27387
> URL: https://issues.apache.org/jira/browse/SPARK-27387
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.4.1
>Reporter: Bryan Cutler
>Priority: Major
>
> In PySpark unit tests, sqlutils ReusedSQLTestCase.assertPandasEqual is meant 
> to check if 2 pandas.DataFrames are equal but it seems for later versions of 
> Pandas, this can fail if the DataFrame has an array column. This method can 
> be replaced by {{assert_frame_equal}} from pandas.util.testing.  This is what 
> it is meant for and it will give a better assertion message as well.
> The test failure I have seen is:
>  {noformat}
> ==
> ERROR: test_supported_types 
> (pyspark.sql.tests.test_pandas_udf_grouped_map.GroupedMapPandasUDFTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/bryan/git/spark/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py",
>  line 128, in test_supported_types
>     self.assertPandasEqual(expected1, result1)
>   File "/home/bryan/git/spark/python/pyspark/testing/sqlutils.py", line 268, 
> in assertPandasEqual
>     self.assertTrue(expected.equals(result), msg=msg)
>   File "/home/bryan/miniconda2/envs/pa012/lib/python3.6/site-packages/pandas
> ...
>   File "pandas/_libs/lib.pyx", line 523, in 
> pandas._libs.lib.array_equivalent_object
> ValueError: The truth value of an array with more than one element is 
> ambiguous. Use a.any() or a.all()
>  {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-05 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811355#comment-16811355
 ] 

Bryan Cutler commented on SPARK-27389:
--

>From the stacktrace, it looks like it's getting this from 
>"spark.sql.session.timeZone" which defaults to Java.util 
>TimeZone.getDefault.getID()

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: shane knapp
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27293) Setting random seed produces different results in RandomForestRegressor

2019-03-28 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27293:
-
Summary: Setting random seed produces different results in 
RandomForestRegressor  (was: I am interested in finding out if there is a bug 
in the implementation of RandomForests. The Issue is when applying a seed and 
getting different results than other people from my class when applying it to 
the same data. )

> Setting random seed produces different results in RandomForestRegressor
> ---
>
> Key: SPARK-27293
> URL: https://issues.apache.org/jira/browse/SPARK-27293
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Martin Skauen
>Priority: Major
>
> I am calculating the RMSE metric like this:
> {code:java}
> (trainingData, testData) = data.randomSplit([0.7, 0.3], 313)
> from pyspark.ml.regression import RandomForestRegressor
> rfr = RandomForestRegressor(labelCol="labels", featuresCol="features", 
> maxDepth=5, numTrees=3, seed = 313)
> from pyspark.ml.evaluation import RegressionEvaluator
> evaluator = RegressionEvaluator\
> (labelCol="labels", predictionCol="prediction", metricName="rmse")
> rmse = evaluator.evaluate(predictions)
> print("RMSE = %g " % rmse)
> {code}
> I am setting the seed. For seed = 50 and also for other seeds I get exact 
> same RMSE as people from class. I set seed to 313 and it is giving me 
> different value. What could be the issue here?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27293) Setting random seed produces different results in RandomForestRegressor

2019-03-28 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27293:
-
Description: 
I am interested in finding out if there is a bug in the implementation of 
RandomForests. The Issue is when applying a seed and getting different results 
than other people from my class when applying it to the same data

I am calculating the RMSE metric like this:
{code:java}
(trainingData, testData) = data.randomSplit([0.7, 0.3], 313)
from pyspark.ml.regression import RandomForestRegressor
rfr = RandomForestRegressor(labelCol="labels", featuresCol="features", 
maxDepth=5, numTrees=3, seed = 313)
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator\
(labelCol="labels", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("RMSE = %g " % rmse)
{code}
I am setting the seed. For seed = 50 and also for other seeds I get exact same 
RMSE as people from class. I set seed to 313 and it is giving me different 
value. What could be the issue here?

  was:
I am calculating the RMSE metric like this:
{code:java}
(trainingData, testData) = data.randomSplit([0.7, 0.3], 313)
from pyspark.ml.regression import RandomForestRegressor
rfr = RandomForestRegressor(labelCol="labels", featuresCol="features", 
maxDepth=5, numTrees=3, seed = 313)
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator\
(labelCol="labels", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("RMSE = %g " % rmse)
{code}
I am setting the seed. For seed = 50 and also for other seeds I get exact same 
RMSE as people from class. I set seed to 313 and it is giving me different 
value. What could be the issue here?


> Setting random seed produces different results in RandomForestRegressor
> ---
>
> Key: SPARK-27293
> URL: https://issues.apache.org/jira/browse/SPARK-27293
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Martin Skauen
>Priority: Major
>
> I am interested in finding out if there is a bug in the implementation of 
> RandomForests. The Issue is when applying a seed and getting different 
> results than other people from my class when applying it to the same data
> I am calculating the RMSE metric like this:
> {code:java}
> (trainingData, testData) = data.randomSplit([0.7, 0.3], 313)
> from pyspark.ml.regression import RandomForestRegressor
> rfr = RandomForestRegressor(labelCol="labels", featuresCol="features", 
> maxDepth=5, numTrees=3, seed = 313)
> from pyspark.ml.evaluation import RegressionEvaluator
> evaluator = RegressionEvaluator\
> (labelCol="labels", predictionCol="prediction", metricName="rmse")
> rmse = evaluator.evaluate(predictions)
> print("RMSE = %g " % rmse)
> {code}
> I am setting the seed. For seed = 50 and also for other seeds I get exact 
> same RMSE as people from class. I set seed to 313 and it is giving me 
> different value. What could be the issue here?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27293) Setting random seed produces different results in RandomForestRegressor

2019-03-28 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27293:
-
Component/s: ML

> Setting random seed produces different results in RandomForestRegressor
> ---
>
> Key: SPARK-27293
> URL: https://issues.apache.org/jira/browse/SPARK-27293
> Project: Spark
>  Issue Type: Question
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Martin Skauen
>Priority: Major
>
> I am interested in finding out if there is a bug in the implementation of 
> RandomForests. The Issue is when applying a seed and getting different 
> results than other people from my class when applying it to the same data
> I am calculating the RMSE metric like this:
> {code:java}
> (trainingData, testData) = data.randomSplit([0.7, 0.3], 313)
> from pyspark.ml.regression import RandomForestRegressor
> rfr = RandomForestRegressor(labelCol="labels", featuresCol="features", 
> maxDepth=5, numTrees=3, seed = 313)
> from pyspark.ml.evaluation import RegressionEvaluator
> evaluator = RegressionEvaluator\
> (labelCol="labels", predictionCol="prediction", metricName="rmse")
> rmse = evaluator.evaluate(predictions)
> print("RMSE = %g " % rmse)
> {code}
> I am setting the seed. For seed = 50 and also for other seeds I get exact 
> same RMSE as people from class. I set seed to 313 and it is giving me 
> different value. What could be the issue here?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27293) I am interested in finding out if there is a bug in the implementation of RandomForests. The Issue is when applying a seed and getting different results than other peo

2019-03-28 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804124#comment-16804124
 ] 

Bryan Cutler commented on SPARK-27293:
--

Setting the seed like in your example for randomSplit and the regressor will 
change the input data and algorithm initialization, since these operations are 
non-deterministic. So it is not surprising that the result might end up 
different. If you have sufficient data and get blatantly wrong results as 
compared to another implementation of the algorithm, then there might be an 
issue. From what I can see here, there doesn't seem to be a problem.

> I am interested in finding out if there is a bug in the implementation of 
> RandomForests. The Issue is when applying a seed and getting different 
> results than other people from my class when applying it to the same data. 
> 
>
> Key: SPARK-27293
> URL: https://issues.apache.org/jira/browse/SPARK-27293
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Martin Skauen
>Priority: Major
>
> I am calculating the RMSE metric like this:
> {code:java}
> (trainingData, testData) = data.randomSplit([0.7, 0.3], 313)
> from pyspark.ml.regression import RandomForestRegressor
> rfr = RandomForestRegressor(labelCol="labels", featuresCol="features", 
> maxDepth=5, numTrees=3, seed = 313)
> from pyspark.ml.evaluation import RegressionEvaluator
> evaluator = RegressionEvaluator\
> (labelCol="labels", predictionCol="prediction", metricName="rmse")
> rmse = evaluator.evaluate(predictions)
> print("RMSE = %g " % rmse)
> {code}
> I am setting the seed. For seed = 50 and also for other seeds I get exact 
> same RMSE as people from class. I set seed to 313 and it is giving me 
> different value. What could be the issue here?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27240) Use pandas DataFrame for struct type argument in Scalar Pandas UDF.

2019-03-25 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-27240.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24177
https://github.com/apache/spark/pull/24177

> Use pandas DataFrame for struct type argument in Scalar Pandas UDF.
> ---
>
> Key: SPARK-27240
> URL: https://issues.apache.org/jira/browse/SPARK-27240
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.0
>
>
> Now that we support returning pandas DataFrame for struct type in Scalar 
> Pandas UDF.
> If we chain another Pandas UDF after the Scalar Pandas UDF returning pandas 
> DataFrame, the argument of the chained UDF will be pandas DataFrame, but 
> currently we don't support pandas DataFrame as an argument of Scalar Pandas 
> UDF. That means there is an inconsistency between the chained UDF and the 
> single UDF.
> We should support taking pandas DataFrame for struct type argument in Scala 
> Pandas UDF to be consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27240) Use pandas DataFrame for struct type argument in Scalar Pandas UDF.

2019-03-25 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-27240:


Assignee: Takuya Ueshin

> Use pandas DataFrame for struct type argument in Scalar Pandas UDF.
> ---
>
> Key: SPARK-27240
> URL: https://issues.apache.org/jira/browse/SPARK-27240
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>
> Now that we support returning pandas DataFrame for struct type in Scalar 
> Pandas UDF.
> If we chain another Pandas UDF after the Scalar Pandas UDF returning pandas 
> DataFrame, the argument of the chained UDF will be pandas DataFrame, but 
> currently we don't support pandas DataFrame as an argument of Scalar Pandas 
> UDF. That means there is an inconsistency between the chained UDF and the 
> single UDF.
> We should support taking pandas DataFrame for struct type argument in Scala 
> Pandas UDF to be consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23836) Support returning StructType to the level support in GroupedMap Arrow's "scalar" UDFS (or similar)

2019-02-25 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777416#comment-16777416
 ] 

Bryan Cutler commented on SPARK-23836:
--

I can work on this

> Support returning StructType to the level support in GroupedMap Arrow's 
> "scalar" UDFS (or similar)
> --
>
> Key: SPARK-23836
> URL: https://issues.apache.org/jira/browse/SPARK-23836
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Major
>
> Currently not all of the supported types can be returned from the scalar 
> pandas UDF type. This means if someone wants to return a struct type doing a 
> map operation right now they either have to do a "junk" groupBy or use the 
> non-vectorized results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25147) GroupedData.apply pandas_udf crashing

2019-02-26 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-25147.
--
Resolution: Cannot Reproduce

Going to resolve this for now, please reopen if the above suggestion does not 
fix the issue

> GroupedData.apply pandas_udf crashing
> -
>
> Key: SPARK-25147
> URL: https://issues.apache.org/jira/browse/SPARK-25147
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: OS: Mac OS 10.13.6
> Python: 2.7.15, 3.6.6
> PyArrow: 0.10.0
> Pandas: 0.23.4
> Numpy: 1.15.0
>Reporter: Mike Sukmanowsky
>Priority: Major
>
> Running the following example taken straight from the docs results in 
> {{org.apache.spark.SparkException: Python worker exited unexpectedly 
> (crashed)}} for reasons that aren't clear from any logs I can see:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> spark = (
> SparkSession
> .builder
> .appName("pandas_udf")
> .getOrCreate()
> )
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v")
> )
> @F.pandas_udf("id long, v double", F.PandasUDFType.GROUPED_MAP)
> def normalize(pdf):
> v = pdf.v
> return pdf.assign(v=(v - v.mean()) / v.std())
> (
> df
> .groupby("id")
> .apply(normalize)
> .show()
> )
> {code}
>  See output.log for 
> [stacktrace|https://gist.github.com/msukmanowsky/b9cb6700e8ccaf93f265962000403f28].
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26943) Weird behaviour with `.cache()`

2019-02-28 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780751#comment-16780751
 ] 

Bryan Cutler commented on SPARK-26943:
--

If you can try to reproduce locally, that would be ideal. Maybe you could take 
a sample of your dataset? I wouldn't recommend trying out different versions in 
your cluster.

> Weird behaviour with `.cache()`
> ---
>
> Key: SPARK-26943
> URL: https://issues.apache.org/jira/browse/SPARK-26943
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Will Uto
>Priority: Major
>
>  
> {code:java}
> sdf.count(){code}
>  
> works fine. However:
>  
> {code:java}
> sdf = sdf.cache()
> sdf.count()
> {code}
>  does not, and produces error
> {code:java}
> Py4JJavaError: An error occurred while calling o314.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 75 
> in stage 8.0 failed 4 times, most recent failure: Lost task 75.3 in stage 8.0 
> (TID 438, uat-datanode-02, executor 1): java.text.ParseException: Unparseable 
> number: "(N/A)"
>   at java.text.NumberFormat.parse(NumberFormat.java:350)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26858) Vectorized gapplyCollect, Arrow optimization in native R function execution

2019-02-20 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773284#comment-16773284
 ] 

Bryan Cutler commented on SPARK-26858:
--

{quote}
(One other possibility I was thinking about batches without Schema is that we 
just send Arrow batch by Arrow batch, deserialize each batch to RecordBatch 
instance, and then construct an Arrow table, which is pretty different from 
Python side and hacky)
{quote}

The point from my previous comment is that you can't deserialize a RecordBatch 
and make a Table without a schema.

I think these options make the most sense:

1) Have the user define a schema beforehand, then you can just send serialized 
RecordBatches back to the driver.

2) Send a complete Arrow stream (schema + RecordBatches) from the executor, 
then merge streams in the driver JVM, discarding duplicate schemas, and send 
one final stream to R.

3) Same as (2) but instead of merging streams, send each separate stream 
through to the R driver where they are read and concatenated into one Table - 
I'm not sure if the Arrow R api will support this though.

> Vectorized gapplyCollect, Arrow optimization in native R function execution
> ---
>
> Key: SPARK-26858
> URL: https://issues.apache.org/jira/browse/SPARK-26858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Unlike gapply, gapplyCollect requires additional ser/de steps because it can 
> omit the schema, and Spark SQL doesn't know the return type before actually 
> execution happens.
> In original code path, it's done via using binary schema. Once gapply is done 
> (SPARK-26761). we can mimic this approach in vectorized gapply to support 
> gapplyCollect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23961) pyspark toLocalIterator throws an exception

2019-03-05 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784843#comment-16784843
 ] 

Bryan Cutler commented on SPARK-23961:
--

I could also reproduce with a nearly identical error using the following

{code}
import time
from pyspark.sql import SparkSession
from pyspark.sql.functions import rand, udf
from pyspark.sql.types import *

spark = SparkSession\
.builder\
.appName("toLocalIterator_Test")\
.getOrCreate()

df = spark.range(1 << 16).select(rand())

it = df.toLocalIterator()

print(next(it))
it = None

time.sleep(5)
spark.stop()
{code}

I think there are a couple issues with the way this is currently working. When 
toLocalIterator is called in Python, the Scala side also creates a local 
iterator which immediately starts a loop to consume the entire iterator and 
write it all to Python without any synchronization with the Python iterator. 
Blocking the write operation only happens when the socket receive buffer is 
full.  Small examples work fine if the data all fits in the read buffer, but 
the above code fails because the writing becomes blocked, then the Python 
iterator stops reading and closes the connection, which the Scala side sees as 
an error.  I can work on a fix for this.

> pyspark toLocalIterator throws an exception
> ---
>
> Key: SPARK-23961
> URL: https://issues.apache.org/jira/browse/SPARK-23961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Michel Lemay
>Priority: Minor
>  Labels: DataFrame, pyspark
>
> Given a dataframe and use toLocalIterator. If we do not consume all records, 
> it will throw: 
> {quote}ERROR PythonRDD: Error while sending iterator
>  java.net.SocketException: Connection reset by peer: socket write error
>  at java.net.SocketOutputStream.socketWrite0(Native Method)
>  at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
>  at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
>  at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
>  at java.io.DataOutputStream.write(DataOutputStream.java:107)
>  at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>  at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
>  at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706)
> {quote}
>  
> To reproduce, here is a simple pyspark shell script that show the error:
> {quote}import itertools
>  df = spark.read.parquet("large parquet folder").cache()
> print(df.count())
>  b = df.toLocalIterator()
>  print(len(list(itertools.islice(b, 20
>  b = None # Make the iterator goes out of scope.  Throws here.
> {quote}
>  
> Observations:
>  * Consuming all records do not throw.  Taking only a subset of the 
> partitions create the error.
>  * In another experiment, doing the same on a regular RDD works if we 
> cache/materialize it. If we do not cache the RDD, it throws similarly.
>  * It works in scala shell
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors

2019-03-04 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783690#comment-16783690
 ] 

Bryan Cutler commented on SPARK-27039:
--

I was able to reproduce in v2.4.0, but it looks like current master raises an 
error in the driver and does not return an empty Pandas DataFrame. This is 
probably due to some of the recent changes in toPandas() with Arrow enabled.

{noformat}
In [4]: spark.conf.set('spark.sql.execution.arrow.enabled', True)

In [5]: import pyspark.sql.functions as F
   ...: df = spark.range(1000 * 1000)
   ...: df_pd = df.withColumn("test", F.lit("this is a long string that should 
make the resulting dataframe too large for maxRe
   ...: sult which is 1m")).toPandas()
   ...: 
19/03/04 10:54:56 ERROR TaskSetManager: Total size of serialized results of 1 
tasks (13.2 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)
19/03/04 10:54:56 ERROR TaskSetManager: Total size of serialized results of 2 
tasks (26.4 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)
Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job aborted 
due to stage failure: Total size of serialized results of 1 tasks (13.2 MiB) is 
bigger than spark.driver.maxResultSize (1024.0 KiB)
at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1938)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1926)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1925)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1925)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:935)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:935)
at scala.Option.foreach(Option.scala:274)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:935)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2155)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2104)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2093)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:746)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2008)
at 
org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3300)
at 
org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3265)
at 
org.apache.spark.api.python.PythonRDD$.$anonfun$serveToStream$2(PythonRDD.scala:442)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319)
at 
org.apache.spark.api.python.PythonRDD$.$anonfun$serveToStream$1(PythonRDD.scala:444)
at 
org.apache.spark.api.python.PythonRDD$.$anonfun$serveToStream$1$adapted(PythonRDD.scala:439)
at 
org.apache.spark.api.python.PythonServer$$anon$3.run(PythonRDD.scala:890)
/home/bryan/git/spark/python/pyspark/sql/dataframe.py:2129: UserWarning: 
toPandas attempted Arrow optimization because 
'spark.sql.execution.arrow.enabled' is set to true, but has reached the error 
below and can not continue. Note that 
'spark.sql.execution.arrow.fallback.enabled' does not have an effect on 
failures in the middle of computation.
  
  warnings.warn(msg)
19/03/04 10:54:56 ERROR TaskSetManager: Total size of serialized results of 3 
tasks (39.6 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB)
[Stage 0:==>(1 + 7) / 8][Stage 1:>  (0 + 8) / 
8]---
EOFError  Traceback (most recent call last)
 in ()
  1 import pyspark.sql.functions as F
  2 df = spark.range(1000 * 1000)
> 3 df_pd = df.withColumn("test", F.lit("this is a long string that should 
make the resulting dataframe too large for maxResult which is 1m")).toPandas()

/home/bryan/git/spark/python/pyspark/sql/dataframe.pyc in toPandas(self)
   2111 _check_dataframe_localize_timestamps
   2112 import pyarrow
-> 2113 batches = self._collectAsArrow()
   2114 if len(batches) > 0:
   2115 table = 

[jira] [Commented] (SPARK-26943) Weird behaviour with `.cache()`

2019-02-21 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774566#comment-16774566
 ] 

Bryan Cutler commented on SPARK-26943:
--

Could you please provide a complete script to reproduce? Also, have you tried a 
newer version of Spark other than 2.1.0?

> Weird behaviour with `.cache()`
> ---
>
> Key: SPARK-26943
> URL: https://issues.apache.org/jira/browse/SPARK-26943
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Will Uto
>Priority: Major
>
>  
> {code:java}
> sdf.count(){code}
>  
> works fine. However:
>  
> {code:java}
> sdf = sdf.cache()
> sdf.count()
> {code}
>  does not, and produces error
> {code:java}
> Py4JJavaError: An error occurred while calling o314.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 75 
> in stage 8.0 failed 4 times, most recent failure: Lost task 75.3 in stage 8.0 
> (TID 438, uat-datanode-02, executor 1): java.text.ParseException: Unparseable 
> number: "(N/A)"
>   at java.text.NumberFormat.parse(NumberFormat.java:350)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27163) Cleanup and consolidate Pandas UDF functionality

2019-03-14 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-27163:


 Summary: Cleanup and consolidate Pandas UDF functionality
 Key: SPARK-27163
 URL: https://issues.apache.org/jira/browse/SPARK-27163
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Bryan Cutler


Some of the code for Pandas UDFs can be cleaned up and consolidated to remove 
duplicated parts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27163) Cleanup and consolidate Pandas UDF functionality

2019-03-14 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27163:
-
Priority: Minor  (was: Major)

> Cleanup and consolidate Pandas UDF functionality
> 
>
> Key: SPARK-27163
> URL: https://issues.apache.org/jira/browse/SPARK-27163
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> Some of the code for Pandas UDFs can be cleaned up and consolidated to remove 
> duplicated parts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23836) Support returning StructType to the level support in GroupedMap Arrow's "scalar" UDFS (or similar)

2019-03-07 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23836.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23900
https://github.com/apache/spark/pull/23900

> Support returning StructType to the level support in GroupedMap Arrow's 
> "scalar" UDFS (or similar)
> --
>
> Key: SPARK-23836
> URL: https://issues.apache.org/jira/browse/SPARK-23836
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently not all of the supported types can be returned from the scalar 
> pandas UDF type. This means if someone wants to return a struct type doing a 
> map operation right now they either have to do a "junk" groupBy or use the 
> non-vectorized results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23836) Support returning StructType to the level support in GroupedMap Arrow's "scalar" UDFS (or similar)

2019-03-07 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-23836:


Assignee: Bryan Cutler

> Support returning StructType to the level support in GroupedMap Arrow's 
> "scalar" UDFS (or similar)
> --
>
> Key: SPARK-23836
> URL: https://issues.apache.org/jira/browse/SPARK-23836
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: Bryan Cutler
>Priority: Major
>
> Currently not all of the supported types can be returned from the scalar 
> pandas UDF type. This means if someone wants to return a struct type doing a 
> map operation right now they either have to do a "junk" groupBy or use the 
> non-vectorized results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26858) Vectorized gapplyCollect, Arrow optimization in native R function execution

2019-02-19 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772266#comment-16772266
 ] 

Bryan Cutler commented on SPARK-26858:
--

[~hyukjin.kwon] actually {{pyarrow.Table.from_batches}} does require a schema, 
but it's a little tricky.  In python, to create a {{RecordBatch}} object, a 
schema is always required. Then it attaches to the {{RecordBatch}} object, so 
you can always get the schema from a batch and do things like 
{{Table.from_batches}} with just a list of batches. Look at the full api 
[here|https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_batches]
 and you can see the schema is optional and it just pulls from the first batch. 
Serialized record batches are a bit different and do not contain the schema, 
which is why the stream protocol goes {{Schema, RecordBatch, RecordBatch, ..}}. 
So you can't just send serialized Arrow batches and then build a Table.  
Hopefully that makes sense :P

Back on your sequence diagram at step (9), does R write the data frame row by 
row bytes over the socket? I'm wondering how it gets serialized to see where 
Arrow might best help out.

> Vectorized gapplyCollect, Arrow optimization in native R function execution
> ---
>
> Key: SPARK-26858
> URL: https://issues.apache.org/jira/browse/SPARK-26858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Unlike gapply, gapplyCollect requires additional ser/de steps because it can 
> omit the schema, and Spark SQL doesn't know the return type before actually 
> execution happens.
> In original code path, it's done via using binary schema. Once gapply is done 
> (SPARK-26761). we can mimic this approach in vectorized gapply to support 
> gapplyCollect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26566) Upgrade apache/arrow to 0.12.0

2019-01-25 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-26566:
-
Description: 
Version 0.12.0 includes the following selected fixes/improvements relevant to 
Spark users:

* Safe cast fails from numpy float64 array with nans to integer, ARROW-4258
* Java, Reduce heap usage for variable width vectors, ARROW-4147
* Binary identity cast not implemented, ARROW-4101
* pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
* conversion to date object no longer needed, ARROW-3910
* Error reading IPC file with no record batches, ARROW-3894
* Signed to unsigned integer cast yields incorrect results when type sizes are 
the same, ARROW-3790
* from_pandas gives incorrect results when converting floating point to bool, 
ARROW-3428
* Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / 
libboost issue), ARROW-3048

complete list 
[here|https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0]

PySpark requires the following fixes to work with PyArrow 0.12.0

* Encrypted pyspark worker fails due to ChunkedStream missing closed property
* pyarrow now converts dates as objects by default, which causes error because 
type is assumed datetime64
* ArrowTests fails due to difference in raised error message
* pyarrow.open_stream deprecated
* tests fail because groupby adds index column with duplicate name

 

  was:
_This is just a placeholder for now to collect what needs to be fixed when we 
upgrade next time_

Version 0.12.0 includes the following:
 * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
 * conversion to date object no longer needed, ARROW-3910

 


> Upgrade apache/arrow to 0.12.0
> --
>
> Key: SPARK-26566
> URL: https://issues.apache.org/jira/browse/SPARK-26566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Version 0.12.0 includes the following selected fixes/improvements relevant to 
> Spark users:
> * Safe cast fails from numpy float64 array with nans to integer, ARROW-4258
> * Java, Reduce heap usage for variable width vectors, ARROW-4147
> * Binary identity cast not implemented, ARROW-4101
> * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
> * conversion to date object no longer needed, ARROW-3910
> * Error reading IPC file with no record batches, ARROW-3894
> * Signed to unsigned integer cast yields incorrect results when type sizes 
> are the same, ARROW-3790
> * from_pandas gives incorrect results when converting floating point to bool, 
> ARROW-3428
> * Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / 
> libboost issue), ARROW-3048
> complete list 
> [here|https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0]
> PySpark requires the following fixes to work with PyArrow 0.12.0
> * Encrypted pyspark worker fails due to ChunkedStream missing closed property
> * pyarrow now converts dates as objects by default, which causes error 
> because type is assumed datetime64
> * ArrowTests fails due to difference in raised error message
> * pyarrow.open_stream deprecated
> * tests fail because groupby adds index column with duplicate name
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26566) Upgrade apache/arrow to 0.12.0

2019-01-25 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-26566:
-
Description: 
Version 0.12.0 includes the following selected fixes/improvements relevant to 
Spark users:

* Safe cast fails from numpy float64 array with nans to integer, ARROW-4258
* Java, Reduce heap usage for variable width vectors, ARROW-4147
* Binary identity cast not implemented, ARROW-4101
* pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
* conversion to date object no longer needed, ARROW-3910
* Error reading IPC file with no record batches, ARROW-3894
* Signed to unsigned integer cast yields incorrect results when type sizes are 
the same, ARROW-3790
* from_pandas gives incorrect results when converting floating point to bool, 
ARROW-3428
* Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / 
libboost issue), ARROW-3048
* Java update to official Flatbuffers version 1.9.0, ARROW-3175

complete list 
[here|https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0]

PySpark requires the following fixes to work with PyArrow 0.12.0

* Encrypted pyspark worker fails due to ChunkedStream missing closed property
* pyarrow now converts dates as objects by default, which causes error because 
type is assumed datetime64
* ArrowTests fails due to difference in raised error message
* pyarrow.open_stream deprecated
* tests fail because groupby adds index column with duplicate name

 

  was:
Version 0.12.0 includes the following selected fixes/improvements relevant to 
Spark users:

* Safe cast fails from numpy float64 array with nans to integer, ARROW-4258
* Java, Reduce heap usage for variable width vectors, ARROW-4147
* Binary identity cast not implemented, ARROW-4101
* pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
* conversion to date object no longer needed, ARROW-3910
* Error reading IPC file with no record batches, ARROW-3894
* Signed to unsigned integer cast yields incorrect results when type sizes are 
the same, ARROW-3790
* from_pandas gives incorrect results when converting floating point to bool, 
ARROW-3428
* Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / 
libboost issue), ARROW-3048

complete list 
[here|https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0]

PySpark requires the following fixes to work with PyArrow 0.12.0

* Encrypted pyspark worker fails due to ChunkedStream missing closed property
* pyarrow now converts dates as objects by default, which causes error because 
type is assumed datetime64
* ArrowTests fails due to difference in raised error message
* pyarrow.open_stream deprecated
* tests fail because groupby adds index column with duplicate name

 


> Upgrade apache/arrow to 0.12.0
> --
>
> Key: SPARK-26566
> URL: https://issues.apache.org/jira/browse/SPARK-26566
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Version 0.12.0 includes the following selected fixes/improvements relevant to 
> Spark users:
> * Safe cast fails from numpy float64 array with nans to integer, ARROW-4258
> * Java, Reduce heap usage for variable width vectors, ARROW-4147
> * Binary identity cast not implemented, ARROW-4101
> * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098
> * conversion to date object no longer needed, ARROW-3910
> * Error reading IPC file with no record batches, ARROW-3894
> * Signed to unsigned integer cast yields incorrect results when type sizes 
> are the same, ARROW-3790
> * from_pandas gives incorrect results when converting floating point to bool, 
> ARROW-3428
> * Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / 
> libboost issue), ARROW-3048
> * Java update to official Flatbuffers version 1.9.0, ARROW-3175
> complete list 
> [here|https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0]
> PySpark requires the following fixes to work with PyArrow 0.12.0
> * Encrypted pyspark worker fails due to ChunkedStream missing closed property
> * pyarrow now converts dates as objects by default, which causes error 
> because type is assumed datetime64
> * ArrowTests fails due to difference in raised error message
> * pyarrow.open_stream deprecated
> * tests fail because groupby adds index column with duplicate name
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-09 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813896#comment-16813896
 ] 

Bryan Cutler commented on SPARK-27389:
--

Thanks [~shaneknapp] for the fix. I couldn't come up with any idea why this was 
happening all of a sudden either, but at least we are up and running again!

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: shane knapp
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-08 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812877#comment-16812877
 ] 

Bryan Cutler commented on SPARK-27389:
--

[~shaneknapp], I had a couple of successful tests with worker-4. Do you know if 
the problem consistent on certain workers or just random on all of them?

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27387) Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests

2019-04-11 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-27387:


Assignee: Bryan Cutler

> Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests
> --
>
> Key: SPARK-27387
> URL: https://issues.apache.org/jira/browse/SPARK-27387
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Affects Versions: 2.4.1
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.0.0
>
>
> In PySpark unit tests, sqlutils ReusedSQLTestCase.assertPandasEqual is meant 
> to check if 2 pandas.DataFrames are equal but it seems for later versions of 
> Pandas, this can fail if the DataFrame has an array column. This method can 
> be replaced by {{assert_frame_equal}} from pandas.util.testing.  This is what 
> it is meant for and it will give a better assertion message as well.
> The test failure I have seen is:
>  {noformat}
> ==
> ERROR: test_supported_types 
> (pyspark.sql.tests.test_pandas_udf_grouped_map.GroupedMapPandasUDFTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/bryan/git/spark/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py",
>  line 128, in test_supported_types
>     self.assertPandasEqual(expected1, result1)
>   File "/home/bryan/git/spark/python/pyspark/testing/sqlutils.py", line 268, 
> in assertPandasEqual
>     self.assertTrue(expected.equals(result), msg=msg)
>   File "/home/bryan/miniconda2/envs/pa012/lib/python3.6/site-packages/pandas
> ...
>   File "pandas/_libs/lib.pyx", line 523, in 
> pandas._libs.lib.array_equivalent_object
> ValueError: The truth value of an array with more than one element is 
> ambiguous. Use a.any() or a.all()
>  {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27463) SPIP: Support Dataframe Cogroup via Pandas UDFs

2019-05-17 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16842335#comment-16842335
 ] 

Bryan Cutler commented on SPARK-27463:
--

[~d80tb7] I think you could remove the SPIP label from this and begin work. It 
will require some tweaks to the Python worker and add a new API, but not major 
changes and additions like other SPIPs. If others feel differently though, we 
could continue with the SPIP process.

> SPIP: Support Dataframe Cogroup via Pandas UDFs 
> 
>
> Key: SPARK-27463
> URL: https://issues.apache.org/jira/browse/SPARK-27463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Chris Martin
>Priority: Major
>  Labels: SPIP
>
> Recent work on Pandas UDFs in Spark, has allowed for improved 
> interoperability between Pandas and Spark.  This proposal aims to extend this 
> by introducing a new Pandas UDF type which would allow for a cogroup 
> operation to be applied to two PySpark DataFrames.
> Full details are in the google document linked below.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27712) createDataFrame() reorders row

2019-05-17 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-27712.
--
Resolution: Duplicate

> createDataFrame() reorders row
> --
>
> Key: SPARK-27712
> URL: https://issues.apache.org/jira/browse/SPARK-27712
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: emr-5.20.0
> PySpark 2.4.0
> Python 2.7.15
>Reporter: Tim Ludwinski
>Priority: Major
>  Labels: correctness
>
> Executing  the following:
> {code:java}
> my_schema = pyspark.sql.types.StructType([
>     pyspark.sql.types.StructField("B", pyspark.sql.types.StringType(), True),
>     pyspark.sql.types.StructField("A", pyspark.sql.types.StringType(), True)
> ])
> spark.createDataFrame(spark.sparkContext.parallelize([pyspark.sql.Row(A="1", 
> B="2")]), my_schema).collect()
> {code}
> should produce this:
> {code:java}
> [Row(A="1", B="2")]
> {code}
> or this:
> {code:java}
> [Row(B='2', A='1')]
> {code}
> but produces this instead:
> {code:java}
> [Row(B=u'1', A=u'2')]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27805) toPandas does not propagate SparkExceptions with arrow enabled

2019-06-04 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27805:
-
Affects Version/s: (was: 3.1.0)
   2.4.3

> toPandas does not propagate SparkExceptions with arrow enabled
> --
>
> Key: SPARK-27805
> URL: https://issues.apache.org/jira/browse/SPARK-27805
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.3
>Reporter: David Vogelbacher
>Assignee: David Vogelbacher
>Priority: Major
> Fix For: 3.0.0
>
>
> When calling {{toPandas}} with arrow enabled errors encountered during the 
> collect are not propagated to the python process.
> There is only a very general {{EofError}} raised.
> Example of behavior with arrow enabled vs. arrow disabled:
> {noformat}
> import traceback
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType
> def raise_exception():
>   raise Exception("My error")
> error_udf = udf(raise_exception, IntegerType())
> df = spark.range(3).toDF("i").withColumn("x", error_udf())
> try:
> df.toPandas()
> except:
> no_arrow_exception = traceback.format_exc()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> try:
> df.toPandas()
> except:
> arrow_exception = traceback.format_exc()
> print no_arrow_exception
> print arrow_exception
> {noformat}
> {{arrow_exception}} gives as output:
> {noformat}
> >>> print arrow_exception
> Traceback (most recent call last):
>   File "", line 2, in 
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2143, in toPandas
> batches = self._collectAsArrow()
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2205, in _collectAsArrow
> results = list(_load_from_socket(sock_info, ArrowCollectSerializer()))
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 210, in load_stream
> num = read_int(stream)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 810, in read_int
> raise EOFError
> EOFError
> {noformat}
> {{no_arrow_exception}} gives as output:
> {noformat}
> Traceback (most recent call last):
>   File "", line 2, in 
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2166, in toPandas
> pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 516, in collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
>  line 1286, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/utils.py", line 89, 
> in deco
> return f(*a, **kw)
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o38.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
> in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 
> (TID 7, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 428, in main
> process()
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 423, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 438, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 141, in dump_stream
> for obj in iterator:
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 427, in _batched
> for item in iterator:
>   File "", line 1, in 
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 86, in 
> return lambda *a: f(*a)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/util.py", line 99, in 
> wrapper
> return f(*args, **kwargs)
>   File "", line 2, in raise_exception
> Exception: My error
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27805) toPandas does not propagate SparkExceptions with arrow enabled

2019-06-04 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-27805.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24677
[https://github.com/apache/spark/pull/24677]

> toPandas does not propagate SparkExceptions with arrow enabled
> --
>
> Key: SPARK-27805
> URL: https://issues.apache.org/jira/browse/SPARK-27805
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: David Vogelbacher
>Assignee: David Vogelbacher
>Priority: Major
> Fix For: 3.0.0
>
>
> When calling {{toPandas}} with arrow enabled errors encountered during the 
> collect are not propagated to the python process.
> There is only a very general {{EofError}} raised.
> Example of behavior with arrow enabled vs. arrow disabled:
> {noformat}
> import traceback
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType
> def raise_exception():
>   raise Exception("My error")
> error_udf = udf(raise_exception, IntegerType())
> df = spark.range(3).toDF("i").withColumn("x", error_udf())
> try:
> df.toPandas()
> except:
> no_arrow_exception = traceback.format_exc()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> try:
> df.toPandas()
> except:
> arrow_exception = traceback.format_exc()
> print no_arrow_exception
> print arrow_exception
> {noformat}
> {{arrow_exception}} gives as output:
> {noformat}
> >>> print arrow_exception
> Traceback (most recent call last):
>   File "", line 2, in 
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2143, in toPandas
> batches = self._collectAsArrow()
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2205, in _collectAsArrow
> results = list(_load_from_socket(sock_info, ArrowCollectSerializer()))
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 210, in load_stream
> num = read_int(stream)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 810, in read_int
> raise EOFError
> EOFError
> {noformat}
> {{no_arrow_exception}} gives as output:
> {noformat}
> Traceback (most recent call last):
>   File "", line 2, in 
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2166, in toPandas
> pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 516, in collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
>  line 1286, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/utils.py", line 89, 
> in deco
> return f(*a, **kw)
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o38.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
> in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 
> (TID 7, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 428, in main
> process()
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 423, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 438, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 141, in dump_stream
> for obj in iterator:
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 427, in _batched
> for item in iterator:
>   File "", line 1, in 
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 86, in 
> return lambda *a: f(*a)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/util.py", line 99, in 
> wrapper
> return f(*args, **kwargs)
>   File "", line 2, in raise_exception
> Exception: My error
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Assigned] (SPARK-27805) toPandas does not propagate SparkExceptions with arrow enabled

2019-06-04 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-27805:


Assignee: David Vogelbacher

> toPandas does not propagate SparkExceptions with arrow enabled
> --
>
> Key: SPARK-27805
> URL: https://issues.apache.org/jira/browse/SPARK-27805
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: David Vogelbacher
>Assignee: David Vogelbacher
>Priority: Major
>
> When calling {{toPandas}} with arrow enabled errors encountered during the 
> collect are not propagated to the python process.
> There is only a very general {{EofError}} raised.
> Example of behavior with arrow enabled vs. arrow disabled:
> {noformat}
> import traceback
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType
> def raise_exception():
>   raise Exception("My error")
> error_udf = udf(raise_exception, IntegerType())
> df = spark.range(3).toDF("i").withColumn("x", error_udf())
> try:
> df.toPandas()
> except:
> no_arrow_exception = traceback.format_exc()
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> try:
> df.toPandas()
> except:
> arrow_exception = traceback.format_exc()
> print no_arrow_exception
> print arrow_exception
> {noformat}
> {{arrow_exception}} gives as output:
> {noformat}
> >>> print arrow_exception
> Traceback (most recent call last):
>   File "", line 2, in 
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2143, in toPandas
> batches = self._collectAsArrow()
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2205, in _collectAsArrow
> results = list(_load_from_socket(sock_info, ArrowCollectSerializer()))
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 210, in load_stream
> num = read_int(stream)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 810, in read_int
> raise EOFError
> EOFError
> {noformat}
> {{no_arrow_exception}} gives as output:
> {noformat}
> Traceback (most recent call last):
>   File "", line 2, in 
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 2166, in toPandas
> pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line 
> 516, in collect
> sock_info = self._jdf.collectToPython()
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
>  line 1286, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/sql/utils.py", line 89, 
> in deco
> return f(*a, **kw)
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o38.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 
> in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 
> (TID 7, localhost, executor driver): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 428, in main
> process()
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 423, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 438, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 141, in dump_stream
> for obj in iterator:
>   File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line 
> 427, in _batched
> for item in iterator:
>   File "", line 1, in 
>   File 
> "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", 
> line 86, in 
> return lambda *a: f(*a)
>   File "/Users/dvogelbacher/git/spark/python/pyspark/util.py", line 99, in 
> wrapper
> return f(*args, **kwargs)
>   File "", line 2, in raise_exception
> Exception: My error
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27939) Defining a schema with VectorUDT

2019-06-04 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855969#comment-16855969
 ] 

Bryan Cutler edited comment on SPARK-27939 at 6/4/19 6:13 PM:
--

Linked to a similar problem with Python {{Row}} class, SPARK-22232


was (Author: bryanc):
Another problem with Python {{Row}} class

> Defining a schema with VectorUDT
> 
>
> Key: SPARK-27939
> URL: https://issues.apache.org/jira/browse/SPARK-27939
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Johannes Schaffrath
>Priority: Minor
>
> When I try to define a dataframe schema which has a VectorUDT field, I run 
> into an error when the VectorUDT field is not the last element of the 
> StructType list.
> The following example causes the error below:
> {code:java}
> // from pyspark.sql import functions as F
> from pyspark.sql import types as T
> from pyspark.sql import Row
> from pyspark.ml.linalg import VectorUDT, SparseVector
> #VectorUDT should be the last structfield
> train_schema = T.StructType([
>     T.StructField('features', VectorUDT()),
>     T.StructField('SALESCLOSEPRICE', T.IntegerType())
>     ])
>   
> train_df = spark.createDataFrame(
> [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, 
> 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: 
> 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, 
> 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, 
> 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: 
> -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: 
> 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: 
> 1.0, 133: 1.0}), SALESCLOSEPRICE=143000),
>  Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, 
> 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: 
> 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, 
> 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: 
> 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, 
> 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: 
> -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, 
> 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 
> 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), 
> SALESCLOSEPRICE=19),
>  Row(features=SparseVector(135, {0: 225000.0, 1: 225000.0, 2: 1102.0, 3: 
> 28.0, 5: 1102.0, 6: 2390.0, 7: 2390.0, 8: 1102.0, 9: 1949.0, 10: 0.822, 11: 
> 1.0, 15: 1.0, 16: 2.0, 20: 2017.0, 21: 6.0, 22: 26.0, 23: 26.0, 24: 177.0, 
> 25: 25.0, 26: -1.0, 27: 3.88, 28: 3.9, 29: 3.91, 30: 3.89, 31: 3.94, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 61.7, 36: 2.7, 38: 0.4706, 39: 204.1742, 40: 1102.0, 
> 41: 0.0106, 42: 2.0, 49: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: -1.0, 57: 
> 1.0, 62: 1.0, 68: 1.0, 70: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 1.0, 100: 1.0, 
> 103: 1.0, 108: 1.0, 110: 1.0, 115: 1.0, 123: 1.0, 131: 1.0, 132: 1.0}), 
> SALESCLOSEPRICE=225000)
>  ], schema=train_schema)
>  
> train_df.printSchema()
> train_df.show()
> {code}
> Error  message:
> {code:java}
> // Fail to execute line 17: ], schema=train_schema) Traceback (most recent 
> call last): File "/tmp/zeppelin_pyspark-3793375738105660281.py", line 375, in 
>  exec(code, _zcUserQueryNameSpace) File "", line 17, in 
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", 
> line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, 
> data), schema) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in 
> _createFromLocal data = [schema.toInternal(row) for row in data] File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in 
>  data = [schema.toInternal(row) for row in data] File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in 
> toInternal for f, v, c in zip(self.fields, obj, self._needConversion)) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in 
>  for f, v, c in zip(self.fields, obj, self._needConversion)) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 442, in 
> toInternal return self.dataType.toInternal(obj) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 685, in 
> toInternal return self._cachedSqlType().toInternal(self.serialize(obj)) File 
> 

[jira] [Resolved] (SPARK-27939) Defining a schema with VectorUDT

2019-06-04 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-27939.
--
Resolution: Not A Problem

> Defining a schema with VectorUDT
> 
>
> Key: SPARK-27939
> URL: https://issues.apache.org/jira/browse/SPARK-27939
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Johannes Schaffrath
>Priority: Minor
>
> When I try to define a dataframe schema which has a VectorUDT field, I run 
> into an error when the VectorUDT field is not the last element of the 
> StructType list.
> The following example causes the error below:
> {code:java}
> // from pyspark.sql import functions as F
> from pyspark.sql import types as T
> from pyspark.sql import Row
> from pyspark.ml.linalg import VectorUDT, SparseVector
> #VectorUDT should be the last structfield
> train_schema = T.StructType([
>     T.StructField('features', VectorUDT()),
>     T.StructField('SALESCLOSEPRICE', T.IntegerType())
>     ])
>   
> train_df = spark.createDataFrame(
> [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, 
> 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: 
> 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, 
> 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, 
> 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: 
> -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: 
> 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: 
> 1.0, 133: 1.0}), SALESCLOSEPRICE=143000),
>  Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, 
> 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: 
> 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, 
> 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: 
> 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, 
> 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: 
> -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, 
> 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 
> 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), 
> SALESCLOSEPRICE=19),
>  Row(features=SparseVector(135, {0: 225000.0, 1: 225000.0, 2: 1102.0, 3: 
> 28.0, 5: 1102.0, 6: 2390.0, 7: 2390.0, 8: 1102.0, 9: 1949.0, 10: 0.822, 11: 
> 1.0, 15: 1.0, 16: 2.0, 20: 2017.0, 21: 6.0, 22: 26.0, 23: 26.0, 24: 177.0, 
> 25: 25.0, 26: -1.0, 27: 3.88, 28: 3.9, 29: 3.91, 30: 3.89, 31: 3.94, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 61.7, 36: 2.7, 38: 0.4706, 39: 204.1742, 40: 1102.0, 
> 41: 0.0106, 42: 2.0, 49: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: -1.0, 57: 
> 1.0, 62: 1.0, 68: 1.0, 70: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 1.0, 100: 1.0, 
> 103: 1.0, 108: 1.0, 110: 1.0, 115: 1.0, 123: 1.0, 131: 1.0, 132: 1.0}), 
> SALESCLOSEPRICE=225000)
>  ], schema=train_schema)
>  
> train_df.printSchema()
> train_df.show()
> {code}
> Error  message:
> {code:java}
> // Fail to execute line 17: ], schema=train_schema) Traceback (most recent 
> call last): File "/tmp/zeppelin_pyspark-3793375738105660281.py", line 375, in 
>  exec(code, _zcUserQueryNameSpace) File "", line 17, in 
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", 
> line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, 
> data), schema) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in 
> _createFromLocal data = [schema.toInternal(row) for row in data] File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in 
>  data = [schema.toInternal(row) for row in data] File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in 
> toInternal for f, v, c in zip(self.fields, obj, self._needConversion)) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in 
>  for f, v, c in zip(self.fields, obj, self._needConversion)) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 442, in 
> toInternal return self.dataType.toInternal(obj) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 685, in 
> toInternal return self._cachedSqlType().toInternal(self.serialize(obj)) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/ml/linalg/__init__.py", line 167, 
> in serialize raise TypeError("cannot serialize %r of type %r" % (obj, 
> type(obj))) TypeError: cannot serialize 143000 of type {code}
> I don't get 

[jira] [Commented] (SPARK-27939) Defining a schema with VectorUDT

2019-06-04 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855969#comment-16855969
 ] 

Bryan Cutler commented on SPARK-27939:
--

Another problem with Python {{Row}} class

> Defining a schema with VectorUDT
> 
>
> Key: SPARK-27939
> URL: https://issues.apache.org/jira/browse/SPARK-27939
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Johannes Schaffrath
>Priority: Minor
>
> When I try to define a dataframe schema which has a VectorUDT field, I run 
> into an error when the VectorUDT field is not the last element of the 
> StructType list.
> The following example causes the error below:
> {code:java}
> // from pyspark.sql import functions as F
> from pyspark.sql import types as T
> from pyspark.sql import Row
> from pyspark.ml.linalg import VectorUDT, SparseVector
> #VectorUDT should be the last structfield
> train_schema = T.StructType([
>     T.StructField('features', VectorUDT()),
>     T.StructField('SALESCLOSEPRICE', T.IntegerType())
>     ])
>   
> train_df = spark.createDataFrame(
> [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, 
> 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: 
> 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, 
> 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, 
> 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: 
> -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: 
> 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: 
> 1.0, 133: 1.0}), SALESCLOSEPRICE=143000),
>  Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, 
> 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: 
> 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, 
> 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: 
> 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, 
> 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: 
> -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, 
> 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 
> 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), 
> SALESCLOSEPRICE=19),
>  Row(features=SparseVector(135, {0: 225000.0, 1: 225000.0, 2: 1102.0, 3: 
> 28.0, 5: 1102.0, 6: 2390.0, 7: 2390.0, 8: 1102.0, 9: 1949.0, 10: 0.822, 11: 
> 1.0, 15: 1.0, 16: 2.0, 20: 2017.0, 21: 6.0, 22: 26.0, 23: 26.0, 24: 177.0, 
> 25: 25.0, 26: -1.0, 27: 3.88, 28: 3.9, 29: 3.91, 30: 3.89, 31: 3.94, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 61.7, 36: 2.7, 38: 0.4706, 39: 204.1742, 40: 1102.0, 
> 41: 0.0106, 42: 2.0, 49: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: -1.0, 57: 
> 1.0, 62: 1.0, 68: 1.0, 70: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 1.0, 100: 1.0, 
> 103: 1.0, 108: 1.0, 110: 1.0, 115: 1.0, 123: 1.0, 131: 1.0, 132: 1.0}), 
> SALESCLOSEPRICE=225000)
>  ], schema=train_schema)
>  
> train_df.printSchema()
> train_df.show()
> {code}
> Error  message:
> {code:java}
> // Fail to execute line 17: ], schema=train_schema) Traceback (most recent 
> call last): File "/tmp/zeppelin_pyspark-3793375738105660281.py", line 375, in 
>  exec(code, _zcUserQueryNameSpace) File "", line 17, in 
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", 
> line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, 
> data), schema) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in 
> _createFromLocal data = [schema.toInternal(row) for row in data] File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in 
>  data = [schema.toInternal(row) for row in data] File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in 
> toInternal for f, v, c in zip(self.fields, obj, self._needConversion)) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in 
>  for f, v, c in zip(self.fields, obj, self._needConversion)) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 442, in 
> toInternal return self.dataType.toInternal(obj) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 685, in 
> toInternal return self._cachedSqlType().toInternal(self.serialize(obj)) File 
> "/opt/spark/python/lib/pyspark.zip/pyspark/ml/linalg/__init__.py", line 167, 
> in serialize raise TypeError("cannot serialize %r of type %r" % (obj, 
> type(obj))) TypeError: 

[jira] [Comment Edited] (SPARK-27939) Defining a schema with VectorUDT

2019-06-04 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855966#comment-16855966
 ] 

Bryan Cutler edited comment on SPARK-27939 at 6/4/19 6:11 PM:
--

The problem is the {{Row}} class sorts the field names alphabetically, which 
puts capital letters first and then conflicts with your schema:
{noformat}
r = Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, ...}), 
SALESCLOSEPRICE=143000)

In [3]: r.__fields__
Out[3]: ['SALESCLOSEPRICE', 'features']{noformat}
This is by design, but it is not intuitive and has caused lots of problems. 
Hopefully, we can improve this for Spark 3.0.0

You can either just specify your data as tuples. for example
{noformat}
In [5]: train_df = spark.createDataFrame([(SparseVector(135, {0: 139900.0}), 
143000)], schema=train_schema)

In [6]: train_df.show()
++---+
| features|SALESCLOSEPRICE|
++---+
|(135,[0],[139900.0])| 143000|
++---+
{noformat}
Or if you want to have keywords, then define your own row class like this:
{noformat}
In [7]: MyRow = Row('features', 'SALESCLOSEPRICE')

In [8]: MyRow(SparseVector(135, {0: 139900.0}), 143000)
Out[8]: Row(features=SparseVector(135, {0: 139900.0}), 
SALESCLOSEPRICE=143000){noformat}


was (Author: bryanc):
The problem is the {{Row}} class sorts the field names alphabetically, which 
puts capital letters first and then conflicts with your schema:
{noformat}
r = Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, ...}), 
SALESCLOSEPRICE=143000)

In [3]: r.__fields__
Out[3]: ['SALESCLOSEPRICE', 'features']{noformat}
This is by design, but it is not intuitive and has caused lots of problems.

You can either just specify your data as tuples. for example
{noformat}
In [5]: train_df = spark.createDataFrame([(SparseVector(135, {0: 139900.0}), 
143000)], schema=train_schema)

In [6]: train_df.show()
++---+
| features|SALESCLOSEPRICE|
++---+
|(135,[0],[139900.0])| 143000|
++---+
{noformat}
Or if you want to have keywords, then define your own row class like this:
{noformat}
In [7]: MyRow = Row('features', 'SALESCLOSEPRICE')

In [8]: MyRow(SparseVector(135, {0: 139900.0}), 143000)
Out[8]: Row(features=SparseVector(135, {0: 139900.0}), 
SALESCLOSEPRICE=143000){noformat}

> Defining a schema with VectorUDT
> 
>
> Key: SPARK-27939
> URL: https://issues.apache.org/jira/browse/SPARK-27939
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Johannes Schaffrath
>Priority: Minor
>
> When I try to define a dataframe schema which has a VectorUDT field, I run 
> into an error when the VectorUDT field is not the last element of the 
> StructType list.
> The following example causes the error below:
> {code:java}
> // from pyspark.sql import functions as F
> from pyspark.sql import types as T
> from pyspark.sql import Row
> from pyspark.ml.linalg import VectorUDT, SparseVector
> #VectorUDT should be the last structfield
> train_schema = T.StructType([
>     T.StructField('features', VectorUDT()),
>     T.StructField('SALESCLOSEPRICE', T.IntegerType())
>     ])
>   
> train_df = spark.createDataFrame(
> [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, 
> 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: 
> 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, 
> 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, 
> 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: 
> -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: 
> 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: 
> 1.0, 133: 1.0}), SALESCLOSEPRICE=143000),
>  Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, 
> 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: 
> 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, 
> 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: 
> 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, 
> 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: 
> -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, 
> 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 
> 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), 
> 

[jira] [Commented] (SPARK-27939) Defining a schema with VectorUDT

2019-06-04 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855966#comment-16855966
 ] 

Bryan Cutler commented on SPARK-27939:
--

The problem is the {{Row}} class sorts the field names alphabetically, which 
puts capital letters first and then conflicts with your schema:
{noformat}
r = Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, ...}), 
SALESCLOSEPRICE=143000)

In [3]: r.__fields__
Out[3]: ['SALESCLOSEPRICE', 'features']{noformat}
This is by design, but it is not intuitive and has caused lots of problems.

You can either just specify your data as tuples. for example
{noformat}
In [5]: train_df = spark.createDataFrame([(SparseVector(135, {0: 139900.0}), 
143000)], schema=train_schema)

In [6]: train_df.show()
++---+
| features|SALESCLOSEPRICE|
++---+
|(135,[0],[139900.0])| 143000|
++---+
{noformat}
Or if you want to have keywords, then define your own row class like this:
{noformat}
In [7]: MyRow = Row('features', 'SALESCLOSEPRICE')

In [8]: MyRow(SparseVector(135, {0: 139900.0}), 143000)
Out[8]: Row(features=SparseVector(135, {0: 139900.0}), 
SALESCLOSEPRICE=143000){noformat}

> Defining a schema with VectorUDT
> 
>
> Key: SPARK-27939
> URL: https://issues.apache.org/jira/browse/SPARK-27939
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Johannes Schaffrath
>Priority: Minor
>
> When I try to define a dataframe schema which has a VectorUDT field, I run 
> into an error when the VectorUDT field is not the last element of the 
> StructType list.
> The following example causes the error below:
> {code:java}
> // from pyspark.sql import functions as F
> from pyspark.sql import types as T
> from pyspark.sql import Row
> from pyspark.ml.linalg import VectorUDT, SparseVector
> #VectorUDT should be the last structfield
> train_schema = T.StructType([
>     T.StructField('features', VectorUDT()),
>     T.StructField('SALESCLOSEPRICE', T.IntegerType())
>     ])
>   
> train_df = spark.createDataFrame(
> [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, 
> 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: 
> 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, 
> 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, 
> 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: 
> -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: 
> 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: 
> 1.0, 133: 1.0}), SALESCLOSEPRICE=143000),
>  Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, 
> 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: 
> 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, 
> 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: 
> 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, 
> 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: 
> -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, 
> 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 
> 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), 
> SALESCLOSEPRICE=19),
>  Row(features=SparseVector(135, {0: 225000.0, 1: 225000.0, 2: 1102.0, 3: 
> 28.0, 5: 1102.0, 6: 2390.0, 7: 2390.0, 8: 1102.0, 9: 1949.0, 10: 0.822, 11: 
> 1.0, 15: 1.0, 16: 2.0, 20: 2017.0, 21: 6.0, 22: 26.0, 23: 26.0, 24: 177.0, 
> 25: 25.0, 26: -1.0, 27: 3.88, 28: 3.9, 29: 3.91, 30: 3.89, 31: 3.94, 32: 9.8, 
> 33: 22.4, 34: 67.8, 35: 61.7, 36: 2.7, 38: 0.4706, 39: 204.1742, 40: 1102.0, 
> 41: 0.0106, 42: 2.0, 49: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: -1.0, 57: 
> 1.0, 62: 1.0, 68: 1.0, 70: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 1.0, 100: 1.0, 
> 103: 1.0, 108: 1.0, 110: 1.0, 115: 1.0, 123: 1.0, 131: 1.0, 132: 1.0}), 
> SALESCLOSEPRICE=225000)
>  ], schema=train_schema)
>  
> train_df.printSchema()
> train_df.show()
> {code}
> Error  message:
> {code:java}
> // Fail to execute line 17: ], schema=train_schema) Traceback (most recent 
> call last): File "/tmp/zeppelin_pyspark-3793375738105660281.py", line 375, in 
>  exec(code, _zcUserQueryNameSpace) File "", line 17, in 
>  File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", 
> line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, 
> data), schema) File 
> 

[jira] [Updated] (SPARK-27992) PySpark socket server should sync with JVM connection thread future

2019-06-10 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27992:
-
Description: 
Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark job 
are not propagated to Python. This is because toLocalIterator() and toPandas() 
with Arrow enabled run Spark jobs asynchronously in a background thread, after 
creating the socket connection info. The fix for these was to catch a 
SparkException if the job errored and then send the exception through the 
pyspark serializer.

A better fix would be to allow Python to synchronize on the serving thread 
future. That way if the serving thread throws an exception, it will be 
propagated on the synchronization call.

> PySpark socket server should sync with JVM connection thread future
> ---
>
> Key: SPARK-27992
> URL: https://issues.apache.org/jira/browse/SPARK-27992
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Bryan Cutler
>Priority: Major
>
> Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark 
> job are not propagated to Python. This is because toLocalIterator() and 
> toPandas() with Arrow enabled run Spark jobs asynchronously in a background 
> thread, after creating the socket connection info. The fix for these was to 
> catch a SparkException if the job errored and then send the exception through 
> the pyspark serializer.
> A better fix would be to allow Python to synchronize on the serving thread 
> future. That way if the serving thread throws an exception, it will be 
> propagated on the synchronization call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27992) PySpark socket server should sync with JVM connection thread future

2019-06-10 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27992:
-
Environment: (was: Both SPARK-27805 and SPARK-27548 identified an issue 
that errors in a Spark job are not propagated to Python. This is because 
toLocalIterator() and toPandas() with Arrow enabled run Spark jobs 
asynchronously in a background thread, after creating the socket connection 
info. The fix for these was to catch a SparkException if the job errored and 
then send the exception through the pyspark serializer.

A better fix would be to allow Python to synchronize on the serving thread 
future. That way if the serving thread throws an exception, it will be 
propagated on the synchronization call.)

> PySpark socket server should sync with JVM connection thread future
> ---
>
> Key: SPARK-27992
> URL: https://issues.apache.org/jira/browse/SPARK-27992
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Bryan Cutler
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27992) PySpark socket server should sync with JVM connection thread future

2019-06-10 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-27992:


 Summary: PySpark socket server should sync with JVM connection 
thread future
 Key: SPARK-27992
 URL: https://issues.apache.org/jira/browse/SPARK-27992
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.3
 Environment: Both SPARK-27805 and SPARK-27548 identified an issue that 
errors in a Spark job are not propagated to Python. This is because 
toLocalIterator() and toPandas() with Arrow enabled run Spark jobs 
asynchronously in a background thread, after creating the socket connection 
info. The fix for these was to catch a SparkException if the job errored and 
then send the exception through the pyspark serializer.

A better fix would be to allow Python to synchronize on the serving thread 
future. That way if the serving thread throws an exception, it will be 
propagated on the synchronization call.
Reporter: Bryan Cutler






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27992) PySpark socket server should sync with JVM connection thread future

2019-06-10 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27992:
-
Affects Version/s: (was: 2.4.3)
   3.0.0

> PySpark socket server should sync with JVM connection thread future
> ---
>
> Key: SPARK-27992
> URL: https://issues.apache.org/jira/browse/SPARK-27992
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark 
> job are not propagated to Python. This is because toLocalIterator() and 
> toPandas() with Arrow enabled run Spark jobs asynchronously in a background 
> thread, after creating the socket connection info. The fix for these was to 
> catch a SparkException if the job errored and then send the exception through 
> the pyspark serializer.
> A better fix would be to allow Python to synchronize on the serving thread 
> future. That way if the serving thread throws an exception, it will be 
> propagated on the synchronization call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27992) PySpark socket server should sync with JVM connection thread future

2019-06-24 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27992:
-
Description: 
Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark job 
are not propagated to Python. This is because toLocalIterator() and toPandas() 
with Arrow enabled run Spark jobs asynchronously in a background thread, after 
creating the socket connection info. The fix for these was to catch a 
SparkException if the job errored and then send the exception through the 
pyspark serializer.

A better fix would be to allow Python to await on the serving thread future and 
join the thread. That way if the serving thread throws an exception, it will be 
propagated on the call to awaitResult.

  was:
Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark job 
are not propagated to Python. This is because toLocalIterator() and toPandas() 
with Arrow enabled run Spark jobs asynchronously in a background thread, after 
creating the socket connection info. The fix for these was to catch a 
SparkException if the job errored and then send the exception through the 
pyspark serializer.

A better fix would be to allow Python to synchronize on the serving thread 
future. That way if the serving thread throws an exception, it will be 
propagated on the synchronization call.


> PySpark socket server should sync with JVM connection thread future
> ---
>
> Key: SPARK-27992
> URL: https://issues.apache.org/jira/browse/SPARK-27992
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark 
> job are not propagated to Python. This is because toLocalIterator() and 
> toPandas() with Arrow enabled run Spark jobs asynchronously in a background 
> thread, after creating the socket connection info. The fix for these was to 
> catch a SparkException if the job errored and then send the exception through 
> the pyspark serializer.
> A better fix would be to allow Python to await on the serving thread future 
> and join the thread. That way if the serving thread throws an exception, it 
> will be propagated on the call to awaitResult.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT

2019-06-24 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-28003.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24844
[https://github.com/apache/spark/pull/24844]

> spark.createDataFrame with Arrow doesn't work with pandas.NaT 
> --
>
> Key: SPARK-28003
> URL: https://issues.apache.org/jira/browse/SPARK-28003
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:java}
> import pandas as pd
> dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100
> pdf1 = pd.DataFrame({'time': dt1})
> df1 = self.spark.createDataFrame(pdf1)
> {code}
> The example above doesn't work with arrow enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT

2019-06-24 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-28003:


Assignee: Li Jin

> spark.createDataFrame with Arrow doesn't work with pandas.NaT 
> --
>
> Key: SPARK-28003
> URL: https://issues.apache.org/jira/browse/SPARK-28003
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
>
> {code:java}
> import pandas as pd
> dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100
> pdf1 = pd.DataFrame({'time': dt1})
> df1 = self.spark.createDataFrame(pdf1)
> {code}
> The example above doesn't work with arrow enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   >