[jira] [Resolved] (SPARK-25801) pandas_udf grouped_map fails with input dataframe with more than 255 columns
[ https://issues.apache.org/jira/browse/SPARK-25801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-25801. -- Resolution: Fixed Fix Version/s: 2.4.0 > pandas_udf grouped_map fails with input dataframe with more than 255 columns > > > Key: SPARK-25801 > URL: https://issues.apache.org/jira/browse/SPARK-25801 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: python 2.7 > pyspark 2.3.0 >Reporter: Frederik >Priority: Major > Fix For: 2.4.0 > > > Hi, > I'm using a pandas_udf to deploy a model to predict all samples in a spark > dataframe, > for this I use a udf as follows: > @pandas_udf("scores double", PandasUDFType.GROUPED_MAP) def > predict_scores(pdf): score_values = model.predict_proba(pdf)[:,1] return > pd.DataFrame({'scores': score_values}) > So it takes a dataframe and predicts the probability of being positive > according to an sklearn model for each row and returns this as single column. > This works great on a random groupBy, e.g.: > sdf_to_score.groupBy(sf.col('age')).apply(predict_scores) > as long as the dataframe has <255 columns. When the input dataframe has more > than 255 columns (thus features in my model), I get: > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line > 219, in main > func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, > eval_type) > File "path/to/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line > 148, in read_udfs > mapper = eval(mapper_str, udfs) > File "", line 1 > SyntaxError: more than 255 arguments > Which seems to be related with Python's general limitation of having not > allowing more than 255 arguments for a function? > > Is this a bug or is there a straightforward way around this problem? > > Regards, > Frederik -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22809) pyspark is sensitive to imports with dots
[ https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-22809: - Fix Version/s: 2.3.2 > pyspark is sensitive to imports with dots > - > > Key: SPARK-22809 > URL: https://issues.apache.org/jira/browse/SPARK-22809 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Cricket Temple >Assignee: holdenk >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > User code can fail with dotted imports. Here's a repro script. > {noformat} > import numpy as np > import pandas as pd > import pyspark > import scipy.interpolate > import scipy.interpolate as scipy_interpolate > import py4j > scipy_interpolate2 = scipy.interpolate > sc = pyspark.SparkContext() > spark_session = pyspark.SQLContext(sc) > ### > # The details of this dataset are irrelevant # > # Sorry if you'd have preferred something more boring # > ### > x__ = np.linspace(0,10,1000) > freq__ = np.arange(1,5) > x_, freq_ = np.ix_(x__, freq__) > y = np.sin(x_ * freq_).ravel() > x = (x_ * np.ones(freq_.shape)).ravel() > freq = (np.ones(x_.shape) * freq_).ravel() > df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq']) > df_sk = spark_session.createDataFrame(df_pd) > assert(df_sk.toPandas() == df_pd).all().all() > try: > import matplotlib.pyplot as plt > for f, data in df_pd.groupby("freq"): > plt.plot(*data[['x','y']].values.T) > plt.show() > except: > print("I guess we can't plot anything") > def mymap(x, interp_fn): > df = pd.DataFrame.from_records([row.asDict() for row in list(x)]) > return interp_fn(df.x.values, df.y.values)(np.pi) > df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey() > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy_interpolate.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > try: > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy.interpolate.interp1d)).collect() > raise Excpetion("Not going to reach this line") > except py4j.protocol.Py4JJavaError, e: > print("See?") > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy_interpolate2.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > # But now it works! > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy.interpolate.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots
[ https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661502#comment-16661502 ] Bryan Cutler commented on SPARK-22809: -- Sure, I probably shouldn't have tested out of the branches. Running tests again from IPython with Python 3.6.6: *v2.2.2* - Error is raised *v2.3.2* - Working *v2.4.0-rc4* - Working >From those results, it seems like SPARK-21070 most likely fixed it > pyspark is sensitive to imports with dots > - > > Key: SPARK-22809 > URL: https://issues.apache.org/jira/browse/SPARK-22809 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Cricket Temple >Assignee: holdenk >Priority: Major > Fix For: 2.4.0 > > > User code can fail with dotted imports. Here's a repro script. > {noformat} > import numpy as np > import pandas as pd > import pyspark > import scipy.interpolate > import scipy.interpolate as scipy_interpolate > import py4j > scipy_interpolate2 = scipy.interpolate > sc = pyspark.SparkContext() > spark_session = pyspark.SQLContext(sc) > ### > # The details of this dataset are irrelevant # > # Sorry if you'd have preferred something more boring # > ### > x__ = np.linspace(0,10,1000) > freq__ = np.arange(1,5) > x_, freq_ = np.ix_(x__, freq__) > y = np.sin(x_ * freq_).ravel() > x = (x_ * np.ones(freq_.shape)).ravel() > freq = (np.ones(x_.shape) * freq_).ravel() > df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq']) > df_sk = spark_session.createDataFrame(df_pd) > assert(df_sk.toPandas() == df_pd).all().all() > try: > import matplotlib.pyplot as plt > for f, data in df_pd.groupby("freq"): > plt.plot(*data[['x','y']].values.T) > plt.show() > except: > print("I guess we can't plot anything") > def mymap(x, interp_fn): > df = pd.DataFrame.from_records([row.asDict() for row in list(x)]) > return interp_fn(df.x.values, df.y.values)(np.pi) > df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey() > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy_interpolate.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > try: > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy.interpolate.interp1d)).collect() > raise Excpetion("Not going to reach this line") > except py4j.protocol.Py4JJavaError, e: > print("See?") > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy_interpolate2.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > # But now it works! > result = df_by_freq.mapValues(lambda x: mymap(x, > scipy.interpolate.interp1d)).collect() > assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), > atol=1e-6)) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.5
[ https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675866#comment-16675866 ] Bryan Cutler edited comment on SPARK-25079 at 11/5/18 11:09 PM: Sounds like a good plan [~shaneknapp]\! The instances of python3.4 in SparkLauncherSuite and SparkSubmitSuite look like they are for testing that the env vars PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON can be set to different Python versions. So I don't think they need to be changed, but if you want to maybe change python3.4 -> python3.5 and also python3.5 -> python3.6 to keep them different. Everything else looks good! was (Author: bryanc): Sounds like a good plan [~shaneknapp] ! The instances of python3.4 in SparkLauncherSuite and SparkSubmitSuite look like they are for testing that the env vars PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON can be set to different Python versions. So I don't think they need to be changed, but if you want to maybe change python3.4 -> python3.5 and also python3.5 -> python3.6 to keep them different. Everything else looks good! > [PYTHON] upgrade python 3.4 -> 3.5 > -- > > Key: SPARK-25079 > URL: https://issues.apache.org/jira/browse/SPARK-25079 > Project: Spark > Issue Type: Improvement > Components: Build, PySpark >Affects Versions: 2.3.1 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > for the impending arrow upgrade > (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python > 3.4 -> 3.5. > i have been testing this here: > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69] > my methodology: > 1) upgrade python + arrow to 3.5 and 0.10.0 > 2) run python tests > 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and > upgrade centos workers to python3.5 > 4) simultaneously do the following: > - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that > points to python3.5 (this is currently being tested here: > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)] > - push a change to python/run-tests.py replacing 3.4 with 3.5 > 5) once the python3.5 change to run-tests.py is merged, we will need to > back-port this to all existing branches > 6) then and only then can i remove the python3.4 -> python3.5 symlink -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.5
[ https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675866#comment-16675866 ] Bryan Cutler commented on SPARK-25079: -- Sounds like a good plan [~shaneknapp]! The instances of python3.4 in SparkLauncherSuite and SparkSubmitSuite look like they are for testing that the env vars PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON can be set to different Python versions. So I don't think they need to be changed, but if you want to maybe change python3.4 -> python3.5 and also python3.5 -> python3.6 to keep them different. Everything else looks good! > [PYTHON] upgrade python 3.4 -> 3.5 > -- > > Key: SPARK-25079 > URL: https://issues.apache.org/jira/browse/SPARK-25079 > Project: Spark > Issue Type: Improvement > Components: Build, PySpark >Affects Versions: 2.3.1 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > for the impending arrow upgrade > (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python > 3.4 -> 3.5. > i have been testing this here: > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69] > my methodology: > 1) upgrade python + arrow to 3.5 and 0.10.0 > 2) run python tests > 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and > upgrade centos workers to python3.5 > 4) simultaneously do the following: > - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that > points to python3.5 (this is currently being tested here: > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)] > - push a change to python/run-tests.py replacing 3.4 with 3.5 > 5) once the python3.5 change to run-tests.py is merged, we will need to > back-port this to all existing branches > 6) then and only then can i remove the python3.4 -> python3.5 symlink -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.5
[ https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16675866#comment-16675866 ] Bryan Cutler edited comment on SPARK-25079 at 11/5/18 11:08 PM: Sounds like a good plan [~shaneknapp] ! The instances of python3.4 in SparkLauncherSuite and SparkSubmitSuite look like they are for testing that the env vars PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON can be set to different Python versions. So I don't think they need to be changed, but if you want to maybe change python3.4 -> python3.5 and also python3.5 -> python3.6 to keep them different. Everything else looks good! was (Author: bryanc): Sounds like a good plan [~shaneknapp]! The instances of python3.4 in SparkLauncherSuite and SparkSubmitSuite look like they are for testing that the env vars PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON can be set to different Python versions. So I don't think they need to be changed, but if you want to maybe change python3.4 -> python3.5 and also python3.5 -> python3.6 to keep them different. Everything else looks good! > [PYTHON] upgrade python 3.4 -> 3.5 > -- > > Key: SPARK-25079 > URL: https://issues.apache.org/jira/browse/SPARK-25079 > Project: Spark > Issue Type: Improvement > Components: Build, PySpark >Affects Versions: 2.3.1 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > for the impending arrow upgrade > (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python > 3.4 -> 3.5. > i have been testing this here: > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69] > my methodology: > 1) upgrade python + arrow to 3.5 and 0.10.0 > 2) run python tests > 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and > upgrade centos workers to python3.5 > 4) simultaneously do the following: > - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that > points to python3.5 (this is currently being tested here: > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)] > - push a change to python/run-tests.py replacing 3.4 with 3.5 > 5) once the python3.5 change to run-tests.py is merged, we will need to > back-port this to all existing branches > 6) then and only then can i remove the python3.4 -> python3.5 symlink -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25344) Break large PySpark unittests into smaller files
[ https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685613#comment-16685613 ] Bryan Cutler commented on SPARK-25344: -- [~hyukjin.kwon] no problem, I can take on ML and MLlib > Break large PySpark unittests into smaller files > > > Key: SPARK-25344 > URL: https://issues.apache.org/jira/browse/SPARK-25344 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Assignee: Hyukjin Kwon >Priority: Major > > We've got a ton of tests in one humongous tests.py file, rather than breaking > it out into smaller files. > Having one huge file doesn't seem great for code organization, and it also > makes the test parallelization in run-tests.py not work as well. On my > laptop, tests.py takes 150s, and the next longest test file takes only 20s. > There are similarly large files in other pyspark modules, eg. sql/tests.py, > ml/tests.py, mllib/tests.py, streaming/tests.py. > It seems that at least for some of these files, its already broken into > independent test classes, so it shouldn't be too hard to just move them into > their own files. > We could pick up one example and follow. The current style looks closer to > NumPy structure and looks easier to follow. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25344) Break large tests.py files into smaller files
[ https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643762#comment-16643762 ] Bryan Cutler commented on SPARK-25344: -- No I don't have strong feelings, my only preference was to put the tests in a subdir. I'm sure it's quite a lot of work, so however you see fit [~hyukjin.kwon] is fine! If you are planning on breaking this up into tasks, I can try to help out as well. > Break large tests.py files into smaller files > - > > Key: SPARK-25344 > URL: https://issues.apache.org/jira/browse/SPARK-25344 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > > We've got a ton of tests in one humongous tests.py file, rather than breaking > it out into smaller files. > Having one huge file doesn't seem great for code organization, and it also > makes the test parallelization in run-tests.py not work as well. On my > laptop, tests.py takes 150s, and the next longest test file takes only 20s. > There are similarly large files in other pyspark modules, eg. sql/tests.py, > ml/tests.py, mllib/tests.py, streaming/tests.py. > It seems that at least for some of these files, its already broken into > independent test classes, so it shouldn't be too hard to just move them into > their own files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None
[ https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16635762#comment-16635762 ] Bryan Cutler commented on SPARK-25461: -- Thanks for looking into this [~viirya]! You are right that the above udf returns a float64 instead of boolean. I'm not sure what the expected cast from float to bool should be, but it does seem like pyarrow might be doing something wrong here. I'll look into it some more and raise an issue there if so. > PySpark Pandas UDF outputs incorrect results when input columns contain None > > > Key: SPARK-25461 > URL: https://issues.apache.org/jira/browse/SPARK-25461 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: I reproduced this issue by running pyspark locally on > mac: > Spark version: 2.3.1 pre-built with Hadoop 2.7 > Python library versions: pyarrow==0.10.0, pandas==0.20.2 >Reporter: Chongyuan Xiang >Priority: Major > > The following PySpark script uses a simple pandas UDF to calculate a column > given column 'A'. When column 'A' contains None, the results look incorrect. > Script: > > {code:java} > import pandas as pd > import random > import pyspark > from pyspark.sql.functions import col, lit, pandas_udf > values = [None] * 3 + [1.0] * 17 + [2.0] * 600 > random.shuffle(values) > pdf = pd.DataFrame({'A': values}) > df = spark.createDataFrame(pdf) > @pandas_udf(returnType=pyspark.sql.types.BooleanType()) > def gt_2(column): > return (column >= 2).where(column.notnull()) > calculated_df = (df.select(['A']) > .withColumn('potential_bad_col', gt_2('A')) > ) > calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) > | (col("A").isNull())) > calculated_df.show() > {code} > > Output: > {code:java} > +---+-+---+ > | A|potential_bad_col|correct_col| > +---+-+---+ > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |1.0| false| false| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > +---+-+---+ > only showing top 20 rows > {code} > This problem disappears when the number of rows is small or when the input > column does not contain None. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None
[ https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637626#comment-16637626 ] Bryan Cutler commented on SPARK-25461: -- I file ARROW-3428, which deals with the incorrect cast from float to bool > PySpark Pandas UDF outputs incorrect results when input columns contain None > > > Key: SPARK-25461 > URL: https://issues.apache.org/jira/browse/SPARK-25461 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: I reproduced this issue by running pyspark locally on > mac: > Spark version: 2.3.1 pre-built with Hadoop 2.7 > Python library versions: pyarrow==0.10.0, pandas==0.20.2 >Reporter: Chongyuan Xiang >Priority: Major > > The following PySpark script uses a simple pandas UDF to calculate a column > given column 'A'. When column 'A' contains None, the results look incorrect. > Script: > > {code:java} > import pandas as pd > import random > import pyspark > from pyspark.sql.functions import col, lit, pandas_udf > values = [None] * 3 + [1.0] * 17 + [2.0] * 600 > random.shuffle(values) > pdf = pd.DataFrame({'A': values}) > df = spark.createDataFrame(pdf) > @pandas_udf(returnType=pyspark.sql.types.BooleanType()) > def gt_2(column): > return (column >= 2).where(column.notnull()) > calculated_df = (df.select(['A']) > .withColumn('potential_bad_col', gt_2('A')) > ) > calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) > | (col("A").isNull())) > calculated_df.show() > {code} > > Output: > {code:java} > +---+-+---+ > | A|potential_bad_col|correct_col| > +---+-+---+ > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |1.0| false| false| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > +---+-+---+ > only showing top 20 rows > {code} > This problem disappears when the number of rows is small or when the input > column does not contain None. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None
[ https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637626#comment-16637626 ] Bryan Cutler edited comment on SPARK-25461 at 10/3/18 11:53 PM: I filed ARROW-3428, which deals with the incorrect cast from float to bool was (Author: bryanc): I file ARROW-3428, which deals with the incorrect cast from float to bool > PySpark Pandas UDF outputs incorrect results when input columns contain None > > > Key: SPARK-25461 > URL: https://issues.apache.org/jira/browse/SPARK-25461 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: I reproduced this issue by running pyspark locally on > mac: > Spark version: 2.3.1 pre-built with Hadoop 2.7 > Python library versions: pyarrow==0.10.0, pandas==0.20.2 >Reporter: Chongyuan Xiang >Priority: Major > > The following PySpark script uses a simple pandas UDF to calculate a column > given column 'A'. When column 'A' contains None, the results look incorrect. > Script: > > {code:java} > import pandas as pd > import random > import pyspark > from pyspark.sql.functions import col, lit, pandas_udf > values = [None] * 3 + [1.0] * 17 + [2.0] * 600 > random.shuffle(values) > pdf = pd.DataFrame({'A': values}) > df = spark.createDataFrame(pdf) > @pandas_udf(returnType=pyspark.sql.types.BooleanType()) > def gt_2(column): > return (column >= 2).where(column.notnull()) > calculated_df = (df.select(['A']) > .withColumn('potential_bad_col', gt_2('A')) > ) > calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) > | (col("A").isNull())) > calculated_df.show() > {code} > > Output: > {code:java} > +---+-+---+ > | A|potential_bad_col|correct_col| > +---+-+---+ > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |1.0| false| false| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > +---+-+---+ > only showing top 20 rows > {code} > This problem disappears when the number of rows is small or when the input > column does not contain None. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None
[ https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642379#comment-16642379 ] Bryan Cutler commented on SPARK-25461: -- Just wanted to add that the resolution here added a note for the user to verify the return type and data is correct. The actual fix for correct data when converting to/from booleans is on the Arrow side and won't be available until Spark upgrades pyarrow to version 0.12.0+. > PySpark Pandas UDF outputs incorrect results when input columns contain None > > > Key: SPARK-25461 > URL: https://issues.apache.org/jira/browse/SPARK-25461 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: I reproduced this issue by running pyspark locally on > mac: > Spark version: 2.3.1 pre-built with Hadoop 2.7 > Python library versions: pyarrow==0.10.0, pandas==0.20.2 >Reporter: Chongyuan Xiang >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > > The following PySpark script uses a simple pandas UDF to calculate a column > given column 'A'. When column 'A' contains None, the results look incorrect. > Script: > > {code:java} > import pandas as pd > import random > import pyspark > from pyspark.sql.functions import col, lit, pandas_udf > values = [None] * 3 + [1.0] * 17 + [2.0] * 600 > random.shuffle(values) > pdf = pd.DataFrame({'A': values}) > df = spark.createDataFrame(pdf) > @pandas_udf(returnType=pyspark.sql.types.BooleanType()) > def gt_2(column): > return (column >= 2).where(column.notnull()) > calculated_df = (df.select(['A']) > .withColumn('potential_bad_col', gt_2('A')) > ) > calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) > | (col("A").isNull())) > calculated_df.show() > {code} > > Output: > {code:java} > +---+-+---+ > | A|potential_bad_col|correct_col| > +---+-+---+ > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |1.0| false| false| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > +---+-+---+ > only showing top 20 rows > {code} > This problem disappears when the number of rows is small or when the input > column does not contain None. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25471) Fix tests for Python 3.6 with Pandas 0.23+
[ https://issues.apache.org/jira/browse/SPARK-25471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-25471: Assignee: (was: Bryan Cutler) > Fix tests for Python 3.6 with Pandas 0.23+ > -- > > Key: SPARK-25471 > URL: https://issues.apache.org/jira/browse/SPARK-25471 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Priority: Major > > Running pyspark tests causes at least 1 error when using Python 3.6 and > Pandas 0.23 or higher. This is because the Pandas DataFrame constructor can > create columns in the defined order, where earlier versions might be in > alphabetical order. This leads to the following failure: > {noformat} > == > ERROR: test_create_dataframe_from_pandas_with_timestamp > (pyspark.sql.tests.SQLTests) > -- > Traceback (most recent call last): > File "/home/bryan/git/spark/python/pyspark/sql/tests.py", line 3275, in > test_create_dataframe_from_pandas_with_timestamp > df = self.spark.createDataFrame(pdf, schema="d date, ts timestamp") > File "/home/bryan/git/spark/python/pyspark/sql/session.py", line 748, in > createDataFrame > rdd, schema = self._createFromLocal(map(prepare, data), schema) > File "/home/bryan/git/spark/python/pyspark/sql/session.py", line 413, in > _createFromLocal > data = list(data) > File "/home/bryan/git/spark/python/pyspark/sql/session.py", line 730, in > prepare > verify_func(obj) > File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1389, in > verify > verify_value(obj) > File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1370, in > verify_struct > verifier(v) > File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1389, in > verify > verify_value(obj) > File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1383, in > verify_default > verify_acceptable_types(obj) > File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1278, in > verify_acceptable_types > % (dataType, obj, type(obj > TypeError: field ts: TimestampType can not accept object datetime.date(2018, > 9, 19) in type > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25471) Fix tests for Python 3.6 with Pandas 0.23+
Bryan Cutler created SPARK-25471: Summary: Fix tests for Python 3.6 with Pandas 0.23+ Key: SPARK-25471 URL: https://issues.apache.org/jira/browse/SPARK-25471 Project: Spark Issue Type: Bug Components: PySpark, Tests Affects Versions: 2.4.0 Reporter: Bryan Cutler Assignee: Bryan Cutler Running pyspark tests causes at least 1 error when using Python 3.6 and Pandas 0.23 or higher. This is because the Pandas DataFrame constructor can create columns in the defined order, where earlier versions might be in alphabetical order. This leads to the following failure: {noformat} == ERROR: test_create_dataframe_from_pandas_with_timestamp (pyspark.sql.tests.SQLTests) -- Traceback (most recent call last): File "/home/bryan/git/spark/python/pyspark/sql/tests.py", line 3275, in test_create_dataframe_from_pandas_with_timestamp df = self.spark.createDataFrame(pdf, schema="d date, ts timestamp") File "/home/bryan/git/spark/python/pyspark/sql/session.py", line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File "/home/bryan/git/spark/python/pyspark/sql/session.py", line 413, in _createFromLocal data = list(data) File "/home/bryan/git/spark/python/pyspark/sql/session.py", line 730, in prepare verify_func(obj) File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1389, in verify verify_value(obj) File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1370, in verify_struct verifier(v) File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1389, in verify verify_value(obj) File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1383, in verify_default verify_acceptable_types(obj) File "/home/bryan/git/spark/python/pyspark/sql/types.py", line 1278, in verify_acceptable_types % (dataType, obj, type(obj TypeError: field ts: TimestampType can not accept object datetime.date(2018, 9, 19) in type {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25432) Consider if using standard getOrCreate from PySpark into JVM SparkSession would simplify code
[ https://issues.apache.org/jira/browse/SPARK-25432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626148#comment-16626148 ] Bryan Cutler commented on SPARK-25432: -- moved description :) > Consider if using standard getOrCreate from PySpark into JVM SparkSession > would simplify code > - > > Key: SPARK-25432 > URL: https://issues.apache.org/jira/browse/SPARK-25432 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: holdenk >Priority: Trivial > > As we saw in [https://github.com/apache/spark/pull/22295/files] the logic can > get a bit out of sync. It _might_ make sense to try and simplify this so > there's less duplicated logic in Python & Scala around session set up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25432) Consider if using standard getOrCreate from PySpark into JVM SparkSession would simplify code
[ https://issues.apache.org/jira/browse/SPARK-25432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-25432: - Description: As we saw in [https://github.com/apache/spark/pull/22295/files] the logic can get a bit out of sync. It _might_ make sense to try and simplify this so there's less duplicated logic in Python & Scala around session set up. > Consider if using standard getOrCreate from PySpark into JVM SparkSession > would simplify code > - > > Key: SPARK-25432 > URL: https://issues.apache.org/jira/browse/SPARK-25432 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: holdenk >Priority: Trivial > > As we saw in [https://github.com/apache/spark/pull/22295/files] the logic can > get a bit out of sync. It _might_ make sense to try and simplify this so > there's less duplicated logic in Python & Scala around session set up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25432) Consider if using standard getOrCreate from PySpark into JVM SparkSession would simplify code
[ https://issues.apache.org/jira/browse/SPARK-25432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-25432: - Environment: (was: As we saw in [https://github.com/apache/spark/pull/22295/files] the logic can get a bit out of sync. It _might_ make sense to try and simplify this so there's less duplicated logic in Python & Scala around session set up.) > Consider if using standard getOrCreate from PySpark into JVM SparkSession > would simplify code > - > > Key: SPARK-25432 > URL: https://issues.apache.org/jira/browse/SPARK-25432 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: holdenk >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow
[ https://issues.apache.org/jira/browse/SPARK-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629572#comment-16629572 ] Bryan Cutler commented on SPARK-25351: -- Hi [~pgadige], yes please go ahead with this issue! When creating a DataFrame from Pandas without Arrow, category columns are converted into the type of the category. So in the example above, column "A" becomes a string type. The same should be done when Arrow is enabled, so we end up with the same Spark DataFrame. If you are able to, we also need to see how this affects pandas_udfs too. Thanks! > Handle Pandas category type when converting from Python with Arrow > -- > > Key: SPARK-25351 > URL: https://issues.apache.org/jira/browse/SPARK-25351 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Bryan Cutler >Priority: Major > > There needs to be some handling of category types done when calling > {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}. > Without Arrow, Spark casts each element to the category. For example > {noformat} > In [1]: import pandas as pd > In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]}) > In [3]: pdf["B"] = pdf["A"].astype('category') > In [4]: pdf > Out[4]: >A B > 0 a a > 1 b b > 2 c c > 3 a a > In [5]: pdf.dtypes > Out[5]: > A object > Bcategory > dtype: object > In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False) > In [8]: df = spark.createDataFrame(pdf) > In [9]: df.show() > +---+---+ > | A| B| > +---+---+ > | a| a| > | b| b| > | c| c| > | a| a| > +---+---+ > In [10]: df.printSchema() > root > |-- A: string (nullable = true) > |-- B: string (nullable = true) > In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True) > In [19]: df = spark.createDataFrame(pdf) >1667 spark_type = ArrayType(from_arrow_type(at.value_type)) >1668 else: > -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " > + str(at)) >1670 return spark_type >1671 > TypeError: Unsupported type in conversion from Arrow: > dictionary > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26591) Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain environment
[ https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744288#comment-16744288 ] Bryan Cutler commented on SPARK-26591: -- [~elch10] please go ahead and make a Jira for Arrow regarding the pyarrow import error. Also, include all relevant details about your system. > Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain > environment > > > Key: SPARK-26591 > URL: https://issues.apache.org/jira/browse/SPARK-26591 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Python 3.6.7 > Pyspark 2.4.0 > OS: > {noformat} > Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 > x86_64 x86_64 GNU/Linux{noformat} > CPU: > > {code:java} > Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB > clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz > {code} > > >Reporter: Elchin >Priority: Major > Attachments: core > > > When I try to use pandas_udf from examples in > [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > from pyspark.sql.types import IntegerType, StringType > slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is > crashed{code} > I get the error: > {code:java} > [1] 17969 illegal hardware instruction (core dumped) python3{code} > The environment is: > Python 3.6.7 > PySpark 2.4.0 > PyArrow: 0.11.1 > Pandas: 0.23.4 > NumPy: 1.15.4 > OS: Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 > x86_64 x86_64 x86_64 GNU/Linux -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26566) Upgrade apache/arrow to 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-26566: - Description: _This is just a placeholder for now to collect what needs to be fixed when we upgrade next time_ Version 0.12.0 includes the following: * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098 * conversion to date object no longer needed, ARROW-3910 was: _This is just a placeholder for now to collect what needs to be fixed when we upgrade next time_ Version 0.12.0 includes the following: * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098 > Upgrade apache/arrow to 0.12.0 > -- > > Key: SPARK-26566 > URL: https://issues.apache.org/jira/browse/SPARK-26566 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Priority: Major > > _This is just a placeholder for now to collect what needs to be fixed when we > upgrade next time_ > Version 0.12.0 includes the following: > * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098 > * conversion to date object no longer needed, ARROW-3910 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26591) Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain environment
[ https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742517#comment-16742517 ] Bryan Cutler commented on SPARK-26591: -- I created the same virtual environment and could not reproduce, can anyone else verify? OS: 4.15.0-43-generic #46~16.04.1-Ubuntu SMP Fri Dec 7 13:31:08 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux {noformat} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.0 /_/ Using Python version 3.6.7 (default, Nov 21 2018 02:32:25) SparkSession available as 'spark'. >>> import pyarrow >>> pyarrow.__version__ '0.11.1' >>> import pandas >>> pandas.__version__ '0.23.4' >>> import numpy >>> numpy.__version__ '1.15.4' >>> from pyspark.sql.functions import pandas_udf, PandasUDFType >>> from pyspark.sql.types import IntegerType, StringType >>> slen = pandas_udf(lambda s: s.str.len(), IntegerType()) >>> slen at 0x7f099a99cd90> {noformat} > Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain > environment > > > Key: SPARK-26591 > URL: https://issues.apache.org/jira/browse/SPARK-26591 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Python 3.6.7 > Pyspark 2.4.0 > OS: > {noformat} > Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 > x86_64 x86_64 GNU/Linux{noformat} > CPU: > > {code:java} > Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB > clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz > {code} > > >Reporter: Elchin >Priority: Major > Attachments: core > > > When I try to use pandas_udf from examples in > [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > from pyspark.sql.types import IntegerType, StringType > slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is > crashed{code} > I get the error: > {code:java} > [1] 17969 illegal hardware instruction (core dumped) python3{code} > The environment is: > Python 3.6.7 > PySpark 2.4.0 > PyArrow: 0.11.1 > Pandas: 0.23.4 > NumPy: 1.15.4 > OS: Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 > x86_64 x86_64 x86_64 GNU/Linux -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26591) Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain environment
[ https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743460#comment-16743460 ] Bryan Cutler commented on SPARK-26591: -- [~elch10] this seems like it is more an Arrow issue with pyarrow compatibility. Can you try something like the following in an interpreter to test is pyarrow basics work? {code:java} import pandas as pd import pyarrow as pa s = pd.Series([1, 2, 3]) pa.Array.from_pandas(s){code} If any of that fails, we should move the issue over to Arrow > Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain > environment > > > Key: SPARK-26591 > URL: https://issues.apache.org/jira/browse/SPARK-26591 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Python 3.6.7 > Pyspark 2.4.0 > OS: > {noformat} > Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 > x86_64 x86_64 GNU/Linux{noformat} > CPU: > > {code:java} > Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB > clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz > {code} > > >Reporter: Elchin >Priority: Major > Attachments: core > > > When I try to use pandas_udf from examples in > [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > from pyspark.sql.types import IntegerType, StringType > slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is > crashed{code} > I get the error: > {code:java} > [1] 17969 illegal hardware instruction (core dumped) python3{code} > The environment is: > Python 3.6.7 > PySpark 2.4.0 > PyArrow: 0.11.1 > Pandas: 0.23.4 > NumPy: 1.15.4 > OS: Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 > x86_64 x86_64 x86_64 GNU/Linux -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26591) Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain environment
[ https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16743463#comment-16743463 ] Bryan Cutler commented on SPARK-26591: -- And yes, you could build pyarrow yourself, but it shouldn't fail if you are using standard hardware and nothing exotic. See [https://arrow.apache.org/docs/python/development.html#development] for build info > Scalar Pandas UDF fails with 'illegal hardware instruction' in a certain > environment > > > Key: SPARK-26591 > URL: https://issues.apache.org/jira/browse/SPARK-26591 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Python 3.6.7 > Pyspark 2.4.0 > OS: > {noformat} > Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 > x86_64 x86_64 GNU/Linux{noformat} > CPU: > > {code:java} > Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB > clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz > {code} > > >Reporter: Elchin >Priority: Major > Attachments: core > > > When I try to use pandas_udf from examples in > [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > from pyspark.sql.types import IntegerType, StringType > slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is > crashed{code} > I get the error: > {code:java} > [1] 17969 illegal hardware instruction (core dumped) python3{code} > The environment is: > Python 3.6.7 > PySpark 2.4.0 > PyArrow: 0.11.1 > Pandas: 0.23.4 > NumPy: 1.15.4 > OS: Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 > x86_64 x86_64 x86_64 GNU/Linux -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26676) Make HiveContextSQLTests.test_unbounded_frames test compatible with Python 2 and PyPy
[ https://issues.apache.org/jira/browse/SPARK-26676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-26676. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23604 [https://github.com/apache/spark/pull/23604] > Make HiveContextSQLTests.test_unbounded_frames test compatible with Python 2 > and PyPy > - > > Key: SPARK-26676 > URL: https://issues.apache.org/jira/browse/SPARK-26676 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > This particular test is being skipped at PyPy and Python 2. > {code} > Skipped tests in pyspark.sql.tests.test_context with pypy: > test_unbounded_frames > (pyspark.sql.tests.test_context.HiveContextSQLTests) ... skipped "Unittest < > 3.3 doesn't support mocking" > Skipped tests in pyspark.sql.tests.test_context with python2.7: > test_unbounded_frames > (pyspark.sql.tests.test_context.HiveContextSQLTests) ... skipped "Unittest < > 3.3 doesn't support mocking" > {code} > We don't have to use unittest 3.3 module to mock. And looks the test itself > isn't compatible with Python 2. We should fix it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26676) Make HiveContextSQLTests.test_unbounded_frames test compatible with Python 2 and PyPy
[ https://issues.apache.org/jira/browse/SPARK-26676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-26676: Assignee: Hyukjin Kwon > Make HiveContextSQLTests.test_unbounded_frames test compatible with Python 2 > and PyPy > - > > Key: SPARK-26676 > URL: https://issues.apache.org/jira/browse/SPARK-26676 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > This particular test is being skipped at PyPy and Python 2. > {code} > Skipped tests in pyspark.sql.tests.test_context with pypy: > test_unbounded_frames > (pyspark.sql.tests.test_context.HiveContextSQLTests) ... skipped "Unittest < > 3.3 doesn't support mocking" > Skipped tests in pyspark.sql.tests.test_context with python2.7: > test_unbounded_frames > (pyspark.sql.tests.test_context.HiveContextSQLTests) ... skipped "Unittest < > 3.3 doesn't support mocking" > {code} > We don't have to use unittest 3.3 module to mock. And looks the test itself > isn't compatible with Python 2. We should fix it -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26315) auto cast threshold from Integer to Float in approxSimilarityJoin of BucketedRandomProjectionLSHModel
[ https://issues.apache.org/jira/browse/SPARK-26315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717873#comment-16717873 ] Bryan Cutler commented on SPARK-26315: -- I believe {{def approxSimilarityJoin(...)}} in LSHModelf in feature.py, should have {{threshold = TypeConverters.toFloat(threshold)}} before the call to {{_call_java(...)}} I can help guide if anyone would like to submit a PR for the fix. > auto cast threshold from Integer to Float in approxSimilarityJoin of > BucketedRandomProjectionLSHModel > - > > Key: SPARK-26315 > URL: https://issues.apache.org/jira/browse/SPARK-26315 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 2.3.2 >Reporter: Song Ci >Priority: Major > > when I was using > {code:java} > // code placeholder > BucketedRandomProjectionLSHModel.approxSimilarityJoin(dt_features, > dt_features, distCol="EuclideanDistance", threshold=20.) > {code} > I was confused then that this method reported an exception some java method > (dataset, dataset, integer, string) fingerprint can not be found I think > if I give an integer, and the python method of pyspark should be auto-cast > this to float if needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization
[ https://issues.apache.org/jira/browse/SPARK-26200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704092#comment-16704092 ] Bryan Cutler edited comment on SPARK-26200 at 11/30/18 12:56 AM: - I think this is a duplicate of SPARK-24915 except the columns are mixed up instead of an exception being thrown, could you please confirm [~davidlyness]? was (Author: bryanc): I think this is a duplicate of https://issues.apache.org/jira/browse/SPARK-24915 except the columns are mixed up instead of an exception being thrown, could you please confirm [~davidlyness]? > Column values are incorrectly transposed when a field in a PySpark Row > requires serialization > - > > Key: SPARK-26200 > URL: https://issues.apache.org/jira/browse/SPARK-26200 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Spark version 2.4.0 > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144 > The same issue is observed when PySpark is run on both macOS 10.13.6 and > CentOS 7, so this appears to be a cross-platform issue. >Reporter: David Lyness >Priority: Major > Labels: correctness > > h2. Description of issue > Whenever a field in a PySpark {{Row}} requires serialization (such as a > {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below > will assign column values *in alphabetical order*, rather than assigning each > column value to its specified columns. > h3. Code to reproduce: > {code:java} > import datetime > from pyspark.sql import Row > from pyspark.sql.session import SparkSession > from pyspark.sql.types import DateType, StringType, StructField, StructType > spark = SparkSession.builder.getOrCreate() > schema = StructType([ > StructField("date_column", DateType()), > StructField("my_b_column", StringType()), > StructField("my_a_column", StringType()), > ]) > spark.createDataFrame([Row( > date_column=datetime.date.today(), > my_b_column="my_b_value", > my_a_column="my_a_value" > )], schema).show() > {code} > h3. Expected result: > {noformat} > +---+---+---+ > |date_column|my_b_column|my_a_column| > +---+---+---+ > | 2018-11-28| my_b_value| my_a_value| > +---+---+---+{noformat} > h3. Actual result: > {noformat} > +---+---+---+ > |date_column|my_b_column|my_a_column| > +---+---+---+ > | 2018-11-28| my_a_value| my_b_value| > +---+---+---+{noformat} > (Note that {{my_a_value}} and {{my_b_value}} are transposed.) > h2. Analysis of issue > Reviewing [the relevant code on > GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622], > there are two relevant conditional blocks: > > {code:java} > if self._needSerializeAnyField: > # Block 1, does not work correctly > else: > # Block 2, works correctly > {code} > {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and > a dictionary of named columns. In Block 2, there is a conditional that works > specifically to serialize a {{Row}} object: > > {code:java} > elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False): > return tuple(obj[n] for n in self.names) > {code} > There is no such condition in Block 1, so we fall into this instead: > > {code:java} > elif isinstance(obj, (tuple, list)): > return tuple(f.toInternal(v) if c else v > for f, v, c in zip(self.fields, obj, self._needConversion)) > {code} > The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will > return a different ordering than the schema fields. So we end up with: > {code:java} > (date, date, True), > (b, a, False), > (a, b, False) > {code} > h2. Workarounds > Correct behaviour is observed if you use a Python {{list}} or {{dict}} > instead of PySpark's {{Row}} object: > > {code:java} > # Using a list works > spark.createDataFrame([[ > datetime.date.today(), > "my_b_value", > "my_a_value" > ]], schema) > # Using a dict also works > spark.createDataFrame([{ > "date_column": datetime.date.today(), > "my_b_column": "my_b_value", > "my_a_column": "my_a_value" > }], schema){code} > Correct behaviour is also observed if you have no fields that require > serialization; in this example, changing {{date_column}} to {{StringType}} > avoids the correctness issue. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization
[ https://issues.apache.org/jira/browse/SPARK-26200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704092#comment-16704092 ] Bryan Cutler commented on SPARK-26200: -- I think this is a duplicate of https://issues.apache.org/jira/browse/SPARK-24915 except the columns are mixed up instead of an exception being thrown, could you please confirm [~davidlyness]? > Column values are incorrectly transposed when a field in a PySpark Row > requires serialization > - > > Key: SPARK-26200 > URL: https://issues.apache.org/jira/browse/SPARK-26200 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Spark version 2.4.0 > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144 > The same issue is observed when PySpark is run on both macOS 10.13.6 and > CentOS 7, so this appears to be a cross-platform issue. >Reporter: David Lyness >Priority: Major > Labels: correctness > > h2. Description of issue > Whenever a field in a PySpark {{Row}} requires serialization (such as a > {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below > will assign column values *in alphabetical order*, rather than assigning each > column value to its specified columns. > h3. Code to reproduce: > {code:java} > import datetime > from pyspark.sql import Row > from pyspark.sql.session import SparkSession > from pyspark.sql.types import DateType, StringType, StructField, StructType > spark = SparkSession.builder.getOrCreate() > schema = StructType([ > StructField("date_column", DateType()), > StructField("my_b_column", StringType()), > StructField("my_a_column", StringType()), > ]) > spark.createDataFrame([Row( > date_column=datetime.date.today(), > my_b_column="my_b_value", > my_a_column="my_a_value" > )], schema).show() > {code} > h3. Expected result: > {noformat} > +---+---+---+ > |date_column|my_b_column|my_a_column| > +---+---+---+ > | 2018-11-28| my_b_value| my_a_value| > +---+---+---+{noformat} > h3. Actual result: > {noformat} > +---+---+---+ > |date_column|my_b_column|my_a_column| > +---+---+---+ > | 2018-11-28| my_a_value| my_b_value| > +---+---+---+{noformat} > (Note that {{my_a_value}} and {{my_b_value}} are transposed.) > h2. Analysis of issue > Reviewing [the relevant code on > GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622], > there are two relevant conditional blocks: > > {code:java} > if self._needSerializeAnyField: > # Block 1, does not work correctly > else: > # Block 2, works correctly > {code} > {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and > a dictionary of named columns. In Block 2, there is a conditional that works > specifically to serialize a {{Row}} object: > > {code:java} > elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False): > return tuple(obj[n] for n in self.names) > {code} > There is no such condition in Block 1, so we fall into this instead: > > {code:java} > elif isinstance(obj, (tuple, list)): > return tuple(f.toInternal(v) if c else v > for f, v, c in zip(self.fields, obj, self._needConversion)) > {code} > The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will > return a different ordering than the schema fields. So we end up with: > {code:java} > (date, date, True), > (b, a, False), > (a, b, False) > {code} > h2. Workarounds > Correct behaviour is observed if you use a Python {{list}} or {{dict}} > instead of PySpark's {{Row}} object: > > {code:java} > # Using a list works > spark.createDataFrame([[ > datetime.date.today(), > "my_b_value", > "my_a_value" > ]], schema) > # Using a dict also works > spark.createDataFrame([{ > "date_column": datetime.date.today(), > "my_b_column": "my_b_value", > "my_a_column": "my_a_value" > }], schema){code} > Correct behaviour is also observed if you have no fields that require > serialization; in this example, changing {{date_column}} to {{StringType}} > avoids the correctness issue. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization
[ https://issues.apache.org/jira/browse/SPARK-26200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-26200. -- Resolution: Duplicate > Column values are incorrectly transposed when a field in a PySpark Row > requires serialization > - > > Key: SPARK-26200 > URL: https://issues.apache.org/jira/browse/SPARK-26200 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Spark version 2.4.0 > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144 > The same issue is observed when PySpark is run on both macOS 10.13.6 and > CentOS 7, so this appears to be a cross-platform issue. >Reporter: David Lyness >Priority: Major > Labels: correctness > > h2. Description of issue > Whenever a field in a PySpark {{Row}} requires serialization (such as a > {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below > will assign column values *in alphabetical order*, rather than assigning each > column value to its specified columns. > h3. Code to reproduce: > {code:java} > import datetime > from pyspark.sql import Row > from pyspark.sql.session import SparkSession > from pyspark.sql.types import DateType, StringType, StructField, StructType > spark = SparkSession.builder.getOrCreate() > schema = StructType([ > StructField("date_column", DateType()), > StructField("my_b_column", StringType()), > StructField("my_a_column", StringType()), > ]) > spark.createDataFrame([Row( > date_column=datetime.date.today(), > my_b_column="my_b_value", > my_a_column="my_a_value" > )], schema).show() > {code} > h3. Expected result: > {noformat} > +---+---+---+ > |date_column|my_b_column|my_a_column| > +---+---+---+ > | 2018-11-28| my_b_value| my_a_value| > +---+---+---+{noformat} > h3. Actual result: > {noformat} > +---+---+---+ > |date_column|my_b_column|my_a_column| > +---+---+---+ > | 2018-11-28| my_a_value| my_b_value| > +---+---+---+{noformat} > (Note that {{my_a_value}} and {{my_b_value}} are transposed.) > h2. Analysis of issue > Reviewing [the relevant code on > GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622], > there are two relevant conditional blocks: > > {code:java} > if self._needSerializeAnyField: > # Block 1, does not work correctly > else: > # Block 2, works correctly > {code} > {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and > a dictionary of named columns. In Block 2, there is a conditional that works > specifically to serialize a {{Row}} object: > > {code:java} > elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False): > return tuple(obj[n] for n in self.names) > {code} > There is no such condition in Block 1, so we fall into this instead: > > {code:java} > elif isinstance(obj, (tuple, list)): > return tuple(f.toInternal(v) if c else v > for f, v, c in zip(self.fields, obj, self._needConversion)) > {code} > The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will > return a different ordering than the schema fields. So we end up with: > {code:java} > (date, date, True), > (b, a, False), > (a, b, False) > {code} > h2. Workarounds > Correct behaviour is observed if you use a Python {{list}} or {{dict}} > instead of PySpark's {{Row}} object: > > {code:java} > # Using a list works > spark.createDataFrame([[ > datetime.date.today(), > "my_b_value", > "my_a_value" > ]], schema) > # Using a dict also works > spark.createDataFrame([{ > "date_column": datetime.date.today(), > "my_b_column": "my_b_value", > "my_a_column": "my_a_value" > }], schema){code} > Correct behaviour is also observed if you have no fields that require > serialization; in this example, changing {{date_column}} to {{StringType}} > avoids the correctness issue. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24333) Add fit with validation set to spark.ml GBT: Python API
[ https://issues.apache.org/jira/browse/SPARK-24333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-24333: Assignee: Huaxin Gao > Add fit with validation set to spark.ml GBT: Python API > --- > > Key: SPARK-24333 > URL: https://issues.apache.org/jira/browse/SPARK-24333 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Assignee: Huaxin Gao >Priority: Major > > Python version of API added by [SPARK-7132] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24333) Add fit with validation set to spark.ml GBT: Python API
[ https://issues.apache.org/jira/browse/SPARK-24333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-24333. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 21465 [https://github.com/apache/spark/pull/21465] > Add fit with validation set to spark.ml GBT: Python API > --- > > Key: SPARK-24333 > URL: https://issues.apache.org/jira/browse/SPARK-24333 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > Python version of API added by [SPARK-7132] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25274) Improve toPandas with Arrow by sending out-of-order record batches
[ https://issues.apache.org/jira/browse/SPARK-25274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-25274: Assignee: Bryan Cutler > Improve toPandas with Arrow by sending out-of-order record batches > -- > > Key: SPARK-25274 > URL: https://issues.apache.org/jira/browse/SPARK-25274 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > When executing {{toPandas}} with Arrow enabled, partitions that arrive in the > JVM out-of-order must be buffered before they can be send to Python. This > causes an excess of memory to be used in the driver JVM and increases the > time it takes to complete because data must sit in the JVM waiting for > preceding partitions to come in. > This can be improved by sending out-of-order partitions to Python as soon as > they arrive in the JVM, followed by a list of indices so that Python can > assemble the data in the correct order. This way, data is not buffered at the > JVM and there is no waiting on particular partitions so performance will be > increased. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25274) Improve toPandas with Arrow by sending out-of-order record batches
[ https://issues.apache.org/jira/browse/SPARK-25274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-25274. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22275 https://github.com/apache/spark/pull/22275 > Improve toPandas with Arrow by sending out-of-order record batches > -- > > Key: SPARK-25274 > URL: https://issues.apache.org/jira/browse/SPARK-25274 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Fix For: 3.0.0 > > > When executing {{toPandas}} with Arrow enabled, partitions that arrive in the > JVM out-of-order must be buffered before they can be send to Python. This > causes an excess of memory to be used in the driver JVM and increases the > time it takes to complete because data must sit in the JVM waiting for > preceding partitions to come in. > This can be improved by sending out-of-order partitions to Python as soon as > they arrive in the JVM, followed by a list of indices so that Python can > assemble the data in the correct order. This way, data is not buffered at the > JVM and there is no waiting on particular partitions so performance will be > increased. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26573) Python worker not reused with mapPartitions if not consuming iterator
Bryan Cutler created SPARK-26573: Summary: Python worker not reused with mapPartitions if not consuming iterator Key: SPARK-26573 URL: https://issues.apache.org/jira/browse/SPARK-26573 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.0 Reporter: Bryan Cutler In PySpark, if the user calls RDD mapPartitions and does not consume the iterator, the Python worker will read the wrong signal and not be reused. Test to replicate: {code:java} def test_worker_reused_in_map_partition(self): def map_pid(iterator): # Fails when iterator not consumed, e.g. len(list(iterator)) return (os.getpid() for _ in xrange(10)) rdd = self.sc.parallelize([], 10) worker_pids_a = rdd.mapPartitions(map_pid).collect() worker_pids_b = rdd.mapPartitions(map_pid).collect() self.assertTrue(all([pid in worker_pids_a for pid in worker_pids_b])){code} This is related to SPARK-26549 which fixes this issue, but only for use in rdd.range -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26349) Pyspark should not accept insecure p4yj gateways
[ https://issues.apache.org/jira/browse/SPARK-26349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-26349: Assignee: Imran Rashid > Pyspark should not accept insecure p4yj gateways > > > Key: SPARK-26349 > URL: https://issues.apache.org/jira/browse/SPARK-26349 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Blocker > > Pyspark normally produces a secure py4j connection between python & the jvm, > but it also lets users provide their own connection. Spark should fail fast > if that connection is insecure. > Note this is breaking a public api, which is why this is targeted at 3.0.0. > SPARK-26019 is about adding a warning, but still allowing the old behavior to > work, in 2.x -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26349) Pyspark should not accept insecure p4yj gateways
[ https://issues.apache.org/jira/browse/SPARK-26349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-26349. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23441 [https://github.com/apache/spark/pull/23441] > Pyspark should not accept insecure p4yj gateways > > > Key: SPARK-26349 > URL: https://issues.apache.org/jira/browse/SPARK-26349 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Blocker > Fix For: 3.0.0 > > > Pyspark normally produces a secure py4j connection between python & the jvm, > but it also lets users provide their own connection. Spark should fail fast > if that connection is insecure. > Note this is breaking a public api, which is why this is targeted at 3.0.0. > SPARK-26019 is about adding a warning, but still allowing the old behavior to > work, in 2.x -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26566) Upgrade apache/arrow to 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-26566: - Target Version/s: (was: 2.4.0) > Upgrade apache/arrow to 0.12.0 > -- > > Key: SPARK-26566 > URL: https://issues.apache.org/jira/browse/SPARK-26566 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.4.0 > > > Version 0.10.0 will allow for the following improvements and bug fixes: > * Allow for adding BinaryType support ARROW-2141 > * Bug fix related to array serialization ARROW-1973 > * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 > * Python bytearrays are supported in as input to pyarrow ARROW-2141 > * Java has common interface for reset to cleanup complex vectors in Spark > ArrowWriter ARROW-1962 > * Cleanup pyarrow type equality checks ARROW-2423 > * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, > ARROW-2645 > * Improved low level handling of messages for RecordBatch ARROW-2704 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26566) Upgrade apache/arrow to 0.12.0
Bryan Cutler created SPARK-26566: Summary: Upgrade apache/arrow to 0.12.0 Key: SPARK-26566 URL: https://issues.apache.org/jira/browse/SPARK-26566 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.3.0 Reporter: Bryan Cutler Assignee: Bryan Cutler Fix For: 2.4.0 Version 0.10.0 will allow for the following improvements and bug fixes: * Allow for adding BinaryType support ARROW-2141 * Bug fix related to array serialization ARROW-1973 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 * Python bytearrays are supported in as input to pyarrow ARROW-2141 * Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962 * Cleanup pyarrow type equality checks ARROW-2423 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645 * Improved low level handling of messages for RecordBatch ARROW-2704 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26566) Upgrade apache/arrow to 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736540#comment-16736540 ] Bryan Cutler commented on SPARK-26566: -- Version 0.12.0 is slated to be released in mid January > Upgrade apache/arrow to 0.12.0 > -- > > Key: SPARK-26566 > URL: https://issues.apache.org/jira/browse/SPARK-26566 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Priority: Major > > _This is just a placeholder for now to collect what needs to be fixed when we > upgrade next time_ > Version 0.12.0 includes the following: > * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run
[ https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-25272. -- Resolution: Won't Fix > Show some kind of test output to indicate pyarrow tests were run > > > Key: SPARK-25272 > URL: https://issues.apache.org/jira/browse/SPARK-25272 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Right now tests only output status when they are skipped and there is no way > to really see from the logs that pyarrow tests, like ArrowTests, have been > run except by the absence of a skipped message. We can add a test that is > skipped if pyarrow is installed, which will give an output in our Jenkins > test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26566) Upgrade apache/arrow to 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-26566: - Fix Version/s: (was: 2.4.0) > Upgrade apache/arrow to 0.12.0 > -- > > Key: SPARK-26566 > URL: https://issues.apache.org/jira/browse/SPARK-26566 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Priority: Major > > Version 0.10.0 will allow for the following improvements and bug fixes: > * Allow for adding BinaryType support ARROW-2141 > * Bug fix related to array serialization ARROW-1973 > * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 > * Python bytearrays are supported in as input to pyarrow ARROW-2141 > * Java has common interface for reset to cleanup complex vectors in Spark > ArrowWriter ARROW-1962 > * Cleanup pyarrow type equality checks ARROW-2423 > * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, > ARROW-2645 > * Improved low level handling of messages for RecordBatch ARROW-2704 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26566) Upgrade apache/arrow to 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-26566: - Affects Version/s: (was: 2.3.0) 2.4.0 > Upgrade apache/arrow to 0.12.0 > -- > > Key: SPARK-26566 > URL: https://issues.apache.org/jira/browse/SPARK-26566 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.4.0 > > > Version 0.10.0 will allow for the following improvements and bug fixes: > * Allow for adding BinaryType support ARROW-2141 > * Bug fix related to array serialization ARROW-1973 > * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 > * Python bytearrays are supported in as input to pyarrow ARROW-2141 > * Java has common interface for reset to cleanup complex vectors in Spark > ArrowWriter ARROW-1962 > * Cleanup pyarrow type equality checks ARROW-2423 > * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, > ARROW-2645 > * Improved low level handling of messages for RecordBatch ARROW-2704 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26566) Upgrade apache/arrow to 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-26566: Assignee: (was: Bryan Cutler) > Upgrade apache/arrow to 0.12.0 > -- > > Key: SPARK-26566 > URL: https://issues.apache.org/jira/browse/SPARK-26566 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Priority: Major > Fix For: 2.4.0 > > > Version 0.10.0 will allow for the following improvements and bug fixes: > * Allow for adding BinaryType support ARROW-2141 > * Bug fix related to array serialization ARROW-1973 > * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 > * Python bytearrays are supported in as input to pyarrow ARROW-2141 > * Java has common interface for reset to cleanup complex vectors in Spark > ArrowWriter ARROW-1962 > * Cleanup pyarrow type equality checks ARROW-2423 > * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, > ARROW-2645 > * Improved low level handling of messages for RecordBatch ARROW-2704 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26566) Upgrade apache/arrow to 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-26566: - Description: _This is just a placeholder for now to collect what needs to be fixed when we upgrade next time_ Version 0.12.0 includes the following: * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098 was: Version 0.10.0 will allow for the following improvements and bug fixes: * Allow for adding BinaryType support ARROW-2141 * Bug fix related to array serialization ARROW-1973 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 * Python bytearrays are supported in as input to pyarrow ARROW-2141 * Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962 * Cleanup pyarrow type equality checks ARROW-2423 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645 * Improved low level handling of messages for RecordBatch ARROW-2704 > Upgrade apache/arrow to 0.12.0 > -- > > Key: SPARK-26566 > URL: https://issues.apache.org/jira/browse/SPARK-26566 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Priority: Major > > _This is just a placeholder for now to collect what needs to be fixed when we > upgrade next time_ > Version 0.12.0 includes the following: > * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26591) illegal hardware instruction
[ https://issues.apache.org/jira/browse/SPARK-26591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16740571#comment-16740571 ] Bryan Cutler commented on SPARK-26591: -- Could you share some details of your pyarrow installation - version, did you pip install, are you using a virtual env? If possible, I would make a clean virtual environment and try installation again, it sounds like something went bad. > illegal hardware instruction > > > Key: SPARK-26591 > URL: https://issues.apache.org/jira/browse/SPARK-26591 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Python 3.6.7 > Pyspark 2.4.0 > OS: > {noformat} > Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 > x86_64 x86_64 GNU/Linux{noformat} > CPU: > > {code:java} > Dual core AMD Athlon II P360 (-MCP-) cache: 1024 KB > clock speeds: max: 2300 MHz 1: 1700 MHz 2: 1700 MHz > {code} > > >Reporter: Elchin >Priority: Critical > > When I try to use pandas_udf from examples in > [documentation|https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf]: > {code:java} > from pyspark.sql.functions import pandas_udf, PandasUDFType > from pyspark.sql.types import IntegerType, StringType > slen = pandas_udf(lambda s: s.str.len(), IntegerType()) #here it is > crashed{code} > I get the error: > {code:java} > [1] 17969 illegal hardware instruction (core dumped) python3{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25344) Break large tests.py files into smaller files
[ https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16614176#comment-16614176 ] Bryan Cutler commented on SPARK-25344: -- >From the mailing list I think we should agree on a few things first: 1. When to create a separate test file, for each module? and how to name? e.g. "test_rdd.py" 2. Where to put the test files? same dir as source or subdir named "tests" 3. Start splitting tests immediately as new tests are written? Incrementally as subtasks in this JIRA? > Break large tests.py files into smaller files > - > > Key: SPARK-25344 > URL: https://issues.apache.org/jira/browse/SPARK-25344 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > > We've got a ton of tests in one humongous tests.py file, rather than breaking > it out into smaller files. > Having one huge file doesn't seem great for code organization, and it also > makes the test parallelization in run-tests.py not work as well. On my > laptop, tests.py takes 150s, and the next longest test file takes only 20s. > There are similarly large files in other pyspark modules, eg. sql/tests.py, > ml/tests.py, mllib/tests.py, streaming/tests.py. > It seems that at least for some of these files, its already broken into > independent test classes, so it shouldn't be too hard to just move them into > their own files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization
[ https://issues.apache.org/jira/browse/SPARK-26200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705224#comment-16705224 ] Bryan Cutler commented on SPARK-26200: -- Thanks [~davidlyness], I'll mark this as a duplicate since the root cause is the same. I've been tracking this and similar issues with PySpark Rows, but since addressing these will cause a behavior change, we can only make the fix in Spark 3.0. If you didn't see the workaround in the other JIRA, I would recommend creating your Rows this way when using a schema and you care about field ordering, or use a list as you pointed out {code} In [10]: MyRow = Row("field2", "field1") In [11]: data = [ ...: MyRow(Row(sub_field='world'), "hello") ...: ] In [12]: df = spark.createDataFrame(data, schema=schema) In [13]: df.show() +---+--+ | field2|field1| +---+--+ |[world]| hello| +---+--+ {code} > Column values are incorrectly transposed when a field in a PySpark Row > requires serialization > - > > Key: SPARK-26200 > URL: https://issues.apache.org/jira/browse/SPARK-26200 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Spark version 2.4.0 > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144 > The same issue is observed when PySpark is run on both macOS 10.13.6 and > CentOS 7, so this appears to be a cross-platform issue. >Reporter: David Lyness >Priority: Major > Labels: correctness > > h2. Description of issue > Whenever a field in a PySpark {{Row}} requires serialization (such as a > {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below > will assign column values *in alphabetical order*, rather than assigning each > column value to its specified columns. > h3. Code to reproduce: > {code:java} > import datetime > from pyspark.sql import Row > from pyspark.sql.session import SparkSession > from pyspark.sql.types import DateType, StringType, StructField, StructType > spark = SparkSession.builder.getOrCreate() > schema = StructType([ > StructField("date_column", DateType()), > StructField("my_b_column", StringType()), > StructField("my_a_column", StringType()), > ]) > spark.createDataFrame([Row( > date_column=datetime.date.today(), > my_b_column="my_b_value", > my_a_column="my_a_value" > )], schema).show() > {code} > h3. Expected result: > {noformat} > +---+---+---+ > |date_column|my_b_column|my_a_column| > +---+---+---+ > | 2018-11-28| my_b_value| my_a_value| > +---+---+---+{noformat} > h3. Actual result: > {noformat} > +---+---+---+ > |date_column|my_b_column|my_a_column| > +---+---+---+ > | 2018-11-28| my_a_value| my_b_value| > +---+---+---+{noformat} > (Note that {{my_a_value}} and {{my_b_value}} are transposed.) > h2. Analysis of issue > Reviewing [the relevant code on > GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622], > there are two relevant conditional blocks: > > {code:java} > if self._needSerializeAnyField: > # Block 1, does not work correctly > else: > # Block 2, works correctly > {code} > {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and > a dictionary of named columns. In Block 2, there is a conditional that works > specifically to serialize a {{Row}} object: > > {code:java} > elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False): > return tuple(obj[n] for n in self.names) > {code} > There is no such condition in Block 1, so we fall into this instead: > > {code:java} > elif isinstance(obj, (tuple, list)): > return tuple(f.toInternal(v) if c else v > for f, v, c in zip(self.fields, obj, self._needConversion)) > {code} > The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will > return a different ordering than the schema fields. So we end up with: > {code:java} > (date, date, True), > (b, a, False), > (a, b, False) > {code} > h2. Workarounds > Correct behaviour is observed if you use a Python {{list}} or {{dict}} > instead of PySpark's {{Row}} object: > > {code:java} > # Using a list works > spark.createDataFrame([[ > datetime.date.today(), > "my_b_value", > "my_a_value" > ]], schema) > # Using a dict also works > spark.createDataFrame([{ > "date_column": datetime.date.today(), > "my_b_column": "my_b_value", > "my_a_column": "my_a_value" > }], schema){code} > Correct behaviour is also observed if you have no fields that require > serialization; in this example, changing {{date_column}} to {{StringType}} > avoids
[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames for the entire partition
[ https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751785#comment-16751785 ] Bryan Cutler commented on SPARK-26412: -- [~mengxr] I think Arrow record batches would be a much more ideal way to connect with other frameworks. Making the conversion to Pandas carries some overhead and the Arrow format/types are more solidly defined. It is also better suited to be used with an iterator - most of the Arrow IPC mechanisms operate on streams of record batches. Is this proposal instead of SPARK-24579 SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks? > Allow Pandas UDF to take an iterator of pd.DataFrames for the entire partition > -- > > Key: SPARK-26412 > URL: https://issues.apache.org/jira/browse/SPARK-26412 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > Pandas UDF is the ideal connection between PySpark and DL model inference > workload. However, user needs to load the model file first to make > predictions. It is common to see models of size ~100MB or bigger. If the > Pandas UDF execution is limited to batch scope, user need to repeatedly load > the same model for every batch in the same python worker process, which is > inefficient. I created this JIRA to discuss possible solutions. > Essentially we need to support "start()" and "finish()" besides "apply". We > can either provide those interfaces or simply provide users the iterator of > batches in pd.DataFrame and let user code handle it. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26410) Support per Pandas UDF configuration
[ https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751771#comment-16751771 ] Bryan Cutler commented on SPARK-26410: -- This could be useful to have, but it does seem a little strange to bind batch size to a udf. To me, batch size seems more related to the data being used, and merging different batch sizes could complicate the behavior. Still, I can see how someone might want to change batch size at different points in a session. > Support per Pandas UDF configuration > > > Key: SPARK-26410 > URL: https://issues.apache.org/jira/browse/SPARK-26410 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the > "right" batch size usually depends on the task itself. It would be nice if > user can configure the batch size when they declare the Pandas UDF. > This is orthogonal to SPARK-23258 (using max buffer size instead of row > count). > Besides API, we should also discuss how to merge Pandas UDFs of different > configurations. For example, > {code} > df.select(predict1(col("features"), predict2(col("features"))) > {code} > when predict1 requests 100 rows per batch, while predict2 requests 120 rows > per batch. > cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
[ https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751802#comment-16751802 ] Bryan Cutler commented on SPARK-24579: -- It would be great to start up this discussion again, I saw related jiras SPARK-24579 and SPARK-26413 have been recently created. I think the ideal way for Spark to exchange data is an iterator of Arrow record batches, which could be done with an arrow_udf (mentioned in the design here) or through a {{mapPartitions}} function. Maybe expanding on the examples touched on in the design doc would help with discussion? > SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks > > > Key: SPARK-24579 > URL: https://issues.apache.org/jira/browse/SPARK-24579 > Project: Spark > Issue Type: Epic > Components: ML, PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: Hydrogen > Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange > between Apache Spark and DL%2FAI Frameworks .pdf > > > (see attached SPIP pdf for more details) > At the crossroads of big data and AI, we see both the success of Apache Spark > as a unified > analytics engine and the rise of AI frameworks like TensorFlow and Apache > MXNet (incubating). > Both big data and AI are indispensable components to drive business > innovation and there have > been multiple attempts from both communities to bring them together. > We saw efforts from AI community to implement data solutions for AI > frameworks like tf.data and tf.Transform. However, with 50+ data sources and > built-in SQL, DataFrames, and Streaming features, Spark remains the community > choice for big data. This is why we saw many efforts to integrate DL/AI > frameworks with Spark to leverage its power, for example, TFRecords data > source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project > Hydrogen, this SPIP takes a different angle at Spark + AI unification. > None of the integrations are possible without exchanging data between Spark > and external DL/AI frameworks. And the performance matters. However, there > doesn’t exist a standard way to exchange data and hence implementation and > performance optimization fall into pieces. For example, TensorFlowOnSpark > uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and > save data and pass the RDD records to TensorFlow in Python. And TensorFrames > converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s > Java API. How can we reduce the complexity? > The proposal here is to standardize the data exchange interface (or format) > between Spark and DL/AI frameworks and optimize data conversion from/to this > interface. So DL/AI frameworks can leverage Spark to load data virtually > from anywhere without spending extra effort building complex data solutions, > like reading features from a production data warehouse or streaming model > inference. Spark users can use DL/AI frameworks without learning specific > data APIs implemented there. And developers from both sides can work on > performance optimizations independently given the interface itself doesn’t > introduce big overhead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27276) Increase the minimum pyarrow version to 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-27276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801250#comment-16801250 ] Bryan Cutler commented on SPARK-27276: -- [~shaneknapp] this will need an upgrade on Jenkins, so let me know whenever you think is a good time to do this - I know you are working out some kinks with recent upgrades right now so no rush. I will try to do the code changes next week and then we can coordinate anytime after. Thanks! cc [~hyukjin.kwon] > Increase the minimum pyarrow version to 0.12.0 > -- > > Key: SPARK-27276 > URL: https://issues.apache.org/jira/browse/SPARK-27276 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > The current minimum version is 0.8.0, which is pretty ancient since Arrow has > been moving fast and a lot has changed since this version. There are > currently many workarounds checking for different versions or disabling > specific functionality, and the code is getting ugly and difficult to > maintain. Increasing the version will allow cleanup and upgrade the testing > environment. > This involves changing the pyarrow version in setup.py (currently at 0.8.0), > updating Jenkins to test against the new version, code cleanup to remove > workarounds from older versions. Users would then need to ensure this > version is installed on the cluster. > There is also a 0.12.1 release, so I will need to check what bugs that fixed > to see if that will be a better version. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27276) Increase the minimum pyarrow version to 0.12.0
Bryan Cutler created SPARK-27276: Summary: Increase the minimum pyarrow version to 0.12.0 Key: SPARK-27276 URL: https://issues.apache.org/jira/browse/SPARK-27276 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Bryan Cutler The current minimum version is 0.8.0, which is pretty ancient since Arrow has been moving fast and a lot has changed since this version. There are currently many workarounds checking for different versions or disabling specific functionality, and the code is getting ugly and difficult to maintain. Increasing the version will allow cleanup and upgrade the testing environment. This involves changing the pyarrow version in setup.py (currently at 0.8.0), updating Jenkins to test against the new version, code cleanup to remove workarounds from older versions. Users would then need to ensure this version is installed on the cluster. There is also a 0.12.1 release, so I will need to check what bugs that fixed to see if that will be a better version. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27276) Increase the minimum pyarrow version to 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-27276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-27276: - Description: The current minimum version is 0.8.0, which is pretty ancient since Arrow has been moving fast and a lot has changed since this version. There are currently many workarounds checking for different versions or disabling specific functionality, and the code is getting ugly and difficult to maintain. Increasing the version will allow cleanup and upgrade the testing environment. This involves changing the pyarrow version in setup.py (currently at 0.8.0), updating Jenkins to test against the new version, code cleanup to remove workarounds from older versions. Newer versions of pyarrow have dropped support for Python 3.4, so it might be necessary to update to Python 3.5+ in Jenkins as well. Users would then need to ensure at least this version of pyarrow is installed on the cluster. There is also a 0.12.1 release, so I will need to check what bugs that fixed to see if that will be a better version. was: The current minimum version is 0.8.0, which is pretty ancient since Arrow has been moving fast and a lot has changed since this version. There are currently many workarounds checking for different versions or disabling specific functionality, and the code is getting ugly and difficult to maintain. Increasing the version will allow cleanup and upgrade the testing environment. This involves changing the pyarrow version in setup.py (currently at 0.8.0), updating Jenkins to test against the new version, code cleanup to remove workarounds from older versions. Users would then need to ensure this version is installed on the cluster. There is also a 0.12.1 release, so I will need to check what bugs that fixed to see if that will be a better version. > Increase the minimum pyarrow version to 0.12.0 > -- > > Key: SPARK-27276 > URL: https://issues.apache.org/jira/browse/SPARK-27276 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > The current minimum version is 0.8.0, which is pretty ancient since Arrow has > been moving fast and a lot has changed since this version. There are > currently many workarounds checking for different versions or disabling > specific functionality, and the code is getting ugly and difficult to > maintain. Increasing the version will allow cleanup and upgrade the testing > environment. > This involves changing the pyarrow version in setup.py (currently at 0.8.0), > updating Jenkins to test against the new version, code cleanup to remove > workarounds from older versions. Newer versions of pyarrow have dropped > support for Python 3.4, so it might be necessary to update to Python 3.5+ in > Jenkins as well. Users would then need to ensure at least this version of > pyarrow is installed on the cluster. > There is also a 0.12.1 release, so I will need to check what bugs that fixed > to see if that will be a better version. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27276) Increase the minimum pyarrow version to 0.12.1
[ https://issues.apache.org/jira/browse/SPARK-27276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-27276: - Summary: Increase the minimum pyarrow version to 0.12.1 (was: Increase the minimum pyarrow version to 0.12.0) > Increase the minimum pyarrow version to 0.12.1 > -- > > Key: SPARK-27276 > URL: https://issues.apache.org/jira/browse/SPARK-27276 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > The current minimum version is 0.8.0, which is pretty ancient since Arrow has > been moving fast and a lot has changed since this version. There are > currently many workarounds checking for different versions or disabling > specific functionality, and the code is getting ugly and difficult to > maintain. Increasing the version will allow cleanup and upgrade the testing > environment. > This involves changing the pyarrow version in setup.py (currently at 0.8.0), > updating Jenkins to test against the new version, code cleanup to remove > workarounds from older versions. Newer versions of pyarrow have dropped > support for Python 3.4, so it might be necessary to update to Python 3.5+ in > Jenkins as well. Users would then need to ensure at least this version of > pyarrow is installed on the cluster. > There is also a 0.12.1 release, so I will need to check what bugs that fixed > to see if that will be a better version. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27276) Increase the minimum pyarrow version to 0.12.1
[ https://issues.apache.org/jira/browse/SPARK-27276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810024#comment-16810024 ] Bryan Cutler commented on SPARK-27276: -- I think we should use 0.12.1, there was a bug fix ARROW-4582 that might affect usage in Spark. > Increase the minimum pyarrow version to 0.12.1 > -- > > Key: SPARK-27276 > URL: https://issues.apache.org/jira/browse/SPARK-27276 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > The current minimum version is 0.8.0, which is pretty ancient since Arrow has > been moving fast and a lot has changed since this version. There are > currently many workarounds checking for different versions or disabling > specific functionality, and the code is getting ugly and difficult to > maintain. Increasing the version will allow cleanup and upgrade the testing > environment. > This involves changing the pyarrow version in setup.py (currently at 0.8.0), > updating Jenkins to test against the new version, code cleanup to remove > workarounds from older versions. Newer versions of pyarrow have dropped > support for Python 3.4, so it might be necessary to update to Python 3.5+ in > Jenkins as well. Users would then need to ensure at least this version of > pyarrow is installed on the cluster. > There is also a 0.12.1 release, so I will need to check what bugs that fixed > to see if that will be a better version. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27353) PySpark Row __repr__ bug
[ https://issues.apache.org/jira/browse/SPARK-27353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810009#comment-16810009 ] Bryan Cutler commented on SPARK-27353: -- Works for me out of master, can you provide a script to reproduce? In [1]: from pyspark.sql.types import Row In [2]: import datetime In [3]: Row(d=datetime.date.today()) Out[3]: Row(d=datetime.date(2019, 4, 4)) In [4]: repr(Row(d=datetime.date.today())) Out[4]: 'Row(d=datetime.date(2019, 4, 4))' > PySpark Row __repr__ bug > -- > > Key: SPARK-27353 > URL: https://issues.apache.org/jira/browse/SPARK-27353 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Ihor Bobak >Priority: Major > > Row class has this implementation of __repr__: > def __repr__(self): > """Printable representation of Row used in Python REPL.""" > if hasattr(self, "__fields__"): > return "Row(%s)" % ", ".join("%s=%r" % (k, v) > for k, v in zip(self.__fields__, > tuple(self))) > else: > return "" % ", ".join(self) > > the last line fails when you have a datetime.date instance in a row: > TypeError Traceback (most recent call last) > in > 2 print(*row.values) > 3 df_row = Row(*row.values) > > 4 print(repr(df_row)) > 5 break > 6 > E:\spark\spark-2.3.2-bin-without-hadoop\python\pyspark\sql\types.py in > __repr__(self) > 1579 for k, v in > zip(self.__fields__, tuple(self))) > 1580 else: > -> 1581 return "" % ", ".join(self) > 1582 > 1583 > TypeError: sequence item 0: expected str instance, datetime.date found > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27387) Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests
Bryan Cutler created SPARK-27387: Summary: Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests Key: SPARK-27387 URL: https://issues.apache.org/jira/browse/SPARK-27387 Project: Spark Issue Type: Bug Components: PySpark, Tests Affects Versions: 2.4.1 Reporter: Bryan Cutler In PySpark unit tests, sqlutils ReusedSQLTestCase.assertPandasEqual is meant to check if 2 pandas.DataFrames are equal but it seems for later versions of Pandas, this can fail if the DataFrame has an array column. This method can be replaced by {{assert_frame_equal}} from pandas.util.testing. This is what it is meant for and it will give a better assertion message as well. The test failure I have seen is: {noformat} == ERROR: test_supported_types (pyspark.sql.tests.test_pandas_udf_grouped_map.GroupedMapPandasUDFTests) -- Traceback (most recent call last): File "/home/bryan/git/spark/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py", line 128, in test_supported_types self.assertPandasEqual(expected1, result1) File "/home/bryan/git/spark/python/pyspark/testing/sqlutils.py", line 268, in assertPandasEqual self.assertTrue(expected.equals(result), msg=msg) File "/home/bryan/miniconda2/envs/pa012/lib/python3.6/site-packages/pandas ... File "pandas/_libs/lib.pyx", line 523, in pandas._libs.lib.array_equivalent_object ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27387) Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests
[ https://issues.apache.org/jira/browse/SPARK-27387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810197#comment-16810197 ] Bryan Cutler commented on SPARK-27387: -- This can be done after the upgrade of pyarrow version to avoid conflicts. > Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests > -- > > Key: SPARK-27387 > URL: https://issues.apache.org/jira/browse/SPARK-27387 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 2.4.1 >Reporter: Bryan Cutler >Priority: Major > > In PySpark unit tests, sqlutils ReusedSQLTestCase.assertPandasEqual is meant > to check if 2 pandas.DataFrames are equal but it seems for later versions of > Pandas, this can fail if the DataFrame has an array column. This method can > be replaced by {{assert_frame_equal}} from pandas.util.testing. This is what > it is meant for and it will give a better assertion message as well. > The test failure I have seen is: > {noformat} > == > ERROR: test_supported_types > (pyspark.sql.tests.test_pandas_udf_grouped_map.GroupedMapPandasUDFTests) > -- > Traceback (most recent call last): > File > "/home/bryan/git/spark/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py", > line 128, in test_supported_types > self.assertPandasEqual(expected1, result1) > File "/home/bryan/git/spark/python/pyspark/testing/sqlutils.py", line 268, > in assertPandasEqual > self.assertTrue(expected.equals(result), msg=msg) > File "/home/bryan/miniconda2/envs/pa012/lib/python3.6/site-packages/pandas > ... > File "pandas/_libs/lib.pyx", line 523, in > pandas._libs.lib.array_equivalent_object > ValueError: The truth value of an array with more than one element is > ambiguous. Use a.any() or a.all() > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27387) Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests
[ https://issues.apache.org/jira/browse/SPARK-27387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810200#comment-16810200 ] Bryan Cutler commented on SPARK-27387: -- I can work on this > Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests > -- > > Key: SPARK-27387 > URL: https://issues.apache.org/jira/browse/SPARK-27387 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 2.4.1 >Reporter: Bryan Cutler >Priority: Major > > In PySpark unit tests, sqlutils ReusedSQLTestCase.assertPandasEqual is meant > to check if 2 pandas.DataFrames are equal but it seems for later versions of > Pandas, this can fail if the DataFrame has an array column. This method can > be replaced by {{assert_frame_equal}} from pandas.util.testing. This is what > it is meant for and it will give a better assertion message as well. > The test failure I have seen is: > {noformat} > == > ERROR: test_supported_types > (pyspark.sql.tests.test_pandas_udf_grouped_map.GroupedMapPandasUDFTests) > -- > Traceback (most recent call last): > File > "/home/bryan/git/spark/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py", > line 128, in test_supported_types > self.assertPandasEqual(expected1, result1) > File "/home/bryan/git/spark/python/pyspark/testing/sqlutils.py", line 268, > in assertPandasEqual > self.assertTrue(expected.equals(result), msg=msg) > File "/home/bryan/miniconda2/envs/pa012/lib/python3.6/site-packages/pandas > ... > File "pandas/_libs/lib.pyx", line 523, in > pandas._libs.lib.array_equivalent_object > ValueError: The truth value of an array with more than one element is > ambiguous. Use a.any() or a.all() > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811355#comment-16811355 ] Bryan Cutler commented on SPARK-27389: -- >From the stacktrace, it looks like it's getting this from >"spark.sql.session.timeZone" which defaults to Java.util >TimeZone.getDefault.getID() > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: shane knapp >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27293) Setting random seed produces different results in RandomForestRegressor
[ https://issues.apache.org/jira/browse/SPARK-27293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-27293: - Summary: Setting random seed produces different results in RandomForestRegressor (was: I am interested in finding out if there is a bug in the implementation of RandomForests. The Issue is when applying a seed and getting different results than other people from my class when applying it to the same data. ) > Setting random seed produces different results in RandomForestRegressor > --- > > Key: SPARK-27293 > URL: https://issues.apache.org/jira/browse/SPARK-27293 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Martin Skauen >Priority: Major > > I am calculating the RMSE metric like this: > {code:java} > (trainingData, testData) = data.randomSplit([0.7, 0.3], 313) > from pyspark.ml.regression import RandomForestRegressor > rfr = RandomForestRegressor(labelCol="labels", featuresCol="features", > maxDepth=5, numTrees=3, seed = 313) > from pyspark.ml.evaluation import RegressionEvaluator > evaluator = RegressionEvaluator\ > (labelCol="labels", predictionCol="prediction", metricName="rmse") > rmse = evaluator.evaluate(predictions) > print("RMSE = %g " % rmse) > {code} > I am setting the seed. For seed = 50 and also for other seeds I get exact > same RMSE as people from class. I set seed to 313 and it is giving me > different value. What could be the issue here? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27293) Setting random seed produces different results in RandomForestRegressor
[ https://issues.apache.org/jira/browse/SPARK-27293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-27293: - Description: I am interested in finding out if there is a bug in the implementation of RandomForests. The Issue is when applying a seed and getting different results than other people from my class when applying it to the same data I am calculating the RMSE metric like this: {code:java} (trainingData, testData) = data.randomSplit([0.7, 0.3], 313) from pyspark.ml.regression import RandomForestRegressor rfr = RandomForestRegressor(labelCol="labels", featuresCol="features", maxDepth=5, numTrees=3, seed = 313) from pyspark.ml.evaluation import RegressionEvaluator evaluator = RegressionEvaluator\ (labelCol="labels", predictionCol="prediction", metricName="rmse") rmse = evaluator.evaluate(predictions) print("RMSE = %g " % rmse) {code} I am setting the seed. For seed = 50 and also for other seeds I get exact same RMSE as people from class. I set seed to 313 and it is giving me different value. What could be the issue here? was: I am calculating the RMSE metric like this: {code:java} (trainingData, testData) = data.randomSplit([0.7, 0.3], 313) from pyspark.ml.regression import RandomForestRegressor rfr = RandomForestRegressor(labelCol="labels", featuresCol="features", maxDepth=5, numTrees=3, seed = 313) from pyspark.ml.evaluation import RegressionEvaluator evaluator = RegressionEvaluator\ (labelCol="labels", predictionCol="prediction", metricName="rmse") rmse = evaluator.evaluate(predictions) print("RMSE = %g " % rmse) {code} I am setting the seed. For seed = 50 and also for other seeds I get exact same RMSE as people from class. I set seed to 313 and it is giving me different value. What could be the issue here? > Setting random seed produces different results in RandomForestRegressor > --- > > Key: SPARK-27293 > URL: https://issues.apache.org/jira/browse/SPARK-27293 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Martin Skauen >Priority: Major > > I am interested in finding out if there is a bug in the implementation of > RandomForests. The Issue is when applying a seed and getting different > results than other people from my class when applying it to the same data > I am calculating the RMSE metric like this: > {code:java} > (trainingData, testData) = data.randomSplit([0.7, 0.3], 313) > from pyspark.ml.regression import RandomForestRegressor > rfr = RandomForestRegressor(labelCol="labels", featuresCol="features", > maxDepth=5, numTrees=3, seed = 313) > from pyspark.ml.evaluation import RegressionEvaluator > evaluator = RegressionEvaluator\ > (labelCol="labels", predictionCol="prediction", metricName="rmse") > rmse = evaluator.evaluate(predictions) > print("RMSE = %g " % rmse) > {code} > I am setting the seed. For seed = 50 and also for other seeds I get exact > same RMSE as people from class. I set seed to 313 and it is giving me > different value. What could be the issue here? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27293) Setting random seed produces different results in RandomForestRegressor
[ https://issues.apache.org/jira/browse/SPARK-27293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-27293: - Component/s: ML > Setting random seed produces different results in RandomForestRegressor > --- > > Key: SPARK-27293 > URL: https://issues.apache.org/jira/browse/SPARK-27293 > Project: Spark > Issue Type: Question > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Martin Skauen >Priority: Major > > I am interested in finding out if there is a bug in the implementation of > RandomForests. The Issue is when applying a seed and getting different > results than other people from my class when applying it to the same data > I am calculating the RMSE metric like this: > {code:java} > (trainingData, testData) = data.randomSplit([0.7, 0.3], 313) > from pyspark.ml.regression import RandomForestRegressor > rfr = RandomForestRegressor(labelCol="labels", featuresCol="features", > maxDepth=5, numTrees=3, seed = 313) > from pyspark.ml.evaluation import RegressionEvaluator > evaluator = RegressionEvaluator\ > (labelCol="labels", predictionCol="prediction", metricName="rmse") > rmse = evaluator.evaluate(predictions) > print("RMSE = %g " % rmse) > {code} > I am setting the seed. For seed = 50 and also for other seeds I get exact > same RMSE as people from class. I set seed to 313 and it is giving me > different value. What could be the issue here? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27293) I am interested in finding out if there is a bug in the implementation of RandomForests. The Issue is when applying a seed and getting different results than other peo
[ https://issues.apache.org/jira/browse/SPARK-27293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804124#comment-16804124 ] Bryan Cutler commented on SPARK-27293: -- Setting the seed like in your example for randomSplit and the regressor will change the input data and algorithm initialization, since these operations are non-deterministic. So it is not surprising that the result might end up different. If you have sufficient data and get blatantly wrong results as compared to another implementation of the algorithm, then there might be an issue. From what I can see here, there doesn't seem to be a problem. > I am interested in finding out if there is a bug in the implementation of > RandomForests. The Issue is when applying a seed and getting different > results than other people from my class when applying it to the same data. > > > Key: SPARK-27293 > URL: https://issues.apache.org/jira/browse/SPARK-27293 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Martin Skauen >Priority: Major > > I am calculating the RMSE metric like this: > {code:java} > (trainingData, testData) = data.randomSplit([0.7, 0.3], 313) > from pyspark.ml.regression import RandomForestRegressor > rfr = RandomForestRegressor(labelCol="labels", featuresCol="features", > maxDepth=5, numTrees=3, seed = 313) > from pyspark.ml.evaluation import RegressionEvaluator > evaluator = RegressionEvaluator\ > (labelCol="labels", predictionCol="prediction", metricName="rmse") > rmse = evaluator.evaluate(predictions) > print("RMSE = %g " % rmse) > {code} > I am setting the seed. For seed = 50 and also for other seeds I get exact > same RMSE as people from class. I set seed to 313 and it is giving me > different value. What could be the issue here? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27240) Use pandas DataFrame for struct type argument in Scalar Pandas UDF.
[ https://issues.apache.org/jira/browse/SPARK-27240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-27240. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24177 https://github.com/apache/spark/pull/24177 > Use pandas DataFrame for struct type argument in Scalar Pandas UDF. > --- > > Key: SPARK-27240 > URL: https://issues.apache.org/jira/browse/SPARK-27240 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.0.0 > > > Now that we support returning pandas DataFrame for struct type in Scalar > Pandas UDF. > If we chain another Pandas UDF after the Scalar Pandas UDF returning pandas > DataFrame, the argument of the chained UDF will be pandas DataFrame, but > currently we don't support pandas DataFrame as an argument of Scalar Pandas > UDF. That means there is an inconsistency between the chained UDF and the > single UDF. > We should support taking pandas DataFrame for struct type argument in Scala > Pandas UDF to be consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27240) Use pandas DataFrame for struct type argument in Scalar Pandas UDF.
[ https://issues.apache.org/jira/browse/SPARK-27240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-27240: Assignee: Takuya Ueshin > Use pandas DataFrame for struct type argument in Scalar Pandas UDF. > --- > > Key: SPARK-27240 > URL: https://issues.apache.org/jira/browse/SPARK-27240 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > > Now that we support returning pandas DataFrame for struct type in Scalar > Pandas UDF. > If we chain another Pandas UDF after the Scalar Pandas UDF returning pandas > DataFrame, the argument of the chained UDF will be pandas DataFrame, but > currently we don't support pandas DataFrame as an argument of Scalar Pandas > UDF. That means there is an inconsistency between the chained UDF and the > single UDF. > We should support taking pandas DataFrame for struct type argument in Scala > Pandas UDF to be consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23836) Support returning StructType to the level support in GroupedMap Arrow's "scalar" UDFS (or similar)
[ https://issues.apache.org/jira/browse/SPARK-23836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777416#comment-16777416 ] Bryan Cutler commented on SPARK-23836: -- I can work on this > Support returning StructType to the level support in GroupedMap Arrow's > "scalar" UDFS (or similar) > -- > > Key: SPARK-23836 > URL: https://issues.apache.org/jira/browse/SPARK-23836 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Major > > Currently not all of the supported types can be returned from the scalar > pandas UDF type. This means if someone wants to return a struct type doing a > map operation right now they either have to do a "junk" groupBy or use the > non-vectorized results. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25147) GroupedData.apply pandas_udf crashing
[ https://issues.apache.org/jira/browse/SPARK-25147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-25147. -- Resolution: Cannot Reproduce Going to resolve this for now, please reopen if the above suggestion does not fix the issue > GroupedData.apply pandas_udf crashing > - > > Key: SPARK-25147 > URL: https://issues.apache.org/jira/browse/SPARK-25147 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: OS: Mac OS 10.13.6 > Python: 2.7.15, 3.6.6 > PyArrow: 0.10.0 > Pandas: 0.23.4 > Numpy: 1.15.0 >Reporter: Mike Sukmanowsky >Priority: Major > > Running the following example taken straight from the docs results in > {{org.apache.spark.SparkException: Python worker exited unexpectedly > (crashed)}} for reasons that aren't clear from any logs I can see: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql import functions as F > spark = ( > SparkSession > .builder > .appName("pandas_udf") > .getOrCreate() > ) > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v") > ) > @F.pandas_udf("id long, v double", F.PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > ( > df > .groupby("id") > .apply(normalize) > .show() > ) > {code} > See output.log for > [stacktrace|https://gist.github.com/msukmanowsky/b9cb6700e8ccaf93f265962000403f28]. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26943) Weird behaviour with `.cache()`
[ https://issues.apache.org/jira/browse/SPARK-26943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780751#comment-16780751 ] Bryan Cutler commented on SPARK-26943: -- If you can try to reproduce locally, that would be ideal. Maybe you could take a sample of your dataset? I wouldn't recommend trying out different versions in your cluster. > Weird behaviour with `.cache()` > --- > > Key: SPARK-26943 > URL: https://issues.apache.org/jira/browse/SPARK-26943 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Will Uto >Priority: Major > > > {code:java} > sdf.count(){code} > > works fine. However: > > {code:java} > sdf = sdf.cache() > sdf.count() > {code} > does not, and produces error > {code:java} > Py4JJavaError: An error occurred while calling o314.count. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 75 > in stage 8.0 failed 4 times, most recent failure: Lost task 75.3 in stage 8.0 > (TID 438, uat-datanode-02, executor 1): java.text.ParseException: Unparseable > number: "(N/A)" > at java.text.NumberFormat.parse(NumberFormat.java:350) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26858) Vectorized gapplyCollect, Arrow optimization in native R function execution
[ https://issues.apache.org/jira/browse/SPARK-26858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773284#comment-16773284 ] Bryan Cutler commented on SPARK-26858: -- {quote} (One other possibility I was thinking about batches without Schema is that we just send Arrow batch by Arrow batch, deserialize each batch to RecordBatch instance, and then construct an Arrow table, which is pretty different from Python side and hacky) {quote} The point from my previous comment is that you can't deserialize a RecordBatch and make a Table without a schema. I think these options make the most sense: 1) Have the user define a schema beforehand, then you can just send serialized RecordBatches back to the driver. 2) Send a complete Arrow stream (schema + RecordBatches) from the executor, then merge streams in the driver JVM, discarding duplicate schemas, and send one final stream to R. 3) Same as (2) but instead of merging streams, send each separate stream through to the R driver where they are read and concatenated into one Table - I'm not sure if the Arrow R api will support this though. > Vectorized gapplyCollect, Arrow optimization in native R function execution > --- > > Key: SPARK-26858 > URL: https://issues.apache.org/jira/browse/SPARK-26858 > Project: Spark > Issue Type: Sub-task > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Unlike gapply, gapplyCollect requires additional ser/de steps because it can > omit the schema, and Spark SQL doesn't know the return type before actually > execution happens. > In original code path, it's done via using binary schema. Once gapply is done > (SPARK-26761). we can mimic this approach in vectorized gapply to support > gapplyCollect. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23961) pyspark toLocalIterator throws an exception
[ https://issues.apache.org/jira/browse/SPARK-23961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784843#comment-16784843 ] Bryan Cutler commented on SPARK-23961: -- I could also reproduce with a nearly identical error using the following {code} import time from pyspark.sql import SparkSession from pyspark.sql.functions import rand, udf from pyspark.sql.types import * spark = SparkSession\ .builder\ .appName("toLocalIterator_Test")\ .getOrCreate() df = spark.range(1 << 16).select(rand()) it = df.toLocalIterator() print(next(it)) it = None time.sleep(5) spark.stop() {code} I think there are a couple issues with the way this is currently working. When toLocalIterator is called in Python, the Scala side also creates a local iterator which immediately starts a loop to consume the entire iterator and write it all to Python without any synchronization with the Python iterator. Blocking the write operation only happens when the socket receive buffer is full. Small examples work fine if the data all fits in the read buffer, but the above code fails because the writing becomes blocked, then the Python iterator stops reading and closes the connection, which the Scala side sees as an error. I can work on a fix for this. > pyspark toLocalIterator throws an exception > --- > > Key: SPARK-23961 > URL: https://issues.apache.org/jira/browse/SPARK-23961 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Michel Lemay >Priority: Minor > Labels: DataFrame, pyspark > > Given a dataframe and use toLocalIterator. If we do not consume all records, > it will throw: > {quote}ERROR PythonRDD: Error while sending iterator > java.net.SocketException: Connection reset by peer: socket write error > at java.net.SocketOutputStream.socketWrite0(Native Method) > at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111) > at java.net.SocketOutputStream.write(SocketOutputStream.java:155) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122) > at java.io.DataOutputStream.write(DataOutputStream.java:107) > at java.io.FilterOutputStream.write(FilterOutputStream.java:97) > at > org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509) > at > org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705) > at > org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706) > {quote} > > To reproduce, here is a simple pyspark shell script that show the error: > {quote}import itertools > df = spark.read.parquet("large parquet folder").cache() > print(df.count()) > b = df.toLocalIterator() > print(len(list(itertools.islice(b, 20 > b = None # Make the iterator goes out of scope. Throws here. > {quote} > > Observations: > * Consuming all records do not throw. Taking only a subset of the > partitions create the error. > * In another experiment, doing the same on a regular RDD works if we > cache/materialize it. If we do not cache the RDD, it throws similarly. > * It works in scala shell > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27039) toPandas with Arrow swallows maxResultSize errors
[ https://issues.apache.org/jira/browse/SPARK-27039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783690#comment-16783690 ] Bryan Cutler commented on SPARK-27039: -- I was able to reproduce in v2.4.0, but it looks like current master raises an error in the driver and does not return an empty Pandas DataFrame. This is probably due to some of the recent changes in toPandas() with Arrow enabled. {noformat} In [4]: spark.conf.set('spark.sql.execution.arrow.enabled', True) In [5]: import pyspark.sql.functions as F ...: df = spark.range(1000 * 1000) ...: df_pd = df.withColumn("test", F.lit("this is a long string that should make the resulting dataframe too large for maxRe ...: sult which is 1m")).toPandas() ...: 19/03/04 10:54:56 ERROR TaskSetManager: Total size of serialized results of 1 tasks (13.2 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB) 19/03/04 10:54:56 ERROR TaskSetManager: Total size of serialized results of 2 tasks (26.4 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB) Exception in thread "serve-Arrow" org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 1 tasks (13.2 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB) at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1938) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1926) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1925) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1925) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:935) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:935) at scala.Option.foreach(Option.scala:274) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:935) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2155) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2104) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2093) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:746) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2008) at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2(Dataset.scala:3300) at org.apache.spark.sql.Dataset.$anonfun$collectAsArrowToPython$2$adapted(Dataset.scala:3265) at org.apache.spark.api.python.PythonRDD$.$anonfun$serveToStream$2(PythonRDD.scala:442) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319) at org.apache.spark.api.python.PythonRDD$.$anonfun$serveToStream$1(PythonRDD.scala:444) at org.apache.spark.api.python.PythonRDD$.$anonfun$serveToStream$1$adapted(PythonRDD.scala:439) at org.apache.spark.api.python.PythonServer$$anon$3.run(PythonRDD.scala:890) /home/bryan/git/spark/python/pyspark/sql/dataframe.py:2129: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true, but has reached the error below and can not continue. Note that 'spark.sql.execution.arrow.fallback.enabled' does not have an effect on failures in the middle of computation. warnings.warn(msg) 19/03/04 10:54:56 ERROR TaskSetManager: Total size of serialized results of 3 tasks (39.6 MiB) is bigger than spark.driver.maxResultSize (1024.0 KiB) [Stage 0:==>(1 + 7) / 8][Stage 1:> (0 + 8) / 8]--- EOFError Traceback (most recent call last) in () 1 import pyspark.sql.functions as F 2 df = spark.range(1000 * 1000) > 3 df_pd = df.withColumn("test", F.lit("this is a long string that should make the resulting dataframe too large for maxResult which is 1m")).toPandas() /home/bryan/git/spark/python/pyspark/sql/dataframe.pyc in toPandas(self) 2111 _check_dataframe_localize_timestamps 2112 import pyarrow -> 2113 batches = self._collectAsArrow() 2114 if len(batches) > 0: 2115 table =
[jira] [Commented] (SPARK-26943) Weird behaviour with `.cache()`
[ https://issues.apache.org/jira/browse/SPARK-26943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774566#comment-16774566 ] Bryan Cutler commented on SPARK-26943: -- Could you please provide a complete script to reproduce? Also, have you tried a newer version of Spark other than 2.1.0? > Weird behaviour with `.cache()` > --- > > Key: SPARK-26943 > URL: https://issues.apache.org/jira/browse/SPARK-26943 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Will Uto >Priority: Major > > > {code:java} > sdf.count(){code} > > works fine. However: > > {code:java} > sdf = sdf.cache() > sdf.count() > {code} > does not, and produces error > {code:java} > Py4JJavaError: An error occurred while calling o314.count. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 75 > in stage 8.0 failed 4 times, most recent failure: Lost task 75.3 in stage 8.0 > (TID 438, uat-datanode-02, executor 1): java.text.ParseException: Unparseable > number: "(N/A)" > at java.text.NumberFormat.parse(NumberFormat.java:350) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27163) Cleanup and consolidate Pandas UDF functionality
Bryan Cutler created SPARK-27163: Summary: Cleanup and consolidate Pandas UDF functionality Key: SPARK-27163 URL: https://issues.apache.org/jira/browse/SPARK-27163 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.4.0 Reporter: Bryan Cutler Some of the code for Pandas UDFs can be cleaned up and consolidated to remove duplicated parts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27163) Cleanup and consolidate Pandas UDF functionality
[ https://issues.apache.org/jira/browse/SPARK-27163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-27163: - Priority: Minor (was: Major) > Cleanup and consolidate Pandas UDF functionality > > > Key: SPARK-27163 > URL: https://issues.apache.org/jira/browse/SPARK-27163 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Priority: Minor > > Some of the code for Pandas UDFs can be cleaned up and consolidated to remove > duplicated parts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23836) Support returning StructType to the level support in GroupedMap Arrow's "scalar" UDFS (or similar)
[ https://issues.apache.org/jira/browse/SPARK-23836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-23836. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23900 https://github.com/apache/spark/pull/23900 > Support returning StructType to the level support in GroupedMap Arrow's > "scalar" UDFS (or similar) > -- > > Key: SPARK-23836 > URL: https://issues.apache.org/jira/browse/SPARK-23836 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: holdenk >Assignee: Bryan Cutler >Priority: Major > Fix For: 3.0.0 > > > Currently not all of the supported types can be returned from the scalar > pandas UDF type. This means if someone wants to return a struct type doing a > map operation right now they either have to do a "junk" groupBy or use the > non-vectorized results. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23836) Support returning StructType to the level support in GroupedMap Arrow's "scalar" UDFS (or similar)
[ https://issues.apache.org/jira/browse/SPARK-23836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-23836: Assignee: Bryan Cutler > Support returning StructType to the level support in GroupedMap Arrow's > "scalar" UDFS (or similar) > -- > > Key: SPARK-23836 > URL: https://issues.apache.org/jira/browse/SPARK-23836 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: holdenk >Assignee: Bryan Cutler >Priority: Major > > Currently not all of the supported types can be returned from the scalar > pandas UDF type. This means if someone wants to return a struct type doing a > map operation right now they either have to do a "junk" groupBy or use the > non-vectorized results. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26858) Vectorized gapplyCollect, Arrow optimization in native R function execution
[ https://issues.apache.org/jira/browse/SPARK-26858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772266#comment-16772266 ] Bryan Cutler commented on SPARK-26858: -- [~hyukjin.kwon] actually {{pyarrow.Table.from_batches}} does require a schema, but it's a little tricky. In python, to create a {{RecordBatch}} object, a schema is always required. Then it attaches to the {{RecordBatch}} object, so you can always get the schema from a batch and do things like {{Table.from_batches}} with just a list of batches. Look at the full api [here|https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.from_batches] and you can see the schema is optional and it just pulls from the first batch. Serialized record batches are a bit different and do not contain the schema, which is why the stream protocol goes {{Schema, RecordBatch, RecordBatch, ..}}. So you can't just send serialized Arrow batches and then build a Table. Hopefully that makes sense :P Back on your sequence diagram at step (9), does R write the data frame row by row bytes over the socket? I'm wondering how it gets serialized to see where Arrow might best help out. > Vectorized gapplyCollect, Arrow optimization in native R function execution > --- > > Key: SPARK-26858 > URL: https://issues.apache.org/jira/browse/SPARK-26858 > Project: Spark > Issue Type: Sub-task > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Unlike gapply, gapplyCollect requires additional ser/de steps because it can > omit the schema, and Spark SQL doesn't know the return type before actually > execution happens. > In original code path, it's done via using binary schema. Once gapply is done > (SPARK-26761). we can mimic this approach in vectorized gapply to support > gapplyCollect. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26566) Upgrade apache/arrow to 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-26566: - Description: Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users: * Safe cast fails from numpy float64 array with nans to integer, ARROW-4258 * Java, Reduce heap usage for variable width vectors, ARROW-4147 * Binary identity cast not implemented, ARROW-4101 * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098 * conversion to date object no longer needed, ARROW-3910 * Error reading IPC file with no record batches, ARROW-3894 * Signed to unsigned integer cast yields incorrect results when type sizes are the same, ARROW-3790 * from_pandas gives incorrect results when converting floating point to bool, ARROW-3428 * Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ARROW-3048 complete list [here|https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0] PySpark requires the following fixes to work with PyArrow 0.12.0 * Encrypted pyspark worker fails due to ChunkedStream missing closed property * pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64 * ArrowTests fails due to difference in raised error message * pyarrow.open_stream deprecated * tests fail because groupby adds index column with duplicate name was: _This is just a placeholder for now to collect what needs to be fixed when we upgrade next time_ Version 0.12.0 includes the following: * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098 * conversion to date object no longer needed, ARROW-3910 > Upgrade apache/arrow to 0.12.0 > -- > > Key: SPARK-26566 > URL: https://issues.apache.org/jira/browse/SPARK-26566 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Priority: Major > > Version 0.12.0 includes the following selected fixes/improvements relevant to > Spark users: > * Safe cast fails from numpy float64 array with nans to integer, ARROW-4258 > * Java, Reduce heap usage for variable width vectors, ARROW-4147 > * Binary identity cast not implemented, ARROW-4101 > * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098 > * conversion to date object no longer needed, ARROW-3910 > * Error reading IPC file with no record batches, ARROW-3894 > * Signed to unsigned integer cast yields incorrect results when type sizes > are the same, ARROW-3790 > * from_pandas gives incorrect results when converting floating point to bool, > ARROW-3428 > * Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / > libboost issue), ARROW-3048 > complete list > [here|https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0] > PySpark requires the following fixes to work with PyArrow 0.12.0 > * Encrypted pyspark worker fails due to ChunkedStream missing closed property > * pyarrow now converts dates as objects by default, which causes error > because type is assumed datetime64 > * ArrowTests fails due to difference in raised error message > * pyarrow.open_stream deprecated > * tests fail because groupby adds index column with duplicate name > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26566) Upgrade apache/arrow to 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-26566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-26566: - Description: Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users: * Safe cast fails from numpy float64 array with nans to integer, ARROW-4258 * Java, Reduce heap usage for variable width vectors, ARROW-4147 * Binary identity cast not implemented, ARROW-4101 * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098 * conversion to date object no longer needed, ARROW-3910 * Error reading IPC file with no record batches, ARROW-3894 * Signed to unsigned integer cast yields incorrect results when type sizes are the same, ARROW-3790 * from_pandas gives incorrect results when converting floating point to bool, ARROW-3428 * Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ARROW-3048 * Java update to official Flatbuffers version 1.9.0, ARROW-3175 complete list [here|https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0] PySpark requires the following fixes to work with PyArrow 0.12.0 * Encrypted pyspark worker fails due to ChunkedStream missing closed property * pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64 * ArrowTests fails due to difference in raised error message * pyarrow.open_stream deprecated * tests fail because groupby adds index column with duplicate name was: Version 0.12.0 includes the following selected fixes/improvements relevant to Spark users: * Safe cast fails from numpy float64 array with nans to integer, ARROW-4258 * Java, Reduce heap usage for variable width vectors, ARROW-4147 * Binary identity cast not implemented, ARROW-4101 * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098 * conversion to date object no longer needed, ARROW-3910 * Error reading IPC file with no record batches, ARROW-3894 * Signed to unsigned integer cast yields incorrect results when type sizes are the same, ARROW-3790 * from_pandas gives incorrect results when converting floating point to bool, ARROW-3428 * Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / libboost issue), ARROW-3048 complete list [here|https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0] PySpark requires the following fixes to work with PyArrow 0.12.0 * Encrypted pyspark worker fails due to ChunkedStream missing closed property * pyarrow now converts dates as objects by default, which causes error because type is assumed datetime64 * ArrowTests fails due to difference in raised error message * pyarrow.open_stream deprecated * tests fail because groupby adds index column with duplicate name > Upgrade apache/arrow to 0.12.0 > -- > > Key: SPARK-26566 > URL: https://issues.apache.org/jira/browse/SPARK-26566 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Priority: Major > > Version 0.12.0 includes the following selected fixes/improvements relevant to > Spark users: > * Safe cast fails from numpy float64 array with nans to integer, ARROW-4258 > * Java, Reduce heap usage for variable width vectors, ARROW-4147 > * Binary identity cast not implemented, ARROW-4101 > * pyarrow open_stream deprecated, use ipc.open_stream, ARROW-4098 > * conversion to date object no longer needed, ARROW-3910 > * Error reading IPC file with no record batches, ARROW-3894 > * Signed to unsigned integer cast yields incorrect results when type sizes > are the same, ARROW-3790 > * from_pandas gives incorrect results when converting floating point to bool, > ARROW-3428 > * Import pyarrow fails if scikit-learn is installed from conda (boost-cpp / > libboost issue), ARROW-3048 > * Java update to official Flatbuffers version 1.9.0, ARROW-3175 > complete list > [here|https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%20AND%20fixVersion%20%3D%200.12.0] > PySpark requires the following fixes to work with PyArrow 0.12.0 > * Encrypted pyspark worker fails due to ChunkedStream missing closed property > * pyarrow now converts dates as objects by default, which causes error > because type is assumed datetime64 > * ArrowTests fails due to difference in raised error message > * pyarrow.open_stream deprecated > * tests fail because groupby adds index column with duplicate name > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813896#comment-16813896 ] Bryan Cutler commented on SPARK-27389: -- Thanks [~shaneknapp] for the fix. I couldn't come up with any idea why this was happening all of a sudden either, but at least we are up and running again! > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: shane knapp >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
[ https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812877#comment-16812877 ] Bryan Cutler commented on SPARK-27389: -- [~shaneknapp], I had a couple of successful tests with worker-4. Do you know if the problem consistent on certain workers or just random on all of them? > pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'" > - > > Key: SPARK-27389 > URL: https://issues.apache.org/jira/browse/SPARK-27389 > Project: Spark > Issue Type: Task > Components: jenkins, PySpark >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > I've seen a few odd PR build failures w/ an error in pyspark tests about > "UnknownTimeZoneError: 'US/Pacific-New'". eg. > https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull > A bit of searching tells me that US/Pacific-New probably isn't really > supposed to be a timezone at all: > https://mm.icann.org/pipermail/tz/2009-February/015448.html > I'm guessing that this is from some misconfiguration of jenkins. that said, > I can't figure out what is wrong. There does seem to be a timezone entry for > US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to > be there on every amp-jenkins-worker, so I dunno what that alone would cause > this failure sometime. > [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be > totally wrong here and it is really a pyspark problem. > Full Stack trace from the test failure: > {noformat} > == > ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 522, in test_to_pandas > pdf = self._to_pandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py", > line 517, in _to_pandas > return df.toPandas() > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py", > line 2189, in toPandas > _check_series_convert_timestamps_local_tz(pdf[field.name], timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1891, in _check_series_convert_timestamps_local_tz > return _check_series_convert_timestamps_localize(s, None, timezone) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1877, in _check_series_convert_timestamps_localize > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", > line 2294, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer > (pandas/lib.c:66124) > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py", > line 1878, in > if ts is not pd.NaT else pd.NaT) > File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert > (pandas/tslib.c:13923) > File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ > (pandas/tslib.c:10447) > File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject > (pandas/tslib.c:27504) > File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz > (pandas/tslib.c:32362) > File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line > 178, in timezone > raise UnknownTimeZoneError(zone) > UnknownTimeZoneError: 'US/Pacific-New' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27387) Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests
[ https://issues.apache.org/jira/browse/SPARK-27387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-27387: Assignee: Bryan Cutler > Replace sqlutils assertPandasEqual with Pandas assert_frame_equal in tests > -- > > Key: SPARK-27387 > URL: https://issues.apache.org/jira/browse/SPARK-27387 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Affects Versions: 2.4.1 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Fix For: 3.0.0 > > > In PySpark unit tests, sqlutils ReusedSQLTestCase.assertPandasEqual is meant > to check if 2 pandas.DataFrames are equal but it seems for later versions of > Pandas, this can fail if the DataFrame has an array column. This method can > be replaced by {{assert_frame_equal}} from pandas.util.testing. This is what > it is meant for and it will give a better assertion message as well. > The test failure I have seen is: > {noformat} > == > ERROR: test_supported_types > (pyspark.sql.tests.test_pandas_udf_grouped_map.GroupedMapPandasUDFTests) > -- > Traceback (most recent call last): > File > "/home/bryan/git/spark/python/pyspark/sql/tests/test_pandas_udf_grouped_map.py", > line 128, in test_supported_types > self.assertPandasEqual(expected1, result1) > File "/home/bryan/git/spark/python/pyspark/testing/sqlutils.py", line 268, > in assertPandasEqual > self.assertTrue(expected.equals(result), msg=msg) > File "/home/bryan/miniconda2/envs/pa012/lib/python3.6/site-packages/pandas > ... > File "pandas/_libs/lib.pyx", line 523, in > pandas._libs.lib.array_equivalent_object > ValueError: The truth value of an array with more than one element is > ambiguous. Use a.any() or a.all() > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27463) SPIP: Support Dataframe Cogroup via Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-27463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16842335#comment-16842335 ] Bryan Cutler commented on SPARK-27463: -- [~d80tb7] I think you could remove the SPIP label from this and begin work. It will require some tweaks to the Python worker and add a new API, but not major changes and additions like other SPIPs. If others feel differently though, we could continue with the SPIP process. > SPIP: Support Dataframe Cogroup via Pandas UDFs > > > Key: SPARK-27463 > URL: https://issues.apache.org/jira/browse/SPARK-27463 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Chris Martin >Priority: Major > Labels: SPIP > > Recent work on Pandas UDFs in Spark, has allowed for improved > interoperability between Pandas and Spark. This proposal aims to extend this > by introducing a new Pandas UDF type which would allow for a cogroup > operation to be applied to two PySpark DataFrames. > Full details are in the google document linked below. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27712) createDataFrame() reorders row
[ https://issues.apache.org/jira/browse/SPARK-27712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-27712. -- Resolution: Duplicate > createDataFrame() reorders row > -- > > Key: SPARK-27712 > URL: https://issues.apache.org/jira/browse/SPARK-27712 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: emr-5.20.0 > PySpark 2.4.0 > Python 2.7.15 >Reporter: Tim Ludwinski >Priority: Major > Labels: correctness > > Executing the following: > {code:java} > my_schema = pyspark.sql.types.StructType([ > pyspark.sql.types.StructField("B", pyspark.sql.types.StringType(), True), > pyspark.sql.types.StructField("A", pyspark.sql.types.StringType(), True) > ]) > spark.createDataFrame(spark.sparkContext.parallelize([pyspark.sql.Row(A="1", > B="2")]), my_schema).collect() > {code} > should produce this: > {code:java} > [Row(A="1", B="2")] > {code} > or this: > {code:java} > [Row(B='2', A='1')] > {code} > but produces this instead: > {code:java} > [Row(B=u'1', A=u'2')] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27805) toPandas does not propagate SparkExceptions with arrow enabled
[ https://issues.apache.org/jira/browse/SPARK-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-27805: - Affects Version/s: (was: 3.1.0) 2.4.3 > toPandas does not propagate SparkExceptions with arrow enabled > -- > > Key: SPARK-27805 > URL: https://issues.apache.org/jira/browse/SPARK-27805 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.3 >Reporter: David Vogelbacher >Assignee: David Vogelbacher >Priority: Major > Fix For: 3.0.0 > > > When calling {{toPandas}} with arrow enabled errors encountered during the > collect are not propagated to the python process. > There is only a very general {{EofError}} raised. > Example of behavior with arrow enabled vs. arrow disabled: > {noformat} > import traceback > from pyspark.sql.functions import udf > from pyspark.sql.types import IntegerType > def raise_exception(): > raise Exception("My error") > error_udf = udf(raise_exception, IntegerType()) > df = spark.range(3).toDF("i").withColumn("x", error_udf()) > try: > df.toPandas() > except: > no_arrow_exception = traceback.format_exc() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > try: > df.toPandas() > except: > arrow_exception = traceback.format_exc() > print no_arrow_exception > print arrow_exception > {noformat} > {{arrow_exception}} gives as output: > {noformat} > >>> print arrow_exception > Traceback (most recent call last): > File "", line 2, in > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 2143, in toPandas > batches = self._collectAsArrow() > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 2205, in _collectAsArrow > results = list(_load_from_socket(sock_info, ArrowCollectSerializer())) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 210, in load_stream > num = read_int(stream) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 810, in read_int > raise EOFError > EOFError > {noformat} > {{no_arrow_exception}} gives as output: > {noformat} > Traceback (most recent call last): > File "", line 2, in > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 2166, in toPandas > pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 516, in collect > sock_info = self._jdf.collectToPython() > File > "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", > line 1286, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/utils.py", line 89, > in deco > return f(*a, **kw) > File > "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", > line 328, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o38.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 > in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 > (TID 7, localhost, executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File > "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 428, in main > process() > File > "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 423, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 438, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 141, in dump_stream > for obj in iterator: > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 427, in _batched > for item in iterator: > File "", line 1, in > File > "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 86, in > return lambda *a: f(*a) > File "/Users/dvogelbacher/git/spark/python/pyspark/util.py", line 99, in > wrapper > return f(*args, **kwargs) > File "", line 2, in raise_exception > Exception: My error > ... > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27805) toPandas does not propagate SparkExceptions with arrow enabled
[ https://issues.apache.org/jira/browse/SPARK-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-27805. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24677 [https://github.com/apache/spark/pull/24677] > toPandas does not propagate SparkExceptions with arrow enabled > -- > > Key: SPARK-27805 > URL: https://issues.apache.org/jira/browse/SPARK-27805 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.1.0 >Reporter: David Vogelbacher >Assignee: David Vogelbacher >Priority: Major > Fix For: 3.0.0 > > > When calling {{toPandas}} with arrow enabled errors encountered during the > collect are not propagated to the python process. > There is only a very general {{EofError}} raised. > Example of behavior with arrow enabled vs. arrow disabled: > {noformat} > import traceback > from pyspark.sql.functions import udf > from pyspark.sql.types import IntegerType > def raise_exception(): > raise Exception("My error") > error_udf = udf(raise_exception, IntegerType()) > df = spark.range(3).toDF("i").withColumn("x", error_udf()) > try: > df.toPandas() > except: > no_arrow_exception = traceback.format_exc() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > try: > df.toPandas() > except: > arrow_exception = traceback.format_exc() > print no_arrow_exception > print arrow_exception > {noformat} > {{arrow_exception}} gives as output: > {noformat} > >>> print arrow_exception > Traceback (most recent call last): > File "", line 2, in > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 2143, in toPandas > batches = self._collectAsArrow() > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 2205, in _collectAsArrow > results = list(_load_from_socket(sock_info, ArrowCollectSerializer())) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 210, in load_stream > num = read_int(stream) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 810, in read_int > raise EOFError > EOFError > {noformat} > {{no_arrow_exception}} gives as output: > {noformat} > Traceback (most recent call last): > File "", line 2, in > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 2166, in toPandas > pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 516, in collect > sock_info = self._jdf.collectToPython() > File > "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", > line 1286, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/utils.py", line 89, > in deco > return f(*a, **kw) > File > "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", > line 328, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o38.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 > in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 > (TID 7, localhost, executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File > "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 428, in main > process() > File > "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 423, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 438, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 141, in dump_stream > for obj in iterator: > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 427, in _batched > for item in iterator: > File "", line 1, in > File > "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 86, in > return lambda *a: f(*a) > File "/Users/dvogelbacher/git/spark/python/pyspark/util.py", line 99, in > wrapper > return f(*args, **kwargs) > File "", line 2, in raise_exception > Exception: My error > ... > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Assigned] (SPARK-27805) toPandas does not propagate SparkExceptions with arrow enabled
[ https://issues.apache.org/jira/browse/SPARK-27805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-27805: Assignee: David Vogelbacher > toPandas does not propagate SparkExceptions with arrow enabled > -- > > Key: SPARK-27805 > URL: https://issues.apache.org/jira/browse/SPARK-27805 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.1.0 >Reporter: David Vogelbacher >Assignee: David Vogelbacher >Priority: Major > > When calling {{toPandas}} with arrow enabled errors encountered during the > collect are not propagated to the python process. > There is only a very general {{EofError}} raised. > Example of behavior with arrow enabled vs. arrow disabled: > {noformat} > import traceback > from pyspark.sql.functions import udf > from pyspark.sql.types import IntegerType > def raise_exception(): > raise Exception("My error") > error_udf = udf(raise_exception, IntegerType()) > df = spark.range(3).toDF("i").withColumn("x", error_udf()) > try: > df.toPandas() > except: > no_arrow_exception = traceback.format_exc() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > try: > df.toPandas() > except: > arrow_exception = traceback.format_exc() > print no_arrow_exception > print arrow_exception > {noformat} > {{arrow_exception}} gives as output: > {noformat} > >>> print arrow_exception > Traceback (most recent call last): > File "", line 2, in > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 2143, in toPandas > batches = self._collectAsArrow() > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 2205, in _collectAsArrow > results = list(_load_from_socket(sock_info, ArrowCollectSerializer())) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 210, in load_stream > num = read_int(stream) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 810, in read_int > raise EOFError > EOFError > {noformat} > {{no_arrow_exception}} gives as output: > {noformat} > Traceback (most recent call last): > File "", line 2, in > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 2166, in toPandas > pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/dataframe.py", line > 516, in collect > sock_info = self._jdf.collectToPython() > File > "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", > line 1286, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/Users/dvogelbacher/git/spark/python/pyspark/sql/utils.py", line 89, > in deco > return f(*a, **kw) > File > "/Users/dvogelbacher/git/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", > line 328, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o38.collectToPython. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 > in stage 0.0 failed 1 times, most recent failure: Lost task 7.0 in stage 0.0 > (TID 7, localhost, executor driver): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File > "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 428, in main > process() > File > "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 423, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 438, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 141, in dump_stream > for obj in iterator: > File "/Users/dvogelbacher/git/spark/python/pyspark/serializers.py", line > 427, in _batched > for item in iterator: > File "", line 1, in > File > "/Users/dvogelbacher/git/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 86, in > return lambda *a: f(*a) > File "/Users/dvogelbacher/git/spark/python/pyspark/util.py", line 99, in > wrapper > return f(*args, **kwargs) > File "", line 2, in raise_exception > Exception: My error > ... > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27939) Defining a schema with VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855969#comment-16855969 ] Bryan Cutler edited comment on SPARK-27939 at 6/4/19 6:13 PM: -- Linked to a similar problem with Python {{Row}} class, SPARK-22232 was (Author: bryanc): Another problem with Python {{Row}} class > Defining a schema with VectorUDT > > > Key: SPARK-27939 > URL: https://issues.apache.org/jira/browse/SPARK-27939 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Johannes Schaffrath >Priority: Minor > > When I try to define a dataframe schema which has a VectorUDT field, I run > into an error when the VectorUDT field is not the last element of the > StructType list. > The following example causes the error below: > {code:java} > // from pyspark.sql import functions as F > from pyspark.sql import types as T > from pyspark.sql import Row > from pyspark.ml.linalg import VectorUDT, SparseVector > #VectorUDT should be the last structfield > train_schema = T.StructType([ > T.StructField('features', VectorUDT()), > T.StructField('SALESCLOSEPRICE', T.IntegerType()) > ]) > > train_df = spark.createDataFrame( > [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, > 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: > 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, > 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, > 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: > -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: > 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: > 1.0, 133: 1.0}), SALESCLOSEPRICE=143000), > Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, > 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: > 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, > 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: > 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, > 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: > -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, > 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: > 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), > SALESCLOSEPRICE=19), > Row(features=SparseVector(135, {0: 225000.0, 1: 225000.0, 2: 1102.0, 3: > 28.0, 5: 1102.0, 6: 2390.0, 7: 2390.0, 8: 1102.0, 9: 1949.0, 10: 0.822, 11: > 1.0, 15: 1.0, 16: 2.0, 20: 2017.0, 21: 6.0, 22: 26.0, 23: 26.0, 24: 177.0, > 25: 25.0, 26: -1.0, 27: 3.88, 28: 3.9, 29: 3.91, 30: 3.89, 31: 3.94, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 61.7, 36: 2.7, 38: 0.4706, 39: 204.1742, 40: 1102.0, > 41: 0.0106, 42: 2.0, 49: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: -1.0, 57: > 1.0, 62: 1.0, 68: 1.0, 70: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 1.0, 100: 1.0, > 103: 1.0, 108: 1.0, 110: 1.0, 115: 1.0, 123: 1.0, 131: 1.0, 132: 1.0}), > SALESCLOSEPRICE=225000) > ], schema=train_schema) > > train_df.printSchema() > train_df.show() > {code} > Error message: > {code:java} > // Fail to execute line 17: ], schema=train_schema) Traceback (most recent > call last): File "/tmp/zeppelin_pyspark-3793375738105660281.py", line 375, in > exec(code, _zcUserQueryNameSpace) File "", line 17, in > File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", > line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, > data), schema) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in > _createFromLocal data = [schema.toInternal(row) for row in data] File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in > data = [schema.toInternal(row) for row in data] File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in > toInternal for f, v, c in zip(self.fields, obj, self._needConversion)) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in > for f, v, c in zip(self.fields, obj, self._needConversion)) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 442, in > toInternal return self.dataType.toInternal(obj) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 685, in > toInternal return self._cachedSqlType().toInternal(self.serialize(obj)) File >
[jira] [Resolved] (SPARK-27939) Defining a schema with VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-27939. -- Resolution: Not A Problem > Defining a schema with VectorUDT > > > Key: SPARK-27939 > URL: https://issues.apache.org/jira/browse/SPARK-27939 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Johannes Schaffrath >Priority: Minor > > When I try to define a dataframe schema which has a VectorUDT field, I run > into an error when the VectorUDT field is not the last element of the > StructType list. > The following example causes the error below: > {code:java} > // from pyspark.sql import functions as F > from pyspark.sql import types as T > from pyspark.sql import Row > from pyspark.ml.linalg import VectorUDT, SparseVector > #VectorUDT should be the last structfield > train_schema = T.StructType([ > T.StructField('features', VectorUDT()), > T.StructField('SALESCLOSEPRICE', T.IntegerType()) > ]) > > train_df = spark.createDataFrame( > [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, > 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: > 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, > 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, > 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: > -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: > 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: > 1.0, 133: 1.0}), SALESCLOSEPRICE=143000), > Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, > 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: > 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, > 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: > 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, > 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: > -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, > 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: > 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), > SALESCLOSEPRICE=19), > Row(features=SparseVector(135, {0: 225000.0, 1: 225000.0, 2: 1102.0, 3: > 28.0, 5: 1102.0, 6: 2390.0, 7: 2390.0, 8: 1102.0, 9: 1949.0, 10: 0.822, 11: > 1.0, 15: 1.0, 16: 2.0, 20: 2017.0, 21: 6.0, 22: 26.0, 23: 26.0, 24: 177.0, > 25: 25.0, 26: -1.0, 27: 3.88, 28: 3.9, 29: 3.91, 30: 3.89, 31: 3.94, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 61.7, 36: 2.7, 38: 0.4706, 39: 204.1742, 40: 1102.0, > 41: 0.0106, 42: 2.0, 49: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: -1.0, 57: > 1.0, 62: 1.0, 68: 1.0, 70: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 1.0, 100: 1.0, > 103: 1.0, 108: 1.0, 110: 1.0, 115: 1.0, 123: 1.0, 131: 1.0, 132: 1.0}), > SALESCLOSEPRICE=225000) > ], schema=train_schema) > > train_df.printSchema() > train_df.show() > {code} > Error message: > {code:java} > // Fail to execute line 17: ], schema=train_schema) Traceback (most recent > call last): File "/tmp/zeppelin_pyspark-3793375738105660281.py", line 375, in > exec(code, _zcUserQueryNameSpace) File "", line 17, in > File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", > line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, > data), schema) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in > _createFromLocal data = [schema.toInternal(row) for row in data] File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in > data = [schema.toInternal(row) for row in data] File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in > toInternal for f, v, c in zip(self.fields, obj, self._needConversion)) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in > for f, v, c in zip(self.fields, obj, self._needConversion)) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 442, in > toInternal return self.dataType.toInternal(obj) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 685, in > toInternal return self._cachedSqlType().toInternal(self.serialize(obj)) File > "/opt/spark/python/lib/pyspark.zip/pyspark/ml/linalg/__init__.py", line 167, > in serialize raise TypeError("cannot serialize %r of type %r" % (obj, > type(obj))) TypeError: cannot serialize 143000 of type {code} > I don't get
[jira] [Commented] (SPARK-27939) Defining a schema with VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855969#comment-16855969 ] Bryan Cutler commented on SPARK-27939: -- Another problem with Python {{Row}} class > Defining a schema with VectorUDT > > > Key: SPARK-27939 > URL: https://issues.apache.org/jira/browse/SPARK-27939 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Johannes Schaffrath >Priority: Minor > > When I try to define a dataframe schema which has a VectorUDT field, I run > into an error when the VectorUDT field is not the last element of the > StructType list. > The following example causes the error below: > {code:java} > // from pyspark.sql import functions as F > from pyspark.sql import types as T > from pyspark.sql import Row > from pyspark.ml.linalg import VectorUDT, SparseVector > #VectorUDT should be the last structfield > train_schema = T.StructType([ > T.StructField('features', VectorUDT()), > T.StructField('SALESCLOSEPRICE', T.IntegerType()) > ]) > > train_df = spark.createDataFrame( > [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, > 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: > 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, > 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, > 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: > -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: > 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: > 1.0, 133: 1.0}), SALESCLOSEPRICE=143000), > Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, > 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: > 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, > 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: > 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, > 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: > -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, > 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: > 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), > SALESCLOSEPRICE=19), > Row(features=SparseVector(135, {0: 225000.0, 1: 225000.0, 2: 1102.0, 3: > 28.0, 5: 1102.0, 6: 2390.0, 7: 2390.0, 8: 1102.0, 9: 1949.0, 10: 0.822, 11: > 1.0, 15: 1.0, 16: 2.0, 20: 2017.0, 21: 6.0, 22: 26.0, 23: 26.0, 24: 177.0, > 25: 25.0, 26: -1.0, 27: 3.88, 28: 3.9, 29: 3.91, 30: 3.89, 31: 3.94, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 61.7, 36: 2.7, 38: 0.4706, 39: 204.1742, 40: 1102.0, > 41: 0.0106, 42: 2.0, 49: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: -1.0, 57: > 1.0, 62: 1.0, 68: 1.0, 70: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 1.0, 100: 1.0, > 103: 1.0, 108: 1.0, 110: 1.0, 115: 1.0, 123: 1.0, 131: 1.0, 132: 1.0}), > SALESCLOSEPRICE=225000) > ], schema=train_schema) > > train_df.printSchema() > train_df.show() > {code} > Error message: > {code:java} > // Fail to execute line 17: ], schema=train_schema) Traceback (most recent > call last): File "/tmp/zeppelin_pyspark-3793375738105660281.py", line 375, in > exec(code, _zcUserQueryNameSpace) File "", line 17, in > File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", > line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, > data), schema) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in > _createFromLocal data = [schema.toInternal(row) for row in data] File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 429, in > data = [schema.toInternal(row) for row in data] File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in > toInternal for f, v, c in zip(self.fields, obj, self._needConversion)) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 604, in > for f, v, c in zip(self.fields, obj, self._needConversion)) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 442, in > toInternal return self.dataType.toInternal(obj) File > "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 685, in > toInternal return self._cachedSqlType().toInternal(self.serialize(obj)) File > "/opt/spark/python/lib/pyspark.zip/pyspark/ml/linalg/__init__.py", line 167, > in serialize raise TypeError("cannot serialize %r of type %r" % (obj, > type(obj))) TypeError:
[jira] [Comment Edited] (SPARK-27939) Defining a schema with VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855966#comment-16855966 ] Bryan Cutler edited comment on SPARK-27939 at 6/4/19 6:11 PM: -- The problem is the {{Row}} class sorts the field names alphabetically, which puts capital letters first and then conflicts with your schema: {noformat} r = Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, ...}), SALESCLOSEPRICE=143000) In [3]: r.__fields__ Out[3]: ['SALESCLOSEPRICE', 'features']{noformat} This is by design, but it is not intuitive and has caused lots of problems. Hopefully, we can improve this for Spark 3.0.0 You can either just specify your data as tuples. for example {noformat} In [5]: train_df = spark.createDataFrame([(SparseVector(135, {0: 139900.0}), 143000)], schema=train_schema) In [6]: train_df.show() ++---+ | features|SALESCLOSEPRICE| ++---+ |(135,[0],[139900.0])| 143000| ++---+ {noformat} Or if you want to have keywords, then define your own row class like this: {noformat} In [7]: MyRow = Row('features', 'SALESCLOSEPRICE') In [8]: MyRow(SparseVector(135, {0: 139900.0}), 143000) Out[8]: Row(features=SparseVector(135, {0: 139900.0}), SALESCLOSEPRICE=143000){noformat} was (Author: bryanc): The problem is the {{Row}} class sorts the field names alphabetically, which puts capital letters first and then conflicts with your schema: {noformat} r = Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, ...}), SALESCLOSEPRICE=143000) In [3]: r.__fields__ Out[3]: ['SALESCLOSEPRICE', 'features']{noformat} This is by design, but it is not intuitive and has caused lots of problems. You can either just specify your data as tuples. for example {noformat} In [5]: train_df = spark.createDataFrame([(SparseVector(135, {0: 139900.0}), 143000)], schema=train_schema) In [6]: train_df.show() ++---+ | features|SALESCLOSEPRICE| ++---+ |(135,[0],[139900.0])| 143000| ++---+ {noformat} Or if you want to have keywords, then define your own row class like this: {noformat} In [7]: MyRow = Row('features', 'SALESCLOSEPRICE') In [8]: MyRow(SparseVector(135, {0: 139900.0}), 143000) Out[8]: Row(features=SparseVector(135, {0: 139900.0}), SALESCLOSEPRICE=143000){noformat} > Defining a schema with VectorUDT > > > Key: SPARK-27939 > URL: https://issues.apache.org/jira/browse/SPARK-27939 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Johannes Schaffrath >Priority: Minor > > When I try to define a dataframe schema which has a VectorUDT field, I run > into an error when the VectorUDT field is not the last element of the > StructType list. > The following example causes the error below: > {code:java} > // from pyspark.sql import functions as F > from pyspark.sql import types as T > from pyspark.sql import Row > from pyspark.ml.linalg import VectorUDT, SparseVector > #VectorUDT should be the last structfield > train_schema = T.StructType([ > T.StructField('features', VectorUDT()), > T.StructField('SALESCLOSEPRICE', T.IntegerType()) > ]) > > train_df = spark.createDataFrame( > [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, > 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: > 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, > 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, > 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: > -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: > 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: > 1.0, 133: 1.0}), SALESCLOSEPRICE=143000), > Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, > 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: > 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, > 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: > 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, > 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: > -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, > 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: > 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), >
[jira] [Commented] (SPARK-27939) Defining a schema with VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-27939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855966#comment-16855966 ] Bryan Cutler commented on SPARK-27939: -- The problem is the {{Row}} class sorts the field names alphabetically, which puts capital letters first and then conflicts with your schema: {noformat} r = Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, ...}), SALESCLOSEPRICE=143000) In [3]: r.__fields__ Out[3]: ['SALESCLOSEPRICE', 'features']{noformat} This is by design, but it is not intuitive and has caused lots of problems. You can either just specify your data as tuples. for example {noformat} In [5]: train_df = spark.createDataFrame([(SparseVector(135, {0: 139900.0}), 143000)], schema=train_schema) In [6]: train_df.show() ++---+ | features|SALESCLOSEPRICE| ++---+ |(135,[0],[139900.0])| 143000| ++---+ {noformat} Or if you want to have keywords, then define your own row class like this: {noformat} In [7]: MyRow = Row('features', 'SALESCLOSEPRICE') In [8]: MyRow(SparseVector(135, {0: 139900.0}), 143000) Out[8]: Row(features=SparseVector(135, {0: 139900.0}), SALESCLOSEPRICE=143000){noformat} > Defining a schema with VectorUDT > > > Key: SPARK-27939 > URL: https://issues.apache.org/jira/browse/SPARK-27939 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Johannes Schaffrath >Priority: Minor > > When I try to define a dataframe schema which has a VectorUDT field, I run > into an error when the VectorUDT field is not the last element of the > StructType list. > The following example causes the error below: > {code:java} > // from pyspark.sql import functions as F > from pyspark.sql import types as T > from pyspark.sql import Row > from pyspark.ml.linalg import VectorUDT, SparseVector > #VectorUDT should be the last structfield > train_schema = T.StructType([ > T.StructField('features', VectorUDT()), > T.StructField('SALESCLOSEPRICE', T.IntegerType()) > ]) > > train_df = spark.createDataFrame( > [Row(features=SparseVector(135, {0: 139900.0, 1: 139900.0, 2: 980.0, 3: 10.0, > 5: 980.0, 6: 1858.0, 7: 1858.0, 8: 980.0, 9: 1950.0, 10: 1.28, 11: 1.0, 12: > 1.0, 15: 2.0, 16: 3.0, 20: 2017.0, 21: 7.0, 22: 28.0, 23: 15.0, 24: 196.0, > 25: 25.0, 26: -1.0, 27: 4.03, 28: 3.96, 29: 3.88, 30: 3.9, 31: 3.91, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 49.8, 36: 11.9, 37: 2.7, 38: 0.2926, 39: 142.7551, > 40: 980.0, 41: 0.0133, 42: 1.5, 43: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: > -1.0, 55: -1.0, 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 81: 1.0, 89: > 1.0, 95: 1.0, 96: 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 123: > 1.0, 133: 1.0}), SALESCLOSEPRICE=143000), > Row(features=SparseVector(135, {0: 21.0, 1: 21.0, 2: 1144.0, 3: 4.0, > 5: 1268.0, 6: 1640.0, 7: 1640.0, 8: 2228.0, 9: 1971.0, 10: 0.32, 11: 1.0, 14: > 2.0, 15: 3.0, 16: 4.0, 17: 960.0, 20: 2017.0, 21: 10.0, 22: 41.0, 23: 9.0, > 24: 282.0, 25: 2.0, 26: -1.0, 27: 3.91, 28: 3.85, 29: 3.83, 30: 3.83, 31: > 3.78, 32: 32.2, 33: 49.0, 34: 18.8, 35: 14.0, 36: 35.8, 37: 14.6, 38: 0.4392, > 39: 94.2549, 40: 2228.0, 41: 0.0078, 42: 1., 43: -1.0, 44: -1.0, 45: > -1.0, 46: -1.0, 47: -1.0, 48: -1.0, 49: -1.0, 50: -1.0, 52: 1.0, 55: -1.0, > 56: -1.0, 57: -1.0, 62: 1.0, 68: 1.0, 77: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: > 1.0, 101: 1.0, 103: 1.0, 108: 1.0, 114: 1.0, 115: 1.0, 124: 1.0, 133: 1.0}), > SALESCLOSEPRICE=19), > Row(features=SparseVector(135, {0: 225000.0, 1: 225000.0, 2: 1102.0, 3: > 28.0, 5: 1102.0, 6: 2390.0, 7: 2390.0, 8: 1102.0, 9: 1949.0, 10: 0.822, 11: > 1.0, 15: 1.0, 16: 2.0, 20: 2017.0, 21: 6.0, 22: 26.0, 23: 26.0, 24: 177.0, > 25: 25.0, 26: -1.0, 27: 3.88, 28: 3.9, 29: 3.91, 30: 3.89, 31: 3.94, 32: 9.8, > 33: 22.4, 34: 67.8, 35: 61.7, 36: 2.7, 38: 0.4706, 39: 204.1742, 40: 1102.0, > 41: 0.0106, 42: 2.0, 49: 1.0, 51: -1.0, 52: -1.0, 53: -1.0, 54: -1.0, 57: > 1.0, 62: 1.0, 68: 1.0, 70: 1.0, 79: 1.0, 89: 1.0, 92: 1.0, 96: 1.0, 100: 1.0, > 103: 1.0, 108: 1.0, 110: 1.0, 115: 1.0, 123: 1.0, 131: 1.0, 132: 1.0}), > SALESCLOSEPRICE=225000) > ], schema=train_schema) > > train_df.printSchema() > train_df.show() > {code} > Error message: > {code:java} > // Fail to execute line 17: ], schema=train_schema) Traceback (most recent > call last): File "/tmp/zeppelin_pyspark-3793375738105660281.py", line 375, in > exec(code, _zcUserQueryNameSpace) File "", line 17, in > File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", > line 748, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, > data), schema) File >
[jira] [Updated] (SPARK-27992) PySpark socket server should sync with JVM connection thread future
[ https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-27992: - Description: Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark job are not propagated to Python. This is because toLocalIterator() and toPandas() with Arrow enabled run Spark jobs asynchronously in a background thread, after creating the socket connection info. The fix for these was to catch a SparkException if the job errored and then send the exception through the pyspark serializer. A better fix would be to allow Python to synchronize on the serving thread future. That way if the serving thread throws an exception, it will be propagated on the synchronization call. > PySpark socket server should sync with JVM connection thread future > --- > > Key: SPARK-27992 > URL: https://issues.apache.org/jira/browse/SPARK-27992 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Bryan Cutler >Priority: Major > > Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark > job are not propagated to Python. This is because toLocalIterator() and > toPandas() with Arrow enabled run Spark jobs asynchronously in a background > thread, after creating the socket connection info. The fix for these was to > catch a SparkException if the job errored and then send the exception through > the pyspark serializer. > A better fix would be to allow Python to synchronize on the serving thread > future. That way if the serving thread throws an exception, it will be > propagated on the synchronization call. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27992) PySpark socket server should sync with JVM connection thread future
[ https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-27992: - Environment: (was: Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark job are not propagated to Python. This is because toLocalIterator() and toPandas() with Arrow enabled run Spark jobs asynchronously in a background thread, after creating the socket connection info. The fix for these was to catch a SparkException if the job errored and then send the exception through the pyspark serializer. A better fix would be to allow Python to synchronize on the serving thread future. That way if the serving thread throws an exception, it will be propagated on the synchronization call.) > PySpark socket server should sync with JVM connection thread future > --- > > Key: SPARK-27992 > URL: https://issues.apache.org/jira/browse/SPARK-27992 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.3 >Reporter: Bryan Cutler >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27992) PySpark socket server should sync with JVM connection thread future
Bryan Cutler created SPARK-27992: Summary: PySpark socket server should sync with JVM connection thread future Key: SPARK-27992 URL: https://issues.apache.org/jira/browse/SPARK-27992 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.4.3 Environment: Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark job are not propagated to Python. This is because toLocalIterator() and toPandas() with Arrow enabled run Spark jobs asynchronously in a background thread, after creating the socket connection info. The fix for these was to catch a SparkException if the job errored and then send the exception through the pyspark serializer. A better fix would be to allow Python to synchronize on the serving thread future. That way if the serving thread throws an exception, it will be propagated on the synchronization call. Reporter: Bryan Cutler -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27992) PySpark socket server should sync with JVM connection thread future
[ https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-27992: - Affects Version/s: (was: 2.4.3) 3.0.0 > PySpark socket server should sync with JVM connection thread future > --- > > Key: SPARK-27992 > URL: https://issues.apache.org/jira/browse/SPARK-27992 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark > job are not propagated to Python. This is because toLocalIterator() and > toPandas() with Arrow enabled run Spark jobs asynchronously in a background > thread, after creating the socket connection info. The fix for these was to > catch a SparkException if the job errored and then send the exception through > the pyspark serializer. > A better fix would be to allow Python to synchronize on the serving thread > future. That way if the serving thread throws an exception, it will be > propagated on the synchronization call. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27992) PySpark socket server should sync with JVM connection thread future
[ https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-27992: - Description: Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark job are not propagated to Python. This is because toLocalIterator() and toPandas() with Arrow enabled run Spark jobs asynchronously in a background thread, after creating the socket connection info. The fix for these was to catch a SparkException if the job errored and then send the exception through the pyspark serializer. A better fix would be to allow Python to await on the serving thread future and join the thread. That way if the serving thread throws an exception, it will be propagated on the call to awaitResult. was: Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark job are not propagated to Python. This is because toLocalIterator() and toPandas() with Arrow enabled run Spark jobs asynchronously in a background thread, after creating the socket connection info. The fix for these was to catch a SparkException if the job errored and then send the exception through the pyspark serializer. A better fix would be to allow Python to synchronize on the serving thread future. That way if the serving thread throws an exception, it will be propagated on the synchronization call. > PySpark socket server should sync with JVM connection thread future > --- > > Key: SPARK-27992 > URL: https://issues.apache.org/jira/browse/SPARK-27992 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark > job are not propagated to Python. This is because toLocalIterator() and > toPandas() with Arrow enabled run Spark jobs asynchronously in a background > thread, after creating the socket connection info. The fix for these was to > catch a SparkException if the job errored and then send the exception through > the pyspark serializer. > A better fix would be to allow Python to await on the serving thread future > and join the thread. That way if the serving thread throws an exception, it > will be propagated on the call to awaitResult. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT
[ https://issues.apache.org/jira/browse/SPARK-28003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-28003. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24844 [https://github.com/apache/spark/pull/24844] > spark.createDataFrame with Arrow doesn't work with pandas.NaT > -- > > Key: SPARK-28003 > URL: https://issues.apache.org/jira/browse/SPARK-28003 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.3, 2.4.3 >Reporter: Li Jin >Assignee: Li Jin >Priority: Major > Fix For: 3.0.0 > > > {code:java} > import pandas as pd > dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100 > pdf1 = pd.DataFrame({'time': dt1}) > df1 = self.spark.createDataFrame(pdf1) > {code} > The example above doesn't work with arrow enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28003) spark.createDataFrame with Arrow doesn't work with pandas.NaT
[ https://issues.apache.org/jira/browse/SPARK-28003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-28003: Assignee: Li Jin > spark.createDataFrame with Arrow doesn't work with pandas.NaT > -- > > Key: SPARK-28003 > URL: https://issues.apache.org/jira/browse/SPARK-28003 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.3, 2.4.3 >Reporter: Li Jin >Assignee: Li Jin >Priority: Major > > {code:java} > import pandas as pd > dt1 = [pd.NaT, pd.Timestamp('2019-06-11')] * 100 > pdf1 = pd.DataFrame({'time': dt1}) > df1 = self.spark.createDataFrame(pdf1) > {code} > The example above doesn't work with arrow enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org