[jira] [Created] (SPARK-26242) Leading slash breaks proxying
Ryan Lovett created SPARK-26242: --- Summary: Leading slash breaks proxying Key: SPARK-26242 URL: https://issues.apache.org/jira/browse/SPARK-26242 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.4.0 Reporter: Ryan Lovett The WebUI prefixes "/" at the beginning of each link path (e.g. /jobs) which breaks navigation when attempting to proxy the app at another URL. In my case, a pyspark user creates a SparkContext within a JupyterHub-hosted notebook and attempts to proxy it with nbserverproxy off of address.of.jupyter.hub/user/proxy/4040/. Since WebUI sets the URLs of its pages to begin with "/", the browser sends the user to address.of.jupyter.hub/jobs, address.of.jupyter.hub/stages, etc. Similar: [https://github.com/mesosphere/spark/commit/ada99f1b3801e81db2e367f219377e93f5d32655|https://github.com/apache/spark/pull/11369] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26189) Fix the doc of unionAll in SparkR
[ https://issues.apache.org/jira/browse/SPARK-26189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-26189. -- Resolution: Fixed Assignee: Huaxin Gao Fix Version/s: 3.0.0 Target Version/s: 3.0.0 > Fix the doc of unionAll in SparkR > - > > Key: SPARK-26189 > URL: https://issues.apache.org/jira/browse/SPARK-26189 > Project: Spark > Issue Type: Documentation > Components: R >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > We should fix the doc of unionAll in SparkR. See the discussion: > https://github.com/apache/spark/pull/23131/files#r236760822 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26221) Improve Spark SQL instrumentation and metrics
[ https://issues.apache.org/jira/browse/SPARK-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705626#comment-16705626 ] Apache Spark commented on SPARK-26221: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/23192 > Improve Spark SQL instrumentation and metrics > - > > Key: SPARK-26221 > URL: https://issues.apache.org/jira/browse/SPARK-26221 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > This is an umbrella ticket for various small improvements for better metrics > and instrumentation. Some thoughts: > > Differentiate query plan that’s writing data out, vs returning data to the > driver > * I.e. ETL & report generation vs interactive analysis > * This is related to the data sink item below. We need to make sure from the > query plan we can tell what a query is doing > Data sink: Have an operator for data sink, with metrics that can tell us: > * Write time > * Number of records written > * Size of output written > * Number of partitions modified > * Metastore update time > * Also track number of records for collect / limit > Scan > * Track file listing time (start and end so we can construct timeline, not > just duration) > * Track metastore operation time > * Track IO decoding time for row-based input sources; Need to make sure > overhead is low > Shuffle > * Track read time and write time > * Decide if we can measure serialization and deserialization > Client fetch time > * Sometimes a query take long to run because it is blocked on the client > fetching result (e.g. using a result iterator). Record the time blocked on > client so we can remove it in measuring query execution time. > Make it easy to correlate queries with jobs, stages, and tasks belonging to a > single query, e.g. dump execution id in task logs? > Better logging: > * Enable logging the query execution id and TID in executor logs, and query > execution id in driver logs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26221) Improve Spark SQL instrumentation and metrics
[ https://issues.apache.org/jira/browse/SPARK-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705627#comment-16705627 ] Apache Spark commented on SPARK-26221: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/23192 > Improve Spark SQL instrumentation and metrics > - > > Key: SPARK-26221 > URL: https://issues.apache.org/jira/browse/SPARK-26221 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > This is an umbrella ticket for various small improvements for better metrics > and instrumentation. Some thoughts: > > Differentiate query plan that’s writing data out, vs returning data to the > driver > * I.e. ETL & report generation vs interactive analysis > * This is related to the data sink item below. We need to make sure from the > query plan we can tell what a query is doing > Data sink: Have an operator for data sink, with metrics that can tell us: > * Write time > * Number of records written > * Size of output written > * Number of partitions modified > * Metastore update time > * Also track number of records for collect / limit > Scan > * Track file listing time (start and end so we can construct timeline, not > just duration) > * Track metastore operation time > * Track IO decoding time for row-based input sources; Need to make sure > overhead is low > Shuffle > * Track read time and write time > * Decide if we can measure serialization and deserialization > Client fetch time > * Sometimes a query take long to run because it is blocked on the client > fetching result (e.g. using a result iterator). Record the time blocked on > client so we can remove it in measuring query execution time. > Make it easy to correlate queries with jobs, stages, and tasks belonging to a > single query, e.g. dump execution id in task logs? > Better logging: > * Enable logging the query execution id and TID in executor logs, and query > execution id in driver logs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26226) Update query tracker to report timeline for phases, rather than duration
[ https://issues.apache.org/jira/browse/SPARK-26226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705629#comment-16705629 ] Apache Spark commented on SPARK-26226: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/23193 > Update query tracker to report timeline for phases, rather than duration > > > Key: SPARK-26226 > URL: https://issues.apache.org/jira/browse/SPARK-26226 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > Fix For: 3.0.0 > > > It'd be more useful to report start and end time for each phrase, rather than > only a single duration. This way we can look at timelines and figure out if > there is any unaccounted time. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26221) Improve Spark SQL instrumentation and metrics
[ https://issues.apache.org/jira/browse/SPARK-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26221: Assignee: Apache Spark (was: Reynold Xin) > Improve Spark SQL instrumentation and metrics > - > > Key: SPARK-26221 > URL: https://issues.apache.org/jira/browse/SPARK-26221 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Assignee: Apache Spark >Priority: Major > > This is an umbrella ticket for various small improvements for better metrics > and instrumentation. Some thoughts: > > Differentiate query plan that’s writing data out, vs returning data to the > driver > * I.e. ETL & report generation vs interactive analysis > * This is related to the data sink item below. We need to make sure from the > query plan we can tell what a query is doing > Data sink: Have an operator for data sink, with metrics that can tell us: > * Write time > * Number of records written > * Size of output written > * Number of partitions modified > * Metastore update time > * Also track number of records for collect / limit > Scan > * Track file listing time (start and end so we can construct timeline, not > just duration) > * Track metastore operation time > * Track IO decoding time for row-based input sources; Need to make sure > overhead is low > Shuffle > * Track read time and write time > * Decide if we can measure serialization and deserialization > Client fetch time > * Sometimes a query take long to run because it is blocked on the client > fetching result (e.g. using a result iterator). Record the time blocked on > client so we can remove it in measuring query execution time. > Make it easy to correlate queries with jobs, stages, and tasks belonging to a > single query, e.g. dump execution id in task logs? > Better logging: > * Enable logging the query execution id and TID in executor logs, and query > execution id in driver logs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26221) Improve Spark SQL instrumentation and metrics
[ https://issues.apache.org/jira/browse/SPARK-26221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26221: Assignee: Reynold Xin (was: Apache Spark) > Improve Spark SQL instrumentation and metrics > - > > Key: SPARK-26221 > URL: https://issues.apache.org/jira/browse/SPARK-26221 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > This is an umbrella ticket for various small improvements for better metrics > and instrumentation. Some thoughts: > > Differentiate query plan that’s writing data out, vs returning data to the > driver > * I.e. ETL & report generation vs interactive analysis > * This is related to the data sink item below. We need to make sure from the > query plan we can tell what a query is doing > Data sink: Have an operator for data sink, with metrics that can tell us: > * Write time > * Number of records written > * Size of output written > * Number of partitions modified > * Metastore update time > * Also track number of records for collect / limit > Scan > * Track file listing time (start and end so we can construct timeline, not > just duration) > * Track metastore operation time > * Track IO decoding time for row-based input sources; Need to make sure > overhead is low > Shuffle > * Track read time and write time > * Decide if we can measure serialization and deserialization > Client fetch time > * Sometimes a query take long to run because it is blocked on the client > fetching result (e.g. using a result iterator). Record the time blocked on > client so we can remove it in measuring query execution time. > Make it easy to correlate queries with jobs, stages, and tasks belonging to a > single query, e.g. dump execution id in task logs? > Better logging: > * Enable logging the query execution id and TID in executor logs, and query > execution id in driver logs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26226) Update query tracker to report timeline for phases, rather than duration
[ https://issues.apache.org/jira/browse/SPARK-26226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-26226. - Resolution: Fixed Fix Version/s: 3.0.0 > Update query tracker to report timeline for phases, rather than duration > > > Key: SPARK-26226 > URL: https://issues.apache.org/jira/browse/SPARK-26226 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > Fix For: 3.0.0 > > > It'd be more useful to report start and end time for each phrase, rather than > only a single duration. This way we can look at timelines and figure out if > there is any unaccounted time. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26241) Add queryId to IncrementalExecution
Reynold Xin created SPARK-26241: --- Summary: Add queryId to IncrementalExecution Key: SPARK-26241 URL: https://issues.apache.org/jira/browse/SPARK-26241 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Reynold Xin Assignee: Reynold Xin It'd be useful to have the streaming query uuid in IncrementalExecution, when we look at the QueryExecution in isolation to trace back the query. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26241) Add queryId to IncrementalExecution
[ https://issues.apache.org/jira/browse/SPARK-26241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-26241: Issue Type: Sub-task (was: Bug) Parent: SPARK-26221 > Add queryId to IncrementalExecution > --- > > Key: SPARK-26241 > URL: https://issues.apache.org/jira/browse/SPARK-26241 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Major > > It'd be useful to have the streaming query uuid in IncrementalExecution, when > we look at the QueryExecution in isolation to trace back the query. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26219) Executor summary is not getting updated for failure jobs in history server UI
[ https://issues.apache.org/jira/browse/SPARK-26219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705620#comment-16705620 ] Apache Spark commented on SPARK-26219: -- User 'shahidki31' has created a pull request for this issue: https://github.com/apache/spark/pull/23191 > Executor summary is not getting updated for failure jobs in history server UI > - > > Key: SPARK-26219 > URL: https://issues.apache.org/jira/browse/SPARK-26219 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.0 >Reporter: shahid >Assignee: shahid >Priority: Major > Fix For: 3.0.0 > > Attachments: Screenshot from 2018-11-29 22-13-34.png, Screenshot from > 2018-11-29 22-13-44.png > > > Test step to reproduce: > {code:java} > bin/spark-shell --master yarn --conf spark.executor.instances=3 > sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad > executor")}.collect() > {code} > 1)Open the application from History UI > 2) Go to the executor tab > From History UI: > !Screenshot from 2018-11-29 22-13-34.png! > From Live UI: > !Screenshot from 2018-11-29 22-13-44.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23647) extend hint syntax to support any expression for Python
[ https://issues.apache.org/jira/browse/SPARK-23647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23647: - Fix Version/s: 3.0.0 > extend hint syntax to support any expression for Python > --- > > Key: SPARK-23647 > URL: https://issues.apache.org/jira/browse/SPARK-23647 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Dylan Guedes >Assignee: Dylan Guedes >Priority: Major > Fix For: 3.0.0 > > > Relax checks in > [https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23647) extend hint syntax to support any expression for Python
[ https://issues.apache.org/jira/browse/SPARK-23647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-23647: Assignee: Dylan Guedes > extend hint syntax to support any expression for Python > --- > > Key: SPARK-23647 > URL: https://issues.apache.org/jira/browse/SPARK-23647 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Dylan Guedes >Assignee: Dylan Guedes >Priority: Major > > Relax checks in > [https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23647) extend hint syntax to support any expression for Python
[ https://issues.apache.org/jira/browse/SPARK-23647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23647. -- Resolution: Fixed Issue resolved by pull request 20788 [https://github.com/apache/spark/pull/20788] > extend hint syntax to support any expression for Python > --- > > Key: SPARK-23647 > URL: https://issues.apache.org/jira/browse/SPARK-23647 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Dylan Guedes >Assignee: Dylan Guedes >Priority: Major > Fix For: 3.0.0 > > > Relax checks in > [https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21030) extend hint syntax to support any expression for Python and R
[ https://issues.apache.org/jira/browse/SPARK-21030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21030. -- Resolution: Done > extend hint syntax to support any expression for Python and R > - > > Key: SPARK-21030 > URL: https://issues.apache.org/jira/browse/SPARK-21030 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR, SQL >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Priority: Major > > See SPARK-20854 > we need to relax checks in > https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422 > and > https://github.com/apache/spark/blob/7f203a248f94df6183a4bc4642a3d873171fef29/R/pkg/R/DataFrame.R#L3746 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25876) Simplify configuration types in k8s backend
[ https://issues.apache.org/jira/browse/SPARK-25876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Cheah resolved SPARK-25876. Resolution: Fixed Fix Version/s: 3.0.0 > Simplify configuration types in k8s backend > --- > > Key: SPARK-25876 > URL: https://issues.apache.org/jira/browse/SPARK-25876 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > Fix For: 3.0.0 > > > This is a child of SPARK-25874 to deal with the current issues with the > different configuration objects used in the k8s backend. Please refer to the > parent for further discussion of what this means. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26219) Executor summary is not getting updated for failure jobs in history server UI
[ https://issues.apache.org/jira/browse/SPARK-26219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-26219. Resolution: Fixed Assignee: shahid Fix Version/s: 3.0.0 > Executor summary is not getting updated for failure jobs in history server UI > - > > Key: SPARK-26219 > URL: https://issues.apache.org/jira/browse/SPARK-26219 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.4.0 >Reporter: shahid >Assignee: shahid >Priority: Major > Fix For: 3.0.0 > > Attachments: Screenshot from 2018-11-29 22-13-34.png, Screenshot from > 2018-11-29 22-13-44.png > > > Test step to reproduce: > {code:java} > bin/spark-shell --master yarn --conf spark.executor.instances=3 > sc.parallelize(1 to 1, 10).map{ x => throw new RuntimeException("Bad > executor")}.collect() > {code} > 1)Open the application from History UI > 2) Go to the executor tab > From History UI: > !Screenshot from 2018-11-29 22-13-34.png! > From Live UI: > !Screenshot from 2018-11-29 22-13-44.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26225) Scan: track decoding time for row-based data sources
[ https://issues.apache.org/jira/browse/SPARK-26225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705373#comment-16705373 ] Thincrs commented on SPARK-26225: - A user of thincrs has selected this issue. Deadline: Fri, Dec 7, 2018 10:42 PM > Scan: track decoding time for row-based data sources > > > Key: SPARK-26225 > URL: https://issues.apache.org/jira/browse/SPARK-26225 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Priority: Major > > Scan node should report decoding time for each record, if it is not too much > overhead. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26240) [pyspark] Updating illegal column names with withColumnRenamed does not change schema changes, causing pyspark.sql.utils.AnalysisException
Ying Wang created SPARK-26240: - Summary: [pyspark] Updating illegal column names with withColumnRenamed does not change schema changes, causing pyspark.sql.utils.AnalysisException Key: SPARK-26240 URL: https://issues.apache.org/jira/browse/SPARK-26240 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.1 Environment: Ubuntu 16.04 LTS (x86_64/deb) Reporter: Ying Wang I am unfamiliar with the internals of Spark, but I tried to ingest a Parquet file with illegal column headers, and when I had called df = df.withColumnRenamed($COLUMN_NAME, $NEW_COLUMN_NAME) and then called df.show(), pyspark errored out with the failed attribute being the old column name. Steps to reproduce: - Create a Parquet file from Pandas using this dataframe schema: ```python In [10]: df.info() Int64Index: 1000 entries, 0 to 999 Data columns (total 16 columns): Record_ID 1000 non-null int64 registration_dttm 1000 non-null object id 1000 non-null int64 first_name 984 non-null object last_name 1000 non-null object email 984 non-null object gender 933 non-null object ip_address 1000 non-null object cc 709 non-null float64 country 1000 non-null object birthdate 803 non-null object salary 932 non-null float64 title 803 non-null object comments 179 non-null object Unnamed: 14 10 non-null object Unnamed: 15 9 non-null object dtypes: float64(2), int64(2), object(12) memory usage: 132.8+ KB ``` * Open pyspark shell with `pyspark` and read in the Parquet file with `spark.read.format('parquet').load('/path/to/file.parquet') Call `spark_df.show()` Note the error with column 'Unnamed: 14'. Rename column, replacing illegal space character with underscore character: `spark_df = spark_df.withColumnRenamed('Unnamed: 14', 'Unnamed:_14')` Call `spark_df.show()` again, and note that the error still shows attribute 'Unnamed: 14' in the error message: ```python >>> df = spark.read.parquet('/home/yingw787/Downloads/userdata1.parquet') >>> newdf = df.withColumnRenamed('Unnamed: 14', 'Unnamed:_14') >>> newdf.show() Traceback (most recent call last): File "/home/yingw787/anaconda2/envs/scratch/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/home/yingw787/anaconda2/envs/scratch/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o32.showString. : org.apache.spark.sql.AnalysisException: Attribute name "Unnamed: 14" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.; ... ``` I would have thought that there would be a way in order to read in Parquet files such that illegal column names can be changed after the fact with the spark dataframe was generated, and thus this is unintended behavior. Please let me know if I am wrong. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility
[ https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705307#comment-16705307 ] Damien Doucet-Girard commented on SPARK-26188: -- [~ste...@apache.org] thanks for the tip, I brought it up with the team and we're going to check out a solution for our existing jobs > Spark 2.4.0 Partitioning behavior breaks backwards compatibility > > > Key: SPARK-26188 > URL: https://issues.apache.org/jira/browse/SPARK-26188 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Damien Doucet-Girard >Assignee: Gengliang Wang >Priority: Critical > Fix For: 2.4.1, 3.0.0 > > > My team uses spark to partition and output parquet files to amazon S3. We > typically use 256 partitions, from 00 to ff. > We've observed that in spark 2.3.2 and prior, it reads the partitions as > strings by default. However, in spark 2.4.0 and later, the type of each > partition is inferred by default, and partitions such as 00 become 0 and 4d > become 4.0. > Here is a log sample of this behavior from one of our jobs: > 2.4.0: > {code:java} > 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, > range: 0-662, partition values: [0] > 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, > range: 0-662, partition values: [ef] > 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, > range: 0-662, partition values: [4a] > 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, > range: 0-662, partition values: [74] > 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, > range: 0-662, partition values: [f5] > 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, > range: 0-662, partition values: [50] > 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, > range: 0-662, partition values: [70] > 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, > range: 0-662, partition values: [b9] > 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, > range: 0-662, partition values: [d2] > 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, > range: 0-662, partition values: [51] > 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, > range: 0-662, partition values: [84] > 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, > range: 0-662, partition values: [b5] > 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, > range: 0-662, partition values: [88] > 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, > range: 0-662, partition values: [4.0] > 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, > range: 0-662, partition values: [ac] > 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, > range: 0-662, partition values: [24] > 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, > range: 0-662, partition values: [fd] > 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, > range: 0-662, partition values: [52] > 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, > range: 0-662, partition values: [ab] > 18/11/27 14:02:48 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f8/part-00018-hashredacted.parquet, > range: 0-662, partition values: [f8] > 18/11/27 14:02:49 INFO FileScanRDD: Reading File path: >
[jira] [Commented] (SPARK-23078) Allow Submitting Spark Thrift Server in Cluster Mode
[ https://issues.apache.org/jira/browse/SPARK-23078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705282#comment-16705282 ] Marcelo Vanzin commented on SPARK-23078: Is this actually needed now that k8s supports client mode and also arbitrary commands to run in the image's entry point? You could set up your STS pod with a YAML file containing everything you need, running the thrift server from there directly in client mode. > Allow Submitting Spark Thrift Server in Cluster Mode > > > Key: SPARK-23078 > URL: https://issues.apache.org/jira/browse/SPARK-23078 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Submit, SQL >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Oz Ben-Ami >Priority: Minor > > Since SPARK-5176, SparkSubmit has blacklisted the Thrift Server from running > in Cluster mode, since at the time it was not able to do so successfully. I > have confirmed that Spark Thrift Server can run on Cluster mode in > Kubernetes, by commenting out > [https://github.com/apache-spark-on-k8s/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L331.] > I have not had a chance to test against YARN. Since Kubernetes does not have > Client mode, this change is necessary to run Spark Thrift Service in > Kubernetes. > [~foxish] [~coderanger] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26239) Add configurable auth secret source in k8s backend
[ https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705277#comment-16705277 ] Marcelo Vanzin commented on SPARK-26239: That can work but it doesn't address the 3rd bullet. > Add configurable auth secret source in k8s backend > -- > > Key: SPARK-26239 > URL: https://issues.apache.org/jira/browse/SPARK-26239 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Major > > This is a follow up to SPARK-26194, which aims to add auto-generated secrets > similar to the YARN backend. > There's a desire to support different ways to generate and propagate these > auth secrets (e.g. using things like Vault). Need to investigate: > - exposing configuration to support that > - changing SecurityManager so that it can delegate some of the > secret-handling logic to custom implementations > - figuring out whether this can also be used in client-mode, where the driver > is not created by the k8s backend in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25145) Buffer size too small on spark.sql query with filterPushdown predicate=True
[ https://issues.apache.org/jira/browse/SPARK-25145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705269#comment-16705269 ] Dongjoon Hyun commented on SPARK-25145: --- I used your configuration `zerocopy` like the followings in Apache Spark 2.4.0 and 2.3.2 with Python3. But, it seems to be not reproducible. Could you reopen with a reproducible example next time? *Spark 2.4.0* {code} ~/A/s/spark-2.4.0-bin-hadoop2.7$ bin/pyspark --conf spark.hadoop.hive.exec.orc.zerocopy=true Python 3.7.1 (default, Nov 6 2018, 18:46:03) Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.0 /_/ Using Python version 3.7.1 (default, Nov 6 2018 18:46:03) SparkSession available as 'spark'. >>> import numpy as np >>> import pandas as pd >>> spark.conf.set('spark.sql.orc.impl', 'native') >>> spark.conf.set('spark.sql.orc.filterPushdown', True) >>> df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0}) >>> sdf = spark.createDataFrame(df) >>> sdf.write.saveAsTable(format='orc', mode='overwrite', name='t1', >>> compression='zlib') >>> spark.sql('SELECT * FROM t1 WHERE a > 5').show() +---+---+ | a| b| +---+---+ | 8|4.0| | 9|4.5| | 6|3.0| | 7|3.5| +---+---+ {code} *Spark 2.3.2* {code} ~/A/s/spark-2.3.2-bin-hadoop2.7$ bin/pyspark --conf spark.hadoop.hive.exec.orc.zerocopy=true Python 3.7.1 (default, Nov 6 2018, 18:46:03) Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.2 /_/ Using Python version 3.7.1 (default, Nov 6 2018 18:46:03) SparkSession available as 'spark'. >>> import numpy as np >>> import pandas as pd >>> spark.conf.set('spark.sql.orc.impl', 'native') >>> spark.conf.set('spark.sql.orc.filterPushdown', True) >>> df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0}) >>> sdf = spark.createDataFrame(df) >>> sdf.write.saveAsTable(format='orc', mode='overwrite', name='t1', >>> compression='zlib') >>> spark.sql('SELECT * FROM t1 WHERE a > 5').show() +---+---+ | a| b| +---+---+ | 8|4.0| | 9|4.5| | 6|3.0| | 7|3.5| +---+---+ {code} > Buffer size too small on spark.sql query with filterPushdown predicate=True > --- > > Key: SPARK-25145 > URL: https://issues.apache.org/jira/browse/SPARK-25145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 > Environment: > {noformat} > # Generated by Apache Ambari. Wed Mar 21 15:37:53 2018 > spark.driver.extraLibraryPath > /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 > spark.eventLog.dir hdfs:///spark2-history/ > spark.eventLog.enabled true > spark.executor.extraLibraryPath > /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 > spark.hadoop.hive.vectorized.execution.enabled true > spark.history.fs.logDirectory hdfs:///spark2-history/ > spark.history.kerberos.keytab none > spark.history.kerberos.principal none > spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider > spark.history.retainedApplications 50 > spark.history.ui.port 18081 > spark.io.compression.lz4.blockSize 128k > spark.locality.wait 2s > spark.network.timeout 600s > spark.serializer org.apache.spark.serializer.KryoSerializer > spark.shuffle.consolidateFiles true > spark.shuffle.io.numConnectionsPerPeer 10 > spark.sql.autoBroadcastJoinTreshold 26214400 > spark.sql.shuffle.partitions 300 > spark.sql.statistics.fallBack.toHdfs true > spark.sql.tungsten.enabled true > spark.driver.memoryOverhead 2048 > spark.executor.memoryOverhead 4096 > spark.yarn.historyServer.address service-10-4.local:18081 > spark.yarn.queue default > spark.sql.warehouse.dir hdfs:///apps/hive/warehouse > spark.sql.execution.arrow.enabled true > spark.sql.hive.convertMetastoreOrc true > spark.sql.orc.char.enabled true > spark.sql.orc.enabled true > spark.sql.orc.filterPushdown true > spark.sql.orc.impl native > spark.sql.orc.enableVectorizedReader true > spark.yarn.jars hdfs:///apps/spark-jars/231/jars/* > {noformat} > >Reporter: Bjørnar Jensen >Priority: Minor > Attachments: create_bug.py, report.txt > > > java.lang.IllegalArgumentException: Buffer size too small. size = 262144 > needed = 2205991 > # > {code:java} > Python > import numpy as np > import pandas as pd > # Create a spark dataframe > df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0}) > sdf = spark.createDataFrame(df) > print('Created spark dataframe:') > sdf.show() > # Save table as orc > sdf.write.saveAsTable(format='orc', mode='overwrite', > name='bjornj.spark_buffer_size_too_small_on_filter_pushdown', > compression='zlib') > # Ensure filterPushdown is
[jira] [Commented] (SPARK-26237) [K8s] Unable to switch python version in executor when running pyspark client
[ https://issues.apache.org/jira/browse/SPARK-26237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705275#comment-16705275 ] Qi Shao commented on SPARK-26237: - Figured out that if pyspark configs needs to be done before running pyspark. Creating a new sparkSession after logging into console won't work. > [K8s] Unable to switch python version in executor when running pyspark client > - > > Key: SPARK-26237 > URL: https://issues.apache.org/jira/browse/SPARK-26237 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 > Google Kubernetes Engines >Reporter: Qi Shao >Priority: Major > > Error message: > {code:java} > Exception: Python in worker has different version 2.7 than that in driver > 3.6, PySpark cannot run with different minor versions.Please check > environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly > set.{code} > Neither > {code:java} > spark.kubernetes.pyspark.pythonVersion{code} > nor > {code:java} > spark.executorEnv.PYSPARK_MAJOR_PYTHON_VERSION {code} > works. > This happens when I'm running a Notebook with pyspark+python3 and also in a > pod which has pyspark+python3. > For notebook, the code is: > {code:java} > ``` > from _future_ import print_function > import sys > from random import random > from operator import add > from pyspark.sql import SparkSession > spark = SparkSession.builder\ > .master("k8s://https://kubernetes.default.svc;)\ > .appName("PySpark Testout")\ > .config("spark.submit.deployMode","client")\ > .config("spark.executor.instances", "2")\ > .config("spark.kubernetes.container.image","azureq/pantheon:pyspark-2.4")\ > .config("spark.driver.host","jupyter-notebook-headless")\ > .config("spark.driver.pod.name","jupyter-notebook-headless")\ > .config("spark.kubernetes.authenticate.driver.serviceAccountName","spark")\ > .config("spark.kubernetes.pyspark.pythonVersion","3")\ > .config("spark.executorEnv.PYSPARK_MAJOR_PYTHON_VERSION","3")\ > .getOrCreate() > n = 10 > def f(_): > x = random() * 2 - 1 > y = random() * 2 - 1 > return 1 if x ** 2 + y ** 2 <= 1 else 0 > count = spark.sparkContext.parallelize(range(1, n + 1), > partitions).map(f).reduce(add) > print("Pi is roughly %f" % (4.0 * count / n)) > {code} > For pyspark shell, the command is: > > {code:java} > $SPARK_HOME/bin/pyspark --master \ > k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS \ > --deploy-mode client \ > --conf spark.executor.instances=5 \ > --conf spark.kubernetes.container.image=azureq/pantheon:pyspark-2.4 \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf spark.driver.host=spark-client-mode-headless \ > --conf spark.kubernetes.pyspark.pythonVersion=3 \ > --conf spark.driver.pod.name=spark-client-mode-headless{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26237) [K8s] Unable to switch python version in executor when running pyspark client
[ https://issues.apache.org/jira/browse/SPARK-26237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Shao resolved SPARK-26237. - Resolution: Invalid > [K8s] Unable to switch python version in executor when running pyspark client > - > > Key: SPARK-26237 > URL: https://issues.apache.org/jira/browse/SPARK-26237 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 > Google Kubernetes Engines >Reporter: Qi Shao >Priority: Major > > Error message: > {code:java} > Exception: Python in worker has different version 2.7 than that in driver > 3.6, PySpark cannot run with different minor versions.Please check > environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly > set.{code} > Neither > {code:java} > spark.kubernetes.pyspark.pythonVersion{code} > nor > {code:java} > spark.executorEnv.PYSPARK_MAJOR_PYTHON_VERSION {code} > works. > This happens when I'm running a Notebook with pyspark+python3 and also in a > pod which has pyspark+python3. > For notebook, the code is: > {code:java} > ``` > from _future_ import print_function > import sys > from random import random > from operator import add > from pyspark.sql import SparkSession > spark = SparkSession.builder\ > .master("k8s://https://kubernetes.default.svc;)\ > .appName("PySpark Testout")\ > .config("spark.submit.deployMode","client")\ > .config("spark.executor.instances", "2")\ > .config("spark.kubernetes.container.image","azureq/pantheon:pyspark-2.4")\ > .config("spark.driver.host","jupyter-notebook-headless")\ > .config("spark.driver.pod.name","jupyter-notebook-headless")\ > .config("spark.kubernetes.authenticate.driver.serviceAccountName","spark")\ > .config("spark.kubernetes.pyspark.pythonVersion","3")\ > .config("spark.executorEnv.PYSPARK_MAJOR_PYTHON_VERSION","3")\ > .getOrCreate() > n = 10 > def f(_): > x = random() * 2 - 1 > y = random() * 2 - 1 > return 1 if x ** 2 + y ** 2 <= 1 else 0 > count = spark.sparkContext.parallelize(range(1, n + 1), > partitions).map(f).reduce(add) > print("Pi is roughly %f" % (4.0 * count / n)) > {code} > For pyspark shell, the command is: > > {code:java} > $SPARK_HOME/bin/pyspark --master \ > k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS \ > --deploy-mode client \ > --conf spark.executor.instances=5 \ > --conf spark.kubernetes.container.image=azureq/pantheon:pyspark-2.4 \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf spark.driver.host=spark-client-mode-headless \ > --conf spark.kubernetes.pyspark.pythonVersion=3 \ > --conf spark.driver.pod.name=spark-client-mode-headless{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26239) Add configurable auth secret source in k8s backend
[ https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705273#comment-16705273 ] Matt Cheah edited comment on SPARK-26239 at 11/30/18 8:59 PM: -- Could we add a simple version that just points to file paths for the executor and driver to load, with the secret contents being inside? The user can decide how those files are mounted into the containers. was (Author: mcheah): Would a simple addition just to point to file paths for the executor and driver to load, with the secret contents being inside? The user can decide how those files are mounted into the containers. > Add configurable auth secret source in k8s backend > -- > > Key: SPARK-26239 > URL: https://issues.apache.org/jira/browse/SPARK-26239 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Major > > This is a follow up to SPARK-26194, which aims to add auto-generated secrets > similar to the YARN backend. > There's a desire to support different ways to generate and propagate these > auth secrets (e.g. using things like Vault). Need to investigate: > - exposing configuration to support that > - changing SecurityManager so that it can delegate some of the > secret-handling logic to custom implementations > - figuring out whether this can also be used in client-mode, where the driver > is not created by the k8s backend in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26239) Add configurable auth secret source in k8s backend
[ https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705273#comment-16705273 ] Matt Cheah commented on SPARK-26239: Would a simple addition just to point to file paths for the executor and driver to load, with the secret contents being inside? The user can decide how those files are mounted into the containers. > Add configurable auth secret source in k8s backend > -- > > Key: SPARK-26239 > URL: https://issues.apache.org/jira/browse/SPARK-26239 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Major > > This is a follow up to SPARK-26194, which aims to add auto-generated secrets > similar to the YARN backend. > There's a desire to support different ways to generate and propagate these > auth secrets (e.g. using things like Vault). Need to investigate: > - exposing configuration to support that > - changing SecurityManager so that it can delegate some of the > secret-handling logic to custom implementations > - figuring out whether this can also be used in client-mode, where the driver > is not created by the k8s backend in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25145) Buffer size too small on spark.sql query with filterPushdown predicate=True
[ https://issues.apache.org/jira/browse/SPARK-25145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25145. --- Resolution: Cannot Reproduce > Buffer size too small on spark.sql query with filterPushdown predicate=True > --- > > Key: SPARK-25145 > URL: https://issues.apache.org/jira/browse/SPARK-25145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 > Environment: > {noformat} > # Generated by Apache Ambari. Wed Mar 21 15:37:53 2018 > spark.driver.extraLibraryPath > /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 > spark.eventLog.dir hdfs:///spark2-history/ > spark.eventLog.enabled true > spark.executor.extraLibraryPath > /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 > spark.hadoop.hive.vectorized.execution.enabled true > spark.history.fs.logDirectory hdfs:///spark2-history/ > spark.history.kerberos.keytab none > spark.history.kerberos.principal none > spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider > spark.history.retainedApplications 50 > spark.history.ui.port 18081 > spark.io.compression.lz4.blockSize 128k > spark.locality.wait 2s > spark.network.timeout 600s > spark.serializer org.apache.spark.serializer.KryoSerializer > spark.shuffle.consolidateFiles true > spark.shuffle.io.numConnectionsPerPeer 10 > spark.sql.autoBroadcastJoinTreshold 26214400 > spark.sql.shuffle.partitions 300 > spark.sql.statistics.fallBack.toHdfs true > spark.sql.tungsten.enabled true > spark.driver.memoryOverhead 2048 > spark.executor.memoryOverhead 4096 > spark.yarn.historyServer.address service-10-4.local:18081 > spark.yarn.queue default > spark.sql.warehouse.dir hdfs:///apps/hive/warehouse > spark.sql.execution.arrow.enabled true > spark.sql.hive.convertMetastoreOrc true > spark.sql.orc.char.enabled true > spark.sql.orc.enabled true > spark.sql.orc.filterPushdown true > spark.sql.orc.impl native > spark.sql.orc.enableVectorizedReader true > spark.yarn.jars hdfs:///apps/spark-jars/231/jars/* > {noformat} > >Reporter: Bjørnar Jensen >Priority: Minor > Attachments: create_bug.py, report.txt > > > java.lang.IllegalArgumentException: Buffer size too small. size = 262144 > needed = 2205991 > # > {code:java} > Python > import numpy as np > import pandas as pd > # Create a spark dataframe > df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0}) > sdf = spark.createDataFrame(df) > print('Created spark dataframe:') > sdf.show() > # Save table as orc > sdf.write.saveAsTable(format='orc', mode='overwrite', > name='bjornj.spark_buffer_size_too_small_on_filter_pushdown', > compression='zlib') > # Ensure filterPushdown is enabled > spark.conf.set('spark.sql.orc.filterPushdown', True) > # Fetch entire table (works) > print('Read entire table with "filterPushdown"=True') > spark.sql('SELECT * FROM > bjornj.spark_buffer_size_too_small_on_filter_pushdown').show() > # Ensure filterPushdown is disabled > spark.conf.set('spark.sql.orc.filterPushdown', False) > # Query without filterPushdown (works) > print('Read a selection from table with "filterPushdown"=False') > spark.sql('SELECT * FROM > bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show() > # Ensure filterPushdown is enabled > spark.conf.set('spark.sql.orc.filterPushdown', True) > # Query with filterPushDown (fails) > print('Read a selection from table with "filterPushdown"=True') > spark.sql('SELECT * FROM > bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show() > {code} > {noformat} > ~/bug_report $ pyspark > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 2018-08-17 13:44:31,365 WARN Utils: Service 'SparkUI' could not bind on port > 4040. Attempting port 4041. > Jupyter console 5.1.0 > Python 3.6.3 |Intel Corporation| (default, May 4 2018, 04:22:28) > Type 'copyright', 'credits' or 'license' for more information > IPython 6.3.1 -- An enhanced Interactive Python. Type '?' for help. > In [1]: %run -i create_bug.py > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.3.3-SNAPSHOT > /_/ > Using Python version 3.6.3 (default, May 4 2018 04:22:28) > SparkSession available as 'spark'. > Created spark dataframe: > +---+---+ > | a| b| > +---+---+ > | 0|0.0| > | 1|0.5| > | 2|1.0| > | 3|1.5| > | 4|2.0| > | 5|2.5| > | 6|3.0| > | 7|3.5| > | 8|4.0| > | 9|4.5| > +---+---+ > Read entire table with "filterPushdown"=True > +---+---+ > | a| b| > +---+---+ > | 1|0.5| > | 2|1.0| > | 3|1.5| > | 5|2.5| > | 6|3.0| > | 7|3.5| > | 8|4.0| > | 9|4.5| > | 4|2.0| > | 0|0.0| >
[jira] [Commented] (SPARK-26239) Add configurable auth secret source in k8s backend
[ https://issues.apache.org/jira/browse/SPARK-26239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705236#comment-16705236 ] Thincrs commented on SPARK-26239: - A user of thincrs has selected this issue. Deadline: Fri, Dec 7, 2018 8:18 PM > Add configurable auth secret source in k8s backend > -- > > Key: SPARK-26239 > URL: https://issues.apache.org/jira/browse/SPARK-26239 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Major > > This is a follow up to SPARK-26194, which aims to add auto-generated secrets > similar to the YARN backend. > There's a desire to support different ways to generate and propagate these > auth secrets (e.g. using things like Vault). Need to investigate: > - exposing configuration to support that > - changing SecurityManager so that it can delegate some of the > secret-handling logic to custom implementations > - figuring out whether this can also be used in client-mode, where the driver > is not created by the k8s backend in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization
[ https://issues.apache.org/jira/browse/SPARK-26200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-26200. -- Resolution: Duplicate > Column values are incorrectly transposed when a field in a PySpark Row > requires serialization > - > > Key: SPARK-26200 > URL: https://issues.apache.org/jira/browse/SPARK-26200 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Spark version 2.4.0 > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144 > The same issue is observed when PySpark is run on both macOS 10.13.6 and > CentOS 7, so this appears to be a cross-platform issue. >Reporter: David Lyness >Priority: Major > Labels: correctness > > h2. Description of issue > Whenever a field in a PySpark {{Row}} requires serialization (such as a > {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below > will assign column values *in alphabetical order*, rather than assigning each > column value to its specified columns. > h3. Code to reproduce: > {code:java} > import datetime > from pyspark.sql import Row > from pyspark.sql.session import SparkSession > from pyspark.sql.types import DateType, StringType, StructField, StructType > spark = SparkSession.builder.getOrCreate() > schema = StructType([ > StructField("date_column", DateType()), > StructField("my_b_column", StringType()), > StructField("my_a_column", StringType()), > ]) > spark.createDataFrame([Row( > date_column=datetime.date.today(), > my_b_column="my_b_value", > my_a_column="my_a_value" > )], schema).show() > {code} > h3. Expected result: > {noformat} > +---+---+---+ > |date_column|my_b_column|my_a_column| > +---+---+---+ > | 2018-11-28| my_b_value| my_a_value| > +---+---+---+{noformat} > h3. Actual result: > {noformat} > +---+---+---+ > |date_column|my_b_column|my_a_column| > +---+---+---+ > | 2018-11-28| my_a_value| my_b_value| > +---+---+---+{noformat} > (Note that {{my_a_value}} and {{my_b_value}} are transposed.) > h2. Analysis of issue > Reviewing [the relevant code on > GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622], > there are two relevant conditional blocks: > > {code:java} > if self._needSerializeAnyField: > # Block 1, does not work correctly > else: > # Block 2, works correctly > {code} > {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and > a dictionary of named columns. In Block 2, there is a conditional that works > specifically to serialize a {{Row}} object: > > {code:java} > elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False): > return tuple(obj[n] for n in self.names) > {code} > There is no such condition in Block 1, so we fall into this instead: > > {code:java} > elif isinstance(obj, (tuple, list)): > return tuple(f.toInternal(v) if c else v > for f, v, c in zip(self.fields, obj, self._needConversion)) > {code} > The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will > return a different ordering than the schema fields. So we end up with: > {code:java} > (date, date, True), > (b, a, False), > (a, b, False) > {code} > h2. Workarounds > Correct behaviour is observed if you use a Python {{list}} or {{dict}} > instead of PySpark's {{Row}} object: > > {code:java} > # Using a list works > spark.createDataFrame([[ > datetime.date.today(), > "my_b_value", > "my_a_value" > ]], schema) > # Using a dict also works > spark.createDataFrame([{ > "date_column": datetime.date.today(), > "my_b_column": "my_b_value", > "my_a_column": "my_a_value" > }], schema){code} > Correct behaviour is also observed if you have no fields that require > serialization; in this example, changing {{date_column}} to {{StringType}} > avoids the correctness issue. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization
[ https://issues.apache.org/jira/browse/SPARK-26200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705224#comment-16705224 ] Bryan Cutler commented on SPARK-26200: -- Thanks [~davidlyness], I'll mark this as a duplicate since the root cause is the same. I've been tracking this and similar issues with PySpark Rows, but since addressing these will cause a behavior change, we can only make the fix in Spark 3.0. If you didn't see the workaround in the other JIRA, I would recommend creating your Rows this way when using a schema and you care about field ordering, or use a list as you pointed out {code} In [10]: MyRow = Row("field2", "field1") In [11]: data = [ ...: MyRow(Row(sub_field='world'), "hello") ...: ] In [12]: df = spark.createDataFrame(data, schema=schema) In [13]: df.show() +---+--+ | field2|field1| +---+--+ |[world]| hello| +---+--+ {code} > Column values are incorrectly transposed when a field in a PySpark Row > requires serialization > - > > Key: SPARK-26200 > URL: https://issues.apache.org/jira/browse/SPARK-26200 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Spark version 2.4.0 > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144 > The same issue is observed when PySpark is run on both macOS 10.13.6 and > CentOS 7, so this appears to be a cross-platform issue. >Reporter: David Lyness >Priority: Major > Labels: correctness > > h2. Description of issue > Whenever a field in a PySpark {{Row}} requires serialization (such as a > {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below > will assign column values *in alphabetical order*, rather than assigning each > column value to its specified columns. > h3. Code to reproduce: > {code:java} > import datetime > from pyspark.sql import Row > from pyspark.sql.session import SparkSession > from pyspark.sql.types import DateType, StringType, StructField, StructType > spark = SparkSession.builder.getOrCreate() > schema = StructType([ > StructField("date_column", DateType()), > StructField("my_b_column", StringType()), > StructField("my_a_column", StringType()), > ]) > spark.createDataFrame([Row( > date_column=datetime.date.today(), > my_b_column="my_b_value", > my_a_column="my_a_value" > )], schema).show() > {code} > h3. Expected result: > {noformat} > +---+---+---+ > |date_column|my_b_column|my_a_column| > +---+---+---+ > | 2018-11-28| my_b_value| my_a_value| > +---+---+---+{noformat} > h3. Actual result: > {noformat} > +---+---+---+ > |date_column|my_b_column|my_a_column| > +---+---+---+ > | 2018-11-28| my_a_value| my_b_value| > +---+---+---+{noformat} > (Note that {{my_a_value}} and {{my_b_value}} are transposed.) > h2. Analysis of issue > Reviewing [the relevant code on > GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622], > there are two relevant conditional blocks: > > {code:java} > if self._needSerializeAnyField: > # Block 1, does not work correctly > else: > # Block 2, works correctly > {code} > {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and > a dictionary of named columns. In Block 2, there is a conditional that works > specifically to serialize a {{Row}} object: > > {code:java} > elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False): > return tuple(obj[n] for n in self.names) > {code} > There is no such condition in Block 1, so we fall into this instead: > > {code:java} > elif isinstance(obj, (tuple, list)): > return tuple(f.toInternal(v) if c else v > for f, v, c in zip(self.fields, obj, self._needConversion)) > {code} > The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will > return a different ordering than the schema fields. So we end up with: > {code:java} > (date, date, True), > (b, a, False), > (a, b, False) > {code} > h2. Workarounds > Correct behaviour is observed if you use a Python {{list}} or {{dict}} > instead of PySpark's {{Row}} object: > > {code:java} > # Using a list works > spark.createDataFrame([[ > datetime.date.today(), > "my_b_value", > "my_a_value" > ]], schema) > # Using a dict also works > spark.createDataFrame([{ > "date_column": datetime.date.today(), > "my_b_column": "my_b_value", > "my_a_column": "my_a_value" > }], schema){code} > Correct behaviour is also observed if you have no fields that require > serialization; in this example, changing {{date_column}} to {{StringType}} > avoids
[jira] [Commented] (SPARK-21453) Cached Kafka consumer may be closed too early
[ https://issues.apache.org/jira/browse/SPARK-21453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705222#comment-16705222 ] Julian commented on SPARK-21453: I also get the following error around the same time (shortly after above one). Maybe it is related? 18/11/30 18:04:17 ERROR Executor: Exception in task 2.0 in stage 115.0 (TID 2377) java.lang.NullPointerException at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:436) at org.apache.kafka.common.network.Selector.poll(Selector.java:395) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:460) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:238) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:214) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:205) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitMetadataUpdate(ConsumerNetworkClient.java:137) at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:307) at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1138) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1103) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:298) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:206) at org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:117) at org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106) at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:85) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106) at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:157) at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:148) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:52) at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:49) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926) > Cached Kafka consumer may be closed too early > - > > Key: SPARK-21453 > URL: https://issues.apache.org/jira/browse/SPARK-21453 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 > Environment: Spark 2.2.0 and kafka 0.10.2.0 >Reporter: Pablo Panero >Priority: Minor > > On a streaming job using built-in kafka source and sink (over SSL), with I > am getting the following exception: > Config of the source: > {code:java} > val df = spark.readStream > .format("kafka") > .option("kafka.bootstrap.servers", config.bootstrapServers) > .option("failOnDataLoss", value = false) > .option("kafka.connections.max.idle.ms", 360) > //SSL: this only applies to communication between Spark and Kafka > brokers; you are still responsible for separately securing Spark inter-node > communication. > .option("kafka.security.protocol", "SASL_SSL") > .option("kafka.sasl.mechanism", "GSSAPI") > .option("kafka.sasl.kerberos.service.name", "kafka") > .option("kafka.ssl.truststore.location", "/etc/pki/java/cacerts") > .option("kafka.ssl.truststore.password", "changeit") > .option("subscribe", config.topicConfigList.keys.mkString(",")) > .load() > {code} > Config of the sink: > {code:java} > .writeStream > .option("checkpointLocation", > s"${config.checkpointDir}/${topicConfig._1}/") > .format("kafka") > .option("kafka.bootstrap.servers", config.bootstrapServers) >
[jira] [Commented] (SPARK-21453) Cached Kafka consumer may be closed too early
[ https://issues.apache.org/jira/browse/SPARK-21453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705215#comment-16705215 ] Julian commented on SPARK-21453: I see this issue is still open (although no one has posted here for a while), as I get exactly the same error periodically on our Spark 2.2 on our spark structured streaming job that is running on our HDP 2.6 cluster with Kafka (HDF). It is a pain, as it is causing tasks to fail and with a load currently of 1.5GB per 5 mins (to be upscaled 5+ times in the coming weeks/months) and it increases the overall time for a micro batch to complete. Any ideas would be good (I'll check link above also). Note, we are not using the experimental Continuous Mode in this version... 18/11/30 18:04:02 WARN SslTransportLayer: Failed to send SSL Close message java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) at org.apache.kafka.common.network.SslTransportLayer.flush(SslTransportLayer.java:212) at org.apache.kafka.common.network.SslTransportLayer.close(SslTransportLayer.java:170) at org.apache.kafka.common.utils.Utils.closeAll(Utils.java:703) at org.apache.kafka.common.network.KafkaChannel.close(KafkaChannel.java:61) at org.apache.kafka.common.network.Selector.doClose(Selector.java:717) at org.apache.kafka.common.network.Selector.close(Selector.java:708) at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:500) at org.apache.kafka.common.network.Selector.poll(Selector.java:398) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:460) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:238) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:214) at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1156) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1103) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:298) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.org$apache$spark$sql$kafka010$CachedKafkaConsumer$$fetchData(CachedKafkaConsumer.scala:206) at org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:117) at org.apache.spark.sql.kafka010.CachedKafkaConsumer$$anonfun$get$1.apply(CachedKafkaConsumer.scala:106) at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:85) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.runUninterruptiblyIfPossible(CachedKafkaConsumer.scala:68) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106) at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:157) at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:148) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) ……. > Cached Kafka consumer may be closed too early > - > > Key: SPARK-21453 > URL: https://issues.apache.org/jira/browse/SPARK-21453 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 > Environment: Spark 2.2.0 and kafka 0.10.2.0 >Reporter: Pablo Panero >Priority: Minor > > On a streaming job using built-in kafka source and sink (over SSL), with I > am getting the following exception: > Config of the source: > {code:java} > val df = spark.readStream > .format("kafka") > .option("kafka.bootstrap.servers", config.bootstrapServers) > .option("failOnDataLoss", value = false) > .option("kafka.connections.max.idle.ms", 360) > //SSL: this only applies to communication between Spark and Kafka > brokers; you are still responsible for separately securing Spark inter-node > communication. > .option("kafka.security.protocol", "SASL_SSL") > .option("kafka.sasl.mechanism", "GSSAPI") > .option("kafka.sasl.kerberos.service.name", "kafka") > .option("kafka.ssl.truststore.location", "/etc/pki/java/cacerts") > .option("kafka.ssl.truststore.password", "changeit") > .option("subscribe", config.topicConfigList.keys.mkString(",")) > .load() > {code} > Config of the sink: > {code:java} > .writeStream >
[jira] [Created] (SPARK-26238) Set SPARK_CONF_DIR to be ${SPARK_HOME}/conf for K8S
Ilan Filonenko created SPARK-26238: -- Summary: Set SPARK_CONF_DIR to be ${SPARK_HOME}/conf for K8S Key: SPARK-26238 URL: https://issues.apache.org/jira/browse/SPARK-26238 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.4.0, 3.0.0 Reporter: Ilan Filonenko Set SPARK_CONF_DIR to point to ${SPARK_HOME}/conf as opposed to /opt/spark/conf which is hard-coded into the Constants. This is expected behavior according to spark docs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26239) Add configurable auth secret source in k8s backend
Marcelo Vanzin created SPARK-26239: -- Summary: Add configurable auth secret source in k8s backend Key: SPARK-26239 URL: https://issues.apache.org/jira/browse/SPARK-26239 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 3.0.0 Reporter: Marcelo Vanzin This is a follow up to SPARK-26194, which aims to add auto-generated secrets similar to the YARN backend. There's a desire to support different ways to generate and propagate these auth secrets (e.g. using things like Vault). Need to investigate: - exposing configuration to support that - changing SecurityManager so that it can delegate some of the secret-handling logic to custom implementations - figuring out whether this can also be used in client-mode, where the driver is not created by the k8s backend in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26201) python broadcast.value on driver fails with disk encryption enabled
[ https://issues.apache.org/jira/browse/SPARK-26201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-26201. --- Resolution: Fixed Assignee: Sanket Chintapalli Fix Version/s: 3.0.0 2.4.1 2.3.3 > python broadcast.value on driver fails with disk encryption enabled > --- > > Key: SPARK-26201 > URL: https://issues.apache.org/jira/browse/SPARK-26201 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2 >Reporter: Thomas Graves >Assignee: Sanket Chintapalli >Priority: Major > Fix For: 2.3.3, 2.4.1, 3.0.0 > > > I was trying python with rpc and disk encryption enabled and when I tried a > python broadcast variable and just read the value back on the driver side the > job failed with: > > Traceback (most recent call last): File "broadcast.py", line 37, in > words_new.value File "/pyspark.zip/pyspark/broadcast.py", line 137, in value > File "pyspark.zip/pyspark/broadcast.py", line 122, in load_from_path File > "pyspark.zip/pyspark/broadcast.py", line 128, in load EOFError: Ran out of > input > To reproduce use configs: --conf spark.network.crypto.enabled=true --conf > spark.io.encryption.enabled=true > > Code: > words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"]) > words_new.value > print(words_new.value) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26237) [K8s] Unable to switch python version in executor when running pyspark client
[ https://issues.apache.org/jira/browse/SPARK-26237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Shao updated SPARK-26237: Summary: [K8s] Unable to switch python version in executor when running pyspark client (was: [K8s] Unable to switch python version in executor when running pyspark shell.) > [K8s] Unable to switch python version in executor when running pyspark client > - > > Key: SPARK-26237 > URL: https://issues.apache.org/jira/browse/SPARK-26237 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Spark 2.4.0 > Google Kubernetes Engines >Reporter: Qi Shao >Priority: Major > > Error message: > {code:java} > Exception: Python in worker has different version 2.7 than that in driver > 3.6, PySpark cannot run with different minor versions.Please check > environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly > set.{code} > Neither > {code:java} > spark.kubernetes.pyspark.pythonVersion{code} > nor > {code:java} > spark.executorEnv.PYSPARK_MAJOR_PYTHON_VERSION {code} > works. > This happens when I'm running a Notebook with pyspark+python3 and also in a > pod which has pyspark+python3. > For notebook, the code is: > {code:java} > ``` > from _future_ import print_function > import sys > from random import random > from operator import add > from pyspark.sql import SparkSession > spark = SparkSession.builder\ > .master("k8s://https://kubernetes.default.svc;)\ > .appName("PySpark Testout")\ > .config("spark.submit.deployMode","client")\ > .config("spark.executor.instances", "2")\ > .config("spark.kubernetes.container.image","azureq/pantheon:pyspark-2.4")\ > .config("spark.driver.host","jupyter-notebook-headless")\ > .config("spark.driver.pod.name","jupyter-notebook-headless")\ > .config("spark.kubernetes.authenticate.driver.serviceAccountName","spark")\ > .config("spark.kubernetes.pyspark.pythonVersion","3")\ > .config("spark.executorEnv.PYSPARK_MAJOR_PYTHON_VERSION","3")\ > .getOrCreate() > n = 10 > def f(_): > x = random() * 2 - 1 > y = random() * 2 - 1 > return 1 if x ** 2 + y ** 2 <= 1 else 0 > count = spark.sparkContext.parallelize(range(1, n + 1), > partitions).map(f).reduce(add) > print("Pi is roughly %f" % (4.0 * count / n)) > {code} > For pyspark shell, the command is: > > {code:java} > $SPARK_HOME/bin/pyspark --master \ > k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS \ > --deploy-mode client \ > --conf spark.executor.instances=5 \ > --conf spark.kubernetes.container.image=azureq/pantheon:pyspark-2.4 \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf spark.driver.host=spark-client-mode-headless \ > --conf spark.kubernetes.pyspark.pythonVersion=3 \ > --conf spark.driver.pod.name=spark-client-mode-headless{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26237) [K8s] Unable to switch python version in executor when running pyspark shell.
Qi Shao created SPARK-26237: --- Summary: [K8s] Unable to switch python version in executor when running pyspark shell. Key: SPARK-26237 URL: https://issues.apache.org/jira/browse/SPARK-26237 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.0 Environment: Spark 2.4.0 Google Kubernetes Engines Reporter: Qi Shao Error message: {code:java} Exception: Python in worker has different version 2.7 than that in driver 3.6, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.{code} Neither {code:java} spark.kubernetes.pyspark.pythonVersion{code} nor {code:java} spark.executorEnv.PYSPARK_MAJOR_PYTHON_VERSION {code} works. This happens when I'm running a Notebook with pyspark+python3 and also in a pod which has pyspark+python3. For notebook, the code is: {code:java} ``` from _future_ import print_function import sys from random import random from operator import add from pyspark.sql import SparkSession spark = SparkSession.builder\ .master("k8s://https://kubernetes.default.svc;)\ .appName("PySpark Testout")\ .config("spark.submit.deployMode","client")\ .config("spark.executor.instances", "2")\ .config("spark.kubernetes.container.image","azureq/pantheon:pyspark-2.4")\ .config("spark.driver.host","jupyter-notebook-headless")\ .config("spark.driver.pod.name","jupyter-notebook-headless")\ .config("spark.kubernetes.authenticate.driver.serviceAccountName","spark")\ .config("spark.kubernetes.pyspark.pythonVersion","3")\ .config("spark.executorEnv.PYSPARK_MAJOR_PYTHON_VERSION","3")\ .getOrCreate() n = 10 def f(_): x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 <= 1 else 0 count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) {code} For pyspark shell, the command is: {code:java} $SPARK_HOME/bin/pyspark --master \ k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS \ --deploy-mode client \ --conf spark.executor.instances=5 \ --conf spark.kubernetes.container.image=azureq/pantheon:pyspark-2.4 \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.driver.host=spark-client-mode-headless \ --conf spark.kubernetes.pyspark.pythonVersion=3 \ --conf spark.driver.pod.name=spark-client-mode-headless{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark
[ https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705045#comment-16705045 ] Marcelo Vanzin commented on SPARK-26043: It would be better if you can show what you mean with an example. In general, it's good to avoid having this kind of logic in executors; and if needed, have the driver provide the needed info. (e.g. by broadcasting the config, which is done internally by Spark). > Make SparkHadoopUtil private to Spark > - > > Key: SPARK-26043 > URL: https://issues.apache.org/jira/browse/SPARK-26043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Sean Owen >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > This API contains a few small helper methods used internally by Spark, mostly > related to Hadoop configs and kerberos. > It's been historically marked as "DeveloperApi". But in reality it's not very > useful for others, and changes a lot to be considered a stable API. Better to > just make it private to Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25145) Buffer size too small on spark.sql query with filterPushdown predicate=True
[ https://issues.apache.org/jira/browse/SPARK-25145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705005#comment-16705005 ] Dongjoon Hyun commented on SPARK-25145: --- Thank you for the enriching the information. I'll try to check with that options, [~bjensen]. > Buffer size too small on spark.sql query with filterPushdown predicate=True > --- > > Key: SPARK-25145 > URL: https://issues.apache.org/jira/browse/SPARK-25145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 > Environment: > {noformat} > # Generated by Apache Ambari. Wed Mar 21 15:37:53 2018 > spark.driver.extraLibraryPath > /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 > spark.eventLog.dir hdfs:///spark2-history/ > spark.eventLog.enabled true > spark.executor.extraLibraryPath > /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 > spark.hadoop.hive.vectorized.execution.enabled true > spark.history.fs.logDirectory hdfs:///spark2-history/ > spark.history.kerberos.keytab none > spark.history.kerberos.principal none > spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider > spark.history.retainedApplications 50 > spark.history.ui.port 18081 > spark.io.compression.lz4.blockSize 128k > spark.locality.wait 2s > spark.network.timeout 600s > spark.serializer org.apache.spark.serializer.KryoSerializer > spark.shuffle.consolidateFiles true > spark.shuffle.io.numConnectionsPerPeer 10 > spark.sql.autoBroadcastJoinTreshold 26214400 > spark.sql.shuffle.partitions 300 > spark.sql.statistics.fallBack.toHdfs true > spark.sql.tungsten.enabled true > spark.driver.memoryOverhead 2048 > spark.executor.memoryOverhead 4096 > spark.yarn.historyServer.address service-10-4.local:18081 > spark.yarn.queue default > spark.sql.warehouse.dir hdfs:///apps/hive/warehouse > spark.sql.execution.arrow.enabled true > spark.sql.hive.convertMetastoreOrc true > spark.sql.orc.char.enabled true > spark.sql.orc.enabled true > spark.sql.orc.filterPushdown true > spark.sql.orc.impl native > spark.sql.orc.enableVectorizedReader true > spark.yarn.jars hdfs:///apps/spark-jars/231/jars/* > {noformat} > >Reporter: Bjørnar Jensen >Priority: Minor > Attachments: create_bug.py, report.txt > > > java.lang.IllegalArgumentException: Buffer size too small. size = 262144 > needed = 2205991 > # > {code:java} > Python > import numpy as np > import pandas as pd > # Create a spark dataframe > df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0}) > sdf = spark.createDataFrame(df) > print('Created spark dataframe:') > sdf.show() > # Save table as orc > sdf.write.saveAsTable(format='orc', mode='overwrite', > name='bjornj.spark_buffer_size_too_small_on_filter_pushdown', > compression='zlib') > # Ensure filterPushdown is enabled > spark.conf.set('spark.sql.orc.filterPushdown', True) > # Fetch entire table (works) > print('Read entire table with "filterPushdown"=True') > spark.sql('SELECT * FROM > bjornj.spark_buffer_size_too_small_on_filter_pushdown').show() > # Ensure filterPushdown is disabled > spark.conf.set('spark.sql.orc.filterPushdown', False) > # Query without filterPushdown (works) > print('Read a selection from table with "filterPushdown"=False') > spark.sql('SELECT * FROM > bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show() > # Ensure filterPushdown is enabled > spark.conf.set('spark.sql.orc.filterPushdown', True) > # Query with filterPushDown (fails) > print('Read a selection from table with "filterPushdown"=True') > spark.sql('SELECT * FROM > bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show() > {code} > {noformat} > ~/bug_report $ pyspark > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 2018-08-17 13:44:31,365 WARN Utils: Service 'SparkUI' could not bind on port > 4040. Attempting port 4041. > Jupyter console 5.1.0 > Python 3.6.3 |Intel Corporation| (default, May 4 2018, 04:22:28) > Type 'copyright', 'credits' or 'license' for more information > IPython 6.3.1 -- An enhanced Interactive Python. Type '?' for help. > In [1]: %run -i create_bug.py > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.3.3-SNAPSHOT > /_/ > Using Python version 3.6.3 (default, May 4 2018 04:22:28) > SparkSession available as 'spark'. > Created spark dataframe: > +---+---+ > | a| b| > +---+---+ > | 0|0.0| > | 1|0.5| > | 2|1.0| > | 3|1.5| > | 4|2.0| > | 5|2.5| > | 6|3.0| > | 7|3.5| > | 8|4.0| > | 9|4.5| > +---+---+ > Read entire table with "filterPushdown"=True > +---+---+ > | a| b| > +---+---+ > | 1|0.5| > |
[jira] [Comment Edited] (SPARK-26214) Add "broadcast" method to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-26214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704978#comment-16704978 ] Thomas Decaux edited comment on SPARK-26214 at 11/30/18 4:24 PM: - Well, not so different : {code:java} public void broadcast() { this.sqlContext().sparkContext().broadcast(this); } {code} {{VERSUS}} {code:java} public void registerTempTable(String tableName) { this.sqlContext().registerDataFrameAsTable(this, tableName); } {code} was (Author: ebuildy): Well, not so different : {code:java} public void broadcast() { this.sqlContext().sparkContext().broadcast(this); } {code} {{VERSUS}} public void registerTempTable(String tableName) { this.sqlContext().registerDataFrameAsTable(this, tableName); } > Add "broadcast" method to DataFrame > --- > > Key: SPARK-26214 > URL: https://issues.apache.org/jira/browse/SPARK-26214 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.4.0 >Reporter: Thomas Decaux >Priority: Trivial > Labels: broadcast, dataframe > > As discussed at > [https://stackoverflow.com/questions/43984068/does-spark-sql-autobroadcastjointhreshold-work-for-joins-using-datasets-join-op/43994022,] > it's possible to force broadcast of DataFrame, even if total size is greater > than ``*spark.sql.autoBroadcastJoinThreshold``.* > But this not trivial for beginner, because there is no "broadcast" method (I > know, I am lazy ...). > We could add this method, with a WARN if size is greater than the threshold. > (if it's an easy one, I could do it?) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26214) Add "broadcast" method to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-26214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704978#comment-16704978 ] Thomas Decaux commented on SPARK-26214: --- Well, not so different : {code:java} public void broadcast() { this.sqlContext().sparkContext().broadcast(this); } {code} {{VERSUS}} public void registerTempTable(String tableName) { this.sqlContext().registerDataFrameAsTable(this, tableName); } > Add "broadcast" method to DataFrame > --- > > Key: SPARK-26214 > URL: https://issues.apache.org/jira/browse/SPARK-26214 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.4.0 >Reporter: Thomas Decaux >Priority: Trivial > Labels: broadcast, dataframe > > As discussed at > [https://stackoverflow.com/questions/43984068/does-spark-sql-autobroadcastjointhreshold-work-for-joins-using-datasets-join-op/43994022,] > it's possible to force broadcast of DataFrame, even if total size is greater > than ``*spark.sql.autoBroadcastJoinThreshold``.* > But this not trivial for beginner, because there is no "broadcast" method (I > know, I am lazy ...). > We could add this method, with a WARN if size is greater than the threshold. > (if it's an easy one, I could do it?) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26214) Add "broadcast" method to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-26214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704958#comment-16704958 ] Marco Gaido commented on SPARK-26214: - I don't think it is really the same. I don't think it is a good idea to add this new API. > Add "broadcast" method to DataFrame > --- > > Key: SPARK-26214 > URL: https://issues.apache.org/jira/browse/SPARK-26214 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.4.0 >Reporter: Thomas Decaux >Priority: Trivial > Labels: broadcast, dataframe > > As discussed at > [https://stackoverflow.com/questions/43984068/does-spark-sql-autobroadcastjointhreshold-work-for-joins-using-datasets-join-op/43994022,] > it's possible to force broadcast of DataFrame, even if total size is greater > than ``*spark.sql.autoBroadcastJoinThreshold``.* > But this not trivial for beginner, because there is no "broadcast" method (I > know, I am lazy ...). > We could add this method, with a WARN if size is greater than the threshold. > (if it's an easy one, I could do it?) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26236) Kafka delegation token support documentation
Gabor Somogyi created SPARK-26236: - Summary: Kafka delegation token support documentation Key: SPARK-26236 URL: https://issues.apache.org/jira/browse/SPARK-26236 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Gabor Somogyi Because SPARK-25501 merged to master now it's time to update the docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark
[ https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704926#comment-16704926 ] Ian O Connell commented on SPARK-26043: --- Dumping into a map though would leak it into the API of the methods, nothing in the user land api/submitter code needs to be aware of the configuration right now. SparkEnv/HadoopConf's coming from the 'ether' magically has just been a useful side channel. I can try pull it out of the SparkConf instead, though it makes the code then less portable between scalding/spark unfortunately. > Make SparkHadoopUtil private to Spark > - > > Key: SPARK-26043 > URL: https://issues.apache.org/jira/browse/SPARK-26043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Sean Owen >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > This API contains a few small helper methods used internally by Spark, mostly > related to Hadoop configs and kerberos. > It's been historically marked as "DeveloperApi". But in reality it's not very > useful for others, and changes a lot to be considered a stable API. Better to > just make it private to Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26236) Kafka delegation token support documentation
[ https://issues.apache.org/jira/browse/SPARK-26236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704903#comment-16704903 ] Gabor Somogyi commented on SPARK-26236: --- Started to work on this. > Kafka delegation token support documentation > > > Key: SPARK-26236 > URL: https://issues.apache.org/jira/browse/SPARK-26236 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > Because SPARK-25501 merged to master now it's time to update the docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark
[ https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704877#comment-16704877 ] Sean Owen commented on SPARK-26043: --- I see. Typically you'd just instantiate Configuration and it would pick up all the Hadoop config from your cluster env. I see you're setting the props locally on the command line though. Configuration is "Writable" but not directly "Serializable" unfortunately, or else you could just use it in the closure you send. You could dump the Configuration key-values into a Map and use that I suppose? Probably a couple lines of code. > Make SparkHadoopUtil private to Spark > - > > Key: SPARK-26043 > URL: https://issues.apache.org/jira/browse/SPARK-26043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Sean Owen >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > This API contains a few small helper methods used internally by Spark, mostly > related to Hadoop configs and kerberos. > It's been historically marked as "DeveloperApi". But in reality it's not very > useful for others, and changes a lot to be considered a stable API. Better to > just make it private to Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25857) Document delegation token code in Spark
[ https://issues.apache.org/jira/browse/SPARK-25857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704894#comment-16704894 ] Gabor Somogyi commented on SPARK-25857: --- +1, it's not easy topic. > Document delegation token code in Spark > --- > > Key: SPARK-25857 > URL: https://issues.apache.org/jira/browse/SPARK-25857 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > By this I mean not user documentation, but documenting the functionality > provided in the {{org.apache.spark.deploy.security}} and related packages, so > that other developers making changes there can refer to it. > It seems to be a source of confusion every time somebody needs touch that > code, so it would be good to have a document explaining how it all works, > including how it's hooked up to different resource managers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark
[ https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704869#comment-16704869 ] Ian O Connell commented on SPARK-26043: --- I can't use the hadoopConfiguration from the spark context on the executors i didn't think? > Make SparkHadoopUtil private to Spark > - > > Key: SPARK-26043 > URL: https://issues.apache.org/jira/browse/SPARK-26043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Sean Owen >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > This API contains a few small helper methods used internally by Spark, mostly > related to Hadoop configs and kerberos. > It's been historically marked as "DeveloperApi". But in reality it's not very > useful for others, and changes a lot to be considered a stable API. Better to > just make it private to Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark
[ https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704864#comment-16704864 ] Ian O Connell commented on SPARK-26043: --- [~srowen] i just edited the comment right this sec to add that context sry, race condition, but: (context is via the command line/in spark i had been setting hadoop configuration options – but i need to pick those up in some libraries on the executors to see what was set (if s3guard is enabled in my case). I need some means to hook into what the submitter though the hadoop conf should be to turn on/off reporting to dynamodb for s3guard info) > Make SparkHadoopUtil private to Spark > - > > Key: SPARK-26043 > URL: https://issues.apache.org/jira/browse/SPARK-26043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Sean Owen >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > This API contains a few small helper methods used internally by Spark, mostly > related to Hadoop configs and kerberos. > It's been historically marked as "DeveloperApi". But in reality it's not very > useful for others, and changes a lot to be considered a stable API. Better to > just make it private to Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26043) Make SparkHadoopUtil private to Spark
[ https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704829#comment-16704829 ] Ian O Connell edited comment on SPARK-26043 at 11/30/18 3:02 PM: - With this change it makes it difficult to get a fully populated hadoop configuration on executor hosts. If spark properties were applied to the hadoop conf in the driver those don't show up in a `new Configuration`. This let one go SparkEnv -> SparkHadoopUtil -> Configuration thats fully populated. Is there a nicer way to achieve this possibly? (context is via the command line/in spark i had been setting hadoop configuration options – but i need to pick those up in some libraries on the executors to see what was set (if s3guard is enabled in my case). I need some means to hook into what the submitter though the hadoop conf should be to turn on/off reporting to dynamodb for s3guard info) was (Author: ianoc): With this change it makes it difficult to get a fully populated hadoop configuration on executor hosts. If spark properties were applied to the hadoop conf in the driver those don't show up in a `new Configuration`. This let one go SparkEnv -> SparkHadoopUtil -> Configuration thats fully populated. Is there a nicer way to achieve this possibly? > Make SparkHadoopUtil private to Spark > - > > Key: SPARK-26043 > URL: https://issues.apache.org/jira/browse/SPARK-26043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Sean Owen >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > This API contains a few small helper methods used internally by Spark, mostly > related to Hadoop configs and kerberos. > It's been historically marked as "DeveloperApi". But in reality it's not very > useful for others, and changes a lot to be considered a stable API. Better to > just make it private to Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark
[ https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704856#comment-16704856 ] Sean Owen commented on SPARK-26043: --- What is the use case for that? I think this class was intended to be internal to Spark. Does SparkContext.hadoopConfiguration() give you what you need? > Make SparkHadoopUtil private to Spark > - > > Key: SPARK-26043 > URL: https://issues.apache.org/jira/browse/SPARK-26043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Sean Owen >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > This API contains a few small helper methods used internally by Spark, mostly > related to Hadoop configs and kerberos. > It's been historically marked as "DeveloperApi". But in reality it's not very > useful for others, and changes a lot to be considered a stable API. Better to > just make it private to Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26200) Column values are incorrectly transposed when a field in a PySpark Row requires serialization
[ https://issues.apache.org/jira/browse/SPARK-26200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704839#comment-16704839 ] David Lyness commented on SPARK-26200: -- Yep, I believe the root cause for this and SPARK-24915 are the same. (You see the behaviour in SPARK-24915 if the mis-ordered columns are of incompatible/different types; you see the behaviour in this ticket if the mis-ordered columns are of the same type.) Note that the impact is more severe in the case of this ticket - rather than a function call failing during development which can then be worked around, this has the potential to be silently causing data correctness issues for users of PySpark. > Column values are incorrectly transposed when a field in a PySpark Row > requires serialization > - > > Key: SPARK-26200 > URL: https://issues.apache.org/jira/browse/SPARK-26200 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Spark version 2.4.0 > Using Scala version 2.11.12, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_144 > The same issue is observed when PySpark is run on both macOS 10.13.6 and > CentOS 7, so this appears to be a cross-platform issue. >Reporter: David Lyness >Priority: Major > Labels: correctness > > h2. Description of issue > Whenever a field in a PySpark {{Row}} requires serialization (such as a > {{DateType}} or {{TimestampType}}), the DataFrame generated by the code below > will assign column values *in alphabetical order*, rather than assigning each > column value to its specified columns. > h3. Code to reproduce: > {code:java} > import datetime > from pyspark.sql import Row > from pyspark.sql.session import SparkSession > from pyspark.sql.types import DateType, StringType, StructField, StructType > spark = SparkSession.builder.getOrCreate() > schema = StructType([ > StructField("date_column", DateType()), > StructField("my_b_column", StringType()), > StructField("my_a_column", StringType()), > ]) > spark.createDataFrame([Row( > date_column=datetime.date.today(), > my_b_column="my_b_value", > my_a_column="my_a_value" > )], schema).show() > {code} > h3. Expected result: > {noformat} > +---+---+---+ > |date_column|my_b_column|my_a_column| > +---+---+---+ > | 2018-11-28| my_b_value| my_a_value| > +---+---+---+{noformat} > h3. Actual result: > {noformat} > +---+---+---+ > |date_column|my_b_column|my_a_column| > +---+---+---+ > | 2018-11-28| my_a_value| my_b_value| > +---+---+---+{noformat} > (Note that {{my_a_value}} and {{my_b_value}} are transposed.) > h2. Analysis of issue > Reviewing [the relevant code on > GitHub|https://github.com/apache/spark/blame/master/python/pyspark/sql/types.py#L593-L622], > there are two relevant conditional blocks: > > {code:java} > if self._needSerializeAnyField: > # Block 1, does not work correctly > else: > # Block 2, works correctly > {code} > {{Row}} is implemented as both a tuple of alphabetically-sorted columns, and > a dictionary of named columns. In Block 2, there is a conditional that works > specifically to serialize a {{Row}} object: > > {code:java} > elif isinstance(obj, Row) and getattr(obj, "__from_dict__", False): > return tuple(obj[n] for n in self.names) > {code} > There is no such condition in Block 1, so we fall into this instead: > > {code:java} > elif isinstance(obj, (tuple, list)): > return tuple(f.toInternal(v) if c else v > for f, v, c in zip(self.fields, obj, self._needConversion)) > {code} > The behaviour in the {{zip}} call is wrong, since {{obj}} (the {{Row}}) will > return a different ordering than the schema fields. So we end up with: > {code:java} > (date, date, True), > (b, a, False), > (a, b, False) > {code} > h2. Workarounds > Correct behaviour is observed if you use a Python {{list}} or {{dict}} > instead of PySpark's {{Row}} object: > > {code:java} > # Using a list works > spark.createDataFrame([[ > datetime.date.today(), > "my_b_value", > "my_a_value" > ]], schema) > # Using a dict also works > spark.createDataFrame([{ > "date_column": datetime.date.today(), > "my_b_column": "my_b_value", > "my_a_column": "my_a_value" > }], schema){code} > Correct behaviour is also observed if you have no fields that require > serialization; in this example, changing {{date_column}} to {{StringType}} > avoids the correctness issue. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail:
[jira] [Closed] (SPARK-26234) Column list specification in INSERT statement
[ https://issues.apache.org/jira/browse/SPARK-26234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joby Joje closed SPARK-26234. - A card SPARK-20845 is already in place to resolve this, hence closing the ticket. > Column list specification in INSERT statement > - > > Key: SPARK-26234 > URL: https://issues.apache.org/jira/browse/SPARK-26234 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Joby Joje >Priority: Major > > While trying to OVERWRITE the Hive table with specific columns from > Spark(Pyspark) using a dataframe getting the below error > {quote}pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting > Unknown macro: \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', > 'REDUCE'} > (line 1, pos 36)\n\n== SQL ==\ninsert into table DB.TableName (Col1, Col2, > Col3) select Col1, Col2, Col3 FROM > dataframe\n^^^\n" > {quote} > {{sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select > Col1, Col2, Col3 FROM dataframe")}} > {{But on trying the same via _Hive Terminal_ goes through fine.}} > Please check the below link to get more info on the same. > [https://stackoverflow.com/questions/53517671/column-list-specification-in-insert-overwrite-statement] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark
[ https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704829#comment-16704829 ] Ian O Connell commented on SPARK-26043: --- With this change it makes it difficult to get a fully populated hadoop configuration on executor hosts. If spark properties were applied to the hadoop conf in the driver those don't show up in a `new Configuration`. This let one go SparkEnv -> SparkHadoopUtil -> Configuration thats fully populated. Is there a nicer way to achieve this possibly? > Make SparkHadoopUtil private to Spark > - > > Key: SPARK-26043 > URL: https://issues.apache.org/jira/browse/SPARK-26043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Sean Owen >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > This API contains a few small helper methods used internally by Spark, mostly > related to Hadoop configs and kerberos. > It's been historically marked as "DeveloperApi". But in reality it's not very > useful for others, and changes a lot to be considered a stable API. Better to > just make it private to Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26235) Change log level for ClassNotFoundException/NoClassDefFoundError in SparkSubmit to Error
[ https://issues.apache.org/jira/browse/SPARK-26235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26235: Assignee: Apache Spark > Change log level for ClassNotFoundException/NoClassDefFoundError in > SparkSubmit to Error > > > Key: SPARK-26235 > URL: https://issues.apache.org/jira/browse/SPARK-26235 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Trivial > > In my local setup, I set log4j root category as ERROR > (https://stackoverflow.com/questions/27781187/how-to-stop-info-messages-displaying-on-spark-console > , first item show up if we google search "set spark log level".) > When I run such command > ``` > spark-submit --class foo bar.jar > ``` > Nothing shows up, and the script exits. > After quick investigation, I think the log level for > ClassNotFoundException/NoClassDefFoundError in SparkSubmit should be ERROR > instead of WARN. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26235) Change log level for ClassNotFoundException/NoClassDefFoundError in SparkSubmit to Error
[ https://issues.apache.org/jira/browse/SPARK-26235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704821#comment-16704821 ] Apache Spark commented on SPARK-26235: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/23189 > Change log level for ClassNotFoundException/NoClassDefFoundError in > SparkSubmit to Error > > > Key: SPARK-26235 > URL: https://issues.apache.org/jira/browse/SPARK-26235 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Trivial > > In my local setup, I set log4j root category as ERROR > (https://stackoverflow.com/questions/27781187/how-to-stop-info-messages-displaying-on-spark-console > , first item show up if we google search "set spark log level".) > When I run such command > ``` > spark-submit --class foo bar.jar > ``` > Nothing shows up, and the script exits. > After quick investigation, I think the log level for > ClassNotFoundException/NoClassDefFoundError in SparkSubmit should be ERROR > instead of WARN. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26235) Change log level for ClassNotFoundException/NoClassDefFoundError in SparkSubmit to Error
[ https://issues.apache.org/jira/browse/SPARK-26235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26235: Assignee: (was: Apache Spark) > Change log level for ClassNotFoundException/NoClassDefFoundError in > SparkSubmit to Error > > > Key: SPARK-26235 > URL: https://issues.apache.org/jira/browse/SPARK-26235 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Trivial > > In my local setup, I set log4j root category as ERROR > (https://stackoverflow.com/questions/27781187/how-to-stop-info-messages-displaying-on-spark-console > , first item show up if we google search "set spark log level".) > When I run such command > ``` > spark-submit --class foo bar.jar > ``` > Nothing shows up, and the script exits. > After quick investigation, I think the log level for > ClassNotFoundException/NoClassDefFoundError in SparkSubmit should be ERROR > instead of WARN. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26188) Spark 2.4.0 Partitioning behavior breaks backwards compatibility
[ https://issues.apache.org/jira/browse/SPARK-26188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704804#comment-16704804 ] Steve Loughran commented on SPARK-26188: bq. > My team uses spark to partition and output parquet files to amazon S3. We typically use 256 partitions, from 00 to ff. Independent of the patch, you'd better be using something to deliver the consistency which commit-via-rename requires for directory listing (on S3A: S3Guard), or better, an output committer designed from the ground up for S3 (S3A committers), or you are at risk of data loss due to inconsistent listings. > Spark 2.4.0 Partitioning behavior breaks backwards compatibility > > > Key: SPARK-26188 > URL: https://issues.apache.org/jira/browse/SPARK-26188 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Damien Doucet-Girard >Assignee: Gengliang Wang >Priority: Critical > Fix For: 2.4.1, 3.0.0 > > > My team uses spark to partition and output parquet files to amazon S3. We > typically use 256 partitions, from 00 to ff. > We've observed that in spark 2.3.2 and prior, it reads the partitions as > strings by default. However, in spark 2.4.0 and later, the type of each > partition is inferred by default, and partitions such as 00 become 0 and 4d > become 4.0. > Here is a log sample of this behavior from one of our jobs: > 2.4.0: > {code:java} > 18/11/27 14:02:27 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=00/part-00061-hashredacted.parquet, > range: 0-662, partition values: [0] > 18/11/27 14:02:28 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ef/part-00034-hashredacted.parquet, > range: 0-662, partition values: [ef] > 18/11/27 14:02:29 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4a/part-00151-hashredacted.parquet, > range: 0-662, partition values: [4a] > 18/11/27 14:02:30 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=74/part-00180-hashredacted.parquet, > range: 0-662, partition values: [74] > 18/11/27 14:02:32 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=f5/part-00156-hashredacted.parquet, > range: 0-662, partition values: [f5] > 18/11/27 14:02:33 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=50/part-00195-hashredacted.parquet, > range: 0-662, partition values: [50] > 18/11/27 14:02:34 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=70/part-00054-hashredacted.parquet, > range: 0-662, partition values: [70] > 18/11/27 14:02:35 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b9/part-00012-hashredacted.parquet, > range: 0-662, partition values: [b9] > 18/11/27 14:02:37 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=d2/part-00016-hashredacted.parquet, > range: 0-662, partition values: [d2] > 18/11/27 14:02:38 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=51/part-3-hashredacted.parquet, > range: 0-662, partition values: [51] > 18/11/27 14:02:39 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=84/part-00135-hashredacted.parquet, > range: 0-662, partition values: [84] > 18/11/27 14:02:40 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=b5/part-00190-hashredacted.parquet, > range: 0-662, partition values: [b5] > 18/11/27 14:02:41 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=88/part-00143-hashredacted.parquet, > range: 0-662, partition values: [88] > 18/11/27 14:02:42 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=4d/part-00120-hashredacted.parquet, > range: 0-662, partition values: [4.0] > 18/11/27 14:02:43 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ac/part-00119-hashredacted.parquet, > range: 0-662, partition values: [ac] > 18/11/27 14:02:44 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=24/part-00139-hashredacted.parquet, > range: 0-662, partition values: [24] > 18/11/27 14:02:45 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=fd/part-00167-hashredacted.parquet, > range: 0-662, partition values: [fd] > 18/11/27 14:02:46 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=52/part-00033-hashredacted.parquet, > range: 0-662, partition values: [52] > 18/11/27 14:02:47 INFO FileScanRDD: Reading File path: > s3a://bucketnamereadacted/ddgirard/suffix=ab/part-00083-hashredacted.parquet, > range:
[jira] [Created] (SPARK-26235) Change log level for ClassNotFoundException/NoClassDefFoundError in SparkSubmit to Error
Gengliang Wang created SPARK-26235: -- Summary: Change log level for ClassNotFoundException/NoClassDefFoundError in SparkSubmit to Error Key: SPARK-26235 URL: https://issues.apache.org/jira/browse/SPARK-26235 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Gengliang Wang In my local setup, I set log4j root category as ERROR (https://stackoverflow.com/questions/27781187/how-to-stop-info-messages-displaying-on-spark-console , first item show up if we google search "set spark log level".) When I run such command ``` spark-submit --class foo bar.jar ``` Nothing shows up, and the script exits. After quick investigation, I think the log level for ClassNotFoundException/NoClassDefFoundError in SparkSubmit should be ERROR instead of WARN. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26234) Column list specification in INSERT statement
[ https://issues.apache.org/jira/browse/SPARK-26234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-26234. - Resolution: Duplicate > Column list specification in INSERT statement > - > > Key: SPARK-26234 > URL: https://issues.apache.org/jira/browse/SPARK-26234 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Joby Joje >Priority: Major > > While trying to OVERWRITE the Hive table with specific columns from > Spark(Pyspark) using a dataframe getting the below error > {quote}pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting > Unknown macro: \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', > 'REDUCE'} > (line 1, pos 36)\n\n== SQL ==\ninsert into table DB.TableName (Col1, Col2, > Col3) select Col1, Col2, Col3 FROM > dataframe\n^^^\n" > {quote} > {{sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select > Col1, Col2, Col3 FROM dataframe")}} > {{But on trying the same via _Hive Terminal_ goes through fine.}} > Please check the below link to get more info on the same. > [https://stackoverflow.com/questions/53517671/column-list-specification-in-insert-overwrite-statement] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26198) Metadata serialize null values throw NPE
[ https://issues.apache.org/jira/browse/SPARK-26198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-26198: -- Priority: Minor (was: Major) > Metadata serialize null values throw NPE > > > Key: SPARK-26198 > URL: https://issues.apache.org/jira/browse/SPARK-26198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Minor > > How to reproduce this issue: > {code} > scala> val meta = new > org.apache.spark.sql.types.MetadataBuilder().putNull("key").build().json > java.lang.NullPointerException > at > org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196) > at org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26232) Remove old Netty3 dependency from the build
[ https://issues.apache.org/jira/browse/SPARK-26232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros resolved SPARK-26232. Resolution: Won't Fix One of the third party tool uses Netty3, so as a transitive dependency Netty3 is really needed. > Remove old Netty3 dependency from the build > > > Key: SPARK-26232 > URL: https://issues.apache.org/jira/browse/SPARK-26232 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Old Netty artifact (3.9.9.Final) is unused. > The reason it is not collide with Netty4 is they are using different package > names: > * Netty3: org.jboss.netty.* > * Netty4: io.netty.* > Still Netty3 is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26234) Column list specification in INSERT statement
[ https://issues.apache.org/jira/browse/SPARK-26234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joby Joje updated SPARK-26234: -- Description: While trying to OVERWRITE the Hive table with specific columns from Spark(Pyspark) using a dataframe getting the below error {quote}pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting Unknown macro: \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'} (line 1, pos 36)\n\n== SQL ==\ninsert table DB.TableName (Col1, Col2, Col3) select Col1, Col2, Col3 FROM dataframe\n^^^\n" {quote} {{sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select Col1, Col2, Col3 FROM dataframe")}} {{But on trying the same via _Hive Terminal_ goes through fine.}} Please check the below link to get more info on the same. [https://stackoverflow.com/questions/53517671/column-list-specification-in-insert-overwrite-statement] was: While trying to OVERWRITE the Hive table with specific columns from Spark(Pyspark) using a dataframe getting the below error {quote}pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting Unknown macro: \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'} (line 1, pos 36)\n\n== SQL ==\ninsert OVERWRITE table DB.TableName (Col1, Col2, Col3) select Col1, Col2, Col3 FROM dataframe\n^^^\n" {quote} {{sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select Col1, Col2, Col3 FROM dataframe")}} {{But on trying the same via _Hive Terminal_ goes through fine.}} Please check the below link to get more info on the same. https://stackoverflow.com/questions/53517671/column-list-specification-in-insert-overwrite-statement > Column list specification in INSERT statement > - > > Key: SPARK-26234 > URL: https://issues.apache.org/jira/browse/SPARK-26234 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Joby Joje >Priority: Major > > While trying to OVERWRITE the Hive table with specific columns from > Spark(Pyspark) using a dataframe getting the below error > {quote}pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting > Unknown macro: \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', > 'REDUCE'} > (line 1, pos 36)\n\n== SQL ==\ninsert table DB.TableName (Col1, Col2, Col3) > select Col1, Col2, Col3 FROM > dataframe\n^^^\n" > {quote} > {{sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select > Col1, Col2, Col3 FROM dataframe")}} > {{But on trying the same via _Hive Terminal_ goes through fine.}} > Please check the below link to get more info on the same. > [https://stackoverflow.com/questions/53517671/column-list-specification-in-insert-overwrite-statement] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26234) Column list specification in INSERT statement
[ https://issues.apache.org/jira/browse/SPARK-26234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joby Joje updated SPARK-26234: -- Description: While trying to OVERWRITE the Hive table with specific columns from Spark(Pyspark) using a dataframe getting the below error {quote}pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting Unknown macro: \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'} (line 1, pos 36)\n\n== SQL ==\ninsert into table DB.TableName (Col1, Col2, Col3) select Col1, Col2, Col3 FROM dataframe\n^^^\n" {quote} {{sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select Col1, Col2, Col3 FROM dataframe")}} {{But on trying the same via _Hive Terminal_ goes through fine.}} Please check the below link to get more info on the same. [https://stackoverflow.com/questions/53517671/column-list-specification-in-insert-overwrite-statement] was: While trying to OVERWRITE the Hive table with specific columns from Spark(Pyspark) using a dataframe getting the below error {quote}pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting Unknown macro: \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'} (line 1, pos 36)\n\n== SQL ==\ninsert table DB.TableName (Col1, Col2, Col3) select Col1, Col2, Col3 FROM dataframe\n^^^\n" {quote} {{sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select Col1, Col2, Col3 FROM dataframe")}} {{But on trying the same via _Hive Terminal_ goes through fine.}} Please check the below link to get more info on the same. [https://stackoverflow.com/questions/53517671/column-list-specification-in-insert-overwrite-statement] > Column list specification in INSERT statement > - > > Key: SPARK-26234 > URL: https://issues.apache.org/jira/browse/SPARK-26234 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Joby Joje >Priority: Major > > While trying to OVERWRITE the Hive table with specific columns from > Spark(Pyspark) using a dataframe getting the below error > {quote}pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting > Unknown macro: \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', > 'REDUCE'} > (line 1, pos 36)\n\n== SQL ==\ninsert into table DB.TableName (Col1, Col2, > Col3) select Col1, Col2, Col3 FROM > dataframe\n^^^\n" > {quote} > {{sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select > Col1, Col2, Col3 FROM dataframe")}} > {{But on trying the same via _Hive Terminal_ goes through fine.}} > Please check the below link to get more info on the same. > [https://stackoverflow.com/questions/53517671/column-list-specification-in-insert-overwrite-statement] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26234) Column list specification in INSERT statement
Joby Joje created SPARK-26234: - Summary: Column list specification in INSERT statement Key: SPARK-26234 URL: https://issues.apache.org/jira/browse/SPARK-26234 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Reporter: Joby Joje While trying to OVERWRITE the Hive table with specific columns from Spark(Pyspark) using a dataframe getting the below error {quote}pyspark.sql.utils.ParseException: u"\nmismatched input 'col1' expecting Unknown macro: \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'} (line 1, pos 36)\n\n== SQL ==\ninsert OVERWRITE table DB.TableName (Col1, Col2, Col3) select Col1, Col2, Col3 FROM dataframe\n^^^\n" {quote} {{sparkSession.sql("insert into table DB.TableName (Col1, Col2, Col3) select Col1, Col2, Col3 FROM dataframe")}} {{But on trying the same via _Hive Terminal_ goes through fine.}} Please check the below link to get more info on the same. https://stackoverflow.com/questions/53517671/column-list-specification-in-insert-overwrite-statement -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26232) Remove old Netty3 dependency from the build
[ https://issues.apache.org/jira/browse/SPARK-26232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26232: Assignee: Apache Spark > Remove old Netty3 dependency from the build > > > Key: SPARK-26232 > URL: https://issues.apache.org/jira/browse/SPARK-26232 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Apache Spark >Priority: Major > > Old Netty artifact (3.9.9.Final) is unused. > The reason it is not collide with Netty4 is they are using different package > names: > * Netty3: org.jboss.netty.* > * Netty4: io.netty.* > Still Netty3 is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions
Miquel created SPARK-26233: -- Summary: Incorrect decimal value with java beans and first/last/max... functions Key: SPARK-26233 URL: https://issues.apache.org/jira/browse/SPARK-26233 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.4.0, 2.3.1 Reporter: Miquel Decimal values from Java beans are incorrectly scaled when used with functions like first/last/max... This problem came because Encoders.bean always set Decimal values as _DecimalType(this.MAX_PRECISION(), 18)._ Usually it's not a problem if you use numeric functions like *sum* but for functions like *first*/*last*/*max*... it is a problem. How to reproduce this error: Using this class as an example: {code:java} public class Foo implements Serializable { private String group; private BigDecimal var; public BigDecimal getVar() { return var; } public void setVar(BigDecimal var) { this.var = var; } public String getGroup() { return group; } public void setGroup(String group) { this.group = group; } } {code} And a dummy code to create some objects: {code:java} Dataset ds = spark.range(5) .map(l -> { Foo foo = new Foo(); foo.setGroup("" + l); foo.setVar(BigDecimal.valueOf(l + 0.)); return foo; }, Encoders.bean(Foo.class)); ds.printSchema(); ds.show(); +-+--+ |group| var| +-+--+ | 0|0.| | 1|1.| | 2|2.| | 3|3.| | 4|4.| +-+--+ {code} We can see that the DecimalType is precision 38 and 18 scale and all values are show correctly. But if we use a first function, they are scaled incorrectly: {code:java} ds.groupBy(col("group")) .agg( first("var") ) .show(); +-+-+ |group|first(var, false)| +-+-+ | 3| 3.E-14| | 0| 1.111E-15| | 1| 1.E-14| | 4| 4.E-14| | 2| 2.E-14| +-+-+ {code} This incorrect behavior cannot be reproduced if we use "numerical "functions like sum or if the column is cast a new Decimal Type. {code:java} ds.groupBy(col("group")) .agg( sum("var") ) .show(); +-++ |group| sum(var)| +-++ | 3|3.00| | 0|0.00| | 1|1.00| | 4|4.00| | 2|2.00| +-++ ds.groupBy(col("group")) .agg( first(col("var").cast(new DecimalType(38, 8))) ) .show(); +-++ |group|first(CAST(var AS DECIMAL(38,8)), false)| +-++ | 3| 3.| | 0| 0.| | 1| 1.| | 4| 4.| | 2| 2.| +-++ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26232) Remove old Netty3 dependency from the build
[ https://issues.apache.org/jira/browse/SPARK-26232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704700#comment-16704700 ] Apache Spark commented on SPARK-26232: -- User 'attilapiros' has created a pull request for this issue: https://github.com/apache/spark/pull/23188 > Remove old Netty3 dependency from the build > > > Key: SPARK-26232 > URL: https://issues.apache.org/jira/browse/SPARK-26232 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Old Netty artifact (3.9.9.Final) is unused. > The reason it is not collide with Netty4 is they are using different package > names: > * Netty3: org.jboss.netty.* > * Netty4: io.netty.* > Still Netty3 is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26232) Remove old Netty3 dependency from the build
[ https://issues.apache.org/jira/browse/SPARK-26232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26232: Assignee: (was: Apache Spark) > Remove old Netty3 dependency from the build > > > Key: SPARK-26232 > URL: https://issues.apache.org/jira/browse/SPARK-26232 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Old Netty artifact (3.9.9.Final) is unused. > The reason it is not collide with Netty4 is they are using different package > names: > * Netty3: org.jboss.netty.* > * Netty4: io.netty.* > Still Netty3 is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26232) Remove old Netty3 dependency from the build
[ https://issues.apache.org/jira/browse/SPARK-26232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704699#comment-16704699 ] Apache Spark commented on SPARK-26232: -- User 'attilapiros' has created a pull request for this issue: https://github.com/apache/spark/pull/23188 > Remove old Netty3 dependency from the build > > > Key: SPARK-26232 > URL: https://issues.apache.org/jira/browse/SPARK-26232 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Old Netty artifact (3.9.9.Final) is unused. > The reason it is not collide with Netty4 is they are using different package > names: > * Netty3: org.jboss.netty.* > * Netty4: io.netty.* > Still Netty3 is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26232) Remove old Netty3 dependency from the build
[ https://issues.apache.org/jira/browse/SPARK-26232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-26232: --- Summary: Remove old Netty3 dependency from the build (was: Remove old Netty3 artifact from the build ) > Remove old Netty3 dependency from the build > > > Key: SPARK-26232 > URL: https://issues.apache.org/jira/browse/SPARK-26232 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Old Netty artifact (3.9.9.Final) is unused. > The reason it is not collide with Netty4 is they are using different package > names: > * Netty3: org.jboss.netty.* > * Netty4: io.netty.* > Still Netty3 is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26232) Remove old Netty3 artifact from the build
Attila Zsolt Piros created SPARK-26232: -- Summary: Remove old Netty3 artifact from the build Key: SPARK-26232 URL: https://issues.apache.org/jira/browse/SPARK-26232 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Attila Zsolt Piros Old Netty artifact (3.9.9.Final) is unused. The reason it is not collide with Netty4 is they are using different package names: * Netty3: org.jboss.netty.* * Netty4: io.netty.* Still Netty3 is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24661) Window API - using multiple fields for partitioning with WindowSpec API and dataset that is cached causes org.apache.spark.sql.catalyst.errors.package$TreeNodeExceptio
[ https://issues.apache.org/jira/browse/SPARK-24661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704668#comment-16704668 ] Oleg V Korchagin commented on SPARK-24661: -- Why PySpark in components list? > Window API - using multiple fields for partitioning with WindowSpec API and > dataset that is cached causes > org.apache.spark.sql.catalyst.errors.package$TreeNodeException > > > Key: SPARK-24661 > URL: https://issues.apache.org/jira/browse/SPARK-24661 > Project: Spark > Issue Type: Bug > Components: DStreams, Java API, PySpark >Affects Versions: 2.3.0 >Reporter: David Mavashev >Priority: Major > > Steps to reproduce: > Creating a data set: > > {code:java} > List simpleWindowColumns = new ArrayList(); > simpleWindowColumns.add("column1"); > simpleWindowColumns.add("column2"); > Map expressionsWithAliasesEntrySet = new HashMap String>); > expressionsWithAliasesEntrySet.put("count(id)", "count_column"); > DataFrameReader reader = sparkSession.read().format("csv"); > Dataset sparkDataSet = reader.option("header", > "true").load("/path/to/data/data.csv"); > //Invoking cached: > sparkDataSet = sparkDataSet.cache() > //Creating window spec with 2 columns: > WindowSpec window = > Window.partitionBy(JavaConverters.asScalaIteratorConverter(simpleWindowColumns.stream().map(item->sparkDataSet.col(item)).iterator()).asScala().toSeq()); > sparkDataSet = > sparkDataSet.withColumns(JavaConverters.asScalaIteratorConverter(expressionsWithAliasesEntrySet.stream().map(item->item.getKey()).collect(Collectors.toList()).iterator()).asScala().toSeq(), > > JavaConverters.asScalaIteratorConverter(expressionsWithAliasesEntrySet.stream().map(item->new > > Column(item.getValue()).over(finalWindow)).collect(Collectors.toList()).iterator()).asScala().toSeq()); > sparkDataSet.show();{code} > Expected: > > Results are shown > > > Actual: the following exception is thrown > {code:java} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, > tree: windowspecdefinition(O003#3, O006#6, specifiedwindowframe(RowFrame, > unboundedpreceding$(), unboundedfollowing$())) at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at > org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385) at > org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:244) > at > org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:190) > at > org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$1.apply(Expression.scala:189) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at > scala.collection.immutable.List.map(List.scala:285) at > org.apache.spark.sql.catalyst.expressions.Expression.canonicalized$lzycompute(Expression.scala:189) > at > org.apache.spark.sql.catalyst.expressions.Expression.canonicalized(Expression.scala:188) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$.normalizeExprId(QueryPlan.scala:288) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$doCanonicalize$1.apply(QueryPlan.scala:232) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$doCanonicalize$1.apply(QueryPlan.scala:226) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at > scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at > scala.collection.AbstractTraversable.map(Traversable.scala:104) at >
[jira] [Commented] (SPARK-26211) Fix InSet for binary, and struct and array with null.
[ https://issues.apache.org/jira/browse/SPARK-26211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704591#comment-16704591 ] Apache Spark commented on SPARK-26211: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/23187 > Fix InSet for binary, and struct and array with null. > - > > Key: SPARK-26211 > URL: https://issues.apache.org/jira/browse/SPARK-26211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.3.3, 2.4.1, 3.0.0 > > > Currently {{InSet}} doesn't work properly for binary type, or struct and > array type with null value in the set. > Because, as for binary type, the {{HashSet}} doesn't work properly for > {{Array[Byte]}}, and as for struct and array type with null value in the set, > the {{ordering}} will throw a {{NPE}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26211) Fix InSet for binary, and struct and array with null.
[ https://issues.apache.org/jira/browse/SPARK-26211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704590#comment-16704590 ] Apache Spark commented on SPARK-26211: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/23187 > Fix InSet for binary, and struct and array with null. > - > > Key: SPARK-26211 > URL: https://issues.apache.org/jira/browse/SPARK-26211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.3.3, 2.4.1, 3.0.0 > > > Currently {{InSet}} doesn't work properly for binary type, or struct and > array type with null value in the set. > Because, as for binary type, the {{HashSet}} doesn't work properly for > {{Array[Byte]}}, and as for struct and array type with null value in the set, > the {{ordering}} will throw a {{NPE}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-25145) Buffer size too small on spark.sql query with filterPushdown predicate=True
[ https://issues.apache.org/jira/browse/SPARK-25145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bjørnar Jensen reopened SPARK-25145: New information that I believe will make it reproducible: The "zerocopy" option is what triggers this crash in our system. With "zerocopy" set to "false" the reads produces results. spark.sql().explain() show that it pushes the filters down. Turning on "zerocopy" causes the query with filter pushdown to crash with buffer size too small. Furthermore, orc.compress.size and orc.buffer.size.enforce does not seem to have any effect/stick when tried. {code:java} pyspark --conf 'spark.hadoop.hive.exec.orc.zerocopy=true" => Crashes pyspark --conf 'spark.hadoop.hive.exec.orc.zerocopy=false" => Succeeds {code} I do not know why, though. Best regards, Bjørnar. > Buffer size too small on spark.sql query with filterPushdown predicate=True > --- > > Key: SPARK-25145 > URL: https://issues.apache.org/jira/browse/SPARK-25145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 > Environment: > {noformat} > # Generated by Apache Ambari. Wed Mar 21 15:37:53 2018 > spark.driver.extraLibraryPath > /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 > spark.eventLog.dir hdfs:///spark2-history/ > spark.eventLog.enabled true > spark.executor.extraLibraryPath > /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 > spark.hadoop.hive.vectorized.execution.enabled true > spark.history.fs.logDirectory hdfs:///spark2-history/ > spark.history.kerberos.keytab none > spark.history.kerberos.principal none > spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider > spark.history.retainedApplications 50 > spark.history.ui.port 18081 > spark.io.compression.lz4.blockSize 128k > spark.locality.wait 2s > spark.network.timeout 600s > spark.serializer org.apache.spark.serializer.KryoSerializer > spark.shuffle.consolidateFiles true > spark.shuffle.io.numConnectionsPerPeer 10 > spark.sql.autoBroadcastJoinTreshold 26214400 > spark.sql.shuffle.partitions 300 > spark.sql.statistics.fallBack.toHdfs true > spark.sql.tungsten.enabled true > spark.driver.memoryOverhead 2048 > spark.executor.memoryOverhead 4096 > spark.yarn.historyServer.address service-10-4.local:18081 > spark.yarn.queue default > spark.sql.warehouse.dir hdfs:///apps/hive/warehouse > spark.sql.execution.arrow.enabled true > spark.sql.hive.convertMetastoreOrc true > spark.sql.orc.char.enabled true > spark.sql.orc.enabled true > spark.sql.orc.filterPushdown true > spark.sql.orc.impl native > spark.sql.orc.enableVectorizedReader true > spark.yarn.jars hdfs:///apps/spark-jars/231/jars/* > {noformat} > >Reporter: Bjørnar Jensen >Priority: Minor > Attachments: create_bug.py, report.txt > > > java.lang.IllegalArgumentException: Buffer size too small. size = 262144 > needed = 2205991 > # > {code:java} > Python > import numpy as np > import pandas as pd > # Create a spark dataframe > df = pd.DataFrame({'a': np.arange(10), 'b': np.arange(10) / 2.0}) > sdf = spark.createDataFrame(df) > print('Created spark dataframe:') > sdf.show() > # Save table as orc > sdf.write.saveAsTable(format='orc', mode='overwrite', > name='bjornj.spark_buffer_size_too_small_on_filter_pushdown', > compression='zlib') > # Ensure filterPushdown is enabled > spark.conf.set('spark.sql.orc.filterPushdown', True) > # Fetch entire table (works) > print('Read entire table with "filterPushdown"=True') > spark.sql('SELECT * FROM > bjornj.spark_buffer_size_too_small_on_filter_pushdown').show() > # Ensure filterPushdown is disabled > spark.conf.set('spark.sql.orc.filterPushdown', False) > # Query without filterPushdown (works) > print('Read a selection from table with "filterPushdown"=False') > spark.sql('SELECT * FROM > bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show() > # Ensure filterPushdown is enabled > spark.conf.set('spark.sql.orc.filterPushdown', True) > # Query with filterPushDown (fails) > print('Read a selection from table with "filterPushdown"=True') > spark.sql('SELECT * FROM > bjornj.spark_buffer_size_too_small_on_filter_pushdown WHERE a > 5').show() > {code} > {noformat} > ~/bug_report $ pyspark > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 2018-08-17 13:44:31,365 WARN Utils: Service 'SparkUI' could not bind on port > 4040. Attempting port 4041. > Jupyter console 5.1.0 > Python 3.6.3 |Intel Corporation| (default, May 4 2018, 04:22:28) > Type 'copyright', 'credits' or 'license' for more information > IPython 6.3.1 -- An enhanced Interactive Python. Type '?' for help.
[jira] [Created] (SPARK-26231) Dataframes inner join on double datatype columns resulting in Cartesian product
Shrikant created SPARK-26231: Summary: Dataframes inner join on double datatype columns resulting in Cartesian product Key: SPARK-26231 URL: https://issues.apache.org/jira/browse/SPARK-26231 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.1, 1.6.0 Reporter: Shrikant Following code snippet explains the bug. The join on the Double columns results in catersian , when both columns typecasted to String it works. please see the explain plan belolw Error: scala> cartesianJoinErr.explain() == Physical Plan == CartesianProduct :- ConvertToSafe : +- Project [name#143,group#144,data#145,name#143 AS name1#146] : +- Filter (name#143 = name#143) : +- Scan ExistingRDD[name#143,group#144,data#145] +- Scan ExistingRDD[name#147,group#148,data#149] --- After conversion to String explain plan stringColJoinWorks.explain() == Physical Plan == SortMergeJoin [name1String#151], [name2String#152] :- Sort [name1String#151 ASC], false, 0 : +- TungstenExchange hashpartitioning(name1String#151,200), None : +- Project [name#143,group#144,data#145,cast(name#143 as string) AS name1String#151] : +- Scan ExistingRDD[name#143,group#144,data#145] +- Sort [name2String#152 ASC], false, 0 +- TungstenExchange hashpartitioning(name2String#152,200), None +- Project [name#153,group#154,data#155,cast(name#153 as string) AS name2String#152] +- Scan ExistingRDD[name#153,group#154,data#155] import org.apache.spark.sql.Row import org.apache.spark.sql.Dataset import org.apache.spark.sql.types._ import org.apache.spark.sql.functions val doubleRDD = sc.parallelize(Seq( Row(1.0, 2, 1), Row(2.0, 8, 2), Row(3.0, 10, 3), Row(4.0, 10, 4))) val testSchema = StructType(Seq( StructField("name", DoubleType, nullable = true), StructField("group", IntegerType, nullable = true), StructField("data", IntegerType, nullable = true))) val doubleRDDCartesian = sqlContext.createDataFrame(doubleRDD, testSchema) val cartNewCol = doubleRDDCartesian.select($"name" , $"group", $"data") val newColName1DF = cartNewCol.withColumn("name1", $"name") val cartesianJoinErr = newColName1DF.join(doubleRDDCartesian, newColName1DF("name1")===(doubleRDDCartesian("name"))) cartesianJoinErr.show cartesianJoinErr.explain() //Convert both into StringType val stringColDF1 = doubleRDDCartesian.withColumn("name1String",$"name".cast("String")) val stringColDF2 = cartNewCol.withColumn("name2String", $"name".cast("String")) val stringColJoinWorks = stringColDF1.join(stringColDF2, stringColDF1("name1String")===(stringColDF2("name2String"))) stringColJoinWorks.show stringColJoinWorks.explain() -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26214) Add "broadcast" method to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-26214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704475#comment-16704475 ] Thomas Decaux edited comment on SPARK-26214 at 11/30/18 9:32 AM: - Sure, it's already possible to do it (like I said). The same thing for "cache" / "persist" DataFrame method: {code:java} public DataFrame persist(StorageLevel newLevel) { this.sqlContext().cacheManager().cacheQuery(this, scala.None..MODULE$, newLevel); return this; }{code} This is more a "short-cut" method as you can see, it's possible to use cacheManager *OR* the DataFrame method. Same thing for registerTempTable: {code:java} public void registerTempTable(String tableName) { this.sqlContext().registerDataFrameAsTable(this, tableName); }{code} You can see here, this is a short-cut method. I propose to do the same thing for broadcast. was (Author: ebuildy): Sure, it's already to do this (like I said). The same thing for "cache" / "persist" DataFrame method: {code:java} public DataFrame persist(StorageLevel newLevel) { this.sqlContext().cacheManager().cacheQuery(this, scala.None..MODULE$, newLevel); return this; }{code} This is more a "short-cut" method as you can see, it's possible to use cacheManager *OR* the DataFrame method. Same thing for registerTempTable: {code:java} public void registerTempTable(String tableName) { this.sqlContext().registerDataFrameAsTable(this, tableName); }{code} You can see here, this is a short-cut method. I propose to do the same thing for broadcast. > Add "broadcast" method to DataFrame > --- > > Key: SPARK-26214 > URL: https://issues.apache.org/jira/browse/SPARK-26214 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.4.0 >Reporter: Thomas Decaux >Priority: Trivial > Labels: broadcast, dataframe > > As discussed at > [https://stackoverflow.com/questions/43984068/does-spark-sql-autobroadcastjointhreshold-work-for-joins-using-datasets-join-op/43994022,] > it's possible to force broadcast of DataFrame, even if total size is greater > than ``*spark.sql.autoBroadcastJoinThreshold``.* > But this not trivial for beginner, because there is no "broadcast" method (I > know, I am lazy ...). > We could add this method, with a WARN if size is greater than the threshold. > (if it's an easy one, I could do it?) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26214) Add "broadcast" method to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-26214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704475#comment-16704475 ] Thomas Decaux commented on SPARK-26214: --- Sure, it's already to do this (like I said). The same thing for "cache" / "persist" DataFrame method: {code:java} public DataFrame persist(StorageLevel newLevel) { this.sqlContext().cacheManager().cacheQuery(this, scala.None..MODULE$, newLevel); return this; }{code} This is more a "short-cut" method as you can see, it's possible to use cacheManager *OR* the DataFrame method. Same thing for registerTempTable: {code:java} public void registerTempTable(String tableName) { this.sqlContext().registerDataFrameAsTable(this, tableName); }{code} You can see here, this is a short-cut method. I propose to do the same thing for broadcast. > Add "broadcast" method to DataFrame > --- > > Key: SPARK-26214 > URL: https://issues.apache.org/jira/browse/SPARK-26214 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.4.0 >Reporter: Thomas Decaux >Priority: Trivial > Labels: broadcast, dataframe > > As discussed at > [https://stackoverflow.com/questions/43984068/does-spark-sql-autobroadcastjointhreshold-work-for-joins-using-datasets-join-op/43994022,] > it's possible to force broadcast of DataFrame, even if total size is greater > than ``*spark.sql.autoBroadcastJoinThreshold``.* > But this not trivial for beginner, because there is no "broadcast" method (I > know, I am lazy ...). > We could add this method, with a WARN if size is greater than the threshold. > (if it's an easy one, I could do it?) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26230) FileIndex: if case sensitive, validate partitions with original column names
Gengliang Wang created SPARK-26230: -- Summary: FileIndex: if case sensitive, validate partitions with original column names Key: SPARK-26230 URL: https://issues.apache.org/jira/browse/SPARK-26230 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang Partition column name is required to be unique under the same directory. The following paths are invalid partitioned directory: ``` hdfs://host:9000/path/a=1 hdfs://host:9000/path/b=2 ``` If case sensitive, the following paths should be invalid too: ``` hdfs://host:9000/path/a=1 hdfs://host:9000/path/A=2 ``` Since column 'a' and 'A' are different, and it is wrong to use either one as the column name in partition schema. Also, there is a `TODO` in the code. This PR is to resolve the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26230) FileIndex: if case sensitive, validate partitions with original column names
[ https://issues.apache.org/jira/browse/SPARK-26230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26230: Assignee: (was: Apache Spark) > FileIndex: if case sensitive, validate partitions with original column names > > > Key: SPARK-26230 > URL: https://issues.apache.org/jira/browse/SPARK-26230 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Minor > > Partition column name is required to be unique under the same directory. The > following paths are invalid partitioned directory: > ``` > hdfs://host:9000/path/a=1 > hdfs://host:9000/path/b=2 > ``` > If case sensitive, the following paths should be invalid too: > ``` > hdfs://host:9000/path/a=1 > hdfs://host:9000/path/A=2 > ``` > Since column 'a' and 'A' are different, and it is wrong to use either one as > the column name in partition schema. > Also, there is a `TODO` in the code. This PR is to resolve the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26230) FileIndex: if case sensitive, validate partitions with original column names
[ https://issues.apache.org/jira/browse/SPARK-26230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16704470#comment-16704470 ] Apache Spark commented on SPARK-26230: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/23186 > FileIndex: if case sensitive, validate partitions with original column names > > > Key: SPARK-26230 > URL: https://issues.apache.org/jira/browse/SPARK-26230 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Minor > > Partition column name is required to be unique under the same directory. The > following paths are invalid partitioned directory: > ``` > hdfs://host:9000/path/a=1 > hdfs://host:9000/path/b=2 > ``` > If case sensitive, the following paths should be invalid too: > ``` > hdfs://host:9000/path/a=1 > hdfs://host:9000/path/A=2 > ``` > Since column 'a' and 'A' are different, and it is wrong to use either one as > the column name in partition schema. > Also, there is a `TODO` in the code. This PR is to resolve the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26230) FileIndex: if case sensitive, validate partitions with original column names
[ https://issues.apache.org/jira/browse/SPARK-26230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26230: Assignee: Apache Spark > FileIndex: if case sensitive, validate partitions with original column names > > > Key: SPARK-26230 > URL: https://issues.apache.org/jira/browse/SPARK-26230 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > > Partition column name is required to be unique under the same directory. The > following paths are invalid partitioned directory: > ``` > hdfs://host:9000/path/a=1 > hdfs://host:9000/path/b=2 > ``` > If case sensitive, the following paths should be invalid too: > ``` > hdfs://host:9000/path/a=1 > hdfs://host:9000/path/A=2 > ``` > Since column 'a' and 'A' are different, and it is wrong to use either one as > the column name in partition schema. > Also, there is a `TODO` in the code. This PR is to resolve the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26229) Expose SizeEstimator as a developer API in pyspark
[ https://issues.apache.org/jira/browse/SPARK-26229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ranjith Pulluru updated SPARK-26229: Component/s: (was: Spark Core) > Expose SizeEstimator as a developer API in pyspark > -- > > Key: SPARK-26229 > URL: https://issues.apache.org/jira/browse/SPARK-26229 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Ranjith Pulluru >Priority: Minor > > Expose SizeEstimator as a developer API in pyspark. > SizeEstmator is not available in pyspark. > This api will be helpful for understanding the memory footprint of an object. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26229) Expose SizeEstimator as a developer API in pyspark
[ https://issues.apache.org/jira/browse/SPARK-26229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ranjith Pulluru updated SPARK-26229: Description: SizeEstmator is not available in pyspark. This api will be helpful for understanding the memory footprint of an object. was: Expose SizeEstimator as a developer API in pyspark. SizeEstmator is not available in pyspark. This api will be helpful for understanding the memory footprint of an object. > Expose SizeEstimator as a developer API in pyspark > -- > > Key: SPARK-26229 > URL: https://issues.apache.org/jira/browse/SPARK-26229 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Ranjith Pulluru >Priority: Minor > > SizeEstmator is not available in pyspark. > This api will be helpful for understanding the memory footprint of an object. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26229) Expose SizeEstimator as a developer API in pyspark
Ranjith Pulluru created SPARK-26229: --- Summary: Expose SizeEstimator as a developer API in pyspark Key: SPARK-26229 URL: https://issues.apache.org/jira/browse/SPARK-26229 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 2.3.0 Reporter: Ranjith Pulluru Expose SizeEstimator as a developer API in pyspark. SizeEstmator is not available in pyspark. This api will be helpful for understanding the memory footprint of an object. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25528) data source V2 API refactoring (batch read)
[ https://issues.apache.org/jira/browse/SPARK-25528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25528. - Resolution: Fixed Fix Version/s: 3.0.0 > data source V2 API refactoring (batch read) > --- > > Key: SPARK-25528 > URL: https://issues.apache.org/jira/browse/SPARK-25528 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > refactor the read side API according to this abstraction > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org