[jira] [Updated] (SPARK-25164) Parquet reader builds entire list of columns once for each column
[ https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-25164: -- Description: {{VectorizedParquetRecordReader.initializeInternal}} loops through each column, and for each column it calls {noformat} requestedSchema.getColumns().get(i) {noformat} However, {{MessageType.getColumns}} will build the entire column list from getPaths(0). {noformat} public List getColumns() { List paths = this.getPaths(0); List columns = new ArrayList(paths.size()); for (String[] path : paths) { // TODO: optimize this PrimitiveType primitiveType = getType(path).asPrimitiveType(); columns.add(new ColumnDescriptor( path, primitiveType, getMaxRepetitionLevel(path), getMaxDefinitionLevel(path))); } return columns; } {noformat} This means that for each parquet file, this routine indirectly iterates colCount*colCount times. This is actually not particularly noticeable unless you have: - many parquet files - many columns To verify that this is an issue, I created a 1 million record parquet table with 6000 columns of type double and 67 files (so initializeInternal is called 67 times). I ran the following query: {noformat} sql("select * from 6000_1m_double where id1 = 1").collect {noformat} I used Spark from the master branch. I had 8 executor threads. The filter returns only a few thousand records. The query ran (on average) for 6.4 minutes. Then I cached the column list at the top of {{initializeInternal}} as follows: {noformat} List columnCache = requestedSchema.getColumns(); {noformat} Then I changed {{initializeInternal}} to use {{columnCache}} rather than {{requestedSchema.getColumns()}}. With the column cache variable, the same query runs in 5 minutes. So with my simple query, you save %22 of time by not rebuilding the column list for each column. You get additional savings with a paths cache variable, now saving 34% in total on the above query. was: {{VectorizedParquetRecordReader.initializeInternal}} loops through each column, and for each column it calls {noformat} requestedSchema.getColumns().get(i) {noformat} However, {{MessageType.getColumns}} will build the entire column list from getPaths(0). {noformat} public List getColumns() { List paths = this.getPaths(0); List columns = new ArrayList(paths.size()); for (String[] path : paths) { // TODO: optimize this PrimitiveType primitiveType = getType(path).asPrimitiveType(); columns.add(new ColumnDescriptor( path, primitiveType, getMaxRepetitionLevel(path), getMaxDefinitionLevel(path))); } return columns; } {noformat} This means that for each parquet file, this routine indirectly iterates colCount*colCount times. This is actually not particularly noticeable unless you have: - many parquet files - many columns To verify that this is an issue, I created a 1 million record parquet table with 6000 columns of type double and 67 files (so initializeInternal is called 67 times). I ran the following query: {noformat} sql("select * from 6000_1m_double where id1 = 1").collect {noformat} I used Spark from the master branch. I had 8 executor threads. The filter returns only a few thousand records. The query ran (on average) for 6.4 minutes. Then I cached the column list at the top of {{initializeInternal}} as follows: {noformat} List columnCache = requestedSchema.getColumns(); {noformat} Then I changed {{initializeInternal}} to use {{columnCache}} rather than {{requestedSchema.getColumns()}}. With the column cache variable, the same query runs in 5 minutes. So with my simple query, you save %22 of time by not rebuilding the column list for each column. > Parquet reader builds entire list of columns once for each column > - > > Key: SPARK-25164 > URL: https://issues.apache.org/jira/browse/SPARK-25164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Priority: Minor > > {{VectorizedParquetRecordReader.initializeInternal}} loops through each > column, and for each column it calls > {noformat} > requestedSchema.getColumns().get(i) > {noformat} > However, {{MessageType.getColumns}} will build the entire column list from > getPaths(0). > {noformat} > public List getColumns() { > List paths = this.getPaths(0); > List columns
[jira] [Commented] (SPARK-25168) PlanTest.comparePlans may make a supplied resolved plan unresolved.
[ https://issues.apache.org/jira/browse/SPARK-25168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587024#comment-16587024 ] Dilip Biswal commented on SPARK-25168: -- [~cloud_fan] [~smilegator] > PlanTest.comparePlans may make a supplied resolved plan unresolved. > --- > > Key: SPARK-25168 > URL: https://issues.apache.org/jira/browse/SPARK-25168 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Priority: Minor > > Hello, > I came across this behaviour while writing unit test for one of my PR. We > use the utlility class PlanTest for writing plan verification tests. > Currently when we call comparePlans and the resolved input plan contains a > Join operator, the plan becomes unresolved after the call to > `normalizePlan(normalizeExprIds(plan1))`. The reason for this is that, after > cleaning the expression ids, duplicateResolved returns false making the Join > operator unresolved. > {code} > def duplicateResolved: Boolean = > left.outputSet.intersect(right.outputSet).isEmpty > // Joins are only resolved if they don't introduce ambiguous expression ids. > // NaturalJoin should be ready for resolution only if everything else is > resolved here > lazy val resolvedExceptNatural: Boolean = { > childrenResolved && > expressions.forall(_.resolved) && > duplicateResolved && > condition.forall(_.dataType == BooleanType) > } > {code} > Please note that, plan verification actually works fine. It just looked > awkward to compare two unresolved plan for equality. > I am opening this ticket to discuss if it is an okay behaviour. If its an > tolerable behaviour then we can close the ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE
[ https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587020#comment-16587020 ] Jiang Xingbo commented on SPARK-25114: -- The PR has been merged to master and 2.3 > RecordBinaryComparator may return wrong result when subtraction between two > words is divisible by Integer.MAX_VALUE > --- > > Key: SPARK-25114 > URL: https://issues.apache.org/jira/browse/SPARK-25114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Priority: Blocker > Labels: correctness > > It is possible for two objects to be unequal and yet we consider them as > equal within RecordBinaryComparator, if the long values are separated by > Int.MaxValue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25168) PlanTest.comparePlans may make a supplied resolved plan unresolved.
Dilip Biswal created SPARK-25168: Summary: PlanTest.comparePlans may make a supplied resolved plan unresolved. Key: SPARK-25168 URL: https://issues.apache.org/jira/browse/SPARK-25168 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Dilip Biswal Hello, I came across this behaviour while writing unit test for one of my PR. We use the utlility class PlanTest for writing plan verification tests. Currently when we call comparePlans and the resolved input plan contains a Join operator, the plan becomes unresolved after the call to `normalizePlan(normalizeExprIds(plan1))`. The reason for this is that, after cleaning the expression ids, duplicateResolved returns false making the Join operator unresolved. {code} def duplicateResolved: Boolean = left.outputSet.intersect(right.outputSet).isEmpty // Joins are only resolved if they don't introduce ambiguous expression ids. // NaturalJoin should be ready for resolution only if everything else is resolved here lazy val resolvedExceptNatural: Boolean = { childrenResolved && expressions.forall(_.resolved) && duplicateResolved && condition.forall(_.dataType == BooleanType) } {code} Please note that, plan verification actually works fine. It just looked awkward to compare two unresolved plan for equality. I am opening this ticket to discuss if it is an okay behaviour. If its an tolerable behaviour then we can close the ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25017) Add test suite for ContextBarrierState
[ https://issues.apache.org/jira/browse/SPARK-25017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25017: Assignee: (was: Apache Spark) > Add test suite for ContextBarrierState > -- > > Key: SPARK-25017 > URL: https://issues.apache.org/jira/browse/SPARK-25017 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Priority: Major > > We shall be able to add unit test to ContextBarrierState with a mocked > RpcCallContext. Currently it's only covered by end-to-end test in > `BarrierTaskContextSuite` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25017) Add test suite for ContextBarrierState
[ https://issues.apache.org/jira/browse/SPARK-25017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586950#comment-16586950 ] Apache Spark commented on SPARK-25017: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/22165 > Add test suite for ContextBarrierState > -- > > Key: SPARK-25017 > URL: https://issues.apache.org/jira/browse/SPARK-25017 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Priority: Major > > We shall be able to add unit test to ContextBarrierState with a mocked > RpcCallContext. Currently it's only covered by end-to-end test in > `BarrierTaskContextSuite` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25017) Add test suite for ContextBarrierState
[ https://issues.apache.org/jira/browse/SPARK-25017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25017: Assignee: Apache Spark > Add test suite for ContextBarrierState > -- > > Key: SPARK-25017 > URL: https://issues.apache.org/jira/browse/SPARK-25017 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Assignee: Apache Spark >Priority: Major > > We shall be able to add unit test to ContextBarrierState with a mocked > RpcCallContext. Currently it's only covered by end-to-end test in > `BarrierTaskContextSuite` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25167) Minor fixes for R sql tests (tests that fail in development environment)
[ https://issues.apache.org/jira/browse/SPARK-25167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586923#comment-16586923 ] Apache Spark commented on SPARK-25167: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/22161 > Minor fixes for R sql tests (tests that fail in development environment) > > > Key: SPARK-25167 > URL: https://issues.apache.org/jira/browse/SPARK-25167 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Priority: Minor > > A few SQL tests for R are failing development environment (Mac). > * The catalog api tests assumes catalog artifacts named "foo" to be non > existent. I think name such as foo and bar are common and developers use it > frequently. > * One test assumes that we only have one database in the system. I had more > than one and it caused the test to fail. I have changed that check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25167) Minor fixes for R sql tests (tests that fail in development environment)
[ https://issues.apache.org/jira/browse/SPARK-25167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25167: Assignee: Apache Spark > Minor fixes for R sql tests (tests that fail in development environment) > > > Key: SPARK-25167 > URL: https://issues.apache.org/jira/browse/SPARK-25167 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Assignee: Apache Spark >Priority: Minor > > A few SQL tests for R are failing development environment (Mac). > * The catalog api tests assumes catalog artifacts named "foo" to be non > existent. I think name such as foo and bar are common and developers use it > frequently. > * One test assumes that we only have one database in the system. I had more > than one and it caused the test to fail. I have changed that check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25167) Minor fixes for R sql tests (tests that fail in development environment)
[ https://issues.apache.org/jira/browse/SPARK-25167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25167: Assignee: (was: Apache Spark) > Minor fixes for R sql tests (tests that fail in development environment) > > > Key: SPARK-25167 > URL: https://issues.apache.org/jira/browse/SPARK-25167 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Priority: Minor > > A few SQL tests for R are failing development environment (Mac). > * The catalog api tests assumes catalog artifacts named "foo" to be non > existent. I think name such as foo and bar are common and developers use it > frequently. > * One test assumes that we only have one database in the system. I had more > than one and it caused the test to fail. I have changed that check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25167) Minor fixes for R sql tests (tests that fail in development environment)
Dilip Biswal created SPARK-25167: Summary: Minor fixes for R sql tests (tests that fail in development environment) Key: SPARK-25167 URL: https://issues.apache.org/jira/browse/SPARK-25167 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.3.1 Reporter: Dilip Biswal A few SQL tests for R are failing development environment (Mac). * The catalog api tests assumes catalog artifacts named "foo" to be non existent. I think name such as foo and bar are common and developers use it frequently. * One test assumes that we only have one database in the system. I had more than one and it caused the test to fail. I have changed that check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23679) uiWebUrl show inproper URL when running on YARN
[ https://issues.apache.org/jira/browse/SPARK-23679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23679: Assignee: Apache Spark > uiWebUrl show inproper URL when running on YARN > --- > > Key: SPARK-23679 > URL: https://issues.apache.org/jira/browse/SPARK-23679 > Project: Spark > Issue Type: Bug > Components: Web UI, YARN >Affects Versions: 2.3.0 >Reporter: Maciej Bryński >Assignee: Apache Spark >Priority: Major > > uiWebUrl returns local url. > Using it will cause HTTP ERROR 500 > {code} > Problem accessing /. Reason: > Server Error > Caused by: > javax.servlet.ServletException: Could not determine the proxy server for > redirection > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.findRedirectUrl(AmIpFilter.java:205) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:145) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:524) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at > org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:745) > {code} > We should give address to yarn proxy instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23679) uiWebUrl show inproper URL when running on YARN
[ https://issues.apache.org/jira/browse/SPARK-23679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23679: Assignee: (was: Apache Spark) > uiWebUrl show inproper URL when running on YARN > --- > > Key: SPARK-23679 > URL: https://issues.apache.org/jira/browse/SPARK-23679 > Project: Spark > Issue Type: Bug > Components: Web UI, YARN >Affects Versions: 2.3.0 >Reporter: Maciej Bryński >Priority: Major > > uiWebUrl returns local url. > Using it will cause HTTP ERROR 500 > {code} > Problem accessing /. Reason: > Server Error > Caused by: > javax.servlet.ServletException: Could not determine the proxy server for > redirection > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.findRedirectUrl(AmIpFilter.java:205) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:145) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:524) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at > org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:745) > {code} > We should give address to yarn proxy instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23679) uiWebUrl show inproper URL when running on YARN
[ https://issues.apache.org/jira/browse/SPARK-23679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586907#comment-16586907 ] Apache Spark commented on SPARK-23679: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/22164 > uiWebUrl show inproper URL when running on YARN > --- > > Key: SPARK-23679 > URL: https://issues.apache.org/jira/browse/SPARK-23679 > Project: Spark > Issue Type: Bug > Components: Web UI, YARN >Affects Versions: 2.3.0 >Reporter: Maciej Bryński >Priority: Major > > uiWebUrl returns local url. > Using it will cause HTTP ERROR 500 > {code} > Problem accessing /. Reason: > Server Error > Caused by: > javax.servlet.ServletException: Could not determine the proxy server for > redirection > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.findRedirectUrl(AmIpFilter.java:205) > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:145) > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.spark_project.jetty.server.Server.handle(Server.java:524) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253) > at > org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at > org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:745) > {code} > We should give address to yarn proxy instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25165) Cannot parse Hive Struct
[ https://issues.apache.org/jira/browse/SPARK-25165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586901#comment-16586901 ] Frank Yin edited comment on SPARK-25165 at 8/21/18 3:48 AM: {{#!/usr/bin/env python}} {{# -**- *coding: UTF-8 --*}} {{# encoding=utf8}} {{import sys}} {{import os}} {{import json}} {{import argparse}} {{import time}} {{from datetime import datetime, timedelta}} {{from calendar import timegm}} {{from pyspark.sql import SparkSession}} {{from pyspark.conf import SparkConf}} {{from pyspark.sql.functions import *}} {{from pyspark.sql.types import *}} {{spark_conf = SparkConf().setAppName("Test Hive")}} {{ .set("spark.executor.memory", "4g")\}} {{ .set("spark.sql.catalogImplementation","hive")\}} {{ .set("spark.speculation", "true")\}} {{ .set("spark.dynamicAllocation.maxExecutors", "2000")\}} {{ .set("spark.sql.shuffle.partitions", "400")}} {{spark.sql("SELECT * FROM default.a").collect() }} where default.a is a table in hive. schema: columnA:struct,view.b:array> was (Author: frankyin-factual): {{#!/usr/bin/env python}} {{# -*- coding: UTF-8 -*-}} {{# encoding=utf8}} {{import sys}} {{import os}} {{import json}} {{import argparse}} {{import time}} {{from datetime import datetime, timedelta}} {{from calendar import timegm}} {{from pyspark.sql import SparkSession}} {{from pyspark.conf import SparkConf}} {{from pyspark.sql.functions import *}} {{from pyspark.sql.types import *}}{{spark_conf = SparkConf().setAppName("Test Hive")\}} {{ .set("spark.executor.memory", "4g")\}} {{ .set("spark.sql.catalogImplementation","hive")\}} {{ .set("spark.speculation", "true")\}} {{ .set("spark.dynamicAllocation.maxExecutors", "2000")\}} {{ .set("spark.sql.shuffle.partitions", "400")}}{{spark = SparkSession\}} {{ .builder\}} {{ .config(conf=spark_conf)\}} {{ .getOrCreate()}} {{spark.sql("SELECT * FROM default.a").collect() }} where default.a is a table in hive. schema: columnA:struct,view.b:array> > Cannot parse Hive Struct > > > Key: SPARK-25165 > URL: https://issues.apache.org/jira/browse/SPARK-25165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: Frank Yin >Priority: Major > > org.apache.spark.SparkException: Cannot recognize hive type string: > struct,view.b:array> > > My guess is dot(.) is causing issues for parsing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25165) Cannot parse Hive Struct
[ https://issues.apache.org/jira/browse/SPARK-25165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586901#comment-16586901 ] Frank Yin edited comment on SPARK-25165 at 8/21/18 3:48 AM: {quote}{{#!/usr/bin/env python}} {{# -***- **coding: UTF-8 --*}} {{# encoding=utf8}} {{import sys}} {{import os}} {{import json}} {{import argparse}} {{import time}} {{from datetime import datetime, timedelta}} {{from calendar import timegm}} {{from pyspark.sql import SparkSession}} {{from pyspark.conf import SparkConf}} {{from pyspark.sql.functions import *}} {{from pyspark.sql.types import *}} {{spark_conf = SparkConf().setAppName("Test Hive")}}.set("spark.sql.catalogImplementation","hive") {{spark.sql("SELECT * FROM default.a").collect() }} where default.a is a table in hive. schema: columnA:struct,view.b:array> {quote} was (Author: frankyin-factual): {{#!/usr/bin/env python}} {{# -**- *coding: UTF-8 --*}} {{# encoding=utf8}} {{import sys}} {{import os}} {{import json}} {{import argparse}} {{import time}} {{from datetime import datetime, timedelta}} {{from calendar import timegm}} {{from pyspark.sql import SparkSession}} {{from pyspark.conf import SparkConf}} {{from pyspark.sql.functions import *}} {{from pyspark.sql.types import *}} {{spark_conf = SparkConf().setAppName("Test Hive")}} {{ .set("spark.executor.memory", "4g")\}} {{ .set("spark.sql.catalogImplementation","hive")\}} {{ .set("spark.speculation", "true")\}} {{ .set("spark.dynamicAllocation.maxExecutors", "2000")\}} {{ .set("spark.sql.shuffle.partitions", "400")}} {{spark.sql("SELECT * FROM default.a").collect() }} where default.a is a table in hive. schema: columnA:struct,view.b:array> > Cannot parse Hive Struct > > > Key: SPARK-25165 > URL: https://issues.apache.org/jira/browse/SPARK-25165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: Frank Yin >Priority: Major > > org.apache.spark.SparkException: Cannot recognize hive type string: > struct,view.b:array> > > My guess is dot(.) is causing issues for parsing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25165) Cannot parse Hive Struct
[ https://issues.apache.org/jira/browse/SPARK-25165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586901#comment-16586901 ] Frank Yin edited comment on SPARK-25165 at 8/21/18 3:47 AM: {{#!/usr/bin/env python}} {{# -*- coding: UTF-8 -*-}} {{# encoding=utf8}} {{import sys}} {{import os}} {{import json}} {{import argparse}} {{import time}} {{from datetime import datetime, timedelta}} {{from calendar import timegm}} {{from pyspark.sql import SparkSession}} {{from pyspark.conf import SparkConf}} {{from pyspark.sql.functions import *}} {{from pyspark.sql.types import *}}{{spark_conf = SparkConf().setAppName("Test Hive")\}} {{ .set("spark.executor.memory", "4g")\}} {{ .set("spark.sql.catalogImplementation","hive")\}} {{ .set("spark.speculation", "true")\}} {{ .set("spark.dynamicAllocation.maxExecutors", "2000")\}} {{ .set("spark.sql.shuffle.partitions", "400")}}{{spark = SparkSession\}} {{ .builder\}} {{ .config(conf=spark_conf)\}} {{ .getOrCreate()}} {{spark.sql("SELECT * FROM default.a").collect() }} where default.a is a table in hive. schema: columnA:struct,view.b:array> was (Author: frankyin-factual): #!/usr/bin/env python # -*- coding: UTF-8 -*- # encoding=utf8 import sys import os import json import argparse import time from datetime import datetime, timedelta from calendar import timegm from pyspark.sql import SparkSession from pyspark.conf import SparkConf from pyspark.sql.functions import * from pyspark.sql.types import * spark_conf = SparkConf().setAppName("Test Hive")\ .set("spark.executor.memory", "4g")\ .set("spark.sql.catalogImplementation","hive")\ .set("spark.speculation", "true")\ .set("spark.dynamicAllocation.maxExecutors", "2000")\ .set("spark.sql.shuffle.partitions", "400") spark = SparkSession\ .builder\ .config(conf=spark_conf)\ .getOrCreate() places_and_devices = spark.sql("SELECT * FROM default.a").collect() where default.a is a table in hive. schema: columnA:struct,view.b:array> > Cannot parse Hive Struct > > > Key: SPARK-25165 > URL: https://issues.apache.org/jira/browse/SPARK-25165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: Frank Yin >Priority: Major > > org.apache.spark.SparkException: Cannot recognize hive type string: > struct,view.b:array> > > My guess is dot(.) is causing issues for parsing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25165) Cannot parse Hive Struct
[ https://issues.apache.org/jira/browse/SPARK-25165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586901#comment-16586901 ] Frank Yin commented on SPARK-25165: --- #!/usr/bin/env python # -*- coding: UTF-8 -*- # encoding=utf8 import sys import os import json import argparse import time from datetime import datetime, timedelta from calendar import timegm from pyspark.sql import SparkSession from pyspark.conf import SparkConf from pyspark.sql.functions import * from pyspark.sql.types import * spark_conf = SparkConf().setAppName("Test Hive")\ .set("spark.executor.memory", "4g")\ .set("spark.sql.catalogImplementation","hive")\ .set("spark.speculation", "true")\ .set("spark.dynamicAllocation.maxExecutors", "2000")\ .set("spark.sql.shuffle.partitions", "400") spark = SparkSession\ .builder\ .config(conf=spark_conf)\ .getOrCreate() places_and_devices = spark.sql("SELECT * FROM default.a").collect() where default.a is a table in hive. schema: columnA:struct,view.b:array> > Cannot parse Hive Struct > > > Key: SPARK-25165 > URL: https://issues.apache.org/jira/browse/SPARK-25165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: Frank Yin >Priority: Major > > org.apache.spark.SparkException: Cannot recognize hive type string: > struct,view.b:array> > > My guess is dot(.) is causing issues for parsing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25157) Streaming of image files from directory
[ https://issues.apache.org/jira/browse/SPARK-25157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586847#comment-16586847 ] Hyukjin Kwon commented on SPARK-25157: -- Is this blocked by SPARK-22666? > Streaming of image files from directory > --- > > Key: SPARK-25157 > URL: https://issues.apache.org/jira/browse/SPARK-25157 > Project: Spark > Issue Type: New Feature > Components: ML, Structured Streaming >Affects Versions: 2.3.1 >Reporter: Amit Baghel >Priority: Major > > We are doing video analytics for video streams using Spark. At present there > is no direct way to stream video frames or image files to Spark and process > them using Structured Streaming and Dataset. We are using Kafka to stream > images and then doing processing at spark. We need a method in Spark to > stream images from directory. Currently *{{DataStreamReader}}* doesn't > support Image files. With the introduction of > *org.apache.spark.ml.image.ImageSchema* class, we think streaming > capabilities can be added for image files. It is fine if it won't support > some of the structured streaming features as it is a binary file. This method > could be similar to *mmlspark* *streamImages* method. > [https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25165) Cannot parse Hive Struct
[ https://issues.apache.org/jira/browse/SPARK-25165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586842#comment-16586842 ] Hyukjin Kwon commented on SPARK-25165: -- Mind if I ask a reproducer please? > Cannot parse Hive Struct > > > Key: SPARK-25165 > URL: https://issues.apache.org/jira/browse/SPARK-25165 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.1 >Reporter: Frank Yin >Priority: Major > > org.apache.spark.SparkException: Cannot recognize hive type string: > struct,view.b:array> > > My guess is dot(.) is causing issues for parsing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25166) Reduce the number of write operations for shuffle write.
[ https://issues.apache.org/jira/browse/SPARK-25166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25166: Assignee: Apache Spark > Reduce the number of write operations for shuffle write. > > > Key: SPARK-25166 > URL: https://issues.apache.org/jira/browse/SPARK-25166 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: liuxian >Assignee: Apache Spark >Priority: Minor > > Currently, only one record is written to a buffer each time, which increases > the number of copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25166) Reduce the number of write operations for shuffle write.
[ https://issues.apache.org/jira/browse/SPARK-25166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25166: Assignee: (was: Apache Spark) > Reduce the number of write operations for shuffle write. > > > Key: SPARK-25166 > URL: https://issues.apache.org/jira/browse/SPARK-25166 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > Currently, only one record is written to a buffer each time, which increases > the number of copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25166) Reduce the number of write operations for shuffle write.
[ https://issues.apache.org/jira/browse/SPARK-25166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586831#comment-16586831 ] Apache Spark commented on SPARK-25166: -- User '10110346' has created a pull request for this issue: https://github.com/apache/spark/pull/22163 > Reduce the number of write operations for shuffle write. > > > Key: SPARK-25166 > URL: https://issues.apache.org/jira/browse/SPARK-25166 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > Currently, only one record is written to a buffer each time, which increases > the number of copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25166) Reduce the number of write operations for shuffle write.
[ https://issues.apache.org/jira/browse/SPARK-25166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25166: Description: Currently, only one record is written to a buffer each time, which increases the number of copies. (was: Currently, each record will be write to a buffer , which increases the number of copies.) > Reduce the number of write operations for shuffle write. > > > Key: SPARK-25166 > URL: https://issues.apache.org/jira/browse/SPARK-25166 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > Currently, only one record is written to a buffer each time, which increases > the number of copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25166) Reduce the number of write operations for shuffle write.
liuxian created SPARK-25166: --- Summary: Reduce the number of write operations for shuffle write. Key: SPARK-25166 URL: https://issues.apache.org/jira/browse/SPARK-25166 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 2.4.0 Reporter: liuxian Currently, each record will be write to a buffer , which increases the number of copies. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25132) Case-insensitive field resolution when reading from Parquet/ORC
[ https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25132. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22148 [https://github.com/apache/spark/pull/22148] > Case-insensitive field resolution when reading from Parquet/ORC > --- > > Key: SPARK-25132 > URL: https://issues.apache.org/jira/browse/SPARK-25132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Assignee: Chenxiao Mao >Priority: Major > Fix For: 2.4.0 > > > Spark SQL returns NULL for a column whose Hive metastore schema and Parquet > schema are in different letter cases, regardless of spark.sql.caseSensitive > set to true or false. > Here is a simple example to reproduce this issue: > scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1") > spark-sql> show create table t1; > CREATE TABLE `t1` (`id` BIGINT) > USING parquet > OPTIONS ( > `serialization.format` '1' > ) > spark-sql> CREATE TABLE `t2` (`ID` BIGINT) > > USING parquet > > LOCATION 'hdfs://localhost/user/hive/warehouse/t1'; > spark-sql> select * from t1; > 0 > 1 > 2 > 3 > 4 > spark-sql> select * from t2; > NULL > NULL > NULL > NULL > NULL > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25132) Case-insensitive field resolution when reading from Parquet/ORC
[ https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-25132: Assignee: Chenxiao Mao > Case-insensitive field resolution when reading from Parquet/ORC > --- > > Key: SPARK-25132 > URL: https://issues.apache.org/jira/browse/SPARK-25132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Assignee: Chenxiao Mao >Priority: Major > Fix For: 2.4.0 > > > Spark SQL returns NULL for a column whose Hive metastore schema and Parquet > schema are in different letter cases, regardless of spark.sql.caseSensitive > set to true or false. > Here is a simple example to reproduce this issue: > scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1") > spark-sql> show create table t1; > CREATE TABLE `t1` (`id` BIGINT) > USING parquet > OPTIONS ( > `serialization.format` '1' > ) > spark-sql> CREATE TABLE `t2` (`ID` BIGINT) > > USING parquet > > LOCATION 'hdfs://localhost/user/hive/warehouse/t1'; > spark-sql> select * from t1; > 0 > 1 > 2 > 3 > 4 > spark-sql> select * from t2; > NULL > NULL > NULL > NULL > NULL > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25134) Csv column pruning with checking of headers throws incorrect error
[ https://issues.apache.org/jira/browse/SPARK-25134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-25134: - Fix Version/s: 2.4.0 > Csv column pruning with checking of headers throws incorrect error > -- > > Key: SPARK-25134 > URL: https://issues.apache.org/jira/browse/SPARK-25134 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: spark master branch at > a791c29bd824adadfb2d85594bc8dad4424df936 >Reporter: koert kuipers >Assignee: Koert Kuipers >Priority: Major > Fix For: 2.4.0 > > > hello! > seems to me there is some interaction between csv column pruning and the > checking of csv headers that is causing issues. for example this fails: > {code:scala} > Seq(("a", "b")).toDF("columnA", "columnB").write > .format("csv") > .option("header", true) > .save(dir) > spark.read > .format("csv") > .option("header", true) > .option("enforceSchema", false) > .load(dir) > .select("columnA") > .show > {code} > the error is: > {code:bash} > 291.0 (TID 319, localhost, executor driver): > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > [info] Header length: 1, schema size: 2 > {code} > if i remove the project it works fine. if i disable column pruning it also > works fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25134) Csv column pruning with checking of headers throws incorrect error
[ https://issues.apache.org/jira/browse/SPARK-25134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25134. -- Resolution: Fixed Fixed in https://github.com/apache/spark/pull/22123 > Csv column pruning with checking of headers throws incorrect error > -- > > Key: SPARK-25134 > URL: https://issues.apache.org/jira/browse/SPARK-25134 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: spark master branch at > a791c29bd824adadfb2d85594bc8dad4424df936 >Reporter: koert kuipers >Assignee: Koert Kuipers >Priority: Minor > > hello! > seems to me there is some interaction between csv column pruning and the > checking of csv headers that is causing issues. for example this fails: > {code:scala} > Seq(("a", "b")).toDF("columnA", "columnB").write > .format("csv") > .option("header", true) > .save(dir) > spark.read > .format("csv") > .option("header", true) > .option("enforceSchema", false) > .load(dir) > .select("columnA") > .show > {code} > the error is: > {code:bash} > 291.0 (TID 319, localhost, executor driver): > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > [info] Header length: 1, schema size: 2 > {code} > if i remove the project it works fine. if i disable column pruning it also > works fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25134) Csv column pruning with checking of headers throws incorrect error
[ https://issues.apache.org/jira/browse/SPARK-25134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-25134: - Priority: Major (was: Minor) > Csv column pruning with checking of headers throws incorrect error > -- > > Key: SPARK-25134 > URL: https://issues.apache.org/jira/browse/SPARK-25134 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: spark master branch at > a791c29bd824adadfb2d85594bc8dad4424df936 >Reporter: koert kuipers >Assignee: Koert Kuipers >Priority: Major > Fix For: 2.4.0 > > > hello! > seems to me there is some interaction between csv column pruning and the > checking of csv headers that is causing issues. for example this fails: > {code:scala} > Seq(("a", "b")).toDF("columnA", "columnB").write > .format("csv") > .option("header", true) > .save(dir) > spark.read > .format("csv") > .option("header", true) > .option("enforceSchema", false) > .load(dir) > .select("columnA") > .show > {code} > the error is: > {code:bash} > 291.0 (TID 319, localhost, executor driver): > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > [info] Header length: 1, schema size: 2 > {code} > if i remove the project it works fine. if i disable column pruning it also > works fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25134) Csv column pruning with checking of headers throws incorrect error
[ https://issues.apache.org/jira/browse/SPARK-25134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-25134: - Affects Version/s: (was: 2.3.1) 2.4.0 > Csv column pruning with checking of headers throws incorrect error > -- > > Key: SPARK-25134 > URL: https://issues.apache.org/jira/browse/SPARK-25134 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: spark master branch at > a791c29bd824adadfb2d85594bc8dad4424df936 >Reporter: koert kuipers >Priority: Minor > > hello! > seems to me there is some interaction between csv column pruning and the > checking of csv headers that is causing issues. for example this fails: > {code:scala} > Seq(("a", "b")).toDF("columnA", "columnB").write > .format("csv") > .option("header", true) > .save(dir) > spark.read > .format("csv") > .option("header", true) > .option("enforceSchema", false) > .load(dir) > .select("columnA") > .show > {code} > the error is: > {code:bash} > 291.0 (TID 319, localhost, executor driver): > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > [info] Header length: 1, schema size: 2 > {code} > if i remove the project it works fine. if i disable column pruning it also > works fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25134) Csv column pruning with checking of headers throws incorrect error
[ https://issues.apache.org/jira/browse/SPARK-25134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-25134: Assignee: Koert Kuipers > Csv column pruning with checking of headers throws incorrect error > -- > > Key: SPARK-25134 > URL: https://issues.apache.org/jira/browse/SPARK-25134 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: spark master branch at > a791c29bd824adadfb2d85594bc8dad4424df936 >Reporter: koert kuipers >Assignee: Koert Kuipers >Priority: Minor > > hello! > seems to me there is some interaction between csv column pruning and the > checking of csv headers that is causing issues. for example this fails: > {code:scala} > Seq(("a", "b")).toDF("columnA", "columnB").write > .format("csv") > .option("header", true) > .save(dir) > spark.read > .format("csv") > .option("header", true) > .option("enforceSchema", false) > .load(dir) > .select("columnA") > .show > {code} > the error is: > {code:bash} > 291.0 (TID 319, localhost, executor driver): > java.lang.IllegalArgumentException: Number of column in CSV header is not > equal to number of fields in the schema: > [info] Header length: 1, schema size: 2 > {code} > if i remove the project it works fine. if i disable column pruning it also > works fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected
[ https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25144. -- Resolution: Fixed > distinct on Dataset leads to exception due to Managed memory leak detected > > > Key: SPARK-25144 > URL: https://issues.apache.org/jira/browse/SPARK-25144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.3.2 >Reporter: Ayoub Benali >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.2.3, 2.3.2 > > > The following code example: > {code} > case class Foo(bar: Option[String]) > val ds = List(Foo(Some("bar"))).toDS > val result = ds.flatMap(_.bar).distinct > result.rdd.isEmpty > {code} > Produces the following stacktrace > {code} > [info] org.apache.spark.SparkException: Job aborted due to stage failure: > Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in > stage 7.0 (TID 125, localhost, executor driver): > org.apache.spark.SparkException: Managed memory leak detected; size = > 16777216 bytes, TID = 125 > [info]at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) > [info]at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [info]at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [info]at java.lang.Thread.run(Thread.java:748) > [info] > [info] Driver stacktrace: > [info] at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589) > [info] at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > [info] at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > [info] at scala.Option.foreach(Option.scala:257) > [info] at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) > [info] at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823) > [info] at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) > [info] at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) > [info] at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > [info] at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) > [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) > [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) > [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074) > [info] at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > [info] at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > [info] at org.apache.spark.rdd.RDD.take(RDD.scala:1331) > [info] at > org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466) > [info] at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466) > [info] at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > [info] at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > [info] at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465) > {code} > The code example doesn't produce any error when `distinct` function is not > called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected
[ https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-25144: Assignee: Dongjoon Hyun > distinct on Dataset leads to exception due to Managed memory leak detected > > > Key: SPARK-25144 > URL: https://issues.apache.org/jira/browse/SPARK-25144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.3.2 >Reporter: Ayoub Benali >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.2.3, 2.3.2 > > > The following code example: > {code} > case class Foo(bar: Option[String]) > val ds = List(Foo(Some("bar"))).toDS > val result = ds.flatMap(_.bar).distinct > result.rdd.isEmpty > {code} > Produces the following stacktrace > {code} > [info] org.apache.spark.SparkException: Job aborted due to stage failure: > Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in > stage 7.0 (TID 125, localhost, executor driver): > org.apache.spark.SparkException: Managed memory leak detected; size = > 16777216 bytes, TID = 125 > [info]at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) > [info]at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [info]at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [info]at java.lang.Thread.run(Thread.java:748) > [info] > [info] Driver stacktrace: > [info] at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589) > [info] at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > [info] at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > [info] at scala.Option.foreach(Option.scala:257) > [info] at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) > [info] at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823) > [info] at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) > [info] at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) > [info] at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > [info] at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) > [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) > [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) > [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074) > [info] at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > [info] at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > [info] at org.apache.spark.rdd.RDD.take(RDD.scala:1331) > [info] at > org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466) > [info] at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466) > [info] at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > [info] at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > [info] at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465) > {code} > The code example doesn't produce any error when `distinct` function is not > called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected
[ https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-25144: - Fix Version/s: 2.3.2 2.2.3 > distinct on Dataset leads to exception due to Managed memory leak detected > > > Key: SPARK-25144 > URL: https://issues.apache.org/jira/browse/SPARK-25144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.3.2 >Reporter: Ayoub Benali >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.2.3, 2.3.2 > > > The following code example: > {code} > case class Foo(bar: Option[String]) > val ds = List(Foo(Some("bar"))).toDS > val result = ds.flatMap(_.bar).distinct > result.rdd.isEmpty > {code} > Produces the following stacktrace > {code} > [info] org.apache.spark.SparkException: Job aborted due to stage failure: > Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in > stage 7.0 (TID 125, localhost, executor driver): > org.apache.spark.SparkException: Managed memory leak detected; size = > 16777216 bytes, TID = 125 > [info]at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) > [info]at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [info]at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [info]at java.lang.Thread.run(Thread.java:748) > [info] > [info] Driver stacktrace: > [info] at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589) > [info] at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > [info] at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > [info] at scala.Option.foreach(Option.scala:257) > [info] at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) > [info] at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823) > [info] at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) > [info] at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) > [info] at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > [info] at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) > [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) > [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) > [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074) > [info] at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > [info] at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > [info] at org.apache.spark.rdd.RDD.take(RDD.scala:1331) > [info] at > org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466) > [info] at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466) > [info] at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > [info] at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > [info] at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465) > {code} > The code example doesn't produce any error when `distinct` function is not > called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586647#comment-16586647 ] Xin Yao commented on SPARK-23128: - Thanks [~XuanYuan] for sharing those awesome numbers. If I understand it correctly, one of the goals of this patch is to solve the data skew by splitting large partition into small ones. Have seen any significant improvement in this area? > A new approach to do adaptive execution in Spark SQL > > > Key: SPARK-23128 > URL: https://issues.apache.org/jira/browse/SPARK-23128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Carson Wang >Priority: Major > Attachments: AdaptiveExecutioninBaidu.pdf > > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0
[ https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586612#comment-16586612 ] Bryan Cutler commented on SPARK-23874: -- For the Python fixes, yes the user would have to upgrade pyarrow. In Spark, there is no upgrade for pyarrow, just bumping up the minimum version which determines what version is installed with pyspark, which is still currently 0.8.0. Since we upgraded the Java library, those fixes are included. > Upgrade apache/arrow to 0.10.0 > -- > > Key: SPARK-23874 > URL: https://issues.apache.org/jira/browse/SPARK-23874 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.4.0 > > > Version 0.10.0 will allow for the following improvements and bug fixes: > * Allow for adding BinaryType support ARROW-2141 > * Bug fix related to array serialization ARROW-1973 > * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 > * Python bytearrays are supported in as input to pyarrow ARROW-2141 > * Java has common interface for reset to cleanup complex vectors in Spark > ArrowWriter ARROW-1962 > * Cleanup pyarrow type equality checks ARROW-2423 > * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, > ARROW-2645 > * Improved low level handling of messages for RecordBatch ARROW-2704 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586605#comment-16586605 ] Xin Yao commented on SPARK-23128: - Thank [~carsonwang] for the great work. This looks really interesting. As you mentioned you have ported this to spark 2.3, what version of Spark was this built on top of original? Can I find those code anywhere? and are they up to date? Thanks > A new approach to do adaptive execution in Spark SQL > > > Key: SPARK-23128 > URL: https://issues.apache.org/jira/browse/SPARK-23128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Carson Wang >Priority: Major > Attachments: AdaptiveExecutioninBaidu.pdf > > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0
[ https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586593#comment-16586593 ] Xiao Li commented on SPARK-23874: - [~bryanc]To get these fixes, we need to upgrade pyarrow to 0.10, right? > Upgrade apache/arrow to 0.10.0 > -- > > Key: SPARK-23874 > URL: https://issues.apache.org/jira/browse/SPARK-23874 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.4.0 > > > Version 0.10.0 will allow for the following improvements and bug fixes: > * Allow for adding BinaryType support ARROW-2141 > * Bug fix related to array serialization ARROW-1973 > * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 > * Python bytearrays are supported in as input to pyarrow ARROW-2141 > * Java has common interface for reset to cleanup complex vectors in Spark > ArrowWriter ARROW-1962 > * Cleanup pyarrow type equality checks ARROW-2423 > * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, > ARROW-2645 > * Improved low level handling of messages for RecordBatch ARROW-2704 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()
[ https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586544#comment-16586544 ] Apache Spark commented on SPARK-24442: -- User 'AndrewKL' has created a pull request for this issue: https://github.com/apache/spark/pull/22162 > Add configuration parameter to adjust the numbers of records and the charters > per row before truncation when a user runs.show() > --- > > Key: SPARK-24442 > URL: https://issues.apache.org/jira/browse/SPARK-24442 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.3.0 >Reporter: Andrew K Long >Priority: Minor > Attachments: spark-adjustable-display-size.diff > > Original Estimate: 12h > Remaining Estimate: 12h > > Currently the number of characters displayed when a user runs the .show() > function on a data frame is hard coded. The current default is too small when > used with wider console widths. This fix will add two parameters. > > parameter: "spark.show.default.number.of.rows" default: "20" > parameter: "spark.show.default.truncate.characters.per.column" default: "20" > > This change will be backwords compatible and will not break any existing > functionality nor change the default display characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()
[ https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24442: Assignee: (was: Apache Spark) > Add configuration parameter to adjust the numbers of records and the charters > per row before truncation when a user runs.show() > --- > > Key: SPARK-24442 > URL: https://issues.apache.org/jira/browse/SPARK-24442 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.3.0 >Reporter: Andrew K Long >Priority: Minor > Attachments: spark-adjustable-display-size.diff > > Original Estimate: 12h > Remaining Estimate: 12h > > Currently the number of characters displayed when a user runs the .show() > function on a data frame is hard coded. The current default is too small when > used with wider console widths. This fix will add two parameters. > > parameter: "spark.show.default.number.of.rows" default: "20" > parameter: "spark.show.default.truncate.characters.per.column" default: "20" > > This change will be backwords compatible and will not break any existing > functionality nor change the default display characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()
[ https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24442: Assignee: Apache Spark > Add configuration parameter to adjust the numbers of records and the charters > per row before truncation when a user runs.show() > --- > > Key: SPARK-24442 > URL: https://issues.apache.org/jira/browse/SPARK-24442 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.3.0 >Reporter: Andrew K Long >Assignee: Apache Spark >Priority: Minor > Attachments: spark-adjustable-display-size.diff > > Original Estimate: 12h > Remaining Estimate: 12h > > Currently the number of characters displayed when a user runs the .show() > function on a data frame is hard coded. The current default is too small when > used with wider console widths. This fix will add two parameters. > > parameter: "spark.show.default.number.of.rows" default: "20" > parameter: "spark.show.default.truncate.characters.per.column" default: "20" > > This change will be backwords compatible and will not break any existing > functionality nor change the default display characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()
[ https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586545#comment-16586545 ] Andrew K Long commented on SPARK-24442: --- I've created a pull request! https://github.com/apache/spark/pull/22162 > Add configuration parameter to adjust the numbers of records and the charters > per row before truncation when a user runs.show() > --- > > Key: SPARK-24442 > URL: https://issues.apache.org/jira/browse/SPARK-24442 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.3.0 >Reporter: Andrew K Long >Priority: Minor > Attachments: spark-adjustable-display-size.diff > > Original Estimate: 12h > Remaining Estimate: 12h > > Currently the number of characters displayed when a user runs the .show() > function on a data frame is hard coded. The current default is too small when > used with wider console widths. This fix will add two parameters. > > parameter: "spark.show.default.number.of.rows" default: "20" > parameter: "spark.show.default.truncate.characters.per.column" default: "20" > > This change will be backwords compatible and will not break any existing > functionality nor change the default display characteristics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25165) Cannot parse Hive Struct
Frank Yin created SPARK-25165: - Summary: Cannot parse Hive Struct Key: SPARK-25165 URL: https://issues.apache.org/jira/browse/SPARK-25165 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1, 2.2.1 Reporter: Frank Yin org.apache.spark.SparkException: Cannot recognize hive type string: struct,view.b:array> My guess is dot(.) is causing issues for parsing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24418) Upgrade to Scala 2.11.12
[ https://issues.apache.org/jira/browse/SPARK-24418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586483#comment-16586483 ] Apache Spark commented on SPARK-24418: -- User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/22160 > Upgrade to Scala 2.11.12 > > > Key: SPARK-24418 > URL: https://issues.apache.org/jira/browse/SPARK-24418 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.3.0 >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Major > Fix For: 2.4.0 > > > Scala 2.11.12+ will support JDK9+. However, this is not going to be a simple > version bump. > *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to > initialize the Spark before REPL sees any files. > Issue filed in Scala community. > https://github.com/scala/bug/issues/10913 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24639) Add three configs in the doc
[ https://issues.apache.org/jira/browse/SPARK-24639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-24639. --- Resolution: Won't Fix > Add three configs in the doc > > > Key: SPARK-24639 > URL: https://issues.apache.org/jira/browse/SPARK-24639 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.3.1 >Reporter: xueyu >Priority: Trivial > > Add some missing configs mentioned in spark-24560, which are > spark.python.task.killTimeout, spark.worker.driverTerminateTimeout and > spark.ui.consoleProgress.update.interval -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24834) Utils#nanSafeCompare{Double,Float} functions do not differ from normal java double/float comparison
[ https://issues.apache.org/jira/browse/SPARK-24834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-24834. --- Resolution: Won't Fix See PR – I think we can't do this directly as it changes semantics unfortunately. > Utils#nanSafeCompare{Double,Float} functions do not differ from normal java > double/float comparison > --- > > Key: SPARK-24834 > URL: https://issues.apache.org/jira/browse/SPARK-24834 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Benjamin Duffield >Priority: Minor > > Utils.scala contains two functions `nanSafeCompareDoubles` and > `nanSafeCompareFloats` which purport to have special handling of NaN values > in comparisons. > The handling in these functions do not appear to differ from > java.lang.Double.compare and java.lang.Float.compare - they seem to produce > identical output to the built-in java comparison functions. > I think it's clearer to not have these special Utils functions, and instead > just use the standard java comparison functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25164) Parquet reader builds entire list of columns once for each column
Bruce Robbins created SPARK-25164: - Summary: Parquet reader builds entire list of columns once for each column Key: SPARK-25164 URL: https://issues.apache.org/jira/browse/SPARK-25164 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Bruce Robbins {{VectorizedParquetRecordReader.initializeInternal}} loops through each column, and for each column it calls {noformat} requestedSchema.getColumns().get(i) {noformat} However, {{MessageType.getColumns}} will build the entire column list from getPaths(0). {noformat} public List getColumns() { List paths = this.getPaths(0); List columns = new ArrayList(paths.size()); for (String[] path : paths) { // TODO: optimize this PrimitiveType primitiveType = getType(path).asPrimitiveType(); columns.add(new ColumnDescriptor( path, primitiveType, getMaxRepetitionLevel(path), getMaxDefinitionLevel(path))); } return columns; } {noformat} This means that for each parquet file, this routine indirectly iterates colCount*colCount times. This is actually not particularly noticeable unless you have: - many parquet files - many columns To verify that this is an issue, I created a 1 million record parquet table with 6000 columns of type double and 67 files (so initializeInternal is called 67 times). I ran the following query: {noformat} sql("select * from 6000_1m_double where id1 = 1").collect {noformat} I used Spark from the master branch. I had 8 executor threads. The filter returns only a few thousand records. The query ran (on average) for 6.4 minutes. Then I cached the column list at the top of {{initializeInternal}} as follows: {noformat} List columnCache = requestedSchema.getColumns(); {noformat} Then I changed {{initializeInternal}} to use {{columnCache}} rather than {{requestedSchema.getColumns()}}. With the column cache variable, the same query runs in 5 minutes. So with my simple query, you save %22 of time by not rebuilding the column list for each column. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21375) Add date and timestamp support to ArrowConverters for toPandas() collection
[ https://issues.apache.org/jira/browse/SPARK-21375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586389#comment-16586389 ] Eric Wohlstadter commented on SPARK-21375: -- [~bryanc] Thanks for the feedback. I think the part that makes my case different is that the producer of the TimeStampMicroTZVector and the consumer (Spark) are in different processes. Because Arrow data is coming to Spark via an ArrowInputStream over a TCP socket. Since the timezone of the TimeStampMicroTZVector needs to be fixed on the producer side (not Spark), the producer process won't have a "spark.sql.session.timeZone". In the current Arrow Java API, the timezone of any TimeStampTZVector can't be mutated after creation. I'm not sure if the timezone is just something that is interpreted on read by the consumer (in which case it seems like it should be possible to mutate the timezone of the vector). Or if the timezone is actually used when encoding the bytes for entries in the value vectors (in which case it seems like it would not be possible to mutate the timezone of the vector). > Add date and timestamp support to ArrowConverters for toPandas() collection > --- > > Key: SPARK-21375 > URL: https://issues.apache.org/jira/browse/SPARK-21375 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.3.0 > > > Date and timestamp are not yet supported in DataFrame.toPandas() using > ArrowConverters. These are common types for data analysis used in both Spark > and Pandas and should be supported. > There is a discrepancy with the way that PySpark and Arrow store timestamps, > without timezone specified, internally. PySpark takes a UTC timestamp that > is adjusted to local time and Arrow is in UTC time. Hopefully there is a > clean way to resolve this. > Spark internal storage spec: > * *DateType* stored as days > * *Timestamp* stored as microseconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586378#comment-16586378 ] James Carter commented on SPARK-24432: -- We're looking to shift our production Spark workloads from Mesos to Kubernetes. Consider this an up-vote for Spark Shuffle Service and Dynamic Allocation of executors. I would be happy to participate in beta-testing this feature. > Add support for dynamic resource allocation > --- > > Key: SPARK-24432 > URL: https://issues.apache.org/jira/browse/SPARK-24432 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > This is an umbrella ticket for work on adding support for dynamic resource > allocation into the Kubernetes mode. This requires a Kubernetes-specific > external shuffle service. The feature is available in our fork at > github.com/apache-spark-on-k8s/spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21375) Add date and timestamp support to ArrowConverters for toPandas() collection
[ https://issues.apache.org/jira/browse/SPARK-21375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586369#comment-16586369 ] Bryan Cutler commented on SPARK-21375: -- Hi [~ewohlstadter], the timestamp values should be in UTC. I'm not sure if that's what Hive uses or not, but if not then you will need to do a conversion there. Then when building an Arrow TimeStampMicroTZVector, use the conf "spark.sql.session.timeZone" for the timezone, which should make things consistent. Hope that helps. > Add date and timestamp support to ArrowConverters for toPandas() collection > --- > > Key: SPARK-21375 > URL: https://issues.apache.org/jira/browse/SPARK-21375 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.3.0 > > > Date and timestamp are not yet supported in DataFrame.toPandas() using > ArrowConverters. These are common types for data analysis used in both Spark > and Pandas and should be supported. > There is a discrepancy with the way that PySpark and Arrow store timestamps, > without timezone specified, internally. PySpark takes a UTC timestamp that > is adjusted to local time and Arrow is in UTC time. Hopefully there is a > clean way to resolve this. > Spark internal storage spec: > * *DateType* stored as days > * *Timestamp* stored as microseconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25162) Kubernetes 'in-cluster' client mode and value of spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-25162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Carter updated SPARK-25162: - Summary: Kubernetes 'in-cluster' client mode and value of spark.driver.host (was: Kubernetes 'in-cluster' client mode and value spark.driver.host) > Kubernetes 'in-cluster' client mode and value of spark.driver.host > -- > > Key: SPARK-25162 > URL: https://issues.apache.org/jira/browse/SPARK-25162 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 > Environment: A java program, deployed to kubernetes, that establishes > a Spark Context in client mode. > Not using spark-submit. > Kubernetes 1.10 > AWS EKS > > >Reporter: James Carter >Priority: Minor > > When creating Kubernetes scheduler 'in-cluster' using client mode, the value > for spark.driver.host can be derived from the IP address of the driver pod. > I observed that the value of _spark.driver.host_ defaulted to the value of > _spark.kubernetes.driver.pod.name_, which is not a valid hostname. This > caused the executors to fail to establish a connection back to the driver. > As a work around, in my configuration I pass the driver's pod name _and_ the > driver's ip address to ensure that executors can establish a connection with > the driver. > _spark.kubernetes.driver.pod.name_ := env.valueFrom.fieldRef.fieldPath: > metadata.name > _spark.driver.host_ := env.valueFrom.fieldRef.fieldPath: status.podIp > e.g. > Deployment: > {noformat} > env: > - name: DRIVER_POD_NAME > valueFrom: > fieldRef: > fieldPath: metadata.name > - name: DRIVER_POD_IP > valueFrom: > fieldRef: > fieldPath: status.podIP > {noformat} > > Application Properties: > {noformat} > config[spark.kubernetes.driver.pod.name]: ${DRIVER_POD_NAME} > config[spark.driver.host]: ${DRIVER_POD_IP} > {noformat} > > BasicExecutorFeatureStep.scala: > {code:java} > private val driverUrl = RpcEndpointAddress( > kubernetesConf.get("spark.driver.host"), > kubernetesConf.sparkConf.getInt("spark.driver.port", DEFAULT_DRIVER_PORT), > CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString > {code} > > Ideally only _spark.kubernetes.driver.pod.name_ would need be provided in > this deployment scenario. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25162) Kubernetes 'in-cluster' client mode and value spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-25162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Carter updated SPARK-25162: - Description: When creating Kubernetes scheduler 'in-cluster' using client mode, the value for spark.driver.host can be derived from the IP address of the driver pod. I observed that the value of _spark.driver.host_ defaulted to the value of _spark.kubernetes.driver.pod.name_, which is not a valid hostname. This caused the executors to fail to establish a connection back to the driver. As a work around, in my configuration I pass the driver's pod name _and_ the driver's ip address to ensure that executors can establish a connection with the driver. _spark.kubernetes.driver.pod.name_ := env.valueFrom.fieldRef.fieldPath: metadata.name _spark.driver.host_ := env.valueFrom.fieldRef.fieldPath: status.podIp e.g. Deployment: {noformat} env: - name: DRIVER_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: DRIVER_POD_IP valueFrom: fieldRef: fieldPath: status.podIP {noformat} Application Properties: {noformat} config[spark.kubernetes.driver.pod.name]: ${DRIVER_POD_NAME} config[spark.driver.host]: ${DRIVER_POD_IP} {noformat} BasicExecutorFeatureStep.scala: {code:java} private val driverUrl = RpcEndpointAddress( kubernetesConf.get("spark.driver.host"), kubernetesConf.sparkConf.getInt("spark.driver.port", DEFAULT_DRIVER_PORT), CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString {code} Ideally only _spark.kubernetes.driver.pod.name_ would need be provided in this deployment scenario. was: When creating Kubernetes scheduler 'in-cluster' using client mode, the value for spark.driver.host can be derived from the IP address of the driver pod. I observed that the value of _spark.driver.host_ defaulted to the value of _spark.kubernetes.driver.pod.name_, which is not a valid hostname. This caused the executors to fail to establish a connection back to the driver. As a work around, in my configuration I pass the driver's pod name _and_ the driver's ip address to ensure that executors can establish a connection with the driver. _spark.kubernetes.driver.pod.name_ := env.valueFrom.fieldRef.fieldPath: metadata.name _spark.driver.host_ := env.valueFrom.fieldRef.fieldPath: status.podIp Ideally only _spark.kubernetes.driver.pod.name_ need be provided in this deployment scenario. > Kubernetes 'in-cluster' client mode and value spark.driver.host > --- > > Key: SPARK-25162 > URL: https://issues.apache.org/jira/browse/SPARK-25162 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 > Environment: A java program, deployed to kubernetes, that establishes > a Spark Context in client mode. > Not using spark-submit. > Kubernetes 1.10 > AWS EKS > > >Reporter: James Carter >Priority: Minor > > When creating Kubernetes scheduler 'in-cluster' using client mode, the value > for spark.driver.host can be derived from the IP address of the driver pod. > I observed that the value of _spark.driver.host_ defaulted to the value of > _spark.kubernetes.driver.pod.name_, which is not a valid hostname. This > caused the executors to fail to establish a connection back to the driver. > As a work around, in my configuration I pass the driver's pod name _and_ the > driver's ip address to ensure that executors can establish a connection with > the driver. > _spark.kubernetes.driver.pod.name_ := env.valueFrom.fieldRef.fieldPath: > metadata.name > _spark.driver.host_ := env.valueFrom.fieldRef.fieldPath: status.podIp > e.g. > Deployment: > {noformat} > env: > - name: DRIVER_POD_NAME > valueFrom: > fieldRef: > fieldPath: metadata.name > - name: DRIVER_POD_IP > valueFrom: > fieldRef: > fieldPath: status.podIP > {noformat} > > Application Properties: > {noformat} > config[spark.kubernetes.driver.pod.name]: ${DRIVER_POD_NAME} > config[spark.driver.host]: ${DRIVER_POD_IP} > {noformat} > > BasicExecutorFeatureStep.scala: > {code:java} > private val driverUrl = RpcEndpointAddress( > kubernetesConf.get("spark.driver.host"), > kubernetesConf.sparkConf.getInt("spark.driver.port", DEFAULT_DRIVER_PORT), > CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString > {code} > > Ideally only _spark.kubernetes.driver.pod.name_ would need be provided in > this deployment scenario. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25163) Flaky test: o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with compression
Shixiong Zhu created SPARK-25163: Summary: Flaky test: o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with compression Key: SPARK-25163 URL: https://issues.apache.org/jira/browse/SPARK-25163 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.4.0 Reporter: Shixiong Zhu I saw it failed multiple times on Jenkins: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/4813/testReport/junit/org.apache.spark.util.collection/ExternalAppendOnlyMapSuite/spilling_with_compression/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25162) Kubernetes 'in-cluster' client mode and value spark.driver.host
James Carter created SPARK-25162: Summary: Kubernetes 'in-cluster' client mode and value spark.driver.host Key: SPARK-25162 URL: https://issues.apache.org/jira/browse/SPARK-25162 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.4.0 Environment: A java program, deployed to kubernetes, that establishes a Spark Context in client mode. Not using spark-submit. Kubernetes 1.10 AWS EKS Reporter: James Carter When creating Kubernetes scheduler 'in-cluster' using client mode, the value for spark.driver.host can be derived from the IP address of the driver pod. I observed that the value of _spark.driver.host_ defaulted to the value of _spark.kubernetes.driver.pod.name_, which is not a valid hostname. This caused the executors to fail to establish a connection back to the driver. As a work around, in my configuration I pass the driver's pod name _and_ the driver's ip address to ensure that executors can establish a connection with the driver. _spark.kubernetes.driver.pod.name_ := env.valueFrom.fieldRef.fieldPath: metadata.name _spark.driver.host_ := env.valueFrom.fieldRef.fieldPath: status.podIp Ideally only _spark.kubernetes.driver.pod.name_ need be provided in this deployment scenario. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586314#comment-16586314 ] Yinan Li commented on SPARK-24434: -- [~skonto] I will make sure the assignee gets properly set for future JIRAs. [~onursatici], if you would like to work on the implementation, please make sure you read through the design doc from [~skonto] and make the implementation follow what the design proposes. Thanks! > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25161) Fix several bugs in failure handling of barrier execution mode
[ https://issues.apache.org/jira/browse/SPARK-25161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25161: Assignee: (was: Apache Spark) > Fix several bugs in failure handling of barrier execution mode > -- > > Key: SPARK-25161 > URL: https://issues.apache.org/jira/browse/SPARK-25161 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Priority: Major > > Fix several bugs in failure handling of barrier execution mode: > * Mark TaskSet for a barrier stage as zombie when a task attempt fails; > * Multiple barrier task failures from a single barrier stage should not > trigger multiple stage retries; > * Barrier task failure from a previous failed stage attempt should not > trigger stage retry; > * Fail the job when a task from a barrier ResultStage failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25161) Fix several bugs in failure handling of barrier execution mode
[ https://issues.apache.org/jira/browse/SPARK-25161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25161: Assignee: Apache Spark > Fix several bugs in failure handling of barrier execution mode > -- > > Key: SPARK-25161 > URL: https://issues.apache.org/jira/browse/SPARK-25161 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Assignee: Apache Spark >Priority: Major > > Fix several bugs in failure handling of barrier execution mode: > * Mark TaskSet for a barrier stage as zombie when a task attempt fails; > * Multiple barrier task failures from a single barrier stage should not > trigger multiple stage retries; > * Barrier task failure from a previous failed stage attempt should not > trigger stage retry; > * Fail the job when a task from a barrier ResultStage failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25161) Fix several bugs in failure handling of barrier execution mode
[ https://issues.apache.org/jira/browse/SPARK-25161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586268#comment-16586268 ] Apache Spark commented on SPARK-25161: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/22158 > Fix several bugs in failure handling of barrier execution mode > -- > > Key: SPARK-25161 > URL: https://issues.apache.org/jira/browse/SPARK-25161 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Priority: Major > > Fix several bugs in failure handling of barrier execution mode: > * Mark TaskSet for a barrier stage as zombie when a task attempt fails; > * Multiple barrier task failures from a single barrier stage should not > trigger multiple stage retries; > * Barrier task failure from a previous failed stage attempt should not > trigger stage retry; > * Fail the job when a task from a barrier ResultStage failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25161) Fix several bugs in failure handling of barrier execution mode
Jiang Xingbo created SPARK-25161: Summary: Fix several bugs in failure handling of barrier execution mode Key: SPARK-25161 URL: https://issues.apache.org/jira/browse/SPARK-25161 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: Jiang Xingbo Fix several bugs in failure handling of barrier execution mode: * Mark TaskSet for a barrier stage as zombie when a task attempt fails; * Multiple barrier task failures from a single barrier stage should not trigger multiple stage retries; * Barrier task failure from a previous failed stage attempt should not trigger stage retry; * Fail the job when a task from a barrier ResultStage failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25126) avoid creating OrcFile.Reader for all orc files
[ https://issues.apache.org/jira/browse/SPARK-25126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586237#comment-16586237 ] Apache Spark commented on SPARK-25126: -- User 'raofu' has created a pull request for this issue: https://github.com/apache/spark/pull/22157 > avoid creating OrcFile.Reader for all orc files > --- > > Key: SPARK-25126 > URL: https://issues.apache.org/jira/browse/SPARK-25126 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.1 >Reporter: Rao Fu >Priority: Minor > > We have a spark job that starts by reading orc files under an S3 directory > and we noticed the job consumes a lot of memory when both the number of orc > files and the size of the file are large. The memory bloat went away with the > following workaround. > 1) create a DataSet from a single orc file. > Dataset rowsForFirstFile = spark.read().format("orc").load(oneFile); > 2) when creating DataSet from all files under the directory, use the > schema from the previous DataSet. > Dataset rows = > spark.read().schema(rowsForFirstFile.schema()).format("orc").load(path); > I believe the issue is due to the fact in order to infer the schema a > FileReader is created for each orc file under the directory although only the > first one is used. The FileReader creation loads the metadata of the orc file > and the memory consumption is very high when there are many files under the > directory. > The issue exists in both 2.0 and HEAD. > In 2.0, OrcFileOperator.readSchema is used. > [https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L95] > In HEAD, OrcUtils.readSchema is used. > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L82 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23573) Create linter rule to prevent misuse of SparkContext.hadoopConfiguration in SQL modules
[ https://issues.apache.org/jira/browse/SPARK-23573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23573. - Resolution: Duplicate Assignee: Gengliang Wang Fix Version/s: 2.4.0 > Create linter rule to prevent misuse of SparkContext.hadoopConfiguration in > SQL modules > --- > > Key: SPARK-23573 > URL: https://issues.apache.org/jira/browse/SPARK-23573 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > > SparkContext.hadoopConfiguration should not be used directly in SQL module. > Instead, one should always use sessionState.newHadoopConfiguration(), which > blends in configs set per session. > The idea is to add the linter rule to ban it. > - Restrict the linter rule to the components that use SQL. use the parameter > `scalastyleSources`? > - Exclude genuinely valid uses, like e.g. in SessionState (ok, that can be > done per case with scalastyle:off in the code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25126) avoid creating OrcFile.Reader for all orc files
[ https://issues.apache.org/jira/browse/SPARK-25126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25126: Assignee: (was: Apache Spark) > avoid creating OrcFile.Reader for all orc files > --- > > Key: SPARK-25126 > URL: https://issues.apache.org/jira/browse/SPARK-25126 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.1 >Reporter: Rao Fu >Priority: Minor > > We have a spark job that starts by reading orc files under an S3 directory > and we noticed the job consumes a lot of memory when both the number of orc > files and the size of the file are large. The memory bloat went away with the > following workaround. > 1) create a DataSet from a single orc file. > Dataset rowsForFirstFile = spark.read().format("orc").load(oneFile); > 2) when creating DataSet from all files under the directory, use the > schema from the previous DataSet. > Dataset rows = > spark.read().schema(rowsForFirstFile.schema()).format("orc").load(path); > I believe the issue is due to the fact in order to infer the schema a > FileReader is created for each orc file under the directory although only the > first one is used. The FileReader creation loads the metadata of the orc file > and the memory consumption is very high when there are many files under the > directory. > The issue exists in both 2.0 and HEAD. > In 2.0, OrcFileOperator.readSchema is used. > [https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L95] > In HEAD, OrcUtils.readSchema is used. > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L82 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25126) avoid creating OrcFile.Reader for all orc files
[ https://issues.apache.org/jira/browse/SPARK-25126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25126: Assignee: Apache Spark > avoid creating OrcFile.Reader for all orc files > --- > > Key: SPARK-25126 > URL: https://issues.apache.org/jira/browse/SPARK-25126 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.3.1 >Reporter: Rao Fu >Assignee: Apache Spark >Priority: Minor > > We have a spark job that starts by reading orc files under an S3 directory > and we noticed the job consumes a lot of memory when both the number of orc > files and the size of the file are large. The memory bloat went away with the > following workaround. > 1) create a DataSet from a single orc file. > Dataset rowsForFirstFile = spark.read().format("orc").load(oneFile); > 2) when creating DataSet from all files under the directory, use the > schema from the previous DataSet. > Dataset rows = > spark.read().schema(rowsForFirstFile.schema()).format("orc").load(path); > I believe the issue is due to the fact in order to infer the schema a > FileReader is created for each orc file under the directory although only the > first one is used. The FileReader creation loads the metadata of the orc file > and the memory consumption is very high when there are many files under the > directory. > The issue exists in both 2.0 and HEAD. > In 2.0, OrcFileOperator.readSchema is used. > [https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L95] > In HEAD, OrcUtils.readSchema is used. > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L82 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected
[ https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586168#comment-16586168 ] Apache Spark commented on SPARK-25144: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/22156 > distinct on Dataset leads to exception due to Managed memory leak detected > > > Key: SPARK-25144 > URL: https://issues.apache.org/jira/browse/SPARK-25144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.3.2 >Reporter: Ayoub Benali >Priority: Major > > The following code example: > {code} > case class Foo(bar: Option[String]) > val ds = List(Foo(Some("bar"))).toDS > val result = ds.flatMap(_.bar).distinct > result.rdd.isEmpty > {code} > Produces the following stacktrace > {code} > [info] org.apache.spark.SparkException: Job aborted due to stage failure: > Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in > stage 7.0 (TID 125, localhost, executor driver): > org.apache.spark.SparkException: Managed memory leak detected; size = > 16777216 bytes, TID = 125 > [info]at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) > [info]at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [info]at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [info]at java.lang.Thread.run(Thread.java:748) > [info] > [info] Driver stacktrace: > [info] at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589) > [info] at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > [info] at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > [info] at scala.Option.foreach(Option.scala:257) > [info] at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) > [info] at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823) > [info] at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) > [info] at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) > [info] at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > [info] at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) > [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) > [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) > [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074) > [info] at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > [info] at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > [info] at org.apache.spark.rdd.RDD.take(RDD.scala:1331) > [info] at > org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466) > [info] at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466) > [info] at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > [info] at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > [info] at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465) > {code} > The code example doesn't produce any error when `distinct` function is not > called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected
[ https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586138#comment-16586138 ] Apache Spark commented on SPARK-25144: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/22155 > distinct on Dataset leads to exception due to Managed memory leak detected > > > Key: SPARK-25144 > URL: https://issues.apache.org/jira/browse/SPARK-25144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.3.2 >Reporter: Ayoub Benali >Priority: Major > > The following code example: > {code} > case class Foo(bar: Option[String]) > val ds = List(Foo(Some("bar"))).toDS > val result = ds.flatMap(_.bar).distinct > result.rdd.isEmpty > {code} > Produces the following stacktrace > {code} > [info] org.apache.spark.SparkException: Job aborted due to stage failure: > Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in > stage 7.0 (TID 125, localhost, executor driver): > org.apache.spark.SparkException: Managed memory leak detected; size = > 16777216 bytes, TID = 125 > [info]at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) > [info]at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [info]at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [info]at java.lang.Thread.run(Thread.java:748) > [info] > [info] Driver stacktrace: > [info] at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589) > [info] at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > [info] at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > [info] at scala.Option.foreach(Option.scala:257) > [info] at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) > [info] at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823) > [info] at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) > [info] at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) > [info] at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > [info] at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) > [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) > [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) > [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074) > [info] at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > [info] at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > [info] at org.apache.spark.rdd.RDD.take(RDD.scala:1331) > [info] at > org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466) > [info] at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466) > [info] at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > [info] at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > [info] at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465) > {code} > The code example doesn't produce any error when `distinct` function is not > called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25140) Add optional logging to UnsafeProjection.create when it falls back to interpreted mode
[ https://issues.apache.org/jira/browse/SPARK-25140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25140: Assignee: (was: Apache Spark) > Add optional logging to UnsafeProjection.create when it falls back to > interpreted mode > -- > > Key: SPARK-25140 > URL: https://issues.apache.org/jira/browse/SPARK-25140 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Priority: Minor > > SPARK-23711 implemented a nice graceful handling of allowing UnsafeProjection > to fall back to an interpreter mode when codegen fails. That makes Spark much > more usable even when codegen is unable to handle the given query. > But in its current form, the fallback handling can also be a mystery in terms > of performance cliffs. Users may be left wondering why a query runs fine with > some expressions, but then with just one extra expression the performance > goes 2x, 3x (or more) slower. > It'd be nice to have optional logging of the fallback behavior, so that for > users that care about monitoring performance cliffs, they can opt-in to log > when a fallback to interpreter mode was taken. i.e. at > https://github.com/apache/spark/blob/a40ffc656d62372da85e0fa932b67207839e7fde/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L183 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25140) Add optional logging to UnsafeProjection.create when it falls back to interpreted mode
[ https://issues.apache.org/jira/browse/SPARK-25140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586091#comment-16586091 ] Apache Spark commented on SPARK-25140: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/22154 > Add optional logging to UnsafeProjection.create when it falls back to > interpreted mode > -- > > Key: SPARK-25140 > URL: https://issues.apache.org/jira/browse/SPARK-25140 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Priority: Minor > > SPARK-23711 implemented a nice graceful handling of allowing UnsafeProjection > to fall back to an interpreter mode when codegen fails. That makes Spark much > more usable even when codegen is unable to handle the given query. > But in its current form, the fallback handling can also be a mystery in terms > of performance cliffs. Users may be left wondering why a query runs fine with > some expressions, but then with just one extra expression the performance > goes 2x, 3x (or more) slower. > It'd be nice to have optional logging of the fallback behavior, so that for > users that care about monitoring performance cliffs, they can opt-in to log > when a fallback to interpreter mode was taken. i.e. at > https://github.com/apache/spark/blob/a40ffc656d62372da85e0fa932b67207839e7fde/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L183 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25140) Add optional logging to UnsafeProjection.create when it falls back to interpreted mode
[ https://issues.apache.org/jira/browse/SPARK-25140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25140: Assignee: Apache Spark > Add optional logging to UnsafeProjection.create when it falls back to > interpreted mode > -- > > Key: SPARK-25140 > URL: https://issues.apache.org/jira/browse/SPARK-25140 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Assignee: Apache Spark >Priority: Minor > > SPARK-23711 implemented a nice graceful handling of allowing UnsafeProjection > to fall back to an interpreter mode when codegen fails. That makes Spark much > more usable even when codegen is unable to handle the given query. > But in its current form, the fallback handling can also be a mystery in terms > of performance cliffs. Users may be left wondering why a query runs fine with > some expressions, but then with just one extra expression the performance > goes 2x, 3x (or more) slower. > It'd be nice to have optional logging of the fallback behavior, so that for > users that care about monitoring performance cliffs, they can opt-in to log > when a fallback to interpreter mode was taken. i.e. at > https://github.com/apache/spark/blob/a40ffc656d62372da85e0fa932b67207839e7fde/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L183 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23711) Add fallback to interpreted execution logic
[ https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586090#comment-16586090 ] Apache Spark commented on SPARK-23711: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/22154 > Add fallback to interpreted execution logic > --- > > Key: SPARK-23711 > URL: https://issues.apache.org/jira/browse/SPARK-23711 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25147) GroupedData.apply pandas_udf crashing
[ https://issues.apache.org/jira/browse/SPARK-25147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585911#comment-16585911 ] Mike Sukmanowsky commented on SPARK-25147: -- Confirmed, works with: Python 2.7.15 PyArrow==0.9.0.post1 numpy==1.11.3 pandas==0.19.2 > GroupedData.apply pandas_udf crashing > - > > Key: SPARK-25147 > URL: https://issues.apache.org/jira/browse/SPARK-25147 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: OS: Mac OS 10.13.6 > Python: 2.7.15, 3.6.6 > PyArrow: 0.10.0 > Pandas: 0.23.4 > Numpy: 1.15.0 >Reporter: Mike Sukmanowsky >Priority: Major > > Running the following example taken straight from the docs results in > {{org.apache.spark.SparkException: Python worker exited unexpectedly > (crashed)}} for reasons that aren't clear from any logs I can see: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql import functions as F > spark = ( > SparkSession > .builder > .appName("pandas_udf") > .getOrCreate() > ) > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v") > ) > @F.pandas_udf("id long, v double", F.PandasUDFType.GROUPED_MAP) > def normalize(pdf): > v = pdf.v > return pdf.assign(v=(v - v.mean()) / v.std()) > ( > df > .groupby("id") > .apply(normalize) > .show() > ) > {code} > See output.log for > [stacktrace|https://gist.github.com/msukmanowsky/b9cb6700e8ccaf93f265962000403f28]. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries
[ https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16527351#comment-16527351 ] Lucas Partridge edited comment on SPARK-19498 at 8/20/18 1:03 PM: -- Ok great. Here's my feedback after wrapping a large complex Python algorithm for ML Pipeline on Spark 2.2.0. Several of these comments probably apply beyond pyspark too. # The inability to save and load custom pyspark models/pipelines/pipelinemodels is an absolute showstopper. Training models can take hours and so we need to be able to save and reload models. Pending the availability of https://issues.apache.org/jira/browse/SPARK-17025 I used a refinement of [https://stackoverflow.com/a/49195515/1843329] to work around this. Had this not been solved no further work would have been done. # Support for saving/loading more param types (e.g., dict) would be great. I had to use json.dumps to convert our algorithm's internal model into a string and then pretend it was a string param in order to save and load that with the rest of the transformer. # Given that pipelinemodels can be saved we also need the ability to export them easily for deployment on other clusters. The cluster where you train the model may be different to the one where you deploy it for predictions. A hack workaround is to use hdfs commands to copy the relevant files and directories but it would be great if we had simple single export/import commands in pyspark to move models.pipelines/pipelinemodels easily between clusters and to allow artifacts to be stored off-cluster. # Creating individual parameters with getters and setters is tedious and error-prone, especially if writing docs inline too. It would be great if as much of this boiler-plate as possible could be auto-generated from a simple parameter definition. I always groan when someone asks for an extra param at the moment! # The Ml Pipeline API seems to assume all the params lie on the estimator and none on the transformer. In the algorithm I wrapped the model/transformer has numerous params that are specific to it rather than the estimator. PipelineModel needs a getStages() command (just as Pipeline does) to get at the model so you can parameterise it. I had to use the undocumented .stages member instead. But then if you want to call transform() on a pipelinemodel immediately after fitting it you also need some ability to set the model/transformer params in advance. I got around this by defining a params class for the estimator-only params and another class for the model-only params. I made the estimator inherit from both these classes and the model inherit from only the model-base params class. The estimator then just passes through any model-specific params to the model when it creates it at the end of its fit() method. But, to distinguish the model-only params from the estimator (e.g., when listing the params on the estimator) I had to prefix all the model-only params with a common value to identify them. This works but it's clunky and ugly. # The algorithm I ported works naturally with individually named column inputs. But the existing ML Pipeline library prefers DenseVectors. I ended up having to support both types of inputs - if a DenseVector input was 'None' I would take the data directly from the individually named columns instead. If users want to use the algorithm by itself they can used the column-based input approach; if they want to work with algorithms from the built-in library (e.g., StandardScaler, Binarizer, etc) they can use the DenseVector approach instead. Again this works but is clunky because you're having to handle two different forms of input inside the same implementation. Also DenseVectors are limited by their inability to handle missing values. # Similarly, I wanted to produce multiple separate columns for the outputs of the model's transform() method whereas most built-in algorithms seem to use a single DenseVector output column. DataFrame's withColumn() method could do with a withColumns() equivalent to make it easy to add multiple columns to a Dataframe instead of just one column at a time. # Documentation explaining how to create a custom estimator and transformer (preferably one with transformer-specific params) would be extremely useful for people. Most of what I learned I gleaned off StackOverflow and from looking at Spark's pipeline code. Hope this list will be useful for improving ML Pipelines in future versions of Spark! was (Author: lucas.partridge): Ok great. Here's my feedback after wrapping a large complex Python algorithm for ML Pipeline on Spark 2.2.0. Several of these comments probably apply beyond pyspark too. # The inability to save and load custom pyspark models/pipelines/pipelinemodels is an absolute showstopper. Training models can take hours and so we need to be
[jira] [Comment Edited] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries
[ https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16527351#comment-16527351 ] Lucas Partridge edited comment on SPARK-19498 at 8/20/18 1:02 PM: -- Ok great. Here's my feedback after wrapping a large complex Python algorithm for ML Pipeline on Spark 2.2.0. Several of these comments probably apply beyond pyspark too. # The inability to save and load custom pyspark models/pipelines/pipelinemodels is an absolute showstopper. Training models can take hours and so we need to be able to save and reload models. Pending the availability of https://issues.apache.org/jira/browse/SPARK-17025 I used a refinement of [https://stackoverflow.com/a/49195515/1843329] to work around this. Had this not been solved no further work would have been done. # Support for saving/loading more param types (e.g., dict) would be great. I had to use json.dumps to convert our algorithm's internal model into a string and then pretend it was a string param in order to save and load that with the rest of the transformer. # Given that pipelinemodels can be saved we also need the ability to export them easily for deployment on other clusters. The cluster where you train the model may be different to the one where you deploy it for predictions. A hack workaround is to use hdfs commands to copy the relevant files and directories but it would be great if we had simple single export/import commands in pyspark to move models.pipelines/pipelinemodels easily between clusters and to allow artifacts to be stored off-cluster. # Creating individual parameters with getters and setters is tedious and error-prone, especially if writing docs inline too. It would be great if as much of this boiler-plate as possible could be auto-generated from a simple parameter definition. I always groan when someone asks for an extra param at the moment! # The Ml Pipeline API seems to assume all the params lie on the estimator and none on the transformer. In the algorithm I wrapped the model/transformer has numerous params that are specific to it rather than the estimator. PipelineModel needs a getStages() command (just as Pipeline does) to get at the model so you can parameterise it. I had to use the undocumented .stages member instead. But then if you want to call transform() on a pipelinemodel immediately after fitting it you also need some ability to set the model/transformer params in advance. I got around this by defining a params class for the estimator-only params and another class for the model-only params. I made the estimator inherit from both these classes and the model inherit from only the model-base params class. The estimator then just passes through any model-specific params to the model when it creates it at the end of its fit() method. But, to distinguish the model-only params from the estimator (e.g., when listing the params on the estimator) I had to prefix all the model-only params with a common value to identify them. This works but it's clunky and ugly. # The algorithm I ported works naturally with individually named column inputs. But the existing ML Pipeline library prefers DenseVectors. I ended up having to support both types of inputs - if a DenseVector input was 'None' I would take the data directly from the individually named columns instead. If users want to use the algorithm by itself they can used the column-based input approach; if they want to work with algorithms from the built-in library (e.g., StandardScaler, Binarizer, etc) they can use the DenseVector approach instead. Again this works but is clunky because you're having to handle two different forms of input inside the same implementation. Also DenseVectors are limited by their inability to handle missing values. # Similarly, I wanted to produce multiple separate columns for the outputs of the model's transform() method whereas most built-in algorithms seem to use a single DenseVector output column. DataFrame's withColumn() method could do with a withColumns() equivalent to make it easy to add multiple columns to a Dataframe instead of just one column at a time. # Documentation explaining how to create a custom estimator and transformer (preferably one with transformer-specific params) would be extremely useful for people. Most of what I learned I gleaned off StackOverflow and from looking at Spark's pipeline code. Hope this list will be useful for improving ML Pipelines in future versions of Spark! was (Author: lucas.partridge): Ok great. Here's my feedback after wrapping a large complex Python algorithm for ML Pipeline on Spark 2.2.0. Several of these comments probably apply beyond pyspark too. # The inability to save and load custom pyspark models/pipelines/pipelinemodels is an absolute showstopper. Training models can take hours and so
[jira] [Assigned] (SPARK-25160) Remove sql configuration spark.sql.avro.outputTimestampType
[ https://issues.apache.org/jira/browse/SPARK-25160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-25160: Assignee: Gengliang Wang > Remove sql configuration spark.sql.avro.outputTimestampType > --- > > Key: SPARK-25160 > URL: https://issues.apache.org/jira/browse/SPARK-25160 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > > In the PR for supporting logical timestamp types > [https://github.com/apache/spark/pull/21935,] a SQL configuration > spark.sql.avro.outputTimestampType is added, so that user can specify the > output timestamp precision they want. > With PR [https://github.com/apache/spark/pull/21847,] the output file can be > written with user specified types. > So no need to have such configuration. Otherwise to make it consistent we > need to add configuration for all the Catalyst types that can be converted > into different Avro types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25160) Remove sql configuration spark.sql.avro.outputTimestampType
[ https://issues.apache.org/jira/browse/SPARK-25160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25160. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22151 [https://github.com/apache/spark/pull/22151] > Remove sql configuration spark.sql.avro.outputTimestampType > --- > > Key: SPARK-25160 > URL: https://issues.apache.org/jira/browse/SPARK-25160 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > > In the PR for supporting logical timestamp types > [https://github.com/apache/spark/pull/21935,] a SQL configuration > spark.sql.avro.outputTimestampType is added, so that user can specify the > output timestamp precision they want. > With PR [https://github.com/apache/spark/pull/21847,] the output file can be > written with user specified types. > So no need to have such configuration. Otherwise to make it consistent we > need to add configuration for all the Catalyst types that can be converted > into different Avro types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23034) Display tablename for `HiveTableScan` node in UI
[ https://issues.apache.org/jira/browse/SPARK-23034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585786#comment-16585786 ] Apache Spark commented on SPARK-23034: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/22153 > Display tablename for `HiveTableScan` node in UI > > > Key: SPARK-23034 > URL: https://issues.apache.org/jira/browse/SPARK-23034 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 2.2.1 >Reporter: Tejas Patil >Priority: Trivial > > For queries which scan multiple tables, it will be convenient if the DAG > shown in Spark UI also showed which table is being scanned. This will make > debugging easier. For this JIRA, I am scoping those for hive table scans > only. For Spark native tables, the table scan node is abstracted out as a > `WholeStageCodegen` node (which might be doing more things besides table > scan). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585689#comment-16585689 ] Stavros Kontopoulos edited comment on SPARK-24434 at 8/20/18 10:17 AM: --- [~onursatici] I am working on this (check last comment above), but since I am on vacations I haven't created a PR. If you are going to do it, pls read the design doc and cover the listed cases, will review. At some point we need to coordinate, [~liyinan926] next time pls assign people, I spent quite some time on this. was (Author: skonto): [~onursatici] I am working on this, but since I am on vacations I haven't created a PR. If you are going to do it, pls read the design doc and cover the listed cases, will review. At some point we need to coordinate, [~liyinan926] next time pls assign people, I spent quite some time on this. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585689#comment-16585689 ] Stavros Kontopoulos edited comment on SPARK-24434 at 8/20/18 9:34 AM: -- [~onursatici] I am working on this, but since I am on vacations I haven't created a PR. If you are going to do it, pls read the design doc and cover the listed cases, will review. At some point we need to coordinate, [~liyinan926] next time pls assign people, I spent quite some time on this. was (Author: skonto): [~onursatici] I am working on this, but since I am on vacations I haven't created a PR. If you are going to do it, pls read the design doc and cover the listed cases. At some point we need to coordinate, [~liyinan926] next time pls assign people. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585689#comment-16585689 ] Stavros Kontopoulos edited comment on SPARK-24434 at 8/20/18 9:32 AM: -- [~onursatici] I am working on this, but since I am on vacations I haven't created a PR. If you are going to do it, pls read the design doc and cover the listed cases. At some point we need to coordinate, [~liyinan926] next time pls assign people. was (Author: skonto): [~onursatici] I am working on this, but since I am on vacations I haven't created a PR. If you are going to do it, pls read the design doc and cover the listed cases. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates
[ https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585689#comment-16585689 ] Stavros Kontopoulos commented on SPARK-24434: - [~onursatici] I am working on this, but since I am on vacations I haven't created a PR. If you are going to do it, pls read the design doc and cover the listed cases. > Support user-specified driver and executor pod templates > > > Key: SPARK-24434 > URL: https://issues.apache.org/jira/browse/SPARK-24434 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Yinan Li >Priority: Major > > With more requests for customizing the driver and executor pods coming, the > current approach of adding new Spark configuration options has some serious > drawbacks: 1) it means more Kubernetes specific configuration options to > maintain, and 2) it widens the gap between the declarative model used by > Kubernetes and the configuration model used by Spark. We should start > designing a solution that allows users to specify pod templates as central > places for all customization needs for the driver and executor pods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25159) json schema inference should only trigger one job
[ https://issues.apache.org/jira/browse/SPARK-25159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25159: Assignee: Apache Spark (was: Wenchen Fan) > json schema inference should only trigger one job > - > > Key: SPARK-25159 > URL: https://issues.apache.org/jira/browse/SPARK-25159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25159) json schema inference should only trigger one job
[ https://issues.apache.org/jira/browse/SPARK-25159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25159: Assignee: Wenchen Fan (was: Apache Spark) > json schema inference should only trigger one job > - > > Key: SPARK-25159 > URL: https://issues.apache.org/jira/browse/SPARK-25159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25159) json schema inference should only trigger one job
[ https://issues.apache.org/jira/browse/SPARK-25159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585631#comment-16585631 ] Apache Spark commented on SPARK-25159: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/22152 > json schema inference should only trigger one job > - > > Key: SPARK-25159 > URL: https://issues.apache.org/jira/browse/SPARK-25159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25160) Remove sql configuration spark.sql.avro.outputTimestampType
[ https://issues.apache.org/jira/browse/SPARK-25160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25160: Assignee: Apache Spark > Remove sql configuration spark.sql.avro.outputTimestampType > --- > > Key: SPARK-25160 > URL: https://issues.apache.org/jira/browse/SPARK-25160 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > In the PR for supporting logical timestamp types > [https://github.com/apache/spark/pull/21935,] a SQL configuration > spark.sql.avro.outputTimestampType is added, so that user can specify the > output timestamp precision they want. > With PR [https://github.com/apache/spark/pull/21847,] the output file can be > written with user specified types. > So no need to have such configuration. Otherwise to make it consistent we > need to add configuration for all the Catalyst types that can be converted > into different Avro types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25160) Remove sql configuration spark.sql.avro.outputTimestampType
[ https://issues.apache.org/jira/browse/SPARK-25160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25160: Assignee: (was: Apache Spark) > Remove sql configuration spark.sql.avro.outputTimestampType > --- > > Key: SPARK-25160 > URL: https://issues.apache.org/jira/browse/SPARK-25160 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > In the PR for supporting logical timestamp types > [https://github.com/apache/spark/pull/21935,] a SQL configuration > spark.sql.avro.outputTimestampType is added, so that user can specify the > output timestamp precision they want. > With PR [https://github.com/apache/spark/pull/21847,] the output file can be > written with user specified types. > So no need to have such configuration. Otherwise to make it consistent we > need to add configuration for all the Catalyst types that can be converted > into different Avro types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25160) Remove sql configuration spark.sql.avro.outputTimestampType
[ https://issues.apache.org/jira/browse/SPARK-25160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585609#comment-16585609 ] Apache Spark commented on SPARK-25160: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/22151 > Remove sql configuration spark.sql.avro.outputTimestampType > --- > > Key: SPARK-25160 > URL: https://issues.apache.org/jira/browse/SPARK-25160 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > In the PR for supporting logical timestamp types > [https://github.com/apache/spark/pull/21935,] a SQL configuration > spark.sql.avro.outputTimestampType is added, so that user can specify the > output timestamp precision they want. > With PR [https://github.com/apache/spark/pull/21847,] the output file can be > written with user specified types. > So no need to have such configuration. Otherwise to make it consistent we > need to add configuration for all the Catalyst types that can be converted > into different Avro types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25160) Remove sql configuration spark.sql.avro.outputTimestampType
Gengliang Wang created SPARK-25160: -- Summary: Remove sql configuration spark.sql.avro.outputTimestampType Key: SPARK-25160 URL: https://issues.apache.org/jira/browse/SPARK-25160 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: Gengliang Wang In the PR for supporting logical timestamp types [https://github.com/apache/spark/pull/21935,] a SQL configuration spark.sql.avro.outputTimestampType is added, so that user can specify the output timestamp precision they want. With PR [https://github.com/apache/spark/pull/21847,] the output file can be written with user specified types. So no need to have such configuration. Otherwise to make it consistent we need to add configuration for all the Catalyst types that can be converted into different Avro types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23714) Add metrics for cached KafkaConsumer
[ https://issues.apache.org/jira/browse/SPARK-23714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585587#comment-16585587 ] Jungtaek Lim edited comment on SPARK-23714 at 8/20/18 8:06 AM: --- [~yuzhih...@gmail.com] Maybe we can just apply Apache Commons Pool (SPARK-25151) and leverage metrics. PR for SPARK-25151 is also ready to review. was (Author: kabhwan): [~yuzhih...@gmail.com] Maybe we can just apply Apache Commons Pool (SPARK-25251) and leverage metrics. [PR|https://github.com/apache/spark/pull/22138] for SPARK-25251 is also ready to review. > Add metrics for cached KafkaConsumer > > > Key: SPARK-23714 > URL: https://issues.apache.org/jira/browse/SPARK-23714 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Ted Yu >Priority: Major > > SPARK-23623 added KafkaDataConsumer to avoid concurrent use of cached > KafkaConsumer. > This JIRA is to add metrics for measuring the operations of the cache so that > users can gain insight into the caching solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23714) Add metrics for cached KafkaConsumer
[ https://issues.apache.org/jira/browse/SPARK-23714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585587#comment-16585587 ] Jungtaek Lim commented on SPARK-23714: -- [~yuzhih...@gmail.com] Maybe we can just apply Apache Commons Pool (SPARK-25251) and leverage metrics. [PR|https://github.com/apache/spark/pull/22138] for SPARK-25251 is also ready to review. > Add metrics for cached KafkaConsumer > > > Key: SPARK-23714 > URL: https://issues.apache.org/jira/browse/SPARK-23714 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Ted Yu >Priority: Major > > SPARK-23623 added KafkaDataConsumer to avoid concurrent use of cached > KafkaConsumer. > This JIRA is to add metrics for measuring the operations of the cache so that > users can gain insight into the caching solution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected
[ https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585554#comment-16585554 ] Ayoub Benali commented on SPARK-25144: -- [~jerryshao] according the comments above this bug doesn't happen in Master but I haven't tried my self thanks for the PR [~dongjoon] > distinct on Dataset leads to exception due to Managed memory leak detected > > > Key: SPARK-25144 > URL: https://issues.apache.org/jira/browse/SPARK-25144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.3.2 >Reporter: Ayoub Benali >Priority: Major > > The following code example: > {code} > case class Foo(bar: Option[String]) > val ds = List(Foo(Some("bar"))).toDS > val result = ds.flatMap(_.bar).distinct > result.rdd.isEmpty > {code} > Produces the following stacktrace > {code} > [info] org.apache.spark.SparkException: Job aborted due to stage failure: > Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in > stage 7.0 (TID 125, localhost, executor driver): > org.apache.spark.SparkException: Managed memory leak detected; size = > 16777216 bytes, TID = 125 > [info]at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358) > [info]at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [info]at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [info]at java.lang.Thread.run(Thread.java:748) > [info] > [info] Driver stacktrace: > [info] at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589) > [info] at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > [info] at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > [info] at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > [info] at scala.Option.foreach(Option.scala:257) > [info] at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) > [info] at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823) > [info] at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) > [info] at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) > [info] at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > [info] at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) > [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) > [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) > [info] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074) > [info] at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > [info] at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > [info] at org.apache.spark.rdd.RDD.take(RDD.scala:1331) > [info] at > org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466) > [info] at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466) > [info] at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > [info] at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > [info] at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465) > {code} > The code example doesn't produce any error when `distinct` function is not > called. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25159) json schema inference should only trigger one job
Wenchen Fan created SPARK-25159: --- Summary: json schema inference should only trigger one job Key: SPARK-25159 URL: https://issues.apache.org/jira/browse/SPARK-25159 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org