[jira] [Updated] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-08-20 Thread Bruce Robbins (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-25164:
--
Description: 
{{VectorizedParquetRecordReader.initializeInternal}} loops through each column, 
and for each column it calls
{noformat}
requestedSchema.getColumns().get(i)
{noformat}
However, {{MessageType.getColumns}} will build the entire column list from 
getPaths(0).
{noformat}
  public List getColumns() {
List paths = this.getPaths(0);
List columns = new 
ArrayList(paths.size());
for (String[] path : paths) {
  // TODO: optimize this

  PrimitiveType primitiveType = getType(path).asPrimitiveType();
  columns.add(new ColumnDescriptor(
  path,
  primitiveType,
  getMaxRepetitionLevel(path),
  getMaxDefinitionLevel(path)));
}
return columns;
  }
{noformat}
This means that for each parquet file, this routine indirectly iterates 
colCount*colCount times.

This is actually not particularly noticeable unless you have:
 - many parquet files
 - many columns

To verify that this is an issue, I created a 1 million record parquet table 
with 6000 columns of type double and 67 files (so initializeInternal is called 
67 times). I ran the following query:
{noformat}
sql("select * from 6000_1m_double where id1 = 1").collect
{noformat}
I used Spark from the master branch. I had 8 executor threads. The filter 
returns only a few thousand records. The query ran (on average) for 6.4 minutes.

Then I cached the column list at the top of {{initializeInternal}} as follows:
{noformat}
List columnCache = requestedSchema.getColumns();
{noformat}
Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
{{requestedSchema.getColumns()}}.

With the column cache variable, the same query runs in 5 minutes. So with my 
simple query, you save %22 of time by not rebuilding the column list for each 
column.

You get additional savings with a paths cache variable, now saving 34% in total 
on the above query.

  was:
{{VectorizedParquetRecordReader.initializeInternal}} loops through each column, 
and for each column it calls
{noformat}
requestedSchema.getColumns().get(i)
{noformat}
However, {{MessageType.getColumns}} will build the entire column list from 
getPaths(0).
{noformat}
  public List getColumns() {
List paths = this.getPaths(0);
List columns = new 
ArrayList(paths.size());
for (String[] path : paths) {
  // TODO: optimize this

  PrimitiveType primitiveType = getType(path).asPrimitiveType();
  columns.add(new ColumnDescriptor(
  path,
  primitiveType,
  getMaxRepetitionLevel(path),
  getMaxDefinitionLevel(path)));
}
return columns;
  }
{noformat}
This means that for each parquet file, this routine indirectly iterates 
colCount*colCount times.

This is actually not particularly noticeable unless you have:
 - many parquet files
 - many columns

To verify that this is an issue, I created a 1 million record parquet table 
with 6000 columns of type double and 67 files (so initializeInternal is called 
67 times). I ran the following query:
{noformat}
sql("select * from 6000_1m_double where id1 = 1").collect
{noformat}
I used Spark from the master branch. I had 8 executor threads. The filter 
returns only a few thousand records. The query ran (on average) for 6.4 minutes.

Then I cached the column list at the top of {{initializeInternal}} as follows:
{noformat}
List columnCache = requestedSchema.getColumns();
{noformat}
Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
{{requestedSchema.getColumns()}}.

With the column cache variable, the same query runs in 5 minutes. So with my 
simple query, you save %22 of time by not rebuilding the column list for each 
column.



> Parquet reader builds entire list of columns once for each column
> -
>
> Key: SPARK-25164
> URL: https://issues.apache.org/jira/browse/SPARK-25164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> {{VectorizedParquetRecordReader.initializeInternal}} loops through each 
> column, and for each column it calls
> {noformat}
> requestedSchema.getColumns().get(i)
> {noformat}
> However, {{MessageType.getColumns}} will build the entire column list from 
> getPaths(0).
> {noformat}
>   public List getColumns() {
> List paths = this.getPaths(0);
> List columns

[jira] [Commented] (SPARK-25168) PlanTest.comparePlans may make a supplied resolved plan unresolved.

2018-08-20 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587024#comment-16587024
 ] 

Dilip Biswal commented on SPARK-25168:
--

[~cloud_fan] [~smilegator]

> PlanTest.comparePlans may make a supplied resolved plan unresolved.
> ---
>
> Key: SPARK-25168
> URL: https://issues.apache.org/jira/browse/SPARK-25168
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Priority: Minor
>
> Hello,
> I came across this behaviour while writing unit test for one of my PR.  We 
> use the utlility class PlanTest for writing plan verification tests. 
> Currently when we call comparePlans and the resolved input plan contains a 
> Join operator, the plan becomes unresolved after the call to 
> `normalizePlan(normalizeExprIds(plan1))`.  The reason for this is that, after 
> cleaning the expression ids, duplicateResolved returns false making the Join 
> operator unresolved.
> {code}
>  def duplicateResolved: Boolean = 
> left.outputSet.intersect(right.outputSet).isEmpty
>   // Joins are only resolved if they don't introduce ambiguous expression ids.
>   // NaturalJoin should be ready for resolution only if everything else is 
> resolved here
>   lazy val resolvedExceptNatural: Boolean = {
> childrenResolved &&
>   expressions.forall(_.resolved) &&
>   duplicateResolved &&
>   condition.forall(_.dataType == BooleanType)
>   }
> {code}
> Please note that, plan verification actually works fine. It just looked 
> awkward to compare two unresolved plan for equality.
> I am opening this ticket to discuss if it is an okay behaviour. If its an 
> tolerable behaviour then we can close the ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE

2018-08-20 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587020#comment-16587020
 ] 

Jiang Xingbo commented on SPARK-25114:
--

The PR has been merged to master and 2.3

> RecordBinaryComparator may return wrong result when subtraction between two 
> words is divisible by Integer.MAX_VALUE
> ---
>
> Key: SPARK-25114
> URL: https://issues.apache.org/jira/browse/SPARK-25114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
>
> It is possible for two objects to be unequal and yet we consider them as 
> equal within RecordBinaryComparator, if the long values are separated by 
> Int.MaxValue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25168) PlanTest.comparePlans may make a supplied resolved plan unresolved.

2018-08-20 Thread Dilip Biswal (JIRA)
Dilip Biswal created SPARK-25168:


 Summary: PlanTest.comparePlans may make a supplied resolved plan 
unresolved.
 Key: SPARK-25168
 URL: https://issues.apache.org/jira/browse/SPARK-25168
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Dilip Biswal


Hello,

I came across this behaviour while writing unit test for one of my PR.  We use 
the utlility class PlanTest for writing plan verification tests. Currently when 
we call comparePlans and the resolved input plan contains a Join operator, the 
plan becomes unresolved after the call to 
`normalizePlan(normalizeExprIds(plan1))`.  The reason for this is that, after 
cleaning the expression ids, duplicateResolved returns false making the Join 
operator unresolved.

{code}
 def duplicateResolved: Boolean = 
left.outputSet.intersect(right.outputSet).isEmpty

  // Joins are only resolved if they don't introduce ambiguous expression ids.
  // NaturalJoin should be ready for resolution only if everything else is 
resolved here
  lazy val resolvedExceptNatural: Boolean = {
childrenResolved &&
  expressions.forall(_.resolved) &&
  duplicateResolved &&
  condition.forall(_.dataType == BooleanType)
  }
{code}

Please note that, plan verification actually works fine. It just looked awkward 
to compare two unresolved plan for equality.

I am opening this ticket to discuss if it is an okay behaviour. If its an 
tolerable behaviour then we can close the ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25017) Add test suite for ContextBarrierState

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25017:


Assignee: (was: Apache Spark)

> Add test suite for ContextBarrierState
> --
>
> Key: SPARK-25017
> URL: https://issues.apache.org/jira/browse/SPARK-25017
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> We shall be able to add unit test to ContextBarrierState with a mocked 
> RpcCallContext. Currently it's only covered by end-to-end test in 
> `BarrierTaskContextSuite`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25017) Add test suite for ContextBarrierState

2018-08-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586950#comment-16586950
 ] 

Apache Spark commented on SPARK-25017:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/22165

> Add test suite for ContextBarrierState
> --
>
> Key: SPARK-25017
> URL: https://issues.apache.org/jira/browse/SPARK-25017
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> We shall be able to add unit test to ContextBarrierState with a mocked 
> RpcCallContext. Currently it's only covered by end-to-end test in 
> `BarrierTaskContextSuite`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25017) Add test suite for ContextBarrierState

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25017:


Assignee: Apache Spark

> Add test suite for ContextBarrierState
> --
>
> Key: SPARK-25017
> URL: https://issues.apache.org/jira/browse/SPARK-25017
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Major
>
> We shall be able to add unit test to ContextBarrierState with a mocked 
> RpcCallContext. Currently it's only covered by end-to-end test in 
> `BarrierTaskContextSuite`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25167) Minor fixes for R sql tests (tests that fail in development environment)

2018-08-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586923#comment-16586923
 ] 

Apache Spark commented on SPARK-25167:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/22161

> Minor fixes for R sql tests (tests that fail in development environment)
> 
>
> Key: SPARK-25167
> URL: https://issues.apache.org/jira/browse/SPARK-25167
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Priority: Minor
>
> A few SQL tests for R are failing development environment (Mac). 
> *  The catalog api tests assumes catalog artifacts named "foo" to be non 
> existent. I think name such as foo and bar are common and developers use it 
> frequently. 
> *  One test assumes that we only have one database in the system. I had more 
> than one and it caused the test to fail. I have changed that check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25167) Minor fixes for R sql tests (tests that fail in development environment)

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25167:


Assignee: Apache Spark

> Minor fixes for R sql tests (tests that fail in development environment)
> 
>
> Key: SPARK-25167
> URL: https://issues.apache.org/jira/browse/SPARK-25167
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Assignee: Apache Spark
>Priority: Minor
>
> A few SQL tests for R are failing development environment (Mac). 
> *  The catalog api tests assumes catalog artifacts named "foo" to be non 
> existent. I think name such as foo and bar are common and developers use it 
> frequently. 
> *  One test assumes that we only have one database in the system. I had more 
> than one and it caused the test to fail. I have changed that check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25167) Minor fixes for R sql tests (tests that fail in development environment)

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25167:


Assignee: (was: Apache Spark)

> Minor fixes for R sql tests (tests that fail in development environment)
> 
>
> Key: SPARK-25167
> URL: https://issues.apache.org/jira/browse/SPARK-25167
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Priority: Minor
>
> A few SQL tests for R are failing development environment (Mac). 
> *  The catalog api tests assumes catalog artifacts named "foo" to be non 
> existent. I think name such as foo and bar are common and developers use it 
> frequently. 
> *  One test assumes that we only have one database in the system. I had more 
> than one and it caused the test to fail. I have changed that check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25167) Minor fixes for R sql tests (tests that fail in development environment)

2018-08-20 Thread Dilip Biswal (JIRA)
Dilip Biswal created SPARK-25167:


 Summary: Minor fixes for R sql tests (tests that fail in 
development environment)
 Key: SPARK-25167
 URL: https://issues.apache.org/jira/browse/SPARK-25167
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.1
Reporter: Dilip Biswal


A few SQL tests for R are failing development environment (Mac). 

*  The catalog api tests assumes catalog artifacts named "foo" to be non 
existent. I think name such as foo and bar are common and developers use it 
frequently. 
*  One test assumes that we only have one database in the system. I had more 
than one and it caused the test to fail. I have changed that check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23679) uiWebUrl show inproper URL when running on YARN

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23679:


Assignee: Apache Spark

> uiWebUrl show inproper URL when running on YARN
> ---
>
> Key: SPARK-23679
> URL: https://issues.apache.org/jira/browse/SPARK-23679
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI, YARN
>Affects Versions: 2.3.0
>Reporter: Maciej Bryński
>Assignee: Apache Spark
>Priority: Major
>
> uiWebUrl returns local url.
> Using it will cause HTTP ERROR 500
> {code}
> Problem accessing /. Reason:
> Server Error
> Caused by:
> javax.servlet.ServletException: Could not determine the proxy server for 
> redirection
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.findRedirectUrl(AmIpFilter.java:205)
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:145)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:524)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> We should give address to yarn proxy instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23679) uiWebUrl show inproper URL when running on YARN

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23679:


Assignee: (was: Apache Spark)

> uiWebUrl show inproper URL when running on YARN
> ---
>
> Key: SPARK-23679
> URL: https://issues.apache.org/jira/browse/SPARK-23679
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI, YARN
>Affects Versions: 2.3.0
>Reporter: Maciej Bryński
>Priority: Major
>
> uiWebUrl returns local url.
> Using it will cause HTTP ERROR 500
> {code}
> Problem accessing /. Reason:
> Server Error
> Caused by:
> javax.servlet.ServletException: Could not determine the proxy server for 
> redirection
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.findRedirectUrl(AmIpFilter.java:205)
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:145)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:524)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> We should give address to yarn proxy instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23679) uiWebUrl show inproper URL when running on YARN

2018-08-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586907#comment-16586907
 ] 

Apache Spark commented on SPARK-23679:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22164

> uiWebUrl show inproper URL when running on YARN
> ---
>
> Key: SPARK-23679
> URL: https://issues.apache.org/jira/browse/SPARK-23679
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI, YARN
>Affects Versions: 2.3.0
>Reporter: Maciej Bryński
>Priority: Major
>
> uiWebUrl returns local url.
> Using it will cause HTTP ERROR 500
> {code}
> Problem accessing /. Reason:
> Server Error
> Caused by:
> javax.servlet.ServletException: Could not determine the proxy server for 
> redirection
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.findRedirectUrl(AmIpFilter.java:205)
>   at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:145)
>   at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>   at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
>   at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>   at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>   at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:461)
>   at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>   at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>   at org.spark_project.jetty.server.Server.handle(Server.java:524)
>   at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
>   at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
>   at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
>   at 
> org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
>   at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>   at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>   at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> We should give address to yarn proxy instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25165) Cannot parse Hive Struct

2018-08-20 Thread Frank Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586901#comment-16586901
 ] 

Frank Yin edited comment on SPARK-25165 at 8/21/18 3:48 AM:


 

 

{{#!/usr/bin/env python}}
 {{# -**- *coding: UTF-8 --*}}
 {{# encoding=utf8}}
 {{import sys}}
 {{import os}}
 {{import json}}
 {{import argparse}}
 {{import time}}
 {{from datetime import datetime, timedelta}}
 {{from calendar import timegm}}
 {{from pyspark.sql import SparkSession}}
 {{from pyspark.conf import SparkConf}}
 {{from pyspark.sql.functions import *}}
 {{from pyspark.sql.types import *}}

{{spark_conf = SparkConf().setAppName("Test Hive")}}
{{ .set("spark.executor.memory", "4g")\}}
{{ .set("spark.sql.catalogImplementation","hive")\}}
{{ .set("spark.speculation", "true")\}}
{{ .set("spark.dynamicAllocation.maxExecutors", "2000")\}}
{{ .set("spark.sql.shuffle.partitions", "400")}}

 

{{spark.sql("SELECT * FROM default.a").collect() }}

where default.a is a table in hive. 

schema: 

columnA:struct,view.b:array>

 


was (Author: frankyin-factual):
 

 

{{#!/usr/bin/env python}}
{{# -*- coding: UTF-8 -*-}}
{{# encoding=utf8}}
{{import sys}}
{{import os}}
{{import json}}
{{import argparse}}
{{import time}}
{{from datetime import datetime, timedelta}}
{{from calendar import timegm}}
{{from pyspark.sql import SparkSession}}
{{from pyspark.conf import SparkConf}}
{{from pyspark.sql.functions import *}}
{{from pyspark.sql.types import *}}{{spark_conf = SparkConf().setAppName("Test 
Hive")\}}
{{ .set("spark.executor.memory", "4g")\}}
{{ .set("spark.sql.catalogImplementation","hive")\}}
{{ .set("spark.speculation", "true")\}}
{{ .set("spark.dynamicAllocation.maxExecutors", "2000")\}}
{{ .set("spark.sql.shuffle.partitions", "400")}}{{spark = SparkSession\}}
{{ .builder\}}
{{ .config(conf=spark_conf)\}}
{{ .getOrCreate()}}

{{spark.sql("SELECT * FROM default.a").collect() }}

where default.a is a table in hive. 

schema: 

columnA:struct,view.b:array>

 

> Cannot parse Hive Struct
> 
>
> Key: SPARK-25165
> URL: https://issues.apache.org/jira/browse/SPARK-25165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: Frank Yin
>Priority: Major
>
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct,view.b:array>
>  
> My guess is dot(.) is causing issues for parsing. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25165) Cannot parse Hive Struct

2018-08-20 Thread Frank Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586901#comment-16586901
 ] 

Frank Yin edited comment on SPARK-25165 at 8/21/18 3:48 AM:


 

 
{quote}{{#!/usr/bin/env python}}
 {{# -***- **coding: UTF-8 --*}}
 {{# encoding=utf8}}
 {{import sys}}
 {{import os}}
 {{import json}}
 {{import argparse}}
 {{import time}}
 {{from datetime import datetime, timedelta}}
 {{from calendar import timegm}}
 {{from pyspark.sql import SparkSession}}
 {{from pyspark.conf import SparkConf}}
 {{from pyspark.sql.functions import *}}
 {{from pyspark.sql.types import *}}

{{spark_conf = SparkConf().setAppName("Test 
Hive")}}.set("spark.sql.catalogImplementation","hive") 

{{spark.sql("SELECT * FROM default.a").collect() }}

where default.a is a table in hive. 

schema: 

columnA:struct,view.b:array>
{quote}
 


was (Author: frankyin-factual):
 

 

{{#!/usr/bin/env python}}
 {{# -**- *coding: UTF-8 --*}}
 {{# encoding=utf8}}
 {{import sys}}
 {{import os}}
 {{import json}}
 {{import argparse}}
 {{import time}}
 {{from datetime import datetime, timedelta}}
 {{from calendar import timegm}}
 {{from pyspark.sql import SparkSession}}
 {{from pyspark.conf import SparkConf}}
 {{from pyspark.sql.functions import *}}
 {{from pyspark.sql.types import *}}

{{spark_conf = SparkConf().setAppName("Test Hive")}}
{{ .set("spark.executor.memory", "4g")\}}
{{ .set("spark.sql.catalogImplementation","hive")\}}
{{ .set("spark.speculation", "true")\}}
{{ .set("spark.dynamicAllocation.maxExecutors", "2000")\}}
{{ .set("spark.sql.shuffle.partitions", "400")}}

 

{{spark.sql("SELECT * FROM default.a").collect() }}

where default.a is a table in hive. 

schema: 

columnA:struct,view.b:array>

 

> Cannot parse Hive Struct
> 
>
> Key: SPARK-25165
> URL: https://issues.apache.org/jira/browse/SPARK-25165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: Frank Yin
>Priority: Major
>
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct,view.b:array>
>  
> My guess is dot(.) is causing issues for parsing. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25165) Cannot parse Hive Struct

2018-08-20 Thread Frank Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586901#comment-16586901
 ] 

Frank Yin edited comment on SPARK-25165 at 8/21/18 3:47 AM:


 

 

{{#!/usr/bin/env python}}
{{# -*- coding: UTF-8 -*-}}
{{# encoding=utf8}}
{{import sys}}
{{import os}}
{{import json}}
{{import argparse}}
{{import time}}
{{from datetime import datetime, timedelta}}
{{from calendar import timegm}}
{{from pyspark.sql import SparkSession}}
{{from pyspark.conf import SparkConf}}
{{from pyspark.sql.functions import *}}
{{from pyspark.sql.types import *}}{{spark_conf = SparkConf().setAppName("Test 
Hive")\}}
{{ .set("spark.executor.memory", "4g")\}}
{{ .set("spark.sql.catalogImplementation","hive")\}}
{{ .set("spark.speculation", "true")\}}
{{ .set("spark.dynamicAllocation.maxExecutors", "2000")\}}
{{ .set("spark.sql.shuffle.partitions", "400")}}{{spark = SparkSession\}}
{{ .builder\}}
{{ .config(conf=spark_conf)\}}
{{ .getOrCreate()}}

{{spark.sql("SELECT * FROM default.a").collect() }}

where default.a is a table in hive. 

schema: 

columnA:struct,view.b:array>

 


was (Author: frankyin-factual):
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
# encoding=utf8
import sys
import os
import json
import argparse
import time
from datetime import datetime, timedelta
from calendar import timegm
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark_conf = SparkConf().setAppName("Test Hive")\
 .set("spark.executor.memory", "4g")\
 .set("spark.sql.catalogImplementation","hive")\
 .set("spark.speculation", "true")\
 .set("spark.dynamicAllocation.maxExecutors", "2000")\
 .set("spark.sql.shuffle.partitions", "400")

spark = SparkSession\
 .builder\
 .config(conf=spark_conf)\
 .getOrCreate()

 

places_and_devices = spark.sql("SELECT * FROM default.a").collect()

 

where default.a is a table in hive. 

schema: 

columnA:struct,view.b:array>

 

> Cannot parse Hive Struct
> 
>
> Key: SPARK-25165
> URL: https://issues.apache.org/jira/browse/SPARK-25165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: Frank Yin
>Priority: Major
>
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct,view.b:array>
>  
> My guess is dot(.) is causing issues for parsing. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25165) Cannot parse Hive Struct

2018-08-20 Thread Frank Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586901#comment-16586901
 ] 

Frank Yin commented on SPARK-25165:
---

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
# encoding=utf8
import sys
import os
import json
import argparse
import time
from datetime import datetime, timedelta
from calendar import timegm
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark_conf = SparkConf().setAppName("Test Hive")\
 .set("spark.executor.memory", "4g")\
 .set("spark.sql.catalogImplementation","hive")\
 .set("spark.speculation", "true")\
 .set("spark.dynamicAllocation.maxExecutors", "2000")\
 .set("spark.sql.shuffle.partitions", "400")

spark = SparkSession\
 .builder\
 .config(conf=spark_conf)\
 .getOrCreate()

 

places_and_devices = spark.sql("SELECT * FROM default.a").collect()

 

where default.a is a table in hive. 

schema: 

columnA:struct,view.b:array>

 

> Cannot parse Hive Struct
> 
>
> Key: SPARK-25165
> URL: https://issues.apache.org/jira/browse/SPARK-25165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: Frank Yin
>Priority: Major
>
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct,view.b:array>
>  
> My guess is dot(.) is causing issues for parsing. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25157) Streaming of image files from directory

2018-08-20 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586847#comment-16586847
 ] 

Hyukjin Kwon commented on SPARK-25157:
--

Is this blocked by SPARK-22666?


> Streaming of image files from directory
> ---
>
> Key: SPARK-25157
> URL: https://issues.apache.org/jira/browse/SPARK-25157
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Amit Baghel
>Priority: Major
>
> We are doing video analytics for video streams using Spark. At present there 
> is no direct way to stream video frames or image files to Spark and process 
> them using Structured Streaming and Dataset. We are using Kafka to stream 
> images and then doing processing at spark. We need a method in Spark to 
> stream images from directory. Currently *{{DataStreamReader}}* doesn't 
> support Image files. With the introduction of 
> *org.apache.spark.ml.image.ImageSchema* class, we think streaming 
> capabilities can be added for image files. It is fine if it won't support 
> some of the structured streaming features as it is a binary file. This method 
> could be similar to *mmlspark* *streamImages* method. 
> [https://github.com/Azure/mmlspark/blob/4413771a8830e4760f550084da60ea0616bf80b9/src/io/image/src/main/python/ImageReader.py]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25165) Cannot parse Hive Struct

2018-08-20 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586842#comment-16586842
 ] 

Hyukjin Kwon commented on SPARK-25165:
--

Mind if I ask a reproducer please?

> Cannot parse Hive Struct
> 
>
> Key: SPARK-25165
> URL: https://issues.apache.org/jira/browse/SPARK-25165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: Frank Yin
>Priority: Major
>
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct,view.b:array>
>  
> My guess is dot(.) is causing issues for parsing. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25166) Reduce the number of write operations for shuffle write.

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25166:


Assignee: Apache Spark

> Reduce the number of write operations for shuffle write.
> 
>
> Key: SPARK-25166
> URL: https://issues.apache.org/jira/browse/SPARK-25166
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: liuxian
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, only one record is written to a buffer each time, which increases 
> the number of copies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25166) Reduce the number of write operations for shuffle write.

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25166:


Assignee: (was: Apache Spark)

> Reduce the number of write operations for shuffle write.
> 
>
> Key: SPARK-25166
> URL: https://issues.apache.org/jira/browse/SPARK-25166
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> Currently, only one record is written to a buffer each time, which increases 
> the number of copies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25166) Reduce the number of write operations for shuffle write.

2018-08-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586831#comment-16586831
 ] 

Apache Spark commented on SPARK-25166:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/22163

> Reduce the number of write operations for shuffle write.
> 
>
> Key: SPARK-25166
> URL: https://issues.apache.org/jira/browse/SPARK-25166
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> Currently, only one record is written to a buffer each time, which increases 
> the number of copies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25166) Reduce the number of write operations for shuffle write.

2018-08-20 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25166:

Description: Currently, only one record is written to a buffer each time, 
which increases the number of copies.  (was: Currently, each record will be 
write to a buffer , which increases the number of copies.)

> Reduce the number of write operations for shuffle write.
> 
>
> Key: SPARK-25166
> URL: https://issues.apache.org/jira/browse/SPARK-25166
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> Currently, only one record is written to a buffer each time, which increases 
> the number of copies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25166) Reduce the number of write operations for shuffle write.

2018-08-20 Thread liuxian (JIRA)
liuxian created SPARK-25166:
---

 Summary: Reduce the number of write operations for shuffle write.
 Key: SPARK-25166
 URL: https://issues.apache.org/jira/browse/SPARK-25166
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 2.4.0
Reporter: liuxian


Currently, each record will be write to a buffer , which increases the number 
of copies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25132) Case-insensitive field resolution when reading from Parquet/ORC

2018-08-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25132.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22148
[https://github.com/apache/spark/pull/22148]

> Case-insensitive field resolution when reading from Parquet/ORC
> ---
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25132) Case-insensitive field resolution when reading from Parquet/ORC

2018-08-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25132:


Assignee: Chenxiao Mao

> Case-insensitive field resolution when reading from Parquet/ORC
> ---
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25134) Csv column pruning with checking of headers throws incorrect error

2018-08-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25134:
-
Fix Version/s: 2.4.0

> Csv column pruning with checking of headers throws incorrect error
> --
>
> Key: SPARK-25134
> URL: https://issues.apache.org/jira/browse/SPARK-25134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: spark master branch at 
> a791c29bd824adadfb2d85594bc8dad4424df936
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>Priority: Major
> Fix For: 2.4.0
>
>
> hello!
> seems to me there is some interaction between csv column pruning and the 
> checking of csv headers that is causing issues. for example this fails:
> {code:scala}
> Seq(("a", "b")).toDF("columnA", "columnB").write
>   .format("csv")
>   .option("header", true)
>   .save(dir)
> spark.read
>   .format("csv")
>   .option("header", true)
>   .option("enforceSchema", false)
>   .load(dir)
>   .select("columnA")
>   .show
> {code}
> the error is:
> {code:bash}
> 291.0 (TID 319, localhost, executor driver): 
> java.lang.IllegalArgumentException: Number of column in CSV header is not 
> equal to number of fields in the schema:
> [info]  Header length: 1, schema size: 2
> {code}
> if i remove the project it works fine. if i disable column pruning it also 
> works fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25134) Csv column pruning with checking of headers throws incorrect error

2018-08-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25134.
--
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/22123

> Csv column pruning with checking of headers throws incorrect error
> --
>
> Key: SPARK-25134
> URL: https://issues.apache.org/jira/browse/SPARK-25134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: spark master branch at 
> a791c29bd824adadfb2d85594bc8dad4424df936
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>Priority: Minor
>
> hello!
> seems to me there is some interaction between csv column pruning and the 
> checking of csv headers that is causing issues. for example this fails:
> {code:scala}
> Seq(("a", "b")).toDF("columnA", "columnB").write
>   .format("csv")
>   .option("header", true)
>   .save(dir)
> spark.read
>   .format("csv")
>   .option("header", true)
>   .option("enforceSchema", false)
>   .load(dir)
>   .select("columnA")
>   .show
> {code}
> the error is:
> {code:bash}
> 291.0 (TID 319, localhost, executor driver): 
> java.lang.IllegalArgumentException: Number of column in CSV header is not 
> equal to number of fields in the schema:
> [info]  Header length: 1, schema size: 2
> {code}
> if i remove the project it works fine. if i disable column pruning it also 
> works fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25134) Csv column pruning with checking of headers throws incorrect error

2018-08-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25134:
-
Priority: Major  (was: Minor)

> Csv column pruning with checking of headers throws incorrect error
> --
>
> Key: SPARK-25134
> URL: https://issues.apache.org/jira/browse/SPARK-25134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: spark master branch at 
> a791c29bd824adadfb2d85594bc8dad4424df936
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>Priority: Major
> Fix For: 2.4.0
>
>
> hello!
> seems to me there is some interaction between csv column pruning and the 
> checking of csv headers that is causing issues. for example this fails:
> {code:scala}
> Seq(("a", "b")).toDF("columnA", "columnB").write
>   .format("csv")
>   .option("header", true)
>   .save(dir)
> spark.read
>   .format("csv")
>   .option("header", true)
>   .option("enforceSchema", false)
>   .load(dir)
>   .select("columnA")
>   .show
> {code}
> the error is:
> {code:bash}
> 291.0 (TID 319, localhost, executor driver): 
> java.lang.IllegalArgumentException: Number of column in CSV header is not 
> equal to number of fields in the schema:
> [info]  Header length: 1, schema size: 2
> {code}
> if i remove the project it works fine. if i disable column pruning it also 
> works fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25134) Csv column pruning with checking of headers throws incorrect error

2018-08-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25134:
-
Affects Version/s: (was: 2.3.1)
   2.4.0

> Csv column pruning with checking of headers throws incorrect error
> --
>
> Key: SPARK-25134
> URL: https://issues.apache.org/jira/browse/SPARK-25134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: spark master branch at 
> a791c29bd824adadfb2d85594bc8dad4424df936
>Reporter: koert kuipers
>Priority: Minor
>
> hello!
> seems to me there is some interaction between csv column pruning and the 
> checking of csv headers that is causing issues. for example this fails:
> {code:scala}
> Seq(("a", "b")).toDF("columnA", "columnB").write
>   .format("csv")
>   .option("header", true)
>   .save(dir)
> spark.read
>   .format("csv")
>   .option("header", true)
>   .option("enforceSchema", false)
>   .load(dir)
>   .select("columnA")
>   .show
> {code}
> the error is:
> {code:bash}
> 291.0 (TID 319, localhost, executor driver): 
> java.lang.IllegalArgumentException: Number of column in CSV header is not 
> equal to number of fields in the schema:
> [info]  Header length: 1, schema size: 2
> {code}
> if i remove the project it works fine. if i disable column pruning it also 
> works fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25134) Csv column pruning with checking of headers throws incorrect error

2018-08-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25134:


Assignee: Koert Kuipers

> Csv column pruning with checking of headers throws incorrect error
> --
>
> Key: SPARK-25134
> URL: https://issues.apache.org/jira/browse/SPARK-25134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: spark master branch at 
> a791c29bd824adadfb2d85594bc8dad4424df936
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>Priority: Minor
>
> hello!
> seems to me there is some interaction between csv column pruning and the 
> checking of csv headers that is causing issues. for example this fails:
> {code:scala}
> Seq(("a", "b")).toDF("columnA", "columnB").write
>   .format("csv")
>   .option("header", true)
>   .save(dir)
> spark.read
>   .format("csv")
>   .option("header", true)
>   .option("enforceSchema", false)
>   .load(dir)
>   .select("columnA")
>   .show
> {code}
> the error is:
> {code:bash}
> 291.0 (TID 319, localhost, executor driver): 
> java.lang.IllegalArgumentException: Number of column in CSV header is not 
> equal to number of fields in the schema:
> [info]  Header length: 1, schema size: 2
> {code}
> if i remove the project it works fine. if i disable column pruning it also 
> works fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected

2018-08-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25144.
--
Resolution: Fixed

> distinct on Dataset leads to exception due to Managed memory leak detected  
> 
>
> Key: SPARK-25144
> URL: https://issues.apache.org/jira/browse/SPARK-25144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.3.2
>Reporter: Ayoub Benali
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.2.3, 2.3.2
>
>
> The following code example: 
> {code}
> case class Foo(bar: Option[String])
> val ds = List(Foo(Some("bar"))).toDS
> val result = ds.flatMap(_.bar).distinct
> result.rdd.isEmpty
> {code}
> Produces the following stacktrace
> {code}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in 
> stage 7.0 (TID 125, localhost, executor driver): 
> org.apache.spark.SparkException: Managed memory leak detected; size = 
> 16777216 bytes, TID = 125
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info] 
> [info] Driver stacktrace:
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
> [info]   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at scala.Option.foreach(Option.scala:257)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
> [info]   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
> [info]   at 
> org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465)
> {code}
> The code example doesn't produce any error when `distinct` function is not 
> called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected

2018-08-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25144:


Assignee: Dongjoon Hyun

> distinct on Dataset leads to exception due to Managed memory leak detected  
> 
>
> Key: SPARK-25144
> URL: https://issues.apache.org/jira/browse/SPARK-25144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.3.2
>Reporter: Ayoub Benali
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.2.3, 2.3.2
>
>
> The following code example: 
> {code}
> case class Foo(bar: Option[String])
> val ds = List(Foo(Some("bar"))).toDS
> val result = ds.flatMap(_.bar).distinct
> result.rdd.isEmpty
> {code}
> Produces the following stacktrace
> {code}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in 
> stage 7.0 (TID 125, localhost, executor driver): 
> org.apache.spark.SparkException: Managed memory leak detected; size = 
> 16777216 bytes, TID = 125
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info] 
> [info] Driver stacktrace:
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
> [info]   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at scala.Option.foreach(Option.scala:257)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
> [info]   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
> [info]   at 
> org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465)
> {code}
> The code example doesn't produce any error when `distinct` function is not 
> called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected

2018-08-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25144:
-
Fix Version/s: 2.3.2
   2.2.3

> distinct on Dataset leads to exception due to Managed memory leak detected  
> 
>
> Key: SPARK-25144
> URL: https://issues.apache.org/jira/browse/SPARK-25144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.3.2
>Reporter: Ayoub Benali
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.2.3, 2.3.2
>
>
> The following code example: 
> {code}
> case class Foo(bar: Option[String])
> val ds = List(Foo(Some("bar"))).toDS
> val result = ds.flatMap(_.bar).distinct
> result.rdd.isEmpty
> {code}
> Produces the following stacktrace
> {code}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in 
> stage 7.0 (TID 125, localhost, executor driver): 
> org.apache.spark.SparkException: Managed memory leak detected; size = 
> 16777216 bytes, TID = 125
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info] 
> [info] Driver stacktrace:
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
> [info]   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at scala.Option.foreach(Option.scala:257)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
> [info]   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
> [info]   at 
> org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465)
> {code}
> The code example doesn't produce any error when `distinct` function is not 
> called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL

2018-08-20 Thread Xin Yao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586647#comment-16586647
 ] 

Xin Yao commented on SPARK-23128:
-

Thanks [~XuanYuan] for sharing those awesome numbers.

 

If I understand it correctly, one of the goals of this patch is to solve the 
data skew by splitting large partition into small ones. Have seen any 
significant improvement in this area?

> A new approach to do adaptive execution in Spark SQL
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Priority: Major
> Attachments: AdaptiveExecutioninBaidu.pdf
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-20 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586612#comment-16586612
 ] 

Bryan Cutler commented on SPARK-23874:
--

For the Python fixes, yes the user would have to upgrade pyarrow. In Spark, 
there is no upgrade for pyarrow, just bumping up the minimum version which 
determines what version is installed with pyspark, which is still currently 
0.8.0. Since we upgraded the Java library, those fixes are included.

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL

2018-08-20 Thread Xin Yao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586605#comment-16586605
 ] 

Xin Yao commented on SPARK-23128:
-

Thank [~carsonwang] for the great work. This looks really interesting.

As you mentioned you have ported this to spark 2.3, what version of Spark was 
this built on top of original? Can I find those code anywhere? and are they up 
to date? Thanks

> A new approach to do adaptive execution in Spark SQL
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Priority: Major
> Attachments: AdaptiveExecutioninBaidu.pdf
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-20 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586593#comment-16586593
 ] 

Xiao Li commented on SPARK-23874:
-

[~bryanc]To get these fixes, we need to upgrade pyarrow to 0.10, right?

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()

2018-08-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586544#comment-16586544
 ] 

Apache Spark commented on SPARK-24442:
--

User 'AndrewKL' has created a pull request for this issue:
https://github.com/apache/spark/pull/22162

> Add configuration parameter to adjust the numbers of records and the charters 
> per row before truncation when a user runs.show()
> ---
>
> Key: SPARK-24442
> URL: https://issues.apache.org/jira/browse/SPARK-24442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Andrew K Long
>Priority: Minor
> Attachments: spark-adjustable-display-size.diff
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently the number of characters displayed when a user runs the .show() 
> function on a data frame is hard coded. The current default is too small when 
> used with wider console widths.  This fix will add two parameters.
>  
> parameter: "spark.show.default.number.of.rows" default: "20"
> parameter: "spark.show.default.truncate.characters.per.column" default: "20"
>  
> This change will be backwords compatible and will not break any existing 
> functionality nor change the default display characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24442:


Assignee: (was: Apache Spark)

> Add configuration parameter to adjust the numbers of records and the charters 
> per row before truncation when a user runs.show()
> ---
>
> Key: SPARK-24442
> URL: https://issues.apache.org/jira/browse/SPARK-24442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Andrew K Long
>Priority: Minor
> Attachments: spark-adjustable-display-size.diff
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently the number of characters displayed when a user runs the .show() 
> function on a data frame is hard coded. The current default is too small when 
> used with wider console widths.  This fix will add two parameters.
>  
> parameter: "spark.show.default.number.of.rows" default: "20"
> parameter: "spark.show.default.truncate.characters.per.column" default: "20"
>  
> This change will be backwords compatible and will not break any existing 
> functionality nor change the default display characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24442:


Assignee: Apache Spark

> Add configuration parameter to adjust the numbers of records and the charters 
> per row before truncation when a user runs.show()
> ---
>
> Key: SPARK-24442
> URL: https://issues.apache.org/jira/browse/SPARK-24442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Andrew K Long
>Assignee: Apache Spark
>Priority: Minor
> Attachments: spark-adjustable-display-size.diff
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently the number of characters displayed when a user runs the .show() 
> function on a data frame is hard coded. The current default is too small when 
> used with wider console widths.  This fix will add two parameters.
>  
> parameter: "spark.show.default.number.of.rows" default: "20"
> parameter: "spark.show.default.truncate.characters.per.column" default: "20"
>  
> This change will be backwords compatible and will not break any existing 
> functionality nor change the default display characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24442) Add configuration parameter to adjust the numbers of records and the charters per row before truncation when a user runs.show()

2018-08-20 Thread Andrew K Long (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586545#comment-16586545
 ] 

Andrew K Long commented on SPARK-24442:
---

I've created a pull request!

https://github.com/apache/spark/pull/22162

> Add configuration parameter to adjust the numbers of records and the charters 
> per row before truncation when a user runs.show()
> ---
>
> Key: SPARK-24442
> URL: https://issues.apache.org/jira/browse/SPARK-24442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Andrew K Long
>Priority: Minor
> Attachments: spark-adjustable-display-size.diff
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> Currently the number of characters displayed when a user runs the .show() 
> function on a data frame is hard coded. The current default is too small when 
> used with wider console widths.  This fix will add two parameters.
>  
> parameter: "spark.show.default.number.of.rows" default: "20"
> parameter: "spark.show.default.truncate.characters.per.column" default: "20"
>  
> This change will be backwords compatible and will not break any existing 
> functionality nor change the default display characteristics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25165) Cannot parse Hive Struct

2018-08-20 Thread Frank Yin (JIRA)
Frank Yin created SPARK-25165:
-

 Summary: Cannot parse Hive Struct
 Key: SPARK-25165
 URL: https://issues.apache.org/jira/browse/SPARK-25165
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1, 2.2.1
Reporter: Frank Yin


org.apache.spark.SparkException: Cannot recognize hive type string: 
struct,view.b:array>

 

My guess is dot(.) is causing issues for parsing. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24418) Upgrade to Scala 2.11.12

2018-08-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586483#comment-16586483
 ] 

Apache Spark commented on SPARK-24418:
--

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/22160

> Upgrade to Scala 2.11.12
> 
>
> Key: SPARK-24418
> URL: https://issues.apache.org/jira/browse/SPARK-24418
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> Scala 2.11.12+ will support JDK9+. However, this is not going to be a simple 
> version bump. 
> *loadFIles()* in *ILoop* was removed in Scala 2.11.12. We use it as a hack to 
> initialize the Spark before REPL sees any files.
> Issue filed in Scala community.
> https://github.com/scala/bug/issues/10913



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24639) Add three configs in the doc

2018-08-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24639.
---
Resolution: Won't Fix

> Add three configs in the doc
> 
>
> Key: SPARK-24639
> URL: https://issues.apache.org/jira/browse/SPARK-24639
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.3.1
>Reporter: xueyu
>Priority: Trivial
>
> Add some missing configs mentioned in spark-24560, which are 
> spark.python.task.killTimeout, spark.worker.driverTerminateTimeout and 
> spark.ui.consoleProgress.update.interval



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24834) Utils#nanSafeCompare{Double,Float} functions do not differ from normal java double/float comparison

2018-08-20 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24834.
---
Resolution: Won't Fix

See PR – I think we can't do this directly as it changes semantics 
unfortunately.

> Utils#nanSafeCompare{Double,Float} functions do not differ from normal java 
> double/float comparison
> ---
>
> Key: SPARK-24834
> URL: https://issues.apache.org/jira/browse/SPARK-24834
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Benjamin Duffield
>Priority: Minor
>
> Utils.scala contains two functions `nanSafeCompareDoubles` and 
> `nanSafeCompareFloats` which purport to have special handling of NaN values 
> in comparisons.
> The handling in these functions do not appear to differ from 
> java.lang.Double.compare and java.lang.Float.compare - they seem to produce 
> identical output to the built-in java comparison functions.
> I think it's clearer to not have these special Utils functions, and instead 
> just use the standard java comparison functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25164) Parquet reader builds entire list of columns once for each column

2018-08-20 Thread Bruce Robbins (JIRA)
Bruce Robbins created SPARK-25164:
-

 Summary: Parquet reader builds entire list of columns once for 
each column
 Key: SPARK-25164
 URL: https://issues.apache.org/jira/browse/SPARK-25164
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Bruce Robbins


{{VectorizedParquetRecordReader.initializeInternal}} loops through each column, 
and for each column it calls
{noformat}
requestedSchema.getColumns().get(i)
{noformat}
However, {{MessageType.getColumns}} will build the entire column list from 
getPaths(0).
{noformat}
  public List getColumns() {
List paths = this.getPaths(0);
List columns = new 
ArrayList(paths.size());
for (String[] path : paths) {
  // TODO: optimize this

  PrimitiveType primitiveType = getType(path).asPrimitiveType();
  columns.add(new ColumnDescriptor(
  path,
  primitiveType,
  getMaxRepetitionLevel(path),
  getMaxDefinitionLevel(path)));
}
return columns;
  }
{noformat}
This means that for each parquet file, this routine indirectly iterates 
colCount*colCount times.

This is actually not particularly noticeable unless you have:
 - many parquet files
 - many columns

To verify that this is an issue, I created a 1 million record parquet table 
with 6000 columns of type double and 67 files (so initializeInternal is called 
67 times). I ran the following query:
{noformat}
sql("select * from 6000_1m_double where id1 = 1").collect
{noformat}
I used Spark from the master branch. I had 8 executor threads. The filter 
returns only a few thousand records. The query ran (on average) for 6.4 minutes.

Then I cached the column list at the top of {{initializeInternal}} as follows:
{noformat}
List columnCache = requestedSchema.getColumns();
{noformat}
Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
{{requestedSchema.getColumns()}}.

With the column cache variable, the same query runs in 5 minutes. So with my 
simple query, you save %22 of time by not rebuilding the column list for each 
column.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21375) Add date and timestamp support to ArrowConverters for toPandas() collection

2018-08-20 Thread Eric Wohlstadter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586389#comment-16586389
 ] 

Eric Wohlstadter commented on SPARK-21375:
--

[~bryanc]

Thanks for the feedback. I think the part that makes my case different is that 
the producer of the TimeStampMicroTZVector and the consumer (Spark) are in 
different processes. Because Arrow data is coming to Spark via an 
ArrowInputStream over a TCP socket.

Since the timezone of the TimeStampMicroTZVector needs to be fixed on the 
producer side (not Spark), the producer process won't have a 
"spark.sql.session.timeZone".

In the current Arrow Java API, the timezone of any TimeStampTZVector can't be 
mutated after creation.

I'm not sure if the timezone is just something that is interpreted on read by 
the consumer (in which case it seems like it should be possible to mutate the 
timezone of the vector).

Or if the timezone is actually used when encoding the bytes for entries in the 
value vectors (in which case it seems like it would not be possible to mutate 
the timezone of the vector).

> Add date and timestamp support to ArrowConverters for toPandas() collection
> ---
>
> Key: SPARK-21375
> URL: https://issues.apache.org/jira/browse/SPARK-21375
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.3.0
>
>
> Date and timestamp are not yet supported in DataFrame.toPandas() using 
> ArrowConverters.  These are common types for data analysis used in both Spark 
> and Pandas and should be supported.
> There is a discrepancy with the way that PySpark and Arrow store timestamps, 
> without timezone specified, internally.  PySpark takes a UTC timestamp that 
> is adjusted to local time and Arrow is in UTC time.  Hopefully there is a 
> clean way to resolve this.
> Spark internal storage spec:
> * *DateType* stored as days
> * *Timestamp* stored as microseconds 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation

2018-08-20 Thread James Carter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586378#comment-16586378
 ] 

James Carter commented on SPARK-24432:
--

We're looking to shift our production Spark workloads from Mesos to Kubernetes. 
 Consider this an up-vote for Spark Shuffle Service and Dynamic Allocation of 
executors.  I would be happy to participate in beta-testing this feature.

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21375) Add date and timestamp support to ArrowConverters for toPandas() collection

2018-08-20 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586369#comment-16586369
 ] 

Bryan Cutler commented on SPARK-21375:
--

Hi [~ewohlstadter], the timestamp values should be in UTC.  I'm not sure if 
that's what Hive uses or not, but if not then you will need to do a conversion 
there. Then when building an Arrow TimeStampMicroTZVector, use the conf 
"spark.sql.session.timeZone" for the timezone, which should make things 
consistent. Hope that helps.

> Add date and timestamp support to ArrowConverters for toPandas() collection
> ---
>
> Key: SPARK-21375
> URL: https://issues.apache.org/jira/browse/SPARK-21375
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.3.0
>
>
> Date and timestamp are not yet supported in DataFrame.toPandas() using 
> ArrowConverters.  These are common types for data analysis used in both Spark 
> and Pandas and should be supported.
> There is a discrepancy with the way that PySpark and Arrow store timestamps, 
> without timezone specified, internally.  PySpark takes a UTC timestamp that 
> is adjusted to local time and Arrow is in UTC time.  Hopefully there is a 
> clean way to resolve this.
> Spark internal storage spec:
> * *DateType* stored as days
> * *Timestamp* stored as microseconds 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25162) Kubernetes 'in-cluster' client mode and value of spark.driver.host

2018-08-20 Thread James Carter (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Carter updated SPARK-25162:
-
Summary: Kubernetes 'in-cluster' client mode and value of spark.driver.host 
 (was: Kubernetes 'in-cluster' client mode and value spark.driver.host)

> Kubernetes 'in-cluster' client mode and value of spark.driver.host
> --
>
> Key: SPARK-25162
> URL: https://issues.apache.org/jira/browse/SPARK-25162
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
> Environment: A java program, deployed to kubernetes, that establishes 
> a Spark Context in client mode. 
> Not using spark-submit.
> Kubernetes 1.10
> AWS EKS
>  
>  
>Reporter: James Carter
>Priority: Minor
>
> When creating Kubernetes scheduler 'in-cluster' using client mode, the value 
> for spark.driver.host can be derived from the IP address of the driver pod.
> I observed that the value of _spark.driver.host_ defaulted to the value of 
> _spark.kubernetes.driver.pod.name_, which is not a valid hostname.  This 
> caused the executors to fail to establish a connection back to the driver.
> As a work around, in my configuration I pass the driver's pod name _and_ the 
> driver's ip address to ensure that executors can establish a connection with 
> the driver.
> _spark.kubernetes.driver.pod.name_ := env.valueFrom.fieldRef.fieldPath: 
> metadata.name
> _spark.driver.host_ := env.valueFrom.fieldRef.fieldPath: status.podIp
> e.g.
> Deployment:
> {noformat}
> env:
> - name: DRIVER_POD_NAME
>   valueFrom:
> fieldRef:
>   fieldPath: metadata.name
> - name: DRIVER_POD_IP
>   valueFrom:
> fieldRef:
>   fieldPath: status.podIP
> {noformat}
>  
> Application Properties:
> {noformat}
> config[spark.kubernetes.driver.pod.name]: ${DRIVER_POD_NAME}
> config[spark.driver.host]: ${DRIVER_POD_IP}
> {noformat}
>  
> BasicExecutorFeatureStep.scala:
> {code:java}
> private val driverUrl = RpcEndpointAddress(
> kubernetesConf.get("spark.driver.host"),
> kubernetesConf.sparkConf.getInt("spark.driver.port", DEFAULT_DRIVER_PORT),
> CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString
> {code}
>  
> Ideally only _spark.kubernetes.driver.pod.name_ would need be provided in 
> this deployment scenario.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25162) Kubernetes 'in-cluster' client mode and value spark.driver.host

2018-08-20 Thread James Carter (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Carter updated SPARK-25162:
-
Description: 
When creating Kubernetes scheduler 'in-cluster' using client mode, the value 
for spark.driver.host can be derived from the IP address of the driver pod.

I observed that the value of _spark.driver.host_ defaulted to the value of 
_spark.kubernetes.driver.pod.name_, which is not a valid hostname.  This caused 
the executors to fail to establish a connection back to the driver.

As a work around, in my configuration I pass the driver's pod name _and_ the 
driver's ip address to ensure that executors can establish a connection with 
the driver.

_spark.kubernetes.driver.pod.name_ := env.valueFrom.fieldRef.fieldPath: 
metadata.name

_spark.driver.host_ := env.valueFrom.fieldRef.fieldPath: status.podIp

e.g.

Deployment:
{noformat}
env:
- name: DRIVER_POD_NAME
  valueFrom:
fieldRef:
  fieldPath: metadata.name
- name: DRIVER_POD_IP
  valueFrom:
fieldRef:
  fieldPath: status.podIP
{noformat}
 

Application Properties:
{noformat}
config[spark.kubernetes.driver.pod.name]: ${DRIVER_POD_NAME}
config[spark.driver.host]: ${DRIVER_POD_IP}
{noformat}
 

BasicExecutorFeatureStep.scala:
{code:java}
private val driverUrl = RpcEndpointAddress(
kubernetesConf.get("spark.driver.host"),
kubernetesConf.sparkConf.getInt("spark.driver.port", DEFAULT_DRIVER_PORT),
CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString
{code}
 

Ideally only _spark.kubernetes.driver.pod.name_ would need be provided in this 
deployment scenario.

 

 

 

  was:
When creating Kubernetes scheduler 'in-cluster' using client mode, the value 
for spark.driver.host can be derived from the IP address of the driver pod.

I observed that the value of _spark.driver.host_ defaulted to the value of 
_spark.kubernetes.driver.pod.name_, which is not a valid hostname.  This caused 
the executors to fail to establish a connection back to the driver.

As a work around, in my configuration I pass the driver's pod name _and_ the 
driver's ip address to ensure that executors can establish a connection with 
the driver.

_spark.kubernetes.driver.pod.name_ := env.valueFrom.fieldRef.fieldPath: 
metadata.name

_spark.driver.host_ := env.valueFrom.fieldRef.fieldPath: status.podIp

 

Ideally only _spark.kubernetes.driver.pod.name_ need be provided in this 
deployment scenario.

 

 

 


> Kubernetes 'in-cluster' client mode and value spark.driver.host
> ---
>
> Key: SPARK-25162
> URL: https://issues.apache.org/jira/browse/SPARK-25162
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
> Environment: A java program, deployed to kubernetes, that establishes 
> a Spark Context in client mode. 
> Not using spark-submit.
> Kubernetes 1.10
> AWS EKS
>  
>  
>Reporter: James Carter
>Priority: Minor
>
> When creating Kubernetes scheduler 'in-cluster' using client mode, the value 
> for spark.driver.host can be derived from the IP address of the driver pod.
> I observed that the value of _spark.driver.host_ defaulted to the value of 
> _spark.kubernetes.driver.pod.name_, which is not a valid hostname.  This 
> caused the executors to fail to establish a connection back to the driver.
> As a work around, in my configuration I pass the driver's pod name _and_ the 
> driver's ip address to ensure that executors can establish a connection with 
> the driver.
> _spark.kubernetes.driver.pod.name_ := env.valueFrom.fieldRef.fieldPath: 
> metadata.name
> _spark.driver.host_ := env.valueFrom.fieldRef.fieldPath: status.podIp
> e.g.
> Deployment:
> {noformat}
> env:
> - name: DRIVER_POD_NAME
>   valueFrom:
> fieldRef:
>   fieldPath: metadata.name
> - name: DRIVER_POD_IP
>   valueFrom:
> fieldRef:
>   fieldPath: status.podIP
> {noformat}
>  
> Application Properties:
> {noformat}
> config[spark.kubernetes.driver.pod.name]: ${DRIVER_POD_NAME}
> config[spark.driver.host]: ${DRIVER_POD_IP}
> {noformat}
>  
> BasicExecutorFeatureStep.scala:
> {code:java}
> private val driverUrl = RpcEndpointAddress(
> kubernetesConf.get("spark.driver.host"),
> kubernetesConf.sparkConf.getInt("spark.driver.port", DEFAULT_DRIVER_PORT),
> CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString
> {code}
>  
> Ideally only _spark.kubernetes.driver.pod.name_ would need be provided in 
> this deployment scenario.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25163) Flaky test: o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with compression

2018-08-20 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-25163:


 Summary: Flaky test: 
o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with compression
 Key: SPARK-25163
 URL: https://issues.apache.org/jira/browse/SPARK-25163
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.4.0
Reporter: Shixiong Zhu


I saw it failed multiple times on Jenkins:

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/4813/testReport/junit/org.apache.spark.util.collection/ExternalAppendOnlyMapSuite/spilling_with_compression/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25162) Kubernetes 'in-cluster' client mode and value spark.driver.host

2018-08-20 Thread James Carter (JIRA)
James Carter created SPARK-25162:


 Summary: Kubernetes 'in-cluster' client mode and value 
spark.driver.host
 Key: SPARK-25162
 URL: https://issues.apache.org/jira/browse/SPARK-25162
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.0
 Environment: A java program, deployed to kubernetes, that establishes 
a Spark Context in client mode. 

Not using spark-submit.

Kubernetes 1.10

AWS EKS

 

 
Reporter: James Carter


When creating Kubernetes scheduler 'in-cluster' using client mode, the value 
for spark.driver.host can be derived from the IP address of the driver pod.

I observed that the value of _spark.driver.host_ defaulted to the value of 
_spark.kubernetes.driver.pod.name_, which is not a valid hostname.  This caused 
the executors to fail to establish a connection back to the driver.

As a work around, in my configuration I pass the driver's pod name _and_ the 
driver's ip address to ensure that executors can establish a connection with 
the driver.

_spark.kubernetes.driver.pod.name_ := env.valueFrom.fieldRef.fieldPath: 
metadata.name

_spark.driver.host_ := env.valueFrom.fieldRef.fieldPath: status.podIp

 

Ideally only _spark.kubernetes.driver.pod.name_ need be provided in this 
deployment scenario.

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-20 Thread Yinan Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586314#comment-16586314
 ] 

Yinan Li commented on SPARK-24434:
--

[~skonto] I will make sure the assignee gets properly set for future JIRAs. 
[~onursatici], if you would like to work on the implementation, please make 
sure you read through the design doc from [~skonto] and make the implementation 
follow what the design proposes. Thanks! 

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25161) Fix several bugs in failure handling of barrier execution mode

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25161:


Assignee: (was: Apache Spark)

> Fix several bugs in failure handling of barrier execution mode
> --
>
> Key: SPARK-25161
> URL: https://issues.apache.org/jira/browse/SPARK-25161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Fix several bugs in failure handling of barrier execution mode:
> * Mark TaskSet for a barrier stage as zombie when a task attempt fails;
> * Multiple barrier task failures from a single barrier stage should not 
> trigger multiple stage retries;
> * Barrier task failure from a previous failed stage attempt should not 
> trigger stage retry;
> * Fail the job when a task from a barrier ResultStage failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25161) Fix several bugs in failure handling of barrier execution mode

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25161:


Assignee: Apache Spark

> Fix several bugs in failure handling of barrier execution mode
> --
>
> Key: SPARK-25161
> URL: https://issues.apache.org/jira/browse/SPARK-25161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Major
>
> Fix several bugs in failure handling of barrier execution mode:
> * Mark TaskSet for a barrier stage as zombie when a task attempt fails;
> * Multiple barrier task failures from a single barrier stage should not 
> trigger multiple stage retries;
> * Barrier task failure from a previous failed stage attempt should not 
> trigger stage retry;
> * Fail the job when a task from a barrier ResultStage failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25161) Fix several bugs in failure handling of barrier execution mode

2018-08-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586268#comment-16586268
 ] 

Apache Spark commented on SPARK-25161:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/22158

> Fix several bugs in failure handling of barrier execution mode
> --
>
> Key: SPARK-25161
> URL: https://issues.apache.org/jira/browse/SPARK-25161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Fix several bugs in failure handling of barrier execution mode:
> * Mark TaskSet for a barrier stage as zombie when a task attempt fails;
> * Multiple barrier task failures from a single barrier stage should not 
> trigger multiple stage retries;
> * Barrier task failure from a previous failed stage attempt should not 
> trigger stage retry;
> * Fail the job when a task from a barrier ResultStage failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25161) Fix several bugs in failure handling of barrier execution mode

2018-08-20 Thread Jiang Xingbo (JIRA)
Jiang Xingbo created SPARK-25161:


 Summary: Fix several bugs in failure handling of barrier execution 
mode
 Key: SPARK-25161
 URL: https://issues.apache.org/jira/browse/SPARK-25161
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Jiang Xingbo


Fix several bugs in failure handling of barrier execution mode:
* Mark TaskSet for a barrier stage as zombie when a task attempt fails;
* Multiple barrier task failures from a single barrier stage should not trigger 
multiple stage retries;
* Barrier task failure from a previous failed stage attempt should not trigger 
stage retry;
* Fail the job when a task from a barrier ResultStage failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25126) avoid creating OrcFile.Reader for all orc files

2018-08-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586237#comment-16586237
 ] 

Apache Spark commented on SPARK-25126:
--

User 'raofu' has created a pull request for this issue:
https://github.com/apache/spark/pull/22157

> avoid creating OrcFile.Reader for all orc files
> ---
>
> Key: SPARK-25126
> URL: https://issues.apache.org/jira/browse/SPARK-25126
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.1
>Reporter: Rao Fu
>Priority: Minor
>
> We have a spark job that starts by reading orc files under an S3 directory 
> and we noticed the job consumes a lot of memory when both the number of orc 
> files and the size of the file are large. The memory bloat went away with the 
> following workaround.
> 1) create a DataSet from a single orc file.
> Dataset rowsForFirstFile = spark.read().format("orc").load(oneFile);
> 2) when creating DataSet from all files under the directory, use the 
> schema from the previous DataSet.
> Dataset rows = 
> spark.read().schema(rowsForFirstFile.schema()).format("orc").load(path);
> I believe the issue is due to the fact in order to infer the schema a 
> FileReader is created for each orc file under the directory although only the 
> first one is used. The FileReader creation loads the metadata of the orc file 
> and the memory consumption is very high when there are many files under the 
> directory.
> The issue exists in both 2.0 and HEAD.
> In 2.0, OrcFileOperator.readSchema is used.
> [https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L95]
> In HEAD, OrcUtils.readSchema is used.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L82
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23573) Create linter rule to prevent misuse of SparkContext.hadoopConfiguration in SQL modules

2018-08-20 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23573.
-
   Resolution: Duplicate
 Assignee: Gengliang Wang
Fix Version/s: 2.4.0

> Create linter rule to prevent misuse of SparkContext.hadoopConfiguration in 
> SQL modules
> ---
>
> Key: SPARK-23573
> URL: https://issues.apache.org/jira/browse/SPARK-23573
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> SparkContext.hadoopConfiguration should not be used directly in SQL module.
> Instead, one should always use sessionState.newHadoopConfiguration(), which 
> blends in configs set per session.
> The idea is to add the linter rule to ban it.
>  - Restrict the linter rule to the components that use SQL. use the parameter 
> `scalastyleSources`?
>  - Exclude genuinely valid uses, like e.g. in SessionState (ok, that can be 
> done per case with scalastyle:off in the code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25126) avoid creating OrcFile.Reader for all orc files

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25126:


Assignee: (was: Apache Spark)

> avoid creating OrcFile.Reader for all orc files
> ---
>
> Key: SPARK-25126
> URL: https://issues.apache.org/jira/browse/SPARK-25126
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.1
>Reporter: Rao Fu
>Priority: Minor
>
> We have a spark job that starts by reading orc files under an S3 directory 
> and we noticed the job consumes a lot of memory when both the number of orc 
> files and the size of the file are large. The memory bloat went away with the 
> following workaround.
> 1) create a DataSet from a single orc file.
> Dataset rowsForFirstFile = spark.read().format("orc").load(oneFile);
> 2) when creating DataSet from all files under the directory, use the 
> schema from the previous DataSet.
> Dataset rows = 
> spark.read().schema(rowsForFirstFile.schema()).format("orc").load(path);
> I believe the issue is due to the fact in order to infer the schema a 
> FileReader is created for each orc file under the directory although only the 
> first one is used. The FileReader creation loads the metadata of the orc file 
> and the memory consumption is very high when there are many files under the 
> directory.
> The issue exists in both 2.0 and HEAD.
> In 2.0, OrcFileOperator.readSchema is used.
> [https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L95]
> In HEAD, OrcUtils.readSchema is used.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L82
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25126) avoid creating OrcFile.Reader for all orc files

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25126:


Assignee: Apache Spark

> avoid creating OrcFile.Reader for all orc files
> ---
>
> Key: SPARK-25126
> URL: https://issues.apache.org/jira/browse/SPARK-25126
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.1
>Reporter: Rao Fu
>Assignee: Apache Spark
>Priority: Minor
>
> We have a spark job that starts by reading orc files under an S3 directory 
> and we noticed the job consumes a lot of memory when both the number of orc 
> files and the size of the file are large. The memory bloat went away with the 
> following workaround.
> 1) create a DataSet from a single orc file.
> Dataset rowsForFirstFile = spark.read().format("orc").load(oneFile);
> 2) when creating DataSet from all files under the directory, use the 
> schema from the previous DataSet.
> Dataset rows = 
> spark.read().schema(rowsForFirstFile.schema()).format("orc").load(path);
> I believe the issue is due to the fact in order to infer the schema a 
> FileReader is created for each orc file under the directory although only the 
> first one is used. The FileReader creation loads the metadata of the orc file 
> and the memory consumption is very high when there are many files under the 
> directory.
> The issue exists in both 2.0 and HEAD.
> In 2.0, OrcFileOperator.readSchema is used.
> [https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L95]
> In HEAD, OrcUtils.readSchema is used.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L82
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected

2018-08-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586168#comment-16586168
 ] 

Apache Spark commented on SPARK-25144:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22156

> distinct on Dataset leads to exception due to Managed memory leak detected  
> 
>
> Key: SPARK-25144
> URL: https://issues.apache.org/jira/browse/SPARK-25144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.3.2
>Reporter: Ayoub Benali
>Priority: Major
>
> The following code example: 
> {code}
> case class Foo(bar: Option[String])
> val ds = List(Foo(Some("bar"))).toDS
> val result = ds.flatMap(_.bar).distinct
> result.rdd.isEmpty
> {code}
> Produces the following stacktrace
> {code}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in 
> stage 7.0 (TID 125, localhost, executor driver): 
> org.apache.spark.SparkException: Managed memory leak detected; size = 
> 16777216 bytes, TID = 125
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info] 
> [info] Driver stacktrace:
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
> [info]   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at scala.Option.foreach(Option.scala:257)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
> [info]   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
> [info]   at 
> org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465)
> {code}
> The code example doesn't produce any error when `distinct` function is not 
> called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected

2018-08-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586138#comment-16586138
 ] 

Apache Spark commented on SPARK-25144:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22155

> distinct on Dataset leads to exception due to Managed memory leak detected  
> 
>
> Key: SPARK-25144
> URL: https://issues.apache.org/jira/browse/SPARK-25144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.3.2
>Reporter: Ayoub Benali
>Priority: Major
>
> The following code example: 
> {code}
> case class Foo(bar: Option[String])
> val ds = List(Foo(Some("bar"))).toDS
> val result = ds.flatMap(_.bar).distinct
> result.rdd.isEmpty
> {code}
> Produces the following stacktrace
> {code}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in 
> stage 7.0 (TID 125, localhost, executor driver): 
> org.apache.spark.SparkException: Managed memory leak detected; size = 
> 16777216 bytes, TID = 125
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info] 
> [info] Driver stacktrace:
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
> [info]   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at scala.Option.foreach(Option.scala:257)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
> [info]   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
> [info]   at 
> org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465)
> {code}
> The code example doesn't produce any error when `distinct` function is not 
> called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25140) Add optional logging to UnsafeProjection.create when it falls back to interpreted mode

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25140:


Assignee: (was: Apache Spark)

> Add optional logging to UnsafeProjection.create when it falls back to 
> interpreted mode
> --
>
> Key: SPARK-25140
> URL: https://issues.apache.org/jira/browse/SPARK-25140
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> SPARK-23711 implemented a nice graceful handling of allowing UnsafeProjection 
> to fall back to an interpreter mode when codegen fails. That makes Spark much 
> more usable even when codegen is unable to handle the given query.
> But in its current form, the fallback handling can also be a mystery in terms 
> of performance cliffs. Users may be left wondering why a query runs fine with 
> some expressions, but then with just one extra expression the performance 
> goes 2x, 3x (or more) slower.
> It'd be nice to have optional logging of the fallback behavior, so that for 
> users that care about monitoring performance cliffs, they can opt-in to log 
> when a fallback to interpreter mode was taken. i.e. at
> https://github.com/apache/spark/blob/a40ffc656d62372da85e0fa932b67207839e7fde/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L183



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25140) Add optional logging to UnsafeProjection.create when it falls back to interpreted mode

2018-08-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586091#comment-16586091
 ] 

Apache Spark commented on SPARK-25140:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/22154

> Add optional logging to UnsafeProjection.create when it falls back to 
> interpreted mode
> --
>
> Key: SPARK-25140
> URL: https://issues.apache.org/jira/browse/SPARK-25140
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> SPARK-23711 implemented a nice graceful handling of allowing UnsafeProjection 
> to fall back to an interpreter mode when codegen fails. That makes Spark much 
> more usable even when codegen is unable to handle the given query.
> But in its current form, the fallback handling can also be a mystery in terms 
> of performance cliffs. Users may be left wondering why a query runs fine with 
> some expressions, but then with just one extra expression the performance 
> goes 2x, 3x (or more) slower.
> It'd be nice to have optional logging of the fallback behavior, so that for 
> users that care about monitoring performance cliffs, they can opt-in to log 
> when a fallback to interpreter mode was taken. i.e. at
> https://github.com/apache/spark/blob/a40ffc656d62372da85e0fa932b67207839e7fde/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L183



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25140) Add optional logging to UnsafeProjection.create when it falls back to interpreted mode

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25140:


Assignee: Apache Spark

> Add optional logging to UnsafeProjection.create when it falls back to 
> interpreted mode
> --
>
> Key: SPARK-25140
> URL: https://issues.apache.org/jira/browse/SPARK-25140
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-23711 implemented a nice graceful handling of allowing UnsafeProjection 
> to fall back to an interpreter mode when codegen fails. That makes Spark much 
> more usable even when codegen is unable to handle the given query.
> But in its current form, the fallback handling can also be a mystery in terms 
> of performance cliffs. Users may be left wondering why a query runs fine with 
> some expressions, but then with just one extra expression the performance 
> goes 2x, 3x (or more) slower.
> It'd be nice to have optional logging of the fallback behavior, so that for 
> users that care about monitoring performance cliffs, they can opt-in to log 
> when a fallback to interpreter mode was taken. i.e. at
> https://github.com/apache/spark/blob/a40ffc656d62372da85e0fa932b67207839e7fde/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L183



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23711) Add fallback to interpreted execution logic

2018-08-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586090#comment-16586090
 ] 

Apache Spark commented on SPARK-23711:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/22154

> Add fallback to interpreted execution logic
> ---
>
> Key: SPARK-23711
> URL: https://issues.apache.org/jira/browse/SPARK-23711
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25147) GroupedData.apply pandas_udf crashing

2018-08-20 Thread Mike Sukmanowsky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585911#comment-16585911
 ] 

Mike Sukmanowsky commented on SPARK-25147:
--

Confirmed, works with:

Python 2.7.15

PyArrow==0.9.0.post1

numpy==1.11.3

pandas==0.19.2

> GroupedData.apply pandas_udf crashing
> -
>
> Key: SPARK-25147
> URL: https://issues.apache.org/jira/browse/SPARK-25147
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: OS: Mac OS 10.13.6
> Python: 2.7.15, 3.6.6
> PyArrow: 0.10.0
> Pandas: 0.23.4
> Numpy: 1.15.0
>Reporter: Mike Sukmanowsky
>Priority: Major
>
> Running the following example taken straight from the docs results in 
> {{org.apache.spark.SparkException: Python worker exited unexpectedly 
> (crashed)}} for reasons that aren't clear from any logs I can see:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> spark = (
> SparkSession
> .builder
> .appName("pandas_udf")
> .getOrCreate()
> )
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v")
> )
> @F.pandas_udf("id long, v double", F.PandasUDFType.GROUPED_MAP)
> def normalize(pdf):
> v = pdf.v
> return pdf.assign(v=(v - v.mean()) / v.std())
> (
> df
> .groupby("id")
> .apply(normalize)
> .show()
> )
> {code}
>  See output.log for 
> [stacktrace|https://gist.github.com/msukmanowsky/b9cb6700e8ccaf93f265962000403f28].
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries

2018-08-20 Thread Lucas Partridge (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16527351#comment-16527351
 ] 

Lucas Partridge edited comment on SPARK-19498 at 8/20/18 1:03 PM:
--

Ok great. Here's my feedback after wrapping a large complex Python algorithm 
for ML Pipeline on Spark 2.2.0. Several of these comments probably apply beyond 
pyspark too.
 # The inability to save and load custom pyspark 
models/pipelines/pipelinemodels is an absolute showstopper. Training models can 
take hours and so we need to be able to save and reload models. Pending the 
availability of https://issues.apache.org/jira/browse/SPARK-17025 I used a 
refinement of [https://stackoverflow.com/a/49195515/1843329] to work around 
this. Had this not been solved no further work would have been done.
 # Support for saving/loading more param types (e.g., dict) would be great. I 
had to use json.dumps to convert our algorithm's internal model into a string 
and then pretend it was a string param in order to save and load that with the 
rest of the transformer.
 # Given that pipelinemodels can be saved we also need the ability to export 
them easily for deployment on other clusters. The cluster where you train the 
model may be different to the one where you deploy it for predictions. A hack 
workaround is to use hdfs commands to copy the relevant files and directories 
but it would be great if we had simple single export/import commands in pyspark 
to move models.pipelines/pipelinemodels easily between clusters and to allow 
artifacts to be stored off-cluster.
 # Creating individual parameters with getters and setters is tedious and 
error-prone, especially if writing docs inline too. It would be great if as 
much of this boiler-plate as possible could be auto-generated from a simple 
parameter definition. I always groan when someone asks for an extra param at 
the moment!
 # The Ml Pipeline API seems to assume all the params lie on the estimator and 
none on the transformer. In the algorithm I wrapped the model/transformer has 
numerous params that are specific to it rather than the estimator. 
PipelineModel needs a getStages() command (just as Pipeline does) to get at the 
model so you can parameterise it. I had to use the undocumented .stages member 
instead.  But then if you want to call transform() on a pipelinemodel 
immediately after fitting it you also need some ability to set the 
model/transformer params in advance. I got around this by defining a params 
class for the estimator-only params and another class for the model-only 
params. I made the estimator inherit from both these classes and the model 
inherit from only the model-base params class. The estimator then just passes 
through any model-specific params to the model when it creates it at the end of 
its fit() method. But, to distinguish the model-only params from the estimator 
(e.g., when listing the params on the estimator) I had to prefix all the 
model-only params with a common value to identify them. This works but it's 
clunky and ugly.
 # The algorithm I ported works naturally with individually named column 
inputs. But the existing ML Pipeline library prefers DenseVectors. I ended up 
having to support both types of inputs - if a DenseVector input was 'None' I 
would take the data directly from the individually named  columns instead. If 
users want to use the algorithm by itself they can used the column-based input 
approach; if they want to work with algorithms from the built-in library (e.g., 
StandardScaler, Binarizer, etc) they can use the DenseVector approach instead.  
Again this works but is clunky because you're having to handle two different 
forms of input inside the same implementation. Also DenseVectors are limited by 
their inability to handle missing values.
 # Similarly, I wanted to produce multiple separate columns for the outputs of 
the model's transform() method whereas most built-in algorithms seem to use a 
single DenseVector output column. DataFrame's withColumn() method could do with 
a withColumns() equivalent to make it easy to add multiple columns to a 
Dataframe instead of just one column at a time.
 # Documentation explaining how to create a custom estimator and transformer 
(preferably one with transformer-specific params) would be extremely useful for 
people. Most of what I learned I gleaned off StackOverflow and from looking at 
Spark's pipeline code.

Hope this list will be useful for improving ML Pipelines in future versions of 
Spark!


was (Author: lucas.partridge):
Ok great. Here's my feedback after wrapping a large complex Python algorithm 
for ML Pipeline on Spark 2.2.0. Several of these comments probably apply beyond 
pyspark too.
 # The inability to save and load custom pyspark 
models/pipelines/pipelinemodels is an absolute showstopper. Training models can 
take hours and so we need to be 

[jira] [Comment Edited] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries

2018-08-20 Thread Lucas Partridge (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16527351#comment-16527351
 ] 

Lucas Partridge edited comment on SPARK-19498 at 8/20/18 1:02 PM:
--

Ok great. Here's my feedback after wrapping a large complex Python algorithm 
for ML Pipeline on Spark 2.2.0. Several of these comments probably apply beyond 
pyspark too.
 # The inability to save and load custom pyspark 
models/pipelines/pipelinemodels is an absolute showstopper. Training models can 
take hours and so we need to be able to save and reload models. Pending the 
availability of https://issues.apache.org/jira/browse/SPARK-17025 I used a 
refinement of [https://stackoverflow.com/a/49195515/1843329] to work around 
this. Had this not been solved no further work would have been done.


 # Support for saving/loading more param types (e.g., dict) would be great. I 
had to use json.dumps to convert our algorithm's internal model into a string 
and then pretend it was a string param in order to save and load that with the 
rest of the transformer.


 # Given that pipelinemodels can be saved we also need the ability to export 
them easily for deployment on other clusters. The cluster where you train the 
model may be different to the one where you deploy it for predictions. A hack 
workaround is to use hdfs commands to copy the relevant files and directories 
but it would be great if we had simple single export/import commands in pyspark 
to move models.pipelines/pipelinemodels easily between clusters and to allow 
artifacts to be stored off-cluster.


 # Creating individual parameters with getters and setters is tedious and 
error-prone, especially if writing docs inline too. It would be great if as 
much of this boiler-plate as possible could be auto-generated from a simple 
parameter definition. I always groan when someone asks for an extra param at 
the moment!


 # The Ml Pipeline API seems to assume all the params lie on the estimator and 
none on the transformer. In the algorithm I wrapped the model/transformer has 
numerous params that are specific to it rather than the estimator. 
PipelineModel needs a getStages() command (just as Pipeline does) to get at the 
model so you can parameterise it. I had to use the undocumented .stages member 
instead.  But then if you want to call transform() on a pipelinemodel 
immediately after fitting it you also need some ability to set the 
model/transformer params in advance. I got around this by defining a params 
class for the estimator-only params and another class for the model-only 
params. I made the estimator inherit from both these classes and the model 
inherit from only the model-base params class. The estimator then just passes 
through any model-specific params to the model when it creates it at the end of 
its fit() method. But, to distinguish the model-only params from the estimator 
(e.g., when listing the params on the estimator) I had to prefix all the 
model-only params with a common value to identify them. This works but it's 
clunky and ugly.


 # The algorithm I ported works naturally with individually named column 
inputs. But the existing ML Pipeline library prefers DenseVectors. I ended up 
having to support both types of inputs - if a DenseVector input was 'None' I 
would take the data directly from the individually named  columns instead. If 
users want to use the algorithm by itself they can used the column-based input 
approach; if they want to work with algorithms from the built-in library (e.g., 
StandardScaler, Binarizer, etc) they can use the DenseVector approach instead.  
Again this works but is clunky because you're having to handle two different 
forms of input inside the same implementation. Also DenseVectors are limited by 
their inability to handle missing values.


 # Similarly, I wanted to produce multiple separate columns for the outputs of 
the model's transform() method whereas most built-in algorithms seem to use a 
single DenseVector output column. DataFrame's withColumn() method could do with 
a withColumns() equivalent to make it easy to add multiple columns to a 
Dataframe instead of just one column at a time.


 # Documentation explaining how to create a custom estimator and transformer 
(preferably one with transformer-specific params) would be extremely useful for 
people. Most of what I learned I gleaned off StackOverflow and from looking at 
Spark's pipeline code.

Hope this list will be useful for improving ML Pipelines in future versions of 
Spark!


was (Author: lucas.partridge):
Ok great. Here's my feedback after wrapping a large complex Python algorithm 
for ML Pipeline on Spark 2.2.0. Several of these comments probably apply beyond 
pyspark too.
 # The inability to save and load custom pyspark 
models/pipelines/pipelinemodels is an absolute showstopper. Training models can 
take hours and so 

[jira] [Assigned] (SPARK-25160) Remove sql configuration spark.sql.avro.outputTimestampType

2018-08-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25160:


Assignee: Gengliang Wang

> Remove sql configuration spark.sql.avro.outputTimestampType
> ---
>
> Key: SPARK-25160
> URL: https://issues.apache.org/jira/browse/SPARK-25160
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> In the PR for supporting logical timestamp types 
> [https://github.com/apache/spark/pull/21935,] a SQL configuration 
> spark.sql.avro.outputTimestampType is added, so that user can specify the 
> output timestamp precision they want.
> With PR [https://github.com/apache/spark/pull/21847,]  the output file can be 
> written with user specified types. 
> So no need to have such configuration. Otherwise to make it consistent we 
> need to add configuration for all the Catalyst types that can be converted 
> into different Avro types. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25160) Remove sql configuration spark.sql.avro.outputTimestampType

2018-08-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25160.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22151
[https://github.com/apache/spark/pull/22151]

> Remove sql configuration spark.sql.avro.outputTimestampType
> ---
>
> Key: SPARK-25160
> URL: https://issues.apache.org/jira/browse/SPARK-25160
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> In the PR for supporting logical timestamp types 
> [https://github.com/apache/spark/pull/21935,] a SQL configuration 
> spark.sql.avro.outputTimestampType is added, so that user can specify the 
> output timestamp precision they want.
> With PR [https://github.com/apache/spark/pull/21847,]  the output file can be 
> written with user specified types. 
> So no need to have such configuration. Otherwise to make it consistent we 
> need to add configuration for all the Catalyst types that can be converted 
> into different Avro types. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23034) Display tablename for `HiveTableScan` node in UI

2018-08-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585786#comment-16585786
 ] 

Apache Spark commented on SPARK-23034:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/22153

> Display tablename for `HiveTableScan` node in UI
> 
>
> Key: SPARK-23034
> URL: https://issues.apache.org/jira/browse/SPARK-23034
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.2.1
>Reporter: Tejas Patil
>Priority: Trivial
>
> For queries which scan multiple tables, it will be convenient if the DAG 
> shown in Spark UI also showed which table is being scanned. This will make 
> debugging easier. For this JIRA, I am scoping those for hive table scans 
> only. For Spark native tables, the table scan node is abstracted out as a 
> `WholeStageCodegen` node (which might be doing more things besides table 
> scan).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-20 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585689#comment-16585689
 ] 

Stavros Kontopoulos edited comment on SPARK-24434 at 8/20/18 10:17 AM:
---

[~onursatici] I am working on this (check last comment above), but since I am 
on vacations I haven't created a PR. 

If you are going to do it, pls read the design doc and cover the listed cases, 
will review.

At some point we need to coordinate, [~liyinan926] next time pls assign people, 
I spent quite some time on this.


was (Author: skonto):
[~onursatici] I am working on this, but since I am on vacations I haven't 
created a PR. 

If you are going to do it, pls read the design doc and cover the listed cases, 
will review.

At some point we need to coordinate, [~liyinan926] next time pls assign people, 
I spent quite some time on this.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-20 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585689#comment-16585689
 ] 

Stavros Kontopoulos edited comment on SPARK-24434 at 8/20/18 9:34 AM:
--

[~onursatici] I am working on this, but since I am on vacations I haven't 
created a PR. 

If you are going to do it, pls read the design doc and cover the listed cases, 
will review.

At some point we need to coordinate, [~liyinan926] next time pls assign people, 
I spent quite some time on this.


was (Author: skonto):
[~onursatici] I am working on this, but since I am on vacations I haven't 
created a PR. 

If you are going to do it, pls read the design doc and cover the listed cases.

At some point we need to coordinate, [~liyinan926] next time pls assign people.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-20 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585689#comment-16585689
 ] 

Stavros Kontopoulos edited comment on SPARK-24434 at 8/20/18 9:32 AM:
--

[~onursatici] I am working on this, but since I am on vacations I haven't 
created a PR. 

If you are going to do it, pls read the design doc and cover the listed cases.

At some point we need to coordinate, [~liyinan926] next time pls assign people.


was (Author: skonto):
[~onursatici] I am working on this, but since I am on vacations I haven't 
created a PR. 

If you are going to do it, pls read the design doc and cover the listed cases.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-20 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585689#comment-16585689
 ] 

Stavros Kontopoulos commented on SPARK-24434:
-

[~onursatici] I am working on this, but since I am on vacations I haven't 
created a PR. 

If you are going to do it, pls read the design doc and cover the listed cases.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25159) json schema inference should only trigger one job

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25159:


Assignee: Apache Spark  (was: Wenchen Fan)

> json schema inference should only trigger one job
> -
>
> Key: SPARK-25159
> URL: https://issues.apache.org/jira/browse/SPARK-25159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25159) json schema inference should only trigger one job

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25159:


Assignee: Wenchen Fan  (was: Apache Spark)

> json schema inference should only trigger one job
> -
>
> Key: SPARK-25159
> URL: https://issues.apache.org/jira/browse/SPARK-25159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25159) json schema inference should only trigger one job

2018-08-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585631#comment-16585631
 ] 

Apache Spark commented on SPARK-25159:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22152

> json schema inference should only trigger one job
> -
>
> Key: SPARK-25159
> URL: https://issues.apache.org/jira/browse/SPARK-25159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25160) Remove sql configuration spark.sql.avro.outputTimestampType

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25160:


Assignee: Apache Spark

> Remove sql configuration spark.sql.avro.outputTimestampType
> ---
>
> Key: SPARK-25160
> URL: https://issues.apache.org/jira/browse/SPARK-25160
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> In the PR for supporting logical timestamp types 
> [https://github.com/apache/spark/pull/21935,] a SQL configuration 
> spark.sql.avro.outputTimestampType is added, so that user can specify the 
> output timestamp precision they want.
> With PR [https://github.com/apache/spark/pull/21847,]  the output file can be 
> written with user specified types. 
> So no need to have such configuration. Otherwise to make it consistent we 
> need to add configuration for all the Catalyst types that can be converted 
> into different Avro types. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25160) Remove sql configuration spark.sql.avro.outputTimestampType

2018-08-20 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25160:


Assignee: (was: Apache Spark)

> Remove sql configuration spark.sql.avro.outputTimestampType
> ---
>
> Key: SPARK-25160
> URL: https://issues.apache.org/jira/browse/SPARK-25160
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In the PR for supporting logical timestamp types 
> [https://github.com/apache/spark/pull/21935,] a SQL configuration 
> spark.sql.avro.outputTimestampType is added, so that user can specify the 
> output timestamp precision they want.
> With PR [https://github.com/apache/spark/pull/21847,]  the output file can be 
> written with user specified types. 
> So no need to have such configuration. Otherwise to make it consistent we 
> need to add configuration for all the Catalyst types that can be converted 
> into different Avro types. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25160) Remove sql configuration spark.sql.avro.outputTimestampType

2018-08-20 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585609#comment-16585609
 ] 

Apache Spark commented on SPARK-25160:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22151

> Remove sql configuration spark.sql.avro.outputTimestampType
> ---
>
> Key: SPARK-25160
> URL: https://issues.apache.org/jira/browse/SPARK-25160
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In the PR for supporting logical timestamp types 
> [https://github.com/apache/spark/pull/21935,] a SQL configuration 
> spark.sql.avro.outputTimestampType is added, so that user can specify the 
> output timestamp precision they want.
> With PR [https://github.com/apache/spark/pull/21847,]  the output file can be 
> written with user specified types. 
> So no need to have such configuration. Otherwise to make it consistent we 
> need to add configuration for all the Catalyst types that can be converted 
> into different Avro types. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25160) Remove sql configuration spark.sql.avro.outputTimestampType

2018-08-20 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-25160:
--

 Summary: Remove sql configuration 
spark.sql.avro.outputTimestampType
 Key: SPARK-25160
 URL: https://issues.apache.org/jira/browse/SPARK-25160
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Gengliang Wang


In the PR for supporting logical timestamp types 
[https://github.com/apache/spark/pull/21935,] a SQL configuration 
spark.sql.avro.outputTimestampType is added, so that user can specify the 
output timestamp precision they want.

With PR [https://github.com/apache/spark/pull/21847,]  the output file can be 
written with user specified types. 

So no need to have such configuration. Otherwise to make it consistent we need 
to add configuration for all the Catalyst types that can be converted into 
different Avro types. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23714) Add metrics for cached KafkaConsumer

2018-08-20 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585587#comment-16585587
 ] 

Jungtaek Lim edited comment on SPARK-23714 at 8/20/18 8:06 AM:
---

[~yuzhih...@gmail.com]

Maybe we can just apply Apache Commons Pool (SPARK-25151) and leverage metrics. 
PR for SPARK-25151 is also ready to review.


was (Author: kabhwan):
[~yuzhih...@gmail.com]

Maybe we can just apply Apache Commons Pool (SPARK-25251) and leverage metrics. 
[PR|https://github.com/apache/spark/pull/22138] for SPARK-25251 is also ready 
to review.

> Add metrics for cached KafkaConsumer
> 
>
> Key: SPARK-23714
> URL: https://issues.apache.org/jira/browse/SPARK-23714
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Ted Yu
>Priority: Major
>
> SPARK-23623 added KafkaDataConsumer to avoid concurrent use of cached 
> KafkaConsumer.
> This JIRA is to add metrics for measuring the operations of the cache so that 
> users can gain insight into the caching solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23714) Add metrics for cached KafkaConsumer

2018-08-20 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585587#comment-16585587
 ] 

Jungtaek Lim commented on SPARK-23714:
--

[~yuzhih...@gmail.com]

Maybe we can just apply Apache Commons Pool (SPARK-25251) and leverage metrics. 
[PR|https://github.com/apache/spark/pull/22138] for SPARK-25251 is also ready 
to review.

> Add metrics for cached KafkaConsumer
> 
>
> Key: SPARK-23714
> URL: https://issues.apache.org/jira/browse/SPARK-23714
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Ted Yu
>Priority: Major
>
> SPARK-23623 added KafkaDataConsumer to avoid concurrent use of cached 
> KafkaConsumer.
> This JIRA is to add metrics for measuring the operations of the cache so that 
> users can gain insight into the caching solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected

2018-08-20 Thread Ayoub Benali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585554#comment-16585554
 ] 

Ayoub Benali commented on SPARK-25144:
--

[~jerryshao] according the comments above this bug doesn't happen in Master but 
I haven't tried my self 

 

thanks for the PR [~dongjoon]

> distinct on Dataset leads to exception due to Managed memory leak detected  
> 
>
> Key: SPARK-25144
> URL: https://issues.apache.org/jira/browse/SPARK-25144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.3.2
>Reporter: Ayoub Benali
>Priority: Major
>
> The following code example: 
> {code}
> case class Foo(bar: Option[String])
> val ds = List(Foo(Some("bar"))).toDS
> val result = ds.flatMap(_.bar).distinct
> result.rdd.isEmpty
> {code}
> Produces the following stacktrace
> {code}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in 
> stage 7.0 (TID 125, localhost, executor driver): 
> org.apache.spark.SparkException: Managed memory leak detected; size = 
> 16777216 bytes, TID = 125
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info] 
> [info] Driver stacktrace:
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
> [info]   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at scala.Option.foreach(Option.scala:257)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
> [info]   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
> [info]   at 
> org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465)
> {code}
> The code example doesn't produce any error when `distinct` function is not 
> called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25159) json schema inference should only trigger one job

2018-08-20 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-25159:
---

 Summary: json schema inference should only trigger one job
 Key: SPARK-25159
 URL: https://issues.apache.org/jira/browse/SPARK-25159
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org