date:20180521

[jira] [Updated] (SPARK-24339) spark sql can not prune column in transform/map/reduce query

2018-05-21 Thread xdcjie (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xdcjie updated SPARK-24339:
---
Priority: Minor  (was: Major)

> spark sql can not prune column in transform/map/reduce query
> 
>
> Key: SPARK-24339
> URL: https://issues.apache.org/jira/browse/SPARK-24339
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0, 2.2.1
>Reporter: xdcjie
>Priority: Minor
>  Labels: map, reduce, sql, transform
> Fix For: 2.1.1, 2.1.2, 2.2.0, 2.2.1
>
>
> I was using {{TRANSFORM USING}} with branch-2.1/2.2, and noticed that it will 
> scan all column of data, query like:
> {code:java}
> SELECT TRANSFORM(usid, cch) USING 'python test.py' AS (u1, c1, u2, c2) FROM 
> test_table;{code}
> it's physical plan like:
> {code:java}
> == Physical Plan ==
> ScriptTransformation [usid#17, cch#9], python test.py, [u1#784, c1#785, 
> u2#786, c2#787], 
> HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,
> )),List((field.delim,   
> )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
> +- FileScan parquet 
> [sh#0L,clk#1L,chg#2L,qey#3,ship#4,chgname#5,sid#6,bid#7,dis#8L,cch#9,wch#10,wid#11L,arank#12L,rtag#13,iid#14,uid#15,pid#16,usid#17,wdid#18,bid#19,oqleft#20,oqright#21,poqvalue#22,tm#23,...
>  367 more fields] Batched: false, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/Downloads/part-r-00093-0ef5d59f-2e08-4085-9b46-458a1652932a.g...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct {code}
> In our scenario, parquet has 400 column, this query will take more time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24313) Collection functions interpreted execution doesn't work with complex types

2018-05-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24313:

Labels: correctness  (was: )

> Collection functions interpreted execution doesn't work with complex types
> --
>
> Key: SPARK-24313
> URL: https://issues.apache.org/jira/browse/SPARK-24313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Minor
>  Labels: correctness
>
> Several functions working on collection return incorrect result for complex 
> data types in interpreted mode. In particular, we consider comple data types 
> BINARY, ARRAY. The list of the affected functions is: {{array_contains}}, 
> {{array_position}}, {{element_at}} and {{GetMapValue}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24338) Spark SQL fails to create a table in Hive when running in a Apache Sentry-secured Environment

2018-05-21 Thread Chaoran Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chaoran Yu updated SPARK-24338:
---
Description: 
This 
[commit|https://github.com/apache/spark/commit/ce13c2672318242748f7520ed4ce6bcfad4fb428]
 introduced a bug that caused Spark SQL "CREATE TABLE" statement to fail in 
Hive when Apache Sentry is used to control cluster authorization. This bug 
exists in Spark 2.1.0 and all later releases. The error message thrown is in 
the attached file.[^exception.txt]

Cloudera in their fork of Spark fixed this bug as shown 
[here|https://github.com/cloudera/spark/blob/spark2-2.2.0-cloudera2/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L229].
 It would make sense for this fix to be merged back upstream.

  was:
This 
[commit|https://github.com/apache/spark/commit/ce13c2672318242748f7520ed4ce6bcfad4fb428]
 introduced a bug that caused Spark SQL "CREATE TABLE" statement to fail in 
Hive when Apache Sentry is used to control cluster authorization. This bug 
exists in Spark 2.1.0 and all later releases. The error message thrown is the 
following:

org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:User 
soadusr does not have privileges for CREATETABLE);
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.doCreateTable(HiveExternalCatalog.scala:215)
 at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalog.createTable(ExternalCatalog.scala:110)
 at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:316)
 at 
org.apache.spark.sql.execution.command.CreateTableCommand.run(tables.scala:127)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
 at org.apache.spark.sql.Dataset.(Dataset.scala:182)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
 at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:691)
 at test.HiveTestJob$.runJob(HiveTestJob.scala:86)
 at test.HiveTestJob$.runJob(HiveTestJob.scala:73)
 at 
spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$8.apply(JobManagerActor.scala:407)
 at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
 at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
 at 
monitoring.MdcPropagatingExecutionContext$$anon$1.run(MdcPropagatingExecutionContext.scala:24)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:User soadusr does not have privileges for CREATETABLE)
 at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:720)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:446)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:446)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:446)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:445)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply$mcV$sp(HiveExternalCatalog.scala:256)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:215)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:215)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
 ... 20 more
Caused by: MetaException(message:User soadusr does not have privileges for 
CREATETABLE)
 at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result$create_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:29983)
 at

[jira] [Resolved] (SPARK-24292) Proxy user cannot connect to HiveMetastore in local mode

2018-05-21 Thread Lantao Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin resolved SPARK-24292.

Resolution: Duplicate

> Proxy user cannot connect to HiveMetastore in local mode
> 
>
> Key: SPARK-24292
> URL: https://issues.apache.org/jira/browse/SPARK-24292
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Lantao Jin
>Priority: Major
>
> [SPARK-19995|https://issues.apache.org/jira/browse/SPARK-19995] only fix same 
> problem in yarn mode, but there are many cases which need to run in local 
> mode. For example, user want to create a table. Local mode much easier than 
> launching an AM in yarn.
> bin/spark-sql  --proxy-user x_user --master local
> But it doesn't work:
> {code}
> Caused by: MetaException(message:Could not connect to meta store using any of 
> the URIs provided. Most recent failure: 
> org.apache.thrift.transport.TTransportException: GSS initiate failed
> at 
> org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
> at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:316)
> at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236)
> at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
> at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
> at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:192)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24338) Spark SQL fails to create a table in Hive when running in a Apache Sentry-secured Environment

2018-05-21 Thread Chaoran Yu (JIRA)

Chaoran Yu created SPARK-24338:
--

 Summary: Spark SQL fails to create a table in Hive when running in 
a Apache Sentry-secured Environment
 Key: SPARK-24338
 URL: https://issues.apache.org/jira/browse/SPARK-24338
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Chaoran Yu


This 
[commit|https://github.com/apache/spark/commit/ce13c2672318242748f7520ed4ce6bcfad4fb428]
 introduced a bug that caused Spark SQL "CREATE TABLE" statement to fail in 
Hive when Apache Sentry is used to control cluster authorization. This bug 
exists in Spark 2.1.0 and all later releases. The error message thrown is the 
following:

org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:User 
soadusr does not have privileges for CREATETABLE);
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.doCreateTable(HiveExternalCatalog.scala:215)
 at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalog.createTable(ExternalCatalog.scala:110)
 at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:316)
 at 
org.apache.spark.sql.execution.command.CreateTableCommand.run(tables.scala:127)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
 at org.apache.spark.sql.Dataset.(Dataset.scala:182)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
 at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:691)
 at test.HiveTestJob$.runJob(HiveTestJob.scala:86)
 at test.HiveTestJob$.runJob(HiveTestJob.scala:73)
 at 
spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$8.apply(JobManagerActor.scala:407)
 at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
 at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
 at 
monitoring.MdcPropagatingExecutionContext$$anon$1.run(MdcPropagatingExecutionContext.scala:24)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:User soadusr does not have privileges for CREATETABLE)
 at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:720)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:446)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:446)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:446)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:445)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply$mcV$sp(HiveExternalCatalog.scala:256)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:215)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:215)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
 ... 20 more
Caused by: MetaException(message:User soadusr does not have privileges for 
CREATETABLE)
 at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result$create_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:29983)
 at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result$create_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:29951)
 at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result.read(ThriftHiveMetastore.java:29877)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
 at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_create_table_with_environment_context(ThriftHiveMetastore.java:1075)

[jira] [Updated] (SPARK-24338) Spark SQL fails to create a table in Hive when running in a Apache Sentry-secured Environment

2018-05-21 Thread Chaoran Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chaoran Yu updated SPARK-24338:
---
Attachment: exception.txt

> Spark SQL fails to create a table in Hive when running in a Apache 
> Sentry-secured Environment
> -
>
> Key: SPARK-24338
> URL: https://issues.apache.org/jira/browse/SPARK-24338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Chaoran Yu
>Priority: Critical
> Attachments: exception.txt
>
>
> This 
> [commit|https://github.com/apache/spark/commit/ce13c2672318242748f7520ed4ce6bcfad4fb428]
>  introduced a bug that caused Spark SQL "CREATE TABLE" statement to fail in 
> Hive when Apache Sentry is used to control cluster authorization. This bug 
> exists in Spark 2.1.0 and all later releases. The error message thrown is the 
> following:
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:User 
> soadusr does not have privileges for CREATETABLE);
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.doCreateTable(HiveExternalCatalog.scala:215)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalog.createTable(ExternalCatalog.scala:110)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:316)
>  at 
> org.apache.spark.sql.execution.command.CreateTableCommand.run(tables.scala:127)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:182)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
>  at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:691)
>  at test.HiveTestJob$.runJob(HiveTestJob.scala:86)
>  at test.HiveTestJob$.runJob(HiveTestJob.scala:73)
>  at 
> spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$8.apply(JobManagerActor.scala:407)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>  at 
> monitoring.MdcPropagatingExecutionContext$$anon$1.run(MdcPropagatingExecutionContext.scala:24)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> MetaException(message:User soadusr does not have privileges for CREATETABLE)
>  at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:720)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply$mcV$sp(HiveClientImpl.scala:446)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:446)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$createTable$1.apply(HiveClientImpl.scala:446)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:290)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:231)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:230)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:445)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply$mcV$sp(HiveExternalCatalog.scala:256)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:215)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:215)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>  ... 20 more
> Caused by: MetaException(message:User soadusr does not have privileges for 
> CREATETABLE)
>  at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$create_table_with_environment_context_result$create_table_with_environment_context_resultStandardScheme.read(ThriftHiveMetastore.java:29983)
>  at 
>

[jira] [Updated] (SPARK-24313) Collection functions interpreted execution doesn't work with complex types

2018-05-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24313:

Priority: Critical  (was: Minor)

> Collection functions interpreted execution doesn't work with complex types
> --
>
> Key: SPARK-24313
> URL: https://issues.apache.org/jira/browse/SPARK-24313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Critical
>  Labels: correctness
>
> Several functions working on collection return incorrect result for complex 
> data types in interpreted mode. In particular, we consider comple data types 
> BINARY, ARRAY. The list of the affected functions is: {{array_contains}}, 
> {{array_position}}, {{element_at}} and {{GetMapValue}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24339) spark sql can not prune column in transform/map/reduce query

2018-05-21 Thread xdcjie (JIRA)

xdcjie created SPARK-24339:
--

 Summary: spark sql can not prune column in transform/map/reduce 
query
 Key: SPARK-24339
 URL: https://issues.apache.org/jira/browse/SPARK-24339
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.1, 2.2.0, 2.1.2, 2.1.1
Reporter: xdcjie
 Fix For: 2.2.1, 2.2.0, 2.1.2, 2.1.1


I was using {{TRANSFORM USING}} with branch-2.1/2.2, and noticed that it will 
scan all column of data, query like:
{code:java}
SELECT TRANSFORM(usid, cch) USING 'python test.py' AS (u1, c1, u2, c2) FROM 
test_table;{code}
it's physical plan like:
{code:java}
== Physical Plan ==
ScriptTransformation [usid#17, cch#9], python test.py, [u1#784, c1#785, u2#786, 
c2#787], 
HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,
  )),List((field.delim,   
)),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
+- FileScan parquet 
[sh#0L,clk#1L,chg#2L,qey#3,ship#4,chgname#5,sid#6,bid#7,dis#8L,cch#9,wch#10,wid#11L,arank#12L,rtag#13,iid#14,uid#15,pid#16,usid#17,wdid#18,bid#19,oqleft#20,oqright#21,poqvalue#22,tm#23,...
 367 more fields] Batched: false, Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/Downloads/part-r-00093-0ef5d59f-2e08-4085-9b46-458a1652932a.g...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct

[jira] [Updated] (SPARK-24338) Spark SQL fails to create a Hive table when running in a Apache Sentry-secured Environment

2018-05-21 Thread Chaoran Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chaoran Yu updated SPARK-24338:
---
Summary: Spark SQL fails to create a Hive table when running in a Apache 
Sentry-secured Environment  (was: Spark SQL fails to create a table in Hive 
when running in a Apache Sentry-secured Environment)

> Spark SQL fails to create a Hive table when running in a Apache 
> Sentry-secured Environment
> --
>
> Key: SPARK-24338
> URL: https://issues.apache.org/jira/browse/SPARK-24338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Chaoran Yu
>Priority: Critical
> Attachments: exception.txt
>
>
> This 
> [commit|https://github.com/apache/spark/commit/ce13c2672318242748f7520ed4ce6bcfad4fb428]
>  introduced a bug that caused Spark SQL "CREATE TABLE" statement to fail in 
> Hive when Apache Sentry is used to control cluster authorization. This bug 
> exists in Spark 2.1.0 and all later releases. The error message thrown is in 
> the attached file.[^exception.txt]
> Cloudera in their fork of Spark fixed this bug as shown 
> [here|https://github.com/cloudera/spark/blob/spark2-2.2.0-cloudera2/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L229].
>  It would make sense for this fix to be merged back upstream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24204) Verify a write schema in Json/Orc/ParquetFileFormat

2018-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24204:


Assignee: (was: Apache Spark)

> Verify a write schema in Json/Orc/ParquetFileFormat
> ---
>
> Key: SPARK-24204
> URL: https://issues.apache.org/jira/browse/SPARK-24204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> *SUMMARY*
> - CSV: Raising analysis exception.
> - JSON: dropping columns with null types
> - Parquet/ORC: raising runtime exceptions
> The native orc file format throws an exception with a meaningless message in 
> executor-sides when unsupported types passed;
> {code}
> scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, 
> null)))
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("b", NullType) :: Nil)
> scala> val df = spark.createDataFrame(rdd, schema)
> scala> df.write.orc("/tmp/orc")
> java.lang.IllegalArgumentException: Can't parse category at 
> 'struct'
> at 
> org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332)
> at 
> org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
> at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ
> er.scala:226)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply
> (FileFormatWriter.scala:278)
> {code}
> It seems to be better to verify a write schema in a driver side for users 
> along with the CSV fromat;
> https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24204) Verify a write schema in Json/Orc/ParquetFileFormat

2018-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24204:


Assignee: Apache Spark

> Verify a write schema in Json/Orc/ParquetFileFormat
> ---
>
> Key: SPARK-24204
> URL: https://issues.apache.org/jira/browse/SPARK-24204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> *SUMMARY*
> - CSV: Raising analysis exception.
> - JSON: dropping columns with null types
> - Parquet/ORC: raising runtime exceptions
> The native orc file format throws an exception with a meaningless message in 
> executor-sides when unsupported types passed;
> {code}
> scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, 
> null)))
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("b", NullType) :: Nil)
> scala> val df = spark.createDataFrame(rdd, schema)
> scala> df.write.orc("/tmp/orc")
> java.lang.IllegalArgumentException: Can't parse category at 
> 'struct'
> at 
> org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332)
> at 
> org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
> at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ
> er.scala:226)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply
> (FileFormatWriter.scala:278)
> {code}
> It seems to be better to verify a write schema in a driver side for users 
> along with the CSV fromat;
> https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24204) Verify a write schema in Json/Orc/ParquetFileFormat

2018-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483468#comment-16483468
 ] 

Apache Spark commented on SPARK-24204:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/21389

> Verify a write schema in Json/Orc/ParquetFileFormat
> ---
>
> Key: SPARK-24204
> URL: https://issues.apache.org/jira/browse/SPARK-24204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> *SUMMARY*
> - CSV: Raising analysis exception.
> - JSON: dropping columns with null types
> - Parquet/ORC: raising runtime exceptions
> The native orc file format throws an exception with a meaningless message in 
> executor-sides when unsupported types passed;
> {code}
> scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, 
> null)))
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("b", NullType) :: Nil)
> scala> val df = spark.createDataFrame(rdd, schema)
> scala> df.write.orc("/tmp/orc")
> java.lang.IllegalArgumentException: Can't parse category at 
> 'struct'
> at 
> org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332)
> at 
> org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
> at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ
> er.scala:226)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply
> (FileFormatWriter.scala:278)
> {code}
> It seems to be better to verify a write schema in a driver side for users 
> along with the CSV fromat;
> https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24209) 0 configuration Knox gateway support in SHS

2018-05-21 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24209.

   Resolution: Fixed
 Assignee: Marco Gaido
Fix Version/s: 2.4.0

> 0 configuration Knox gateway support in SHS
> ---
>
> Key: SPARK-24209
> URL: https://issues.apache.org/jira/browse/SPARK-24209
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
>  Labels: usability
> Fix For: 2.4.0
>
>
> SHS supports Knox gateway by setting {{spark.ui.proxyBase}} to the proper 
> value. The main problem with this approach is that when 
> {{spark.ui.proxyBase}} is set, SHS is not accessible anymore through direct 
> access.
> Since Knox provides in the header {{X-Forwarded-Context}} the base URL to use 
> (there is a patch available in KNOX-1295), we can leverage it in order to 
> dynamically retrieve the right proxy base, instead of relying on a fixed 
> config. In this way, both access via Knox proxy and via direct URL can work 
> properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24337) Improve the error message for invalid SQL conf value

2018-05-21 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-24337:


 Summary: Improve the error message for invalid SQL conf value
 Key: SPARK-24337
 URL: https://issues.apache.org/jira/browse/SPARK-24337
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Shixiong Zhu


Right now Spark will throw the following error message when a config is set to 
an invalid value. It would be great if the error message contains the config 
key so that it's easy to tell which one is wrong.

{code}
Size must be specified as bytes (b), kibibytes (k), mebibytes (m), gibibytes 
(g), tebibytes (t), or pebibytes(p). E.g. 50b, 100k, or 250m.
Fractional values are not supported. Input was: 1.6
at 
org.apache.spark.network.util.JavaUtils.byteStringAs(JavaUtils.java:291)
at 
org.apache.spark.internal.config.ConfigHelpers$.byteFromString(ConfigBuilder.scala:66)
at 
org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:234)
at 
org.apache.spark.internal.config.ConfigBuilder$$anonfun$bytesConf$1.apply(ConfigBuilder.scala:234)
at 
org.apache.spark.sql.internal.SQLConf.setConfString(SQLConf.scala:1300)
at 
org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$mergeSparkConf$1.apply(BaseSessionStateBuilder.scala:78)
at 
org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$mergeSparkConf$1.apply(BaseSessionStateBuilder.scala:77)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
org.apache.spark.sql.internal.BaseSessionStateBuilder.mergeSparkConf(BaseSessionStateBuilder.scala:77)
at 
org.apache.spark.sql.internal.BaseSessionStateBuilder.conf$lzycompute(BaseSessionStateBuilder.scala:90)
at 
org.apache.spark.sql.internal.BaseSessionStateBuilder.conf(BaseSessionStateBuilder.scala:88)
at 
org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)
at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1071)
... 59 more
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22055) Port release scripts

2018-05-21 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483217#comment-16483217
 ] 

Marcelo Vanzin commented on SPARK-22055:


IMO it would be a better idea to take Felix's docker image [1] and create a 
script based on it, instead of relying on Jenkins jobs for this. As part of 
2.3.1 I have written a small script to help me prepare the release, and it half 
of the job. All that is left is automate building the image, and running the 
commands inside it.

I'll send my script for review after 2.3.1 is out and I'm happy with it.

[1] https://github.com/felixcheung/spark-build/blob/master/Dockerfile

> Port release scripts
> 
>
> Key: SPARK-22055
> URL: https://issues.apache.org/jira/browse/SPARK-22055
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Blocker
>
> The current Jenkins jobs are generated from scripts in a private repo. We 
> should port these to enable changes like SPARK-22054 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24309) AsyncEventQueue should handle an interrupt from a Listener

2018-05-21 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24309:
--

Assignee: Imran Rashid

> AsyncEventQueue should handle an interrupt from a Listener
> --
>
> Key: SPARK-24309
> URL: https://issues.apache.org/jira/browse/SPARK-24309
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Blocker
> Fix For: 2.3.1, 2.4.0
>
>
> AsyncEventQueue does not properly handle an interrupt from a Listener -- the 
> spark app won't even stop!
> I observed this on an actual workload as the EventLoggingListener can 
> generate an interrupt from the underlying hdfs calls:
> {noformat}
> 18/05/16 17:46:36 WARN hdfs.DFSClient: Error transferring data from 
> DatanodeInfoWithStorage[10.17.206.36:20002,DS-3adac910-5d0a-418b-b0f7-6332b35bf6a1,DISK]
>  to 
> DatanodeInfoWithStorage[10.17.206.42:20002,DS-2e7ed0aa-0e68-441e-b5b2-96ad4a9ce7a5,DISK]:
>  10 millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.17.206.35:33950 
> remote=/10.17.206.36:20002]
> 18/05/16 17:46:36 WARN hdfs.DFSClient: DataStreamer Exception
> java.net.SocketTimeoutException: 10 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/10.17.206.35:33950 remote=/10.17.206.36:20002]
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at 
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2305)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$StreamerStreams.sendTransferBlock(DFSOutputStream.java:516)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1450)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1408)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1559)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1254)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:739)
> 18/05/16 17:46:36 ERROR scheduler.AsyncEventQueue: Listener 
> EventLoggingListener threw an exception
> [... a few more of these ...]
> 18/05/16 17:46:36 INFO scheduler.AsyncEventQueue: Stopping listener queue 
> eventLog.
> java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
> at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:94)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:83)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:79)
> at 
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:78)
> {noformat}
> When this happens, the AsyncEventQueue will continue to pile up events in its 
> queue, though its no longer processing them.  And then in the call to stop, 
> it'll block on {{queue.put(POISON_PILL)}} forever, so the SparkContext won't 
> stop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24336) Support 'pass through' transformation in BasicOperators

2018-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483209#comment-16483209
 ] 

Apache Spark commented on SPARK-24336:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/21388

> Support 'pass through' transformation in BasicOperators
> ---
>
> Key: SPARK-24336
> URL: https://issues.apache.org/jira/browse/SPARK-24336
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> As of now BasicOperators enumerates all the cases to match LogicalPlan to 
> SparkPlan, but some of cases are just a kind of simple transformation as the 
> code stated as 'pass through', which just need to provide same parameters 
> except converting LogicalPlan to SparkPlan via calling `planLater`.
> We can automate this via leveraging reflection. ScalaReflection already has 
> many methods to leverage, so we would be good to go.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24336) Support 'pass through' transformation in BasicOperators

2018-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24336:


Assignee: (was: Apache Spark)

> Support 'pass through' transformation in BasicOperators
> ---
>
> Key: SPARK-24336
> URL: https://issues.apache.org/jira/browse/SPARK-24336
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> As of now BasicOperators enumerates all the cases to match LogicalPlan to 
> SparkPlan, but some of cases are just a kind of simple transformation as the 
> code stated as 'pass through', which just need to provide same parameters 
> except converting LogicalPlan to SparkPlan via calling `planLater`.
> We can automate this via leveraging reflection. ScalaReflection already has 
> many methods to leverage, so we would be good to go.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24309) AsyncEventQueue should handle an interrupt from a Listener

2018-05-21 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24309.

   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 21356
[https://github.com/apache/spark/pull/21356]

> AsyncEventQueue should handle an interrupt from a Listener
> --
>
> Key: SPARK-24309
> URL: https://issues.apache.org/jira/browse/SPARK-24309
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Blocker
> Fix For: 2.4.0, 2.3.1
>
>
> AsyncEventQueue does not properly handle an interrupt from a Listener -- the 
> spark app won't even stop!
> I observed this on an actual workload as the EventLoggingListener can 
> generate an interrupt from the underlying hdfs calls:
> {noformat}
> 18/05/16 17:46:36 WARN hdfs.DFSClient: Error transferring data from 
> DatanodeInfoWithStorage[10.17.206.36:20002,DS-3adac910-5d0a-418b-b0f7-6332b35bf6a1,DISK]
>  to 
> DatanodeInfoWithStorage[10.17.206.42:20002,DS-2e7ed0aa-0e68-441e-b5b2-96ad4a9ce7a5,DISK]:
>  10 millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.17.206.35:33950 
> remote=/10.17.206.36:20002]
> 18/05/16 17:46:36 WARN hdfs.DFSClient: DataStreamer Exception
> java.net.SocketTimeoutException: 10 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/10.17.206.35:33950 remote=/10.17.206.36:20002]
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at java.io.FilterInputStream.read(FilterInputStream.java:83)
> at 
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2305)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$StreamerStreams.sendTransferBlock(DFSOutputStream.java:516)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:1450)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1408)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1559)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1254)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:739)
> 18/05/16 17:46:36 ERROR scheduler.AsyncEventQueue: Listener 
> EventLoggingListener threw an exception
> [... a few more of these ...]
> 18/05/16 17:46:36 INFO scheduler.AsyncEventQueue: Stopping listener queue 
> eventLog.
> java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1220)
> at 
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
> at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:94)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:83)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:79)
> at 
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)
> at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:78)
> {noformat}
> When this happens, the AsyncEventQueue will continue to pile up events in its 
> queue, though its no longer processing them.  And then in the call to stop, 
> it'll block on {{queue.put(POISON_PILL)}} forever, so the SparkContext won't 
> stop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24336) Support 'pass through' transformation in BasicOperators

2018-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24336:


Assignee: Apache Spark

> Support 'pass through' transformation in BasicOperators
> ---
>
> Key: SPARK-24336
> URL: https://issues.apache.org/jira/browse/SPARK-24336
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> As of now BasicOperators enumerates all the cases to match LogicalPlan to 
> SparkPlan, but some of cases are just a kind of simple transformation as the 
> code stated as 'pass through', which just need to provide same parameters 
> except converting LogicalPlan to SparkPlan via calling `planLater`.
> We can automate this via leveraging reflection. ScalaReflection already has 
> many methods to leverage, so we would be good to go.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24336) Support 'pass through' transformation in BasicOperators

2018-05-21 Thread Jungtaek Lim (JIRA)

Jungtaek Lim created SPARK-24336:


 Summary: Support 'pass through' transformation in BasicOperators
 Key: SPARK-24336
 URL: https://issues.apache.org/jira/browse/SPARK-24336
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Jungtaek Lim


As of now BasicOperators enumerates all the cases to match LogicalPlan to 
SparkPlan, but some of cases are just a kind of simple transformation as the 
code stated as 'pass through', which just need to provide same parameters 
except converting LogicalPlan to SparkPlan via calling `planLater`.

We can automate this via leveraging reflection. ScalaReflection already has 
many methods to leverage, so we would be good to go.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23928) High-order function: shuffle(x) → array

2018-05-21 Thread H Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483095#comment-16483095
 ] 

H Lu commented on SPARK-23928:
--

Dear watchers,

Can you please help review and comment the PR? THANKS A LOT!!

For the tests, I was trying to do this:
 # assertEqualsInorgeOrder(shuffle(originSeq), originSeq)
But spark does not have assertEqualsInorgeOrder implemented. I was thinking to 
check Multiset of shuffle(originSeq) and originSeq. But had trouble using 
Multiset for expression and seq.
 # About the randomness, I was thinking to generate a Seq range(1, 501) and 
shuffle it 30 times. And it should produce at least 80% distinct permutations. 
Say using HashSet.add(shuffledResult). But I don't know how to implement this 
idea in scala and codeGen for expressions.

This is my 1st time contributing to spark and codeGen. I hope committers and 
contributors could help with tests, and also the shuffle function code.

I will quickly fix bugs and change tests depending on your comments. Thanks a 
million!

> High-order function: shuffle(x) → array
> ---
>
> Key: SPARK-23928
> URL: https://issues.apache.org/jira/browse/SPARK-23928
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Generate a random permutation of the given array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23928) High-order function: shuffle(x) → array

2018-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23928:


Assignee: (was: Apache Spark)

> High-order function: shuffle(x) → array
> ---
>
> Key: SPARK-23928
> URL: https://issues.apache.org/jira/browse/SPARK-23928
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Generate a random permutation of the given array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23928) High-order function: shuffle(x) → array

2018-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483086#comment-16483086
 ] 

Apache Spark commented on SPARK-23928:
--

User 'pkuwm' has created a pull request for this issue:
https://github.com/apache/spark/pull/21386

> High-order function: shuffle(x) → array
> ---
>
> Key: SPARK-23928
> URL: https://issues.apache.org/jira/browse/SPARK-23928
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Generate a random permutation of the given array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23928) High-order function: shuffle(x) → array

2018-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23928:


Assignee: Apache Spark

> High-order function: shuffle(x) → array
> ---
>
> Key: SPARK-23928
> URL: https://issues.apache.org/jira/browse/SPARK-23928
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Generate a random permutation of the given array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24335) Dataset.map schema not applied in some cases

2018-05-21 Thread Robert Reid (JIRA)

Robert Reid created SPARK-24335:
---

 Summary: Dataset.map schema not applied in some cases
 Key: SPARK-24335
 URL: https://issues.apache.org/jira/browse/SPARK-24335
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.0, 2.1.0
Reporter: Robert Reid


In the following code an {color:#808080}UnsupportedOperationException{color} is 
thrown in the filter() call just after the Dateset.map() call unless 
withWatermark() is added between them. The error reports 
`{color:#808080}fieldIndex on a Row without schema is undefined{color}`.  I 
expect the map() method to have applied the schema and for it to be accessible 
in filter().  Without the extra withWatermark() call my debugger reports that 
the `row` objects in the filter lambda are `GenericRow`.  With the watermark 
call it reports that they are `GenericRowWithSchema`.

I should add that I'm new to working with Structured Streaming.  So if I'm 
overlooking some implied dependency please fill me in.

I'm encountering this in new code for a new production job. The presented code 
is distilled down to demonstrate the problem.  While the problem can be worked 
around simply by adding withWatermark() I'm concerned that this will leave the 
code in a fragile state.  With this simplified code if this error occurs again 
it will be easy to identify what change led to the error.  But in the code I'm 
writing, with this functionality delegated to other classes, it is (and has 
been) very challenging to identify the cause.

 
{code:java}
public static void main(String[] args) {
SparkSession sparkSession = 
SparkSession.builder().master("local").getOrCreate();

sparkSession.conf().set(
"spark.sql.streaming.checkpointLocation",
"hdfs://localhost:9000/search_relevance/checkpoint" // for spark 2.3
// "spark.sql.streaming.checkpointLocation", "tmp/checkpoint" // 
for spark 2.1
);

StructType inSchema = DataTypes.createStructType(
new StructField[] {
DataTypes.createStructField("id", DataTypes.StringType  
, false),
DataTypes.createStructField("ts", DataTypes.TimestampType   
, false),
DataTypes.createStructField("f1", DataTypes.LongType
, true)
}
);
Dataset rawSet = sparkSession.sqlContext().readStream()
.format("rate")
.option("rowsPerSecond", 1)
.load()
.map(   (MapFunction) raw -> {
Object[] fields = new Object[3];
fields[0] = "id1";
fields[1] = raw.getAs("timestamp");
fields[2] = raw.getAs("value");
return RowFactory.create(fields);
},
RowEncoder.apply(inSchema)
)
// If withWatermark() is included above the filter() line then this 
works.  Without it we get:
//Caused by: java.lang.UnsupportedOperationException: 
fieldIndex on a Row without schema is undefined.
// at the row.getAs() call.
// .withWatermark("ts", "10 seconds")  // <-- This is required for 
row.getAs("f1") to work ???
.filter((FilterFunction) row -> !row.getAs("f1").equals(0L))
.withWatermark("ts", "10 seconds")
;

StreamingQuery streamingQuery = rawSet
.select("*")
.writeStream()
.format("console")
.outputMode("append")
.start();

try {
streamingQuery.awaitTermination(30_000);
} catch (StreamingQueryException e) {
System.out.println("Caught exception at 'awaitTermination':");
e.printStackTrace();
}
}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24325) Tests for Hadoop's LinesReader

2018-05-21 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24325.
-
   Resolution: Fixed
 Assignee: Maxim Gekk
Fix Version/s: 2.4.0

> Tests for Hadoop's LinesReader
> --
>
> Key: SPARK-24325
> URL: https://issues.apache.org/jira/browse/SPARK-24325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
>
> Currently, there are no tests for [Hadoop 
> LinesReader|https://github.com/apache/spark/blob/8d79113b812a91073d2c24a3a9ad94cc3b90b24a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala#L42].
>  For refactoring or rewriting of the class, need to add tests that cover 
> basic functionality of the class like:
>  * Split's boundaries slice lines
>  * A split slices delimiters - user's specified or defaults
>  * No duplicates if splits slice delimiters or lines
>  * Checking constant limits like maximum line length
>  * Handling a case when internal buffers size is less than line size



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24334) Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator

2018-05-21 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483004#comment-16483004
 ] 

Li Jin commented on SPARK-24334:


I have done some investigation and will submit a PR soon.

> Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory 
> allocator
> -
>
> Key: SPARK-24334
> URL: https://issues.apache.org/jira/browse/SPARK-24334
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
>
> Currently, ArrowPythonRunner has two thread that frees the Arrow vector 
> schema root and allocator - The main writer thread and task completion 
> listener thread. 
> Having both thread doing the clean up leads to weird case (e.g., negative ref 
> cnt, NPE, and memory leak exception) when an exceptions are thrown from the 
> user function.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24334) Race condition in ArrowPythonRunner causes unclean shutdown of Arrow memory allocator

2018-05-21 Thread Li Jin (JIRA)

Li Jin created SPARK-24334:
--

 Summary: Race condition in ArrowPythonRunner causes unclean 
shutdown of Arrow memory allocator
 Key: SPARK-24334
 URL: https://issues.apache.org/jira/browse/SPARK-24334
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Li Jin


Currently, ArrowPythonRunner has two thread that frees the Arrow vector schema 
root and allocator - The main writer thread and task completion listener 
thread. 

Having both thread doing the clean up leads to weird case (e.g., negative ref 
cnt, NPE, and memory leak exception) when an exceptions are thrown from the 
user function.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24234) create the bottom-of-task RDD with row buffer

2018-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482998#comment-16482998
 ] 

Apache Spark commented on SPARK-24234:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/21385

> create the bottom-of-task RDD with row buffer
> -
>
> Key: SPARK-24234
> URL: https://issues.apache.org/jira/browse/SPARK-24234
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 3.0.0
>
>
> [https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit#heading=h.8t3ci57f7uii]
>  
> This probably ought to be an abstraction of ContinuousDataSourceRDD and 
> ContinuousQueuedDataReader. These classes do pretty much exactly what's 
> needed, except the buffer is filled from the remote data source instead of a 
> remote Spark task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24333) Add fit with validation set to spark.ml GBT: Python API

2018-05-21 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-24333:
-

 Summary: Add fit with validation set to spark.ml GBT: Python API
 Key: SPARK-24333
 URL: https://issues.apache.org/jira/browse/SPARK-24333
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 2.4.0
Reporter: Joseph K. Bradley


Python version of API added by [SPARK-7132]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7132) Add fit with validation set to spark.ml GBT

2018-05-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7132.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21129
[https://github.com/apache/spark/pull/21129]

> Add fit with validation set to spark.ml GBT
> ---
>
> Key: SPARK-7132
> URL: https://issues.apache.org/jira/browse/SPARK-7132
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Minor
> Fix For: 2.4.0
>
>
> In spark.mllib GradientBoostedTrees, we have a method runWithValidation which 
> takes a validation set.  We should add that to the spark.ml API.
> This will require a bit of thinking about how the Pipelines API should handle 
> a validation set (since Transformers and Estimators only take 1 input 
> DataFrame).  The current plan is to include an extra column in the input 
> DataFrame which indicates whether the row is for training, validation, etc.
> Goals
> A  [P0] Support efficient validation during training
> B  [P1] Support early stopping based on validation metrics
> C  [P0] Ensure validation data are preprocessed identically to training data
> D  [P1] Support complex Pipelines with multiple models using validation data
> Proposal: column with indicator for train vs validation
> Include an extra column in the input DataFrame which indicates whether the 
> row is for training or validation.  Add a Param “validationFlagCol” used to 
> specify the extra column name.
> A, B, C are easy.
> D is doable.
> Each estimator would need to have its validationFlagCol Param set to the same 
> column.
> Complication: It would be ideal if we could prevent different estimators from 
> using different validation sets.  (Joseph: There is not an obvious way IMO.  
> Maybe we can address this later by, e.g., having Pipelines take a 
> validationFlagCol Param and pass that to the sub-models in the Pipeline.  
> Let’s not worry about this for now.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24234) create the bottom-of-task RDD with row buffer

2018-05-21 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-24234:
-

Assignee: Jose Torres

> create the bottom-of-task RDD with row buffer
> -
>
> Key: SPARK-24234
> URL: https://issues.apache.org/jira/browse/SPARK-24234
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 3.0.0
>
>
> [https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit#heading=h.8t3ci57f7uii]
>  
> This probably ought to be an abstraction of ContinuousDataSourceRDD and 
> ContinuousQueuedDataReader. These classes do pretty much exactly what's 
> needed, except the buffer is filled from the remote data source instead of a 
> remote Spark task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24234) create the bottom-of-task RDD with row buffer

2018-05-21 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-24234.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21337
[https://github.com/apache/spark/pull/21337]

> create the bottom-of-task RDD with row buffer
> -
>
> Key: SPARK-24234
> URL: https://issues.apache.org/jira/browse/SPARK-24234
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 3.0.0
>
>
> [https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE/edit#heading=h.8t3ci57f7uii]
>  
> This probably ought to be an abstraction of ContinuousDataSourceRDD and 
> ContinuousQueuedDataReader. These classes do pretty much exactly what's 
> needed, except the buffer is filled from the remote data source instead of a 
> remote Spark task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23416) Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false

2018-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23416:


Assignee: Apache Spark

> Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for 
> failOnDataLoss=false
> 
>
> Key: SPARK-23416
> URL: https://issues.apache.org/jira/browse/SPARK-23416
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.3.0
>
>
> I suspect this is a race condition latent in the DataSourceV2 write path, or 
> at least the interaction of that write path with StreamTest.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/]
> h3. Error Message
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted.
> h3. Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted. at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
>  Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing 
> job aborted. at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at 
> org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
>

[jira] [Commented] (SPARK-23416) Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false

2018-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482922#comment-16482922
 ] 

Apache Spark commented on SPARK-23416:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/21384

> Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for 
> failOnDataLoss=false
> 
>
> Key: SPARK-23416
> URL: https://issues.apache.org/jira/browse/SPARK-23416
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Minor
> Fix For: 2.3.0
>
>
> I suspect this is a race condition latent in the DataSourceV2 write path, or 
> at least the interaction of that write path with StreamTest.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/]
> h3. Error Message
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted.
> h3. Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted. at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
>  Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing 
> job aborted. at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at 
> org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
>

[jira] [Assigned] (SPARK-23416) Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false

2018-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23416:


Assignee: (was: Apache Spark)

> Flaky test: KafkaSourceStressForDontFailOnDataLossSuite.stress test for 
> failOnDataLoss=false
> 
>
> Key: SPARK-23416
> URL: https://issues.apache.org/jira/browse/SPARK-23416
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Minor
> Fix For: 2.3.0
>
>
> I suspect this is a race condition latent in the DataSourceV2 write path, or 
> at least the interaction of that write path with StreamTest.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/]
> h3. Error Message
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted.
> h3. Stacktrace
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 16b2a2b1-acdd-44ec-902f-531169193169, runId = 
> 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job 
> aborted. at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
>  Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing 
> job aborted. at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at 
> org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
>

[jira] [Commented] (SPARK-6235) Address various 2G limits

2018-05-21 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482901#comment-16482901
 ] 

Imran Rashid commented on SPARK-6235:
-

[~tgraves]
WAL -- write-ahead-log for receiver-based streaming. This wouldn't effect a 
streaming source like the KafkaDirectDstream which isn't receiver based.  It 
might not be that hard to fix this, but I don't know this code that well I 
don't think its nearly so important.


I've also seen records larger than 2 GB.  Actually this would probably be a 
good thing to support eventually as well.   But I don't think its as important; 
I just want to put it out of scope here.

For task results, I mean the results sent back to the driver in an action, from 
each partition.  It would be hard to imagine that working if RDD records 
couldn't be greater than 2GB in general; I just thought it was worth calling 
out as something else I've seen users try to send back large results.  A 
compelling use case might be if you're updating a statistical model in memory 
in your rdd action, and you want to send back the updates in a reduce to merge 
the updates together.

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23754) StopIterator exception in Python UDF results in partial result

2018-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23754:


Assignee: (was: Apache Spark)

> StopIterator exception in Python UDF results in partial result
> --
>
> Key: SPARK-23754
> URL: https://issues.apache.org/jira/browse/SPARK-23754
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
>
> Reproduce:
> {code:java}
> df = spark.range(0, 1000)
> from pyspark.sql.functions import udf
> def foo(x):
> raise StopIteration()
> df.withColumn('v', udf(foo)).show()
> # Results
> # +---+---+
> # | id|  v|
> # +---+---+
> # +---+---+{code}
> I think the task should fail in this case



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23754) StopIterator exception in Python UDF results in partial result

2018-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23754:


Assignee: Apache Spark

> StopIterator exception in Python UDF results in partial result
> --
>
> Key: SPARK-23754
> URL: https://issues.apache.org/jira/browse/SPARK-23754
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Assignee: Apache Spark
>Priority: Major
>
> Reproduce:
> {code:java}
> df = spark.range(0, 1000)
> from pyspark.sql.functions import udf
> def foo(x):
> raise StopIteration()
> df.withColumn('v', udf(foo)).show()
> # Results
> # +---+---+
> # | id|  v|
> # +---+---+
> # +---+---+{code}
> I think the task should fail in this case



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23754) StopIterator exception in Python UDF results in partial result

2018-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482884#comment-16482884
 ] 

Apache Spark commented on SPARK-23754:
--

User 'e-dorigatti' has created a pull request for this issue:
https://github.com/apache/spark/pull/21383

> StopIterator exception in Python UDF results in partial result
> --
>
> Key: SPARK-23754
> URL: https://issues.apache.org/jira/browse/SPARK-23754
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
>
> Reproduce:
> {code:java}
> df = spark.range(0, 1000)
> from pyspark.sql.functions import udf
> def foo(x):
> raise StopIteration()
> df.withColumn('v', udf(foo)).show()
> # Results
> # +---+---+
> # | id|  v|
> # +---+---+
> # +---+---+{code}
> I think the task should fail in this case



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24316) Spark sql queries stall for column width more 6k for parquet based table

2018-05-21 Thread Bimalendu Choudhary (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bimalendu Choudhary updated SPARK-24316:

Description: 
When we create a table from a data frame using spark sql with columns around 6k 
or more, even simple queries of fetching 70k rows takes 20 minutes, while the 
same table if we create through Hive with same data , the same query just takes 
5 minutes.

 

Instrumenting the code we see that the executors are looping in the while loop 
of the function initializeInternal(). The majority of time is getting spent in 
the for loop in below code looping through the columns and the executor appears 
to be stalled for long time .

  
{code:java|title=spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java|borderStyle=solid}
private void initializeInternal() ..
 ..
 for (int i = 0; i < requestedSchema.getFieldCount(); ++i)

{ ... }

}
{code:java}
 {code}
 

When spark sql is creating table, it also stores the metadata in the 
TBLPROPERTIES in json format. We see that if we remove this metadata from the 
table the queries become fast , which is the case when we create the same table 
through Hive. The exact same table takes 5 times more time with the Json meta 
data as compared to without the json metadata.

 

So looks like as the number of columns are growing bigger than 5 to 6k, the 
processing of the metadata and comparing it becomes more and more expensive and 
the performance degrades drastically.

To recreate the problem simply run the following query:

import org.apache.spark.sql.SparkSession

val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7")

 resp_data.write.format("csv").save("/tmp/filename")

 

The table should be created by spark sql from dataframe so that the Json meta 
data is stored. For ex:-

val dff =  spark.read.format("csv").load("hdfs:///tmp/test.csv")

dff.createOrReplaceTempView("my_temp_table")

 val tmp = spark.sql("Create table tableName stored as parquet as select * from 
my_temp_table")

 

 

from pyspark.sql import SQL

Context 
 sqlContext = SQLContext(sc) 
 resp_data = spark.sql( " select * from test").limit(2000) 
 print resp_data_fgv_1k.count() 
 (resp_data_fgv_1k.write.option('header', 
False).mode('overwrite').csv('/tmp/2.csv') ) 

 

 

The performance seems to be even slow in the loop if the schema does not match 
or the fields are empty and the code goes into the if condition where the 
missing column is marked true:

missingColumns[i] = true;

 

  was:
When we create a table from a data frame using spark sql with columns around 6k 
or more, even simple queries of fetching 70k rows takes 20 minutes, while the 
same table if we create through Hive with same data , the same query just takes 
5 minutes.

 

Instrumenting the code we see that the executors are looping in the while loop 
of the function initializeInternal(). The majority of time is getting spent 
here and the executor seems to be stalled for long time .

 

 
{code:java|title=spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java|borderStyle=solid}


private void initializeInternal() ..
 ..
 for (int i = 0; i < requestedSchema.getFieldCount(); ++i)

{ ... }

}
{code:java}
 {code}
 

When spark sql is creating table, it also stores the metadata in the 
TBLPROPERTIES in json format. We see that if we remove this metadata from the 
table the queries become fast , which is the case when we create the same table 
through Hive. The exact same table takes 5 times more time with the Json meta 
data as compared to without the json metadata.

 

So looks like as the number of columns are growing bigger than 5 to 6k, the 
processing of the metadata and comparing it becomes more and more expensive and 
the performance degrades drastically.

To recreate the problem simply run the following query:

import org.apache.spark.sql.SparkSession

val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7")

 resp_data.write.format("csv").save("/tmp/filename")

 

The table should be created by spark sql from dataframe so that the Json meta 
data is stored. For ex:-

val dff =  spark.read.format("csv").load("hdfs:///tmp/test.csv")

dff.createOrReplaceTempView("my_temp_table")

 val tmp = spark.sql("Create table tableName stored as parquet as select * from 
my_temp_table")

 

 

from pyspark.sql import SQL

Context 
 sqlContext = SQLContext(sc) 
 resp_data = spark.sql( " select * from test").limit(2000) 
 print resp_data_fgv_1k.count() 
 (resp_data_fgv_1k.write.option('header', 
False).mode('overwrite').csv('/tmp/2.csv') ) 

 

 

The performance seems to be even slow in the loop if the schema does not match 
or the fields are empty and the code goes into the if condition where the 
missing column is marked true:

missingColumns[i] = true;

 


> Spark sql queries stall

[jira] [Updated] (SPARK-24316) Spark sql queries stall for column width more 6k for parquet based table

2018-05-21 Thread Bimalendu Choudhary (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bimalendu Choudhary updated SPARK-24316:

Description: 
When we create a table from a data frame using spark sql with columns around 6k 
or more, even simple queries of fetching 70k rows takes 20 minutes, while the 
same table if we create through Hive with same data , the same query just takes 
5 minutes.

 

Instrumenting the code we see that the executors are looping in the while loop 
of the function initializeInternal(). The majority of time is getting spent 
here and the executor seems to be stalled for long time .

 

 
{code:java|title=spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java|borderStyle=solid}


private void initializeInternal() ..
 ..
 for (int i = 0; i < requestedSchema.getFieldCount(); ++i)

{ ... }

}
{code:java}
 {code}
 

When spark sql is creating table, it also stores the metadata in the 
TBLPROPERTIES in json format. We see that if we remove this metadata from the 
table the queries become fast , which is the case when we create the same table 
through Hive. The exact same table takes 5 times more time with the Json meta 
data as compared to without the json metadata.

 

So looks like as the number of columns are growing bigger than 5 to 6k, the 
processing of the metadata and comparing it becomes more and more expensive and 
the performance degrades drastically.

To recreate the problem simply run the following query:

import org.apache.spark.sql.SparkSession

val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7")

 resp_data.write.format("csv").save("/tmp/filename")

 

The table should be created by spark sql from dataframe so that the Json meta 
data is stored. For ex:-

val dff =  spark.read.format("csv").load("hdfs:///tmp/test.csv")

dff.createOrReplaceTempView("my_temp_table")

 val tmp = spark.sql("Create table tableName stored as parquet as select * from 
my_temp_table")

 

 

from pyspark.sql import SQL

Context 
 sqlContext = SQLContext(sc) 
 resp_data = spark.sql( " select * from test").limit(2000) 
 print resp_data_fgv_1k.count() 
 (resp_data_fgv_1k.write.option('header', 
False).mode('overwrite').csv('/tmp/2.csv') ) 

 

 

The performance seems to be even slow in the loop if the schema does not match 
or the fields are empty and the code goes into the if condition where the 
missing column is marked true:

missingColumns[i] = true;

 

  was:
When we create a table from a data frame using spark sql with columns around 6k 
or more, even simple queries of fetching 70k rows takes 20 minutes, while the 
same table if we create through Hive with same data , the same query just takes 
5 minutes.

 

Instrumenting the code we see that the executors are looping in the while loop 
of the function initializeInternal(). The majority of time is getting spent 
here and the executor seems to be stalled for long time .

 

{{{code:title=spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java|borderStyle=solid}}}

private void initializeInternal() ..
 ..
 for (int i = 0; i < requestedSchema.getFieldCount(); ++i)

{ ... }

}
{code:java}
 {code}
 

When spark sql is creating table, it also stores the metadata in the 
TBLPROPERTIES in json format. We see that if we remove this metadata from the 
table the queries become fast , which is the case when we create the same table 
through Hive. The exact same table takes 5 times more time with the Json meta 
data as compared to without the json metadata.

 

So looks like as the number of columns are growing bigger than 5 to 6k, the 
processing of the metadata and comparing it becomes more and more expensive and 
the performance degrades drastically.

To recreate the problem simply run the following query:

import org.apache.spark.sql.SparkSession

val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7")

 resp_data.write.format("csv").save("/tmp/filename")

 

The table should be created by spark sql from dataframe so that the Json meta 
data is stored. For ex:-

val dff =  spark.read.format("csv").load("hdfs:///tmp/test.csv")

dff.createOrReplaceTempView("my_temp_table")

 val tmp = spark.sql("Create table tableName stored as parquet as select * from 
my_temp_table")

 

 

from pyspark.sql import SQL

Context 
 sqlContext = SQLContext(sc) 
 resp_data = spark.sql( " select * from test").limit(2000) 
 print resp_data_fgv_1k.count() 
 (resp_data_fgv_1k.write.option('header', 
False).mode('overwrite').csv('/tmp/2.csv') ) 

 

 

The performance seems to be even slow in the loop if the schema does not match 
or the fields are empty and the code goes into the if condition where the 
missing column is marked true:

missingColumns[i] = true;

 


> Spark sql queries stall for  column width more 6k for parquet based table
>

[jira] [Updated] (SPARK-24316) Spark sql queries stall for column width more 6k for parquet based table

2018-05-21 Thread Bimalendu Choudhary (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bimalendu Choudhary updated SPARK-24316:

Description: 
When we create a table from a data frame using spark sql with columns around 6k 
or more, even simple queries of fetching 70k rows takes 20 minutes, while the 
same table if we create through Hive with same data , the same query just takes 
5 minutes.

 

Instrumenting the code we see that the executors are looping in the while loop 
of the function initializeInternal(). The majority of time is getting spent 
here and the executor seems to be stalled for long time .

 

{{{code:title=spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java|borderStyle=solid}}}

private void initializeInternal() ..
 ..
 for (int i = 0; i < requestedSchema.getFieldCount(); ++i)

{ ... }

}
{code:java}
 {code}
 

When spark sql is creating table, it also stores the metadata in the 
TBLPROPERTIES in json format. We see that if we remove this metadata from the 
table the queries become fast , which is the case when we create the same table 
through Hive. The exact same table takes 5 times more time with the Json meta 
data as compared to without the json metadata.

 

So looks like as the number of columns are growing bigger than 5 to 6k, the 
processing of the metadata and comparing it becomes more and more expensive and 
the performance degrades drastically.

To recreate the problem simply run the following query:

import org.apache.spark.sql.SparkSession

val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7")

 resp_data.write.format("csv").save("/tmp/filename")

 

The table should be created by spark sql from dataframe so that the Json meta 
data is stored. For ex:-

val dff =  spark.read.format("csv").load("hdfs:///tmp/test.csv")

dff.createOrReplaceTempView("my_temp_table")

 val tmp = spark.sql("Create table tableName stored as parquet as select * from 
my_temp_table")

 

 

from pyspark.sql import SQL

Context 
 sqlContext = SQLContext(sc) 
 resp_data = spark.sql( " select * from test").limit(2000) 
 print resp_data_fgv_1k.count() 
 (resp_data_fgv_1k.write.option('header', 
False).mode('overwrite').csv('/tmp/2.csv') ) 

 

 

The performance seems to be even slow in the loop if the schema does not match 
or the fields are empty and the code goes into the if condition where the 
missing column is marked true:

missingColumns[i] = true;

 

  was:
When we create a table from a data frame using spark sql with columns around 6k 
or more, even simple queries of fetching 70k rows takes 20 minutes, while the 
same table if we create through Hive with same data , the same query just takes 
5 minutes.

 

Instrumenting the code we see that the executors are looping in the while loop 
of the function initializeInternal(). The majority of time is getting spent 
here and the executor seems to be stalled for long time .

spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

private void initializeInternal() ..
 ..
 for (int i = 0; i < requestedSchema.getFieldCount(); ++i)

{ ... }

}

When spark sql is creating table, it also stores the metadata in the 
TBLPROPERTIES in json format. We see that if we remove this metadata from the 
table the queries become fast , which is the case when we create the same table 
through Hive. The exact same table takes 5 times more time with the Json meta 
data as compared to without the json metadata.

 

So looks like as the number of columns are growing bigger than 5 to 6k, the 
processing of the metadata and comparing it becomes more and more expensive and 
the performance degrades drastically.

To recreate the problem simply run the following query:

import org.apache.spark.sql.SparkSession

val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7")

 resp_data.write.format("csv").save("/tmp/filename")

 

The table should be created by spark sql from dataframe so that the Json meta 
data is stored. For ex:-

val dff =  spark.read.format("csv").load("hdfs:///tmp/test.csv")

dff.createOrReplaceTempView("my_temp_table")

 val tmp = spark.sql("Create table tableName stored as parquet as select * from 
my_temp_table")

 

 

from pyspark.sql import SQL

Context 
 sqlContext = SQLContext(sc) 
 resp_data = spark.sql( " select * from test").limit(2000) 
 print resp_data_fgv_1k.count() 
 (resp_data_fgv_1k.write.option('header', 
False).mode('overwrite').csv('/tmp/2.csv') ) 

 

 

The performance seems to be even slow in the loop if the schema does not match 
or the fields are empty and the code goes into the if condition where the 
missing column is marked true:

missingColumns[i] = true;

 


> Spark sql queries stall for  column width more 6k for parquet based table
>

[jira] [Updated] (SPARK-24316) Spark sql queries stall for column width more 6k for parquet based table

2018-05-21 Thread Bimalendu Choudhary (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bimalendu Choudhary updated SPARK-24316:

Affects Version/s: 2.4.0
   2.3.0
  Description: 
When we create a table from a data frame using spark sql with columns around 6k 
or more, even simple queries of fetching 70k rows takes 20 minutes, while the 
same table if we create through Hive with same data , the same query just takes 
5 minutes.

 

Instrumenting the code we see that the executors are looping in the while loop 
of the function initializeInternal(). The majority of time is getting spent 
here and the executor seems to be stalled for long time .

spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

private void initializeInternal() ..
 ..
 for (int i = 0; i < requestedSchema.getFieldCount(); ++i)

{ ... }

}

When spark sql is creating table, it also stores the metadata in the 
TBLPROPERTIES in json format. We see that if we remove this metadata from the 
table the queries become fast , which is the case when we create the same table 
through Hive. The exact same table takes 5 times more time with the Json meta 
data as compared to without the json metadata.

 

So looks like as the number of columns are growing bigger than 5 to 6k, the 
processing of the metadata and comparing it becomes more and more expensive and 
the performance degrades drastically.

To recreate the problem simply run the following query:

import org.apache.spark.sql.SparkSession

val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7")

 resp_data.write.format("csv").save("/tmp/filename")

 

The table should be created by spark sql from dataframe so that the Json meta 
data is stored. For ex:-

val dff =  spark.read.format("csv").load("hdfs:///tmp/test.csv")

dff.createOrReplaceTempView("my_temp_table")

 val tmp = spark.sql("Create table tableName stored as parquet as select * from 
my_temp_table")

 

 

from pyspark.sql import SQL

Context 
 sqlContext = SQLContext(sc) 
 resp_data = spark.sql( " select * from test").limit(2000) 
 print resp_data_fgv_1k.count() 
 (resp_data_fgv_1k.write.option('header', 
False).mode('overwrite').csv('/tmp/2.csv') ) 

 

 

The performance seems to be even slow in the loop if the schema does not match 
or the fields are empty and the code goes into the if condition where the 
missing column is marked true:

missingColumns[i] = true;

 

  was:
When we create a table from a data frame using spark sql with columns around 6k 
or more, even simple queries of fetching 70k rows takes 20 minutes, while the 
same table if we create through Hive with same data , the same query just takes 
5 minutes.

 

Instrumenting the code we see that the executors are looping in the while loop 
of the function initializeInternal(). The majority of time is getting spent 
here and the executor seems to be stalled for long time .

[VectorizedParquetRecordReader.java|http://opengrok.sjc.cloudera.com/source/xref/spark-2.2.0-cloudera1/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java]

private void initializeInternal() ..
..
 for (int i = 0; i < requestedSchema.getFieldCount(); ++i) {

...
 }
}

When spark sql is creating table, it also stores the metadata in the 
TBLPROPERTIES in json format. We see that if we remove this metadata from the 
table the queries become fast , which is the case when we create the same table 
through Hive. The exact same table takes 5 times more time with the Json meta 
data as compared to without the json metadata.

 

So looks like as the number of columns are growing bigger than 5 to 6k, the 
processing of the metadata and comparing it becomes more and more expensive and 
the performance degrades drastically.

To recreate the problem simply run the following query:

import org.apache.spark.sql.SparkSession

val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7")

 resp_data.write.format("csv").save("/tmp/filename")

 

The table should be created by spark sql from dataframe so that the Json meta 
data is stored. For ex:-

val dff =  spark.read.format("csv").load("hdfs:///tmp/test.csv")

dff.createOrReplaceTempView("my_temp_table")

 val tmp = spark.sql("Create table tableName stored as parquet as select * from 
my_temp_table")

 

 

from pyspark.sql import SQL

Context 
sqlContext = SQLContext(sc) 
resp_data = spark.sql( " select * from test").limit(2000) 
print resp_data_fgv_1k.count() 
(resp_data_fgv_1k.write.option('header', 
False).mode('overwrite').csv('/tmp/2.csv') ) 

 

 

The performance seems to be even slow in the loop if the schema does not match 
or the fields are empty and the code goes into the if condition where the 
missing column is marked true:

missingColumns[i] = true;

 


> Spark sql queries stall for  column width more

[jira] [Commented] (SPARK-6235) Address various 2G limits

2018-05-21 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482746#comment-16482746
 ] 

Thomas Graves commented on SPARK-6235:
--

>> Still unsupported:
 * large task results

 * large blocks in the WAL
 * individual records larger than 2 GB

 

Can you clarify what WAL is?

I have seen individual records larger then 2GB, I don't think its as common 
though.

Also can you clarify large task results? 

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24332) Fix places reading 'spark.network.timeout' as milliseconds

2018-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24332:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fix places reading 'spark.network.timeout' as milliseconds
> --
>
> Key: SPARK-24332
> URL: https://issues.apache.org/jira/browse/SPARK-24332
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> There are some places reading "spark.network.timeout" using "getTimeAsMs" 
> rather than "getTimeAsSeconds". This will return a wrong value when the user 
> specifies a value without a time unit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6235) Address various 2G limits

2018-05-21 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482699#comment-16482699
 ] 

Imran Rashid commented on SPARK-6235:
-

Linked a [design 
doc|https://docs.google.com/document/d/1ZialnQ0RSOkyYYND7nU609NJYBC6lnhS4xyl2YqG03A/edit?usp=sharing]

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24332) Fix places reading 'spark.network.timeout' as milliseconds

2018-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24332:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix places reading 'spark.network.timeout' as milliseconds
> --
>
> Key: SPARK-24332
> URL: https://issues.apache.org/jira/browse/SPARK-24332
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> There are some places reading "spark.network.timeout" using "getTimeAsMs" 
> rather than "getTimeAsSeconds". This will return a wrong value when the user 
> specifies a value without a time unit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24332) Fix places reading 'spark.network.timeout' as milliseconds

2018-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482698#comment-16482698
 ] 

Apache Spark commented on SPARK-24332:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/21382

> Fix places reading 'spark.network.timeout' as milliseconds
> --
>
> Key: SPARK-24332
> URL: https://issues.apache.org/jira/browse/SPARK-24332
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> There are some places reading "spark.network.timeout" using "getTimeAsMs" 
> rather than "getTimeAsSeconds". This will return a wrong value when the user 
> specifies a value without a time unit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24332) Fix places reading 'spark.network.timeout' as milliseconds

2018-05-21 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-24332:


 Summary: Fix places reading 'spark.network.timeout' as milliseconds
 Key: SPARK-24332
 URL: https://issues.apache.org/jira/browse/SPARK-24332
 Project: Spark
  Issue Type: Improvement
  Components: Mesos, Structured Streaming
Affects Versions: 2.3.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


There are some places reading "spark.network.timeout" using "getTimeAsMs" 
rather than "getTimeAsSeconds". This will return a wrong value when the user 
specifies a value without a time unit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24330) Refactor ExecuteWriteTask in FileFormatWriter with DataWriter(V2)

2018-05-21 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-24330:
---
Description: 
Refactor ExecuteWriteTask in FileFormatWriter to reduce common logic and 
improve readability.
After the change, callers only need to call {{commit()}} or {{abort}} at the 
end of task.
Also there is less code in {{SingleDirectoryWriteTask}} and 
{{DynamicPartitionWriteTask}}.

Definitions of related classes are moved to a new file, and 
{{ExecuteWriteTask}} is renamed to {{FileFormatDataWriter}}.

  was:
As I am working on File data source V2 write path in my repo 
[https://github.com/gengliangwang/spark/tree/orcWriter] , I find it essential 
to refactor ExecuteWriteTask in FileFormatWriter with DataWriter of Data source 
V2:
 # Reuse the code in both `FileFormat` and Data Source V2
 # Better abstraction, callers only need to call `commit()` or `abort` at the 
end of task.

 

 


> Refactor ExecuteWriteTask in FileFormatWriter with DataWriter(V2)
> -
>
> Key: SPARK-24330
> URL: https://issues.apache.org/jira/browse/SPARK-24330
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Priority: Major
>
> Refactor ExecuteWriteTask in FileFormatWriter to reduce common logic and 
> improve readability.
> After the change, callers only need to call {{commit()}} or {{abort}} at the 
> end of task.
> Also there is less code in {{SingleDirectoryWriteTask}} and 
> {{DynamicPartitionWriteTask}}.
> Definitions of related classes are moved to a new file, and 
> {{ExecuteWriteTask}} is renamed to {{FileFormatDataWriter}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24331) Add cardinality / arrays_overlap / array_repeat / map_entries

2018-05-21 Thread Marek Novotny (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482577#comment-16482577
 ] 

Marek Novotny commented on SPARK-24331:
---

I will work on this. Thanks!

> Add cardinality / arrays_overlap / array_repeat / map_entries  
> ---
>
> Key: SPARK-24331
> URL: https://issues.apache.org/jira/browse/SPARK-24331
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Priority: Major
>
> Add SparkR equivalent to:
>  * cardinality - SPARK-23923
>  * arrays_overlap - SPARK-23922
>  * array_repeat - SPARK-23925
>  * map_entries - SPARK-23935



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24331) Add cardinality / arrays_overlap / array_repeat / map_entries

2018-05-21 Thread Marek Novotny (JIRA)

Marek Novotny created SPARK-24331:
-

 Summary: Add cardinality / arrays_overlap / array_repeat / 
map_entries  
 Key: SPARK-24331
 URL: https://issues.apache.org/jira/browse/SPARK-24331
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 2.4.0
Reporter: Marek Novotny


Add SparkR equivalent to:
 * cardinality - SPARK-23923
 * arrays_overlap - SPARK-23922
 * array_repeat - SPARK-23925
 * map_entries - SPARK-23935



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24273) Failure while using .checkpoint method

2018-05-21 Thread Jami Malikzade (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482562#comment-16482562
 ] 

Jami Malikzade commented on SPARK-24273:


[~kiszk] , I have just noticed that I am getting this error when I use 
Zeppelin. In spark shell it works

More info https://issues.apache.org/jira/browse/ZEPPELIN-3477

 

> Failure while using .checkpoint method
> --
>
> Key: SPARK-24273
> URL: https://issues.apache.org/jira/browse/SPARK-24273
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Jami Malikzade
>Priority: Major
>
> We are getting following error:
> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 416, AWS 
> Service: Amazon S3, AWS Request ID: 
> tx14126-005ae9bfd9-9ed9ac2-default, AWS Error Code: 
> InvalidRange, AWS Error Message: null, S3 Extended Request ID: 
> 9ed9ac2-default-default"
> when we use checkpoint method as below.
> val streamBucketDF = streamPacketDeltaDF
>  .filter('timeDelta > maxGap && 'timeDelta <= 3)
>  .withColumn("bucket", when('timeDelta <= mediumGap, "medium")
>  .otherwise("large")
>  )
>  .checkpoint()
> Do you have idea how to prevent invalid range in header to be sent, or how it 
> can be workarounded or fixed?
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24330) Refactor ExecuteWriteTask in FileFormatWriter with DataWriter(V2)

2018-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24330:


Assignee: (was: Apache Spark)

> Refactor ExecuteWriteTask in FileFormatWriter with DataWriter(V2)
> -
>
> Key: SPARK-24330
> URL: https://issues.apache.org/jira/browse/SPARK-24330
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Priority: Major
>
> As I am working on File data source V2 write path in my repo 
> [https://github.com/gengliangwang/spark/tree/orcWriter] , I find it essential 
> to refactor ExecuteWriteTask in FileFormatWriter with DataWriter of Data 
> source V2:
>  # Reuse the code in both `FileFormat` and Data Source V2
>  # Better abstraction, callers only need to call `commit()` or `abort` at the 
> end of task.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24330) Refactor ExecuteWriteTask in FileFormatWriter with DataWriter(V2)

2018-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24330:


Assignee: Apache Spark

> Refactor ExecuteWriteTask in FileFormatWriter with DataWriter(V2)
> -
>
> Key: SPARK-24330
> URL: https://issues.apache.org/jira/browse/SPARK-24330
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> As I am working on File data source V2 write path in my repo 
> [https://github.com/gengliangwang/spark/tree/orcWriter] , I find it essential 
> to refactor ExecuteWriteTask in FileFormatWriter with DataWriter of Data 
> source V2:
>  # Reuse the code in both `FileFormat` and Data Source V2
>  # Better abstraction, callers only need to call `commit()` or `abort` at the 
> end of task.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24330) Refactor ExecuteWriteTask in FileFormatWriter with DataWriter(V2)

2018-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482539#comment-16482539
 ] 

Apache Spark commented on SPARK-24330:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21381

> Refactor ExecuteWriteTask in FileFormatWriter with DataWriter(V2)
> -
>
> Key: SPARK-24330
> URL: https://issues.apache.org/jira/browse/SPARK-24330
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Priority: Major
>
> As I am working on File data source V2 write path in my repo 
> [https://github.com/gengliangwang/spark/tree/orcWriter] , I find it essential 
> to refactor ExecuteWriteTask in FileFormatWriter with DataWriter of Data 
> source V2:
>  # Reuse the code in both `FileFormat` and Data Source V2
>  # Better abstraction, callers only need to call `commit()` or `abort` at the 
> end of task.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23935) High-order function: map_entries(map<K, V>) → array<row<K,V>>

2018-05-21 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23935.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21236
https://github.com/apache/spark/pull/21236

> High-order function: map_entries(map) → array>
> -
>
> Key: SPARK-23935
> URL: https://issues.apache.org/jira/browse/SPARK-23935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marek Novotny
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns an array of all entries in the given map.
> {noformat}
> SELECT map_entries(MAP(ARRAY[1, 2], ARRAY['x', 'y'])); -- [ROW(1, 'x'), 
> ROW(2, 'y')]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23935) High-order function: map_entries(map<K, V>) → array<row<K,V>>

2018-05-21 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23935:
-

Assignee: Marek Novotny

> High-order function: map_entries(map) → array>
> -
>
> Key: SPARK-23935
> URL: https://issues.apache.org/jira/browse/SPARK-23935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marek Novotny
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns an array of all entries in the given map.
> {noformat}
> SELECT map_entries(MAP(ARRAY[1, 2], ARRAY['x', 'y'])); -- [ROW(1, 'x'), 
> ROW(2, 'y')]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24273) Failure while using .checkpoint method

2018-05-21 Thread Jami Malikzade (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482533#comment-16482533
 ] 

Jami Malikzade commented on SPARK-24273:


[~kiszk]

 

I was able to reproduce in small piece of code by filtering column , so 0 
records returned as below, 'salary > 300' no record match, so it should 
checkpoint 0 records.

so steps are:

prepare in bucket salary.csv file like:
|name,year,salary,month|
|John,2017,12.33,4|
|John,2018,55.114,5|
|Smith,2017,36.339,3|
|Smith,2018,45.36,6|

 

And  execute following spark :

sc.setCheckpointDir("s3a://testbucket/tmp")
val testschema = StructType(Array(
 StructField("name", StringType, true),
 StructField("year", IntegerType, true),
 StructField("salary",FloatType,true),
 StructField("month", IntegerType, true)
 ))


val df = 
spark.read.option("header","true").option("sep",",").schema(testschema).csv("s3a://testbucket/salary.csv").filter('salary
 > 300).withColumn("month", when('name === "Smith", 
"6").otherwise("3")).checkpoint()
df.show()

 

 

It will throw :

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 11.0 failed 16 times, most recent failure: Lost task 0.15 in stage 11.0 
(TID 41, 172.16.0.15, executor 5): 
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 416, AWS 
Service: Amazon S3, AWS Request ID: 
tx00018c023-005b02d1e7-513c871-default, AWS Error Code: 
InvalidRange, AWS Error Message: null, S3 Extended Request ID: 
513c871-default-default

> Failure while using .checkpoint method
> --
>
> Key: SPARK-24273
> URL: https://issues.apache.org/jira/browse/SPARK-24273
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Jami Malikzade
>Priority: Major
>
> We are getting following error:
> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 416, AWS 
> Service: Amazon S3, AWS Request ID: 
> tx14126-005ae9bfd9-9ed9ac2-default, AWS Error Code: 
> InvalidRange, AWS Error Message: null, S3 Extended Request ID: 
> 9ed9ac2-default-default"
> when we use checkpoint method as below.
> val streamBucketDF = streamPacketDeltaDF
>  .filter('timeDelta > maxGap && 'timeDelta <= 3)
>  .withColumn("bucket", when('timeDelta <= mediumGap, "medium")
>  .otherwise("large")
>  )
>  .checkpoint()
> Do you have idea how to prevent invalid range in header to be sent, or how it 
> can be workarounded or fixed?
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19790) OutputCommitCoordinator should not allow another task to commit after an ExecutorFailure

2018-05-21 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482512#comment-16482512
 ] 

Steve Loughran commented on SPARK-19790:


Update on this, having spent lots of time working in the internals of hadoop 
committers

The v2 commit algorithm is broken in that if it fails *or partitions* during 
commit execution, the outcome is "undefined"
The v1 committer doesn't have this problem on any FS with atomic rename; the 
task attempt dir rename() will either fail or succeed. A partitioned executor 
may still execute its rename after its successor, but as the outcome of any 
task attempt is required to be acceptable, that can be justified.
The new s3a committers handle task failure.

The only limitation then is the v2 commit algorithm, which, IMO, doesn't mean 
the semantics of a commit algorithm as required by Spark. Which means that if 
people are permitted to use that algorithm, then when it fails during a commit, 
it should be picked up and the job itself failed.

Now: how to do that?

There's currently no method in the hadoop OutputCommitter interface to probe 
this; there's only one for job commit being repeatable. We can add one to 
Hadoop 3.x+ HadoopOutputCommitter() to aid this (or have it implement 
{{StreamCapabilities}} and declare the string -> predicate map there). Spark 
can query it and fail on task execute failure if the task committer has the new 
method and declares that it can't handle commit retry. FileOutputCommitter 
would return true on the probe iff v1 was invoked; the S3A committers will 
return true always.

Thoughts? 

> OutputCommitCoordinator should not allow another task to commit after an 
> ExecutorFailure
> 
>
> Key: SPARK-19790
> URL: https://issues.apache.org/jira/browse/SPARK-19790
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>Priority: Major
>
> The OutputCommitCoordinator resets the allowed committer when the task fails. 
>  
> https://github.com/apache/spark/blob/8aa560b75e6b083b2a890c52301414285ba35c3d/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L143
> However, if a task fails because of an ExecutorFailure, we actually have no 
> idea what the status is of the task.  The task may actually still be running, 
> and perhaps successfully commit its output.  By allowing another task to 
> commit its output, there is a chance that multiple tasks commit, which can 
> result in corrupt output.  This would be particularly problematic when 
> commit() is an expensive operation, eg. moving files on S3.
> For other task failures, we can allow other tasks to commit.  But with an 
> ExecutorFailure, its not clear what the right thing to do is.  The only safe 
> thing to do may be to fail the job.
> This is related to SPARK-19631, and was discovered during discussion on that 
> PR https://github.com/apache/spark/pull/16959#discussion_r103549134



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2

2018-05-21 Thread Truong Duc Kien (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482476#comment-16482476
 ] 

Truong Duc Kien commented on SPARK-23309:
-

We're also having also a performance problem with cached query on Spark 2.3. 
Once in a while, a query will take abnormally long time. We take a look at the 
thead-dump and see the executor waiting to fetch remote cached blocks, which 
progresses very slowly. It seems to be a run-time bug, because if we run the 
same query again, the slow-down might go away.

This did not happen with 2.2.1. 

> Spark 2.3 cached query performance 20-30% worse then spark 2.2
> --
>
> Key: SPARK-23309
> URL: https://issues.apache.org/jira/browse/SPARK-23309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was testing spark 2.3 rc2 and I am seeing a performance regression in sql 
> queries on cached data.
> The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 
> partitions
> Here is the example query:
> val dailycached = spark.sql("select something from table where dt = 
> '20170301' AND something IS NOT NULL")
> dailycached.createOrReplaceTempView("dailycached") 
> spark.catalog.cacheTable("dailyCached")
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show()
>  
> On spark 2.2 I see queries times average 13 seconds
> On the same nodes I see spark 2.3 queries times average 17 seconds
> Note these are times of queries after the initial caching.  so just running 
> the last line again: 
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() 
> multiple times.
>  
> I also ran a query over more data (335GB input/587.5 GB cached) and saw a 
> similar discrepancy in the performance of querying cached data between spark 
> 2.3 and spark 2.2, where 2.2 was better by like 20%.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24330) Refactor ExecuteWriteTask in FileFormatWriter with DataWriter(V2)

2018-05-21 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-24330:
--

 Summary: Refactor ExecuteWriteTask in FileFormatWriter with 
DataWriter(V2)
 Key: SPARK-24330
 URL: https://issues.apache.org/jira/browse/SPARK-24330
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Gengliang Wang


As I am working on File data source V2 write path in my repo 
[https://github.com/gengliangwang/spark/tree/orcWriter] , I find it essential 
to refactor ExecuteWriteTask in FileFormatWriter with DataWriter of Data source 
V2:
 # Reuse the code in both `FileFormat` and Data Source V2
 # Better abstraction, callers only need to call `commit()` or `abort` at the 
end of task.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20389) Upgrade kryo to fix NegativeArraySizeException

2018-05-21 Thread Arseniy Tashoyan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arseniy Tashoyan updated SPARK-20389:
-
Affects Version/s: 2.2.1

> Upgrade kryo to fix NegativeArraySizeException
> --
>
> Key: SPARK-20389
> URL: https://issues.apache.org/jira/browse/SPARK-20389
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.1.0, 2.2.1
> Environment: Linux, Centos7, jdk8
>Reporter: Georg Heiler
>Priority: Major
>
> I am experiencing an issue with Kryo when writing parquet files. Similar to 
> https://github.com/broadinstitute/gatk/issues/1524 a 
> NegativeArraySizeException occurs. Apparently this is fixed in a current Kryo 
> version. Spark is still using the very old 3.3 Kryo. 
> Can you please upgrade to a fixed Kryo version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20389) Upgrade kryo to fix NegativeArraySizeException

2018-05-21 Thread Arseniy Tashoyan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482408#comment-16482408
 ] 

Arseniy Tashoyan commented on SPARK-20389:
--

I confirm: the issue is in place for Spark 2.2.1. I hit it when running KMeans 
on a dataset of ~4M records when trying to get ~500 clusters.

I tried to avoid the problem by increasing spark.sql.shuffle.partitions to a 
large number like 500, but then I got OutOfMemoryError: Heap space on 
executors. So this workaround is suitable only with a capable cluster at hands.

> Upgrade kryo to fix NegativeArraySizeException
> --
>
> Key: SPARK-20389
> URL: https://issues.apache.org/jira/browse/SPARK-20389
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Linux, Centos7, jdk8
>Reporter: Georg Heiler
>Priority: Major
>
> I am experiencing an issue with Kryo when writing parquet files. Similar to 
> https://github.com/broadinstitute/gatk/issues/1524 a 
> NegativeArraySizeException occurs. Apparently this is fixed in a current Kryo 
> version. Spark is still using the very old 3.3 Kryo. 
> Can you please upgrade to a fixed Kryo version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24302) when using spark persist(),"KryoException:IndexOutOfBoundsException" happens

2018-05-21 Thread yijukang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482330#comment-16482330
 ] 

yijukang commented on SPARK-24302:
--

[~hyukjin.kwon]  OK thanks

> when using spark persist(),"KryoException:IndexOutOfBoundsException" happens
> 
>
> Key: SPARK-24302
> URL: https://issues.apache.org/jira/browse/SPARK-24302
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.0
>Reporter: yijukang
>Priority: Major
>  Labels: apache-spark
>
> my operation is using spark to insert RDD data into Hbase like this:
> --
> localData.persist()
>  localData.saveAsNewAPIHadoopDataset(jobConf.getConfiguration)
> --
> this way throw Exception:
>    com.esotericsoftware.kryo.KryoException: 
> java.lang.IndexOutOfBoundsException:index:99, Size:6
> Serialization trace:
>     familyMap (org.apache.hadoop.hbase.client.Put)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
>    at com.esotericsoftware.kryo.kryo.readClassAndObject(Kryo.java:729)
>    at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
>    at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
>    at com.esotericsoftware.kryo.kryo.readClassAndObject(Kryo.java:729)
>  
> when i deal with this:
> -
>  localData.saveAsNewAPIHadoopDataset(jobConf.getConfiguration)
> --
> it works well，what the persist() method did?
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24329) Remove comments filtering before parsing of CSV files

2018-05-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482295#comment-16482295
 ] 

Apache Spark commented on SPARK-24329:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21380

> Remove comments filtering before parsing of CSV files
> -
>
> Key: SPARK-24329
> URL: https://issues.apache.org/jira/browse/SPARK-24329
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Comments and whitespace filtering has been performed by uniVocity parser 
> already according to parser settings:
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L178-L180
> It is not necessary to do the same before parsing. Need to inspect all places 
> where the filterCommentAndEmpty method is called, and remove the former one 
> if it duplicates filtering of uniVocity parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24329) Remove comments filtering before parsing of CSV files

2018-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24329:


Assignee: (was: Apache Spark)

> Remove comments filtering before parsing of CSV files
> -
>
> Key: SPARK-24329
> URL: https://issues.apache.org/jira/browse/SPARK-24329
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Comments and whitespace filtering has been performed by uniVocity parser 
> already according to parser settings:
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L178-L180
> It is not necessary to do the same before parsing. Need to inspect all places 
> where the filterCommentAndEmpty method is called, and remove the former one 
> if it duplicates filtering of uniVocity parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24329) Remove comments filtering before parsing of CSV files

2018-05-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24329:


Assignee: Apache Spark

> Remove comments filtering before parsing of CSV files
> -
>
> Key: SPARK-24329
> URL: https://issues.apache.org/jira/browse/SPARK-24329
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Comments and whitespace filtering has been performed by uniVocity parser 
> already according to parser settings:
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L178-L180
> It is not necessary to do the same before parsing. Need to inspect all places 
> where the filterCommentAndEmpty method is called, and remove the former one 
> if it duplicates filtering of uniVocity parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24329) Remove comments filtering before parsing of CSV files

2018-05-21 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-24329:
--

 Summary: Remove comments filtering before parsing of CSV files
 Key: SPARK-24329
 URL: https://issues.apache.org/jira/browse/SPARK-24329
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


Comments and whitespace filtering has been performed by uniVocity parser 
already according to parser settings:
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L178-L180

It is not necessary to do the same before parsing. Need to inspect all places 
where the filterCommentAndEmpty method is called, and remove the former one if 
it duplicates filtering of uniVocity parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24283) Make standard scaler work without legacy MLlib

2018-05-21 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482277#comment-16482277
 ] 

zhengruifeng commented on SPARK-24283:
--

[~holdenk] do you mean that we should impl the prediction in ML.features (not 
to call the MLlib.features.StandardScalerModel.transform) ?

> Make standard scaler work without legacy MLlib
> --
>
> Key: SPARK-24283
> URL: https://issues.apache.org/jira/browse/SPARK-24283
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> Currently StandardScaler converts Spark ML vectors to MLlib vectors during 
> prediction, we should skip that step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24323) Java lint errors

2018-05-21 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24323:


Assignee: Kazuaki Ishizaki

> Java lint errors
> 
>
> Key: SPARK-24323
> URL: https://issues.apache.org/jira/browse/SPARK-24323
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>
> The following error occurs when run lint-java
> {code:java}
> [ERROR] 
> src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartition.java:[39] 
> (sizes) LineLength: Line is longer than 100 characters (found 104).
> [ERROR] 
> src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartitionReader.java:[26]
>  (sizes) LineLength: Line is longer than 100 characters (found 110).
> [ERROR] 
> src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartitionReader.java:[30]
>  (sizes) LineLength: Line is longer than 100 characters (found 104).
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24323) Java lint errors

2018-05-21 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24323.
--
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/21374

> Java lint errors
> 
>
> Key: SPARK-24323
> URL: https://issues.apache.org/jira/browse/SPARK-24323
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> The following error occurs when run lint-java
> {code:java}
> [ERROR] 
> src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartition.java:[39] 
> (sizes) LineLength: Line is longer than 100 characters (found 104).
> [ERROR] 
> src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartitionReader.java:[26]
>  (sizes) LineLength: Line is longer than 100 characters (found 110).
> [ERROR] 
> src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartitionReader.java:[30]
>  (sizes) LineLength: Line is longer than 100 characters (found 104).
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24242) RangeExec should have correct outputOrdering

2018-05-21 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24242:


Assignee: Liang-Chi Hsieh

> RangeExec should have correct outputOrdering
> 
>
> Key: SPARK-24242
> URL: https://issues.apache.org/jira/browse/SPARK-24242
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Logical Range node has been added with outputOrdering recently. It's used to 
> eliminate redundant Sort during optimization. However, this outputOrdering 
> info doesn't not propagate to physical Range node. We should use this 
> outputOrdering from logical Range node so parent nodes of Range can correctly 
> know the output ordering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24242) RangeExec should have correct outputOrdering

2018-05-21 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24242.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21291
[https://github.com/apache/spark/pull/21291]

> RangeExec should have correct outputOrdering
> 
>
> Key: SPARK-24242
> URL: https://issues.apache.org/jira/browse/SPARK-24242
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Logical Range node has been added with outputOrdering recently. It's used to 
> eliminate redundant Sort during optimization. However, this outputOrdering 
> info doesn't not propagate to physical Range node. We should use this 
> outputOrdering from logical Range node so parent nodes of Range can correctly 
> know the output ordering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24328) Fix scala.MatchError in literals.sql.out

2018-05-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24328.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.3.0

> Fix scala.MatchError in literals.sql.out 
> -
>
> Key: SPARK-24328
> URL: https://issues.apache.org/jira/browse/SPARK-24328
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Trivial
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

77 matches

Mail list logo