[jira] [Commented] (SPARK-24891) Fix HandleNullInputsForUDF rule

2018-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555111#comment-16555111
 ] 

Apache Spark commented on SPARK-24891:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/21869

> Fix HandleNullInputsForUDF rule
> ---
>
> Key: SPARK-24891
> URL: https://issues.apache.org/jira/browse/SPARK-24891
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> The HandleNullInputsForUDF rule can generate new {{If}} node infinitely, thus 
> causing problems like match of SQL cache missed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23957) Sorts in subqueries are redundant and can be removed

2018-07-24 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23957.
-
   Resolution: Fixed
 Assignee: Henry Robinson
Fix Version/s: 2.4.0

> Sorts in subqueries are redundant and can be removed
> 
>
> Key: SPARK-23957
> URL: https://issues.apache.org/jira/browse/SPARK-23957
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Henry Robinson
>Assignee: Henry Robinson
>Priority: Major
> Fix For: 2.4.0
>
>
> Unless combined with a {{LIMIT}}, there's no correctness reason that planned 
> and optimized subqueries should have any sort operators (since the result of 
> the subquery is an unordered collection of tuples). 
> For example:
> {{SELECT count(1) FROM (select id FROM dft ORDER by id)}}
> has the following plan:
> {code:java}
> == Physical Plan ==
> *(3) HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition
>+- *(2) HashAggregate(keys=[], functions=[partial_count(1)])
>   +- *(2) Project
>  +- *(2) Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>+- *(1) Project [id#0L]
>   +- *(1) FileScan parquet [id#0L] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct
> {code}
> ... but the sort operator is redundant.
> Less intuitively, the sort is also redundant in selections from an ordered 
> subquery:
> {{SELECT * FROM (SELECT id FROM dft ORDER BY id)}}
> has plan:
> {code:java}
> == Physical Plan ==
> *(2) Sort [id#0L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>+- *(1) Project [id#0L]
>   +- *(1) FileScan parquet [id#0L] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> {code}
> ... but again, since the subquery returns a bag of tuples, the sort is 
> unnecessary.
> We should consider adding an optimizer rule that removes a sort inside a 
> subquery. SPARK-23375 is related, but removes sorts that are functionally 
> redundant because they perform the same ordering.
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24890) Short circuiting the `if` condition when `trueValue` and `falseValue` are the same

2018-07-24 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24890.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Short circuiting the `if` condition when `trueValue` and `falseValue` are the 
> same
> --
>
> Key: SPARK-24890
> URL: https://issues.apache.org/jira/browse/SPARK-24890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> When `trueValue` and `falseValue` are semantic equivalence, the condition 
> expression in `if` can be removed to avoid extra computation in runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24914) totalSize is not a good estimate for broadcast joins

2018-07-24 Thread Bruce Robbins (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-24914:
--
Description: 
When determining whether to do a broadcast join, Spark estimates the size of 
the smaller table as follows:
 - if totalSize is defined and greater than 0, use it.
 - else, if rawDataSize is defined and greater than 0, use it
 - else, use spark.sql.defaultSizeInBytes (default: Long.MaxValue)

Therefore, Spark prefers totalSize over rawDataSize.

Unfortunately, totalSize is often quite a bit smaller than the actual table 
size, since it represents the size of the table's files on disk. Parquet and 
Orc files, for example, are encoded and compressed. This can result in the JVM 
throwing an OutOfMemoryError while Spark is loading the table into a 
HashedRelation, or when Spark actually attempts to broadcast the data.

On the other hand, rawDataSize represents the uncompressed size of the dataset, 
according to Hive documentation. This seems like a pretty good number to use in 
preference to totalSize. However, due to HIVE-20079, this value is simply 
#columns * #rows. Once that bug is fixed, it may be a superior statistic, at 
least for managed tables.

In the meantime, we could apply a configurable "fudge factor" to totalSize, at 
least for types of files that are encoded and compressed. Hive has the setting 
hive.stats.deserialization.factor, which defaults to 1.0, and is described as 
follows:
{quote}in the absence of uncompressed/raw data size, total file size will be 
used for statistics annotation. But the file may be compressed, encoded and 
serialized which may be lesser in size than the actual uncompressed/raw data 
size. This factor will be multiplied to file size to estimate the raw data size.
{quote}
In addition to the fudge factor, we could compare the adjusted totalSize to 
rawDataSize and use the bigger of the two.

Caveat: This mitigates the issue only for Hive tables. It does not help much 
when the user is reading files using {{spark.read.parquet}}, unless we apply 
the same fudge factor there.

  was:
When determining whether to do a broadcast join, Spark estimates the size of 
the smaller table as follows:
 - if totalSize is defined and greater than 0, use it.
 - else, if rawDataSize is defined and greater than 0, use it
 - else, use spark.sql.defaultSizeInBytes (default: Long.MaxValue)

Therefore, Spark prefers totalSize over rawDataSize.

Unfortunately, totalSize is often quite a bit smaller than the actual table 
size, since it represents the size of the table's files on disk. Parquet and 
Orc files, for example, are encoded and compressed. This can result in the JVM 
throwing an OutOfMemoryError while Spark is loading the table into a 
HashedRelation, or when Spark actually attempts to broadcast the data.

On the other hand, rawDataSize represents the uncompressed size of the dataset, 
according to Hive documentation. This seems like a pretty good number to use in 
preference to totalSize. However, due to HIVE-20079, this value is simply 
#columns * #rows. Once that bug is fixed, it may be a superior statistic, at 
least for managed tables.

In the meantime, we could apply a configurable "fudge factor" to totalSize, at 
least for types of files that are encoded and compressed. Hive has the setting 
hive.stats.deserialization.factor, which defaults to 1.0, and is described as 
follows:
{quote}in the absence of uncompressed/raw data size, total file size will be 
used for statistics annotation. But the file may be compressed, encoded and 
serialized which may be lesser in size than the actual uncompressed/raw data 
size. This factor will be multiplied to file size to estimate the raw data size.
{quote}
In addition to the fudge factor, we could compare the adjusted totalSize to 
rawDataSize and use the bigger of the two:
{noformat}
size1 = totalSize.isDefined ? totalSize * fudgeFactor : Long.MAX_VALUE
size2 = rawDataSize.isDefined ? rawDataSize : Long.MAX_VALUE
size = max(size1, size2)
{noformat}
Caveat: This mitigates the issue only for Hive tables. It does not help much 
when the user is reading files using {{spark.read.parquet}}, unless we apply 
the same fudge factor there.


> totalSize is not a good estimate for broadcast joins
> 
>
> Key: SPARK-24914
> URL: https://issues.apache.org/jira/browse/SPARK-24914
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> When determining whether to do a broadcast join, Spark estimates the size of 
> the smaller table as follows:
>  - if totalSize is defined and greater than 0, use it.
>  - else, if rawDataSize is defined and greater than 0, use it
>  - else, use spark.sql.defaultSizeInBytes (default: 

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-24 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555080#comment-16555080
 ] 

Saisai Shao commented on SPARK-24615:
-

Hi [~tgraves] [~irashid] thanks a lot for your comments.

Currently in my design I don't insert a specific stage boundary with different 
resources, the stage boundary is still the same (by shuffle or by result). so 
{{withResouces}} is not an eval() action which trigger a stage. Instead, it 
just adds a resource hint to the RDD.

So which means RDDs with different resources requirements in one stage may have 
conflicts. For example: {{rdd1.withResources.mapPartitions \{ xxx 
\}.withResources.mapPartitions \{ xxx \}.collect}},  resources in rdd1 may be 
different from map rdd, so currently what I can think is that:

1. always pick the latter with warning log to say that multiple different 
resources in one stage is illegal.
2. fail the stage with warning log to say that multiple different resources in 
one stage is illegal.
3. merge conflicts with maximum resources needs. For example rdd1 requires 3 
gpus per task, rdd2 requires 4 gpus per task, then the merged requirement would 
be 4 gpus per task. (This is the high level description, details will be per 
partition based merging) [chosen].

Take join for example, where rdd1 and rdd2 may have different resource 
requirements, and joined RDD will potentially have other resource requirements.

For example:

{code}
val rddA = rdd.mapPartitions().withResources
val rddB = rdd.mapPartitions().withResources
val rddC = rddA.join(rddB).withResources
rddC.collect()
{code}

Here this 3 {{withResources}} may have different requirements. Since {{rddC}} 
is running in a different stage, so there's no need to merge the resource 
conflicts. But {{rddA}} and {{rddB}} are running in the same stage with 
different tasks (partitions). So the merging strategy I'm thinking is based on 
the partition, tasks running with partitions from {{rddA}} will use the 
resource specified by {{rddA}}, so does {{rddB}}.





> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23622) Flaky Test: HiveClientSuites

2018-07-24 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555041#comment-16555041
 ] 

Yuming Wang edited comment on SPARK-23622 at 7/25/18 2:44 AM:
--

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93521/testReport/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93506/testReport/
{noformat}
java.lang.RuntimeException: [unresolved dependency: 
com.sun.jersey#jersey-core;1.14: configuration not found in 
com.sun.jersey#jersey-core;1.14: 'master(compile)'. Missing configuration: 
'compile'. It was required from org.apache.hadoop#hadoop-yarn-common;2.6.5 
compile]
{noformat}


{noformat}
sbt.ForkMain$ForkError: java.lang.RuntimeException: [unresolved dependency: 
com.sun.jersey#jersey-core;1.14: configuration not found in 
com.sun.jersey#jersey-core;1.14: 'master(compile)'. Missing configuration: 
'compile'. It was required from org.apache.hadoop#hadoop-yarn-common;2.6.5 
compile]
at 
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1305)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$2.apply(IsolatedClientLoader.scala:115)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anonfun$2.apply(IsolatedClientLoader.scala:115)
at org.apache.spark.sql.catalyst.util.package$.quietly(package.scala:42)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$.downloadVersion(IsolatedClientLoader.scala:114)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$.liftedTree1$1(IsolatedClientLoader.scala:74)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$.forVersion(IsolatedClientLoader.scala:62)
at 
org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:51)
at 
org.apache.spark.sql.hive.client.HiveVersionSuite.buildClient(HiveVersionSuite.scala:41)
at 
org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$init(HiveClientSuite.scala:48)
at 
org.apache.spark.sql.hive.client.HiveClientSuite.beforeAll(HiveClientSuite.scala:79)
at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212)
at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210)
at 
org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257)
at 
org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1255)
at 
org.apache.spark.sql.hive.client.HiveClientSuites.runNestedSuites(HiveClientSuites.scala:24)
at org.scalatest.Suite$class.run(Suite.scala:1144)
at 
org.apache.spark.sql.hive.client.HiveClientSuites.run(HiveClientSuites.scala:24)
at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
at sbt.ForkMain$Run$2.call(ForkMain.java:296)
at sbt.ForkMain$Run$2.call(ForkMain.java:286)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}




was (Author: q79969786):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93521/testReport/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93506/testReport/
{noformat}
java.lang.RuntimeException: [unresolved dependency: 
com.sun.jersey#jersey-core;1.14: configuration not found in 
com.sun.jersey#jersey-core;1.14: 'master(compile)'. Missing configuration: 
'compile'. It was required from org.apache.hadoop#hadoop-yarn-common;2.6.5 
compile]

{noformat}


> Flaky Test: HiveClientSuites
> 
>
> Key: SPARK-23622
> URL: https://issues.apache.org/jira/browse/SPARK-23622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88052/testReport/org.apache.spark.sql.hive.client/HiveClientSuites/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark QA Test 
> 

[jira] [Resolved] (SPARK-24891) Fix HandleNullInputsForUDF rule

2018-07-24 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24891.
-
   Resolution: Fixed
 Assignee: Maryann Xue
Fix Version/s: 2.4.0
   2.3.2

> Fix HandleNullInputsForUDF rule
> ---
>
> Key: SPARK-24891
> URL: https://issues.apache.org/jira/browse/SPARK-24891
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> The HandleNullInputsForUDF rule can generate new {{If}} node infinitely, thus 
> causing problems like match of SQL cache missed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23622) Flaky Test: HiveClientSuites

2018-07-24 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555041#comment-16555041
 ] 

Yuming Wang commented on SPARK-23622:
-

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93521/testReport/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93506/testReport/
{noformat}
java.lang.RuntimeException: [unresolved dependency: 
com.sun.jersey#jersey-core;1.14: configuration not found in 
com.sun.jersey#jersey-core;1.14: 'master(compile)'. Missing configuration: 
'compile'. It was required from org.apache.hadoop#hadoop-yarn-common;2.6.5 
compile]

{noformat}


> Flaky Test: HiveClientSuites
> 
>
> Key: SPARK-23622
> URL: https://issues.apache.org/jira/browse/SPARK-23622
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88052/testReport/org.apache.spark.sql.hive.client/HiveClientSuites/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark QA Test 
> (Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/325
> {code}
> Error Message
> java.lang.reflect.InvocationTargetException: null
> Stacktrace
> sbt.ForkMain$ForkError: java.lang.reflect.InvocationTargetException: null
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:270)
>   at 
> org.apache.spark.sql.hive.client.HiveClientBuilder$.buildClient(HiveClientBuilder.scala:58)
>   at 
> org.apache.spark.sql.hive.client.HiveVersionSuite.buildClient(HiveVersionSuite.scala:41)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$init(HiveClientSuite.scala:48)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuite.beforeAll(HiveClientSuite.scala:71)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1210)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1257)
>   at 
> org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1255)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at org.scalatest.Suite$class.runNestedSuites(Suite.scala:1255)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuites.runNestedSuites(HiveClientSuites.scala:24)
>   at org.scalatest.Suite$class.run(Suite.scala:1144)
>   at 
> org.apache.spark.sql.hive.client.HiveClientSuites.run(HiveClientSuites.scala:24)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: 
> java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:444)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:117)
>   ... 29 more
> Caused by: sbt.ForkMain$ForkError: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1453)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:63)
>   at 
> 

[jira] [Issue Comment Deleted] (SPARK-24663) Flaky test: StreamingContextSuite "stop slow receiver gracefully"

2018-07-24 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24663:

Comment: was deleted

(was: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93521/testReport/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93506/testReport/
{noformat}
java.lang.RuntimeException: [unresolved dependency: 
com.sun.jersey#jersey-core;1.14: configuration not found in 
com.sun.jersey#jersey-core;1.14: 'master(compile)'. Missing configuration: 
'compile'. It was required from org.apache.hadoop#hadoop-yarn-common;2.6.5 
compile]
{noformat}
)

> Flaky test: StreamingContextSuite "stop slow receiver gracefully"
> -
>
> Key: SPARK-24663
> URL: https://issues.apache.org/jira/browse/SPARK-24663
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> This is another test that sometimes fails on our build machines, although I 
> can't find failures on the riselab jenkins servers. Failure looks like:
> {noformat}
> org.scalatest.exceptions.TestFailedException: 0 was not greater than 0
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply$mcV$sp(StreamingContextSuite.scala:356)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply(StreamingContextSuite.scala:335)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply(StreamingContextSuite.scala:335)
> {noformat}
> The test fails in about 2s, while a successful run generally takes 15s. 
> Looking at the logs, the receiver hasn't even started when things fail, which 
> points at a race during test initialization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24529) Add spotbugs into maven build process

2018-07-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24529:
-
Fix Version/s: (was: 2.4.0)

> Add spotbugs into maven build process
> -
>
> Key: SPARK-24529
> URL: https://issues.apache.org/jira/browse/SPARK-24529
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>
> We will enable a Java bytecode check tool 
> [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at 
> multiplication. Due to the tool limitation, some other checks will be enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-24529) Add spotbugs into maven build process

2018-07-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-24529:
--

This was reverted in favour of https://github.com/apache/spark/pull/21865 and 
SPARK-24895 for now.

> Add spotbugs into maven build process
> -
>
> Key: SPARK-24529
> URL: https://issues.apache.org/jira/browse/SPARK-24529
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>
> We will enable a Java bytecode check tool 
> [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at 
> multiplication. Due to the tool limitation, some other checks will be enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555027#comment-16555027
 ] 

Hyukjin Kwon commented on SPARK-24895:
--

Thank you [~yhuai]. I couldn't foresee this problem.

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {noformat}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
>   org.apache.spark
>   spark-mllib-local_2.11
>   2.4.0-SNAPSHOT
>   
> 
>   20180723.232411
>   177
> 
> 20180723232411
> 
>   
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
> 
>   
> 
> {code}
>  
> This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24663) Flaky test: StreamingContextSuite "stop slow receiver gracefully"

2018-07-24 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555028#comment-16555028
 ] 

Yuming Wang commented on SPARK-24663:
-

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93521/testReport/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93506/testReport/
{noformat}
java.lang.RuntimeException: [unresolved dependency: 
com.sun.jersey#jersey-core;1.14: configuration not found in 
com.sun.jersey#jersey-core;1.14: 'master(compile)'. Missing configuration: 
'compile'. It was required from org.apache.hadoop#hadoop-yarn-common;2.6.5 
compile]
{noformat}


> Flaky test: StreamingContextSuite "stop slow receiver gracefully"
> -
>
> Key: SPARK-24663
> URL: https://issues.apache.org/jira/browse/SPARK-24663
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> This is another test that sometimes fails on our build machines, although I 
> can't find failures on the riselab jenkins servers. Failure looks like:
> {noformat}
> org.scalatest.exceptions.TestFailedException: 0 was not greater than 0
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply$mcV$sp(StreamingContextSuite.scala:356)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply(StreamingContextSuite.scala:335)
>   at 
> org.apache.spark.streaming.StreamingContextSuite$$anonfun$24.apply(StreamingContextSuite.scala:335)
> {noformat}
> The test fails in about 2s, while a successful run generally takes 15s. 
> Looking at the logs, the receiver hasn't even started when things fail, which 
> points at a race during test initialization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data

2018-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555023#comment-16555023
 ] 

Apache Spark commented on SPARK-24906:
--

User 'habren' has created a pull request for this issue:
https://github.com/apache/spark/pull/21868

> Enlarge split size for columnar file to ensure the task read enough data
> 
>
> Key: SPARK-24906
> URL: https://issues.apache.org/jira/browse/SPARK-24906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
> Attachments: image-2018-07-24-20-26-32-441.png, 
> image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, 
> image-2018-07-24-20-30-24-552.png
>
>
> For columnar file, such as, when spark sql read the table, each split will be 
> 128 MB by default since spark.sql.files.maxPartitionBytes is default to 
> 128MB. Even when user set it to a large value, such as 512MB, the task may 
> read only few MB or even hundreds of KB. Because the table (Parquet) may 
> consists of dozens of columns while the SQL only need few columns. And spark 
> will prune the unnecessary columns.
>  
> In this case, spark DataSourceScanExec can enlarge maxPartitionBytes 
> adaptively. 
> For example, there is 40 columns , 20 are integer while another 20 are long. 
> When use query on an integer type column and an long type column, the 
> maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 
>  
> With this optimization, the number of task will be smaller and the job will 
> run faster. More importantly, for a very large cluster (more the 10 thousand 
> nodes), it will relieve RM's schedule pressure.
>  
> Here is the test
>  
> The table named test2 has more than 40 columns and there are more than 5 TB 
> data each hour.
> When we issue a very simple query 
>  
> {code:java}
> select count(device_id) from test2 where date=20180708 and hour='23'{code}
>  
> There are 72176 tasks and the duration of the job is 4.8 minutes
> !image-2018-07-24-20-26-32-441.png!
>  
> Most tasks last less than 1 second and read less than 1.5 MB data
> !image-2018-07-24-20-28-06-269.png!
>  
> After the optimization, there are only 1615 tasks and the job last only 30 
> seconds. It almost 10 times faster.
> !image-2018-07-24-20-29-24-797.png!
>  
> The median of read data is 44.2MB. 
> !image-2018-07-24-20-30-24-552.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data

2018-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24906:


Assignee: (was: Apache Spark)

> Enlarge split size for columnar file to ensure the task read enough data
> 
>
> Key: SPARK-24906
> URL: https://issues.apache.org/jira/browse/SPARK-24906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
> Attachments: image-2018-07-24-20-26-32-441.png, 
> image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, 
> image-2018-07-24-20-30-24-552.png
>
>
> For columnar file, such as, when spark sql read the table, each split will be 
> 128 MB by default since spark.sql.files.maxPartitionBytes is default to 
> 128MB. Even when user set it to a large value, such as 512MB, the task may 
> read only few MB or even hundreds of KB. Because the table (Parquet) may 
> consists of dozens of columns while the SQL only need few columns. And spark 
> will prune the unnecessary columns.
>  
> In this case, spark DataSourceScanExec can enlarge maxPartitionBytes 
> adaptively. 
> For example, there is 40 columns , 20 are integer while another 20 are long. 
> When use query on an integer type column and an long type column, the 
> maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 
>  
> With this optimization, the number of task will be smaller and the job will 
> run faster. More importantly, for a very large cluster (more the 10 thousand 
> nodes), it will relieve RM's schedule pressure.
>  
> Here is the test
>  
> The table named test2 has more than 40 columns and there are more than 5 TB 
> data each hour.
> When we issue a very simple query 
>  
> {code:java}
> select count(device_id) from test2 where date=20180708 and hour='23'{code}
>  
> There are 72176 tasks and the duration of the job is 4.8 minutes
> !image-2018-07-24-20-26-32-441.png!
>  
> Most tasks last less than 1 second and read less than 1.5 MB data
> !image-2018-07-24-20-28-06-269.png!
>  
> After the optimization, there are only 1615 tasks and the job last only 30 
> seconds. It almost 10 times faster.
> !image-2018-07-24-20-29-24-797.png!
>  
> The median of read data is 44.2MB. 
> !image-2018-07-24-20-30-24-552.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data

2018-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24906:


Assignee: Apache Spark

> Enlarge split size for columnar file to ensure the task read enough data
> 
>
> Key: SPARK-24906
> URL: https://issues.apache.org/jira/browse/SPARK-24906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Assignee: Apache Spark
>Priority: Critical
> Attachments: image-2018-07-24-20-26-32-441.png, 
> image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, 
> image-2018-07-24-20-30-24-552.png
>
>
> For columnar file, such as, when spark sql read the table, each split will be 
> 128 MB by default since spark.sql.files.maxPartitionBytes is default to 
> 128MB. Even when user set it to a large value, such as 512MB, the task may 
> read only few MB or even hundreds of KB. Because the table (Parquet) may 
> consists of dozens of columns while the SQL only need few columns. And spark 
> will prune the unnecessary columns.
>  
> In this case, spark DataSourceScanExec can enlarge maxPartitionBytes 
> adaptively. 
> For example, there is 40 columns , 20 are integer while another 20 are long. 
> When use query on an integer type column and an long type column, the 
> maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 
>  
> With this optimization, the number of task will be smaller and the job will 
> run faster. More importantly, for a very large cluster (more the 10 thousand 
> nodes), it will relieve RM's schedule pressure.
>  
> Here is the test
>  
> The table named test2 has more than 40 columns and there are more than 5 TB 
> data each hour.
> When we issue a very simple query 
>  
> {code:java}
> select count(device_id) from test2 where date=20180708 and hour='23'{code}
>  
> There are 72176 tasks and the duration of the job is 4.8 minutes
> !image-2018-07-24-20-26-32-441.png!
>  
> Most tasks last less than 1 second and read less than 1.5 MB data
> !image-2018-07-24-20-28-06-269.png!
>  
> After the optimization, there are only 1615 tasks and the job last only 30 
> seconds. It almost 10 times faster.
> !image-2018-07-24-20-29-24-797.png!
>  
> The median of read data is 44.2MB. 
> !image-2018-07-24-20-30-24-552.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24897) DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for stage fetchFailed

2018-07-24 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555002#comment-16555002
 ] 

Sean Owen commented on SPARK-24897:
---

I don't understand what you're reporting. I think you should maybe make 
reference to a specific change you would suggest that addresses this, or else I 
think it would be closed.

> DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for 
> stage fetchFailed
> --
>
> Key: SPARK-24897
> URL: https://issues.apache.org/jira/browse/SPARK-24897
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.0, 2.3.1
>Reporter: liupengcheng
>Priority: Major
>
> In Spark2.1, when a stage fetchfailed DAGScheduler will retry both this stage 
> and it's parent stage, however, when the parent stage is resubmitted and 
> start running, the mapstatuses can 
> still be invalidate by the stage's outstanding task due to fetchfailed.
> The stage's outstanding task might unregister the mapstatuses with new epoch, 
> thus causing 
> the parent stage repeated MetadataFetchFailed and finally failling the Job.
>  
>  
> {code:java}
> 2018-07-23,01:52:33,012 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
> task 174.0 in stage 71.0 (TID 154127, , executor 96): 
> FetchFailed(BlockManagerId(4945, , 22409), shuffleId=24, mapId=667, 
> reduceId=174, message= org.apache.spark.shuffle.FetchFailedException: Failed 
> to connect to /:22409
> 2018-07-23,01:52:33,013 INFO org.apache.spark.scheduler.DAGScheduler: Marking 
> ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) as failed due to a 
> fetch failure from ShuffleMapStage 69 ($plus$plus at 
> DeviceLocateMain.scala:236) 
> 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: 
> ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) failed in 246.856 s 
> 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: 
> Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) 
> and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch 
> failure 
> 2018-07-23,01:52:36,004 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
> task 120.0 in stage 71.0 (TID 154073, , executor 286): 
> FetchFailed(BlockManagerId(4208, , 22409), shuffleId=24, mapId=241, 
> reduceId=120, message= org.apache.spark.shuffle.FetchFailedException: Failed 
> to connect to /:22409 
> 2018-07-23,01:52:36,005 INFO org.apache.spark.scheduler.DAGScheduler: 
> Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) 
> and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch 
> failure 
> 2018-07-23,01:52:36,017 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece0 stored as bytes in memory (estimated size 4.0 MB, free 
> 26.7 MB) 
> 2018-07-23,01:52:36,025 INFO org.apache.spark.storage.BlockManagerInfo: 
> Removed broadcast_59_piece1 on :52349 in memory (size: 4.0 MB, free: 
> 3.0 GB) 
> 2018-07-23,01:52:36,029 INFO org.apache.spark.storage.BlockManagerInfo: 
> Removed broadcast_61_piece6 on :52349 in memory (size: 4.0 MB, free: 
> 3.0 GB) 
> 2018-07-23,01:52:36,079 INFO org.apache.spark.deploy.yarn.YarnAllocator: 
> Canceling requests for 0 executor containers 
> 2018-07-23,01:52:36,079 WARN org.apache.spark.deploy.yarn.YarnAllocator: 
> Expected to find pending requests, but found none.
>  2018-07-23,01:52:36,094 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_63_piece0 in memory on :56780 (size: 4.0 MB, free: 3.7 
> GB) 
> 2018-07-23,01:52:36,095 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece1 stored as bytes in memory (estimated size 4.0 MB, free 
> 30.7 MB) 
> 2018-07-23,01:52:36,107 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece1 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece2 stored as bytes in memory (estimated size 4.0 MB, free 
> 34.7 MB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece2 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece3 stored as bytes in memory (estimated size 4.0 MB, free 
> 38.7 MB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece3 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece4 stored as bytes in memory (estimated size 3.8 MB, free 
> 42.5 MB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> 

[jira] [Commented] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names

2018-07-24 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554990#comment-16554990
 ] 

Yuming Wang commented on SPARK-24911:
-

Can we show non printable field delimiter when {{SHOW CREATE TABLE}}. I have 
reported in https://issues.apache.org/jira/browse/SPARK-23058

> SHOW CREATE TABLE drops escaping of nested column names
> ---
>
> Key: SPARK-24911
> URL: https://issues.apache.org/jira/browse/SPARK-24911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Major
>
> Create a table with quoted nested column - *`b`*:
> {code:sql}
> create table `test` (`a` STRUCT<`b`:STRING>);
> {code}
> and show how the table was created:
> {code:sql}
> SHOW CREATE TABLE `test`
> {code}
> {code}
> CREATE TABLE `test`(`a` struct)
> {code}
> The column *b* becomes unquoted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24897) DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for stage fetchFailed

2018-07-24 Thread liupengcheng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554988#comment-16554988
 ] 

liupengcheng commented on SPARK-24897:
--

[~srowen] anybody can verify this issue? Thanks a lot

> DAGScheduler should not unregisterMapOutput and increaseEpoch repeatedly for 
> stage fetchFailed
> --
>
> Key: SPARK-24897
> URL: https://issues.apache.org/jira/browse/SPARK-24897
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.0, 2.3.1
>Reporter: liupengcheng
>Priority: Major
>
> In Spark2.1, when a stage fetchfailed DAGScheduler will retry both this stage 
> and it's parent stage, however, when the parent stage is resubmitted and 
> start running, the mapstatuses can 
> still be invalidate by the stage's outstanding task due to fetchfailed.
> The stage's outstanding task might unregister the mapstatuses with new epoch, 
> thus causing 
> the parent stage repeated MetadataFetchFailed and finally failling the Job.
>  
>  
> {code:java}
> 2018-07-23,01:52:33,012 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
> task 174.0 in stage 71.0 (TID 154127, , executor 96): 
> FetchFailed(BlockManagerId(4945, , 22409), shuffleId=24, mapId=667, 
> reduceId=174, message= org.apache.spark.shuffle.FetchFailedException: Failed 
> to connect to /:22409
> 2018-07-23,01:52:33,013 INFO org.apache.spark.scheduler.DAGScheduler: Marking 
> ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) as failed due to a 
> fetch failure from ShuffleMapStage 69 ($plus$plus at 
> DeviceLocateMain.scala:236) 
> 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: 
> ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) failed in 246.856 s 
> 2018-07-23,01:52:33,014 INFO org.apache.spark.scheduler.DAGScheduler: 
> Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) 
> and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch 
> failure 
> 2018-07-23,01:52:36,004 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
> task 120.0 in stage 71.0 (TID 154073, , executor 286): 
> FetchFailed(BlockManagerId(4208, , 22409), shuffleId=24, mapId=241, 
> reduceId=120, message= org.apache.spark.shuffle.FetchFailedException: Failed 
> to connect to /:22409 
> 2018-07-23,01:52:36,005 INFO org.apache.spark.scheduler.DAGScheduler: 
> Resubmitting ShuffleMapStage 69 ($plus$plus at DeviceLocateMain.scala:236) 
> and ShuffleMapStage 71 (map at DeviceLocateMain.scala:238) due to fetch 
> failure 
> 2018-07-23,01:52:36,017 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece0 stored as bytes in memory (estimated size 4.0 MB, free 
> 26.7 MB) 
> 2018-07-23,01:52:36,025 INFO org.apache.spark.storage.BlockManagerInfo: 
> Removed broadcast_59_piece1 on :52349 in memory (size: 4.0 MB, free: 
> 3.0 GB) 
> 2018-07-23,01:52:36,029 INFO org.apache.spark.storage.BlockManagerInfo: 
> Removed broadcast_61_piece6 on :52349 in memory (size: 4.0 MB, free: 
> 3.0 GB) 
> 2018-07-23,01:52:36,079 INFO org.apache.spark.deploy.yarn.YarnAllocator: 
> Canceling requests for 0 executor containers 
> 2018-07-23,01:52:36,079 WARN org.apache.spark.deploy.yarn.YarnAllocator: 
> Expected to find pending requests, but found none.
>  2018-07-23,01:52:36,094 INFO org.apache.spark.storage.BlockManagerInfo: 
> Added broadcast_63_piece0 in memory on :56780 (size: 4.0 MB, free: 3.7 
> GB) 
> 2018-07-23,01:52:36,095 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece1 stored as bytes in memory (estimated size 4.0 MB, free 
> 30.7 MB) 
> 2018-07-23,01:52:36,107 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece1 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece2 stored as bytes in memory (estimated size 4.0 MB, free 
> 34.7 MB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece2 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,108 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece3 stored as bytes in memory (estimated size 4.0 MB, free 
> 38.7 MB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece3 in memory on :56780 (size: 4.0 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.MemoryStore: Block 
> broadcast_63_piece4 stored as bytes in memory (estimated size 3.8 MB, free 
> 42.5 MB) 
> 2018-07-23,01:52:36,132 INFO org.apache.spark.storage.BlockManagerInfo: Added 
> broadcast_63_piece4 in memory on :56780 (size: 3.8 MB, free: 3.7 GB) 
> 2018-07-23,01:52:36,132 INFO 

[jira] [Commented] (SPARK-24867) Add AnalysisBarrier to DataFrameWriter

2018-07-24 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554986#comment-16554986
 ] 

Saisai Shao commented on SPARK-24867:
-

[~smilegator] what's the ETA of this issue?

> Add AnalysisBarrier to DataFrameWriter 
> ---
>
> Key: SPARK-24867
> URL: https://issues.apache.org/jira/browse/SPARK-24867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> {code}
>   val udf1 = udf({(x: Int, y: Int) => x + y})
>   val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
>   df.cache()
>   df.write.saveAsTable("t")
>   df.write.saveAsTable("t1")
> {code}
> Cache is not being used because the plans do not match with the cached plan. 
> This is a regression caused by the changes we made in AnalysisBarrier, since 
> not all the Analyzer rules are idempotent. We need to fix it to Spark 2.3.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24297) Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB

2018-07-24 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24297:
---

Assignee: Imran Rashid

> Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB
> ---
>
> Key: SPARK-24297
> URL: https://issues.apache.org/jira/browse/SPARK-24297
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Shuffle, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Any network request which does not use stream-to-disk that is sending over 
> 2GB is doomed to fail, so we might as well at least set the default value of 
> spark.maxRemoteBlockSizeFetchToMem to something < 2GB.
> It probably makes sense to set it to something even lower still, but that 
> might require more careful testing; this is a totally safe first step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24297) Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB

2018-07-24 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24297.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21474
[https://github.com/apache/spark/pull/21474]

> Change default value for spark.maxRemoteBlockSizeFetchToMem to be < 2GB
> ---
>
> Key: SPARK-24297
> URL: https://issues.apache.org/jira/browse/SPARK-24297
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Shuffle, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Any network request which does not use stream-to-disk that is sending over 
> 2GB is doomed to fail, so we might as well at least set the default value of 
> spark.maxRemoteBlockSizeFetchToMem to something < 2GB.
> It probably makes sense to set it to something even lower still, but that 
> might require more careful testing; this is a totally safe first step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data

2018-07-24 Thread Jason Guo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554972#comment-16554972
 ] 

Jason Guo edited comment on SPARK-24906 at 7/25/18 1:03 AM:


Thanks [~maropu] and [~viirya] for your comments. Here is my solution for our 
cluster (more than 40 thousands nodes in total and more than 10 thousands nodes 
for a single cluster). 

1. When this will be enabled?

There is a configuration item named 
{code:java}
spark.sql.parquet.adaptiveFileSplit=false{code}
 

Only when this is enabled, DataSourceScanExec will enlarge the 
mapPartitionBytes. By default, this parameter is set to false. (Our ad-hoc 
query will set it to true)

 

With this configuration, user will know that spark will adjust the partition / 
split size adaptively. If user do not want to use this, he or she can disable it

 

2. How to calculate maxPartitionBytes and openCostInBytes

Different data type has different length (calculated with 
DataType.defaultSize). First we get the total size of the whole table 
(henceforth referred to as the “T”). Then we get the total size of all the 
requiredSchema (henceforth referred to as the “R”). The multiplier should be T 
/ R  .Then the maxPartitionBytes and openCostInBytes will be enlarge with T / R 
times.

 

 

 


was (Author: habren):
Thanks [~maropu] and [~viirya] for your comments. Here is my solution for our 
cluster (more than 40 thousands nodes in total and more than 10 thousands nodes 
for a single cluster). 

1. When this is enabled

There is a configuration item named 

 
{code:java}
spark.sql.parquet.adaptiveFileSplit=false{code}
 

Only when this is enabled, DataSourceScanExec will enlarge the 
mapPartitionBytes. By default, this parameter is set to false. (Our ad-hoc 
query will set it to true)

 

2. How to calculate maxPartitionBytes and openCostInBytes

Different data type has different length (calculated with 
DataType.defaultSize). First we get the total size of the whole table 
(henceforth referred to as the “T”). Then we get the total size of all the 
requiredSchema (henceforth referred to as the “R”). The multiplier should be T 
/ R  .Then the maxPartitionBytes and openCostInBytes will be enlarge with T / R 
times.

 

 

 

> Enlarge split size for columnar file to ensure the task read enough data
> 
>
> Key: SPARK-24906
> URL: https://issues.apache.org/jira/browse/SPARK-24906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
> Attachments: image-2018-07-24-20-26-32-441.png, 
> image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, 
> image-2018-07-24-20-30-24-552.png
>
>
> For columnar file, such as, when spark sql read the table, each split will be 
> 128 MB by default since spark.sql.files.maxPartitionBytes is default to 
> 128MB. Even when user set it to a large value, such as 512MB, the task may 
> read only few MB or even hundreds of KB. Because the table (Parquet) may 
> consists of dozens of columns while the SQL only need few columns. And spark 
> will prune the unnecessary columns.
>  
> In this case, spark DataSourceScanExec can enlarge maxPartitionBytes 
> adaptively. 
> For example, there is 40 columns , 20 are integer while another 20 are long. 
> When use query on an integer type column and an long type column, the 
> maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 
>  
> With this optimization, the number of task will be smaller and the job will 
> run faster. More importantly, for a very large cluster (more the 10 thousand 
> nodes), it will relieve RM's schedule pressure.
>  
> Here is the test
>  
> The table named test2 has more than 40 columns and there are more than 5 TB 
> data each hour.
> When we issue a very simple query 
>  
> {code:java}
> select count(device_id) from test2 where date=20180708 and hour='23'{code}
>  
> There are 72176 tasks and the duration of the job is 4.8 minutes
> !image-2018-07-24-20-26-32-441.png!
>  
> Most tasks last less than 1 second and read less than 1.5 MB data
> !image-2018-07-24-20-28-06-269.png!
>  
> After the optimization, there are only 1615 tasks and the job last only 30 
> seconds. It almost 10 times faster.
> !image-2018-07-24-20-29-24-797.png!
>  
> The median of read data is 44.2MB. 
> !image-2018-07-24-20-30-24-552.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data

2018-07-24 Thread Jason Guo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554972#comment-16554972
 ] 

Jason Guo commented on SPARK-24906:
---

Thanks [~maropu] and [~viirya] for your comments. Here is my solution for our 
cluster (more than 40 thousands nodes in total and more than 10 thousands nodes 
for a single cluster). 

1. When this is enabled

There is a configuration item named 

 
{code:java}
spark.sql.parquet.adaptiveFileSplit=false{code}
 

Only when this is enabled, DataSourceScanExec will enlarge the 
mapPartitionBytes. By default, this parameter is set to false. (Our ad-hoc 
query will set it to true)

 

2. How to calculate maxPartitionBytes and openCostInBytes

Different data type has different length (calculated with 
DataType.defaultSize). First we get the total size of the whole table 
(henceforth referred to as the “T”). Then we get the total size of all the 
requiredSchema (henceforth referred to as the “R”). The multiplier should be T 
/ R  .Then the maxPartitionBytes and openCostInBytes will be enlarge with T / R 
times.

 

 

 

> Enlarge split size for columnar file to ensure the task read enough data
> 
>
> Key: SPARK-24906
> URL: https://issues.apache.org/jira/browse/SPARK-24906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
> Attachments: image-2018-07-24-20-26-32-441.png, 
> image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, 
> image-2018-07-24-20-30-24-552.png
>
>
> For columnar file, such as, when spark sql read the table, each split will be 
> 128 MB by default since spark.sql.files.maxPartitionBytes is default to 
> 128MB. Even when user set it to a large value, such as 512MB, the task may 
> read only few MB or even hundreds of KB. Because the table (Parquet) may 
> consists of dozens of columns while the SQL only need few columns. And spark 
> will prune the unnecessary columns.
>  
> In this case, spark DataSourceScanExec can enlarge maxPartitionBytes 
> adaptively. 
> For example, there is 40 columns , 20 are integer while another 20 are long. 
> When use query on an integer type column and an long type column, the 
> maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 
>  
> With this optimization, the number of task will be smaller and the job will 
> run faster. More importantly, for a very large cluster (more the 10 thousand 
> nodes), it will relieve RM's schedule pressure.
>  
> Here is the test
>  
> The table named test2 has more than 40 columns and there are more than 5 TB 
> data each hour.
> When we issue a very simple query 
>  
> {code:java}
> select count(device_id) from test2 where date=20180708 and hour='23'{code}
>  
> There are 72176 tasks and the duration of the job is 4.8 minutes
> !image-2018-07-24-20-26-32-441.png!
>  
> Most tasks last less than 1 second and read less than 1.5 MB data
> !image-2018-07-24-20-28-06-269.png!
>  
> After the optimization, there are only 1615 tasks and the job last only 30 
> seconds. It almost 10 times faster.
> !image-2018-07-24-20-29-24-797.png!
>  
> The median of read data is 44.2MB. 
> !image-2018-07-24-20-30-24-552.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-24 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554932#comment-16554932
 ] 

Yin Huai commented on SPARK-24895:
--

[~hyukjin.kwon] [~kiszk] seems this revert indeed fixed the problem :)

 

 

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {noformat}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
>   org.apache.spark
>   spark-mllib-local_2.11
>   2.4.0-SNAPSHOT
>   
> 
>   20180723.232411
>   177
> 
> 20180723232411
> 
>   
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
> 
>   
> 
> {code}
>  
> This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24914) totalSize is not a good estimate for broadcast joins

2018-07-24 Thread Bruce Robbins (JIRA)
Bruce Robbins created SPARK-24914:
-

 Summary: totalSize is not a good estimate for broadcast joins
 Key: SPARK-24914
 URL: https://issues.apache.org/jira/browse/SPARK-24914
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Bruce Robbins


When determining whether to do a broadcast join, Spark estimates the size of 
the smaller table as follows:
 - if totalSize is defined and greater than 0, use it.
 - else, if rawDataSize is defined and greater than 0, use it
 - else, use spark.sql.defaultSizeInBytes (default: Long.MaxValue)

Therefore, Spark prefers totalSize over rawDataSize.

Unfortunately, totalSize is often quite a bit smaller than the actual table 
size, since it represents the size of the table's files on disk. Parquet and 
Orc files, for example, are encoded and compressed. This can result in the JVM 
throwing an OutOfMemoryError while Spark is loading the table into a 
HashedRelation, or when Spark actually attempts to broadcast the data.

On the other hand, rawDataSize represents the uncompressed size of the dataset, 
according to Hive documentation. This seems like a pretty good number to use in 
preference to totalSize. However, due to HIVE-20079, this value is simply 
#columns * #rows. Once that bug is fixed, it may be a superior statistic, at 
least for managed tables.

In the meantime, we could apply a configurable "fudge factor" to totalSize, at 
least for types of files that are encoded and compressed. Hive has the setting 
hive.stats.deserialization.factor, which defaults to 1.0, and is described as 
follows:
{quote}in the absence of uncompressed/raw data size, total file size will be 
used for statistics annotation. But the file may be compressed, encoded and 
serialized which may be lesser in size than the actual uncompressed/raw data 
size. This factor will be multiplied to file size to estimate the raw data size.
{quote}
In addition to the fudge factor, we could compare the adjusted totalSize to 
rawDataSize and use the bigger of the two:
{noformat}
size1 = totalSize.isDefined ? totalSize * fudgeFactor : Long.MAX_VALUE
size2 = rawDataSize.isDefined ? rawDataSize : Long.MAX_VALUE
size = max(size1, size2)
{noformat}
Caveat: This mitigates the issue only for Hive tables. It does not help much 
when the user is reading files using {{spark.read.parquet}}, unless we apply 
the same fudge factor there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24778) DateTimeUtils.getTimeZone method returns GMT time if timezone cannot be parsed

2018-07-24 Thread Vinitha Reddy Gankidi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554907#comment-16554907
 ] 

Vinitha Reddy Gankidi commented on SPARK-24778:
---

[~maropu] Sorry, I missed seeing this comment. Failing the query when the 
timezone is not supported works too. In some cases, Spark returns NULL where 
the inputs are incorrect. For instance, {{to_date('2018-02-31')}} returns NULL 
instead of throwing an error. 

 

> DateTimeUtils.getTimeZone method returns GMT time if timezone cannot be parsed
> --
>
> Key: SPARK-24778
> URL: https://issues.apache.org/jira/browse/SPARK-24778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Vinitha Reddy Gankidi
>Priority: Major
>
> {{DateTimeUtils.getTimeZone}} calls java's {{Timezone.getTimezone}} method 
> that defaults to GMT if the timezone cannot be parsed. This can be misleading 
> for users and its better to return NULL instead of returning an incorrect 
> value. 
> To reproduce: {{from_utc_timestamp}} is one of the functions that calls 
> {{DateTimeUtils.getTimeZone}}. Session timezone is GMT for the following 
> queries.
> {code:java}
> SELECT from_utc_timestamp('2018-07-10 12:00:00', 'GMT+05:00') -> 2018-07-10 
> 17:00:00 
> SELECT from_utc_timestamp('2018-07-10 12:00:00', '+05:00') -> 2018-07-10 
> 12:00:00 (Defaults to GMT as the timezone is not recognized){code}
> We could fix it by using the workaround mentioned here: 
> [https://bugs.openjdk.java.net/browse/JDK-4412864].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24913) Make `AssertTrue` and `AssertNotNull` non-deterministic

2018-07-24 Thread DB Tsai (JIRA)
DB Tsai created SPARK-24913:
---

 Summary: Make `AssertTrue` and `AssertNotNull` non-deterministic
 Key: SPARK-24913
 URL: https://issues.apache.org/jira/browse/SPARK-24913
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: DB Tsai






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24908) [R] remove spaces to make lintr happy

2018-07-24 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-24908.
-
  Resolution: Fixed
Target Version/s: 2.4.0

> [R] remove spaces to make lintr happy
> -
>
> Key: SPARK-24908
> URL: https://issues.apache.org/jira/browse/SPARK-24908
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Critical
>
> during my travails in porting spark builds to run on our centos worker, i 
> managed to recreate (as best i could) the centos environment on our new 
> ubuntu-testing machine.
> while running my initial builds, lintr was crashing on some extraneous spaces 
> in test_basic.R (see:  
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)]
> after removing those spaces, the ubuntu build happily passed the lintr tests.
> i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build 
> (see 
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),]
>  which scp'ed a copy of test_basic.R in to the repo after the git clone.  
> everything seems to be working happily.
> i will be creating a pull request for this now.
>  
> attn:  [~felixcheung] [~shivaram] [~ifilonenko]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-24 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-24895.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

[https://github.com/apache/spark/pull/21865] has been merged.

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {noformat}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
>   org.apache.spark
>   spark-mllib-local_2.11
>   2.4.0-SNAPSHOT
>   
> 
>   20180723.232411
>   177
> 
> 20180723232411
> 
>   
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
> 
>   
> 
> {code}
>  
> This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24912) Broadcast join OutOfMemory stack trace obscures actual cause of OOM

2018-07-24 Thread Bruce Robbins (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-24912:
--
Priority: Minor  (was: Major)

> Broadcast join OutOfMemory stack trace obscures actual cause of OOM
> ---
>
> Key: SPARK-24912
> URL: https://issues.apache.org/jira/browse/SPARK-24912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> When the Spark driver suffers an OutOfMemoryError while attempting to 
> broadcast a table for a broadcast join, the resulting stack trace obscures 
> the actual cause of the OOM. For e.g.:
> {noformat}
> [GC (Allocation Failure)  585453K->585453K(928768K), 0.0060025 secs]
> [Full GC (Allocation Failure)  585453K->582524K(928768K), 0.4019639 secs]
> java.lang.OutOfMemoryError: Java heap space
> Dumping heap to java_pid12446.hprof ...
> Heap dump file created [632701033 bytes in 1.016 secs]
> Exception in thread "main" java.lang.OutOfMemoryError: Not enough memory to 
> build and broadcast the table to all worker nodes. As a workaround, you can 
> either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to 
> -1 or increase the spark driver memory by setting spark.driver.memory to a 
> higher value
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:122)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:76)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:101)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:98)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 18/07/24 14:29:58 INFO ContextCleaner: Cleaned accumulator 30
> 18/07/24 14:29:58 INFO ContextCleaner: Cleaned accumulator 35
> {noformat}
> The above stack trace blames BroadcastExchangeExec. However, the given line 
> is actually where the original OutOfMemoryError was caught and a new one was 
> created and wrapped by a SparkException. The actual location where the OOM 
> occurred was in LongToUnsafeRowMap#grow, at this line:
> {noformat}
> val newPage = new Array[Long](newNumWords.toInt)
> {noformat}
> Sometimes it is helpful to know the actual location from which an OOM is 
> thrown. In the above case, the location indicated that Spark underestimated 
> the size of a large-ish table and ran out of memory trying to load it into 
> memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names

2018-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554839#comment-16554839
 ] 

Apache Spark commented on SPARK-24911:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21803

> SHOW CREATE TABLE drops escaping of nested column names
> ---
>
> Key: SPARK-24911
> URL: https://issues.apache.org/jira/browse/SPARK-24911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Major
>
> Create a table with quoted nested column - *`b`*:
> {code:sql}
> create table `test` (`a` STRUCT<`b`:STRING>);
> {code}
> and show how the table was created:
> {code:sql}
> SHOW CREATE TABLE `test`
> {code}
> {code}
> CREATE TABLE `test`(`a` struct)
> {code}
> The column *b* becomes unquoted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names

2018-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24911:


Assignee: Apache Spark

> SHOW CREATE TABLE drops escaping of nested column names
> ---
>
> Key: SPARK-24911
> URL: https://issues.apache.org/jira/browse/SPARK-24911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Create a table with quoted nested column - *`b`*:
> {code:sql}
> create table `test` (`a` STRUCT<`b`:STRING>);
> {code}
> and show how the table was created:
> {code:sql}
> SHOW CREATE TABLE `test`
> {code}
> {code}
> CREATE TABLE `test`(`a` struct)
> {code}
> The column *b* becomes unquoted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names

2018-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24911:


Assignee: (was: Apache Spark)

> SHOW CREATE TABLE drops escaping of nested column names
> ---
>
> Key: SPARK-24911
> URL: https://issues.apache.org/jira/browse/SPARK-24911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Major
>
> Create a table with quoted nested column - *`b`*:
> {code:sql}
> create table `test` (`a` STRUCT<`b`:STRING>);
> {code}
> and show how the table was created:
> {code:sql}
> SHOW CREATE TABLE `test`
> {code}
> {code}
> CREATE TABLE `test`(`a` struct)
> {code}
> The column *b* becomes unquoted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24912) Broadcast join OutOfMemory stack trace obscures actual cause of OOM

2018-07-24 Thread Bruce Robbins (JIRA)
Bruce Robbins created SPARK-24912:
-

 Summary: Broadcast join OutOfMemory stack trace obscures actual 
cause of OOM
 Key: SPARK-24912
 URL: https://issues.apache.org/jira/browse/SPARK-24912
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Bruce Robbins


When the Spark driver suffers an OutOfMemoryError while attempting to broadcast 
a table for a broadcast join, the resulting stack trace obscures the actual 
cause of the OOM. For e.g.:
{noformat}
[GC (Allocation Failure)  585453K->585453K(928768K), 0.0060025 secs]
[Full GC (Allocation Failure)  585453K->582524K(928768K), 0.4019639 secs]
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid12446.hprof ...
Heap dump file created [632701033 bytes in 1.016 secs]
Exception in thread "main" java.lang.OutOfMemoryError: Not enough memory to 
build and broadcast the table to all worker nodes. As a workaround, you can 
either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 
or increase the spark driver memory by setting spark.driver.memory to a higher 
value
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:122)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:76)
at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:101)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at 
org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:98)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/07/24 14:29:58 INFO ContextCleaner: Cleaned accumulator 30
18/07/24 14:29:58 INFO ContextCleaner: Cleaned accumulator 35
{noformat}
The above stack trace blames BroadcastExchangeExec. However, the given line is 
actually where the original OutOfMemoryError was caught and a new one was 
created and wrapped by a SparkException. The actual location where the OOM 
occurred was in LongToUnsafeRowMap#grow, at this line:
{noformat}
val newPage = new Array[Long](newNumWords.toInt)
{noformat}
Sometimes it is helpful to know the actual location from which an OOM is 
thrown. In the above case, the location indicated that Spark underestimated the 
size of a large-ish table and ran out of memory trying to load it into memory.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24911) SHOW CREATE TABLE drops escaping of nested column names

2018-07-24 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-24911:
--

 Summary: SHOW CREATE TABLE drops escaping of nested column names
 Key: SPARK-24911
 URL: https://issues.apache.org/jira/browse/SPARK-24911
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maxim Gekk


Create a table with quoted nested column - *`b`*:
{code:sql}
create table `test` (`a` STRUCT<`b`:STRING>);
{code}
and show how the table was created:
{code:sql}
SHOW CREATE TABLE `test`
{code}
{code}
CREATE TABLE `test`(`a` struct)
{code}

The column *b* becomes unquoted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24910) Spark Bloom Filter Closure Serialization improvement for very high volume of Data

2018-07-24 Thread Himangshu Ranjan Borah (JIRA)
Himangshu Ranjan Borah created SPARK-24910:
--

 Summary: Spark Bloom Filter Closure Serialization improvement for 
very high volume of Data
 Key: SPARK-24910
 URL: https://issues.apache.org/jira/browse/SPARK-24910
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.3.1
Reporter: Himangshu Ranjan Borah


I am proposing an improvement to the Bloom Filter Generation logic being used 
in the DataFrameStatFunctions' Bloom Filter API using mapPartitions() instead 
of aggregate() to avoid closure serialization which fails for huge BitArrays.

Spark's Stat Functions' Bloom Filter Implementation uses 
aggregate/treeAggregate operations which uses a closure with a dependency on 
the bloom filter that is created in the driver. Since Spark hard codes the 
closure serializer to Java Serializer it fails in closure cleanup for very big 
sizes of Bloom Filters (Typically with num items ~ Billions and with fpp ~ 
0.001). Kryo serializer work's fine in such a scale but seems like there were 
some issues using Kryo for closure serialization due to which Spark 2.0 
hardcoded it to Java. The call-stack that we get typically looks like,

{{{color:#f79232}java.lang.OutOfMemoryError{color}}}
{{{color:#f79232} at 
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123){color}}}
{{{color:#f79232} at 
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117){color}}}
{{{color:#f79232} at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93){color}}}
{{{color:#f79232} at 
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153){color}}}
{{{color:#f79232} at 
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41){color}}}
{{{color:#f79232} at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877){color}}}
{{{color:#f79232} at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786){color}}}
{{{color:#f79232} at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189){color}}}
{{{color:#f79232} at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548){color}}}
{{{color:#f79232} at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509){color}}}
{{{color:#f79232} at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432){color}}}
{{{color:#f79232} at 
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178){color}}}
{{{color:#f79232} at 
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348){color}}}
{{{color:#f79232} at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43){color}}}
{{{color:#f79232} at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100){color}}}
{{{color:#f79232} at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342){color}}}
{{{color:#f79232} at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335){color}}}
{{{color:#f79232} at 
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159){color}}}
{{{color:#f79232} at 
org.apache.spark.SparkContext.clean(SparkContext.scala:2292){color}}}
{{{color:#f79232} at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2022){color}}}
{{{color:#f79232} at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2124){color}}}
{{{color:#f79232} at 
org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1092){color}}}
{{{color:#f79232} at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151){color}}}
{{{color:#f79232} at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112){color}}}
{{{color:#f79232} at org.apache.spark.rdd.RDD.withScope(RDD.scala:363){color}}}
{{{color:#f79232} at org.apache.spark.rdd.RDD.fold(RDD.scala:1086){color}}}
{{{color:#f79232} at 
org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1155){color}}}
{{{color:#f79232} at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151){color}}}
{{{color:#f79232} at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112){color}}}
{{{color:#f79232} at org.apache.spark.rdd.RDD.withScope(RDD.scala:363){color}}}
{{{color:#f79232} at 
org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1131){color}}}
{{{color:#f79232} at 
org.apache.spark.sql.DataFrameStatFunctions.buildBloomFilter(DataFrameStatFunctions.scala:554){color}}}
{{{color:#f79232} at 
org.apache.spark.sql.DataFrameStatFunctions.bloomFilter(DataFrameStatFunctions.scala:505){color}}}

This issue can be overcome if we *don't* use the *aggregate()* operations for 
the Bloom Filter generation and use *mapPartitions()* kind of operations where 
we create the 

[jira] [Commented] (SPARK-24909) Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts

2018-07-24 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554803#comment-16554803
 ] 

Thomas Graves commented on SPARK-24909:
---

I haven't come up with a fix yet but have been looking at essentially all the 
things you have mentioned.  will continue working on it, except I'm out 
tomorrow so will continue thursday.  

> Spark scheduler can hang when fetch failures, executor lost, task running on 
> lost executor, and multiple stage attempts
> ---
>
> Key: SPARK-24909
> URL: https://issues.apache.org/jira/browse/SPARK-24909
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.1
>Reporter: Thomas Graves
>Priority: Critical
>
> The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
> all the tasks in the tasks sets are marked as completed. 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]
> It never creates new task attempts in the task scheduler but the dag 
> scheduler still has pendingPartitions.
> {code:java}
> 8/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in 
> stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, 
> PROCESS_LOCAL, 7874 bytes)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
> (repartition at Lift.scala:191) as failed due to a fetch failure from 
> ShuffleMapStage 42 (map at foo.scala:27)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 
> 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at 
> bar.scala:191) due to fetch failure
> 
> 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
> 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for 
> executor: 33 (epoch 18)
> 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
> (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
> parents
> 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 
> with 59955 tasks
> 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in 
> stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)
> 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
> ShuffleMapTask(44, 55769) completion from executor 33{code}
>  
>  
> In the logs above you will see that task 55769.0 finished after the executor 
> was lost and a new task set was started.  The DAG scheduler says "Ignoring 
> possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
> completed for all stage attempts. The DAGScheduler gets hung here.  I did a 
> heap dump on the process and can see that 55769 is still in the DAGScheduler 
> pendingPartitions list but the tasksetmanagers are all complete
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24307) Support sending messages over 2GB from memory

2018-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554784#comment-16554784
 ] 

Apache Spark commented on SPARK-24307:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/21867

> Support sending messages over 2GB from memory
> -
>
> Key: SPARK-24307
> URL: https://issues.apache.org/jira/browse/SPARK-24307
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark's networking layer supports sending messages backed by a {{FileRegion}} 
> or a {{ByteBuf}}.  Sending large FileRegion's works, as netty supports large 
> FileRegions.   However, {{ByteBuf}} is limited to 2GB.  This is particularly 
> a problem for sending large datasets that are already in memory, eg.  cached 
> RDD blocks.
> eg. if you try to replicate a block stored in memory that is over 2 GB, you 
> will see an exception like:
> {noformat}
> 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC 
> 7420542363232096629 to xyz.com/172.31.113.213:44358: 
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: 
> readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= 
> writerIndex <= capacity(-1294617291))
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816)
> at 
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723)
> at 
> io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
> at 
> io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: 
> -1294617291 (expected: 0 <= readerIndex <= writerIndex <= 
> capacity(-1294617291))
> at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129)
> at 
> io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688)
> at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110)
> at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359)
> at 
> org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87)
> at 
> org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95)
> at 
> org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58)
> at 
> org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
> at 
> io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
> ... 17 more
> {noformat}
> A simple solution to this is to create a "FileRegion" which is backed by a 
> {{ChunkedByteBuffer}} (spark's existing datastructure to support 

[jira] [Commented] (SPARK-24909) Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts

2018-07-24 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554748#comment-16554748
 ] 

Imran Rashid commented on SPARK-24909:
--

ugh, yeah I think you're right.  do you have a fix in mind?

we could undo SPARK-23433, so that active taskset doesn't care if an earlier 
taskset completes the tasks it has (and rethink how to solve that case).

Or we could have the markPartitionCompletedInAllTaskSets take the epoch into 
account.

Or, we could change that condition in the dagscheduler where its getting hung 
-- it could resubmit the taskset if it thinks there is still work to be done, 
but all tasksets are complete (which might be a more stable approach in 
general), rather than only doing it if 
{{shuffleStage.pendingPartitions.isEmpty}}.

All of these changes always have a ton of corner cases and past bugs I need to 
remind myself of, though ...

> Spark scheduler can hang when fetch failures, executor lost, task running on 
> lost executor, and multiple stage attempts
> ---
>
> Key: SPARK-24909
> URL: https://issues.apache.org/jira/browse/SPARK-24909
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.1
>Reporter: Thomas Graves
>Priority: Critical
>
> The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
> all the tasks in the tasks sets are marked as completed. 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]
> It never creates new task attempts in the task scheduler but the dag 
> scheduler still has pendingPartitions.
> {code:java}
> 8/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in 
> stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, 
> PROCESS_LOCAL, 7874 bytes)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
> (repartition at Lift.scala:191) as failed due to a fetch failure from 
> ShuffleMapStage 42 (map at foo.scala:27)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 
> 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at 
> bar.scala:191) due to fetch failure
> 
> 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
> 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for 
> executor: 33 (epoch 18)
> 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
> (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
> parents
> 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 
> with 59955 tasks
> 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in 
> stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)
> 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
> ShuffleMapTask(44, 55769) completion from executor 33{code}
>  
>  
> In the logs above you will see that task 55769.0 finished after the executor 
> was lost and a new task set was started.  The DAG scheduler says "Ignoring 
> possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
> completed for all stage attempts. The DAGScheduler gets hung here.  I did a 
> heap dump on the process and can see that 55769 is still in the DAGScheduler 
> pendingPartitions list but the tasksetmanagers are all complete
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-24 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554742#comment-16554742
 ] 

Imran Rashid commented on SPARK-24615:
--

hi, just catching up here -- agree with a lot of Tom's concerns.  There are a 
lot of corner cases we'd need to think about.  Eg., you mentioned joins, that 
in general they have stage boundaries -- but then again you might in some cases 
have a shared partitioner, which would remove the stage boundary.  And then 
there are other cases like {{zip()}}.

What if you had two rdds in one stage with conflicting requirements?  Eg. they 
request different accelerator types, and no node in the cluster has both 
available?  Would you fail?  Or introduce a stage boundary?  I think failing is 
fine, it would be OK if we require the user to put in their own boundary there 
(eg. by writing to hdfs) just want to think it through.

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24909) Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts

2018-07-24 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24909:
--
Description: 
The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
all the tasks in the tasks sets are marked as completed. 
([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]

It never creates new task attempts in the task scheduler but the dag scheduler 
still has pendingPartitions.
{code:java}
8/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in stage 
44.0 (TID 970752, host1.com, executor 33, partition 55769, PROCESS_LOCAL, 7874 
bytes)

18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
(repartition at Lift.scala:191) as failed due to a fetch failure from 
ShuffleMapStage 42 (map at foo.scala:27)
18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 42 
(map at foo.scala:27) and ShuffleMapStage 44 (repartition at bar.scala:191) due 
to fetch failure


18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for executor: 
33 (epoch 18)

18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
(MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
parents
18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 with 
59955 tasks

18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in stage 
44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)

8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
ShuffleMapTask(44, 55769) completion from executor 33{code}
 

 

In the logs above you will see that task 55769.0 finished after the executor 
was lost and a new task set was started.  The DAG scheduler says "Ignoring 
possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
completed for all stage attempts. The DAGScheduler gets hung here.  I did a 
heap dump on the process and can see that 55769 is still in the DAGScheduler 
pendingPartitions list but the tasksetmanagers are all complete

 

  was:
The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
all the tasks in the tasks sets are marked as completed. 
([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]

It never creates new task attempts in the task scheduler but the dag scheduler 
still has pendingPartitions.

18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in stage 
44.0 (TID 970752, host1.com, executor 33, partition 55769, PROCESS_LOCAL, 7874 
bytes)

18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
(repartition at Lift.scala:191) as failed due to a fetch failure from 
ShuffleMapStage 42 (map at foo.scala:27)
 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 42 
(map at foo.scala:27) and ShuffleMapStage 44 (repartition at bar.scala:191) due 
to fetch failure
 

18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for executor: 
33 (epoch 18)

18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
(MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
parents
 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 with 
59955 tasks

18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in stage 
44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)

8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
ShuffleMapTask(44, 55769) completion from executor 33

 

In the logs above you will see that task 55769.0 finished after the executor 
was lost and a new task set was started.  The DAG scheduler says "Ignoring 
possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
completed for all stage attempts. The DAGScheduler gets hung here.  I did a 
heap dump on the process and can see that 55769 is still in the DAGScheduler 
pendingPartitions list but the tasksetmanagers are all complete

 


> Spark scheduler can hang when fetch failures, executor lost, task running on 
> lost executor, and multiple stage attempts
> ---
>
> Key: SPARK-24909
> URL: https://issues.apache.org/jira/browse/SPARK-24909
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.1
>Reporter: Thomas Graves
>Priority: Critical
>
> The DAGScheduler can hang if the executor was lost 

[jira] [Updated] (SPARK-24909) Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts

2018-07-24 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24909:
--
Summary: Spark scheduler can hang when fetch failures, executor lost, task 
running on lost executor, and multiple stage attempts  (was: Spark scheduler 
can hang with fetch failures and executor lost and multiple stage attempts)

> Spark scheduler can hang when fetch failures, executor lost, task running on 
> lost executor, and multiple stage attempts
> ---
>
> Key: SPARK-24909
> URL: https://issues.apache.org/jira/browse/SPARK-24909
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.1
>Reporter: Thomas Graves
>Priority: Critical
>
> The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
> all the tasks in the tasks sets are marked as completed. 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]
> It never creates new task attempts in the task scheduler but the dag 
> scheduler still has pendingPartitions.
> 18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in 
> stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, 
> PROCESS_LOCAL, 7874 bytes)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
> (repartition at Lift.scala:191) as failed due to a fetch failure from 
> ShuffleMapStage 42 (map at foo.scala:27)
>  18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 
> 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at 
> bar.scala:191) due to fetch failure
>  
> 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
>  18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for 
> executor: 33 (epoch 18)
> 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
> (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
> parents
>  18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 
> with 59955 tasks
> 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in 
> stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)
> 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
> ShuffleMapTask(44, 55769) completion from executor 33
>  
> In the logs above you will see that task 55769.0 finished after the executor 
> was lost and a new task set was started.  The DAG scheduler says "Ignoring 
> possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
> completed for all stage attempts. The DAGScheduler gets hung here.  I did a 
> heap dump on the process and can see that 55769 is still in the DAGScheduler 
> pendingPartitions list but the tasksetmanagers are all complete
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24768) Have a built-in AVRO data source implementation

2018-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554715#comment-16554715
 ] 

Apache Spark commented on SPARK-24768:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21866

> Have a built-in AVRO data source implementation
> ---
>
> Key: SPARK-24768
> URL: https://issues.apache.org/jira/browse/SPARK-24768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: Built-in AVRO Data Source In Spark 2.4.pdf
>
>
> Apache Avro (https://avro.apache.org) is a popular data serialization format. 
> It is widely used in the Spark and Hadoop ecosystem, especially for 
> Kafka-based data pipelines.  Using the external package 
> [https://github.com/databricks/spark-avro], Spark SQL can read and write the 
> avro data. Making spark-Avro built-in can provide a better experience for 
> first-time users of Spark SQL and structured streaming. We expect the 
> built-in Avro data source can further improve the adoption of structured 
> streaming. The proposal is to inline code from spark-avro package 
> ([https://github.com/databricks/spark-avro]). The target release is Spark 
> 2.4.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data

2018-07-24 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554716#comment-16554716
 ] 

Liang-Chi Hsieh commented on SPARK-24906:
-

A {{maxPartitionBytes}} value adapted improperly might hurt performance. And in 
this case, users may not know what is happened. Do we have a general rule to 
calculate proper {{maxPartitionBytes}}?

> Enlarge split size for columnar file to ensure the task read enough data
> 
>
> Key: SPARK-24906
> URL: https://issues.apache.org/jira/browse/SPARK-24906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
> Attachments: image-2018-07-24-20-26-32-441.png, 
> image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, 
> image-2018-07-24-20-30-24-552.png
>
>
> For columnar file, such as, when spark sql read the table, each split will be 
> 128 MB by default since spark.sql.files.maxPartitionBytes is default to 
> 128MB. Even when user set it to a large value, such as 512MB, the task may 
> read only few MB or even hundreds of KB. Because the table (Parquet) may 
> consists of dozens of columns while the SQL only need few columns. And spark 
> will prune the unnecessary columns.
>  
> In this case, spark DataSourceScanExec can enlarge maxPartitionBytes 
> adaptively. 
> For example, there is 40 columns , 20 are integer while another 20 are long. 
> When use query on an integer type column and an long type column, the 
> maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 
>  
> With this optimization, the number of task will be smaller and the job will 
> run faster. More importantly, for a very large cluster (more the 10 thousand 
> nodes), it will relieve RM's schedule pressure.
>  
> Here is the test
>  
> The table named test2 has more than 40 columns and there are more than 5 TB 
> data each hour.
> When we issue a very simple query 
>  
> {code:java}
> select count(device_id) from test2 where date=20180708 and hour='23'{code}
>  
> There are 72176 tasks and the duration of the job is 4.8 minutes
> !image-2018-07-24-20-26-32-441.png!
>  
> Most tasks last less than 1 second and read less than 1.5 MB data
> !image-2018-07-24-20-28-06-269.png!
>  
> After the optimization, there are only 1615 tasks and the job last only 30 
> seconds. It almost 10 times faster.
> !image-2018-07-24-20-29-24-797.png!
>  
> The median of read data is 44.2MB. 
> !image-2018-07-24-20-30-24-552.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24909) Spark scheduler can hang with fetch failures and executor lost and multiple stage attempts

2018-07-24 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554701#comment-16554701
 ] 

Thomas Graves commented on SPARK-24909:
---

Note this may have been introduced as part of SPARK-23433 where we now mark all 
the partitions in stage attempts as completed.   If the task that completed had 
been in the latest attempt it would have been removed from pendingPartitions.  
If we didn't have SPARK-23433 then I think the stage attempt 44.1 would have 
went on to run the task again.

 

cc [~irashid]

> Spark scheduler can hang with fetch failures and executor lost and multiple 
> stage attempts
> --
>
> Key: SPARK-24909
> URL: https://issues.apache.org/jira/browse/SPARK-24909
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.1
>Reporter: Thomas Graves
>Priority: Critical
>
> The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
> all the tasks in the tasks sets are marked as completed. 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]
> It never creates new task attempts in the task scheduler but the dag 
> scheduler still has pendingPartitions.
> 18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in 
> stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, 
> PROCESS_LOCAL, 7874 bytes)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
> (repartition at Lift.scala:191) as failed due to a fetch failure from 
> ShuffleMapStage 42 (map at foo.scala:27)
>  18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 
> 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at 
> bar.scala:191) due to fetch failure
>  
> 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
>  18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for 
> executor: 33 (epoch 18)
> 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
> (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
> parents
>  18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 
> with 59955 tasks
> 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in 
> stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)
> 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
> ShuffleMapTask(44, 55769) completion from executor 33
>  
> In the logs above you will see that task 55769.0 finished after the executor 
> was lost and a new task set was started.  The DAG scheduler says "Ignoring 
> possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
> completed for all stage attempts. The DAGScheduler gets hung here.  I did a 
> heap dump on the process and can see that 55769 is still in the DAGScheduler 
> pendingPartitions list but the tasksetmanagers are all complete
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24909) Spark scheduler can hang with fetch failures and executor lost and multiple stage attempts

2018-07-24 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24909:
--
Description: 
The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
all the tasks in the tasks sets are marked as completed. 
([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]

It never creates new task attempts in the task scheduler but the dag scheduler 
still has pendingPartitions.

18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in stage 
44.0 (TID 970752, host1.com, executor 33, partition 55769, PROCESS_LOCAL, 7874 
bytes)

18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
(repartition at Lift.scala:191) as failed due to a fetch failure from 
ShuffleMapStage 42 (map at foo.scala:27)
 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 42 
(map at foo.scala:27) and ShuffleMapStage 44 (repartition at bar.scala:191) due 
to fetch failure
 

18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for executor: 
33 (epoch 18)

18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
(MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
parents
 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 with 
59955 tasks

18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in stage 
44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)

8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
ShuffleMapTask(44, 55769) completion from executor 33

 

In the logs above you will see that task 55769.0 finished after the executor 
was lost and a new task set was started.  The DAG scheduler says "Ignoring 
possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
completed for all stage attempts. The DAGScheduler gets hung here.  I did a 
heap dump on the process and can see that 55769 is still in the DAGScheduler 
pendingPartitions list but the tasksetmanagers are all complete

 

  was:
The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
all the tasks in the tasks sets are marked as completed. 
([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]

It never creates new task attempts in the task scheduler but the dag scheduler 
still has pendingPartitions.

18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in stage 
44.0 (TID 970752, host1.com, executor 33, partition 55769, PROCESS_LOCAL, 7874 
bytes)

18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
(repartition at Lift.scala:191) as failed due to a fetch failure from 
ShuffleMapStage 42 (map at foo.scala:27)
18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 42 
(map at foo.scala:27) and ShuffleMapStage 44 (repartition at bar.scala:191) due 
to fetch failure


18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for executor: 
33 (epoch 18)

18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
(MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
parents
18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 with 
59955 tasks

18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in stage 
44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)


8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
ShuffleMapTask(44, 55769) completion from executor 33

 

In the logs above you will see that task 55769.0 finished after the executor 
was lost and a new task set was started.  The DAG scheduler says "Ignoring 
possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
completed for all stage attempts.

 


> Spark scheduler can hang with fetch failures and executor lost and multiple 
> stage attempts
> --
>
> Key: SPARK-24909
> URL: https://issues.apache.org/jira/browse/SPARK-24909
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.1
>Reporter: Thomas Graves
>Priority: Critical
>
> The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
> all the tasks in the tasks sets are marked as completed. 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]
> It never creates new task attempts in the task 

[jira] [Created] (SPARK-24909) Spark scheduler can hang with fetch failures and executor lost and multiple stage attempts

2018-07-24 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-24909:
-

 Summary: Spark scheduler can hang with fetch failures and executor 
lost and multiple stage attempts
 Key: SPARK-24909
 URL: https://issues.apache.org/jira/browse/SPARK-24909
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.3.1
Reporter: Thomas Graves


The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
all the tasks in the tasks sets are marked as completed. 
([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]

It never creates new task attempts in the task scheduler but the dag scheduler 
still has pendingPartitions.

18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in stage 
44.0 (TID 970752, host1.com, executor 33, partition 55769, PROCESS_LOCAL, 7874 
bytes)

18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
(repartition at Lift.scala:191) as failed due to a fetch failure from 
ShuffleMapStage 42 (map at foo.scala:27)
18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 42 
(map at foo.scala:27) and ShuffleMapStage 44 (repartition at bar.scala:191) due 
to fetch failure


18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for executor: 
33 (epoch 18)

18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
(MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
parents
18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 with 
59955 tasks

18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in stage 
44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)


8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
ShuffleMapTask(44, 55769) completion from executor 33

 

In the logs above you will see that task 55769.0 finished after the executor 
was lost and a new task set was started.  The DAG scheduler says "Ignoring 
possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
completed for all stage attempts.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24909) Spark scheduler can hang with fetch failures and executor lost and multiple stage attempts

2018-07-24 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24909:
--
Priority: Critical  (was: Major)

> Spark scheduler can hang with fetch failures and executor lost and multiple 
> stage attempts
> --
>
> Key: SPARK-24909
> URL: https://issues.apache.org/jira/browse/SPARK-24909
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.1
>Reporter: Thomas Graves
>Priority: Critical
>
> The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
> all the tasks in the tasks sets are marked as completed. 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]
> It never creates new task attempts in the task scheduler but the dag 
> scheduler still has pendingPartitions.
> 18/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in 
> stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, 
> PROCESS_LOCAL, 7874 bytes)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
> (repartition at Lift.scala:191) as failed due to a fetch failure from 
> ShuffleMapStage 42 (map at foo.scala:27)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 
> 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at 
> bar.scala:191) due to fetch failure
> 
> 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
> 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for 
> executor: 33 (epoch 18)
> 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
> (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
> parents
> 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 
> with 59955 tasks
> 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in 
> stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)
> 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
> ShuffleMapTask(44, 55769) completion from executor 33
>  
> In the logs above you will see that task 55769.0 finished after the executor 
> was lost and a new task set was started.  The DAG scheduler says "Ignoring 
> possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
> completed for all stage attempts.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24812) Last Access Time in the table description is not valid

2018-07-24 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24812.
-
   Resolution: Fixed
 Assignee: Sujith
Fix Version/s: 2.4.0

> Last Access Time in the table description is not valid
> --
>
> Key: SPARK-24812
> URL: https://issues.apache.org/jira/browse/SPARK-24812
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: Sujith
>Assignee: Sujith
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: image-2018-07-16-15-37-28-896.png, 
> image-2018-07-16-15-38-26-717.png
>
>
> Last Access Time in the table description is not valid, 
> Test steps:
> Step 1 -  create a table
> Step 2 - Run  command "DESC FORMATTED table"
>  Last Access Time will always displayed wrong date
> Wed Dec 31 15:59:59 PST 1969 - which is wrong.
> !image-2018-07-16-15-37-28-896.png!
> In hive its displayed as "UNKNOWN" which makes more sense than displaying 
> wrong date.
> Please find the snapshot tested in hive for the same com 
> !image-2018-07-16-15-38-26-717.png! mand
>  
> Seems to be a limitation as of now, better we can follow the hive behavior in 
> this scenario.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-24 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-24895:
--

Assignee: Eric Chang

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
>
> Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {noformat}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
>   org.apache.spark
>   spark-mllib-local_2.11
>   2.4.0-SNAPSHOT
>   
> 
>   20180723.232411
>   177
> 
> 20180723232411
> 
>   
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
> 
>   
> 
> {code}
>  
> This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-24 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-24895:
---
Description: 
Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
repo has mismatched filenames:
{noformat}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
(enforce-banned-dependencies) on project spark_2.4: Execution 
enforce-banned-dependencies of goal 
org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: Could 
not resolve following dependencies: 
[org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
resolve dependencies for project com.databricks:spark_2.4:pom:1: The following 
artifacts could not be resolved: 
org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find artifact 
org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
{noformat}
 

If you check the artifact metadata you will see the pom and jar files are 
2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
{code:xml}

  org.apache.spark
  spark-mllib-local_2.11
  2.4.0-SNAPSHOT
  

  20180723.232411
  177

20180723232411

  
jar
2.4.0-20180723.232411-177
20180723232411
  
  
pom
2.4.0-20180723.232411-177
20180723232411
  
  
tests
jar
2.4.0-20180723.232410-177
20180723232411
  
  
sources
jar
2.4.0-20180723.232410-177
20180723232411
  
  
test-sources
jar
2.4.0-20180723.232410-177
20180723232411
  

  

{code}
 
This behavior is very similar to this issue: 
https://issues.apache.org/jira/browse/MDEPLOY-221

Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
2.8.2 plugin, it is highly possible that we introduced a new plugin that causes 
this. 

The most recent addition is the spot-bugs plugin, which is known to have 
incompatibilities with other plugins: 
[https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]

We may want to try building without it to sanity check.

  was:
Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven 
repo has mismatched filenames:

{code}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
(enforce-banned-dependencies) on project spark_2.4: Execution 
enforce-banned-dependencies of goal 
org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: Could 
not resolve following dependencies: 
[org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
resolve dependencies for project com.databricks:spark_2.4:pom:1: The following 
artifacts could not be resolved: 
org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find artifact 
org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
{code}
 

If you check the artifact metadata you will see the pom and jar files are 
2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
{code:xml}

org.apache.spark
spark-mllib-local_2.11
2.4.0-SNAPSHOT


20180723.232411
177

20180723232411


jar
2.4.0-20180723.232411-177
20180723232411


pom
2.4.0-20180723.232411-177
20180723232411


tests
jar
2.4.0-20180723.232410-177
20180723232411


sources
jar
2.4.0-20180723.232410-177
20180723232411


test-sources
jar
2.4.0-20180723.232410-177
20180723232411




{code}
 
 This behavior is very similar to this issue: 
https://issues.apache.org/jira/browse/MDEPLOY-221

Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
2.8.2 plugin, it is highly possible that we introduced a new plugin that causes 
this. 

The most recent addition is the spot-bugs plugin, which is known to have 
incompatibilities with other plugins: 
[https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]

We may want to try building without it to sanity check.


> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> 

[jira] [Commented] (SPARK-24894) Invalid DNS name due to hostname truncation

2018-07-24 Thread Yinan Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554602#comment-16554602
 ] 

Yinan Li commented on SPARK-24894:
--

[~mcheah]. We need to make sure the truncation leads to a valid hostname.

> Invalid DNS name due to hostname truncation 
> 
>
> Key: SPARK-24894
> URL: https://issues.apache.org/jira/browse/SPARK-24894
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> The truncation for hostname happening here 
> [https://github.com/apache/spark/blob/5ff1b9ba1983d5601add62aef64a3e87d07050eb/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L77]
>   is a problematic and can lead to DNS names starting with "-". 
> Originally filled here : 
> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/issues/229
> ```
> {{2018-07-23 21:21:42 ERROR Utils:91 - Uncaught exception in thread 
> kubernetes-pod-allocator 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://kubernetes.default.svc/api/v1/namespaces/default/pods. 
> Message: Pod 
> "user-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9" is 
> invalid: spec.hostname: Invalid value: 
> "-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9": a DNS-1123 
> label must consist of lower case alphanumeric characters or '-', and must 
> start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', 
> regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'). Received 
> status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=spec.hostname, 
> message=Invalid value: 
> "-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9": a DNS-1123 
> label must consist of lower case alphanumeric characters or '-', and must 
> start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', 
> regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'), 
> reason=FieldValueInvalid, additionalProperties={})], group=null, kind=Pod, 
> name=user-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=Pod 
> "user-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9" is 
> invalid: spec.hostname: Invalid value: 
> "-archetypes-all-weekly-1532380861251850404-1532380862321-exec-9": a DNS-1123 
> label must consist of lower case alphanumeric characters or '-', and must 
> start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', 
> regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?'), 
> metadata=ListMeta(resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}). at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:470)
>  at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:409)
>  at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:379)
>  at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:343)
>  at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:226)
>  at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:769)
>  at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:356)
>  at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$$anon$1$$anonfun$3$$anonfun$apply$3.apply(KubernetesClusterSchedulerBackend.scala:140)
>  at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$$anon$1$$anonfun$3$$anonfun$apply$3.apply(KubernetesClusterSchedulerBackend.scala:140)
>  at org.apache.spark.util.Utils$.tryLog(Utils.scala:1922) at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$$anon$1$$anonfun$3.apply(KubernetesClusterSchedulerBackend.scala:139)
>  at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$$anon$1$$anonfun$3.apply(KubernetesClusterSchedulerBackend.scala:138)
>  at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>  at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) 
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) 
> at 

[jira] [Commented] (SPARK-24908) [R] remove spaces to make lintr happy

2018-07-24 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554597#comment-16554597
 ] 

shane knapp commented on SPARK-24908:
-

i noticed lintr complaining in my PRB build 
([https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93509/console)]
 :(

`Error: StartTag: invalid element name [68] Execution halted lintr checks 
passed.`

so i checked other R-based PRB builds and saw the same thing:

pull request:  https://github.com/apache/spark/pull/21835

associated build:  
[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93402/consoleFull]

i'm guessing that this is a red herring.

> [R] remove spaces to make lintr happy
> -
>
> Key: SPARK-24908
> URL: https://issues.apache.org/jira/browse/SPARK-24908
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Critical
>
> during my travails in porting spark builds to run on our centos worker, i 
> managed to recreate (as best i could) the centos environment on our new 
> ubuntu-testing machine.
> while running my initial builds, lintr was crashing on some extraneous spaces 
> in test_basic.R (see:  
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)]
> after removing those spaces, the ubuntu build happily passed the lintr tests.
> i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build 
> (see 
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),]
>  which scp'ed a copy of test_basic.R in to the repo after the git clone.  
> everything seems to be working happily.
> i will be creating a pull request for this now.
>  
> attn:  [~felixcheung] [~shivaram] [~ifilonenko]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24895:


Assignee: Apache Spark

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Apache Spark
>Priority: Major
>
> Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {code}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
> org.apache.spark
> spark-mllib-local_2.11
> 2.4.0-SNAPSHOT
> 
> 
> 20180723.232411
> 177
> 
> 20180723232411
> 
> 
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
> 
> 
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
> 
> 
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> 
> 
> {code}
>  
>  This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554595#comment-16554595
 ] 

Apache Spark commented on SPARK-24895:
--

User 'ericfchang' has created a pull request for this issue:
https://github.com/apache/spark/pull/21865

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Priority: Major
>
> Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {code}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
> org.apache.spark
> spark-mllib-local_2.11
> 2.4.0-SNAPSHOT
> 
> 
> 20180723.232411
> 177
> 
> 20180723232411
> 
> 
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
> 
> 
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
> 
> 
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> 
> 
> {code}
>  
>  This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24895:


Assignee: (was: Apache Spark)

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Priority: Major
>
> Spark 2.4.0 has maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {code}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
> org.apache.spark
> spark-mllib-local_2.11
> 2.4.0-SNAPSHOT
> 
> 
> 20180723.232411
> 177
> 
> 20180723232411
> 
> 
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
> 
> 
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
> 
> 
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
> 
> 
> 
> 
> {code}
>  
>  This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.

2018-07-24 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23325.
-
   Resolution: Fixed
 Assignee: Ryan Blue
Fix Version/s: 2.4.0

> DataSourceV2 readers should always produce InternalRow.
> ---
>
> Key: SPARK-23325
> URL: https://issues.apache.org/jira/browse/SPARK-23325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> DataSourceV2 row-oriented implementations are limited to producing either 
> {{Row}} instances or {{UnsafeRow}} instances by implementing 
> {{SupportsScanUnsafeRow}}. Instead, I think that implementations should 
> always produce {{InternalRow}}.
> The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither 
> one is appropriate for implementers.
> File formats don't produce {{Row}} instances or the data values used by 
> {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation 
> that uses {{Row}} instances must produce data that is immediately translated 
> from the representation that was just produced by Spark. In my experience, it 
> made little sense to translate a timestamp in microseconds to a 
> (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass 
> that instance to Spark for immediate translation back.
> On the other hand, {{UnsafeRow}} is very difficult to produce unless data is 
> already held in memory. Even the Parquet support built into Spark 
> deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce 
> unsafe rows. When I went to build an implementation that deserializes Parquet 
> or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be 
> done without first deserializing into memory because the size of an array 
> must be known before any values are written.
> I ended up deciding to deserialize to {{InternalRow}} and use 
> {{UnsafeProjection}} to convert to unsafe. There are two problems with this: 
> first, this is Scala and was difficult to call from Java (it required 
> reflection), and second, this causes double projection in the physical plan 
> (a copy for unsafe to unsafe) if there is a projection that wasn't fully 
> pushed to the data source.
> I think the solution is to have a single interface for readers that expects 
> {{InternalRow}}. Then, a projection should be added in the Spark plan to 
> convert to unsafe and avoid projection in the plan and in the data source. If 
> the data source already produces unsafe rows by deserializing directly, this 
> still minimizes the number of copies because the unsafe projection will check 
> whether the incoming data is already {{UnsafeRow}}.
> Using {{InternalRow}} would also match the interface on the write side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24908) [R] remove spaces to make lintr happy

2018-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554561#comment-16554561
 ] 

Apache Spark commented on SPARK-24908:
--

User 'shaneknapp' has created a pull request for this issue:
https://github.com/apache/spark/pull/21864

> [R] remove spaces to make lintr happy
> -
>
> Key: SPARK-24908
> URL: https://issues.apache.org/jira/browse/SPARK-24908
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Critical
>
> during my travails in porting spark builds to run on our centos worker, i 
> managed to recreate (as best i could) the centos environment on our new 
> ubuntu-testing machine.
> while running my initial builds, lintr was crashing on some extraneous spaces 
> in test_basic.R (see:  
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)]
> after removing those spaces, the ubuntu build happily passed the lintr tests.
> i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build 
> (see 
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),]
>  which scp'ed a copy of test_basic.R in to the repo after the git clone.  
> everything seems to be working happily.
> i will be creating a pull request for this now.
>  
> attn:  [~felixcheung] [~shivaram]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24908) [R] remove spaces to make lintr happy

2018-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24908:


Assignee: shane knapp  (was: Apache Spark)

> [R] remove spaces to make lintr happy
> -
>
> Key: SPARK-24908
> URL: https://issues.apache.org/jira/browse/SPARK-24908
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Critical
>
> during my travails in porting spark builds to run on our centos worker, i 
> managed to recreate (as best i could) the centos environment on our new 
> ubuntu-testing machine.
> while running my initial builds, lintr was crashing on some extraneous spaces 
> in test_basic.R (see:  
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)]
> after removing those spaces, the ubuntu build happily passed the lintr tests.
> i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build 
> (see 
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),]
>  which scp'ed a copy of test_basic.R in to the repo after the git clone.  
> everything seems to be working happily.
> i will be creating a pull request for this now.
>  
> attn:  [~felixcheung] [~shivaram]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24908) [R] remove spaces to make lintr happy

2018-07-24 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-24908:

Description: 
during my travails in porting spark builds to run on our centos worker, i 
managed to recreate (as best i could) the centos environment on our new 
ubuntu-testing machine.

while running my initial builds, lintr was crashing on some extraneous spaces 
in test_basic.R (see:  
[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)]

after removing those spaces, the ubuntu build happily passed the lintr tests.

i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build 
(see 
[https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),]
 which scp'ed a copy of test_basic.R in to the repo after the git clone.  
everything seems to be working happily.

i will be creating a pull request for this now.

 

attn:  [~felixcheung] [~shivaram] [~ifilonenko]

  was:
during my travails in porting spark builds to run on our centos worker, i 
managed to recreate (as best i could) the centos environment on our new 
ubuntu-testing machine.

while running my initial builds, lintr was crashing on some extraneous spaces 
in test_basic.R (see:  
[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)]

after removing those spaces, the ubuntu build happily passed the lintr tests.

i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build 
(see 
[https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),]
 which scp'ed a copy of test_basic.R in to the repo after the git clone.  
everything seems to be working happily.

i will be creating a pull request for this now.

 

attn:  [~felixcheung] [~shivaram]


> [R] remove spaces to make lintr happy
> -
>
> Key: SPARK-24908
> URL: https://issues.apache.org/jira/browse/SPARK-24908
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Critical
>
> during my travails in porting spark builds to run on our centos worker, i 
> managed to recreate (as best i could) the centos environment on our new 
> ubuntu-testing machine.
> while running my initial builds, lintr was crashing on some extraneous spaces 
> in test_basic.R (see:  
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)]
> after removing those spaces, the ubuntu build happily passed the lintr tests.
> i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build 
> (see 
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),]
>  which scp'ed a copy of test_basic.R in to the repo after the git clone.  
> everything seems to be working happily.
> i will be creating a pull request for this now.
>  
> attn:  [~felixcheung] [~shivaram] [~ifilonenko]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24908) [R] remove spaces to make lintr happy

2018-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24908:


Assignee: Apache Spark  (was: shane knapp)

> [R] remove spaces to make lintr happy
> -
>
> Key: SPARK-24908
> URL: https://issues.apache.org/jira/browse/SPARK-24908
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: Apache Spark
>Priority: Critical
>
> during my travails in porting spark builds to run on our centos worker, i 
> managed to recreate (as best i could) the centos environment on our new 
> ubuntu-testing machine.
> while running my initial builds, lintr was crashing on some extraneous spaces 
> in test_basic.R (see:  
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)]
> after removing those spaces, the ubuntu build happily passed the lintr tests.
> i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build 
> (see 
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),]
>  which scp'ed a copy of test_basic.R in to the repo after the git clone.  
> everything seems to be working happily.
> i will be creating a pull request for this now.
>  
> attn:  [~felixcheung] [~shivaram]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24908) [R] remove spaces to make lintr happy

2018-07-24 Thread shane knapp (JIRA)
shane knapp created SPARK-24908:
---

 Summary: [R] remove spaces to make lintr happy
 Key: SPARK-24908
 URL: https://issues.apache.org/jira/browse/SPARK-24908
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.3.1
Reporter: shane knapp
Assignee: shane knapp


during my travails in porting spark builds to run on our centos worker, i 
managed to recreate (as best i could) the centos environment on our new 
ubuntu-testing machine.

while running my initial builds, lintr was crashing on some extraneous spaces 
in test_basic.R (see:  
[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6-ubuntu-test/862/console)]

after removing those spaces, the ubuntu build happily passed the lintr tests.

i then tested this against a modified spark-master-test-sbt-hadoop-2.6 build 
(see 
[https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/testing-spark-master-test-with-updated-R-crap/4/),]
 which scp'ed a copy of test_basic.R in to the repo after the git clone.  
everything seems to be working happily.

i will be creating a pull request for this now.

 

attn:  [~felixcheung] [~shivaram]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18874) First phase: Deferring the correlated predicate pull up to Optimizer phase

2018-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554431#comment-16554431
 ] 

Apache Spark commented on SPARK-18874:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/21863

> First phase: Deferring the correlated predicate pull up to Optimizer phase
> --
>
> Key: SPARK-18874
> URL: https://issues.apache.org/jira/browse/SPARK-18874
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Nattavut Sutyanyong
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.2.0
>
> Attachments: SPARK-18874-3.pdf
>
>
> This JIRA implements the first phase of SPARK-18455 by deferring the 
> correlated predicate pull up from Analyzer to Optimizer. The goal is to 
> preserve the current functionality of subquery in Spark 2.0 (if it works, it 
> continues to work after this JIRA, if it does not, it won't). The performance 
> of subquery processing is expected to be at par with Spark 2.0.
> The representation of the LogicalPlan after Analyzer will be different after 
> this JIRA that it will preserve the original positions of correlated 
> predicates in a subquery. This new representation is a preparation work for 
> the second phase of extending the support of correlated subquery to cases 
> Spark 2.0 does not support such as deep correlation, outer references in 
> SELECT clause.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24581) Design: BarrierTaskContext.barrier()

2018-07-24 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554409#comment-16554409
 ] 

Jiang Xingbo commented on SPARK-24581:
--

Design doc: 
https://docs.google.com/document/d/1r07-vU5JTH6s1jJ6azkmK0K5it6jwpfO6b_K3mJmxR4/edit?usp=sharing

> Design: BarrierTaskContext.barrier()
> 
>
> Key: SPARK-24581
> URL: https://issues.apache.org/jira/browse/SPARK-24581
> Project: Spark
>  Issue Type: Story
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Jiang Xingbo
>Priority: Major
>
> We need to provide a communication barrier function to users to help 
> coordinate tasks within a barrier stage. This is very similar to MPI_Barrier 
> function in MPI. This story is for its design.
>  
> Requirements:
>  * Low-latency. The tasks should be unblocked soon after all tasks have 
> reached this barrier. The latency is more important than CPU cycles here.
>  * Support unlimited timeout with proper logging. For DL tasks, it might take 
> very long to converge, we should support unlimited timeout with proper 
> logging. So users know why a task is waiting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data

2018-07-24 Thread Brad (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554404#comment-16554404
 ] 

Brad commented on SPARK-21097:
--

Hi [~menelaus],

I have stalled out on this project, if you would like to work on it feel free. 
There is a *lot* of discussion on the PR (it might take a couple of attempts to 
load the page). The last recommendation from Squito was that I open up a SPIP 
to get more input. I'm not sure if the final state of the code is 100% correct 
at this point, I was making a lot of changes at the end.

 

Thanks

Brad

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24903) Allow driver container name to be configurable in k8 spark

2018-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24903:


Assignee: Apache Spark

> Allow driver container name to be configurable in k8 spark
> --
>
> Key: SPARK-24903
> URL: https://issues.apache.org/jira/browse/SPARK-24903
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Yifei Huang
>Assignee: Apache Spark
>Priority: Major
>
> We'd like to expose the container name as a configurable value. In case it 
> changes in spark, we want to be able to override it to guarantee consistency 
> on our end. 
> https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L76



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24903) Allow driver container name to be configurable in k8 spark

2018-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554379#comment-16554379
 ] 

Apache Spark commented on SPARK-24903:
--

User 'yifeih' has created a pull request for this issue:
https://github.com/apache/spark/pull/21862

> Allow driver container name to be configurable in k8 spark
> --
>
> Key: SPARK-24903
> URL: https://issues.apache.org/jira/browse/SPARK-24903
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Yifei Huang
>Priority: Major
>
> We'd like to expose the container name as a configurable value. In case it 
> changes in spark, we want to be able to override it to guarantee consistency 
> on our end. 
> https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L76



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24903) Allow driver container name to be configurable in k8 spark

2018-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24903:


Assignee: (was: Apache Spark)

> Allow driver container name to be configurable in k8 spark
> --
>
> Key: SPARK-24903
> URL: https://issues.apache.org/jira/browse/SPARK-24903
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Yifei Huang
>Priority: Major
>
> We'd like to expose the container name as a configurable value. In case it 
> changes in spark, we want to be able to override it to guarantee consistency 
> on our end. 
> https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L76



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data

2018-07-24 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554340#comment-16554340
 ] 

Takeshi Yamamuro commented on SPARK-24906:
--

Ah, I see. It make some sense to me. In DataSourceScanExec, in case that 
`requiredSchema` has the relatively smaller number of columns than 
`dataSchema`, we might consider an additional term to make `maxSplitBytes` 
larger in `createNonBucketedReadRDD`.
https://github.com/apache/spark/blob/3d5c61e5fd24f07302e39b5d61294da79aa0c2f9/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L316

WDYT? [~smilegator]  [~viirya] 

> Enlarge split size for columnar file to ensure the task read enough data
> 
>
> Key: SPARK-24906
> URL: https://issues.apache.org/jira/browse/SPARK-24906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
> Attachments: image-2018-07-24-20-26-32-441.png, 
> image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, 
> image-2018-07-24-20-30-24-552.png
>
>
> For columnar file, such as, when spark sql read the table, each split will be 
> 128 MB by default since spark.sql.files.maxPartitionBytes is default to 
> 128MB. Even when user set it to a large value, such as 512MB, the task may 
> read only few MB or even hundreds of KB. Because the table (Parquet) may 
> consists of dozens of columns while the SQL only need few columns. And spark 
> will prune the unnecessary columns.
>  
> In this case, spark DataSourceScanExec can enlarge maxPartitionBytes 
> adaptively. 
> For example, there is 40 columns , 20 are integer while another 20 are long. 
> When use query on an integer type column and an long type column, the 
> maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 
>  
> With this optimization, the number of task will be smaller and the job will 
> run faster. More importantly, for a very large cluster (more the 10 thousand 
> nodes), it will relieve RM's schedule pressure.
>  
> Here is the test
>  
> The table named test2 has more than 40 columns and there are more than 5 TB 
> data each hour.
> When we issue a very simple query 
>  
> {code:java}
> select count(device_id) from test2 where date=20180708 and hour='23'{code}
>  
> There are 72176 tasks and the duration of the job is 4.8 minutes
> !image-2018-07-24-20-26-32-441.png!
>  
> Most tasks last less than 1 second and read less than 1.5 MB data
> !image-2018-07-24-20-28-06-269.png!
>  
> After the optimization, there are only 1615 tasks and the job last only 30 
> seconds. It almost 10 times faster.
> !image-2018-07-24-20-29-24-797.png!
>  
> The median of read data is 44.2MB. 
> !image-2018-07-24-20-30-24-552.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-24 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554327#comment-16554327
 ] 

Thomas Graves commented on SPARK-24615:
---

Right so I think part of this is trying to make it more obvious to the user 
what the scope actually is.  This in some ways is similar to caching.  I see 
people many times force an evaluation after a cache to force the data to 
actually be cached because otherwise it might not do what they expect.  that is 
why I mentioned the .eval() type functionality.

 For the example:
val rddA = rdd.withResources.mapPartitions()

val rddB = rdd.withResources.mapPartitions()

val rddC = rddA.join(rddB)


 

Above the mapPartitions would normally get their own stages correct?  So I 
would think those stages would be with the resources specific but the join 
would be with the default resources.  then you wouldn't have to worry about 
merging, etc.  But you have the case with map or others where they wouldn't 
normally get their own stage so the question is perhaps should they, or do you 
provide something to?

 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL

2018-07-24 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554306#comment-16554306
 ] 

Thomas Graves commented on SPARK-23128:
---

we also did some initial evaluation with it as well and it was looking good so 
that is why I pinged on here.  We should work to get it in rather then everyone 
having their own version.

It might be nice to post on https://issues.apache.org/jira/browse/SPARK-9850 
since it initial version but I don't see any activity there. 

Ideally we would have a committer that is more familiar with the sql code 
shepherd it in.  I'm just still learning the sql side of the code here so don't 
consider myself an expert there.   Have you posted this to the dev list at all? 
 We should probably have a SPIP for this, which your doc above should pretty 
much cover, although you may want to make sure its up to date.  So I think the 
first step would be to post to the dev list to get any feedback and see if 
someone else is willing to volunteer to review.  Could you just post a DISCUSS 
thread about it?  If no one else will review I will, it may just take me longer.

 

> A new approach to do adaptive execution in Spark SQL
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Priority: Major
> Attachments: AdaptiveExecutioninBaidu.pdf
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24907) Migrate JDBC data source to DataSource API v2

2018-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24907:


Assignee: Apache Spark

> Migrate JDBC data source to DataSource API v2
> -
>
> Key: SPARK-24907
> URL: https://issues.apache.org/jira/browse/SPARK-24907
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Teng Peng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24907) Migrate JDBC data source to DataSource API v2

2018-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554221#comment-16554221
 ] 

Apache Spark commented on SPARK-24907:
--

User 'tengpeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/21861

> Migrate JDBC data source to DataSource API v2
> -
>
> Key: SPARK-24907
> URL: https://issues.apache.org/jira/browse/SPARK-24907
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Teng Peng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24907) Migrate JDBC data source to DataSource API v2

2018-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24907:


Assignee: (was: Apache Spark)

> Migrate JDBC data source to DataSource API v2
> -
>
> Key: SPARK-24907
> URL: https://issues.apache.org/jira/browse/SPARK-24907
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Teng Peng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-07-24 Thread Jackey Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554210#comment-16554210
 ] 

Jackey Lee commented on SPARK-24630:


In SQLStreaming, we also support standard SQL as batch queries, except for
the stream key word. Such as, you can run this query to do some window
aggregations, which is similar to batch queries.
select stream count(*) from kafka_sql_test group by window(timestamp, '10
seconds', '5 seconds')

Are there any more queries can you think that don't support in batch syntax?




> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24907) Migrate JDBC data source to DataSource API v2

2018-07-24 Thread Teng Peng (JIRA)
Teng Peng created SPARK-24907:
-

 Summary: Migrate JDBC data source to DataSource API v2
 Key: SPARK-24907
 URL: https://issues.apache.org/jira/browse/SPARK-24907
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.3.0
Reporter: Teng Peng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data

2018-07-24 Thread Jason Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Guo updated SPARK-24906:
--
Description: 
For columnar file, such as, when spark sql read the table, each split will be 
128 MB by default since spark.sql.files.maxPartitionBytes is default to 128MB. 
Even when user set it to a large value, such as 512MB, the task may read only 
few MB or even hundreds of KB. Because the table (Parquet) may consists of 
dozens of columns while the SQL only need few columns. And spark will prune the 
unnecessary columns.

 

In this case, spark DataSourceScanExec can enlarge maxPartitionBytes 
adaptively. 

For example, there is 40 columns , 20 are integer while another 20 are long. 
When use query on an integer type column and an long type column, the 
maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 

 

With this optimization, the number of task will be smaller and the job will run 
faster. More importantly, for a very large cluster (more the 10 thousand 
nodes), it will relieve RM's schedule pressure.

 

Here is the test

 

The table named test2 has more than 40 columns and there are more than 5 TB 
data each hour.

When we issue a very simple query 

 
{code:java}
select count(device_id) from test2 where date=20180708 and hour='23'{code}
 

There are 72176 tasks and the duration of the job is 4.8 minutes

!image-2018-07-24-20-26-32-441.png!

 

Most tasks last less than 1 second and read less than 1.5 MB data

!image-2018-07-24-20-28-06-269.png!

 

After the optimization, there are only 1615 tasks and the job last only 30 
seconds. It almost 10 times faster.

!image-2018-07-24-20-29-24-797.png!

 

The median of read data is 44.2MB. 

!image-2018-07-24-20-30-24-552.png!

 

  was:
For columnar file, such as, when spark sql read the table, each split will be 
128 MB by default since spark.sql.files.maxPartitionBytes is default to 128MB. 
Even when user set it to a large value, such as 512MB, the task may read only 
few MB or even hundreds of KB. Because the table (Parquet) may consists of 
dozens of columns while the SQL only need few columns.

 

In this case, spark DataSourceScanExec can enlarge maxPartitionBytes 
adaptively. 

For example, there is 40 columns , 20 are integer while another 20 are long. 
When use query on an integer type column and an long type column, the 
maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 

 

With this optimization, the number of task will be smaller and the job will run 
faster. More importantly, for a very large cluster (more the 10 thousand 
nodes), it will relieve RM's schedule pressure.


> Enlarge split size for columnar file to ensure the task read enough data
> 
>
> Key: SPARK-24906
> URL: https://issues.apache.org/jira/browse/SPARK-24906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
> Attachments: image-2018-07-24-20-26-32-441.png, 
> image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, 
> image-2018-07-24-20-30-24-552.png
>
>
> For columnar file, such as, when spark sql read the table, each split will be 
> 128 MB by default since spark.sql.files.maxPartitionBytes is default to 
> 128MB. Even when user set it to a large value, such as 512MB, the task may 
> read only few MB or even hundreds of KB. Because the table (Parquet) may 
> consists of dozens of columns while the SQL only need few columns. And spark 
> will prune the unnecessary columns.
>  
> In this case, spark DataSourceScanExec can enlarge maxPartitionBytes 
> adaptively. 
> For example, there is 40 columns , 20 are integer while another 20 are long. 
> When use query on an integer type column and an long type column, the 
> maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 
>  
> With this optimization, the number of task will be smaller and the job will 
> run faster. More importantly, for a very large cluster (more the 10 thousand 
> nodes), it will relieve RM's schedule pressure.
>  
> Here is the test
>  
> The table named test2 has more than 40 columns and there are more than 5 TB 
> data each hour.
> When we issue a very simple query 
>  
> {code:java}
> select count(device_id) from test2 where date=20180708 and hour='23'{code}
>  
> There are 72176 tasks and the duration of the job is 4.8 minutes
> !image-2018-07-24-20-26-32-441.png!
>  
> Most tasks last less than 1 second and read less than 1.5 MB data
> !image-2018-07-24-20-28-06-269.png!
>  
> After the optimization, there are only 1615 tasks and the job last only 30 
> seconds. It almost 10 times faster.
> !image-2018-07-24-20-29-24-797.png!
>  
> The median of read data is 44.2MB. 
> 

[jira] [Updated] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data

2018-07-24 Thread Jason Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Guo updated SPARK-24906:
--
Attachment: image-2018-07-24-20-30-24-552.png

> Enlarge split size for columnar file to ensure the task read enough data
> 
>
> Key: SPARK-24906
> URL: https://issues.apache.org/jira/browse/SPARK-24906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
> Attachments: image-2018-07-24-20-26-32-441.png, 
> image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, 
> image-2018-07-24-20-30-24-552.png
>
>
> For columnar file, such as, when spark sql read the table, each split will be 
> 128 MB by default since spark.sql.files.maxPartitionBytes is default to 
> 128MB. Even when user set it to a large value, such as 512MB, the task may 
> read only few MB or even hundreds of KB. Because the table (Parquet) may 
> consists of dozens of columns while the SQL only need few columns.
>  
> In this case, spark DataSourceScanExec can enlarge maxPartitionBytes 
> adaptively. 
> For example, there is 40 columns , 20 are integer while another 20 are long. 
> When use query on an integer type column and an long type column, the 
> maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 
>  
> With this optimization, the number of task will be smaller and the job will 
> run faster. More importantly, for a very large cluster (more the 10 thousand 
> nodes), it will relieve RM's schedule pressure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data

2018-07-24 Thread Jason Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Guo updated SPARK-24906:
--
Attachment: image-2018-07-24-20-29-24-797.png

> Enlarge split size for columnar file to ensure the task read enough data
> 
>
> Key: SPARK-24906
> URL: https://issues.apache.org/jira/browse/SPARK-24906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
> Attachments: image-2018-07-24-20-26-32-441.png, 
> image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png
>
>
> For columnar file, such as, when spark sql read the table, each split will be 
> 128 MB by default since spark.sql.files.maxPartitionBytes is default to 
> 128MB. Even when user set it to a large value, such as 512MB, the task may 
> read only few MB or even hundreds of KB. Because the table (Parquet) may 
> consists of dozens of columns while the SQL only need few columns.
>  
> In this case, spark DataSourceScanExec can enlarge maxPartitionBytes 
> adaptively. 
> For example, there is 40 columns , 20 are integer while another 20 are long. 
> When use query on an integer type column and an long type column, the 
> maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 
>  
> With this optimization, the number of task will be smaller and the job will 
> run faster. More importantly, for a very large cluster (more the 10 thousand 
> nodes), it will relieve RM's schedule pressure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data

2018-07-24 Thread Jason Guo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Guo updated SPARK-24906:
--
Attachment: image-2018-07-24-20-28-06-269.png

> Enlarge split size for columnar file to ensure the task read enough data
> 
>
> Key: SPARK-24906
> URL: https://issues.apache.org/jira/browse/SPARK-24906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Critical
> Attachments: image-2018-07-24-20-26-32-441.png, 
> image-2018-07-24-20-28-06-269.png
>
>
> For columnar file, such as, when spark sql read the table, each split will be 
> 128 MB by default since spark.sql.files.maxPartitionBytes is default to 
> 128MB. Even when user set it to a large value, such as 512MB, the task may 
> read only few MB or even hundreds of KB. Because the table (Parquet) may 
> consists of dozens of columns while the SQL only need few columns.
>  
> In this case, spark DataSourceScanExec can enlarge maxPartitionBytes 
> adaptively. 
> For example, there is 40 columns , 20 are integer while another 20 are long. 
> When use query on an integer type column and an long type column, the 
> maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 
>  
> With this optimization, the number of task will be smaller and the job will 
> run faster. More importantly, for a very large cluster (more the 10 thousand 
> nodes), it will relieve RM's schedule pressure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24905) Spark 2.3 Internal URL env variable

2018-07-24 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-24905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Björn Wenzel updated SPARK-24905:
-
Priority: Critical  (was: Major)

> Spark 2.3 Internal URL env variable
> ---
>
> Key: SPARK-24905
> URL: https://issues.apache.org/jira/browse/SPARK-24905
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Björn Wenzel
>Priority: Critical
>
> Currently the Kubernetes Master internal URL is hardcoded in the 
> Constants.scala file 
> ([https://github.com/apache/spark/blob/85fe1297e35bcff9cf86bd53fee615e140ee5bfb/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L75)]
> In some cases these URL should be changed e.g. if the Certificate is valid 
> for another Hostname.
> Is it possible to make this URL a property like: 
> spark.kubernetes.authenticate.driver.hostname?
> Kubernetes The Hard Way maintained by Kelsey Hightower for example uses 
> kubernetes.default as hostname, this will produce again a 
> SSLPeerUnverifiedException.
>  
> Here is the use of the Hardcoded Host: 
> [https://github.com/apache/spark/blob/85fe1297e35bcff9cf86bd53fee615e140ee5bfb/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala#L52]
>  maybe this could be changed like the KUBERNETES_NAMESPACE property in Line 
> 53.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-07-24 Thread Genmao Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554043#comment-16554043
 ] 

Genmao Yu edited comment on SPARK-24630 at 7/24/18 11:38 AM:
-

[~zsxwing]

{{Structured Streaming supports standard SQL as the batch queries, so the users 
can switch their queries between batch and streaming easily.}}

IIUC, there are some queries we can not switch from stream to batch, like 
"*groupBy window a**ggregation"* or "*over window a**ggregation"* on stream. 
Isn't it?


was (Author: unclegen):
{{Structured Streaming supports standard SQL as the batch queries, so the users 
can switch their queries between batch and streaming easily.}}

IIUC, there are some queries we can not switch from stream to batch, like 
"*groupBy window a**ggregation"* or "*over window a**ggregation"* on stream. 
Isn't it?

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24904) Join with broadcasted dataframe causes shuffle of redundant data

2018-07-24 Thread Shay Elbaz (JIRA)
Shay Elbaz created SPARK-24904:
--

 Summary: Join with broadcasted dataframe causes shuffle of 
redundant data
 Key: SPARK-24904
 URL: https://issues.apache.org/jira/browse/SPARK-24904
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 2.1.2
Reporter: Shay Elbaz


When joining a "large" dataframe with broadcasted small one, and join-type is 
on the small DF side (see right-join below), the physical plan does not include 
broadcasting the small table. But when the join is on the large DF side, the 
broadcast does take place. Is there a good reason for this? In the below 
example it sure doesn't make any sense to shuffle the entire large table:

 
{code:java}
val small = spark.range(1, 10)
val big = spark.range(1, 1 << 30)
  .withColumnRenamed("id", "id2")

big.join(broadcast(small), $"id" === $"id2", "right")
.explain


== Physical Plan == 
SortMergeJoin [id2#16307L], [id#16310L], RightOuter 
:- *Sort [id2#16307L ASC NULLS FIRST], false, 0
 :  +- Exchange hashpartitioning(id2#16307L, 1000)
 : +- *Project [id#16304L AS id2#16307L]
 :    +- *Range (1, 1073741824, step=1, splits=Some(600))
 +- *Sort [id#16310L ASC NULLS FIRST], false, 0
    +- Exchange hashpartitioning(id#16310L, 1000)
   +- *Range (1, 10, step=1, splits=Some(600))
{code}
As a workaround, users need to perform inner instead of right join, and then 
join the result back with the small DF to fill the missing rows.

 

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24288) Enable preventing predicate pushdown

2018-07-24 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554109#comment-16554109
 ] 

Tomasz Gawęda commented on SPARK-24288:
---

[~smilegator] Yes, you are right. If we don't want to use barriers (as 
mentioned by [~rxin] in mail), we can add option to disable predicate pushdown 
for JDBC source. I've also though about adding custom optimizer rule, but 
probably I'm not good enough in Spark internals yet ;) 

> Enable preventing predicate pushdown
> 
>
> Key: SPARK-24288
> URL: https://issues.apache.org/jira/browse/SPARK-24288
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Tomasz Gawęda
>Priority: Major
> Attachments: SPARK-24288.simple.patch
>
>
> Issue discussed on Mailing List: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]
> While working with JDBC datasource I saw that many "or" clauses with 
> non-equality operators causes huge performance degradation of SQL query 
> to database (DB2). For example: 
> val df = spark.read.format("jdbc").(other options to parallelize 
> load).load() 
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
>  > 100)").show() // in real application whose predicates were pushed 
> many many lines below, many ANDs and ORs 
> If I use cache() before where, there is no predicate pushdown of this 
> "where" clause. However, in production system caching many sources is a 
> waste of memory (especially is pipeline is long and I must do cache many 
> times).There are also few more workarounds, but it would be great if Spark 
> will support preventing predicate pushdown by user.
>  
> For example: df.withAnalysisBarrier().where(...) ?
>  
> Note, that this should not be a global configuration option. If I read 2 
> DataFrames, df1 and df2, I would like to specify that df1 should not have 
> some predicates pushed down, but some may be, but df2 should have all 
> predicates pushed down, even if target query joins df1 and df2. As far as I 
> understand Spark optimizer, if we use functions like `withAnalysisBarrier` 
> and put AnalysisBarrier explicitly in logical plan, then predicates won't be 
> pushed down on this particular DataFrames and PP will be still possible on 
> the second one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24901) Merge the codegen of RegularHashMap and fastHashMap to reduce compiler maxCodesize when VectorizedHashMap is false

2018-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24901:


Assignee: Apache Spark

> Merge the codegen of RegularHashMap and fastHashMap to reduce compiler 
> maxCodesize when VectorizedHashMap is false
> --
>
> Key: SPARK-24901
> URL: https://issues.apache.org/jira/browse/SPARK-24901
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: caoxuewen
>Assignee: Apache Spark
>Priority: Major
>
> Currently, Generate code of update UnsafeRow in hash aggregation.
> FastHashMap and RegularHashMap are two separate codes,These two separate 
> codes need only when VectorizedHashMap is true. but other cases, we can merge 
> together to reduce compiler maxCodesize. thanks.
> case class DistinctAgg(a: Int, b: Float, c: Double, d: Int, e: String)
> spark.sparkContext.parallelize(
>   DistinctAgg(8, 2, 3, 4, "a") ::
>   DistinctAgg(9, 3, 4, 5, "b") 
> ::Nil).toDF()createOrReplaceTempView("distinctAgg")
> val df = sql("select a,b,e, min(d) as mind, min(case when a > 10 then a else 
> null end) as mincasea, min(a) as mina from distinctAgg group by a, b, e")
> println(org.apache.spark.sql.execution.debug.codegenString(df.queryExecution.executedPlan))
> df.show()
> Generate code like:
>  *Before modified:*
> Generated code:
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIteratorForCodegenStage1(references);
> /* 003 */ }
> /* 004 */
> ...
> /* 354 */
> /* 355 */ if (agg_fastAggBuffer_0 != null) {
> /* 356 */   // common sub-expressions
> /* 357 */
> /* 358 */   // evaluate aggregate function
> /* 359 */   agg_agg_isNull_31_0 = true;
> /* 360 */   int agg_value_34 = -1;
> /* 361 */
> /* 362 */   boolean agg_isNull_32 = agg_fastAggBuffer_0.isNullAt(0);
> /* 363 */   int agg_value_35 = agg_isNull_32 ?
> /* 364 */   -1 : (agg_fastAggBuffer_0.getInt(0));
> /* 365 */
> /* 366 */   if (!agg_isNull_32 && (agg_agg_isNull_31_0 ||
> /* 367 */   agg_value_34 > agg_value_35)) {
> /* 368 */ agg_agg_isNull_31_0 = false;
> /* 369 */ agg_value_34 = agg_value_35;
> /* 370 */   }
> /* 371 */
> /* 372 */   if (!false && (agg_agg_isNull_31_0 ||
> /* 373 */   agg_value_34 > agg_expr_2_0)) {
> /* 374 */ agg_agg_isNull_31_0 = false;
> /* 375 */ agg_value_34 = agg_expr_2_0;
> /* 376 */   }
> /* 377 */   agg_agg_isNull_34_0 = true;
> /* 378 */   int agg_value_37 = -1;
> /* 379 */
> /* 380 */   boolean agg_isNull_35 = agg_fastAggBuffer_0.isNullAt(1);
> /* 381 */   int agg_value_38 = agg_isNull_35 ?
> /* 382 */   -1 : (agg_fastAggBuffer_0.getInt(1));
> /* 383 */
> /* 384 */   if (!agg_isNull_35 && (agg_agg_isNull_34_0 ||
> /* 385 */   agg_value_37 > agg_value_38)) {
> /* 386 */ agg_agg_isNull_34_0 = false;
> /* 387 */ agg_value_37 = agg_value_38;
> /* 388 */   }
> /* 389 */
> /* 390 */   byte agg_caseWhenResultState_1 = -1;
> /* 391 */   do {
> /* 392 */ boolean agg_value_40 = false;
> /* 393 */ agg_value_40 = agg_expr_0_0 > 10;
> /* 394 */ if (!false && agg_value_40) {
> /* 395 */   agg_caseWhenResultState_1 = (byte)(false ? 1 : 0);
> /* 396 */   agg_agg_value_39_0 = agg_expr_0_0;
> /* 397 */   continue;
> /* 398 */ }
> /* 399 */
> /* 400 */ agg_caseWhenResultState_1 = (byte)(true ? 1 : 0);
> /* 401 */ agg_agg_value_39_0 = -1;
> /* 402 */
> /* 403 */   } while (false);
> /* 404 */   // TRUE if any condition is met and the result is null, or no 
> any condition is met.
> /* 405 */   final boolean agg_isNull_36 = (agg_caseWhenResultState_1 != 
> 0);
> /* 406 */
> /* 407 */   if (!agg_isNull_36 && (agg_agg_isNull_34_0 ||
> /* 408 */   agg_value_37 > agg_agg_value_39_0)) {
> /* 409 */ agg_agg_isNull_34_0 = false;
> /* 410 */ agg_value_37 = agg_agg_value_39_0;
> /* 411 */   }
> /* 412 */   agg_agg_isNull_42_0 = true;
> /* 413 */   int agg_value_45 = -1;
> /* 414 */
> /* 415 */   boolean agg_isNull_43 = agg_fastAggBuffer_0.isNullAt(2);
> /* 416 */   int agg_value_46 = agg_isNull_43 ?
> /* 417 */   -1 : (agg_fastAggBuffer_0.getInt(2));
> /* 418 */
> /* 419 */   if (!agg_isNull_43 && (agg_agg_isNull_42_0 ||
> /* 420 */   agg_value_45 > agg_value_46)) {
> /* 421 */ agg_agg_isNull_42_0 = false;
> /* 422 */ agg_value_45 = agg_value_46;
> /* 423 */   }
> /* 424 */
> /* 425 */   if (!false && (agg_agg_isNull_42_0 ||
> /* 426 */   agg_value_45 > agg_expr_0_0)) {
> /* 427 

[jira] [Commented] (SPARK-24901) Merge the codegen of RegularHashMap and fastHashMap to reduce compiler maxCodesize when VectorizedHashMap is false

2018-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554106#comment-16554106
 ] 

Apache Spark commented on SPARK-24901:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21860

> Merge the codegen of RegularHashMap and fastHashMap to reduce compiler 
> maxCodesize when VectorizedHashMap is false
> --
>
> Key: SPARK-24901
> URL: https://issues.apache.org/jira/browse/SPARK-24901
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: caoxuewen
>Priority: Major
>
> Currently, Generate code of update UnsafeRow in hash aggregation.
> FastHashMap and RegularHashMap are two separate codes,These two separate 
> codes need only when VectorizedHashMap is true. but other cases, we can merge 
> together to reduce compiler maxCodesize. thanks.
> case class DistinctAgg(a: Int, b: Float, c: Double, d: Int, e: String)
> spark.sparkContext.parallelize(
>   DistinctAgg(8, 2, 3, 4, "a") ::
>   DistinctAgg(9, 3, 4, 5, "b") 
> ::Nil).toDF()createOrReplaceTempView("distinctAgg")
> val df = sql("select a,b,e, min(d) as mind, min(case when a > 10 then a else 
> null end) as mincasea, min(a) as mina from distinctAgg group by a, b, e")
> println(org.apache.spark.sql.execution.debug.codegenString(df.queryExecution.executedPlan))
> df.show()
> Generate code like:
>  *Before modified:*
> Generated code:
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIteratorForCodegenStage1(references);
> /* 003 */ }
> /* 004 */
> ...
> /* 354 */
> /* 355 */ if (agg_fastAggBuffer_0 != null) {
> /* 356 */   // common sub-expressions
> /* 357 */
> /* 358 */   // evaluate aggregate function
> /* 359 */   agg_agg_isNull_31_0 = true;
> /* 360 */   int agg_value_34 = -1;
> /* 361 */
> /* 362 */   boolean agg_isNull_32 = agg_fastAggBuffer_0.isNullAt(0);
> /* 363 */   int agg_value_35 = agg_isNull_32 ?
> /* 364 */   -1 : (agg_fastAggBuffer_0.getInt(0));
> /* 365 */
> /* 366 */   if (!agg_isNull_32 && (agg_agg_isNull_31_0 ||
> /* 367 */   agg_value_34 > agg_value_35)) {
> /* 368 */ agg_agg_isNull_31_0 = false;
> /* 369 */ agg_value_34 = agg_value_35;
> /* 370 */   }
> /* 371 */
> /* 372 */   if (!false && (agg_agg_isNull_31_0 ||
> /* 373 */   agg_value_34 > agg_expr_2_0)) {
> /* 374 */ agg_agg_isNull_31_0 = false;
> /* 375 */ agg_value_34 = agg_expr_2_0;
> /* 376 */   }
> /* 377 */   agg_agg_isNull_34_0 = true;
> /* 378 */   int agg_value_37 = -1;
> /* 379 */
> /* 380 */   boolean agg_isNull_35 = agg_fastAggBuffer_0.isNullAt(1);
> /* 381 */   int agg_value_38 = agg_isNull_35 ?
> /* 382 */   -1 : (agg_fastAggBuffer_0.getInt(1));
> /* 383 */
> /* 384 */   if (!agg_isNull_35 && (agg_agg_isNull_34_0 ||
> /* 385 */   agg_value_37 > agg_value_38)) {
> /* 386 */ agg_agg_isNull_34_0 = false;
> /* 387 */ agg_value_37 = agg_value_38;
> /* 388 */   }
> /* 389 */
> /* 390 */   byte agg_caseWhenResultState_1 = -1;
> /* 391 */   do {
> /* 392 */ boolean agg_value_40 = false;
> /* 393 */ agg_value_40 = agg_expr_0_0 > 10;
> /* 394 */ if (!false && agg_value_40) {
> /* 395 */   agg_caseWhenResultState_1 = (byte)(false ? 1 : 0);
> /* 396 */   agg_agg_value_39_0 = agg_expr_0_0;
> /* 397 */   continue;
> /* 398 */ }
> /* 399 */
> /* 400 */ agg_caseWhenResultState_1 = (byte)(true ? 1 : 0);
> /* 401 */ agg_agg_value_39_0 = -1;
> /* 402 */
> /* 403 */   } while (false);
> /* 404 */   // TRUE if any condition is met and the result is null, or no 
> any condition is met.
> /* 405 */   final boolean agg_isNull_36 = (agg_caseWhenResultState_1 != 
> 0);
> /* 406 */
> /* 407 */   if (!agg_isNull_36 && (agg_agg_isNull_34_0 ||
> /* 408 */   agg_value_37 > agg_agg_value_39_0)) {
> /* 409 */ agg_agg_isNull_34_0 = false;
> /* 410 */ agg_value_37 = agg_agg_value_39_0;
> /* 411 */   }
> /* 412 */   agg_agg_isNull_42_0 = true;
> /* 413 */   int agg_value_45 = -1;
> /* 414 */
> /* 415 */   boolean agg_isNull_43 = agg_fastAggBuffer_0.isNullAt(2);
> /* 416 */   int agg_value_46 = agg_isNull_43 ?
> /* 417 */   -1 : (agg_fastAggBuffer_0.getInt(2));
> /* 418 */
> /* 419 */   if (!agg_isNull_43 && (agg_agg_isNull_42_0 ||
> /* 420 */   agg_value_45 > agg_value_46)) {
> /* 421 */ agg_agg_isNull_42_0 = false;
> /* 422 */ agg_value_45 = agg_value_46;
> /* 423 */   }
> /* 424 */
> /* 425 */   if (!false && 

[jira] [Assigned] (SPARK-24901) Merge the codegen of RegularHashMap and fastHashMap to reduce compiler maxCodesize when VectorizedHashMap is false

2018-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24901:


Assignee: (was: Apache Spark)

> Merge the codegen of RegularHashMap and fastHashMap to reduce compiler 
> maxCodesize when VectorizedHashMap is false
> --
>
> Key: SPARK-24901
> URL: https://issues.apache.org/jira/browse/SPARK-24901
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: caoxuewen
>Priority: Major
>
> Currently, Generate code of update UnsafeRow in hash aggregation.
> FastHashMap and RegularHashMap are two separate codes,These two separate 
> codes need only when VectorizedHashMap is true. but other cases, we can merge 
> together to reduce compiler maxCodesize. thanks.
> case class DistinctAgg(a: Int, b: Float, c: Double, d: Int, e: String)
> spark.sparkContext.parallelize(
>   DistinctAgg(8, 2, 3, 4, "a") ::
>   DistinctAgg(9, 3, 4, 5, "b") 
> ::Nil).toDF()createOrReplaceTempView("distinctAgg")
> val df = sql("select a,b,e, min(d) as mind, min(case when a > 10 then a else 
> null end) as mincasea, min(a) as mina from distinctAgg group by a, b, e")
> println(org.apache.spark.sql.execution.debug.codegenString(df.queryExecution.executedPlan))
> df.show()
> Generate code like:
>  *Before modified:*
> Generated code:
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIteratorForCodegenStage1(references);
> /* 003 */ }
> /* 004 */
> ...
> /* 354 */
> /* 355 */ if (agg_fastAggBuffer_0 != null) {
> /* 356 */   // common sub-expressions
> /* 357 */
> /* 358 */   // evaluate aggregate function
> /* 359 */   agg_agg_isNull_31_0 = true;
> /* 360 */   int agg_value_34 = -1;
> /* 361 */
> /* 362 */   boolean agg_isNull_32 = agg_fastAggBuffer_0.isNullAt(0);
> /* 363 */   int agg_value_35 = agg_isNull_32 ?
> /* 364 */   -1 : (agg_fastAggBuffer_0.getInt(0));
> /* 365 */
> /* 366 */   if (!agg_isNull_32 && (agg_agg_isNull_31_0 ||
> /* 367 */   agg_value_34 > agg_value_35)) {
> /* 368 */ agg_agg_isNull_31_0 = false;
> /* 369 */ agg_value_34 = agg_value_35;
> /* 370 */   }
> /* 371 */
> /* 372 */   if (!false && (agg_agg_isNull_31_0 ||
> /* 373 */   agg_value_34 > agg_expr_2_0)) {
> /* 374 */ agg_agg_isNull_31_0 = false;
> /* 375 */ agg_value_34 = agg_expr_2_0;
> /* 376 */   }
> /* 377 */   agg_agg_isNull_34_0 = true;
> /* 378 */   int agg_value_37 = -1;
> /* 379 */
> /* 380 */   boolean agg_isNull_35 = agg_fastAggBuffer_0.isNullAt(1);
> /* 381 */   int agg_value_38 = agg_isNull_35 ?
> /* 382 */   -1 : (agg_fastAggBuffer_0.getInt(1));
> /* 383 */
> /* 384 */   if (!agg_isNull_35 && (agg_agg_isNull_34_0 ||
> /* 385 */   agg_value_37 > agg_value_38)) {
> /* 386 */ agg_agg_isNull_34_0 = false;
> /* 387 */ agg_value_37 = agg_value_38;
> /* 388 */   }
> /* 389 */
> /* 390 */   byte agg_caseWhenResultState_1 = -1;
> /* 391 */   do {
> /* 392 */ boolean agg_value_40 = false;
> /* 393 */ agg_value_40 = agg_expr_0_0 > 10;
> /* 394 */ if (!false && agg_value_40) {
> /* 395 */   agg_caseWhenResultState_1 = (byte)(false ? 1 : 0);
> /* 396 */   agg_agg_value_39_0 = agg_expr_0_0;
> /* 397 */   continue;
> /* 398 */ }
> /* 399 */
> /* 400 */ agg_caseWhenResultState_1 = (byte)(true ? 1 : 0);
> /* 401 */ agg_agg_value_39_0 = -1;
> /* 402 */
> /* 403 */   } while (false);
> /* 404 */   // TRUE if any condition is met and the result is null, or no 
> any condition is met.
> /* 405 */   final boolean agg_isNull_36 = (agg_caseWhenResultState_1 != 
> 0);
> /* 406 */
> /* 407 */   if (!agg_isNull_36 && (agg_agg_isNull_34_0 ||
> /* 408 */   agg_value_37 > agg_agg_value_39_0)) {
> /* 409 */ agg_agg_isNull_34_0 = false;
> /* 410 */ agg_value_37 = agg_agg_value_39_0;
> /* 411 */   }
> /* 412 */   agg_agg_isNull_42_0 = true;
> /* 413 */   int agg_value_45 = -1;
> /* 414 */
> /* 415 */   boolean agg_isNull_43 = agg_fastAggBuffer_0.isNullAt(2);
> /* 416 */   int agg_value_46 = agg_isNull_43 ?
> /* 417 */   -1 : (agg_fastAggBuffer_0.getInt(2));
> /* 418 */
> /* 419 */   if (!agg_isNull_43 && (agg_agg_isNull_42_0 ||
> /* 420 */   agg_value_45 > agg_value_46)) {
> /* 421 */ agg_agg_isNull_42_0 = false;
> /* 422 */ agg_value_45 = agg_value_46;
> /* 423 */   }
> /* 424 */
> /* 425 */   if (!false && (agg_agg_isNull_42_0 ||
> /* 426 */   agg_value_45 > agg_expr_0_0)) {
> /* 427 */ 

[jira] [Updated] (SPARK-24902) Add integration tests for PVs

2018-07-24 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24902:

Description: 
PVs and hostpath support has been added recently 
(https://github.com/apache/spark/pull/21260/files) for Spark on K8s. 

We should have some integration tests based on local storage. 

It is easy to add PVs to minikube and attatch them to pods.

We could target a known dir like /tmp for the PV path.

  was:
PVs and hostpath support has been added recently for Spark on K8s. 

We should have some integration tests based on local storage. 

It is easy to add PVs to minikube and attatch them to pods.

We could target a known dir like /tmp for the PV path.


> Add integration tests for PVs
> -
>
> Key: SPARK-24902
> URL: https://issues.apache.org/jira/browse/SPARK-24902
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> PVs and hostpath support has been added recently 
> (https://github.com/apache/spark/pull/21260/files) for Spark on K8s. 
> We should have some integration tests based on local storage. 
> It is easy to add PVs to minikube and attatch them to pods.
> We could target a known dir like /tmp for the PV path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24903) Allow driver container name to be configurable in k8 spark

2018-07-24 Thread Yifei Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifei Huang updated SPARK-24903:

Description: 
We'd like to expose the container name as a configurable value. In case it 
changes in spark, we want to be able to override it to guarantee consistency on 
our end. 

https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L76

  was:We'd like to expose the container name as a configurable value. In case 
it changes in spark, we want to be able to override it to guarantee consistency 
on our end.


> Allow driver container name to be configurable in k8 spark
> --
>
> Key: SPARK-24903
> URL: https://issues.apache.org/jira/browse/SPARK-24903
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Yifei Huang
>Priority: Major
>
> We'd like to expose the container name as a configurable value. In case it 
> changes in spark, we want to be able to override it to guarantee consistency 
> on our end. 
> https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala#L76



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24903) Allow driver container name to be configurable in k8 spark

2018-07-24 Thread Yifei Huang (JIRA)
Yifei Huang created SPARK-24903:
---

 Summary: Allow driver container name to be configurable in k8 spark
 Key: SPARK-24903
 URL: https://issues.apache.org/jira/browse/SPARK-24903
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.3.1
Reporter: Yifei Huang


We'd like to expose the container name as a configurable value. In case it 
changes in spark, we want to be able to override it to guarantee consistency on 
our end.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24902) Add integration tests for PVs

2018-07-24 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-24902:
---

 Summary: Add integration tests for PVs
 Key: SPARK-24902
 URL: https://issues.apache.org/jira/browse/SPARK-24902
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Stavros Kontopoulos


PVs and hostpath support has been added recently for Spark on K8s. 

We should have some integration tests based on local storage. 

It is easy to add PVs to minikube and attatch them to pods.

We could target a known dir like /tmp for the PV path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates

2018-07-24 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554099#comment-16554099
 ] 

Stavros Kontopoulos edited comment on SPARK-24434 at 7/24/18 11:06 AM:
---

[~liyinan926] [~eje] Should we move with the implementation? Any more 
questions? Could you do another review round pls?


was (Author: skonto):
[~liyinan926] Should we move with the implementation? Any more questions? Could 
you do another review round pls?

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-07-24 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554099#comment-16554099
 ] 

Stavros Kontopoulos commented on SPARK-24434:
-

[~liyinan926] Should we move with the implementation? Any more questions? Could 
you do another review round pls?

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24900) speed up sort when the dataset is small

2018-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24900:


Assignee: Apache Spark

> speed up sort when the dataset is small
> ---
>
> Key: SPARK-24900
> URL: https://issues.apache.org/jira/browse/SPARK-24900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: SongXun
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> when running the sql like 'select * from order where order_status = 4 order 
> by order_id', the pysical plan is:
> !https://ws1.sinaimg.cn/large/006tNc79ly1ftl01gtso1j31kw0ag43q.jpg!
> the Exchange rangepartitioning has two steps:
> 1. sample the rdd and get the RangePartitioner which has a rangeBounds
> !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0a6ziytj30le0su0vw.jpg!
> 2. get the rddWithPartitionIds depending on the rangeBounds, and do the 
> shuffle
> !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0bapn6mj30l40ke769.jpg!
> !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0brnj3jj30kq0mu40q.jpg!
>  The filescan and filter will be executed twice, it may take a long time. If 
> the final dataset is small, and the sample data covers all the data, there is 
> no need to do so.
> !https://ws3.sinaimg.cn/large/006tNc79ly1ftl3u0091wj31600a8wh1.jpg!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24900) speed up sort when the dataset is small

2018-07-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554094#comment-16554094
 ] 

Apache Spark commented on SPARK-24900:
--

User 'sddyljsx' has created a pull request for this issue:
https://github.com/apache/spark/pull/21859

> speed up sort when the dataset is small
> ---
>
> Key: SPARK-24900
> URL: https://issues.apache.org/jira/browse/SPARK-24900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: SongXun
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> when running the sql like 'select * from order where order_status = 4 order 
> by order_id', the pysical plan is:
> !https://ws1.sinaimg.cn/large/006tNc79ly1ftl01gtso1j31kw0ag43q.jpg!
> the Exchange rangepartitioning has two steps:
> 1. sample the rdd and get the RangePartitioner which has a rangeBounds
> !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0a6ziytj30le0su0vw.jpg!
> 2. get the rddWithPartitionIds depending on the rangeBounds, and do the 
> shuffle
> !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0bapn6mj30l40ke769.jpg!
> !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0brnj3jj30kq0mu40q.jpg!
>  The filescan and filter will be executed twice, it may take a long time. If 
> the final dataset is small, and the sample data covers all the data, there is 
> no need to do so.
> !https://ws3.sinaimg.cn/large/006tNc79ly1ftl3u0091wj31600a8wh1.jpg!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24900) speed up sort when the dataset is small

2018-07-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24900:


Assignee: (was: Apache Spark)

> speed up sort when the dataset is small
> ---
>
> Key: SPARK-24900
> URL: https://issues.apache.org/jira/browse/SPARK-24900
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: SongXun
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> when running the sql like 'select * from order where order_status = 4 order 
> by order_id', the pysical plan is:
> !https://ws1.sinaimg.cn/large/006tNc79ly1ftl01gtso1j31kw0ag43q.jpg!
> the Exchange rangepartitioning has two steps:
> 1. sample the rdd and get the RangePartitioner which has a rangeBounds
> !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0a6ziytj30le0su0vw.jpg!
> 2. get the rddWithPartitionIds depending on the rangeBounds, and do the 
> shuffle
> !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0bapn6mj30l40ke769.jpg!
> !https://ws1.sinaimg.cn/large/006tNc79ly1ftl0brnj3jj30kq0mu40q.jpg!
>  The filescan and filter will be executed twice, it may take a long time. If 
> the final dataset is small, and the sample data covers all the data, there is 
> no need to do so.
> !https://ws3.sinaimg.cn/large/006tNc79ly1ftl3u0091wj31600a8wh1.jpg!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24901) Merge the codegen of RegularHashMap and fastHashMap to reduce compiler maxCodesize when VectorizedHashMap is false

2018-07-24 Thread caoxuewen (JIRA)
caoxuewen created SPARK-24901:
-

 Summary: Merge the codegen of RegularHashMap and fastHashMap to 
reduce compiler maxCodesize when VectorizedHashMap is false
 Key: SPARK-24901
 URL: https://issues.apache.org/jira/browse/SPARK-24901
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: caoxuewen


Currently, Generate code of update UnsafeRow in hash aggregation.
FastHashMap and RegularHashMap are two separate codes,These two separate codes 
need only when VectorizedHashMap is true. but other cases, we can merge 
together to reduce compiler maxCodesize. thanks.
case class DistinctAgg(a: Int, b: Float, c: Double, d: Int, e: String)
spark.sparkContext.parallelize(
  DistinctAgg(8, 2, 3, 4, "a") ::
  DistinctAgg(9, 3, 4, 5, "b") 
::Nil).toDF()createOrReplaceTempView("distinctAgg")
val df = sql("select a,b,e, min(d) as mind, min(case when a > 10 then a else 
null end) as mincasea, min(a) as mina from distinctAgg group by a, b, e")
println(org.apache.spark.sql.execution.debug.codegenString(df.queryExecution.executedPlan))
df.show()

Generate code like:
 *Before modified:*
Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
/* 004 */
...
/* 354 */
/* 355 */ if (agg_fastAggBuffer_0 != null) {
/* 356 */   // common sub-expressions
/* 357 */
/* 358 */   // evaluate aggregate function
/* 359 */   agg_agg_isNull_31_0 = true;
/* 360 */   int agg_value_34 = -1;
/* 361 */
/* 362 */   boolean agg_isNull_32 = agg_fastAggBuffer_0.isNullAt(0);
/* 363 */   int agg_value_35 = agg_isNull_32 ?
/* 364 */   -1 : (agg_fastAggBuffer_0.getInt(0));
/* 365 */
/* 366 */   if (!agg_isNull_32 && (agg_agg_isNull_31_0 ||
/* 367 */   agg_value_34 > agg_value_35)) {
/* 368 */ agg_agg_isNull_31_0 = false;
/* 369 */ agg_value_34 = agg_value_35;
/* 370 */   }
/* 371 */
/* 372 */   if (!false && (agg_agg_isNull_31_0 ||
/* 373 */   agg_value_34 > agg_expr_2_0)) {
/* 374 */ agg_agg_isNull_31_0 = false;
/* 375 */ agg_value_34 = agg_expr_2_0;
/* 376 */   }
/* 377 */   agg_agg_isNull_34_0 = true;
/* 378 */   int agg_value_37 = -1;
/* 379 */
/* 380 */   boolean agg_isNull_35 = agg_fastAggBuffer_0.isNullAt(1);
/* 381 */   int agg_value_38 = agg_isNull_35 ?
/* 382 */   -1 : (agg_fastAggBuffer_0.getInt(1));
/* 383 */
/* 384 */   if (!agg_isNull_35 && (agg_agg_isNull_34_0 ||
/* 385 */   agg_value_37 > agg_value_38)) {
/* 386 */ agg_agg_isNull_34_0 = false;
/* 387 */ agg_value_37 = agg_value_38;
/* 388 */   }
/* 389 */
/* 390 */   byte agg_caseWhenResultState_1 = -1;
/* 391 */   do {
/* 392 */ boolean agg_value_40 = false;
/* 393 */ agg_value_40 = agg_expr_0_0 > 10;
/* 394 */ if (!false && agg_value_40) {
/* 395 */   agg_caseWhenResultState_1 = (byte)(false ? 1 : 0);
/* 396 */   agg_agg_value_39_0 = agg_expr_0_0;
/* 397 */   continue;
/* 398 */ }
/* 399 */
/* 400 */ agg_caseWhenResultState_1 = (byte)(true ? 1 : 0);
/* 401 */ agg_agg_value_39_0 = -1;
/* 402 */
/* 403 */   } while (false);
/* 404 */   // TRUE if any condition is met and the result is null, or no 
any condition is met.
/* 405 */   final boolean agg_isNull_36 = (agg_caseWhenResultState_1 != 0);
/* 406 */
/* 407 */   if (!agg_isNull_36 && (agg_agg_isNull_34_0 ||
/* 408 */   agg_value_37 > agg_agg_value_39_0)) {
/* 409 */ agg_agg_isNull_34_0 = false;
/* 410 */ agg_value_37 = agg_agg_value_39_0;
/* 411 */   }
/* 412 */   agg_agg_isNull_42_0 = true;
/* 413 */   int agg_value_45 = -1;
/* 414 */
/* 415 */   boolean agg_isNull_43 = agg_fastAggBuffer_0.isNullAt(2);
/* 416 */   int agg_value_46 = agg_isNull_43 ?
/* 417 */   -1 : (agg_fastAggBuffer_0.getInt(2));
/* 418 */
/* 419 */   if (!agg_isNull_43 && (agg_agg_isNull_42_0 ||
/* 420 */   agg_value_45 > agg_value_46)) {
/* 421 */ agg_agg_isNull_42_0 = false;
/* 422 */ agg_value_45 = agg_value_46;
/* 423 */   }
/* 424 */
/* 425 */   if (!false && (agg_agg_isNull_42_0 ||
/* 426 */   agg_value_45 > agg_expr_0_0)) {
/* 427 */ agg_agg_isNull_42_0 = false;
/* 428 */ agg_value_45 = agg_expr_0_0;
/* 429 */   }
/* 430 */   // update fast row
/* 431 */   agg_fastAggBuffer_0.setInt(0, agg_value_34);
/* 432 */
/* 433 */   if (!agg_agg_isNull_34_0) {
/* 434 */ agg_fastAggBuffer_0.setInt(1, agg_value_37);
/* 435 */   } else {
/* 436 */ agg_fastAggBuffer_0.setNullAt(1);
/* 437 */   }
/* 438 */
/* 439 */   agg_fastAggBuffer_0.setInt(2, agg_value_45);
/* 440 */ } else {
/* 441 */   // common 

[jira] [Created] (SPARK-24900) speed up sort when the dataset is small

2018-07-24 Thread SongXun (JIRA)
SongXun created SPARK-24900:
---

 Summary: speed up sort when the dataset is small
 Key: SPARK-24900
 URL: https://issues.apache.org/jira/browse/SPARK-24900
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: SongXun


when running the sql like 'select * from order where order_status = 4 order by 
order_id', the pysical plan is:
!https://ws1.sinaimg.cn/large/006tNc79ly1ftl01gtso1j31kw0ag43q.jpg!
the Exchange rangepartitioning has two steps:

1. sample the rdd and get the RangePartitioner which has a rangeBounds

!https://ws1.sinaimg.cn/large/006tNc79ly1ftl0a6ziytj30le0su0vw.jpg!

2. get the rddWithPartitionIds depending on the rangeBounds, and do the shuffle

!https://ws1.sinaimg.cn/large/006tNc79ly1ftl0bapn6mj30l40ke769.jpg!

!https://ws1.sinaimg.cn/large/006tNc79ly1ftl0brnj3jj30kq0mu40q.jpg!
 The filescan and filter will be executed twice, it may take a long time. If 
the final dataset is small, and the sample data covers all the data, there is 
no need to do so.

!https://ws3.sinaimg.cn/large/006tNc79ly1ftl3u0091wj31600a8wh1.jpg!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-07-24 Thread Genmao Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16554043#comment-16554043
 ] 

Genmao Yu commented on SPARK-24630:
---

{{Structured Streaming supports standard SQL as the batch queries, so the users 
can switch their queries between batch and streaming easily.}}

IIUC, there are some queries we can not switch from stream to batch, like 
"*groupBy window a**ggregation"*** or "*over window a**ggregation"*** on 
stream. Isn't it?

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >