date:20180716

[jira] [Updated] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-16 Thread Romeo Kienzer (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Romeo Kienzer updated SPARK-24828:
--
Description: 
As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
here

 

Using the attached parquet file from one Spark installation, reading it using 
an installation directly obtained from [http://spark.apache.org/downloads.html] 
yields to the following exception:

 

18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
 scala.MatchError: [1.0,null] (of class 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
     at 
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
     at 
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
     at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
     at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
     at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
     at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
     at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
     at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
     at org.apache.spark.scheduler.Task.run(Task.scala:99)
     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)
 18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
 java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
     at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
     at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
     at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
     at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
     at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
     at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
     at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
     at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
     at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
     at org.apache.spark.scheduler.Task.run(Task.scala:99)
     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)

 

The file is attached [^a2_m2.parquet.zip]

 

The following code reproduces the error:

df = spark.read.parquet('a2_m2.parquet')

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") 
.setPredictionCol("prediction").setLabelCol("label")

accuracy = binEval.evaluate(df)

  was:
As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
here #3

 

Using the attached parquet file from one Spark installation, reading it using 
an installation directly obtained from [http://spark.apache.org/downloads.html] 
yields to the following exception:

 

18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
scala.MatchError: [1.0,null] (of class 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
    at 
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
    at 
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
    at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
    at

[jira] [Commented] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary

2018-07-16 Thread Romeo Kienzer (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546059#comment-16546059
 ] 

Romeo Kienzer commented on SPARK-17557:
---

Dear [~hyukjin.kwon] - I've done so - new issue is SPARK-24828

> SQL query on parquet table java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary
> -
>
> Key: SPARK-17557
> URL: https://issues.apache.org/jira/browse/SPARK-17557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Major
> Attachments: a2_m2.parquet.zip
>
>
> Working on 1.6.2, broken on 2.0
> {code}
> select * from logs.a where year=2016 and month=9 and day=14 limit 100
> {code}
> {code}
> java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>   at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24828) Incompatible parquet formats - java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-07-16 Thread Romeo Kienzer (JIRA)

Romeo Kienzer created SPARK-24828:
-

 Summary: Incompatible parquet formats - 
java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
 Key: SPARK-24828
 URL: https://issues.apache.org/jira/browse/SPARK-24828
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
 Environment: Environment for creating the parquet file:

IBM Watson Studio Apache Spark Service, V2.1.2

Environment for reading the parquet file:

java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)

MacOSX 10.13.3 (17D47)

Spark spark-2.1.2-bin-hadoop2.7 directly obtained from 
http://spark.apache.org/downloads.html
Reporter: Romeo Kienzer


As requested by [~hyukjin.kwon] here a new issue - related issue can be found 
here #3

 

Using the attached parquet file from one Spark installation, reading it using 
an installation directly obtained from [http://spark.apache.org/downloads.html] 
yields to the following exception:

 

18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4)
scala.MatchError: [1.0,null] (of class 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
    at 
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
    at 
org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
    at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
    at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
    at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
    at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

 

The file is attached [^a2_m2.parquet.zip]

 

The following code reproduces the error:

df = spark.read.parquet('a2_m2.parquet')

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") 
.setPredictionCol("prediction").setLabelCol("label")

accuracy = binEval.evaluate(df)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24568) Code refactoring for DataType equalsXXX methods

2018-07-16 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16546009#comment-16546009
 ] 

Apache Spark commented on SPARK-24568:
--

User 'swapnilushinde' has created a pull request for this issue:
https://github.com/apache/spark/pull/21787

> Code refactoring for DataType equalsXXX methods
> ---
>
> Key: SPARK-24568
> URL: https://issues.apache.org/jira/browse/SPARK-24568
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Priority: Major
>
> Right now there is a lot of code duplication between all DataType equalsXXX 
> methods: {{equalsIgnoreNullability}}, {{equalsIgnoreCaseAndNullability}}, 
> {{equalsIgnoreCaseAndNullability}}, {{equalsStructurally}}. We can replace 
> the dup code with a helper function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24568) Code refactoring for DataType equalsXXX methods

2018-07-16 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24568:


Assignee: (was: Apache Spark)

> Code refactoring for DataType equalsXXX methods
> ---
>
> Key: SPARK-24568
> URL: https://issues.apache.org/jira/browse/SPARK-24568
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Priority: Major
>
> Right now there is a lot of code duplication between all DataType equalsXXX 
> methods: {{equalsIgnoreNullability}}, {{equalsIgnoreCaseAndNullability}}, 
> {{equalsIgnoreCaseAndNullability}}, {{equalsStructurally}}. We can replace 
> the dup code with a helper function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24568) Code refactoring for DataType equalsXXX methods

2018-07-16 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24568:


Assignee: Apache Spark

> Code refactoring for DataType equalsXXX methods
> ---
>
> Key: SPARK-24568
> URL: https://issues.apache.org/jira/browse/SPARK-24568
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maryann Xue
>Assignee: Apache Spark
>Priority: Major
>
> Right now there is a lot of code duplication between all DataType equalsXXX 
> methods: {{equalsIgnoreNullability}}, {{equalsIgnoreCaseAndNullability}}, 
> {{equalsIgnoreCaseAndNullability}}, {{equalsStructurally}}. We can replace 
> the dup code with a helper function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24402) Optimize `In` expression when only one element in the collection or collection is empty

2018-07-16 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24402:
-
Fix Version/s: (was: 2.4.0)

> Optimize `In` expression when only one element in the collection or 
> collection is empty 
> 
>
> Key: SPARK-24402
> URL: https://issues.apache.org/jira/browse/SPARK-24402
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> Two new rules in the logical plan optimizers are added.
> # When there is only one element in the *{{Collection}}*, the physical plan
> will be optimized to *{{EqualTo}}*, so predicate pushdown can be used.
> {code}
>  profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true)
>  """
> |== Physical Plan ==|
> |*(1) Project [profileID#0|#0]|
> |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))|
> |+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,|
> |PartitionFilters: [],|
> |PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],|
> |ReadSchema: struct
>  """.stripMargin
> {code}
> # When the *{{Set}}* is empty, and the input is nullable, the logical
> plan will be simplified to
> {code}
>  profileDF.filter( $"profileID".isInCollection(Set())).explain(true)
>  """
> |== Optimized Logical Plan ==|
> |Filter if (isnull(profileID#0)) null else false|
> |+- Relation[profileID#0|#0] parquet
>  """.stripMargin
> {code}
> TODO:
>  # For multiple conditions with numbers less than certain thresholds,
> we should still allow predicate pushdown.
>  # Optimize the `In` using tableswitch or lookupswitch when the
> numbers of the categories are low, and they are `Int`, `Long`.
>  # The default immutable hash trees set is slow for query, and we
> should do benchmark for using different set implementation for faster
> query.
>  # `filter(if (condition) null else false)` can be optimized to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-24402) Optimize `In` expression when only one element in the collection or collection is empty

2018-07-16 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-24402:
--

This was reverted.

> Optimize `In` expression when only one element in the collection or 
> collection is empty 
> 
>
> Key: SPARK-24402
> URL: https://issues.apache.org/jira/browse/SPARK-24402
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>
> Two new rules in the logical plan optimizers are added.
> # When there is only one element in the *{{Collection}}*, the physical plan
> will be optimized to *{{EqualTo}}*, so predicate pushdown can be used.
> {code}
>  profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true)
>  """
> |== Physical Plan ==|
> |*(1) Project [profileID#0|#0]|
> |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))|
> |+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,|
> |PartitionFilters: [],|
> |PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],|
> |ReadSchema: struct
>  """.stripMargin
> {code}
> # When the *{{Set}}* is empty, and the input is nullable, the logical
> plan will be simplified to
> {code}
>  profileDF.filter( $"profileID".isInCollection(Set())).explain(true)
>  """
> |== Optimized Logical Plan ==|
> |Filter if (isnull(profileID#0)) null else false|
> |+- Relation[profileID#0|#0] parquet
>  """.stripMargin
> {code}
> TODO:
>  # For multiple conditions with numbers less than certain thresholds,
> we should still allow predicate pushdown.
>  # Optimize the `In` using tableswitch or lookupswitch when the
> numbers of the categories are low, and they are `Int`, `Long`.
>  # The default immutable hash trees set is slow for query, and we
> should do benchmark for using different set implementation for faster
> query.
>  # `filter(if (condition) null else false)` can be optimized to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24798) sortWithinPartitions(xx) will failed in java.lang.NullPointerException

2018-07-16 Thread shengyao piao (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shengyao piao resolved SPARK-24798.
---
Resolution: Not A Problem

> sortWithinPartitions(xx) will failed in java.lang.NullPointerException
> --
>
> Key: SPARK-24798
> URL: https://issues.apache.org/jira/browse/SPARK-24798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: shengyao piao
>Priority: Minor
>
> I have some issue in Spark 2.3 when I run bellow code in spark-shell or 
> spark-submit 
> I already figured out the reason of error is the name field contains 
> Some(null),
> But I believe this code will run successfully in Spark 2.2
> Is it an expected behavior in Spark 2.3 ?
>  
> ・Spark code
> {code}
> case class Hoge (id : Int,name : Option[String])
>  val ds = 
> spark.createDataFrame(Array((1,"John"),(2,null))).withColumnRenamed("_1", 
> "id").withColumnRenamed("_2", "name").map(row => 
> Hoge(row.getAs[Int]("id"),Some(row.getAs[String]("name"
>  
> ds.sortWithinPartitions("id").foreachPartition(iter => println(iter.isEmpty))
> {code}
> ・Error
> {code}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:194)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.isEmpty(Iterator.scala:330)
> at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1336)
> at 
> $line37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:26)
> at 
> $line37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:26)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24798) sortWithinPartitions(xx) will failed in java.lang.NullPointerException

2018-07-16 Thread shengyao piao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545973#comment-16545973
 ] 

shengyao piao commented on SPARK-24798:
---

Hi [~mahmoudmahdi24] , [~dmateusp]

Thank you!

It's helped me.


> sortWithinPartitions(xx) will failed in java.lang.NullPointerException
> --
>
> Key: SPARK-24798
> URL: https://issues.apache.org/jira/browse/SPARK-24798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: shengyao piao
>Priority: Minor
>
> I have some issue in Spark 2.3 when I run bellow code in spark-shell or 
> spark-submit 
> I already figured out the reason of error is the name field contains 
> Some(null),
> But I believe this code will run successfully in Spark 2.2
> Is it an expected behavior in Spark 2.3 ?
>  
> ・Spark code
> {code}
> case class Hoge (id : Int,name : Option[String])
>  val ds = 
> spark.createDataFrame(Array((1,"John"),(2,null))).withColumnRenamed("_1", 
> "id").withColumnRenamed("_2", "name").map(row => 
> Hoge(row.getAs[Int]("id"),Some(row.getAs[String]("name"
>  
> ds.sortWithinPartitions("id").foreachPartition(iter => println(iter.isEmpty))
> {code}
> ・Error
> {code}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:194)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$class.isEmpty(Iterator.scala:330)
> at scala.collection.AbstractIterator.isEmpty(Iterator.scala:1336)
> at 
> $line37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:26)
> at 
> $line37.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:26)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929)
> at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:929)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2067)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-16 Thread Saisai Shao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545962#comment-16545962
 ] 

Saisai Shao commented on SPARK-24615:
-

Sorry [~tgraves] for the late response. Yes,  when requesting executors, user 
should know accelerators are required or not. If there's no satisfied 
accelerators, the job will be pending or not launched. 

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary

2018-07-16 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545961#comment-16545961
 ] 

Hyukjin Kwon commented on SPARK-17557:
--

Please go ahead but I would alternatively open a separate ticket and leave a 
"relate to" link to this JIRA because the reproducer and affect version are 
different apparently.

> SQL query on parquet table java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary
> -
>
> Key: SPARK-17557
> URL: https://issues.apache.org/jira/browse/SPARK-17557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Major
> Attachments: a2_m2.parquet.zip
>
>
> Working on 1.6.2, broken on 2.0
> {code}
> select * from logs.a where year=2016 and month=9 and day=14 limit 100
> {code}
> {code}
> java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>   at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23255) Add user guide and examples for DataFrame image reading functions

2018-07-16 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545960#comment-16545960
 ] 

Hyukjin Kwon commented on SPARK-23255:
--

Please go ahead.

> Add user guide and examples for DataFrame image reading functions
> -
>
> Key: SPARK-23255
> URL: https://issues.apache.org/jira/browse/SPARK-23255
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> SPARK-21866 added built-in support for reading image data into a DataFrame. 
> This new functionality should be documented in the user guide, with example 
> usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24644) Pyarrow exception while running pandas_udf on pyspark 2.3.1

2018-07-16 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24644.
--
Resolution: Invalid

Let me leave this resolved. Please reopen this if the same issue exists in 
higher version of Pandas.

> Pyarrow exception while running pandas_udf on pyspark 2.3.1
> ---
>
> Key: SPARK-24644
> URL: https://issues.apache.org/jira/browse/SPARK-24644
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: os: centos
> pyspark 2.3.1
> spark 2.3.1
> pyarrow >= 0.8.0
>Reporter: Hichame El Khalfi
>Priority: Major
>
> Hello,
> When I try to run a `pandas_udf` on my spark dataframe, I get this error
>  
> {code:java}
>   File 
> "/mnt/ephemeral3/yarn/nm/usercache/user/appcache/application_1524574803975_205774/container_e280_1524574803975_205774_01_44/pyspark.zip/pyspark/serializers.py",
>  lin
> e 280, in load_stream
> pdf = batch.to_pandas()
>   File "pyarrow/table.pxi", line 677, in pyarrow.lib.RecordBatch.to_pandas 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:43226)
> return Table.from_batches([self]).to_pandas(nthreads=nthreads)
>   File "pyarrow/table.pxi", line 1043, in pyarrow.lib.Table.to_pandas 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:46331)
> mgr = pdcompat.table_to_blockmanager(options, self, memory_pool,
>   File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line 
> 528, in table_to_blockmanager
> blocks = _table_to_blocks(options, block_table, nthreads, memory_pool)
>   File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line 
> 622, in _table_to_blocks
> return [_reconstruct_block(item) for item in result]
>   File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line 
> 446, in _reconstruct_block
> block = _int.make_block(block_arr, placement=placement)
> TypeError: make_block() takes at least 3 arguments (2 given)
> {code}
>  
>  More than happy to provide any additional information



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20220) Add thrift scheduling pool config in scheduling docs

2018-07-16 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20220.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21778
[https://github.com/apache/spark/pull/21778]

> Add thrift scheduling pool config in scheduling docs
> 
>
> Key: SPARK-20220
> URL: https://issues.apache.org/jira/browse/SPARK-20220
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Miklos Christine
>Assignee: Miklos Christine
>Priority: Trivial
> Fix For: 2.4.0
>
>
> Spark 1.2 docs document the thrift job scheduling pool. 
> https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md
> This configuration is no longer documented in the 2.x documentation. 
> Adding this back to the job scheduling docs. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20220) Add thrift scheduling pool config in scheduling docs

2018-07-16 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-20220:


Assignee: Miklos Christine

> Add thrift scheduling pool config in scheduling docs
> 
>
> Key: SPARK-20220
> URL: https://issues.apache.org/jira/browse/SPARK-20220
> Project: Spark
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Miklos Christine
>Assignee: Miklos Christine
>Priority: Trivial
> Fix For: 2.4.0
>
>
> Spark 1.2 docs document the thrift job scheduling pool. 
> https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md
> This configuration is no longer documented in the 2.x documentation. 
> Adding this back to the job scheduling docs. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23259) Clean up legacy code around hive external catalog

2018-07-16 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23259.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21780
[https://github.com/apache/spark/pull/21780]

> Clean up legacy code around hive external catalog
> -
>
> Key: SPARK-23259
> URL: https://issues.apache.org/jira/browse/SPARK-23259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Feng Liu
>Assignee: Feng Liu
>Priority: Major
> Fix For: 2.4.0
>
>
> Some legacy code around the hive metastore catalog need to be removed for 
> further code improvement:
>  # in HiveExternalCatalog: The `withClient` wrapper is not necessary for the 
> private method `getRawTable`. 
>  # in HiveClientImpl: The statement `runSqlHive()` is not necessary for the 
> `addJar` method, after the jar being added to the single class loader.
>  # in HiveClientImpl: There are some redundant code in both the `tableExists` 
> and `getTableOption` method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23259) Clean up legacy code around hive external catalog

2018-07-16 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23259:


Assignee: Feng Liu

> Clean up legacy code around hive external catalog
> -
>
> Key: SPARK-23259
> URL: https://issues.apache.org/jira/browse/SPARK-23259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Feng Liu
>Assignee: Feng Liu
>Priority: Major
> Fix For: 2.4.0
>
>
> Some legacy code around the hive metastore catalog need to be removed for 
> further code improvement:
>  # in HiveExternalCatalog: The `withClient` wrapper is not necessary for the 
> private method `getRawTable`. 
>  # in HiveClientImpl: The statement `runSqlHive()` is not necessary for the 
> `addJar` method, after the jar being added to the single class loader.
>  # in HiveClientImpl: There are some redundant code in both the `tableExists` 
> and `getTableOption` method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21481) Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF

2018-07-16 Thread chenzhiming (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenzhiming updated SPARK-21481:

Attachment: idea64.exe

> Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF
> -
>
> Key: SPARK-21481
> URL: https://issues.apache.org/jira/browse/SPARK-21481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Aseem Bansal
>Priority: Major
>
> If we want to find the index of any input based on hashing trick then it is 
> possible in 
> https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF
>  but not in 
> https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.feature.HashingTF.
> Should allow that for feature parity



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21481) Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF

2018-07-16 Thread chenzhiming (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenzhiming updated SPARK-21481:

Attachment: (was: idea64.exe)

> Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF
> -
>
> Key: SPARK-21481
> URL: https://issues.apache.org/jira/browse/SPARK-21481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Aseem Bansal
>Priority: Major
>
> If we want to find the index of any input based on hashing trick then it is 
> possible in 
> https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF
>  but not in 
> https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.feature.HashingTF.
> Should allow that for feature parity



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24615) Accelerator-aware task scheduling for Spark

2018-07-16 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24615:
--
Summary: Accelerator-aware task scheduling for Spark  (was: Accelerator 
aware task scheduling for Spark)

> Accelerator-aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2

2018-07-16 Thread Michael Yannakopoulos (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Yannakopoulos updated SPARK-24826:
--
Description: 
Running a self-join against a table derived from a parquet file with many 
columns fails during the planning phase with the following stack-trace:

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
 Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), 
coordinator[target post-shuffle partition size: 67108864]
 +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, 
funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, 
emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, 
verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, 
desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, 
member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, 
int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, emp_length#12, 
home_ownership#13, annual_inc#14, verification_status#15, issue_d#16, 
loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, title#22, 
zip_code#23, ... 92 more fields]
 +- Filter isnotnull(_row_id#0L)
 +- FileScan parquet 
[_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
 92 more 
fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
 92 more fields] Batched: false, Format: Parquet, Location: 
InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-...,
 PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: 
struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_...

at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
 at 
org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
 at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
 at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
 at 
org.apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:73)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
 at 
org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:133)
 at

[jira] [Created] (SPARK-24827) Some memory waste in History Server by strings in AccumulableInfo objects

2018-07-16 Thread Misha Dmitriev (JIRA)

Misha Dmitriev created SPARK-24827:
--

 Summary: Some memory waste in History Server by strings in 
AccumulableInfo objects
 Key: SPARK-24827
 URL: https://issues.apache.org/jira/browse/SPARK-24827
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.2
Reporter: Misha Dmitriev


I've analyzed a heap dump of Spark History Server with jxray 
([www.jxray.com)|http://www.jxray.com)/] and found that 42% of the heap is 
wasted due to duplicate strings. The biggest sources of such strings are the 
{{name}} and {{value}} data fields of {{AccumulableInfo}} objects:
{code:java}
7. Duplicate Strings:  overhead 42.1% 

  Total strings   Unique strings   Duplicate values  Overhead 
    13,732,278     729,234   354,032 867,177K (42.1%)

Expensive data fields:


318,421K (15.4%), 3669685 / 100% dup strings (8 unique), 3669685 dup backing 
arrays:

 ↖org.apache.spark.scheduler.AccumulableInfo.name

178,994K (8.7%), 3674403 / 99% dup strings (35640 unique), 3674403 dup backing 
arrays:

 ↖scala.Some.x

168,601K (8.2%), 3401960 / 92% dup strings (175826 unique), 3401960 dup backing 
arrays:

 ↖org.apache.spark.scheduler.AccumulableInfo.value{code}
That is, 15.4% of the heap is wasted by {{AccumulableInfo.name}} and 8.2% is 
wasted by {{AccumulableInfo.value}}.

It turns out that the problem has been partially addressed in spark 2.3+, e.g.

[https://github.com/apache/spark/blob/b045315e5d87b7ea3588436053aaa4d5a7bd103f/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L590]

However, this code has two minor problems:
 # Strings for {{AccumulableInfo.value}} are not interned in the above code, 
only {{AccumulableInfo.name}}.
 # For interning, the code in {{weakIntern(String)}} method uses a Guava 
interner ({{stringInterner = Interners.newWeakInterner[String]()}}). This is an 
old-fashioned, less efficient way of interning strings. Since some 3-4 years 
old JDK7 version, the built-in JVM {{String.intern()}} method is much more 
efficient, both in terms of CPU and memory.

It is therefore suggested to add interning for {{value}} and replace the Guava 
interner with {{String.intern()}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2

2018-07-16 Thread Michael Yannakopoulos (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Yannakopoulos updated SPARK-24826:
--
Description: 
Running a self-join against a table derived from a parquet file with many 
columns fails during the planning phase with the following stack-trace:

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
 Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), 
coordinator[target post-shuffle partition size: 67108864]
 +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, 
funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, 
emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, 
verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, 
desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields|#0L, id#1L, 
member_id#2L, loan_amnt#3L, funded_amnt#4L, funded_amnt_inv#5L, term#6, 
int_rate#7, installment#8, grade#9, sub_grade#10, emp_title#11, emp_length#12, 
home_ownership#13, annual_inc#14, verification_status#15, issue_d#16, 
loan_status#17, pymnt_plan#18, url#19, desc_#20, purpose#21, title#22, 
zip_code#23, ... 92 more fields]
 +- Filter isnotnull(_row_id#0L)
 +- FileScan parquet 
[_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
 92 more 
fields|#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
 92 more fields] Batched: false, Format: Parquet, Location: 
InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-...,
 PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: 
struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_...

at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
 at 
org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
 at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
 at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
 at 
org.apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:73)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
 at 
org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:133)
 at

[jira] [Updated] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2

2018-07-16 Thread Michael Yannakopoulos (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Yannakopoulos updated SPARK-24826:
--
Attachment: part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet

> Self-Join not working in Apache Spark 2.2.2
> ---
>
> Key: SPARK-24826
> URL: https://issues.apache.org/jira/browse/SPARK-24826
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.2
>Reporter: Michael Yannakopoulos
>Priority: Major
> Attachments: 
> part-0-48210471-3088-4cee-8670-a332444bae66-c000.gz.parquet
>
>
> Running a self-join against a table derived from a parquet file with many 
> columns fails during the planning phase with the following stack-trace:
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), 
> coordinator[target post-shuffle partition size: 67108864]
> +- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, 
> funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, 
> emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, 
> verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, 
> desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields]
>  +- Filter isnotnull(_row_id#0L)
>  +- FileScan parquet 
> [_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
>  92 more fields] Batched: false, Format: Parquet, Location: 
> InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: 
> struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_...
> at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at 
> org.apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:73)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
>  at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:133)
>  at 
>

[jira] [Created] (SPARK-24826) Self-Join not working in Apache Spark 2.2.2

2018-07-16 Thread Michael Yannakopoulos (JIRA)

Michael Yannakopoulos created SPARK-24826:
-

 Summary: Self-Join not working in Apache Spark 2.2.2
 Key: SPARK-24826
 URL: https://issues.apache.org/jira/browse/SPARK-24826
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 2.2.2
Reporter: Michael Yannakopoulos


Running a self-join against a table derived from a parquet file with many 
columns fails during the planning phase with the following stack-trace:



org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange(coordinator id: 331918455) hashpartitioning(_row_id#0L, 2), 
coordinator[target post-shuffle partition size: 67108864]
+- Project [_row_id#0L, id#1L, member_id#2L, loan_amnt#3L, funded_amnt#4L, 
funded_amnt_inv#5L, term#6, int_rate#7, installment#8, grade#9, sub_grade#10, 
emp_title#11, emp_length#12, home_ownership#13, annual_inc#14, 
verification_status#15, issue_d#16, loan_status#17, pymnt_plan#18, url#19, 
desc_#20, purpose#21, title#22, zip_code#23, ... 92 more fields]
 +- Filter isnotnull(_row_id#0L)
 +- FileScan parquet 
[_row_id#0L,id#1L,member_id#2L,loan_amnt#3L,funded_amnt#4L,funded_amnt_inv#5L,term#6,int_rate#7,installment#8,grade#9,sub_grade#10,emp_title#11,emp_length#12,home_ownership#13,annual_inc#14,verification_status#15,issue_d#16,loan_status#17,pymnt_plan#18,url#19,desc_#20,purpose#21,title#22,zip_code#23,...
 92 more fields] Batched: false, Format: Parquet, Location: 
InMemoryFileIndex[file:/c:/Users/gianna/Desktop/alpha.parquet/part-0-48210471-3088-4cee-8670-...,
 PartitionFilters: [], PushedFilters: [IsNotNull(_row_id)], ReadSchema: 
struct<_row_id:bigint,id:bigint,member_id:bigint,loan_amnt:bigint,funded_amnt:bigint,funded_amnt_...

at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
 at 
org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
 at org.apache.spark.sql.execution.SortExec.doExecute(SortExec.scala:101)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
 at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:141)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
 at 
org.apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:73)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
 at 
org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:133)
 at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2865)
 at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2154)
 at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2154)
 at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2846)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
 at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2845)
 at org.apache.spark.sql.Dataset.head(Dataset.scala:2154)
 at

[jira] [Commented] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory

2018-07-16 Thread Misha Dmitriev (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545820#comment-16545820
 ] 

Misha Dmitriev commented on SPARK-24801:


Correct, there are indeed 40583 instances of {{EncryptedMessage}} in memory. 
From the other section of jxray report, which shows reference chains starting 
from GC roots, and shows the number of objects at each level, I see the 
following:
{code:java}
2,929,966K (72.3%) Object tree for GC root(s) Java Static 
org.apache.spark.network.yarn.YarnShuffleService.instance

org.apache.spark.network.yarn.YarnShuffleService.blockHandler ↘ 2,753,031K 
(67.9%), 1 reference(s) 
org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager ↘ 
2,753,019K (67.9%), 1 reference(s) 
org.apache.spark.network.server.OneForOneStreamManager.streams ↘ 2,753,019K 
(67.9%), 1 reference(s) 
{java.util.concurrent.ConcurrentHashMap}.values ↘ 2,753,008K (67.9%), 169 
reference(s) 
org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel
 ↘ 2,640,203K (65.1%), 32 reference(s) 
io.netty.channel.socket.nio.NioSocketChannel.unsafe ↘ 2,640,039K (65.1%), 32 
reference(s) 
io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer
 ↘ 2,640,037K (65.1%), 30 reference(s) 
io.netty.channel.ChannelOutboundBuffer.flushedEntry ↘ 2,639,382K (65.1%), 15 
reference(s) 
io.netty.channel.ChannelOutboundBuffer$Entry.{next} ↘ 2,637,973K (65.1%), 
40,583 reference(s) 
io.netty.channel.ChannelOutboundBuffer$Entry.msg ↘ 2,622,966K (64.7%), 40,583 
reference(s) 
org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel ↘ 
2,598,897K (64.1%), 40,583 reference(s) 
org.apache.spark.network.util.ByteArrayWritableChannel.data ↘ 2,597,946K 
(64.1%), 40,583 reference(s) 
org.apache.spark.network.util.ByteArrayWritableChannel self 951K (< 0.1%), 
40,583 object(s){code}
So basically we have 15 netty {{ChannelOutboundBuffer}} objects, and then 
collectively , via linked lists starting from their {{flushedEntry}} data 
fields, they end up referencing 40,583 {{ChannelOutboundBuffer$Entry}} objects, 
which ultimately reference all these {{EncryptedMessage}} objects.

So looks like here netty for some reason accumulated (didn't send) a very large 
number of messages, and thus netty is likely the main culprit. But then I 
wonder why all these messages are empty

 

> Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can 
> waste a lot of memory
> ---
>
> Key: SPARK-24801
> URL: https://issues.apache.org/jira/browse/SPARK-24801
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Priority: Major
>
> I recently analyzed another Yarn NM heap dump with jxray 
> ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is 
> wasted by empty (all zeroes) byte[] arrays. Most of these arrays are 
> referenced by 
> {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in 
> turn come from 
> {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is 
> the full reference chain that leads to the problematic arrays:
> {code:java}
> 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%)
> ↖org.apache.spark.network.util.ByteArrayWritableChannel.data
> ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next}
> ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry
> ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer
> ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code}
>  
> Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that 
> byteChannel is always initialized eagerly in the constructor:
> {code:java}
> this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code}
> So I think to address the problem of empty byte[] arrays flooding the memory, 
> we should initialize {{byteChannel}} lazily, upon the first use. As far as I 
> can see, it's used only in one method, {{private void nextChunk()}}.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To

[jira] [Resolved] (SPARK-24402) Optimize `In` expression when only one element in the collection or collection is empty

2018-07-16 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24402.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Optimize `In` expression when only one element in the collection or 
> collection is empty 
> 
>
> Key: SPARK-24402
> URL: https://issues.apache.org/jira/browse/SPARK-24402
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> Two new rules in the logical plan optimizers are added.
> # When there is only one element in the *{{Collection}}*, the physical plan
> will be optimized to *{{EqualTo}}*, so predicate pushdown can be used.
> {code}
>  profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true)
>  """
> |== Physical Plan ==|
> |*(1) Project [profileID#0|#0]|
> |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))|
> |+- *(1) FileScan parquet [profileID#0|#0] Batched: true, Format: Parquet,|
> |PartitionFilters: [],|
> |PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],|
> |ReadSchema: struct
>  """.stripMargin
> {code}
> # When the *{{Set}}* is empty, and the input is nullable, the logical
> plan will be simplified to
> {code}
>  profileDF.filter( $"profileID".isInCollection(Set())).explain(true)
>  """
> |== Optimized Logical Plan ==|
> |Filter if (isnull(profileID#0)) null else false|
> |+- Relation[profileID#0|#0] parquet
>  """.stripMargin
> {code}
> TODO:
>  # For multiple conditions with numbers less than certain thresholds,
> we should still allow predicate pushdown.
>  # Optimize the `In` using tableswitch or lookupswitch when the
> numbers of the categories are low, and they are `Int`, `Long`.
>  # The default immutable hash trees set is slow for query, and we
> should do benchmark for using different set implementation for faster
> query.
>  # `filter(if (condition) null else false)` can be optimized to false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-16 Thread Matt Cheah (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah updated SPARK-24825:
---
Issue Type: Bug  (was: Improvement)

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-16 Thread Matt Cheah (JIRA)

Matt Cheah created SPARK-24825:
--

 Summary: [K8S][TEST] Kubernetes integration tests don't trace the 
maven project dependency structure
 Key: SPARK-24825
 URL: https://issues.apache.org/jira/browse/SPARK-24825
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Tests
Affects Versions: 2.4.0
Reporter: Matt Cheah


The Kubernetes integration tests will currently fail if maven installation is 
not performed first, because the integration test build believes it should be 
pulling the Spark parent artifact from maven central. However, this is 
incorrect because the integration test should be building the Spark parent pom 
as a dependency in the multi-module build, and the integration test should just 
use the dynamically built artifact. Or to put it another way, the integration 
test builds should never be pulling Spark dependencies from maven central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24825) [K8S][TEST] Kubernetes integration tests don't trace the maven project dependency structure

2018-07-16 Thread Matt Cheah (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah updated SPARK-24825:
---
Priority: Critical  (was: Major)

> [K8S][TEST] Kubernetes integration tests don't trace the maven project 
> dependency structure
> ---
>
> Key: SPARK-24825
> URL: https://issues.apache.org/jira/browse/SPARK-24825
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Critical
>
> The Kubernetes integration tests will currently fail if maven installation is 
> not performed first, because the integration test build believes it should be 
> pulling the Spark parent artifact from maven central. However, this is 
> incorrect because the integration test should be building the Spark parent 
> pom as a dependency in the multi-module build, and the integration test 
> should just use the dynamically built artifact. Or to put it another way, the 
> integration test builds should never be pulling Spark dependencies from maven 
> central.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24805) Don't ignore files without .avro extension by default

2018-07-16 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24805.
-
   Resolution: Fixed
 Assignee: Maxim Gekk
Fix Version/s: 2.4.0

> Don't ignore files without .avro extension by default
> -
>
> Key: SPARK-24805
> URL: https://issues.apache.org/jira/browse/SPARK-24805
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently to read files without .avro extension, users have to set the flag 
> *avro.mapred.ignore.inputs.without.extension* to *false* (by default it is 
> *true*). The ticket aims to change the default value to *false*. The reasons 
> to do that are:
> - Other systems can create avro files without extensions. When users try to 
> read such files, they get just partitial results silently. The behaviour may 
> confuse users.
> - Current behavior is different behavior from another supported datasource 
> CSV and JSON. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23901) Data Masking Functions

2018-07-16 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23901:

Fix Version/s: (was: 2.4.0)

> Data Masking Functions
> --
>
> Key: SPARK-23901
> URL: https://issues.apache.org/jira/browse/SPARK-23901
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
>
> - mask()
>  - mask_first_n()
>  - mask_last_n()
>  - mask_hash()
>  - mask_show_first_n()
>  - mask_show_last_n()
> Reference:
> [1] 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions]
> [2] https://issues.apache.org/jira/browse/HIVE-13568
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23901) Data Masking Functions

2018-07-16 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23901.
-
Resolution: Won't Fix

> Data Masking Functions
> --
>
> Key: SPARK-23901
> URL: https://issues.apache.org/jira/browse/SPARK-23901
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> - mask()
>  - mask_first_n()
>  - mask_last_n()
>  - mask_hash()
>  - mask_show_first_n()
>  - mask_show_last_n()
> Reference:
> [1] 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions]
> [2] https://issues.apache.org/jira/browse/HIVE-13568
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-23901) Data Masking Functions

2018-07-16 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-23901:
-

> Data Masking Functions
> --
>
> Key: SPARK-23901
> URL: https://issues.apache.org/jira/browse/SPARK-23901
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> - mask()
>  - mask_first_n()
>  - mask_last_n()
>  - mask_hash()
>  - mask_show_first_n()
>  - mask_show_last_n()
> Reference:
> [1] 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions]
> [2] https://issues.apache.org/jira/browse/HIVE-13568
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24801) Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can waste a lot of memory

2018-07-16 Thread Imran Rashid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545676#comment-16545676
 ] 

Imran Rashid commented on SPARK-24801:
--

I'm surprised there are so many {{EncryptedMessage}} objects sitting around.  
Are there 40583 of them?  that sounds like an extremely overloaded shuffle 
service -- or a leak.  You're proposal would probably help some in that case, 
but really there is probably something else we should be doing differently.

> Empty byte[] arrays in spark.network.sasl.SaslEncryption$EncryptedMessage can 
> waste a lot of memory
> ---
>
> Key: SPARK-24801
> URL: https://issues.apache.org/jira/browse/SPARK-24801
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Priority: Major
>
> I recently analyzed another Yarn NM heap dump with jxray 
> ([www.jxray.com),|http://www.jxray.com),/] and found that 81% of memory is 
> wasted by empty (all zeroes) byte[] arrays. Most of these arrays are 
> referenced by 
> {{org.apache.spark.network.util.ByteArrayWritableChannel.data}}, and these in 
> turn come from 
> {{spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel}}. Here is 
> the full reference chain that leads to the problematic arrays:
> {code:java}
> 2,597,946K (64.1%): byte[]: 40583 / 100% of empty 2,597,946K (64.1%)
> ↖org.apache.spark.network.util.ByteArrayWritableChannel.data
> ↖org.apache.spark.network.sasl.SaslEncryption$EncryptedMessage.byteChannel
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.msg
> ↖io.netty.channel.ChannelOutboundBuffer$Entry.{next}
> ↖io.netty.channel.ChannelOutboundBuffer.flushedEntry
> ↖io.netty.channel.socket.nio.NioSocketChannel$NioSocketChannelUnsafe.outboundBuffer
> ↖io.netty.channel.socket.nio.NioSocketChannel.unsafe
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.associatedChannel
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance{code}
>  
> Checking the code of {{SaslEncryption$EncryptedMessage}}, I see that 
> byteChannel is always initialized eagerly in the constructor:
> {code:java}
> this.byteChannel = new ByteArrayWritableChannel(maxOutboundBlockSize);{code}
> So I think to address the problem of empty byte[] arrays flooding the memory, 
> we should initialize {{byteChannel}} lazily, upon the first use. As far as I 
> can see, it's used only in one method, {{private void nextChunk()}}.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16617) Upgrade to Avro 1.8.x

2018-07-16 Thread Thomas Omans (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-16617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545625#comment-16545625
 ] 

Thomas Omans commented on SPARK-16617:
--

My code is getting the Schema.getLogicalType bug using spark 2.3.x because we 
use parquet and avro. Don't think this is an improvement/enhancement - this is 
a bugfix for anything that needs to use parquet and avro after the 1.8.2 
parquet upgrade.

> Upgrade to Avro 1.8.x
> -
>
> Key: SPARK-16617
> URL: https://issues.apache.org/jira/browse/SPARK-16617
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 2.1.0
>Reporter: Ben McCann
>Priority: Major
>
> Avro 1.8 makes Avro objects serializable so that you can easily have an RDD 
> containing Avro objects.
> See https://issues.apache.org/jira/browse/AVRO-1502



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24644) Pyarrow exception while running pandas_udf on pyspark 2.3.1

2018-07-16 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545589#comment-16545589
 ] 

Bryan Cutler commented on SPARK-24644:
--

[~helkhalfi], the error in the stack trace is coming from pandas internals and 
it looks like you are using a pretty old version, so my guess is that you need 
to upgrade pandas to solve this.  For Spark, we currently test pyarrow with 
pandas 0.19.2 and I would recommend at least that version or higher.

> Pyarrow exception while running pandas_udf on pyspark 2.3.1
> ---
>
> Key: SPARK-24644
> URL: https://issues.apache.org/jira/browse/SPARK-24644
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: os: centos
> pyspark 2.3.1
> spark 2.3.1
> pyarrow >= 0.8.0
>Reporter: Hichame El Khalfi
>Priority: Major
>
> Hello,
> When I try to run a `pandas_udf` on my spark dataframe, I get this error
>  
> {code:java}
>   File 
> "/mnt/ephemeral3/yarn/nm/usercache/user/appcache/application_1524574803975_205774/container_e280_1524574803975_205774_01_44/pyspark.zip/pyspark/serializers.py",
>  lin
> e 280, in load_stream
> pdf = batch.to_pandas()
>   File "pyarrow/table.pxi", line 677, in pyarrow.lib.RecordBatch.to_pandas 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:43226)
> return Table.from_batches([self]).to_pandas(nthreads=nthreads)
>   File "pyarrow/table.pxi", line 1043, in pyarrow.lib.Table.to_pandas 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:46331)
> mgr = pdcompat.table_to_blockmanager(options, self, memory_pool,
>   File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line 
> 528, in table_to_blockmanager
> blocks = _table_to_blocks(options, block_table, nthreads, memory_pool)
>   File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line 
> 622, in _table_to_blocks
> return [_reconstruct_block(item) for item in result]
>   File "/usr/lib64/python2.7/site-packages/pyarrow/pandas_compat.py", line 
> 446, in _reconstruct_block
> block = _int.make_block(block_arr, placement=placement)
> TypeError: make_block() takes at least 3 arguments (2 given)
> {code}
>  
>  More than happy to provide any additional information



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2018-07-16 Thread Iqbal Singh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545577#comment-16545577
 ] 

Iqbal Singh commented on SPARK-24295:
-

Hey [~XuanYuan],

We are processing 3000 files every 5 minutes 24X7 using structured streaming, 
File size is 120MB on average. 
 * Every Structured streaming batch commit file size is around 800KB to 1000KB 
and compact file keep track of all the data from the start of the process. It 
goes up to 8Gb after 45 days and structured streaming process takes more than 
15 mins to compact the file every 10th batch.

 * We are using Dynamic partitions while dumping the data which also increases 
the output file count for each micro batch ratio is 2:3. (2 input files give us 
3 output files). 

 * Spark forces the jobs to read the data using _spark__metadata files if the 
input directory of the job is a structured streaming output, Which wastes 
another 10-15 minutes for generating a list of files from "_spark_metadata" 
commit compact file.

 * Compact file has data in json format and grows in size very fast, if  we 
have too many files to process in each batch.

 

*File:* org/apache/spark/sql/execution/streaming/FileStreamSinkLog.scala

-- Delete Action is defined in the Class "FileStreamSinkLog" but it is not 
implemented any where in code.

 
{code:java}
object FileStreamSinkLog {
 val VERSION = 1
 val DELETE_ACTION = "delete"
 val ADD_ACTION = "add"
} 
{code}
 

-- Below code never executes, Where we are deleting the Sink logs with action 
"DELETE" while compacting the files
{code:java}
override def compactLogs(logs: Seq[SinkFileStatus]): Seq[SinkFileStatus] = {
 val deletedFiles = logs.filter(_.action == 
FileStreamSinkLog.DELETE_ACTION).map(_.path).toSet
 if (deletedFiles.isEmpty) {
 logs
 } else {
 logs.filter(f => !deletedFiles.contains(f.path))
 }
}{code}
 

-- We do not have batch Number info in the Compact file as a metric, it is 
tough to keep  defined number of batches  in the file. We have modification 
Time and can use it to mark the sink metadata log records as delete based on 
some data retention on time. 

 

We have developed a Spark job to read the metadata as a spark job and generate 
a list of files to have exactly once guarantee and it passes the list of files 
for a particular batch to the spark job, it takes 60 seconds to read the 
compact file using spark.

 

We are working on a explicit data purge job for the compact file to keep its 
size under control, Please let me know if more details are required and also if 
there is anything we are missing out.

 

Appreciate your help.

 

Thanks,

Iqbal Singh

> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> 
>
> Key: SPARK-24295
> URL: https://issues.apache.org/jira/browse/SPARK-24295
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Iqbal Singh
>Priority: Major
>
> FileStreamSinkLog metadata logs are concatenated to a single compact file 
> after defined compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing 
> slowness  while reading the data from FileStreamSinkLog dir as spark is 
> defaulting to the "__spark__metadata" dir for the read.
> We need a functionality to purge the compact file size.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary

2018-07-16 Thread Romeo Kienzer (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545557#comment-16545557
 ] 

Romeo Kienzer commented on SPARK-17557:
---

[~jayadevan.m] [~hyukjin.kwon] can you please re-open? You can easily reproduce 
the error with the following parquet file

 

[^a2_m2.parquet.zip]

 

and the following code in pyspark 2.1.2, 2.1.3, 2.3.0

df = spark.read.parquet('a2_m2.parquet')

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") 
.setPredictionCol("prediction").setLabelCol("label")

accuracy = binEval.evaluate(df)

 

> SQL query on parquet table java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary
> -
>
> Key: SPARK-17557
> URL: https://issues.apache.org/jira/browse/SPARK-17557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Major
> Attachments: a2_m2.parquet.zip
>
>
> Working on 1.6.2, broken on 2.0
> {code}
> select * from logs.a where year=2016 and month=9 and day=14 limit 100
> {code}
> {code}
> java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>   at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary

2018-07-16 Thread Romeo Kienzer (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Romeo Kienzer updated SPARK-17557:
--
Attachment: a2_m2.parquet.zip

> SQL query on parquet table java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary
> -
>
> Key: SPARK-17557
> URL: https://issues.apache.org/jira/browse/SPARK-17557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Major
> Attachments: a2_m2.parquet.zip
>
>
> Working on 1.6.2, broken on 2.0
> {code}
> select * from logs.a where year=2016 and month=9 and day=14 limit 100
> {code}
> {code}
> java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>   at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-07-16 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545519#comment-16545519
 ] 

Bryan Cutler commented on SPARK-23874:
--

[~smilegator], we are aiming to have the Arrow 0.10.0 release soon, and I will 
pick this up again when that happens. So I think it's possible we could have 
the upgrade done before the code freeze.

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23901) Data Masking Functions

2018-07-16 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545491#comment-16545491
 ] 

Apache Spark commented on SPARK-23901:
--

User 'mn-mikke' has created a pull request for this issue:
https://github.com/apache/spark/pull/21786

> Data Masking Functions
> --
>
> Key: SPARK-23901
> URL: https://issues.apache.org/jira/browse/SPARK-23901
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> - mask()
>  - mask_first_n()
>  - mask_last_n()
>  - mask_hash()
>  - mask_show_first_n()
>  - mask_show_last_n()
> Reference:
> [1] 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions]
> [2] https://issues.apache.org/jira/browse/HIVE-13568
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23901) Data Masking Functions

2018-07-16 Thread Reynold Xin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545446#comment-16545446
 ] 

Reynold Xin edited comment on SPARK-23901 at 7/16/18 4:31 PM:
--

I actually feel pretty strongly we should remove them. Just so much code to 
maintain for something that doesn't have a clear use case and we only added 
them for some hypothetical Hive compatibility.

 


was (Author: rxin):
I actually feel pretty strongly we should remove them.

 

> Data Masking Functions
> --
>
> Key: SPARK-23901
> URL: https://issues.apache.org/jira/browse/SPARK-23901
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> - mask()
>  - mask_first_n()
>  - mask_last_n()
>  - mask_hash()
>  - mask_show_first_n()
>  - mask_show_last_n()
> Reference:
> [1] 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions]
> [2] https://issues.apache.org/jira/browse/HIVE-13568
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23901) Data Masking Functions

2018-07-16 Thread Reynold Xin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545446#comment-16545446
 ] 

Reynold Xin commented on SPARK-23901:
-

I actually feel pretty strongly we should remove them.

 

> Data Masking Functions
> --
>
> Key: SPARK-23901
> URL: https://issues.apache.org/jira/browse/SPARK-23901
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> - mask()
>  - mask_first_n()
>  - mask_last_n()
>  - mask_hash()
>  - mask_show_first_n()
>  - mask_show_last_n()
> Reference:
> [1] 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions]
> [2] https://issues.apache.org/jira/browse/HIVE-13568
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23901) Data Masking Functions

2018-07-16 Thread Marek Novotny (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545443#comment-16545443
 ] 

Marek Novotny commented on SPARK-23901:
---

Is there a consensus on getting the masking functions to version 2.4.0? If so, 
what about extending also Python and R API to be consistent? (I volunteer for 
that.)

> Data Masking Functions
> --
>
> Key: SPARK-23901
> URL: https://issues.apache.org/jira/browse/SPARK-23901
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> - mask()
>  - mask_first_n()
>  - mask_last_n()
>  - mask_hash()
>  - mask_show_first_n()
>  - mask_show_last_n()
> Reference:
> [1] 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions]
> [2] https://issues.apache.org/jira/browse/HIVE-13568
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging

2018-07-16 Thread Sanket Reddy (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545436#comment-16545436
 ] 

Sanket Reddy commented on SPARK-24787:
--

[~vanzin] do you have any suggestions regarding this issue?

[~olegd] I would rather make this configurable or trigger periodically?

> Events being dropped at an alarming rate due to hsync being slow for 
> eventLogging
> -
>
> Key: SPARK-24787
> URL: https://issues.apache.org/jira/browse/SPARK-24787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Sanket Reddy
>Priority: Minor
>
> [https://github.com/apache/spark/pull/16924/files] updates the length of the 
> inprogress files allowing history server being responsive.
> Although we have a production job that has 6 tasks per stage and due to 
> hsync being slow it starts dropping events and the history server has wrong 
> stats due to events being dropped.
> A viable solution is not to make it sync very frequently or make it 
> configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24734) Fix containsNull of Concat for array type.

2018-07-16 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24734:
---

Assignee: Takuya Ueshin

> Fix containsNull of Concat for array type.
> --
>
> Key: SPARK-24734
> URL: https://issues.apache.org/jira/browse/SPARK-24734
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently {{Concat}} for array type uses the data type of the first child as 
> its own data type, but the children might include an array containing nulls.
> We should aware the nullabilities of all children.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24734) Fix containsNull of Concat for array type.

2018-07-16 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24734.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21704
[https://github.com/apache/spark/pull/21704]

> Fix containsNull of Concat for array type.
> --
>
> Key: SPARK-24734
> URL: https://issues.apache.org/jira/browse/SPARK-24734
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently {{Concat}} for array type uses the data type of the first child as 
> its own data type, but the children might include an array containing nulls.
> We should aware the nullabilities of all children.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20174) Analyzer gives mysterious AnalysisException when posexplode used in withColumn

2018-07-16 Thread Valery Khamenya (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545314#comment-16545314
 ] 

Valery Khamenya commented on SPARK-20174:
-

Ok, I found a combo-workaround that seems to work:
{code:java}
df selectExpr("*", "posexplode(s) as (p,c)") drop("s"){code}

> Analyzer gives mysterious AnalysisException when posexplode used in withColumn
> --
>
> Key: SPARK-20174
> URL: https://issues.apache.org/jira/browse/SPARK-20174
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> Wish I knew how to even describe the issue. It appears that {{posexplode}} 
> cannot be used in {{withColumn}}, but the error message does not seem to say 
> it.
> [The scaladoc of 
> posexplode|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@posexplode(e:org.apache.spark.sql.Column):org.apache.spark.sql.Column]
>  is silent about this "limitation", too.
> {code}
> scala> codes.printSchema
> root
>  |-- id: integer (nullable = false)
>  |-- rate_plan_code: array (nullable = true)
>  ||-- element: string (containsNull = true)
> scala> codes.withColumn("code", posexplode($"rate_plan_code")).show
> org.apache.spark.sql.AnalysisException: The number of aliases supplied in the 
> AS clause does not match the number of columns output by the UDTF expected 2 
> aliases but got code ;
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:90)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.makeGeneratorOutput(Analyzer.scala:1744)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$20$$anonfun$56.apply(Analyzer.scala:1691)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$20$$anonfun$56.apply(Analyzer.scala:1679)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$20.applyOrElse(Analyzer.scala:1679)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$20.applyOrElse(Analyzer.scala:1664)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$.apply(Analyzer.scala:1664)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$.apply(Analyzer.scala:1629)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:70)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:68)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2832)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1137)
>   at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1882)
>

[jira] [Updated] (SPARK-24813) HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive

2018-07-16 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-24813:
--
Affects Version/s: (was: 2.1.3)

> HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
> -
>
> Key: SPARK-24813
> URL: https://issues.apache.org/jira/browse/SPARK-24813
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> HiveExternalCatalogVersionsSuite is still failing periodically with errors 
> from mirror sites. In fact, the test depends on the Spark versions it needs 
> being available on the mirrors, but older versions will eventually be removed.
> The test should fall back to downloading from archive.apache.org if mirrors 
> don't have the Spark release, or aren't responding.
> This has become urgent as I helpfully already purged many old Spark releases 
> from mirrors, as requested by the ASF, before realizing it would probably 
> make this test fail deterministically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24529) Add spotbugs into maven build process

2018-07-16 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545284#comment-16545284
 ] 

Apache Spark commented on SPARK-24529:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/21785

> Add spotbugs into maven build process
> -
>
> Key: SPARK-24529
> URL: https://issues.apache.org/jira/browse/SPARK-24529
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.4.0
>
>
> We will enable a Java bytecode check tool 
> [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at 
> multiplication. Due to the tool limitation, some other checks will be enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18230) MatrixFactorizationModel.recommendProducts throws NoSuchElement exception when the user does not exist

2018-07-16 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-18230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18230.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21740
[https://github.com/apache/spark/pull/21740]

> MatrixFactorizationModel.recommendProducts throws NoSuchElement exception 
> when the user does not exist
> --
>
> Key: SPARK-18230
> URL: https://issues.apache.org/jira/browse/SPARK-18230
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.1
>Reporter: Mikael Ståldal
>Assignee: shahid
>Priority: Minor
> Fix For: 2.4.0
>
>
> When invoking {{MatrixFactorizationModel.recommendProducts(Int, Int)}} with a 
> non-existing user, a {{java.util.NoSuchElementException}} is thrown:
> {code}
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at 
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
>   at scala.collection.IterableLike$class.head(IterableLike.scala:107)
>   at 
> scala.collection.mutable.WrappedArray.scala$collection$IndexedSeqOptimized$$super$head(WrappedArray.scala:35)
>   at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
>   at scala.collection.mutable.WrappedArray.head(WrappedArray.scala:35)
>   at 
> org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:169)
> {code}
> It would be nice if it returned the empty array, or throwed a more specific 
> exception, and that was documented in ScalaDoc for the method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20174) Analyzer gives mysterious AnalysisException when posexplode used in withColumn

2018-07-16 Thread Valery Khamenya (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545282#comment-16545282
 ] 

Valery Khamenya commented on SPARK-20174:
-

Guys, I am tracking this issue for quite some time already. Prio "Minor" is 
applicable if there is a workaround for Spark users. Personally I am often 
having situation that I need to _append_ those two columns coming from 
posexplode, leaving all the rest columns intact.

That is, something like:
{code:java}
df.withColumn(Seq("p", "c"), posexplode($"a")){code}
is really wanted or an alternative combo with the same semantics is wanted 
badly for a DataFrame.

> Analyzer gives mysterious AnalysisException when posexplode used in withColumn
> --
>
> Key: SPARK-20174
> URL: https://issues.apache.org/jira/browse/SPARK-20174
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> Wish I knew how to even describe the issue. It appears that {{posexplode}} 
> cannot be used in {{withColumn}}, but the error message does not seem to say 
> it.
> [The scaladoc of 
> posexplode|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@posexplode(e:org.apache.spark.sql.Column):org.apache.spark.sql.Column]
>  is silent about this "limitation", too.
> {code}
> scala> codes.printSchema
> root
>  |-- id: integer (nullable = false)
>  |-- rate_plan_code: array (nullable = true)
>  ||-- element: string (containsNull = true)
> scala> codes.withColumn("code", posexplode($"rate_plan_code")).show
> org.apache.spark.sql.AnalysisException: The number of aliases supplied in the 
> AS clause does not match the number of columns output by the UDTF expected 2 
> aliases but got code ;
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:90)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.makeGeneratorOutput(Analyzer.scala:1744)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$20$$anonfun$56.apply(Analyzer.scala:1691)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$20$$anonfun$56.apply(Analyzer.scala:1679)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$20.applyOrElse(Analyzer.scala:1679)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$20.applyOrElse(Analyzer.scala:1664)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$.apply(Analyzer.scala:1664)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$.apply(Analyzer.scala:1629)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:70)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:68)
>   at 
>

[jira] [Assigned] (SPARK-18230) MatrixFactorizationModel.recommendProducts throws NoSuchElement exception when the user does not exist

2018-07-16 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-18230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-18230:
-

Assignee: shahid

> MatrixFactorizationModel.recommendProducts throws NoSuchElement exception 
> when the user does not exist
> --
>
> Key: SPARK-18230
> URL: https://issues.apache.org/jira/browse/SPARK-18230
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.1
>Reporter: Mikael Ståldal
>Assignee: shahid
>Priority: Minor
> Fix For: 2.4.0
>
>
> When invoking {{MatrixFactorizationModel.recommendProducts(Int, Int)}} with a 
> non-existing user, a {{java.util.NoSuchElementException}} is thrown:
> {code}
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at 
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
>   at scala.collection.IterableLike$class.head(IterableLike.scala:107)
>   at 
> scala.collection.mutable.WrappedArray.scala$collection$IndexedSeqOptimized$$super$head(WrappedArray.scala:35)
>   at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
>   at scala.collection.mutable.WrappedArray.head(WrappedArray.scala:35)
>   at 
> org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:169)
> {code}
> It would be nice if it returned the empty array, or throwed a more specific 
> exception, and that was documented in ScalaDoc for the method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) Accelerator aware task scheduling for Spark

2018-07-16 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545253#comment-16545253
 ] 

Thomas Graves commented on SPARK-24615:
---

[~jerryshao] ^

> Accelerator aware task scheduling for Spark
> ---
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
>  Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data

2018-07-16 Thread Brad (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545240#comment-16545240
 ] 

Brad commented on SPARK-21097:
--

Hi [~menelaus]

The processing time delay is just a way to simulate different sized processing 
workloads. In the test I just do a map over the dataframe and spin for a few 
microseconds on each row as configured.  In the case of 0 µs there it's like 
there is almost no processing on the data. There is still the time of loading 
the data from hadoop.

The dynamic allocation without recovery benchmark is much slower in the 0 µs 
cached data case because it has lost all of its cached data and has to reload 
from hadoop. You can see the performance is similar to the initial load.

Thanks

Brad

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2018-07-16 Thread Antony (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545225#comment-16545225
 ] 

Antony commented on SPARK-15343:


{{--conf spark.hadoop.yarn.timeline-service.enabled=false  is work for me}}

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at 
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: 
> com.sun.jersey.api.client.config.ClientConfig
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>

[jira] [Commented] (SPARK-24182) Improve error message for client mode when AM fails

2018-07-16 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545167#comment-16545167
 ] 

Apache Spark commented on SPARK-24182:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/21784

> Improve error message for client mode when AM fails
> ---
>
> Key: SPARK-24182
> URL: https://issues.apache.org/jira/browse/SPARK-24182
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.4.0
>
>
> Today, when the client AM fails, there's not a lot of useful information 
> printed on the output. Depending on the type of failure, the information 
> provided by the YARN AM is also not very useful. For example, you'd see this 
> in the Spark shell:
> {noformat}
> 18/05/04 11:07:38 ERROR spark.SparkContext: Error initializing SparkContext.
> org.apache.spark.SparkException: Yarn application has already ended! It might 
> have been killed or unable to launch application master.
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:86)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:63)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)
> at org.apache.spark.SparkContext.(SparkContext.scala:500)
>  [long stack trace]
> {noformat}
> Similarly, on the YARN RM, for certain failures you see a generic error like 
> this:
> {noformat}
> ExitCodeException exitCode=10: at 
> org.apache.hadoop.util.Shell.runCommand(Shell.java:543) at 
> org.apache.hadoop.util.Shell.run(Shell.java:460) at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720) at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:366)
>  at 
> [blah blah blah]
> {noformat}
> It would be nice if we could provide a more accurate description of what went 
> wrong when possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown

2018-07-16 Thread Sasaki Toru (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sasaki Toru resolved SPARK-20050.
-
Resolution: Not A Problem

> Kafka 0.10 DirectStream doesn't commit last processed batch's offset when 
> graceful shutdown
> ---
>
> Key: SPARK-20050
> URL: https://issues.apache.org/jira/browse/SPARK-20050
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sasaki Toru
>Priority: Major
>
> I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
> call 'DirectKafkaInputDStream#commitAsync' finally in each batches,  such 
> below
> {code}
> val kafkaStream = KafkaUtils.createDirectStream[String, String](...)
> kafkaStream.map { input =>
>   "key: " + input.key.toString + " value: " + input.value.toString + " 
> offset: " + input.offset.toString
>   }.foreachRDD { rdd =>
> rdd.foreach { input =>
> println(input)
>   }
> }
> kafkaStream.foreachRDD { rdd =>
>   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>   kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
> }
> {\code}
> Some records which processed in the last batch before Streaming graceful 
> shutdown reprocess in the first batch after Spark Streaming restart, such 
> below
> * output first run of this application
> {code}
> key: null value: 1 offset: 101452472
> key: null value: 2 offset: 101452473
> key: null value: 3 offset: 101452474
> key: null value: 4 offset: 101452475
> key: null value: 5 offset: 101452476
> key: null value: 6 offset: 101452477
> key: null value: 7 offset: 101452478
> key: null value: 8 offset: 101452479
> key: null value: 9 offset: 101452480  // this is a last record before 
> shutdown Spark Streaming gracefully
> {\code}
> * output re-run of this application
> {code}
> key: null value: 7 offset: 101452478   // duplication
> key: null value: 8 offset: 101452479   // duplication
> key: null value: 9 offset: 101452480   // duplication
> key: null value: 10 offset: 101452481
> {\code}
> It may cause offsets specified in commitAsync will commit in the head of next 
> batch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24799) A solution of dealing with data skew in left,right,inner join

2018-07-16 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545032#comment-16545032
 ] 

Apache Spark commented on SPARK-24799:
--

User 'marymwu' has created a pull request for this issue:
https://github.com/apache/spark/pull/21783

> A solution of dealing with data skew in left,right,inner join
> -
>
> Key: SPARK-24799
> URL: https://issues.apache.org/jira/browse/SPARK-24799
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: marymwu
>Priority: Major
>
> For the left,right,inner join statment execution, this solution is mainling 
> about to devide the partions where the data skew has occured into serveral 
> partions with smaller data scale, in order to parallelly execute more tasks 
> to increase effeciency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24812) Last Access Time in the table description is not valid

2018-07-16 Thread Sujith (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujith updated SPARK-24812:
---
Description: 
Last Access Time in the table description is not valid, 

Test steps:

Step 1 -  create a table

Step 2 - Run  command "DESC FORMATTED table"

 Last Access Time will always displayed wrong date

Wed Dec 31 15:59:59 PST 1969 - which is wrong.

!image-2018-07-16-15-37-28-896.png!

In hive its displayed as "UNKNOWN" which makes more sense than displaying wrong 
date.

Please find the snapshot tested in hive for the same com 
!image-2018-07-16-15-38-26-717.png! mand

 

Seems to be a limitation as of now, better we can follow the hive behavior in 
this scenario.

 

 

 

 

  was:
Last Access Time in the table description is not valid, 

Test steps:

Step 1 -  create a table

Step 2 - Run  command "DESC FORMATTED table"

 Last Access Time will always displayed wrong date

Wed Dec 31 15:59:59 PST 1969 - which is wrong.

In hive its displayed as "UNKNOWN" which makes more sense than displaying wrong 
date.

Seems to be a limitation as of now, better we can follow the hive behavior in 
this scenario.

 

 

 

 


> Last Access Time in the table description is not valid
> --
>
> Key: SPARK-24812
> URL: https://issues.apache.org/jira/browse/SPARK-24812
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: Sujith
>Priority: Minor
> Attachments: image-2018-07-16-15-37-28-896.png, 
> image-2018-07-16-15-38-26-717.png
>
>
> Last Access Time in the table description is not valid, 
> Test steps:
> Step 1 -  create a table
> Step 2 - Run  command "DESC FORMATTED table"
>  Last Access Time will always displayed wrong date
> Wed Dec 31 15:59:59 PST 1969 - which is wrong.
> !image-2018-07-16-15-37-28-896.png!
> In hive its displayed as "UNKNOWN" which makes more sense than displaying 
> wrong date.
> Please find the snapshot tested in hive for the same com 
> !image-2018-07-16-15-38-26-717.png! mand
>  
> Seems to be a limitation as of now, better we can follow the hive behavior in 
> this scenario.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24812) Last Access Time in the table description is not valid

2018-07-16 Thread Sujith (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujith updated SPARK-24812:
---
Attachment: image-2018-07-16-15-38-26-717.png

> Last Access Time in the table description is not valid
> --
>
> Key: SPARK-24812
> URL: https://issues.apache.org/jira/browse/SPARK-24812
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: Sujith
>Priority: Minor
> Attachments: image-2018-07-16-15-37-28-896.png, 
> image-2018-07-16-15-38-26-717.png
>
>
> Last Access Time in the table description is not valid, 
> Test steps:
> Step 1 -  create a table
> Step 2 - Run  command "DESC FORMATTED table"
>  Last Access Time will always displayed wrong date
> Wed Dec 31 15:59:59 PST 1969 - which is wrong.
> In hive its displayed as "UNKNOWN" which makes more sense than displaying 
> wrong date.
> Seems to be a limitation as of now, better we can follow the hive behavior in 
> this scenario.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24812) Last Access Time in the table description is not valid

2018-07-16 Thread Sujith (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujith updated SPARK-24812:
---
Attachment: image-2018-07-16-15-37-28-896.png

> Last Access Time in the table description is not valid
> --
>
> Key: SPARK-24812
> URL: https://issues.apache.org/jira/browse/SPARK-24812
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: Sujith
>Priority: Minor
> Attachments: image-2018-07-16-15-37-28-896.png
>
>
> Last Access Time in the table description is not valid, 
> Test steps:
> Step 1 -  create a table
> Step 2 - Run  command "DESC FORMATTED table"
>  Last Access Time will always displayed wrong date
> Wed Dec 31 15:59:59 PST 1969 - which is wrong.
> In hive its displayed as "UNKNOWN" which makes more sense than displaying 
> wrong date.
> Seems to be a limitation as of now, better we can follow the hive behavior in 
> this scenario.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-24816) SQL interface support repartitionByRange

2018-07-16 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24816:

Comment: was deleted

(was: I'm working on.)

> SQL interface support repartitionByRange
> 
>
> Key: SPARK-24816
> URL: https://issues.apache.org/jira/browse/SPARK-24816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: DISTRIBUTE_BY_SORT_BY.png, 
> RANGE_DISTRIBUTE_BY_SORT_BY.png
>
>
> SQL interface support {{repartitionByRange}} to improvement data pushdown. I 
> have test this feature with a big table(data size: 1.1 T, row count: 
> 282,001,954,428) .
> The test sql is:
> {code:sql}
> select * from table where id=401564838907
> {code}
> The test result:
> |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
> MB-seconds|
> |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
> |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
> |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
> |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
> |RANGE PARTITION BY |38.5 GB|75355144|45 min|13 s|14525275297|
> |RANGE PARTITION BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24816) SQL interface support repartitionByRange

2018-07-16 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24816:


Assignee: Apache Spark

> SQL interface support repartitionByRange
> 
>
> Key: SPARK-24816
> URL: https://issues.apache.org/jira/browse/SPARK-24816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
> Attachments: DISTRIBUTE_BY_SORT_BY.png, 
> RANGE_DISTRIBUTE_BY_SORT_BY.png
>
>
> SQL interface support {{repartitionByRange}} to improvement data pushdown. I 
> have test this feature with a big table(data size: 1.1 T, row count: 
> 282,001,954,428) .
> The test sql is:
> {code:sql}
> select * from table where id=401564838907
> {code}
> The test result:
> |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
> MB-seconds|
> |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
> |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
> |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
> |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
> |RANGE PARTITION BY |38.5 GB|75355144|45 min|13 s|14525275297|
> |RANGE PARTITION BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24816) SQL interface support repartitionByRange

2018-07-16 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24816:


Assignee: (was: Apache Spark)

> SQL interface support repartitionByRange
> 
>
> Key: SPARK-24816
> URL: https://issues.apache.org/jira/browse/SPARK-24816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: DISTRIBUTE_BY_SORT_BY.png, 
> RANGE_DISTRIBUTE_BY_SORT_BY.png
>
>
> SQL interface support {{repartitionByRange}} to improvement data pushdown. I 
> have test this feature with a big table(data size: 1.1 T, row count: 
> 282,001,954,428) .
> The test sql is:
> {code:sql}
> select * from table where id=401564838907
> {code}
> The test result:
> |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
> MB-seconds|
> |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
> |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
> |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
> |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
> |RANGE PARTITION BY |38.5 GB|75355144|45 min|13 s|14525275297|
> |RANGE PARTITION BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24816) SQL interface support repartitionByRange

2018-07-16 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544980#comment-16544980
 ] 

Apache Spark commented on SPARK-24816:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/21782

> SQL interface support repartitionByRange
> 
>
> Key: SPARK-24816
> URL: https://issues.apache.org/jira/browse/SPARK-24816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: DISTRIBUTE_BY_SORT_BY.png, 
> RANGE_DISTRIBUTE_BY_SORT_BY.png
>
>
> SQL interface support {{repartitionByRange}} to improvement data pushdown. I 
> have test this feature with a big table(data size: 1.1 T, row count: 
> 282,001,954,428) .
> The test sql is:
> {code:sql}
> select * from table where id=401564838907
> {code}
> The test result:
> |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
> MB-seconds|
> |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
> |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
> |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
> |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
> |RANGE PARTITION BY |38.5 GB|75355144|45 min|13 s|14525275297|
> |RANGE PARTITION BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24794) DriverWrapper should have both master addresses in -Dspark.master

2018-07-16 Thread Ecaterina (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544973#comment-16544973
 ] 

Ecaterina commented on SPARK-24794:
---

Yes, I also face this problem. Would be nice if somebody could answer this.

> DriverWrapper should have both master addresses in -Dspark.master
> -
>
> Key: SPARK-24794
> URL: https://issues.apache.org/jira/browse/SPARK-24794
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Behroz Sikander
>Priority: Major
>
> In standalone cluster mode, one could launch a Driver with supervise mode 
> enabled. Spark launches the driver with a JVM argument -Dspark.master which 
> is set to [host and port of current 
> master|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L149].
>  
> During the life of context, the spark masters can switch due to any reason. 
> After that if the driver dies unexpectedly and comes up it tries to connect 
> with the master which was set initially with -Dspark.master but that master 
> is in STANDBY mode. The context tries multiple times to connect to standby 
> and then just kills itself.
>  
> *Suggestion:*
> While launching the driver process, Spark master should use the [spark.master 
> passed as 
> input|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L124]
>  instead of master and port of the current master.
> Log messages that we observe:
>  
> {code:java}
> 2018-07-11 13:03:21,801 INFO appclient-register-master-threadpool-0 
> org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: 
> Connecting to master spark://10.100.100.22:7077..
> .
> 2018-07-11 13:03:21,806 INFO netty-rpc-connection-0 
> org.apache.spark.network.client.TransportClientFactory []: Successfully 
> created connection to /10.100.100.22:7077 after 1 ms (0 ms spent in 
> bootstraps)
> .
> 2018-07-11 13:03:41,802 INFO appclient-register-master-threadpool-0 
> org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: 
> Connecting to master spark://10.100.100.22:7077...
> .
> 2018-07-11 13:04:01,802 INFO appclient-register-master-threadpool-0 
> org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: 
> Connecting to master spark://10.100.100.22:7077...
> .
> 2018-07-11 13:04:21,806 ERROR appclient-registration-retry-thread 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend []: Application 
> has been killed. Reason: All masters are unresponsive! Giving up.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24816) SQL interface support repartitionByRange

2018-07-16 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24816:

Description: 
SQL interface support {{repartitionByRange}} to improvement data pushdown. I 
have test this feature with a big table(data size: 1.1 T, row count: 
282,001,954,428) .

The test sql is:
{code:sql}
select * from table where id=401564838907
{code}
The test result:
|Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
MB-seconds|
|default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
|DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
|SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
|DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
|RANGE PARTITION BY |38.5 GB|75355144|45 min|13 s|14525275297|
|RANGE PARTITION BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|

  was:
SQL interface support {{repartitionByRange}} to improvement data pushdown. I 
have test this feature with a big table(data size: 1.1 T, row count: 
282,001,954,428) .

The test sql is:
{code:sql}
select * from table where id=401564838907
{code}
The test result:
|Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
MB-seconds|
|default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
|DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
|SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
|DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
|RANGE DISTRIBUTE BY |38.5 GB|75355144|45 min|13 s|14525275297|
|RANGE DISTRIBUTE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|


> SQL interface support repartitionByRange
> 
>
> Key: SPARK-24816
> URL: https://issues.apache.org/jira/browse/SPARK-24816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: DISTRIBUTE_BY_SORT_BY.png, 
> RANGE_DISTRIBUTE_BY_SORT_BY.png
>
>
> SQL interface support {{repartitionByRange}} to improvement data pushdown. I 
> have test this feature with a big table(data size: 1.1 T, row count: 
> 282,001,954,428) .
> The test sql is:
> {code:sql}
> select * from table where id=401564838907
> {code}
> The test result:
> |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
> MB-seconds|
> |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
> |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
> |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
> |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
> |RANGE PARTITION BY |38.5 GB|75355144|45 min|13 s|14525275297|
> |RANGE PARTITION BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24824) Make Spark task speculation a per-stage config

2018-07-16 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-24824:


 Summary: Make Spark task speculation a per-stage config
 Key: SPARK-24824
 URL: https://issues.apache.org/jira/browse/SPARK-24824
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jiang Xingbo


Make Spark task speculation a per-stage config, so we can explicitly disable 
task speculation for a barrier stage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24823) Cancel a job that contains barrier stage(s) if the barrier tasks don't get launched within a configured time

2018-07-16 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-24823:


 Summary: Cancel a job that contains barrier stage(s) if the 
barrier tasks don't get launched within a configured time
 Key: SPARK-24823
 URL: https://issues.apache.org/jira/browse/SPARK-24823
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jiang Xingbo


Cancel a job that contains barrier stage(s) if the barrier tasks don't get 
launched within a configured time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24822) Python support for barrier execution mode

2018-07-16 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-24822:


 Summary: Python support for barrier execution mode
 Key: SPARK-24822
 URL: https://issues.apache.org/jira/browse/SPARK-24822
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Spark Core
Affects Versions: 2.4.0
Reporter: Jiang Xingbo


Enable launch a job containing barrier stage(s) from PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24821) Fail fast when submitted job compute on a subset of all the partitions for a barrier stage

2018-07-16 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-24821:


 Summary: Fail fast when submitted job compute on a subset of all 
the partitions for a barrier stage
 Key: SPARK-24821
 URL: https://issues.apache.org/jira/browse/SPARK-24821
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Jiang Xingbo


Detect SparkContext.runJob() launch a barrier stage with a subset of all the 
partitions, one example is the `first()` operation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24820) Fail fast when submitted job contains PartitionPruningRDD in a barrier stage

2018-07-16 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-24820:


 Summary: Fail fast when submitted job contains PartitionPruningRDD 
in a barrier stage
 Key: SPARK-24820
 URL: https://issues.apache.org/jira/browse/SPARK-24820
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Jiang Xingbo


Detect SparkContext.runJob() launch a barrier stage including 
PartitionPruningRDD.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24819) Fail fast when no enough slots to launch the barrier stage on job submitted

2018-07-16 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-24819:


 Summary: Fail fast when no enough slots to launch the barrier 
stage on job submitted
 Key: SPARK-24819
 URL: https://issues.apache.org/jira/browse/SPARK-24819
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Jiang Xingbo


Check all the barrier stages on job submitted, to see whether the barrier 
stages require more slots (to be able to launch all the barrier tasks in the 
same stage together) than currently active slots in the cluster. If the job 
requires more slots than available (both busy and free slots), fail the job on 
submit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24818) Ensure all the barrier tasks in the same stage are launched together

2018-07-16 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-24818:


 Summary: Ensure all the barrier tasks in the same stage are 
launched together
 Key: SPARK-24818
 URL: https://issues.apache.org/jira/browse/SPARK-24818
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Jiang Xingbo


When some executors/hosts are blacklisted, it may happen that only a part of 
the tasks in the same barrier stage can be launched. We shall detect the case 
and revert the allocated resource offers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24817) Implement BarrierTaskContext.barrier()

2018-07-16 Thread Jiang Xingbo (JIRA)

Jiang Xingbo created SPARK-24817:


 Summary: Implement BarrierTaskContext.barrier()
 Key: SPARK-24817
 URL: https://issues.apache.org/jira/browse/SPARK-24817
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Jiang Xingbo


Implement BarrierTaskContext.barrier(), to support global sync between all the 
tasks in a barrier stage. The global sync shall finish immediately once all 
tasks in the same barrier stage reaches the same barrier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24538) ByteArrayDecimalType support push down to parquet data sources

2018-07-16 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24538:

Summary: ByteArrayDecimalType support push down to parquet data sources  
(was: ByteArrayDecimalType support push down to the data sources)

> ByteArrayDecimalType support push down to parquet data sources
> --
>
> Key: SPARK-24538
> URL: https://issues.apache.org/jira/browse/SPARK-24538
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Latest parquet support decimal type statistics. then we can push down to the 
> data sources:
> {noformat}
> LM-SHC-16502798:parquet-mr yumwang$ java -jar 
> ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta 
> /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> file:         
> file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> creator:      parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:        org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]}
> file schema:  spark_schema
> 
> id:           REQUIRED INT64 R:0 D:0
> d1:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d2:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d3:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d4:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d5:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> d6:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1:  RC:241867 TS:15480513 OFFSET:4
> 
> id:            INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 
> SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 
> 241866, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 
> SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, 
> max: 241866.00, num_nulls: 0]
> row group 2:  RC:241867 TS:15480513 OFFSET:8584904
> 
> id:            INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 
> SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, 
> max: 483733, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 
> SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 
> 241867.00, max: 483733.00, num_nulls: 
> 0]{noformat}



--
This message was sent by Atlassian JIRA

[jira] [Resolved] (SPARK-24538) ByteArrayDecimalType support push down to the data sources

2018-07-16 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24538.
-
  Resolution: Fixed
Assignee: Yuming Wang
   Fix Version/s: 2.4.0
Target Version/s: 2.4.0

> ByteArrayDecimalType support push down to the data sources
> --
>
> Key: SPARK-24538
> URL: https://issues.apache.org/jira/browse/SPARK-24538
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Latest parquet support decimal type statistics. then we can push down to the 
> data sources:
> {noformat}
> LM-SHC-16502798:parquet-mr yumwang$ java -jar 
> ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta 
> /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> file:         
> file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> creator:      parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:        org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]}
> file schema:  spark_schema
> 
> id:           REQUIRED INT64 R:0 D:0
> d1:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d2:           OPTIONAL INT32 O:DECIMAL R:0 D:1
> d3:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d4:           OPTIONAL INT64 O:DECIMAL R:0 D:1
> d5:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> d6:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1:  RC:241867 TS:15480513 OFFSET:4
> 
> id:            INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 
> SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 
> 241866, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 
> SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, 
> max: 241866.00, num_nulls: 0]
> row group 2:  RC:241867 TS:15480513 OFFSET:8584904
> 
> id:            INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 
> ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d1:            INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d2:            INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 
> ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0]
> d3:            INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]
> d4:            INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 
> VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., 
> num_nulls: 0]
> d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 
> SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, 
> max: 483733, num_nulls: 0]
> d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 
> SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 
> 241867.00, max: 483733.00, num_nulls: 
> 0]{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (SPARK-24549) Support DecimalType push down to the parquet data sources

2018-07-16 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24549:
---

Assignee: Yuming Wang

> Support DecimalType push down to the parquet data sources
> -
>
> Key: SPARK-24549
> URL: https://issues.apache.org/jira/browse/SPARK-24549
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21791) ORC should support column names with dot

2018-07-16 Thread Furcy Pin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544899#comment-16544899
 ] 

Furcy Pin commented on SPARK-21791:
---

Indeed, it works like this.

Awesome! thanks! 

 

 

> ORC should support column names with dot
> 
>
> Key: SPARK-21791
> URL: https://issues.apache.org/jira/browse/SPARK-21791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.0
>
>
> *PARQUET*
> {code}
> scala> Seq(Some(1), None).toDF("col.dots").write.parquet("/tmp/parquet_dot")
> scala> spark.read.parquet("/tmp/parquet_dot").show
> ++
> |col.dots|
> ++
> |   1|
> |null|
> ++
> {code}
> *ORC*
> {code}
> scala> Seq(Some(1), None).toDF("col.dots").write.orc("/tmp/orc_dot")
> scala> spark.read.orc("/tmp/orc_dot").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '.' expecting ':'(line 1, pos 10)
> == SQL ==
> struct
> --^^^
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24816) SQL interface support repartitionByRange

2018-07-16 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24816:

Attachment: DISTRIBUTE_BY_SORT_BY.png
RANGE_DISTRIBUTE_BY_SORT_BY.png

> SQL interface support repartitionByRange
> 
>
> Key: SPARK-24816
> URL: https://issues.apache.org/jira/browse/SPARK-24816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: DISTRIBUTE_BY_SORT_BY.png, 
> RANGE_DISTRIBUTE_BY_SORT_BY.png
>
>
> SQL interface support {{repartitionByRange}} to improvement data pushdown. I 
> have test this feature with a big table(data size: 1.1 T, row count: 
> 282,001,954,428) .
> The test sql is:
> {code:sql}
> select * from table where id=401564838907
> {code}
> The test result:
> |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
> MB-seconds|
> |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
> |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
> |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
> |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
> |RANGE DISTRIBUTE BY |38.5 GB|75355144|45 min|13 s|14525275297|
> |RANGE DISTRIBUTE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24558) Driver prints the wrong info in the log when the executor which holds cacheBlock is IDLE.Time-out value displayed is not as per configuration value.

2018-07-16 Thread sandeep katta (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeep katta updated SPARK-24558:
--
Affects Version/s: 2.2.1
   2.2.2

> Driver prints the wrong info in the log when the executor which holds 
> cacheBlock is IDLE.Time-out value displayed is not as per configuration value.
> 
>
> Key: SPARK-24558
> URL: https://issues.apache.org/jira/browse/SPARK-24558
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.3.1
>Reporter: sandeep katta
>Assignee: sandeep katta
>Priority: Minor
> Fix For: 2.4.0
>
>
>  
> launch spark-sql
> spark-sql>cache table sample;
> 2018-05-15 14:02:28 INFO ExecutorAllocationManager:54 - New executor 2 has 
> registered (new total is 1)
> 2018-05-15 14:02:28 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 
> (TID 0, vm3, executor 2, partition 0, NODE_LOCAL, 7956 bytes)
> 2018-05-15 14:02:28 INFO BlockManagerMasterEndpoint:54 - Registering block 
> manager vm3:53439 with 93.3 MB RAM, BlockManagerId(2, vm3, 53439, None)
> 2018-05-15 14:02:28 INFO BlockManagerInfo:54 - Added broadcast_1_piece0 in 
> memory on vm3:53439 (size: 9.5 KB, free: 93.3 MB)
> 2018-05-15 14:02:29 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in 
> memory on vm3:53439 (size: 33.8 KB, free: 93.3 MB)
> 2018-05-15 14:02:29 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - 
> Registered executor NettyRpcEndpointRef(spark-client://Executor) 
> (10.18.99.35:44288) with ID 1
> 2018-05-15 14:02:29 INFO ExecutorAllocationManager:54 - New executor 1 has 
> registered (new total is 2)
> 
> ...
> 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Request to remove 
> executorIds: 2
> 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Requesting to kill 
> executor(s) 2
> 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Actual list of 
> executor(s) to be killed is 2
> 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Removing executor 2 
> because it has been idle for 60 seconds (new desired total will be 1) *//It 
> should be 120 not 60*
> 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Request to remove 
> executorIds: 1
> 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Requesting to kill 
> executor(s) 1
> 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Actual list of 
> executor(s) to be killed is 1
> 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Removing executor 1 
> because it has been idle for 60 seconds (new desired total will be 0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24558) Driver prints the wrong info in the log when the executor which holds cacheBlock is IDLE.Time-out value displayed is not as per configuration value.

2018-07-16 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24558:
---

Assignee: sandeep katta

> Driver prints the wrong info in the log when the executor which holds 
> cacheBlock is IDLE.Time-out value displayed is not as per configuration value.
> 
>
> Key: SPARK-24558
> URL: https://issues.apache.org/jira/browse/SPARK-24558
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: sandeep katta
>Assignee: sandeep katta
>Priority: Minor
> Fix For: 2.4.0
>
>
>  
> launch spark-sql
> spark-sql>cache table sample;
> 2018-05-15 14:02:28 INFO ExecutorAllocationManager:54 - New executor 2 has 
> registered (new total is 1)
> 2018-05-15 14:02:28 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 
> (TID 0, vm3, executor 2, partition 0, NODE_LOCAL, 7956 bytes)
> 2018-05-15 14:02:28 INFO BlockManagerMasterEndpoint:54 - Registering block 
> manager vm3:53439 with 93.3 MB RAM, BlockManagerId(2, vm3, 53439, None)
> 2018-05-15 14:02:28 INFO BlockManagerInfo:54 - Added broadcast_1_piece0 in 
> memory on vm3:53439 (size: 9.5 KB, free: 93.3 MB)
> 2018-05-15 14:02:29 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in 
> memory on vm3:53439 (size: 33.8 KB, free: 93.3 MB)
> 2018-05-15 14:02:29 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - 
> Registered executor NettyRpcEndpointRef(spark-client://Executor) 
> (10.18.99.35:44288) with ID 1
> 2018-05-15 14:02:29 INFO ExecutorAllocationManager:54 - New executor 1 has 
> registered (new total is 2)
> 
> ...
> 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Request to remove 
> executorIds: 2
> 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Requesting to kill 
> executor(s) 2
> 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Actual list of 
> executor(s) to be killed is 2
> 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Removing executor 2 
> because it has been idle for 60 seconds (new desired total will be 1) *//It 
> should be 120 not 60*
> 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Request to remove 
> executorIds: 1
> 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Requesting to kill 
> executor(s) 1
> 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Actual list of 
> executor(s) to be killed is 1
> 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Removing executor 1 
> because it has been idle for 60 seconds (new desired total will be 0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24558) Driver prints the wrong info in the log when the executor which holds cacheBlock is IDLE.Time-out value displayed is not as per configuration value.

2018-07-16 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24558.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21565
[https://github.com/apache/spark/pull/21565]

> Driver prints the wrong info in the log when the executor which holds 
> cacheBlock is IDLE.Time-out value displayed is not as per configuration value.
> 
>
> Key: SPARK-24558
> URL: https://issues.apache.org/jira/browse/SPARK-24558
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: sandeep katta
>Priority: Minor
> Fix For: 2.4.0
>
>
>  
> launch spark-sql
> spark-sql>cache table sample;
> 2018-05-15 14:02:28 INFO ExecutorAllocationManager:54 - New executor 2 has 
> registered (new total is 1)
> 2018-05-15 14:02:28 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 
> (TID 0, vm3, executor 2, partition 0, NODE_LOCAL, 7956 bytes)
> 2018-05-15 14:02:28 INFO BlockManagerMasterEndpoint:54 - Registering block 
> manager vm3:53439 with 93.3 MB RAM, BlockManagerId(2, vm3, 53439, None)
> 2018-05-15 14:02:28 INFO BlockManagerInfo:54 - Added broadcast_1_piece0 in 
> memory on vm3:53439 (size: 9.5 KB, free: 93.3 MB)
> 2018-05-15 14:02:29 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in 
> memory on vm3:53439 (size: 33.8 KB, free: 93.3 MB)
> 2018-05-15 14:02:29 INFO YarnSchedulerBackend$YarnDriverEndpoint:54 - 
> Registered executor NettyRpcEndpointRef(spark-client://Executor) 
> (10.18.99.35:44288) with ID 1
> 2018-05-15 14:02:29 INFO ExecutorAllocationManager:54 - New executor 1 has 
> registered (new total is 2)
> 
> ...
> 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Request to remove 
> executorIds: 2
> 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Requesting to kill 
> executor(s) 2
> 2018-05-15 14:04:31 INFO YarnClientSchedulerBackend:54 - Actual list of 
> executor(s) to be killed is 2
> 2018-05-15 14:04:31 INFO ExecutorAllocationManager:54 - Removing executor 2 
> because it has been idle for 60 seconds (new desired total will be 1) *//It 
> should be 120 not 60*
> 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Request to remove 
> executorIds: 1
> 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Requesting to kill 
> executor(s) 1
> 2018-05-15 14:04:32 INFO YarnClientSchedulerBackend:54 - Actual list of 
> executor(s) to be killed is 1
> 2018-05-15 14:04:32 INFO ExecutorAllocationManager:54 - Removing executor 1 
> because it has been idle for 60 seconds (new desired total will be 0)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24816) SQL interface support repartitionByRange

2018-07-16 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24816:

Description: 
SQL interface support {{repartitionByRange}} to improvement data pushdown. I 
have test this feature with a big table(data size: 1.1 T, row count: 
282,001,954,428) .

The test sql is:
{code:sql}
select * from table where id=401564838907
{code}
The test result:
|Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
MB-seconds|
|default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
|DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
|SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
|DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
|RANGE DISTRIBUTE BY |38.5 GB|75355144|45 min|13 s|14525275297|
|RANGE DISTRIBUTE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|

  was:
SQL interface support {{repartitionByRange}} to improvement data pushdown. I 
have test this feature with a big table(data size: 1.1 T, row count: 
282,001,954,428) .

The test sql is:
{code:sql}
select * from table where id=401564838907
{code}
The test result:
|Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
MB-seconds|
|default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
|DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
|SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
|DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
|RANGE BY |38.5 GB|75355144|45 min|13 s|14525275297|
|RANGE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|


> SQL interface support repartitionByRange
> 
>
> Key: SPARK-24816
> URL: https://issues.apache.org/jira/browse/SPARK-24816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> SQL interface support {{repartitionByRange}} to improvement data pushdown. I 
> have test this feature with a big table(data size: 1.1 T, row count: 
> 282,001,954,428) .
> The test sql is:
> {code:sql}
> select * from table where id=401564838907
> {code}
> The test result:
> |Mode|Input Size|Records|Total Time|Duration|Prepare data Resource Allocation 
> MB-seconds|
> |default|959.2 GB|237624395522|11.2 h|1.3 min|6496280086|
> |DISTRIBUTE BY|970.8 GB|244642791213|11.4 h|1.3 min|10536069846|
> |SORT BY|456.3 GB|101587838784|5.4 h|31 s|8965158620|
> |DISTRIBUTE BY + SORT BY |219.0 GB |51723521593|3.3 h|54 s|12552656774|
> |RANGE DISTRIBUTE BY |38.5 GB|75355144|45 min|13 s|14525275297|
> |RANGE DISTRIBUTE BY + SORT BY|17.4 GB|14334724|45 min|12 s|16255296698|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-07-16 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544841#comment-16544841
 ] 

Xiao Li commented on SPARK-23874:
-

[~bryanc] [~icexelloss] I saw you are working on the JIRA 
https://issues.apache.org/jira/browse/ARROW-2704. Since our code freeze of 
Spark 2.4 release is Aug 1st. Does it mean we will miss the upgrade of arrow in 
Spark 2.4?

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24810) Fix paths to resource files in AvroSuite

2018-07-16 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24810.
-
   Resolution: Fixed
 Assignee: Maxim Gekk
Fix Version/s: 2.4.0

> Fix paths to resource files in AvroSuite
> 
>
> Key: SPARK-24810
> URL: https://issues.apache.org/jira/browse/SPARK-24810
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: Screen Shot 2018-07-15 at 15.28.13.png
>
>
> Currently paths to tests files from resource folder are relative in 
> AvroSuite. It causes problems like impossibility for running tests from IDE. 
> Need to wrap test files by:
> {code:scala}
> def testFile(fileName: String): String = {
> 
> Thread.currentThread().getContextClassLoader.getResource(fileName).toString
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

89 matches

Mail list logo