date:20170417

[jira] [Commented] (SPARK-20313) Possible lack of join optimization when partitions are in the join condition

2017-04-17 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15972142#comment-15972142
 ] 

Takeshi Yamamuro commented on SPARK-20313:
--

What's the issue that you'd like to point out? I think the description is 
ambiguous. 

> Possible lack of join optimization when partitions are in the join condition
> 
>
> Key: SPARK-20313
> URL: https://issues.apache.org/jira/browse/SPARK-20313
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Albert Meltzer
>
> Given two tables T1 and T2, partitioned on column part1, the following have 
> vastly different execution performance:
> // initial, slow
> {noformat}
> val df1 = // load data from T1
>   .filter(functions.col("part1").between("val1", "val2")
> val df2 = // load data from T2
>   .filter(functions.col("part1").between("val1", "val2")
> val df3 = df1.join(df2, Seq("part1", "col1"))
> {noformat}
> // manually optimized, considerably faster
> {noformat}
> val df1 = // load data from T1
> val df2 = // load data from T2
> val part1values = Seq(...) // a collection of values between val1 and val2
> val df3 = part1values
>   .map(part1value => {
> val df1filtered = df1.filter(functions.col("part1") === part1value)
> val df2filtered = df2.filter(functions.col("part1") === part1value)
> df1filtered.join(df2filtered, Seq("col1")) // part1 removed from join
>   })
>   .reduce(_ union _)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20320) AnalysisException: Columns of grouping_id (count(value#17L)) does not match grouping columns (count(value#17L))

2017-04-17 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15972126#comment-15972126
 ] 

Takeshi Yamamuro commented on SPARK-20320:
--

Is this query (putting `AggregateFunction` like `count(value)` in `cube`) valid?
Could you explain what you get from this query?
{code}
scala> spark.range(5).cube(count("id")).avg().show
org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, 
and '`id`' is not an aggregate function. Wrap '(count(`id`) AS `count(id#7L)`)' 
in windowing function(s) or wrap '`id`' in first() (or first_value) if you 
don't care which value you get.;;
Aggregate [count(id#7L)#19L, spark_grouping_id#17], [count(id#7L) AS 
count(id)#15L, avg(id#7L) AS avg(id)#16]
+- Expand [List(id#7L, count(id#7L)#18L, 0), List(id#7L, null, 1)], [id#7L, 
count(id#7L)#19L, spark_grouping_id#17]
   +- Aggregate [id#7L, count(id#7L) AS count(id#7L)#18L]
  +- Range (0, 5, step=1, splits=Some(4))
{code}

> AnalysisException: Columns of grouping_id (count(value#17L)) does not match 
> grouping columns (count(value#17L))
> ---
>
> Key: SPARK-20320
> URL: https://issues.apache.org/jira/browse/SPARK-20320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> I'm not questioning the {{AnalysisException}} (which I don't know whether 
> should be reported or not), but the exception message that tells...nothing 
> helpful.
> {code}
> val records = spark.range(5).flatMap(n => Seq.fill(n.toInt)(n))
> scala> 
> records.cube(count("value")).agg(grouping_id(count("value"))).queryExecution.logical
> org.apache.spark.sql.AnalysisException: Columns of grouping_id 
> (count(value#17L)) does not match grouping columns (count(value#17L));
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$replaceGroupingFunc$1.applyOrElse(Analyzer.scala:313)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$replaceGroupingFunc$1.applyOrElse(Analyzer.scala:308)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20169) Groupby Bug with Sparksql

2017-04-17 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15972123#comment-15972123
 ] 

Takeshi Yamamuro commented on SPARK-20169:
--

I tried this query in v2.1 and master though, I couldn't reproduce this.
Do I miss anything?
{code}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Python version 2.7.10 (v2.7.10:15c95b7d81dc, May 23 2015 09:33:12)
SparkSession available as 'spark'.
>>> 
>>> e = 
>>> sc.parallelize([(1,2),(1,3),(1,4),(2,1),(3,1),(4,1)]).toDF(["src","dst"])
>>> r = sc.parallelize([(1,),(2,),(3,),(4,)]).toDF(['src'])
>>> 
>>> r1 = e.join(r, 'src').groupBy('dst').count().withColumnRenamed('dst','src')
>>> jr = e.join(r1, 'src')
>>> jr.show()
+---+---+-+ 
|src|dst|count|
+---+---+-+
|  1|  2|3|
|  1|  3|3|
|  1|  4|3|
|  3|  1|1|
|  2|  1|1|
|  4|  1|1|
+---+---+-+

>>> r2 = jr.groupBy('dst').count()
>>> r2.show()
+---+-+ 
|dst|count|
+---+-+
|  1|3|
|  3|1|
|  2|1|
|  4|1|
+---+-+
{code}

> Groupby Bug with Sparksql
> -
>
> Key: SPARK-20169
> URL: https://issues.apache.org/jira/browse/SPARK-20169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Bin Wu
>
> We find a potential bug in Catalyst optimizer which cannot correctly 
> process "groupby". You can reproduce it by following simple example:
> =
> from pyspark.sql.functions import *
> #e=sc.parallelize([(1,2),(1,3),(1,4),(2,1),(3,1),(4,1)]).toDF(["src","dst"])
> e = spark.read.csv("graph.csv", header=True)
> r = sc.parallelize([(1,),(2,),(3,),(4,)]).toDF(['src'])
> r1 = e.join(r, 'src').groupBy('dst').count().withColumnRenamed('dst','src')
> jr = e.join(r1, 'src')
> jr.show()
> r2 = jr.groupBy('dst').count()
> r2.show()
> =
> FYI, "graph.csv" contains exactly the same data as the commented line.
> You can find that jr is:
> |src|dst|count|
> |  3|  1|1|
> |  1|  4|3|
> |  1|  3|3|
> |  1|  2|3|
> |  4|  1|1|
> |  2|  1|1|
> But, after the last groupBy, the 3 rows with dst = 1 are not grouped together:
> |dst|count|
> |  1|1|
> |  4|1|
> |  3|1|
> |  2|1|
> |  1|1|
> |  1|1|
> If we build jr directly from raw data (commented line), this error will not 
> show up.  So 
> we suspect  that there is a bug in the Catalyst optimizer when multiple joins 
> and groupBy's 
> are being optimized. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20312) query optimizer calls udf with null values when it doesn't expect them

2017-04-17 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971997#comment-15971997
 ] 

Takeshi Yamamuro commented on SPARK-20312:
--

I made the query a bit simpler and tried though, I couldn't reproduce this in 
the master;
{code}
val testUdf = udf { (s: String) => s }

val df1 = Seq((1L, null), (2L, "")).toDF("a", "b")
  .filter($"b".isNotNull)
  .withColumn("c", testUdf($"b"))
  .repartition($"a")
  .sortWithinPartitions($"a")

val df2 = Seq((1L, "")).toDF("a", "b")
  .filter($"b".isNotNull)
  .repartition($"a")
  .sortWithinPartitions($"a")

df1.join(df2, "a" :: Nil, "left_outer").explain(true)

== Analyzed Logical Plan ==
a: bigint, b: string, c: string, b: string
Project [a#509L, b#510, c#515, b#528]
+- Join LeftOuter, (a#509L = a#527L)
   :- Sort [a#509L ASC NULLS FIRST], false
   :  +- RepartitionByExpression [a#509L], 200
   : +- Project [a#509L, b#510, UDF(b#510) AS c#515]
   :+- Filter isnotnull(b#510)
   :   +- Project [_1#506L AS a#509L, _2#507 AS b#510]
   :  +- LocalRelation [_1#506L, _2#507]
   +- Sort [a#527L ASC NULLS FIRST], false
  +- RepartitionByExpression [a#527L], 200
 +- Filter isnotnull(b#528)
+- Project [_1#524L AS a#527L, _2#525 AS b#528]
   +- LocalRelation [_1#524L, _2#525]

== Optimized Logical Plan ==
Project [a#509L, b#510, c#515, b#528]
+- Join LeftOuter, (a#509L = a#527L)
   :- Sort [a#509L ASC NULLS FIRST], false
   :  +- RepartitionByExpression [a#509L], 200
   : +- Project [_1#506L AS a#509L, _2#507 AS b#510, UDF(_2#507) AS c#515]
   :+- Filter isnotnull(_2#507)
   :   +- LocalRelation [_1#506L, _2#507]
   +- Sort [a#527L ASC NULLS FIRST], false
  +- RepartitionByExpression [a#527L], 200
 +- Project [_1#524L AS a#527L, _2#525 AS b#528]
+- Filter isnotnull(_2#525)
   +- LocalRelation [_1#524L, _2#525]
{code}

Do I miss anything? If yes, could you give me more?

> query optimizer calls udf with null values when it doesn't expect them
> --
>
> Key: SPARK-20312
> URL: https://issues.apache.org/jira/browse/SPARK-20312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Albert Meltzer
>
> When optimizing an outer join, spark passes an empty row to both sides to see 
> if nulls would be ignored (side comment: for half-outer joins it subsequently 
> ignores the assessment on the dominant side).
> For some reason, a condition such as {{xx IS NOT NULL && udf(xx) IS NOT 
> NULL}} might result in checking the right side first, and an exception if the 
> udf doesn't expect a null input (given the left side first).
> A example is SIMILAR to the following (see actual query plans separately):
> {noformat}
> def func(value: Any): Int = ... // return AnyVal which probably causes a IS 
> NOT NULL added filter on the result
> val df1 = sparkSession
>   .table(...)
>   .select("col1", "col2") // LongType both
> val df11 = df1
>   .filter(df1("col1").isNotNull)
>   .withColumn("col3", functions.udf(func)(df1("col1"))
>   .repartition(df1("col2"))
>   .sortWithinPartitions(df1("col2"))
> val df2 = ... // load other data containing col2, similarly repartition and 
> sort
> val df3 =
>   df1.join(df2, Seq("col2"), "left_outer")
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20299) NullPointerException when null and string are in a tuple while encoding Dataset

2017-04-17 Thread Umesh Chaudhary (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971977#comment-15971977
 ] 

Umesh Chaudhary commented on SPARK-20299:
-

[~marmbrus], the issue seems to caused by changes in *CodeGenerator.scala*. 
Please see my this 
[comment|https://issues.apache.org/jira/browse/SPARK-20299?focusedCommentId=15970652=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15970652]
 in which I compared the CleanExpressions output of *Result of Batch 
CleanExpressions* between Spark 2.1.0 and Spark 2.2.0-SNAPSHOT versions and 
found that additional *assertnotnull* function wrapped in Spark 2.2.0-SNAPSHOT. 
So would like to get the pointers to fix this issue as I am not too familiar 
with *CodeGenerator.scala*

> NullPointerException when null and string are in a tuple while encoding 
> Dataset
> ---
>
> Key: SPARK-20299
> URL: https://issues.apache.org/jira/browse/SPARK-20299
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> When creating a Dataset from a tuple with {{null}} and a string, NPE is 
> reported. When either is removed, it works fine.
> {code}
> scala> Seq((1, null.asInstanceOf[Int]), (2, 1)).toDS
> res43: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]
> scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top 
> level Product input object), - root class: "scala.Tuple2")._1, true) AS _1#474
> assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product 
> input object), - root class: "scala.Tuple2")._2 AS _2#475
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454)
>   at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377)
>   at 
> org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:246)
>   ... 48 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
>   ... 58 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20363) sessionstate.get is get the same object in hive project, when I use spark-beeline

2017-04-17 Thread QQShu1 (JIRA)

QQShu1 created SPARK-20363:
--

 Summary: sessionstate.get is get the same object  in hive project, 
when I use spark-beeline
 Key: SPARK-20363
 URL: https://issues.apache.org/jira/browse/SPARK-20363
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.2
Reporter: QQShu1


sessionstate.get is get the same object  in hive project, when I use 
spark-beeline,but when I use hive beeline,sessionstate.get is get the different 
object  in hive project



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work

2017-04-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20311:


Assignee: Apache Spark

> SQL "range(N) as alias" or "range(N) alias" doesn't work
> 
>
> Key: SPARK-20311
> URL: https://issues.apache.org/jira/browse/SPARK-20311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Juliusz Sompolski
>Assignee: Apache Spark
>Priority: Minor
>
> `select * from range(10) as A;` or `select * from range(10) A;`
> does not work.
> As a workaround, a subquery has to be used:
> `select * from (select * from range(10)) as A;`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work

2017-04-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971932#comment-15971932
 ] 

Apache Spark commented on SPARK-20311:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17666

> SQL "range(N) as alias" or "range(N) alias" doesn't work
> 
>
> Key: SPARK-20311
> URL: https://issues.apache.org/jira/browse/SPARK-20311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Juliusz Sompolski
>Priority: Minor
>
> `select * from range(10) as A;` or `select * from range(10) A;`
> does not work.
> As a workaround, a subquery has to be used:
> `select * from (select * from range(10)) as A;`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20311) SQL "range(N) as alias" or "range(N) alias" doesn't work

2017-04-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20311:


Assignee: (was: Apache Spark)

> SQL "range(N) as alias" or "range(N) alias" doesn't work
> 
>
> Key: SPARK-20311
> URL: https://issues.apache.org/jira/browse/SPARK-20311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Juliusz Sompolski
>Priority: Minor
>
> `select * from range(10) as A;` or `select * from range(10) A;`
> does not work.
> As a workaround, a subquery has to be used:
> `select * from (select * from range(10)) as A;`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20349) ListFunctions returns duplicate functions after using persistent functions

2017-04-17 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-20349:

Fix Version/s: (was: 2.1.2)

> ListFunctions returns duplicate functions after using persistent functions
> --
>
> Key: SPARK-20349
> URL: https://issues.apache.org/jira/browse/SPARK-20349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> The session catalog caches some persistent functions in the FunctionRegistry, 
> so there can be duplicates. Our Catalog API listFunctions does not handle it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos

2017-04-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971897#comment-15971897
 ] 

Apache Spark commented on SPARK-16742:
--

User 'mgummelt' has created a pull request for this issue:
https://github.com/apache/spark/pull/17665

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20361) JVM locale affects SQL type names

2017-04-17 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971896#comment-15971896
 ] 

Hyukjin Kwon commented on SPARK-20361:
--

Seems now fixed as below just for sure:

{code}
>>> locale = sc._jvm.java.util.Locale
>>> locale.setDefault(locale.forLanguageTag("tr-TR"))
>>>
>>> spark.createDataFrame([1, 2, 3], IntegerType())
DataFrame[value: int]
{code}

> JVM locale affects SQL type names 
> --
>
> Key: SPARK-20361
> URL: https://issues.apache.org/jira/browse/SPARK-20361
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Maciej Szymkiewicz
>
> Steps to reproduce:
> {code}
> from pyspark.sql.types import IntegerType
> locale = sc._jvm.java.util.Locale
> locale.setDefault(locale.forLanguageTag("tr-TR"))
> spark.createDataFrame([1, 2, 3], IntegerType())
> {code}
> {code}
> Py4JJavaError: An error occurred while calling o24.applySchemaToPythonRDD.
> : java.util.NoSuchElementException: key not found: integer
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-17 Thread Stephane Maarek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971893#comment-15971893
 ] 

Stephane Maarek commented on SPARK-20287:
-

[~c...@koeninger.org] It makes sense. I didn't realized in the direct streams, 
that the driver was in charge of assigning metadata to the executors to pull 
data. Therefore yes you're right, it's "incompatible" with the Kafka way of 
being "master-free", where each consumer really doesn't know and shouldn't care 
about how many other consumers there are. I think this ticket can now be closed 
(just re-open it if you don't believe so). Maybe it'll be worth opening a KIP 
on Kafka to have some APIs to allow Spark to be a bit more "optimized", but it 
all seems okay for now. Cheers!

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---
>
> Key: SPARK-20287
> URL: https://issues.apache.org/jira/browse/SPARK-20287
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition

2017-04-17 Thread Stephane Maarek (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek closed SPARK-20287.
---
Resolution: Not A Problem

> Kafka Consumer should be able to subscribe to more than one topic partition
> ---
>
> Key: SPARK-20287
> URL: https://issues.apache.org/jira/browse/SPARK-20287
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Stephane Maarek
>
> As I understand and as it stands, one Kafka Consumer is created for each 
> topic partition in the source Kafka topics, and they're cached.
> cf 
> https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
> In my opinion, that makes the design an anti pattern for Kafka and highly 
> unefficient:
> - Each Kafka consumer creates a connection to Kafka
> - Spark doesn't leverage the power of the Kafka consumers, which is that it 
> automatically assigns and balances partitions amongst all the consumers that 
> share the same group.id
> - You can still cache your Kafka consumer even if it has multiple partitions.
> I'm not sure about how that translates to the spark underlying RDD 
> architecture, but from a Kafka standpoint, I believe creating one consumer 
> per partition is a big overhead, and a risk as the user may have to increase 
> the spark.streaming.kafka.consumer.cache.maxCapacity parameter. 
> Happy to discuss to understand the rationale



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19986) Make pyspark.streaming.tests.CheckpointTests more stable

2017-04-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971863#comment-15971863
 ] 

Apache Spark commented on SPARK-19986:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17323

> Make pyspark.streaming.tests.CheckpointTests more stable
> 
>
> Key: SPARK-19986
> URL: https://issues.apache.org/jira/browse/SPARK-19986
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Sometimes, CheckpointTests will hang because the streaming jobs are too slow 
> and cannot catch up.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20362) spark submit not considering user defined Configs (Pyspark)

2017-04-17 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20362.

Resolution: Duplicate

> spark submit not considering user defined Configs (Pyspark)
> ---
>
> Key: SPARK-20362
> URL: https://issues.apache.org/jira/browse/SPARK-20362
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Harish
>
> I am trying to set up the custom configuration on runtime (pyspark), but in 
> my spark UI :8080 i see my job is using complete node/cluster resources 
> and application name is "test.py"(which is script name). It looks like the 
> user defined  configurations are not considered in job submit.
> command : spark-submit test.py 
> standalone mode(2 nodes and 1 master)
> Here is the code:
> test.py
> from pyspark.sql import SparkSession
> from pyspark import SparkConf
> if __name__ == "__main__":
> conf = SparkConf().setAll([('spark.executor.memory', '8g'), 
> ('spark.executor.cores', '3'), ('spark.cores.max', '10'), 
> ('spark.driver.memory','8g')])
> spark = 
> SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
> sc = spark.sparkContext
> print(sc.getConf().getAll())
> sqlContext = SQLContext(sc)
> hiveContext = HiveContext(sc)
> print(hiveContext)
> print(sc.getConf().getAll())
> print("Complete")
> Print:
> [('spark.jars.packages', 'com.databricks:spark-csv_2.11:1.2.0'), 
> ('spark.local.dir', '/mnt/sparklocaldir/'), ('hive.metastore.warehouse.dir', 
> ''), ('spark.app.id', 'app-20170417221942-0003'), ('spark.jars', 
> 'file:/home/user/.ivy2/jars/com.databricks_spark-csv_2.11-1.2.0.jar,file:/home/user/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,file:/home/user/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar'),
>  ('spark.executor.id', 'driver'), ('spark.app.name', 'test.py'), 
> ('spark.cores.max', '10'), ('spark.serializer', 
> 'org.apache.spark.serializer.KryoSerializer'), ('spark.driver.port', 
> '35596'), ('spark.sql.catalogImplementation', 'hive'), 
> ('spark.sql.warehouse.dir', ''), ('spark.rdd.compress', 'True'), 
> ('spark.driver.memory', '8g'), ('spark.serializer.objectStreamReset', '100'), 
> ('spark.executor.memory', '8g'), ('spark.executor.cores', '3'), 
> ('spark.submit.deployMode', 'client'), ('spark.files', 
> 'file:/home/user/test.py,file:/home/user/.ivy2/jars/com.databricks_spark-csv_2.11-1.2.0.jar,file:/home/user/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,file:/home/user/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar'),
>  ('spark.master', 'spark://master:7077'), ('spark.submit.pyFiles', 
> '/home/user/.ivy2/jars/com.databricks_spark-csv_2.11-1.2.0.jar,/home/user/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,/home/user/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar'),
>  ('spark.driver.host', 'master')]
> 
> [('spark.jars.packages', 'com.databricks:spark-csv_2.11:1.2.0'), 
> ('spark.local.dir', '/mnt/sparklocaldir/'), ('hive.metastore.warehouse.dir', 
> ''), ('spark.app.id', 'app-20170417221942-0003'), ('spark.jars', 
> 'file:/home/user/.ivy2/jars/com.databricks_spark-csv_2.11-1.2.0.jar,file:/home/user/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,file:/home/user/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar'),
>  ('spark.executor.id', 'driver'), ('spark.app.name', 'test.py'), 
> ('spark.cores.max', '10'), ('spark.serializer', 
> 'org.apache.spark.serializer.KryoSerializer'), ('spark.driver.port', 
> '35596'), ('spark.sql.catalogImplementation', 'hive'), 
> ('spark.sql.warehouse.dir', ''), ('spark.rdd.compress', 'True'), 
> ('spark.driver.memory', '8g'), ('spark.serializer.objectStreamReset', '100'), 
> ('spark.executor.memory', '8g'), ('spark.executor.cores', '3'), 
> ('spark.submit.deployMode', 'client'), ('spark.files', 
> 'file:/home/user/test.py,file:/home/user/.ivy2/jars/com.databricks_spark-csv_2.11-1.2.0.jar,file:/home/user/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,file:/home/user/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar'),
>  ('spark.master', 'spark://master:7077'), ('spark.submit.pyFiles', 
> '/home/user/.ivy2/jars/com.databricks_spark-csv_2.11-1.2.0.jar,/home/user/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,/home/user/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar'),
>  ('spark.driver.host', 'master')]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20362) spark submit not considering user defined Configs (Pyspark)

2017-04-17 Thread Harish (JIRA)

Harish created SPARK-20362:
--

 Summary: spark submit not considering user defined Configs 
(Pyspark)
 Key: SPARK-20362
 URL: https://issues.apache.org/jira/browse/SPARK-20362
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
Reporter: Harish


I am trying to set up the custom configuration on runtime (pyspark), but in my 
spark UI :8080 i see my job is using complete node/cluster resources and 
application name is "test.py"(which is script name). It looks like the user 
defined  configurations are not considered in job submit.

command : spark-submit test.py 
standalone mode(2 nodes and 1 master)

Here is the code:

test.py
from pyspark.sql import SparkSession
from pyspark import SparkConf

if __name__ == "__main__":
conf = SparkConf().setAll([('spark.executor.memory', '8g'), 
('spark.executor.cores', '3'), ('spark.cores.max', '10'), 
('spark.driver.memory','8g')])
spark = 
SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
sc = spark.sparkContext
print(sc.getConf().getAll())
sqlContext = SQLContext(sc)
hiveContext = HiveContext(sc)
print(hiveContext)
print(sc.getConf().getAll())
print("Complete")


Print:

[('spark.jars.packages', 'com.databricks:spark-csv_2.11:1.2.0'), 
('spark.local.dir', '/mnt/sparklocaldir/'), ('hive.metastore.warehouse.dir', 
''), ('spark.app.id', 'app-20170417221942-0003'), ('spark.jars', 
'file:/home/user/.ivy2/jars/com.databricks_spark-csv_2.11-1.2.0.jar,file:/home/user/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,file:/home/user/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar'),
 ('spark.executor.id', 'driver'), ('spark.app.name', 'test.py'), 
('spark.cores.max', '10'), ('spark.serializer', 
'org.apache.spark.serializer.KryoSerializer'), ('spark.driver.port', '35596'), 
('spark.sql.catalogImplementation', 'hive'), ('spark.sql.warehouse.dir', 
''), ('spark.rdd.compress', 'True'), ('spark.driver.memory', '8g'), 
('spark.serializer.objectStreamReset', '100'), ('spark.executor.memory', '8g'), 
('spark.executor.cores', '3'), ('spark.submit.deployMode', 'client'), 
('spark.files', 
'file:/home/user/test.py,file:/home/user/.ivy2/jars/com.databricks_spark-csv_2.11-1.2.0.jar,file:/home/user/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,file:/home/user/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar'),
 ('spark.master', 'spark://master:7077'), ('spark.submit.pyFiles', 
'/home/user/.ivy2/jars/com.databricks_spark-csv_2.11-1.2.0.jar,/home/user/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,/home/user/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar'),
 ('spark.driver.host', 'master')]



[('spark.jars.packages', 'com.databricks:spark-csv_2.11:1.2.0'), 
('spark.local.dir', '/mnt/sparklocaldir/'), ('hive.metastore.warehouse.dir', 
''), ('spark.app.id', 'app-20170417221942-0003'), ('spark.jars', 
'file:/home/user/.ivy2/jars/com.databricks_spark-csv_2.11-1.2.0.jar,file:/home/user/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,file:/home/user/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar'),
 ('spark.executor.id', 'driver'), ('spark.app.name', 'test.py'), 
('spark.cores.max', '10'), ('spark.serializer', 
'org.apache.spark.serializer.KryoSerializer'), ('spark.driver.port', '35596'), 
('spark.sql.catalogImplementation', 'hive'), ('spark.sql.warehouse.dir', 
''), ('spark.rdd.compress', 'True'), ('spark.driver.memory', '8g'), 
('spark.serializer.objectStreamReset', '100'), ('spark.executor.memory', '8g'), 
('spark.executor.cores', '3'), ('spark.submit.deployMode', 'client'), 
('spark.files', 
'file:/home/user/test.py,file:/home/user/.ivy2/jars/com.databricks_spark-csv_2.11-1.2.0.jar,file:/home/user/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,file:/home/user/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar'),
 ('spark.master', 'spark://master:7077'), ('spark.submit.pyFiles', 
'/home/user/.ivy2/jars/com.databricks_spark-csv_2.11-1.2.0.jar,/home/user/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,/home/user/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar'),
 ('spark.driver.host', 'master')]






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20361) JVM locale affects SQL type names

2017-04-17 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971734#comment-15971734
 ] 

Maciej Szymkiewicz commented on SPARK-20361:


Indeed.

> JVM locale affects SQL type names 
> --
>
> Key: SPARK-20361
> URL: https://issues.apache.org/jira/browse/SPARK-20361
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Maciej Szymkiewicz
>
> Steps to reproduce:
> {code}
> from pyspark.sql.types import IntegerType
> locale = sc._jvm.java.util.Locale
> locale.setDefault(locale.forLanguageTag("tr-TR"))
> spark.createDataFrame([1, 2, 3], IntegerType())
> {code}
> {code}
> Py4JJavaError: An error occurred while calling o24.applySchemaToPythonRDD.
> : java.util.NoSuchElementException: key not found: integer
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-20361) JVM locale affects SQL type names

2017-04-17 Thread Maciej Szymkiewicz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz closed SPARK-20361.
--
Resolution: Fixed

> JVM locale affects SQL type names 
> --
>
> Key: SPARK-20361
> URL: https://issues.apache.org/jira/browse/SPARK-20361
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Maciej Szymkiewicz
>
> Steps to reproduce:
> {code}
> from pyspark.sql.types import IntegerType
> locale = sc._jvm.java.util.Locale
> locale.setDefault(locale.forLanguageTag("tr-TR"))
> spark.createDataFrame([1, 2, 3], IntegerType())
> {code}
> {code}
> Py4JJavaError: An error occurred while calling o24.applySchemaToPythonRDD.
> : java.util.NoSuchElementException: key not found: integer
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20361) JVM locale affects SQL type names

2017-04-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971723#comment-15971723
 ] 

Sean Owen commented on SPARK-20361:
---

This is the same as https://issues.apache.org/jira/browse/SPARK-20156 right?

> JVM locale affects SQL type names 
> --
>
> Key: SPARK-20361
> URL: https://issues.apache.org/jira/browse/SPARK-20361
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Maciej Szymkiewicz
>
> Steps to reproduce:
> {code}
> from pyspark.sql.types import IntegerType
> locale = sc._jvm.java.util.Locale
> locale.setDefault(locale.forLanguageTag("tr-TR"))
> spark.createDataFrame([1, 2, 3], IntegerType())
> {code}
> {code}
> Py4JJavaError: An error occurred while calling o24.applySchemaToPythonRDD.
> : java.util.NoSuchElementException: key not found: integer
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20361) JVM locale affects SQL type names

2017-04-17 Thread Maciej Szymkiewicz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-20361:
---
Description: 
Steps to reproduce:

{code}
from pyspark.sql.types import IntegerType

locale = sc._jvm.java.util.Locale
locale.setDefault(locale.forLanguageTag("tr-TR"))

spark.createDataFrame([1, 2, 3], IntegerType())
{code}


{code}

Py4JJavaError: An error occurred while calling o24.applySchemaToPythonRDD.
: java.util.NoSuchElementException: key not found: integer
at scala.collection.MapLike$class.default(MapLike.scala:228)
...
{code}

  was:
Steps to reproduce:

{code}
from pyspark.sql.types import IntegerType

locale = sc._jvm.java.util.Locale
locale.setDefault(locale.forLanguageTag("tr-TR"))

spark.createDataFrame([1, 2, 3], IntegerType())
{code}


> JVM locale affects SQL type names 
> --
>
> Key: SPARK-20361
> URL: https://issues.apache.org/jira/browse/SPARK-20361
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Maciej Szymkiewicz
>
> Steps to reproduce:
> {code}
> from pyspark.sql.types import IntegerType
> locale = sc._jvm.java.util.Locale
> locale.setDefault(locale.forLanguageTag("tr-TR"))
> spark.createDataFrame([1, 2, 3], IntegerType())
> {code}
> {code}
> Py4JJavaError: An error occurred while calling o24.applySchemaToPythonRDD.
> : java.util.NoSuchElementException: key not found: integer
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20361) JVM locale affects SQL type names

2017-04-17 Thread Maciej Szymkiewicz (JIRA)

Maciej Szymkiewicz created SPARK-20361:
--

 Summary: JVM locale affects SQL type names 
 Key: SPARK-20361
 URL: https://issues.apache.org/jira/browse/SPARK-20361
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.1.0
Reporter: Maciej Szymkiewicz


Steps to reproduce:

{code}
from pyspark.sql.types import IntegerType

locale = sc._jvm.java.util.Locale
locale.setDefault(locale.forLanguageTag("tr-TR"))

spark.createDataFrame([1, 2, 3], IntegerType())
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17647) SQL LIKE does not handle backslashes correctly

2017-04-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971675#comment-15971675
 ] 

Apache Spark commented on SPARK-17647:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/17663

> SQL LIKE does not handle backslashes correctly
> --
>
> Key: SPARK-17647
> URL: https://issues.apache.org/jira/browse/SPARK-17647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>  Labels: correctness
> Fix For: 2.1.1, 2.2.0
>
>
> Try the following in SQL shell:
> {code}
> select '' like '%\\%';
> {code}
> It returned false, which is wrong.
> cc: [~yhuai] [~joshrosen]
> A false-negative considered previously:
> {code}
> select '' rlike '.*.*';
> {code}
> It returned true, which is correct if we assume that the pattern is treated 
> as a Java string but not raw string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20360) Create repr functions for interpreters to use

2017-04-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20360:


Assignee: (was: Apache Spark)

> Create repr functions for interpreters to use
> -
>
> Key: SPARK-20360
> URL: https://issues.apache.org/jira/browse/SPARK-20360
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Kyle Kelley
>Priority: Minor
>
> Create `_repr_html_` for SparkContext, DataFrames, and other objects to 
> target rich display in IPython. This will improve the user experience in 
> Jupyter, Hydrogen, nteract, and any other frontends that use this namespace. 
> http://ipython.readthedocs.io/en/stable/config/integrating.html
> I made this issue only target 2.x since it's an enhancement on the current 
> experience.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20360) Create repr functions for interpreters to use

2017-04-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971651#comment-15971651
 ] 

Apache Spark commented on SPARK-20360:
--

User 'rgbkrk' has created a pull request for this issue:
https://github.com/apache/spark/pull/17662

> Create repr functions for interpreters to use
> -
>
> Key: SPARK-20360
> URL: https://issues.apache.org/jira/browse/SPARK-20360
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Kyle Kelley
>Priority: Minor
>
> Create `_repr_html_` for SparkContext, DataFrames, and other objects to 
> target rich display in IPython. This will improve the user experience in 
> Jupyter, Hydrogen, nteract, and any other frontends that use this namespace. 
> http://ipython.readthedocs.io/en/stable/config/integrating.html
> I made this issue only target 2.x since it's an enhancement on the current 
> experience.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20360) Create repr functions for interpreters to use

2017-04-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20360:


Assignee: Apache Spark

> Create repr functions for interpreters to use
> -
>
> Key: SPARK-20360
> URL: https://issues.apache.org/jira/browse/SPARK-20360
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Kyle Kelley
>Assignee: Apache Spark
>Priority: Minor
>
> Create `_repr_html_` for SparkContext, DataFrames, and other objects to 
> target rich display in IPython. This will improve the user experience in 
> Jupyter, Hydrogen, nteract, and any other frontends that use this namespace. 
> http://ipython.readthedocs.io/en/stable/config/integrating.html
> I made this issue only target 2.x since it's an enhancement on the current 
> experience.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18085) Better History Server scalability for many / large applications

2017-04-17 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971594#comment-15971594
 ] 

Marcelo Vanzin edited comment on SPARK-18085 at 4/17/17 8:57 PM:
-

I'm getting close to a point where I think the code can start to trickle in. I 
want to wait until 2.2's branch gets going before sending PRs, though. In the 
meantime, I'm keeping "private PRs" in my fork for each milestone, so it's easy 
for anybody interested in getting themselves familiar with the code to provide 
comments:

https://github.com/vanzin/spark/pulls

At this point, all the UI that the SHS shows is kept in a disk store (that's 
core + SQL, but not streaming). At this point, since streaming is not shown in 
the SHS, I'm not planning to touch it (aside from the small changes I made that 
were required by internal API changes in core).

What's left at this point is, from my view:
- managing disk space in the SHS so that large number of apps don't cause the 
SHS to fill local disks
- limiting the number of jobs / stages / tasks / etc kept in the store (similar 
to existing settings, which the code doesn't yet honor)
- an in-memory implementation of the store (in case someone wants lower latency 
or can't / does not want to use the disk store)
- more tests, and more testing



was (Author: vanzin):
I'm getting close to a point where I think the code can start to trickle in. I 
want to wait until 2.2's branch gets going before sending PRs, though. In the 
meantime, I'm keeping "private PRs" in my fork for each milestone, so it's easy 
for anybody interesting in getting themselves familiar with the code to provide 
comments:

https://github.com/vanzin/spark/pulls

At this point, all the UI that the SHS shows is kept in a disk store (that's 
core + SQL, but not streaming). At this point, since streaming is not shown in 
the SHS, I'm not planning to touch it (aside from the small changes I made that 
were required by internal API changes in core).

What's left at this point is, from my view:
- managing disk space in the SHS so that large number of apps don't cause the 
SHS to fill local disks
- limiting the number of jobs / stages / tasks / etc kept in the store (similar 
to existing settings, which the code doesn't yet honor)
- an in-memory implementation of the store (in case someone wants lower latency 
or can't / does not want to use the disk store)
- more tests, and more testing


> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14245) webUI should display the user

2017-04-17 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971633#comment-15971633
 ] 

Alex Bozarth commented on SPARK-14245:
--

Given it's been a year since I fixed that PR I honestly can't remember the 
exact details of that race condition, and looking back I wish I had been more 
through in detailing it in my comment. I can remember how the race condition 
manifested as though:

After first starting up a stand-alone master (using user1), I would then submit 
two apps simultaneously (or close enough for testing) using two users (user1 
and user2) in this case both applications always listed user1, but this would 
only happen if the two were the first applications submitted after starting up 
the master (subsequent simultaneous submittals worked as expected).

As I mentioned I can't remember how I tracked down and identified the race 
condition in the code though, feel free to dig into it. I would look at it, but 
I don't have time at the moment.

> webUI should display the user
> -
>
> Key: SPARK-14245
> URL: https://issues.apache.org/jira/browse/SPARK-14245
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.1
>Reporter: Thomas Graves
>Assignee: Alex Bozarth
> Fix For: 2.0.0
>
>
> It would be nice if the Spark UI (both active and history) showed the user 
> who ran the application somewhere when you are in the application view.   
> Perhaps under the Jobs view by total uptime and scheduler mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20349) ListFunctions returns duplicate functions after using persistent functions

2017-04-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971604#comment-15971604
 ] 

Apache Spark commented on SPARK-20349:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17661

> ListFunctions returns duplicate functions after using persistent functions
> --
>
> Key: SPARK-20349
> URL: https://issues.apache.org/jira/browse/SPARK-20349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.2, 2.2.0
>
>
> The session catalog caches some persistent functions in the FunctionRegistry, 
> so there can be duplicates. Our Catalog API listFunctions does not handle it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

2017-04-17 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971594#comment-15971594
 ] 

Marcelo Vanzin commented on SPARK-18085:


I'm getting close to a point where I think the code can start to trickle in. I 
want to wait until 2.2's branch gets going before sending PRs, though. In the 
meantime, I'm keeping "private PRs" in my fork for each milestone, so it's easy 
for anybody interesting in getting themselves familiar with the code to provide 
comments:

https://github.com/vanzin/spark/pulls

At this point, all the UI that the SHS shows is kept in a disk store (that's 
core + SQL, but not streaming). At this point, since streaming is not shown in 
the SHS, I'm not planning to touch it (aside from the small changes I made that 
were required by internal API changes in core).

What's left at this point is, from my view:
- managing disk space in the SHS so that large number of apps don't cause the 
SHS to fill local disks
- limiting the number of jobs / stages / tasks / etc kept in the store (similar 
to existing settings, which the code doesn't yet honor)
- an in-memory implementation of the store (in case someone wants lower latency 
or can't / does not want to use the disk store)
- more tests, and more testing


> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20360) Create repr functions for interpreters to use

2017-04-17 Thread Kyle Kelley (JIRA)

Kyle Kelley created SPARK-20360:
---

 Summary: Create repr functions for interpreters to use
 Key: SPARK-20360
 URL: https://issues.apache.org/jira/browse/SPARK-20360
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.1.0, 2.0.0
Reporter: Kyle Kelley
Priority: Minor


Create `_repr_html_` for SparkContext, DataFrames, and other objects to target 
rich display in IPython. This will improve the user experience in Jupyter, 
Hydrogen, nteract, and any other frontends that use this namespace. 
http://ipython.readthedocs.io/en/stable/config/integrating.html

I made this issue only target 2.x since it's an enhancement on the current 
experience.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20359) Catalyst EliminateOuterJoin optimization can cause NPE

2017-04-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971569#comment-15971569
 ] 

Apache Spark commented on SPARK-20359:
--

User 'koertkuipers' has created a pull request for this issue:
https://github.com/apache/spark/pull/17660

> Catalyst EliminateOuterJoin optimization can cause NPE
> --
>
> Key: SPARK-20359
> URL: https://issues.apache.org/jira/browse/SPARK-20359
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: spark master at commit 
> 35e5ae4f81176af52569c465520a703529893b50 (Sun Apr 16)
>Reporter: koert kuipers
> Fix For: 2.2.0
>
>
> we were running in to an NPE in one of our UDFs for spark sql.
>  
> now this particular function indeed could not handle nulls, but this was by 
> design since null input was never allowed (and we would want it to blow up if 
> there was a null as input).
> we realized the issue was not in our data when we added filters for nulls and 
> the NPE still happened. then we also saw the NPE when just doing 
> dataframe.explain instead of running our job.
> turns out the issue is in EliminateOuterJoin.canFilterOutNull where a row 
> with all nulls ifs fed into the expression as a test. its the line:
> val v = boundE.eval(emptyRow)
> i believe it is a bug to assume the expression can always handle nulls.
> for example this fails:
> {noformat}
> val df1 = Seq("a", "b", "c").toDF("x")
>   .withColumn("y", udf{ (x: String) => x.substring(0, 1) + "!" }.apply($"x"))
> val df2 = Seq("a", "b").toDF("x1")
> df1
>   .join(df2, df1("x") === df2("x1"), "left_outer")
>   .filter($"x1".isNotNull || !$"y".isin("a!"))
>   .count
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20359) Catalyst EliminateOuterJoin optimization can cause NPE

2017-04-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20359:


Assignee: (was: Apache Spark)

> Catalyst EliminateOuterJoin optimization can cause NPE
> --
>
> Key: SPARK-20359
> URL: https://issues.apache.org/jira/browse/SPARK-20359
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: spark master at commit 
> 35e5ae4f81176af52569c465520a703529893b50 (Sun Apr 16)
>Reporter: koert kuipers
> Fix For: 2.2.0
>
>
> we were running in to an NPE in one of our UDFs for spark sql.
>  
> now this particular function indeed could not handle nulls, but this was by 
> design since null input was never allowed (and we would want it to blow up if 
> there was a null as input).
> we realized the issue was not in our data when we added filters for nulls and 
> the NPE still happened. then we also saw the NPE when just doing 
> dataframe.explain instead of running our job.
> turns out the issue is in EliminateOuterJoin.canFilterOutNull where a row 
> with all nulls ifs fed into the expression as a test. its the line:
> val v = boundE.eval(emptyRow)
> i believe it is a bug to assume the expression can always handle nulls.
> for example this fails:
> {noformat}
> val df1 = Seq("a", "b", "c").toDF("x")
>   .withColumn("y", udf{ (x: String) => x.substring(0, 1) + "!" }.apply($"x"))
> val df2 = Seq("a", "b").toDF("x1")
> df1
>   .join(df2, df1("x") === df2("x1"), "left_outer")
>   .filter($"x1".isNotNull || !$"y".isin("a!"))
>   .count
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20359) Catalyst EliminateOuterJoin optimization can cause NPE

2017-04-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20359:


Assignee: Apache Spark

> Catalyst EliminateOuterJoin optimization can cause NPE
> --
>
> Key: SPARK-20359
> URL: https://issues.apache.org/jira/browse/SPARK-20359
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: spark master at commit 
> 35e5ae4f81176af52569c465520a703529893b50 (Sun Apr 16)
>Reporter: koert kuipers
>Assignee: Apache Spark
> Fix For: 2.2.0
>
>
> we were running in to an NPE in one of our UDFs for spark sql.
>  
> now this particular function indeed could not handle nulls, but this was by 
> design since null input was never allowed (and we would want it to blow up if 
> there was a null as input).
> we realized the issue was not in our data when we added filters for nulls and 
> the NPE still happened. then we also saw the NPE when just doing 
> dataframe.explain instead of running our job.
> turns out the issue is in EliminateOuterJoin.canFilterOutNull where a row 
> with all nulls ifs fed into the expression as a test. its the line:
> val v = boundE.eval(emptyRow)
> i believe it is a bug to assume the expression can always handle nulls.
> for example this fails:
> {noformat}
> val df1 = Seq("a", "b", "c").toDF("x")
>   .withColumn("y", udf{ (x: String) => x.substring(0, 1) + "!" }.apply($"x"))
> val df2 = Seq("a", "b").toDF("x1")
> df1
>   .join(df2, df1("x") === df2("x1"), "left_outer")
>   .filter($"x1".isNotNull || !$"y".isin("a!"))
>   .count
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20359) Catalyst EliminateOuterJoin optimization can cause NPE

2017-04-17 Thread koert kuipers (JIRA)

koert kuipers created SPARK-20359:
-

 Summary: Catalyst EliminateOuterJoin optimization can cause NPE
 Key: SPARK-20359
 URL: https://issues.apache.org/jira/browse/SPARK-20359
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
 Environment: spark master at commit 
35e5ae4f81176af52569c465520a703529893b50 (Sun Apr 16)
Reporter: koert kuipers
 Fix For: 2.2.0


we were running in to an NPE in one of our UDFs for spark sql.
 
now this particular function indeed could not handle nulls, but this was by 
design since null input was never allowed (and we would want it to blow up if 
there was a null as input).

we realized the issue was not in our data when we added filters for nulls and 
the NPE still happened. then we also saw the NPE when just doing 
dataframe.explain instead of running our job.

turns out the issue is in EliminateOuterJoin.canFilterOutNull where a row with 
all nulls ifs fed into the expression as a test. its the line:
val v = boundE.eval(emptyRow)

i believe it is a bug to assume the expression can always handle nulls.

for example this fails:
{noformat}
val df1 = Seq("a", "b", "c").toDF("x")
  .withColumn("y", udf{ (x: String) => x.substring(0, 1) + "!" }.apply($"x"))
val df2 = Seq("a", "b").toDF("x1")
df1
  .join(df2, df1("x") === df2("x1"), "left_outer")
  .filter($"x1".isNotNull || !$"y".isin("a!"))
  .count
{noformat}





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14245) webUI should display the user

2017-04-17 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971536#comment-15971536
 ] 

Imran Rashid commented on SPARK-14245:
--

thanks, I should have looked in the PR first, sorry.

But I have to admit -- I really do not understand what that comment is 
referring to.  Why is the master involved at all?  Is there a separate 
standalone-only issue, maybe a race in the master that we need to fix?

I'm mostly concerned about the difference between hadoop's 
{{UserGroupInformation.getCurrentUser()}} vs. the "user.name" system property.  
I imagine those are not always the same.

> webUI should display the user
> -
>
> Key: SPARK-14245
> URL: https://issues.apache.org/jira/browse/SPARK-14245
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.1
>Reporter: Thomas Graves
>Assignee: Alex Bozarth
> Fix For: 2.0.0
>
>
> It would be nice if the Spark UI (both active and history) showed the user 
> who ran the application somewhere when you are in the application view.   
> Perhaps under the Jobs view by total uptime and scheduler mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20358) Executors failing stage on interrupted exception thrown by cancelled tasks

2017-04-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20358:


Assignee: (was: Apache Spark)

> Executors failing stage on interrupted exception thrown by cancelled tasks
> --
>
> Key: SPARK-20358
> URL: https://issues.apache.org/jira/browse/SPARK-20358
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Eric Liang
>
> https://issues.apache.org/jira/browse/SPARK-20217 introduced a regression 
> where now interrupted exceptions will cause a task to fail on cancellation. 
> This is because NonFatal(e) does not match if e is an InterrupedException.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20358) Executors failing stage on interrupted exception thrown by cancelled tasks

2017-04-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971516#comment-15971516
 ] 

Apache Spark commented on SPARK-20358:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/17659

> Executors failing stage on interrupted exception thrown by cancelled tasks
> --
>
> Key: SPARK-20358
> URL: https://issues.apache.org/jira/browse/SPARK-20358
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Eric Liang
>
> https://issues.apache.org/jira/browse/SPARK-20217 introduced a regression 
> where now interrupted exceptions will cause a task to fail on cancellation. 
> This is because NonFatal(e) does not match if e is an InterrupedException.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20358) Executors failing stage on interrupted exception thrown by cancelled tasks

2017-04-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20358:


Assignee: Apache Spark

> Executors failing stage on interrupted exception thrown by cancelled tasks
> --
>
> Key: SPARK-20358
> URL: https://issues.apache.org/jira/browse/SPARK-20358
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Eric Liang
>Assignee: Apache Spark
>
> https://issues.apache.org/jira/browse/SPARK-20217 introduced a regression 
> where now interrupted exceptions will cause a task to fail on cancellation. 
> This is because NonFatal(e) does not match if e is an InterrupedException.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20358) Executors failing stage on interrupted exception thrown by cancelled tasks

2017-04-17 Thread Eric Liang (JIRA)

Eric Liang created SPARK-20358:
--

 Summary: Executors failing stage on interrupted exception thrown 
by cancelled tasks
 Key: SPARK-20358
 URL: https://issues.apache.org/jira/browse/SPARK-20358
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Eric Liang


https://issues.apache.org/jira/browse/SPARK-20217 introduced a regression where 
now interrupted exceptions will cause a task to fail on cancellation. This is 
because NonFatal(e) does not match if e is an InterrupedException.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20357) Expose Calendar.getWeekYear() as Spark SQL date function to be consistent with weekofyear()

2017-04-17 Thread Jeeyoung Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeeyoung Kim updated SPARK-20357:
-
Description: 
Since weeks and years are extracted using different boundaries (weeks happen 
every 7 days, years happen every 365-ish days, which is not divisible by 7), 
there are weird inconsistencies around how end-of-the year dates are handled if 
you use {{year}} and {{weekofyear}} Spark SQL functions. The example below 
shows how "2016-01-01" and "2016-12-30" has the same {{(year, week)}} pair.

This happens because the week for "2016-01-01" is calculated as "last week of 
2015". the Year function in Spark SQL ignores this and returns  component 
of -MM-DD.

The correct way to fix this is by exposing {{Java.util.dates.getWeekYear}}. 
This function calculates week-based years, so "2016-01-01" will return 2015 
instead. in this case.

{noformat}
# Trying out the bug for date - using PySpark
import pyspark.sql.functions as F
df = spark.createDataFrame([("2016-12-31",),("2016-12-30",), ("2017-01-01",), 
("2017-01-02",),("2017-12-30",)], ['id'])
df_parsed = (
df
.withColumn("year", F.year(df['id'].cast("date")))
.withColumn("weekofyear", F.weekofyear(df['id'].cast("date")))
)
df_parsed.show()
{noformat}

Prints 
{noformat}
+--++--+
|id|year|weekofyear|
+--++--+
|2016-12-31|2016|52|
|2016-12-30|2016|52|
|2017-01-01|2017|52| <- same (year, weekofyear) output
|2017-01-02|2017| 1|
|2017-12-30|2017|52| <- 
+--++--+
{noformat}


  was:
Since weeks and years are extracted using different boundaries (weeks happen 
every 7 days, years happen every 365-ish days, which is not divisible by 7), 
there are weird inconsistencies around how end-of-the year dates are handled if 
you use {{year}} and {{weekofyear}} Spark SQL functions. The example below 
shows how "2016-01-01" and "2016-12-30" has the same {{(year, week)}} pair.

This happens because the week for "2016-01-01" is calculated as "last week of 
2015". the Year function in Spark SQL ignores this and returns  component 
of -MM-DD.

The correct way to fix this is by exposing {{Java.util.dates.getWeekYear}}. 
This function calculates week-based years, so "2016-01-01" will return 2015 
instead. in this case.

{noformat}
# Trying out the bug for date - using PySpark
import pyspark.sql.functions as F
df = spark.createDataFrame([("2016-12-31",),("2016-12-30",), ("2017-01-01",), 
("2017-01-02",),("2017-12-30",)], ['id'])
df_parsed = (
df
.withColumn("year", F.year(df['id'].cast("date")))
.withColumn("weekofyear", F.weekofyear(df['id'].cast("date")))
)
df_parsed.show()
{noformat}

Prints 
{noformat}
+--++--+
|id|year|weekofyear|
+--++--+
|2016-12-31|2016|52|
|2016-12-30|2016|52|
|2017-01-01|2017|52|
|2017-01-02|2017| 1| <- same (year, weekofyear) output
|2017-12-30|2017|52| <- 
+--++--+
{noformat}



> Expose Calendar.getWeekYear() as Spark SQL date function to be consistent 
> with weekofyear()
> ---
>
> Key: SPARK-20357
> URL: https://issues.apache.org/jira/browse/SPARK-20357
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Jeeyoung Kim
>Priority: Minor
>
> Since weeks and years are extracted using different boundaries (weeks happen 
> every 7 days, years happen every 365-ish days, which is not divisible by 7), 
> there are weird inconsistencies around how end-of-the year dates are handled 
> if you use {{year}} and {{weekofyear}} Spark SQL functions. The example below 
> shows how "2016-01-01" and "2016-12-30" has the same {{(year, week)}} pair.
> This happens because the week for "2016-01-01" is calculated as "last week of 
> 2015". the Year function in Spark SQL ignores this and returns  component 
> of -MM-DD.
> The correct way to fix this is by exposing {{Java.util.dates.getWeekYear}}. 
> This function calculates week-based years, so "2016-01-01" will return 2015 
> instead. in this case.
> {noformat}
> # Trying out the bug for date - using PySpark
> import pyspark.sql.functions as F
> df = spark.createDataFrame([("2016-12-31",),("2016-12-30",), ("2017-01-01",), 
> ("2017-01-02",),("2017-12-30",)], ['id'])
> df_parsed = (
> df
> .withColumn("year", F.year(df['id'].cast("date")))
> .withColumn("weekofyear", F.weekofyear(df['id'].cast("date")))
> )
> df_parsed.show()
> {noformat}
> Prints 
> {noformat}
> +--++--+
> |id|year|weekofyear|
> +--++--+
> |2016-12-31|2016|52|
> |2016-12-30|2016|52|
>

[jira] [Updated] (SPARK-20357) Expose Calendar.getWeekYear() as Spark SQL date function to be consistent with weekofyear()

2017-04-17 Thread Jeeyoung Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeeyoung Kim updated SPARK-20357:
-
Affects Version/s: 2.1.0

> Expose Calendar.getWeekYear() as Spark SQL date function to be consistent 
> with weekofyear()
> ---
>
> Key: SPARK-20357
> URL: https://issues.apache.org/jira/browse/SPARK-20357
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Jeeyoung Kim
>Priority: Minor
>
> Since weeks and years are extracted using different boundaries (weeks happen 
> every 7 days, years happen every 365-ish days, which is not divisible by 7), 
> there are weird inconsistencies around how end-of-the year dates are handled 
> if you use {{year}} and {{weekofyear}} Spark SQL functions. The example below 
> shows how "2016-01-01" and "2016-12-30" has the same {{(year, week)}} pair.
> This happens because the week for "2016-01-01" is calculated as "last week of 
> 2015". the Year function in Spark SQL ignores this and returns  component 
> of -MM-DD.
> The correct way to fix this is by exposing {{Java.util.dates.getWeekYear}}. 
> This function calculates week-based years, so "2016-01-01" will return 2015 
> instead. in this case.
> {noformat}
> # Trying out the bug for date - using PySpark
> import pyspark.sql.functions as F
> df = spark.createDataFrame([("2016-12-31",),("2016-12-30",), ("2017-01-01",), 
> ("2017-01-02",),("2017-12-30",)], ['id'])
> df_parsed = (
> df
> .withColumn("year", F.year(df['id'].cast("date")))
> .withColumn("weekofyear", F.weekofyear(df['id'].cast("date")))
> )
> df_parsed.show()
> {noformat}
> Prints 
> {noformat}
> +--++--+
> |id|year|weekofyear|
> +--++--+
> |2016-12-31|2016|52|
> |2016-12-30|2016|52|
> |2017-01-01|2017|52| <- same (year, weekofyear) output
> |2017-01-02|2017| 1|
> |2017-12-30|2017|52| <- 
> +--++--+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20357) Expose Calendar.getWeekYear() as Spark SQL date function to be consistent with weekofyear()

2017-04-17 Thread Jeeyoung Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeeyoung Kim updated SPARK-20357:
-
Description: 
Since weeks and years are extracted using different boundaries (weeks happen 
every 7 days, years happen every 365-ish days, which is not divisible by 7), 
there are weird inconsistencies around how end-of-the year dates are handled if 
you use {{year}} and {{weekofyear}} Spark SQL functions. The example below 
shows how "2016-01-01" and "2016-12-30" has the same {{(year, week)}} pair.

This happens because the week for "2016-01-01" is calculated as "last week of 
2015". the Year function in Spark SQL ignores this and returns  component 
of -MM-DD.

The correct way to fix this is by exposing {{Java.util.dates.getWeekYear}}. 
This function calculates week-based years, so "2016-01-01" will return 2015 
instead. in this case.

{noformat}
# Trying out the bug for date - using PySpark
import pyspark.sql.functions as F
df = spark.createDataFrame([("2016-12-31",),("2016-12-30",), ("2017-01-01",), 
("2017-01-02",),("2017-12-30",)], ['id'])
df_parsed = (
df
.withColumn("year", F.year(df['id'].cast("date")))
.withColumn("weekofyear", F.weekofyear(df['id'].cast("date")))
)
df_parsed.show()
{noformat}

Prints 
{noformat}
+--++--+
|id|year|weekofyear|
+--++--+
|2016-12-31|2016|52|
|2016-12-30|2016|52|
|2017-01-01|2017|52|
|2017-01-02|2017| 1| <- same (year, weekofyear) output
|2017-12-30|2017|52| <- 
+--++--+
{noformat}


  was:
Since weeks and years are extracted using different boundaries (weeks happen 
every 7 days, years happen every 365-ish days, which is not divisible by 7), 
there are weird inconsistencies around how end-of-the year dates are handled if 
you use {{year}} and {{weekofyear}} Spark SQL functions. The example below 
shows how "2016-01-01" and "2016-12-30" has the same {{(year, week)}} pair.

This happens because the week for "2016-01-01" is calculated as "last week of 
2015". the Year function in Spark SQL ignores this and returns  component 
of -MM-DD.

The correct way to fix this is by exposing {{Java.util.dates.getWeekYear}}. 
This function calculates week-based years, so "2016-01-01" will return 2015 
instead. in this case.

{noformat}
# Trying out the bug for date - using PySpark
import pyspark.sql.functions as F
df = spark.createDataFrame([("2016-12-31",),("2016-12-30",), ("2017-01-01",), 
("2017-01-02",),("2017-12-30",)], ['id'])
df_parsed = (
df
.withColumn("year", F.year(df['id'].cast("date")))
.withColumn("weekofyear", F.weekofyear(df['id'].cast("date")))
)
df_parsed.show()
{noformat}

Prints 
{noformat}
+--++--+
|id|year|weekofyear|
+--++--+
|2016-12-31|2016|52|
|2016-12-30|2016|52|
|2017-01-01|2017|52|
|2017-01-02|2017| 1|
|2017-12-30|2017|52|
+--++--+
{noformat}



> Expose Calendar.getWeekYear() as Spark SQL date function to be consistent 
> with weekofyear()
> ---
>
> Key: SPARK-20357
> URL: https://issues.apache.org/jira/browse/SPARK-20357
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jeeyoung Kim
>Priority: Minor
>
> Since weeks and years are extracted using different boundaries (weeks happen 
> every 7 days, years happen every 365-ish days, which is not divisible by 7), 
> there are weird inconsistencies around how end-of-the year dates are handled 
> if you use {{year}} and {{weekofyear}} Spark SQL functions. The example below 
> shows how "2016-01-01" and "2016-12-30" has the same {{(year, week)}} pair.
> This happens because the week for "2016-01-01" is calculated as "last week of 
> 2015". the Year function in Spark SQL ignores this and returns  component 
> of -MM-DD.
> The correct way to fix this is by exposing {{Java.util.dates.getWeekYear}}. 
> This function calculates week-based years, so "2016-01-01" will return 2015 
> instead. in this case.
> {noformat}
> # Trying out the bug for date - using PySpark
> import pyspark.sql.functions as F
> df = spark.createDataFrame([("2016-12-31",),("2016-12-30",), ("2017-01-01",), 
> ("2017-01-02",),("2017-12-30",)], ['id'])
> df_parsed = (
> df
> .withColumn("year", F.year(df['id'].cast("date")))
> .withColumn("weekofyear", F.weekofyear(df['id'].cast("date")))
> )
> df_parsed.show()
> {noformat}
> Prints 
> {noformat}
> +--++--+
> |id|year|weekofyear|
> +--++--+
> |2016-12-31|2016|52|
> |2016-12-30|2016|52|
> |2017-01-01|2017|52|
> |2017-01-02|2017|

[jira] [Created] (SPARK-20357) Expose Calendar.getWeekYear() as Spark SQL date function to be consistent with weekofyear()

2017-04-17 Thread Jeeyoung Kim (JIRA)

Jeeyoung Kim created SPARK-20357:


 Summary: Expose Calendar.getWeekYear() as Spark SQL date function 
to be consistent with weekofyear()
 Key: SPARK-20357
 URL: https://issues.apache.org/jira/browse/SPARK-20357
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.2.0
Reporter: Jeeyoung Kim
Priority: Minor


Since weeks and years are extracted using different boundaries (weeks happen 
every 7 days, years happen every 365-ish days, which is not divisible by 7), 
there are weird inconsistencies around how end-of-the year dates are handled if 
you use {{year}} and {{weekofyear}} Spark SQL functions. The example below 
shows how "2016-01-01" and "2016-12-30" has the same {{(year, week)}} pair.

This happens because the week for "2016-01-01" is calculated as "last week of 
2015". the Year function in Spark SQL ignores this and returns  component 
of -MM-DD.

The correct way to fix this is by exposing {{Java.util.dates.getWeekYear}}. 
This function calculates week-based years, so "2016-01-01" will return 2015 
instead. in this case.

{noformat}
# Trying out the bug for date - using PySpark
import pyspark.sql.functions as F
df = spark.createDataFrame([("2016-12-31",),("2016-12-30",), ("2017-01-01",), 
("2017-01-02",),("2017-12-30",)], ['id'])
df_parsed = (
df
.withColumn("year", F.year(df['id'].cast("date")))
.withColumn("weekofyear", F.weekofyear(df['id'].cast("date")))
)
df_parsed.show()
{noformat}

Prints 
{noformat}
+--++--+
|id|year|weekofyear|
+--++--+
|2016-12-31|2016|52|
|2016-12-30|2016|52|
|2017-01-01|2017|52|
|2017-01-02|2017| 1|
|2017-12-30|2017|52|
+--++--+
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20299) NullPointerException when null and string are in a tuple while encoding Dataset

2017-04-17 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971491#comment-15971491
 ] 

Michael Armbrust commented on SPARK-20299:
--

What input are you looking for?

> NullPointerException when null and string are in a tuple while encoding 
> Dataset
> ---
>
> Key: SPARK-20299
> URL: https://issues.apache.org/jira/browse/SPARK-20299
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> When creating a Dataset from a tuple with {{null}} and a string, NPE is 
> reported. When either is removed, it works fine.
> {code}
> scala> Seq((1, null.asInstanceOf[Int]), (2, 1)).toDS
> res43: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]
> scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top 
> level Product input object), - root class: "scala.Tuple2")._1, true) AS _1#474
> assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product 
> input object), - root class: "scala.Tuple2")._2 AS _2#475
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454)
>   at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377)
>   at 
> org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:246)
>   ... 48 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
>   ... 58 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17647) SQL LIKE does not handle backslashes correctly

2017-04-17 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17647.
-
   Resolution: Fixed
 Assignee: Xiangrui Meng
Fix Version/s: 2.2.0
   2.1.1

> SQL LIKE does not handle backslashes correctly
> --
>
> Key: SPARK-17647
> URL: https://issues.apache.org/jira/browse/SPARK-17647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>  Labels: correctness
> Fix For: 2.1.1, 2.2.0
>
>
> Try the following in SQL shell:
> {code}
> select '' like '%\\%';
> {code}
> It returned false, which is wrong.
> cc: [~yhuai] [~joshrosen]
> A false-negative considered previously:
> {code}
> select '' rlike '.*.*';
> {code}
> It returned true, which is correct if we assume that the pattern is treated 
> as a Java string but not raw string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14245) webUI should display the user

2017-04-17 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971365#comment-15971365
 ] 

Thomas Graves commented on SPARK-14245:
---

see the commend in the PR, I think there was a race with the SparkContext 
version:

I just pushed an update. I changed how I get the username so it uses the same 
method for both active and history. This method actually pulls the information 
from the same source as both sc.sparkUser and the API do at their root, the 
system env. Except with this method we make sure we always get the correct 
user; I noticed a race condition when using sc.sparkUser while finding this 
solution. After starting up the master when two spark applications are started 
at once, one by the user who started the master and one who didn't, both would 
appear to been started by the user who started the master. This would only 
happened on the first apps started, any apps started after would behave as 
expected. This race condition doesn't exist with my change.

> webUI should display the user
> -
>
> Key: SPARK-14245
> URL: https://issues.apache.org/jira/browse/SPARK-14245
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.1
>Reporter: Thomas Graves
>Assignee: Alex Bozarth
> Fix For: 2.0.0
>
>
> It would be nice if the Spark UI (both active and history) showed the user 
> who ran the application somewhere when you are in the application view.   
> Perhaps under the Jobs view by total uptime and scheduler mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14245) webUI should display the user

2017-04-17 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971337#comment-15971337
 ] 

Imran Rashid commented on SPARK-14245:
--

Hi [~ajbozarth] [~tgraves] -- I was just taking a look at this b/c of a new pr 
to get the user added to the rest api: 
https://github.com/apache/spark/pull/17656

but one thing puzzled me about this.  Why is it using a different way of 
getting the user than the history server uses?  The HistoryServer takes the 
user from the SparkListenerApplicationStart event, which originates with 
[{{SparkContext.sparkUser}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L295].
  It seems like that is doing something slightly more general than just taking 
"user.name".  I haven't looked closely yet, but is there a reason to do one or 
the other?

> webUI should display the user
> -
>
> Key: SPARK-14245
> URL: https://issues.apache.org/jira/browse/SPARK-14245
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.1
>Reporter: Thomas Graves
>Assignee: Alex Bozarth
> Fix For: 2.0.0
>
>
> It would be nice if the Spark UI (both active and history) showed the user 
> who ran the application somewhere when you are in the application view.   
> Perhaps under the Jobs view by total uptime and scheduler mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20349) ListFunctions returns duplicate functions after using persistent functions

2017-04-17 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20349.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.2

> ListFunctions returns duplicate functions after using persistent functions
> --
>
> Key: SPARK-20349
> URL: https://issues.apache.org/jira/browse/SPARK-20349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.2, 2.2.0
>
>
> The session catalog caches some persistent functions in the FunctionRegistry, 
> so there can be duplicates. Our Catalog API listFunctions does not handle it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20356) Spark sql group by returns incorrect results after join + distinct transformations

2017-04-17 Thread Chris Kipers (JIRA)

Chris Kipers created SPARK-20356:


 Summary: Spark sql group by returns incorrect results after join + 
distinct transformations
 Key: SPARK-20356
 URL: https://issues.apache.org/jira/browse/SPARK-20356
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
 Environment: Linux mint 18
Python 3.5
Reporter: Chris Kipers


I'm experiencing a bug with the head version of spark as of 4/17/2017. After 
joining to dataframes, renaming a column and invoking distinct, the results of 
the aggregation is incorrect after caching the dataframe. The following code 
snippet consistently reproduces the error.

from pyspark.sql import SparkSession
import pyspark.sql.functions as sf
import pandas as pd

spark = SparkSession.builder.master("local").appName("Word Count").getOrCreate()

mapping_sdf = spark.createDataFrame(pd.DataFrame([
{"ITEM": "a", "GROUP": 1},
{"ITEM": "b", "GROUP": 1},
{"ITEM": "c", "GROUP": 2}
]))

items_sdf = spark.createDataFrame(pd.DataFrame([
{"ITEM": "a", "ID": 1},
{"ITEM": "b", "ID": 2},
{"ITEM": "c", "ID": 3}
]))

mapped_sdf = \
items_sdf.join(mapping_sdf, on='ITEM').select("ID", 
sf.col("GROUP").alias('ITEM')).distinct()

print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 2, correct
mapped_sdf.cache()
print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 3, incorrect

The next code snippet is almost the same after the first except I don't call 
distinct on the dataframe. This snippet performs as expected:

mapped_sdf = \
items_sdf.join(mapping_sdf, on='ITEM').select("ID", 
sf.col("GROUP").alias('ITEM'))

print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 2, correct
mapped_sdf.cache()
print(mapped_sdf.groupBy("ITEM").count().count())  # Prints 2, correct

I don't experience this bug with spark 2.1 or event earlier versions for 2.2



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20355) Display Spark version on history page

2017-04-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971120#comment-15971120
 ] 

Apache Spark commented on SPARK-20355:
--

User 'redsanket' has created a pull request for this issue:
https://github.com/apache/spark/pull/17658

> Display Spark version on history page
> -
>
> Key: SPARK-20355
> URL: https://issues.apache.org/jira/browse/SPARK-20355
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.1.0
>Reporter: Sanket Reddy
>Priority: Minor
>
> Spark Version for a specific application is not displayed on the history page 
> now. It should be nice to switch the spark version on the UI when we click on 
> the specific application.
> Currently there seems to be way as SparkListenerLogStart records the 
> application version. So, it should be trivial to listen to this event and 
> provision this change on the UI.
> {"Event":"SparkListenerLogStart","Spark 
> Version":"1.6.2.0_2.7.2.7.1604210306_161643"}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20355) Display Spark version on history page

2017-04-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20355:


Assignee: Apache Spark

> Display Spark version on history page
> -
>
> Key: SPARK-20355
> URL: https://issues.apache.org/jira/browse/SPARK-20355
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.1.0
>Reporter: Sanket Reddy
>Assignee: Apache Spark
>Priority: Minor
>
> Spark Version for a specific application is not displayed on the history page 
> now. It should be nice to switch the spark version on the UI when we click on 
> the specific application.
> Currently there seems to be way as SparkListenerLogStart records the 
> application version. So, it should be trivial to listen to this event and 
> provision this change on the UI.
> {"Event":"SparkListenerLogStart","Spark 
> Version":"1.6.2.0_2.7.2.7.1604210306_161643"}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20355) Display Spark version on history page

2017-04-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20355:


Assignee: (was: Apache Spark)

> Display Spark version on history page
> -
>
> Key: SPARK-20355
> URL: https://issues.apache.org/jira/browse/SPARK-20355
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.1.0
>Reporter: Sanket Reddy
>Priority: Minor
>
> Spark Version for a specific application is not displayed on the history page 
> now. It should be nice to switch the spark version on the UI when we click on 
> the specific application.
> Currently there seems to be way as SparkListenerLogStart records the 
> application version. So, it should be trivial to listen to this event and 
> provision this change on the UI.
> {"Event":"SparkListenerLogStart","Spark 
> Version":"1.6.2.0_2.7.2.7.1604210306_161643"}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20355) Display Spark version on history page

2017-04-17 Thread Sanket Reddy (JIRA)

Sanket Reddy created SPARK-20355:


 Summary: Display Spark version on history page
 Key: SPARK-20355
 URL: https://issues.apache.org/jira/browse/SPARK-20355
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Affects Versions: 2.1.0
Reporter: Sanket Reddy
Priority: Minor


Spark Version for a specific application is not displayed on the history page 
now. It should be nice to switch the spark version on the UI when we click on 
the specific application.

Currently there seems to be way as SparkListenerLogStart records the 
application version. So, it should be trivial to listen to this event and 
provision this change on the UI.
{"Event":"SparkListenerLogStart","Spark 
Version":"1.6.2.0_2.7.2.7.1604210306_161643"}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)

2017-04-17 Thread cen yuhai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971081#comment-15971081
 ] 

cen yuhai commented on SPARK-16599:
---

[~srowen] I also encounter this problem

> java.util.NoSuchElementException: None.get  at at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
> --
>
> Key: SPARK-16599
> URL: https://issues.apache.org/jira/browse/SPARK-16599
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: centos 6.7   spark 2.0
>Reporter: binde
>
> run a spark job with spark 2.0, error message
> Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): 
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20340) Size estimate very wrong in ExternalAppendOnlyMap from CoGroupedRDD, cause OOM

2017-04-17 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971071#comment-15971071
 ] 

Thomas Graves commented on SPARK-20340:
---

Right, I figured it was probably for performance, the thing is that when its 
wrong it causes the job to fail and this could be unexpectedly.  Meaning a 
production job was running fine for months and then the data it gets in is all 
of a sudden differently/skewed due to say high traffic day and then a critical 
job fails.  This to me is not good for a production environment, if we want 
Spark to continue to be adopted by larger companies in production environments 
this sort of thing has to be very reliable.   

It looks very bad for Spark that my user said this runs fine on PIG. It makes 
users very wary about switching to Spark as they have doubts about this 
scalability and reliability.

Anyway, I think this either needs a sanity at some point to say its estimates 
are really off so get the real size or switch it to always use the real size.  
For shuffle data we should know what the size is as we just transferred it.  
that is abstracted away a git at this point in the code though so need to 
understand the code more.

 [~rxin]  [~joshrosen] any thoughts on this?

> Size estimate very wrong in ExternalAppendOnlyMap from CoGroupedRDD, cause OOM
> --
>
> Key: SPARK-20340
> URL: https://issues.apache.org/jira/browse/SPARK-20340
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> I had a user doing a basic join operation. The values are image's binary 
> data(in base64 format) and widely vary in size.
> The job failed with out of memory. Originally failed on yarn with using to 
> much overhead memory, turned spark.shuffle.io.preferDirectBufs  to false 
> then failed with out of heap memory.  I debugged it down to during the 
> shuffle when CoGroupedRDD putting things into the ExternalAppendOnlyMap, it 
> computes an estimated size to determine when to spill.  In this case 
> SizeEstimator handle arrays such that if it is larger then 400 elements, it 
> samples 100 elements. The estimate is coming back as GB's different from the 
> actual size.  It claims 1GB when it is actually using close to 5GB. 
> Temporary work around it to increase the memory to be very large (10GB 
> executors) but that isn't really acceptable here.  User did the same thing in 
> pig and it easily handled the data with 1.5GB of memory.
> It seems risky to be using an estimate in such a critical thing. If the 
> estimate is wrong you are going to run out of memory and fail the job.
> I'm looking closer at the users data still to get more insights.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20339) Issue in regex_replace in Apache Spark Java

2017-04-17 Thread Nischay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971054#comment-15971054
 ] 

Nischay commented on SPARK-20339:
-

Sure I'll not add redundant code in future, also I'll use u...@spark.apache.org

"For such a huge sequence of generating columns you are probably much better 
off contstructing a Row directly in a transformation" we are not able to 
understand can you please explain in detail. 

We used UDF but getting "Task not serializable exception".
 
UDF1 removeSpecialCharaters = new UDF1() {
public String call(final String types) throws Exception {   
while(names.hasMoreElements()) {
String str = (String) names.nextElement();
   types.replaceAll(str, 
manufacturerNames.get(str).toString());
}   
return types;
}
};
sqlContext.udf().register("removeSpecialCharatersUDF", removeSpecialCharaters, 
DataTypes.StringType);
dataFileContent.createOrReplaceTempView("DataSetOfinterest");
Dataset newDF = sqlContext.sql("select 
removeSpecialCharatersUDF(ManufacturerSource) FROM DataSetOfinterest");


> Issue in regex_replace in Apache Spark Java
> ---
>
> Key: SPARK-20339
> URL: https://issues.apache.org/jira/browse/SPARK-20339
> Project: Spark
>  Issue Type: Question
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Nischay
>
> We are currently facing couple of issues
> 1. 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB".
> 2. "java.lang.StackOverflowError"
> The first issue is reported as a Major bug in Jira of Apache spark 
> https://issues.apache.org/jira/browse/SPARK-18492
> We got these issues by the following program. We are trying to replace the 
> Manufacturer name by its equivalent alternate name,
> These issues occur only when we have Huge number of alternate names to 
> replace, for small number of replacements it works with no issues.
> dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
> regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));`
> Kindly suggest us an alternative method or a solution to go around this 
> problem.
> {code}
>   Hashtable manufacturerNames = new Hashtable();
> Enumeration names;
> String str;
> double bal;
> manufacturerNames.put("Allen","Apex Tool Group");
> manufacturerNames.put("Armstrong","Apex Tool Group");
> manufacturerNames.put("Campbell","Apex Tool Group");
> manufacturerNames.put("Lubriplate","Apex Tool Group");
> manufacturerNames.put("Delta","Apex Tool Group");
> manufacturerNames.put("Gearwrench","Apex Tool Group");
> manufacturerNames.put("H.K. Porter","Apex Tool 
> Group");
> manufacturerNames.put("Jacobs","Apex Tool Group");
> manufacturerNames.put("Jobox","Apex Tool Group");
> ...about 100 more ...
> manufacturerNames.put("Standard Safety","Standard 
> Safety Equipment Company");
> manufacturerNames.put("Standard Safety","Standard 
> Safety Equipment Company");   
> // Show all balances in hash table.
> names = manufacturerNames.keys();
> Dataset dataFileContent = 
> sqlContext.load("com.databricks.spark.csv", options);
>   
> 
> while(names.hasMoreElements()) {
>str = (String) names.nextElement();
>
> dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
> regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));
> }
> dataFileContent.show();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-04-17 Thread jin xing (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971025#comment-15971025
 ] 

jin xing commented on SPARK-19659:
--

[~cloud_fan]
I refined the the pr. In current change, I'd propose:
1. Track average size and also the outliers(which are larger than 2*avgSize) in 
MapStatus;
2. Request memory from *MemoryManager* before fetch blocks and release the 
memory to MemoryManager when *ManagedBuffer* is released.
3. Fetch remote blocks to disk when failing acquiring memory from 
MemoryManager, otherwise fetch to memory.

I adjust the default value of "spark.memory.offHeap.size" to be 384m(the same 
with spark.yarn.executor.memoryOverhead), which will be used to initialize the 
off heap memory pool. What's more, I think *spark.memory.offHeap.enabled* 
(which is documented in *configuration.md* as "If true, Spark will attempt to 
use off-heap memory for certain operations. ") might be confusing. Because no 
matter it is true or false, remote blocks will be shuffle read to off heap by 
default.

It would great if you could take a rough look at the pr and help comment. Thus 
I can know if I'm on the right direction :)

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19951) Add string concatenate operator || to Spark SQL

2017-04-17 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971001#comment-15971001
 ] 

Takeshi Yamamuro commented on SPARK-19951:
--

Since this operation is supported in PostgreSQL and MySQL, I think it's useful 
for these users.
If you don't mind, I'll make a pr after branch-2.2 is cut.
https://github.com/apache/spark/compare/master...maropu:SPARK-19951

> Add string concatenate operator || to Spark SQL
> ---
>
> Key: SPARK-19951
> URL: https://issues.apache.org/jira/browse/SPARK-19951
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>
> It is quite natural to concatenate strings using the {||} symbol. For 
> example: {{select a || b || c as abc from tbl_x}}. Let's add to Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

2017-04-17 Thread HanCheol Cho (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970944#comment-15970944
 ] 

HanCheol Cho edited comment on SPARK-20336 at 4/17/17 10:32 AM:


This time, I checked whether PySpark uses the same version of Python in both 
client and worker nodes first with the following code.
The result shows that all servers are using the same one.

{code}
$ pyspark --master yarn --num-executors 12

# client
import sys
print sys.version
Python 2.7.13 :: Anaconda custom (64-bit)

# workers
rdd = spark.sparkContext.parallelize(range(100), 16)
def get_hostname_and_python_version():
import socket, sys
return socket.gethostname() + " --- " + sys.version

res = rdd.map(lambda r: get_hostname_and_python_version()) 
for item in set(res.collect()):
print item
slave01 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)  
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave05 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave06 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave02 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave03 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave04 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
{code}


Unfortunately, the test result was same. spark.read.csv() with wholeFile=True 
option fails to read non-ASCII text correctly.
An interesting point is that the broken non-ASCII characters shown as � are 
actually a series of \ufffd. As I understand, it appears when input characters 
cannot be decoded with a given codec.
In addtion, testing with spark-shell also showed the same results (works in 
local mode but not in Yarn mode).
Please see the following code snippet for the details.



{code}
#-#
# Test spark.read.csv() with wholeFile option #
#-#

$ pyspark --master yarn --num-executors 12

# what is test data?
csv_raw = spark.read.text("test.encoding.csv")
csv_raw.show()
+--+
| value|
+--+
|col1,col2,col3|
|  1,a,text|
|  2,b,テキスト|
|   3,c,텍스트|
| 4,d,"text|
|  テキスト|
|  텍스트"|
|  5,e,last|
+--+

# loading csv in one-record-per-line fashion
csv_default = spark.read.csv("test.encoding.csv", header=True)
csv_default.show()
++++
|col1|col2|col3|
++++
|   1|   a|text|
|   2|   b|テキスト|
|   3|   c| 텍스트|
|   4|   d|text|
|テキスト|null|null|
|텍스트"|null|null|
|   5|   e|last|
++++

# loading csv in wholeFile mode
csv_wholefile = spark.read.csv("test.encoding.csv", header=True, wholeFile=True)
csv_wholefile.show()
++++
|col1|col2|col3|
++++
|   1|   a|text|
|   2|   b||
|   3|   c|   �|
|   4|   d|text
...|
|   5|   e|last|
++++

csv_wholefile.collect()[3]  
  
Row(col1=u'4', col2=u'd', 
col3=u'text\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd')


#---#
# Test with spark-shell #
#---#

$ spark-shell --master local[4]
spark.read.option("wholeFile", true).option("header", 
true).csv("test.encoding.csv").show()
+++-+
|col1|col2| col3|
+++-+
|   1|   a| text|
|   2|   b| テキスト|
|   3|   c|  텍스트|
|   4|   d|text
テキスト
텍스트|
|   5|   e| last|
+++-+

$ spark-shell --num-executors 12 --master yarn
spark.read.option("wholeFile", true).option("header", 
true).csv("test.encoding.csv").show()
++++
|col1|col2|col3|
++++
|   1|   a|text|
|   2|   b||
|   3|   c|   �|
|   4|   d|text
...|
|   5|   e|last|
++++

{code}


The following is the test for spark.read.json() with wholeFile option. It works 
without any proble.
But I would like to point one issue that the semantics of wholeFile option in 
this method
is different from that in spark.read.csv(). (in case of json(), one file must 
have only one record)
I think this can confuse users.

{code}
#-#
# Test spark.read.json() with wholeFile option #
#-#

#

[jira] [Comment Edited] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

2017-04-17 Thread HanCheol Cho (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970944#comment-15970944
 ] 

HanCheol Cho edited comment on SPARK-20336 at 4/17/17 10:31 AM:


This time, I checked whether PySpark uses the same version of Python in both 
client and worker nodes first with the following code.
The result shows that all servers are using the same one.

{code}
$ pyspark --master yarn --num-executors 12

# client
import sys
print sys.version
Python 2.7.13 :: Anaconda custom (64-bit)

# workers
rdd = spark.sparkContext.parallelize(range(100), 16)
def get_hostname_and_python_version():
import socket, sys
return socket.gethostname() + " --- " + sys.version

res = rdd.map(lambda r: get_hostname_and_python_version()) 
for item in set(res.collect()):
print item
slave01 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)  
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave05 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave06 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave02 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave03 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave04 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
{code}


Unfortunately, the test result was same. spark.read.csv() with wholeFile=True 
option fails to read non-ASCII text correctly.
An interesting point is that the broken non-ASCII characters shown as � are 
actually a series of \ufffd. As I understand, it appears when input characters 
cannot be decoded with a given codec.
In addtion, testing with spark-shell also showed the same results (works in 
local mode but not in Yarn mode).
Please see the following code snippet for the details.



{code}
#-#
# Test spark.read.csv() with wholeFile option #
#-#

$ pyspark --master yarn --num-executors 12

# what is test data?
csv_raw = spark.read.text("test.encoding.csv")
csv_raw.show()
+--+
| value|
+--+
|col1,col2,col3|
|  1,a,text|
|  2,b,テキスト|
|   3,c,텍스트|
| 4,d,"text|
|  テキスト|
|  텍스트"|
|  5,e,last|
+--+

# loading csv in one-record-per-line fashion
csv_default = spark.read.csv("test.encoding.csv", header=True)
csv_default.show()
++++
|col1|col2|col3|
++++
|   1|   a|text|
|   2|   b|テキスト|
|   3|   c| 텍스트|
|   4|   d|text|
|テキスト|null|null|
|텍스트"|null|null|
|   5|   e|last|
++++

# loading csv in wholeFile mode
csv_wholefile = spark.read.csv("test.encoding.csv", header=True, wholeFile=True)
csv_wholefile.show()
++++
|col1|col2|col3|
++++
|   1|   a|text|
|   2|   b||
|   3|   c|   �|
|   4|   d|text
...|
|   5|   e|last|
++++

csv_wholefile.collect()[3]  
  
Row(col1=u'4', col2=u'd', 
col3=u'text\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd')


#---#
# Test with spark-shell #
#---#

$ spark-shell --master local[4]
spark.read.option("wholeFile", true).option("header", 
true).csv("test.encoding.csv").show()
+++-+
|col1|col2| col3|
+++-+
|   1|   a| text|
|   2|   b| テキスト|
|   3|   c|  텍스트|
|   4|   d|text
テキスト
텍스트|
|   5|   e| last|
+++-+

$ spark-shell --num-executors 12 --master yarn
spark.read.option("wholeFile", true).option("header", 
true).csv("test.encoding.csv").show()
++++
|col1|col2|col3|
++++
|   1|   a|text|
|   2|   b||
|   3|   c|   �|
|   4|   d|text
...|
|   5|   e|last|
++++

{code}


The following is the test for spark.read.json() with wholeFile option. It works 
without any proble.
But I would like to point one issue that the semantics of wholeFile option in 
this method
is different from that in spark.read.csv(). (in case of json(), one file must 
have only one record)
I think this can confuse users.

{code}
#-#
# Test spark.read.json() with wholeFile option #
#-#

#

[jira] [Resolved] (SPARK-20310) Dependency convergence error for scala-xml

2017-04-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20310.
---
Resolution: Not A Problem

Reopen if that suggestion doesn't work, and, if there is a change you are 
proposing to the Spark build itself.

> Dependency convergence error for scala-xml
> --
>
> Key: SPARK-20310
> URL: https://issues.apache.org/jira/browse/SPARK-20310
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Samik R
>Priority: Minor
>
> Hi,
> I am trying to compile a package (apache tinkerpop) which has spark-core as 
> one of the dependencies. I am trying to compile with v2.1.0. But when I run 
> maven build through a dependency checker, it is showing a dependency error 
> within the spark-core itself for scala-xml package, as below:
> Dependency convergence error for org.scala-lang.modules:scala-xml_2.11:1.0.1 
> paths to dependency are:
> +-org.apache.tinkerpop:spark-gremlin:3.2.3
>   +-org.apache.spark:spark-core_2.11:2.1.0
> +-org.json4s:json4s-jackson_2.11:3.2.11
>   +-org.json4s:json4s-core_2.11:3.2.11
> +-org.scala-lang:scalap:2.11.0
>   +-org.scala-lang:scala-compiler:2.11.0
> +-org.scala-lang.modules:scala-xml_2.11:1.0.1
> and
> +-org.apache.tinkerpop:spark-gremlin:3.2.3
>   +-org.apache.spark:spark-core_2.11:2.1.0
> +-org.apache.spark:spark-tags_2.11:2.1.0
>   +-org.scalatest:scalatest_2.11:2.2.6
> +-org.scala-lang.modules:scala-xml_2.11:1.0.2
> Can this be fixed?
> Thanks.
> -Samik



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

2017-04-17 Thread HanCheol Cho (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970944#comment-15970944
 ] 

HanCheol Cho commented on SPARK-20336:
--

This time, I checked whether PySpark uses the same version of Python in both 
client and worker nodes first with the following code.
The result shows that all servers are using the same one.

{code}
$ pyspark --master yarn --num-executors 12

# client
import sys
print sys.version
Python 2.7.13 :: Anaconda custom (64-bit)

# workers
rdd = spark.sparkContext.parallelize(range(100), 16)
def get_hostname_and_python_version():
import socket, sys
return socket.gethostname() + " --- " + sys.version

res = rdd.map(lambda r: get_hostname_and_python_version()) 
for item in set(res.collect()):
print item
slave01 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)  
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave05 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave06 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave02 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave03 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
slave04 --- 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
{code}


Unfortunately, the test result was same. spark.read.csv() with wholeFile=True 
option fails to read non-ASCII text correctly.
An interesting point is that the broken non-ASCII characters shown as � are 
actually a series of \ufffd. As I understand, it appears when input characters 
cannot be decoded with a given codec.
In addtion, testing with spark-shell also showed the same results (works in 
local mode but not in Yarn mode).
Please see the following code snippet for the details.



{code}
#-#
# Test spark.read.csv() with wholeFile option #
#-#

$ pyspark --master yarn --num-executors 12

# what is test data?
csv_raw = spark.read.text("test.encoding.csv")
csv_raw.show()
+--+
| value|
+--+
|col1,col2,col3|
|  1,a,text|
|  2,b,テキスト|
|   3,c,텍스트|
| 4,d,"text|
|  テキスト|
|  텍스트"|
|  5,e,last|
+--+

# loading csv in one-record-per-line fashion
csv_default = spark.read.csv("test.encoding.csv", header=True)
csv_default.show()
++++
|col1|col2|col3|
++++
|   1|   a|text|
|   2|   b|テキスト|
|   3|   c| 텍스트|
|   4|   d|text|
|テキスト|null|null|
|텍스트"|null|null|
|   5|   e|last|
++++

# loading csv in wholeFile mode
csv_wholefile = spark.read.csv("test.encoding.csv", header=True, wholeFile=True)
csv_wholefile.show()
++++
|col1|col2|col3|
++++
|   1|   a|text|
|   2|   b||
|   3|   c|   �|
|   4|   d|text
...|
|   5|   e|last|
++++

csv_wholefile.collect()[3]  
  
Row(col1=u'4', col2=u'd', 
col3=u'text\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd')


#---#
# Test with spark-shell #
#---#

$ spark-shell --master local[4]
spark.read.option("wholeFile", true).option("header", 
true).csv("test.encoding.csv").show()
+++-+
|col1|col2| col3|
+++-+
|   1|   a| text|
|   2|   b| テキスト|
|   3|   c|  텍스트|
|   4|   d|text
テキスト
텍스트|
|   5|   e| last|
+++-+

$ spark-shell --num-executors 12 --master yarn
spark.read.option("wholeFile", true).option("header", 
true).csv("test.encoding.csv").show()
++++
|col1|col2|col3|
++++
|   1|   a|text|
|   2|   b||
|   3|   c|   �|
|   4|   d|text
...|
|   5|   e|last|
++++

{code}


The following is the test for spark.read.json() with wholeFile option. It works 
without any proble.
But I would like to point one issue that the semantics of wholeFile option in 
this method
is different from that in spark.read.csv(). (in case of json(), one file must 
have only one record)
I think this can confuse users.

{code}
#-#
# Test spark.read.json() with wholeFile option #
#-#

# what is test data?
json_raw =

[jira] [Comment Edited] (SPARK-20347) Provide AsyncRDDActions in Python

2017-04-17 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970933#comment-15970933
 ] 

Maciej Szymkiewicz edited comment on SPARK-20347 at 4/17/17 10:22 AM:
--

This is a nice idea but I wonder what would be the benefit over using 
{{concurrent.futures}}? These work just fine with PySpark (at least in simple 
case) and the only overhead is creating the executor. It is not like we have 
worry about GIL here.

For my  own internal usage I went with  monkey patch like this (plus some 
`SparkContext.stop` patching):

{code}
from pyspark.rdd import RDD
from pyspark import SparkConf, SparkContext
from concurrent.futures import ThreadPoolExecutor


def _get_executor():
sc = SparkContext.getOrCreate()
if not hasattr(sc, "_thread_pool_executor"):
max_workers = int(sc.getConf().get("spark.driver.cores") or 1)
sc._thread_pool_executor = ThreadPoolExecutor(max_workers=max_workers)
return sc._thread_pool_executor

def asyncCount(self):
return _get_executor().submit(self.count)

def foreachAsync(self, f):
return _get_executor().submit(self.foreach, f)

RDD.asyncCount = asyncCount
RDD.foreachAsync = foreachAsync

sc = SparkContext(master="local[*]", conf=SparkConf().set("spark.driver.cores", 
3))
f = rdd.asyncCount()
{code}

One possible caveat is lack of direct legacy Python support but there is a 3rd 
party backport and I will argue this beats implementing this from scratch.

Furthermore we get a solution which integrates with existing libraries and 
should require a minimal maintenance. 



was (Author: zero323):
This is a nice idea but I wonder what would be the benefit over using 
{{concurrent.futures}}? These work just fine with PySpark (at least in simple 
case) and the only overhead is creating the executor. It is not like we have 
worry about GIL here.

For my  own internal usage I went with  monkey patch like this (plus some 
`SparkContext.stop` patching):

{code}
from pyspark.rdd import RDD
from pyspark import SparkConf, SparkContext
from concurrent.futures import ThreadPoolExecutor


def _get_executor():
sc = SparkContext.getOrCreate()
if not hasattr(sc, "_thread_pool_executor"):
max_workers = int(sc.getConf().get("spark.driver.cores") or 1)
sc._thread_pool_executor = ThreadPoolExecutor(max_workers=max_workers)
return sc._thread_pool_executor

def asyncCount(self):
return _get_executor().submit(self.count)

def foreachAsync(self, f):
return _get_executor().submit(self.foreach, f)

RDD.asyncCount = asyncCount
RDD.foreachAsync = foreachAsync

sc = SparkContext(master="local[*]", conf=SparkConf().set("spark.driver.cores", 
3))
f = rdd.asyncCount()
{code}


> Provide AsyncRDDActions in Python
> -
>
> Key: SPARK-20347
> URL: https://issues.apache.org/jira/browse/SPARK-20347
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: holdenk
>Priority: Minor
>
> In core Spark AsyncRDDActions allows people to perform non-blocking RDD 
> actions. In Python where threading & is a bit more involved there could be 
> value in exposing this, the easiest way might involve using the Py4J callback 
> server on the driver.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20347) Provide AsyncRDDActions in Python

2017-04-17 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970933#comment-15970933
 ] 

Maciej Szymkiewicz edited comment on SPARK-20347 at 4/17/17 10:17 AM:
--

This is a nice idea but I wonder what would be the benefit over using 
{{concurrent.futures}}? These work just fine with PySpark (at least in simple 
case) and the only overhead is creating the executor. It is not like we have 
worry about GIL here.

For my  own internal usage I went with  monkey patch like this (plus some 
`SparkContext.stop` patching):

{code}
from pyspark.rdd import RDD
from pyspark import SparkConf, SparkContext
from concurrent.futures import ThreadPoolExecutor


def _get_executor():
sc = SparkContext.getOrCreate()
if not hasattr(sc, "_thread_pool_executor"):
max_workers = int(sc.getConf().get("spark.driver.cores") or 1)
sc._thread_pool_executor = ThreadPoolExecutor(max_workers=max_workers)
return sc._thread_pool_executor

def asyncCount(self):
return _get_executor().submit(self.count)

def foreachAsync(self, f):
return _get_executor().submit(self.foreach, f)

RDD.asyncCount = asyncCount
RDD.foreachAsync = foreachAsync

sc = SparkContext(master="local[*]", conf=SparkConf().set("spark.driver.cores", 
3))
f = rdd.asyncCount()
{code}



was (Author: zero323):
This is a nice idea but I wonder what would be the benefit over using 
{{concurrent.futures}}? These work just fine with PySpark (at least in simple 
case) and the only overhead is creating the executor. It is not like we have 
worry about GIL here.

For my  own internal usage I went with  monkey patch like this (plus some 
`SparkContext.stop` patching):

{code}
from pyspark.rdd import RDD
from pyspark import SparkConf, SparkContext
from concurrent.futures import ThreadPoolExecutor


def _get_executor():
sc = SparkContext.getOrCreate()
if not hasattr(sc, "_thread_pool_executor"):
max_workers = int(sc.getConf().get("spark.driver.cores") or 1)
sc._thread_pool_executor = ThreadPoolExecutor(max_workers=max_workers)
return sc._thread_pool_executor

def asyncCount(self):
return _get_executor().submit(self.count)

def foreachAsync(self, f):
return _get_executor().submit(self.foreach, f)

RDD.asyncCount = asyncCount
RDD.foreachAsync = foreachAsync

sc = SparkContext(master="local[*]", conf=SparkConf().set("spark.driver.cores", 
3))
f = rdd.asyncCount(
{code}


> Provide AsyncRDDActions in Python
> -
>
> Key: SPARK-20347
> URL: https://issues.apache.org/jira/browse/SPARK-20347
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: holdenk
>Priority: Minor
>
> In core Spark AsyncRDDActions allows people to perform non-blocking RDD 
> actions. In Python where threading & is a bit more involved there could be 
> value in exposing this, the easiest way might involve using the Py4J callback 
> server on the driver.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20347) Provide AsyncRDDActions in Python

2017-04-17 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970933#comment-15970933
 ] 

Maciej Szymkiewicz edited comment on SPARK-20347 at 4/17/17 10:16 AM:
--

This is a nice idea but I wonder what would be the benefit over using 
{{concurrent.futures}}? These work just fine with PySpark (at least in simple 
case) and the only overhead is creating the executor. It is not like we have 
worry about GIL here.

For my  own internal usage I went with  monkey patch like this (plus some 
`SparkContext.stop` patching):

{code}
from pyspark.rdd import RDD
from pyspark import SparkConf, SparkContext
from concurrent.futures import ThreadPoolExecutor


def _get_executor():
sc = SparkContext.getOrCreate()
if not hasattr(sc, "_thread_pool_executor"):
max_workers = int(sc.getConf().get("spark.driver.cores") or 1)
sc._thread_pool_executor = ThreadPoolExecutor(max_workers=max_workers)
return sc._thread_pool_executor

def asyncCount(self):
return _get_executor().submit(self.count)

def foreachAsync(self, f):
return _get_executor().submit(self.foreach, f)

RDD.asyncCount = asyncCount
RDD.foreachAsync = foreachAsync

sc = SparkContext(master="local[*]", conf=SparkConf().set("spark.driver.cores", 
3))
f = rdd.asyncCount(
{code}



was (Author: zero323):
This is a nice idea but I wonder what would be the benefit over using 
{{concurrent.futures}}? These work just fine with PySpark (at least in simple 
case) and the only overhead is creating the executor. It is not like we have 
worry about GIL here.

{code}
from pyspark.rdd import RDD
from pyspark import SparkConf, SparkContext
from concurrent.futures import ThreadPoolExecutor


def _get_executor():
sc = SparkContext.getOrCreate()
if not hasattr(sc, "_thread_pool_executor"):
max_workers = int(sc.getConf().get("spark.driver.cores") or 1)
sc._thread_pool_executor = ThreadPoolExecutor(max_workers=max_workers)
return sc._thread_pool_executor

def asyncCount(self):
return _get_executor().submit(self.count)

def foreachAsync(self, f):
return _get_executor().submit(self.foreach, f)

RDD.asyncCount = asyncCount
RDD.foreachAsync = foreachAsync

{code}

> Provide AsyncRDDActions in Python
> -
>
> Key: SPARK-20347
> URL: https://issues.apache.org/jira/browse/SPARK-20347
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: holdenk
>Priority: Minor
>
> In core Spark AsyncRDDActions allows people to perform non-blocking RDD 
> actions. In Python where threading & is a bit more involved there could be 
> value in exposing this, the easiest way might involve using the Py4J callback 
> server on the driver.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20347) Provide AsyncRDDActions in Python

2017-04-17 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970933#comment-15970933
 ] 

Maciej Szymkiewicz commented on SPARK-20347:


This is a nice idea but I wonder what would be the benefit over using 
{{concurrent.futures}}? These work just fine with PySpark (at least in simple 
case) and the only overhead is creating the executor. It is not like we have 
worry about GIL here.

{code}
from pyspark.rdd import RDD
from pyspark import SparkConf, SparkContext
from concurrent.futures import ThreadPoolExecutor


def _get_executor():
sc = SparkContext.getOrCreate()
if not hasattr(sc, "_thread_pool_executor"):
max_workers = int(sc.getConf().get("spark.driver.cores") or 1)
sc._thread_pool_executor = ThreadPoolExecutor(max_workers=max_workers)
return sc._thread_pool_executor

def asyncCount(self):
return _get_executor().submit(self.count)

def foreachAsync(self, f):
return _get_executor().submit(self.foreach, f)

RDD.asyncCount = asyncCount
RDD.foreachAsync = foreachAsync

{code}

> Provide AsyncRDDActions in Python
> -
>
> Key: SPARK-20347
> URL: https://issues.apache.org/jira/browse/SPARK-20347
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: holdenk
>Priority: Minor
>
> In core Spark AsyncRDDActions allows people to perform non-blocking RDD 
> actions. In Python where threading & is a bit more involved there could be 
> value in exposing this, the easiest way might involve using the Py4J callback 
> server on the driver.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16892) flatten function to get flat array (or map) column from array of array (or array of map) column

2017-04-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16892.
---
Resolution: Not A Problem

> flatten function to get flat array (or map) column from array of array (or 
> array of map) column
> ---
>
> Key: SPARK-16892
> URL: https://issues.apache.org/jira/browse/SPARK-16892
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Kapil Singh
>
> flatten(input)
> Converts input of array of array type into flat array type by inserting 
> elements of all element arrays into single array. Example:
> input: [[1, 2, 3], [4, 5], [-1, -2, 0]]
> output: [1, 2, 3, 4, 5, -1, -2, 0]
> Converts input of array of map type into flat map type by inserting key-value 
> pairs of all element maps into single map. Example:
> input: [(1 -> "one", 2 -> "two"), (0 -> "zero"), (4 -> "four")]
> output: (1 -> "one", 2 -> "two", 0 -> "zero", 4 -> "four")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-20352) PySpark SparkSession initialization take longer every iteration in a single application

2017-04-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-20352.
-

> PySpark SparkSession initialization take longer every iteration in a single 
> application
> ---
>
> Key: SPARK-20352
> URL: https://issues.apache.org/jira/browse/SPARK-20352
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Ubuntu 12
> Spark 2.1
> JRE 8.0
> Python 2.7
>Reporter: hosein
>
> I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. 
> Run spark-sumbit my_code.py without any additional configuration parameters.
> In a while loop I start SparkSession, analyze data and then stop the context 
> and this process repeats every 10 seconds.
> {code}
> while True:
> spark =   
> SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' 
> , '5g').getOrCreate()
> sc = spark.sparkContext
> #some process and analyze
> spark.stop()
> {code}
> When program starts, it works perfectly.
> but when it works for many hours. spark initialization take long time. it 
> makes 10 or 20 seconds for just initializing spark.
> So what is the problem ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20352) PySpark SparkSession initialization take longer every iteration in a single application

2017-04-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20352.
---
Resolution: Not A Problem

"My code takes too long to run" is not a JIRA. You haven't addressed the 
underlying problem here, which is that you're reopening contexts.
Let committers reopen JIRAs. This should _not_ be reopened.

> PySpark SparkSession initialization take longer every iteration in a single 
> application
> ---
>
> Key: SPARK-20352
> URL: https://issues.apache.org/jira/browse/SPARK-20352
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Ubuntu 12
> Spark 2.1
> JRE 8.0
> Python 2.7
>Reporter: hosein
>
> I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. 
> Run spark-sumbit my_code.py without any additional configuration parameters.
> In a while loop I start SparkSession, analyze data and then stop the context 
> and this process repeats every 10 seconds.
> {code}
> while True:
> spark =   
> SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' 
> , '5g').getOrCreate()
> sc = spark.sparkContext
> #some process and analyze
> spark.stop()
> {code}
> When program starts, it works perfectly.
> but when it works for many hours. spark initialization take long time. it 
> makes 10 or 20 seconds for just initializing spark.
> So what is the problem ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-20352) PySpark SparkSession initialization take longer every iteration in a single application

2017-04-17 Thread hosein (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hosein reopened SPARK-20352:


> PySpark SparkSession initialization take longer every iteration in a single 
> application
> ---
>
> Key: SPARK-20352
> URL: https://issues.apache.org/jira/browse/SPARK-20352
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Ubuntu 12
> Spark 2.1
> JRE 8.0
> Python 2.7
>Reporter: hosein
>
> I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. 
> Run spark-sumbit my_code.py without any additional configuration parameters.
> In a while loop I start SparkSession, analyze data and then stop the context 
> and this process repeats every 10 seconds.
> {code}
> while True:
> spark =   
> SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' 
> , '5g').getOrCreate()
> sc = spark.sparkContext
> #some process and analyze
> spark.stop()
> {code}
> When program starts, it works perfectly.
> but when it works for many hours. spark initialization take long time. it 
> makes 10 or 20 seconds for just initializing spark.
> So what is the problem ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20352) PySpark SparkSession initialization take longer every iteration in a single application

2017-04-17 Thread hosein (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970906#comment-15970906
 ] 

hosein commented on SPARK-20352:


I monitor execution time of every line in my code and this line:

spark =   
SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' , 
'5g').getOrCreate()

take too long (20 or more seconds) if my code runs for hours.

> PySpark SparkSession initialization take longer every iteration in a single 
> application
> ---
>
> Key: SPARK-20352
> URL: https://issues.apache.org/jira/browse/SPARK-20352
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Ubuntu 12
> Spark 2.1
> JRE 8.0
> Python 2.7
>Reporter: hosein
>
> I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. 
> Run spark-sumbit my_code.py without any additional configuration parameters.
> In a while loop I start SparkSession, analyze data and then stop the context 
> and this process repeats every 10 seconds.
> {code}
> while True:
> spark =   
> SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' 
> , '5g').getOrCreate()
> sc = spark.sparkContext
> #some process and analyze
> spark.stop()
> {code}
> When program starts, it works perfectly.
> but when it works for many hours. spark initialization take long time. it 
> makes 10 or 20 seconds for just initializing spark.
> So what is the problem ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20352) PySpark SparkSession initialization take longer every iteration in a single application

2017-04-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20352.
---
   Resolution: Not A Problem
Fix Version/s: (was: 2.1.0)

At the least, it's not supported to stop and start contexts within an app. 
There should be no need to do that. 
You haven't provided detail on what is slow, and that's up to you before you 
create an issue.
Please read http://spark.apache.org/contributing.html

> PySpark SparkSession initialization take longer every iteration in a single 
> application
> ---
>
> Key: SPARK-20352
> URL: https://issues.apache.org/jira/browse/SPARK-20352
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Ubuntu 12
> Spark 2.1
> JRE 8.0
> Python 2.7
>Reporter: hosein
>
> I run Spark on a standalone Ubuntu server with 128G memory and 32-core CPU. 
> Run spark-sumbit my_code.py without any additional configuration parameters.
> In a while loop I start SparkSession, analyze data and then stop the context 
> and this process repeats every 10 seconds.
> {code}
> while True:
> spark =   
> SparkSession.builder.appName("sync_task").config('spark.driver.maxResultSize' 
> , '5g').getOrCreate()
> sc = spark.sparkContext
> #some process and analyze
> spark.stop()
> {code}
> When program starts, it works perfectly.
> but when it works for many hours. spark initialization take long time. it 
> makes 10 or 20 seconds for just initializing spark.
> So what is the problem ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19976) DirectStream API throws OffsetOutOfRange Exception

2017-04-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19976.
---
Resolution: Not A Problem

> DirectStream API throws OffsetOutOfRange Exception
> --
>
> Key: SPARK-19976
> URL: https://issues.apache.org/jira/browse/SPARK-19976
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Taukir
>
> I am using following code. While data on kafka topic get deleted/retention 
> period is over, it throws Exception and application crash
> def functionToCreateContext(sc:SparkContext):StreamingContext = {
> val kafkaParams = new mutable.HashMap[String, Object]()
> kafkaParams.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaBrokers)
> kafkaParams.put(ConsumerConfig.GROUP_ID_CONFIG, KafkaConsumerGrp)
> kafkaParams.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, 
> classOf[StringDeserializer])
> kafkaParams.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, 
> classOf[StringDeserializer])
> kafkaParams.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG,"true")
> kafkaParams.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
>val consumerStrategy = ConsumerStrategies.Subscribe[String, 
> String](topic.split(",").map(_.trim).filter(!_.isEmpty).toSet, kafkaParams)
> val kafkaStream  = 
> KafkaUtils.createDirectStream(ssc,LocationStrategies.PreferConsistent,consumerStrategy)
> }
> spark throws error and crash once OffsetOutOf RangeException  is thrown
> WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 : 
> org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of 
> range with no configured reset policy for partitions: {test-2=127287}
> at 
> org.apache.kafka.clients.consumer.internals.Fetcher.parseFetchedData(Fetcher.java:588)
> at 
> org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:354)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1000)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
> at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:98)
> at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
> at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
> at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20353) Implement Tensorflow TFRecords file format

2017-04-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20353:
--
Priority: Minor  (was: Major)

I think this is too app-specific to live in Spark, and should just be in a 
third-party library.

> Implement Tensorflow TFRecords file format
> --
>
> Key: SPARK-20353
> URL: https://issues.apache.org/jira/browse/SPARK-20353
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 2.1.0
>Reporter: Mathew Wicks
>Priority: Minor
>
> Spark is a very good prepossessing engine for tools like Tensorflow. However, 
> we lack native support for Tensorflow's core file format, TFRecords.
> There is a project which implements this functionality as an external JAR. 
> (But is not user friendly, or robust enough for production use.)
> https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-connector
> Here is some discussion around the above.
> https://github.com/tensorflow/ecosystem/issues/32
> If we were to implement "tfrecords" as a data-frame writable/readable format, 
> we would have to account for the various datatypes that can be present in 
> spark columns, and which ones are actually useful in Tensorflow. 
> Note: The `spark-tensorflow-connector` described above, does not properly 
> support the vector data type. 
> Further discussion of whether this is within the scope of Spark SQL is 
> strongly welcomed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20299) NullPointerException when null and string are in a tuple while encoding Dataset

2017-04-17 Thread Umesh Chaudhary (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970857#comment-15970857
 ] 

Umesh Chaudhary commented on SPARK-20299:
-

[~lwlin], I want to work on this but waiting on inputs from [~marmbrus]. If you 
want to continue, just to ahead. 

> NullPointerException when null and string are in a tuple while encoding 
> Dataset
> ---
>
> Key: SPARK-20299
> URL: https://issues.apache.org/jira/browse/SPARK-20299
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> When creating a Dataset from a tuple with {{null}} and a string, NPE is 
> reported. When either is removed, it works fine.
> {code}
> scala> Seq((1, null.asInstanceOf[Int]), (2, 1)).toDS
> res43: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]
> scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top 
> level Product input object), - root class: "scala.Tuple2")._1, true) AS _1#474
> assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product 
> input object), - root class: "scala.Tuple2")._2 AS _2#475
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454)
>   at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377)
>   at 
> org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:246)
>   ... 48 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
>   ... 58 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19368) Very bad performance in BlockMatrix.toIndexedRowMatrix()

2017-04-17 Thread Angelos Kaltsikis (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970853#comment-15970853
 ] 

Angelos Kaltsikis commented on SPARK-19368:
---

By any chance this will get fixed soon?

> Very bad performance in BlockMatrix.toIndexedRowMatrix()
> 
>
> Key: SPARK-19368
> URL: https://issues.apache.org/jira/browse/SPARK-19368
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Ohad Raviv
>Priority: Minor
> Attachments: profiler snapshot.png
>
>
> In SPARK-12869, this function was optimized for the case of dense matrices 
> using Breeze. However, I have a case with very very sparse matrices which 
> suffers a great deal from this optimization. A process we have that took 
> about 20 mins now takes about 6.5 hours.
> Here is a sample code to see the difference:
> {quote}
> val n = 4
> val density = 0.0002
> val rnd = new Random(123)
> val rndEntryList = (for (i <- 0 until (n*n*density).toInt) yield 
> (rnd.nextInt\(n\), rnd.nextInt\(n\), rnd.nextDouble()))
>   .groupBy(t => (t._1,t._2)).map\(t => t._2.last).map\{ case 
> (i,j,d) => (i,(j,d)) }.toSeq
> val entries: RDD\[(Int, (Int, Double))] = sc.parallelize(rndEntryList, 10)
> val indexedRows = entries.groupByKey().map(e => IndexedRow(e._1, 
> Vectors.sparse(n, e._2.toSeq)))
> val mat = new IndexedRowMatrix(indexedRows, nRows = n, nCols = n)
> val t1 = System.nanoTime()
> 
> println(mat.toBlockMatrix(1,1).toCoordinateMatrix().toIndexedRowMatrix().rows.map(_.vector.numActives).sum())
> val t2 = System.nanoTime()
> println("took: " + (t2 - t1) / 1000 / 1000 + " ms")
> println("")
> 
> println(mat.toBlockMatrix(1,1).toIndexedRowMatrix().rows.map(_.vector.numActives).sum())
> val t3 = System.nanoTime()
> println("took: " + (t3 - t2) / 1000 / 1000 + " ms")
> println("")
> {quote}
> I get:
> {quote}
> took: 9404 ms
> 
> took: 57350 ms
> 
> {quote}
> Looking at it a little with a profiler, I see that the problem is with the 
> SliceVector.update() and SparseVector.apply.
> I currently work-around this by doing:
> {quote}
> blockMatrix.toCoordinateMatrix().toIndexedRowMatrix()
> {quote}
> like it was in version 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20335) Children expressions of Hive UDF impacts the determinism of Hive UDF

2017-04-17 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-20335:

Fix Version/s: 2.1.1

> Children expressions of Hive UDF impacts the determinism of Hive UDF
> 
>
> Key: SPARK-20335
> URL: https://issues.apache.org/jira/browse/SPARK-20335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.1, 2.2.0
>
>
> {noformat}
>   /**
>* Certain optimizations should not be applied if UDF is not deterministic.
>* Deterministic UDF returns same result each time it is invoked with a
>* particular input. This determinism just needs to hold within the context 
> of
>* a query.
>*
>* @return true if the UDF is deterministic
>*/
>   boolean deterministic() default true;
> {noformat}
> Based on the definition o UDFType, when Hive UDF's children are 
> non-deterministic, Hive UDF is also non-deterministic.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20299) NullPointerException when null and string are in a tuple while encoding Dataset

2017-04-17 Thread Liwei Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970811#comment-15970811
 ] 

Liwei Lin commented on SPARK-20299:
---

hi [~umesh9...@gmail.com], are you planning to work on this? In case not, I 
might be interested in this.

> NullPointerException when null and string are in a tuple while encoding 
> Dataset
> ---
>
> Key: SPARK-20299
> URL: https://issues.apache.org/jira/browse/SPARK-20299
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> When creating a Dataset from a tuple with {{null}} and a string, NPE is 
> reported. When either is removed, it works fine.
> {code}
> scala> Seq((1, null.asInstanceOf[Int]), (2, 1)).toDS
> res43: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]
> scala> Seq(("1", null.asInstanceOf[Int]), ("2", 1)).toDS
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top 
> level Product input object), - root class: "scala.Tuple2")._1, true) AS _1#474
> assertnotnull(assertnotnull(input[0, scala.Tuple2, true], top level Product 
> input object), - root class: "scala.Tuple2")._2 AS _2#475
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454)
>   at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377)
>   at 
> org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:246)
>   ... 48 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
>   ... 58 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20354) /api/v1/applications’ return sparkUser is null in REST API.

2017-04-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970772#comment-15970772
 ] 

Apache Spark commented on SPARK-20354:
--

User 'guoxiaolongzte' has created a pull request for this issue:
https://github.com/apache/spark/pull/17656

> /api/v1/applications’ return sparkUser is null in REST API.
> ---
>
> Key: SPARK-20354
> URL: https://issues.apache.org/jira/browse/SPARK-20354
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> /api/v1/applications’ return sparkUser is null in REST API.
> /api/v1/applications’ return json follow:
> [ {
>   "id" : "app-20170417151455-0003",
>   "name" : "KafkaWordCount",
>   "attempts" : [ {
> "startTime" : "2017-04-17T07:14:51.949GMT",
> "endTime" : "1969-12-31T23:59:59.999GMT",
> "lastUpdated" : "2017-04-17T07:14:51.949GMT",
> "duration" : 0,
> "sparkUser" : " ",
> "completed" : false,
> "endTimeEpoch" : -1,
> "startTimeEpoch" : 1492413291949,
> "lastUpdatedEpoch" : 1492413291949
>   } ]
> } ]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20354) /api/v1/applications’ return sparkUser is null in REST API.

2017-04-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20354:


Assignee: Apache Spark

> /api/v1/applications’ return sparkUser is null in REST API.
> ---
>
> Key: SPARK-20354
> URL: https://issues.apache.org/jira/browse/SPARK-20354
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Assignee: Apache Spark
>Priority: Minor
>
> /api/v1/applications’ return sparkUser is null in REST API.
> /api/v1/applications’ return json follow:
> [ {
>   "id" : "app-20170417151455-0003",
>   "name" : "KafkaWordCount",
>   "attempts" : [ {
> "startTime" : "2017-04-17T07:14:51.949GMT",
> "endTime" : "1969-12-31T23:59:59.999GMT",
> "lastUpdated" : "2017-04-17T07:14:51.949GMT",
> "duration" : 0,
> "sparkUser" : " ",
> "completed" : false,
> "endTimeEpoch" : -1,
> "startTimeEpoch" : 1492413291949,
> "lastUpdatedEpoch" : 1492413291949
>   } ]
> } ]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20354) /api/v1/applications’ return sparkUser is null in REST API.

2017-04-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20354:


Assignee: (was: Apache Spark)

> /api/v1/applications’ return sparkUser is null in REST API.
> ---
>
> Key: SPARK-20354
> URL: https://issues.apache.org/jira/browse/SPARK-20354
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> /api/v1/applications’ return sparkUser is null in REST API.
> /api/v1/applications’ return json follow:
> [ {
>   "id" : "app-20170417151455-0003",
>   "name" : "KafkaWordCount",
>   "attempts" : [ {
> "startTime" : "2017-04-17T07:14:51.949GMT",
> "endTime" : "1969-12-31T23:59:59.999GMT",
> "lastUpdated" : "2017-04-17T07:14:51.949GMT",
> "duration" : 0,
> "sparkUser" : " ",
> "completed" : false,
> "endTimeEpoch" : -1,
> "startTimeEpoch" : 1492413291949,
> "lastUpdatedEpoch" : 1492413291949
>   } ]
> } ]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20354) When I request access to the 'http: //ip:port/api/v1/applications' link, return 'sparkUser' is empty in REST API.

2017-04-17 Thread guoxiaolongzte (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-20354:
---
Description: 
When I request access to the 'http: //ip:port/api/v1/applications' link, get 
the json. I need the 'sparkUser' field specific value because my Spark big data 
management platform needs to filter through this field which user submits the 
application to facilitate my administration and query, but the current return 
of the json string is empty, causing me this Function can not be achieved, that 
is, I do not know who the specific application is submitted by this REST Api.

return json:
[ {
  "id" : "app-20170417152053-",
  "name" : "KafkaWordCount",
  "attempts" : [ {
"startTime" : "2017-04-17T07:20:51.395GMT",
"endTime" : "1969-12-31T23:59:59.999GMT",
"lastUpdated" : "2017-04-17T07:20:51.395GMT",
"duration" : 0,
"sparkUser" : "",
"completed" : false,
"endTimeEpoch" : -1,
"startTimeEpoch" : 1492413651395,
"lastUpdatedEpoch" : 1492413651395
  } ]
} ]

  was:
/api/v1/applications’ return sparkUser is null in REST API.
/api/v1/applications’ return json follow:
[ {
  "id" : "app-20170417151455-0003",
  "name" : "KafkaWordCount",
  "attempts" : [ {
"startTime" : "2017-04-17T07:14:51.949GMT",
"endTime" : "1969-12-31T23:59:59.999GMT",
"lastUpdated" : "2017-04-17T07:14:51.949GMT",
"duration" : 0,
"sparkUser" : " ",
"completed" : false,
"endTimeEpoch" : -1,
"startTimeEpoch" : 1492413291949,
"lastUpdatedEpoch" : 1492413291949
  } ]
} ]

Summary: When I request access to the 'http: 
//ip:port/api/v1/applications' link, return 'sparkUser' is empty in REST API.  
(was: /api/v1/applications’ return sparkUser is null in REST API.)

> When I request access to the 'http: //ip:port/api/v1/applications' link, 
> return 'sparkUser' is empty in REST API.
> -
>
> Key: SPARK-20354
> URL: https://issues.apache.org/jira/browse/SPARK-20354
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> When I request access to the 'http: //ip:port/api/v1/applications' link, get 
> the json. I need the 'sparkUser' field specific value because my Spark big 
> data management platform needs to filter through this field which user 
> submits the application to facilitate my administration and query, but the 
> current return of the json string is empty, causing me this Function can not 
> be achieved, that is, I do not know who the specific application is submitted 
> by this REST Api.
> return json:
> [ {
>   "id" : "app-20170417152053-",
>   "name" : "KafkaWordCount",
>   "attempts" : [ {
> "startTime" : "2017-04-17T07:20:51.395GMT",
> "endTime" : "1969-12-31T23:59:59.999GMT",
> "lastUpdated" : "2017-04-17T07:20:51.395GMT",
> "duration" : 0,
> "sparkUser" : "",
> "completed" : false,
> "endTimeEpoch" : -1,
> "startTimeEpoch" : 1492413651395,
> "lastUpdatedEpoch" : 1492413651395
>   } ]
> } ]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20354) /api/v1/applications’ return sparkUser is null in REST API.

2017-04-17 Thread guoxiaolongzte (JIRA)

guoxiaolongzte created SPARK-20354:
--

 Summary: /api/v1/applications’ return sparkUser is null in REST 
API.
 Key: SPARK-20354
 URL: https://issues.apache.org/jira/browse/SPARK-20354
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: guoxiaolongzte
Priority: Minor


/api/v1/applications’ return sparkUser is null in REST API.
/api/v1/applications’ return json follow:
[ {
  "id" : "app-20170417151455-0003",
  "name" : "KafkaWordCount",
  "attempts" : [ {
"startTime" : "2017-04-17T07:14:51.949GMT",
"endTime" : "1969-12-31T23:59:59.999GMT",
"lastUpdated" : "2017-04-17T07:14:51.949GMT",
"duration" : 0,
"sparkUser" : " ",
"completed" : false,
"endTimeEpoch" : -1,
"startTimeEpoch" : 1492413291949,
"lastUpdatedEpoch" : 1492413291949
  } ]
} ]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20353) Implement Tensorflow TFRecords file format

2017-04-17 Thread Mathew Wicks (JIRA)

Mathew Wicks created SPARK-20353:


 Summary: Implement Tensorflow TFRecords file format
 Key: SPARK-20353
 URL: https://issues.apache.org/jira/browse/SPARK-20353
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output, SQL
Affects Versions: 2.1.0
Reporter: Mathew Wicks


Spark is a very good prepossessing engine for tools like Tensorflow. However, 
we lack native support for Tensorflow's core file format, TFRecords.

There is a project which implements this functionality as an external JAR. (But 
is not user friendly, or robust enough for production use.)
https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-connector

Here is some discussion around the above.
https://github.com/tensorflow/ecosystem/issues/32

If we were to implement "tfrecords" as a data-frame writable/readable format, 
we would have to account for the various datatypes that can be present in spark 
columns, and which ones are actually useful in Tensorflow. 

Note: The `spark-tensorflow-connector` described above, does not properly 
support the vector data type. 

Further discussion of whether this is within the scope of Spark SQL is strongly 
welcomed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

85 matches

Mail list logo