date:20180910

[jira] [Created] (SPARK-25402) Null handling in BooleanSimplification

2018-09-10 Thread Xiao Li (JIRA)

Xiao Li created SPARK-25402:
---

 Summary: Null handling in BooleanSimplification
 Key: SPARK-25402
 URL: https://issues.apache.org/jira/browse/SPARK-25402
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1, 2.2.2
Reporter: Xiao Li
Assignee: Xiao Li


SPARK-20350 introduced a bug BooleanSimplification for null handling. For 
example, the following case returns a wrong answer. 

{code}
val schema = StructType.fromDDL("a boolean, b int")
val rows = Seq(Row(null, 1))

val rdd = sparkContext.parallelize(rows)
val df = spark.createDataFrame(rdd, schema)

checkAnswer(df.where("(NOT a) OR a"), Seq.empty)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers

2018-09-10 Thread yucai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-23207:
--
Description: 
Currently shuffle repartition uses RoundRobinPartitioning, the generated result 
is nondeterministic since the sequence of input rows are not determined.

The bug can be triggered when there is a repartition call following a shuffle 
(which would lead to non-deterministic row ordering), as the pattern shows 
below:
upstream stage -> repartition stage -> result stage
(-> indicate a shuffle)
When one of the executors process goes down, some tasks on the repartition 
stage will be retried and generate inconsistent ordering, and some tasks of the 
result stage will be retried generating different data.

The following code returns 931532, instead of 100:
{code:java}
import scala.sys.process._

import org.apache.spark.TaskContext
val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
  x
}.repartition(200).map { x =>
  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
throw new Exception("pkill -f java".!!)
  }
  x
}
res.distinct().count()
{code}

  was:
Currently shuffle repartition uses RoundRobinPartitioning, the generated result 
is nondeterministic since the sequence of input rows are not determined.

The bug can be triggered when there is a repartition call following a shuffle 
(which would lead to non-deterministic row ordering), as the pattern shows 
below:
upstream stage -> repartition stage -> result stage
(-> indicate a shuffle)
When one of the executors process goes down, some tasks on the repartition 
stage will be retried and generate inconsistent ordering, and some tasks of the 
result stage will be retried generating different data.

The following code returns 931532, instead of 100:
{code}
import scala.sys.process._

import org.apache.spark.TaskContext
val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
  x
}.repartition(200).map { x =>
  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
throw new Exception("pkill -f java".!!)
  }
  x
}
res.distinct().count()
{code}


> Shuffle+Repartition on an DataFrame could lead to incorrect answers
> ---
>
> Key: SPARK-23207
> URL: https://issues.apache.org/jira/browse/SPARK-23207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.1.4, 2.2.3, 2.3.0
>
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated 
> result is nondeterministic since the sequence of input rows are not 
> determined.
> The bug can be triggered when there is a repartition call following a shuffle 
> (which would lead to non-deterministic row ordering), as the pattern shows 
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition 
> stage will be retried and generate inconsistent ordering, and some tasks of 
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 100:
> {code:java}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
> throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25241) Configurable empty values when reading/writing CSV files

2018-09-10 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610166#comment-16610166
 ] 

Apache Spark commented on SPARK-25241:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22389

> Configurable empty values when reading/writing CSV files
> 
>
> Key: SPARK-25241
> URL: https://issues.apache.org/jira/browse/SPARK-25241
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Mario Molina
>Priority: Minor
>
> There is an option in the CSV parser to set values when we have empty values 
> in the CSV files or in our dataframes.
> Currently, this option cannot be configured and always sets a default value 
> (empty string for reading and `""` for writing).
> I think it'd be interesting to enable this option in the CSV reader/writer to 
> allow customizing these empty values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25241) Configurable empty values when reading/writing CSV files

2018-09-10 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610167#comment-16610167
 ] 

Apache Spark commented on SPARK-25241:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22389

> Configurable empty values when reading/writing CSV files
> 
>
> Key: SPARK-25241
> URL: https://issues.apache.org/jira/browse/SPARK-25241
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Mario Molina
>Priority: Minor
>
> There is an option in the CSV parser to set values when we have empty values 
> in the CSV files or in our dataframes.
> Currently, this option cannot be configured and always sets a default value 
> (empty string for reading and `""` for writing).
> I think it'd be interesting to enable this option in the CSV reader/writer to 
> allow customizing these empty values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2018-09-10 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610164#comment-16610164
 ] 

Apache Spark commented on SPARK-17916:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22389

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25401) Reorder the required ordering to match the table's output ordering for bucket join

2018-09-10 Thread Wang, Gang (JIRA)

Wang, Gang created SPARK-25401:
--

 Summary: Reorder the required ordering to match the table's output 
ordering for bucket join
 Key: SPARK-25401
 URL: https://issues.apache.org/jira/browse/SPARK-25401
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wang, Gang


Currently, we check if SortExec is needed between a operator and its child 
operator in method orderingSatisfies, and method orderingSatisfies require the 
order in the SortOrders are all the same.

While, take the following case into consideration.
 * Table a is bucketed by (a1, a2), sorted by (a2, a1), and buckets number is 
200.
 * Table b is bucketed by (b1, b2), sorted by (b2, b1), and buckets number is 
200.
 * Table a join table b on (a1=b1, a2=b2)

In this case, if the join is sort merge join, the query planner won't add 
exchange on both sides, while, sort will be added on both sides. Actually, sort 
is also unnecessary, since in the same bucket, like bucket 1 of table a, and 
bucket 1 of table b, (a1=b1, a2=b2) is equivalent to (a2=b2, a1=b1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23515) JsonProtocol.sparkEventToJson can OOM when jsonifying an event

2018-09-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23515.
-
Resolution: Won't Fix

> JsonProtocol.sparkEventToJson can OOM when jsonifying an event
> --
>
> Key: SPARK-23515
> URL: https://issues.apache.org/jira/browse/SPARK-23515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> {code}
> def sparkEventToJson(event: SparkListenerEvent)
> {code}
> has a fallback method which creates a JSON object by turning an unrecognized 
> event to Json and then parsing it again. This method materializes the whole 
> string to parse the json record, which is unnecessary and can cause OOMs as 
> seen in the stacktrace below:
> {code:java}
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOfRange(Arrays.java:3664)
> at java.lang.String.(String.java:207)
> at java.lang.StringBuilder.toString(StringBuilder.java:407)
> at 
> com.fasterxml.jackson.core.util.TextBuffer.contentsAsString(TextBuffer.java:356)
> at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.getText(ReaderBasedJsonParser.java:235)
> at 
> org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:20)
> at 
> org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42)
> at 
> org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35)
> at 
> com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3736)
> at 
> com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2726)
> at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:20)
> at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:50)
> at 
> org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:103){code}
>  
> We should just use the stream parsing to avoid such OOMs.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-09-10 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610089#comment-16610089
 ] 

Felix Cheung commented on SPARK-23200:
--

probably need someone to rebuild on the current config names...

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23243) Shuffle+Repartition on an RDD could lead to incorrect answers

2018-09-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-23243:

Fix Version/s: 2.2.3

> Shuffle+Repartition on an RDD could lead to incorrect answers
> -
>
> Key: SPARK-23243
> URL: https://issues.apache.org/jira/browse/SPARK-23243
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Xingbo Jiang
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.2.3, 2.3.2, 2.4.0
>
>
> The RDD repartition also uses the round-robin way to distribute data, this 
> can also cause incorrect answers on RDD workload the similar way as in 
> https://issues.apache.org/jira/browse/SPARK-23207
> The approach that fixes DataFrame.repartition() doesn't apply on the RDD 
> repartition issue, as discussed in 
> https://github.com/apache/spark/pull/20393#issuecomment-360912451
> We track for alternative solutions for this issue in this task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20715) MapStatuses shouldn't be redundantly stored in both ShuffleMapStage and MapOutputTracker

2018-09-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-20715:

Fix Version/s: 2.2.3

> MapStatuses shouldn't be redundantly stored in both ShuffleMapStage and 
> MapOutputTracker
> 
>
> Key: SPARK-20715
> URL: https://issues.apache.org/jira/browse/SPARK-20715
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Shuffle
>Affects Versions: 2.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
> Fix For: 2.2.3, 2.3.0
>
>
> Today the MapOutputTracker and ShuffleMapStage both maintain their own copies 
> of MapStatuses. This creates the potential for bugs in case these two pieces 
> of state become out of sync.
> I believe that we can improve our ability to reason about the code by storing 
> this information only in the MapOutputTracker. This can also help to reduce 
> driver memory consumption.
> I will provide more details in my PR, where I'll walk through the detailed 
> arguments as to why we can take these two different metadata tracking formats 
> and consolidate without loss of performance or correctness.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22632) Fix the behavior of timestamp values for R's DataFrame to respect session timezone

2018-09-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22632:

Target Version/s: 3.0.0  (was: 2.4.0)

> Fix the behavior of timestamp values for R's DataFrame to respect session 
> timezone
> --
>
> Key: SPARK-22632
> URL: https://issues.apache.org/jira/browse/SPARK-22632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Note: wording is borrowed from SPARK-22395. Symptom is similar and I think 
> that JIRA is well descriptive.
> When converting R's DataFrame from/to Spark DataFrame using 
> {{createDataFrame}} or {{collect}}, timestamp values behave to respect R 
> system timezone instead of session timezone.
> For example, let's say we use "America/Los_Angeles" as session timezone and 
> have a timestamp value "1970-01-01 00:00:01" in the timezone. Btw, I'm in 
> South Korea so R timezone would be "KST".
> The timestamp value from current collect() will be the following:
> {code}
> > sparkR.session(master = "local[*]", sparkConfig = 
> > list(spark.sql.session.timeZone = "America/Los_Angeles"))
> > collect(sql("SELECT cast(cast(28801 as timestamp) as string) as ts"))
>ts
> 1 1970-01-01 00:00:01
> > collect(sql("SELECT cast(28801 as timestamp) as ts"))
>ts
> 1 1970-01-01 17:00:01
> {code}
> As you can see, the value becomes "1970-01-01 17:00:01" because it respects R 
> system timezone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22632) Fix the behavior of timestamp values for R's DataFrame to respect session timezone

2018-09-10 Thread Felix Cheung (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610083#comment-16610083
 ] 

Felix Cheung commented on SPARK-22632:
--

mismatch between R and JVM time zone could be an issue but not a blocker for 
release. let's move to 3.0

> Fix the behavior of timestamp values for R's DataFrame to respect session 
> timezone
> --
>
> Key: SPARK-22632
> URL: https://issues.apache.org/jira/browse/SPARK-22632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Note: wording is borrowed from SPARK-22395. Symptom is similar and I think 
> that JIRA is well descriptive.
> When converting R's DataFrame from/to Spark DataFrame using 
> {{createDataFrame}} or {{collect}}, timestamp values behave to respect R 
> system timezone instead of session timezone.
> For example, let's say we use "America/Los_Angeles" as session timezone and 
> have a timestamp value "1970-01-01 00:00:01" in the timezone. Btw, I'm in 
> South Korea so R timezone would be "KST".
> The timestamp value from current collect() will be the following:
> {code}
> > sparkR.session(master = "local[*]", sparkConfig = 
> > list(spark.sql.session.timeZone = "America/Los_Angeles"))
> > collect(sql("SELECT cast(cast(28801 as timestamp) as string) as ts"))
>ts
> 1 1970-01-01 00:00:01
> > collect(sql("SELECT cast(28801 as timestamp) as ts"))
>ts
> 1 1970-01-01 17:00:01
> {code}
> As you can see, the value becomes "1970-01-01 17:00:01" because it respects R 
> system timezone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections

2018-09-10 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610082#comment-16610082
 ] 

Wenchen Fan commented on SPARK-23580:
-

There are still 2 open PRs, one is inactive for a while, one will be merged to 
master soon. Shall we just move the inactive one out from this umbrella?

> Interpreted mode fallback should be implemented for all expressions & 
> projections
> -
>
> Key: SPARK-23580
> URL: https://issues.apache.org/jira/browse/SPARK-23580
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>  Labels: release-notes
>
> Spark SQL currently does not support interpreted mode for all expressions and 
> projections. This is a problem for scenario's where were code generation does 
> not work, or blows past the JVM class limits. We currently cannot gracefully 
> fallback.
> This ticket is an umbrella to fix this class of problem in Spark SQL. This 
> work can be divided into two main area's:
> - Add interpreted versions for all dataset related expressions.
> - Add an interpreted version of {{GenerateUnsafeProjection}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25367) The column attributes obtained by Spark sql are inconsistent with hive

2018-09-10 Thread yy (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610056#comment-16610056
 ] 

yy commented on SPARK-25367:


[~hyukjin.kwon]

Thank you for your correction, I will pay attention next time.

By the way, have you ever met that problem? I wander to know if we need to 
solve it? Or say that spark does not support users to modify column attribute.

> The column attributes obtained by Spark sql are inconsistent with hive
> --
>
> Key: SPARK-25367
> URL: https://issues.apache.org/jira/browse/SPARK-25367
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1
> Environment: spark2.2.1-hadoop-2.6.0-chd-5.4.2
> hive-1.2.1
>Reporter: yy
>Priority: Major
>  Labels: sparksql
>
> We save the dataframe object as a hive table in orc/parquet format in the 
> spark shell.
>  After we modified the column type (int to double) of this table in hive 
> jdbc, we  found the column type queried in spark-shell didn't change, but 
> changed in hive jdbc. After we restarted the spark-shell, this table's column 
> type is still incompatible as showed in hive jdbc.
> The coding process are as follows:
> spark-shell:
> {code:java}
> val df = spark.read.json("examples/src/main/resources/people.json");
> df.write.format("orc").saveAsTable("people_test");
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
> hive:
> {code:java}
> hive> desc people_test;
> OK
> age bigint 
> name string 
> Time taken: 0.454 seconds, Fetched: 2 row(s)
> hive> alter table people_test change column age age double;
> OK
> Time taken: 0.68 seconds
> hive> desc people_test;
> OK
> age double 
> name string 
> Time taken: 0.358 seconds, Fetched: 2 row(s){code}
> spark-shell:
> {code:java}
> spark.catalog.refreshTable("people_test")
> spark.sql("desc people_test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> | age| bigint| null|
> | name| string| null|
> ++-+---+
> {code}
>  
> We also tested in spark-shell by creating a table using spark.sql("create 
> table XXX()"),  the modified columns are consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24882) data source v2 API improvement

2018-09-10 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610029#comment-16610029
 ] 

Apache Spark commented on SPARK-24882:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22388

> data source v2 API improvement
> --
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0
>
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 API, isolate the stateull part of the API, think of better naming 
> of some interfaces. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25313) Fix regression in FileFormatWriter output schema

2018-09-10 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610021#comment-16610021
 ] 

Apache Spark commented on SPARK-25313:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22387

> Fix regression in FileFormatWriter output schema
> 
>
> Key: SPARK-25313
> URL: https://issues.apache.org/jira/browse/SPARK-25313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> In the follow example:
> val location = "/tmp/t"
> val df = spark.range(10).toDF("id")
> df.write.format("parquet").saveAsTable("tbl")
> spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
> spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location 
> $location")
> spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
> println(spark.read.parquet(location).schema)
> spark.table("tbl2").show()
> The output column name in schema will be id instead of ID, thus the last 
> query shows nothing from tbl2.
> By enabling the debug message we can see that the output naming is changed 
> from ID to id, and then the outputColumns in 
> InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases.
> To guarantee correctness, we should change the output columns from 
> `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by 
> optimizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25184) Flaky test: FlatMapGroupsWithState "streaming with processing time timeout"

2018-09-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25184:

Fix Version/s: (was: 3.0.0)
   2.4.0

> Flaky test: FlatMapGroupsWithState "streaming with processing time timeout"
> ---
>
> Key: SPARK-25184
> URL: https://issues.apache.org/jira/browse/SPARK-25184
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
> Fix For: 2.4.0
>
>
> {code}
> Assert on query failed: Check total state rows = List(1), updated state rows 
> = List(2): Array() did not equal List(1) incorrect total rows, recent 
> progresses:
> {
>   "id" : "3598002e-0120-4937-8a36-226e0af992b6",
>   "runId" : "e7efe911-72fb-48aa-ba35-775057eabe55",
>   "name" : null,
>   "timestamp" : "1970-01-01T00:00:12.000Z",
>   "batchId" : 3,
>   "numInputRows" : 0,
>   "durationMs" : {
> "getEndOffset" : 0,
> "setOffsetRange" : 0,
> "triggerExecution" : 0
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : "MemoryStream[value#474622]",
> "startOffset" : 2,
> "endOffset" : 2,
> "numInputRows" : 0
>   } ],
>   "sink" : {
> "description" : "MemorySink"
>   }
> }
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   
> org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:55)
>   
> org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:33)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$executeAction$1$11.apply$mcZ$sp(StreamTest.scala:657)
>   
> org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:428)
>   
> org.apache.spark.sql.streaming.StreamTest$class.executeAction$1(StreamTest.scala:657)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:775)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:762)
> == Progress ==
>
> StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null)
>AddData to MemoryStream[value#474622]: a
>AdvanceManualClock(1000)
>CheckNewAnswer: [a,1]
>AssertOnQuery(, Check total state rows = List(1), updated state 
> rows = List(1))
>AddData to MemoryStream[value#474622]: b
>AdvanceManualClock(1000)
>CheckNewAnswer: [b,1]
>AssertOnQuery(, Check total state rows = List(2), updated state 
> rows = List(1))
>AddData to MemoryStream[value#474622]: b
>AdvanceManualClock(1)
>CheckNewAnswer: [a,-1],[b,2]
>AssertOnQuery(, Check total state rows = List(1), updated state 
> rows = List(2))
>StopStream
>
> StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null)
>AddData to MemoryStream[value#474622]: c
>AdvanceManualClock(11000)
>CheckNewAnswer: [b,-1],[c,1]
> => AssertOnQuery(, Check total state rows = List(1), updated state 
> rows = List(2))
>AdvanceManualClock(12000)
>AssertOnQuery(, )
>AssertOnQuery(, name)
>CheckNewAnswer: [c,-1]
>AssertOnQuery(, Check total state rows = List(0), updated state 
> rows = List(0))
> == Stream ==
> Output Mode: Update
> Stream state: {MemoryStream[value#474622]: 3}
> Thread state: alive
> Thread stack trace: java.lang.Object.wait(Native Method)
> org.apache.spark.util.ManualClock.waitTillTime(ManualClock.scala:61)
> org.apache.spark.sql.streaming.util.StreamManualClock.waitTillTime(StreamManualClock.scala:34)
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:65)
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:166)
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:293)
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:203)
> == Sink ==
> 0: [a,1]
> 1: [b,1]
> 2: [a,-1] [b,2]
> 3: [b,-1] [c,1]
> == Plan ==
> == Parsed Logical Plan ==
> SerializeFromObject [staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) 
> AS _1#474630, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
> StringType,

[jira] [Commented] (SPARK-25397) SparkSession.conf fails when given default value with Python 3

2018-09-10 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609928#comment-16609928
 ] 

Hyukjin Kwon commented on SPARK-25397:
--

That's fixed in 
https://github.com/apache/spark/commit/71f38ac242157cbede684546159f2a27892ee09f 
[~josephkb]:

{code}
>>> spark.conf.get("myConf", False)
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/sql/conf.py", line 54, in get
self._checkType(default, "default")
  File "/.../spark/python/pyspark/sql/conf.py", line 67, in _checkType
(identifier, obj, type(obj).__name__))
TypeError: expected default 'False' to be a string (was 'bool')
{code}

Just for clarification, looks about fixing error message, not the bug fix 
though .. 

> SparkSession.conf fails when given default value with Python 3
> --
>
> Key: SPARK-25397
> URL: https://issues.apache.org/jira/browse/SPARK-25397
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Spark 2.3.1 has a Python 3 incompatibility when requesting a Conf value from 
> SparkSession when you give non-string default values.  Reproduce via 
> SparkSession call:
> {{spark.conf.get("myConf", False)}}
> This gives the error:
> {code}
> >>> spark.conf.get("myConf", False)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/josephkb/work/spark-bin/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/conf.py",
>  line 51, in get
> self._checkType(default, "default")
>   File 
> "/Users/josephkb/work/spark-bin/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/conf.py",
>  line 62, in _checkType
> if not isinstance(obj, str) and not isinstance(obj, unicode):
> *NameError: name 'unicode' is not defined*
> {code}
> The offending line in Spark in branch-2.3 is: 
> https://github.com/apache/spark/blob/branch-2.3/python/pyspark/sql/conf.py 
> which uses the value {{unicode}} which is not available in Python 3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25399) Reusing execution threads from continuous processing for microbatch streaming can result in correctness issues

2018-09-10 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609924#comment-16609924
 ] 

Apache Spark commented on SPARK-25399:
--

User 'mukulmurthy' has created a pull request for this issue:
https://github.com/apache/spark/pull/22386

> Reusing execution threads from continuous processing for microbatch streaming 
> can result in correctness issues
> --
>
> Key: SPARK-25399
> URL: https://issues.apache.org/jira/browse/SPARK-25399
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Priority: Critical
>  Labels: correctness
>
> Continuous processing sets some thread local variables that, when read by a 
> thread running a microbatch stream, may result in incorrect or no previous 
> state being read and resulting in wrong answers. This was caught by a job 
> running the StreamSuite tests, and only repros occasionally when the same 
> threads are used.
> The issue is in StateStoreRDD.compute - when we compute currentVersion, we 
> read from a thread local variable which is set by continuous processing 
> threads. If this value is set, we then think we're on the wrong state version.
> I imagine very few people, if any, would run into this bug, because you'd 
> have to use continuous processing and then microbatch processing in the same 
> cluster. However, it can result in silent correctness issues, and it would be 
> very difficult for someone to tell if they were impacted by this or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25399) Reusing execution threads from continuous processing for microbatch streaming can result in correctness issues

2018-09-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25399:


Assignee: (was: Apache Spark)

> Reusing execution threads from continuous processing for microbatch streaming 
> can result in correctness issues
> --
>
> Key: SPARK-25399
> URL: https://issues.apache.org/jira/browse/SPARK-25399
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Priority: Critical
>  Labels: correctness
>
> Continuous processing sets some thread local variables that, when read by a 
> thread running a microbatch stream, may result in incorrect or no previous 
> state being read and resulting in wrong answers. This was caught by a job 
> running the StreamSuite tests, and only repros occasionally when the same 
> threads are used.
> The issue is in StateStoreRDD.compute - when we compute currentVersion, we 
> read from a thread local variable which is set by continuous processing 
> threads. If this value is set, we then think we're on the wrong state version.
> I imagine very few people, if any, would run into this bug, because you'd 
> have to use continuous processing and then microbatch processing in the same 
> cluster. However, it can result in silent correctness issues, and it would be 
> very difficult for someone to tell if they were impacted by this or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25399) Reusing execution threads from continuous processing for microbatch streaming can result in correctness issues

2018-09-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25399:


Assignee: Apache Spark

> Reusing execution threads from continuous processing for microbatch streaming 
> can result in correctness issues
> --
>
> Key: SPARK-25399
> URL: https://issues.apache.org/jira/browse/SPARK-25399
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Assignee: Apache Spark
>Priority: Critical
>  Labels: correctness
>
> Continuous processing sets some thread local variables that, when read by a 
> thread running a microbatch stream, may result in incorrect or no previous 
> state being read and resulting in wrong answers. This was caught by a job 
> running the StreamSuite tests, and only repros occasionally when the same 
> threads are used.
> The issue is in StateStoreRDD.compute - when we compute currentVersion, we 
> read from a thread local variable which is set by continuous processing 
> threads. If this value is set, we then think we're on the wrong state version.
> I imagine very few people, if any, would run into this bug, because you'd 
> have to use continuous processing and then microbatch processing in the same 
> cluster. However, it can result in silent correctness issues, and it would be 
> very difficult for someone to tell if they were impacted by this or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25072) PySpark custom Row class can be given extra parameters

2018-09-10 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609920#comment-16609920
 ] 

Dongjoon Hyun commented on SPARK-25072:
---

[~bryanc] and [~smilegator] Since this is reverted from `branch-2.3`, I removed 
the fixed version 2.3.2.

> PySpark custom Row class can be given extra parameters
> --
>
> Key: SPARK-25072
> URL: https://issues.apache.org/jira/browse/SPARK-25072
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: {noformat}
> SPARK_MAJOR_VERSION is set to 2, using Spark2
> Python 3.4.5 (default, Dec 11 2017, 16:57:19)
> Type 'copyright', 'credits' or 'license' for more information
> IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 18/08/01 04:49:16 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 18/08/01 04:49:17 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> 18/08/01 04:49:27 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.2.0
>   /_/
> Using Python version 3.4.5 (default, Dec 11 2017 16:57:19)
> SparkSession available as 'spark'.
> {noformat}
> {{CentOS release 6.9 (Final)}}
> {{Linux sandbox-hdp.hortonworks.com 4.14.0-1.el7.elrepo.x86_64 #1 SMP Sun Nov 
> 12 20:21:04 EST 2017 x86_64 x86_64 x86_64 GNU/Linux}}
> {noformat}openjdk version "1.8.0_161"
> OpenJDK Runtime Environment (build 1.8.0_161-b14)
> OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode){noformat}
>Reporter: Jan-Willem van der Sijp
>Assignee: Li Yuanjian
>Priority: Minor
> Fix For: 2.4.0, 3.0.0
>
>
> When a custom Row class is made in PySpark, it is possible to provide the 
> constructor of this class with more parameters than there are columns. These 
> extra parameters affect the value of the Row, but are not part of the 
> {{repr}} or {{str}} output, making it hard to debug errors due to these 
> "invisible" values. The hidden values can be accessed through integer-based 
> indexing though.
> Some examples:
> {code:python}
> In [69]: RowClass = Row("column1", "column2")
> In [70]: RowClass(1, 2) == RowClass(1, 2)
> Out[70]: True
> In [71]: RowClass(1, 2) == RowClass(1, 2, 3)
> Out[71]: False
> In [75]: RowClass(1, 2, 3)
> Out[75]: Row(column1=1, column2=2)
> In [76]: RowClass(1, 2)
> Out[76]: Row(column1=1, column2=2)
> In [77]: RowClass(1, 2, 3).asDict()
> Out[77]: {'column1': 1, 'column2': 2}
> In [78]: RowClass(1, 2, 3)[2]
> Out[78]: 3
> In [79]: repr(RowClass(1, 2, 3))
> Out[79]: 'Row(column1=1, column2=2)'
> In [80]: str(RowClass(1, 2, 3))
> Out[80]: 'Row(column1=1, column2=2)'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25072) PySpark custom Row class can be given extra parameters

2018-09-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25072:
--
Fix Version/s: (was: 2.3.2)

> PySpark custom Row class can be given extra parameters
> --
>
> Key: SPARK-25072
> URL: https://issues.apache.org/jira/browse/SPARK-25072
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: {noformat}
> SPARK_MAJOR_VERSION is set to 2, using Spark2
> Python 3.4.5 (default, Dec 11 2017, 16:57:19)
> Type 'copyright', 'credits' or 'license' for more information
> IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 18/08/01 04:49:16 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 18/08/01 04:49:17 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> 18/08/01 04:49:27 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.2.0
>   /_/
> Using Python version 3.4.5 (default, Dec 11 2017 16:57:19)
> SparkSession available as 'spark'.
> {noformat}
> {{CentOS release 6.9 (Final)}}
> {{Linux sandbox-hdp.hortonworks.com 4.14.0-1.el7.elrepo.x86_64 #1 SMP Sun Nov 
> 12 20:21:04 EST 2017 x86_64 x86_64 x86_64 GNU/Linux}}
> {noformat}openjdk version "1.8.0_161"
> OpenJDK Runtime Environment (build 1.8.0_161-b14)
> OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode){noformat}
>Reporter: Jan-Willem van der Sijp
>Assignee: Li Yuanjian
>Priority: Minor
> Fix For: 2.4.0, 3.0.0
>
>
> When a custom Row class is made in PySpark, it is possible to provide the 
> constructor of this class with more parameters than there are columns. These 
> extra parameters affect the value of the Row, but are not part of the 
> {{repr}} or {{str}} output, making it hard to debug errors due to these 
> "invisible" values. The hidden values can be accessed through integer-based 
> indexing though.
> Some examples:
> {code:python}
> In [69]: RowClass = Row("column1", "column2")
> In [70]: RowClass(1, 2) == RowClass(1, 2)
> Out[70]: True
> In [71]: RowClass(1, 2) == RowClass(1, 2, 3)
> Out[71]: False
> In [75]: RowClass(1, 2, 3)
> Out[75]: Row(column1=1, column2=2)
> In [76]: RowClass(1, 2)
> Out[76]: Row(column1=1, column2=2)
> In [77]: RowClass(1, 2, 3).asDict()
> Out[77]: {'column1': 1, 'column2': 2}
> In [78]: RowClass(1, 2, 3)[2]
> Out[78]: 3
> In [79]: repr(RowClass(1, 2, 3))
> Out[79]: 'Row(column1=1, column2=2)'
> In [80]: str(RowClass(1, 2, 3))
> Out[80]: 'Row(column1=1, column2=2)'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23597) Audit Spark SQL code base for non-interpreted expressions

2018-09-10 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609918#comment-16609918
 ] 

Liang-Chi Hsieh commented on SPARK-23597:
-

At least, I didn't find expressions that do not provide an interpreted 
execution path in catalyst now.



> Audit Spark SQL code base for non-interpreted expressions
> -
>
> Key: SPARK-23597
> URL: https://issues.apache.org/jira/browse/SPARK-23597
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>
> We want to eliminate expressions that do not provide an interpreted execution 
> path from the code base. The goal of this ticket is to check if there any 
> other besides the ones being addressed by SPARK-23580.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13587) Support virtualenv in PySpark

2018-09-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-13587:
-
Target Version/s: 3.0.0  (was: 2.4.0)

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Major
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure

2018-09-10 Thread Reynold Xin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609892#comment-16609892
 ] 

Reynold Xin commented on SPARK-25331:
-

Yes I would rely on idempotency here. Retries upon failure + idempotency = 
exactly once.

 

> Structured Streaming File Sink duplicates records in case of driver failure
> ---
>
> Key: SPARK-25331
> URL: https://issues.apache.org/jira/browse/SPARK-25331
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mihaly Toth
>Priority: Major
>
> Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
> been started by {{FileFormatWrite.write}} and then the resulting task sets 
> are completed but in the meantime the driver dies. In such a case repeating 
> {{FileStreamSink.addBtach}} will result in duplicate writing of the data
> In the event the driver fails after the executors start processing the job 
> the processed batch will be written twice.
> Steps needed:
> # call {{FileStreamSink.addBtach}}
> # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
> # call {{FileStreamSink.addBtach}} with the same data
> # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} 
> successfully
> # Verify file output - according to {{Sink.addBatch}} documentation the rdd 
> should be written only once
> I have created a wip PR with a unit test:
> https://github.com/apache/spark/pull/22331



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25400) Increase timeouts in schedulerIntegrationSuite

2018-09-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25400:


Assignee: (was: Apache Spark)

> Increase timeouts in schedulerIntegrationSuite
> --
>
> Key: SPARK-25400
> URL: https://issues.apache.org/jira/browse/SPARK-25400
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>
> I just took a look at a flaky failure in {{SchedulerIntegrationSuite}} 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95887
>  it seems the timeout really is too short:
> {noformat}
> 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task 
> 5.0 in stage 1.0 (TID 8, localhost, executor driver, partition 5, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:07.821 task-result-getter-2 INFO TaskSetManager: Finished task 
> 3.0 in stage 1.0 (TID 6) in 1 ms on localhost (executor driver) (4/10)
> 18/09/10 11:14:07.821 task-result-getter-0 INFO TaskSetManager: Finished task 
> 4.0 in stage 1.0 (TID 7) in 1 ms on localhost (executor driver) (5/10)
> 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task 
> 6.0 in stage 1.0 (TID 9, localhost, executor driver, partition 6, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:07.821 task-result-getter-1 INFO TaskSetManager: Finished task 
> 5.0 in stage 1.0 (TID 8) in 0 ms on localhost (executor driver) (6/10)
> 18/09/10 11:14:09.481 mock backend thread INFO TaskSetManager: Starting task 
> 7.0 in stage 1.0 (TID 10, localhost, executor driver, partition 7, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:09.482 dispatcher-event-loop-14 INFO BlockManagerInfo: Removed 
> broadcast_0_piece0 on amp-jenkins-worker-05.amp:36913 in memory (size: 1260.0 
> B, free: 1638.6 MB)
> {noformat}
> you'll see that the "mock backend thread" does keep making progress, but for 
> whatever reason there is over a one second delay in the middle.  Thats 
> already going over the existing timeouts.
> Its possible there is something else going on here, but for now just 
> increasing the timeouts seems like the best next step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25400) Increase timeouts in schedulerIntegrationSuite

2018-09-10 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609841#comment-16609841
 ] 

Apache Spark commented on SPARK-25400:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/22385

> Increase timeouts in schedulerIntegrationSuite
> --
>
> Key: SPARK-25400
> URL: https://issues.apache.org/jira/browse/SPARK-25400
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>
> I just took a look at a flaky failure in {{SchedulerIntegrationSuite}} 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95887
>  it seems the timeout really is too short:
> {noformat}
> 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task 
> 5.0 in stage 1.0 (TID 8, localhost, executor driver, partition 5, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:07.821 task-result-getter-2 INFO TaskSetManager: Finished task 
> 3.0 in stage 1.0 (TID 6) in 1 ms on localhost (executor driver) (4/10)
> 18/09/10 11:14:07.821 task-result-getter-0 INFO TaskSetManager: Finished task 
> 4.0 in stage 1.0 (TID 7) in 1 ms on localhost (executor driver) (5/10)
> 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task 
> 6.0 in stage 1.0 (TID 9, localhost, executor driver, partition 6, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:07.821 task-result-getter-1 INFO TaskSetManager: Finished task 
> 5.0 in stage 1.0 (TID 8) in 0 ms on localhost (executor driver) (6/10)
> 18/09/10 11:14:09.481 mock backend thread INFO TaskSetManager: Starting task 
> 7.0 in stage 1.0 (TID 10, localhost, executor driver, partition 7, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:09.482 dispatcher-event-loop-14 INFO BlockManagerInfo: Removed 
> broadcast_0_piece0 on amp-jenkins-worker-05.amp:36913 in memory (size: 1260.0 
> B, free: 1638.6 MB)
> {noformat}
> you'll see that the "mock backend thread" does keep making progress, but for 
> whatever reason there is over a one second delay in the middle.  Thats 
> already going over the existing timeouts.
> Its possible there is something else going on here, but for now just 
> increasing the timeouts seems like the best next step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25400) Increase timeouts in schedulerIntegrationSuite

2018-09-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25400:


Assignee: Apache Spark

> Increase timeouts in schedulerIntegrationSuite
> --
>
> Key: SPARK-25400
> URL: https://issues.apache.org/jira/browse/SPARK-25400
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Major
>
> I just took a look at a flaky failure in {{SchedulerIntegrationSuite}} 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95887
>  it seems the timeout really is too short:
> {noformat}
> 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task 
> 5.0 in stage 1.0 (TID 8, localhost, executor driver, partition 5, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:07.821 task-result-getter-2 INFO TaskSetManager: Finished task 
> 3.0 in stage 1.0 (TID 6) in 1 ms on localhost (executor driver) (4/10)
> 18/09/10 11:14:07.821 task-result-getter-0 INFO TaskSetManager: Finished task 
> 4.0 in stage 1.0 (TID 7) in 1 ms on localhost (executor driver) (5/10)
> 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task 
> 6.0 in stage 1.0 (TID 9, localhost, executor driver, partition 6, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:07.821 task-result-getter-1 INFO TaskSetManager: Finished task 
> 5.0 in stage 1.0 (TID 8) in 0 ms on localhost (executor driver) (6/10)
> 18/09/10 11:14:09.481 mock backend thread INFO TaskSetManager: Starting task 
> 7.0 in stage 1.0 (TID 10, localhost, executor driver, partition 7, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:09.482 dispatcher-event-loop-14 INFO BlockManagerInfo: Removed 
> broadcast_0_piece0 on amp-jenkins-worker-05.amp:36913 in memory (size: 1260.0 
> B, free: 1638.6 MB)
> {noformat}
> you'll see that the "mock backend thread" does keep making progress, but for 
> whatever reason there is over a one second delay in the middle.  Thats 
> already going over the existing timeouts.
> Its possible there is something else going on here, but for now just 
> increasing the timeouts seems like the best next step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25400) Increase timeouts in schedulerIntegrationSuite

2018-09-10 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609839#comment-16609839
 ] 

Apache Spark commented on SPARK-25400:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/22385

> Increase timeouts in schedulerIntegrationSuite
> --
>
> Key: SPARK-25400
> URL: https://issues.apache.org/jira/browse/SPARK-25400
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>
> I just took a look at a flaky failure in {{SchedulerIntegrationSuite}} 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95887
>  it seems the timeout really is too short:
> {noformat}
> 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task 
> 5.0 in stage 1.0 (TID 8, localhost, executor driver, partition 5, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:07.821 task-result-getter-2 INFO TaskSetManager: Finished task 
> 3.0 in stage 1.0 (TID 6) in 1 ms on localhost (executor driver) (4/10)
> 18/09/10 11:14:07.821 task-result-getter-0 INFO TaskSetManager: Finished task 
> 4.0 in stage 1.0 (TID 7) in 1 ms on localhost (executor driver) (5/10)
> 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task 
> 6.0 in stage 1.0 (TID 9, localhost, executor driver, partition 6, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:07.821 task-result-getter-1 INFO TaskSetManager: Finished task 
> 5.0 in stage 1.0 (TID 8) in 0 ms on localhost (executor driver) (6/10)
> 18/09/10 11:14:09.481 mock backend thread INFO TaskSetManager: Starting task 
> 7.0 in stage 1.0 (TID 10, localhost, executor driver, partition 7, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:09.482 dispatcher-event-loop-14 INFO BlockManagerInfo: Removed 
> broadcast_0_piece0 on amp-jenkins-worker-05.amp:36913 in memory (size: 1260.0 
> B, free: 1638.6 MB)
> {noformat}
> you'll see that the "mock backend thread" does keep making progress, but for 
> whatever reason there is over a one second delay in the middle.  Thats 
> already going over the existing timeouts.
> Its possible there is something else going on here, but for now just 
> increasing the timeouts seems like the best next step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25400) Increase timeouts in schedulerIntegrationSuite

2018-09-10 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-25400:


Assignee: (was: Imran Rashid)

> Increase timeouts in schedulerIntegrationSuite
> --
>
> Key: SPARK-25400
> URL: https://issues.apache.org/jira/browse/SPARK-25400
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>
> I just took a look at a flaky failure in {{SchedulerIntegrationSuite}} 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95887
>  it seems the timeout really is too short:
> {noformat}
> 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task 
> 5.0 in stage 1.0 (TID 8, localhost, executor driver, partition 5, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:07.821 task-result-getter-2 INFO TaskSetManager: Finished task 
> 3.0 in stage 1.0 (TID 6) in 1 ms on localhost (executor driver) (4/10)
> 18/09/10 11:14:07.821 task-result-getter-0 INFO TaskSetManager: Finished task 
> 4.0 in stage 1.0 (TID 7) in 1 ms on localhost (executor driver) (5/10)
> 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task 
> 6.0 in stage 1.0 (TID 9, localhost, executor driver, partition 6, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:07.821 task-result-getter-1 INFO TaskSetManager: Finished task 
> 5.0 in stage 1.0 (TID 8) in 0 ms on localhost (executor driver) (6/10)
> 18/09/10 11:14:09.481 mock backend thread INFO TaskSetManager: Starting task 
> 7.0 in stage 1.0 (TID 10, localhost, executor driver, partition 7, 
> PROCESS_LOCAL, 7677 bytes)
> 18/09/10 11:14:09.482 dispatcher-event-loop-14 INFO BlockManagerInfo: Removed 
> broadcast_0_piece0 on amp-jenkins-worker-05.amp:36913 in memory (size: 1260.0 
> B, free: 1638.6 MB)
> {noformat}
> you'll see that the "mock backend thread" does keep making progress, but for 
> whatever reason there is over a one second delay in the middle.  Thats 
> already going over the existing timeouts.
> Its possible there is something else going on here, but for now just 
> increasing the timeouts seems like the best next step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25400) Increase timeouts in schedulerIntegrationSuite

2018-09-10 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-25400:


 Summary: Increase timeouts in schedulerIntegrationSuite
 Key: SPARK-25400
 URL: https://issues.apache.org/jira/browse/SPARK-25400
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.4.0
Reporter: Imran Rashid
Assignee: Imran Rashid


I just took a look at a flaky failure in {{SchedulerIntegrationSuite}} 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95887
 it seems the timeout really is too short:

{noformat}
18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task 
5.0 in stage 1.0 (TID 8, localhost, executor driver, partition 5, 
PROCESS_LOCAL, 7677 bytes)
18/09/10 11:14:07.821 task-result-getter-2 INFO TaskSetManager: Finished task 
3.0 in stage 1.0 (TID 6) in 1 ms on localhost (executor driver) (4/10)
18/09/10 11:14:07.821 task-result-getter-0 INFO TaskSetManager: Finished task 
4.0 in stage 1.0 (TID 7) in 1 ms on localhost (executor driver) (5/10)
18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task 
6.0 in stage 1.0 (TID 9, localhost, executor driver, partition 6, 
PROCESS_LOCAL, 7677 bytes)
18/09/10 11:14:07.821 task-result-getter-1 INFO TaskSetManager: Finished task 
5.0 in stage 1.0 (TID 8) in 0 ms on localhost (executor driver) (6/10)
18/09/10 11:14:09.481 mock backend thread INFO TaskSetManager: Starting task 
7.0 in stage 1.0 (TID 10, localhost, executor driver, partition 7, 
PROCESS_LOCAL, 7677 bytes)
18/09/10 11:14:09.482 dispatcher-event-loop-14 INFO BlockManagerInfo: Removed 
broadcast_0_piece0 on amp-jenkins-worker-05.amp:36913 in memory (size: 1260.0 
B, free: 1638.6 MB)
{noformat}

you'll see that the "mock backend thread" does keep making progress, but for 
whatever reason there is over a one second delay in the middle.  Thats already 
going over the existing timeouts.

Its possible there is something else going on here, but for now just increasing 
the timeouts seems like the best next step.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25399) Reusing execution threads from continuous processing for microbatch streaming can result in correctness issues

2018-09-10 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25399:

Priority: Critical  (was: Blocker)

> Reusing execution threads from continuous processing for microbatch streaming 
> can result in correctness issues
> --
>
> Key: SPARK-25399
> URL: https://issues.apache.org/jira/browse/SPARK-25399
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Priority: Critical
>  Labels: correctness
>
> Continuous processing sets some thread local variables that, when read by a 
> thread running a microbatch stream, may result in incorrect or no previous 
> state being read and resulting in wrong answers. This was caught by a job 
> running the StreamSuite tests, and only repros occasionally when the same 
> threads are used.
> The issue is in StateStoreRDD.compute - when we compute currentVersion, we 
> read from a thread local variable which is set by continuous processing 
> threads. If this value is set, we then think we're on the wrong state version.
> I imagine very few people, if any, would run into this bug, because you'd 
> have to use continuous processing and then microbatch processing in the same 
> cluster. However, it can result in silent correctness issues, and it would be 
> very difficult for someone to tell if they were impacted by this or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25399) Reusing execution threads from continuous processing for microbatch streaming can result in correctness issues

2018-09-10 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25399:

Labels: correctness  (was: )

> Reusing execution threads from continuous processing for microbatch streaming 
> can result in correctness issues
> --
>
> Key: SPARK-25399
> URL: https://issues.apache.org/jira/browse/SPARK-25399
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Priority: Major
>  Labels: correctness
>
> Continuous processing sets some thread local variables that, when read by a 
> thread running a microbatch stream, may result in incorrect or no previous 
> state being read and resulting in wrong answers. This was caught by a job 
> running the StreamSuite tests, and only repros occasionally when the same 
> threads are used.
> The issue is in StateStoreRDD.compute - when we compute currentVersion, we 
> read from a thread local variable which is set by continuous processing 
> threads. If this value is set, we then think we're on the wrong state version.
> I imagine very few people, if any, would run into this bug, because you'd 
> have to use continuous processing and then microbatch processing in the same 
> cluster. However, it can result in silent correctness issues, and it would be 
> very difficult for someone to tell if they were impacted by this or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25399) Reusing execution threads from continuous processing for microbatch streaming can result in correctness issues

2018-09-10 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25399:

Priority: Blocker  (was: Major)

> Reusing execution threads from continuous processing for microbatch streaming 
> can result in correctness issues
> --
>
> Key: SPARK-25399
> URL: https://issues.apache.org/jira/browse/SPARK-25399
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Priority: Blocker
>  Labels: correctness
>
> Continuous processing sets some thread local variables that, when read by a 
> thread running a microbatch stream, may result in incorrect or no previous 
> state being read and resulting in wrong answers. This was caught by a job 
> running the StreamSuite tests, and only repros occasionally when the same 
> threads are used.
> The issue is in StateStoreRDD.compute - when we compute currentVersion, we 
> read from a thread local variable which is set by continuous processing 
> threads. If this value is set, we then think we're on the wrong state version.
> I imagine very few people, if any, would run into this bug, because you'd 
> have to use continuous processing and then microbatch processing in the same 
> cluster. However, it can result in silent correctness issues, and it would be 
> very difficult for someone to tell if they were impacted by this or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25399) Reusing execution threads from continuous processing for microbatch streaming can result in correctness issues

2018-09-10 Thread Mukul Murthy (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Murthy updated SPARK-25399:
-
Priority: Major  (was: Blocker)

> Reusing execution threads from continuous processing for microbatch streaming 
> can result in correctness issues
> --
>
> Key: SPARK-25399
> URL: https://issues.apache.org/jira/browse/SPARK-25399
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Continuous processing sets some thread local variables that, when read by a 
> thread running a microbatch stream, may result in incorrect or no previous 
> state being read and resulting in wrong answers. This was caught by a job 
> running the StreamSuite tests, and only repros occasionally when the same 
> threads are used.
> The issue is in StateStoreRDD.compute - when we compute currentVersion, we 
> read from a thread local variable which is set by continuous processing 
> threads. If this value is set, we then think we're on the wrong state version.
> I imagine very few people, if any, would run into this bug, because you'd 
> have to use continuous processing and then microbatch processing in the same 
> cluster. However, it can result in silent correctness issues, and it would be 
> very difficult for someone to tell if they were impacted by this or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25399) Reusing execution threads from continuous processing for microbatch streaming can result in correctness issues

2018-09-10 Thread Mukul Murthy (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609819#comment-16609819
 ] 

Mukul Murthy commented on SPARK-25399:
--

cc [~joseph.torres] and [~tdas]

> Reusing execution threads from continuous processing for microbatch streaming 
> can result in correctness issues
> --
>
> Key: SPARK-25399
> URL: https://issues.apache.org/jira/browse/SPARK-25399
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Priority: Blocker
>
> Continuous processing sets some thread local variables that, when read by a 
> thread running a microbatch stream, may result in incorrect or no previous 
> state being read and resulting in wrong answers. This was caught by a job 
> running the StreamSuite tests, and only repros occasionally when the same 
> threads are used.
> The issue is in StateStoreRDD.compute - when we compute currentVersion, we 
> read from a thread local variable which is set by continuous processing 
> threads. If this value is set, we then think we're on the wrong state version.
> I imagine very few people, if any, would run into this bug, because you'd 
> have to use continuous processing and then microbatch processing in the same 
> cluster. However, it can result in silent correctness issues, and it would be 
> very difficult for someone to tell if they were impacted by this or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25399) Reusing execution threads from continuous processing for microbatch streaming can result in correctness issues

2018-09-10 Thread Mukul Murthy (JIRA)

Mukul Murthy created SPARK-25399:


 Summary: Reusing execution threads from continuous processing for 
microbatch streaming can result in correctness issues
 Key: SPARK-25399
 URL: https://issues.apache.org/jira/browse/SPARK-25399
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Mukul Murthy


Continuous processing sets some thread local variables that, when read by a 
thread running a microbatch stream, may result in incorrect or no previous 
state being read and resulting in wrong answers. This was caught by a job 
running the StreamSuite tests, and only repros occasionally when the same 
threads are used.

The issue is in StateStoreRDD.compute - when we compute currentVersion, we read 
from a thread local variable which is set by continuous processing threads. If 
this value is set, we then think we're on the wrong state version.

I imagine very few people, if any, would run into this bug, because you'd have 
to use continuous processing and then microbatch processing in the same 
cluster. However, it can result in silent correctness issues, and it would be 
very difficult for someone to tell if they were impacted by this or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23580) Interpreted mode fallback should be implemented for all expressions & projections

2018-09-10 Thread Reynold Xin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609787#comment-16609787
 ] 

Reynold Xin commented on SPARK-23580:
-

90% or 100%?

 

> Interpreted mode fallback should be implemented for all expressions & 
> projections
> -
>
> Key: SPARK-23580
> URL: https://issues.apache.org/jira/browse/SPARK-23580
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>  Labels: release-notes
>
> Spark SQL currently does not support interpreted mode for all expressions and 
> projections. This is a problem for scenario's where were code generation does 
> not work, or blows past the JVM class limits. We currently cannot gracefully 
> fallback.
> This ticket is an umbrella to fix this class of problem in Spark SQL. This 
> work can be divided into two main area's:
> - Add interpreted versions for all dataset related expressions.
> - Add an interpreted version of {{GenerateUnsafeProjection}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25398) Minor bugs from comparing unrelated types

2018-09-10 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609730#comment-16609730
 ] 

Apache Spark commented on SPARK-25398:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/22384

> Minor bugs from comparing unrelated types
> -
>
> Key: SPARK-25398
> URL: https://issues.apache.org/jira/browse/SPARK-25398
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core, YARN
>Affects Versions: 2.3.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> I noticed a potential issue from Scala inspections, like this clause in 
> LiveEntity.scala around line 586:
> {code:java}
>  (!acc.metadata.isDefined ||
>   acc.metadata.get != Some(AccumulatorContext.SQL_ACCUM_IDENTIFIER)){code}
> The issue is that acc.metadata is Option[String], so can't equal 
> Some[String]. This just meant to be:
> {code:java}
>  acc.metadata != Some(AccumulatorContext.SQL_ACCUM_IDENTIFIER){code}
> This may or may not actually cause a bug, but seems worth fixing. And then 
> there are a number of other ones like this, mostly in tests, that might 
> likewise mask real assertion problems.
> Many are, interestingly, flagging items like this on a Seq[String]:
> {code:java}
> .filter(_.getFoo.equals("foo")){code}
> It complains that Any => Any is compared to String. Either it's wrong, or 
> somehow, this is parsed as (_.getFoo).equals("foo")). In any event, easy 
> enough to write this more clearly as:
> {code:java}
> .filter(_.getFoo == "foo"){code}
> And so on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25398) Minor bugs from comparing unrelated types

2018-09-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25398:


Assignee: Sean Owen  (was: Apache Spark)

> Minor bugs from comparing unrelated types
> -
>
> Key: SPARK-25398
> URL: https://issues.apache.org/jira/browse/SPARK-25398
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core, YARN
>Affects Versions: 2.3.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> I noticed a potential issue from Scala inspections, like this clause in 
> LiveEntity.scala around line 586:
> {code:java}
>  (!acc.metadata.isDefined ||
>   acc.metadata.get != Some(AccumulatorContext.SQL_ACCUM_IDENTIFIER)){code}
> The issue is that acc.metadata is Option[String], so can't equal 
> Some[String]. This just meant to be:
> {code:java}
>  acc.metadata != Some(AccumulatorContext.SQL_ACCUM_IDENTIFIER){code}
> This may or may not actually cause a bug, but seems worth fixing. And then 
> there are a number of other ones like this, mostly in tests, that might 
> likewise mask real assertion problems.
> Many are, interestingly, flagging items like this on a Seq[String]:
> {code:java}
> .filter(_.getFoo.equals("foo")){code}
> It complains that Any => Any is compared to String. Either it's wrong, or 
> somehow, this is parsed as (_.getFoo).equals("foo")). In any event, easy 
> enough to write this more clearly as:
> {code:java}
> .filter(_.getFoo == "foo"){code}
> And so on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25398) Minor bugs from comparing unrelated types

2018-09-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25398:


Assignee: Apache Spark  (was: Sean Owen)

> Minor bugs from comparing unrelated types
> -
>
> Key: SPARK-25398
> URL: https://issues.apache.org/jira/browse/SPARK-25398
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core, YARN
>Affects Versions: 2.3.1
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> I noticed a potential issue from Scala inspections, like this clause in 
> LiveEntity.scala around line 586:
> {code:java}
>  (!acc.metadata.isDefined ||
>   acc.metadata.get != Some(AccumulatorContext.SQL_ACCUM_IDENTIFIER)){code}
> The issue is that acc.metadata is Option[String], so can't equal 
> Some[String]. This just meant to be:
> {code:java}
>  acc.metadata != Some(AccumulatorContext.SQL_ACCUM_IDENTIFIER){code}
> This may or may not actually cause a bug, but seems worth fixing. And then 
> there are a number of other ones like this, mostly in tests, that might 
> likewise mask real assertion problems.
> Many are, interestingly, flagging items like this on a Seq[String]:
> {code:java}
> .filter(_.getFoo.equals("foo")){code}
> It complains that Any => Any is compared to String. Either it's wrong, or 
> somehow, this is parsed as (_.getFoo).equals("foo")). In any event, easy 
> enough to write this more clearly as:
> {code:java}
> .filter(_.getFoo == "foo"){code}
> And so on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25398) Minor bugs from comparing unrelated types

2018-09-10 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609728#comment-16609728
 ] 

Apache Spark commented on SPARK-25398:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/22384

> Minor bugs from comparing unrelated types
> -
>
> Key: SPARK-25398
> URL: https://issues.apache.org/jira/browse/SPARK-25398
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core, YARN
>Affects Versions: 2.3.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> I noticed a potential issue from Scala inspections, like this clause in 
> LiveEntity.scala around line 586:
> {code:java}
>  (!acc.metadata.isDefined ||
>   acc.metadata.get != Some(AccumulatorContext.SQL_ACCUM_IDENTIFIER)){code}
> The issue is that acc.metadata is Option[String], so can't equal 
> Some[String]. This just meant to be:
> {code:java}
>  acc.metadata != Some(AccumulatorContext.SQL_ACCUM_IDENTIFIER){code}
> This may or may not actually cause a bug, but seems worth fixing. And then 
> there are a number of other ones like this, mostly in tests, that might 
> likewise mask real assertion problems.
> Many are, interestingly, flagging items like this on a Seq[String]:
> {code:java}
> .filter(_.getFoo.equals("foo")){code}
> It complains that Any => Any is compared to String. Either it's wrong, or 
> somehow, this is parsed as (_.getFoo).equals("foo")). In any event, easy 
> enough to write this more clearly as:
> {code:java}
> .filter(_.getFoo == "foo"){code}
> And so on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25398) Minor bugs from comparing unrelated types

2018-09-10 Thread Sean Owen (JIRA)

Sean Owen created SPARK-25398:
-

 Summary: Minor bugs from comparing unrelated types
 Key: SPARK-25398
 URL: https://issues.apache.org/jira/browse/SPARK-25398
 Project: Spark
  Issue Type: Bug
  Components: Mesos, Spark Core, YARN
Affects Versions: 2.3.1
Reporter: Sean Owen
Assignee: Sean Owen


I noticed a potential issue from Scala inspections, like this clause in 
LiveEntity.scala around line 586:
{code:java}
 (!acc.metadata.isDefined ||
  acc.metadata.get != Some(AccumulatorContext.SQL_ACCUM_IDENTIFIER)){code}
The issue is that acc.metadata is Option[String], so can't equal Some[String]. 
This just meant to be:
{code:java}
 acc.metadata != Some(AccumulatorContext.SQL_ACCUM_IDENTIFIER){code}
This may or may not actually cause a bug, but seems worth fixing. And then 
there are a number of other ones like this, mostly in tests, that might 
likewise mask real assertion problems.

Many are, interestingly, flagging items like this on a Seq[String]:
{code:java}
.filter(_.getFoo.equals("foo")){code}
It complains that Any => Any is compared to String. Either it's wrong, or 
somehow, this is parsed as (_.getFoo).equals("foo")). In any event, easy enough 
to write this more clearly as:
{code:java}
.filter(_.getFoo == "foo"){code}
And so on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-09-10 Thread Dmitry Zanozin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609708#comment-16609708
 ] 

Dmitry Zanozin edited comment on SPARK-23986 at 9/10/18 7:47 PM:
-

Spark 2.3.1 still generates methods with duplicate parameter names. I've just 
got this method (which obviously failed with the following exception: "{{ERROR 
CodeGenerator:91 - failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
686, Column 28: Redefinition of parameter "agg_expr_21"}}" ):
{code}
/* 686 */
private void agg_doConsume1(byte agg_expr_01, boolean agg_exprIsNull_01,
short agg_expr_11, boolean agg_exprIsNull_11,
short agg_expr_21, boolean agg_exprIsNull_21,
int agg_expr_31, boolean agg_exprIsNull_31,
int agg_expr_41, boolean agg_exprIsNull_41,
int agg_expr_51, boolean agg_exprIsNull_51,
UTF8String agg_expr_61, boolean agg_exprIsNull_61,
byte agg_expr_71, boolean agg_exprIsNull_71,
long agg_expr_81, boolean agg_exprIsNull_81,
double agg_expr_91, boolean agg_exprIsNull_91,
long agg_expr_101, boolean agg_exprIsNull_101,
double agg_expr_111, boolean agg_exprIsNull_111,
long agg_expr_121, boolean agg_exprIsNull_121,
int agg_expr_131, boolean agg_exprIsNull_131,
long agg_expr_141, boolean agg_exprIsNull_141,
int agg_expr_151, boolean agg_exprIsNull_151,
boolean agg_expr_161, boolean agg_exprIsNull_161,
long agg_expr_171,
byte agg_expr_18, boolean agg_exprIsNull_18,
boolean agg_expr_19, boolean agg_exprIsNull_19,
byte agg_expr_20, boolean agg_exprIsNull_20,
boolean agg_expr_21, boolean agg_exprIsNull_21,
short agg_expr_22, boolean agg_exprIsNull_22,
int agg_expr_23, boolean agg_exprIsNull_23) throws 
java.io.IOException {
{code}


was (Author: dzanozin):
Spark 2.3.1 still generates methods with duplicate parameter names. I've just 
got this method (which obviously failed with the following exception: "\{{ERROR 
CodeGenerator:91 - failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
686, Column 28: Redefinition of parameter "agg_expr_21"}}":

{code}

{color:#808080}/* 686 */
{color}{color:#cc7832}private void {color}agg_doConsume1({color:#cc7832}byte 
{color}agg_expr_01{color:#cc7832}, boolean 
{color}agg_exprIsNull_01{color:#cc7832},
{color}{color:#cc7832}  short {color}agg_expr_11{color:#cc7832}, boolean 
{color}agg_exprIsNull_11{color:#cc7832},
{color}{color:#cc7832}  short {color}agg_expr_21{color:#cc7832}, boolean 
{color}agg_exprIsNull_21{color:#cc7832},
{color}{color:#cc7832}  int {color}agg_expr_31{color:#cc7832}, boolean 
{color}agg_exprIsNull_31{color:#cc7832},
{color}{color:#cc7832}  int {color}agg_expr_41{color:#cc7832}, boolean 
{color}agg_exprIsNull_41{color:#cc7832},
{color}{color:#cc7832}  int {color}agg_expr_51{color:#cc7832}, boolean 
{color}agg_exprIsNull_51{color:#cc7832},
{color}  UTF8String agg_expr_61{color:#cc7832}, boolean 
{color}agg_exprIsNull_61{color:#cc7832},
{color}{color:#cc7832}  byte {color}agg_expr_71{color:#cc7832}, boolean 
{color}agg_exprIsNull_71{color:#cc7832},
{color}{color:#cc7832}  long {color}agg_expr_81{color:#cc7832}, boolean 
{color}agg_exprIsNull_81{color:#cc7832},
{color}{color:#cc7832}  double {color}agg_expr_91{color:#cc7832}, boolean 
{color}agg_exprIsNull_91{color:#cc7832},
{color}{color:#cc7832}  long {color}agg_expr_101{color:#cc7832}, boolean 
{color}agg_exprIsNull_101{color:#cc7832},
{color}{color:#cc7832}  double {color}agg_expr_111{color:#cc7832}, boolean 
{color}agg_exprIsNull_111{color:#cc7832},
{color}{color:#cc7832}  long {color}agg_expr_121{color:#cc7832}, boolean 
{color}agg_exprIsNull_121{color:#cc7832},
{color}{color:#cc7832}  int {color}agg_expr_131{color:#cc7832}, boolean 
{color}agg_exprIsNull_131{color:#cc7832},
{color}{color:#cc7832}  long {color}agg_expr_141{color:#cc7832}, boolean 
{color}agg_exprIsNull_141{color:#cc7832},
{color}{color:#cc7832}  int {color}agg_expr_151{color:#cc7832}, boolean 
{color}agg_exprIsNull_151{color:#cc7832},
{color}{color:#cc7832}  boolean {color}agg_expr_161{color:#cc7832}, boolean 
{color}agg_exprIsNull_161{color:#cc7832},
{color}{color:#cc7832}  long {color}agg_expr_171{color:#cc7832},
{color}{color:#cc7832}  byte {color}agg_expr_18{color:#cc7832}, boolean 
{color}agg_exprIsNul

[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-09-10 Thread Dmitry Zanozin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609708#comment-16609708
 ] 

Dmitry Zanozin commented on SPARK-23986:


Spark 2.3.1 still generates methods with duplicate parameter names. I've just 
got this method (which obviously failed with the following exception: "\{{ERROR 
CodeGenerator:91 - failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
686, Column 28: Redefinition of parameter "agg_expr_21"}}":

{code}

{color:#808080}/* 686 */
{color}{color:#cc7832}private void {color}agg_doConsume1({color:#cc7832}byte 
{color}agg_expr_01{color:#cc7832}, boolean 
{color}agg_exprIsNull_01{color:#cc7832},
{color}{color:#cc7832}  short {color}agg_expr_11{color:#cc7832}, boolean 
{color}agg_exprIsNull_11{color:#cc7832},
{color}{color:#cc7832}  short {color}agg_expr_21{color:#cc7832}, boolean 
{color}agg_exprIsNull_21{color:#cc7832},
{color}{color:#cc7832}  int {color}agg_expr_31{color:#cc7832}, boolean 
{color}agg_exprIsNull_31{color:#cc7832},
{color}{color:#cc7832}  int {color}agg_expr_41{color:#cc7832}, boolean 
{color}agg_exprIsNull_41{color:#cc7832},
{color}{color:#cc7832}  int {color}agg_expr_51{color:#cc7832}, boolean 
{color}agg_exprIsNull_51{color:#cc7832},
{color}  UTF8String agg_expr_61{color:#cc7832}, boolean 
{color}agg_exprIsNull_61{color:#cc7832},
{color}{color:#cc7832}  byte {color}agg_expr_71{color:#cc7832}, boolean 
{color}agg_exprIsNull_71{color:#cc7832},
{color}{color:#cc7832}  long {color}agg_expr_81{color:#cc7832}, boolean 
{color}agg_exprIsNull_81{color:#cc7832},
{color}{color:#cc7832}  double {color}agg_expr_91{color:#cc7832}, boolean 
{color}agg_exprIsNull_91{color:#cc7832},
{color}{color:#cc7832}  long {color}agg_expr_101{color:#cc7832}, boolean 
{color}agg_exprIsNull_101{color:#cc7832},
{color}{color:#cc7832}  double {color}agg_expr_111{color:#cc7832}, boolean 
{color}agg_exprIsNull_111{color:#cc7832},
{color}{color:#cc7832}  long {color}agg_expr_121{color:#cc7832}, boolean 
{color}agg_exprIsNull_121{color:#cc7832},
{color}{color:#cc7832}  int {color}agg_expr_131{color:#cc7832}, boolean 
{color}agg_exprIsNull_131{color:#cc7832},
{color}{color:#cc7832}  long {color}agg_expr_141{color:#cc7832}, boolean 
{color}agg_exprIsNull_141{color:#cc7832},
{color}{color:#cc7832}  int {color}agg_expr_151{color:#cc7832}, boolean 
{color}agg_exprIsNull_151{color:#cc7832},
{color}{color:#cc7832}  boolean {color}agg_expr_161{color:#cc7832}, boolean 
{color}agg_exprIsNull_161{color:#cc7832},
{color}{color:#cc7832}  long {color}agg_expr_171{color:#cc7832},
{color}{color:#cc7832}  byte {color}agg_expr_18{color:#cc7832}, boolean 
{color}agg_exprIsNull_18{color:#cc7832},
{color}{color:#cc7832}  boolean {color}agg_expr_19{color:#cc7832}, boolean 
{color}agg_exprIsNull_19{color:#cc7832},
{color}{color:#cc7832}  byte {color}agg_expr_20{color:#cc7832}, boolean 
{color}agg_exprIsNull_20{color:#cc7832},
{color}{color:#cc7832}  boolean {color}agg_expr_21{color:#cc7832}, boolean 
{color}agg_exprIsNull_21{color:#cc7832},
{color}{color:#cc7832}  short {color}agg_expr_22{color:#cc7832}, boolean 
{color}agg_exprIsNull_22{color:#cc7832},
{color}{color:#cc7832}  int {color}agg_expr_23{color:#cc7832}, boolean 
{color}agg_exprIsNull_23) {color:#cc7832}throws {color}java.io.IOException {

{code}

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that th

[jira] [Resolved] (SPARK-23672) Document Support returning lists in Arrow UDFs

2018-09-10 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23672.
--
   Resolution: Fixed
Fix Version/s: 2.4.0
   3.0.0

Issue resolved by pull request 20908
[https://github.com/apache/spark/pull/20908]

> Document Support returning lists in Arrow UDFs
> --
>
> Key: SPARK-23672
> URL: https://issues.apache.org/jira/browse/SPARK-23672
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Major
> Fix For: 3.0.0, 2.4.0
>
>
> Documenting the support for returning lists for individual inputs on 
> non-grouped data inside of PySpark UDFs to better support the wordcount 
> example (and other things but wordcount is the simplest I can think of).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23672) Document Support returning lists in Arrow UDFs

2018-09-10 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-23672:


Assignee: holdenk

> Document Support returning lists in Arrow UDFs
> --
>
> Key: SPARK-23672
> URL: https://issues.apache.org/jira/browse/SPARK-23672
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Major
> Fix For: 2.4.0, 3.0.0
>
>
> Documenting the support for returning lists for individual inputs on 
> non-grouped data inside of PySpark UDFs to better support the wordcount 
> example (and other things but wordcount is the simplest I can think of).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12417) Orc bloom filter options are not propagated during file write in spark

2018-09-10 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-12417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609638#comment-16609638
 ] 

Dongjoon Hyun edited comment on SPARK-12417 at 9/10/18 6:23 PM:


This is fixed since 2.0.0.
{code}
scala> spark.version
res0: String = 2.0.0

scala> Seq((1,2)).toDF("a", "b").write.option("orc.bloom.filter.columns", 
"*").orc("/tmp/orc200")

$ hive --orcfiledump 
/tmp/orc200/part-r-7-d36ca145-1e23-4d3a-ba99-09506e4ed8cc.snappy.orc
...
Stripes:
  Stripe: offset: 3 data: 12 rows: 1 tail: 92 index: 1390
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 0 section BLOOM_FILTER start: 14 length 426
Stream: column 1 section ROW_INDEX start: 440 length 24
Stream: column 1 section BLOOM_FILTER start: 464 length 456
Stream: column 2 section ROW_INDEX start: 920 length 24
Stream: column 2 section BLOOM_FILTER start: 944 length 449
Stream: column 1 section DATA start: 1393 length 6
Stream: column 2 section DATA start: 1399 length 6
...
{code}


was (Author: dongjoon):
This is fixed since 2.0.0.
{code}
scala> spark.version
res0: String = 2.0.0

scala> Seq((1,2)).toDF("a", "b").write.option("orc.bloom.filter.columns", 
"*").orc("/tmp/orc200")
{code}
$ hive --orcfiledump 
/tmp/orc200/part-r-7-d36ca145-1e23-4d3a-ba99-09506e4ed8cc.snappy.orc
...
Stripes:
  Stripe: offset: 3 data: 12 rows: 1 tail: 92 index: 1390
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 0 section BLOOM_FILTER start: 14 length 426
Stream: column 1 section ROW_INDEX start: 440 length 24
Stream: column 1 section BLOOM_FILTER start: 464 length 456
Stream: column 2 section ROW_INDEX start: 920 length 24
Stream: column 2 section BLOOM_FILTER start: 944 length 449
Stream: column 1 section DATA start: 1393 length 6
Stream: column 2 section DATA start: 1399 length 6
...
{code}

> Orc bloom filter options are not propagated during file write in spark
> --
>
> Key: SPARK-12417
> URL: https://issues.apache.org/jira/browse/SPARK-12417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: SPARK-12417.1.patch
>
>
> ORC bloom filter is supported by the version of hive used in Spark 1.5.2. 
> However, when trying to create orc file with bloom filter option, it does not 
> make use of it.
> E.g, following orc output does not create the bloom filter even though the 
> options are specified.
> {noformat}
> Map orcOption = new HashMap();
> orcOption.put("orc.bloom.filter.columns", "*");
> hiveContext.sql("select * from accounts where 
> effective_date='2015-12-30'").write().
> format("orc").options(orcOption).save("/tmp/accounts");
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12417) Orc bloom filter options are not propagated during file write in spark

2018-09-10 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-12417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-12417.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

This is fixed since 2.0.0.
{code}
scala> spark.version
res0: String = 2.0.0

scala> Seq((1,2)).toDF("a", "b").write.option("orc.bloom.filter.columns", 
"*").orc("/tmp/orc200")
{code}
$ hive --orcfiledump 
/tmp/orc200/part-r-7-d36ca145-1e23-4d3a-ba99-09506e4ed8cc.snappy.orc
...
Stripes:
  Stripe: offset: 3 data: 12 rows: 1 tail: 92 index: 1390
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 0 section BLOOM_FILTER start: 14 length 426
Stream: column 1 section ROW_INDEX start: 440 length 24
Stream: column 1 section BLOOM_FILTER start: 464 length 456
Stream: column 2 section ROW_INDEX start: 920 length 24
Stream: column 2 section BLOOM_FILTER start: 944 length 449
Stream: column 1 section DATA start: 1393 length 6
Stream: column 2 section DATA start: 1399 length 6
...
{code}

> Orc bloom filter options are not propagated during file write in spark
> --
>
> Key: SPARK-12417
> URL: https://issues.apache.org/jira/browse/SPARK-12417
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: SPARK-12417.1.patch
>
>
> ORC bloom filter is supported by the version of hive used in Spark 1.5.2. 
> However, when trying to create orc file with bloom filter option, it does not 
> make use of it.
> E.g, following orc output does not create the bloom filter even though the 
> options are specified.
> {noformat}
> Map orcOption = new HashMap();
> orcOption.put("orc.bloom.filter.columns", "*");
> hiveContext.sql("select * from accounts where 
> effective_date='2015-12-30'").write().
> format("orc").options(orcOption).save("/tmp/accounts");
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23425) load data for hdfs file path with wild card usage is not working properly

2018-09-10 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23425:
--
Docs Text: 
Release notes:

Wildcard symbols {{*}} and {{?}} can now be used in SQL paths when loading 
data, e.g.:

LOAD DATA PATH 'hdfs://hacluster/user/ext*'
LOAD DATA PATH 'hdfs://hacluster/user/???/data'

Where these characters are used literally in paths, they must be escaped with a 
backslash.

> load data for hdfs file path with wild card usage is not working properly
> -
>
> Key: SPARK-23425
> URL: https://issues.apache.org/jira/browse/SPARK-23425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Sujith
>Assignee: Sujith
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
> Attachments: wildcard_issue.PNG
>
>
> load data command  for loading data from non local  file paths by using wild 
> card strings lke * are not working
> eg:
> "load data inpath 'hdfs://hacluster/user/ext*  into table t1"
> Getting Analysis excepton while executing this query
> !image-2018-02-14-23-41-39-923.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25396) Read array of JSON objects via an Iterator

2018-09-10 Thread Maxim Gekk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609605#comment-16609605
 ] 

Maxim Gekk commented on SPARK-25396:


I have a concern regarding to when I should close Jackson parser. For now it is 
closed before returning result from the parse method there: 
[https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L394-L404]
 . If I return an *Iterator[InternalRow]* instead of *Seq[InternalRow]*, so I 
have to postpone closing of Jackson parser at least up to the end of current 
task, right? ... but it is bad for per-line mode because this could produce a 
lot of opened JSON parsers. It seems implementations for multiLine and for 
per-line mode should be different.

> Read array of JSON objects via an Iterator
> --
>
> Key: SPARK-25396
> URL: https://issues.apache.org/jira/browse/SPARK-25396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> If a JSON file has a structure like below:
> {code}
> [
>   {
>  "time":"2018-08-13T18:00:44.086Z",
>  "resourceId":"some-text",
>  "category":"A",
>  "level":2,
>  "operationName":"Error",
>  "properties":{...}
>  },
> {
>  "time":"2018-08-14T18:00:44.086Z",
>  "resourceId":"some-text2",
>  "category":"B",
>  "level":3,
>  "properties":{...}
>  },
>   ...
> ]
> {code}
> it should be read in the `multiLine` mode. In this mode, Spark read whole 
> array into memory in both cases when schema is `ArrayType` and `StructType`. 
> It can lead to unnecessary memory consumption and even to OOM for big JSON 
> files.
> In general, there is no need to materialize all parsed JSON record in memory 
> there: 
> https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
>  . So, JSON objects of an array can be read via an Iterator. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23425) load data for hdfs file path with wild card usage is not working properly

2018-09-10 Thread Shixiong Zhu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609595#comment-16609595
 ] 

Shixiong Zhu commented on SPARK-23425:
--

Added "release-note" label.

Previously, when INPATH contains special characters (such as " "), the user has 
to manually escape them, e.g., use "/a/b/foo%20bar" rather than "/a/b/foo bar" 
because the former will throw "URISyntaxException: Illegal character in path at 
index XX: /a/b/foo bar".

After this patch, the above workaround will throw "AnalysisException: LOAD DATA 
input path does not exist: /a/b/foo%20bar;".

The root cause is we changed from "new URI(user_specified_path)" to "new 
Path(user_specified_path)". I believe this patch is indeed a bug fix but it's 
worth to highlight in the release note.

> load data for hdfs file path with wild card usage is not working properly
> -
>
> Key: SPARK-23425
> URL: https://issues.apache.org/jira/browse/SPARK-23425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Sujith
>Assignee: Sujith
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
> Attachments: wildcard_issue.PNG
>
>
> load data command  for loading data from non local  file paths by using wild 
> card strings lke * are not working
> eg:
> "load data inpath 'hdfs://hacluster/user/ext*  into table t1"
> Getting Analysis excepton while executing this query
> !image-2018-02-14-23-41-39-923.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25397) SparkSession.conf fails when given default value with Python 3

2018-09-10 Thread Joseph K. Bradley (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609593#comment-16609593
 ] 

Joseph K. Bradley commented on SPARK-25397:
---

CC [~smilegator], [~cloud_fan] for visibility

> SparkSession.conf fails when given default value with Python 3
> --
>
> Key: SPARK-25397
> URL: https://issues.apache.org/jira/browse/SPARK-25397
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Spark 2.3.1 has a Python 3 incompatibility when requesting a Conf value from 
> SparkSession when you give non-string default values.  Reproduce via 
> SparkSession call:
> {{spark.conf.get("myConf", False)}}
> This gives the error:
> {code}
> >>> spark.conf.get("myConf", False)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/josephkb/work/spark-bin/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/conf.py",
>  line 51, in get
> self._checkType(default, "default")
>   File 
> "/Users/josephkb/work/spark-bin/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/conf.py",
>  line 62, in _checkType
> if not isinstance(obj, str) and not isinstance(obj, unicode):
> *NameError: name 'unicode' is not defined*
> {code}
> The offending line in Spark in branch-2.3 is: 
> https://github.com/apache/spark/blob/branch-2.3/python/pyspark/sql/conf.py 
> which uses the value {{unicode}} which is not available in Python 3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25395) Remove Spark Optional Java API

2018-09-10 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25395.

Resolution: Duplicate

> Remove Spark Optional Java API
> --
>
> Key: SPARK-25395
> URL: https://issues.apache.org/jira/browse/SPARK-25395
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: Mario Molina
>Priority: Minor
>
> Previous Spark versions didn't require Java 8 and an ``Optional`` Spark Java 
> API had to be  implemented to support optional values.
> Since Spark 2.4 uses Java 8, the ``Optional`` Spark Java API should be 
> removed so that Spark uses the original Java API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25397) SparkSession.conf fails when given default value with Python 3

2018-09-10 Thread Joseph K. Bradley (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-25397:
--
Priority: Minor  (was: Major)

> SparkSession.conf fails when given default value with Python 3
> --
>
> Key: SPARK-25397
> URL: https://issues.apache.org/jira/browse/SPARK-25397
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Spark 2.3.1 has a Python 3 incompatibility when requesting a Conf value from 
> SparkSession when you give non-string default values.  Reproduce via 
> SparkSession call:
> {{spark.conf.get("myConf", False)}}
> This gives the error:
> {code}
> >>> spark.conf.get("myConf", False)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/josephkb/work/spark-bin/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/conf.py",
>  line 51, in get
> self._checkType(default, "default")
>   File 
> "/Users/josephkb/work/spark-bin/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/conf.py",
>  line 62, in _checkType
> if not isinstance(obj, str) and not isinstance(obj, unicode):
> *NameError: name 'unicode' is not defined*
> {code}
> The offending line in Spark in branch-2.3 is: 
> https://github.com/apache/spark/blob/branch-2.3/python/pyspark/sql/conf.py 
> which uses the value {{unicode}} which is not available in Python 3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory

2018-09-10 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25091.

Resolution: Duplicate

> UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
> -
>
> Key: SPARK-25091
> URL: https://issues.apache.org/jira/browse/SPARK-25091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Yunling Cai
>Priority: Critical
> Attachments: 0.png, 1.png, 2.png, 3.png, 4.png
>
>
> UNCACHE TABLE and CLEAR CACHE does not clean up executor memory.
> Through Spark UI, although in Storage, we see the cached table removed. In 
> Executor, the executors continue to hold the RDD and the memory is not 
> cleared. This results in huge waste in executor memory usage. As we call 
> CACHE TABLE, we run into issues where the cached tables are spilled to disk 
> instead of reclaiming the memory storage. 
> Steps to reproduce:
> CACHE TABLE test.test_cache;
> UNCACHE TABLE test.test_cache;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> CACHE TABLE test.test_cache;
> CLEAR CACHE;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> Similar behavior when using pyspark df.unpersist().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23425) load data for hdfs file path with wild card usage is not working properly

2018-09-10 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-23425:
-
Labels: release-notes  (was: release)

> load data for hdfs file path with wild card usage is not working properly
> -
>
> Key: SPARK-23425
> URL: https://issues.apache.org/jira/browse/SPARK-23425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Sujith
>Assignee: Sujith
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
> Attachments: wildcard_issue.PNG
>
>
> load data command  for loading data from non local  file paths by using wild 
> card strings lke * are not working
> eg:
> "load data inpath 'hdfs://hacluster/user/ext*  into table t1"
> Getting Analysis excepton while executing this query
> !image-2018-02-14-23-41-39-923.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25397) SparkSession.conf fails when given default value with Python 3

2018-09-10 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-25397:
-

 Summary: SparkSession.conf fails when given default value with 
Python 3
 Key: SPARK-25397
 URL: https://issues.apache.org/jira/browse/SPARK-25397
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Joseph K. Bradley


Spark 2.3.1 has a Python 3 incompatibility when requesting a Conf value from 
SparkSession when you give non-string default values.  Reproduce via 
SparkSession call:
{{spark.conf.get("myConf", False)}}

This gives the error:
{code}
>>> spark.conf.get("myConf", False)
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Users/josephkb/work/spark-bin/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/conf.py",
 line 51, in get
self._checkType(default, "default")
  File 
"/Users/josephkb/work/spark-bin/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/conf.py",
 line 62, in _checkType
if not isinstance(obj, str) and not isinstance(obj, unicode):
*NameError: name 'unicode' is not defined*
{code}

The offending line in Spark in branch-2.3 is: 
https://github.com/apache/spark/blob/branch-2.3/python/pyspark/sql/conf.py 
which uses the value {{unicode}} which is not available in Python 3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23425) load data for hdfs file path with wild card usage is not working properly

2018-09-10 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-23425:
-
Labels: release  (was: )

> load data for hdfs file path with wild card usage is not working properly
> -
>
> Key: SPARK-23425
> URL: https://issues.apache.org/jira/browse/SPARK-23425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Sujith
>Assignee: Sujith
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
> Attachments: wildcard_issue.PNG
>
>
> load data command  for loading data from non local  file paths by using wild 
> card strings lke * are not working
> eg:
> "load data inpath 'hdfs://hacluster/user/ext*  into table t1"
> Getting Analysis excepton while executing this query
> !image-2018-02-14-23-41-39-923.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider

2018-09-10 Thread Babulal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609534#comment-16609534
 ] 

Babulal commented on SPARK-25332:
-

Hi [~maropu] 

it seems to be a straightforward issue so raised directly. Actually issue 
happens  because Relation size is correct for same session but when restart the 
application ,HadoopFSRelation size became default relation size 
(spark.sql.defaultSizeInBytes which is LONG.MAXVALUE) that is why SortMergeJoin 
is choosen instead of broadcast join.

> Instead of broadcast hash join  ,Sort merge join has selected when restart 
> spark-shell/spark-JDBC for hive provider
> ---
>
> Key: SPARK-25332
> URL: https://issues.apache.org/jira/browse/SPARK-25332
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Babulal
>Priority: Major
>
> spark.sql("create table x1(name string,age int) stored as parquet ")
>  spark.sql("insert into x1 select 'a',29")
>  spark.sql("create table x2 (name string,age int) stored as parquet '")
>  spark.sql("insert into x2_ex select 'a',29")
>  scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, 
> BuildRight
> :- *(2) Project [name#101, age#102]
> : +- *(2) Filter isnotnull(name#101)
> : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
>  +- *(1) Project [name#103, age#104]
>  +- *(1) Filter isnotnull(name#103)
>  +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
>  
>  
> Now Restart Spark-Shell or do spark-submit orrestart JDBCServer  again and 
> run same select query again
>  
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner
> :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(name#43, 200)
> : +- *(1) Project [name#43, age#44]
> : +- *(1) Filter isnotnull(name#43)
> : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
> +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0
>  +- Exchange hashpartitioning(name#45, 200)
>  +- *(3) Project [name#45, age#46]
>  +- *(3) Filter isnotnull(name#45)
>  +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
>  
>  
> scala> spark.sql("desc formatted x1").show(200,false)
> ++--+---+
> |col_name |data_type |comment|
> ++--+---+
> |name |string |null |
> |age |int |null |
> | | | |
> |# Detailed Table Information| | |
> |Database |default | |
> |Table |x1 | |
> |Owner |Administrator | |
> |Created Time |Sun Aug 19 12:36:58 IST 2018 | |
> |Last Access |Thu Jan 01 05:30:00 IST 1970 | |
> |Created By |Spark 2.3.0 | |
> |Type |MANAGED | |
> |Provider |hive | |
> |Table Properties |[transient_lastDdlTime=1534662418] | |
> |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | |
> |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | 
> |
> |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | 
> |
> |OutputFormat 
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| |
> |Storage Properties |[serialization.format=1] | |
> |Partition Provider |Catalog | |
> ++--+---+
>  
> With datasource table ,working fine ( create table using parquet instead of 
> stored by )



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To un

[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence

2018-09-10 Thread Peter Knight (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609495#comment-16609495
 ] 

Peter Knight commented on SPARK-21542:
--

It would be really helpful to have some example code on how to use these.

I have tried: 
{code}
from pyspark.ml import Transformer
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable

class MedianTrend(Transformer, DefaultParamsReadable, DefaultParamsWritable):
# code here to define Params and transform

# instantiate it
mt1 = MedianTrend(inputColList = ["v1"], outputColList = ["v1_trend_no_reset"], 
sortCol = "date")

# then save andit
path1 = "test_MedianTrend" 
mt1.write().overwrite().save(path1)

# then load it
mt1_loaded = mt1.load(path1)
df2 = mt1_loaded.transform(df)
df2.show()
{code}
This gives the following error:
{noformat}
'module' object has no attribute 'MedianTrend'{noformat}
 

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25396) Read array of JSON objects via an Iterator

2018-09-10 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609479#comment-16609479
 ] 

Hyukjin Kwon commented on SPARK-25396:
--

At that time, there's no multiple mode or json functions. So I wonder how it's 
like for the current status but still agree with this idea in general.

> Read array of JSON objects via an Iterator
> --
>
> Key: SPARK-25396
> URL: https://issues.apache.org/jira/browse/SPARK-25396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> If a JSON file has a structure like below:
> {code}
> [
>   {
>  "time":"2018-08-13T18:00:44.086Z",
>  "resourceId":"some-text",
>  "category":"A",
>  "level":2,
>  "operationName":"Error",
>  "properties":{...}
>  },
> {
>  "time":"2018-08-14T18:00:44.086Z",
>  "resourceId":"some-text2",
>  "category":"B",
>  "level":3,
>  "properties":{...}
>  },
>   ...
> ]
> {code}
> it should be read in the `multiLine` mode. In this mode, Spark read whole 
> array into memory in both cases when schema is `ArrayType` and `StructType`. 
> It can lead to unnecessary memory consumption and even to OOM for big JSON 
> files.
> In general, there is no need to materialize all parsed JSON record in memory 
> there: 
> https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
>  . So, JSON objects of an array can be read via an Iterator. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25396) Read array of JSON objects via an Iterator

2018-09-10 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609475#comment-16609475
 ] 

Hyukjin Kwon commented on SPARK-25396:
--

Oh haha yea I tried this by myself before and kind of failed due to dealing 
with malformed record. If you see a good approach, please go ahead.

> Read array of JSON objects via an Iterator
> --
>
> Key: SPARK-25396
> URL: https://issues.apache.org/jira/browse/SPARK-25396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> If a JSON file has a structure like below:
> {code}
> [
>   {
>  "time":"2018-08-13T18:00:44.086Z",
>  "resourceId":"some-text",
>  "category":"A",
>  "level":2,
>  "operationName":"Error",
>  "properties":{...}
>  },
> {
>  "time":"2018-08-14T18:00:44.086Z",
>  "resourceId":"some-text2",
>  "category":"B",
>  "level":3,
>  "properties":{...}
>  },
>   ...
> ]
> {code}
> it should be read in the `multiLine` mode. In this mode, Spark read whole 
> array into memory in both cases when schema is `ArrayType` and `StructType`. 
> It can lead to unnecessary memory consumption and even to OOM for big JSON 
> files.
> In general, there is no need to materialize all parsed JSON record in memory 
> there: 
> https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
>  . So, JSON objects of an array can be read via an Iterator. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25396) Read array of JSON objects via an Iterator

2018-09-10 Thread Maxim Gekk (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-25396:
---
Description: 
If a JSON file has a structure like below:
{code}
[
  {
 "time":"2018-08-13T18:00:44.086Z",
 "resourceId":"some-text",
 "category":"A",
 "level":2,
 "operationName":"Error",
 "properties":{...}
 },
{
 "time":"2018-08-14T18:00:44.086Z",
 "resourceId":"some-text2",
 "category":"B",
 "level":3,
 "properties":{...}
 },
  ...
]
{code}
it should be read in the `multiLine` mode. In this mode, Spark read whole array 
into memory in both cases when schema is `ArrayType` and `StructType`. It can 
lead to unnecessary memory consumption and even to OOM for big JSON files.

In general, there is no need to materialize all parsed JSON record in memory 
there: 
https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
 . So, JSON objects of an array can be read via an Iterator. 

  was:
If a JSON file has a structure like below:
{code}
[
  {
 "time":"2018-08-13T18:00:44.086Z",
 "resourceId":"some-text",
 "category":"A",
 "level":2,
 "operationName":"Error",
 "properties":{...}
 },
{
 "time":"2018-08-14T18:00:44.086Z",
 "resourceId":"some-text2",
 "category":"B",
 "level":3,
 "properties":{...}
 },
]
{code}
it should be read in the `multiLine` mode. In this mode, Spark read whole array 
into memory in both cases when schema is `ArrayType` and `StructType`. It can 
lead to unnecessary memory consumption and even to OOM for big JSON files.

In general, there is no need to materialize all parsed JSON record in memory 
there: 
https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
 . So, JSON objects of an array can be read via an Iterator. 


> Read array of JSON objects via an Iterator
> --
>
> Key: SPARK-25396
> URL: https://issues.apache.org/jira/browse/SPARK-25396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> If a JSON file has a structure like below:
> {code}
> [
>   {
>  "time":"2018-08-13T18:00:44.086Z",
>  "resourceId":"some-text",
>  "category":"A",
>  "level":2,
>  "operationName":"Error",
>  "properties":{...}
>  },
> {
>  "time":"2018-08-14T18:00:44.086Z",
>  "resourceId":"some-text2",
>  "category":"B",
>  "level":3,
>  "properties":{...}
>  },
>   ...
> ]
> {code}
> it should be read in the `multiLine` mode. In this mode, Spark read whole 
> array into memory in both cases when schema is `ArrayType` and `StructType`. 
> It can lead to unnecessary memory consumption and even to OOM for big JSON 
> files.
> In general, there is no need to materialize all parsed JSON record in memory 
> there: 
> https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
>  . So, JSON objects of an array can be read via an Iterator. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25396) Read array of JSON objects via an Iterator

2018-09-10 Thread Maxim Gekk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609469#comment-16609469
 ] 

Maxim Gekk commented on SPARK-25396:


[~hyukjin.kwon] WDYT

> Read array of JSON objects via an Iterator
> --
>
> Key: SPARK-25396
> URL: https://issues.apache.org/jira/browse/SPARK-25396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> If a JSON file has a structure like below:
> {code}
> [
>   {
>  "time":"2018-08-13T18:00:44.086Z",
>  "resourceId":"some-text",
>  "category":"A",
>  "level":2,
>  "operationName":"Error",
>  "properties":{...}
>  },
> {
>  "time":"2018-08-14T18:00:44.086Z",
>  "resourceId":"some-text2",
>  "category":"B",
>  "level":3,
>  "properties":{...}
>  },
> ]
> {code}
> it should be read in the `multiLine` mode. In this mode, Spark read whole 
> array into memory in both cases when schema is `ArrayType` and `StructType`. 
> It can lead to unnecessary memory consumption and even to OOM for big JSON 
> files.
> In general, there is no need to materialize all parsed JSON record in memory 
> there: 
> https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
>  . So, JSON objects of an array can be read via an Iterator. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4

2018-09-10 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-25378:
--
Summary: ArrayData.toArray(StringType) assume UTF8String in 2.4  (was: 
ArrayData.toArray assume UTF8String)

> ArrayData.toArray(StringType) assume UTF8String in 2.4
> --
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25396) Read array of JSON objects via an Iterator

2018-09-10 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-25396:
--

 Summary: Read array of JSON objects via an Iterator
 Key: SPARK-25396
 URL: https://issues.apache.org/jira/browse/SPARK-25396
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


If a JSON file has a structure like below:
{code}
[
  {
 "time":"2018-08-13T18:00:44.086Z",
 "resourceId":"some-text",
 "category":"A",
 "level":2,
 "operationName":"Error",
 "properties":{...}
 },
{
 "time":"2018-08-14T18:00:44.086Z",
 "resourceId":"some-text2",
 "category":"B",
 "level":3,
 "properties":{...}
 },
]
{code}
it should be read in the `multiLine` mode. In this mode, Spark read whole array 
into memory in both cases when schema is `ArrayType` and `StructType`. It can 
lead to unnecessary memory consumption and even to OOM for big JSON files.

In general, there is no need to materialize all parsed JSON record in memory 
there: 
https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L88-L95
 . So, JSON objects of an array can be read via an Iterator. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25378) ArrayData.toArray assume UTF8String

2018-09-10 Thread Xiangrui Meng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609467#comment-16609467
 ] 

Xiangrui Meng commented on SPARK-25378:
---

I sent a PR to spark-tensorflow-connector at 
https://github.com/tensorflow/ecosystem/pull/100 to use the suggested method 
from [~hvanhovell]. 

I won't mark this ticket as resolved. The potential issue is that there are 
other data sources relying on this behavior. If that is the case, users won't 
be able to migrate to 2.4 before the data source owner published a new version. 
If there doesn't exist a simple way to check, maybe we should send a notice to 
dev@ list.



> ArrayData.toArray assume UTF8String
> ---
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23597) Audit Spark SQL code base for non-interpreted expressions

2018-09-10 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609411#comment-16609411
 ] 

Marco Gaido commented on SPARK-23597:
-

I haven't, [~hvanhovell]?

> Audit Spark SQL code base for non-interpreted expressions
> -
>
> Key: SPARK-23597
> URL: https://issues.apache.org/jira/browse/SPARK-23597
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Major
>
> We want to eliminate expressions that do not provide an interpreted execution 
> path from the code base. The goal of this ticket is to check if there any 
> other besides the ones being addressed by SPARK-23580.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25376) Scenarios we should handle but missed in 2.4 for barrier execution mode

2018-09-10 Thread Imran Rashid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609391#comment-16609391
 ] 

Imran Rashid commented on SPARK-25376:
--

I raised some of my concerns on one of the earlier PRs, Xingbo filed follow up 
issues for a few of them: SPARK-24954, SPARK-24941, 
https://github.com/apache/spark/pull/21758#discussion_r206682882

I think there may be some more things that concerned me, but I need to review 
the changes and refresh my memory a bit ...

> Scenarios we should handle but missed in 2.4 for barrier execution mode
> ---
>
> Key: SPARK-25376
> URL: https://issues.apache.org/jira/browse/SPARK-25376
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> [~irashid] You mentioned that there are couple scenarios we should handle in 
> barrier execution mode but we didn't in 2.4. Could you elaborate here?
> One scenario we are aware of is that speculation is not supported by barrier 
> mode. Hence a barrier mode might hang in case of hardware issues on one node. 
> I don't have a good proposal here except letting users set a timeout for the 
> barrier stage. Would like to hear your thoughts.
> You also mentioned multi-tenancy issues. Could you say more?
> cc: [~jiangxb1987]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25395) Remove Spark Optional Java API

2018-09-10 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609357#comment-16609357
 ] 

Apache Spark commented on SPARK-25395:
--

User 'mmolimar' has created a pull request for this issue:
https://github.com/apache/spark/pull/22383

> Remove Spark Optional Java API
> --
>
> Key: SPARK-25395
> URL: https://issues.apache.org/jira/browse/SPARK-25395
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: Mario Molina
>Priority: Minor
>
> Previous Spark versions didn't require Java 8 and an ``Optional`` Spark Java 
> API had to be  implemented to support optional values.
> Since Spark 2.4 uses Java 8, the ``Optional`` Spark Java API should be 
> removed so that Spark uses the original Java API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25395) Remove Spark Optional Java API

2018-09-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25395:


Assignee: (was: Apache Spark)

> Remove Spark Optional Java API
> --
>
> Key: SPARK-25395
> URL: https://issues.apache.org/jira/browse/SPARK-25395
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: Mario Molina
>Priority: Minor
>
> Previous Spark versions didn't require Java 8 and an ``Optional`` Spark Java 
> API had to be  implemented to support optional values.
> Since Spark 2.4 uses Java 8, the ``Optional`` Spark Java API should be 
> removed so that Spark uses the original Java API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25395) Remove Spark Optional Java API

2018-09-10 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609356#comment-16609356
 ] 

Apache Spark commented on SPARK-25395:
--

User 'mmolimar' has created a pull request for this issue:
https://github.com/apache/spark/pull/22383

> Remove Spark Optional Java API
> --
>
> Key: SPARK-25395
> URL: https://issues.apache.org/jira/browse/SPARK-25395
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: Mario Molina
>Priority: Minor
>
> Previous Spark versions didn't require Java 8 and an ``Optional`` Spark Java 
> API had to be  implemented to support optional values.
> Since Spark 2.4 uses Java 8, the ``Optional`` Spark Java API should be 
> removed so that Spark uses the original Java API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25395) Remove Spark Optional Java API

2018-09-10 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25395:


Assignee: Apache Spark

> Remove Spark Optional Java API
> --
>
> Key: SPARK-25395
> URL: https://issues.apache.org/jira/browse/SPARK-25395
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: Mario Molina
>Assignee: Apache Spark
>Priority: Minor
>
> Previous Spark versions didn't require Java 8 and an ``Optional`` Spark Java 
> API had to be  implemented to support optional values.
> Since Spark 2.4 uses Java 8, the ``Optional`` Spark Java API should be 
> removed so that Spark uses the original Java API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21291) R bucketBy partitionBy API

2018-09-10 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609325#comment-16609325
 ] 

Wenchen Fan commented on SPARK-21291:
-

I'm removing the target version, since no one is working on it.

> R bucketBy partitionBy API
> --
>
> Key: SPARK-21291
> URL: https://issues.apache.org/jira/browse/SPARK-21291
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Priority: Major
>
> partitionBy exists but it's for windowspec only



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21291) R bucketBy partitionBy API

2018-09-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-21291:

Target Version/s:   (was: 2.4.0)

> R bucketBy partitionBy API
> --
>
> Key: SPARK-21291
> URL: https://issues.apache.org/jira/browse/SPARK-21291
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Priority: Major
>
> partitionBy exists but it's for windowspec only



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-25395) Remove Spark Optional Java API

2018-09-10 Thread Mario Molina (JIRA)

Mario Molina created SPARK-25395:


 Summary: Remove Spark Optional Java API
 Key: SPARK-25395
 URL: https://issues.apache.org/jira/browse/SPARK-25395
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 2.4.0
Reporter: Mario Molina


Previous Spark versions didn't require Java 8 and an ``Optional`` Spark Java 
API had to be  implemented to support optional values.

Since Spark 2.4 uses Java 8, the ``Optional`` Spark Java API should be removed 
so that Spark uses the original Java API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21320) Make sure all expressions support interpreted evaluation

2018-09-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21320.
-
Resolution: Duplicate

> Make sure all expressions support interpreted evaluation
> 
>
> Key: SPARK-21320
> URL: https://issues.apache.org/jira/browse/SPARK-21320
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Most of time Spark SQL will evaluate expressions with codegen. However 
> codegen is a complex technology, and we have already fixed a lot of bugs for 
> it, and there will be more. To make Spark SQL more stable, we should have a 
> fallback for evaluation when codegen fails, and this requires all expressions 
> support interpreted evaluation. Currently the encoder related expressions are 
> codegen only, we should fix them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21320) Make sure all expressions support interpreted evaluation

2018-09-10 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609300#comment-16609300
 ] 

Wenchen Fan commented on SPARK-21320:
-

This is replaced by https://issues.apache.org/jira/browse/SPARK-23580

> Make sure all expressions support interpreted evaluation
> 
>
> Key: SPARK-21320
> URL: https://issues.apache.org/jira/browse/SPARK-21320
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Most of time Spark SQL will evaluate expressions with codegen. However 
> codegen is a complex technology, and we have already fixed a lot of bugs for 
> it, and there will be more. To make Spark SQL more stable, we should have a 
> fallback for evaluation when codegen fails, and this requires all expressions 
> support interpreted evaluation. Currently the encoder related expressions are 
> codegen only, we should fix them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21395) Spark SQL hive-thriftserver doesn't register operation log before execute sql statement

2018-09-10 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609298#comment-16609298
 ] 

Wenchen Fan commented on SPARK-21395:
-

I'm removing the target version, since we are not going to merge it to 2.4

> Spark SQL hive-thriftserver doesn't register operation log before execute sql 
> statement
> ---
>
> Key: SPARK-21395
> URL: https://issues.apache.org/jira/browse/SPARK-21395
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Chaozhong Yang
>Priority: Major
>
> In HiveServer2, TFetchResultsReq has a member which is named as `fetchType`. 
> If fetchType is equal to be `1`, the thrift server should return operation 
> log to client. However, we found Spark SQL's thrift server always return 
> nothing to client for  TFetchResultsReq with fetchType(1). We 
>  have checked the 
> ${HIVE_SERVER2_LOGGING_OPERATION_LOG_LOCATION}/${session-id} directory 
> carefully and found that there were existed operation log files with zero 
> bytes(empty file). Why? Let's take a look at SQLOperation.java in Hive:
> {code:java}
>   @Override
>   public void runInternal() throws HiveSQLException {
> setState(OperationState.PENDING);
> final HiveConf opConfig = getConfigForOperation();
> prepare(opConfig);
> if (!shouldRunAsync()) {
>   runQuery(opConfig);
> } else {
>   // We'll pass ThreadLocals in the background thread from the foreground 
> (handler) thread
>   final SessionState parentSessionState = SessionState.get();
>   // ThreadLocal Hive object needs to be set in background thread.
>   // The metastore client in Hive is associated with right user.
>   final Hive parentHive = getSessionHive();
>   // Current UGI will get used by metastore when metsatore is in embedded 
> mode
>   // So this needs to get passed to the new background thread
>   final UserGroupInformation currentUGI = getCurrentUGI(opConfig);
>   // Runnable impl to call runInternal asynchronously,
>   // from a different thread
>   Runnable backgroundOperation = new Runnable() {
> @Override
> public void run() {
>   PrivilegedExceptionAction doAsAction = new 
> PrivilegedExceptionAction() {
> @Override
> public Object run() throws HiveSQLException {
>   Hive.set(parentHive);
>   SessionState.setCurrentSessionState(parentSessionState);
>   // Set current OperationLog in this async thread for keeping on 
> saving query log.
>   registerCurrentOperationLog();
>   try {
> runQuery(opConfig);
>   } catch (HiveSQLException e) {
> setOperationException(e);
> LOG.error("Error running hive query: ", e);
>   } finally {
> unregisterOperationLog();
>   }
>   return null;
> }
>   };
>   try {
> currentUGI.doAs(doAsAction);
>   } catch (Exception e) {
> setOperationException(new HiveSQLException(e));
> LOG.error("Error running hive query as user : " + 
> currentUGI.getShortUserName(), e);
>   }
>   finally {
> /**
>  * We'll cache the ThreadLocal RawStore object for this 
> background thread for an orderly cleanup
>  * when this thread is garbage collected later.
>  * @see 
> org.apache.hive.service.server.ThreadWithGarbageCleanup#finalize()
>  */
> if (ThreadWithGarbageCleanup.currentThread() instanceof 
> ThreadWithGarbageCleanup) {
>   ThreadWithGarbageCleanup currentThread =
>   (ThreadWithGarbageCleanup) 
> ThreadWithGarbageCleanup.currentThread();
>   currentThread.cacheThreadLocalRawStore();
> }
>   }
> }
>   };
>   try {
> // This submit blocks if no background threads are available to run 
> this operation
> Future backgroundHandle =
> 
> getParentSession().getSessionManager().submitBackgroundOperation(backgroundOperation);
> setBackgroundHandle(backgroundHandle);
>   } catch (RejectedExecutionException rejected) {
> setState(OperationState.ERROR);
> throw new HiveSQLException("The background threadpool cannot accept" +
> " new task for execution, please retry the operation", rejected);
>   }
> }
>   }
> {code}
> Obviously, registerOperationLog is the key point that Hive can produce and 
> return operation log to client.
> But, in Spark SQL, SparkExecuteStatementOperation doesn't 
> registerOperationLog before execute sql statemen

[jira] [Updated] (SPARK-21395) Spark SQL hive-thriftserver doesn't register operation log before execute sql statement

2018-09-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-21395:

Target Version/s:   (was: 2.4.0)

> Spark SQL hive-thriftserver doesn't register operation log before execute sql 
> statement
> ---
>
> Key: SPARK-21395
> URL: https://issues.apache.org/jira/browse/SPARK-21395
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Chaozhong Yang
>Priority: Major
>
> In HiveServer2, TFetchResultsReq has a member which is named as `fetchType`. 
> If fetchType is equal to be `1`, the thrift server should return operation 
> log to client. However, we found Spark SQL's thrift server always return 
> nothing to client for  TFetchResultsReq with fetchType(1). We 
>  have checked the 
> ${HIVE_SERVER2_LOGGING_OPERATION_LOG_LOCATION}/${session-id} directory 
> carefully and found that there were existed operation log files with zero 
> bytes(empty file). Why? Let's take a look at SQLOperation.java in Hive:
> {code:java}
>   @Override
>   public void runInternal() throws HiveSQLException {
> setState(OperationState.PENDING);
> final HiveConf opConfig = getConfigForOperation();
> prepare(opConfig);
> if (!shouldRunAsync()) {
>   runQuery(opConfig);
> } else {
>   // We'll pass ThreadLocals in the background thread from the foreground 
> (handler) thread
>   final SessionState parentSessionState = SessionState.get();
>   // ThreadLocal Hive object needs to be set in background thread.
>   // The metastore client in Hive is associated with right user.
>   final Hive parentHive = getSessionHive();
>   // Current UGI will get used by metastore when metsatore is in embedded 
> mode
>   // So this needs to get passed to the new background thread
>   final UserGroupInformation currentUGI = getCurrentUGI(opConfig);
>   // Runnable impl to call runInternal asynchronously,
>   // from a different thread
>   Runnable backgroundOperation = new Runnable() {
> @Override
> public void run() {
>   PrivilegedExceptionAction doAsAction = new 
> PrivilegedExceptionAction() {
> @Override
> public Object run() throws HiveSQLException {
>   Hive.set(parentHive);
>   SessionState.setCurrentSessionState(parentSessionState);
>   // Set current OperationLog in this async thread for keeping on 
> saving query log.
>   registerCurrentOperationLog();
>   try {
> runQuery(opConfig);
>   } catch (HiveSQLException e) {
> setOperationException(e);
> LOG.error("Error running hive query: ", e);
>   } finally {
> unregisterOperationLog();
>   }
>   return null;
> }
>   };
>   try {
> currentUGI.doAs(doAsAction);
>   } catch (Exception e) {
> setOperationException(new HiveSQLException(e));
> LOG.error("Error running hive query as user : " + 
> currentUGI.getShortUserName(), e);
>   }
>   finally {
> /**
>  * We'll cache the ThreadLocal RawStore object for this 
> background thread for an orderly cleanup
>  * when this thread is garbage collected later.
>  * @see 
> org.apache.hive.service.server.ThreadWithGarbageCleanup#finalize()
>  */
> if (ThreadWithGarbageCleanup.currentThread() instanceof 
> ThreadWithGarbageCleanup) {
>   ThreadWithGarbageCleanup currentThread =
>   (ThreadWithGarbageCleanup) 
> ThreadWithGarbageCleanup.currentThread();
>   currentThread.cacheThreadLocalRawStore();
> }
>   }
> }
>   };
>   try {
> // This submit blocks if no background threads are available to run 
> this operation
> Future backgroundHandle =
> 
> getParentSession().getSessionManager().submitBackgroundOperation(backgroundOperation);
> setBackgroundHandle(backgroundHandle);
>   } catch (RejectedExecutionException rejected) {
> setState(OperationState.ERROR);
> throw new HiveSQLException("The background threadpool cannot accept" +
> " new task for execution, please retry the operation", rejected);
>   }
> }
>   }
> {code}
> Obviously, registerOperationLog is the key point that Hive can produce and 
> return operation log to client.
> But, in Spark SQL, SparkExecuteStatementOperation doesn't 
> registerOperationLog before execute sql statement:
> {code:scala}
>   override def runInternal(): Unit = {
> setState(OperationState.PENDING

[jira] [Updated] (SPARK-21940) Support timezone for timestamps in SparkR

2018-09-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-21940:

Target Version/s:   (was: 2.4.0)

> Support timezone for timestamps in SparkR
> -
>
> Key: SPARK-21940
> URL: https://issues.apache.org/jira/browse/SPARK-21940
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>Priority: Major
>
> {{SparkR::createDataFrame()}} wipes timezone attribute from POSIXct and 
> POSIXlt. See following example:
> {code}
> > x <- data.frame(x = c(Sys.time()))
> > x
> x
> 1 2017-09-06 19:17:16
> > attr(x$x, "tzone") <- "Europe/Paris"
> > x
> x
> 1 2017-09-07 04:17:16
> > collect(createDataFrame(x))
> x
> 1 2017-09-06 19:17:16
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21940) Support timezone for timestamps in SparkR

2018-09-10 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609297#comment-16609297
 ] 

Wenchen Fan commented on SPARK-21940:
-

I'm removing the target version, since no one is working on it.

> Support timezone for timestamps in SparkR
> -
>
> Key: SPARK-21940
> URL: https://issues.apache.org/jira/browse/SPARK-21940
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>Priority: Major
>
> {{SparkR::createDataFrame()}} wipes timezone attribute from POSIXct and 
> POSIXlt. See following example:
> {code}
> > x <- data.frame(x = c(Sys.time()))
> > x
> x
> 1 2017-09-06 19:17:16
> > attr(x$x, "tzone") <- "Europe/Paris"
> > x
> x
> 1 2017-09-07 04:17:16
> > collect(createDataFrame(x))
> x
> 1 2017-09-06 19:17:16
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2018-09-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-21972:

Target Version/s:   (was: 2.4.0)

> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>Priority: Major
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately 
> ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) 
> check the persistence level of the input dataset but not any of its parents. 
> These issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2018-09-10 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609296#comment-16609296
 ] 

Wenchen Fan commented on SPARK-21972:
-

I'm removing the target version, since we are not going to merge it to 2.4

> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>Priority: Major
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately 
> ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) 
> check the persistence level of the input dataset but not any of its parents. 
> These issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22054) Allow release managers to inject their keys

2018-09-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22054:

Target Version/s:   (was: 2.4.0)

> Allow release managers to inject their keys
> ---
>
> Key: SPARK-22054
> URL: https://issues.apache.org/jira/browse/SPARK-22054
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Major
>
> Right now the current release process signs with Patrick's keys, let's update 
> the scripts to allow the release manager to sign the release as part of the 
> job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22054) Allow release managers to inject their keys

2018-09-10 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609295#comment-16609295
 ] 

Wenchen Fan commented on SPARK-22054:
-

I'm removing the target version, since we can't make it before 2.4

> Allow release managers to inject their keys
> ---
>
> Key: SPARK-22054
> URL: https://issues.apache.org/jira/browse/SPARK-22054
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Major
>
> Right now the current release process signs with Patrick's keys, let's update 
> the scripts to allow the release manager to sign the release as part of the 
> job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22055) Port release scripts

2018-09-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22055:

Target Version/s:   (was: 2.4.0)

> Port release scripts
> 
>
> Key: SPARK-22055
> URL: https://issues.apache.org/jira/browse/SPARK-22055
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Major
>
> The current Jenkins jobs are generated from scripts in a private repo. We 
> should port these to enable changes like SPARK-22054 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22055) Port release scripts

2018-09-10 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609294#comment-16609294
 ] 

Wenchen Fan commented on SPARK-22055:
-

I'm removing the target version, since we can't make it before 2.4

> Port release scripts
> 
>
> Key: SPARK-22055
> URL: https://issues.apache.org/jira/browse/SPARK-22055
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Major
>
> The current Jenkins jobs are generated from scripts in a private repo. We 
> should port these to enable changes like SPARK-22054 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22632) Fix the behavior of timestamp values for R's DataFrame to respect session timezone

2018-09-10 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609292#comment-16609292
 ] 

Wenchen Fan commented on SPARK-22632:
-

Is this still a problem now?

> Fix the behavior of timestamp values for R's DataFrame to respect session 
> timezone
> --
>
> Key: SPARK-22632
> URL: https://issues.apache.org/jira/browse/SPARK-22632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Note: wording is borrowed from SPARK-22395. Symptom is similar and I think 
> that JIRA is well descriptive.
> When converting R's DataFrame from/to Spark DataFrame using 
> {{createDataFrame}} or {{collect}}, timestamp values behave to respect R 
> system timezone instead of session timezone.
> For example, let's say we use "America/Los_Angeles" as session timezone and 
> have a timestamp value "1970-01-01 00:00:01" in the timezone. Btw, I'm in 
> South Korea so R timezone would be "KST".
> The timestamp value from current collect() will be the following:
> {code}
> > sparkR.session(master = "local[*]", sparkConfig = 
> > list(spark.sql.session.timeZone = "America/Los_Angeles"))
> > collect(sql("SELECT cast(cast(28801 as timestamp) as string) as ts"))
>ts
> 1 1970-01-01 00:00:01
> > collect(sql("SELECT cast(28801 as timestamp) as ts"))
>ts
> 1 1970-01-01 17:00:01
> {code}
> As you can see, the value becomes "1970-01-01 17:00:01" because it respects R 
> system timezone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20715) MapStatuses shouldn't be redundantly stored in both ShuffleMapStage and MapOutputTracker

2018-09-10 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609291#comment-16609291
 ] 

Apache Spark commented on SPARK-20715:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/22382

> MapStatuses shouldn't be redundantly stored in both ShuffleMapStage and 
> MapOutputTracker
> 
>
> Key: SPARK-20715
> URL: https://issues.apache.org/jira/browse/SPARK-20715
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Shuffle
>Affects Versions: 2.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
> Fix For: 2.3.0
>
>
> Today the MapOutputTracker and ShuffleMapStage both maintain their own copies 
> of MapStatuses. This creates the potential for bugs in case these two pieces 
> of state become out of sync.
> I believe that we can improve our ability to reason about the code by storing 
> this information only in the MapOutputTracker. This can also help to reduce 
> driver memory consumption.
> I will provide more details in my PR, where I'll walk through the detailed 
> arguments as to why we can take these two different metadata tracking formats 
> and consolidate without loss of performance or correctness.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20715) MapStatuses shouldn't be redundantly stored in both ShuffleMapStage and MapOutputTracker

2018-09-10 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609290#comment-16609290
 ] 

Apache Spark commented on SPARK-20715:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/22382

> MapStatuses shouldn't be redundantly stored in both ShuffleMapStage and 
> MapOutputTracker
> 
>
> Key: SPARK-20715
> URL: https://issues.apache.org/jira/browse/SPARK-20715
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Shuffle
>Affects Versions: 2.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
> Fix For: 2.3.0
>
>
> Today the MapOutputTracker and ShuffleMapStage both maintain their own copies 
> of MapStatuses. This creates the potential for bugs in case these two pieces 
> of state become out of sync.
> I believe that we can improve our ability to reason about the code by storing 
> this information only in the MapOutputTracker. This can also help to reduce 
> driver memory consumption.
> I will provide more details in my PR, where I'll walk through the detailed 
> arguments as to why we can take these two different metadata tracking formats 
> and consolidate without loss of performance or correctness.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23243) Shuffle+Repartition on an RDD could lead to incorrect answers

2018-09-10 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609289#comment-16609289
 ] 

Apache Spark commented on SPARK-23243:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/22382

> Shuffle+Repartition on an RDD could lead to incorrect answers
> -
>
> Key: SPARK-23243
> URL: https://issues.apache.org/jira/browse/SPARK-23243
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.2, 2.4.0
>
>
> The RDD repartition also uses the round-robin way to distribute data, this 
> can also cause incorrect answers on RDD workload the similar way as in 
> https://issues.apache.org/jira/browse/SPARK-23207
> The approach that fixes DataFrame.repartition() doesn't apply on the RDD 
> repartition issue, as discussed in 
> https://github.com/apache/spark/pull/20393#issuecomment-360912451
> We track for alternative solutions for this issue in this task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22796) Add multiple column support to PySpark QuantileDiscretizer

2018-09-10 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609286#comment-16609286
 ] 

Wenchen Fan commented on SPARK-22796:
-

I'm removing the target version, since no progress yet.

> Add multiple column support to PySpark QuantileDiscretizer
> --
>
> Key: SPARK-22796
> URL: https://issues.apache.org/jira/browse/SPARK-22796
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22796) Add multiple column support to PySpark QuantileDiscretizer

2018-09-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22796:

Target Version/s:   (was: 2.4.0)

> Add multiple column support to PySpark QuantileDiscretizer
> --
>
> Key: SPARK-22796
> URL: https://issues.apache.org/jira/browse/SPARK-22796
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22798) Add multiple column support to PySpark StringIndexer

2018-09-10 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609285#comment-16609285
 ] 

Wenchen Fan commented on SPARK-22798:
-

I'm removing the target version, since no progress yet.

> Add multiple column support to PySpark StringIndexer
> 
>
> Key: SPARK-22798
> URL: https://issues.apache.org/jira/browse/SPARK-22798
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22798) Add multiple column support to PySpark StringIndexer

2018-09-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22798:

Target Version/s:   (was: 2.4.0)

> Add multiple column support to PySpark StringIndexer
> 
>
> Key: SPARK-22798
> URL: https://issues.apache.org/jira/browse/SPARK-22798
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23153) Support application dependencies in submission client's local file system

2018-09-10 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16609283#comment-16609283
 ] 

Wenchen Fan commented on SPARK-23153:
-

I'm removing the target version, since no one is working on it.

> Support application dependencies in submission client's local file system
> -
>
> Key: SPARK-23153
> URL: https://issues.apache.org/jira/browse/SPARK-23153
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23153) Support application dependencies in submission client's local file system

2018-09-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-23153:

Target Version/s:   (was: 2.4.0)

> Support application dependencies in submission client's local file system
> -
>
> Key: SPARK-23153
> URL: https://issues.apache.org/jira/browse/SPARK-23153
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 174 matches

Mail list logo