[jira] [Updated] (SPARK-26847) Pruning nested serializers from object serializers: MapType support

2019-02-07 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-26847:

Summary: Pruning nested serializers from object serializers: MapType 
support  (was: Prune nested serializers from object serializers: MapType 
support)

> Pruning nested serializers from object serializers: MapType support
> ---
>
> Key: SPARK-26847
> URL: https://issues.apache.org/jira/browse/SPARK-26847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> This is TODO PR for SPARK-26837. In SPARK-26837, we prune nested fields from 
> object serializers if they are unnecessary in the query execution. 
> SPARK-26837 leaves the support of MapType as a TODO item. This ticket is 
> created to track it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26831) bin/pyspark: avoid hardcoded `python` command and improve version checks

2019-02-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26831.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23736
[https://github.com/apache/spark/pull/23736]

> bin/pyspark: avoid hardcoded `python` command and improve version checks
> 
>
> Key: SPARK-26831
> URL: https://issues.apache.org/jira/browse/SPARK-26831
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Stefaan Lippens
>Priority: Major
> Fix For: 3.0.0
>
>
> (this originally started at https://github.com/apache/spark/pull/23736)
> I was trying out pyspark on a system with only a {{python3}}  command but no 
> {{python}} command and got this error:
> {code}
> /opt/spark/bin/pyspark: line 45: python: command not found
> {code}
> While the pyspark script is full of variables to refer to a python 
> interpreter there is still a hardcoded {{python}} used for
> {code}
> WORKS_WITH_IPYTHON=$(python -c 'import sys; print(sys.version_info >= (2, 7, 
> 0))')
> {code}
> While looking into this, I also noticed the bash syntax for the IPython 
> version check is wrong: 
> {code}
> if [[ ! $WORKS_WITH_IPYTHON ]]
> {code}
> always evaluates to false when {{$WORKS_WITH_IPYTHON}} is non-empty (so in 
> both cases "True" and "False")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26831) bin/pyspark: avoid hardcoded `python` command and improve version checks

2019-02-07 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26831:


Assignee: Stefaan Lippens

> bin/pyspark: avoid hardcoded `python` command and improve version checks
> 
>
> Key: SPARK-26831
> URL: https://issues.apache.org/jira/browse/SPARK-26831
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Stefaan Lippens
>Assignee: Stefaan Lippens
>Priority: Major
> Fix For: 3.0.0
>
>
> (this originally started at https://github.com/apache/spark/pull/23736)
> I was trying out pyspark on a system with only a {{python3}}  command but no 
> {{python}} command and got this error:
> {code}
> /opt/spark/bin/pyspark: line 45: python: command not found
> {code}
> While the pyspark script is full of variables to refer to a python 
> interpreter there is still a hardcoded {{python}} used for
> {code}
> WORKS_WITH_IPYTHON=$(python -c 'import sys; print(sys.version_info >= (2, 7, 
> 0))')
> {code}
> While looking into this, I also noticed the bash syntax for the IPython 
> version check is wrong: 
> {code}
> if [[ ! $WORKS_WITH_IPYTHON ]]
> {code}
> always evaluates to false when {{$WORKS_WITH_IPYTHON}} is non-empty (so in 
> both cases "True" and "False")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24657) SortMergeJoin may cause SparkOutOfMemory in execution memory because of not cleanup resource when finished the merge join

2019-02-07 Thread Tao Luo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763243#comment-16763243
 ] 

Tao Luo commented on SPARK-24657:
-

If SortMergeJoinScanner doesn't consume UnsafeExternalRowSorter entirely, the 
memory that UnsafeExternalSorter acquired from TaskMemoryManager will never be 
released. This leads to a memory leak, spills, and OOME. A page will be held 
per partition of the unused iterator.

> SortMergeJoin may cause SparkOutOfMemory  in execution memory  because of not 
> cleanup resource when finished the merge join
> ---
>
> Key: SPARK-24657
> URL: https://issues.apache.org/jira/browse/SPARK-24657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 2.3.1
>Reporter: Joshuawangzj
>Priority: Major
>
> In my sql, It join three tables, and all these tables are small table (about 
> 2mb). And to solve the small files issue, I use coalesce(1). But it throw the 
> oom exception: 
> {code:java}
> org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 bytes 
> of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:159)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:99)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.(UnsafeInMemorySorter.java:128)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:162)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:129)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:111)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.create(UnsafeExternalRowSorter.java:96)
>   at 
> org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:89)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.init(generated.java:22)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10.apply(WholeStageCodegenExec.scala:611)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10.apply(WholeStageCodegenExec.scala:608)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
> {code}
> {code:java}
> 12:10:51.175 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 3.0 (TID 34, localhost, executor driver): 
> org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 65536 bytes 
> of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:159)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:99)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.(UnsafeInMemorySorter.java:128)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:162)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:129)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:111)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.create(UnsafeExternalRowSorter.java:96)
>   at 
> org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:89)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.init(generated.java:22)
> {code}
> Finally I found out the problem go through studying the source code. The 
> reason of  the exception is that task can't allocate page(in my case, the 
> size per page is 32M) from MemoryManager because coalesce will run 20 parent 
> paritition in one task(spark.sql.shuffle.partitions=20), and after sorted 
> merge join for each parent partition, the UnsafeExternalRowSorter can not 
> cleanup some pages allocated. After run 14th parent partition(in my case), 
> there is no enough space in execution memory for acquiring page in sort. 
> Why UnsafeExternalRowSorter can not cleanup some pages resource after 
> finished join for parent partition?
> After my constant attempts, the problem is in SortMergeJoinScanner. 
> UnsafeExternalRowSorter cleanup resource only when it's iterator be advance 
> to end. But in SortMergeJoinScanner, when streamedIterator is end ,the 
> bufferedIterator may not end, so bufferedIterator cannot cleanup the resource 
> and vice versa.
> The solution may be :
> 1、advance to last for 

[jira] [Updated] (SPARK-24211) Flaky test: StreamingOuterJoinSuite

2019-02-07 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-24211:
-
Affects Version/s: (was: 2.4.0)
   2.3.2

> Flaky test: StreamingOuterJoinSuite
> ---
>
> Key: SPARK-24211
> URL: https://issues.apache.org/jira/browse/SPARK-24211
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 2.3.2
>Reporter: Dongjoon Hyun
>Priority: Major
>
> *windowed left outer join*
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/330/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/317/]
> *windowed right outer join*
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/334/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/328/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/371/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/345/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/366/]
> *left outer join with non-key condition violated*
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/337/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/366/]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/386/]
> *left outer early state exclusion on left*
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/375]
>  - 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/385/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26847) Prune nested serializers from object serializers: MapType support

2019-02-07 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763211#comment-16763211
 ] 

Liang-Chi Hsieh commented on SPARK-26847:
-

I will finish this once SPARK-26837 is merged.

> Prune nested serializers from object serializers: MapType support
> -
>
> Key: SPARK-26847
> URL: https://issues.apache.org/jira/browse/SPARK-26847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> This is TODO PR for SPARK-26837. In SPARK-26837, we prune nested fields from 
> object serializers if they are unnecessary in the query execution. 
> SPARK-26837 leaves the support of MapType as a TODO item. This ticket is 
> created to track it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26847) Prune nested serializers from object serializers: MapType support

2019-02-07 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-26847:
---

 Summary: Prune nested serializers from object serializers: MapType 
support
 Key: SPARK-26847
 URL: https://issues.apache.org/jira/browse/SPARK-26847
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


This is TODO PR for SPARK-26837. In SPARK-26837, we prune nested fields from 
object serializers if they are unnecessary in the query execution. SPARK-26837 
leaves the support of MapType as a TODO item. This ticket is created to track 
it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26841) Timestamp pushdown on Kafka table

2019-02-07 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763149#comment-16763149
 ] 

Jungtaek Lim commented on SPARK-26841:
--

[~Bartalos]
Are you working on the patch? Because I'm interested on addressing offset by 
timestamp, though my first goal is not a pushdown but alternative of 
startingOffsets/endingOffsets.

> Timestamp pushdown on Kafka table
> -
>
> Key: SPARK-26841
> URL: https://issues.apache.org/jira/browse/SPARK-26841
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Tomas Bartalos
>Priority: Major
>  Labels: Kafka, pushdown, timestamp
>
> As a Spark user I'd like to have fast queries on Kafka table restricted by 
> timestamp.
> I'd like to have quick answers on questions like:
>  * What was inserted to Kafka in past x minutes
>  * What was inserted to Kafka in specified time range
> Example:
> {quote}select * from kafka_table where timestamp > 
> from_unixtime(unix_timestamp() - 5 * 60, "-MM-dd HH:mm:ss")
> select * from kafka_table where timestamp > $from_time and timestamp < 
> $end_time
> {quote}
> Currently timestamp restrictions are not pushdown to KafkaRelation and 
> querying by timestamp on a large Kafka topic takes forever to complete.
> *Technical solution*
> Technically its possible to retrieve Kafka's offsets by provided timestamp 
> with org.apache.kafka.clients.consumer.Consumer#offsetsForTimes(..) method. 
> Afterwards we can query Kafka topic by retrieved timestamp ranges.
> Querying by timestamp range is already implemented so this change should have 
> minor impact.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26846) Empty Strings in dataframe are written as "" in CSV

2019-02-07 Thread Arvind Krishnan Iyer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvind Krishnan Iyer updated SPARK-26846:
-
Description: 
 
{code:java}
import spark.implicits._
val sc = spark.sparkContext
val df = Seq((8,"100","sfd"),(0,"","sfd"),(8, null, 
"asfasd")).toDF("num","str_num","word").toDF().coalesce(1)
df.write.mode(SaveMode.Overwrite).csv("/Users/arvind.iyer/abcd.csv")
{code}
We are writing the contents of this CSV into a DB, and the contents of that 
column are going in as "". 

+Output+ 

8,100,sfd
 0,"",sfd
 8,"",asfasd

  was:
 
{code:java}
import spark.implicits._
val sc = spark.sparkContext
val df = Seq((8,"100","sfd"),(0,"","sfd"),(8, null, 
"asfasd")).toDF("num","str_num","word").toDF().coalesce(1)
df.write.mode(SaveMode.Overwrite).csv("/Users/arvind.iyer/abcd.csv")
{code}
+Output+ 

8,100,sfd
0,"",sfd
8,"",asfasd


> Empty Strings in dataframe are written as "" in CSV
> ---
>
> Key: SPARK-26846
> URL: https://issues.apache.org/jira/browse/SPARK-26846
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Arvind Krishnan Iyer
>Priority: Major
>
>  
> {code:java}
> import spark.implicits._
> val sc = spark.sparkContext
> val df = Seq((8,"100","sfd"),(0,"","sfd"),(8, null, 
> "asfasd")).toDF("num","str_num","word").toDF().coalesce(1)
> df.write.mode(SaveMode.Overwrite).csv("/Users/arvind.iyer/abcd.csv")
> {code}
> We are writing the contents of this CSV into a DB, and the contents of that 
> column are going in as "". 
> +Output+ 
> 8,100,sfd
>  0,"",sfd
>  8,"",asfasd



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26845) Avro to_avro from_avro roundtrip fails if data type is string

2019-02-07 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763094#comment-16763094
 ] 

Attila Zsolt Piros commented on SPARK-26845:


This also works:
{code}
test("roundtrip in to_avro and from_avro - string") {
val df = spark.createDataset(Seq("1", "2", 
"3")).select('value.cast("string").as("str"))

val avroDF = df.select(to_avro('str).as("b"))
val avroTypeStr = s"""
  |{
  |   "type": "record",
  |   "name": "topLevelRecord",
  |   "fields": [
  | {
  |   "name": "str",
  |   "type": ["string", "null"]
  | }
  |   ]
  |}""".stripMargin
checkAnswer(
  avroDF.select(from_avro('b, avroTypeStr).as("rec")).select($"rec.str"),
  df)
  }
{code}
I have introduced a topLevelRecord as at the top level union types is not 
allowed / not working (good question why), I mean this:
{code:javascript}
  {
"name": "str",
"type": ["string", "null"]
  }
{code}
Throws an exception:
{noformat}
org.apache.avro.SchemaParseException: No type: 
{"name":"str","type":["string","null"]} 
{noformat}

> Avro to_avro from_avro roundtrip fails if data type is string
> -
>
> Key: SPARK-26845
> URL: https://issues.apache.org/jira/browse/SPARK-26845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Critical
>  Labels: correctness
>
> I was playing with AvroFunctionsSuite and created a situation where test 
> fails which I believe it shouldn't:
> {code:java}
>   test("roundtrip in to_avro and from_avro - string") {
> val df = spark.createDataset(Seq("1", "2", 
> "3")).select('value.cast("string").as("str"))
> val avroDF = df.select(to_avro('str).as("b"))
> val avroTypeStr = s"""
>   |{
>   |  "type": "string",
>   |  "name": "str"
>   |}
> """.stripMargin
> checkAnswer(avroDF.select(from_avro('b, avroTypeStr)), df)
>   }
> {code}
> {code:java}
> == Results ==
> !== Correct Answer - 3 ==   == Spark Answer - 3 ==
> !struct struct
> ![1][]
> ![2][]
> ![3][]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26192) MesosClusterScheduler reads options from dispatcher conf instead of submission conf

2019-02-07 Thread Martin Loncaric (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Loncaric updated SPARK-26192:

Description: 
There are at least two options accessed in MesosClusterScheduler that should 
come from the submission's configuration instead of the dispatcher's:

spark.app.name
spark.mesos.fetchCache.enable

This means that all Mesos tasks for Spark drivers have uninformative names of 
the form "Driver for (MainClass)" rather than the configured application name, 
and Spark drivers never cache files unless the caching setting is specified on 
dispatcher as well. Coincidentally, the spark.mesos.fetchCache.enable option 
was misnamed, as referenced in the linked JIRA.

  was:
There are at least two options accessed in MesosClusterScheduler that should 
come from the submission's configuration instead of the dispatcher's:

spark.app.name
spark.mesos.fetchCache.enable

This means that all Mesos tasks for Spark drivers have uninformative names of 
the form "Driver for (MainClass)" rather than the configured application name, 
and Spark drivers never cache files. Coincidentally, the 
spark.mesos.fetchCache.enable option is misnamed, as referenced in the linked 
JIRA.


> MesosClusterScheduler reads options from dispatcher conf instead of 
> submission conf
> ---
>
> Key: SPARK-26192
> URL: https://issues.apache.org/jira/browse/SPARK-26192
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
>
> There are at least two options accessed in MesosClusterScheduler that should 
> come from the submission's configuration instead of the dispatcher's:
> spark.app.name
> spark.mesos.fetchCache.enable
> This means that all Mesos tasks for Spark drivers have uninformative names of 
> the form "Driver for (MainClass)" rather than the configured application 
> name, and Spark drivers never cache files unless the caching setting is 
> specified on dispatcher as well. Coincidentally, the 
> spark.mesos.fetchCache.enable option was misnamed, as referenced in the 
> linked JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26845) Avro to_avro from_avro roundtrip fails if data type is string

2019-02-07 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763066#comment-16763066
 ] 

Attila Zsolt Piros commented on SPARK-26845:


The test would work if you replace the line
{code:java}
val df = spark.createDataset(Seq("1", "2", 
"3")).select('value.cast("string").as("str"))
{code}
with
{code:java}
val df = spark.range(3).select('id.cast("string").as("str"))
{code}

*And the difference is caused by the nullable flag of the _StructField_.*

For the _Seq_ you used the schema is:
{code:java}
scala> spark.createDataset(Seq("1", "2", 
"3")).select('value.cast("string").as("str")).schema 
res0: org.apache.spark.sql.types.StructType = 
StructType(StructField(str,StringType,true))
{code}
And for the range:
{code:java}
scala> spark.range(3).select('id.cast("string").as("str")).schema 
res1: org.apache.spark.sql.types.StructType = 
StructType(StructField(str,StringType,false))
{code}
So in your case the _avroTypeStr_ does not match to the data.

> Avro to_avro from_avro roundtrip fails if data type is string
> -
>
> Key: SPARK-26845
> URL: https://issues.apache.org/jira/browse/SPARK-26845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Critical
>  Labels: correctness
>
> I was playing with AvroFunctionsSuite and created a situation where test 
> fails which I believe it shouldn't:
> {code:java}
>   test("roundtrip in to_avro and from_avro - string") {
> val df = spark.createDataset(Seq("1", "2", 
> "3")).select('value.cast("string").as("str"))
> val avroDF = df.select(to_avro('str).as("b"))
> val avroTypeStr = s"""
>   |{
>   |  "type": "string",
>   |  "name": "str"
>   |}
> """.stripMargin
> checkAnswer(avroDF.select(from_avro('b, avroTypeStr)), df)
>   }
> {code}
> {code:java}
> == Results ==
> !== Correct Answer - 3 ==   == Spark Answer - 3 ==
> !struct struct
> ![1][]
> ![2][]
> ![3][]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26845) Avro to_avro from_avro roundtrip fails if data type is string

2019-02-07 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-26845:
--
Summary: Avro to_avro from_avro roundtrip fails if data type is string  
(was: Avro from_avro to_avro roundtrip fails if data type is string)

> Avro to_avro from_avro roundtrip fails if data type is string
> -
>
> Key: SPARK-26845
> URL: https://issues.apache.org/jira/browse/SPARK-26845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Critical
>  Labels: correctness
>
> I was playing with AvroFunctionsSuite and created a situation where test 
> fails which I believe it shouldn't:
> {code:java}
>   test("roundtrip in to_avro and from_avro - string") {
> val df = spark.createDataset(Seq("1", "2", 
> "3")).select('value.cast("string").as("str"))
> val avroDF = df.select(to_avro('str).as("b"))
> val avroTypeStr = s"""
>   |{
>   |  "type": "string",
>   |  "name": "str"
>   |}
> """.stripMargin
> checkAnswer(avroDF.select(from_avro('b, avroTypeStr)), df)
>   }
> {code}
> {code:java}
> == Results ==
> !== Correct Answer - 3 ==   == Spark Answer - 3 ==
> !struct struct
> ![1][]
> ![2][]
> ![3][]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26846) Empty Strings in dataframe are written as "" in CSV

2019-02-07 Thread Arvind Krishnan Iyer (JIRA)
Arvind Krishnan Iyer created SPARK-26846:


 Summary: Empty Strings in dataframe are written as "" in CSV
 Key: SPARK-26846
 URL: https://issues.apache.org/jira/browse/SPARK-26846
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Arvind Krishnan Iyer


 
{code:java}
import spark.implicits._
val sc = spark.sparkContext
val df = Seq((8,"100","sfd"),(0,"","sfd"),(8, null, 
"asfasd")).toDF("num","str_num","word").toDF().coalesce(1)
df.write.mode(SaveMode.Overwrite).csv("/Users/arvind.iyer/abcd.csv")
{code}
+Output+ 

8,100,sfd
0,"",sfd
8,"",asfasd



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26845) Avro from_avro to_avro roundtrip fails if data type is string

2019-02-07 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-26845:
--
Labels: correctness  (was: )

> Avro from_avro to_avro roundtrip fails if data type is string
> -
>
> Key: SPARK-26845
> URL: https://issues.apache.org/jira/browse/SPARK-26845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Critical
>  Labels: correctness
>
> I was playing with AvroFunctionsSuite and created a situation where test 
> fails which I believe it shouldn't:
> {code:java}
>   test("roundtrip in to_avro and from_avro - string") {
> val df = spark.createDataset(Seq("1", "2", 
> "3")).select('value.cast("string").as("str"))
> val avroDF = df.select(to_avro('str).as("b"))
> val avroTypeStr = s"""
>   |{
>   |  "type": "string",
>   |  "name": "str"
>   |}
> """.stripMargin
> checkAnswer(avroDF.select(from_avro('b, avroTypeStr)), df)
>   }
> {code}
> {code:java}
> == Results ==
> !== Correct Answer - 3 ==   == Spark Answer - 3 ==
> !struct struct
> ![1][]
> ![2][]
> ![3][]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26845) Avro from_avro to_avro roundtrip fails if data type is string

2019-02-07 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-26845:
--
Description: 
I was playing with AvroFunctionsSuite and created a situation where test fails 
which I believe it shouldn't:


{code:java}
  test("roundtrip in to_avro and from_avro - string") {
val df = spark.createDataset(Seq("1", "2", 
"3")).select('value.cast("string").as("str"))

val avroDF = df.select(to_avro('str).as("b"))
val avroTypeStr = s"""
  |{
  |  "type": "string",
  |  "name": "str"
  |}
""".stripMargin
checkAnswer(avroDF.select(from_avro('b, avroTypeStr)), df)
  }
{code}


{code:java}
== Results ==
!== Correct Answer - 3 ==   == Spark Answer - 3 ==
!struct struct
![1][]
![2][]
![3][]
{code}


  was:
I was playing with AvroFunctionsSuite and creates a situation where test fails 
which I believe it shouldn't:


{code:java}
  test("roundtrip in to_avro and from_avro - string") {
val df = spark.createDataset(Seq("1", "2", 
"3")).select('value.cast("string").as("str"))

val avroDF = df.select(to_avro('str).as("b"))
val avroTypeStr = s"""
  |{
  |  "type": "string",
  |  "name": "str"
  |}
""".stripMargin
checkAnswer(avroDF.select(from_avro('b, avroTypeStr)), df)
  }
{code}


{code:java}
== Results ==
!== Correct Answer - 3 ==   == Spark Answer - 3 ==
!struct struct
![1][]
![2][]
![3][]
{code}



> Avro from_avro to_avro roundtrip fails if data type is string
> -
>
> Key: SPARK-26845
> URL: https://issues.apache.org/jira/browse/SPARK-26845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> I was playing with AvroFunctionsSuite and created a situation where test 
> fails which I believe it shouldn't:
> {code:java}
>   test("roundtrip in to_avro and from_avro - string") {
> val df = spark.createDataset(Seq("1", "2", 
> "3")).select('value.cast("string").as("str"))
> val avroDF = df.select(to_avro('str).as("b"))
> val avroTypeStr = s"""
>   |{
>   |  "type": "string",
>   |  "name": "str"
>   |}
> """.stripMargin
> checkAnswer(avroDF.select(from_avro('b, avroTypeStr)), df)
>   }
> {code}
> {code:java}
> == Results ==
> !== Correct Answer - 3 ==   == Spark Answer - 3 ==
> !struct struct
> ![1][]
> ![2][]
> ![3][]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26845) Avro from_avro to_avro roundtrip fails if data type is string

2019-02-07 Thread Gabor Somogyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-26845:
--
Description: 
I was playing with AvroFunctionsSuite and creates a situation where test fails 
which I believe it shouldn't:


{code:java}
  test("roundtrip in to_avro and from_avro - string") {
val df = spark.createDataset(Seq("1", "2", 
"3")).select('value.cast("string").as("str"))

val avroDF = df.select(to_avro('str).as("b"))
val avroTypeStr = s"""
  |{
  |  "type": "string",
  |  "name": "str"
  |}
""".stripMargin
checkAnswer(avroDF.select(from_avro('b, avroTypeStr)), df)
  }
{code}


{code:java}
== Results ==
!== Correct Answer - 3 ==   == Spark Answer - 3 ==
!struct struct
![1][]
![2][]
![3][]
{code}


  was:
I was playing with AvroFunctionsSuite and creates a situation where test fails 
which I believe it shouldn't:


{code:java}
  test("roundtrip in to_avro and from_avro - string") {
val df = spark.createDataset(Seq("1", "2", 
"3")).select('value.cast("string").as("str"))

val avroDF = df.select(to_avro('str).as("b"))
val avroTypeStr = s"""
  |{
  |  "type": "string",
  |  "name": "str"
  |}
""".stripMargin
checkAnswer(avroDF.select(from_avro('b, avroTypeStr)), df)
  }
{code}



> Avro from_avro to_avro roundtrip fails if data type is string
> -
>
> Key: SPARK-26845
> URL: https://issues.apache.org/jira/browse/SPARK-26845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> I was playing with AvroFunctionsSuite and creates a situation where test 
> fails which I believe it shouldn't:
> {code:java}
>   test("roundtrip in to_avro and from_avro - string") {
> val df = spark.createDataset(Seq("1", "2", 
> "3")).select('value.cast("string").as("str"))
> val avroDF = df.select(to_avro('str).as("b"))
> val avroTypeStr = s"""
>   |{
>   |  "type": "string",
>   |  "name": "str"
>   |}
> """.stripMargin
> checkAnswer(avroDF.select(from_avro('b, avroTypeStr)), df)
>   }
> {code}
> {code:java}
> == Results ==
> !== Correct Answer - 3 ==   == Spark Answer - 3 ==
> !struct struct
> ![1][]
> ![2][]
> ![3][]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26845) Avro from_avro to_avro roundtrip fails if data type is string

2019-02-07 Thread Gabor Somogyi (JIRA)
Gabor Somogyi created SPARK-26845:
-

 Summary: Avro from_avro to_avro roundtrip fails if data type is 
string
 Key: SPARK-26845
 URL: https://issues.apache.org/jira/browse/SPARK-26845
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: Gabor Somogyi


I was playing with AvroFunctionsSuite and creates a situation where test fails 
which I believe it shouldn't:


{code:java}
  test("roundtrip in to_avro and from_avro - string") {
val df = spark.createDataset(Seq("1", "2", 
"3")).select('value.cast("string").as("str"))

val avroDF = df.select(to_avro('str).as("b"))
val avroTypeStr = s"""
  |{
  |  "type": "string",
  |  "name": "str"
  |}
""".stripMargin
checkAnswer(avroDF.select(from_avro('b, avroTypeStr)), df)
  }
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10892) Join with Data Frame returns wrong results

2019-02-07 Thread Jash Gala (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763001#comment-16763001
 ] 

Jash Gala commented on SPARK-10892:
---

 This issue is still reproducible in Spark 2.4.0.

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Ofer Mendelevitch
>Priority: Critical
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|0|2012|
> |20121020| 20|  PRCP|   10|USW00023272|0|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMIN|   10|USW00023272|  139|2012|
> |20121002|  2|  TMIN|   10|USW00023272|  178|2012|
> |20121003|  3|  TMIN|   10|USW00023272|  144|2012|
> |20121004|  4|  TMIN|   10|USW00023272|  144|2012|
> |20121005|  5|  TMIN|   10|USW00023272|  139|2012|
> |20121006|  6|  TMIN|   10|USW00023272|  128|2012|
> |20121007|  7|  TMIN|   10|USW00023272|  122|2012|
> |20121008|  8|  TMIN|   10|USW00023272|  122|2012|
> |20121009|  9|  TMIN|   10|USW00023272|  139|2012|
> |20121010| 10|  TMIN|   10|USW00023272|  128|2012|
> |20121011| 11|  TMIN|   10|USW00023272|  122|2012|
> |20121012| 12|  TMIN|   10|USW00023272|  117|2012|
> |20121013| 13|  TMIN|   10|USW00023272|  122|2012|
> |20121014| 14|  TMIN|   10|USW00023272|  128|2012|
> |20121015| 15|  TMIN|   10|USW00023272|  128|2012|
> |20121016| 16|  TMIN|   10|USW00023272|  156|2012|
> |20121017| 17|  TMIN|   10|USW00023272|  139|2012|
> |20121018| 18|  TMIN|   10|USW00023272|  161|2012|
> |20121019| 19|  TMIN|   10|USW00023272|  133|2012|
> |20121020| 20|  TMIN|   10|USW00023272|  122|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMAX|   10|USW00023272|  322|2012|
> |20121002|  2|  TMAX|   10|USW00023272|  344|2012|
> |20121003|  3|  TMAX|   10|USW00023272|  222|2012|
> |20121004|  4|  TMAX|   10|USW00023272|  189|2012|
> |20121005|  5|  TMAX|   10|USW00023272|  194|2012|
> |20121006|  6|  TMAX|   

[jira] [Closed] (SPARK-26842) java.lang.IllegalArgumentException: Unsupported class file major version 55

2019-02-07 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-26842.
-

>  java.lang.IllegalArgumentException: Unsupported class file major version 55 
> -
>
> Key: SPARK-26842
> URL: https://issues.apache.org/jira/browse/SPARK-26842
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.0
> Environment: Java:
> openjdk version "11.0.1" 2018-10-16 LTS
> OpenJDK Runtime Environment Zulu11.2+3 (build 11.0.1+13-LTS) OpenJDK 64-Bit 
> Server VM Zulu11.2+3 (build 11.0.1+13-LTS, mixed mode)
>  
> Maven: (Spark Streaming)
> 
>     org.apache.spark
>     spark-streaming-kafka-0-10_2.11
>     2.4.0
> 
> 
>     org.apache.spark
>     spark-streaming_2.11
>     2.4.0
> 
>Reporter: Ranjit Hande
>Priority: Major
>
> Getting following Runtime Error with Java 11:
> {"@timestamp":"2019-02-07T11:54:30.624+05:30","@version":"1","message":"Application
>  run 
> failed","logger_name":"org.springframework.boot.SpringApplication","thread_name":"main","level":"ERROR","level_value":4,"stack_trace":"java.lang.IllegalStateException:
>  Failed to execute CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:816)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:797)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:324) at 
> com.avaya.measures.AgentMeasures.AgentMeasuresApplication.main(AgentMeasuresApplication.java:41)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
> org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:48)
>  at org.springframework.boot.loader.Launcher.launch(Launcher.java:87) at
> org.springframework.boot.loader.Launcher.launch(Launcher.java:50) at 
> org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:51)\r\n*{color:#FF}Caused
>  by: java.lang.IllegalArgumentException: Unsupported class file major version 
> 55{color}* at 
>  org.apache.xbean.asm6.ClassReader.(ClassReader.java:166) at 
> org.apache.xbean.asm6.ClassReader.(ClassReader.java:148) at 
> org.apache.xbean.asm6.ClassReader.(ClassReader.java:136) at 
> org.apache.xbean.asm6.ClassReader.(ClassReader.java:237) at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49) 
> at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
>  at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
>  at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) at 
> scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at 
> scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:500)
>  at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175) at 
> org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238) at 
> org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631) at 
> org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:355) at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:307)
>  at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:306)
>  at scala.collection.immutable.List.foreach(List.scala:392) at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:306)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2326) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2100) at 
> org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1364) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> 

[jira] [Commented] (SPARK-26844) Parquet Reader exception - ArrayIndexOutOfBound should give more information to user

2019-02-07 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762993#comment-16762993
 ] 

Dongjoon Hyun commented on SPARK-26844:
---

Hi, [~tenstriker].
Please report with a reproducible example.

> Parquet Reader exception - ArrayIndexOutOfBound should give more information 
> to user
> 
>
> Key: SPARK-26844
> URL: https://issues.apache.org/jira/browse/SPARK-26844
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: nirav patel
>Priority: Major
>
> I get following error while reading parquet file which has primitive 
> datatypes (INT32, binary)
>  
>  
> spark.read.format("parquet").load(path).show() // error happens here
>  
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at java.lang.System.arraycopy(Native Method)
> at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putBytes(OnHeapColumnVector.java:163)
> at 
> org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:733)
> at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:410)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:419)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:203)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
>  
>  
> Point if  ArrayIndexOutOfBoundsException raised on a column/field spark 
> should say what particular column/field it is. it helps in troubleshoot.
>  
> e.g. I get following error while reading same file using Drill reader.
> org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR: 
> Error reading page data File: /.../../part-00016-0-m-00016.parquet 
> *Column: GROUP_NAME* Row Group Start: 5539 Fragment 0:0 
> I also get more specific information in Drillbit.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26844) Parquet Reader exception - ArrayIndexOutOfBound should give more information to user

2019-02-07 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26844:
--
Priority: Minor  (was: Major)

> Parquet Reader exception - ArrayIndexOutOfBound should give more information 
> to user
> 
>
> Key: SPARK-26844
> URL: https://issues.apache.org/jira/browse/SPARK-26844
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.1
>Reporter: nirav patel
>Priority: Minor
>
> I get following error while reading parquet file which has primitive 
> datatypes (INT32, binary)
>  
>  
> spark.read.format("parquet").load(path).show() // error happens here
>  
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at java.lang.System.arraycopy(Native Method)
> at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putBytes(OnHeapColumnVector.java:163)
> at 
> org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:733)
> at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:410)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:419)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:203)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
>  
>  
> Point if  ArrayIndexOutOfBoundsException raised on a column/field spark 
> should say what particular column/field it is. it helps in troubleshoot.
>  
> e.g. I get following error while reading same file using Drill reader.
> org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR: 
> Error reading page data File: /.../../part-00016-0-m-00016.parquet 
> *Column: GROUP_NAME* Row Group Start: 5539 Fragment 0:0 
> I also get more specific information in Drillbit.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2019-02-07 Thread Thincrs (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762991#comment-16762991
 ] 

Thincrs commented on SPARK-26164:
-

A user of thincrs has selected this issue. Deadline: Thu, Feb 14, 2019 7:16 PM

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21755) Spark 2.1.1 UI page not displaying any dynamic updates on job progress after showing progress for initial few minutes of job run.

2019-02-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-21755.

Resolution: Not A Bug

I'm closing this for a few reasons.

- There's been a lot of changes in the history server handling of event logs 
since 2.2 that may make this better.
- From your description, it seems that "UI" here means the history server. It's 
normal for it to not see many updates, depending on how the event logs are 
written.
- For example, if your event logs are on s3 or in some other storage where the 
read side doesn't necessarily see updates from the write side, you can get into 
this situation.

I'd suggest working with the EMR guys first, and if they identify a Spark 
issue, then file a bug here.

> Spark 2.1.1 UI page not displaying any dynamic updates on job progress after 
> showing progress for initial few minutes of job run.
> -
>
> Key: SPARK-21755
> URL: https://issues.apache.org/jira/browse/SPARK-21755
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.1
> Environment: Issue was produced on an EMR cluster with following 
> configurations:
> ### EMR Release label:  emr-5.6.0
> ### Hadoop distribution:  Amazon 2.7.3
> ### Applications installed:   Hive 2.1.1, Spark 2.1.1
>Reporter: Ankur
>Priority: Major
>
> When a Spark SQL job is ran, Spark Application’s Web Console ( UI ) is 
> getting intermittently updated for initial few minutes ( ~ 10-15 minutes ) 
> and after that there are no updates on job progress ( even after job 
> execution completes).  As soon as "Spark SQL" session is terminated I can see 
> Spark UI got updated with the job summary.
> Issue was reproduced by using spark-sql on a data-set of around 1.2 TB size. 
> Here are the steps:
> Step 1> An EMR cluster is launched ( release emr-5.6.0 and applications as 
> Hive 2.1.1, Spark 2.1.1 )
> Step 2>> Following command is ran:
> spark-sql> CREATE TABLE total_flights USING com.databricks.spark.csv OPTIONS 
> (path "s3://bucket/test_web_UI/flight/", header "true", inferSchema "true");
> Data-set used : Flights history in CSV files provided by US Department of 
> Transportation, Bureau of Transportation Statistics - 
> https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236_Short_Name=On-Time
> Step 3> There were no updates on Web UI after initial ~10 minutes. Web UI did 
> not got updated even after few hours when job was completed successfully. 
> Step 4> Once the spark-sql session is ended, Spark UI got updated with the 
> job summary correctly as expected. 
> I have verified that "spark.history.fs.update.interval" is set to default 
> value of 10 seconds as mentioned in this document 
> "https://spark.apache.org/docs/latest/monitoring.html ".  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19528) external shuffle service registration timeout is very short with heavy workloads when dynamic allocation is enabled

2019-02-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19528.

Resolution: Duplicate

I believe this is the same issue as SPARK-24355. There's a new configuration 
you can use to reserve some RPC resources in the shuffle service to non-shuffle 
data requests, such as authentication.

> external shuffle service registration timeout is very short with heavy 
> workloads when dynamic allocation is enabled 
> 
>
> Key: SPARK-19528
> URL: https://issues.apache.org/jira/browse/SPARK-19528
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle, Spark Core
>Affects Versions: 1.6.2, 1.6.3, 2.0.2
> Environment: Hadoop2.7.1
> spark1.6.2
> hive2.2
>Reporter: KaiXu
>Priority: Major
> Attachments: SPARK-19528.1.patch, SPARK-19528.1.spark2.patch
>
>
> when dynamic allocation is enabled, the external shuffle service is used for 
> maintain the unfinished status between executors. So the external shuffle 
> service should not close before the executor while still have request from 
> executor.
> container's log:
> {noformat}
> 17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Connecting to 
> driver: spark://CoarseGrainedScheduler@192.168.1.1:41867
> 17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Successfully 
> registered with driver
> 17/02/09 08:30:46 INFO executor.Executor: Starting executor ID 75 on host 
> hsx-node8
> 17/02/09 08:30:46 INFO util.Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40374.
> 17/02/09 08:30:46 INFO netty.NettyBlockTransferService: Server created on 
> 40374
> 17/02/09 08:30:46 INFO storage.BlockManager: external shuffle service port = 
> 7337
> 17/02/09 08:30:46 INFO storage.BlockManagerMaster: Trying to register 
> BlockManager
> 17/02/09 08:30:46 INFO storage.BlockManagerMaster: Registered BlockManager
> 17/02/09 08:30:46 INFO storage.BlockManager: Registering executor with local 
> external shuffle service.
> 17/02/09 08:30:51 ERROR client.TransportResponseHandler: Still have 1 
> requests outstanding when connection from hsx-node8/192.168.1.8:7337 is closed
> 17/02/09 08:30:51 ERROR storage.BlockManager: Failed to connect to external 
> shuffle server, will retry 2 more times after waiting 5 seconds...
> java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:144)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>   at 
> org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:215)
>   at 
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:201)
>   at org.apache.spark.executor.Executor.(Executor.scala:86)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
>   at 
> org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276)
>   at 
> org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:274)
>   ... 14 more
> 17/02/09 08:31:01 ERROR storage.BlockManager: Failed to connect to external 
> shuffle server, will retry 1 more times after waiting 5 seconds...
> {noformat}
> nodemanager's log:
> {noformat}
> 2017-02-09 08:30:48,836 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed 
> completed containers from NM context: [container_1486564603520_0097_01_05]
> 

[jira] [Assigned] (SPARK-26707) Insert into table with single struct column fails

2019-02-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26707:


Assignee: (was: Apache Spark)

> Insert into table with single struct column fails
> -
>
> Key: SPARK-26707
> URL: https://issues.apache.org/jira/browse/SPARK-26707
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.2, 2.4.0, 3.0.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> This works:
> {noformat}
> scala> sql("select named_struct('d1', 123) c1, 12 
> c2").write.format("parquet").saveAsTable("structtbl2")
> scala> sql("show create table structtbl2").show(truncate=false)
> +---+
> |createtab_stmt |
> +---+
> |CREATE TABLE `structtbl2` (`c1` STRUCT<`d1`: INT>, `c2` INT)
> USING parquet
> |
> +---+
> scala> sql("insert into structtbl2 values (struct(789), 17)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from structtbl2").show
> +-+---+
> |   c1| c2|
> +-+---+
> |[789]| 17|
> |[123]| 12|
> +-+---+
> scala>
> {noformat}
> However, if the table's only column is the struct column, the insert does not 
> work:
> {noformat}
> scala> sql("select named_struct('d1', 123) 
> c1").write.format("parquet").saveAsTable("structtbl1")
> scala> sql("show create table structtbl1").show(truncate=false)
> +-+
> |createtab_stmt   |
> +-+
> |CREATE TABLE `structtbl1` (`c1` STRUCT<`d1`: INT>)
> USING parquet
> |
> +-+
> scala> sql("insert into structtbl1 values (struct(789))")
> org.apache.spark.sql.AnalysisException: cannot resolve '`col1`' due to data 
> type mismatch: cannot cast int to struct;;
> 'InsertIntoHadoopFsRelationCommand 
> file:/Users/brobbins/github/spark_upstream/spark-warehouse/structtbl1, false, 
> Parquet, Map(path -> 
> file:/Users/brobbins/github/spark_upstream/spark-warehouse/structtbl1), 
> Append, CatalogTable(
> ...etc...
> {noformat}
> I can work around it by using a named_struct as the value:
> {noformat}
> scala> sql("insert into structtbl1 values (named_struct('d1',789))")
> res7: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from structtbl1").show
> +-+
> |   c1|
> +-+
> |[789]|
> |[123]|
> +-+
> scala>
> {noformat}
> My guess is that I just don't understand how structs work. But maybe this is 
> a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26707) Insert into table with single struct column fails

2019-02-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26707:


Assignee: Apache Spark

> Insert into table with single struct column fails
> -
>
> Key: SPARK-26707
> URL: https://issues.apache.org/jira/browse/SPARK-26707
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.2, 2.4.0, 3.0.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Minor
>
> This works:
> {noformat}
> scala> sql("select named_struct('d1', 123) c1, 12 
> c2").write.format("parquet").saveAsTable("structtbl2")
> scala> sql("show create table structtbl2").show(truncate=false)
> +---+
> |createtab_stmt |
> +---+
> |CREATE TABLE `structtbl2` (`c1` STRUCT<`d1`: INT>, `c2` INT)
> USING parquet
> |
> +---+
> scala> sql("insert into structtbl2 values (struct(789), 17)")
> res2: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from structtbl2").show
> +-+---+
> |   c1| c2|
> +-+---+
> |[789]| 17|
> |[123]| 12|
> +-+---+
> scala>
> {noformat}
> However, if the table's only column is the struct column, the insert does not 
> work:
> {noformat}
> scala> sql("select named_struct('d1', 123) 
> c1").write.format("parquet").saveAsTable("structtbl1")
> scala> sql("show create table structtbl1").show(truncate=false)
> +-+
> |createtab_stmt   |
> +-+
> |CREATE TABLE `structtbl1` (`c1` STRUCT<`d1`: INT>)
> USING parquet
> |
> +-+
> scala> sql("insert into structtbl1 values (struct(789))")
> org.apache.spark.sql.AnalysisException: cannot resolve '`col1`' due to data 
> type mismatch: cannot cast int to struct;;
> 'InsertIntoHadoopFsRelationCommand 
> file:/Users/brobbins/github/spark_upstream/spark-warehouse/structtbl1, false, 
> Parquet, Map(path -> 
> file:/Users/brobbins/github/spark_upstream/spark-warehouse/structtbl1), 
> Append, CatalogTable(
> ...etc...
> {noformat}
> I can work around it by using a named_struct as the value:
> {noformat}
> scala> sql("insert into structtbl1 values (named_struct('d1',789))")
> res7: org.apache.spark.sql.DataFrame = []
> scala> sql("select * from structtbl1").show
> +-+
> |   c1|
> +-+
> |[789]|
> |[123]|
> +-+
> scala>
> {noformat}
> My guess is that I just don't understand how structs work. But maybe this is 
> a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23974) Do not allocate more containers as expected in dynamic allocation

2019-02-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23974.

Resolution: Not A Problem

Closing based on the above comment.

> Do not allocate more containers as expected in dynamic allocation
> -
>
> Key: SPARK-23974
> URL: https://issues.apache.org/jira/browse/SPARK-23974
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Darcy Shen
>Priority: Major
>
> Using Yarn with dynamic allocation enabled, spark does not allocate more 
> containers when current containers(executors) number is less than the max 
> executor num.
> For example, we only have 7 executors working, while our cluster is not busy, 
> and I have set
> {\{ spark.dynamicAllocation.maxExecutors = 600}}
> {{and the current jobs of the context are executed slowly.}}
>  
> A live case with online logs:
> ```
> $ grep "Not adding executors because our current target total" 
> spark-job-server.log.9 | tail
> [2018-04-12 16:07:19,070] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> [2018-04-12 16:07:20,071] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> [2018-04-12 16:07:21,072] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> [2018-04-12 16:07:22,073] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> [2018-04-12 16:07:23,074] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> [2018-04-12 16:07:24,075] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> [2018-04-12 16:07:25,076] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> [2018-04-12 16:07:26,077] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> [2018-04-12 16:07:27,078] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> [2018-04-12 16:07:28,079] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> $ grep "Not adding executors because our current target total" 
> spark-job-server.log.9 | head
> [2018-04-12 13:52:18,067] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> [2018-04-12 13:52:19,071] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> [2018-04-12 13:52:20,072] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> [2018-04-12 13:52:21,073] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> [2018-04-12 13:52:22,074] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> [2018-04-12 13:52:23,075] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> [2018-04-12 13:52:24,076] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> [2018-04-12 13:52:25,077] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> [2018-04-12 13:52:26,078] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - Not adding executors because our current 
> target total is already 600 (limit 600)
> [2018-04-12 13:52:27,079] DEBUG .ExecutorAllocationManager [] 
> [akka://JobServer/user/jobManager] - 

[jira] [Assigned] (SPARK-21097) Dynamic allocation will preserve cached data

2019-02-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21097:


Assignee: Apache Spark

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Assignee: Apache Spark
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16158) Support pluggable dynamic allocation heuristics

2019-02-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-16158:
--

Assignee: Marcelo Vanzin

> Support pluggable dynamic allocation heuristics
> ---
>
> Key: SPARK-16158
> URL: https://issues.apache.org/jira/browse/SPARK-16158
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Nezih Yigitbasi
>Assignee: Marcelo Vanzin
>Priority: Major
>
> It would be nice if Spark supports plugging in custom dynamic allocation 
> heuristics. This feature would be useful for experimenting with new 
> heuristics and also useful for plugging in different heuristics per job etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21097) Dynamic allocation will preserve cached data

2019-02-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21097:


Assignee: (was: Apache Spark)

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16158) Support pluggable dynamic allocation heuristics

2019-02-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-16158:
--

Assignee: (was: Marcelo Vanzin)

> Support pluggable dynamic allocation heuristics
> ---
>
> Key: SPARK-16158
> URL: https://issues.apache.org/jira/browse/SPARK-16158
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Nezih Yigitbasi
>Priority: Major
>
> It would be nice if Spark supports plugging in custom dynamic allocation 
> heuristics. This feature would be useful for experimenting with new 
> heuristics and also useful for plugging in different heuristics per job etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26264) It is better to add @transient to field 'locs' for class `ResultTask`.

2019-02-07 Thread Thincrs (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762931#comment-16762931
 ] 

Thincrs commented on SPARK-26264:
-

A user of thincrs has selected this issue. Deadline: Thu, Feb 14, 2019 6:06 PM

> It is better to add @transient to field 'locs'  for class `ResultTask`.
> ---
>
> Key: SPARK-26264
> URL: https://issues.apache.org/jira/browse/SPARK-26264
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> The field 'locs' is only used in driver side  for class `ResultTask`, so it 
> is not needed to serialize  when sending the `ResultTask`  to executor.
> Although it's not very big, it's very frequent, so we can add` transient` for 
> it  like `ShuffleMapTask`
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21097) Dynamic allocation will preserve cached data

2019-02-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-21097:
--

Assignee: (was: Marcelo Vanzin)

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21097) Dynamic allocation will preserve cached data

2019-02-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-21097:
--

Assignee: Marcelo Vanzin

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Assignee: Marcelo Vanzin
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26817) Use System.nanoTime to measure time intervals

2019-02-07 Thread Thincrs (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762926#comment-16762926
 ] 

Thincrs commented on SPARK-26817:
-

A user of thincrs has selected this issue. Deadline: Thu, Feb 14, 2019 6:03 PM

> Use System.nanoTime to measure time intervals
> -
>
> Key: SPARK-26817
> URL: https://issues.apache.org/jira/browse/SPARK-26817
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Replace System.currentTimeMillis() by System.nanoTime in time intervals 
> measurements. System.currentTimeMillis() returns current wallclock time and 
> will follow changes to the system clock. Thus, negative wallclock adjustments 
> can cause timeouts to "hang" for a long time (until wallclock time has caught 
> up to its previous value again). This can happen when ntpd does a "step" 
> after the network has been disconnected for some time. The most canonical 
> example is during system bootup when DHCP takes longer than usual. This can 
> lead to failures that are really hard to understand/reproduce. 
> System.nanoTime() is guaranteed to be monotonically increasing irrespective 
> of wallclock changes.
>  
> https://github.com/databricks/scala-style-guide#prefer-nanotime-over-currenttimemillis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26266) Update to Scala 2.12.8

2019-02-07 Thread Thincrs (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762918#comment-16762918
 ] 

Thincrs commented on SPARK-26266:
-

A user of thincrs has selected this issue. Deadline: Thu, Feb 14, 2019 5:58 PM

> Update to Scala 2.12.8
> --
>
> Key: SPARK-26266
> URL: https://issues.apache.org/jira/browse/SPARK-26266
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Yuming Wang
>Priority: Minor
>  Labels: release-notes
>
> [~yumwang] notes that Scala 2.12.8 is out and fixes two minor issues:
> Don't reject views with result types which are TypeVars (#7295)
> Don't emit static forwarders (which simplify the use of methods in top-level 
> objects from Java) for bridge methods (#7469)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26822) Upgrade the deprecated module 'optparse'

2019-02-07 Thread Thincrs (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762910#comment-16762910
 ] 

Thincrs commented on SPARK-26822:
-

A user of thincrs has selected this issue. Deadline: Thu, Feb 14, 2019 5:56 PM

> Upgrade the deprecated module 'optparse'
> 
>
> Key: SPARK-26822
> URL: https://issues.apache.org/jira/browse/SPARK-26822
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Neil Chien
>Priority: Minor
>  Labels: pull-request-available, test
>
> Follow the [official 
> document|https://docs.python.org/2/library/argparse.html#upgrading-optparse-code]
>  to upgrade the deprecated module 'optparse' to 'argparse'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26842) java.lang.IllegalArgumentException: Unsupported class file major version 55

2019-02-07 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26842.

Resolution: Invalid

Java 11 is not supported in 2.4.

>  java.lang.IllegalArgumentException: Unsupported class file major version 55 
> -
>
> Key: SPARK-26842
> URL: https://issues.apache.org/jira/browse/SPARK-26842
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.0
> Environment: Java:
> openjdk version "11.0.1" 2018-10-16 LTS
> OpenJDK Runtime Environment Zulu11.2+3 (build 11.0.1+13-LTS) OpenJDK 64-Bit 
> Server VM Zulu11.2+3 (build 11.0.1+13-LTS, mixed mode)
>  
> Maven: (Spark Streaming)
> 
>     org.apache.spark
>     spark-streaming-kafka-0-10_2.11
>     2.4.0
> 
> 
>     org.apache.spark
>     spark-streaming_2.11
>     2.4.0
> 
>Reporter: Ranjit Hande
>Priority: Major
>
> Getting following Runtime Error with Java 11:
> {"@timestamp":"2019-02-07T11:54:30.624+05:30","@version":"1","message":"Application
>  run 
> failed","logger_name":"org.springframework.boot.SpringApplication","thread_name":"main","level":"ERROR","level_value":4,"stack_trace":"java.lang.IllegalStateException:
>  Failed to execute CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:816)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:797)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:324) at 
> com.avaya.measures.AgentMeasures.AgentMeasuresApplication.main(AgentMeasuresApplication.java:41)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
> org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:48)
>  at org.springframework.boot.loader.Launcher.launch(Launcher.java:87) at
> org.springframework.boot.loader.Launcher.launch(Launcher.java:50) at 
> org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:51)\r\n*{color:#FF}Caused
>  by: java.lang.IllegalArgumentException: Unsupported class file major version 
> 55{color}* at 
>  org.apache.xbean.asm6.ClassReader.(ClassReader.java:166) at 
> org.apache.xbean.asm6.ClassReader.(ClassReader.java:148) at 
> org.apache.xbean.asm6.ClassReader.(ClassReader.java:136) at 
> org.apache.xbean.asm6.ClassReader.(ClassReader.java:237) at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49) 
> at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
>  at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
>  at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) at 
> scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at 
> scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:500)
>  at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175) at 
> org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238) at 
> org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631) at 
> org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:355) at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:307)
>  at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:306)
>  at scala.collection.immutable.List.foreach(List.scala:392) at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:306)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2326) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2100) at 
> org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1364) at 
> 

[jira] [Updated] (SPARK-2387) Remove the stage barrier for better resource utilization

2019-02-07 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-2387:

Component/s: Scheduler

> Remove the stage barrier for better resource utilization
> 
>
> Key: SPARK-2387
> URL: https://issues.apache.org/jira/browse/SPARK-2387
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler, Spark Core
>Reporter: Rui Li
>Priority: Major
>
> DAGScheduler divides a Spark job into multiple stages according to RDD 
> dependencies. Whenever there’s a shuffle dependency, DAGScheduler creates a 
> shuffle map stage on the map side, and another stage depending on that stage.
> Currently, the downstream stage cannot start until all its depended stages 
> have finished. This barrier between stages leads to idle slots when waiting 
> for the last few upstream tasks to finish and thus wasting cluster resources.
> Therefore we propose to remove the barrier and pre-start the reduce stage 
> once there're free slots. This can achieve better resource utilization and 
> improve the overall job performance, especially when there're lots of 
> executors granted to the application.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26842) java.lang.IllegalArgumentException: Unsupported class file major version 55

2019-02-07 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762749#comment-16762749
 ] 

Gabor Somogyi commented on SPARK-26842:
---

Please see the response provided by [~kabhwan] and me on user list.

>  java.lang.IllegalArgumentException: Unsupported class file major version 55 
> -
>
> Key: SPARK-26842
> URL: https://issues.apache.org/jira/browse/SPARK-26842
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.0
> Environment: Java:
> openjdk version "11.0.1" 2018-10-16 LTS
> OpenJDK Runtime Environment Zulu11.2+3 (build 11.0.1+13-LTS) OpenJDK 64-Bit 
> Server VM Zulu11.2+3 (build 11.0.1+13-LTS, mixed mode)
>  
> Maven: (Spark Streaming)
> 
>     org.apache.spark
>     spark-streaming-kafka-0-10_2.11
>     2.4.0
> 
> 
>     org.apache.spark
>     spark-streaming_2.11
>     2.4.0
> 
>Reporter: Ranjit Hande
>Priority: Major
>
> Getting following Runtime Error with Java 11:
> {"@timestamp":"2019-02-07T11:54:30.624+05:30","@version":"1","message":"Application
>  run 
> failed","logger_name":"org.springframework.boot.SpringApplication","thread_name":"main","level":"ERROR","level_value":4,"stack_trace":"java.lang.IllegalStateException:
>  Failed to execute CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:816)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:797)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:324) at 
> com.avaya.measures.AgentMeasures.AgentMeasuresApplication.main(AgentMeasuresApplication.java:41)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
> org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:48)
>  at org.springframework.boot.loader.Launcher.launch(Launcher.java:87) at
> org.springframework.boot.loader.Launcher.launch(Launcher.java:50) at 
> org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:51)\r\n*{color:#FF}Caused
>  by: java.lang.IllegalArgumentException: Unsupported class file major version 
> 55{color}* at 
>  org.apache.xbean.asm6.ClassReader.(ClassReader.java:166) at 
> org.apache.xbean.asm6.ClassReader.(ClassReader.java:148) at 
> org.apache.xbean.asm6.ClassReader.(ClassReader.java:136) at 
> org.apache.xbean.asm6.ClassReader.(ClassReader.java:237) at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49) 
> at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
>  at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
>  at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) at 
> scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at 
> scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:500)
>  at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175) at 
> org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238) at 
> org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631) at 
> org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:355) at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:307)
>  at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:306)
>  at scala.collection.immutable.List.foreach(List.scala:392) at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:306)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2326) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2100) at 
> 

[jira] [Commented] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2019-02-07 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762756#comment-16762756
 ] 

Gabor Somogyi commented on SPARK-23685:
---

[~sindiri] gentle ping.

> Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive 
> Offsets (i.e. Log Compaction)
> -
>
> Key: SPARK-23685
> URL: https://issues.apache.org/jira/browse/SPARK-23685
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: sirisha
>Priority: Major
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always 
> be just an increment of 1 .If not, it throws the below exception:
>  
> "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). 
> Some data may have been lost because they are not available in Kafka any 
> more; either the data was aged out by Kafka or the topic may have been 
> deleted before all the data in the topic was processed. If you don't want 
> your streaming query to fail on such cases, set the source option 
> "failOnDataLoss" to "false". "
>  
> FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26842) java.lang.IllegalArgumentException: Unsupported class file major version 55

2019-02-07 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762754#comment-16762754
 ] 

Gabor Somogyi commented on SPARK-26842:
---

Have you tried it out [~ranjit_hande]?

>  java.lang.IllegalArgumentException: Unsupported class file major version 55 
> -
>
> Key: SPARK-26842
> URL: https://issues.apache.org/jira/browse/SPARK-26842
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.0
> Environment: Java:
> openjdk version "11.0.1" 2018-10-16 LTS
> OpenJDK Runtime Environment Zulu11.2+3 (build 11.0.1+13-LTS) OpenJDK 64-Bit 
> Server VM Zulu11.2+3 (build 11.0.1+13-LTS, mixed mode)
>  
> Maven: (Spark Streaming)
> 
>     org.apache.spark
>     spark-streaming-kafka-0-10_2.11
>     2.4.0
> 
> 
>     org.apache.spark
>     spark-streaming_2.11
>     2.4.0
> 
>Reporter: Ranjit Hande
>Priority: Major
>
> Getting following Runtime Error with Java 11:
> {"@timestamp":"2019-02-07T11:54:30.624+05:30","@version":"1","message":"Application
>  run 
> failed","logger_name":"org.springframework.boot.SpringApplication","thread_name":"main","level":"ERROR","level_value":4,"stack_trace":"java.lang.IllegalStateException:
>  Failed to execute CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:816)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:797)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:324) at 
> com.avaya.measures.AgentMeasures.AgentMeasuresApplication.main(AgentMeasuresApplication.java:41)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
> org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:48)
>  at org.springframework.boot.loader.Launcher.launch(Launcher.java:87) at
> org.springframework.boot.loader.Launcher.launch(Launcher.java:50) at 
> org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:51)\r\n*{color:#FF}Caused
>  by: java.lang.IllegalArgumentException: Unsupported class file major version 
> 55{color}* at 
>  org.apache.xbean.asm6.ClassReader.(ClassReader.java:166) at 
> org.apache.xbean.asm6.ClassReader.(ClassReader.java:148) at 
> org.apache.xbean.asm6.ClassReader.(ClassReader.java:136) at 
> org.apache.xbean.asm6.ClassReader.(ClassReader.java:237) at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49) 
> at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
>  at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
>  at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) at 
> scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at 
> scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:500)
>  at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175) at 
> org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238) at 
> org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631) at 
> org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:355) at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:307)
>  at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:306)
>  at scala.collection.immutable.List.foreach(List.scala:392) at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:306)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2326) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2100) at 
> org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1364) at 
> 

[jira] [Commented] (SPARK-26842) java.lang.IllegalArgumentException: Unsupported class file major version 55

2019-02-07 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762753#comment-16762753
 ] 

Gabor Somogyi commented on SPARK-26842:
---

{quote}so you may need to use lower version of JDK (8 safest) for Spark 2.4.0, 
and try out master branch for preparing Java 11.{quote}

>  java.lang.IllegalArgumentException: Unsupported class file major version 55 
> -
>
> Key: SPARK-26842
> URL: https://issues.apache.org/jira/browse/SPARK-26842
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.0
> Environment: Java:
> openjdk version "11.0.1" 2018-10-16 LTS
> OpenJDK Runtime Environment Zulu11.2+3 (build 11.0.1+13-LTS) OpenJDK 64-Bit 
> Server VM Zulu11.2+3 (build 11.0.1+13-LTS, mixed mode)
>  
> Maven: (Spark Streaming)
> 
>     org.apache.spark
>     spark-streaming-kafka-0-10_2.11
>     2.4.0
> 
> 
>     org.apache.spark
>     spark-streaming_2.11
>     2.4.0
> 
>Reporter: Ranjit Hande
>Priority: Major
>
> Getting following Runtime Error with Java 11:
> {"@timestamp":"2019-02-07T11:54:30.624+05:30","@version":"1","message":"Application
>  run 
> failed","logger_name":"org.springframework.boot.SpringApplication","thread_name":"main","level":"ERROR","level_value":4,"stack_trace":"java.lang.IllegalStateException:
>  Failed to execute CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:816)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:797)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:324) at 
> com.avaya.measures.AgentMeasures.AgentMeasuresApplication.main(AgentMeasuresApplication.java:41)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
> org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:48)
>  at org.springframework.boot.loader.Launcher.launch(Launcher.java:87) at
> org.springframework.boot.loader.Launcher.launch(Launcher.java:50) at 
> org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:51)\r\n*{color:#FF}Caused
>  by: java.lang.IllegalArgumentException: Unsupported class file major version 
> 55{color}* at 
>  org.apache.xbean.asm6.ClassReader.(ClassReader.java:166) at 
> org.apache.xbean.asm6.ClassReader.(ClassReader.java:148) at 
> org.apache.xbean.asm6.ClassReader.(ClassReader.java:136) at 
> org.apache.xbean.asm6.ClassReader.(ClassReader.java:237) at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49) 
> at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500)
>  at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>  at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
>  at 
> scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
>  at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) at 
> scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at 
> scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134) at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>  at 
> org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:500)
>  at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175) at 
> org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238) at 
> org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631) at 
> org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:355) at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:307)
>  at 
> org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:306)
>  at scala.collection.immutable.List.foreach(List.scala:392) at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:306)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2326) at 
> 

[jira] [Created] (SPARK-26844) Parquet Reader exception - ArrayIndexOutOfBound should give more information to user

2019-02-07 Thread nirav patel (JIRA)
nirav patel created SPARK-26844:
---

 Summary: Parquet Reader exception - ArrayIndexOutOfBound should 
give more information to user
 Key: SPARK-26844
 URL: https://issues.apache.org/jira/browse/SPARK-26844
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1, 2.2.1
Reporter: nirav patel


I get following error while reading parquet file which has primitive datatypes 
(INT32, binary)

 

 

spark.read.format("parquet").load(path).show() // error happens here

 

Caused by: java.lang.ArrayIndexOutOfBoundsException

at java.lang.System.arraycopy(Native Method)

at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putBytes(OnHeapColumnVector.java:163)

at 
org.apache.spark.sql.execution.vectorized.ColumnVector.appendBytes(ColumnVector.java:733)

at 
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putByteArray(OnHeapColumnVector.java:410)

at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReader.readBinary(VectorizedPlainValuesReader.java:167)

at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readBinarys(VectorizedRleValuesReader.java:402)

at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:419)

at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:203)

at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)

at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)

at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)

at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)

at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)

at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)

at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)

at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)

at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)

at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)

at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)

at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)

at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

at org.apache.spark.scheduler.Task.run(Task.scala:108)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

 

 

 

Point if  ArrayIndexOutOfBoundsException raised on a column/field spark should 
say what particular column/field it is. it helps in troubleshoot.

 

e.g. I get following error while reading same file using Drill reader.

org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR: Error 
reading page data File: /.../../part-00016-0-m-00016.parquet *Column: 
GROUP_NAME* Row Group Start: 5539 Fragment 0:0 

I also get more specific information in Drillbit.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26843) Use ConfigEntry for hardcoded configs for "mesos" resource manager

2019-02-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26843:


Assignee: (was: Apache Spark)

> Use ConfigEntry for hardcoded configs for "mesos" resource manager
> --
>
> Key: SPARK-26843
> URL: https://issues.apache.org/jira/browse/SPARK-26843
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Make the hardcoded configs in "mesos" module to use ConfigEntry.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26843) Use ConfigEntry for hardcoded configs for "mesos" resource manager

2019-02-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26843:


Assignee: Apache Spark

> Use ConfigEntry for hardcoded configs for "mesos" resource manager
> --
>
> Key: SPARK-26843
> URL: https://issues.apache.org/jira/browse/SPARK-26843
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> Make the hardcoded configs in "mesos" module to use ConfigEntry.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26843) Use ConfigEntry for hardcoded configs for "mesos" resource manager

2019-02-07 Thread Jungtaek Lim (JIRA)
Jungtaek Lim created SPARK-26843:


 Summary: Use ConfigEntry for hardcoded configs for "mesos" 
resource manager
 Key: SPARK-26843
 URL: https://issues.apache.org/jira/browse/SPARK-26843
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


Make the hardcoded configs in "mesos" module to use ConfigEntry.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26842) java.lang.IllegalArgumentException: Unsupported class file major version 55

2019-02-07 Thread Ranjit Hande (JIRA)
Ranjit Hande created SPARK-26842:


 Summary:  java.lang.IllegalArgumentException: Unsupported class 
file major version 55 
 Key: SPARK-26842
 URL: https://issues.apache.org/jira/browse/SPARK-26842
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.4.0
 Environment: Java:

openjdk version "11.0.1" 2018-10-16 LTS

OpenJDK Runtime Environment Zulu11.2+3 (build 11.0.1+13-LTS) OpenJDK 64-Bit 
Server VM Zulu11.2+3 (build 11.0.1+13-LTS, mixed mode)

 

Maven: (Spark Streaming)



    org.apache.spark

    spark-streaming-kafka-0-10_2.11

    2.4.0





    org.apache.spark

    spark-streaming_2.11

    2.4.0


Reporter: Ranjit Hande


Getting following Runtime Error with Java 11:

{"@timestamp":"2019-02-07T11:54:30.624+05:30","@version":"1","message":"Application
 run 
failed","logger_name":"org.springframework.boot.SpringApplication","thread_name":"main","level":"ERROR","level_value":4,"stack_trace":"java.lang.IllegalStateException:
 Failed to execute CommandLineRunner at 
org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:816)
 at 
org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:797)
 at org.springframework.boot.SpringApplication.run(SpringApplication.java:324) 
at 
com.avaya.measures.AgentMeasures.AgentMeasuresApplication.main(AgentMeasuresApplication.java:41)
 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method) at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:48) 
at org.springframework.boot.loader.Launcher.launch(Launcher.java:87) at

org.springframework.boot.loader.Launcher.launch(Launcher.java:50) at 
org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:51)\r\n*{color:#FF}Caused
 by: java.lang.IllegalArgumentException: Unsupported class file major version 
55{color}* at 

 org.apache.xbean.asm6.ClassReader.(ClassReader.java:166) at 
org.apache.xbean.asm6.ClassReader.(ClassReader.java:148) at 
org.apache.xbean.asm6.ClassReader.(ClassReader.java:136) at 
org.apache.xbean.asm6.ClassReader.(ClassReader.java:237) at 
org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49) 
at 
org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517)
 at 
org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500)
 at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
 at 
scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
 at 
scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
 at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) 
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at 
scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134) at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) 
at 
org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:500)
 at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175) at 
org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238) at 
org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631) at 
org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:355) at 
org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:307)
 at 
org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:306)
 at scala.collection.immutable.List.foreach(List.scala:392) at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:306)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) at 
org.apache.spark.SparkContext.clean(SparkContext.scala:2326) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2100) at 
org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1364) at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at 
org.apache.spark.rdd.RDD.take(RDD.scala:1337) at 
org.apache.spark.streaming.dstream.DStream$$anonfun$print$2$$anonfun$foreachFunc$3$1.apply(DStream.scala:735)
 at 

[jira] [Updated] (SPARK-26841) Timestamp pushdown on Kafka table

2019-02-07 Thread Tomas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomas updated SPARK-26841:
--
Description: 
As a Spark user I'd like to have fast queries on Kafka table restricted by 
timestamp.

I'd like to have quick answers on questions like:
 * What was inserted to Kafka in past x minutes
 * What was inserted to Kafka in specified time range

Example:
{quote}select * from kafka_table where timestamp > 
from_unixtime(unix_timestamp() - 5 * 60, "-MM-dd HH:mm:ss")

select * from kafka_table where timestamp > $from_time and timestamp < $end_time
{quote}
Currently timestamp restrictions are not pushdown to KafkaRelation and querying 
by timestamp on a large Kafka topic takes forever to complete.

*Technical solution*

Technically its possible to retrieve Kafka's offsets by provided timestamp with 
org.apache.kafka.clients.consumer.Consumer#offsetsForTimes(..) method. 
Afterwards we can query Kafka topic by retrieved timestamp ranges.

Querying by timestamp range is already implemented so this change should have 
minor impact.

  was:
As a Spark user I'd like to have fast queries on Kafka table restricted by 
timestamp.

I'd like to have quick answers on questions like "What was inserted in Kafka in 
past x minutes", "what was inserted in Kafka in specified time range", ...

Example:
{quote}select * from kafka_table where timestamp > 
from_unixtime(unix_timestamp() - 5 * 60, "-MM-dd HH:mm:ss")

select * from kafka_table where timestamp > $from_time and timestamp < $end_time
{quote}
Currently timestamp restrictions is not pushdown to KafkaRelation and querying 
by timestamp on a large Kafka topic takes forever to complete.

*Technical solution*

Technically its possible to retrieve Kafka's offsets by provided timestamp with 
org.apache.kafka.clients.consumer.Consumer#offsetsForTimes(..) method. 
Afterwards we can query Kafka topic by retrieved timestamp ranges.

Querying by timestamp range is already implemented so this change should have 
just a minor impact.


> Timestamp pushdown on Kafka table
> -
>
> Key: SPARK-26841
> URL: https://issues.apache.org/jira/browse/SPARK-26841
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Tomas
>Priority: Major
>  Labels: Kafka, pushdown, timestamp
>
> As a Spark user I'd like to have fast queries on Kafka table restricted by 
> timestamp.
> I'd like to have quick answers on questions like:
>  * What was inserted to Kafka in past x minutes
>  * What was inserted to Kafka in specified time range
> Example:
> {quote}select * from kafka_table where timestamp > 
> from_unixtime(unix_timestamp() - 5 * 60, "-MM-dd HH:mm:ss")
> select * from kafka_table where timestamp > $from_time and timestamp < 
> $end_time
> {quote}
> Currently timestamp restrictions are not pushdown to KafkaRelation and 
> querying by timestamp on a large Kafka topic takes forever to complete.
> *Technical solution*
> Technically its possible to retrieve Kafka's offsets by provided timestamp 
> with org.apache.kafka.clients.consumer.Consumer#offsetsForTimes(..) method. 
> Afterwards we can query Kafka topic by retrieved timestamp ranges.
> Querying by timestamp range is already implemented so this change should have 
> minor impact.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26841) Timestamp pushdown on Kafka table

2019-02-07 Thread Tomas (JIRA)
Tomas created SPARK-26841:
-

 Summary: Timestamp pushdown on Kafka table
 Key: SPARK-26841
 URL: https://issues.apache.org/jira/browse/SPARK-26841
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.5.0
Reporter: Tomas


As a Spark user I'd like to have fast queries on Kafka table restricted by 
timestamp.

I'd like to have quick answers on questions like "What was inserted in Kafka in 
past x minutes", "what was inserted in Kafka in specified time range", ...

Example:
{quote}select * from kafka_table where timestamp > 
from_unixtime(unix_timestamp() - 5 * 60, "-MM-dd HH:mm:ss")

select * from kafka_table where timestamp > $from_time and timestamp < $end_time
{quote}
Currently timestamp restrictions is not pushdown to KafkaRelation and querying 
by timestamp on a large Kafka topic takes forever to complete.

*Technical solution*

Technically its possible to retrieve Kafka's offsets by provided timestamp with 
org.apache.kafka.clients.consumer.Consumer#offsetsForTimes(..) method. 
Afterwards we can query Kafka topic by retrieved timestamp ranges.

Querying by timestamp range is already implemented so this change should have 
just a minor impact.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26841) Timestamp pushdown on Kafka table

2019-02-07 Thread Tomas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomas updated SPARK-26841:
--
Affects Version/s: (was: 2.5.0)
   2.4.0

> Timestamp pushdown on Kafka table
> -
>
> Key: SPARK-26841
> URL: https://issues.apache.org/jira/browse/SPARK-26841
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Tomas
>Priority: Major
>  Labels: Kafka, pushdown, timestamp
>
> As a Spark user I'd like to have fast queries on Kafka table restricted by 
> timestamp.
> I'd like to have quick answers on questions like "What was inserted in Kafka 
> in past x minutes", "what was inserted in Kafka in specified time range", ...
> Example:
> {quote}select * from kafka_table where timestamp > 
> from_unixtime(unix_timestamp() - 5 * 60, "-MM-dd HH:mm:ss")
> select * from kafka_table where timestamp > $from_time and timestamp < 
> $end_time
> {quote}
> Currently timestamp restrictions is not pushdown to KafkaRelation and 
> querying by timestamp on a large Kafka topic takes forever to complete.
> *Technical solution*
> Technically its possible to retrieve Kafka's offsets by provided timestamp 
> with org.apache.kafka.clients.consumer.Consumer#offsetsForTimes(..) method. 
> Afterwards we can query Kafka topic by retrieved timestamp ranges.
> Querying by timestamp range is already implemented so this change should have 
> just a minor impact.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26835) Document configuration properties of Spark SQL Generic Load/Save Functions

2019-02-07 Thread Peter Horvath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762584#comment-16762584
 ] 

Peter Horvath edited comment on SPARK-26835 at 2/7/19 11:33 AM:


This might be trivial / self-explanatory if you have been working with Spark 
for a while, but for newcomers, it's a bit confusing.

Having better / more detailed docs is always good. :)

 

I've opened a pull request for this: 
[https://github.com/apache/spark/pull/23742]

 


was (Author: peter.gergely.horv...@gmail.com):
This might be trivial / self-explanatory if you have been working with Spark 
for a while, but for newcomers, it's a bit confusing.

Having better / more details docs is always good. :)

 

I've opened a pull request for this: 
[https://github.com/apache/spark/pull/23742]

 

> Document configuration properties of Spark SQL Generic Load/Save Functions
> --
>
> Key: SPARK-26835
> URL: https://issues.apache.org/jira/browse/SPARK-26835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Peter Horvath
>Priority: Major
>
> Currently the [Generic Load/Save Functions section of Spark SQL 
> documentation|https://spark.apache.org/docs/2.4.0/sql-data-sources-load-save-functions.html]
>  
>  does not explain the available configuration properties at all.
> Neither the available formats nor their configuration properties are listed 
> properly: there are some usage samples, but that is really.
> For some formats, there is a remark to visit the site of the provider; quote: 
> "_To find more detailed information about the extra ORC/Parquet options, 
> visit the official Apache ORC/Parquet websites."_
> However, this is not applicable for all format providers; for example there 
> is not even a hint regarding the CSV writer's configuration properties or 
> where they can be looked up.
> Please add documentation regarding the configuration properties. Either copy 
> over documentation completely, or link.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26835) Document configuration properties of Spark SQL Generic Load/Save Functions

2019-02-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26835:


Assignee: Apache Spark

> Document configuration properties of Spark SQL Generic Load/Save Functions
> --
>
> Key: SPARK-26835
> URL: https://issues.apache.org/jira/browse/SPARK-26835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Peter Horvath
>Assignee: Apache Spark
>Priority: Major
>
> Currently the [Generic Load/Save Functions section of Spark SQL 
> documentation|https://spark.apache.org/docs/2.4.0/sql-data-sources-load-save-functions.html]
>  
>  does not explain the available configuration properties at all.
> Neither the available formats nor their configuration properties are listed 
> properly: there are some usage samples, but that is really.
> For some formats, there is a remark to visit the site of the provider; quote: 
> "_To find more detailed information about the extra ORC/Parquet options, 
> visit the official Apache ORC/Parquet websites."_
> However, this is not applicable for all format providers; for example there 
> is not even a hint regarding the CSV writer's configuration properties or 
> where they can be looked up.
> Please add documentation regarding the configuration properties. Either copy 
> over documentation completely, or link.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26835) Document configuration properties of Spark SQL Generic Load/Save Functions

2019-02-07 Thread Peter Horvath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762584#comment-16762584
 ] 

Peter Horvath commented on SPARK-26835:
---

This might be trivial / self-explanatory if you have been working with Spark 
for a while, but for newcomers, it's a bit confusing.

Having better / more details docs is always good. :)

 

I've opened a pull request for this: 
[https://github.com/apache/spark/pull/23742]

 

> Document configuration properties of Spark SQL Generic Load/Save Functions
> --
>
> Key: SPARK-26835
> URL: https://issues.apache.org/jira/browse/SPARK-26835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Peter Horvath
>Priority: Major
>
> Currently the [Generic Load/Save Functions section of Spark SQL 
> documentation|https://spark.apache.org/docs/2.4.0/sql-data-sources-load-save-functions.html]
>  
>  does not explain the available configuration properties at all.
> Neither the available formats nor their configuration properties are listed 
> properly: there are some usage samples, but that is really.
> For some formats, there is a remark to visit the site of the provider; quote: 
> "_To find more detailed information about the extra ORC/Parquet options, 
> visit the official Apache ORC/Parquet websites."_
> However, this is not applicable for all format providers; for example there 
> is not even a hint regarding the CSV writer's configuration properties or 
> where they can be looked up.
> Please add documentation regarding the configuration properties. Either copy 
> over documentation completely, or link.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26835) Document configuration properties of Spark SQL Generic Load/Save Functions

2019-02-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26835:


Assignee: (was: Apache Spark)

> Document configuration properties of Spark SQL Generic Load/Save Functions
> --
>
> Key: SPARK-26835
> URL: https://issues.apache.org/jira/browse/SPARK-26835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Peter Horvath
>Priority: Major
>
> Currently the [Generic Load/Save Functions section of Spark SQL 
> documentation|https://spark.apache.org/docs/2.4.0/sql-data-sources-load-save-functions.html]
>  
>  does not explain the available configuration properties at all.
> Neither the available formats nor their configuration properties are listed 
> properly: there are some usage samples, but that is really.
> For some formats, there is a remark to visit the site of the provider; quote: 
> "_To find more detailed information about the extra ORC/Parquet options, 
> visit the official Apache ORC/Parquet websites."_
> However, this is not applicable for all format providers; for example there 
> is not even a hint regarding the CSV writer's configuration properties or 
> where they can be looked up.
> Please add documentation regarding the configuration properties. Either copy 
> over documentation completely, or link.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26082) Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler

2019-02-07 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762552#comment-16762552
 ] 

Dongjoon Hyun commented on SPARK-26082:
---

Although this is not a blocker, cc [~maropu].

> Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler
> ---
>
> Key: SPARK-26082
> URL: https://issues.apache.org/jira/browse/SPARK-26082
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 
> 2.3.1, 2.3.2
>Reporter: Martin Loncaric
>Assignee: Martin Loncaric
>Priority: Major
> Fix For: 2.3.4, 2.4.1, 3.0.0
>
>
> Currently in 
> [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
> {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
> (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the 
> Mesos Fetcher Cache
> {quote}
> Currently in {{MesosClusterScheduler.scala}} (which passes parameter to 
> driver):
> {{private val useFetchCache = 
> conf.getBoolean("spark.mesos.fetchCache.enable", false)}}
> Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
> caching parameter to executors):
> {{private val useFetcherCache = 
> conf.getBoolean("spark.mesos.fetcherCache.enable", false)}}
> This naming discrepancy dates back to version 2.0.0 
> ([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]).
> This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, 
> the Mesos cache will be used only for executors, and not for drivers.
> IMPACT:
> Not caching these driver files (typically including at least spark binaries, 
> custom jar, and additional dependencies) adds considerable overhead network 
> traffic and startup time when frequently running spark Applications on a 
> Mesos cluster. Additionally, since extracted files like 
> {{spark-x.x.x-bin-*.tgz}} are additionally copied and left in the sandbox 
> with the cache off (rather than extracted directly without an extra copy), 
> this can considerably increase disk usage. Users CAN currently workaround by 
> specifying the {{spark.mesos.fetchCache.enable}} option, but this should at 
> least be specified in the documentation.
> SUGGESTED FIX:
> Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 
> 2.4, and update {{MesosClusterScheduler.scala}} to use 
> {{spark.mesos.fetcherCache.enable}} going forward (literally a one-line 
> change).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26082) Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler

2019-02-07 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762511#comment-16762511
 ] 

Dongjoon Hyun commented on SPARK-26082:
---

Thank you, [~mwlon]. You are added to Apache Spark contributor group.

> Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler
> ---
>
> Key: SPARK-26082
> URL: https://issues.apache.org/jira/browse/SPARK-26082
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 
> 2.3.1, 2.3.2
>Reporter: Martin Loncaric
>Assignee: Martin Loncaric
>Priority: Major
> Fix For: 2.3.4, 2.4.1, 3.0.0
>
>
> Currently in 
> [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
> {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
> (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the 
> Mesos Fetcher Cache
> {quote}
> Currently in {{MesosClusterScheduler.scala}} (which passes parameter to 
> driver):
> {{private val useFetchCache = 
> conf.getBoolean("spark.mesos.fetchCache.enable", false)}}
> Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
> caching parameter to executors):
> {{private val useFetcherCache = 
> conf.getBoolean("spark.mesos.fetcherCache.enable", false)}}
> This naming discrepancy dates back to version 2.0.0 
> ([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]).
> This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, 
> the Mesos cache will be used only for executors, and not for drivers.
> IMPACT:
> Not caching these driver files (typically including at least spark binaries, 
> custom jar, and additional dependencies) adds considerable overhead network 
> traffic and startup time when frequently running spark Applications on a 
> Mesos cluster. Additionally, since extracted files like 
> {{spark-x.x.x-bin-*.tgz}} are additionally copied and left in the sandbox 
> with the cache off (rather than extracted directly without an extra copy), 
> this can considerably increase disk usage. Users CAN currently workaround by 
> specifying the {{spark.mesos.fetchCache.enable}} option, but this should at 
> least be specified in the documentation.
> SUGGESTED FIX:
> Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 
> 2.4, and update {{MesosClusterScheduler.scala}} to use 
> {{spark.mesos.fetcherCache.enable}} going forward (literally a one-line 
> change).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26082) Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler

2019-02-07 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-26082:
-

Assignee: Martin Loncaric

> Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler
> ---
>
> Key: SPARK-26082
> URL: https://issues.apache.org/jira/browse/SPARK-26082
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 
> 2.3.1, 2.3.2
>Reporter: Martin Loncaric
>Assignee: Martin Loncaric
>Priority: Major
> Fix For: 2.3.4, 2.4.1, 3.0.0
>
>
> Currently in 
> [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
> {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
> (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the 
> Mesos Fetcher Cache
> {quote}
> Currently in {{MesosClusterScheduler.scala}} (which passes parameter to 
> driver):
> {{private val useFetchCache = 
> conf.getBoolean("spark.mesos.fetchCache.enable", false)}}
> Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
> caching parameter to executors):
> {{private val useFetcherCache = 
> conf.getBoolean("spark.mesos.fetcherCache.enable", false)}}
> This naming discrepancy dates back to version 2.0.0 
> ([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]).
> This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, 
> the Mesos cache will be used only for executors, and not for drivers.
> IMPACT:
> Not caching these driver files (typically including at least spark binaries, 
> custom jar, and additional dependencies) adds considerable overhead network 
> traffic and startup time when frequently running spark Applications on a 
> Mesos cluster. Additionally, since extracted files like 
> {{spark-x.x.x-bin-*.tgz}} are additionally copied and left in the sandbox 
> with the cache off (rather than extracted directly without an extra copy), 
> this can considerably increase disk usage. Users CAN currently workaround by 
> specifying the {{spark.mesos.fetchCache.enable}} option, but this should at 
> least be specified in the documentation.
> SUGGESTED FIX:
> Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 
> 2.4, and update {{MesosClusterScheduler.scala}} to use 
> {{spark.mesos.fetcherCache.enable}} going forward (literally a one-line 
> change).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26082) Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler

2019-02-07 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26082.
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1
   2.3.4

This is resolved via https://github.com/apache/spark/pull/23734

> Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler
> ---
>
> Key: SPARK-26082
> URL: https://issues.apache.org/jira/browse/SPARK-26082
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 
> 2.3.1, 2.3.2
>Reporter: Martin Loncaric
>Priority: Major
> Fix For: 2.3.4, 2.4.1, 3.0.0
>
>
> Currently in 
> [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
> {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
> (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the 
> Mesos Fetcher Cache
> {quote}
> Currently in {{MesosClusterScheduler.scala}} (which passes parameter to 
> driver):
> {{private val useFetchCache = 
> conf.getBoolean("spark.mesos.fetchCache.enable", false)}}
> Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
> caching parameter to executors):
> {{private val useFetcherCache = 
> conf.getBoolean("spark.mesos.fetcherCache.enable", false)}}
> This naming discrepancy dates back to version 2.0.0 
> ([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]).
> This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, 
> the Mesos cache will be used only for executors, and not for drivers.
> IMPACT:
> Not caching these driver files (typically including at least spark binaries, 
> custom jar, and additional dependencies) adds considerable overhead network 
> traffic and startup time when frequently running spark Applications on a 
> Mesos cluster. Additionally, since extracted files like 
> {{spark-x.x.x-bin-*.tgz}} are additionally copied and left in the sandbox 
> with the cache off (rather than extracted directly without an extra copy), 
> this can considerably increase disk usage. Users CAN currently workaround by 
> specifying the {{spark.mesos.fetchCache.enable}} option, but this should at 
> least be specified in the documentation.
> SUGGESTED FIX:
> Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 
> 2.4, and update {{MesosClusterScheduler.scala}} to use 
> {{spark.mesos.fetcherCache.enable}} going forward (literally a one-line 
> change).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26082) Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler

2019-02-07 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762499#comment-16762499
 ] 

Dongjoon Hyun edited comment on SPARK-26082 at 2/7/19 9:17 AM:
---

Since this bug is introduced by SPARK-15994 which is added Spark 2.1.0, I 
removed 2.0.x from the affected versions.

BTW, Spark 2.2.x is EOL (https://spark.apache.org/versioning-policy.html).


was (Author: dongjoon):
Since this bug is introduced by SPARK-15994 which is added Spark 2.1.0, I 
removed 2.0.x from the affected versions.

> Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler
> ---
>
> Key: SPARK-26082
> URL: https://issues.apache.org/jira/browse/SPARK-26082
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 
> 2.3.1, 2.3.2
>Reporter: Martin Loncaric
>Priority: Major
>
> Currently in 
> [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
> {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
> (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the 
> Mesos Fetcher Cache
> {quote}
> Currently in {{MesosClusterScheduler.scala}} (which passes parameter to 
> driver):
> {{private val useFetchCache = 
> conf.getBoolean("spark.mesos.fetchCache.enable", false)}}
> Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
> caching parameter to executors):
> {{private val useFetcherCache = 
> conf.getBoolean("spark.mesos.fetcherCache.enable", false)}}
> This naming discrepancy dates back to version 2.0.0 
> ([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]).
> This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, 
> the Mesos cache will be used only for executors, and not for drivers.
> IMPACT:
> Not caching these driver files (typically including at least spark binaries, 
> custom jar, and additional dependencies) adds considerable overhead network 
> traffic and startup time when frequently running spark Applications on a 
> Mesos cluster. Additionally, since extracted files like 
> {{spark-x.x.x-bin-*.tgz}} are additionally copied and left in the sandbox 
> with the cache off (rather than extracted directly without an extra copy), 
> this can considerably increase disk usage. Users CAN currently workaround by 
> specifying the {{spark.mesos.fetchCache.enable}} option, but this should at 
> least be specified in the documentation.
> SUGGESTED FIX:
> Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 
> 2.4, and update {{MesosClusterScheduler.scala}} to use 
> {{spark.mesos.fetcherCache.enable}} going forward (literally a one-line 
> change).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26082) Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler

2019-02-07 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762499#comment-16762499
 ] 

Dongjoon Hyun commented on SPARK-26082:
---

Since this bug is introduced by SPARK-15994 which is added Spark 2.1.0, I 
removed 2.0.x from the affected versions.

> Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler
> ---
>
> Key: SPARK-26082
> URL: https://issues.apache.org/jira/browse/SPARK-26082
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 
> 2.3.1, 2.3.2
>Reporter: Martin Loncaric
>Priority: Major
>
> Currently in 
> [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
> {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
> (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the 
> Mesos Fetcher Cache
> {quote}
> Currently in {{MesosClusterScheduler.scala}} (which passes parameter to 
> driver):
> {{private val useFetchCache = 
> conf.getBoolean("spark.mesos.fetchCache.enable", false)}}
> Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
> caching parameter to executors):
> {{private val useFetcherCache = 
> conf.getBoolean("spark.mesos.fetcherCache.enable", false)}}
> This naming discrepancy dates back to version 2.0.0 
> ([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]).
> This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, 
> the Mesos cache will be used only for executors, and not for drivers.
> IMPACT:
> Not caching these driver files (typically including at least spark binaries, 
> custom jar, and additional dependencies) adds considerable overhead network 
> traffic and startup time when frequently running spark Applications on a 
> Mesos cluster. Additionally, since extracted files like 
> {{spark-x.x.x-bin-*.tgz}} are additionally copied and left in the sandbox 
> with the cache off (rather than extracted directly without an extra copy), 
> this can considerably increase disk usage. Users CAN currently workaround by 
> specifying the {{spark.mesos.fetchCache.enable}} option, but this should at 
> least be specified in the documentation.
> SUGGESTED FIX:
> Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 
> 2.4, and update {{MesosClusterScheduler.scala}} to use 
> {{spark.mesos.fetcherCache.enable}} going forward (literally a one-line 
> change).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26082) Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler

2019-02-07 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26082:
--
Affects Version/s: (was: 2.0.2)
   (was: 2.0.1)
   (was: 2.0.0)

> Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler
> ---
>
> Key: SPARK-26082
> URL: https://issues.apache.org/jira/browse/SPARK-26082
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 
> 2.3.1, 2.3.2
>Reporter: Martin Loncaric
>Priority: Major
>
> Currently in 
> [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
> {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
> (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the 
> Mesos Fetcher Cache
> {quote}
> Currently in {{MesosClusterScheduler.scala}} (which passes parameter to 
> driver):
> {{private val useFetchCache = 
> conf.getBoolean("spark.mesos.fetchCache.enable", false)}}
> Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
> caching parameter to executors):
> {{private val useFetcherCache = 
> conf.getBoolean("spark.mesos.fetcherCache.enable", false)}}
> This naming discrepancy dates back to version 2.0.0 
> ([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]).
> This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, 
> the Mesos cache will be used only for executors, and not for drivers.
> IMPACT:
> Not caching these driver files (typically including at least spark binaries, 
> custom jar, and additional dependencies) adds considerable overhead network 
> traffic and startup time when frequently running spark Applications on a 
> Mesos cluster. Additionally, since extracted files like 
> {{spark-x.x.x-bin-*.tgz}} are additionally copied and left in the sandbox 
> with the cache off (rather than extracted directly without an extra copy), 
> this can considerably increase disk usage. Users CAN currently workaround by 
> specifying the {{spark.mesos.fetchCache.enable}} option, but this should at 
> least be specified in the documentation.
> SUGGESTED FIX:
> Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 
> 2.4, and update {{MesosClusterScheduler.scala}} to use 
> {{spark.mesos.fetcherCache.enable}} going forward (literally a one-line 
> change).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26082) Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler

2019-02-07 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762496#comment-16762496
 ] 

Apache Spark commented on SPARK-26082:
--

User 'mwlon' has created a pull request for this issue:
https://github.com/apache/spark/pull/23734

> Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler
> ---
>
> Key: SPARK-26082
> URL: https://issues.apache.org/jira/browse/SPARK-26082
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 
> 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: Martin Loncaric
>Priority: Major
>
> Currently in 
> [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
> {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
> (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the 
> Mesos Fetcher Cache
> {quote}
> Currently in {{MesosClusterScheduler.scala}} (which passes parameter to 
> driver):
> {{private val useFetchCache = 
> conf.getBoolean("spark.mesos.fetchCache.enable", false)}}
> Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
> caching parameter to executors):
> {{private val useFetcherCache = 
> conf.getBoolean("spark.mesos.fetcherCache.enable", false)}}
> This naming discrepancy dates back to version 2.0.0 
> ([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]).
> This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, 
> the Mesos cache will be used only for executors, and not for drivers.
> IMPACT:
> Not caching these driver files (typically including at least spark binaries, 
> custom jar, and additional dependencies) adds considerable overhead network 
> traffic and startup time when frequently running spark Applications on a 
> Mesos cluster. Additionally, since extracted files like 
> {{spark-x.x.x-bin-*.tgz}} are additionally copied and left in the sandbox 
> with the cache off (rather than extracted directly without an extra copy), 
> this can considerably increase disk usage. Users CAN currently workaround by 
> specifying the {{spark.mesos.fetchCache.enable}} option, but this should at 
> least be specified in the documentation.
> SUGGESTED FIX:
> Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 
> 2.4, and update {{MesosClusterScheduler.scala}} to use 
> {{spark.mesos.fetcherCache.enable}} going forward (literally a one-line 
> change).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26082) Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler

2019-02-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26082:


Assignee: (was: Apache Spark)

> Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler
> ---
>
> Key: SPARK-26082
> URL: https://issues.apache.org/jira/browse/SPARK-26082
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 
> 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: Martin Loncaric
>Priority: Major
>
> Currently in 
> [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
> {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
> (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the 
> Mesos Fetcher Cache
> {quote}
> Currently in {{MesosClusterScheduler.scala}} (which passes parameter to 
> driver):
> {{private val useFetchCache = 
> conf.getBoolean("spark.mesos.fetchCache.enable", false)}}
> Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
> caching parameter to executors):
> {{private val useFetcherCache = 
> conf.getBoolean("spark.mesos.fetcherCache.enable", false)}}
> This naming discrepancy dates back to version 2.0.0 
> ([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]).
> This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, 
> the Mesos cache will be used only for executors, and not for drivers.
> IMPACT:
> Not caching these driver files (typically including at least spark binaries, 
> custom jar, and additional dependencies) adds considerable overhead network 
> traffic and startup time when frequently running spark Applications on a 
> Mesos cluster. Additionally, since extracted files like 
> {{spark-x.x.x-bin-*.tgz}} are additionally copied and left in the sandbox 
> with the cache off (rather than extracted directly without an extra copy), 
> this can considerably increase disk usage. Users CAN currently workaround by 
> specifying the {{spark.mesos.fetchCache.enable}} option, but this should at 
> least be specified in the documentation.
> SUGGESTED FIX:
> Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 
> 2.4, and update {{MesosClusterScheduler.scala}} to use 
> {{spark.mesos.fetcherCache.enable}} going forward (literally a one-line 
> change).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26082) Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler

2019-02-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26082:


Assignee: Apache Spark

> Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler
> ---
>
> Key: SPARK-26082
> URL: https://issues.apache.org/jira/browse/SPARK-26082
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 
> 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: Martin Loncaric
>Assignee: Apache Spark
>Priority: Major
>
> Currently in 
> [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
> {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
> (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the 
> Mesos Fetcher Cache
> {quote}
> Currently in {{MesosClusterScheduler.scala}} (which passes parameter to 
> driver):
> {{private val useFetchCache = 
> conf.getBoolean("spark.mesos.fetchCache.enable", false)}}
> Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
> caching parameter to executors):
> {{private val useFetcherCache = 
> conf.getBoolean("spark.mesos.fetcherCache.enable", false)}}
> This naming discrepancy dates back to version 2.0.0 
> ([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]).
> This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, 
> the Mesos cache will be used only for executors, and not for drivers.
> IMPACT:
> Not caching these driver files (typically including at least spark binaries, 
> custom jar, and additional dependencies) adds considerable overhead network 
> traffic and startup time when frequently running spark Applications on a 
> Mesos cluster. Additionally, since extracted files like 
> {{spark-x.x.x-bin-*.tgz}} are additionally copied and left in the sandbox 
> with the cache off (rather than extracted directly without an extra copy), 
> this can considerably increase disk usage. Users CAN currently workaround by 
> specifying the {{spark.mesos.fetchCache.enable}} option, but this should at 
> least be specified in the documentation.
> SUGGESTED FIX:
> Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 
> 2.4, and update {{MesosClusterScheduler.scala}} to use 
> {{spark.mesos.fetcherCache.enable}} going forward (literally a one-line 
> change).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26826) Array indexing functions array_allpositions and array_select

2019-02-07 Thread Petar Zecevic (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Petar Zecevic resolved SPARK-26826.
---
Resolution: Won't Fix

> Array indexing functions array_allpositions and array_select
> 
>
> Key: SPARK-26826
> URL: https://issues.apache.org/jira/browse/SPARK-26826
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Petar Zecevic
>Priority: Major
>
> This ticket proposes two extra array functions: {{array_allpositions}} (named 
> after {{array_position}}) and {{array_select}}. These functions should make 
> it easier to:
> * get an array of indices of all occurences of a value in an array 
> ({{array_allpositions}})
> * select all elements of an array based on an array of indices 
> ({{array_select}})
> Although higher-order functions, such as {{aggregate}} and {{transform}}, 
> have been recently added, performing tasks above is still not simple, hence 
> this addition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org