[jira] [Resolved] (SPARK-27100) dag-scheduler-event-loop" java.lang.StackOverflowError

2019-06-26 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-27100.
-
   Resolution: Fixed
Fix Version/s: 2.4.4

Issue resolved by pull request 24957
[https://github.com/apache/spark/pull/24957]

> dag-scheduler-event-loop" java.lang.StackOverflowError
> --
>
> Key: SPARK-27100
> URL: https://issues.apache.org/jira/browse/SPARK-27100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.3.3
>Reporter: KaiXu
>Priority: Major
> Fix For: 2.4.4
>
> Attachments: SPARK-27100-Overflow.txt, stderr
>
>
> ALS in Spark MLlib causes StackOverflow:
>  /opt/sparkml/spark213/bin/spark-submit  --properties-file 
> /opt/HiBench/report/als/spark/conf/sparkbench/spark.conf --class 
> com.intel.hibench.sparkbench.ml.ALSExample --master yarn-client 
> --num-executors 3 --executor-memory 322g 
> /opt/HiBench/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar
>  --numUsers 4 --numProducts 6 --rank 100 --numRecommends 20 
> --numIterations 100 --kryo false --implicitPrefs true --numProductBlocks -1 
> --numUserBlocks -1 --lambda 1.0 hdfs://bdw-slave20:8020/HiBench/ALS/Input
>  
> Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1534)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>  at 
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>  at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>  at 
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>  at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28169) Spark can’t push down predicate for OR expression

2019-06-26 Thread angerszhu (JIRA)
angerszhu created SPARK-28169:
-

 Summary: Spark can’t push down predicate for OR expression
 Key: SPARK-28169
 URL: https://issues.apache.org/jira/browse/SPARK-28169
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 2.3.0
Reporter: angerszhu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28091) Extend Spark metrics system with executor plugin metrics

2019-06-26 Thread Luca Canali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873041#comment-16873041
 ] 

Luca Canali commented on SPARK-28091:
-

Indeed toString() is quite useful, thanks.
+ good proposal, I think it could be useful.

I take the occasion to address a related topic: I wanted to inquire on if there 
are any plans to add to S3A instrumentation read and write time/latency 
metrics. This would be useful for (Spark) performance troubleshooting,  for 
example by comparing CPU time and I/O time. I think the question of I/O time 
instrumentation has come up a few times already over the years. I add that in 
the context discussed there, that is Spark monitoring instrumentation with 
Dropwizard libraries, we would "only" need aggregated latency metrics over the 
executor JVM, rather than task/query level statistics (BTW, the latter I see 
has been discussed in the case of HDFS in HADOOP-11873).

> Extend Spark metrics system with executor plugin metrics
> 
>
> Key: SPARK-28091
> URL: https://issues.apache.org/jira/browse/SPARK-28091
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes to improve Spark instrumentation by adding a hook for Spark 
> executor plugin metrics to the Spark metrics systems implemented with the 
> Dropwizard/Codahale library.
> Context: The Spark metrics system provides a large variety of metrics, see 
> also SPARK-26890, useful to  monitor and troubleshoot Spark workloads. A 
> typical workflow is to sink the metrics to a storage system and build 
> dashboards on top of that.
> Improvement: The original goal of this work was to add instrumentation for S3 
> filesystem access metrics by Spark job. Currently, [[ExecutorSource]] 
> instruments HDFS and local filesystem metrics. Rather than extending the code 
> there, we proposes to add a metrics plugin system which is of more flexible 
> and general use.
> Advantages:
>  * The metric plugin system makes it easy to implement instrumentation for S3 
> access by Spark jobs.
>  * The metrics plugin system allows for easy extensions of how Spark collects 
> HDFS-related workload metrics. This is currently done using the Hadoop 
> Filesystem GetAllStatistics method, which is deprecated in recent versions of 
> Hadoop. Recent versions of Hadoop Filesystem recommend using method 
> GetGlobalStorageStatistics, which also provides several additional metrics. 
> GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been 
> introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an 
> easy way to “opt in” using such new API calls for those deploying suitable 
> Hadoop versions.
>  * We also have the use case of adding Hadoop filesystem monitoring for a 
> custom Hadoop compliant filesystem in use in our organization (EOS using the 
> XRootD protocol). The metrics plugin infrastructure makes this easy to do. 
> Others may have similar use cases.
>  * More generally, this method makes it straightforward to plug in Filesystem 
> and other metrics to the Spark monitoring system. Future work on plugin 
> implementation can address extending monitoring to measure usage of external 
> resources (OS, filesystem, network, accelerator cards, etc), that maybe would 
> not normally be considered general enough for inclusion in Apache Spark code, 
> but that can be nevertheless useful for specialized use cases, tests or 
> troubleshooting.
> Implementation:
> The proposed implementation is currently a WIP open for comments and 
> improvements. It is based on the work on Executor Plugin of SPARK-24918 and 
> builds on recent work on extending Spark executor metrics, such as SPARK-25228
> Tests and examples:
> This has been so far manually tested running Spark on YARN and K8S clusters, 
> in particular for monitoring S3 and for extending HDFS instrumentation with 
> the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric 
> plugin example and code used for testing are available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28169) Spark can’t push down predicate for OR expression

2019-06-26 Thread angerszhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-28169:
--
Description: 
Spark can't push down filter condition of Or:

Such as if I have a table {color:#d04437}default.test{color}, his partition col 
is "{color:#d04437}dt{color}",

if I use query : 
{code:java}
select * from default.test where dt=20190625 or (dt = 20190626 and id in 
(1,2,3) )
{code}
In this case, Spark will resolve or condition as one expression, and since this 
{color:#33}expr {color}has reference of "{color:#FF}id{color}", then it 
can't been push down.

> Spark can’t push down predicate for OR expression
> -
>
> Key: SPARK-28169
> URL: https://issues.apache.org/jira/browse/SPARK-28169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: angerszhu
>Priority: Major
>  Labels: SQL
>
> Spark can't push down filter condition of Or:
> Such as if I have a table {color:#d04437}default.test{color}, his partition 
> col is "{color:#d04437}dt{color}",
> if I use query : 
> {code:java}
> select * from default.test where dt=20190625 or (dt = 20190626 and id in 
> (1,2,3) )
> {code}
> In this case, Spark will resolve or condition as one expression, and since 
> this {color:#33}expr {color}has reference of "{color:#FF}id{color}", 
> then it can't been push down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28169) Spark can’t push down predicate for OR expression

2019-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28169:


Assignee: Apache Spark

> Spark can’t push down predicate for OR expression
> -
>
> Key: SPARK-28169
> URL: https://issues.apache.org/jira/browse/SPARK-28169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>  Labels: SQL
>
> Spark can't push down filter condition of Or:
> Such as if I have a table {color:#d04437}default.test{color}, his partition 
> col is "{color:#d04437}dt{color}",
> if I use query : 
> {code:java}
> select * from default.test where dt=20190625 or (dt = 20190626 and id in 
> (1,2,3) )
> {code}
> In this case, Spark will resolve or condition as one expression, and since 
> this {color:#33}expr {color}has reference of "{color:#FF}id{color}", 
> then it can't been push down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28169) Spark can’t push down predicate for OR expression

2019-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28169:


Assignee: (was: Apache Spark)

> Spark can’t push down predicate for OR expression
> -
>
> Key: SPARK-28169
> URL: https://issues.apache.org/jira/browse/SPARK-28169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: angerszhu
>Priority: Major
>  Labels: SQL
>
> Spark can't push down filter condition of Or:
> Such as if I have a table {color:#d04437}default.test{color}, his partition 
> col is "{color:#d04437}dt{color}",
> if I use query : 
> {code:java}
> select * from default.test where dt=20190625 or (dt = 20190626 and id in 
> (1,2,3) )
> {code}
> In this case, Spark will resolve or condition as one expression, and since 
> this {color:#33}expr {color}has reference of "{color:#FF}id{color}", 
> then it can't been push down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28091) Extend Spark metrics system with executor plugin metrics

2019-06-26 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873161#comment-16873161
 ] 

Steve Loughran commented on SPARK-28091:


IO metrics.

hmmm. Maybe, but there's a lot going on underneath, including retries happening 
in the AWS code. And things like the duration of a rename is O(bytes), whereas 
for a getFileStatus its that of a pair of HEAD calls.

We've discussed hooking up to the AWS SDK metrics, if that interests you: 
HADOOP-13551. Getting the throttle and retry count which is often handled 
transparently in the AWS code would be the one to go for. Sometimes we see it 
in our code, where again, the throttle count would be good. It's better to say 
"your job was slow because you overloaded an S3 shard" than "your job was 
slow". 

There are some  quantiles to measure DynamoDB duration in S3Guard, which led to 
HADOOP-16278.  Now that DDB supports pay-on-demand IO throttling is less common 
and so these counters less important.

> Extend Spark metrics system with executor plugin metrics
> 
>
> Key: SPARK-28091
> URL: https://issues.apache.org/jira/browse/SPARK-28091
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Priority: Minor
>
> This proposes to improve Spark instrumentation by adding a hook for Spark 
> executor plugin metrics to the Spark metrics systems implemented with the 
> Dropwizard/Codahale library.
> Context: The Spark metrics system provides a large variety of metrics, see 
> also SPARK-26890, useful to  monitor and troubleshoot Spark workloads. A 
> typical workflow is to sink the metrics to a storage system and build 
> dashboards on top of that.
> Improvement: The original goal of this work was to add instrumentation for S3 
> filesystem access metrics by Spark job. Currently, [[ExecutorSource]] 
> instruments HDFS and local filesystem metrics. Rather than extending the code 
> there, we proposes to add a metrics plugin system which is of more flexible 
> and general use.
> Advantages:
>  * The metric plugin system makes it easy to implement instrumentation for S3 
> access by Spark jobs.
>  * The metrics plugin system allows for easy extensions of how Spark collects 
> HDFS-related workload metrics. This is currently done using the Hadoop 
> Filesystem GetAllStatistics method, which is deprecated in recent versions of 
> Hadoop. Recent versions of Hadoop Filesystem recommend using method 
> GetGlobalStorageStatistics, which also provides several additional metrics. 
> GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been 
> introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an 
> easy way to “opt in” using such new API calls for those deploying suitable 
> Hadoop versions.
>  * We also have the use case of adding Hadoop filesystem monitoring for a 
> custom Hadoop compliant filesystem in use in our organization (EOS using the 
> XRootD protocol). The metrics plugin infrastructure makes this easy to do. 
> Others may have similar use cases.
>  * More generally, this method makes it straightforward to plug in Filesystem 
> and other metrics to the Spark monitoring system. Future work on plugin 
> implementation can address extending monitoring to measure usage of external 
> resources (OS, filesystem, network, accelerator cards, etc), that maybe would 
> not normally be considered general enough for inclusion in Apache Spark code, 
> but that can be nevertheless useful for specialized use cases, tests or 
> troubleshooting.
> Implementation:
> The proposed implementation is currently a WIP open for comments and 
> improvements. It is based on the work on Executor Plugin of SPARK-24918 and 
> builds on recent work on extending Spark executor metrics, such as SPARK-25228
> Tests and examples:
> This has been so far manually tested running Spark on YARN and K8S clusters, 
> in particular for monitoring S3 and for extending HDFS instrumentation with 
> the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric 
> plugin example and code used for testing are available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26683) Incorrect value of "internal.metrics.input.recordsRead" when reading from temp hive table backed by HDFS file

2019-06-26 Thread Amar Agrawal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amar Agrawal updated SPARK-26683:
-
Fix Version/s: 3.0.0

> Incorrect value of "internal.metrics.input.recordsRead" when reading from 
> temp hive table backed by HDFS file
> -
>
> Key: SPARK-26683
> URL: https://issues.apache.org/jira/browse/SPARK-26683
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Amar Agrawal
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: asyncfactory.scala, input1.txt, input2.txt
>
>
> *Issue description*
> The summary of the issue is - when persisted DataFrame is used in two 
> different concurrent threads, we are getting wrong value of 
> *internal.metrics.input.recordsRead* in SparkListenerStageCompleted event.
>  
> *Issue Details* 
> The spark code I have written has 2 source temp hive tables. When the first 
> temp table is read, it's dataframe is persisted. Whereas, for the other temp 
> table, its source dataframe is not persisted. After that, we have 2 pipelines 
> which we run in async fashion. In the 1st pipeline, the persisted dataframe 
> is written to some hive target table. Whereas, in the 2nd pipeline, we are 
> performing a UNION of persisted dataframe with non-persisted dataframe, which 
> is then written to a separate hive table.
> Our expectation is, since the first dataframe is persisted, its metric for 
> recordsRead should be computed exactly once. But in our case, we are seeing 
> an increased value of the metric.
> Example - if my persisted dataframe has 2 rows, the above mentioned metric is 
> consistently reporting it as 3 rows.
>  
> *Steps to reproduce Issue:*
>  # Create directory /tmp/INFA_UNION1 and copy input1.txt to this directory.
>  # Create directory /tmp/INFA_UNION2 and copy input2.txt to this directory.
>  # Run the following in spark-shell:
> scala> :load asyncfactory.scala
> scala> : paste -raw
>  
> {code:java}
> package org.apache.spark
> import org.apache.spark.scheduler._
> import org.apache.spark.util.JsonProtocol
> import org.json4s.jackson.JsonMethods._
> class InfaListener(mode:String="ACCUMULATOR") extends 
> org.apache.spark.scheduler.SparkListener {
> def onEvent(event: SparkListenerEvent): Unit = {
> val jv = JsonProtocol.sparkEventToJson(event)
> println(compact(jv))
> }
> override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): 
> Unit = { onEvent(stageCompleted)}
> override def onStageSubmitted(stageSubmitted: SparkListenerStageSubmitted): 
> Unit = { onEvent(stageSubmitted)}
> }
> {code}
>  
> scala> :paste
> {code:java}
> import org.apache.spark.InfaListener
> implicit def df2idf(d:DataFrame):InfaDataFrame = new InfaDataFrame(d);
> val sqlc = spark.sqlContext
> val sc = spark.sparkContext
> val lis = new InfaListener("TAG")
> sc.addSparkListener(lis)
> sqlc.sql("DROP TABLE IF EXISTS `default`.`read1`")
> sqlc.sql("CREATE TABLE `default`.`read1` (`col0` STRING) LOCATION 
> '/tmp/INFA_UNION1'")
> sqlc.sql("DROP TABLE IF EXISTS `default`.`read2`")
> sqlc.sql("CREATE TABLE `default`.`read2` (`col0` STRING) LOCATION 
> '/tmp/INFA_UNION2'")
> sqlc.sql("DROP TABLE IF EXISTS `default`.`write1`")
> sqlc.sql("CREATE TABLE `default`.`write1` (`col0` STRING)")
> sqlc.sql("DROP TABLE IF EXISTS `default`.`write2`")
> sqlc.sql("CREATE TABLE `default`.`write2` (`col0` STRING)")
> val v0 = sqlc.sql("SELECT `read1`.`col0` as a0 FROM 
> `default`.`read1`").itoDF.persist(MEMORY_AND_DISK).where(lit(true));
> async(asBlock(sqlc.sql("INSERT OVERWRITE TABLE `default`.`write1` SELECT 
> tbl0.c0 as a0 FROM tbl0"), v0.unionAll(sqlc.sql("SELECT `read2`.`col0` as a0 
> FROM 
> `default`.`read2`").itoDF).itoDF("TGT_").itoDF("c").createOrReplaceTempView("tbl0")));
> async(asBlock(sqlc.sql("INSERT OVERWRITE TABLE `default`.`write2` SELECT 
> tbl1.c0 as a0 FROM tbl1"), 
> v0.itoDF("TGT_").itoDF("c").createOrReplaceTempView("tbl1")));
> stop;
> {code}
> *NOTE* - The above code refers to 2 file directories /tmp/INFA_UNION1 and 
> /tmp/INFA_UNION2. We have attached the files which need to be copied to the 
> above locations after these directories are created.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28170) DenseVector .toArray() and .values documentation do not specify they are aliases

2019-06-26 Thread Sivam Pasupathipillai (JIRA)
Sivam Pasupathipillai created SPARK-28170:
-

 Summary: DenseVector .toArray() and .values documentation do not 
specify they are aliases
 Key: SPARK-28170
 URL: https://issues.apache.org/jira/browse/SPARK-28170
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.4.3
Reporter: Sivam Pasupathipillai


The documentation of the *toArray()* method and the *values* property in 
pyspark.ml.linalg.DenseVector is confusing.

*toArray():* Returns an numpy.ndarray

*values**:* Returns a list of values

However, they are actually aliases and they both return a numpy.ndarray.

FIX: either change the documentation or change  the *values* property to return 
a Python list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28164) usage description does not match with shell scripts

2019-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28164:


Assignee: (was: Apache Spark)

> usage description does not match with shell scripts
> ---
>
> Key: SPARK-28164
> URL: https://issues.apache.org/jira/browse/SPARK-28164
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.3
>Reporter: Hanna Kan
>Priority: Major
>
> I found that "spark/sbin/start-slave.sh" may have some error. 
> line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] "
> but later this script,  I found line 59  MASTER=$1 
> Is this a conflict?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28164) usage description does not match with shell scripts

2019-06-26 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873272#comment-16873272
 ] 

Apache Spark commented on SPARK-28164:
--

User 'shivusondur' has created a pull request for this issue:
https://github.com/apache/spark/pull/24974

> usage description does not match with shell scripts
> ---
>
> Key: SPARK-28164
> URL: https://issues.apache.org/jira/browse/SPARK-28164
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.3
>Reporter: Hanna Kan
>Priority: Major
>
> I found that "spark/sbin/start-slave.sh" may have some error. 
> line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] "
> but later this script,  I found line 59  MASTER=$1 
> Is this a conflict?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28164) usage description does not match with shell scripts

2019-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28164:


Assignee: Apache Spark

> usage description does not match with shell scripts
> ---
>
> Key: SPARK-28164
> URL: https://issues.apache.org/jira/browse/SPARK-28164
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.3
>Reporter: Hanna Kan
>Assignee: Apache Spark
>Priority: Major
>
> I found that "spark/sbin/start-slave.sh" may have some error. 
> line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] "
> but later this script,  I found line 59  MASTER=$1 
> Is this a conflict?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28164) usage description does not match with shell scripts

2019-06-26 Thread Shivu Sondur (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873275#comment-16873275
 ] 

Shivu Sondur commented on SPARK-28164:
--

[~hannankan]

i updated the usage message and raised the pull request

> usage description does not match with shell scripts
> ---
>
> Key: SPARK-28164
> URL: https://issues.apache.org/jira/browse/SPARK-28164
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.3
>Reporter: Hanna Kan
>Priority: Major
>
> I found that "spark/sbin/start-slave.sh" may have some error. 
> line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] "
> but later this script,  I found line 59  MASTER=$1 
> Is this a conflict?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28171) Correct interval type conversion behavior

2019-06-26 Thread Zhu, Lipeng (JIRA)
Zhu, Lipeng created SPARK-28171:
---

 Summary: Correct interval type conversion behavior 
 Key: SPARK-28171
 URL: https://issues.apache.org/jira/browse/SPARK-28171
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Zhu, Lipeng


When calculate between date and interval.
{code:sql}
select timestamp '2019-01-01 00:00:00' + interval '1 2:03:04' day to second, 
timestamp '2019-01-01 00:00:00' + interval '-1 2:03:04' day to second
{code}
 * PostgreSQL return *2019-01-02 02:03:04*    *2018-12-31 02:03:04*
 * SparkSQL return    *2019-01-02 02:03:04*    *2018-12-30 21:56:56*

{code:sql}
select timestamp '2019-01-01 00:00:00' + interval '1 -2:03:04' day to second, 
timestamp '2019-01-01 00:00:00' + interval '-1 -2:03:04' day to second
{code}
 * PostgreSQL return *2019-01-01 21:56:56*    *2018-12-30 21:56:56*
 * SparkSQL return    _*Interval string does not match day-time format of 'd 
h:m:s.n': '1 -2:03:04'(line 1, pos 50)*_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28171) Correct interval type conversion behavior

2019-06-26 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873318#comment-16873318
 ] 

Yuming Wang commented on SPARK-28171:
-

{code:sql}
postgres=# select substr(version(), 0, 16), timestamp '2019-01-01 00:00:00' + 
interval '1 2:03:04' day to second, timestamp '2019-01-01 00:00:00' + interval 
'-1 2:03:04' day to second;
 substr  |  ?column?   |  ?column?
-+-+-
 PostgreSQL 11.3 | 2019-01-02 02:03:04 | 2018-12-31 02:03:04
(1 row)
{code}

{code:sql}
dbadmin=> select version(), timestamp '2019-01-01 00:00:00' + interval '1 
2:03:04' day to second, timestamp '2019-01-01 00:00:00' + interval '-1 2:03:04' 
day to second;
version | ?column? | ?column?
+-+-
Vertica Analytic Database v9.1.1-0 | 2019-01-02 02:03:04 | 2018-12-30 21:56:56
(1 row)
{code}


{code:sql}
presto> select timestamp '2019-01-01 00:00:00' + interval '1 2:03:04' day to 
second, timestamp '2019-01-01 00:00:00' + interval '-1 2:03:04' day to second;
  _col0  |  _col1
-+-
 2019-01-02 02:03:04.000 | 2018-12-30 21:56:56.000
(1 row)
{code}

{code:sql}
SQL> -- Oracle
SQL> select timestamp '2019-01-01 00:00:00' + interval '1 2:03:04' day to 
second, timestamp '2019-01-01 00:00:00' + interval '-1 2:03:04' day to second 
from dual;

TIMESTAMP'2019-01-0100:00:00'+INTERVAL'12:03:04'DAYTOSECOND
---
TIMESTAMP'2019-01-0100:00:00'+INTERVAL'-12:03:04'DAYTOSECOND
---
02-JAN-19 02.03.04.0 AM
30-DEC-18 09.56.56.0 PM
{code}

> Correct interval type conversion behavior 
> --
>
> Key: SPARK-28171
> URL: https://issues.apache.org/jira/browse/SPARK-28171
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> When calculate between date and interval.
> {code:sql}
> select timestamp '2019-01-01 00:00:00' + interval '1 2:03:04' day to second, 
> timestamp '2019-01-01 00:00:00' + interval '-1 2:03:04' day to second
> {code}
>  * PostgreSQL return *2019-01-02 02:03:04*    *2018-12-31 02:03:04*
>  * SparkSQL return    *2019-01-02 02:03:04*    *2018-12-30 21:56:56*
> {code:sql}
> select timestamp '2019-01-01 00:00:00' + interval '1 -2:03:04' day to second, 
> timestamp '2019-01-01 00:00:00' + interval '-1 -2:03:04' day to second
> {code}
>  * PostgreSQL return *2019-01-01 21:56:56*    *2018-12-30 21:56:56*
>  * SparkSQL return    _*Interval string does not match day-time format of 'd 
> h:m:s.n': '1 -2:03:04'(line 1, pos 50)*_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28172) pyspark DataFrame equality operator

2019-06-26 Thread Hugo (JIRA)
Hugo created SPARK-28172:


 Summary: pyspark DataFrame equality operator
 Key: SPARK-28172
 URL: https://issues.apache.org/jira/browse/SPARK-28172
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.3
Reporter: Hugo


*Motivation:*
 Facilitating testing equality between DataFrames. Many approaches for applying 
TDD practices in data science require checking function output against expected 
output (in similarity to API snapshots). Having an equality operator or 
something alike would make this easier. 

A basic example:
{code}
from pyspark.ml.feature import Imputer

def create_mock_missing_df():
   return spark.createDataFrame([("a", 1.0),("b", 2.0),("c", float('nan'))],   
['COL1', 'COl2'])

def test_imputation():
   """Test mean value imputation"""

   df1 = create_mock_missing_df()

   #load snapshot
   pickled_snapshot = sc.pickleFile('imputed_df.pkl').collect()
   df2 = spark.createDataFrame(pickled_snapshot)
   """
   >>> df2.show()
   +++
   |COL1|COL2_imputed|
   +++
   | a  | 1.0|
   | b  | 2.0|
   | c  | 1.5|
   +++

   """ 
   imputer = Imputer(
  inputCols=['COL2'],
  outputCols=['COL2_imputed']
   )
   df1 = imputer.fit(df1).transform(df1)
   df1 = df1.drop('COL2')

   assert df1 == df2
{code}
 
 Suggested change:
{code}
class DataFrame(object):
   ...

   def __eq__(self, other):
  """Returns ``True`` if DataFrame content is equal to other.

  >>> df1 = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)])
  >>> df2 = spark.createDataFrame([("b", 2), ("a", 1), ("c", 3)])

  >>> df1 == df2
  True
  """
  return self.unionAll(other) \
 .subtract(self.intersect(other)) \
 .count() == 0
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28005) SparkRackResolver should not log for resolving empty list

2019-06-26 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-28005.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24935
[https://github.com/apache/spark/pull/24935]

> SparkRackResolver should not log for resolving empty list
> -
>
> Key: SPARK-28005
> URL: https://issues.apache.org/jira/browse/SPARK-28005
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
> Fix For: 3.0.0
>
>
> After SPARK-13704, {{SparkRackResolver}} generates an INFO message everytime 
> is called with 0 arguments:
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/SparkRackResolver.scala#L73-L76
> That actually happens every 1s when there are no active executors, because of 
> the repeated offers that happen as part of delay scheduling:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L134-L139
> while this is relatively benign, its a pretty annoying thing to be logging at 
> INFO level every 1 second.
> This is easy to reproduce -- in spark-shell, with dynamic allocation, set log 
> level to info, see the logs appear every 1 second.  Then run something, see 
> the msgs stop.  After the executors timeout, see the msgs reappear.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> sc.setLogLevel("info")
> Thread.sleep(5000)
> sc.parallelize(1 to 10).count()
> // Exiting paste mode, now interpreting.
> 19/06/11 12:43:40 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:43:41 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:43:42 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:43:43 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:43:44 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:43:45 INFO spark.SparkContext: Starting job: count at :28
> 19/06/11 12:43:45 INFO scheduler.DAGScheduler: Got job 0 (count at 
> :28) with 2 output partitions
> 19/06/11 12:43:45 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 
> (count at :28)
> ...
> 19/06/11 12:43:54 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose 
> tasks have all completed, from pool 
> 19/06/11 12:43:54 INFO scheduler.DAGScheduler: ResultStage 0 (count at 
> :28) finished in 9.548 s
> 19/06/11 12:43:54 INFO scheduler.DAGScheduler: Job 0 finished: count at 
> :28, took 9.613049 s
> res2: Long = 10   
>   
> scala> 
> ...
> 19/06/11 12:44:56 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:44:57 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:44:58 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:44:59 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28005) SparkRackResolver should not log for resolving empty list

2019-06-26 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-28005:


Assignee: Gabor Somogyi

> SparkRackResolver should not log for resolving empty list
> -
>
> Key: SPARK-28005
> URL: https://issues.apache.org/jira/browse/SPARK-28005
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.0.0
>
>
> After SPARK-13704, {{SparkRackResolver}} generates an INFO message everytime 
> is called with 0 arguments:
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/SparkRackResolver.scala#L73-L76
> That actually happens every 1s when there are no active executors, because of 
> the repeated offers that happen as part of delay scheduling:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L134-L139
> while this is relatively benign, its a pretty annoying thing to be logging at 
> INFO level every 1 second.
> This is easy to reproduce -- in spark-shell, with dynamic allocation, set log 
> level to info, see the logs appear every 1 second.  Then run something, see 
> the msgs stop.  After the executors timeout, see the msgs reappear.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> sc.setLogLevel("info")
> Thread.sleep(5000)
> sc.parallelize(1 to 10).count()
> // Exiting paste mode, now interpreting.
> 19/06/11 12:43:40 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:43:41 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:43:42 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:43:43 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:43:44 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:43:45 INFO spark.SparkContext: Starting job: count at :28
> 19/06/11 12:43:45 INFO scheduler.DAGScheduler: Got job 0 (count at 
> :28) with 2 output partitions
> 19/06/11 12:43:45 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 
> (count at :28)
> ...
> 19/06/11 12:43:54 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose 
> tasks have all completed, from pool 
> 19/06/11 12:43:54 INFO scheduler.DAGScheduler: ResultStage 0 (count at 
> :28) finished in 9.548 s
> 19/06/11 12:43:54 INFO scheduler.DAGScheduler: Job 0 finished: count at 
> :28, took 9.613049 s
> res2: Long = 10   
>   
> scala> 
> ...
> 19/06/11 12:44:56 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:44:57 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:44:58 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> 19/06/11 12:44:59 INFO yarn.SparkRackResolver: Got an error when resolving 
> hostNames. Falling back to /default-rack for all
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28173) Add Kafka delegation token proxy user support

2019-06-26 Thread Gabor Somogyi (JIRA)
Gabor Somogyi created SPARK-28173:
-

 Summary: Add Kafka delegation token proxy user support
 Key: SPARK-28173
 URL: https://issues.apache.org/jira/browse/SPARK-28173
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Gabor Somogyi


In SPARK-26592 I've turned off proxy user usage because 
https://issues.apache.org/jira/browse/KAFKA-6945 is not yet implemented. Since 
the KIP will be under discussion and hopefully implemented here is this jira to 
track the Spark side effort.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28173) Add Kafka delegation token proxy user support

2019-06-26 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873420#comment-16873420
 ] 

Gabor Somogyi commented on SPARK-28173:
---

cc [~viktorsomogyi]

> Add Kafka delegation token proxy user support
> -
>
> Key: SPARK-28173
> URL: https://issues.apache.org/jira/browse/SPARK-28173
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In SPARK-26592 I've turned off proxy user usage because 
> https://issues.apache.org/jira/browse/KAFKA-6945 is not yet implemented. 
> Since the KIP will be under discussion and hopefully implemented here is this 
> jira to track the Spark side effort.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28173) Add Kafka delegation token proxy user support

2019-06-26 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873422#comment-16873422
 ] 

Gabor Somogyi commented on SPARK-28173:
---

Will file the appropriate PR when Kafka part is ready. I'm closely working on 
this with [~viktorsomogyi].

> Add Kafka delegation token proxy user support
> -
>
> Key: SPARK-28173
> URL: https://issues.apache.org/jira/browse/SPARK-28173
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> In SPARK-26592 I've turned off proxy user usage because 
> https://issues.apache.org/jira/browse/KAFKA-6945 is not yet implemented. 
> Since the KIP will be under discussion and hopefully implemented here is this 
> jira to track the Spark side effort.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28164) usage description does not match with shell scripts

2019-06-26 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28164.
---
   Resolution: Fixed
Fix Version/s: 2.4.4
   2.3.4
   3.0.0

Issue resolved by pull request 24974
[https://github.com/apache/spark/pull/24974]

> usage description does not match with shell scripts
> ---
>
> Key: SPARK-28164
> URL: https://issues.apache.org/jira/browse/SPARK-28164
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.3
>Reporter: Hanna Kan
>Assignee: Shivu Sondur
>Priority: Major
> Fix For: 3.0.0, 2.3.4, 2.4.4
>
>
> I found that "spark/sbin/start-slave.sh" may have some error. 
> line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] "
> but later this script,  I found line 59  MASTER=$1 
> Is this a conflict?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28164) usage description does not match with shell scripts

2019-06-26 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28164:
-

Assignee: Shivu Sondur

> usage description does not match with shell scripts
> ---
>
> Key: SPARK-28164
> URL: https://issues.apache.org/jira/browse/SPARK-28164
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.3
>Reporter: Hanna Kan
>Assignee: Shivu Sondur
>Priority: Major
>
> I found that "spark/sbin/start-slave.sh" may have some error. 
> line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] "
> but later this script,  I found line 59  MASTER=$1 
> Is this a conflict?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28164) usage description does not match with shell scripts

2019-06-26 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-28164:
--
Priority: Minor  (was: Major)

> usage description does not match with shell scripts
> ---
>
> Key: SPARK-28164
> URL: https://issues.apache.org/jira/browse/SPARK-28164
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 2.4.3
>Reporter: Hanna Kan
>Assignee: Shivu Sondur
>Priority: Minor
> Fix For: 2.3.4, 2.4.4, 3.0.0
>
>
> I found that "spark/sbin/start-slave.sh" may have some error. 
> line 43 gives--- echo "Usage: ./sbin/start-slave.sh [options] "
> but later this script,  I found line 59  MASTER=$1 
> Is this a conflict?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28157) Make SHS clear KVStore LogInfo for the blacklisted entries

2019-06-26 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-28157.
-
   Resolution: Resolved
 Assignee: Dongjoon Hyun
Fix Version/s: 3.0.0

> Make SHS clear KVStore LogInfo for the blacklisted entries
> --
>
> Key: SPARK-28157
> URL: https://issues.apache.org/jira/browse/SPARK-28157
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to 
> the file system, and maintains a blacklist for all event log files failed 
> once at reading. The blacklisted log files are released back after 
> CLEAN_INTERVAL_S .
> However, the files whose size don't changes are ignored forever because 
> shouldReloadLog return false always when the size is the same with the value 
> in KVStore. This is recovered only via SHS restart.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28157) Make SHS clear KVStore LogInfo for the blacklisted entries

2019-06-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28157:
--
Affects Version/s: 2.3.2
   2.3.3

> Make SHS clear KVStore LogInfo for the blacklisted entries
> --
>
> Key: SPARK-28157
> URL: https://issues.apache.org/jira/browse/SPARK-28157
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> At Spark 2.4.0/2.3.2/2.2.3, SPARK-24948 delegated access permission checks to 
> the file system, and maintains a blacklist for all event log files failed 
> once at reading. The blacklisted log files are released back after 
> CLEAN_INTERVAL_S .
> However, the files whose size don't changes are ignored forever because 
> shouldReloadLog return false always when the size is the same with the value 
> in KVStore. This is recovered only via SHS restart.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28174) Upgrade to Kafka 2.3.0

2019-06-26 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-28174:
-

 Summary: Upgrade to Kafka 2.3.0
 Key: SPARK-28174
 URL: https://issues.apache.org/jira/browse/SPARK-28174
 Project: Spark
  Issue Type: Improvement
  Components: Build, Structured Streaming
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


This issue updates Kafka dependency to 2.3.0 to bring the following 10 
client-side patches at least.

https://issues.apache.org/jira/browse/KAFKA-8379?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28165) SHS does not delete old inprogress files until cleaner.maxAge after SHS start time

2019-06-26 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873614#comment-16873614
 ] 

Imran Rashid commented on SPARK-28165:
--

btw if anybody wants to investigate this more, here's a simple test case, 
(though as discussed above, we can't just use the modtime as its not totally 
trustworthy):

{code}
  test("log cleaner for inprogress files before SHS startup") {
val firstFileModifiedTime = TimeUnit.SECONDS.toMillis(10)
val secondFileModifiedTime = TimeUnit.SECONDS.toMillis(100)
val maxAge = TimeUnit.SECONDS.toMillis(40)
val clock = new ManualClock(0)

val log1 = newLogFile("inProgressApp1", None, inProgress = true)
writeFile(log1, true, None,
  SparkListenerApplicationStart(
"inProgressApp1", Some("inProgressApp1"), 3L, "test", Some("attempt1"))
)
log1.setLastModified(firstFileModifiedTime)

val log2 = newLogFile("inProgressApp2", None, inProgress = true)
writeFile(log2, true, None,
  SparkListenerApplicationStart(
"inProgressApp2", Some("inProgressApp2"), 23L, "test2", 
Some("attempt2"))
)
log2.setLastModified(secondFileModifiedTime)

// advance the clock so the first log is expired, but second log is still 
recent
clock.setTime(secondFileModifiedTime)
assert(clock.getTimeMillis() > firstFileModifiedTime + maxAge)

// start up the SHS
val provider = new FsHistoryProvider(
  createTestConf().set("spark.history.fs.cleaner.maxAge", s"${maxAge}ms"), 
clock)

provider.checkForLogs()

// We should cleanup one log immediately
updateAndCheck(provider) { list =>
  assert(list.size  === 1)
}
assert(!log1.exists())
assert(log2.exists())
  }
{code}

> SHS does not delete old inprogress files until cleaner.maxAge after SHS start 
> time
> --
>
> Key: SPARK-28165
> URL: https://issues.apache.org/jira/browse/SPARK-28165
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Imran Rashid
>Priority: Major
>
> The SHS will not delete inprogress files until 
> {{spark.history.fs.cleaner.maxAge}} time after it has started (7 days by 
> default), regardless of when the last modification to the file was.  This is 
> particularly problematic if the SHS gets restarted regularly, as then you'll 
> end up never deleting old files.
> There might not be much we can do about this -- we can't really trust the 
> modification time of the file, as that isn't always updated reliably.
> We could take the last time of any event from the file, but then we'd have to 
> turn off the optimization of SPARK-6951, to avoid reading the entire file 
> just for the listing.
> *WORKAROUND*: have the SHS save state across restarts to local disk by 
> specifying a path in {{spark.history.store.path}}.  It'll still take 7 days 
> from when you add that config for the cleaning to happen, but then going for 
> the cleaning should happen reliably.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28174) Upgrade to Kafka 2.3.0

2019-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28174:


Assignee: (was: Apache Spark)

> Upgrade to Kafka 2.3.0
> --
>
> Key: SPARK-28174
> URL: https://issues.apache.org/jira/browse/SPARK-28174
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue updates Kafka dependency to 2.3.0 to bring the following 10 
> client-side patches at least.
> https://issues.apache.org/jira/browse/KAFKA-8379?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28174) Upgrade to Kafka 2.3.0

2019-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28174:


Assignee: Apache Spark

> Upgrade to Kafka 2.3.0
> --
>
> Key: SPARK-28174
> URL: https://issues.apache.org/jira/browse/SPARK-28174
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> This issue updates Kafka dependency to 2.3.0 to bring the following 10 
> client-side patches at least.
> https://issues.apache.org/jira/browse/KAFKA-8379?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28174) Upgrade to Kafka 2.3.0

2019-06-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28174:
--
Description: 
This issue updates Kafka dependency to 2.3.0 to bring the following 10 
client-side patches at least.

- 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients

The following is a full release note.
- https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html

  was:
This issue updates Kafka dependency to 2.3.0 to bring the following 10 
client-side patches at least.

https://issues.apache.org/jira/browse/KAFKA-8379?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients


> Upgrade to Kafka 2.3.0
> --
>
> Key: SPARK-28174
> URL: https://issues.apache.org/jira/browse/SPARK-28174
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue updates Kafka dependency to 2.3.0 to bring the following 10 
> client-side patches at least.
> - 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients
> The following is a full release note.
> - https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28174) Upgrade to Kafka 2.3.0

2019-06-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28174:
--
Description: 
This issue updates Kafka dependency to 2.3.0 to bring the following 9 
client-side patches at least.

- 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients

The following is a full release note.
- https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html

  was:
This issue updates Kafka dependency to 2.3.0 to bring the following 10 
client-side patches at least.

- 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients

The following is a full release note.
- https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html


> Upgrade to Kafka 2.3.0
> --
>
> Key: SPARK-28174
> URL: https://issues.apache.org/jira/browse/SPARK-28174
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue updates Kafka dependency to 2.3.0 to bring the following 9 
> client-side patches at least.
> - 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients
> The following is a full release note.
> - https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28174) Upgrade to Kafka 2.3.0

2019-06-26 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28174:
--
Description: 
This issue updates Kafka dependency to 2.3.0 to bring the following 9 
client-side patches at least.

- 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20fixVersion%20NOT%20IN%20(2.2.0%2C%202.2.1)%20AND%20component%20%3D%20clients

The following is a full release note.
- https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html

  was:
This issue updates Kafka dependency to 2.3.0 to bring the following 9 
client-side patches at least.

- 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20component%20%3D%20clients

The following is a full release note.
- https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html


> Upgrade to Kafka 2.3.0
> --
>
> Key: SPARK-28174
> URL: https://issues.apache.org/jira/browse/SPARK-28174
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue updates Kafka dependency to 2.3.0 to bring the following 9 
> client-side patches at least.
> - 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20KAFKA%20AND%20fixVersion%20%3D%202.3.0%20AND%20fixVersion%20NOT%20IN%20(2.2.0%2C%202.2.1)%20AND%20component%20%3D%20clients
> The following is a full release note.
> - https://www.apache.org/dist/kafka/2.3.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27992) PySpark socket server should sync with JVM connection thread future

2019-06-26 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-27992:


Assignee: Bryan Cutler

> PySpark socket server should sync with JVM connection thread future
> ---
>
> Key: SPARK-27992
> URL: https://issues.apache.org/jira/browse/SPARK-27992
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark 
> job are not propagated to Python. This is because toLocalIterator() and 
> toPandas() with Arrow enabled run Spark jobs asynchronously in a background 
> thread, after creating the socket connection info. The fix for these was to 
> catch a SparkException if the job errored and then send the exception through 
> the pyspark serializer.
> A better fix would be to allow Python to await on the serving thread future 
> and join the thread. That way if the serving thread throws an exception, it 
> will be propagated on the call to awaitResult.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27992) PySpark socket server should sync with JVM connection thread future

2019-06-26 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-27992.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24834
[https://github.com/apache/spark/pull/24834]

> PySpark socket server should sync with JVM connection thread future
> ---
>
> Key: SPARK-27992
> URL: https://issues.apache.org/jira/browse/SPARK-27992
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.0.0
>
>
> Both SPARK-27805 and SPARK-27548 identified an issue that errors in a Spark 
> job are not propagated to Python. This is because toLocalIterator() and 
> toPandas() with Arrow enabled run Spark jobs asynchronously in a background 
> thread, after creating the socket connection info. The fix for these was to 
> catch a SparkException if the job errored and then send the exception through 
> the pyspark serializer.
> A better fix would be to allow Python to await on the serving thread future 
> and join the thread. That way if the serving thread throws an exception, it 
> will be propagated on the call to awaitResult.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28175) Exception when using userClassPathFirst

2019-06-26 Thread Shan Zhao (JIRA)
Shan Zhao created SPARK-28175:
-

 Summary: Exception when using userClassPathFirst
 Key: SPARK-28175
 URL: https://issues.apache.org/jira/browse/SPARK-28175
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.3.0, 2.0.0
Reporter: Shan Zhao


We want to override the executor/driver spark version with our own spark 
version in a fat jar.

Conf: 

spark.driver.userClassPathFirst=true

spark.executor.userClassPathFirst=true,

Build: spark 2.3.0 

 

Deploy: cluster mode with spark 2.0.0 installed

We encounter the following error: 
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Unable to load YARN 
support
 at 
org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:392)
 at 
org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:387)
 at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:387)
 at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:412)
 at org.apache.spark.deploy.yarn.Client.(Client.scala:69)
 at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1290)
 at org.apache.spark.deploy.yarn.Client.main(Client.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750)
 at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class 
org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not 
org.apache.hadoop.security.GroupMappingServiceProvider
 at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2273)
 at org.apache.hadoop.security.Groups.(Groups.java:99)
 at org.apache.hadoop.security.Groups.(Groups.java:95)
 at 
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:420)
 at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:324)
 at 
org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352)
 at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:51)
 at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.(YarnSparkHadoopUtil.scala:49)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at java.lang.Class.newInstance(Class.java:374)
 at 
org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:389)
 ... 15 more
Caused by: java.lang.RuntimeException: class 
org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not 
org.apache.hadoop.security.GroupMappingServiceProvider
 at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2267)
 ... 28 more{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28176) Add Dataset.collect(PartialFunction) method for parity with RDD API

2019-06-26 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-28176:
--

 Summary: Add Dataset.collect(PartialFunction) method for parity 
with RDD API
 Key: SPARK-28176
 URL: https://issues.apache.org/jira/browse/SPARK-28176
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Josh Rosen


RDDs have a {{collect(PartialFunction)}} method which behaves like the 
equivalent Scala collections method (see 
[https://github.com/apache/spark/commit/f4e6b9361ffeec1018d5834f09db9fd86f2ba7bd]).

It would be nice if Datasets had a similar method (to ease migration of RDD 
code to Datasets).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27292) Spark Job Fails with Unknown Error writing to S3 from AWS EMR

2019-06-26 Thread Padmakar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873670#comment-16873670
 ] 

Padmakar commented on SPARK-27292:
--

Hi, am facing a similar issue on my 10 node cluster:

{color:#205081}org.apache.hadoop.hdfs.DataStreamer: Exception for 
BP-181186916-10.138.0.22-1561568502844:blk_1073741840_1016{color}
{color:#205081}java.io.EOFException: Unexpected EOF while trying to read 
response from server{color}
{color:#205081} at 
org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:453){color}
{color:#205081} at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213){color}
{color:#205081} at 
org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1080){color}
{color:#205081}19/06/26 21:18:24 WARN org.apache.hadoop.hdfs.DataStreamer: 
Error Recovery for BP-181186916-10.138.0.22-1561568502844:blk_1073741840_1016 
in pipeline 
[DatanodeInfoWithStorage[10.138.0.23:9866,DS-197cd21a-3e5f-46cc-9bbb-fda94b2ba6c0,DISK],
 
DatanodeInfoWithStorage[10.138.15.250:9866,DS-c2bc521d-37b0-43ea-a80d-191037fcb8a7,DISK]]:
 datanode 
0(DatanodeInfoWithStorage[10.138.0.23:9866,DS-197cd21a-3e5f-46cc-9bbb-fda94b2ba6c0,DISK])
 is bad.{color}

{color:#205081}was there a fix or any solution or any insights on this issue? 
Please let me know{color}

> Spark Job Fails with Unknown Error writing to S3 from AWS EMR
> -
>
> Key: SPARK-27292
> URL: https://issues.apache.org/jira/browse/SPARK-27292
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output
>Affects Versions: 2.3.2
>Reporter: Olalekan Elesin
>Priority: Major
>
> I am currently experiencing issues writing data to S3 from my Spark Job 
> running on AWS EMR.
> The job writings to some staging path in S3 e.g 
> \{{.spark-random-alphanumeric}}. After which it fails with this error:
> {code:java}
> 9/03/26 10:54:07 WARN AsyncEventQueue: Dropped 196300 events from appStatus 
> since Tue Mar 26 10:52:05 UTC 2019.
> 19/03/26 10:55:07 WARN AsyncEventQueue: Dropped 211186 events from appStatus 
> since Tue Mar 26 10:54:07 UTC 2019.
> 19/03/26 11:37:09 WARN DataStreamer: Exception for 
> BP-312054361-10.41.97.71-1553586781241:blk_1073742995_2172
> java.io.EOFException: Unexpected EOF while trying to read response from server
>   at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:402)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
>   at 
> org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1073)
> 19/03/26 11:37:09 WARN DataStreamer: Error Recovery for 
> BP-312054361-10.41.97.71-1553586781241:blk_1073742995_2172 in pipeline 
> [DatanodeInfoWithStorage[10.41.121.135:50010,DS-cba2a850-fa30-4933-af2a-05b40b58fdb5,DISK],
>  
> DatanodeInfoWithStorage[10.41.71.181:50010,DS-c90a1d87-b40a-4928-a709-1aef027db65a,DISK]]:
>  datanode 
> 0(DatanodeInfoWithStorage[10.41.121.135:50010,DS-cba2a850-fa30-4933-af2a-05b40b58fdb5,DISK])
>  is bad.
> 19/03/26 11:50:34 WARN AsyncEventQueue: Dropped 157572 events from appStatus 
> since Tue Mar 26 10:55:07 UTC 2019.
> 19/03/26 11:51:34 WARN AsyncEventQueue: Dropped 785 events from appStatus 
> since Tue Mar 26 11:50:34 UTC 2019.
> 19/03/26 11:52:34 WARN AsyncEventQueue: Dropped 656 events from appStatus 
> since Tue Mar 26 11:51:34 UTC 2019.
> 19/03/26 11:53:35 WARN AsyncEventQueue: Dropped 1335 events from appStatus 
> since Tue Mar 26 11:52:34 UTC 2019.
> 19/03/26 11:54:35 WARN AsyncEventQueue: Dropped 1087 events from appStatus 
> since Tue Mar 26 11:53:35 UTC 2019.
> ...
> 19/03/26 13:39:39 WARN TaskSetManager: Lost task 33302.0 in stage 1444.0 (TID 
> 1324427, ip-10-41-122-224.eu-west-1.compute.internal, executor 18): 
> org.apache.spark.SparkException: Task failed while writing rows.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:254)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:168)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolEx

[jira] [Created] (SPARK-28177) Adjust post shuffle partition number in adaptive execution

2019-06-26 Thread Carson Wang (JIRA)
Carson Wang created SPARK-28177:
---

 Summary: Adjust post shuffle partition number in adaptive execution
 Key: SPARK-28177
 URL: https://issues.apache.org/jira/browse/SPARK-28177
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: Carson Wang


Implement a rule in the new adaptive execution framework introduced in 
SPARK-23128. This rule is used to adjust the post shuffle partitions based on 
the map output statistics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28177) Adjust post shuffle partition number in adaptive execution

2019-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28177:


Assignee: Apache Spark

> Adjust post shuffle partition number in adaptive execution
> --
>
> Key: SPARK-28177
> URL: https://issues.apache.org/jira/browse/SPARK-28177
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Carson Wang
>Assignee: Apache Spark
>Priority: Major
>
> Implement a rule in the new adaptive execution framework introduced in 
> SPARK-23128. This rule is used to adjust the post shuffle partitions based on 
> the map output statistics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28177) Adjust post shuffle partition number in adaptive execution

2019-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28177:


Assignee: (was: Apache Spark)

> Adjust post shuffle partition number in adaptive execution
> --
>
> Key: SPARK-28177
> URL: https://issues.apache.org/jira/browse/SPARK-28177
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Carson Wang
>Priority: Major
>
> Implement a rule in the new adaptive execution framework introduced in 
> SPARK-23128. This rule is used to adjust the post shuffle partitions based on 
> the map output statistics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28036) Built-in udf left/right has inconsistent behavior

2019-06-26 Thread Shivu Sondur (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873696#comment-16873696
 ] 

Shivu Sondur commented on SPARK-28036:
--

[~yumwang]

You can close this jira right?.

As it is working in spark.

Only difference is we should use offset without '-' symbol.

> Built-in udf left/right has inconsistent behavior
> -
>
> Key: SPARK-28036
> URL: https://issues.apache.org/jira/browse/SPARK-28036
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> PostgreSQL:
> {code:sql}
> postgres=# select left('ahoj', -2), right('ahoj', -2);
>  left | right 
> --+---
>  ah   | oj
> (1 row)
> {code}
> Spark SQL:
> {code:sql}
> spark-sql> select left('ahoj', -2), right('ahoj', -2);
> spark-sql>
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27845) DataSourceV2: InsertTable

2019-06-26 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27845:
---
Description: 
Support multiple catalogs in the following InsertInto use cases:
 * INSERT INTO [TABLE] catalog.db.tbl
 * INSERT OVERWRITE TABLE catalog.db.tbl

  was:
Support multiple catalogs in the following InsertInto use cases:
 * INSERT INTO [TABLE] catalog.db.tbl
 * INSERT OVERWRITE TABLE catalog.db.tbl
 * DataFrameWriter.insertInto("catalog.db.tbl")


> DataSourceV2: InsertTable
> -
>
> Key: SPARK-27845
> URL: https://issues.apache.org/jira/browse/SPARK-27845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multiple catalogs in the following InsertInto use cases:
>  * INSERT INTO [TABLE] catalog.db.tbl
>  * INSERT OVERWRITE TABLE catalog.db.tbl



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27845) DataSourceV2: InsertTable

2019-06-26 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27845:
---
Description: 
Support multiple catalogs in the following use cases:
 * INSERT INTO [TABLE] catalog.db.tbl
 * INSERT OVERWRITE TABLE catalog.db.tbl

  was:
Support multiple catalogs in the following InsertInto use cases:
 * INSERT INTO [TABLE] catalog.db.tbl
 * INSERT OVERWRITE TABLE catalog.db.tbl


> DataSourceV2: InsertTable
> -
>
> Key: SPARK-27845
> URL: https://issues.apache.org/jira/browse/SPARK-27845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multiple catalogs in the following use cases:
>  * INSERT INTO [TABLE] catalog.db.tbl
>  * INSERT OVERWRITE TABLE catalog.db.tbl



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28178) DataSourceV2: DataFrameWriter.insertInfo

2019-06-26 Thread John Zhuge (JIRA)
John Zhuge created SPARK-28178:
--

 Summary: DataSourceV2: DataFrameWriter.insertInfo
 Key: SPARK-28178
 URL: https://issues.apache.org/jira/browse/SPARK-28178
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: John Zhuge


Support multiple catalogs in the following InsertInto use cases:
 * INSERT INTO [TABLE] catalog.db.tbl
 * INSERT OVERWRITE TABLE catalog.db.tbl
 * DataFrameWriter.insertInto("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28178) DataSourceV2: DataFrameWriter.insertInfo

2019-06-26 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-28178:
---
Description: 
Support multiple catalogs in the following InsertInto use cases:
 * DataFrameWriter.insertInto("catalog.db.tbl")

  was:
Support multiple catalogs in the following InsertInto use cases:
 * INSERT INTO [TABLE] catalog.db.tbl
 * INSERT OVERWRITE TABLE catalog.db.tbl
 * DataFrameWriter.insertInto("catalog.db.tbl")


> DataSourceV2: DataFrameWriter.insertInfo
> 
>
> Key: SPARK-28178
> URL: https://issues.apache.org/jira/browse/SPARK-28178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multiple catalogs in the following InsertInto use cases:
>  * DataFrameWriter.insertInto("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28175) Exception when using userClassPathFirst

2019-06-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873762#comment-16873762
 ] 

Hyukjin Kwon commented on SPARK-28175:
--

Are you able to test this against higher version of Spark?

> Exception when using userClassPathFirst
> ---
>
> Key: SPARK-28175
> URL: https://issues.apache.org/jira/browse/SPARK-28175
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0, 2.3.0
>Reporter: Shan Zhao
>Priority: Major
>
> We want to override the executor/driver spark version with our own spark 
> version in a fat jar.
> Conf: 
> spark.driver.userClassPathFirst=true
> spark.executor.userClassPathFirst=true,
> Build: spark 2.3.0 
>  
> Deploy: cluster mode with spark 2.0.0 installed
> We encounter the following error: 
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Unable to load 
> YARN support
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:392)
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:387)
>  at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:387)
>  at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:412)
>  at org.apache.spark.deploy.yarn.Client.(Client.scala:69)
>  at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1290)
>  at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:606)
>  at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750)
>  at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
>  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class 
> org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not 
> org.apache.hadoop.security.GroupMappingServiceProvider
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2273)
>  at org.apache.hadoop.security.Groups.(Groups.java:99)
>  at org.apache.hadoop.security.Groups.(Groups.java:95)
>  at 
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:420)
>  at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:324)
>  at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352)
>  at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:51)
>  at 
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.(YarnSparkHadoopUtil.scala:49)
>  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>  at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>  at java.lang.Class.newInstance(Class.java:374)
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:389)
>  ... 15 more
> Caused by: java.lang.RuntimeException: class 
> org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not 
> org.apache.hadoop.security.GroupMappingServiceProvider
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2267)
>  ... 28 more{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27369) Standalone worker can load resource conf and discover resources

2019-06-26 Thread Xingbo Jiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingbo Jiang resolved SPARK-27369.
--
Resolution: Done

> Standalone worker can load resource conf and discover resources
> ---
>
> Key: SPARK-27369
> URL: https://issues.apache.org/jira/browse/SPARK-27369
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: wuyi
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28175) Exception when using userClassPathFirst

2019-06-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28175:
-
Labels:   (was: version)

> Exception when using userClassPathFirst
> ---
>
> Key: SPARK-28175
> URL: https://issues.apache.org/jira/browse/SPARK-28175
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0, 2.3.0
>Reporter: Shan Zhao
>Priority: Major
>
> We want to override the executor/driver spark version with our own spark 
> version in a fat jar.
> Conf: 
> spark.driver.userClassPathFirst=true
> spark.executor.userClassPathFirst=true,
> Build: spark 2.3.0 
>  
> Deploy: cluster mode with spark 2.0.0 installed
> We encounter the following error: 
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Unable to load 
> YARN support
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:392)
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:387)
>  at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:387)
>  at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:412)
>  at org.apache.spark.deploy.yarn.Client.(Client.scala:69)
>  at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1290)
>  at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:606)
>  at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750)
>  at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
>  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class 
> org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not 
> org.apache.hadoop.security.GroupMappingServiceProvider
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2273)
>  at org.apache.hadoop.security.Groups.(Groups.java:99)
>  at org.apache.hadoop.security.Groups.(Groups.java:95)
>  at 
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:420)
>  at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:324)
>  at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352)
>  at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:51)
>  at 
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.(YarnSparkHadoopUtil.scala:49)
>  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>  at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>  at java.lang.Class.newInstance(Class.java:374)
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:389)
>  ... 15 more
> Caused by: java.lang.RuntimeException: class 
> org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not 
> org.apache.hadoop.security.GroupMappingServiceProvider
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2267)
>  ... 28 more{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28172) pyspark DataFrame equality operator

2019-06-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873764#comment-16873764
 ] 

Hyukjin Kwon commented on SPARK-28172:
--

{code}
  >>> df1 == df2
  True
{code}

Wouldn't this require to execute both DataFrames and collect the data into 
driver side? When the datasets are large, it's very easy for users to shoot 
them in the foot. I won't do that without an explicit plan and design doc for 
all other operators.

> pyspark DataFrame equality operator
> ---
>
> Key: SPARK-28172
> URL: https://issues.apache.org/jira/browse/SPARK-28172
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Hugo
>Priority: Minor
>
> *Motivation:*
>  Facilitating testing equality between DataFrames. Many approaches for 
> applying TDD practices in data science require checking function output 
> against expected output (in similarity to API snapshots). Having an equality 
> operator or something alike would make this easier. 
> A basic example:
> {code}
> from pyspark.ml.feature import Imputer
> def create_mock_missing_df():
>return spark.createDataFrame([("a", 1.0),("b", 2.0),("c", float('nan'))],  
>  ['COL1', 'COl2'])
> def test_imputation():
>"""Test mean value imputation"""
>df1 = create_mock_missing_df()
>#load snapshot
>pickled_snapshot = sc.pickleFile('imputed_df.pkl').collect()
>df2 = spark.createDataFrame(pickled_snapshot)
>"""
>>>> df2.show()
>+++
>|COL1|COL2_imputed|
>+++
>| a  | 1.0|
>| b  | 2.0|
>| c  | 1.5|
>+++
>""" 
>imputer = Imputer(
>   inputCols=['COL2'],
>   outputCols=['COL2_imputed']
>)
>df1 = imputer.fit(df1).transform(df1)
>df1 = df1.drop('COL2')
>assert df1 == df2
> {code}
>  
>  Suggested change:
> {code}
> class DataFrame(object):
>...
>def __eq__(self, other):
>   """Returns ``True`` if DataFrame content is equal to other.
>   >>> df1 = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)])
>   >>> df2 = spark.createDataFrame([("b", 2), ("a", 1), ("c", 3)])
>   >>> df1 == df2
>   True
>   """
>   return self.unionAll(other) \
>  .subtract(self.intersect(other)) \
>  .count() == 0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28172) pyspark DataFrame equality operator

2019-06-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28172.
--
Resolution: Duplicate

> pyspark DataFrame equality operator
> ---
>
> Key: SPARK-28172
> URL: https://issues.apache.org/jira/browse/SPARK-28172
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Hugo
>Priority: Minor
>
> *Motivation:*
>  Facilitating testing equality between DataFrames. Many approaches for 
> applying TDD practices in data science require checking function output 
> against expected output (in similarity to API snapshots). Having an equality 
> operator or something alike would make this easier. 
> A basic example:
> {code}
> from pyspark.ml.feature import Imputer
> def create_mock_missing_df():
>return spark.createDataFrame([("a", 1.0),("b", 2.0),("c", float('nan'))],  
>  ['COL1', 'COl2'])
> def test_imputation():
>"""Test mean value imputation"""
>df1 = create_mock_missing_df()
>#load snapshot
>pickled_snapshot = sc.pickleFile('imputed_df.pkl').collect()
>df2 = spark.createDataFrame(pickled_snapshot)
>"""
>>>> df2.show()
>+++
>|COL1|COL2_imputed|
>+++
>| a  | 1.0|
>| b  | 2.0|
>| c  | 1.5|
>+++
>""" 
>imputer = Imputer(
>   inputCols=['COL2'],
>   outputCols=['COL2_imputed']
>)
>df1 = imputer.fit(df1).transform(df1)
>df1 = df1.drop('COL2')
>assert df1 == df2
> {code}
>  
>  Suggested change:
> {code}
> class DataFrame(object):
>...
>def __eq__(self, other):
>   """Returns ``True`` if DataFrame content is equal to other.
>   >>> df1 = spark.createDataFrame([("a", 1), ("b", 2), ("c", 3)])
>   >>> df2 = spark.createDataFrame([("b", 2), ("a", 1), ("c", 3)])
>   >>> df1 == df2
>   True
>   """
>   return self.unionAll(other) \
>  .subtract(self.intersect(other)) \
>  .count() == 0
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28171) Correct interval type conversion behavior

2019-06-26 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28171:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-27764

> Correct interval type conversion behavior 
> --
>
> Key: SPARK-28171
> URL: https://issues.apache.org/jira/browse/SPARK-28171
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> When calculate between date and interval.
> {code:sql}
> select timestamp '2019-01-01 00:00:00' + interval '1 2:03:04' day to second, 
> timestamp '2019-01-01 00:00:00' + interval '-1 2:03:04' day to second
> {code}
>  * PostgreSQL return *2019-01-02 02:03:04*    *2018-12-31 02:03:04*
>  * SparkSQL return    *2019-01-02 02:03:04*    *2018-12-30 21:56:56*
> {code:sql}
> select timestamp '2019-01-01 00:00:00' + interval '1 -2:03:04' day to second, 
> timestamp '2019-01-01 00:00:00' + interval '-1 -2:03:04' day to second
> {code}
>  * PostgreSQL return *2019-01-01 21:56:56*    *2018-12-30 21:56:56*
>  * SparkSQL return    _*Interval string does not match day-time format of 'd 
> h:m:s.n': '1 -2:03:04'(line 1, pos 50)*_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28171) Difference of interval type conversion between SparkSQL and PostgreSQL

2019-06-26 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28171:

Summary: Difference of interval type conversion between SparkSQL and 
PostgreSQL  (was: Correct interval type conversion behavior )

> Difference of interval type conversion between SparkSQL and PostgreSQL
> --
>
> Key: SPARK-28171
> URL: https://issues.apache.org/jira/browse/SPARK-28171
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> When calculate between date and interval.
> {code:sql}
> select timestamp '2019-01-01 00:00:00' + interval '1 2:03:04' day to second, 
> timestamp '2019-01-01 00:00:00' + interval '-1 2:03:04' day to second
> {code}
>  * PostgreSQL return *2019-01-02 02:03:04*    *2018-12-31 02:03:04*
>  * SparkSQL return    *2019-01-02 02:03:04*    *2018-12-30 21:56:56*
> {code:sql}
> select timestamp '2019-01-01 00:00:00' + interval '1 -2:03:04' day to second, 
> timestamp '2019-01-01 00:00:00' + interval '-1 -2:03:04' day to second
> {code}
>  * PostgreSQL return *2019-01-01 21:56:56*    *2018-12-30 21:56:56*
>  * SparkSQL return    _*Interval string does not match day-time format of 'd 
> h:m:s.n': '1 -2:03:04'(line 1, pos 50)*_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28168) release-build.sh for Hadoop-3.2

2019-06-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28168:
-
Description: Hadoop-3.2 release process in o release-build.sh.  (was: Add 
Hadoop-3.2 to release-build.sh?)

> release-build.sh for Hadoop-3.2
> ---
>
> Key: SPARK-28168
> URL: https://issues.apache.org/jira/browse/SPARK-28168
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Hadoop-3.2 release process in o release-build.sh.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28168) release-build.sh for Hadoop-3.2

2019-06-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28168:
-
Description: Hadoop 3.2 support was added. We might have to add Hadoop-3.2 
in release-build.sh.  (was: Hadoop-3.2 release process in o release-build.sh.)

> release-build.sh for Hadoop-3.2
> ---
>
> Key: SPARK-28168
> URL: https://issues.apache.org/jira/browse/SPARK-28168
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Hadoop 3.2 support was added. We might have to add Hadoop-3.2 in 
> release-build.sh.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28179) Avoid hard-coded config: spark.sql.globalTempDatabase

2019-06-26 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28179:
---

 Summary: Avoid hard-coded config: spark.sql.globalTempDatabase
 Key: SPARK-28179
 URL: https://issues.apache.org/jira/browse/SPARK-28179
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28179) Avoid hard-coded config: spark.sql.globalTempDatabase

2019-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28179:


Assignee: Apache Spark

> Avoid hard-coded config: spark.sql.globalTempDatabase
> -
>
> Key: SPARK-28179
> URL: https://issues.apache.org/jira/browse/SPARK-28179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28179) Avoid hard-coded config: spark.sql.globalTempDatabase

2019-06-26 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28179:


Assignee: (was: Apache Spark)

> Avoid hard-coded config: spark.sql.globalTempDatabase
> -
>
> Key: SPARK-28179
> URL: https://issues.apache.org/jira/browse/SPARK-28179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28170) DenseVector .toArray() and .values documentation do not specify they are aliases

2019-06-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28170:
-
Component/s: PySpark

> DenseVector .toArray() and .values documentation do not specify they are 
> aliases
> 
>
> Key: SPARK-28170
> URL: https://issues.apache.org/jira/browse/SPARK-28170
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.3
>Reporter: Sivam Pasupathipillai
>Priority: Minor
>
> The documentation of the *toArray()* method and the *values* property in 
> pyspark.ml.linalg.DenseVector is confusing.
> *toArray():* Returns an numpy.ndarray
> *values**:* Returns a list of values
> However, they are actually aliases and they both return a numpy.ndarray.
> FIX: either change the documentation or change  the *values* property to 
> return a Python list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28152) ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector

2019-06-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28152.
--
Resolution: Duplicate

> ShortType and FloatTypes are not correctly mapped to right JDBC types when 
> using JDBC connector
> ---
>
> Key: SPARK-28152
> URL: https://issues.apache.org/jira/browse/SPARK-28152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Shiv Prashant Sood
>Priority: Minor
>
> ShortType and FloatTypes are not correctly mapped to right JDBC types when 
> using JDBC connector. This results in tables and spark data frame being 
> created with unintended types.
> Some example issue
>  * Write from df with column type results in a SQL table of with column type 
> as INTEGER as opposed to SMALLINT. Thus a larger table that expected.
>  * read results in a dataframe with type INTEGER as opposed to ShortType 
> FloatTypes have a issue with read path. In the write path Spark data type 
> 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in 
> the read path when JDBC data types need to be converted to Catalyst data 
> types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' 
> rather than 'FloatType'.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28175) Exception when using userClassPathFirst

2019-06-26 Thread Shan Zhao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873800#comment-16873800
 ] 

Shan Zhao commented on SPARK-28175:
---

No since only 2.0.0 is available in our cluster

 

> Exception when using userClassPathFirst
> ---
>
> Key: SPARK-28175
> URL: https://issues.apache.org/jira/browse/SPARK-28175
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0, 2.3.0
>Reporter: Shan Zhao
>Priority: Major
>
> We want to override the executor/driver spark version with our own spark 
> version in a fat jar.
> Conf: 
> spark.driver.userClassPathFirst=true
> spark.executor.userClassPathFirst=true,
> Build: spark 2.3.0 
>  
> Deploy: cluster mode with spark 2.0.0 installed
> We encounter the following error: 
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Unable to load 
> YARN support
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:392)
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:387)
>  at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:387)
>  at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:412)
>  at org.apache.spark.deploy.yarn.Client.(Client.scala:69)
>  at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1290)
>  at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:606)
>  at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750)
>  at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
>  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class 
> org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not 
> org.apache.hadoop.security.GroupMappingServiceProvider
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2273)
>  at org.apache.hadoop.security.Groups.(Groups.java:99)
>  at org.apache.hadoop.security.Groups.(Groups.java:95)
>  at 
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:420)
>  at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:324)
>  at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352)
>  at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:51)
>  at 
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.(YarnSparkHadoopUtil.scala:49)
>  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>  at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>  at java.lang.Class.newInstance(Class.java:374)
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:389)
>  ... 15 more
> Caused by: java.lang.RuntimeException: class 
> org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not 
> org.apache.hadoop.security.GroupMappingServiceProvider
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2267)
>  ... 28 more{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28175) Exception when using userClassPathFirst

2019-06-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873801#comment-16873801
 ] 

Hyukjin Kwon commented on SPARK-28175:
--

hm, then was it not tested in Spark 2.3?

> Exception when using userClassPathFirst
> ---
>
> Key: SPARK-28175
> URL: https://issues.apache.org/jira/browse/SPARK-28175
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0, 2.3.0
>Reporter: Shan Zhao
>Priority: Major
>
> We want to override the executor/driver spark version with our own spark 
> version in a fat jar.
> Conf: 
> spark.driver.userClassPathFirst=true
> spark.executor.userClassPathFirst=true,
> Build: spark 2.3.0 
>  
> Deploy: cluster mode with spark 2.0.0 installed
> We encounter the following error: 
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Unable to load 
> YARN support
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:392)
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:387)
>  at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:387)
>  at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:412)
>  at org.apache.spark.deploy.yarn.Client.(Client.scala:69)
>  at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1290)
>  at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:606)
>  at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750)
>  at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
>  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class 
> org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not 
> org.apache.hadoop.security.GroupMappingServiceProvider
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2273)
>  at org.apache.hadoop.security.Groups.(Groups.java:99)
>  at org.apache.hadoop.security.Groups.(Groups.java:95)
>  at 
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:420)
>  at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:324)
>  at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352)
>  at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:51)
>  at 
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.(YarnSparkHadoopUtil.scala:49)
>  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>  at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>  at java.lang.Class.newInstance(Class.java:374)
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:389)
>  ... 15 more
> Caused by: java.lang.RuntimeException: class 
> org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not 
> org.apache.hadoop.security.GroupMappingServiceProvider
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2267)
>  ... 28 more{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28175) Exception when using userClassPathFirst

2019-06-26 Thread Shan Zhao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873804#comment-16873804
 ] 

Shan Zhao commented on SPARK-28175:
---

We have another cluster with spark 2.3, and apparently it doesn't need the two 
"userClassPathFirst" setup; we haven't tested the code with the 
"userClassPathFirst" conf  in spark 2.3. So what we're trying to do is the to 
compile the code in spark 2.3, and to be able to run it in both spark 2.0 and 
spark 2.3 cluster, if that makes sense.

> Exception when using userClassPathFirst
> ---
>
> Key: SPARK-28175
> URL: https://issues.apache.org/jira/browse/SPARK-28175
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0, 2.3.0
>Reporter: Shan Zhao
>Priority: Major
>
> We want to override the executor/driver spark version with our own spark 
> version in a fat jar.
> Conf: 
> spark.driver.userClassPathFirst=true
> spark.executor.userClassPathFirst=true,
> Build: spark 2.3.0 
>  
> Deploy: cluster mode with spark 2.0.0 installed
> We encounter the following error: 
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Unable to load 
> YARN support
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:392)
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:387)
>  at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:387)
>  at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:412)
>  at org.apache.spark.deploy.yarn.Client.(Client.scala:69)
>  at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1290)
>  at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:606)
>  at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750)
>  at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
>  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class 
> org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not 
> org.apache.hadoop.security.GroupMappingServiceProvider
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2273)
>  at org.apache.hadoop.security.Groups.(Groups.java:99)
>  at org.apache.hadoop.security.Groups.(Groups.java:95)
>  at 
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:420)
>  at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:324)
>  at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:352)
>  at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:51)
>  at 
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.(YarnSparkHadoopUtil.scala:49)
>  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>  at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>  at java.lang.Class.newInstance(Class.java:374)
>  at 
> org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:389)
>  ... 15 more
> Caused by: java.lang.RuntimeException: class 
> org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not 
> org.apache.hadoop.security.GroupMappingServiceProvider
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2267)
>  ... 28 more{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28161) Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212

2019-06-26 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873813#comment-16873813
 ] 

Hyukjin Kwon commented on SPARK-28161:
--

Can you show the full error messages? Also, let's ask questions to mailing list 
before filing it as an issue. You would get better answer there. I don't think 
it's an issue within Spark for now but rather sounds like an env issue.

> Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212
> --
>
> Key: SPARK-28161
> URL: https://issues.apache.org/jira/browse/SPARK-28161
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.3
> Environment: {code}
> # Dockerfile
> # Pull base image.
> FROM ubuntu:16.04
> RUN apt update --fix-missing
> RUN apt-get install -y software-properties-common
> RUN mkdir /usr/java
> ADD jdk-8u212-linux-x64.tar /usr/java
> ENV JAVA_HOME=/usr/java/jdk1.8.0_212
> RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 
> 2
> RUN update-alternatives --install /usr/bin/javac javac 
> ${JAVA_HOME%*/}/bin/javac 2
> ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin"
> ENV MAVEN_VERSION 3.6.1
> RUN apt-get install -y curl wget
> RUN curl -fsSL 
> [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz
>  | tar xzf - -C /usr/share \
>  && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \
>  && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn
> ENV MAVEN_HOME /usr/share/maven
> ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
> ENV SPARK_SRC="/usr/src/spark"
> ENV BRANCH="v2.4.3"
> RUN apt-get update && apt-get install -y --no-install-recommends \
>  git python3 python3-setuptools r-base-dev r-cran-evaluate
> RUN mkdir -p $SPARK_SRC
> RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC
> WORKDIR $SPARK_SRC
> RUN ./build/mvn -DskipTests clean package
> {code}
>Reporter: Martin Nigsch
>Priority: Minor
>
> Trying to build spark from source based on the Dockerfile attached locally 
> (launched on docker on OSX) fails. 
> Attempts to change/add the following things beyond what's recommended on the 
> build page do not bring improvement: 
> 1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help
> 2. editing the pom.xml to exclude zinc as in one of the answers in 
> [https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558]
>  --> doesn't help
> 3. adding options -DrecompileMode=all   --> doesn't help
> I've downloaded the java from Oracle directly ( jdk-8u212-linux-x64.tar ) 
> which is manually put into /usr/java as the Oracle java seems to be 
> recommended. 
> Build fails at project streaming with:
> {code}
> [INFO] 
> 
> [INFO] Reactor Summary for Spark Project Parent POM 2.4.3:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [ 59.875 
> s]
> [INFO] Spark Project Tags . SUCCESS [ 20.386 
> s]
> [INFO] Spark Project Sketch ... SUCCESS [ 3.026 s]
> [INFO] Spark Project Local DB . SUCCESS [ 5.654 s]
> [INFO] Spark Project Networking ... SUCCESS [ 7.401 s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [ 3.400 s]
> [INFO] Spark Project Unsafe ... SUCCESS [ 6.306 s]
> [INFO] Spark Project Launcher . SUCCESS [ 17.471 
> s]
> [INFO] Spark Project Core . SUCCESS [02:36 
> min]
> [INFO] Spark Project ML Local Library . SUCCESS [ 50.313 
> s]
> [INFO] Spark Project GraphX ... SUCCESS [ 21.097 
> s]
> [INFO] Spark Project Streaming  SUCCESS [ 52.537 
> s]
> [INFO] Spark Project Catalyst . SUCCESS [02:44 
> min]
> [INFO] Spark Project SQL .. FAILURE [10:44 
> min]
> [INFO] Spark Project ML Library ... SKIPPED
> [INFO] Spark Project Tools  SKIPPED
> [INFO] Spark Project Hive . SKIPPED
> [INFO] Spark Project REPL . SKIPPED
> [INFO] Spark Project Assembly . SKIPPED
> [INFO] Spark Integration for Kafka 0.10 ... SKIPPED
> [INFO] Kafka 0.10+ Source for Structured Streaming  SKIPPED
> [INFO] Spark Project Examples . SKIPPED
> [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
> [INFO] Spark Avro ...

[jira] [Resolved] (SPARK-28161) Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212

2019-06-26 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28161.
--
Resolution: Invalid

> Can't build Spark 2.4.3 based on Ubuntu and Oracle Java 8 SDK v212
> --
>
> Key: SPARK-28161
> URL: https://issues.apache.org/jira/browse/SPARK-28161
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.3
> Environment: {code}
> # Dockerfile
> # Pull base image.
> FROM ubuntu:16.04
> RUN apt update --fix-missing
> RUN apt-get install -y software-properties-common
> RUN mkdir /usr/java
> ADD jdk-8u212-linux-x64.tar /usr/java
> ENV JAVA_HOME=/usr/java/jdk1.8.0_212
> RUN update-alternatives --install /usr/bin/java java ${JAVA_HOME%*/}/bin/java 
> 2
> RUN update-alternatives --install /usr/bin/javac javac 
> ${JAVA_HOME%*/}/bin/javac 2
> ENV PATH="${PATH}:/usr/java/jdk1.8.0_212/bin"
> ENV MAVEN_VERSION 3.6.1
> RUN apt-get install -y curl wget
> RUN curl -fsSL 
> [http://archive.apache.org/dist/maven/maven-3/$]{MAVEN_VERSION}/binaries/apache-maven-${MAVEN_VERSION}-bin.tar.gz
>  | tar xzf - -C /usr/share \
>  && mv /usr/share/apache-maven-${MAVEN_VERSION} /usr/share/maven \
>  && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn
> ENV MAVEN_HOME /usr/share/maven
> ENV MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
> ENV SPARK_SRC="/usr/src/spark"
> ENV BRANCH="v2.4.3"
> RUN apt-get update && apt-get install -y --no-install-recommends \
>  git python3 python3-setuptools r-base-dev r-cran-evaluate
> RUN mkdir -p $SPARK_SRC
> RUN git clone --branch $BRANCH [https://github.com/apache/spark] $SPARK_SRC
> WORKDIR $SPARK_SRC
> RUN ./build/mvn -DskipTests clean package
> {code}
>Reporter: Martin Nigsch
>Priority: Minor
>
> Trying to build spark from source based on the Dockerfile attached locally 
> (launched on docker on OSX) fails. 
> Attempts to change/add the following things beyond what's recommended on the 
> build page do not bring improvement: 
> 1. adding ```*RUN ./dev/change-scala-version.sh 2.11```* --> doesn't help
> 2. editing the pom.xml to exclude zinc as in one of the answers in 
> [https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven/41223558]
>  --> doesn't help
> 3. adding options -DrecompileMode=all   --> doesn't help
> I've downloaded the java from Oracle directly ( jdk-8u212-linux-x64.tar ) 
> which is manually put into /usr/java as the Oracle java seems to be 
> recommended. 
> Build fails at project streaming with:
> {code}
> [INFO] 
> 
> [INFO] Reactor Summary for Spark Project Parent POM 2.4.3:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [ 59.875 
> s]
> [INFO] Spark Project Tags . SUCCESS [ 20.386 
> s]
> [INFO] Spark Project Sketch ... SUCCESS [ 3.026 s]
> [INFO] Spark Project Local DB . SUCCESS [ 5.654 s]
> [INFO] Spark Project Networking ... SUCCESS [ 7.401 s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [ 3.400 s]
> [INFO] Spark Project Unsafe ... SUCCESS [ 6.306 s]
> [INFO] Spark Project Launcher . SUCCESS [ 17.471 
> s]
> [INFO] Spark Project Core . SUCCESS [02:36 
> min]
> [INFO] Spark Project ML Local Library . SUCCESS [ 50.313 
> s]
> [INFO] Spark Project GraphX ... SUCCESS [ 21.097 
> s]
> [INFO] Spark Project Streaming  SUCCESS [ 52.537 
> s]
> [INFO] Spark Project Catalyst . SUCCESS [02:44 
> min]
> [INFO] Spark Project SQL .. FAILURE [10:44 
> min]
> [INFO] Spark Project ML Library ... SKIPPED
> [INFO] Spark Project Tools  SKIPPED
> [INFO] Spark Project Hive . SKIPPED
> [INFO] Spark Project REPL . SKIPPED
> [INFO] Spark Project Assembly . SKIPPED
> [INFO] Spark Integration for Kafka 0.10 ... SKIPPED
> [INFO] Kafka 0.10+ Source for Structured Streaming  SKIPPED
> [INFO] Spark Project Examples . SKIPPED
> [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
> [INFO] Spark Avro . SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 20:15 min
> [INFO] Finished at: 

[jira] [Created] (SPARK-28180) Encoding CSV to Pojo works with Encoders.bean on RDD but fail on asserts when attemtping it from a Dataset

2019-06-26 Thread M. Le Bihan (JIRA)
M. Le Bihan created SPARK-28180:
---

 Summary: Encoding CSV to Pojo works with Encoders.bean on RDD but 
fail on asserts when attemtping it from a Dataset
 Key: SPARK-28180
 URL: https://issues.apache.org/jira/browse/SPARK-28180
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
 Environment: Debian 9, Java 8.
Reporter: M. Le Bihan


I am converting an _RDD_ spark program to a _Dataset_ one.
Once, it was converting a CSV file mapped the help of a Jackson loader to a RDD 
of Enterprise objects with Encoders.bean(Entreprise.class), and now it is doing 
the conversion more simplier, by loading the CSV content into a Dataset 
and applying the _Encoders.bean(Entreprise.class)_ on it.


{code:java}
  Dataset csv = this.session.read().format("csv")
 .option("header","true").option("quote", "\"").option("escape", "\"")
 .load(source.getAbsolutePath())
 .selectExpr(
"ActivitePrincipaleUniteLegale as ActivitePrincipale",
"CAST(AnneeCategorieEntreprise as INTEGER) as 
AnneeCategorieEntreprise",
"CAST(AnneeEffectifsUniteLegale as INTEGER) as 
AnneeValiditeEffectifSalarie",
"CAST(CaractereEmployeurUniteLegale == 'O' as BOOLEAN) as 
CaractereEmployeur",
"CategorieEntreprise", 
"CategorieJuridiqueUniteLegale as CategorieJuridique",
"DateCreationUniteLegale as DateCreationEntreprise", "DateDebut as 
DateDebutHistorisation", "DateDernierTraitementUniteLegale as 
DateDernierTraitement",
"DenominationUniteLegale as Denomination", 
"DenominationUsuelle1UniteLegale as DenominationUsuelle1", 
"DenominationUsuelle2UniteLegale as DenominationUsuelle2", 
"DenominationUsuelle3UniteLegale as DenominationUsuelle3",
"CAST(EconomieSocialeSolidaireUniteLegale == 'O' as BOOLEAN) as 
EconomieSocialeSolidaire",
"CAST(EtatAdministratifUniteLegale == 'A' as BOOLEAN) as Active",
"IdentifiantAssociationUniteLegale as IdentifiantAssociation",
"NicSiegeUniteLegale as NicSiege", 
"CAST(NombrePeriodesUniteLegale as INTEGER) as NombrePeriodes",
"NomenclatureActivitePrincipaleUniteLegale as 
NomenclatureActivitePrincipale",
"NomUniteLegale as NomNaissance", "NomUsageUniteLegale as NomUsage",
"Prenom1UniteLegale as Prenom1", "Prenom2UniteLegale as Prenom2", 
"Prenom3UniteLegale as Prenom3", "Prenom4UniteLegale as Prenom4", 
"PrenomUsuelUniteLegale as PrenomUsuel",
"PseudonymeUniteLegale as Pseudonyme",
"SexeUniteLegale as Sexe", 
"SigleUniteLegale as Sigle", 
"Siren", 
"TrancheEffectifsUniteLegale as TrancheEffectifSalarie"
 );
{code}

The _Dataset_ is succesfully created. But the following call of 
_Encoders.bean(Enterprise.class)_ fails :


{code:java}
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:208)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:87)
at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142)
at org.apache.spark.sql.Encoders.bean(Encoders.scala)
at 
fr.ecoemploi.spark.entreprise.EntrepriseService.dsEntreprises(EntrepriseService.java:178)
at 
test.fr.ecoemploi.spark.entreprise.EntreprisesIT.datasetEntreprises(EntreprisesIT.java:72)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:532)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:115)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:171)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:167)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:114)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:59)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:108)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java

[jira] [Updated] (SPARK-28180) Encoding CSV to Pojo works with Encoders.bean on RDD but fail on asserts when attemtping it from a Dataset

2019-06-26 Thread M. Le Bihan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

M. Le Bihan updated SPARK-28180:

Description: 
I am converting an _RDD_ spark program to a _Dataset_ one.
Once, it was converting a CSV file mapped the help of a Jackson loader to a RDD 
of Enterprise objects with Encoders.bean(Entreprise.class), and now it is doing 
the conversion more simplier, by loading the CSV content into a Dataset 
and applying the _Encoders.bean(Entreprise.class)_ on it.


{code:java}
  Dataset csv = this.session.read().format("csv")
 .option("header","true").option("quote", "\"").option("escape", "\"")
 .load(source.getAbsolutePath())
 .selectExpr(
"ActivitePrincipaleUniteLegale as ActivitePrincipale",
"CAST(AnneeCategorieEntreprise as INTEGER) as 
AnneeCategorieEntreprise",
"CAST(AnneeEffectifsUniteLegale as INTEGER) as 
AnneeValiditeEffectifSalarie",
"CAST(CaractereEmployeurUniteLegale == 'O' as BOOLEAN) as 
CaractereEmployeur",
"CategorieEntreprise", 
"CategorieJuridiqueUniteLegale as CategorieJuridique",
"DateCreationUniteLegale as DateCreationEntreprise", "DateDebut as 
DateDebutHistorisation", "DateDernierTraitementUniteLegale as 
DateDernierTraitement",
"DenominationUniteLegale as Denomination", 
"DenominationUsuelle1UniteLegale as DenominationUsuelle1", 
"DenominationUsuelle2UniteLegale as DenominationUsuelle2", 
"DenominationUsuelle3UniteLegale as DenominationUsuelle3",
"CAST(EconomieSocialeSolidaireUniteLegale == 'O' as BOOLEAN) as 
EconomieSocialeSolidaire",
"CAST(EtatAdministratifUniteLegale == 'A' as BOOLEAN) as Active",
"IdentifiantAssociationUniteLegale as IdentifiantAssociation",
"NicSiegeUniteLegale as NicSiege", 
"CAST(NombrePeriodesUniteLegale as INTEGER) as NombrePeriodes",
"NomenclatureActivitePrincipaleUniteLegale as 
NomenclatureActivitePrincipale",
"NomUniteLegale as NomNaissance", "NomUsageUniteLegale as NomUsage",
"Prenom1UniteLegale as Prenom1", "Prenom2UniteLegale as Prenom2", 
"Prenom3UniteLegale as Prenom3", "Prenom4UniteLegale as Prenom4", 
"PrenomUsuelUniteLegale as PrenomUsuel",
"PseudonymeUniteLegale as Pseudonyme",
"SexeUniteLegale as Sexe", 
"SigleUniteLegale as Sigle", 
"Siren", 
"TrancheEffectifsUniteLegale as TrancheEffectifSalarie"
 );
{code}

The _Dataset_ is succesfully created. But the following call of 
_Encoders.bean(Enterprise.class)_ fails :


{code:java}
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:208)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:87)
at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142)
at org.apache.spark.sql.Encoders.bean(Encoders.scala)
at 
fr.ecoemploi.spark.entreprise.EntrepriseService.dsEntreprises(EntrepriseService.java:178)
at 
test.fr.ecoemploi.spark.entreprise.EntreprisesIT.datasetEntreprises(EntreprisesIT.java:72)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:532)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:115)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:171)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:167)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:114)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:59)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:108)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:98)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:74)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at 
org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAl

[jira] [Updated] (SPARK-28180) Encoding CSV to Pojo works with Encoders.bean on RDD but fail on asserts when attemtping it from a Dataset

2019-06-26 Thread M. Le Bihan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

M. Le Bihan updated SPARK-28180:

Description: 
I am converting an _RDD_ spark program to a _Dataset_ one.
Once, it was converting a CSV file mapped the help of a Jackson loader to a RDD 
of Enterprise objects with Encoders.bean(Entreprise.class), and now it is doing 
the conversion more simplier, by loading the CSV content into a Dataset 
and applying the _Encoders.bean(Entreprise.class)_ on it.


{code:java}
  Dataset csv = this.session.read().format("csv")
 .option("header","true").option("quote", "\"").option("escape", "\"")
 .load(source.getAbsolutePath())
 .selectExpr(
"ActivitePrincipaleUniteLegale as ActivitePrincipale",
"CAST(AnneeCategorieEntreprise as INTEGER) as 
AnneeCategorieEntreprise",
"CAST(AnneeEffectifsUniteLegale as INTEGER) as 
AnneeValiditeEffectifSalarie",
"CAST(CaractereEmployeurUniteLegale == 'O' as BOOLEAN) as 
CaractereEmployeur",
"CategorieEntreprise", 
"CategorieJuridiqueUniteLegale as CategorieJuridique",
"DateCreationUniteLegale as DateCreationEntreprise", "DateDebut as 
DateDebutHistorisation", "DateDernierTraitementUniteLegale as 
DateDernierTraitement",
"DenominationUniteLegale as Denomination", 
"DenominationUsuelle1UniteLegale as DenominationUsuelle1", 
"DenominationUsuelle2UniteLegale as DenominationUsuelle2", 
"DenominationUsuelle3UniteLegale as DenominationUsuelle3",
"CAST(EconomieSocialeSolidaireUniteLegale == 'O' as BOOLEAN) as 
EconomieSocialeSolidaire",
"CAST(EtatAdministratifUniteLegale == 'A' as BOOLEAN) as Active",
"IdentifiantAssociationUniteLegale as IdentifiantAssociation",
"NicSiegeUniteLegale as NicSiege", 
"CAST(NombrePeriodesUniteLegale as INTEGER) as NombrePeriodes",
"NomenclatureActivitePrincipaleUniteLegale as 
NomenclatureActivitePrincipale",
"NomUniteLegale as NomNaissance", "NomUsageUniteLegale as NomUsage",
"Prenom1UniteLegale as Prenom1", "Prenom2UniteLegale as Prenom2", 
"Prenom3UniteLegale as Prenom3", "Prenom4UniteLegale as Prenom4", 
"PrenomUsuelUniteLegale as PrenomUsuel",
"PseudonymeUniteLegale as Pseudonyme",
"SexeUniteLegale as Sexe", 
"SigleUniteLegale as Sigle", 
"Siren", 
"TrancheEffectifsUniteLegale as TrancheEffectifSalarie"
 );
{code}

The _Dataset_ is succesfully created. But the following call of 
_Encoders.bean(Enterprise.class)_ fails :


{code:java}
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:208)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:87)
at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142)
at org.apache.spark.sql.Encoders.bean(Encoders.scala)
at 
fr.ecoemploi.spark.entreprise.EntrepriseService.dsEntreprises(EntrepriseService.java:178)
at 
test.fr.ecoemploi.spark.entreprise.EntreprisesIT.datasetEntreprises(EntreprisesIT.java:72)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:532)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:115)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:171)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:167)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:114)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:59)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:108)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:98)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:74)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at 
org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAl

[jira] [Updated] (SPARK-28180) Encoding CSV to Pojo works with Encoders.bean on RDD but fail on asserts when attemtping it from a Dataset

2019-06-26 Thread M. Le Bihan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

M. Le Bihan updated SPARK-28180:

Description: 
I am converting an _RDD_ spark program to a _Dataset_ one.
Once, it was converting a CSV file mapped the help of a Jackson loader to a RDD 
of Enterprise objects with Encoders.bean(Entreprise.class), and now it is doing 
the conversion more simplier, by loading the CSV content into a Dataset 
and applying the _Encoders.bean(Entreprise.class)_ on it.


{code:java}
  Dataset csv = this.session.read().format("csv")
 .option("header","true").option("quote", "\"").option("escape", "\"")
 .load(source.getAbsolutePath())
 .selectExpr(
"ActivitePrincipaleUniteLegale as ActivitePrincipale",
"CAST(AnneeCategorieEntreprise as INTEGER) as 
AnneeCategorieEntreprise",
"CAST(AnneeEffectifsUniteLegale as INTEGER) as 
AnneeValiditeEffectifSalarie",
"CAST(CaractereEmployeurUniteLegale == 'O' as BOOLEAN) as 
CaractereEmployeur",
"CategorieEntreprise", 
"CategorieJuridiqueUniteLegale as CategorieJuridique",
"DateCreationUniteLegale as DateCreationEntreprise", "DateDebut as 
DateDebutHistorisation", "DateDernierTraitementUniteLegale as 
DateDernierTraitement",
"DenominationUniteLegale as Denomination", 
"DenominationUsuelle1UniteLegale as DenominationUsuelle1", 
"DenominationUsuelle2UniteLegale as DenominationUsuelle2", 
"DenominationUsuelle3UniteLegale as DenominationUsuelle3",
"CAST(EconomieSocialeSolidaireUniteLegale == 'O' as BOOLEAN) as 
EconomieSocialeSolidaire",
"CAST(EtatAdministratifUniteLegale == 'A' as BOOLEAN) as Active",
"IdentifiantAssociationUniteLegale as IdentifiantAssociation",
"NicSiegeUniteLegale as NicSiege", 
"CAST(NombrePeriodesUniteLegale as INTEGER) as NombrePeriodes",
"NomenclatureActivitePrincipaleUniteLegale as 
NomenclatureActivitePrincipale",
"NomUniteLegale as NomNaissance", "NomUsageUniteLegale as NomUsage",
"Prenom1UniteLegale as Prenom1", "Prenom2UniteLegale as Prenom2", 
"Prenom3UniteLegale as Prenom3", "Prenom4UniteLegale as Prenom4", 
"PrenomUsuelUniteLegale as PrenomUsuel",
"PseudonymeUniteLegale as Pseudonyme",
"SexeUniteLegale as Sexe", 
"SigleUniteLegale as Sigle", 
"Siren", 
"TrancheEffectifsUniteLegale as TrancheEffectifSalarie"
 );
{code}

The _Dataset_ is succesfully created. But the following call of 
_Encoders.bean(Enterprise.class)_ fails :


{code:java}
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:208)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:87)
at org.apache.spark.sql.Encoders$.bean(Encoders.scala:142)
at org.apache.spark.sql.Encoders.bean(Encoders.scala)
at 
fr.ecoemploi.spark.entreprise.EntrepriseService.dsEntreprises(EntrepriseService.java:178)
at 
test.fr.ecoemploi.spark.entreprise.EntreprisesIT.datasetEntreprises(EntreprisesIT.java:72)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:532)
at 
org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:115)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$6(TestMethodTestDescriptor.java:171)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:167)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:114)
at 
org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:59)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$4(NodeTestTask.java:108)
at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:72)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:98)
at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:74)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at 
org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAl