[jira] [Commented] (SPARK-19293) Spark 2.1.x unstable with spark.speculation=true

2017-06-21 Thread Damian Momot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057091#comment-16057091
 ] 

Damian Momot commented on SPARK-19293:
--

Yep,

Some tasks are marked as "killed" but some become "failed". In some specific 
cases if number of fails is very big it causes entire spark job to fail. 
Disabling "speculation" solves failing entirely.

It was working flawlessly before Spark 2.1 with "speculation" enabled

> Spark 2.1.x unstable with spark.speculation=true
> 
>
> Key: SPARK-19293
> URL: https://issues.apache.org/jira/browse/SPARK-19293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Damian Momot
>Priority: Critical
>
> After upgrading from Spark 2.0.2 to 2.1.0 we've observed that jobs are often 
> failing when speculative mode is enabled.
> In 2.0.2 speculative tasks were simply skipped if they were not used for 
> result (i.e. other instance finished earlier) - and it was clearly visible in 
> UI that those tasks were not counted as failures.
> In 2.1.0 many tasks are being marked failed/killed when speculative tasks 
> start to run (that is at the end of stage when there are spare executors to 
> use) which also leads to entire stage/job failures.
> Disabling spark.speculation solves failing problem - but speculative mode is 
> very useful especially when different executors run on machines with varying 
> load (for example in YARN)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21159) Cluster mode, driver throws connection refused exception submitted by SparkLauncher

2017-06-21 Thread niefei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057117#comment-16057117
 ] 

niefei commented on SPARK-21159:


thank you for your reply. it should use launcher's IP address to connect rather 
than driver's IP address, as the launcher and driver will not run on the same 
server in cluster mode. 

> Cluster mode, driver throws connection refused exception submitted by 
> SparkLauncher
> ---
>
> Key: SPARK-21159
> URL: https://issues.apache.org/jira/browse/SPARK-21159
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Server A-Master
> Server B-Slave
>Reporter: niefei
>
> When an spark application submitted by SparkLauncher#startApplication method, 
> this will get a SparkAppHandle. In the test environment, the launcher runs on 
> server A, if it runs in Client mode, everything is ok. In cluster mode, the 
> launcher will run on Server A, and the driver will be run on Server B, in 
> this scenario, when initialize SparkContext, a LauncherBackend will try to 
> connect to the launcher application via specified port and ip address. the 
> problem is the implementation of LauncherBackend uses loopback ip to connect 
> which is 127.0.0.1. this will cause the connection refused as server B never 
> ran the launcher. 
> The expected behavior is the LauncherBackend should use Server A's Ip address 
> to connect for reporting the running status.
> Below is the stacktrace:
> 17/06/20 17:24:37 ERROR SparkContext: Error initializing SparkContext.
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at java.net.Socket.connect(Socket.java:538)
>   at java.net.Socket.(Socket.java:434)
>   at java.net.Socket.(Socket.java:244)
>   at 
> org.apache.spark.launcher.LauncherBackend.connect(LauncherBackend.scala:43)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.start(StandaloneSchedulerBackend.scala:60)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
>   at org.apache.spark.SparkContext.(SparkContext.scala:509)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
>   at 
> com.asura.grinder.datatask.task.AbstractCommonSparkTask.executeSparkJob(AbstractCommonSparkTask.scala:91)
>   at 
> com.asura.grinder.datatask.task.AbstractCommonSparkTask.runSparkJob(AbstractCommonSparkTask.scala:25)
>   at com.asura.grinder.datatask.main.TaskMain$.main(TaskMain.scala:61)
>   at com.asura.grinder.datatask.main.TaskMain.main(TaskMain.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
> 17/06/20 17:24:37 INFO SparkUI: Stopped Spark web UI at 
> http://172.25.108.62:4040
> 17/06/20 17:24:37 INFO StandaloneSchedulerBackend: Shutting down all executors
> 17/06/20 17:24:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking 
> each executor to shut down
> 17/06/20 17:24:37 ERROR Utils: Uncaught exception in thread main
> java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:214)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:116)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:467)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1588)
> 

[jira] [Commented] (SPARK-19293) Spark 2.1.x unstable with spark.speculation=true

2017-06-21 Thread coneyliu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057127#comment-16057127
 ] 

coneyliu commented on SPARK-19293:
--

Have you tried the latest code? The exceptions you give are all about 
`InterruptedException` and `RuntimeException`, those seems have been fixed in 
recently path, such as #SPARK-20358 and more. 

> Spark 2.1.x unstable with spark.speculation=true
> 
>
> Key: SPARK-19293
> URL: https://issues.apache.org/jira/browse/SPARK-19293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Damian Momot
>Priority: Critical
>
> After upgrading from Spark 2.0.2 to 2.1.0 we've observed that jobs are often 
> failing when speculative mode is enabled.
> In 2.0.2 speculative tasks were simply skipped if they were not used for 
> result (i.e. other instance finished earlier) - and it was clearly visible in 
> UI that those tasks were not counted as failures.
> In 2.1.0 many tasks are being marked failed/killed when speculative tasks 
> start to run (that is at the end of stage when there are spare executors to 
> use) which also leads to entire stage/job failures.
> Disabling spark.speculation solves failing problem - but speculative mode is 
> very useful especially when different executors run on machines with varying 
> load (for example in YARN)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19293) Spark 2.1.x unstable with spark.speculation=true

2017-06-21 Thread coneyliu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057127#comment-16057127
 ] 

coneyliu edited comment on SPARK-19293 at 6/21/17 8:01 AM:
---

Have you tried the latest code? The exceptions you give are all about 
`InterruptedException` and `RuntimeException`, those seems have been fixed in 
recently patch, such as SPARK-20358 and more. 


was (Author: coneyliu):
Have you tried the latest code? The exceptions you give are all about 
`InterruptedException` and `RuntimeException`, those seems have been fixed in 
recently path, such as #SPARK-20358 and more. 

> Spark 2.1.x unstable with spark.speculation=true
> 
>
> Key: SPARK-19293
> URL: https://issues.apache.org/jira/browse/SPARK-19293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Damian Momot
>Priority: Critical
>
> After upgrading from Spark 2.0.2 to 2.1.0 we've observed that jobs are often 
> failing when speculative mode is enabled.
> In 2.0.2 speculative tasks were simply skipped if they were not used for 
> result (i.e. other instance finished earlier) - and it was clearly visible in 
> UI that those tasks were not counted as failures.
> In 2.1.0 many tasks are being marked failed/killed when speculative tasks 
> start to run (that is at the end of stage when there are spare executors to 
> use) which also leads to entire stage/job failures.
> Disabling spark.speculation solves failing problem - but speculative mode is 
> very useful especially when different executors run on machines with varying 
> load (for example in YARN)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19293) Spark 2.1.x unstable with spark.speculation=true

2017-06-21 Thread Damian Momot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057157#comment-16057157
 ] 

Damian Momot commented on SPARK-19293:
--

I'll try to build from 2.2 branch and test today

> Spark 2.1.x unstable with spark.speculation=true
> 
>
> Key: SPARK-19293
> URL: https://issues.apache.org/jira/browse/SPARK-19293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Damian Momot
>Priority: Critical
>
> After upgrading from Spark 2.0.2 to 2.1.0 we've observed that jobs are often 
> failing when speculative mode is enabled.
> In 2.0.2 speculative tasks were simply skipped if they were not used for 
> result (i.e. other instance finished earlier) - and it was clearly visible in 
> UI that those tasks were not counted as failures.
> In 2.1.0 many tasks are being marked failed/killed when speculative tasks 
> start to run (that is at the end of stage when there are spare executors to 
> use) which also leads to entire stage/job failures.
> Disabling spark.speculation solves failing problem - but speculative mode is 
> very useful especially when different executors run on machines with varying 
> load (for example in YARN)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21144) Unexpected results when the data schema and partition schema have the duplicate columns

2017-06-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057312#comment-16057312
 ] 

Apache Spark commented on SPARK-21144:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/18375

> Unexpected results when the data schema and partition schema have the 
> duplicate columns
> ---
>
> Key: SPARK-21144
> URL: https://issues.apache.org/jira/browse/SPARK-21144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> {noformat}
> withTempPath { dir =>
>   val basePath = dir.getCanonicalPath
>   spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, 
> "foo=1").toString)
>   spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, 
> "foo=a").toString)
>   spark.read.parquet(basePath).show()
> }
> {noformat}
> The result of the above case is
> {noformat}
> +---+
> |foo|
> +---+
> |  1|
> |  1|
> |  a|
> |  a|
> |  1|
> |  a|
> +---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20466) HadoopRDD#addLocalConfiguration throws NPE

2017-06-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20466:
--
Priority: Minor  (was: Major)

Hm, I think the question is how the JobConf is ever null here. I think adding a 
null check here would only be a band-aid, or at least, something that would 
need to be taken care of consistently across many more classes.

> HadoopRDD#addLocalConfiguration throws NPE
> --
>
> Key: SPARK-20466
> URL: https://issues.apache.org/jira/browse/SPARK-20466
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.2
>Reporter: liyunzhang_intel
>Priority: Minor
> Attachments: NPE_log
>
>
> in spark2.0.2, it throws NPE
> {code}
>   17/04/23 08:19:55 ERROR executor.Executor: Exception in task 439.0 in stage 
> 16.0 (TID 986)$ 
> java.lang.NullPointerException$
> ^Iat 
> org.apache.spark.rdd.HadoopRDD$.addLocalConfiguration(HadoopRDD.scala:373)$
> ^Iat org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:243)$
> ^Iat org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)$
> ^Iat org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)$
> ^Iat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)$
> ^Iat org.apache.spark.rdd.RDD.iterator(RDD.scala:283)$
> ^Iat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)$
> ^Iat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)$
> ^Iat org.apache.spark.rdd.RDD.iterator(RDD.scala:283)$
> ^Iat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)$
> ^Iat org.apache.spark.scheduler.Task.run(Task.scala:86)$
> ^Iat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)$
> ^Iat 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)$
> ^Iat 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)$
> ^Iat java.lang.Thread.run(Thread.java:745)$
> {code}
> suggestion to add some code to avoid NPE
> {code} 
>/** Add Hadoop configuration specific to a single partition and attempt. */
>   def addLocalConfiguration(jobTrackerId: String, jobId: Int, splitId: Int, 
> attemptId: Int,
> conf: JobConf) {
> val jobID = new JobID(jobTrackerId, jobId)
> val taId = new TaskAttemptID(new TaskID(jobID, TaskType.MAP, splitId), 
> attemptId)
> if ( conf != null){
> conf.set("mapred.tip.id", taId.getTaskID.toString)
> conf.set("mapred.task.id", taId.toString)
> conf.setBoolean("mapred.task.is.map", true)
> conf.setInt("mapred.task.partition", splitId)
> conf.set("mapred.job.id", jobID.toString)
>}
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21160) Filtering rows with "not equal" operator yields unexpected result with null rows

2017-06-21 Thread Edoardo Vivo (JIRA)
Edoardo Vivo created SPARK-21160:


 Summary: Filtering rows with "not equal" operator yields 
unexpected result with null rows
 Key: SPARK-21160
 URL: https://issues.apache.org/jira/browse/SPARK-21160
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core, SQL
Affects Versions: 2.0.2
Reporter: Edoardo Vivo
Priority: Minor


```
schema = StructType([StructField("Test", DoubleType())])
test2 = spark.createDataFrame([[1.0],[1.0],[2.0],[2.0],[None]], schema=schema)
test2.where("Test != 1").show()
```
This returns only the rows with the value 2, it does not return the null row. 
This should not be the expected behavior, IMO. 
Thank you.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18484) case class datasets - ability to specify decimal precision and scale

2017-06-21 Thread Arkadiusz Bicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057410#comment-16057410
 ] 

Arkadiusz Bicz commented on SPARK-18484:


Usage of DecimalType should be avoided with this implementation, as there are 
so many issues with it. From my experience you will never know which precision 
you will end up in parquet file, and if you have different precision from 
different files parquet in one dir, it is not readable by spark.

> case class datasets - ability to specify decimal precision and scale
> 
>
> Key: SPARK-18484
> URL: https://issues.apache.org/jira/browse/SPARK-18484
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Damian Momot
>
> Currently when using decimal type (BigDecimal in scala case class) there's no 
> way to enforce precision and scale. This is quite critical when saving data - 
> regarding space usage and compatibility with external systems (for example 
> Hive table) because spark saves data as Decimal(38,18)
> {code}
> case class TestClass(id: String, money: BigDecimal)
> val testDs = spark.createDataset(Seq(
>   TestClass("1", BigDecimal("22.50")),
>   TestClass("2", BigDecimal("500.66"))
> ))
> testDs.printSchema()
> {code}
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- money: decimal(38,18) (nullable = true)
> {code}
> Workaround is to convert dataset to dataframe before saving and manually cast 
> to specific decimal scale/precision:
> {code}
> import org.apache.spark.sql.types.DecimalType
> val testDf = testDs.toDF()
> testDf
>   .withColumn("money", testDf("money").cast(DecimalType(10,2)))
>   .printSchema()
> {code}
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- money: decimal(10,2) (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21161) SparkContext stopped when execute a query on Solr

2017-06-21 Thread Jian Wu (JIRA)
Jian Wu created SPARK-21161:
---

 Summary: SparkContext stopped when execute a query on Solr
 Key: SPARK-21161
 URL: https://issues.apache.org/jira/browse/SPARK-21161
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1
 Environment: Hadoop2.7.3, Spark 2.1.1, spark-solr-3.0.1.jar, 
solr-solrj-6.5.1.jar
Reporter: Jian Wu


The SparkContext stopped due to DAGSchedulerEventProcessLoop failed when I 
query Solr data in Spark.

{code:none}
17/06/21 12:40:53 INFO ContextLauncher: 17/06/21 12:40:53 ERROR 
scheduler.DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; 
shutting down SparkContext
17/06/21 12:40:53 INFO ContextLauncher: java.lang.NumberFormatException: For 
input string: “8983_solr”
17/06/21 12:40:53 INFO ContextLauncher: at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
17/06/21 12:40:53 INFO ContextLauncher: at 
java.lang.Integer.parseInt(Integer.java:580)
17/06/21 12:40:53 INFO ContextLauncher: at 
java.lang.Integer.parseInt(Integer.java:615)
17/06/21 12:40:53 INFO ContextLauncher: at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
17/06/21 12:40:53 INFO ContextLauncher: at 
scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.util.Utils$.parseHostPort(Utils.scala:959)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:36)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:200)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:181)
17/06/21 12:40:53 INFO ContextLauncher: at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
17/06/21 12:40:53 INFO ContextLauncher: at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.scheduler.TaskSetManager.org$apache$spark$scheduler$TaskSetManager$$addPendingTask(TaskSetManager.scala:181)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:160)
17/06/21 12:40:53 INFO ContextLauncher: at 
scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:159)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:212)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:176)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1043)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:918)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:921)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:920)
17/06/21 12:40:53 INFO ContextLauncher: at 
scala.collection.immutable.List.foreach(List.scala:381)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:920)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:862)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1613)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
17/06/21 12:40:53 INFO ContextLauncher: at 
org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
{code}

It is caused by the Solr special node name like "idx5.oi.dev:8983_solr". It 
brings "_solr" along with the port number. So when the YarnScheduler parses the 
port, it gets a "java.lang.N

[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-21 Thread Mikael Valot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057510#comment-16057510
 ] 

Mikael Valot commented on SPARK-21137:
--

this is a very common issue, I do not understand why this is closed.

> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21161) SparkContext stopped when execute a query on Solr

2017-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21161:


Assignee: Apache Spark

> SparkContext stopped when execute a query on Solr
> -
>
> Key: SPARK-21161
> URL: https://issues.apache.org/jira/browse/SPARK-21161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Hadoop2.7.3, Spark 2.1.1, spark-solr-3.0.1.jar, 
> solr-solrj-6.5.1.jar
>Reporter: Jian Wu
>Assignee: Apache Spark
>
> The SparkContext stopped due to DAGSchedulerEventProcessLoop failed when I 
> query Solr data in Spark.
> {code:none}
> 17/06/21 12:40:53 INFO ContextLauncher: 17/06/21 12:40:53 ERROR 
> scheduler.DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; 
> shutting down SparkContext
> 17/06/21 12:40:53 INFO ContextLauncher: java.lang.NumberFormatException: For 
> input string: “8983_solr”
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.Integer.parseInt(Integer.java:580)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.Integer.parseInt(Integer.java:615)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.util.Utils$.parseHostPort(Utils.scala:959)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:36)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:200)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:181)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager.org$apache$spark$scheduler$TaskSetManager$$addPendingTask(TaskSetManager.scala:181)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:160)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:159)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:212)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:176)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1043)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:918)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:921)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:920)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.List.foreach(List.scala:381)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:920)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:862)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1613)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.sca

[jira] [Commented] (SPARK-21161) SparkContext stopped when execute a query on Solr

2017-06-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057530#comment-16057530
 ] 

Apache Spark commented on SPARK-21161:
--

User 'janplus' has created a pull request for this issue:
https://github.com/apache/spark/pull/18376

> SparkContext stopped when execute a query on Solr
> -
>
> Key: SPARK-21161
> URL: https://issues.apache.org/jira/browse/SPARK-21161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Hadoop2.7.3, Spark 2.1.1, spark-solr-3.0.1.jar, 
> solr-solrj-6.5.1.jar
>Reporter: Jian Wu
>
> The SparkContext stopped due to DAGSchedulerEventProcessLoop failed when I 
> query Solr data in Spark.
> {code:none}
> 17/06/21 12:40:53 INFO ContextLauncher: 17/06/21 12:40:53 ERROR 
> scheduler.DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; 
> shutting down SparkContext
> 17/06/21 12:40:53 INFO ContextLauncher: java.lang.NumberFormatException: For 
> input string: “8983_solr”
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.Integer.parseInt(Integer.java:580)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.Integer.parseInt(Integer.java:615)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.util.Utils$.parseHostPort(Utils.scala:959)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:36)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:200)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:181)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager.org$apache$spark$scheduler$TaskSetManager$$addPendingTask(TaskSetManager.scala:181)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:160)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:159)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:212)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:176)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1043)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:918)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:921)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:920)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.List.foreach(List.scala:381)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:920)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:862)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1613)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
> 17/06/21 12:40:53 INFO ContextLauncher:  

[jira] [Assigned] (SPARK-21161) SparkContext stopped when execute a query on Solr

2017-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21161:


Assignee: (was: Apache Spark)

> SparkContext stopped when execute a query on Solr
> -
>
> Key: SPARK-21161
> URL: https://issues.apache.org/jira/browse/SPARK-21161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Hadoop2.7.3, Spark 2.1.1, spark-solr-3.0.1.jar, 
> solr-solrj-6.5.1.jar
>Reporter: Jian Wu
>
> The SparkContext stopped due to DAGSchedulerEventProcessLoop failed when I 
> query Solr data in Spark.
> {code:none}
> 17/06/21 12:40:53 INFO ContextLauncher: 17/06/21 12:40:53 ERROR 
> scheduler.DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; 
> shutting down SparkContext
> 17/06/21 12:40:53 INFO ContextLauncher: java.lang.NumberFormatException: For 
> input string: “8983_solr”
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.Integer.parseInt(Integer.java:580)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.Integer.parseInt(Integer.java:615)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.util.Utils$.parseHostPort(Utils.scala:959)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:36)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:200)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:181)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager.org$apache$spark$scheduler$TaskSetManager$$addPendingTask(TaskSetManager.scala:181)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:160)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:159)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:212)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:176)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1043)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:918)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:921)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:920)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.List.foreach(List.scala:381)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:920)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:862)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1613)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> 17/06/21 12:40

[jira] [Commented] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-21 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057550#comment-16057550
 ] 

Nick Pentreath commented on SPARK-21093:


Just adding the info from test failure report from the 2.2.0-RC4 vote thread:

R - 3.3.0
OpenJDK Runtime Environment (build 1.8.0_111-b15)
CentOS 7.2.1511



> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>Priority: Critical
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> Error in handleErrors(returnStatus, conn) :
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 
> in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage 
> 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: 
> R computation failed with
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.a
> ...
> *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated
> === Backtrace: =
> /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597]
> /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750]
> /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507]
> /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015]
> /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e]
> /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4]
> /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529]
> /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7]
> /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1]
> /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9]
> /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/

[jira] [Resolved] (SPARK-20640) Make rpc timeout and retry for shuffle registration configurable

2017-06-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20640.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18092
[https://github.com/apache/spark/pull/18092]

> Make rpc timeout and retry for shuffle registration configurable
> 
>
> Key: SPARK-20640
> URL: https://issues.apache.org/jira/browse/SPARK-20640
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.2
>Reporter: Sital Kedia
> Fix For: 2.3.0
>
>
> Currently the shuffle service registration timeout and retry has been 
> hardcoded (see 
> https://github.com/sitalkedia/spark/blob/master/network/shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleClient.java#L144
>  and 
> https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L197).
>  This works well for small workloads but under heavy workload when the 
> shuffle service is busy transferring large amount of data we see significant 
> delay in responding to the registration request, as a result we often see the 
> executors fail to register with the shuffle service, eventually failing the 
> job. We need to make these two parameters configurable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20640) Make rpc timeout and retry for shuffle registration configurable

2017-06-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20640:
---

Assignee: Li Yichao

> Make rpc timeout and retry for shuffle registration configurable
> 
>
> Key: SPARK-20640
> URL: https://issues.apache.org/jira/browse/SPARK-20640
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.2
>Reporter: Sital Kedia
>Assignee: Li Yichao
> Fix For: 2.3.0
>
>
> Currently the shuffle service registration timeout and retry has been 
> hardcoded (see 
> https://github.com/sitalkedia/spark/blob/master/network/shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleClient.java#L144
>  and 
> https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L197).
>  This works well for small workloads but under heavy workload when the 
> shuffle service is busy transferring large amount of data we see significant 
> delay in responding to the registration request, as a result we often see the 
> executors fail to register with the shuffle service, eventually failing the 
> job. We need to make these two parameters configurable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21161) SparkContext stopped when execute a query on Solr

2017-06-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057564#comment-16057564
 ] 

Sean Owen commented on SPARK-21161:
---

Yes, that's not a valid host/port. No, you can't just ignore that by stripping 
it. I don't think your app / Solr can employ whatever this convention is when 
interacting with Spark.

> SparkContext stopped when execute a query on Solr
> -
>
> Key: SPARK-21161
> URL: https://issues.apache.org/jira/browse/SPARK-21161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Hadoop2.7.3, Spark 2.1.1, spark-solr-3.0.1.jar, 
> solr-solrj-6.5.1.jar
>Reporter: Jian Wu
>
> The SparkContext stopped due to DAGSchedulerEventProcessLoop failed when I 
> query Solr data in Spark.
> {code:none}
> 17/06/21 12:40:53 INFO ContextLauncher: 17/06/21 12:40:53 ERROR 
> scheduler.DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; 
> shutting down SparkContext
> 17/06/21 12:40:53 INFO ContextLauncher: java.lang.NumberFormatException: For 
> input string: “8983_solr”
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.Integer.parseInt(Integer.java:580)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.Integer.parseInt(Integer.java:615)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.util.Utils$.parseHostPort(Utils.scala:959)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:36)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:200)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:181)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager.org$apache$spark$scheduler$TaskSetManager$$addPendingTask(TaskSetManager.scala:181)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:160)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:159)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:212)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:176)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1043)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:918)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:921)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:920)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.List.foreach(List.scala:381)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:920)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:862)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1613)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onRece

[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057565#comment-16057565
 ] 

Sean Owen commented on SPARK-21137:
---

[~leakimav] -- there still isn't detail here about why this is a Spark issue vs 
Hadoop API issue for example. Read the JIRA here please for how you can help.

> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21160) Filtering rows with "not equal" operator yields unexpected result with null rows

2017-06-21 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057587#comment-16057587
 ] 

Takeshi Yamamuro commented on SPARK-21160:
--

This is an expected behaviour. Probably, you want to do like "df.where("Test != 
1 OR Test is null").show(), right?

> Filtering rows with "not equal" operator yields unexpected result with null 
> rows
> 
>
> Key: SPARK-21160
> URL: https://issues.apache.org/jira/browse/SPARK-21160
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.2
>Reporter: Edoardo Vivo
>Priority: Minor
>
> ```
> schema = StructType([StructField("Test", DoubleType())])
> test2 = spark.createDataFrame([[1.0],[1.0],[2.0],[2.0],[None]], schema=schema)
> test2.where("Test != 1").show()
> ```
> This returns only the rows with the value 2, it does not return the null row. 
> This should not be the expected behavior, IMO. 
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-21 Thread sam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057592#comment-16057592
 ] 

sam commented on SPARK-21137:
-

[~srowen]

I thought I already made a point about that? Please can you tell me what is 
wrong with the following reasoning:

> Yes it's likely that the underlying Hadoop APIs have some yucky code that 
> does something silly, I have delved down their before and my stomach cannot 
> handle it. Nevertheless Spark made the choice to inherit the complexities of 
> the Hadoop APIs and reading multiple small files seems like a pretty basic 
> use case for Spark (come on Sean this is Enron data!). It would feel a bit 
> perverse to just close this and blame the layer cake underneath. Spark should 
> use it's own extensions of the Hadoop APIs where the Hadoop APIs don't work 
> (and the Hadoop code is easily extensible).



> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21160) Filtering rows with "not equal" operator yields unexpected result with null rows

2017-06-21 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057601#comment-16057601
 ] 

Takeshi Yamamuro commented on SPARK-21160:
--

BTW, anybody knows why `a` is nullable in this case?
{code}
scala> Seq(Some(1), Some(2), None).toDF("a").where('a === 1).explain
== Physical Plan ==
*Project [value#87 AS a#89]
+- *Filter (isnotnull(value#87) && (value#87 = 1))
   +- LocalTableScan [value#87]

scala> Seq(Some(1), Some(2), None).toDF("a").where('a === 1).printSchema
root
 |-- a: integer (nullable = true)
{code}

> Filtering rows with "not equal" operator yields unexpected result with null 
> rows
> 
>
> Key: SPARK-21160
> URL: https://issues.apache.org/jira/browse/SPARK-21160
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.2
>Reporter: Edoardo Vivo
>Priority: Minor
>
> ```
> schema = StructType([StructField("Test", DoubleType())])
> test2 = spark.createDataFrame([[1.0],[1.0],[2.0],[2.0],[None]], schema=schema)
> test2.where("Test != 1").show()
> ```
> This returns only the rows with the value 2, it does not return the null row. 
> This should not be the expected behavior, IMO. 
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057605#comment-16057605
 ] 

Sean Owen commented on SPARK-21137:
---

(This is not a common use case)

What change are you proposing in Spark? This really helps focus the 
conversation. If you mean: don't use Hadoop APIs, that one's a non-starter in 
the world of tradeoffs of software. But first, what is even the source of 
slowness? It's pretty easy to take the debug step I suggested above, which is a 
thread dump. Why not just do that?
Then: what do you think the workaround is? have you looked at the source?

JIRA is primarily for developers and isn't used as tech support here. You're 
ideally expected to bring concrete changes here. A reproducible bug report with 
clear expected vs actual behavior is OK too, but there too, the more important 
it is to you to get something resolved, the more it's on you to invest effort. 
This kind of back and forth isn't getting you closer.

> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21160) Filtering rows with "not equal" operator yields unexpected result with null rows

2017-06-21 Thread Edoardo Vivo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057614#comment-16057614
 ] 

Edoardo Vivo commented on SPARK-21160:
--

Sorry for the stupid question, but may I ask WHY this is the expected behavior?
1 is different from null...

BTW, this is not pandas behavior, for instance. I really don't understand.
Thank you.



> Filtering rows with "not equal" operator yields unexpected result with null 
> rows
> 
>
> Key: SPARK-21160
> URL: https://issues.apache.org/jira/browse/SPARK-21160
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.2
>Reporter: Edoardo Vivo
>Priority: Minor
>
> ```
> schema = StructType([StructField("Test", DoubleType())])
> test2 = spark.createDataFrame([[1.0],[1.0],[2.0],[2.0],[None]], schema=schema)
> test2.where("Test != 1").show()
> ```
> This returns only the rows with the value 2, it does not return the null row. 
> This should not be the expected behavior, IMO. 
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21082) Consider Executor's memory usage when scheduling task

2017-06-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21082.
---
Resolution: Won't Fix

> Consider Executor's memory usage when scheduling task 
> --
>
> Key: SPARK-21082
> URL: https://issues.apache.org/jira/browse/SPARK-21082
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: DjvuLee
>
>  Spark Scheduler do not consider the memory usage during dispatch tasks, this 
> can lead to Executor OOM if the RDD is cached sometimes, because Spark can 
> not estimate the memory usage well enough(especially when the RDD type is not 
> flatten), scheduler may dispatch so many tasks on one Executor.
> We can offer a configuration for user to decide whether scheduler will 
> consider the memory usage.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21160) Filtering rows with "not equal" operator yields unexpected result with null rows

2017-06-21 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057629#comment-16057629
 ] 

Takeshi Yamamuro commented on SPARK-21160:
--

you better google it though, this is because NULL is not a value.

> Filtering rows with "not equal" operator yields unexpected result with null 
> rows
> 
>
> Key: SPARK-21160
> URL: https://issues.apache.org/jira/browse/SPARK-21160
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.2
>Reporter: Edoardo Vivo
>Priority: Minor
>
> ```
> schema = StructType([StructField("Test", DoubleType())])
> test2 = spark.createDataFrame([[1.0],[1.0],[2.0],[2.0],[None]], schema=schema)
> test2.where("Test != 1").show()
> ```
> This returns only the rows with the value 2, it does not return the null row. 
> This should not be the expected behavior, IMO. 
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21162) Cannot count rows in an empty Hive table stored as parquet when spark.sql.parquet.cacheMetadata is set to false

2017-06-21 Thread Tom Ogle (JIRA)
Tom Ogle created SPARK-21162:


 Summary: Cannot count rows in an empty Hive table stored as 
parquet when spark.sql.parquet.cacheMetadata is set to false
 Key: SPARK-21162
 URL: https://issues.apache.org/jira/browse/SPARK-21162
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.3, 1.6.2
Reporter: Tom Ogle


With spark.sql.parquet.cacheMetadata set to false, creating an empty Hive table 
stored as Parquet and then trying to count the rows using SparkSQL throws an 
IOException. The issue does not affect Spark 2. This issue is inconvenient in 
environments using Spark 1.6.x where spark.sql.parquet.cacheMetadata is 
explicitly set to false for some reason, such as in Google DataProc 1.0.

Here is the stacktrace:

{code}
17/06/21 15:30:10 INFO ParquetRelation: Reading Parquet file(s) from 
Exception in thread "main" 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], 
output=[count#30L])
+- TungstenExchange SinglePartition, None
   +- TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#33L])
  +- Scan ParquetRelation: my_test_db.test_table[] InputPaths: 
/my_test_db.db/test_table

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:80)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:166)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
at 
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1500)
at 
org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1500)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at 
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2087)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1499)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1506)
at 
org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1516)
at 
org.apache.spark.sql.DataFrame$$anonfun$count$1.apply(DataFrame.scala:1515)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2100)
at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1515)
at App$.main(App.scala:23)
at App.main(App.scala)
Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
execute, tree:
TungstenExchange SinglePartition, None
+- TungstenAggregate(key=[], 
functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#33L])
   +- Scan ParquetRelation: my_test_db.test_table[] InputPaths: 
/my_test_db.db/test_table

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at org.apache.spark.sql.execution.Exchange.doExecute(Exchange.scala:247)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:86)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.apply(TungstenAggregate.scala:80)
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
... 19 more
Caused by: java.io.IOException: No input paths specified in job
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
at 
org.apache.parquet.hadoop.ParquetInputFormat.listStatus(ParquetInputFormat.java:339)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$buildInternalScan$1$$anon$1$$anon$4.listStatus(ParquetRelation.scala:358)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:340)
at 
org.apache.parquet.hadoop.ParquetInputFormat.getSplits(ParquetIn

[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-06-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057657#comment-16057657
 ] 

Apache Spark commented on SPARK-18016:
--

User 'bdrillard' has created a pull request for this issue:
https://github.com/apache/spark/pull/18377

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
> 

[jira] [Commented] (SPARK-21160) Filtering rows with "not equal" operator yields unexpected result with null rows

2017-06-21 Thread Edoardo Vivo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057722#comment-16057722
 ] 

Edoardo Vivo commented on SPARK-21160:
--

Thank you for your answer. I noticed the same happens in relational databases 
and in R too. Strangely enough, it is the first time I have come across this 
issue.

However, I still keep my opinion that this should not be the default behavior. 
I understand you will probably close this issue, however I would like to 
suggest that maybe issuing a Warning in this case might be helpful (for naive 
users like me).
Thank you again

> Filtering rows with "not equal" operator yields unexpected result with null 
> rows
> 
>
> Key: SPARK-21160
> URL: https://issues.apache.org/jira/browse/SPARK-21160
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.2
>Reporter: Edoardo Vivo
>Priority: Minor
>
> ```
> schema = StructType([StructField("Test", DoubleType())])
> test2 = spark.createDataFrame([[1.0],[1.0],[2.0],[2.0],[None]], schema=schema)
> test2.where("Test != 1").show()
> ```
> This returns only the rows with the value 2, it does not return the null row. 
> This should not be the expected behavior, IMO. 
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21163) DataFrame.toPandas should respect the data type

2017-06-21 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-21163:
---

 Summary: DataFrame.toPandas should respect the data type
 Key: SPARK-21163
 URL: https://issues.apache.org/jira/browse/SPARK-21163
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21163) DataFrame.toPandas should respect the data type

2017-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21163:


Assignee: Apache Spark  (was: Wenchen Fan)

> DataFrame.toPandas should respect the data type
> ---
>
> Key: SPARK-21163
> URL: https://issues.apache.org/jira/browse/SPARK-21163
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21163) DataFrame.toPandas should respect the data type

2017-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21163:


Assignee: Wenchen Fan  (was: Apache Spark)

> DataFrame.toPandas should respect the data type
> ---
>
> Key: SPARK-21163
> URL: https://issues.apache.org/jira/browse/SPARK-21163
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21163) DataFrame.toPandas should respect the data type

2017-06-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057729#comment-16057729
 ] 

Apache Spark commented on SPARK-21163:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/18378

> DataFrame.toPandas should respect the data type
> ---
>
> Key: SPARK-21163
> URL: https://issues.apache.org/jira/browse/SPARK-21163
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21161) SparkContext stopped when execute a query on Solr

2017-06-21 Thread Jian Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057753#comment-16057753
 ] 

Jian Wu commented on SPARK-21161:
-

I'll fix this bug in the `spark-solr` project. Thx for comment.

> SparkContext stopped when execute a query on Solr
> -
>
> Key: SPARK-21161
> URL: https://issues.apache.org/jira/browse/SPARK-21161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Hadoop2.7.3, Spark 2.1.1, spark-solr-3.0.1.jar, 
> solr-solrj-6.5.1.jar
>Reporter: Jian Wu
>
> The SparkContext stopped due to DAGSchedulerEventProcessLoop failed when I 
> query Solr data in Spark.
> {code:none}
> 17/06/21 12:40:53 INFO ContextLauncher: 17/06/21 12:40:53 ERROR 
> scheduler.DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; 
> shutting down SparkContext
> 17/06/21 12:40:53 INFO ContextLauncher: java.lang.NumberFormatException: For 
> input string: “8983_solr”
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.Integer.parseInt(Integer.java:580)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.Integer.parseInt(Integer.java:615)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.util.Utils$.parseHostPort(Utils.scala:959)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:36)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:200)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:181)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager.org$apache$spark$scheduler$TaskSetManager$$addPendingTask(TaskSetManager.scala:181)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:160)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:159)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:212)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:176)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1043)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:918)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:921)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:920)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.List.foreach(List.scala:381)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:920)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:862)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1613)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGSchedulerE

[jira] [Commented] (SPARK-21159) Cluster mode, driver throws connection refused exception submitted by SparkLauncher

2017-06-21 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057767#comment-16057767
 ] 

Marcelo Vanzin commented on SPARK-21159:


No, that should not be it. That's not how the launcher works internally.

I'm only keeping this open because app's shouldn't fail because this feature is 
not implemented.

> Cluster mode, driver throws connection refused exception submitted by 
> SparkLauncher
> ---
>
> Key: SPARK-21159
> URL: https://issues.apache.org/jira/browse/SPARK-21159
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Server A-Master
> Server B-Slave
>Reporter: niefei
>
> When an spark application submitted by SparkLauncher#startApplication method, 
> this will get a SparkAppHandle. In the test environment, the launcher runs on 
> server A, if it runs in Client mode, everything is ok. In cluster mode, the 
> launcher will run on Server A, and the driver will be run on Server B, in 
> this scenario, when initialize SparkContext, a LauncherBackend will try to 
> connect to the launcher application via specified port and ip address. the 
> problem is the implementation of LauncherBackend uses loopback ip to connect 
> which is 127.0.0.1. this will cause the connection refused as server B never 
> ran the launcher. 
> The expected behavior is the LauncherBackend should use Server A's Ip address 
> to connect for reporting the running status.
> Below is the stacktrace:
> 17/06/20 17:24:37 ERROR SparkContext: Error initializing SparkContext.
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at java.net.Socket.connect(Socket.java:538)
>   at java.net.Socket.(Socket.java:434)
>   at java.net.Socket.(Socket.java:244)
>   at 
> org.apache.spark.launcher.LauncherBackend.connect(LauncherBackend.scala:43)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.start(StandaloneSchedulerBackend.scala:60)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
>   at org.apache.spark.SparkContext.(SparkContext.scala:509)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
>   at 
> com.asura.grinder.datatask.task.AbstractCommonSparkTask.executeSparkJob(AbstractCommonSparkTask.scala:91)
>   at 
> com.asura.grinder.datatask.task.AbstractCommonSparkTask.runSparkJob(AbstractCommonSparkTask.scala:25)
>   at com.asura.grinder.datatask.main.TaskMain$.main(TaskMain.scala:61)
>   at com.asura.grinder.datatask.main.TaskMain.main(TaskMain.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
> 17/06/20 17:24:37 INFO SparkUI: Stopped Spark web UI at 
> http://172.25.108.62:4040
> 17/06/20 17:24:37 INFO StandaloneSchedulerBackend: Shutting down all executors
> 17/06/20 17:24:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking 
> each executor to shut down
> 17/06/20 17:24:37 ERROR Utils: Uncaught exception in thread main
> java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:214)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:116)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:467)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1588)
>  

[jira] [Commented] (SPARK-10878) Race condition when resolving Maven coordinates via Ivy

2017-06-21 Thread Todd Morrison (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057797#comment-16057797
 ] 

Todd Morrison commented on SPARK-10878:
---

Any chance we can move the priority of this issue up? 

This is causing some issues on large Spark clusters with Yarn and PySpark. 

Currently, a work-around is to throttle a single job to cache then expect 
concurrent jobs to deploy. This isn't ideal as with parallel jobs, there is a 
long-poll waiting for the initial job to complete.

Thanks!

> Race condition when resolving Maven coordinates via Ivy
> ---
>
> Key: SPARK-10878
> URL: https://issues.apache.org/jira/browse/SPARK-10878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Ryan Williams
>Priority: Minor
>
> I've recently been shell-scripting the creation of many concurrent 
> Spark-on-YARN apps and observing a fraction of them to fail with what I'm 
> guessing is a race condition in their Maven-coordinate resolution.
> For example, I might spawn an app for each path in file {{paths}} with the 
> following shell script:
> {code}
> cat paths | parallel "$SPARK_HOME/bin/spark-submit foo.jar {}"
> {code}
> When doing this, I observe some fraction of the spawned jobs to fail with 
> errors like:
> {code}
> :: retrieving :: org.apache.spark#spark-submit-parent
> confs: [default]
> Exception in thread "main" java.lang.RuntimeException: problem during 
> retrieve of org.apache.spark#spark-submit-parent: java.text.ParseException: 
> failed to parse report: 
> /hpc/users/willir31/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml:
>  Premature end of file.
> at 
> org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:249)
> at 
> org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:83)
> at org.apache.ivy.Ivy.retrieve(Ivy.java:551)
> at 
> org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1006)
> at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.text.ParseException: failed to parse report: 
> /hpc/users/willir31/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml:
>  Premature end of file.
> at 
> org.apache.ivy.plugins.report.XmlReportParser.parse(XmlReportParser.java:293)
> at 
> org.apache.ivy.core.retrieve.RetrieveEngine.determineArtifactsToCopy(RetrieveEngine.java:329)
> at 
> org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:118)
> ... 7 more
> Caused by: org.xml.sax.SAXParseException; Premature end of file.
> at 
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
> Source)
> at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown 
> Source)
> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
> {code}
> The more apps I try to launch simultaneously, the greater fraction of them 
> seem to fail with this or similar errors; a batch of ~10 will usually work 
> fine, a batch of 15 will see a few failures, and a batch of ~60 will have 
> dozens of failures.
> [This gist shows 11 recent failures I 
> observed|https://gist.github.com/ryan-williams/648bff70e518de0c7c84].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17851) Make sure all test sqls in catalyst pass checkAnalysis

2017-06-21 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-17851.
-
   Resolution: Fixed
 Assignee: Jiang Xingbo
Fix Version/s: 2.3.0

> Make sure all test sqls in catalyst pass checkAnalysis
> --
>
> Key: SPARK-17851
> URL: https://issues.apache.org/jira/browse/SPARK-17851
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently we have several tens of test sqls in catalyst will fail at 
> `SimpleAnalyzer.checkAnalysis`, we should make sure they are valid.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21161) SparkContext stopped when execute a query on Solr

2017-06-21 Thread Jian Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057855#comment-16057855
 ] 

Jian Wu commented on SPARK-21161:
-

For others who come up with the same issue, please check 
https://github.com/lucidworks/spark-solr/pull/158.

> SparkContext stopped when execute a query on Solr
> -
>
> Key: SPARK-21161
> URL: https://issues.apache.org/jira/browse/SPARK-21161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Hadoop2.7.3, Spark 2.1.1, spark-solr-3.0.1.jar, 
> solr-solrj-6.5.1.jar
>Reporter: Jian Wu
>
> The SparkContext stopped due to DAGSchedulerEventProcessLoop failed when I 
> query Solr data in Spark.
> {code:none}
> 17/06/21 12:40:53 INFO ContextLauncher: 17/06/21 12:40:53 ERROR 
> scheduler.DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; 
> shutting down SparkContext
> 17/06/21 12:40:53 INFO ContextLauncher: java.lang.NumberFormatException: For 
> input string: “8983_solr”
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.Integer.parseInt(Integer.java:580)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.Integer.parseInt(Integer.java:615)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.util.Utils$.parseHostPort(Utils.scala:959)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:36)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:200)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:181)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager.org$apache$spark$scheduler$TaskSetManager$$addPendingTask(TaskSetManager.scala:181)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:160)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:159)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:212)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:176)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1043)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:918)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:921)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:920)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.List.foreach(List.scala:381)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:920)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:862)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1613)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
> 17/06/21 12:40:53 INFO ContextLauncher:   a

[jira] [Resolved] (SPARK-20917) SparkR supports string encoding consistent with R

2017-06-21 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20917.
--
  Resolution: Fixed
Assignee: Wayne Zhang
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> SparkR supports string encoding consistent with R
> -
>
> Key: SPARK-20917
> URL: https://issues.apache.org/jira/browse/SPARK-20917
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
> Fix For: 2.3.0
>
>
> Add stringIndexerOrderType to spark.glm and spark.survreg to support string 
> encoding that is consistent with default R.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21164) Remove isTableSample from Sample

2017-06-21 Thread Xiao Li (JIRA)
Xiao Li created SPARK-21164:
---

 Summary: Remove isTableSample from Sample
 Key: SPARK-21164
 URL: https://issues.apache.org/jira/browse/SPARK-21164
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li
Assignee: Xiao Li


{{isTableSample}} was introduced for SQL Generation. Since SQL Generation is 
removed, we do not need to keep {{isTableSample}}. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21147) the schema of socket/rate source can not be set.

2017-06-21 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21147:
-
Summary: the schema of socket/rate source can not be set.  (was: the schema 
of socket source can not be set.)

> the schema of socket/rate source can not be set.
> 
>
> Key: SPARK-21147
> URL: https://issues.apache.org/jira/browse/SPARK-21147
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
> Environment: Win7,spark 2.1.0
>Reporter: Fei Shao
>
> The schema set for DataStreamReader can not work. The code is shown as below:
> val line = ss.readStream.format("socket")
> .option("ip",xxx)
> .option("port",xxx)
> .schema( StructField("name",StringType)::StructField("area",StringType)::Nil)
> .load
> line.printSchema
> The printSchema prints:
> root
> |--value:String(nullable=true)
> According to the code, it should print the schema set by schema().
> Suggestion from Michael Armbrust:
> throw an exception saying that you can't set schema here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21147) the schema of socket/rate source can not be set.

2017-06-21 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-21147.
--
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.3.0

> the schema of socket/rate source can not be set.
> 
>
> Key: SPARK-21147
> URL: https://issues.apache.org/jira/browse/SPARK-21147
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0
> Environment: Win7,spark 2.1.0
>Reporter: Fei Shao
>Assignee: Hyukjin Kwon
> Fix For: 2.3.0
>
>
> The schema set for DataStreamReader can not work. The code is shown as below:
> val line = ss.readStream.format("socket")
> .option("ip",xxx)
> .option("port",xxx)
> .schema( StructField("name",StringType)::StructField("area",StringType)::Nil)
> .load
> line.printSchema
> The printSchema prints:
> root
> |--value:String(nullable=true)
> According to the code, it should print the schema set by schema().
> Suggestion from Michael Armbrust:
> throw an exception saying that you can't set schema here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21125) PySpark context missing function to set Job Description.

2017-06-21 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-21125:
-

Assignee: Shane Jarvie

> PySpark context missing function to set Job Description.
> 
>
> Key: SPARK-21125
> URL: https://issues.apache.org/jira/browse/SPARK-21125
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.1
>Reporter: Shane Jarvie
>Assignee: Shane Jarvie
>Priority: Trivial
>  Labels: beginner
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The PySpark API is missing a convienient function currently found in the 
> Scala API, which sets the Job Description for display in the Spark UI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21125) PySpark context missing function to set Job Description.

2017-06-21 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-21125.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18332
[https://github.com/apache/spark/pull/18332]

> PySpark context missing function to set Job Description.
> 
>
> Key: SPARK-21125
> URL: https://issues.apache.org/jira/browse/SPARK-21125
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.1
>Reporter: Shane Jarvie
>Priority: Trivial
>  Labels: beginner
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The PySpark API is missing a convienient function currently found in the 
> Scala API, which sets the Job Description for display in the Spark UI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21147) the schema of socket/rate source can not be set.

2017-06-21 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21147:
-
Affects Version/s: 2.2.0

> the schema of socket/rate source can not be set.
> 
>
> Key: SPARK-21147
> URL: https://issues.apache.org/jira/browse/SPARK-21147
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0
> Environment: Win7,spark 2.1.0
>Reporter: Fei Shao
>Assignee: Hyukjin Kwon
> Fix For: 2.3.0
>
>
> The schema set for DataStreamReader can not work. The code is shown as below:
> val line = ss.readStream.format("socket")
> .option("ip",xxx)
> .option("port",xxx)
> .schema( StructField("name",StringType)::StructField("area",StringType)::Nil)
> .load
> line.printSchema
> The printSchema prints:
> root
> |--value:String(nullable=true)
> According to the code, it should print the schema set by schema().
> Suggestion from Michael Armbrust:
> throw an exception saying that you can't set schema here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21164) Remove isTableSample from Sample

2017-06-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16057923#comment-16057923
 ] 

Apache Spark commented on SPARK-21164:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/18379

> Remove isTableSample from Sample
> 
>
> Key: SPARK-21164
> URL: https://issues.apache.org/jira/browse/SPARK-21164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> {{isTableSample}} was introduced for SQL Generation. Since SQL Generation is 
> removed, we do not need to keep {{isTableSample}}. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21164) Remove isTableSample from Sample

2017-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21164:


Assignee: Xiao Li  (was: Apache Spark)

> Remove isTableSample from Sample
> 
>
> Key: SPARK-21164
> URL: https://issues.apache.org/jira/browse/SPARK-21164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> {{isTableSample}} was introduced for SQL Generation. Since SQL Generation is 
> removed, we do not need to keep {{isTableSample}}. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21164) Remove isTableSample from Sample

2017-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21164:


Assignee: Apache Spark  (was: Xiao Li)

> Remove isTableSample from Sample
> 
>
> Key: SPARK-21164
> URL: https://issues.apache.org/jira/browse/SPARK-21164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> {{isTableSample}} was introduced for SQL Generation. Since SQL Generation is 
> removed, we do not need to keep {{isTableSample}}. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21165) Fail to write into partitioned hive table due to attribute reference not working with cast on partition column

2017-06-21 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-21165:


 Summary: Fail to write into partitioned hive table due to 
attribute reference not working with cast on partition column
 Key: SPARK-21165
 URL: https://issues.apache.org/jira/browse/SPARK-21165
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Imran Rashid
Priority: Blocker


A simple "insert into ... select" involving partitioned hive tables fails.  
Here's a simpler repro which doesn't involve hive at all -- this succeeds on 
2.1.1, but fails on 2.2.0-rc5:

{noformat}
spark.sql("""SET hive.exec.dynamic.partition.mode=nonstrict""")
spark.sql("""DROP TABLE IF EXISTS src""")
spark.sql("""DROP TABLE IF EXISTS dest""")
spark.sql("""
CREATE TABLE src (first string, word string)
  PARTITIONED BY (length int)
""")

spark.sql("""
INSERT INTO src PARTITION(length) VALUES
  ('a', 'abc', 3),
  ('b', 'bcde', 4),
  ('c', 'cdefg', 5)
""")

spark.sql("""
  CREATE TABLE dest (word string, length int)
PARTITIONED BY (first string)
""")

spark.sql("""
  INSERT INTO TABLE dest PARTITION(first) SELECT word, length, first FROM src
""")

spark.sql("""
  INSERT INTO TABLE dest PARTITION(first) SELECT word, length, cast(first as 
string) as first FROM src
""")
{noformat}

The exception is

{noformat}
17/06/21 14:25:53 WARN TaskSetManager: Lost task 1.0 in stage 4.0 (TID 10, 
localhost, executor driver): 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute
, tree: first#74
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:49)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:43)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:884)
at 
org.apache.spark.sql.execution.SparkPlan.newOrdering(SparkPlan.scala:363)
at 
org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:63)
at 
org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:102)
at 
org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:101)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.sca

[jira] [Updated] (SPARK-21165) Fail to write into partitioned hive table due to attribute reference not working with cast on partition column

2017-06-21 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-21165:
-
Description: 
A simple "insert into ... select" involving partitioned hive tables fails.  
Here's a simpler repro which doesn't involve hive at all -- this succeeds on 
2.1.1, but fails on 2.2.0-rc5:

{noformat}
spark.sql("""SET hive.exec.dynamic.partition.mode=nonstrict""")
spark.sql("""DROP TABLE IF EXISTS src""")
spark.sql("""DROP TABLE IF EXISTS dest""")
spark.sql("""
CREATE TABLE src (first string, word string)
  PARTITIONED BY (length int)
""")

spark.sql("""
INSERT INTO src PARTITION(length) VALUES
  ('a', 'abc', 3),
  ('b', 'bcde', 4),
  ('c', 'cdefg', 5)
""")

spark.sql("""
  CREATE TABLE dest (word string, length int)
PARTITIONED BY (first string)
""")

spark.sql("""
  INSERT INTO TABLE dest PARTITION(first) SELECT word, length, cast(first as 
string) as first FROM src
""")
{noformat}

The exception is

{noformat}
17/06/21 14:25:53 WARN TaskSetManager: Lost task 1.0 in stage 4.0 (TID 10, 
localhost, executor driver): 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute
, tree: first#74
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:49)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:43)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:884)
at 
org.apache.spark.sql.execution.SparkPlan.newOrdering(SparkPlan.scala:363)
at 
org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:63)
at 
org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:102)
at 
org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:101)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:320)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecu

[jira] [Assigned] (SPARK-16019) Eliminate unexpected delay during spark on yarn job launch

2017-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16019:


Assignee: Apache Spark

> Eliminate unexpected delay during spark on yarn job launch
> --
>
> Key: SPARK-16019
> URL: https://issues.apache.org/jira/browse/SPARK-16019
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Olasoji
>Assignee: Apache Spark
>Priority: Minor
>
> Currently when launching a job in yarn mode, there will be an added delay of 
> about "spark.yarn.report.interval" seconds before launch. By default this 
> parameter is set to 1 second, however if a user increases this for whatever 
> reason, an unexpected startup delay is introduced.
> The "waitForApplication"  function called during job submission and launch 
> will eventually call monitorApplication and so sleep for the configured 
> interval time before checking the state of the job and proceeding with job 
> startup.
> One solution would be to add an "interval" argument to the monitorApplication 
> function. This added flexibility would allow callers who don't need to wait 
> to set a lower wait interval or none.
> Patch to follow soon



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16019) Eliminate unexpected delay during spark on yarn job launch

2017-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16019:


Assignee: (was: Apache Spark)

> Eliminate unexpected delay during spark on yarn job launch
> --
>
> Key: SPARK-16019
> URL: https://issues.apache.org/jira/browse/SPARK-16019
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Olasoji
>Priority: Minor
>
> Currently when launching a job in yarn mode, there will be an added delay of 
> about "spark.yarn.report.interval" seconds before launch. By default this 
> parameter is set to 1 second, however if a user increases this for whatever 
> reason, an unexpected startup delay is introduced.
> The "waitForApplication"  function called during job submission and launch 
> will eventually call monitorApplication and so sleep for the configured 
> interval time before checking the state of the job and proceeding with job 
> startup.
> One solution would be to add an "interval" argument to the monitorApplication 
> function. This added flexibility would allow callers who don't need to wait 
> to set a lower wait interval or none.
> Patch to follow soon



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16019) Eliminate unexpected delay during spark on yarn job launch

2017-06-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058116#comment-16058116
 ] 

Apache Spark commented on SPARK-16019:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/18380

> Eliminate unexpected delay during spark on yarn job launch
> --
>
> Key: SPARK-16019
> URL: https://issues.apache.org/jira/browse/SPARK-16019
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Olasoji
>Priority: Minor
>
> Currently when launching a job in yarn mode, there will be an added delay of 
> about "spark.yarn.report.interval" seconds before launch. By default this 
> parameter is set to 1 second, however if a user increases this for whatever 
> reason, an unexpected startup delay is introduced.
> The "waitForApplication"  function called during job submission and launch 
> will eventually call monitorApplication and so sleep for the configured 
> interval time before checking the state of the job and proceeding with job 
> startup.
> One solution would be to add an "interval" argument to the monitorApplication 
> function. This added flexibility would allow callers who don't need to wait 
> to set a lower wait interval or none.
> Patch to follow soon



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2017-06-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20114:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-14501

> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules 
> can be extracted from the original dataset more efficiently using algorithms 
> like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
> unordered, but X must appear before Y, which is more general and can work 
> better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules 
> are more practical. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21166) Automated ML persistence

2017-06-21 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-21166:
-

 Summary: Automated ML persistence
 Key: SPARK-21166
 URL: https://issues.apache.org/jira/browse/SPARK-21166
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0
Reporter: Joseph K. Bradley


This JIRA is for discussing the possibility of automating ML persistence.  
Currently, custom save/load methods are written for every Model.  However, we 
could design a mixin which provides automated persistence, inspecting model 
data and Params and reading/writing (known types) automatically.  This was 
brought up in discussions with developers behind 
https://github.com/azure/mmlspark

Some issues we will need to consider:
* Providing generic mixin usable in most or all cases
* Handling corner cases (strange Param types, etc.)
* Backwards compatibility (loading models saved by old Spark versions)

Because of backwards compatibility in particular, it may make sense to 
implement testing for that first, before we try to address automated 
persistence: [SPARK-15573]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20830) PySpark wrappers for explode_outer and posexplode_outer

2017-06-21 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-20830:
-

Assignee: Maciej Szymkiewicz

> PySpark wrappers for explode_outer and posexplode_outer
> ---
>
> Key: SPARK-20830
> URL: https://issues.apache.org/jira/browse/SPARK-20830
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
> Fix For: 2.3.0
>
>
> Implement Python wrappers for {{o.a.s.sql.functions.explode_outer}} and 
> {{o.a.s.sql.functions.posexplode_outer}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20830) PySpark wrappers for explode_outer and posexplode_outer

2017-06-21 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-20830.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18049
[https://github.com/apache/spark/pull/18049]

> PySpark wrappers for explode_outer and posexplode_outer
> ---
>
> Key: SPARK-20830
> URL: https://issues.apache.org/jira/browse/SPARK-20830
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
> Fix For: 2.3.0
>
>
> Implement Python wrappers for {{o.a.s.sql.functions.explode_outer}} and 
> {{o.a.s.sql.functions.posexplode_outer}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21165) Fail to write into partitioned hive table due to attribute reference not working with cast on partition column

2017-06-21 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058440#comment-16058440
 ] 

Xiao Li commented on SPARK-21165:
-

Unable to reproduce it in the current master branch. Will try to use 2.2 RC5 
later

> Fail to write into partitioned hive table due to attribute reference not 
> working with cast on partition column
> --
>
> Key: SPARK-21165
> URL: https://issues.apache.org/jira/browse/SPARK-21165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>Priority: Blocker
>
> A simple "insert into ... select" involving partitioned hive tables fails.  
> Here's a simpler repro which doesn't involve hive at all -- this succeeds on 
> 2.1.1, but fails on 2.2.0-rc5:
> {noformat}
> spark.sql("""SET hive.exec.dynamic.partition.mode=nonstrict""")
> spark.sql("""DROP TABLE IF EXISTS src""")
> spark.sql("""DROP TABLE IF EXISTS dest""")
> spark.sql("""
> CREATE TABLE src (first string, word string)
>   PARTITIONED BY (length int)
> """)
> spark.sql("""
> INSERT INTO src PARTITION(length) VALUES
>   ('a', 'abc', 3),
>   ('b', 'bcde', 4),
>   ('c', 'cdefg', 5)
> """)
> spark.sql("""
>   CREATE TABLE dest (word string, length int)
> PARTITIONED BY (first string)
> """)
> spark.sql("""
>   INSERT INTO TABLE dest PARTITION(first) SELECT word, length, cast(first as 
> string) as first FROM src
> """)
> {noformat}
> The exception is
> {noformat}
> 17/06/21 14:25:53 WARN TaskSetManager: Lost task 1.0 in stage 4.0 (TID 10, 
> localhost, executor driver): 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> , tree: first#74
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:43)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:884)
> at 
> org.apache.spark.sql.execution.SparkPlan.newOrdering(SparkPlan.scala:363)
> at 
> org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:63)
> at 
> org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:102)
> at 
> org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec

[jira] [Created] (SPARK-21167) Path is not decoded correctly when reading output of FileSink

2017-06-21 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-21167:


 Summary: Path is not decoded correctly when reading output of 
FileSink
 Key: SPARK-21167
 URL: https://issues.apache.org/jira/browse/SPARK-21167
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.1
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


When reading output of FileSink, path is not decoded correctly. So if the path 
has some special characters, such as spaces, Spark cannot read it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21167) Path is not decoded correctly when reading output of FileSink

2017-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21167:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Path is not decoded correctly when reading output of FileSink
> -
>
> Key: SPARK-21167
> URL: https://issues.apache.org/jira/browse/SPARK-21167
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> When reading output of FileSink, path is not decoded correctly. So if the 
> path has some special characters, such as spaces, Spark cannot read it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21167) Path is not decoded correctly when reading output of FileSink

2017-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21167:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Path is not decoded correctly when reading output of FileSink
> -
>
> Key: SPARK-21167
> URL: https://issues.apache.org/jira/browse/SPARK-21167
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> When reading output of FileSink, path is not decoded correctly. So if the 
> path has some special characters, such as spaces, Spark cannot read it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21167) Path is not decoded correctly when reading output of FileSink

2017-06-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058486#comment-16058486
 ] 

Apache Spark commented on SPARK-21167:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/18381

> Path is not decoded correctly when reading output of FileSink
> -
>
> Key: SPARK-21167
> URL: https://issues.apache.org/jira/browse/SPARK-21167
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> When reading output of FileSink, path is not decoded correctly. So if the 
> path has some special characters, such as spaces, Spark cannot read it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21165) Fail to write into partitioned hive table due to attribute reference not working with cast on partition column

2017-06-21 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058490#comment-16058490
 ] 

Xiao Li commented on SPARK-21165:
-

2.2 branch failed with the same error.

> Fail to write into partitioned hive table due to attribute reference not 
> working with cast on partition column
> --
>
> Key: SPARK-21165
> URL: https://issues.apache.org/jira/browse/SPARK-21165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>Priority: Blocker
>
> A simple "insert into ... select" involving partitioned hive tables fails.  
> Here's a simpler repro which doesn't involve hive at all -- this succeeds on 
> 2.1.1, but fails on 2.2.0-rc5:
> {noformat}
> spark.sql("""SET hive.exec.dynamic.partition.mode=nonstrict""")
> spark.sql("""DROP TABLE IF EXISTS src""")
> spark.sql("""DROP TABLE IF EXISTS dest""")
> spark.sql("""
> CREATE TABLE src (first string, word string)
>   PARTITIONED BY (length int)
> """)
> spark.sql("""
> INSERT INTO src PARTITION(length) VALUES
>   ('a', 'abc', 3),
>   ('b', 'bcde', 4),
>   ('c', 'cdefg', 5)
> """)
> spark.sql("""
>   CREATE TABLE dest (word string, length int)
> PARTITIONED BY (first string)
> """)
> spark.sql("""
>   INSERT INTO TABLE dest PARTITION(first) SELECT word, length, cast(first as 
> string) as first FROM src
> """)
> {noformat}
> The exception is
> {noformat}
> 17/06/21 14:25:53 WARN TaskSetManager: Lost task 1.0 in stage 4.0 (TID 10, 
> localhost, executor driver): 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> , tree: first#74
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:43)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:884)
> at 
> org.apache.spark.sql.execution.SparkPlan.newOrdering(SparkPlan.scala:363)
> at 
> org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:63)
> at 
> org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:102)
> at 
> org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:101)
> at 
> org.apache.spark.

[jira] [Resolved] (SPARK-21160) Filtering rows with "not equal" operator yields unexpected result with null rows

2017-06-21 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21160.
--
Resolution: Not A Bug

There is null-safe equality comparison

```
scala> Seq(Some(1), Some(2), None).toDF("a").where("a != 1").show()
+---+
|  a|
+---+
|  2|
+---+


scala> Seq(Some(1), Some(2), None).toDF("a").where("not(a <=> 1)").show()
++
|   a|
++
|   2|
|null|
++
```

I am resolving this. Issuing warnings will mess the logs and I guess what you 
tested in RDB and R does not produce such warnings as well as references.

> Filtering rows with "not equal" operator yields unexpected result with null 
> rows
> 
>
> Key: SPARK-21160
> URL: https://issues.apache.org/jira/browse/SPARK-21160
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.2
>Reporter: Edoardo Vivo
>Priority: Minor
>
> ```
> schema = StructType([StructField("Test", DoubleType())])
> test2 = spark.createDataFrame([[1.0],[1.0],[2.0],[2.0],[None]], schema=schema)
> test2.where("Test != 1").show()
> ```
> This returns only the rows with the value 2, it does not return the null row. 
> This should not be the expected behavior, IMO. 
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21160) Filtering rows with "not equal" operator yields unexpected result with null rows

2017-06-21 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058537#comment-16058537
 ] 

Hyukjin Kwon edited comment on SPARK-21160 at 6/22/17 12:40 AM:


There is null-safe equality comparison

{code}
scala> Seq(Some(1), Some(2), None).toDF("a").where("a != 1").show()
+---+
|  a|
+---+
|  2|
+---+


scala> Seq(Some(1), Some(2), None).toDF("a").where("not a <=> 1").show()
++
|   a|
++
|   2|
|null|
++
{code}

I am resolving this. Issuing warnings will mess the logs and I guess what you 
tested in RDB and R does not produce such warnings as well as references.


was (Author: hyukjin.kwon):
There is null-safe equality comparison

{code}
scala> Seq(Some(1), Some(2), None).toDF("a").where("a != 1").show()
+---+
|  a|
+---+
|  2|
+---+


scala> Seq(Some(1), Some(2), None).toDF("a").where("not(a <=> 1)").show()
++
|   a|
++
|   2|
|null|
++
{code}

I am resolving this. Issuing warnings will mess the logs and I guess what you 
tested in RDB and R does not produce such warnings as well as references.

> Filtering rows with "not equal" operator yields unexpected result with null 
> rows
> 
>
> Key: SPARK-21160
> URL: https://issues.apache.org/jira/browse/SPARK-21160
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.2
>Reporter: Edoardo Vivo
>Priority: Minor
>
> ```
> schema = StructType([StructField("Test", DoubleType())])
> test2 = spark.createDataFrame([[1.0],[1.0],[2.0],[2.0],[None]], schema=schema)
> test2.where("Test != 1").show()
> ```
> This returns only the rows with the value 2, it does not return the null row. 
> This should not be the expected behavior, IMO. 
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21160) Filtering rows with "not equal" operator yields unexpected result with null rows

2017-06-21 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058537#comment-16058537
 ] 

Hyukjin Kwon edited comment on SPARK-21160 at 6/22/17 12:40 AM:


There is null-safe equality comparison

{code}
scala> Seq(Some(1), Some(2), None).toDF("a").where("a != 1").show()
+---+
|  a|
+---+
|  2|
+---+


scala> Seq(Some(1), Some(2), None).toDF("a").where("not(a <=> 1)").show()
++
|   a|
++
|   2|
|null|
++
{code}

I am resolving this. Issuing warnings will mess the logs and I guess what you 
tested in RDB and R does not produce such warnings as well as references.


was (Author: hyukjin.kwon):
There is null-safe equality comparison

```
scala> Seq(Some(1), Some(2), None).toDF("a").where("a != 1").show()
+---+
|  a|
+---+
|  2|
+---+


scala> Seq(Some(1), Some(2), None).toDF("a").where("not(a <=> 1)").show()
++
|   a|
++
|   2|
|null|
++
```

I am resolving this. Issuing warnings will mess the logs and I guess what you 
tested in RDB and R does not produce such warnings as well as references.

> Filtering rows with "not equal" operator yields unexpected result with null 
> rows
> 
>
> Key: SPARK-21160
> URL: https://issues.apache.org/jira/browse/SPARK-21160
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.2
>Reporter: Edoardo Vivo
>Priority: Minor
>
> ```
> schema = StructType([StructField("Test", DoubleType())])
> test2 = spark.createDataFrame([[1.0],[1.0],[2.0],[2.0],[None]], schema=schema)
> test2.where("Test != 1").show()
> ```
> This returns only the rows with the value 2, it does not return the null row. 
> This should not be the expected behavior, IMO. 
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21158) SparkSQL function SparkSession.Catalog.ListTables() does not handle spark setting for case-sensitivity

2017-06-21 Thread Kathryn McClintic (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058577#comment-16058577
 ] 

Kathryn McClintic commented on SPARK-21158:
---

I'm fine with that from my perspective.

> SparkSQL function SparkSession.Catalog.ListTables() does not handle spark 
> setting for case-sensitivity
> --
>
> Key: SPARK-21158
> URL: https://issues.apache.org/jira/browse/SPARK-21158
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Windows 10
> IntelliJ 
> Scala
>Reporter: Kathryn McClintic
>Priority: Minor
>  Labels: easyfix, features, sparksql, windows
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When working with SQL table names in Spark SQL we have noticed some issues 
> with case-sensitivity.
> If you set spark.sql.caseSensitive setting to be true, SparkSQL stores the 
> table names in the way it was provided. This is correct.
> If you set  spark.sql.caseSensitive setting to be false, SparkSQL stores the 
> table names in lower case.
> Then, we use the function sqlContext.tableNames() to get all the tables in 
> our DB. We check if this list contains(<"string of table name">) to determine 
> if we have already created a table. If case-sensitivity is turned off 
> (false), this function should look if the table name is contained in the 
> table list regardless of case.
> However, it tries to look for only ones that match the lower case version of 
> the stored table. Therefore, if you pass in a camel or upper case table name, 
> this function would return false when in fact the table does exist.
> The root cause of this issue is in the function 
> SparkSession.Catalog.ListTables()
> For example:
> In your SQL context - you have  four tables and you have chosen to have 
> spark.sql.case-Sensitive=false so it stores your tables in lowercase: 
> carnames
> carmodels
> carnamesandmodels
> users
> dealerlocations
> When running your pipeline, you want to see if you have already created the 
> temp join table of 'carnamesandmodels'. However, you have stored it as a 
> constant which reads: CarNamesAndModels for readability.
> So you can use the function
> sqlContext.tableNames().contains("CarNamesAndModels").
> This should return true - because we know its already created, but it will 
> currently return false since CarNamesAndModels is not in lowercase.
> The responsibility to change the name passed into the .contains method to be 
> lowercase should not be put on the spark user. This should be done by spark 
> sql if case-sensitivity is turned to false.
> Proposed solutions:
> - Setting case sensitive in the sql context should make the sql context 
> be agnostic to case but not change the storage of the table
> - There should be a custom contains method for ListTables() which converts 
> the tablename to be lowercase before checking
> - SparkSession.Catalog.ListTables() should return the list of tables in the 
> input format instead of in all lowercase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19341) Bucketing support for Structured Streaming

2017-06-21 Thread Fei Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058590#comment-16058590
 ] 

Fei Shao commented on SPARK-19341:
--

@gagan taneia 
Would you like add more info about this issue please?
Or would you like write some pseudo-code here please?

> Bucketing support for Structured Streaming
> --
>
> Key: SPARK-19341
> URL: https://issues.apache.org/jira/browse/SPARK-19341
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: gagan taneja
>
> One of the major use case planned for Structured streaming is to insert data 
> into a partitioned and clustered/bucketted table in the append mode
> However structured Streaming currently do not support bucketing therefore it 
> can not be used for insert operation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21149) Add job description API for R

2017-06-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058599#comment-16058599
 ] 

Apache Spark commented on SPARK-21149:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/18382

> Add job description API for R
> -
>
> Key: SPARK-21149
> URL: https://issues.apache.org/jira/browse/SPARK-21149
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>
> see SPARK-21125



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21149) Add job description API for R

2017-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21149:


Assignee: Apache Spark

> Add job description API for R
> -
>
> Key: SPARK-21149
> URL: https://issues.apache.org/jira/browse/SPARK-21149
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Minor
>
> see SPARK-21125



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21149) Add job description API for R

2017-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21149:


Assignee: (was: Apache Spark)

> Add job description API for R
> -
>
> Key: SPARK-21149
> URL: https://issues.apache.org/jira/browse/SPARK-21149
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Minor
>
> see SPARK-21125



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21155) Add (? running tasks) into Spark UI progress

2017-06-21 Thread Eric Vandenberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Vandenberg updated SPARK-21155:

Comment: was deleted

(was: Before )

> Add (? running tasks) into Spark UI progress
> 
>
> Key: SPARK-21155
> URL: https://issues.apache.org/jira/browse/SPARK-21155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png, Screen Shot 
> 2017-06-20 at 3.40.39 PM.png
>
>
> The progress UI for Active Jobs / Tasks should show the number of exact 
> number of running tasks.  See screen shot attachment for what this looks like.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20338) Spaces in spark.eventLog.dir are not correctly handled

2017-06-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058666#comment-16058666
 ] 

Apache Spark commented on SPARK-20338:
--

User 'zuotingbing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17638

> Spaces in spark.eventLog.dir are not correctly handled
> --
>
> Key: SPARK-20338
> URL: https://issues.apache.org/jira/browse/SPARK-20338
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: zuotingbing
>Assignee: zuotingbing
> Fix For: 2.3.0
>
>
> set spark.eventLog.dir=/home/mr/event log and submit an app ,we got error as 
> follows:
> 017-04-14 17:28:40,378 INFO org.apache.spark.SparkContext: Successfully 
> stopped SparkContext
> Exception in thread "main" ExitCodeException exitCode=1: chmod: cannot access 
> `/home/mr/event%20log/app-20170414172839-.inprogress': No such file or 
> directory
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
>   at org.apache.hadoop.util.Shell.run(Shell.java:478)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:831)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:814)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:712)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:506)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:125)
>   at org.apache.spark.SparkContext.(SparkContext.scala:516)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:879)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$9.apply(SparkSession.scala:871)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:871)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21168) KafkaRDD should always set kafka clientId.

2017-06-21 Thread Xingxing Di (JIRA)
Xingxing Di created SPARK-21168:
---

 Summary: KafkaRDD should always set kafka clientId.
 Key: SPARK-21168
 URL: https://issues.apache.org/jira/browse/SPARK-21168
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.0.2
Reporter: Xingxing Di
Priority: Trivial


I found KafkaRDD not set kafka client.id in "fetchBatch" method 
(FetchRequestBuilder will set clientId to empty by default),  normally this 
will affect nothing, but in our case ,we use clientId at kafka server side, so 
we have to rebuild spark-streaming-kafka。



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21168) KafkaRDD should always set kafka clientId.

2017-06-21 Thread Xingxing Di (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingxing Di updated SPARK-21168:

External issue URL: https://github.com/apache/spark/pull/18383

> KafkaRDD should always set kafka clientId.
> --
>
> Key: SPARK-21168
> URL: https://issues.apache.org/jira/browse/SPARK-21168
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Xingxing Di
>Priority: Trivial
>
> I found KafkaRDD not set kafka client.id in "fetchBatch" method 
> (FetchRequestBuilder will set clientId to empty by default),  normally this 
> will affect nothing, but in our case ,we use clientId at kafka server side, 
> so we have to rebuild spark-streaming-kafka。



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21167) Path is not decoded correctly when reading output of FileSink

2017-06-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058678#comment-16058678
 ] 

Apache Spark commented on SPARK-21167:
--

User 'dijingran' has created a pull request for this issue:
https://github.com/apache/spark/pull/18383

> Path is not decoded correctly when reading output of FileSink
> -
>
> Key: SPARK-21167
> URL: https://issues.apache.org/jira/browse/SPARK-21167
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> When reading output of FileSink, path is not decoded correctly. So if the 
> path has some special characters, such as spaces, Spark cannot read it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20906) Constrained Logistic Regression for SparkR

2017-06-21 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20906.
--
  Resolution: Fixed
Assignee: Miao Wang
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> Constrained Logistic Regression for SparkR
> --
>
> Key: SPARK-20906
> URL: https://issues.apache.org/jira/browse/SPARK-20906
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.3.0
>
>
> PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic 
> Regression for ML. We should add it to SparkR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21168) KafkaRDD should always set kafka clientId.

2017-06-21 Thread Xingxing Di (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingxing Di updated SPARK-21168:

External issue URL:   (was: https://github.com/apache/spark/pull/18383)

> KafkaRDD should always set kafka clientId.
> --
>
> Key: SPARK-21168
> URL: https://issues.apache.org/jira/browse/SPARK-21168
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Xingxing Di
>Priority: Trivial
>
> I found KafkaRDD not set kafka client.id in "fetchBatch" method 
> (FetchRequestBuilder will set clientId to empty by default),  normally this 
> will affect nothing, but in our case ,we use clientId at kafka server side, 
> so we have to rebuild spark-streaming-kafka。



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21165) Fail to write into partitioned hive table due to attribute reference not working with cast on partition column

2017-06-21 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-21165:
---

Assignee: Xiao Li

> Fail to write into partitioned hive table due to attribute reference not 
> working with cast on partition column
> --
>
> Key: SPARK-21165
> URL: https://issues.apache.org/jira/browse/SPARK-21165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>Assignee: Xiao Li
>Priority: Blocker
>
> A simple "insert into ... select" involving partitioned hive tables fails.  
> Here's a simpler repro which doesn't involve hive at all -- this succeeds on 
> 2.1.1, but fails on 2.2.0-rc5:
> {noformat}
> spark.sql("""SET hive.exec.dynamic.partition.mode=nonstrict""")
> spark.sql("""DROP TABLE IF EXISTS src""")
> spark.sql("""DROP TABLE IF EXISTS dest""")
> spark.sql("""
> CREATE TABLE src (first string, word string)
>   PARTITIONED BY (length int)
> """)
> spark.sql("""
> INSERT INTO src PARTITION(length) VALUES
>   ('a', 'abc', 3),
>   ('b', 'bcde', 4),
>   ('c', 'cdefg', 5)
> """)
> spark.sql("""
>   CREATE TABLE dest (word string, length int)
> PARTITIONED BY (first string)
> """)
> spark.sql("""
>   INSERT INTO TABLE dest PARTITION(first) SELECT word, length, cast(first as 
> string) as first FROM src
> """)
> {noformat}
> The exception is
> {noformat}
> 17/06/21 14:25:53 WARN TaskSetManager: Lost task 1.0 in stage 4.0 (TID 10, 
> localhost, executor driver): 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> , tree: first#74
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:43)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:884)
> at 
> org.apache.spark.sql.execution.SparkPlan.newOrdering(SparkPlan.scala:363)
> at 
> org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:63)
> at 
> org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:102)
> at 
> org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:101)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInterna

[jira] [Created] (SPARK-21169) Spark HA: Jobs state is in WAITING status after reconnecting to standby master

2017-06-21 Thread Srinivasarao Daruna (JIRA)
Srinivasarao Daruna created SPARK-21169:
---

 Summary: Spark HA: Jobs state is in WAITING status after 
reconnecting to standby master
 Key: SPARK-21169
 URL: https://issues.apache.org/jira/browse/SPARK-21169
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.1.0
Reporter: Srinivasarao Daruna


I have created spark cluster with 
2 spark masters, and a separate zookeeper cluster.

Configured following things.

SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER 
-Dspark.deploy.zookeeper.url=zk_machine1:2181,zk_machine2:2181 
-Dspark.deploy.zookeeper.dir=/secondlook/spark-ha"

spark.master configuration in spark-defaults looks as below.

spark://spark_master1:7077,spark_master2:7077

1) Submitted a spark streaming job with spark master configuration. Job got the 
resources and moved to RUNNING state. Job running in client mode.
2) Killed spark master 1, which is active at the time of start
3) Workers shifted to STANDBY master, and Stand By master became ACTIVE.
4) Running job moved to new spark master UI as well.

However, the application state is appearing in WAITING status, instead of 
RUNNING status.
Executors of the application are in RUNNING status.

It looks like, the spark application is not updating after stand by master 
connection.






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19141) VectorAssembler metadata causing memory issues

2017-06-21 Thread Mayur Bhole (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058713#comment-16058713
 ] 

Mayur Bhole commented on SPARK-19141:
-

Is there any possible work around for this issue?

> VectorAssembler metadata causing memory issues
> --
>
> Key: SPARK-19141
> URL: https://issues.apache.org/jira/browse/SPARK-19141
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
> Environment: Windows 10, Ubuntu 16.04.1, Scala 2.11.8, Spark 1.6.0, 
> 2.0.0, 2.1.0
>Reporter: Antonia Oprescu
>
> VectorAssembler produces unnecessary metadata that overflows the Java heap in 
> the case of sparse vectors. In the example below, the logical length of the 
> vector is 10^6, but the number of non-zero values is only 2.
> The problem arises when the vector assembler creates metadata (ML attributes) 
> for each of the 10^6 slots, even if this metadata didn't exist upstream (i.e. 
> HashingTF doesn't produce metadata per slot). Here is a chunk of metadata it 
> produces:
> {noformat}
> {"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"HashedFeat_0"},{"idx":1,"name":"HashedFeat_1"},{"idx":2,"name":"HashedFeat_2"},{"idx":3,"name":"HashedFeat_3"},{"idx":4,"name":"HashedFeat_4"},{"idx":5,"name":"HashedFeat_5"},{"idx":6,"name":"HashedFeat_6"},{"idx":7,"name":"HashedFeat_7"},{"idx":8,"name":"HashedFeat_8"},{"idx":9,"name":"HashedFeat_9"},...,{"idx":100,"name":"Feat01"}]},"num_attrs":101}}
> {noformat}
> In this lightweight example, the feature size limit seems to be 1,000,000 
> when run locally, but this scales poorly with more complicated routines. With 
> a larger dataset and a learner (say LogisticRegression), it maxes out 
> anywhere between 10k and 100k hash size even on a decent sized cluster.
> I did some digging, and it seems that the only metadata necessary for 
> downstream learners is the one indicating categorical columns. Thus, I 
> thought of the following possible solutions:
> 1. Compact representation of ml attributes metadata (but this seems to be a 
> bigger change)
> 2. Removal of non-categorical tags from the metadata created by the 
> VectorAssembler
> 3. An option on the existent VectorAssembler to skip unnecessary ml 
> attributes or create another transformer altogether
> I would happy to take a stab at any of these solutions, but I need some 
> direction from the Spark community.
> {code:title=VABug.scala |borderStyle=solid}
> import org.apache.spark.SparkConf
> import org.apache.spark.ml.feature.{HashingTF, VectorAssembler}
> import org.apache.spark.sql.SparkSession
> object VARepro {
>   case class Record(Label: Double, Feat01: Double, Feat02: Array[String])
>   def main(args: Array[String]) {
> val conf = new SparkConf()
>   .setAppName("Vector assembler bug")
>   .setMaster("local[*]")
> val spark = SparkSession.builder.config(conf).getOrCreate()
> import spark.implicits._
> val df = Seq(Record(1.0, 2.0, Array("4daf")), Record(0.0, 3.0, 
> Array("a9ee"))).toDS()
> val numFeatures = 1000
> val hashingScheme = new 
> HashingTF().setInputCol("Feat02").setOutputCol("HashedFeat").setNumFeatures(numFeatures)
> val hashedData = hashingScheme.transform(df)
> val vectorAssembler = new 
> VectorAssembler().setInputCols(Array("HashedFeat","Feat01")).setOutputCol("Features")
> val processedData = vectorAssembler.transform(hashedData).select("Label", 
> "Features")
> processedData.show()
>   }
> }
> {code}
> *Stacktrace from the example above:*
> Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit 
> exceeded
>   at 
> org.apache.spark.ml.attribute.NumericAttribute.copy(attributes.scala:272)
>   at 
> org.apache.spark.ml.attribute.NumericAttribute.withIndex(attributes.scala:215)
>   at 
> org.apache.spark.ml.attribute.NumericAttribute.withIndex(attributes.scala:195)
>   at 
> org.apache.spark.ml.attribute.AttributeGroup$$anonfun$3$$anonfun$apply$1.apply(AttributeGroup.scala:71)
>   at 
> org.apache.spark.ml.attribute.AttributeGroup$$anonfun$3$$anonfun$apply$1.apply(AttributeGroup.scala:70)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> scala.collection.IterableLike$class.copyToArray(IterableLike.scala:254)
>   at 
> scala.collection.SeqViewLike$AbstractTransformed.copyToArray(SeqViewLike.scala:37)
>   at 
> scala.collection.TraversableOnce$class.copyToArray(TraversableOnce.scala:278)
>   at 
> scala.collection.SeqViewLike$AbstractTransformed.copyToArray(SeqViewLike.scala:37)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:286)
>   at 
> scala.collection.SeqViewLike$AbstractTransformed.toArray(SeqViewLike.scala:37)
>   at 
> org.apache.spark.

[jira] [Created] (SPARK-21170) Utils.tryWithSafeFinallyAndFailureCallbacks throws IllegalArgumentException: Self-suppression not permitted

2017-06-21 Thread Devaraj K (JIRA)
Devaraj K created SPARK-21170:
-

 Summary: Utils.tryWithSafeFinallyAndFailureCallbacks throws 
IllegalArgumentException: Self-suppression not permitted
 Key: SPARK-21170
 URL: https://issues.apache.org/jira/browse/SPARK-21170
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: Devaraj K
Priority: Minor


{code:xml}
17/06/20 22:49:39 ERROR Executor: Exception in task 225.0 in stage 1.0 (TID 
27225)
java.lang.IllegalArgumentException: Self-suppression not permitted
at java.lang.Throwable.addSuppressed(Throwable.java:1043)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1400)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1145)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1125)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

{code:xml}
17/06/20 22:52:32 INFO scheduler.TaskSetManager: Lost task 427.0 in stage 1.0 
(TID 27427) on 192.168.1.121, executor 12: java.lang.IllegalArgumentException 
(Self-suppression not permitted) [duplicate 1]
17/06/20 22:52:33 INFO scheduler.TaskSetManager: Starting task 427.1 in stage 
1.0 (TID 27764, 192.168.1.122, executor 106, partition 427, PROCESS_LOCAL, 4625 
bytes)
17/06/20 22:52:33 INFO scheduler.TaskSetManager: Lost task 186.0 in stage 1.0 
(TID 27186) on 192.168.1.122, executor 106: java.lang.IllegalArgumentException 
(Self-suppression not permitted) [duplicate 2]
17/06/20 22:52:38 INFO scheduler.TaskSetManager: Starting task 186.1 in stage 
1.0 (TID 27765, 192.168.1.121, executor 9, partition 186, PROCESS_LOCAL, 4625 
bytes)
17/06/20 22:52:38 WARN scheduler.TaskSetManager: Lost task 392.0 in stage 1.0 
(TID 27392, 192.168.1.121, executor 9): java.lang.IllegalArgumentException: 
Self-suppression not permitted
at java.lang.Throwable.addSuppressed(Throwable.java:1043)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1400)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1145)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1125)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

Here it is trying to suppress the same Throwable instance and causing to throw 
the IllegalArgumentException which masks the original exception.

I think it should not add to the suppressed if it is the same instance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21170) Utils.tryWithSafeFinallyAndFailureCallbacks throws IllegalArgumentException: Self-suppression not permitted

2017-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21170:


Assignee: (was: Apache Spark)

> Utils.tryWithSafeFinallyAndFailureCallbacks throws IllegalArgumentException: 
> Self-suppression not permitted
> ---
>
> Key: SPARK-21170
> URL: https://issues.apache.org/jira/browse/SPARK-21170
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Devaraj K
>Priority: Minor
>
> {code:xml}
> 17/06/20 22:49:39 ERROR Executor: Exception in task 225.0 in stage 1.0 (TID 
> 27225)
> java.lang.IllegalArgumentException: Self-suppression not permitted
> at java.lang.Throwable.addSuppressed(Throwable.java:1043)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1400)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1145)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1125)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> {code:xml}
> 17/06/20 22:52:32 INFO scheduler.TaskSetManager: Lost task 427.0 in stage 1.0 
> (TID 27427) on 192.168.1.121, executor 12: java.lang.IllegalArgumentException 
> (Self-suppression not permitted) [duplicate 1]
> 17/06/20 22:52:33 INFO scheduler.TaskSetManager: Starting task 427.1 in stage 
> 1.0 (TID 27764, 192.168.1.122, executor 106, partition 427, PROCESS_LOCAL, 
> 4625 bytes)
> 17/06/20 22:52:33 INFO scheduler.TaskSetManager: Lost task 186.0 in stage 1.0 
> (TID 27186) on 192.168.1.122, executor 106: 
> java.lang.IllegalArgumentException (Self-suppression not permitted) 
> [duplicate 2]
> 17/06/20 22:52:38 INFO scheduler.TaskSetManager: Starting task 186.1 in stage 
> 1.0 (TID 27765, 192.168.1.121, executor 9, partition 186, PROCESS_LOCAL, 4625 
> bytes)
> 17/06/20 22:52:38 WARN scheduler.TaskSetManager: Lost task 392.0 in stage 1.0 
> (TID 27392, 192.168.1.121, executor 9): java.lang.IllegalArgumentException: 
> Self-suppression not permitted
>   at java.lang.Throwable.addSuppressed(Throwable.java:1043)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1400)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1145)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1125)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Here it is trying to suppress the same Throwable instance and causing to 
> throw the IllegalArgumentException which masks the original exception.
> I think it should not add to the suppressed if it is the same instance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21170) Utils.tryWithSafeFinallyAndFailureCallbacks throws IllegalArgumentException: Self-suppression not permitted

2017-06-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058775#comment-16058775
 ] 

Apache Spark commented on SPARK-21170:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/18384

> Utils.tryWithSafeFinallyAndFailureCallbacks throws IllegalArgumentException: 
> Self-suppression not permitted
> ---
>
> Key: SPARK-21170
> URL: https://issues.apache.org/jira/browse/SPARK-21170
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Devaraj K
>Priority: Minor
>
> {code:xml}
> 17/06/20 22:49:39 ERROR Executor: Exception in task 225.0 in stage 1.0 (TID 
> 27225)
> java.lang.IllegalArgumentException: Self-suppression not permitted
> at java.lang.Throwable.addSuppressed(Throwable.java:1043)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1400)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1145)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1125)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> {code:xml}
> 17/06/20 22:52:32 INFO scheduler.TaskSetManager: Lost task 427.0 in stage 1.0 
> (TID 27427) on 192.168.1.121, executor 12: java.lang.IllegalArgumentException 
> (Self-suppression not permitted) [duplicate 1]
> 17/06/20 22:52:33 INFO scheduler.TaskSetManager: Starting task 427.1 in stage 
> 1.0 (TID 27764, 192.168.1.122, executor 106, partition 427, PROCESS_LOCAL, 
> 4625 bytes)
> 17/06/20 22:52:33 INFO scheduler.TaskSetManager: Lost task 186.0 in stage 1.0 
> (TID 27186) on 192.168.1.122, executor 106: 
> java.lang.IllegalArgumentException (Self-suppression not permitted) 
> [duplicate 2]
> 17/06/20 22:52:38 INFO scheduler.TaskSetManager: Starting task 186.1 in stage 
> 1.0 (TID 27765, 192.168.1.121, executor 9, partition 186, PROCESS_LOCAL, 4625 
> bytes)
> 17/06/20 22:52:38 WARN scheduler.TaskSetManager: Lost task 392.0 in stage 1.0 
> (TID 27392, 192.168.1.121, executor 9): java.lang.IllegalArgumentException: 
> Self-suppression not permitted
>   at java.lang.Throwable.addSuppressed(Throwable.java:1043)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1400)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1145)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1125)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Here it is trying to suppress the same Throwable instance and causing to 
> throw the IllegalArgumentException which masks the original exception.
> I think it should not add to the suppressed if it is the same instance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21170) Utils.tryWithSafeFinallyAndFailureCallbacks throws IllegalArgumentException: Self-suppression not permitted

2017-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21170:


Assignee: Apache Spark

> Utils.tryWithSafeFinallyAndFailureCallbacks throws IllegalArgumentException: 
> Self-suppression not permitted
> ---
>
> Key: SPARK-21170
> URL: https://issues.apache.org/jira/browse/SPARK-21170
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Devaraj K
>Assignee: Apache Spark
>Priority: Minor
>
> {code:xml}
> 17/06/20 22:49:39 ERROR Executor: Exception in task 225.0 in stage 1.0 (TID 
> 27225)
> java.lang.IllegalArgumentException: Self-suppression not permitted
> at java.lang.Throwable.addSuppressed(Throwable.java:1043)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1400)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1145)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1125)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> {code:xml}
> 17/06/20 22:52:32 INFO scheduler.TaskSetManager: Lost task 427.0 in stage 1.0 
> (TID 27427) on 192.168.1.121, executor 12: java.lang.IllegalArgumentException 
> (Self-suppression not permitted) [duplicate 1]
> 17/06/20 22:52:33 INFO scheduler.TaskSetManager: Starting task 427.1 in stage 
> 1.0 (TID 27764, 192.168.1.122, executor 106, partition 427, PROCESS_LOCAL, 
> 4625 bytes)
> 17/06/20 22:52:33 INFO scheduler.TaskSetManager: Lost task 186.0 in stage 1.0 
> (TID 27186) on 192.168.1.122, executor 106: 
> java.lang.IllegalArgumentException (Self-suppression not permitted) 
> [duplicate 2]
> 17/06/20 22:52:38 INFO scheduler.TaskSetManager: Starting task 186.1 in stage 
> 1.0 (TID 27765, 192.168.1.121, executor 9, partition 186, PROCESS_LOCAL, 4625 
> bytes)
> 17/06/20 22:52:38 WARN scheduler.TaskSetManager: Lost task 392.0 in stage 1.0 
> (TID 27392, 192.168.1.121, executor 9): java.lang.IllegalArgumentException: 
> Self-suppression not permitted
>   at java.lang.Throwable.addSuppressed(Throwable.java:1043)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1400)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1145)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1125)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Here it is trying to suppress the same Throwable instance and causing to 
> throw the IllegalArgumentException which masks the original exception.
> I think it should not add to the suppressed if it is the same instance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-06-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18016:

Fix Version/s: (was: 2.3.0)
   2.2.0
   2.1.2

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.1.2, 2.2.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(Si

[jira] [Created] (SPARK-21171) Speculate task scheduling block dirve handle normal task when a job task number more than one hundred thousand

2017-06-21 Thread wangminfeng (JIRA)
wangminfeng created SPARK-21171:
---

 Summary: Speculate task scheduling block dirve handle normal task 
when a job task number more than one hundred thousand
 Key: SPARK-21171
 URL: https://issues.apache.org/jira/browse/SPARK-21171
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Affects Versions: 2.1.1
 Environment: We have more than two hundred high-performance machine to 
handle more than 2T data by one query
Reporter: wangminfeng


If a job have more then one hundred thousand tasks and spark.speculation is 
true, when speculable tasks start, choosing a speculable will waste lots of 
time and block other tasks. We do a ad-hoc query for badiu data analyse,  we 
can't tolerate one job wasting time even it is a large job



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21171) Speculate task scheduling block dirve handle normal task when a job task number more than one hundred thousand

2017-06-21 Thread wangminfeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangminfeng updated SPARK-21171:

Description: If a job have more then one hundred thousand tasks and 
spark.speculation is true, when speculable tasks start, choosing a speculable 
will waste lots of time and block other tasks. We do a ad-hoc query for data 
analyse,  we can't tolerate one job wasting time even it is a large job  (was: 
If a job have more then one hundred thousand tasks and spark.speculation is 
true, when speculable tasks start, choosing a speculable will waste lots of 
time and block other tasks. We do a ad-hoc query for badiu data analyse,  we 
can't tolerate one job wasting time even it is a large job)

> Speculate task scheduling block dirve handle normal task when a job task 
> number more than one hundred thousand
> --
>
> Key: SPARK-21171
> URL: https://issues.apache.org/jira/browse/SPARK-21171
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.1.1
> Environment: We have more than two hundred high-performance machine 
> to handle more than 2T data by one query
>Reporter: wangminfeng
>
> If a job have more then one hundred thousand tasks and spark.speculation is 
> true, when speculable tasks start, choosing a speculable will waste lots of 
> time and block other tasks. We do a ad-hoc query for data analyse,  we can't 
> tolerate one job wasting time even it is a large job



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21171) Speculate task scheduling block dirve handle normal task when a job task number more than one hundred thousand

2017-06-21 Thread wangminfeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangminfeng updated SPARK-21171:

Affects Version/s: (was: 2.1.1)
   2.0.0

> Speculate task scheduling block dirve handle normal task when a job task 
> number more than one hundred thousand
> --
>
> Key: SPARK-21171
> URL: https://issues.apache.org/jira/browse/SPARK-21171
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.0.0
> Environment: We have more than two hundred high-performance machine 
> to handle more than 2T data by one query
>Reporter: wangminfeng
>
> If a job have more then one hundred thousand tasks and spark.speculation is 
> true, when speculable tasks start, choosing a speculable will waste lots of 
> time and block other tasks. We do a ad-hoc query for data analyse,  we can't 
> tolerate one job wasting time even it is a large job



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21172) EOFException reached end of stream in UnsafeRowSerializer

2017-06-21 Thread liupengcheng (JIRA)
liupengcheng created SPARK-21172:


 Summary: EOFException reached end of stream in UnsafeRowSerializer
 Key: SPARK-21172
 URL: https://issues.apache.org/jira/browse/SPARK-21172
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 2.0.1
Reporter: liupengcheng


Spark sql job failed because of the following Exception. Seems like a bug in 
shuffle stage. 

{code}
org.apache.spark.SparkException: Task failed while writing rows
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:264)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException: reached end of stream after reading 9034374 
bytes; 1684891936 bytes expected
at 
org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:735)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:253)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:253)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1345)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:259)
... 8 more
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21172) EOFException reached end of stream in UnsafeRowSerializer

2017-06-21 Thread liupengcheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-21172:
-
Description: 
Spark sql job failed because of the following Exception. Seems like a bug in 
shuffle stage. 

Shuffle read size for single task is tens of GB

{code}
org.apache.spark.SparkException: Task failed while writing rows
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:264)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException: reached end of stream after reading 9034374 
bytes; 1684891936 bytes expected
at 
org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:735)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:253)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:253)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1345)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:259)
... 8 more
{code}

  was:
Spark sql job failed because of the following Exception. Seems like a bug in 
shuffle stage. 

{code}
org.apache.spark.SparkException: Task failed while writing rows
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:264)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException: reached end of stream after reading 9034374 
bytes; 1684891936 bytes expected
at 
org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:735)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(

[jira] [Created] (SPARK-21173) There are several configuration about SSL displayed in configuration.md but never be used.

2017-06-21 Thread liuzhaokun (JIRA)
liuzhaokun created SPARK-21173:
--

 Summary: There are several configuration about SSL displayed in 
configuration.md but never be used.
 Key: SPARK-21173
 URL: https://issues.apache.org/jira/browse/SPARK-21173
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Spark Core
Affects Versions: 2.1.1
Reporter: liuzhaokun
Priority: Trivial


There are several configuration about SSL displayed in configuration.md but 
never be used and appear at spark's code.So I think it should be removed from 
configuration.md.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21173) There are several configuration about SSL displayed in configuration.md but never be used.

2017-06-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058826#comment-16058826
 ] 

Apache Spark commented on SPARK-21173:
--

User 'liu-zhaokun' has created a pull request for this issue:
https://github.com/apache/spark/pull/18385

> There are several configuration about SSL displayed in configuration.md but 
> never be used.
> --
>
> Key: SPARK-21173
> URL: https://issues.apache.org/jira/browse/SPARK-21173
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Priority: Trivial
>
> There are several configuration about SSL displayed in configuration.md but 
> never be used and appear at spark's code.So I think it should be removed from 
> configuration.md.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21173) There are several configuration about SSL displayed in configuration.md but never be used.

2017-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21173:


Assignee: Apache Spark

> There are several configuration about SSL displayed in configuration.md but 
> never be used.
> --
>
> Key: SPARK-21173
> URL: https://issues.apache.org/jira/browse/SPARK-21173
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Assignee: Apache Spark
>Priority: Trivial
>
> There are several configuration about SSL displayed in configuration.md but 
> never be used and appear at spark's code.So I think it should be removed from 
> configuration.md.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21173) There are several configuration about SSL displayed in configuration.md but never be used.

2017-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21173:


Assignee: (was: Apache Spark)

> There are several configuration about SSL displayed in configuration.md but 
> never be used.
> --
>
> Key: SPARK-21173
> URL: https://issues.apache.org/jira/browse/SPARK-21173
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Priority: Trivial
>
> There are several configuration about SSL displayed in configuration.md but 
> never be used and appear at spark's code.So I think it should be removed from 
> configuration.md.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21165) Fail to write into partitioned hive table due to attribute reference not working with cast on partition column

2017-06-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058836#comment-16058836
 ] 

Apache Spark commented on SPARK-21165:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/18386

> Fail to write into partitioned hive table due to attribute reference not 
> working with cast on partition column
> --
>
> Key: SPARK-21165
> URL: https://issues.apache.org/jira/browse/SPARK-21165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>Assignee: Xiao Li
>Priority: Blocker
>
> A simple "insert into ... select" involving partitioned hive tables fails.  
> Here's a simpler repro which doesn't involve hive at all -- this succeeds on 
> 2.1.1, but fails on 2.2.0-rc5:
> {noformat}
> spark.sql("""SET hive.exec.dynamic.partition.mode=nonstrict""")
> spark.sql("""DROP TABLE IF EXISTS src""")
> spark.sql("""DROP TABLE IF EXISTS dest""")
> spark.sql("""
> CREATE TABLE src (first string, word string)
>   PARTITIONED BY (length int)
> """)
> spark.sql("""
> INSERT INTO src PARTITION(length) VALUES
>   ('a', 'abc', 3),
>   ('b', 'bcde', 4),
>   ('c', 'cdefg', 5)
> """)
> spark.sql("""
>   CREATE TABLE dest (word string, length int)
> PARTITIONED BY (first string)
> """)
> spark.sql("""
>   INSERT INTO TABLE dest PARTITION(first) SELECT word, length, cast(first as 
> string) as first FROM src
> """)
> {noformat}
> The exception is
> {noformat}
> 17/06/21 14:25:53 WARN TaskSetManager: Lost task 1.0 in stage 4.0 (TID 10, 
> localhost, executor driver): 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> , tree: first#74
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:43)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:884)
> at 
> org.apache.spark.sql.execution.SparkPlan.newOrdering(SparkPlan.scala:363)
> at 
> org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:63)
> at 
> org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:102)
> at 
> org.

[jira] [Assigned] (SPARK-21165) Fail to write into partitioned hive table due to attribute reference not working with cast on partition column

2017-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21165:


Assignee: Apache Spark  (was: Xiao Li)

> Fail to write into partitioned hive table due to attribute reference not 
> working with cast on partition column
> --
>
> Key: SPARK-21165
> URL: https://issues.apache.org/jira/browse/SPARK-21165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Blocker
>
> A simple "insert into ... select" involving partitioned hive tables fails.  
> Here's a simpler repro which doesn't involve hive at all -- this succeeds on 
> 2.1.1, but fails on 2.2.0-rc5:
> {noformat}
> spark.sql("""SET hive.exec.dynamic.partition.mode=nonstrict""")
> spark.sql("""DROP TABLE IF EXISTS src""")
> spark.sql("""DROP TABLE IF EXISTS dest""")
> spark.sql("""
> CREATE TABLE src (first string, word string)
>   PARTITIONED BY (length int)
> """)
> spark.sql("""
> INSERT INTO src PARTITION(length) VALUES
>   ('a', 'abc', 3),
>   ('b', 'bcde', 4),
>   ('c', 'cdefg', 5)
> """)
> spark.sql("""
>   CREATE TABLE dest (word string, length int)
> PARTITIONED BY (first string)
> """)
> spark.sql("""
>   INSERT INTO TABLE dest PARTITION(first) SELECT word, length, cast(first as 
> string) as first FROM src
> """)
> {noformat}
> The exception is
> {noformat}
> 17/06/21 14:25:53 WARN TaskSetManager: Lost task 1.0 in stage 4.0 (TID 10, 
> localhost, executor driver): 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> , tree: first#74
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:43)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:884)
> at 
> org.apache.spark.sql.execution.SparkPlan.newOrdering(SparkPlan.scala:363)
> at 
> org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:63)
> at 
> org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:102)
> at 
> org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:101)
> at 
> org.apache.spark.r

[jira] [Assigned] (SPARK-21165) Fail to write into partitioned hive table due to attribute reference not working with cast on partition column

2017-06-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21165:


Assignee: Xiao Li  (was: Apache Spark)

> Fail to write into partitioned hive table due to attribute reference not 
> working with cast on partition column
> --
>
> Key: SPARK-21165
> URL: https://issues.apache.org/jira/browse/SPARK-21165
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>Assignee: Xiao Li
>Priority: Blocker
>
> A simple "insert into ... select" involving partitioned hive tables fails.  
> Here's a simpler repro which doesn't involve hive at all -- this succeeds on 
> 2.1.1, but fails on 2.2.0-rc5:
> {noformat}
> spark.sql("""SET hive.exec.dynamic.partition.mode=nonstrict""")
> spark.sql("""DROP TABLE IF EXISTS src""")
> spark.sql("""DROP TABLE IF EXISTS dest""")
> spark.sql("""
> CREATE TABLE src (first string, word string)
>   PARTITIONED BY (length int)
> """)
> spark.sql("""
> INSERT INTO src PARTITION(length) VALUES
>   ('a', 'abc', 3),
>   ('b', 'bcde', 4),
>   ('c', 'cdefg', 5)
> """)
> spark.sql("""
>   CREATE TABLE dest (word string, length int)
> PARTITIONED BY (first string)
> """)
> spark.sql("""
>   INSERT INTO TABLE dest PARTITION(first) SELECT word, length, cast(first as 
> string) as first FROM src
> """)
> {noformat}
> The exception is
> {noformat}
> 17/06/21 14:25:53 WARN TaskSetManager: Lost task 1.0 in stage 4.0 (TID 10, 
> localhost, executor driver): 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> , tree: first#74
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$$anonfun$bind$1.apply(GenerateOrdering.scala:49)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:49)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.bind(GenerateOrdering.scala:43)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:884)
> at 
> org.apache.spark.sql.execution.SparkPlan.newOrdering(SparkPlan.scala:363)
> at 
> org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:63)
> at 
> org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:102)
> at 
> org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:101)
> at 
> org.apache.spark.rdd.RD

[jira] [Commented] (SPARK-21080) Workaround for HDFS delegation token expiry broken with some Hadoop versions

2017-06-21 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058843#comment-16058843
 ] 

Saisai Shao commented on SPARK-21080:
-

That PR should be worked, we applied that one to our internal branch and 
verified.

[~srowen] what's your opinion of this PR 
(https://github.com/apache/spark/pull/9168)? That PR tried to workaround 
kerberos issue in HDFS HA scenario, the issue is really a HDFS issue but it 
only fixed after version Hadoop 2.8.2. I think we still supports Hadoop 2.6, 
and for lots of user it is pretty hard to upgrade HDFS to a newer version. So I 
think it should be useful to merge that workaround into Spark, to fix the issue 
from Spark's aspect. What is your suggestion?

> Workaround for HDFS delegation token expiry broken with some Hadoop versions
> 
>
> Key: SPARK-21080
> URL: https://issues.apache.org/jira/browse/SPARK-21080
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0 on Yarn, Hadoop 2.7.3
>Reporter: Lukasz Raszka
>Priority: Minor
>
> We're getting struck by SPARK-11182, where the core issue in HDFS has been 
> fixed in more recent versions. It seems that [workaround introduced by user 
> SaintBacchus|https://github.com/apache/spark/commit/646366b5d2f12e42f8e7287672ba29a8c918a17d]
>  doesn't work in newer version of Hadoop. This seems to be cause by a move of 
> property name from {{fs.hdfs.impl}} to {{fs.AbstractFileSystem.hdfs.impl}} 
> which happened somewhere around 2.7.0 or earlier. Taking this into account 
> should make workaround work again for less recent Hadoop versions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21080) Workaround for HDFS delegation token expiry broken with some Hadoop versions

2017-06-21 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058843#comment-16058843
 ] 

Saisai Shao edited comment on SPARK-21080 at 6/22/17 6:39 AM:
--

That PR should be worked, we applied that one to our internal branch and 
verified. But it is a little out-dated, need to rebase.

[~srowen] what's your opinion of this PR 
(https://github.com/apache/spark/pull/9168)? That PR tried to workaround 
kerberos issue in HDFS HA scenario, the issue is really a HDFS issue but it 
only fixed after version Hadoop 2.8.2. I think we still supports Hadoop 2.6, 
and for lots of user it is pretty hard to upgrade HDFS to a newer version. So I 
think it should be useful to merge that workaround into Spark, to fix the issue 
from Spark's aspect. What is your suggestion?


was (Author: jerryshao):
That PR should be worked, we applied that one to our internal branch and 
verified.

[~srowen] what's your opinion of this PR 
(https://github.com/apache/spark/pull/9168)? That PR tried to workaround 
kerberos issue in HDFS HA scenario, the issue is really a HDFS issue but it 
only fixed after version Hadoop 2.8.2. I think we still supports Hadoop 2.6, 
and for lots of user it is pretty hard to upgrade HDFS to a newer version. So I 
think it should be useful to merge that workaround into Spark, to fix the issue 
from Spark's aspect. What is your suggestion?

> Workaround for HDFS delegation token expiry broken with some Hadoop versions
> 
>
> Key: SPARK-21080
> URL: https://issues.apache.org/jira/browse/SPARK-21080
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0 on Yarn, Hadoop 2.7.3
>Reporter: Lukasz Raszka
>Priority: Minor
>
> We're getting struck by SPARK-11182, where the core issue in HDFS has been 
> fixed in more recent versions. It seems that [workaround introduced by user 
> SaintBacchus|https://github.com/apache/spark/commit/646366b5d2f12e42f8e7287672ba29a8c918a17d]
>  doesn't work in newer version of Hadoop. This seems to be cause by a move of 
> property name from {{fs.hdfs.impl}} to {{fs.AbstractFileSystem.hdfs.impl}} 
> which happened somewhere around 2.7.0 or earlier. Taking this into account 
> should make workaround work again for less recent Hadoop versions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >