date:20170619

[jira] [Created] (SPARK-21149) Add job description API for R

2017-06-19 Thread Felix Cheung (JIRA)

Felix Cheung created SPARK-21149:


 Summary: Add job description API for R
 Key: SPARK-21149
 URL: https://issues.apache.org/jira/browse/SPARK-21149
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.3.0
Reporter: Felix Cheung
Priority: Minor


see SPARK-21125



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21148) Set SparkUncaughtExceptionHandler to the Master

2017-06-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21148:


Assignee: Apache Spark

> Set SparkUncaughtExceptionHandler to the Master
> ---
>
> Key: SPARK-21148
> URL: https://issues.apache.org/jira/browse/SPARK-21148
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.1.1
>Reporter: Devaraj K
>Assignee: Apache Spark
>
> Any one thread of the Master gets any of the UncaughtException then the 
> thread gets terminate and the Master process keeps running without 
> functioning properly.
> I think we need to handle the UncaughtException and exit the Master 
> gracefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21148) Set SparkUncaughtExceptionHandler to the Master

2017-06-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21148:


Assignee: (was: Apache Spark)

> Set SparkUncaughtExceptionHandler to the Master
> ---
>
> Key: SPARK-21148
> URL: https://issues.apache.org/jira/browse/SPARK-21148
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.1.1
>Reporter: Devaraj K
>
> Any one thread of the Master gets any of the UncaughtException then the 
> thread gets terminate and the Master process keeps running without 
> functioning properly.
> I think we need to handle the UncaughtException and exit the Master 
> gracefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21148) Set SparkUncaughtExceptionHandler to the Master

2017-06-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055070#comment-16055070
 ] 

Apache Spark commented on SPARK-21148:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/18358

> Set SparkUncaughtExceptionHandler to the Master
> ---
>
> Key: SPARK-21148
> URL: https://issues.apache.org/jira/browse/SPARK-21148
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.1.1
>Reporter: Devaraj K
>
> Any one thread of the Master gets any of the UncaughtException then the 
> thread gets terminate and the Master process keeps running without 
> functioning properly.
> I think we need to handle the UncaughtException and exit the Master 
> gracefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20889) SparkR grouped documentation for Column methods

2017-06-19 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20889.
--
  Resolution: Fixed
Assignee: Wayne Zhang
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> SparkR grouped documentation for Column methods
> ---
>
> Key: SPARK-20889
> URL: https://issues.apache.org/jira/browse/SPARK-20889
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>  Labels: documentation
> Fix For: 2.3.0
>
>
> Group the documentation of individual methods defined for the Column class. 
> This aims to create the following improvements:
> - Centralized documentation for easy navigation (user can view multiple 
> related methods on one single page).
> - Reduced number of items in Seealso.
> - Betters examples using shared data. This avoids creating a data frame for 
> each function if they are documented separately. And more importantly, user 
> can copy and paste to run them directly!
> - Cleaner structure and much fewer Rd files (remove a large number of Rd 
> files).
> - Remove duplicated definition of param (since they share exactly the same 
> argument).
> - No need to write meaningless examples for trivial functions (because of 
> grouping).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21148) Set SparkUncaughtExceptionHandler to the Master

2017-06-19 Thread Devaraj K (JIRA)

Devaraj K created SPARK-21148:
-

 Summary: Set SparkUncaughtExceptionHandler to the Master
 Key: SPARK-21148
 URL: https://issues.apache.org/jira/browse/SPARK-21148
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Spark Core
Affects Versions: 2.1.1
Reporter: Devaraj K


Any one thread of the Master gets any of the UncaughtException then the thread 
gets terminate and the Master process keeps running without functioning 
properly.
I think we need to handle the UncaughtException and exit the Master gracefully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21144) Unexpected results when the data schema and partition schema have the duplicate columns

2017-06-19 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055049#comment-16055049
 ] 

Takeshi Yamamuro commented on SPARK-21144:
--

okay, I'm currently looking into this.

> Unexpected results when the data schema and partition schema have the 
> duplicate columns
> ---
>
> Key: SPARK-21144
> URL: https://issues.apache.org/jira/browse/SPARK-21144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> {noformat}
> withTempPath { dir =>
>   val basePath = dir.getCanonicalPath
>   spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, 
> "foo=1").toString)
>   spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, 
> "foo=a").toString)
>   spark.read.parquet(basePath).show()
> }
> {noformat}
> The result of the above case is
> {noformat}
> +---+
> |foo|
> +---+
> |  1|
> |  1|
> |  a|
> |  a|
> |  1|
> |  a|
> +---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21147) the schema of socket source can not be set.

2017-06-19 Thread Fei Shao (JIRA)

Fei Shao created SPARK-21147:


 Summary: the schema of socket source can not be set.
 Key: SPARK-21147
 URL: https://issues.apache.org/jira/browse/SPARK-21147
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
 Environment: Win7,spark 2.1.0
Reporter: Fei Shao


The schema set for DataStreamReader can not work. The code is shown as below:
val line = ss.readStream.format("socket")
.option("ip",xxx)
.option("port",xxx)
.schema( StructField("name",StringType)::StructField("area",StringType)::Nil)
.load
line.printSchema

The printSchema prints:
root
|--value:String(nullable=true)

According to the code, it should print the schema set by schema().

Suggestion from Michael Armbrust:
throw an exception saying that you can't set schema here.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21133) HighlyCompressedMapStatus#writeExternal throws NPE

2017-06-19 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21133:
---

Assignee: Yuming Wang

> HighlyCompressedMapStatus#writeExternal throws NPE
> --
>
> Key: SPARK-21133
> URL: https://issues.apache.org/jira/browse/SPARK-21133
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Reproduce, set {{set spark.sql.shuffle.partitions>2000}} with shuffle, for 
> simple:
> {code:sql}
> spark-sql --executor-memory 12g --driver-memory 8g --executor-cores 7   -e "
>   set spark.sql.shuffle.partitions=2001;
>   drop table if exists spark_hcms_npe;
>   create table spark_hcms_npe as select id, count(*) from big_table group by 
> id;
> "
> {code}
> Error logs:
> {noformat}
> 17/06/18 15:00:27 ERROR Utils: Exception encountered
> java.lang.NullPointerException
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
> at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
> at 
> org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
> at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
> at 
> org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
>

[jira] [Resolved] (SPARK-21133) HighlyCompressedMapStatus#writeExternal throws NPE

2017-06-19 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21133.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18343
[https://github.com/apache/spark/pull/18343]

> HighlyCompressedMapStatus#writeExternal throws NPE
> --
>
> Key: SPARK-21133
> URL: https://issues.apache.org/jira/browse/SPARK-21133
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Reproduce, set {{set spark.sql.shuffle.partitions>2000}} with shuffle, for 
> simple:
> {code:sql}
> spark-sql --executor-memory 12g --driver-memory 8g --executor-cores 7   -e "
>   set spark.sql.shuffle.partitions=2001;
>   drop table if exists spark_hcms_npe;
>   create table spark_hcms_npe as select id, count(*) from big_table group by 
> id;
> "
> {code}
> Error logs:
> {noformat}
> 17/06/18 15:00:27 ERROR Utils: Exception encountered
> java.lang.NullPointerException
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
> at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
> at 
> org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException
> java.io.IOException: java.lang.NullPointerException
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310)
> at 
> org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
> at 
> java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at 
> org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
> at 
> org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
>

[jira] [Commented] (SPARK-21146) Worker should handle and shutdown when any thread gets UncaughtException

2017-06-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055024#comment-16055024
 ] 

Apache Spark commented on SPARK-21146:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/18357

> Worker should handle and shutdown when any thread gets UncaughtException
> 
>
> Key: SPARK-21146
> URL: https://issues.apache.org/jira/browse/SPARK-21146
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Devaraj K
>
> {code:xml}
> 17/06/19 11:41:23 INFO Worker: Asked to launch executor 
> app-20170619114055-0005/228 for ScalaSort
> Exception in thread "dispatcher-event-loop-79" java.lang.OutOfMemoryError: 
> unable to create new native thread
>   at java.lang.Thread.start0(Native Method)
>   at java.lang.Thread.start(Thread.java:714)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
>   at 
> java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1018)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I see in the logs that Worker's dispatcher-event got the above exception and 
> the Worker keeps running without performing any functionality. And also 
> Worker state changed from ALIVE to DEAD in Master's web UI.
> {code:xml}
> worker-20170619150349-192.168.1.120-56175 192.168.1.120:56175 DEAD
> 88 (41 Used)251.2 GB (246.0 GB Used)
> {code}
> I think Worker should handle and shutdown when any thread gets 
> UncaughtException.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21146) Worker should handle and shutdown when any thread gets UncaughtException

2017-06-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21146:


Assignee: Apache Spark

> Worker should handle and shutdown when any thread gets UncaughtException
> 
>
> Key: SPARK-21146
> URL: https://issues.apache.org/jira/browse/SPARK-21146
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Devaraj K
>Assignee: Apache Spark
>
> {code:xml}
> 17/06/19 11:41:23 INFO Worker: Asked to launch executor 
> app-20170619114055-0005/228 for ScalaSort
> Exception in thread "dispatcher-event-loop-79" java.lang.OutOfMemoryError: 
> unable to create new native thread
>   at java.lang.Thread.start0(Native Method)
>   at java.lang.Thread.start(Thread.java:714)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
>   at 
> java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1018)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I see in the logs that Worker's dispatcher-event got the above exception and 
> the Worker keeps running without performing any functionality. And also 
> Worker state changed from ALIVE to DEAD in Master's web UI.
> {code:xml}
> worker-20170619150349-192.168.1.120-56175 192.168.1.120:56175 DEAD
> 88 (41 Used)251.2 GB (246.0 GB Used)
> {code}
> I think Worker should handle and shutdown when any thread gets 
> UncaughtException.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21146) Worker should handle and shutdown when any thread gets UncaughtException

2017-06-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21146:


Assignee: (was: Apache Spark)

> Worker should handle and shutdown when any thread gets UncaughtException
> 
>
> Key: SPARK-21146
> URL: https://issues.apache.org/jira/browse/SPARK-21146
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Devaraj K
>
> {code:xml}
> 17/06/19 11:41:23 INFO Worker: Asked to launch executor 
> app-20170619114055-0005/228 for ScalaSort
> Exception in thread "dispatcher-event-loop-79" java.lang.OutOfMemoryError: 
> unable to create new native thread
>   at java.lang.Thread.start0(Native Method)
>   at java.lang.Thread.start(Thread.java:714)
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
>   at 
> java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1018)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I see in the logs that Worker's dispatcher-event got the above exception and 
> the Worker keeps running without performing any functionality. And also 
> Worker state changed from ALIVE to DEAD in Master's web UI.
> {code:xml}
> worker-20170619150349-192.168.1.120-56175 192.168.1.120:56175 DEAD
> 88 (41 Used)251.2 GB (246.0 GB Used)
> {code}
> I think Worker should handle and shutdown when any thread gets 
> UncaughtException.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21144) Unexpected results when the data schema and partition schema have the duplicate columns

2017-06-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21144:


Assignee: Apache Spark

> Unexpected results when the data schema and partition schema have the 
> duplicate columns
> ---
>
> Key: SPARK-21144
> URL: https://issues.apache.org/jira/browse/SPARK-21144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> {noformat}
> withTempPath { dir =>
>   val basePath = dir.getCanonicalPath
>   spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, 
> "foo=1").toString)
>   spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, 
> "foo=a").toString)
>   spark.read.parquet(basePath).show()
> }
> {noformat}
> The result of the above case is
> {noformat}
> +---+
> |foo|
> +---+
> |  1|
> |  1|
> |  a|
> |  a|
> |  1|
> |  a|
> +---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21144) Unexpected results when the data schema and partition schema have the duplicate columns

2017-06-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055015#comment-16055015
 ] 

Apache Spark commented on SPARK-21144:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/18356

> Unexpected results when the data schema and partition schema have the 
> duplicate columns
> ---
>
> Key: SPARK-21144
> URL: https://issues.apache.org/jira/browse/SPARK-21144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> {noformat}
> withTempPath { dir =>
>   val basePath = dir.getCanonicalPath
>   spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, 
> "foo=1").toString)
>   spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, 
> "foo=a").toString)
>   spark.read.parquet(basePath).show()
> }
> {noformat}
> The result of the above case is
> {noformat}
> +---+
> |foo|
> +---+
> |  1|
> |  1|
> |  a|
> |  a|
> |  1|
> |  a|
> +---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21144) Unexpected results when the data schema and partition schema have the duplicate columns

2017-06-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21144:


Assignee: (was: Apache Spark)

> Unexpected results when the data schema and partition schema have the 
> duplicate columns
> ---
>
> Key: SPARK-21144
> URL: https://issues.apache.org/jira/browse/SPARK-21144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> {noformat}
> withTempPath { dir =>
>   val basePath = dir.getCanonicalPath
>   spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, 
> "foo=1").toString)
>   spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, 
> "foo=a").toString)
>   spark.read.parquet(basePath).show()
> }
> {noformat}
> The result of the above case is
> {noformat}
> +---+
> |foo|
> +---+
> |  1|
> |  1|
> |  a|
> |  a|
> |  1|
> |  a|
> +---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21146) Worker should handle and shutdown when any thread gets UncaughtException

2017-06-19 Thread Devaraj K (JIRA)

Devaraj K created SPARK-21146:
-

 Summary: Worker should handle and shutdown when any thread gets 
UncaughtException
 Key: SPARK-21146
 URL: https://issues.apache.org/jira/browse/SPARK-21146
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: Devaraj K


{code:xml}
17/06/19 11:41:23 INFO Worker: Asked to launch executor 
app-20170619114055-0005/228 for ScalaSort
Exception in thread "dispatcher-event-loop-79" java.lang.OutOfMemoryError: 
unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
at 
java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1018)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

I see in the logs that Worker's dispatcher-event got the above exception and 
the Worker keeps running without performing any functionality. And also Worker 
state changed from ALIVE to DEAD in Master's web UI.

{code:xml}
worker-20170619150349-192.168.1.120-56175   192.168.1.120:56175 DEAD
88 (41 Used)251.2 GB (246.0 GB Used)
{code}

I think Worker should handle and shutdown when any thread gets 
UncaughtException.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18191) Port RDD API to use commit protocol

2017-06-19 Thread Shridhar Ramachandran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054978#comment-16054978
 ] 

Shridhar Ramachandran commented on SPARK-18191:
---

I see this got committed only in 2.2.0 and 2.1.0 as mentioned. Can the "Fix 
Version/s" be changed accordingly?

> Port RDD API to use commit protocol
> ---
>
> Key: SPARK-18191
> URL: https://issues.apache.org/jira/browse/SPARK-18191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Jiang Xingbo
> Fix For: 2.1.0
>
>
> Commit protocol is actually not specific to SQL. We can move it over to core 
> so the RDD API can use it too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18191) Port RDD API to use commit protocol

2017-06-19 Thread Shridhar Ramachandran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054978#comment-16054978
 ] 

Shridhar Ramachandran edited comment on SPARK-18191 at 6/20/17 12:11 AM:
-

I see this got committed only in 2.2.0 and not 2.1.0 as mentioned. Can the "Fix 
Version/s" be changed accordingly?


was (Author: shridharama):
I see this got committed only in 2.2.0 and 2.1.0 as mentioned. Can the "Fix 
Version/s" be changed accordingly?

> Port RDD API to use commit protocol
> ---
>
> Key: SPARK-18191
> URL: https://issues.apache.org/jira/browse/SPARK-18191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Jiang Xingbo
> Fix For: 2.1.0
>
>
> Commit protocol is actually not specific to SQL. We can move it over to core 
> so the RDD API can use it too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8642) Ungraceful failure when yarn client is not configured.

2017-06-19 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-8642.
---
Resolution: Won't Fix

Even though a better error here would be nice, I'll close this because there's 
no good way to do this with the current YARN API. As silly as it may be, 
"0.0.0.0:8032" is a valid address - it would work if the RM was running on the 
same host as the Spark app, so it sort of makes sense as a default. (I'd prefer 
a null default and an error if it's not set, but that boat has sailed long ago.)

It also doesn't look like you can configure the retry policy used by the YARN 
client...

So yeah, it sucks, but there's not much that Spark can do here without making 
assumptions about how configuration should be deployed.

> Ungraceful failure when yarn client is not configured.
> --
>
> Key: SPARK-8642
> URL: https://issues.apache.org/jira/browse/SPARK-8642
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0, 1.3.1
>Reporter: Juliet Hougland
>Priority: Minor
> Attachments: yarnretries.log
>
>
> When HADOOP_CONF_DIR is not configured (ie yarn-site.xml is not available) 
> the yarn client will try to submit an application. No connection to the 
> resource manager will be able to be established. The client will try to 
> connect 10 times (with a max retry of ten), and then do that 30 more time. 
> This takes about 5 minutes before an Error is recorded for spark context 
> initialization, which is caused by a connect exception. I would expect that 
> after the first 1- tries fail, the initialization of the spark context should 
> fail too. At least that is what I would think given the logs. An earlier 
> failure would be ideal/preferred.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21145) Restarted queries reuse same StateStoreProvider, causing multiple concurrent tasks to update same StateStore

2017-06-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21145:


Assignee: Apache Spark  (was: Tathagata Das)

> Restarted queries reuse same StateStoreProvider, causing multiple concurrent 
> tasks to update same StateStore
> 
>
> Key: SPARK-21145
> URL: https://issues.apache.org/jira/browse/SPARK-21145
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> StateStoreProvider instances are loaded on-demand in a executor when a query 
> is started. When a query is restarted, the loaded provider instance will get 
> reused. Now, there is a non-trivial chance, that the task of the previous 
> query run is still running, while the tasks of the restarted run has started. 
> So for a stateful partition, there may be two concurrent tasks related to the 
> same stateful partition, and there for using the same provider instance. This 
> can lead to inconsistent results and possibly random failures, as state store 
> implementations are not designed to be thread-safe.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21145) Restarted queries reuse same StateStoreProvider, causing multiple concurrent tasks to update same StateStore

2017-06-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054945#comment-16054945
 ] 

Apache Spark commented on SPARK-21145:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/18355

> Restarted queries reuse same StateStoreProvider, causing multiple concurrent 
> tasks to update same StateStore
> 
>
> Key: SPARK-21145
> URL: https://issues.apache.org/jira/browse/SPARK-21145
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> StateStoreProvider instances are loaded on-demand in a executor when a query 
> is started. When a query is restarted, the loaded provider instance will get 
> reused. Now, there is a non-trivial chance, that the task of the previous 
> query run is still running, while the tasks of the restarted run has started. 
> So for a stateful partition, there may be two concurrent tasks related to the 
> same stateful partition, and there for using the same provider instance. This 
> can lead to inconsistent results and possibly random failures, as state store 
> implementations are not designed to be thread-safe.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21145) Restarted queries reuse same StateStoreProvider, causing multiple concurrent tasks to update same StateStore

2017-06-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21145:


Assignee: Tathagata Das  (was: Apache Spark)

> Restarted queries reuse same StateStoreProvider, causing multiple concurrent 
> tasks to update same StateStore
> 
>
> Key: SPARK-21145
> URL: https://issues.apache.org/jira/browse/SPARK-21145
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> StateStoreProvider instances are loaded on-demand in a executor when a query 
> is started. When a query is restarted, the loaded provider instance will get 
> reused. Now, there is a non-trivial chance, that the task of the previous 
> query run is still running, while the tasks of the restarted run has started. 
> So for a stateful partition, there may be two concurrent tasks related to the 
> same stateful partition, and there for using the same provider instance. This 
> can lead to inconsistent results and possibly random failures, as state store 
> implementations are not designed to be thread-safe.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21145) Restarted queries reuse same StateStoreProvider, causing multiple concurrent tasks to update same StateStore

2017-06-19 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-21145:
-

 Summary: Restarted queries reuse same StateStoreProvider, causing 
multiple concurrent tasks to update same StateStore
 Key: SPARK-21145
 URL: https://issues.apache.org/jira/browse/SPARK-21145
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Tathagata Das
Assignee: Tathagata Das


StateStoreProvider instances are loaded on-demand in a executor when a query is 
started. When a query is restarted, the loaded provider instance will get 
reused. Now, there is a non-trivial chance, that the task of the previous query 
run is still running, while the tasks of the restarted run has started. So for 
a stateful partition, there may be two concurrent tasks related to the same 
stateful partition, and there for using the same provider instance. This can 
lead to inconsistent results and possibly random failures, as state store 
implementations are not designed to be thread-safe.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21138) Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different

2017-06-19 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-21138.

   Resolution: Fixed
 Assignee: sharkd tu
Fix Version/s: 2.3.0
   2.2.1
   2.1.2
   2.0.3

> Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and 
> "spark.hadoop.fs.defaultFS" are different 
> -
>
> Key: SPARK-21138
> URL: https://issues.apache.org/jira/browse/SPARK-21138
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: sharkd tu
>Assignee: sharkd tu
> Fix For: 2.0.3, 2.1.2, 2.2.1, 2.3.0
>
>
> When I set different clusters for "spark.hadoop.fs.defaultFS" and 
> "spark.yarn.stagingDir" as follows：
> {code:java}
> spark.hadoop.fs.defaultFS  hdfs://tl-nn-tdw.tencent-distribute.com:54310
> spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark
> {code}
> I got following logs:
> {code:java}
> 17/06/19 17:55:48 INFO SparkContext: Successfully stopped SparkContext
> 17/06/19 17:55:48 INFO ApplicationMaster: Unregistering ApplicationMaster 
> with SUCCEEDED
> 17/06/19 17:55:48 INFO AMRMClientImpl: Waiting for application to be 
> successfully unregistered.
> 17/06/19 17:55:48 INFO ApplicationMaster: Deleting staging directory 
> hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618
> 17/06/19 17:55:48 ERROR Utils: Uncaught exception in thread Thread-2
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, 
> expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getPathName(Cdh3DistributedFileSystem.java:197)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.access$000(Cdh3DistributedFileSystem.java:81)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:644)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:640)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.delete(Cdh3DistributedFileSystem.java:640)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.delete(FilterFileSystem.java:216)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:545)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:233)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21124) Wrong user shown in UI when using kerberos

2017-06-19 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-21124.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.3.0

> Wrong user shown in UI when using kerberos
> --
>
> Key: SPARK-21124
> URL: https://issues.apache.org/jira/browse/SPARK-21124
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.3.0
>
>
> When submitting an app to a kerberos-secured cluster, the OS user and the 
> user running the application may differ. Although it may also happen in 
> cluster mode depending on the cluster manager's configuration, it's more 
> common in client mode.
> The UI should show enough information about user running the application to 
> correctly identify the actual user. The "app user" can be easily retrieved 
> via {{Utils.getCurrentUserName()}}, so it's mostly a matter of how to record 
> this information (for showing in replayed applications) and how to present it 
> in the UI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21144) Unexpected results when the data schema and partition schema have the duplicate columns

2017-06-19 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21144:

Target Version/s: 2.2.0

> Unexpected results when the data schema and partition schema have the 
> duplicate columns
> ---
>
> Key: SPARK-21144
> URL: https://issues.apache.org/jira/browse/SPARK-21144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> {noformat}
> withTempPath { dir =>
>   val basePath = dir.getCanonicalPath
>   spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, 
> "foo=1").toString)
>   spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, 
> "foo=a").toString)
>   spark.read.parquet(basePath).show()
> }
> {noformat}
> The result of the above case is
> {noformat}
> +---+
> |foo|
> +---+
> |  1|
> |  1|
> |  a|
> |  a|
> |  1|
> |  a|
> +---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21144) Unexpected results when the data schema and partition schema have the duplicate columns

2017-06-19 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054788#comment-16054788
 ] 

Xiao Li commented on SPARK-21144:
-

cc [~maropu]

> Unexpected results when the data schema and partition schema have the 
> duplicate columns
> ---
>
> Key: SPARK-21144
> URL: https://issues.apache.org/jira/browse/SPARK-21144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> {noformat}
> withTempPath { dir =>
>   val basePath = dir.getCanonicalPath
>   spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, 
> "foo=1").toString)
>   spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, 
> "foo=a").toString)
>   spark.read.parquet(basePath).show()
> }
> {noformat}
> The result of the above case is
> {noformat}
> +---+
> |foo|
> +---+
> |  1|
> |  1|
> |  a|
> |  a|
> |  1|
> |  a|
> +---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-06-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054785#comment-16054785
 ] 

Apache Spark commented on SPARK-18016:
--

User 'bdrillard' has created a pull request for this issue:
https://github.com/apache/spark/pull/18354

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
>

[jira] [Created] (SPARK-21144) Unexpected results when the data schema and partition schema have the duplicate columns

2017-06-19 Thread Xiao Li (JIRA)

Xiao Li created SPARK-21144:
---

 Summary: Unexpected results when the data schema and partition 
schema have the duplicate columns
 Key: SPARK-21144
 URL: https://issues.apache.org/jira/browse/SPARK-21144
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li


{noformat}
withTempPath { dir =>
  val basePath = dir.getCanonicalPath
  spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, 
"foo=1").toString)
  spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, 
"foo=a").toString)
  spark.read.parquet(basePath).show()
}
{noformat}

The result of the above case is
{noformat}
+---+
|foo|
+---+
|  1|
|  1|
|  a|
|  a|
|  1|
|  a|
+---+
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-06-19 Thread Aleksander Eskilson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054784#comment-16054784
 ] 

Aleksander Eskilson commented on SPARK-18016:
-

[~cloud_fan], [~divshukla], I've created a PR for the backport patch, 
[#18354|https://github.com/apache/spark/pull/18354].

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at

[jira] [Commented] (SPARK-21143) Fail to fetch blocks >1MB in size in presence of conflicting Netty version

2017-06-19 Thread Ryan Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054707#comment-16054707
 ] 

Ryan Williams commented on SPARK-21143:
---

[~zsxwing] 

bq. it's too risky to upgrade from 4.0.X to 4.1.X

makes sense, I wasn't meaning to suggest that as the action to take

bq. The reason you cannot use 4.0.42.Final is because you are using 4.1.X APIs?

I'm depending on 
[google-cloud-nio|https://github.com/GoogleCloudPlatform/google-cloud-java/tree/v0.10.0/google-cloud-contrib/google-cloud-nio]
 while running Spark apps in Google Cloud; and it depends transitively on Netty 
4.1.6.Final:

{code}
org.hammerlab:google-cloud-nio:jar:0.10.0-alpha
\- com.google.cloud:google-cloud-storage:jar:0.10.0-beta:compile
   \- com.google.cloud:google-cloud-core:jar:0.10.0-alpha:compile
  \- com.google.api:gax:jar:0.4.0:compile
 \- io.grpc:grpc-netty:jar:1.0.3:compile
+- io.netty:netty-handler-proxy:jar:4.1.6.Final:compile
{code}

Luckily, google-cloud-nio [publishes a shaded 
JAR|http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22com.google.cloud%22%20AND%20a%3A%22google-cloud-nio%22]
 with [all dependencies 
shaded+renamed|https://github.com/GoogleCloudPlatform/google-cloud-java/blob/v0.10.0/google-cloud-contrib/google-cloud-nio/pom.xml#L101-L129],
 so I can just use that to avoid this conflict.

I mostly interpret this kind of issue as a nudge toward increasingly isolating 
Spark's classpath (by preemptive shading+renaming of some or all dependencies), 
so that these kinds of issues don't happen.

[~sowen] thanks for the pointer, gtk that upgrade is in progress. This is 
narrowly a Netty 4.0 vs. 4.1 conflict, but per the above could be interpreted 
as a shading / classpath-isolation concern.

Anyway, feel free to triage as you like, I just doubt I'll be the last person 
to see these stack traces and didn't see any good google hits about them yet.

> Fail to fetch blocks >1MB in size in presence of conflicting Netty version
> --
>
> Key: SPARK-21143
> URL: https://issues.apache.org/jira/browse/SPARK-21143
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Williams
>Priority: Minor
>
> One of my spark libraries inherited a transitive-dependency on Netty 
> 4.1.6.Final (vs. Spark's 4.0.42.Final), and I observed a strange failure I 
> wanted to document: fetches of blocks larger than 1MB (pre-compression, 
> afaict) seem to trigger a code path that results in {{AbstractMethodError}}'s 
> and ultimately stage failures.
> I put a minimal repro in [this github 
> repo|https://github.com/ryan-williams/spark-bugs/tree/netty]: {{collect}} on 
> a 1-partition RDD with 1032 {{Array\[Byte\]}}'s of size 1000 works, but at 
> 1033 {{Array}}'s it dies in a confusing way.
> Not sure what fixing/mitigating this in Spark would look like, other than 
> defensively shading+renaming netty.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11170) EOFException on History server reading in progress lz4

2017-06-19 Thread remoteServer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054701#comment-16054701
 ] 

remoteServer commented on SPARK-11170:
--

Do we have steps to reproduce the issue? I have enabled spark event log 
compression, lz4 files are generated, however, I do not see any error message 
in the History Server logs. I am in Spark 2.1.0

>  EOFException on History server reading in progress lz4
> 
>
> Key: SPARK-11170
> URL: https://issues.apache.org/jira/browse/SPARK-11170
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
> Environment: HDP: 2.3.2.0-2950 (Hadoop 2.7.1.2.3.2.0-2950)
> Spark: 1.5.x (c27e1904)
>Reporter: Sebastian YEPES FERNANDEZ
>
> The Spark History server is not able to read/save the jobs history if Spark 
> is configured to use 
> "spark.io.compression.codec=org.apache.spark.io.LZ4CompressionCodec", it 
> continuously generated the following error:
> {code}
> ERROR 2015-10-16 16:21:39 org.apache.spark.deploy.history.FsHistoryProvider: 
> Exception encountered when attempting to load application log 
> hdfs://DATA/user/spark/his
> tory/application_1444297190346_0073_1.lz4.inprogress
> java.io.EOFException: Stream ended prematurely
> at 
> net.jpountz.lz4.LZ4BlockInputStream.readFully(LZ4BlockInputStream.java:218)
> at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:150)
> at 
> net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:117)
> at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
> at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
> at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
> at java.io.InputStreamReader.read(InputStreamReader.java:184)
> at java.io.BufferedReader.fill(BufferedReader.java:161)
> at java.io.BufferedReader.readLine(BufferedReader.java:324)
> at java.io.BufferedReader.readLine(BufferedReader.java:389)
> at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
> at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> INFO 2015-10-16 16:21:39 org.apache.spark.deploy.history.FsHistoryProvider: 
> Replaying log path: 
> hdfs://DATA/user/spark/history/application_1444297190346_0072_1.lz4.i
> nprogress
> {code}
> As a workaround setting 
> "spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec" 
> makes the History server work correctly



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-06-19 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054690#comment-16054690
 ] 

Cody Koeninger commented on SPARK-20928:


Cool, can you label it SPIP so it shows up linked from 
http://spark.apache.org/improvement-proposals.html

My only concern so far was the one I mentioned already, namely that it seems 
like you shouldn't have to give up exactly-once delivery semantics in all cases.

> Continuous Processing Mode for Structured Streaming
> ---
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21143) Fail to fetch blocks >1MB in size in presence of conflicting Netty version

2017-06-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054631#comment-16054631
 ] 

Sean Owen commented on SPARK-21143:
---

If this reduces to a 4.0 vs 4.1 conflict, then this is SPARK-19552 and we gotta 
make that change perhaps in the next minor release.

> Fail to fetch blocks >1MB in size in presence of conflicting Netty version
> --
>
> Key: SPARK-21143
> URL: https://issues.apache.org/jira/browse/SPARK-21143
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Williams
>Priority: Minor
>
> One of my spark libraries inherited a transitive-dependency on Netty 
> 4.1.6.Final (vs. Spark's 4.0.42.Final), and I observed a strange failure I 
> wanted to document: fetches of blocks larger than 1MB (pre-compression, 
> afaict) seem to trigger a code path that results in {{AbstractMethodError}}'s 
> and ultimately stage failures.
> I put a minimal repro in [this github 
> repo|https://github.com/ryan-williams/spark-bugs/tree/netty]: {{collect}} on 
> a 1-partition RDD with 1032 {{Array\[Byte\]}}'s of size 1000 works, but at 
> 1033 {{Array}}'s it dies in a confusing way.
> Not sure what fixing/mitigating this in Spark would look like, other than 
> defensively shading+renaming netty.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21142) spark-streaming-kafka-0-10 has too fat dependency on kafka

2017-06-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21142:


Assignee: Apache Spark

> spark-streaming-kafka-0-10 has too fat dependency on kafka
> --
>
> Key: SPARK-21142
> URL: https://issues.apache.org/jira/browse/SPARK-21142
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
>Reporter: Tim Van Wassenhove
>Assignee: Apache Spark
>Priority: Minor
>
> Currently spark-streaming-kafka-0-10_2.11 has a dependency on kafka, where it 
> should only need a dependency on kafka-clients.
> The only reason there is a dependency on kafka (full server) is due to 
> KafkaTestUtils class to run in  memory tests against a kafka broker. This 
> class should be moved to src/test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21142) spark-streaming-kafka-0-10 has too fat dependency on kafka

2017-06-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054612#comment-16054612
 ] 

Apache Spark commented on SPARK-21142:
--

User 'timvw' has created a pull request for this issue:
https://github.com/apache/spark/pull/18353

> spark-streaming-kafka-0-10 has too fat dependency on kafka
> --
>
> Key: SPARK-21142
> URL: https://issues.apache.org/jira/browse/SPARK-21142
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
>Reporter: Tim Van Wassenhove
>Priority: Minor
>
> Currently spark-streaming-kafka-0-10_2.11 has a dependency on kafka, where it 
> should only need a dependency on kafka-clients.
> The only reason there is a dependency on kafka (full server) is due to 
> KafkaTestUtils class to run in  memory tests against a kafka broker. This 
> class should be moved to src/test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21142) spark-streaming-kafka-0-10 has too fat dependency on kafka

2017-06-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21142:


Assignee: (was: Apache Spark)

> spark-streaming-kafka-0-10 has too fat dependency on kafka
> --
>
> Key: SPARK-21142
> URL: https://issues.apache.org/jira/browse/SPARK-21142
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
>Reporter: Tim Van Wassenhove
>Priority: Minor
>
> Currently spark-streaming-kafka-0-10_2.11 has a dependency on kafka, where it 
> should only need a dependency on kafka-clients.
> The only reason there is a dependency on kafka (full server) is due to 
> KafkaTestUtils class to run in  memory tests against a kafka broker. This 
> class should be moved to src/test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21133) HighlyCompressedMapStatus#writeExternal throws NPE

2017-06-19 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-21133:
-
Target Version/s: 2.2.0
Priority: Blocker  (was: Major)
 Description: 
Reproduce, set {{set spark.sql.shuffle.partitions>2000}} with shuffle, for 
simple:

{code:sql}
spark-sql --executor-memory 12g --driver-memory 8g --executor-cores 7   -e "
  set spark.sql.shuffle.partitions=2001;
  drop table if exists spark_hcms_npe;
  create table spark_hcms_npe as select id, count(*) from big_table group by id;
"
{code}

Error logs:
{noformat}
17/06/18 15:00:27 ERROR Utils: Exception encountered
java.lang.NullPointerException
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
at 
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
at 
org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
at 
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException
java.io.IOException: java.lang.NullPointerException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167)
at 
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
at 
org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
at 
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
at 
org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303)
... 17 more
17/06/18

[jira] [Commented] (SPARK-21102) Refresh command is too aggressive in parsing

2017-06-19 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054600#comment-16054600
 ] 

Reynold Xin commented on SPARK-21102:
-

Can you submit a pull request so we can discuss the details of implementation 
there?


> Refresh command is too aggressive in parsing
> 
>
> Key: SPARK-21102
> URL: https://issues.apache.org/jira/browse/SPARK-21102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>  Labels: starter
>
> SQL REFRESH command parsing is way too aggressive:
> {code}
> | REFRESH TABLE tableIdentifier
> #refreshTable
> | REFRESH .*?  
> #refreshResource
> {code}
> We should change it so it takes the whole string (without space), or a quoted 
> string.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20928) Continuous Processing Mode for Structured Streaming

2017-06-19 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054596#comment-16054596
 ] 

Michael Armbrust commented on SPARK-20928:
--

Hi Cody, I do plan to flesh this out with the other sections of the SIP 
document and will email the dev list at that point.  All that has been done so 
far is some basic prototyping to estimate how much work an alternative 
{{StreamExecution}} would take to build, and some experiments to validate the 
latencies that this arch could achieve.  Do you have specific concerns with the 
proposal as it stands?

> Continuous Processing Mode for Structured Streaming
> ---
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21143) Fail to fetch blocks >1MB in size in presence of conflicting Netty version

2017-06-19 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054593#comment-16054593
 ] 

Shixiong Zhu commented on SPARK-21143:
--

The reason you cannot use 4.0.42.Final is because you are using 4.1.X APIs?

> Fail to fetch blocks >1MB in size in presence of conflicting Netty version
> --
>
> Key: SPARK-21143
> URL: https://issues.apache.org/jira/browse/SPARK-21143
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Williams
>Priority: Minor
>
> One of my spark libraries inherited a transitive-dependency on Netty 
> 4.1.6.Final (vs. Spark's 4.0.42.Final), and I observed a strange failure I 
> wanted to document: fetches of blocks larger than 1MB (pre-compression, 
> afaict) seem to trigger a code path that results in {{AbstractMethodError}}'s 
> and ultimately stage failures.
> I put a minimal repro in [this github 
> repo|https://github.com/ryan-williams/spark-bugs/tree/netty]: {{collect}} on 
> a 1-partition RDD with 1032 {{Array\[Byte\]}}'s of size 1000 works, but at 
> 1033 {{Array}}'s it dies in a confusing way.
> Not sure what fixing/mitigating this in Spark would look like, other than 
> defensively shading+renaming netty.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19975) Add map_keys and map_values functions to Python

2017-06-19 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19975.
-
   Resolution: Fixed
 Assignee: Yong Tang
Fix Version/s: 2.3.0

> Add map_keys and map_values functions  to Python 
> -
>
> Key: SPARK-19975
> URL: https://issues.apache.org/jira/browse/SPARK-19975
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Maciej Bryński
>Assignee: Yong Tang
> Fix For: 2.3.0
>
>
> We have `map_keys` and `map_values` functions in SQL.
> There is no Python equivalent functions for that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21143) Fail to fetch blocks >1MB in size in presence of conflicting Netty version

2017-06-19 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054592#comment-16054592
 ] 

Shixiong Zhu commented on SPARK-21143:
--

As Netty is so core to Spark, it's too risky to upgrade from 4.0.X to 4.1.X. 
There is no plan right now.

> Fail to fetch blocks >1MB in size in presence of conflicting Netty version
> --
>
> Key: SPARK-21143
> URL: https://issues.apache.org/jira/browse/SPARK-21143
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Ryan Williams
>Priority: Minor
>
> One of my spark libraries inherited a transitive-dependency on Netty 
> 4.1.6.Final (vs. Spark's 4.0.42.Final), and I observed a strange failure I 
> wanted to document: fetches of blocks larger than 1MB (pre-compression, 
> afaict) seem to trigger a code path that results in {{AbstractMethodError}}'s 
> and ultimately stage failures.
> I put a minimal repro in [this github 
> repo|https://github.com/ryan-williams/spark-bugs/tree/netty]: {{collect}} on 
> a 1-partition RDD with 1032 {{Array\[Byte\]}}'s of size 1000 works, but at 
> 1033 {{Array}}'s it dies in a confusing way.
> Not sure what fixing/mitigating this in Spark would look like, other than 
> defensively shading+renaming netty.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21102) Refresh command is too aggressive in parsing

2017-06-19 Thread Anton Okolnychyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054591#comment-16054591
 ] 

Anton Okolnychyi commented on SPARK-21102:
--

Hi [~rxin],

I took a look at this issue and have a prototype that fixes this. It is 
available [here| 
https://github.com/aokolnychyi/spark/commit/fc2b7c02fab7f570ae3ca080ae1c2c9502300de7].
 I am not sure that my current implementation is the most optimal, so any 
feedback is appreciated. My first idea was to make the grammar as strict as 
possible. Unfortunately, there were some problems. I tried the approach below:

SqlBase.g4

{noformat}
...
| REFRESH TABLE tableIdentifier
#refreshTable
| REFRESH resourcePath 
#refreshResource
...

resourcePath
: STRING
| (IDENTIFIER | number | nonReserved | '/' | '-')+ // other symbols can be 
added if needed
;
{noformat}

It is not flexible enough and requires to explicitly mention all possible 
symbols. Therefore, I came up with the approach that was mentioned at the 
beginning.

Let me know your opinion on which one is better.

> Refresh command is too aggressive in parsing
> 
>
> Key: SPARK-21102
> URL: https://issues.apache.org/jira/browse/SPARK-21102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>  Labels: starter
>
> SQL REFRESH command parsing is way too aggressive:
> {code}
> | REFRESH TABLE tableIdentifier
> #refreshTable
> | REFRESH .*?  
> #refreshResource
> {code}
> We should change it so it takes the whole string (without space), or a quoted 
> string.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12414) Remove closure serializer

2017-06-19 Thread Ritesh Tijoriwala (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054529#comment-16054529
 ] 

Ritesh Tijoriwala commented on SPARK-12414:
---

I have a similar situation. I have several classes that I would like to 
instantiate and use in Executors for e.g. DB connections, Elasticsearch 
clients, etc. I don't want to write instantiation code in Spark functions and 
use "statics".  There was a nit trick suggested here - 
https://issues.apache.org/jira/browse/SPARK-650 but seems like this will not 
work starting 2.0.0 as a consequence of this ticket. 

Could anybody from Spark community recommend how to do some initialization on 
each Spark executor for the job before any task execution begins?

> Remove closure serializer
> -
>
> Key: SPARK-12414
> URL: https://issues.apache.org/jira/browse/SPARK-12414
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Sean Owen
> Fix For: 2.0.0
>
>
> There is a config `spark.closure.serializer` that accepts exactly one value: 
> the java serializer. This is because there are currently bugs in the Kryo 
> serializer that make it not a viable candidate. This was uncovered by an 
> unsuccessful attempt to make it work: SPARK-7708.
> My high level point is that the Java serializer has worked well for at least 
> 6 Spark versions now, and it is an incredibly complicated task to get other 
> serializers (not just Kryo) to work with Spark's closures. IMO the effort is 
> not worth it and we should just remove this documentation and all the code 
> associated with it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21142) spark-streaming-kafka-0-10 has too fat dependency on kafka

2017-06-19 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21142:
-
Component/s: (was: Structured Streaming)
 DStreams

> spark-streaming-kafka-0-10 has too fat dependency on kafka
> --
>
> Key: SPARK-21142
> URL: https://issues.apache.org/jira/browse/SPARK-21142
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
>Reporter: Tim Van Wassenhove
>Priority: Minor
>
> Currently spark-streaming-kafka-0-10_2.11 has a dependency on kafka, where it 
> should only need a dependency on kafka-clients.
> The only reason there is a dependency on kafka (full server) is due to 
> KafkaTestUtils class to run in  memory tests against a kafka broker. This 
> class should be moved to src/test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21123) Options for file stream source are in a wrong table

2017-06-19 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-21123.
--
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.0

> Options for file stream source are in a wrong table
> ---
>
> Key: SPARK-21123
> URL: https://issues.apache.org/jira/browse/SPARK-21123
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Shixiong Zhu
>Priority: Minor
>  Labels: starter
> Fix For: 2.2.0, 2.3.0
>
> Attachments: Screen Shot 2017-06-15 at 11.25.49 AM.png
>
>
> Right now options for file stream source are documented with file sink. We 
> should create a table for source options and fix it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16430) Add an option in file stream source to read 1 file at a time

2017-06-19 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-16430:
-
Fix Version/s: (was: 2.1.0)
   2.0.0

> Add an option in file stream source to read 1 file at a time
> 
>
> Key: SPARK-16430
> URL: https://issues.apache.org/jira/browse/SPARK-16430
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.0
>
>
> An option that limits the file stream source to read 1 file at a time enables 
> rate limiting. It has the additional convenience that a static set of files 
> can be used like a stream for testing as this will allows those files to be 
> considered one at a time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16430) Add an option in file stream source to read 1 file at a time

2017-06-19 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-16430:
-
Fix Version/s: 2.1.0

> Add an option in file stream source to read 1 file at a time
> 
>
> Key: SPARK-16430
> URL: https://issues.apache.org/jira/browse/SPARK-16430
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.1.0
>
>
> An option that limits the file stream source to read 1 file at a time enables 
> rate limiting. It has the additional convenience that a static set of files 
> can be used like a stream for testing as this will allows those files to be 
> considered one at a time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19688) Spark on Yarn Credentials File set to different application directory

2017-06-19 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19688.

   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1
   2.1.2
   2.0.3
   1.6.4

> Spark on Yarn Credentials File set to different application directory
> -
>
> Key: SPARK-19688
> URL: https://issues.apache.org/jira/browse/SPARK-19688
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, YARN
>Affects Versions: 1.6.3
>Reporter: Devaraj Jonnadula
>Priority: Minor
> Fix For: 1.6.4, 2.0.3, 2.1.2, 2.2.1, 2.3.0
>
>
> spark.yarn.credentials.file property is set to different application Id 
> instead of actual Application Id 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21143) Fail to fetch blocks >1MB in size in presence of conflicting Netty version

2017-06-19 Thread Ryan Williams (JIRA)

Ryan Williams created SPARK-21143:
-

 Summary: Fail to fetch blocks >1MB in size in presence of 
conflicting Netty version
 Key: SPARK-21143
 URL: https://issues.apache.org/jira/browse/SPARK-21143
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: Ryan Williams
Priority: Minor


One of my spark libraries inherited a transitive-dependency on Netty 
4.1.6.Final (vs. Spark's 4.0.42.Final), and I observed a strange failure I 
wanted to document: fetches of blocks larger than 1MB (pre-compression, afaict) 
seem to trigger a code path that results in {{AbstractMethodError}}'s and 
ultimately stage failures.

I put a minimal repro in [this github 
repo|https://github.com/ryan-williams/spark-bugs/tree/netty]: {{collect}} on a 
1-partition RDD with 1032 {{Array\[Byte\]}}'s of size 1000 works, but at 1033 
{{Array}}'s it dies in a confusing way.

Not sure what fixing/mitigating this in Spark would look like, other than 
defensively shading+renaming netty.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054176#comment-16054176
 ] 

sam edited comment on SPARK-21137 at 6/19/17 3:20 PM:
--

[~srowen] Ah OK, sorry, not used to that process.  On other projects I've seen 
users are allowed to raise bug tickets even if they don't run debugging tools 
to try to diagnose what the cause is.  Usually that step happens *after* the 
bug is accepted.

Very happy to investigate further and provide a thread dump ASAP.  Is there 
anything else I should try while I'm at it?



was (Author: sams):
[~srowen] Ah OK, sorry, not used to that process.  On other projects I've seen 
users are allowed to raise bug tickets even if they don't run debugging tools 
to try to diagnose what the cause is.  Usually that step happens *after* the 
bug is accepted.

Very happy to investigate further and provide a thread dump ASAP.  Is there 
anything else I should try while I'm at it?  How about turning the cluster off 
and on again?


> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054176#comment-16054176
 ] 

sam commented on SPARK-21137:
-

[~srowen] Ah OK, sorry, not used to that process.  On other projects I've seen 
users are allowed to raise bug tickets even if they don't run debugging tools 
to try to diagnose what the cause is.  Usually that step happens *after* the 
bug is accepted.

Very happy to investigate further and provide a thread dump ASAP.  Is there 
anything else I should try while I'm at it?  How about turning the cluster off 
and on again?


> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2017-06-19 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054162#comment-16054162
 ] 

Michael Schmeißer commented on SPARK-650:
-

[~riteshtijoriwala] - Sorry, but I am not familiar with Spark 2.0.0 yet. But 
what I can say is that we have raised a Cloudera support case to address this 
issue so maybe we can expect some help from this side.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21080) Workaround for HDFS delegation token expiry broken with some Hadoop versions

2017-06-19 Thread Lukasz Raszka (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054155#comment-16054155
 ] 

Lukasz Raszka commented on SPARK-21080:
---

[~jerryshao] Yes, it's in HA mode. Updating to newer HDFS is not an option in 
our infrastructure, unfortunately. PR you reference is outdated at the moment - 
I've tried to crudely apply it with my limited knowledge of the mechanism, but 
it seems that the issue persists. Possibly a mistake on my side, would be great 
if someone could update the fix.

> Workaround for HDFS delegation token expiry broken with some Hadoop versions
> 
>
> Key: SPARK-21080
> URL: https://issues.apache.org/jira/browse/SPARK-21080
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0 on Yarn, Hadoop 2.7.3
>Reporter: Lukasz Raszka
>Priority: Minor
>
> We're getting struck by SPARK-11182, where the core issue in HDFS has been 
> fixed in more recent versions. It seems that [workaround introduced by user 
> SaintBacchus|https://github.com/apache/spark/commit/646366b5d2f12e42f8e7287672ba29a8c918a17d]
>  doesn't work in newer version of Hadoop. This seems to be cause by a move of 
> property name from {{fs.hdfs.impl}} to {{fs.AbstractFileSystem.hdfs.impl}} 
> which happened somewhere around 2.7.0 or earlier. Taking this into account 
> should make workaround work again for less recent Hadoop versions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17176) Task are sorted by "Index" in Stage Page.

2017-06-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17176.
---
Resolution: Won't Fix

> Task are sorted by "Index" in Stage Page.
> -
>
> Key: SPARK-17176
> URL: https://issues.apache.org/jira/browse/SPARK-17176
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.2, 2.0.0
>Reporter: cen yuhai
>Priority: Minor
>
> Task are sorted by "Index" in Stage Page, but user are always concerned about 
> tasks which are failed(see error messages) or still running (maybe it is 
> skewed). When there are too many tasks, it is too slow to sort. So it is 
> better to set the default sort column to ”Status“. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21142) spark-streaming-kafka-0-10 has too fat dependency on kafka

2017-06-19 Thread Tim Van Wassenhove (JIRA)

Tim Van Wassenhove created SPARK-21142:
--

 Summary: spark-streaming-kafka-0-10 has too fat dependency on kafka
 Key: SPARK-21142
 URL: https://issues.apache.org/jira/browse/SPARK-21142
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.1
Reporter: Tim Van Wassenhove
Priority: Minor


Currently spark-streaming-kafka-0-10_2.11 has a dependency on kafka, where it 
should only need a dependency on kafka-clients.

The only reason there is a dependency on kafka (full server) is due to 
KafkaTestUtils class to run in  memory tests against a kafka broker. This class 
should be moved to src/test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21142) spark-streaming-kafka-0-10 has too fat dependency on kafka

2017-06-19 Thread Tim Van Wassenhove (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054130#comment-16054130
 ] 

Tim Van Wassenhove commented on SPARK-21142:


Opened a PR on github: https://github.com/apache/spark/pull/18353

> spark-streaming-kafka-0-10 has too fat dependency on kafka
> --
>
> Key: SPARK-21142
> URL: https://issues.apache.org/jira/browse/SPARK-21142
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Tim Van Wassenhove
>Priority: Minor
>
> Currently spark-streaming-kafka-0-10_2.11 has a dependency on kafka, where it 
> should only need a dependency on kafka-clients.
> The only reason there is a dependency on kafka (full server) is due to 
> KafkaTestUtils class to run in  memory tests against a kafka broker. This 
> class should be moved to src/test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-06-19 Thread Aleksander Eskilson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054118#comment-16054118
 ] 

Aleksander Eskilson commented on SPARK-18016:
-

[~cloud_fan], [~divshukla], yeah, I'd be happy to write a backport of the patch 
for 2.1.x.

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
>

[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054115#comment-16054115
 ] 

Sean Owen commented on SPARK-21137:
---

Try a thread dump on the driver. Until there's some more detail here, I don't 
think anyone would work on this for you, no -- it works the other way around. 
It's mostly on you to investigate and ideally propose a change.

> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054111#comment-16054111
 ] 

sam edited comment on SPARK-21137 at 6/19/17 2:36 PM:
--

[~srowen] 

> what stages are executing if any?

None, no tasks are started for any stages. The logs do NOT output any line of 
the form:

{code}
17/06/19 12:55:10 INFO TaskSetManager: Starting task 975.0 in stage 0.0 ...
{code}

As I have explained, *it takes 2 hours to just output the text (twice) 
"1,227,645 input paths to process"*.  My hand cranked code single threaded, 
does the job in 11 minutes. It's not rocket science : it's reading some files, 
and writing them back out again.

> If no stages are executing, what is the driver executing (thread dump)? 

I don't know what the driver is doing, I'm not trying to debug the issue here, 
I'm just trying to raise a bug.  When the bug is accepted we can start trying 
to de-bug it and fix it.

> (This is the kind of thing that should go into a mailing list exchange)

If there is a configuration setting along the lines of "Don't spend 2 hours 
just to count how many files to process=true" then indeed I can see this is 
just a silly user error :)  Right now I find it hard to believe such a 
configuration setting exists and so still believe this belongs in a bug JIRA.  
Perhaps we ought to ask some other people to take a look as we don't seem to be 
reaching a conclusion.



was (Author: sams):
[~srowen] 

> what stages are executing if any?

None, no tasks are started. The logs do NOT output any line of the form:

{code}
17/06/19 12:55:10 INFO TaskSetManager: Starting task 975.0 in stage 0.0 ...
{code}

As I have explained, *it takes 2 hours to just output the text (twice) 
"1,227,645 input paths to process"*.  My hand cranked code single threaded, 
does the job in 11 minutes. It's not rocket science : it's reading some files, 
and writing them back out again.

> If no stages are executing, what is the driver executing (thread dump)? 

I don't know what the driver is doing, I'm not trying to debug the issue here, 
I'm just trying to raise a bug.  When the bug is accepted we can start trying 
to de-bug it and fix it.

> (This is the kind of thing that should go into a mailing list exchange)

If there is a configuration setting along the lines of "Don't spend 2 hours 
just to count how many files to process=true" then indeed I can see this is 
just a silly user error :)  Right now I find it hard to believe such a 
configuration setting exists and so still believe this belongs in a bug JIRA.  
Perhaps we ought to ask some other people to take a look as we don't seem to be 
reaching a conclusion.


> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054111#comment-16054111
 ] 

sam edited comment on SPARK-21137 at 6/19/17 2:35 PM:
--

[~srowen] 

> what stages are executing if any?

None, no tasks are started. The logs do NOT output any line of the form:

{code}
17/06/19 12:55:10 INFO TaskSetManager: Starting task 975.0 in stage 0.0 ...
{code}

As I have explained, *it takes 2 hours to just output the text (twice) 
"1,227,645 input paths to process"*.  My hand cranked code single threaded, 
does the job in 11 minutes. It's not rocket science : it's reading some files, 
and writing them back out again.

> If no stages are executing, what is the driver executing (thread dump)? 

I don't know what the driver is doing, I'm not trying to debug the issue here, 
I'm just trying to raise a bug.  When the bug is accepted we can start trying 
to de-bug it and fix it.

> (This is the kind of thing that should go into a mailing list exchange)

If there is a configuration setting along the lines of "Don't spend 2 hours 
just to count how many files to process=true" then indeed I can see this is 
just a silly user error :)  Right now I find it hard to believe such a 
configuration setting exists and so still believe this belongs in a bug JIRA.  
Perhaps we ought to ask some other people to take a look as we don't seem to be 
reaching a conclusion.



was (Author: sams):
[~srowen] 

> what stages are executing if any?

*None, no tasks are started*. The logs do NOT output any line of the form:

{code}
17/06/19 12:55:10 INFO TaskSetManager: Starting task 975.0 in stage 0.0 ...
{code}

As I have explained, *it takes 2 hours to just output the text (twice) 
"1,227,645 input paths to process"*.  My hand cranked code single threaded, 
does the job in 11 minutes. It's not rocket science : it's reading some files, 
and writing them back out again.

> If no stages are executing, what is the driver executing (thread dump)? 

I don't know what the driver is doing, I'm not trying to debug the issue here, 
I'm just trying to raise a bug.  When the bug is accepted we can start trying 
to de-bug it and fix it.

> (This is the kind of thing that should go into a mailing list exchange)

If there is a configuration setting along the lines of "Don't spend 2 hours 
just to count how many files to process=true" then indeed I can see this is 
just a silly user error :)  Right now I find it hard to believe such a 
configuration setting exists and so still believe this belongs in a bug JIRA.  
Perhaps we ought to ask some other people to take a look as we don't seem to be 
reaching a conclusion.


> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054111#comment-16054111
 ] 

sam commented on SPARK-21137:
-

[~srowen] 

> what stages are executing if any?

*None, no tasks are started*. The logs do NOT output any line of the form:

{code}
17/06/19 12:55:10 INFO TaskSetManager: Starting task 975.0 in stage 0.0 ...
{code}

As I have explained, *it takes 2 hours to just output the text (twice) 
"1,227,645 input paths to process"*.  My hand cranked code single threaded, 
does the job in 11 minutes. It's not rocket science : it's reading some files, 
and writing them back out again.

> If no stages are executing, what is the driver executing (thread dump)? 

I don't know what the driver is doing, I'm not trying to debug the issue here, 
I'm just trying to raise a bug.  When the bug is accepted we can start trying 
to de-bug it and fix it.

> (This is the kind of thing that should go into a mailing list exchange)

If there is a configuration setting along the lines of "Don't spend 2 hours 
just to count how many files to process=true" then indeed I can see this is 
just a silly user error :)  Right now I find it hard to believe such a 
configuration setting exists and so still believe this belongs in a bug JIRA.  
Perhaps we ought to ask some other people to take a look as we don't seem to be 
reaching a conclusion.


> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19809) NullPointerException on empty ORC file

2017-06-19 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054108#comment-16054108
 ] 

Dongjoon Hyun commented on SPARK-19809:
---

Yep. I'm trying to fix this with new ORC data source. It will be 2.3.0.

> NullPointerException on empty ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.2, 2.1.1
>Reporter: Michał Dawid
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
>

[jira] [Commented] (SPARK-19809) NullPointerException on empty ORC file

2017-06-19 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054076#comment-16054076
 ] 

Hyukjin Kwon commented on SPARK-19809:
--

What you see is what you get. This is "Reopened" per the discussion above and 
"Unresolved" yet.

> NullPointerException on empty ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.2, 2.1.1
>Reporter: Michał Dawid
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
>

[jira] [Resolved] (SPARK-21141) spark-update --version is hard to parse

2017-06-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21141.
---
Resolution: Not A Problem

[~mprocop] please don't reopen JIRAs. We can reopen if needed. As I say, I 
don't think there is a problem here.

> spark-update --version is hard to parse
> ---
>
> Key: SPARK-21141
> URL: https://issues.apache.org/jira/browse/SPARK-21141
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.1.1
> Environment: Debian 8 using hadoop 2.7.2
>Reporter: michael procopio
>
> We have need of being able to determine the spark version in order to 
> reference our jars:  one set built for 1.x using scala 2.10 and the other 
> built for 2.x using scala 2.11.  spark-update --version returns a lot of 
> extraneous output.  It would be preferable if an option were available that 
> only returned the version number.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21140) Reduce collect high memory requrements

2017-06-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054055#comment-16054055
 ] 

Sean Owen commented on SPARK-21140:
---

Yes, it's possible the executor makes a copy of some data during processing. 
Given overhead of serializing data and merging intermediate buffers, it could 
be largeish.
This isn't a very minimal example, and it doesn't establish that something runs 
out of memory.
There is also no proposal here about what it is that could be done differently, 
or leads about where memory is being allocated a lot: serialization? 
I don't think this is actionable as is.

> Reduce collect high memory requrements
> --
>
> Key: SPARK-21140
> URL: https://issues.apache.org/jira/browse/SPARK-21140
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.1.1
> Environment: Linux Debian 8 using hadoop 2.7.2.
>Reporter: michael procopio
>
> I wrote a very simple Scala application which used flatMap to create an RDD 
> containing a 512 mb partition of 256 byte arrays.  Experimentally, I 
> determined that spark.executor.memory had to be set at 3 gb in order to 
> colledt the data.  This seems extremely high.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-21141) spark-update --version is hard to parse

2017-06-19 Thread michael procopio (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

michael procopio reopened SPARK-21141:
--

My apologies, I mean spark-submit --version.


> spark-update --version is hard to parse
> ---
>
> Key: SPARK-21141
> URL: https://issues.apache.org/jira/browse/SPARK-21141
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.1.1
> Environment: Debian 8 using hadoop 2.7.2
>Reporter: michael procopio
>
> We have need of being able to determine the spark version in order to 
> reference our jars:  one set built for 1.x using scala 2.10 and the other 
> built for 2.x using scala 2.11.  spark-update --version returns a lot of 
> extraneous output.  It would be preferable if an option were available that 
> only returned the version number.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-21140) Reduce collect high memory requrements

2017-06-19 Thread michael procopio (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

michael procopio reopened SPARK-21140:
--

I am not sure what detail you are looking for.  I provided the test code I was 
using.  Seems to me multiple copies of the data must be generated when 
collecting a partition.  Having to set driver.executor.memory to 3gb to collect 
a partition of 512 mb seems high to me.


> Reduce collect high memory requrements
> --
>
> Key: SPARK-21140
> URL: https://issues.apache.org/jira/browse/SPARK-21140
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.1.1
> Environment: Linux Debian 8 using hadoop 2.7.2.
>Reporter: michael procopio
>
> I wrote a very simple Scala application which used flatMap to create an RDD 
> containing a 512 mb partition of 256 byte arrays.  Experimentally, I 
> determined that spark.executor.memory had to be set at 3 gb in order to 
> colledt the data.  This seems extremely high.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21140) Reduce collect high memory requrements

2017-06-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054043#comment-16054043
 ] 

Sean Owen edited comment on SPARK-21140 at 6/19/17 2:02 PM:


I disagree executor memory does depend on the size of the partition being 
collected.  A 6 to 1 ratio to collect  data seems onerous to me.  Seems like 
multiple copies must be created in the executor.

I found that to collect an RDD containing a single partition of 512 mb took 
3gb.  

Here's the code I was using:

{code}
package com.test

import org.apache.spark._
import org.apache.spark.SparkContext._

object SparkRdd2 {
def main(args: Array[String]) {
  try
  {
  //
  // Process any arguments.
  //
 def parseOptions( map: Map[String,Any], listArgs: List[String]): 
Map[String,Any] = {
 listArgs match {
 case Nil =>
 map
 case "-master" :: value :: tail =>
 parseOptions( map+("master"-> value),tail)
 case "-recordSize" :: value :: tail =>
 parseOptions( map+("recordSize"-> value.toInt),tail)
 case "-partitionSize" :: value :: tail  =>
 parseOptions( map+("partitionSize"-> value.toLong),tail)
 case "-executorMemory" :: value :: tail =>
 parseOptions( map+("executorMemory"-> value),tail)
 case option :: tail =>
 println("unknown option"+option)
 sys.exit(1)
 }
 }
 val listArgs = args.toList
 val optionmap = parseOptions( Map[String,Any](),listArgs)
 val master = optionmap.getOrElse("master","local").asInstanceOf[String]
 val recordSize = 
optionmap.getOrElse("recordSize",128).asInstanceOf[Int]
 val partitionSize = 
optionmap.getOrElse("partitionSize",1024*1024*1024).asInstanceOf[Long]
 val executorMemory = 
optionmap.getOrElse("executorMemory","6g").asInstanceOf[String]
 println(f"Creating single partition of $partitionSize%d with records 
of length $recordSize%d")
 println(f"Setting spark.executor.memory to $executorMemory")
 //
 // Create SparkConf.
 //
 val sparkConf = new SparkConf()
 
sparkConf.setAppName("MyEnvVar").setMaster(master).setExecutorEnv("myenvvar","good")
 sparkConf.set("spark.executor.cores","1")
 sparkConf.set("spark.executor.instances","1")
 sparkConf.set("spark.executor.memory",executorMemory)
 sparkConf.set("spark.eventLog.enabled","true")
 
sparkConf.set("spark.eventLog.dir","hdfs://hadoop01glnxa64:54310/user/mprocopi/spark-events");
 sparkConf.set("spark.driver.maxResultSize","0")
 /*
  sparkConf.set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
  sparkConf.set("spark.kryoserializer.buffer.max","768m")
  sparkConf.set("spark.kryoserializer.buffer","64k")
  */
 //
 // Create SparkContext
 //
 val sc = new SparkContext(sparkConf)
 //
 //
 //
 def createdSizedPartition( recordSize:Int ,partitionSize:Long): 
Iterator[Array[Byte]] = {
var sizeReturned:Long = 0
new Iterator[Array[Byte]] {
override def hasNext(): Boolean = {
   (sizeReturned createdSizedPartition( 
rddInfo._1, rddInfo._2))
 val results = sizedRdd.collect
 var countLines: Int = 0
 var countBytes: Long = 0
 var maxRecord: Int = 0
 for (line <- results) {
 countLines = countLines+1
 countBytes = countBytes+line.length
  if (line.length> maxRecord)
  {
 maxRecord = line.length
  }
 }
 println(f"Collected $countLines%d lines")
 println(f"  $countBytes%d bytes")
 println(f"Max record $maxRecord%d bytes")
  } catch {
   case e: Exception =>
  println("Error in executing application: ", e.getMessage)
  throw e
  }

   }
}
{code}

After building it can be invoked as:

spark-submit --class com.test.SparkRdd2 --driver-memory 10g 
./target/scala-2.11/envtest_2.11-0.0.1.jar -recordSize 256 -partitionSize 
536870912

Allows you to vary the 


was (Author: mprocop):
I disagree executor

[jira] [Commented] (SPARK-21140) Reduce collect high memory requrements

2017-06-19 Thread michael procopio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054043#comment-16054043
 ] 

michael procopio commented on SPARK-21140:
--

I disagree executor memory does depend on the size of the partition being 
collected.  A 6 to 1 ratio to collect  data seems onerous to me.  Seems like 
multiple copies must be created in the executor.

I found that to collect an RDD containing a single partition of 512 mb took 
3gb.  

Here's the code I was using:

package com.test

import org.apache.spark._
import org.apache.spark.SparkContext._

object SparkRdd2 {
def main(args: Array[String]) {
  try
  {
  //
  // Process any arguments.
  //
 def parseOptions( map: Map[String,Any], listArgs: List[String]): 
Map[String,Any] = {
 listArgs match {
 case Nil =>
 map
 case "-master" :: value :: tail =>
 parseOptions( map+("master"-> value),tail)
 case "-recordSize" :: value :: tail =>
 parseOptions( map+("recordSize"-> value.toInt),tail)
 case "-partitionSize" :: value :: tail  =>
 parseOptions( map+("partitionSize"-> value.toLong),tail)
 case "-executorMemory" :: value :: tail =>
 parseOptions( map+("executorMemory"-> value),tail)
 case option :: tail =>
 println("unknown option"+option)
 sys.exit(1)
 }
 }
 val listArgs = args.toList
 val optionmap = parseOptions( Map[String,Any](),listArgs)
 val master = optionmap.getOrElse("master","local").asInstanceOf[String]
 val recordSize = 
optionmap.getOrElse("recordSize",128).asInstanceOf[Int]
 val partitionSize = 
optionmap.getOrElse("partitionSize",1024*1024*1024).asInstanceOf[Long]
 val executorMemory = 
optionmap.getOrElse("executorMemory","6g").asInstanceOf[String]
 println(f"Creating single partition of $partitionSize%d with records 
of length $recordSize%d")
 println(f"Setting spark.executor.memory to $executorMemory")
 //
 // Create SparkConf.
 //
 val sparkConf = new SparkConf()
 
sparkConf.setAppName("MyEnvVar").setMaster(master).setExecutorEnv("myenvvar","good")
 sparkConf.set("spark.executor.cores","1")
 sparkConf.set("spark.executor.instances","1")
 sparkConf.set("spark.executor.memory",executorMemory)
 sparkConf.set("spark.eventLog.enabled","true")
 
sparkConf.set("spark.eventLog.dir","hdfs://hadoop01glnxa64:54310/user/mprocopi/spark-events");
 sparkConf.set("spark.driver.maxResultSize","0")
 /*
  sparkConf.set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
  sparkConf.set("spark.kryoserializer.buffer.max","768m")
  sparkConf.set("spark.kryoserializer.buffer","64k")
  */
 //
 // Create SparkContext
 //
 val sc = new SparkContext(sparkConf)
 //
 //
 //
 def createdSizedPartition( recordSize:Int ,partitionSize:Long): 
Iterator[Array[Byte]] = {
var sizeReturned:Long = 0
new Iterator[Array[Byte]] {
override def hasNext(): Boolean = {
   (sizeReturned createdSizedPartition( 
rddInfo._1, rddInfo._2))
 val results = sizedRdd.collect
 var countLines: Int = 0
 var countBytes: Long = 0
 var maxRecord: Int = 0
 for (line <- results) {
 countLines = countLines+1
 countBytes = countBytes+line.length
  if (line.length> maxRecord)
  {
 maxRecord = line.length
  }
 }
 println(f"Collected $countLines%d lines")
 println(f"  $countBytes%d bytes")
 println(f"Max record $maxRecord%d bytes")
  } catch {
   case e: Exception =>
  println("Error in executing application: ", e.getMessage)
  throw e
  }

   }
}

After building it can be invoked as:

spark-submit --class com.test.SparkRdd2 --driver-memory 10g 
./target/scala-2.11/envtest_2.11-0.0.1.jar -recordSize 256 -partitionSize 
536870912

Allows you to vary the 

> Reduce collect high memory requrements
> --
>
>

[jira] [Updated] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread sam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam updated SPARK-21137:

Description: 
A very common use case in big data is to read a large number of small files.  
For example the Enron email dataset has 1,227,645 small files.

When one tries to read this data using Spark one will hit many issues.  
Firstly, even if the data is small (each file only say 1K) any job can take a 
very long time (I have a simple job that has been running for 3 hours and has 
not yet got to the point of starting any tasks, I doubt if it will ever finish).

It seems all the code in Spark that manages file listing is single threaded and 
not well optimised.  When I hand crank the code and don't use Spark, my job 
runs much faster.

Is it possible that I'm missing some configuration option? It seems kinda 
surprising to me that Spark cannot read Enron data given that it's such a 
quintessential example.

So it takes 1 hour to output a line "1,227,645 input paths to process", it then 
takes another hour to output the same line. Then it outputs a CSV of all the 
input paths (so creates a text storm).

Now it's been stuck on the following:

{code}
17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
{code}

for 2.5 hours.

So I've provided full reproduce steps here (including code and cluster setup) 
https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
easily just clone, and follow the README to reproduce exactly!

  was:
A very common use case in big data is to read a large number of small files.  
For example the Enron email dataset has 1,227,645 small files.

When one tries to read this data using Spark one will hit many issues.  
Firstly, even if the data is small (each file only say 1K) any job can take a 
very long time (I have a simple job that has been running for 3 hours and has 
not yet got to the point of starting any tasks, I doubt if it will ever finish).

It seems all the code in Spark that manages file listing is single threaded and 
not well optimised.  When I hand crank the code and don't use Spark, my job 
runs much faster.

Is it possible that I'm missing some configuration option? It seems kinda 
surprising to me that Spark cannot read Enron data given that it's such a 
quintessential example.

So it takes 1 hour to output a line "1,227,645 input paths to process", it then 
takes another hour to output the same line. Then it outputs a CSV of all the 
input paths (so creates a text storm).

Now it's been stuck on the following:

{code}
17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
{code}

for 2.5 hours.

All the app does is read the files, then try to output them again (escape the 
newlines and write one file per line).

So I've provided full reproduce steps here (including code and cluster setup) 
https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
easily just clone, and follow the README to reproduce exactly!


> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can

[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054042#comment-16054042
 ] 

Sean Owen commented on SPARK-21137:
---

Are you sure it's not just appearing to be stuck reading the file when 
downstream processing is happening? what stages are executing if any? this is 
the sort of thing that needs clarification. If no stages are executing, what is 
the driver executing (thread dump)? 
The number of partitions isn't some default from the environment -- look at the 
API call.
Also, what is the number of partitions as observed when you read the RDD? print 
it.

(This is the kind of thing that should go into a mailing list exchange)

> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054026#comment-16054026
 ] 

sam edited comment on SPARK-21137 at 6/19/17 1:53 PM:
--

[~srowen]  As I said in the description, which you may have missed, the logs do 
not even reach the point where tasks are started - partitioning cannot be an 
issue here.  If the logs said "Started task", THEN hung for hours on end then I 
would see your point, but this is not the case.

FYI The number of partitions used by `wholeTextFiles` will be the default when 
started on an EMR cluster via the command:

{code}
spark-submit --class scenron.UnzippedToEmailPerRowDistinctAltApp --master 
local[6] --driver-memory 8g /home/hadoop/scenron.jar
{code}

Finally, EVEN if I was running with only 1 partition, this should just be 
roughly equivalent to my hand cranked code that takes 11 minutes.


was (Author: sams):
[~srowen]  As I said in the description, which you may have missed, the logs do 
not even reach the point where tasks are started - partitioning cannot be an 
issue here.  If the logs said "Started task", THEN hung for hours on end then I 
would see your point, but this is not the case.



> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> All the app does is read the files, then try to output them again (escape the 
> newlines and write one file per line).
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19809) NullPointerException on empty ORC file

2017-06-19 Thread Renu Yadav (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054031#comment-16054031
 ] 

Renu Yadav commented on SPARK-19809:


What is the resolution of this issue. spark.sql.files.ignoreCorruptFiles does 
not work for orc file.
Please help.

> NullPointerException on empty ORC file
> --
>
> Key: SPARK-19809
> URL: https://issues.apache.org/jira/browse/SPARK-19809
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.3, 2.0.2, 2.1.1
>Reporter: Michał Dawid
>
> When reading from hive ORC table if there are some 0 byte files we get 
> NullPointerException:
> {code}java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)
>   at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
>

[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054026#comment-16054026
 ] 

sam commented on SPARK-21137:
-

[~srowen]  As I said in the description, which you may have missed, the logs do 
not even reach the point where tasks are started - partitioning cannot be an 
issue here.  If the logs said "Started task", THEN hung for hours on end then I 
would see your point, but this is not the case.



> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> All the app does is read the files, then try to output them again (escape the 
> newlines and write one file per line).
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054004#comment-16054004
 ] 

Sean Owen commented on SPARK-21137:
---

As i say, you're not setting anything about the partitioning here. That's 
likely the issue. You need to start by looking into how many partitions you 
have, how big they are, how long they're taking to execute. That's one of the 
first steps in diagnosing slow stages: do I have too few (wasted parallelism) 
or way too many (completing in <1s)

> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> All the app does is read the files, then try to output them again (escape the 
> newlines and write one file per line).
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21141) spark-update --version is hard to parse

2017-06-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21141.
---
Resolution: Not A Problem

There is no spark-update. It is not intended as an API to determine the version 
externally. It's also not exactly hard to parse if you really needed to because 
it outputs "version x.y.z"

> spark-update --version is hard to parse
> ---
>
> Key: SPARK-21141
> URL: https://issues.apache.org/jira/browse/SPARK-21141
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.1.1
> Environment: Debian 8 using hadoop 2.7.2
>Reporter: michael procopio
>
> We have need of being able to determine the spark version in order to 
> reference our jars:  one set built for 1.x using scala 2.10 and the other 
> built for 2.x using scala 2.11.  spark-update --version returns a lot of 
> extraneous output.  It would be preferable if an option were available that 
> only returned the version number.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21140) Reduce collect high memory requrements

2017-06-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21140.
---
Resolution: Invalid

There's no real detail here. Executor memory doesn't directly matter to how 
much data you can collect on the driver. Of course, collecting half-gig 
partitions to a driver is going to fail with even 1 partition, because that's 
about the default size of the driver memory.

This should start as a question on a mailing list.

> Reduce collect high memory requrements
> --
>
> Key: SPARK-21140
> URL: https://issues.apache.org/jira/browse/SPARK-21140
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.1.1
> Environment: Linux Debian 8 using hadoop 2.7.2.
>Reporter: michael procopio
>
> I wrote a very simple Scala application which used flatMap to create an RDD 
> containing a 512 mb partition of 256 byte arrays.  Experimentally, I 
> determined that spark.executor.memory had to be set at 3 gb in order to 
> colledt the data.  This seems extremely high.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21141) spark-update --version is hard to parse

2017-06-19 Thread michael procopio (JIRA)

michael procopio created SPARK-21141:


 Summary: spark-update --version is hard to parse
 Key: SPARK-21141
 URL: https://issues.apache.org/jira/browse/SPARK-21141
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.1.1
 Environment: Debian 8 using hadoop 2.7.2
Reporter: michael procopio


We have need of being able to determine the spark version in order to reference 
our jars:  one set built for 1.x using scala 2.10 and the other built for 2.x 
using scala 2.11.  spark-update --version returns a lot of extraneous output.  
It would be preferable if an option were available that only returned the 
version number.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21140) Reduce collect high memory requrements

2017-06-19 Thread michael procopio (JIRA)

michael procopio created SPARK-21140:


 Summary: Reduce collect high memory requrements
 Key: SPARK-21140
 URL: https://issues.apache.org/jira/browse/SPARK-21140
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.1.1
 Environment: Linux Debian 8 using hadoop 2.7.2.

Reporter: michael procopio


I wrote a very simple Scala application which used flatMap to create an RDD 
containing a 512 mb partition of 256 byte arrays.  Experimentally, I determined 
that spark.executor.memory had to be set at 3 gb in order to colledt the data.  
This seems extremely high.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053977#comment-16053977
 ] 

sam commented on SPARK-21137:
-

[~srowen]

So I've provided full reproduce steps here (including code and cluster setup) 
https://github.com/samthebest/scenron, scroll down to "Bug In Spark".  You can 
easily just clone, and follow the README to reproduce exactly!

> This reads the files into memory.

Yes I'm aware that `wholeTextFiles` reads the entire files, but all the files 
are rather small.  My hand cranked code also slurps the entire files.

> Also, slow compared to what?

The link includes two versions of the same code, I killed the Spark version 
(after 5 hours of running), my hand cranked version takes 11 minutes.

> Don't reopen this please. Someone will do that if it's appropriate.

Sorry, like I said, I just thought this was a known issue no one had bothered 
to add to JIRA because most people just hand crank their own work arounds.

> in the Hadoop APIs

Yes it's likely that the underlying Hadoop APIs have some yucky code that does 
something silly, I have delved down their before and my stomach cannot handle 
it.  Nevertheless Spark made the choice to inherit the complexities of the 
Hadoop APIs and reading multiple small files seems like a pretty basic use case 
for Spark (come on Sean this is Enron data!).  It would feel a bit perverse to 
just close this and blame the layer cake underneath.  Spark should use it's own 
extensions of the Hadoop APIs where the Hadoop APIs don't work (and the Hadoop 
code is easily extensible).

> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> All the app does is read the files, then try to output them again (escape the 
> newlines and write one file per line).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread sam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam updated SPARK-21137:

Description: 
A very common use case in big data is to read a large number of small files.  
For example the Enron email dataset has 1,227,645 small files.

When one tries to read this data using Spark one will hit many issues.  
Firstly, even if the data is small (each file only say 1K) any job can take a 
very long time (I have a simple job that has been running for 3 hours and has 
not yet got to the point of starting any tasks, I doubt if it will ever finish).

It seems all the code in Spark that manages file listing is single threaded and 
not well optimised.  When I hand crank the code and don't use Spark, my job 
runs much faster.

Is it possible that I'm missing some configuration option? It seems kinda 
surprising to me that Spark cannot read Enron data given that it's such a 
quintessential example.

So it takes 1 hour to output a line "1,227,645 input paths to process", it then 
takes another hour to output the same line. Then it outputs a CSV of all the 
input paths (so creates a text storm).

Now it's been stuck on the following:

{code}
17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
{code}

for 2.5 hours.

All the app does is read the files, then try to output them again (escape the 
newlines and write one file per line).

So I've provided full reproduce steps here (including code and cluster setup) 
https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
easily just clone, and follow the README to reproduce exactly!

  was:
A very common use case in big data is to read a large number of small files.  
For example the Enron email dataset has 1,227,645 small files.

When one tries to read this data using Spark one will hit many issues.  
Firstly, even if the data is small (each file only say 1K) any job can take a 
very long time (I have a simple job that has been running for 3 hours and has 
not yet got to the point of starting any tasks, I doubt if it will ever finish).

It seems all the code in Spark that manages file listing is single threaded and 
not well optimised.  When I hand crank the code and don't use Spark, my job 
runs much faster.

Is it possible that I'm missing some configuration option? It seems kinda 
surprising to me that Spark cannot read Enron data given that it's such a 
quintessential example.

So it takes 1 hour to output a line "1,227,645 input paths to process", it then 
takes another hour to output the same line. Then it outputs a CSV of all the 
input paths (so creates a text storm).

Now it's been stuck on the following:

{code}
17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
{code}

for 2.5 hours.

All the app does is read the files, then try to output them again (escape the 
newlines and write one file per line).


> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> All the app does is read the files, then try to output them again (escape the 
> newlines and write one file per line).
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug

[jira] [Resolved] (SPARK-20931) Built-in SQL Function ABS support string type

2017-06-19 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-20931.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Built-in SQL Function ABS support string type
> -
>
> Key: SPARK-20931
> URL: https://issues.apache.org/jira/browse/SPARK-20931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>  Labels: starter
> Fix For: 2.3.0
>
>
> {noformat}
> ABS()
> {noformat}
> Hive/MySQL support this.
> Ref: 
> https://github.com/apache/hive/blob/4ba713ccd85c3706d195aeef9476e6e6363f1c21/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAbs.java#L93



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21139) java.util.concurrent.RejectedExecutionException: rejected from java.util.concurrent.ThreadPoolExecutor@46477dd0[Terminated, pool size = 0, active threads = 0, queued t

2017-06-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053911#comment-16053911
 ] 

Sean Owen commented on SPARK-21139:
---

That looks like an issue from the HBase client, not Spark.

> java.util.concurrent.RejectedExecutionException: rejected from 
> java.util.concurrent.ThreadPoolExecutor@46477dd0[Terminated, pool size = 0, 
> active threads = 0, queued tasks = 0, completed tasks = 14109]
> -
>
> Key: SPARK-21139
> URL: https://issues.apache.org/jira/browse/SPARK-21139
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.2
> Environment: use spark1.5.2 and hbase 1.1.2
>Reporter: shining
>
> We create two tables use Hive HBaseStorageHandler like:
> CREATE EXTERNAL TABLE `yx_bw`(
>   `rowkey` string, 
>   `occur_time` string, 
>   `milli_second` string, 
>   `yx_id` string , 
>   `resp_area` string , 
>   `st_id` string, 
>   `bay_id` string, 
>   `device_type_id` string, 
>   `content` string, 
> ..)
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.hbase.HBaseSerDe' 
> STORED BY 
>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
> WITH SERDEPROPERTIES ( 
>  
> 'hbase.columns.mapping'=':key,f:OCCUR_TIME,f:MILLI_SECOND,f:YX_ID,f:RESP_AREA,f:ST_ID,f:BAY_ID,f:STATUS,f:CONTENT,f:VLTY_ID,f:MEAS_TYPE,f:RESTRAIN_FLAG,f:R
> ESERV_INT1,f:RESERV_INT2,f:CUSTOMIZED_GROUP,f:CONFIRM_STATUS,f:CONFIRM_TIME,f:CONFIRM_USER_ID,f:CONFIRM_NODE_ID,f:IF_DISPLAY',
>'serialization.format'='1')
> TBLPROPERTIES (
>   'hbase.table.name'='yx_bw'')
> Then we use sparksql to run a join between two tables.
> select * from xxgljxb a, yx_bw b where a.YX_ID = b.YX_ID; 
> When scan hbase table, we encounter the issue:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
> stage 1.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1.0 
> (TID 3, localhost): java.lang.RuntimeException: java.util.concurrent. 
>  : Task 
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFutureQueueingFuture@37b2d978
>  rejected from java.util.concurrent.ThreadPoolExecutor@46477dd0[Terminated, 
> pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14109]
>   at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208)
>   at 
> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
>   at 
> org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:403)
>   at 
> org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:364)
>   at 
> org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:205)
>   at 
> org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:147)
>   at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase$1.nextKeyValue(TableInputFormatBase.java:216)
>   at 
> org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat$1.next(HiveHBaseTableInputFormat.java:156)
>   at 
> org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat$1.next(HiveHBaseTableInputFormat.java:114)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:118)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
>

[jira] [Created] (SPARK-21139) java.util.concurrent.RejectedExecutionException: rejected from java.util.concurrent.ThreadPoolExecutor@46477dd0[Terminated, pool size = 0, active threads = 0, queued tas

2017-06-19 Thread shining (JIRA)

shining created SPARK-21139:
---

 Summary: java.util.concurrent.RejectedExecutionException: rejected 
from java.util.concurrent.ThreadPoolExecutor@46477dd0[Terminated, pool size = 
0, active threads = 0, queued tasks = 0, completed tasks = 14109]
 Key: SPARK-21139
 URL: https://issues.apache.org/jira/browse/SPARK-21139
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.5.2
 Environment: use spark1.5.2 and hbase 1.1.2
Reporter: shining


We create two tables use Hive HBaseStorageHandler like:
CREATE EXTERNAL TABLE `yx_bw`(
  `rowkey` string, 
  `occur_time` string, 
  `milli_second` string, 
  `yx_id` string , 
  `resp_area` string , 
  `st_id` string, 
  `bay_id` string, 
  `device_type_id` string, 
  `content` string, 
..)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.hbase.HBaseSerDe' 
STORED BY 
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ( 
 
'hbase.columns.mapping'=':key,f:OCCUR_TIME,f:MILLI_SECOND,f:YX_ID,f:RESP_AREA,f:ST_ID,f:BAY_ID,f:STATUS,f:CONTENT,f:VLTY_ID,f:MEAS_TYPE,f:RESTRAIN_FLAG,f:R
ESERV_INT1,f:RESERV_INT2,f:CUSTOMIZED_GROUP,f:CONFIRM_STATUS,f:CONFIRM_TIME,f:CONFIRM_USER_ID,f:CONFIRM_NODE_ID,f:IF_DISPLAY',
   'serialization.format'='1')
TBLPROPERTIES (
  'hbase.table.name'='yx_bw'')

Then we use sparksql to run a join between two tables.
select * from xxgljxb a, yx_bw b where a.YX_ID = b.YX_ID; 

When scan hbase table, we encounter the issue:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
stage 1.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1.0 (TID 
3, localhost): java.lang.RuntimeException: java.util.concurrent.
  : Task 
org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFutureQueueingFuture@37b2d978
 rejected from java.util.concurrent.ThreadPoolExecutor@46477dd0[Terminated, 
pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14109]
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208)
at 
org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
at 
org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:403)
at 
org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:364)
at 
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:205)
at 
org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:147)
at 
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase$1.nextKeyValue(TableInputFormatBase.java:216)
at 
org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat$1.next(HiveHBaseTableInputFormat.java:156)
at 
org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat$1.next(HiveHBaseTableInputFormat.java:114)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:118)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.RejectedExecutionException: Task 
org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@37b2d978
 rejected from java.util.concurrent.ThreadPoolExecutor@46477dd0[Terminated, 
pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14109]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
at

[jira] [Commented] (SPARK-20568) Delete files after processing in structured streaming

2017-06-19 Thread Fei Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053871#comment-16053871
 ] 

Fei Shao commented on SPARK-20568:
--

I also  do not support this feature too.
If we delete files processed, we can not do other works about them later.
If we want to do this feature, we can add a flag to indicate whether to delete 
files processed.



> Delete files after processing in structured streaming
> -
>
> Key: SPARK-20568
> URL: https://issues.apache.org/jira/browse/SPARK-20568
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Saul Shanabrook
>
> It would be great to be able to delete files after processing them with 
> structured streaming.
> For example, I am reading in a bunch of JSON files and converting them into 
> Parquet. If the JSON files are not deleted after they are processed, it 
> quickly fills up my hard drive. I originally [posted this on Stack 
> Overflow|http://stackoverflow.com/q/43671757/907060] and was recommended to 
> make a feature request for it. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053820#comment-16053820
 ] 

Sean Owen commented on SPARK-21137:
---

Here's a hint, or example of what could be going wrong: you may have a huge 
number of partitions after reading these files, depending on how you read them. 
 You may need to repartition down. Equally there are situations where you end 
up with just a few. Either might lead to slow processing.

This reads the files into memory. You might be grinding in GC. Probably not 
because you'd probably readily run out of memory then, but these are the kinds 
of things you need to look into first. Also, slow compared to what?

> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> All the app does is read the files, then try to output them again (escape the 
> newlines and write one file per line).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21138) Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different

2017-06-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21138:


Assignee: (was: Apache Spark)

> Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and 
> "spark.hadoop.fs.defaultFS" are different 
> -
>
> Key: SPARK-21138
> URL: https://issues.apache.org/jira/browse/SPARK-21138
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: sharkd tu
>
> When I set different clusters for "spark.hadoop.fs.defaultFS" and 
> "spark.yarn.stagingDir" as follows：
> {code:java}
> spark.hadoop.fs.defaultFS  hdfs://tl-nn-tdw.tencent-distribute.com:54310
> spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark
> {code}
> I got following logs:
> {code:java}
> 17/06/19 17:55:48 INFO SparkContext: Successfully stopped SparkContext
> 17/06/19 17:55:48 INFO ApplicationMaster: Unregistering ApplicationMaster 
> with SUCCEEDED
> 17/06/19 17:55:48 INFO AMRMClientImpl: Waiting for application to be 
> successfully unregistered.
> 17/06/19 17:55:48 INFO ApplicationMaster: Deleting staging directory 
> hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618
> 17/06/19 17:55:48 ERROR Utils: Uncaught exception in thread Thread-2
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, 
> expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getPathName(Cdh3DistributedFileSystem.java:197)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.access$000(Cdh3DistributedFileSystem.java:81)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:644)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:640)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.delete(Cdh3DistributedFileSystem.java:640)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.delete(FilterFileSystem.java:216)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:545)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:233)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21138) Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different

2017-06-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053817#comment-16053817
 ] 

Apache Spark commented on SPARK-21138:
--

User 'sharkdtu' has created a pull request for this issue:
https://github.com/apache/spark/pull/18352

> Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and 
> "spark.hadoop.fs.defaultFS" are different 
> -
>
> Key: SPARK-21138
> URL: https://issues.apache.org/jira/browse/SPARK-21138
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: sharkd tu
>
> When I set different clusters for "spark.hadoop.fs.defaultFS" and 
> "spark.yarn.stagingDir" as follows：
> {code:java}
> spark.hadoop.fs.defaultFS  hdfs://tl-nn-tdw.tencent-distribute.com:54310
> spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark
> {code}
> I got following logs:
> {code:java}
> 17/06/19 17:55:48 INFO SparkContext: Successfully stopped SparkContext
> 17/06/19 17:55:48 INFO ApplicationMaster: Unregistering ApplicationMaster 
> with SUCCEEDED
> 17/06/19 17:55:48 INFO AMRMClientImpl: Waiting for application to be 
> successfully unregistered.
> 17/06/19 17:55:48 INFO ApplicationMaster: Deleting staging directory 
> hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618
> 17/06/19 17:55:48 ERROR Utils: Uncaught exception in thread Thread-2
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, 
> expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getPathName(Cdh3DistributedFileSystem.java:197)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.access$000(Cdh3DistributedFileSystem.java:81)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:644)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:640)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.delete(Cdh3DistributedFileSystem.java:640)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.delete(FilterFileSystem.java:216)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:545)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:233)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21138) Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different

2017-06-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21138:


Assignee: Apache Spark

> Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and 
> "spark.hadoop.fs.defaultFS" are different 
> -
>
> Key: SPARK-21138
> URL: https://issues.apache.org/jira/browse/SPARK-21138
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: sharkd tu
>Assignee: Apache Spark
>
> When I set different clusters for "spark.hadoop.fs.defaultFS" and 
> "spark.yarn.stagingDir" as follows：
> {code:java}
> spark.hadoop.fs.defaultFS  hdfs://tl-nn-tdw.tencent-distribute.com:54310
> spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark
> {code}
> I got following logs:
> {code:java}
> 17/06/19 17:55:48 INFO SparkContext: Successfully stopped SparkContext
> 17/06/19 17:55:48 INFO ApplicationMaster: Unregistering ApplicationMaster 
> with SUCCEEDED
> 17/06/19 17:55:48 INFO AMRMClientImpl: Waiting for application to be 
> successfully unregistered.
> 17/06/19 17:55:48 INFO ApplicationMaster: Deleting staging directory 
> hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618
> 17/06/19 17:55:48 ERROR Utils: Uncaught exception in thread Thread-2
> java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, 
> expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getPathName(Cdh3DistributedFileSystem.java:197)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.access$000(Cdh3DistributedFileSystem.java:81)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:644)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:640)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.delete(Cdh3DistributedFileSystem.java:640)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.delete(FilterFileSystem.java:216)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:545)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:233)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
>   at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-21137.
-

> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> All the app does is read the files, then try to output them again (escape the 
> newlines and write one file per line).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053808#comment-16053808
 ] 

sam edited comment on SPARK-21137 at 6/19/17 11:14 AM:
---

[~srowen] Sorry about the lack of detail Sean.  I guess I just assumed this was 
a known problem (been working with spark for 3 years and every person I have 
worked with seems to be aware of this problem, just thought I'd actually bother 
creating a JIRA for it so it might actually get looked at some day).

So it takes 1 hour to output a line "1,227,645 input paths to process", it then 
takes another hour to output the same line. Then it outputs a CSV of all the 
input paths (so creates a text storm).

Now it's been stuck on the following:

{code}
17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
{code}

for 2.5 hours.

All the app does is read the files, then try to output them again (escape the 
newlines and write one file per line).


was (Author: sams):
[~srowen] Sorry about the lack of detail Sean.  I guess I just assumed this was 
a known problem (been working with spark for 3 years and every person I have 
worked with seems to be aware of this problem, just thought I'd actually bother 
creating a JIRA for it so it might actually get looked at some day).

So it takes 1 hour to output a line "1,227,645 input paths to process", it then 
takes another hour to output the same line. Then it outputs a CSV of all the 
input paths (so creates a text storm).

Now it's been stuck on the following:

{code}
17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
{code}

for 2.5 hours.

> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21138) Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different

2017-06-19 Thread sharkd tu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sharkd tu updated SPARK-21138:
--
Description: 
When I set different clusters for "spark.hadoop.fs.defaultFS" and 
"spark.yarn.stagingDir" as follows：

{code:java}
spark.hadoop.fs.defaultFS  hdfs://tl-nn-tdw.tencent-distribute.com:54310
spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark
{code}

I got following logs:

{code:java}
17/06/19 17:55:48 INFO SparkContext: Successfully stopped SparkContext
17/06/19 17:55:48 INFO ApplicationMaster: Unregistering ApplicationMaster with 
SUCCEEDED
17/06/19 17:55:48 INFO AMRMClientImpl: Waiting for application to be 
successfully unregistered.
17/06/19 17:55:48 INFO ApplicationMaster: Deleting staging directory 
hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618
17/06/19 17:55:48 ERROR Utils: Uncaught exception in thread Thread-2
java.lang.IllegalArgumentException: Wrong FS: 
hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, 
expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getPathName(Cdh3DistributedFileSystem.java:197)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.access$000(Cdh3DistributedFileSystem.java:81)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:644)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:640)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.delete(Cdh3DistributedFileSystem.java:640)
at 
org.apache.hadoop.fs.FilterFileSystem.delete(FilterFileSystem.java:216)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:545)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:233)
at 
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
{code}


  was:
When I set different clusters for "spark.hadoop.fs.defaultFS" and 
"spark.yarn.stagingDir" as follows：

{code:java}
spark.hadoop.fs.defaultFS  hdfs://tl-nn-tdw.tencent-distribute.com:54310
spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark
{code}

I got following logs:


17/06/19 17:55:48 INFO SparkContext: Successfully stopped SparkContext
17/06/19 17:55:48 INFO ApplicationMaster: Unregistering ApplicationMaster with 
SUCCEEDED
17/06/19 17:55:48 INFO AMRMClientImpl: Waiting for application to be 
successfully unregistered.
17/06/19 17:55:48 INFO ApplicationMaster: Deleting staging directory 
hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618
17/06/19 17:55:48 ERROR Utils: Uncaught exception in thread Thread-2
java.lang.IllegalArgumentException: Wrong FS: 
hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, 
expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getPathName(Cdh3DistributedFileSystem.java:197)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.access$000(Cdh3DistributedFileSystem.java:81)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:644)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:640)
at

[jira] [Created] (SPARK-21138) Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different

2017-06-19 Thread sharkd tu (JIRA)

sharkd tu created SPARK-21138:
-

 Summary: Cannot delete staging dir when the clusters of 
"spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different 
 Key: SPARK-21138
 URL: https://issues.apache.org/jira/browse/SPARK-21138
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.1.1
Reporter: sharkd tu


When I set different clusters for "spark.hadoop.fs.defaultFS" and 
"spark.yarn.stagingDir" as follows：

{code:java}
spark.hadoop.fs.defaultFS  hdfs://tl-nn-tdw.tencent-distribute.com:54310
spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark
{code}

I got following logs:


17/06/19 17:55:48 INFO SparkContext: Successfully stopped SparkContext
17/06/19 17:55:48 INFO ApplicationMaster: Unregistering ApplicationMaster with 
SUCCEEDED
17/06/19 17:55:48 INFO AMRMClientImpl: Waiting for application to be 
successfully unregistered.
17/06/19 17:55:48 INFO ApplicationMaster: Deleting staging directory 
hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618
17/06/19 17:55:48 ERROR Utils: Uncaught exception in thread Thread-2
java.lang.IllegalArgumentException: Wrong FS: 
hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, 
expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getPathName(Cdh3DistributedFileSystem.java:197)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.access$000(Cdh3DistributedFileSystem.java:81)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:644)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:640)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.delete(Cdh3DistributedFileSystem.java:640)
at 
org.apache.hadoop.fs.FilterFileSystem.delete(FilterFileSystem.java:216)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:545)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:233)
at 
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread sam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam reopened SPARK-21137:
-

Reopened after adding detail.

> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread sam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam updated SPARK-21137:

Description: 
A very common use case in big data is to read a large number of small files.  
For example the Enron email dataset has 1,227,645 small files.

When one tries to read this data using Spark one will hit many issues.  
Firstly, even if the data is small (each file only say 1K) any job can take a 
very long time (I have a simple job that has been running for 3 hours and has 
not yet got to the point of starting any tasks, I doubt if it will ever finish).

It seems all the code in Spark that manages file listing is single threaded and 
not well optimised.  When I hand crank the code and don't use Spark, my job 
runs much faster.

Is it possible that I'm missing some configuration option? It seems kinda 
surprising to me that Spark cannot read Enron data given that it's such a 
quintessential example.

So it takes 1 hour to output a line "1,227,645 input paths to process", it then 
takes another hour to output the same line. Then it outputs a CSV of all the 
input paths (so creates a text storm).

Now it's been stuck on the following:

{code}
17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
{code}

for 2.5 hours.


  was:
A very common use case in big data is to read a large number of small files.  
For example the Enron email dataset has 1,227,645 small files.

When one tries to read this data using Spark one will hit many issues.  
Firstly, even if the data is small (each file only say 1K) any job can take a 
very long time (I have a simple job that has been running for 3 hours and has 
not yet got to the point of starting any tasks, I doubt if it will ever finish).

It seems all the code in Spark that manages file listing is single threaded and 
not well optimised.  When I hand crank the code and don't use Spark, my job 
runs much faster.

Is it possible that I'm missing some configuration option? It seems kinda 
surprising to me that Spark cannot read Enron data given that it's such a 
quintessential example.




> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21137.
---
Resolution: Invalid

Don't reopen this please. Someone will do that if it's appropriate.

This still doesn't establish that it's not either a) processing in your code or 
b) in the Hadoop APIs. There's no link to any external issues here.


> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> All the app does is read the files, then try to output them again (escape the 
> newlines and write one file per line).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)

2017-06-19 Thread sam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam updated SPARK-21137:

Description: 
A very common use case in big data is to read a large number of small files.  
For example the Enron email dataset has 1,227,645 small files.

When one tries to read this data using Spark one will hit many issues.  
Firstly, even if the data is small (each file only say 1K) any job can take a 
very long time (I have a simple job that has been running for 3 hours and has 
not yet got to the point of starting any tasks, I doubt if it will ever finish).

It seems all the code in Spark that manages file listing is single threaded and 
not well optimised.  When I hand crank the code and don't use Spark, my job 
runs much faster.

Is it possible that I'm missing some configuration option? It seems kinda 
surprising to me that Spark cannot read Enron data given that it's such a 
quintessential example.

So it takes 1 hour to output a line "1,227,645 input paths to process", it then 
takes another hour to output the same line. Then it outputs a CSV of all the 
input paths (so creates a text storm).

Now it's been stuck on the following:

{code}
17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
{code}

for 2.5 hours.

All the app does is read the files, then try to output them again (escape the 
newlines and write one file per line).

  was:
A very common use case in big data is to read a large number of small files.  
For example the Enron email dataset has 1,227,645 small files.

When one tries to read this data using Spark one will hit many issues.  
Firstly, even if the data is small (each file only say 1K) any job can take a 
very long time (I have a simple job that has been running for 3 hours and has 
not yet got to the point of starting any tasks, I doubt if it will ever finish).

It seems all the code in Spark that manages file listing is single threaded and 
not well optimised.  When I hand crank the code and don't use Spark, my job 
runs much faster.

Is it possible that I'm missing some configuration option? It seems kinda 
surprising to me that Spark cannot read Enron data given that it's such a 
quintessential example.

So it takes 1 hour to output a line "1,227,645 input paths to process", it then 
takes another hour to output the same line. Then it outputs a CSV of all the 
input paths (so creates a text storm).

Now it's been stuck on the following:

{code}
17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
{code}

for 2.5 hours.



> Spark cannot read many small files (wholeTextFiles)
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: sam
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> All the app does is read the files, then try to output them again (escape the 
> newlines and write one file per line).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 125 matches

Mail list logo