[jira] [Created] (SPARK-21149) Add job description API for R
Felix Cheung created SPARK-21149: Summary: Add job description API for R Key: SPARK-21149 URL: https://issues.apache.org/jira/browse/SPARK-21149 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 2.3.0 Reporter: Felix Cheung Priority: Minor see SPARK-21125 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21148) Set SparkUncaughtExceptionHandler to the Master
[ https://issues.apache.org/jira/browse/SPARK-21148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21148: Assignee: Apache Spark > Set SparkUncaughtExceptionHandler to the Master > --- > > Key: SPARK-21148 > URL: https://issues.apache.org/jira/browse/SPARK-21148 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.1.1 >Reporter: Devaraj K >Assignee: Apache Spark > > Any one thread of the Master gets any of the UncaughtException then the > thread gets terminate and the Master process keeps running without > functioning properly. > I think we need to handle the UncaughtException and exit the Master > gracefully. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21148) Set SparkUncaughtExceptionHandler to the Master
[ https://issues.apache.org/jira/browse/SPARK-21148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21148: Assignee: (was: Apache Spark) > Set SparkUncaughtExceptionHandler to the Master > --- > > Key: SPARK-21148 > URL: https://issues.apache.org/jira/browse/SPARK-21148 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.1.1 >Reporter: Devaraj K > > Any one thread of the Master gets any of the UncaughtException then the > thread gets terminate and the Master process keeps running without > functioning properly. > I think we need to handle the UncaughtException and exit the Master > gracefully. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21148) Set SparkUncaughtExceptionHandler to the Master
[ https://issues.apache.org/jira/browse/SPARK-21148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055070#comment-16055070 ] Apache Spark commented on SPARK-21148: -- User 'devaraj-kavali' has created a pull request for this issue: https://github.com/apache/spark/pull/18358 > Set SparkUncaughtExceptionHandler to the Master > --- > > Key: SPARK-21148 > URL: https://issues.apache.org/jira/browse/SPARK-21148 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.1.1 >Reporter: Devaraj K > > Any one thread of the Master gets any of the UncaughtException then the > thread gets terminate and the Master process keeps running without > functioning properly. > I think we need to handle the UncaughtException and exit the Master > gracefully. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20889) SparkR grouped documentation for Column methods
[ https://issues.apache.org/jira/browse/SPARK-20889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-20889. -- Resolution: Fixed Assignee: Wayne Zhang Fix Version/s: 2.3.0 Target Version/s: 2.3.0 > SparkR grouped documentation for Column methods > --- > > Key: SPARK-20889 > URL: https://issues.apache.org/jira/browse/SPARK-20889 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 2.1.1 >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Labels: documentation > Fix For: 2.3.0 > > > Group the documentation of individual methods defined for the Column class. > This aims to create the following improvements: > - Centralized documentation for easy navigation (user can view multiple > related methods on one single page). > - Reduced number of items in Seealso. > - Betters examples using shared data. This avoids creating a data frame for > each function if they are documented separately. And more importantly, user > can copy and paste to run them directly! > - Cleaner structure and much fewer Rd files (remove a large number of Rd > files). > - Remove duplicated definition of param (since they share exactly the same > argument). > - No need to write meaningless examples for trivial functions (because of > grouping). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21148) Set SparkUncaughtExceptionHandler to the Master
Devaraj K created SPARK-21148: - Summary: Set SparkUncaughtExceptionHandler to the Master Key: SPARK-21148 URL: https://issues.apache.org/jira/browse/SPARK-21148 Project: Spark Issue Type: Improvement Components: Deploy, Spark Core Affects Versions: 2.1.1 Reporter: Devaraj K Any one thread of the Master gets any of the UncaughtException then the thread gets terminate and the Master process keeps running without functioning properly. I think we need to handle the UncaughtException and exit the Master gracefully. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21144) Unexpected results when the data schema and partition schema have the duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-21144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055049#comment-16055049 ] Takeshi Yamamuro commented on SPARK-21144: -- okay, I'm currently looking into this. > Unexpected results when the data schema and partition schema have the > duplicate columns > --- > > Key: SPARK-21144 > URL: https://issues.apache.org/jira/browse/SPARK-21144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > > {noformat} > withTempPath { dir => > val basePath = dir.getCanonicalPath > spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, > "foo=1").toString) > spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, > "foo=a").toString) > spark.read.parquet(basePath).show() > } > {noformat} > The result of the above case is > {noformat} > +---+ > |foo| > +---+ > | 1| > | 1| > | a| > | a| > | 1| > | a| > +---+ > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21147) the schema of socket source can not be set.
Fei Shao created SPARK-21147: Summary: the schema of socket source can not be set. Key: SPARK-21147 URL: https://issues.apache.org/jira/browse/SPARK-21147 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0 Environment: Win7,spark 2.1.0 Reporter: Fei Shao The schema set for DataStreamReader can not work. The code is shown as below: val line = ss.readStream.format("socket") .option("ip",xxx) .option("port",xxx) .schema( StructField("name",StringType)::StructField("area",StringType)::Nil) .load line.printSchema The printSchema prints: root |--value:String(nullable=true) According to the code, it should print the schema set by schema(). Suggestion from Michael Armbrust: throw an exception saying that you can't set schema here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21133) HighlyCompressedMapStatus#writeExternal throws NPE
[ https://issues.apache.org/jira/browse/SPARK-21133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21133: --- Assignee: Yuming Wang > HighlyCompressedMapStatus#writeExternal throws NPE > -- > > Key: SPARK-21133 > URL: https://issues.apache.org/jira/browse/SPARK-21133 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Blocker > Fix For: 2.2.0 > > > Reproduce, set {{set spark.sql.shuffle.partitions>2000}} with shuffle, for > simple: > {code:sql} > spark-sql --executor-memory 12g --driver-memory 8g --executor-cores 7 -e " > set spark.sql.shuffle.partitions=2001; > drop table if exists spark_hcms_npe; > create table spark_hcms_npe as select id, count(*) from big_table group by > id; > " > {code} > Error logs: > {noformat} > 17/06/18 15:00:27 ERROR Utils: Exception encountered > java.lang.NullPointerException > at > org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171) > at > org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) > at > org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303) > at > org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167) > at > java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617) > at > org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) > at > org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) > at > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619) > at > org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562) > at > org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException > java.io.IOException: java.lang.NullPointerException > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310) > at > org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167) > at > java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617) > at > org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) > at > org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) > at > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619) > at > org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562) > at > org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at >
[jira] [Resolved] (SPARK-21133) HighlyCompressedMapStatus#writeExternal throws NPE
[ https://issues.apache.org/jira/browse/SPARK-21133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21133. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 18343 [https://github.com/apache/spark/pull/18343] > HighlyCompressedMapStatus#writeExternal throws NPE > -- > > Key: SPARK-21133 > URL: https://issues.apache.org/jira/browse/SPARK-21133 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Priority: Blocker > Fix For: 2.2.0 > > > Reproduce, set {{set spark.sql.shuffle.partitions>2000}} with shuffle, for > simple: > {code:sql} > spark-sql --executor-memory 12g --driver-memory 8g --executor-cores 7 -e " > set spark.sql.shuffle.partitions=2001; > drop table if exists spark_hcms_npe; > create table spark_hcms_npe as select id, count(*) from big_table group by > id; > " > {code} > Error logs: > {noformat} > 17/06/18 15:00:27 ERROR Utils: Exception encountered > java.lang.NullPointerException > at > org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171) > at > org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) > at > org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303) > at > org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167) > at > java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617) > at > org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) > at > org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) > at > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619) > at > org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562) > at > org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException > java.io.IOException: java.lang.NullPointerException > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310) > at > org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167) > at > java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617) > at > org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) > at > org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) > at > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619) > at > org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562) > at > org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at >
[jira] [Commented] (SPARK-21146) Worker should handle and shutdown when any thread gets UncaughtException
[ https://issues.apache.org/jira/browse/SPARK-21146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055024#comment-16055024 ] Apache Spark commented on SPARK-21146: -- User 'devaraj-kavali' has created a pull request for this issue: https://github.com/apache/spark/pull/18357 > Worker should handle and shutdown when any thread gets UncaughtException > > > Key: SPARK-21146 > URL: https://issues.apache.org/jira/browse/SPARK-21146 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Devaraj K > > {code:xml} > 17/06/19 11:41:23 INFO Worker: Asked to launch executor > app-20170619114055-0005/228 for ScalaSort > Exception in thread "dispatcher-event-loop-79" java.lang.OutOfMemoryError: > unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:714) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950) > at > java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1018) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > I see in the logs that Worker's dispatcher-event got the above exception and > the Worker keeps running without performing any functionality. And also > Worker state changed from ALIVE to DEAD in Master's web UI. > {code:xml} > worker-20170619150349-192.168.1.120-56175 192.168.1.120:56175 DEAD > 88 (41 Used)251.2 GB (246.0 GB Used) > {code} > I think Worker should handle and shutdown when any thread gets > UncaughtException. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21146) Worker should handle and shutdown when any thread gets UncaughtException
[ https://issues.apache.org/jira/browse/SPARK-21146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21146: Assignee: Apache Spark > Worker should handle and shutdown when any thread gets UncaughtException > > > Key: SPARK-21146 > URL: https://issues.apache.org/jira/browse/SPARK-21146 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Devaraj K >Assignee: Apache Spark > > {code:xml} > 17/06/19 11:41:23 INFO Worker: Asked to launch executor > app-20170619114055-0005/228 for ScalaSort > Exception in thread "dispatcher-event-loop-79" java.lang.OutOfMemoryError: > unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:714) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950) > at > java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1018) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > I see in the logs that Worker's dispatcher-event got the above exception and > the Worker keeps running without performing any functionality. And also > Worker state changed from ALIVE to DEAD in Master's web UI. > {code:xml} > worker-20170619150349-192.168.1.120-56175 192.168.1.120:56175 DEAD > 88 (41 Used)251.2 GB (246.0 GB Used) > {code} > I think Worker should handle and shutdown when any thread gets > UncaughtException. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21146) Worker should handle and shutdown when any thread gets UncaughtException
[ https://issues.apache.org/jira/browse/SPARK-21146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21146: Assignee: (was: Apache Spark) > Worker should handle and shutdown when any thread gets UncaughtException > > > Key: SPARK-21146 > URL: https://issues.apache.org/jira/browse/SPARK-21146 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Devaraj K > > {code:xml} > 17/06/19 11:41:23 INFO Worker: Asked to launch executor > app-20170619114055-0005/228 for ScalaSort > Exception in thread "dispatcher-event-loop-79" java.lang.OutOfMemoryError: > unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:714) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950) > at > java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1018) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > I see in the logs that Worker's dispatcher-event got the above exception and > the Worker keeps running without performing any functionality. And also > Worker state changed from ALIVE to DEAD in Master's web UI. > {code:xml} > worker-20170619150349-192.168.1.120-56175 192.168.1.120:56175 DEAD > 88 (41 Used)251.2 GB (246.0 GB Used) > {code} > I think Worker should handle and shutdown when any thread gets > UncaughtException. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21144) Unexpected results when the data schema and partition schema have the duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-21144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21144: Assignee: Apache Spark > Unexpected results when the data schema and partition schema have the > duplicate columns > --- > > Key: SPARK-21144 > URL: https://issues.apache.org/jira/browse/SPARK-21144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Apache Spark > > {noformat} > withTempPath { dir => > val basePath = dir.getCanonicalPath > spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, > "foo=1").toString) > spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, > "foo=a").toString) > spark.read.parquet(basePath).show() > } > {noformat} > The result of the above case is > {noformat} > +---+ > |foo| > +---+ > | 1| > | 1| > | a| > | a| > | 1| > | a| > +---+ > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21144) Unexpected results when the data schema and partition schema have the duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-21144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16055015#comment-16055015 ] Apache Spark commented on SPARK-21144: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/18356 > Unexpected results when the data schema and partition schema have the > duplicate columns > --- > > Key: SPARK-21144 > URL: https://issues.apache.org/jira/browse/SPARK-21144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > > {noformat} > withTempPath { dir => > val basePath = dir.getCanonicalPath > spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, > "foo=1").toString) > spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, > "foo=a").toString) > spark.read.parquet(basePath).show() > } > {noformat} > The result of the above case is > {noformat} > +---+ > |foo| > +---+ > | 1| > | 1| > | a| > | a| > | 1| > | a| > +---+ > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21144) Unexpected results when the data schema and partition schema have the duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-21144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21144: Assignee: (was: Apache Spark) > Unexpected results when the data schema and partition schema have the > duplicate columns > --- > > Key: SPARK-21144 > URL: https://issues.apache.org/jira/browse/SPARK-21144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > > {noformat} > withTempPath { dir => > val basePath = dir.getCanonicalPath > spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, > "foo=1").toString) > spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, > "foo=a").toString) > spark.read.parquet(basePath).show() > } > {noformat} > The result of the above case is > {noformat} > +---+ > |foo| > +---+ > | 1| > | 1| > | a| > | a| > | 1| > | a| > +---+ > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21146) Worker should handle and shutdown when any thread gets UncaughtException
Devaraj K created SPARK-21146: - Summary: Worker should handle and shutdown when any thread gets UncaughtException Key: SPARK-21146 URL: https://issues.apache.org/jira/browse/SPARK-21146 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.1 Reporter: Devaraj K {code:xml} 17/06/19 11:41:23 INFO Worker: Asked to launch executor app-20170619114055-0005/228 for ScalaSort Exception in thread "dispatcher-event-loop-79" java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950) at java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1018) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} I see in the logs that Worker's dispatcher-event got the above exception and the Worker keeps running without performing any functionality. And also Worker state changed from ALIVE to DEAD in Master's web UI. {code:xml} worker-20170619150349-192.168.1.120-56175 192.168.1.120:56175 DEAD 88 (41 Used)251.2 GB (246.0 GB Used) {code} I think Worker should handle and shutdown when any thread gets UncaughtException. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18191) Port RDD API to use commit protocol
[ https://issues.apache.org/jira/browse/SPARK-18191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054978#comment-16054978 ] Shridhar Ramachandran commented on SPARK-18191: --- I see this got committed only in 2.2.0 and 2.1.0 as mentioned. Can the "Fix Version/s" be changed accordingly? > Port RDD API to use commit protocol > --- > > Key: SPARK-18191 > URL: https://issues.apache.org/jira/browse/SPARK-18191 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Jiang Xingbo > Fix For: 2.1.0 > > > Commit protocol is actually not specific to SQL. We can move it over to core > so the RDD API can use it too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18191) Port RDD API to use commit protocol
[ https://issues.apache.org/jira/browse/SPARK-18191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054978#comment-16054978 ] Shridhar Ramachandran edited comment on SPARK-18191 at 6/20/17 12:11 AM: - I see this got committed only in 2.2.0 and not 2.1.0 as mentioned. Can the "Fix Version/s" be changed accordingly? was (Author: shridharama): I see this got committed only in 2.2.0 and 2.1.0 as mentioned. Can the "Fix Version/s" be changed accordingly? > Port RDD API to use commit protocol > --- > > Key: SPARK-18191 > URL: https://issues.apache.org/jira/browse/SPARK-18191 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Jiang Xingbo > Fix For: 2.1.0 > > > Commit protocol is actually not specific to SQL. We can move it over to core > so the RDD API can use it too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8642) Ungraceful failure when yarn client is not configured.
[ https://issues.apache.org/jira/browse/SPARK-8642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-8642. --- Resolution: Won't Fix Even though a better error here would be nice, I'll close this because there's no good way to do this with the current YARN API. As silly as it may be, "0.0.0.0:8032" is a valid address - it would work if the RM was running on the same host as the Spark app, so it sort of makes sense as a default. (I'd prefer a null default and an error if it's not set, but that boat has sailed long ago.) It also doesn't look like you can configure the retry policy used by the YARN client... So yeah, it sucks, but there's not much that Spark can do here without making assumptions about how configuration should be deployed. > Ungraceful failure when yarn client is not configured. > -- > > Key: SPARK-8642 > URL: https://issues.apache.org/jira/browse/SPARK-8642 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.3.0, 1.3.1 >Reporter: Juliet Hougland >Priority: Minor > Attachments: yarnretries.log > > > When HADOOP_CONF_DIR is not configured (ie yarn-site.xml is not available) > the yarn client will try to submit an application. No connection to the > resource manager will be able to be established. The client will try to > connect 10 times (with a max retry of ten), and then do that 30 more time. > This takes about 5 minutes before an Error is recorded for spark context > initialization, which is caused by a connect exception. I would expect that > after the first 1- tries fail, the initialization of the spark context should > fail too. At least that is what I would think given the logs. An earlier > failure would be ideal/preferred. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21145) Restarted queries reuse same StateStoreProvider, causing multiple concurrent tasks to update same StateStore
[ https://issues.apache.org/jira/browse/SPARK-21145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21145: Assignee: Apache Spark (was: Tathagata Das) > Restarted queries reuse same StateStoreProvider, causing multiple concurrent > tasks to update same StateStore > > > Key: SPARK-21145 > URL: https://issues.apache.org/jira/browse/SPARK-21145 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Tathagata Das >Assignee: Apache Spark > > StateStoreProvider instances are loaded on-demand in a executor when a query > is started. When a query is restarted, the loaded provider instance will get > reused. Now, there is a non-trivial chance, that the task of the previous > query run is still running, while the tasks of the restarted run has started. > So for a stateful partition, there may be two concurrent tasks related to the > same stateful partition, and there for using the same provider instance. This > can lead to inconsistent results and possibly random failures, as state store > implementations are not designed to be thread-safe. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21145) Restarted queries reuse same StateStoreProvider, causing multiple concurrent tasks to update same StateStore
[ https://issues.apache.org/jira/browse/SPARK-21145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054945#comment-16054945 ] Apache Spark commented on SPARK-21145: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/18355 > Restarted queries reuse same StateStoreProvider, causing multiple concurrent > tasks to update same StateStore > > > Key: SPARK-21145 > URL: https://issues.apache.org/jira/browse/SPARK-21145 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > > StateStoreProvider instances are loaded on-demand in a executor when a query > is started. When a query is restarted, the loaded provider instance will get > reused. Now, there is a non-trivial chance, that the task of the previous > query run is still running, while the tasks of the restarted run has started. > So for a stateful partition, there may be two concurrent tasks related to the > same stateful partition, and there for using the same provider instance. This > can lead to inconsistent results and possibly random failures, as state store > implementations are not designed to be thread-safe. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21145) Restarted queries reuse same StateStoreProvider, causing multiple concurrent tasks to update same StateStore
[ https://issues.apache.org/jira/browse/SPARK-21145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21145: Assignee: Tathagata Das (was: Apache Spark) > Restarted queries reuse same StateStoreProvider, causing multiple concurrent > tasks to update same StateStore > > > Key: SPARK-21145 > URL: https://issues.apache.org/jira/browse/SPARK-21145 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > > StateStoreProvider instances are loaded on-demand in a executor when a query > is started. When a query is restarted, the loaded provider instance will get > reused. Now, there is a non-trivial chance, that the task of the previous > query run is still running, while the tasks of the restarted run has started. > So for a stateful partition, there may be two concurrent tasks related to the > same stateful partition, and there for using the same provider instance. This > can lead to inconsistent results and possibly random failures, as state store > implementations are not designed to be thread-safe. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21145) Restarted queries reuse same StateStoreProvider, causing multiple concurrent tasks to update same StateStore
Tathagata Das created SPARK-21145: - Summary: Restarted queries reuse same StateStoreProvider, causing multiple concurrent tasks to update same StateStore Key: SPARK-21145 URL: https://issues.apache.org/jira/browse/SPARK-21145 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Tathagata Das Assignee: Tathagata Das StateStoreProvider instances are loaded on-demand in a executor when a query is started. When a query is restarted, the loaded provider instance will get reused. Now, there is a non-trivial chance, that the task of the previous query run is still running, while the tasks of the restarted run has started. So for a stateful partition, there may be two concurrent tasks related to the same stateful partition, and there for using the same provider instance. This can lead to inconsistent results and possibly random failures, as state store implementations are not designed to be thread-safe. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21138) Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different
[ https://issues.apache.org/jira/browse/SPARK-21138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-21138. Resolution: Fixed Assignee: sharkd tu Fix Version/s: 2.3.0 2.2.1 2.1.2 2.0.3 > Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and > "spark.hadoop.fs.defaultFS" are different > - > > Key: SPARK-21138 > URL: https://issues.apache.org/jira/browse/SPARK-21138 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.1 >Reporter: sharkd tu >Assignee: sharkd tu > Fix For: 2.0.3, 2.1.2, 2.2.1, 2.3.0 > > > When I set different clusters for "spark.hadoop.fs.defaultFS" and > "spark.yarn.stagingDir" as follows: > {code:java} > spark.hadoop.fs.defaultFS hdfs://tl-nn-tdw.tencent-distribute.com:54310 > spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark > {code} > I got following logs: > {code:java} > 17/06/19 17:55:48 INFO SparkContext: Successfully stopped SparkContext > 17/06/19 17:55:48 INFO ApplicationMaster: Unregistering ApplicationMaster > with SUCCEEDED > 17/06/19 17:55:48 INFO AMRMClientImpl: Waiting for application to be > successfully unregistered. > 17/06/19 17:55:48 INFO ApplicationMaster: Deleting staging directory > hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618 > 17/06/19 17:55:48 ERROR Utils: Uncaught exception in thread Thread-2 > java.lang.IllegalArgumentException: Wrong FS: > hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, > expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310 > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getPathName(Cdh3DistributedFileSystem.java:197) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.access$000(Cdh3DistributedFileSystem.java:81) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:644) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:640) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.delete(Cdh3DistributedFileSystem.java:640) > at > org.apache.hadoop.fs.FilterFileSystem.delete(FilterFileSystem.java:216) > at > org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:545) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:233) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21124) Wrong user shown in UI when using kerberos
[ https://issues.apache.org/jira/browse/SPARK-21124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-21124. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.3.0 > Wrong user shown in UI when using kerberos > -- > > Key: SPARK-21124 > URL: https://issues.apache.org/jira/browse/SPARK-21124 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.3.0 > > > When submitting an app to a kerberos-secured cluster, the OS user and the > user running the application may differ. Although it may also happen in > cluster mode depending on the cluster manager's configuration, it's more > common in client mode. > The UI should show enough information about user running the application to > correctly identify the actual user. The "app user" can be easily retrieved > via {{Utils.getCurrentUserName()}}, so it's mostly a matter of how to record > this information (for showing in replayed applications) and how to present it > in the UI. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21144) Unexpected results when the data schema and partition schema have the duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-21144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21144: Target Version/s: 2.2.0 > Unexpected results when the data schema and partition schema have the > duplicate columns > --- > > Key: SPARK-21144 > URL: https://issues.apache.org/jira/browse/SPARK-21144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > > {noformat} > withTempPath { dir => > val basePath = dir.getCanonicalPath > spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, > "foo=1").toString) > spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, > "foo=a").toString) > spark.read.parquet(basePath).show() > } > {noformat} > The result of the above case is > {noformat} > +---+ > |foo| > +---+ > | 1| > | 1| > | a| > | a| > | 1| > | a| > +---+ > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21144) Unexpected results when the data schema and partition schema have the duplicate columns
[ https://issues.apache.org/jira/browse/SPARK-21144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054788#comment-16054788 ] Xiao Li commented on SPARK-21144: - cc [~maropu] > Unexpected results when the data schema and partition schema have the > duplicate columns > --- > > Key: SPARK-21144 > URL: https://issues.apache.org/jira/browse/SPARK-21144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > > {noformat} > withTempPath { dir => > val basePath = dir.getCanonicalPath > spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, > "foo=1").toString) > spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, > "foo=a").toString) > spark.read.parquet(basePath).show() > } > {noformat} > The result of the above case is > {noformat} > +---+ > |foo| > +---+ > | 1| > | 1| > | a| > | a| > | 1| > | a| > +---+ > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054785#comment-16054785 ] Apache Spark commented on SPARK-18016: -- User 'bdrillard' has created a pull request for this issue: https://github.com/apache/spark/pull/18354 > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > - > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson >Assignee: Aleksander Eskilson > Fix For: 2.3.0 > > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) > at >
[jira] [Created] (SPARK-21144) Unexpected results when the data schema and partition schema have the duplicate columns
Xiao Li created SPARK-21144: --- Summary: Unexpected results when the data schema and partition schema have the duplicate columns Key: SPARK-21144 URL: https://issues.apache.org/jira/browse/SPARK-21144 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Xiao Li {noformat} withTempPath { dir => val basePath = dir.getCanonicalPath spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=1").toString) spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=a").toString) spark.read.parquet(basePath).show() } {noformat} The result of the above case is {noformat} +---+ |foo| +---+ | 1| | 1| | a| | a| | 1| | a| +---+ {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054784#comment-16054784 ] Aleksander Eskilson commented on SPARK-18016: - [~cloud_fan], [~divshukla], I've created a PR for the backport patch, [#18354|https://github.com/apache/spark/pull/18354]. > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > - > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson >Assignee: Aleksander Eskilson > Fix For: 2.3.0 > > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at
[jira] [Commented] (SPARK-21143) Fail to fetch blocks >1MB in size in presence of conflicting Netty version
[ https://issues.apache.org/jira/browse/SPARK-21143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054707#comment-16054707 ] Ryan Williams commented on SPARK-21143: --- [~zsxwing] bq. it's too risky to upgrade from 4.0.X to 4.1.X makes sense, I wasn't meaning to suggest that as the action to take bq. The reason you cannot use 4.0.42.Final is because you are using 4.1.X APIs? I'm depending on [google-cloud-nio|https://github.com/GoogleCloudPlatform/google-cloud-java/tree/v0.10.0/google-cloud-contrib/google-cloud-nio] while running Spark apps in Google Cloud; and it depends transitively on Netty 4.1.6.Final: {code} org.hammerlab:google-cloud-nio:jar:0.10.0-alpha \- com.google.cloud:google-cloud-storage:jar:0.10.0-beta:compile \- com.google.cloud:google-cloud-core:jar:0.10.0-alpha:compile \- com.google.api:gax:jar:0.4.0:compile \- io.grpc:grpc-netty:jar:1.0.3:compile +- io.netty:netty-handler-proxy:jar:4.1.6.Final:compile {code} Luckily, google-cloud-nio [publishes a shaded JAR|http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22com.google.cloud%22%20AND%20a%3A%22google-cloud-nio%22] with [all dependencies shaded+renamed|https://github.com/GoogleCloudPlatform/google-cloud-java/blob/v0.10.0/google-cloud-contrib/google-cloud-nio/pom.xml#L101-L129], so I can just use that to avoid this conflict. I mostly interpret this kind of issue as a nudge toward increasingly isolating Spark's classpath (by preemptive shading+renaming of some or all dependencies), so that these kinds of issues don't happen. [~sowen] thanks for the pointer, gtk that upgrade is in progress. This is narrowly a Netty 4.0 vs. 4.1 conflict, but per the above could be interpreted as a shading / classpath-isolation concern. Anyway, feel free to triage as you like, I just doubt I'll be the last person to see these stack traces and didn't see any good google hits about them yet. > Fail to fetch blocks >1MB in size in presence of conflicting Netty version > -- > > Key: SPARK-21143 > URL: https://issues.apache.org/jira/browse/SPARK-21143 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Ryan Williams >Priority: Minor > > One of my spark libraries inherited a transitive-dependency on Netty > 4.1.6.Final (vs. Spark's 4.0.42.Final), and I observed a strange failure I > wanted to document: fetches of blocks larger than 1MB (pre-compression, > afaict) seem to trigger a code path that results in {{AbstractMethodError}}'s > and ultimately stage failures. > I put a minimal repro in [this github > repo|https://github.com/ryan-williams/spark-bugs/tree/netty]: {{collect}} on > a 1-partition RDD with 1032 {{Array\[Byte\]}}'s of size 1000 works, but at > 1033 {{Array}}'s it dies in a confusing way. > Not sure what fixing/mitigating this in Spark would look like, other than > defensively shading+renaming netty. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11170) EOFException on History server reading in progress lz4
[ https://issues.apache.org/jira/browse/SPARK-11170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054701#comment-16054701 ] remoteServer commented on SPARK-11170: -- Do we have steps to reproduce the issue? I have enabled spark event log compression, lz4 files are generated, however, I do not see any error message in the History Server logs. I am in Spark 2.1.0 > EOFException on History server reading in progress lz4 > > > Key: SPARK-11170 > URL: https://issues.apache.org/jira/browse/SPARK-11170 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.1 > Environment: HDP: 2.3.2.0-2950 (Hadoop 2.7.1.2.3.2.0-2950) > Spark: 1.5.x (c27e1904) >Reporter: Sebastian YEPES FERNANDEZ > > The Spark History server is not able to read/save the jobs history if Spark > is configured to use > "spark.io.compression.codec=org.apache.spark.io.LZ4CompressionCodec", it > continuously generated the following error: > {code} > ERROR 2015-10-16 16:21:39 org.apache.spark.deploy.history.FsHistoryProvider: > Exception encountered when attempting to load application log > hdfs://DATA/user/spark/his > tory/application_1444297190346_0073_1.lz4.inprogress > java.io.EOFException: Stream ended prematurely > at > net.jpountz.lz4.LZ4BlockInputStream.readFully(LZ4BlockInputStream.java:218) > at > net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:150) > at > net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:117) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.readLine(BufferedReader.java:324) > at java.io.BufferedReader.readLine(BufferedReader.java:389) > at > scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:457) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:292) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$10.apply(FsHistoryProvider.scala:289) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:289) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$1$$anon$2.run(FsHistoryProvider.scala:210) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > INFO 2015-10-16 16:21:39 org.apache.spark.deploy.history.FsHistoryProvider: > Replaying log path: > hdfs://DATA/user/spark/history/application_1444297190346_0072_1.lz4.i > nprogress > {code} > As a workaround setting > "spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec" > makes the History server work correctly -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20928) Continuous Processing Mode for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054690#comment-16054690 ] Cody Koeninger commented on SPARK-20928: Cool, can you label it SPIP so it shows up linked from http://spark.apache.org/improvement-proposals.html My only concern so far was the one I mentioned already, namely that it seems like you shouldn't have to give up exactly-once delivery semantics in all cases. > Continuous Processing Mode for Structured Streaming > --- > > Key: SPARK-20928 > URL: https://issues.apache.org/jira/browse/SPARK-20928 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Michael Armbrust > > Given the current Source API, the minimum possible latency for any record is > bounded by the amount of time that it takes to launch a task. This > limitation is a result of the fact that {{getBatch}} requires us to know both > the starting and the ending offset, before any tasks are launched. In the > worst case, the end-to-end latency is actually closer to the average batch > time + task launching time. > For applications where latency is more important than exactly-once output > however, it would be useful if processing could happen continuously. This > would allow us to achieve fully pipelined reading and writing from sources > such as Kafka. This kind of architecture would make it possible to process > records with end-to-end latencies on the order of 1 ms, rather than the > 10-100ms that is possible today. > One possible architecture here would be to change the Source API to look like > the following rough sketch: > {code} > trait Epoch { > def data: DataFrame > /** The exclusive starting position for `data`. */ > def startOffset: Offset > /** The inclusive ending position for `data`. Incrementally updated > during processing, but not complete until execution of the query plan in > `data` is finished. */ > def endOffset: Offset > } > def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], > limits: Limits): Epoch > {code} > The above would allow us to build an alternative implementation of > {{StreamExecution}} that processes continuously with much lower latency and > only stops processing when needing to reconfigure the stream (either due to a > failure or a user requested change in parallelism. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21143) Fail to fetch blocks >1MB in size in presence of conflicting Netty version
[ https://issues.apache.org/jira/browse/SPARK-21143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054631#comment-16054631 ] Sean Owen commented on SPARK-21143: --- If this reduces to a 4.0 vs 4.1 conflict, then this is SPARK-19552 and we gotta make that change perhaps in the next minor release. > Fail to fetch blocks >1MB in size in presence of conflicting Netty version > -- > > Key: SPARK-21143 > URL: https://issues.apache.org/jira/browse/SPARK-21143 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Ryan Williams >Priority: Minor > > One of my spark libraries inherited a transitive-dependency on Netty > 4.1.6.Final (vs. Spark's 4.0.42.Final), and I observed a strange failure I > wanted to document: fetches of blocks larger than 1MB (pre-compression, > afaict) seem to trigger a code path that results in {{AbstractMethodError}}'s > and ultimately stage failures. > I put a minimal repro in [this github > repo|https://github.com/ryan-williams/spark-bugs/tree/netty]: {{collect}} on > a 1-partition RDD with 1032 {{Array\[Byte\]}}'s of size 1000 works, but at > 1033 {{Array}}'s it dies in a confusing way. > Not sure what fixing/mitigating this in Spark would look like, other than > defensively shading+renaming netty. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21142) spark-streaming-kafka-0-10 has too fat dependency on kafka
[ https://issues.apache.org/jira/browse/SPARK-21142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21142: Assignee: Apache Spark > spark-streaming-kafka-0-10 has too fat dependency on kafka > -- > > Key: SPARK-21142 > URL: https://issues.apache.org/jira/browse/SPARK-21142 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.1 >Reporter: Tim Van Wassenhove >Assignee: Apache Spark >Priority: Minor > > Currently spark-streaming-kafka-0-10_2.11 has a dependency on kafka, where it > should only need a dependency on kafka-clients. > The only reason there is a dependency on kafka (full server) is due to > KafkaTestUtils class to run in memory tests against a kafka broker. This > class should be moved to src/test. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21142) spark-streaming-kafka-0-10 has too fat dependency on kafka
[ https://issues.apache.org/jira/browse/SPARK-21142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054612#comment-16054612 ] Apache Spark commented on SPARK-21142: -- User 'timvw' has created a pull request for this issue: https://github.com/apache/spark/pull/18353 > spark-streaming-kafka-0-10 has too fat dependency on kafka > -- > > Key: SPARK-21142 > URL: https://issues.apache.org/jira/browse/SPARK-21142 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.1 >Reporter: Tim Van Wassenhove >Priority: Minor > > Currently spark-streaming-kafka-0-10_2.11 has a dependency on kafka, where it > should only need a dependency on kafka-clients. > The only reason there is a dependency on kafka (full server) is due to > KafkaTestUtils class to run in memory tests against a kafka broker. This > class should be moved to src/test. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21142) spark-streaming-kafka-0-10 has too fat dependency on kafka
[ https://issues.apache.org/jira/browse/SPARK-21142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21142: Assignee: (was: Apache Spark) > spark-streaming-kafka-0-10 has too fat dependency on kafka > -- > > Key: SPARK-21142 > URL: https://issues.apache.org/jira/browse/SPARK-21142 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.1 >Reporter: Tim Van Wassenhove >Priority: Minor > > Currently spark-streaming-kafka-0-10_2.11 has a dependency on kafka, where it > should only need a dependency on kafka-clients. > The only reason there is a dependency on kafka (full server) is due to > KafkaTestUtils class to run in memory tests against a kafka broker. This > class should be moved to src/test. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21133) HighlyCompressedMapStatus#writeExternal throws NPE
[ https://issues.apache.org/jira/browse/SPARK-21133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-21133: - Target Version/s: 2.2.0 Priority: Blocker (was: Major) Description: Reproduce, set {{set spark.sql.shuffle.partitions>2000}} with shuffle, for simple: {code:sql} spark-sql --executor-memory 12g --driver-memory 8g --executor-cores 7 -e " set spark.sql.shuffle.partitions=2001; drop table if exists spark_hcms_npe; create table spark_hcms_npe as select id, count(*) from big_table group by id; " {code} Error logs: {noformat} 17/06/18 15:00:27 ERROR Utils: Exception encountered java.lang.NullPointerException at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303) at org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619) at org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562) at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/06/18 15:00:27 ERROR MapOutputTrackerMaster: java.lang.NullPointerException java.io.IOException: java.lang.NullPointerException at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1310) at org.apache.spark.scheduler.HighlyCompressedMapStatus.writeExternal(MapStatus.scala:167) at java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1459) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1430) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply$mcV$sp(MapOutputTracker.scala:617) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$1.apply(MapOutputTracker.scala:616) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619) at org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562) at org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:351) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply$mcV$sp(MapStatus.scala:171) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.scheduler.HighlyCompressedMapStatus$$anonfun$writeExternal$2.apply(MapStatus.scala:167) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1303) ... 17 more 17/06/18
[jira] [Commented] (SPARK-21102) Refresh command is too aggressive in parsing
[ https://issues.apache.org/jira/browse/SPARK-21102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054600#comment-16054600 ] Reynold Xin commented on SPARK-21102: - Can you submit a pull request so we can discuss the details of implementation there? > Refresh command is too aggressive in parsing > > > Key: SPARK-21102 > URL: https://issues.apache.org/jira/browse/SPARK-21102 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > Labels: starter > > SQL REFRESH command parsing is way too aggressive: > {code} > | REFRESH TABLE tableIdentifier > #refreshTable > | REFRESH .*? > #refreshResource > {code} > We should change it so it takes the whole string (without space), or a quoted > string. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20928) Continuous Processing Mode for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054596#comment-16054596 ] Michael Armbrust commented on SPARK-20928: -- Hi Cody, I do plan to flesh this out with the other sections of the SIP document and will email the dev list at that point. All that has been done so far is some basic prototyping to estimate how much work an alternative {{StreamExecution}} would take to build, and some experiments to validate the latencies that this arch could achieve. Do you have specific concerns with the proposal as it stands? > Continuous Processing Mode for Structured Streaming > --- > > Key: SPARK-20928 > URL: https://issues.apache.org/jira/browse/SPARK-20928 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Michael Armbrust > > Given the current Source API, the minimum possible latency for any record is > bounded by the amount of time that it takes to launch a task. This > limitation is a result of the fact that {{getBatch}} requires us to know both > the starting and the ending offset, before any tasks are launched. In the > worst case, the end-to-end latency is actually closer to the average batch > time + task launching time. > For applications where latency is more important than exactly-once output > however, it would be useful if processing could happen continuously. This > would allow us to achieve fully pipelined reading and writing from sources > such as Kafka. This kind of architecture would make it possible to process > records with end-to-end latencies on the order of 1 ms, rather than the > 10-100ms that is possible today. > One possible architecture here would be to change the Source API to look like > the following rough sketch: > {code} > trait Epoch { > def data: DataFrame > /** The exclusive starting position for `data`. */ > def startOffset: Offset > /** The inclusive ending position for `data`. Incrementally updated > during processing, but not complete until execution of the query plan in > `data` is finished. */ > def endOffset: Offset > } > def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], > limits: Limits): Epoch > {code} > The above would allow us to build an alternative implementation of > {{StreamExecution}} that processes continuously with much lower latency and > only stops processing when needing to reconfigure the stream (either due to a > failure or a user requested change in parallelism. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21143) Fail to fetch blocks >1MB in size in presence of conflicting Netty version
[ https://issues.apache.org/jira/browse/SPARK-21143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054593#comment-16054593 ] Shixiong Zhu commented on SPARK-21143: -- The reason you cannot use 4.0.42.Final is because you are using 4.1.X APIs? > Fail to fetch blocks >1MB in size in presence of conflicting Netty version > -- > > Key: SPARK-21143 > URL: https://issues.apache.org/jira/browse/SPARK-21143 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Ryan Williams >Priority: Minor > > One of my spark libraries inherited a transitive-dependency on Netty > 4.1.6.Final (vs. Spark's 4.0.42.Final), and I observed a strange failure I > wanted to document: fetches of blocks larger than 1MB (pre-compression, > afaict) seem to trigger a code path that results in {{AbstractMethodError}}'s > and ultimately stage failures. > I put a minimal repro in [this github > repo|https://github.com/ryan-williams/spark-bugs/tree/netty]: {{collect}} on > a 1-partition RDD with 1032 {{Array\[Byte\]}}'s of size 1000 works, but at > 1033 {{Array}}'s it dies in a confusing way. > Not sure what fixing/mitigating this in Spark would look like, other than > defensively shading+renaming netty. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19975) Add map_keys and map_values functions to Python
[ https://issues.apache.org/jira/browse/SPARK-19975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19975. - Resolution: Fixed Assignee: Yong Tang Fix Version/s: 2.3.0 > Add map_keys and map_values functions to Python > - > > Key: SPARK-19975 > URL: https://issues.apache.org/jira/browse/SPARK-19975 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Maciej Bryński >Assignee: Yong Tang > Fix For: 2.3.0 > > > We have `map_keys` and `map_values` functions in SQL. > There is no Python equivalent functions for that. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21143) Fail to fetch blocks >1MB in size in presence of conflicting Netty version
[ https://issues.apache.org/jira/browse/SPARK-21143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054592#comment-16054592 ] Shixiong Zhu commented on SPARK-21143: -- As Netty is so core to Spark, it's too risky to upgrade from 4.0.X to 4.1.X. There is no plan right now. > Fail to fetch blocks >1MB in size in presence of conflicting Netty version > -- > > Key: SPARK-21143 > URL: https://issues.apache.org/jira/browse/SPARK-21143 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Ryan Williams >Priority: Minor > > One of my spark libraries inherited a transitive-dependency on Netty > 4.1.6.Final (vs. Spark's 4.0.42.Final), and I observed a strange failure I > wanted to document: fetches of blocks larger than 1MB (pre-compression, > afaict) seem to trigger a code path that results in {{AbstractMethodError}}'s > and ultimately stage failures. > I put a minimal repro in [this github > repo|https://github.com/ryan-williams/spark-bugs/tree/netty]: {{collect}} on > a 1-partition RDD with 1032 {{Array\[Byte\]}}'s of size 1000 works, but at > 1033 {{Array}}'s it dies in a confusing way. > Not sure what fixing/mitigating this in Spark would look like, other than > defensively shading+renaming netty. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21102) Refresh command is too aggressive in parsing
[ https://issues.apache.org/jira/browse/SPARK-21102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054591#comment-16054591 ] Anton Okolnychyi commented on SPARK-21102: -- Hi [~rxin], I took a look at this issue and have a prototype that fixes this. It is available [here| https://github.com/aokolnychyi/spark/commit/fc2b7c02fab7f570ae3ca080ae1c2c9502300de7]. I am not sure that my current implementation is the most optimal, so any feedback is appreciated. My first idea was to make the grammar as strict as possible. Unfortunately, there were some problems. I tried the approach below: SqlBase.g4 {noformat} ... | REFRESH TABLE tableIdentifier #refreshTable | REFRESH resourcePath #refreshResource ... resourcePath : STRING | (IDENTIFIER | number | nonReserved | '/' | '-')+ // other symbols can be added if needed ; {noformat} It is not flexible enough and requires to explicitly mention all possible symbols. Therefore, I came up with the approach that was mentioned at the beginning. Let me know your opinion on which one is better. > Refresh command is too aggressive in parsing > > > Key: SPARK-21102 > URL: https://issues.apache.org/jira/browse/SPARK-21102 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > Labels: starter > > SQL REFRESH command parsing is way too aggressive: > {code} > | REFRESH TABLE tableIdentifier > #refreshTable > | REFRESH .*? > #refreshResource > {code} > We should change it so it takes the whole string (without space), or a quoted > string. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12414) Remove closure serializer
[ https://issues.apache.org/jira/browse/SPARK-12414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054529#comment-16054529 ] Ritesh Tijoriwala commented on SPARK-12414: --- I have a similar situation. I have several classes that I would like to instantiate and use in Executors for e.g. DB connections, Elasticsearch clients, etc. I don't want to write instantiation code in Spark functions and use "statics". There was a nit trick suggested here - https://issues.apache.org/jira/browse/SPARK-650 but seems like this will not work starting 2.0.0 as a consequence of this ticket. Could anybody from Spark community recommend how to do some initialization on each Spark executor for the job before any task execution begins? > Remove closure serializer > - > > Key: SPARK-12414 > URL: https://issues.apache.org/jira/browse/SPARK-12414 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Sean Owen > Fix For: 2.0.0 > > > There is a config `spark.closure.serializer` that accepts exactly one value: > the java serializer. This is because there are currently bugs in the Kryo > serializer that make it not a viable candidate. This was uncovered by an > unsuccessful attempt to make it work: SPARK-7708. > My high level point is that the Java serializer has worked well for at least > 6 Spark versions now, and it is an incredibly complicated task to get other > serializers (not just Kryo) to work with Spark's closures. IMO the effort is > not worth it and we should just remove this documentation and all the code > associated with it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21142) spark-streaming-kafka-0-10 has too fat dependency on kafka
[ https://issues.apache.org/jira/browse/SPARK-21142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-21142: - Component/s: (was: Structured Streaming) DStreams > spark-streaming-kafka-0-10 has too fat dependency on kafka > -- > > Key: SPARK-21142 > URL: https://issues.apache.org/jira/browse/SPARK-21142 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.1 >Reporter: Tim Van Wassenhove >Priority: Minor > > Currently spark-streaming-kafka-0-10_2.11 has a dependency on kafka, where it > should only need a dependency on kafka-clients. > The only reason there is a dependency on kafka (full server) is due to > KafkaTestUtils class to run in memory tests against a kafka broker. This > class should be moved to src/test. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21123) Options for file stream source are in a wrong table
[ https://issues.apache.org/jira/browse/SPARK-21123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-21123. -- Resolution: Fixed Fix Version/s: 2.3.0 2.2.0 > Options for file stream source are in a wrong table > --- > > Key: SPARK-21123 > URL: https://issues.apache.org/jira/browse/SPARK-21123 > Project: Spark > Issue Type: Documentation > Components: Documentation, Structured Streaming >Affects Versions: 2.1.1, 2.2.0 >Reporter: Shixiong Zhu >Priority: Minor > Labels: starter > Fix For: 2.2.0, 2.3.0 > > Attachments: Screen Shot 2017-06-15 at 11.25.49 AM.png > > > Right now options for file stream source are documented with file sink. We > should create a table for source options and fix it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16430) Add an option in file stream source to read 1 file at a time
[ https://issues.apache.org/jira/browse/SPARK-16430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-16430: - Fix Version/s: (was: 2.1.0) 2.0.0 > Add an option in file stream source to read 1 file at a time > > > Key: SPARK-16430 > URL: https://issues.apache.org/jira/browse/SPARK-16430 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.0.0 > > > An option that limits the file stream source to read 1 file at a time enables > rate limiting. It has the additional convenience that a static set of files > can be used like a stream for testing as this will allows those files to be > considered one at a time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16430) Add an option in file stream source to read 1 file at a time
[ https://issues.apache.org/jira/browse/SPARK-16430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-16430: - Fix Version/s: 2.1.0 > Add an option in file stream source to read 1 file at a time > > > Key: SPARK-16430 > URL: https://issues.apache.org/jira/browse/SPARK-16430 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.1.0 > > > An option that limits the file stream source to read 1 file at a time enables > rate limiting. It has the additional convenience that a static set of files > can be used like a stream for testing as this will allows those files to be > considered one at a time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19688) Spark on Yarn Credentials File set to different application directory
[ https://issues.apache.org/jira/browse/SPARK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-19688. Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 2.1.2 2.0.3 1.6.4 > Spark on Yarn Credentials File set to different application directory > - > > Key: SPARK-19688 > URL: https://issues.apache.org/jira/browse/SPARK-19688 > Project: Spark > Issue Type: Bug > Components: DStreams, YARN >Affects Versions: 1.6.3 >Reporter: Devaraj Jonnadula >Priority: Minor > Fix For: 1.6.4, 2.0.3, 2.1.2, 2.2.1, 2.3.0 > > > spark.yarn.credentials.file property is set to different application Id > instead of actual Application Id -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21143) Fail to fetch blocks >1MB in size in presence of conflicting Netty version
Ryan Williams created SPARK-21143: - Summary: Fail to fetch blocks >1MB in size in presence of conflicting Netty version Key: SPARK-21143 URL: https://issues.apache.org/jira/browse/SPARK-21143 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.1 Reporter: Ryan Williams Priority: Minor One of my spark libraries inherited a transitive-dependency on Netty 4.1.6.Final (vs. Spark's 4.0.42.Final), and I observed a strange failure I wanted to document: fetches of blocks larger than 1MB (pre-compression, afaict) seem to trigger a code path that results in {{AbstractMethodError}}'s and ultimately stage failures. I put a minimal repro in [this github repo|https://github.com/ryan-williams/spark-bugs/tree/netty]: {{collect}} on a 1-partition RDD with 1032 {{Array\[Byte\]}}'s of size 1000 works, but at 1033 {{Array}}'s it dies in a confusing way. Not sure what fixing/mitigating this in Spark would look like, other than defensively shading+renaming netty. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054176#comment-16054176 ] sam edited comment on SPARK-21137 at 6/19/17 3:20 PM: -- [~srowen] Ah OK, sorry, not used to that process. On other projects I've seen users are allowed to raise bug tickets even if they don't run debugging tools to try to diagnose what the cause is. Usually that step happens *after* the bug is accepted. Very happy to investigate further and provide a thread dump ASAP. Is there anything else I should try while I'm at it? was (Author: sams): [~srowen] Ah OK, sorry, not used to that process. On other projects I've seen users are allowed to raise bug tickets even if they don't run debugging tools to try to diagnose what the cause is. Usually that step happens *after* the bug is accepted. Very happy to investigate further and provide a thread dump ASAP. Is there anything else I should try while I'm at it? How about turning the cluster off and on again? > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > So I've provided full reproduce steps here (including code and cluster setup) > https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can > easily just clone, and follow the README to reproduce exactly! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054176#comment-16054176 ] sam commented on SPARK-21137: - [~srowen] Ah OK, sorry, not used to that process. On other projects I've seen users are allowed to raise bug tickets even if they don't run debugging tools to try to diagnose what the cause is. Usually that step happens *after* the bug is accepted. Very happy to investigate further and provide a thread dump ASAP. Is there anything else I should try while I'm at it? How about turning the cluster off and on again? > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > So I've provided full reproduce steps here (including code and cluster setup) > https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can > easily just clone, and follow the README to reproduce exactly! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054162#comment-16054162 ] Michael Schmeißer commented on SPARK-650: - [~riteshtijoriwala] - Sorry, but I am not familiar with Spark 2.0.0 yet. But what I can say is that we have raised a Cloudera support case to address this issue so maybe we can expect some help from this side. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21080) Workaround for HDFS delegation token expiry broken with some Hadoop versions
[ https://issues.apache.org/jira/browse/SPARK-21080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054155#comment-16054155 ] Lukasz Raszka commented on SPARK-21080: --- [~jerryshao] Yes, it's in HA mode. Updating to newer HDFS is not an option in our infrastructure, unfortunately. PR you reference is outdated at the moment - I've tried to crudely apply it with my limited knowledge of the mechanism, but it seems that the issue persists. Possibly a mistake on my side, would be great if someone could update the fix. > Workaround for HDFS delegation token expiry broken with some Hadoop versions > > > Key: SPARK-21080 > URL: https://issues.apache.org/jira/browse/SPARK-21080 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 on Yarn, Hadoop 2.7.3 >Reporter: Lukasz Raszka >Priority: Minor > > We're getting struck by SPARK-11182, where the core issue in HDFS has been > fixed in more recent versions. It seems that [workaround introduced by user > SaintBacchus|https://github.com/apache/spark/commit/646366b5d2f12e42f8e7287672ba29a8c918a17d] > doesn't work in newer version of Hadoop. This seems to be cause by a move of > property name from {{fs.hdfs.impl}} to {{fs.AbstractFileSystem.hdfs.impl}} > which happened somewhere around 2.7.0 or earlier. Taking this into account > should make workaround work again for less recent Hadoop versions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17176) Task are sorted by "Index" in Stage Page.
[ https://issues.apache.org/jira/browse/SPARK-17176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17176. --- Resolution: Won't Fix > Task are sorted by "Index" in Stage Page. > - > > Key: SPARK-17176 > URL: https://issues.apache.org/jira/browse/SPARK-17176 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.2, 2.0.0 >Reporter: cen yuhai >Priority: Minor > > Task are sorted by "Index" in Stage Page, but user are always concerned about > tasks which are failed(see error messages) or still running (maybe it is > skewed). When there are too many tasks, it is too slow to sort. So it is > better to set the default sort column to ”Status“. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21142) spark-streaming-kafka-0-10 has too fat dependency on kafka
Tim Van Wassenhove created SPARK-21142: -- Summary: spark-streaming-kafka-0-10 has too fat dependency on kafka Key: SPARK-21142 URL: https://issues.apache.org/jira/browse/SPARK-21142 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.1 Reporter: Tim Van Wassenhove Priority: Minor Currently spark-streaming-kafka-0-10_2.11 has a dependency on kafka, where it should only need a dependency on kafka-clients. The only reason there is a dependency on kafka (full server) is due to KafkaTestUtils class to run in memory tests against a kafka broker. This class should be moved to src/test. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21142) spark-streaming-kafka-0-10 has too fat dependency on kafka
[ https://issues.apache.org/jira/browse/SPARK-21142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054130#comment-16054130 ] Tim Van Wassenhove commented on SPARK-21142: Opened a PR on github: https://github.com/apache/spark/pull/18353 > spark-streaming-kafka-0-10 has too fat dependency on kafka > -- > > Key: SPARK-21142 > URL: https://issues.apache.org/jira/browse/SPARK-21142 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 >Reporter: Tim Van Wassenhove >Priority: Minor > > Currently spark-streaming-kafka-0-10_2.11 has a dependency on kafka, where it > should only need a dependency on kafka-clients. > The only reason there is a dependency on kafka (full server) is due to > KafkaTestUtils class to run in memory tests against a kafka broker. This > class should be moved to src/test. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054118#comment-16054118 ] Aleksander Eskilson commented on SPARK-18016: - [~cloud_fan], [~divshukla], yeah, I'd be happy to write a backport of the patch for 2.1.x. > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > - > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson >Assignee: Aleksander Eskilson > Fix For: 2.3.0 > > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) > at >
[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054115#comment-16054115 ] Sean Owen commented on SPARK-21137: --- Try a thread dump on the driver. Until there's some more detail here, I don't think anyone would work on this for you, no -- it works the other way around. It's mostly on you to investigate and ideally propose a change. > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > So I've provided full reproduce steps here (including code and cluster setup) > https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can > easily just clone, and follow the README to reproduce exactly! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054111#comment-16054111 ] sam edited comment on SPARK-21137 at 6/19/17 2:36 PM: -- [~srowen] > what stages are executing if any? None, no tasks are started for any stages. The logs do NOT output any line of the form: {code} 17/06/19 12:55:10 INFO TaskSetManager: Starting task 975.0 in stage 0.0 ... {code} As I have explained, *it takes 2 hours to just output the text (twice) "1,227,645 input paths to process"*. My hand cranked code single threaded, does the job in 11 minutes. It's not rocket science : it's reading some files, and writing them back out again. > If no stages are executing, what is the driver executing (thread dump)? I don't know what the driver is doing, I'm not trying to debug the issue here, I'm just trying to raise a bug. When the bug is accepted we can start trying to de-bug it and fix it. > (This is the kind of thing that should go into a mailing list exchange) If there is a configuration setting along the lines of "Don't spend 2 hours just to count how many files to process=true" then indeed I can see this is just a silly user error :) Right now I find it hard to believe such a configuration setting exists and so still believe this belongs in a bug JIRA. Perhaps we ought to ask some other people to take a look as we don't seem to be reaching a conclusion. was (Author: sams): [~srowen] > what stages are executing if any? None, no tasks are started. The logs do NOT output any line of the form: {code} 17/06/19 12:55:10 INFO TaskSetManager: Starting task 975.0 in stage 0.0 ... {code} As I have explained, *it takes 2 hours to just output the text (twice) "1,227,645 input paths to process"*. My hand cranked code single threaded, does the job in 11 minutes. It's not rocket science : it's reading some files, and writing them back out again. > If no stages are executing, what is the driver executing (thread dump)? I don't know what the driver is doing, I'm not trying to debug the issue here, I'm just trying to raise a bug. When the bug is accepted we can start trying to de-bug it and fix it. > (This is the kind of thing that should go into a mailing list exchange) If there is a configuration setting along the lines of "Don't spend 2 hours just to count how many files to process=true" then indeed I can see this is just a silly user error :) Right now I find it hard to believe such a configuration setting exists and so still believe this belongs in a bug JIRA. Perhaps we ought to ask some other people to take a look as we don't seem to be reaching a conclusion. > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > So I've provided full reproduce steps here (including code and cluster setup) > https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can > easily just clone, and follow the README to reproduce exactly! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054111#comment-16054111 ] sam edited comment on SPARK-21137 at 6/19/17 2:35 PM: -- [~srowen] > what stages are executing if any? None, no tasks are started. The logs do NOT output any line of the form: {code} 17/06/19 12:55:10 INFO TaskSetManager: Starting task 975.0 in stage 0.0 ... {code} As I have explained, *it takes 2 hours to just output the text (twice) "1,227,645 input paths to process"*. My hand cranked code single threaded, does the job in 11 minutes. It's not rocket science : it's reading some files, and writing them back out again. > If no stages are executing, what is the driver executing (thread dump)? I don't know what the driver is doing, I'm not trying to debug the issue here, I'm just trying to raise a bug. When the bug is accepted we can start trying to de-bug it and fix it. > (This is the kind of thing that should go into a mailing list exchange) If there is a configuration setting along the lines of "Don't spend 2 hours just to count how many files to process=true" then indeed I can see this is just a silly user error :) Right now I find it hard to believe such a configuration setting exists and so still believe this belongs in a bug JIRA. Perhaps we ought to ask some other people to take a look as we don't seem to be reaching a conclusion. was (Author: sams): [~srowen] > what stages are executing if any? *None, no tasks are started*. The logs do NOT output any line of the form: {code} 17/06/19 12:55:10 INFO TaskSetManager: Starting task 975.0 in stage 0.0 ... {code} As I have explained, *it takes 2 hours to just output the text (twice) "1,227,645 input paths to process"*. My hand cranked code single threaded, does the job in 11 minutes. It's not rocket science : it's reading some files, and writing them back out again. > If no stages are executing, what is the driver executing (thread dump)? I don't know what the driver is doing, I'm not trying to debug the issue here, I'm just trying to raise a bug. When the bug is accepted we can start trying to de-bug it and fix it. > (This is the kind of thing that should go into a mailing list exchange) If there is a configuration setting along the lines of "Don't spend 2 hours just to count how many files to process=true" then indeed I can see this is just a silly user error :) Right now I find it hard to believe such a configuration setting exists and so still believe this belongs in a bug JIRA. Perhaps we ought to ask some other people to take a look as we don't seem to be reaching a conclusion. > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > So I've provided full reproduce steps here (including code and cluster setup) > https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can > easily just clone, and follow the README to reproduce exactly! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054111#comment-16054111 ] sam commented on SPARK-21137: - [~srowen] > what stages are executing if any? *None, no tasks are started*. The logs do NOT output any line of the form: {code} 17/06/19 12:55:10 INFO TaskSetManager: Starting task 975.0 in stage 0.0 ... {code} As I have explained, *it takes 2 hours to just output the text (twice) "1,227,645 input paths to process"*. My hand cranked code single threaded, does the job in 11 minutes. It's not rocket science : it's reading some files, and writing them back out again. > If no stages are executing, what is the driver executing (thread dump)? I don't know what the driver is doing, I'm not trying to debug the issue here, I'm just trying to raise a bug. When the bug is accepted we can start trying to de-bug it and fix it. > (This is the kind of thing that should go into a mailing list exchange) If there is a configuration setting along the lines of "Don't spend 2 hours just to count how many files to process=true" then indeed I can see this is just a silly user error :) Right now I find it hard to believe such a configuration setting exists and so still believe this belongs in a bug JIRA. Perhaps we ought to ask some other people to take a look as we don't seem to be reaching a conclusion. > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > So I've provided full reproduce steps here (including code and cluster setup) > https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can > easily just clone, and follow the README to reproduce exactly! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19809) NullPointerException on empty ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054108#comment-16054108 ] Dongjoon Hyun commented on SPARK-19809: --- Yep. I'm trying to fix this with new ORC data source. It will be 2.3.0. > NullPointerException on empty ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1 >Reporter: Michał Dawid > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at >
[jira] [Commented] (SPARK-19809) NullPointerException on empty ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054076#comment-16054076 ] Hyukjin Kwon commented on SPARK-19809: -- What you see is what you get. This is "Reopened" per the discussion above and "Unresolved" yet. > NullPointerException on empty ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1 >Reporter: Michał Dawid > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at >
[jira] [Resolved] (SPARK-21141) spark-update --version is hard to parse
[ https://issues.apache.org/jira/browse/SPARK-21141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21141. --- Resolution: Not A Problem [~mprocop] please don't reopen JIRAs. We can reopen if needed. As I say, I don't think there is a problem here. > spark-update --version is hard to parse > --- > > Key: SPARK-21141 > URL: https://issues.apache.org/jira/browse/SPARK-21141 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.1.1 > Environment: Debian 8 using hadoop 2.7.2 >Reporter: michael procopio > > We have need of being able to determine the spark version in order to > reference our jars: one set built for 1.x using scala 2.10 and the other > built for 2.x using scala 2.11. spark-update --version returns a lot of > extraneous output. It would be preferable if an option were available that > only returned the version number. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21140) Reduce collect high memory requrements
[ https://issues.apache.org/jira/browse/SPARK-21140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054055#comment-16054055 ] Sean Owen commented on SPARK-21140: --- Yes, it's possible the executor makes a copy of some data during processing. Given overhead of serializing data and merging intermediate buffers, it could be largeish. This isn't a very minimal example, and it doesn't establish that something runs out of memory. There is also no proposal here about what it is that could be done differently, or leads about where memory is being allocated a lot: serialization? I don't think this is actionable as is. > Reduce collect high memory requrements > -- > > Key: SPARK-21140 > URL: https://issues.apache.org/jira/browse/SPARK-21140 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.1.1 > Environment: Linux Debian 8 using hadoop 2.7.2. >Reporter: michael procopio > > I wrote a very simple Scala application which used flatMap to create an RDD > containing a 512 mb partition of 256 byte arrays. Experimentally, I > determined that spark.executor.memory had to be set at 3 gb in order to > colledt the data. This seems extremely high. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21141) spark-update --version is hard to parse
[ https://issues.apache.org/jira/browse/SPARK-21141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael procopio reopened SPARK-21141: -- My apologies, I mean spark-submit --version. > spark-update --version is hard to parse > --- > > Key: SPARK-21141 > URL: https://issues.apache.org/jira/browse/SPARK-21141 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.1.1 > Environment: Debian 8 using hadoop 2.7.2 >Reporter: michael procopio > > We have need of being able to determine the spark version in order to > reference our jars: one set built for 1.x using scala 2.10 and the other > built for 2.x using scala 2.11. spark-update --version returns a lot of > extraneous output. It would be preferable if an option were available that > only returned the version number. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21140) Reduce collect high memory requrements
[ https://issues.apache.org/jira/browse/SPARK-21140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael procopio reopened SPARK-21140: -- I am not sure what detail you are looking for. I provided the test code I was using. Seems to me multiple copies of the data must be generated when collecting a partition. Having to set driver.executor.memory to 3gb to collect a partition of 512 mb seems high to me. > Reduce collect high memory requrements > -- > > Key: SPARK-21140 > URL: https://issues.apache.org/jira/browse/SPARK-21140 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.1.1 > Environment: Linux Debian 8 using hadoop 2.7.2. >Reporter: michael procopio > > I wrote a very simple Scala application which used flatMap to create an RDD > containing a 512 mb partition of 256 byte arrays. Experimentally, I > determined that spark.executor.memory had to be set at 3 gb in order to > colledt the data. This seems extremely high. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21140) Reduce collect high memory requrements
[ https://issues.apache.org/jira/browse/SPARK-21140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054043#comment-16054043 ] Sean Owen edited comment on SPARK-21140 at 6/19/17 2:02 PM: I disagree executor memory does depend on the size of the partition being collected. A 6 to 1 ratio to collect data seems onerous to me. Seems like multiple copies must be created in the executor. I found that to collect an RDD containing a single partition of 512 mb took 3gb. Here's the code I was using: {code} package com.test import org.apache.spark._ import org.apache.spark.SparkContext._ object SparkRdd2 { def main(args: Array[String]) { try { // // Process any arguments. // def parseOptions( map: Map[String,Any], listArgs: List[String]): Map[String,Any] = { listArgs match { case Nil => map case "-master" :: value :: tail => parseOptions( map+("master"-> value),tail) case "-recordSize" :: value :: tail => parseOptions( map+("recordSize"-> value.toInt),tail) case "-partitionSize" :: value :: tail => parseOptions( map+("partitionSize"-> value.toLong),tail) case "-executorMemory" :: value :: tail => parseOptions( map+("executorMemory"-> value),tail) case option :: tail => println("unknown option"+option) sys.exit(1) } } val listArgs = args.toList val optionmap = parseOptions( Map[String,Any](),listArgs) val master = optionmap.getOrElse("master","local").asInstanceOf[String] val recordSize = optionmap.getOrElse("recordSize",128).asInstanceOf[Int] val partitionSize = optionmap.getOrElse("partitionSize",1024*1024*1024).asInstanceOf[Long] val executorMemory = optionmap.getOrElse("executorMemory","6g").asInstanceOf[String] println(f"Creating single partition of $partitionSize%d with records of length $recordSize%d") println(f"Setting spark.executor.memory to $executorMemory") // // Create SparkConf. // val sparkConf = new SparkConf() sparkConf.setAppName("MyEnvVar").setMaster(master).setExecutorEnv("myenvvar","good") sparkConf.set("spark.executor.cores","1") sparkConf.set("spark.executor.instances","1") sparkConf.set("spark.executor.memory",executorMemory) sparkConf.set("spark.eventLog.enabled","true") sparkConf.set("spark.eventLog.dir","hdfs://hadoop01glnxa64:54310/user/mprocopi/spark-events"); sparkConf.set("spark.driver.maxResultSize","0") /* sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") sparkConf.set("spark.kryoserializer.buffer.max","768m") sparkConf.set("spark.kryoserializer.buffer","64k") */ // // Create SparkContext // val sc = new SparkContext(sparkConf) // // // def createdSizedPartition( recordSize:Int ,partitionSize:Long): Iterator[Array[Byte]] = { var sizeReturned:Long = 0 new Iterator[Array[Byte]] { override def hasNext(): Boolean = { (sizeReturnedcreatedSizedPartition( rddInfo._1, rddInfo._2)) val results = sizedRdd.collect var countLines: Int = 0 var countBytes: Long = 0 var maxRecord: Int = 0 for (line <- results) { countLines = countLines+1 countBytes = countBytes+line.length if (line.length> maxRecord) { maxRecord = line.length } } println(f"Collected $countLines%d lines") println(f" $countBytes%d bytes") println(f"Max record $maxRecord%d bytes") } catch { case e: Exception => println("Error in executing application: ", e.getMessage) throw e } } } {code} After building it can be invoked as: spark-submit --class com.test.SparkRdd2 --driver-memory 10g ./target/scala-2.11/envtest_2.11-0.0.1.jar -recordSize 256 -partitionSize 536870912 Allows you to vary the was (Author: mprocop): I disagree executor
[jira] [Commented] (SPARK-21140) Reduce collect high memory requrements
[ https://issues.apache.org/jira/browse/SPARK-21140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054043#comment-16054043 ] michael procopio commented on SPARK-21140: -- I disagree executor memory does depend on the size of the partition being collected. A 6 to 1 ratio to collect data seems onerous to me. Seems like multiple copies must be created in the executor. I found that to collect an RDD containing a single partition of 512 mb took 3gb. Here's the code I was using: package com.test import org.apache.spark._ import org.apache.spark.SparkContext._ object SparkRdd2 { def main(args: Array[String]) { try { // // Process any arguments. // def parseOptions( map: Map[String,Any], listArgs: List[String]): Map[String,Any] = { listArgs match { case Nil => map case "-master" :: value :: tail => parseOptions( map+("master"-> value),tail) case "-recordSize" :: value :: tail => parseOptions( map+("recordSize"-> value.toInt),tail) case "-partitionSize" :: value :: tail => parseOptions( map+("partitionSize"-> value.toLong),tail) case "-executorMemory" :: value :: tail => parseOptions( map+("executorMemory"-> value),tail) case option :: tail => println("unknown option"+option) sys.exit(1) } } val listArgs = args.toList val optionmap = parseOptions( Map[String,Any](),listArgs) val master = optionmap.getOrElse("master","local").asInstanceOf[String] val recordSize = optionmap.getOrElse("recordSize",128).asInstanceOf[Int] val partitionSize = optionmap.getOrElse("partitionSize",1024*1024*1024).asInstanceOf[Long] val executorMemory = optionmap.getOrElse("executorMemory","6g").asInstanceOf[String] println(f"Creating single partition of $partitionSize%d with records of length $recordSize%d") println(f"Setting spark.executor.memory to $executorMemory") // // Create SparkConf. // val sparkConf = new SparkConf() sparkConf.setAppName("MyEnvVar").setMaster(master).setExecutorEnv("myenvvar","good") sparkConf.set("spark.executor.cores","1") sparkConf.set("spark.executor.instances","1") sparkConf.set("spark.executor.memory",executorMemory) sparkConf.set("spark.eventLog.enabled","true") sparkConf.set("spark.eventLog.dir","hdfs://hadoop01glnxa64:54310/user/mprocopi/spark-events"); sparkConf.set("spark.driver.maxResultSize","0") /* sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") sparkConf.set("spark.kryoserializer.buffer.max","768m") sparkConf.set("spark.kryoserializer.buffer","64k") */ // // Create SparkContext // val sc = new SparkContext(sparkConf) // // // def createdSizedPartition( recordSize:Int ,partitionSize:Long): Iterator[Array[Byte]] = { var sizeReturned:Long = 0 new Iterator[Array[Byte]] { override def hasNext(): Boolean = { (sizeReturnedcreatedSizedPartition( rddInfo._1, rddInfo._2)) val results = sizedRdd.collect var countLines: Int = 0 var countBytes: Long = 0 var maxRecord: Int = 0 for (line <- results) { countLines = countLines+1 countBytes = countBytes+line.length if (line.length> maxRecord) { maxRecord = line.length } } println(f"Collected $countLines%d lines") println(f" $countBytes%d bytes") println(f"Max record $maxRecord%d bytes") } catch { case e: Exception => println("Error in executing application: ", e.getMessage) throw e } } } After building it can be invoked as: spark-submit --class com.test.SparkRdd2 --driver-memory 10g ./target/scala-2.11/envtest_2.11-0.0.1.jar -recordSize 256 -partitionSize 536870912 Allows you to vary the > Reduce collect high memory requrements > -- > >
[jira] [Updated] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam updated SPARK-21137: Description: A very common use case in big data is to read a large number of small files. For example the Enron email dataset has 1,227,645 small files. When one tries to read this data using Spark one will hit many issues. Firstly, even if the data is small (each file only say 1K) any job can take a very long time (I have a simple job that has been running for 3 hours and has not yet got to the point of starting any tasks, I doubt if it will ever finish). It seems all the code in Spark that manages file listing is single threaded and not well optimised. When I hand crank the code and don't use Spark, my job runs much faster. Is it possible that I'm missing some configuration option? It seems kinda surprising to me that Spark cannot read Enron data given that it's such a quintessential example. So it takes 1 hour to output a line "1,227,645 input paths to process", it then takes another hour to output the same line. Then it outputs a CSV of all the input paths (so creates a text storm). Now it's been stuck on the following: {code} 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] {code} for 2.5 hours. So I've provided full reproduce steps here (including code and cluster setup) https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can easily just clone, and follow the README to reproduce exactly! was: A very common use case in big data is to read a large number of small files. For example the Enron email dataset has 1,227,645 small files. When one tries to read this data using Spark one will hit many issues. Firstly, even if the data is small (each file only say 1K) any job can take a very long time (I have a simple job that has been running for 3 hours and has not yet got to the point of starting any tasks, I doubt if it will ever finish). It seems all the code in Spark that manages file listing is single threaded and not well optimised. When I hand crank the code and don't use Spark, my job runs much faster. Is it possible that I'm missing some configuration option? It seems kinda surprising to me that Spark cannot read Enron data given that it's such a quintessential example. So it takes 1 hour to output a line "1,227,645 input paths to process", it then takes another hour to output the same line. Then it outputs a CSV of all the input paths (so creates a text storm). Now it's been stuck on the following: {code} 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] {code} for 2.5 hours. All the app does is read the files, then try to output them again (escape the newlines and write one file per line). So I've provided full reproduce steps here (including code and cluster setup) https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can easily just clone, and follow the README to reproduce exactly! > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > So I've provided full reproduce steps here (including code and cluster setup) > https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can
[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054042#comment-16054042 ] Sean Owen commented on SPARK-21137: --- Are you sure it's not just appearing to be stuck reading the file when downstream processing is happening? what stages are executing if any? this is the sort of thing that needs clarification. If no stages are executing, what is the driver executing (thread dump)? The number of partitions isn't some default from the environment -- look at the API call. Also, what is the number of partitions as observed when you read the RDD? print it. (This is the kind of thing that should go into a mailing list exchange) > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > So I've provided full reproduce steps here (including code and cluster setup) > https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can > easily just clone, and follow the README to reproduce exactly! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054026#comment-16054026 ] sam edited comment on SPARK-21137 at 6/19/17 1:53 PM: -- [~srowen] As I said in the description, which you may have missed, the logs do not even reach the point where tasks are started - partitioning cannot be an issue here. If the logs said "Started task", THEN hung for hours on end then I would see your point, but this is not the case. FYI The number of partitions used by `wholeTextFiles` will be the default when started on an EMR cluster via the command: {code} spark-submit --class scenron.UnzippedToEmailPerRowDistinctAltApp --master local[6] --driver-memory 8g /home/hadoop/scenron.jar {code} Finally, EVEN if I was running with only 1 partition, this should just be roughly equivalent to my hand cranked code that takes 11 minutes. was (Author: sams): [~srowen] As I said in the description, which you may have missed, the logs do not even reach the point where tasks are started - partitioning cannot be an issue here. If the logs said "Started task", THEN hung for hours on end then I would see your point, but this is not the case. > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > All the app does is read the files, then try to output them again (escape the > newlines and write one file per line). > So I've provided full reproduce steps here (including code and cluster setup) > https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can > easily just clone, and follow the README to reproduce exactly! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19809) NullPointerException on empty ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054031#comment-16054031 ] Renu Yadav commented on SPARK-19809: What is the resolution of this issue. spark.sql.files.ignoreCorruptFiles does not work for orc file. Please help. > NullPointerException on empty ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1 >Reporter: Michał Dawid > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at >
[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054026#comment-16054026 ] sam commented on SPARK-21137: - [~srowen] As I said in the description, which you may have missed, the logs do not even reach the point where tasks are started - partitioning cannot be an issue here. If the logs said "Started task", THEN hung for hours on end then I would see your point, but this is not the case. > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > All the app does is read the files, then try to output them again (escape the > newlines and write one file per line). > So I've provided full reproduce steps here (including code and cluster setup) > https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can > easily just clone, and follow the README to reproduce exactly! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16054004#comment-16054004 ] Sean Owen commented on SPARK-21137: --- As i say, you're not setting anything about the partitioning here. That's likely the issue. You need to start by looking into how many partitions you have, how big they are, how long they're taking to execute. That's one of the first steps in diagnosing slow stages: do I have too few (wasted parallelism) or way too many (completing in <1s) > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > All the app does is read the files, then try to output them again (escape the > newlines and write one file per line). > So I've provided full reproduce steps here (including code and cluster setup) > https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can > easily just clone, and follow the README to reproduce exactly! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21141) spark-update --version is hard to parse
[ https://issues.apache.org/jira/browse/SPARK-21141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21141. --- Resolution: Not A Problem There is no spark-update. It is not intended as an API to determine the version externally. It's also not exactly hard to parse if you really needed to because it outputs "version x.y.z" > spark-update --version is hard to parse > --- > > Key: SPARK-21141 > URL: https://issues.apache.org/jira/browse/SPARK-21141 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.1.1 > Environment: Debian 8 using hadoop 2.7.2 >Reporter: michael procopio > > We have need of being able to determine the spark version in order to > reference our jars: one set built for 1.x using scala 2.10 and the other > built for 2.x using scala 2.11. spark-update --version returns a lot of > extraneous output. It would be preferable if an option were available that > only returned the version number. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21140) Reduce collect high memory requrements
[ https://issues.apache.org/jira/browse/SPARK-21140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21140. --- Resolution: Invalid There's no real detail here. Executor memory doesn't directly matter to how much data you can collect on the driver. Of course, collecting half-gig partitions to a driver is going to fail with even 1 partition, because that's about the default size of the driver memory. This should start as a question on a mailing list. > Reduce collect high memory requrements > -- > > Key: SPARK-21140 > URL: https://issues.apache.org/jira/browse/SPARK-21140 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.1.1 > Environment: Linux Debian 8 using hadoop 2.7.2. >Reporter: michael procopio > > I wrote a very simple Scala application which used flatMap to create an RDD > containing a 512 mb partition of 256 byte arrays. Experimentally, I > determined that spark.executor.memory had to be set at 3 gb in order to > colledt the data. This seems extremely high. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21141) spark-update --version is hard to parse
michael procopio created SPARK-21141: Summary: spark-update --version is hard to parse Key: SPARK-21141 URL: https://issues.apache.org/jira/browse/SPARK-21141 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 2.1.1 Environment: Debian 8 using hadoop 2.7.2 Reporter: michael procopio We have need of being able to determine the spark version in order to reference our jars: one set built for 1.x using scala 2.10 and the other built for 2.x using scala 2.11. spark-update --version returns a lot of extraneous output. It would be preferable if an option were available that only returned the version number. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21140) Reduce collect high memory requrements
michael procopio created SPARK-21140: Summary: Reduce collect high memory requrements Key: SPARK-21140 URL: https://issues.apache.org/jira/browse/SPARK-21140 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 2.1.1 Environment: Linux Debian 8 using hadoop 2.7.2. Reporter: michael procopio I wrote a very simple Scala application which used flatMap to create an RDD containing a 512 mb partition of 256 byte arrays. Experimentally, I determined that spark.executor.memory had to be set at 3 gb in order to colledt the data. This seems extremely high. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053977#comment-16053977 ] sam commented on SPARK-21137: - [~srowen] So I've provided full reproduce steps here (including code and cluster setup) https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can easily just clone, and follow the README to reproduce exactly! > This reads the files into memory. Yes I'm aware that `wholeTextFiles` reads the entire files, but all the files are rather small. My hand cranked code also slurps the entire files. > Also, slow compared to what? The link includes two versions of the same code, I killed the Spark version (after 5 hours of running), my hand cranked version takes 11 minutes. > Don't reopen this please. Someone will do that if it's appropriate. Sorry, like I said, I just thought this was a known issue no one had bothered to add to JIRA because most people just hand crank their own work arounds. > in the Hadoop APIs Yes it's likely that the underlying Hadoop APIs have some yucky code that does something silly, I have delved down their before and my stomach cannot handle it. Nevertheless Spark made the choice to inherit the complexities of the Hadoop APIs and reading multiple small files seems like a pretty basic use case for Spark (come on Sean this is Enron data!). It would feel a bit perverse to just close this and blame the layer cake underneath. Spark should use it's own extensions of the Hadoop APIs where the Hadoop APIs don't work (and the Hadoop code is easily extensible). > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > All the app does is read the files, then try to output them again (escape the > newlines and write one file per line). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam updated SPARK-21137: Description: A very common use case in big data is to read a large number of small files. For example the Enron email dataset has 1,227,645 small files. When one tries to read this data using Spark one will hit many issues. Firstly, even if the data is small (each file only say 1K) any job can take a very long time (I have a simple job that has been running for 3 hours and has not yet got to the point of starting any tasks, I doubt if it will ever finish). It seems all the code in Spark that manages file listing is single threaded and not well optimised. When I hand crank the code and don't use Spark, my job runs much faster. Is it possible that I'm missing some configuration option? It seems kinda surprising to me that Spark cannot read Enron data given that it's such a quintessential example. So it takes 1 hour to output a line "1,227,645 input paths to process", it then takes another hour to output the same line. Then it outputs a CSV of all the input paths (so creates a text storm). Now it's been stuck on the following: {code} 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] {code} for 2.5 hours. All the app does is read the files, then try to output them again (escape the newlines and write one file per line). So I've provided full reproduce steps here (including code and cluster setup) https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can easily just clone, and follow the README to reproduce exactly! was: A very common use case in big data is to read a large number of small files. For example the Enron email dataset has 1,227,645 small files. When one tries to read this data using Spark one will hit many issues. Firstly, even if the data is small (each file only say 1K) any job can take a very long time (I have a simple job that has been running for 3 hours and has not yet got to the point of starting any tasks, I doubt if it will ever finish). It seems all the code in Spark that manages file listing is single threaded and not well optimised. When I hand crank the code and don't use Spark, my job runs much faster. Is it possible that I'm missing some configuration option? It seems kinda surprising to me that Spark cannot read Enron data given that it's such a quintessential example. So it takes 1 hour to output a line "1,227,645 input paths to process", it then takes another hour to output the same line. Then it outputs a CSV of all the input paths (so creates a text storm). Now it's been stuck on the following: {code} 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] {code} for 2.5 hours. All the app does is read the files, then try to output them again (escape the newlines and write one file per line). > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > All the app does is read the files, then try to output them again (escape the > newlines and write one file per line). > So I've provided full reproduce steps here (including code and cluster setup) > https://github.com/samthebest/scenron, scroll down to "Bug
[jira] [Resolved] (SPARK-20931) Built-in SQL Function ABS support string type
[ https://issues.apache.org/jira/browse/SPARK-20931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-20931. - Resolution: Fixed Fix Version/s: 2.3.0 > Built-in SQL Function ABS support string type > - > > Key: SPARK-20931 > URL: https://issues.apache.org/jira/browse/SPARK-20931 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Yuming Wang > Labels: starter > Fix For: 2.3.0 > > > {noformat} > ABS() > {noformat} > Hive/MySQL support this. > Ref: > https://github.com/apache/hive/blob/4ba713ccd85c3706d195aeef9476e6e6363f1c21/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFAbs.java#L93 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21139) java.util.concurrent.RejectedExecutionException: rejected from java.util.concurrent.ThreadPoolExecutor@46477dd0[Terminated, pool size = 0, active threads = 0, queued t
[ https://issues.apache.org/jira/browse/SPARK-21139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053911#comment-16053911 ] Sean Owen commented on SPARK-21139: --- That looks like an issue from the HBase client, not Spark. > java.util.concurrent.RejectedExecutionException: rejected from > java.util.concurrent.ThreadPoolExecutor@46477dd0[Terminated, pool size = 0, > active threads = 0, queued tasks = 0, completed tasks = 14109] > - > > Key: SPARK-21139 > URL: https://issues.apache.org/jira/browse/SPARK-21139 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.5.2 > Environment: use spark1.5.2 and hbase 1.1.2 >Reporter: shining > > We create two tables use Hive HBaseStorageHandler like: > CREATE EXTERNAL TABLE `yx_bw`( > `rowkey` string, > `occur_time` string, > `milli_second` string, > `yx_id` string , > `resp_area` string , > `st_id` string, > `bay_id` string, > `device_type_id` string, > `content` string, > ..) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.hbase.HBaseSerDe' > STORED BY > 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( > > 'hbase.columns.mapping'=':key,f:OCCUR_TIME,f:MILLI_SECOND,f:YX_ID,f:RESP_AREA,f:ST_ID,f:BAY_ID,f:STATUS,f:CONTENT,f:VLTY_ID,f:MEAS_TYPE,f:RESTRAIN_FLAG,f:R > ESERV_INT1,f:RESERV_INT2,f:CUSTOMIZED_GROUP,f:CONFIRM_STATUS,f:CONFIRM_TIME,f:CONFIRM_USER_ID,f:CONFIRM_NODE_ID,f:IF_DISPLAY', >'serialization.format'='1') > TBLPROPERTIES ( > 'hbase.table.name'='yx_bw'') > Then we use sparksql to run a join between two tables. > select * from xxgljxb a, yx_bw b where a.YX_ID = b.YX_ID; > When scan hbase table, we encounter the issue: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in > stage 1.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1.0 > (TID 3, localhost): java.lang.RuntimeException: java.util.concurrent. > : Task > org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFutureQueueingFuture@37b2d978 > rejected from java.util.concurrent.ThreadPoolExecutor@46477dd0[Terminated, > pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14109] > at > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208) > at > org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320) > at > org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:403) > at > org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:364) > at > org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:205) > at > org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:147) > at > org.apache.hadoop.hbase.mapreduce.TableInputFormatBase$1.nextKeyValue(TableInputFormatBase.java:216) > at > org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat$1.next(HiveHBaseTableInputFormat.java:156) > at > org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat$1.next(HiveHBaseTableInputFormat.java:114) > at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248) > at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:118) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at >
[jira] [Created] (SPARK-21139) java.util.concurrent.RejectedExecutionException: rejected from java.util.concurrent.ThreadPoolExecutor@46477dd0[Terminated, pool size = 0, active threads = 0, queued tas
shining created SPARK-21139: --- Summary: java.util.concurrent.RejectedExecutionException: rejected from java.util.concurrent.ThreadPoolExecutor@46477dd0[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14109] Key: SPARK-21139 URL: https://issues.apache.org/jira/browse/SPARK-21139 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.5.2 Environment: use spark1.5.2 and hbase 1.1.2 Reporter: shining We create two tables use Hive HBaseStorageHandler like: CREATE EXTERNAL TABLE `yx_bw`( `rowkey` string, `occur_time` string, `milli_second` string, `yx_id` string , `resp_area` string , `st_id` string, `bay_id` string, `device_type_id` string, `content` string, ..) ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'=':key,f:OCCUR_TIME,f:MILLI_SECOND,f:YX_ID,f:RESP_AREA,f:ST_ID,f:BAY_ID,f:STATUS,f:CONTENT,f:VLTY_ID,f:MEAS_TYPE,f:RESTRAIN_FLAG,f:R ESERV_INT1,f:RESERV_INT2,f:CUSTOMIZED_GROUP,f:CONFIRM_STATUS,f:CONFIRM_TIME,f:CONFIRM_USER_ID,f:CONFIRM_NODE_ID,f:IF_DISPLAY', 'serialization.format'='1') TBLPROPERTIES ( 'hbase.table.name'='yx_bw'') Then we use sparksql to run a join between two tables. select * from xxgljxb a, yx_bw b where a.YX_ID = b.YX_ID; When scan hbase table, we encounter the issue: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1.0 (TID 3, localhost): java.lang.RuntimeException: java.util.concurrent. : Task org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFutureQueueingFuture@37b2d978 rejected from java.util.concurrent.ThreadPoolExecutor@46477dd0[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14109] at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208) at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320) at org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:403) at org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:364) at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:205) at org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:147) at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase$1.nextKeyValue(TableInputFormatBase.java:216) at org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat$1.next(HiveHBaseTableInputFormat.java:156) at org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat$1.next(HiveHBaseTableInputFormat.java:114) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:118) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@37b2d978 rejected from java.util.concurrent.ThreadPoolExecutor@46477dd0[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14109] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048) at
[jira] [Commented] (SPARK-20568) Delete files after processing in structured streaming
[ https://issues.apache.org/jira/browse/SPARK-20568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053871#comment-16053871 ] Fei Shao commented on SPARK-20568: -- I also do not support this feature too. If we delete files processed, we can not do other works about them later. If we want to do this feature, we can add a flag to indicate whether to delete files processed. > Delete files after processing in structured streaming > - > > Key: SPARK-20568 > URL: https://issues.apache.org/jira/browse/SPARK-20568 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Saul Shanabrook > > It would be great to be able to delete files after processing them with > structured streaming. > For example, I am reading in a bunch of JSON files and converting them into > Parquet. If the JSON files are not deleted after they are processed, it > quickly fills up my hard drive. I originally [posted this on Stack > Overflow|http://stackoverflow.com/q/43671757/907060] and was recommended to > make a feature request for it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053820#comment-16053820 ] Sean Owen commented on SPARK-21137: --- Here's a hint, or example of what could be going wrong: you may have a huge number of partitions after reading these files, depending on how you read them. You may need to repartition down. Equally there are situations where you end up with just a few. Either might lead to slow processing. This reads the files into memory. You might be grinding in GC. Probably not because you'd probably readily run out of memory then, but these are the kinds of things you need to look into first. Also, slow compared to what? > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > All the app does is read the files, then try to output them again (escape the > newlines and write one file per line). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21138) Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different
[ https://issues.apache.org/jira/browse/SPARK-21138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21138: Assignee: (was: Apache Spark) > Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and > "spark.hadoop.fs.defaultFS" are different > - > > Key: SPARK-21138 > URL: https://issues.apache.org/jira/browse/SPARK-21138 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.1 >Reporter: sharkd tu > > When I set different clusters for "spark.hadoop.fs.defaultFS" and > "spark.yarn.stagingDir" as follows: > {code:java} > spark.hadoop.fs.defaultFS hdfs://tl-nn-tdw.tencent-distribute.com:54310 > spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark > {code} > I got following logs: > {code:java} > 17/06/19 17:55:48 INFO SparkContext: Successfully stopped SparkContext > 17/06/19 17:55:48 INFO ApplicationMaster: Unregistering ApplicationMaster > with SUCCEEDED > 17/06/19 17:55:48 INFO AMRMClientImpl: Waiting for application to be > successfully unregistered. > 17/06/19 17:55:48 INFO ApplicationMaster: Deleting staging directory > hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618 > 17/06/19 17:55:48 ERROR Utils: Uncaught exception in thread Thread-2 > java.lang.IllegalArgumentException: Wrong FS: > hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, > expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310 > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getPathName(Cdh3DistributedFileSystem.java:197) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.access$000(Cdh3DistributedFileSystem.java:81) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:644) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:640) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.delete(Cdh3DistributedFileSystem.java:640) > at > org.apache.hadoop.fs.FilterFileSystem.delete(FilterFileSystem.java:216) > at > org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:545) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:233) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21138) Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different
[ https://issues.apache.org/jira/browse/SPARK-21138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053817#comment-16053817 ] Apache Spark commented on SPARK-21138: -- User 'sharkdtu' has created a pull request for this issue: https://github.com/apache/spark/pull/18352 > Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and > "spark.hadoop.fs.defaultFS" are different > - > > Key: SPARK-21138 > URL: https://issues.apache.org/jira/browse/SPARK-21138 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.1 >Reporter: sharkd tu > > When I set different clusters for "spark.hadoop.fs.defaultFS" and > "spark.yarn.stagingDir" as follows: > {code:java} > spark.hadoop.fs.defaultFS hdfs://tl-nn-tdw.tencent-distribute.com:54310 > spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark > {code} > I got following logs: > {code:java} > 17/06/19 17:55:48 INFO SparkContext: Successfully stopped SparkContext > 17/06/19 17:55:48 INFO ApplicationMaster: Unregistering ApplicationMaster > with SUCCEEDED > 17/06/19 17:55:48 INFO AMRMClientImpl: Waiting for application to be > successfully unregistered. > 17/06/19 17:55:48 INFO ApplicationMaster: Deleting staging directory > hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618 > 17/06/19 17:55:48 ERROR Utils: Uncaught exception in thread Thread-2 > java.lang.IllegalArgumentException: Wrong FS: > hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, > expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310 > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getPathName(Cdh3DistributedFileSystem.java:197) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.access$000(Cdh3DistributedFileSystem.java:81) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:644) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:640) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.delete(Cdh3DistributedFileSystem.java:640) > at > org.apache.hadoop.fs.FilterFileSystem.delete(FilterFileSystem.java:216) > at > org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:545) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:233) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21138) Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different
[ https://issues.apache.org/jira/browse/SPARK-21138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21138: Assignee: Apache Spark > Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and > "spark.hadoop.fs.defaultFS" are different > - > > Key: SPARK-21138 > URL: https://issues.apache.org/jira/browse/SPARK-21138 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.1 >Reporter: sharkd tu >Assignee: Apache Spark > > When I set different clusters for "spark.hadoop.fs.defaultFS" and > "spark.yarn.stagingDir" as follows: > {code:java} > spark.hadoop.fs.defaultFS hdfs://tl-nn-tdw.tencent-distribute.com:54310 > spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark > {code} > I got following logs: > {code:java} > 17/06/19 17:55:48 INFO SparkContext: Successfully stopped SparkContext > 17/06/19 17:55:48 INFO ApplicationMaster: Unregistering ApplicationMaster > with SUCCEEDED > 17/06/19 17:55:48 INFO AMRMClientImpl: Waiting for application to be > successfully unregistered. > 17/06/19 17:55:48 INFO ApplicationMaster: Deleting staging directory > hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618 > 17/06/19 17:55:48 ERROR Utils: Uncaught exception in thread Thread-2 > java.lang.IllegalArgumentException: Wrong FS: > hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, > expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310 > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getPathName(Cdh3DistributedFileSystem.java:197) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.access$000(Cdh3DistributedFileSystem.java:81) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:644) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:640) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.delete(Cdh3DistributedFileSystem.java:640) > at > org.apache.hadoop.fs.FilterFileSystem.delete(FilterFileSystem.java:216) > at > org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:545) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:233) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-21137. - > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > All the app does is read the files, then try to output them again (escape the > newlines and write one file per line). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053808#comment-16053808 ] sam edited comment on SPARK-21137 at 6/19/17 11:14 AM: --- [~srowen] Sorry about the lack of detail Sean. I guess I just assumed this was a known problem (been working with spark for 3 years and every person I have worked with seems to be aware of this problem, just thought I'd actually bother creating a JIRA for it so it might actually get looked at some day). So it takes 1 hour to output a line "1,227,645 input paths to process", it then takes another hour to output the same line. Then it outputs a CSV of all the input paths (so creates a text storm). Now it's been stuck on the following: {code} 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] {code} for 2.5 hours. All the app does is read the files, then try to output them again (escape the newlines and write one file per line). was (Author: sams): [~srowen] Sorry about the lack of detail Sean. I guess I just assumed this was a known problem (been working with spark for 3 years and every person I have worked with seems to be aware of this problem, just thought I'd actually bother creating a JIRA for it so it might actually get looked at some day). So it takes 1 hour to output a line "1,227,645 input paths to process", it then takes another hour to output the same line. Then it outputs a CSV of all the input paths (so creates a text storm). Now it's been stuck on the following: {code} 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] {code} for 2.5 hours. > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21138) Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different
[ https://issues.apache.org/jira/browse/SPARK-21138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sharkd tu updated SPARK-21138: -- Description: When I set different clusters for "spark.hadoop.fs.defaultFS" and "spark.yarn.stagingDir" as follows: {code:java} spark.hadoop.fs.defaultFS hdfs://tl-nn-tdw.tencent-distribute.com:54310 spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark {code} I got following logs: {code:java} 17/06/19 17:55:48 INFO SparkContext: Successfully stopped SparkContext 17/06/19 17:55:48 INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED 17/06/19 17:55:48 INFO AMRMClientImpl: Waiting for application to be successfully unregistered. 17/06/19 17:55:48 INFO ApplicationMaster: Deleting staging directory hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618 17/06/19 17:55:48 ERROR Utils: Uncaught exception in thread Thread-2 java.lang.IllegalArgumentException: Wrong FS: hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getPathName(Cdh3DistributedFileSystem.java:197) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.access$000(Cdh3DistributedFileSystem.java:81) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:644) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:640) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.delete(Cdh3DistributedFileSystem.java:640) at org.apache.hadoop.fs.FilterFileSystem.delete(FilterFileSystem.java:216) at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:545) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:233) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) {code} was: When I set different clusters for "spark.hadoop.fs.defaultFS" and "spark.yarn.stagingDir" as follows: {code:java} spark.hadoop.fs.defaultFS hdfs://tl-nn-tdw.tencent-distribute.com:54310 spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark {code} I got following logs: 17/06/19 17:55:48 INFO SparkContext: Successfully stopped SparkContext 17/06/19 17:55:48 INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED 17/06/19 17:55:48 INFO AMRMClientImpl: Waiting for application to be successfully unregistered. 17/06/19 17:55:48 INFO ApplicationMaster: Deleting staging directory hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618 17/06/19 17:55:48 ERROR Utils: Uncaught exception in thread Thread-2 java.lang.IllegalArgumentException: Wrong FS: hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getPathName(Cdh3DistributedFileSystem.java:197) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.access$000(Cdh3DistributedFileSystem.java:81) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:644) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:640) at
[jira] [Created] (SPARK-21138) Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different
sharkd tu created SPARK-21138: - Summary: Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different Key: SPARK-21138 URL: https://issues.apache.org/jira/browse/SPARK-21138 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.1.1 Reporter: sharkd tu When I set different clusters for "spark.hadoop.fs.defaultFS" and "spark.yarn.stagingDir" as follows: {code:java} spark.hadoop.fs.defaultFS hdfs://tl-nn-tdw.tencent-distribute.com:54310 spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark {code} I got following logs: 17/06/19 17:55:48 INFO SparkContext: Successfully stopped SparkContext 17/06/19 17:55:48 INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED 17/06/19 17:55:48 INFO AMRMClientImpl: Waiting for application to be successfully unregistered. 17/06/19 17:55:48 INFO ApplicationMaster: Deleting staging directory hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618 17/06/19 17:55:48 ERROR Utils: Uncaught exception in thread Thread-2 java.lang.IllegalArgumentException: Wrong FS: hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.getPathName(Cdh3DistributedFileSystem.java:197) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.access$000(Cdh3DistributedFileSystem.java:81) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:644) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem$10.doCall(Cdh3DistributedFileSystem.java:640) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoopcdh3.hdfs.Cdh3DistributedFileSystem.delete(Cdh3DistributedFileSystem.java:640) at org.apache.hadoop.fs.FilterFileSystem.delete(FilterFileSystem.java:216) at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:545) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:233) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam reopened SPARK-21137: - Reopened after adding detail. > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam updated SPARK-21137: Description: A very common use case in big data is to read a large number of small files. For example the Enron email dataset has 1,227,645 small files. When one tries to read this data using Spark one will hit many issues. Firstly, even if the data is small (each file only say 1K) any job can take a very long time (I have a simple job that has been running for 3 hours and has not yet got to the point of starting any tasks, I doubt if it will ever finish). It seems all the code in Spark that manages file listing is single threaded and not well optimised. When I hand crank the code and don't use Spark, my job runs much faster. Is it possible that I'm missing some configuration option? It seems kinda surprising to me that Spark cannot read Enron data given that it's such a quintessential example. So it takes 1 hour to output a line "1,227,645 input paths to process", it then takes another hour to output the same line. Then it outputs a CSV of all the input paths (so creates a text storm). Now it's been stuck on the following: {code} 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] {code} for 2.5 hours. was: A very common use case in big data is to read a large number of small files. For example the Enron email dataset has 1,227,645 small files. When one tries to read this data using Spark one will hit many issues. Firstly, even if the data is small (each file only say 1K) any job can take a very long time (I have a simple job that has been running for 3 hours and has not yet got to the point of starting any tasks, I doubt if it will ever finish). It seems all the code in Spark that manages file listing is single threaded and not well optimised. When I hand crank the code and don't use Spark, my job runs much faster. Is it possible that I'm missing some configuration option? It seems kinda surprising to me that Spark cannot read Enron data given that it's such a quintessential example. > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21137. --- Resolution: Invalid Don't reopen this please. Someone will do that if it's appropriate. This still doesn't establish that it's not either a) processing in your code or b) in the Hadoop APIs. There's no link to any external issues here. > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > All the app does is read the files, then try to output them again (escape the > newlines and write one file per line). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21137) Spark cannot read many small files (wholeTextFiles)
[ https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam updated SPARK-21137: Description: A very common use case in big data is to read a large number of small files. For example the Enron email dataset has 1,227,645 small files. When one tries to read this data using Spark one will hit many issues. Firstly, even if the data is small (each file only say 1K) any job can take a very long time (I have a simple job that has been running for 3 hours and has not yet got to the point of starting any tasks, I doubt if it will ever finish). It seems all the code in Spark that manages file listing is single threaded and not well optimised. When I hand crank the code and don't use Spark, my job runs much faster. Is it possible that I'm missing some configuration option? It seems kinda surprising to me that Spark cannot read Enron data given that it's such a quintessential example. So it takes 1 hour to output a line "1,227,645 input paths to process", it then takes another hour to output the same line. Then it outputs a CSV of all the input paths (so creates a text storm). Now it's been stuck on the following: {code} 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] {code} for 2.5 hours. All the app does is read the files, then try to output them again (escape the newlines and write one file per line). was: A very common use case in big data is to read a large number of small files. For example the Enron email dataset has 1,227,645 small files. When one tries to read this data using Spark one will hit many issues. Firstly, even if the data is small (each file only say 1K) any job can take a very long time (I have a simple job that has been running for 3 hours and has not yet got to the point of starting any tasks, I doubt if it will ever finish). It seems all the code in Spark that manages file listing is single threaded and not well optimised. When I hand crank the code and don't use Spark, my job runs much faster. Is it possible that I'm missing some configuration option? It seems kinda surprising to me that Spark cannot read Enron data given that it's such a quintessential example. So it takes 1 hour to output a line "1,227,645 input paths to process", it then takes another hour to output the same line. Then it outputs a CSV of all the input paths (so creates a text storm). Now it's been stuck on the following: {code} 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] {code} for 2.5 hours. > Spark cannot read many small files (wholeTextFiles) > --- > > Key: SPARK-21137 > URL: https://issues.apache.org/jira/browse/SPARK-21137 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: sam > > A very common use case in big data is to read a large number of small files. > For example the Enron email dataset has 1,227,645 small files. > When one tries to read this data using Spark one will hit many issues. > Firstly, even if the data is small (each file only say 1K) any job can take a > very long time (I have a simple job that has been running for 3 hours and has > not yet got to the point of starting any tasks, I doubt if it will ever > finish). > It seems all the code in Spark that manages file listing is single threaded > and not well optimised. When I hand crank the code and don't use Spark, my > job runs much faster. > Is it possible that I'm missing some configuration option? It seems kinda > surprising to me that Spark cannot read Enron data given that it's such a > quintessential example. > So it takes 1 hour to output a line "1,227,645 input paths to process", it > then takes another hour to output the same line. Then it outputs a CSV of all > the input paths (so creates a text storm). > Now it's been stuck on the following: > {code} > 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo > library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608] > {code} > for 2.5 hours. > All the app does is read the files, then try to output them again (escape the > newlines and write one file per line). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org