Re: Spark task hangs infinitely when accessing S3 from AWS
Hi, I think I have the same issue mentioned here: https://issues.apache.org/jira/browse/SPARK-8898 I tried to run the job with 1 core and it didn't hang anymore. I can live with that for now, but any suggestions are welcome. Erisa On Tue, Jan 26, 2016 at 4:51 PM, Erisa Dervishi wrote: > Actually now that I was taking a close look at the thread dump, it looks > like all the worker threads are in a "Waiting" condition: > > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > org.apache.http.impl.conn.tsccm.WaitingThread.await(WaitingThread.java:159) > org.apache.http.impl.conn.tsccm.ConnPoolByRoute.getEntryBlocking(ConnPoolByRoute.java:398) > org.apache.http.impl.conn.tsccm.ConnPoolByRoute$1.getPoolEntry(ConnPoolByRoute.java:298) > org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager$1.getConnection(ThreadSafeClientConnManager.java:238) > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423) > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) > org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:320) > org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:265) > org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestGet(RestStorageService.java:966) > org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestGet(RestStorageService.java:938) > org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2129) > org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2066) > org.jets3t.service.S3Service.getObject(S3Service.java:2583) > org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieve(Jets3tNativeFileSystemStore.java:230) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:497) > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > org.apache.hadoop.fs.s3native.$Proxy32.retrieve(Unknown Source) > org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.seek(NativeS3FileSystem.java:206) > org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:96) > org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62) > org.apache.avro.mapred.FsInput.seek(FsInput.java:50) > org.apache.avro.file.DataFileReader$SeekableInputStream.seek(DataFileReader.java:190) > org.apache.avro.file.DataFileReader.seek(DataFileReader.java:114) > org.apache.avro.file.DataFileReader.sync(DataFileReader.java:127) > org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:102) > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:153) > org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:124) > org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > org.apache.spark.scheduler.Task.run(Task.scala:88) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745) > > > On Tue, Jan 26, 2016 at 4:26 PM, Erisa Dervishi > wrote: > >> I have quite a different situation though. >> My job works fine for S3 files (avro format) up to 1G. It starts to hang >> for files larger than that size (1.5G for example) >> >> This is how I am creating the RDD: >> >> val rdd: RDD[T] = ctx.newAPIHadoopF
Re: Spark task hangs infinitely when accessing S3 from AWS
Actually now that I was taking a close look at the thread dump, it looks like all the worker threads are in a "Waiting" condition: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) org.apache.http.impl.conn.tsccm.WaitingThread.await(WaitingThread.java:159) org.apache.http.impl.conn.tsccm.ConnPoolByRoute.getEntryBlocking(ConnPoolByRoute.java:398) org.apache.http.impl.conn.tsccm.ConnPoolByRoute$1.getPoolEntry(ConnPoolByRoute.java:298) org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager$1.getConnection(ThreadSafeClientConnManager.java:238) org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423) org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:320) org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:265) org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestGet(RestStorageService.java:966) org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestGet(RestStorageService.java:938) org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2129) org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2066) org.jets3t.service.S3Service.getObject(S3Service.java:2583) org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieve(Jets3tNativeFileSystemStore.java:230) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:497) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) org.apache.hadoop.fs.s3native.$Proxy32.retrieve(Unknown Source) org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.seek(NativeS3FileSystem.java:206) org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:96) org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62) org.apache.avro.mapred.FsInput.seek(FsInput.java:50) org.apache.avro.file.DataFileReader$SeekableInputStream.seek(DataFileReader.java:190) org.apache.avro.file.DataFileReader.seek(DataFileReader.java:114) org.apache.avro.file.DataFileReader.sync(DataFileReader.java:127) org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:102) org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:153) org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:124) org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) org.apache.spark.rdd.RDD.iterator(RDD.scala:264) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) org.apache.spark.rdd.RDD.iterator(RDD.scala:264) org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) org.apache.spark.rdd.RDD.iterator(RDD.scala:264) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) org.apache.spark.scheduler.Task.run(Task.scala:88) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745) On Tue, Jan 26, 2016 at 4:26 PM, Erisa Dervishi wrote: > I have quite a different situation though. > My job works fine for S3 files (avro format) up to 1G. It starts to hang > for files larger than that size (1.5G for example) > > This is how I am creating the RDD: > > val rdd: RDD[T] = ctx.newAPIHadoopFile[AvroKey[T], NullWritable, > AvroKeyInputFormat[T]](s"s3n://path-to-avro-file") > > Because of dependency issues, I had to use an older version of Spark, and > the job was hanging while reading from S3, but right now I upgraded to > spark 1.5.2 and seems like reading from S3 works fine (first succeeded task > in the screenshot attached, which takes 42 s). > > But than it gets stuck. The screenshot attached shows 24 running tasks > that hang forever (with a "Running" status) eventhough I am just doing: > rdd.count() (initially it was a groupby and I thought
Re: Spark task hangs infinitely when accessing S3 from AWS
Hi, I kind am in your situation now while trying to read from S3. Where you able to find a workaround in the end? Thnx, Erisa On Thu, Nov 12, 2015 at 12:00 PM, aecc wrote: > Some other stats: > > The number of files I have in the folder is 48. > The number of partitions used when reading data is 7315. > The maximum size of a file to read is 14G > The size of the folder is around: 270G > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p25367.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Efficient way to get top K values per key in (key, value) RDD?
Hi, I am a Spark newbie, and trying to solve the same problem, and have implemented the same exact solution that sowen is suggesting. I am using priorityqueues to keep trak of the top 25 sub_categories, per each category, and using the combineByKey function to do that. However I run into the following exception when I submit the spark job: ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 17) java.lang.UnsupportedOperationException: unsuitable as hash key at scala.collection.mutable.PriorityQueue.hashCode(PriorityQueue.scala:226) >From the error it looks like spark is trying to use the mutable priority queue as a hashkey so the error makes sense, but I don't get why it is doing that since the value of the RDD record is a priority queue not the key. Maybe there is a more straightforward solution to what I want to achieve, so any suggestion is appreciated :) Thanks, Erisa -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-get-top-K-values-per-key-in-key-value-RDD-tp20370p23263.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org