Hi,

It may be interesting to see this. Can you please create a hivecontext
(using standard AWS Spark stack on EMR 4.0) and create a table to read the
avro file and read data into a dataframe using hivecontext sql?

Please let me know if i can be of any help with this.

Regards,
Gourav

On Wed, Jan 27, 2016 at 9:03 AM, Erisa Dervishi <erisa...@gmail.com> wrote:

> Hi,
>
> I think I have the same issue mentioned here:
>
> https://issues.apache.org/jira/browse/SPARK-8898
>
> I tried to run the job with 1 core and it didn't hang anymore. I can live
> with that for now, but any suggestions are welcome.
>
> Erisa
>
> On Tue, Jan 26, 2016 at 4:51 PM, Erisa Dervishi <erisa...@gmail.com>
> wrote:
>
>> Actually now that I was taking a close look at the thread dump, it looks
>> like all the worker threads are in a "Waiting" condition:
>>
>> sun.misc.Unsafe.park(Native Method)
>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>> org.apache.http.impl.conn.tsccm.WaitingThread.await(WaitingThread.java:159)
>> org.apache.http.impl.conn.tsccm.ConnPoolByRoute.getEntryBlocking(ConnPoolByRoute.java:398)
>> org.apache.http.impl.conn.tsccm.ConnPoolByRoute$1.getPoolEntry(ConnPoolByRoute.java:298)
>> org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager$1.getConnection(ThreadSafeClientConnManager.java:238)
>> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423)
>> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>> org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:320)
>> org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:265)
>> org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestGet(RestStorageService.java:966)
>> org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestGet(RestStorageService.java:938)
>> org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2129)
>> org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2066)
>> org.jets3t.service.S3Service.getObject(S3Service.java:2583)
>> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieve(Jets3tNativeFileSystemStore.java:230)
>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> java.lang.reflect.Method.invoke(Method.java:497)
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> org.apache.hadoop.fs.s3native.$Proxy32.retrieve(Unknown Source)
>> org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.seek(NativeS3FileSystem.java:206)
>> org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:96)
>> org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
>> org.apache.avro.mapred.FsInput.seek(FsInput.java:50)
>> org.apache.avro.file.DataFileReader$SeekableInputStream.seek(DataFileReader.java:190)
>> org.apache.avro.file.DataFileReader.seek(DataFileReader.java:114)
>> org.apache.avro.file.DataFileReader.sync(DataFileReader.java:127)
>> org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:102)
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:153)
>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:124)
>> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>> org.apache.spark.scheduler.Task.run(Task.scala:88)
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> java.lang.Thread.run(Thread.java:745)
>>
>>
>> On Tue, Jan 26, 2016 at 4:26 PM, Erisa Dervishi <erisa...@gmail.com>
>> wrote:
>>
>>> I have quite a different situation though.
>>> My job works fine for S3 files (avro format) up to 1G. It starts to hang
>>> for files larger than that size (1.5G for example)
>>>
>>> This is how I am creating the RDD:
>>>
>>> val rdd: RDD[T] = ctx.newAPIHadoopFile[AvroKey[T], NullWritable,
>>> AvroKeyInputFormat[T]](s"s3n://path-to-avro-file")
>>>
>>> Because of dependency issues, I had to use an older version of Spark,
>>> and the job was hanging while reading from S3, but right now I upgraded to
>>> spark 1.5.2 and seems like reading from S3 works fine (first succeeded task
>>> in the screenshot attached, which takes 42 s).
>>>
>>> But than it gets stuck. The screenshot attached shows 24 running tasks
>>> that hang forever (with a "Running" status) eventhough I am just doing:
>>> rdd.count() (initially it was a groupby and I thought it was causing the
>>> issue, but even with just counting the lines of the file it hangs)
>>>
>>> Any suggestion is appreciated,
>>> Erisa
>>>
>>> On Tue, Jan 26, 2016 at 2:19 PM, Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Are you creating RDD's using textfile option? Can you please let me
>>>> know the following:
>>>> 1. Number of partitions
>>>> 2. Number of files
>>>> 3. Time taken to create the RDD's
>>>>
>>>>
>>>> Regards,
>>>> Gourav Sengupta
>>>>
>>>>
>>>> On Tue, Jan 26, 2016 at 1:12 PM, Gourav Sengupta <
>>>> gourav.sengu...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> are you creating RDD's out of the data?
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Gourav
>>>>>
>>>>> On Tue, Jan 26, 2016 at 12:45 PM, aecc <alessandroa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Sorry, I have not been able to solve the issue. I used speculation
>>>>>> mode as
>>>>>> workaround to this.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p26068.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to