Have you looked at spark GUI to see what it is waiting for. is that
available memory. What is the resource manager you are using?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 13 June 2016 at 20:45, Khaled Hammouda <khaled.hammo...@kik.com> wrote:

> Hi Michael,
>
> Thanks for the suggestion to use Spark 2.0 preview. I just downloaded the
> preview and tried using it, but I’m running into the exact same issue.
>
> Khaled
>
> On Jun 13, 2016, at 2:58 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
> You might try with the Spark 2.0 preview.  We spent a bunch of time
> improving the handling of many small files.
>
> On Mon, Jun 13, 2016 at 11:19 AM, khaled.hammouda <khaled.hammo...@kik.com
> > wrote:
>
>> I'm trying to use Spark SQL to load json data that are split across about
>> 70k
>> files across 24 directories in hdfs, using
>> sqlContext.read.json("hdfs:///user/hadoop/data/*/*").
>>
>> This doesn't seem to work for some reason, I get timeout errors like the
>> following:
>>
>> -------
>> 6/06/13 15:46:31 ERROR TransportChannelHandler: Connection to
>> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 has been quiet for
>> 120000
>> ms while there are outstanding requests. Assuming connection is dead;
>> please
>> adjust spark.network.timeout if this is wrong.
>> 16/06/13 15:46:31 ERROR TransportResponseHandler: Still have 1 requests
>> outstanding when connection from
>> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 is closed
>> ...
>> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
>> seconds]. This timeout is controlled by spark.rpc.askTimeout
>> ...
>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
>> [120 seconds]
>> ------
>>
>> I don't want to start tinkering with increasing timeouts yet. I tried to
>> load just one sub-directory, which contains around 4k files, and this
>> seems
>> to work fine. So I thought of writing a loop where I load the json files
>> from each sub-dir and then unionAll the current dataframe with the
>> previous
>> dataframe. However, this also fails because apparently the json files
>> don't
>> have the exact same schema, causing this error:
>>
>> ---
>> Traceback (most recent call last):
>>   File "/home/hadoop/load_json.py", line 65, in <module>
>>     df = df.unionAll(hrdf)
>>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
>> line 998, in unionAll
>>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>> line 813, in __call__
>>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line
>> 51, in deco
>> pyspark.sql.utils.AnalysisException: u"unresolved operator 'Union;"
>> ---
>>
>> I'd like to know what's preventing Spark from loading 70k files the same
>> way
>> it's loading 4k files?
>>
>> To give you some idea about my setup and data:
>> - ~70k files across 24 directories in HDFS
>> - Each directory contains 3k files on average
>> - Cluster: 200 nodes EMR cluster, each node has 53 GB memory and 8 cores
>> available to YARN
>> - Spark 1.6.1
>>
>> Thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-limit-on-the-number-of-tasks-in-one-job-tp27158.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>> <http://nabble.com>.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>

Reply via email to