Hi, You can control an initial num. of partitions (tasks) in v2.0. https://www.mail-archive.com/user@spark.apache.org/msg51603.html
// maropu On Tue, Jun 14, 2016 at 7:24 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Have you looked at spark GUI to see what it is waiting for. is that > available memory. What is the resource manager you are using? > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 13 June 2016 at 20:45, Khaled Hammouda <khaled.hammo...@kik.com> wrote: > >> Hi Michael, >> >> Thanks for the suggestion to use Spark 2.0 preview. I just downloaded the >> preview and tried using it, but I’m running into the exact same issue. >> >> Khaled >> >> On Jun 13, 2016, at 2:58 PM, Michael Armbrust <mich...@databricks.com> >> wrote: >> >> You might try with the Spark 2.0 preview. We spent a bunch of time >> improving the handling of many small files. >> >> On Mon, Jun 13, 2016 at 11:19 AM, khaled.hammouda < >> khaled.hammo...@kik.com> wrote: >> >>> I'm trying to use Spark SQL to load json data that are split across >>> about 70k >>> files across 24 directories in hdfs, using >>> sqlContext.read.json("hdfs:///user/hadoop/data/*/*"). >>> >>> This doesn't seem to work for some reason, I get timeout errors like the >>> following: >>> >>> ------- >>> 6/06/13 15:46:31 ERROR TransportChannelHandler: Connection to >>> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 has been quiet for >>> 120000 >>> ms while there are outstanding requests. Assuming connection is dead; >>> please >>> adjust spark.network.timeout if this is wrong. >>> 16/06/13 15:46:31 ERROR TransportResponseHandler: Still have 1 requests >>> outstanding when connection from >>> ip-172-31-31-114.ec2.internal/172.31.31.114:46028 is closed >>> ... >>> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 >>> seconds]. This timeout is controlled by spark.rpc.askTimeout >>> ... >>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after >>> [120 seconds] >>> ------ >>> >>> I don't want to start tinkering with increasing timeouts yet. I tried to >>> load just one sub-directory, which contains around 4k files, and this >>> seems >>> to work fine. So I thought of writing a loop where I load the json files >>> from each sub-dir and then unionAll the current dataframe with the >>> previous >>> dataframe. However, this also fails because apparently the json files >>> don't >>> have the exact same schema, causing this error: >>> >>> --- >>> Traceback (most recent call last): >>> File "/home/hadoop/load_json.py", line 65, in <module> >>> df = df.unionAll(hrdf) >>> File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", >>> line 998, in unionAll >>> File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", >>> line 813, in __call__ >>> File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line >>> 51, in deco >>> pyspark.sql.utils.AnalysisException: u"unresolved operator 'Union;" >>> --- >>> >>> I'd like to know what's preventing Spark from loading 70k files the same >>> way >>> it's loading 4k files? >>> >>> To give you some idea about my setup and data: >>> - ~70k files across 24 directories in HDFS >>> - Each directory contains 3k files on average >>> - Cluster: 200 nodes EMR cluster, each node has 53 GB memory and 8 cores >>> available to YARN >>> - Spark 1.6.1 >>> >>> Thanks. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-limit-on-the-number-of-tasks-in-one-job-tp27158.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com >>> <http://nabble.com>. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >> > -- --- Takeshi Yamamuro