Hi,
I'm loading a 1000 files using the spark-avro package:
val df = sqlContext.read.avro(*"/incoming/"*)

When I'm performing an action on this df it seems like for each file a
broadcast is being created and is sent to the workers (instead of the
workers reading their data-local files):

scala> df.coalesce(4).count
15/09/21 15:11:32 INFO storage.MemoryStore: ensureFreeSpace(261920) called
with curMem=0, maxMem=2223023063
15/09/21 15:11:32 INFO storage.MemoryStore: Block broadcast_0 stored as
values in memory (estimated size 255.8 KB, free 2.1 GB)
15/09/21 15:11:32 INFO storage.MemoryStore: ensureFreeSpace(22987) called
with curMem=261920, maxMem=2223023063
15/09/21 15:11:32 INFO storage.MemoryStore: Block broadcast_0_piece0 stored
as bytes in memory (estimated size 22.4 KB, free 2.1 GB)
15/09/21 15:11:32 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
in memory on 192.168.3.4:39736 (size: 22.4 KB, free: 2.1 GB)
....
....
....
15/09/21 15:12:45 INFO storage.MemoryStore: ensureFreeSpace(22987) called
with curMem=294913622, maxMem=2223023063
15/09/21 15:12:45 INFO storage.MemoryStore: Block
*broadcast_1034_piece0 *stored
as bytes in memory (estimated size 22.4 KB, free 1838.8 MB)
15/09/21 15:12:45 INFO storage.BlockManagerInfo: Added
broadcast_1034_piece0 in memory on 192.168.3.4:39736 (size: 22.4 KB, free:
2.0 GB)
15/09/21 15:12:45 INFO spark.SparkContext: Created broadcast 1034 from
hadoopFile at AvroRelation.scala:121
15/09/21 15:12:46 INFO execution.Exchange: Using SparkSqlSerializer2.
15/09/21 15:12:46 INFO spark.SparkContext: Starting job: count at
<console>:25

Am I understanding this wrongs?

Thank you.
Daniel

Reply via email to