<code>
val enriched_web_logs = sqlContext.sql("""
select web_logs.datetime, web_logs.node as app_host, source_ip, b.node as
source_host, log
from web_logs
left outer join (select distinct node, address from nodes) b on source_ip =
address
""")
enriched_web_logs.coalesce(1).write.format("parquet").mode("overwrite").save(bucket+"derived/enriched_web_logs")
enriched_web_logs.registerTempTable("enriched_web_logs")
sqlContext.cacheTable("enriched_web_logs")
</code>

There are only 524 records in the resulting table, and I have explicitly
attempted to coalesce into 1 partition.

Yet my Spark UI shows 200 (mostly empty) partitions:
RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize in
ExternalBlockStoreSize on Disk
In-memory table enriched_web_logs
<http://localhost:4040/storage/rdd?id=86> Memory
Deserialized 1x Replicated 200 100% 22.0 KB 0.0 B 0.0 BWhy would there be
200 partitions despite the coalesce call?

Reply via email to