Hi guys,
I have a simple job which reads LZO thrift files and writes them in Parquet
Format. Driver is going out of memory. Parquet writer does keep some meta
info in memory but that should cause the executor to go out of memory. No
computation is being done on the driver. Any idea what could be
BlockMissingException typically indicates the HDFS file is corrupted. Might
be an HDFS issue, Hadoop mailing list is a better bet:
u...@hadoop.apache.org.
Capture at the full stack trace in executor log.
If the file still exists, run `hdfs fsck -blockId blk_1233169822_159765693`
to determine
has anyone used spark structured streaming from/to files (json, csv,
parquet, avro) in a non-test setting?
i realize kafka is probably the way to go, but lets say i have a situation
where kafka is not available for reasons out of my control, and i want to
do micro-batching. could i use files to
Hey Bipin,
Thanks for the reply, I am actually aggregating after the groupByKey()
operation,
I have posted the wrong code snippet in my first email. Here is what I am
doing
dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x:
(x[0],x)).groupByKey(25).map(build_edges)
Can we replace
Hi Santhosh,
If you are not performing any aggregation, then I don't think you can
replace your groupbykey with a reducebykey, and as I see you are only
grouping and taking 2 values of the result, thus I believe you can't just
replace your groupbykey with that.
Thanks & Regards
Biplob Biswas