Driver OOM when using writing parquet

2018-08-06 Thread Nikhil Goyal
Hi guys, I have a simple job which reads LZO thrift files and writes them in Parquet Format. Driver is going out of memory. Parquet writer does keep some meta info in memory but that should cause the executor to go out of memory. No computation is being done on the driver. Any idea what could be

Re: Handle BlockMissingException in pyspark

2018-08-06 Thread John Zhuge
BlockMissingException typically indicates the HDFS file is corrupted. Might be an HDFS issue, Hadoop mailing list is a better bet: u...@hadoop.apache.org. Capture at the full stack trace in executor log. If the file still exists, run `hdfs fsck -blockId blk_1233169822_159765693` to determine

spark structured streaming with file based sources and sinks

2018-08-06 Thread Koert Kuipers
has anyone used spark structured streaming from/to files (json, csv, parquet, avro) in a non-test setting? i realize kafka is probably the way to go, but lets say i have a situation where kafka is not available for reasons out of my control, and i want to do micro-batching. could i use files to

Re: Replacing groupBykey() with reduceByKey()

2018-08-06 Thread Bathi CCDB
Hey Bipin, Thanks for the reply, I am actually aggregating after the groupByKey() operation, I have posted the wrong code snippet in my first email. Here is what I am doing dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x: (x[0],x)).groupByKey(25).map(build_edges) Can we replace

Re: Replacing groupBykey() with reduceByKey()

2018-08-06 Thread Biplob Biswas
Hi Santhosh, If you are not performing any aggregation, then I don't think you can replace your groupbykey with a reducebykey, and as I see you are only grouping and taking 2 values of the result, thus I believe you can't just replace your groupbykey with that. Thanks & Regards Biplob Biswas