AWS EMR slow write to HDFS

2019-06-11 Thread Femi Anthony
I'm writing a large dataset in Parquet format to HDFS using Spark and it runs rather slowly in EMR vs say Databricks. I realize that if I was able to use Hadoop 3.1, it would be much more performant because it has a high performance output committer. Is this the case, and if so - when will

Re: Write to HDFS

2017-10-20 Thread Deepak Sharma
val counts = textFile.flatMap(line => line.split(" ")) >>> .map(word => (word, 1)) >>> .reduceByKey(_ + _) >>> counts.saveAsTextFile("hdfs://master:8020/user/abc") >>> >>> I want to write co

Re: Write to HDFS

2017-10-20 Thread Marco Mistroni
"Sample.txt") >> val counts = textFile.flatMap(line => line.split(" ")) >> .map(word => (word, 1)) >> .reduceByKey(_ + _) >> counts.saveAsTextFile("hdfs://master:8020/user/abc") >> >> I want to write

Re: Write to HDFS

2017-10-20 Thread Uğur Sopaoğlu
d => (word, 1)) >> .reduceByKey(_ + _) >> counts.saveAsTextFile("hdfs://master:8020/user/abc") >> >> I want to write collection of "*counts" *which is used in code above to >> HDFS, so >> >> val x = counts.collect() >> >> Actually I want to write *x *to HDFS. But spark wants to RDD to write >> sometihng to HDFS >> >> How can I write Array[(String,Int)] to HDFS >> >> >> -- >> Uğur >> > -- Uğur Sopaoğlu

Re: Write to HDFS

2017-10-20 Thread Marco Mistroni
s used in code above to > HDFS, so > > val x = counts.collect() > > Actually I want to write *x *to HDFS. But spark wants to RDD to write > sometihng to HDFS > > How can I write Array[(String,Int)] to HDFS > > > -- > Uğur >

Write to HDFS

2017-10-20 Thread Uğur Sopaoğlu
c") I want to write collection of "*counts" *which is used in code above to HDFS, so val x = counts.collect() Actually I want to write *x *to HDFS. But spark wants to RDD to write sometihng to HDFS How can I write Array[(String,Int)] to HDFS -- Uğur

Slow Parquet write to HDFS using Spark

2016-11-03 Thread morfious902002
bble.com/Slow-Parquet-write-to-HDFS-using-Spark-tp28011.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Looking for the method executors uses to write to HDFS

2015-11-06 Thread Reynold Xin
rquet file from the driver. I could use the HDFS API > but I am worried that it won't work on a secure cluster. I assume that the > method the executors use to write to HDFS takes care of managing Hadoop > security. However, I can't find the place where HDFS write happens in the > spark source.

Looking for the method executors uses to write to HDFS

2015-11-04 Thread Tóth Zoltán
Hi, I'd like to write a parquet file from the driver. I could use the HDFS API but I am worried that it won't work on a secure cluster. I assume that the method the executors use to write to HDFS takes care of managing Hadoop security. However, I can't find the place where HDFS write happens

Re: Removing empty partitions before we write to HDFS

2015-08-06 Thread Patanachai Tangchaisin
Currently, I use rdd.isEmpty() Thanks, Patanachai On 08/06/2015 12:02 PM, gpatcham wrote: Is there a way to filter out empty partitions before I write to HDFS other than using reparition and colasce ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com

Re: Removing empty partitions before we write to HDFS

2015-08-06 Thread Richard Marscher
the write path. On Thu, Aug 6, 2015 at 3:33 PM, Patanachai Tangchaisin patanac...@ipsy.com wrote: Currently, I use rdd.isEmpty() Thanks, Patanachai On 08/06/2015 12:02 PM, gpatcham wrote: Is there a way to filter out empty partitions before I write to HDFS other than using reparition

How to transform large local files into Parquet format and write into HDFS?

2014-08-14 Thread Parthus
Hi there, I have several large files (500GB per file) to transform into Parquet format and write to HDFS. The problems I encountered can be described as follows: 1) At first, I tried to load all the records in a file and then used sc.parallelize(data) to generate RDD and finally used