I'm writing a large dataset in Parquet format to HDFS using Spark and it runs
rather slowly in EMR vs say Databricks. I realize that if I was able to use
Hadoop 3.1, it would be much more performant because it has a high performance
output committer. Is this the case, and if so - when will
val counts = textFile.flatMap(line => line.split(" "))
>>> .map(word => (word, 1))
>>> .reduceByKey(_ + _)
>>> counts.saveAsTextFile("hdfs://master:8020/user/abc")
>>>
>>> I want to write co
"Sample.txt")
>> val counts = textFile.flatMap(line => line.split(" "))
>> .map(word => (word, 1))
>> .reduceByKey(_ + _)
>> counts.saveAsTextFile("hdfs://master:8020/user/abc")
>>
>> I want to write
d => (word, 1))
>> .reduceByKey(_ + _)
>> counts.saveAsTextFile("hdfs://master:8020/user/abc")
>>
>> I want to write collection of "*counts" *which is used in code above to
>> HDFS, so
>>
>> val x = counts.collect()
>>
>> Actually I want to write *x *to HDFS. But spark wants to RDD to write
>> sometihng to HDFS
>>
>> How can I write Array[(String,Int)] to HDFS
>>
>>
>> --
>> Uğur
>>
>
--
Uğur Sopaoğlu
s used in code above to
> HDFS, so
>
> val x = counts.collect()
>
> Actually I want to write *x *to HDFS. But spark wants to RDD to write
> sometihng to HDFS
>
> How can I write Array[(String,Int)] to HDFS
>
>
> --
> Uğur
>
c")
I want to write collection of "*counts" *which is used in code above to
HDFS, so
val x = counts.collect()
Actually I want to write *x *to HDFS. But spark wants to RDD to write
sometihng to HDFS
How can I write Array[(String,Int)] to HDFS
--
Uğur
bble.com/Slow-Parquet-write-to-HDFS-using-Spark-tp28011.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
rquet file from the driver. I could use the HDFS API
> but I am worried that it won't work on a secure cluster. I assume that the
> method the executors use to write to HDFS takes care of managing Hadoop
> security. However, I can't find the place where HDFS write happens in the
> spark source.
Hi,
I'd like to write a parquet file from the driver. I could use the HDFS API
but I am worried that it won't work on a secure cluster. I assume that the
method the executors use to write to HDFS takes care of managing Hadoop
security. However, I can't find the place where HDFS write happens
Currently, I use rdd.isEmpty()
Thanks,
Patanachai
On 08/06/2015 12:02 PM, gpatcham wrote:
Is there a way to filter out empty partitions before I write to HDFS other
than using reparition and colasce ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com
the
write path.
On Thu, Aug 6, 2015 at 3:33 PM, Patanachai Tangchaisin patanac...@ipsy.com
wrote:
Currently, I use rdd.isEmpty()
Thanks,
Patanachai
On 08/06/2015 12:02 PM, gpatcham wrote:
Is there a way to filter out empty partitions before I write to HDFS other
than using reparition
Hi there,
I have several large files (500GB per file) to transform into Parquet format
and write to HDFS. The problems I encountered can be described as follows:
1) At first, I tried to load all the records in a file and then used
sc.parallelize(data) to generate RDD and finally used
12 matches
Mail list logo