Try this
https://github.com/RetailRocket/SparkMultiTool
https://github.com/RetailRocket/SparkMultiTool
This loader solved slow reading of a big data set of small files in hdfs.
--
View this message in context:
Andrew and developers, thank you for excellent release!
It fixed almost all of our issues. Now we are migrating to Spark from Zoo of
Python, Java, Hive, Pig jobs.
Our Scala/Spark jobs often failed on 1.1. Spark 1.1.1 works like a Swiss
watch.
--
View this message in context:
Dear all,
We encountered problems of failed jobs with huge amount of data.
A simple local test was prepared for this question at
https://gist.github.com/copy-of-rezo/6a137e13a1e4f841e7eb
It generates 2 sets of key-value pairs, join them, selects distinct values
and counts data finally.
object
You could use combineTextFile from
https://github.com/RetailRocket/SparkMultiTool
It combines input files before mappers by means of Hadoop
CombineFileInputFormat. In our case it reduced the number of mappers from
10 to approx 3000 and made job significantly faster.
Example:
import
Dear all,
We encountered a problem with failed Spark jobs.
We have a Spark/Hadoop cluster - CDH 5.1.2 + Spark 1.1
After launching a spark job with command:
~/soft/spark-1.1.0-bin-hadoop2.3/bin/spark-submit --master yarn-cluster
--executor-memory 4G --driver-memory 4G --class
We encountered a problem of loading a huge number of small files (hundred
thousands of files) from HDFS in Spark. Our jobs were failed over time.
This one forced us to write own loader with combining by means of Hadoop
CombineFileInputFormat.
It significantly reduced number of mappers from 10
Could you advise the best practice of using some named tuples in Scala
with Spark RDD.
Currently we can access by a field number in a tuple:
RDD.map{_.2}
But want to see such construction:
RDD.map{_.itemId}
This one will be helpful for debugging purposes.
--
View this message in context:
We build some SPARK jobs with external jars. I compile jobs by including them
in one assembly.
But look for an approach to put all external jars into HDFS.
We have already put spark jar in a HDFS folder and set up the variable
SPARK_JAR.
What is the best way to do that for other external jars
Is it possible to use DoubleRDDFunctions
https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/rdd/DoubleRDDFunctions.html
for calculating mean and std dev for Paired RDDs (key, value)?
Now I'm using an approach with ReduceByKey but want to make my code more
concise and readable.