Re: why is spark + scala code so slow, compared to python?

2014-12-12 Thread rzykov
Try this https://github.com/RetailRocket/SparkMultiTool https://github.com/RetailRocket/SparkMultiTool This loader solved slow reading of a big data set of small files in hdfs. -- View this message in context:

Re: Announcing Spark 1.1.1!

2014-12-03 Thread rzykov
Andrew and developers, thank you for excellent release! It fixed almost all of our issues. Now we are migrating to Spark from Zoo of Python, Java, Hive, Pig jobs. Our Scala/Spark jobs often failed on 1.1. Spark 1.1.1 works like a Swiss watch. -- View this message in context:

Spark: Simple local test failed depending on memory settings

2014-11-21 Thread rzykov
Dear all, We encountered problems of failed jobs with huge amount of data. A simple local test was prepared for this question at https://gist.github.com/copy-of-rezo/6a137e13a1e4f841e7eb It generates 2 sets of key-value pairs, join them, selects distinct values and counts data finally. object

Re: Optimizing text file parsing, many small files versus few big files

2014-11-20 Thread rzykov
You could use combineTextFile from https://github.com/RetailRocket/SparkMultiTool It combines input files before mappers by means of Hadoop CombineFileInputFormat. In our case it reduced the number of mappers from 10 to approx 3000 and made job significantly faster. Example: import

Spark doesn't kill worker process after failing on Yarn

2014-11-20 Thread rzykov
Dear all, We encountered a problem with failed Spark jobs. We have a Spark/Hadoop cluster - CDH 5.1.2 + Spark 1.1 After launching a spark job with command: ~/soft/spark-1.1.0-bin-hadoop2.3/bin/spark-submit --master yarn-cluster --executor-memory 4G --driver-memory 4G --class

Solution for small files in HDFS

2014-10-01 Thread rzykov
We encountered a problem of loading a huge number of small files (hundred thousands of files) from HDFS in Spark. Our jobs were failed over time. This one forced us to write own loader with combining by means of Hadoop CombineFileInputFormat. It significantly reduced number of mappers from 10

Access by name in tuples in Scala with Spark

2014-09-26 Thread rzykov
Could you advise the best practice of using some named tuples in Scala with Spark RDD. Currently we can access by a field number in a tuple: RDD.map{_.2} But want to see such construction: RDD.map{_.itemId} This one will be helpful for debugging purposes. -- View this message in context:

SPARK 1.1.0 on yarn-cluster and external JARs

2014-09-25 Thread rzykov
We build some SPARK jobs with external jars. I compile jobs by including them in one assembly. But look for an approach to put all external jars into HDFS. We have already put spark jar in a HDFS folder and set up the variable SPARK_JAR. What is the best way to do that for other external jars

Re: Computing mean and standard deviation by key

2014-09-12 Thread rzykov
Is it possible to use DoubleRDDFunctions https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/rdd/DoubleRDDFunctions.html for calculating mean and std dev for Paired RDDs (key, value)? Now I'm using an approach with ReduceByKey but want to make my code more concise and readable.