Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-12 Thread Srinath C
Hi Akash, Glad to know that repartition helped! The overall tasks actually depends on the kind of operations you are performing and also on how the DF is partitioned. I can't comment on the former but can provide some pointers on the latter. Default value of spark.sql.shuffle.partitions is 200.

Re: Building SparkML vectors from long data

2018-06-12 Thread Nathan Kronenfeld
I don't know if this is the best way or not, but: val indexer = new StringIndexer().setInputCol("vr").setOutputCol("vrIdx") val indexModel = indexer.fit(data) val indexedData = indexModel.transform(data) val variables = indexModel.labels.length val toSeq = udf((a: Double, b: Double) => Seq(a,

Building SparkML vectors from long data

2018-06-12 Thread Patrick McCarthy
I work with a lot of data in a long format, cases in which an ID column is repeated, followed by a variable and a value column like so: +---+-+---+ |ID | var | value | +---+-+---+ | A | v1 | 1.0 | | A | v2 | 2.0 | | B | v1 | 1.5 | | B | v3 | -1.0 | +---+-+---+

Writing rows directly in Tungsten format into memory

2018-06-12 Thread Vadim Semenov
Is there a way to write rows directly into off-heap memory in the Tungsten format bypassing creating objects? I have a lot of rows, and right now I'm creating objects, and they get encoded, but because of the number of rows, it creates significant pressure on GC. I'd like to avoid creating

Re: testing frameworks

2018-06-12 Thread Ryan Adams
We use spark testing base for unit testing. These tests execute on a very small amount of data that covers all paths the code can take (or most paths anyway). https://github.com/holdenk/spark-testing-base For integration testing we use automated routines to ensure that aggregate values match an

Re: testing frameworks

2018-06-12 Thread Lars Albertsson
Hi, I wrote this answer to the same question a couple of years ago: https://www.mail-archive.com/user%40spark.apache.org/msg48032.html I have made a couple of presentations on the subject. Slides and video are linked on this page: http://www.mapflat.com/presentations/ You can find more material

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-12 Thread Aakash Basu
Hi Srinath, Thanks for such an elaborate reply. How to reduce the number of overall tasks? I found, after simply repartitioning the csv file into 8 parts and converting it to parquet with snappy compression, helped not only in even distribution of the tasks on all nodes, but also helped in

Scala Partition Question

2018-06-12 Thread Polisetti, Venkata Siva Rama Gopala Krishna
hello, Can I do complex data manipulations inside groupby function.? i.e. I want to group my whole dataframe by a column and then do some processing for each group. The information contained in this message is intended only for the recipient, and may be a

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-12 Thread Srinath C
Hi Aakash, Can you check the logs for Executor ID 0? It was restarted on worker 192.168.49.39 perhaps due to OOM or something. Also observed that the number of tasks are high and unevenly distributed across the workers. Check if there are too many partitions in the RDD and tune it using

Re: [Spark Optimization] Why is one node getting all the pressure?

2018-06-12 Thread Aakash Basu
Yes, but when I did increase my executor memory, the spark job is going to halt after running a few steps, even though, the executor isn't dying. Data - 60,000 data-points, 230 columns (60 MB data). Any input on why it behaves like that? On Tue, Jun 12, 2018 at 8:15 AM, Vamshi Talla wrote: >