Unsubscribe

2022-11-07 Thread Pedro Tuero
Unsubscribe

Re: Java : Testing RDD aggregateByKey

2021-08-23 Thread Pedro Tuero
lt;https://books.japila.pl/> > Follow me on https://twitter.com/jaceklaskowski > > <https://twitter.com/jaceklaskowski> > > > On Thu, Aug 19, 2021 at 5:43 PM Pedro Tuero wrote: > >> Hi, I'm sorry , the problem was really silly: In the test the number of >

Re: Java : Testing RDD aggregateByKey

2021-08-19 Thread Pedro Tuero
.com/jaceklaskowski> > > > On Tue, Aug 17, 2021 at 4:14 PM Pedro Tuero wrote: > >> >> Context: spark-core_2.12-3.1.1 >> Testing with maven and eclipse. >> >> I'm modifying a project and a test stops working as expected. >> The difference is

Java : Testing RDD aggregateByKey

2021-08-17 Thread Pedro Tuero
Context: spark-core_2.12-3.1.1 Testing with maven and eclipse. I'm modifying a project and a test stops working as expected. The difference is in the parameters passed to the function aggregateByKey of JavaPairRDD. JavaSparkContext is created this way: new JavaSparkContext(new SparkConf()

Coalesce vs reduce operation parameter

2021-03-18 Thread Pedro Tuero
I was reviewing a spark java application running on aws emr. The code was like: RDD.reduceByKey(func).coalesce(number).saveAsTextFile() That stage took hours to complete. I changed to: RDD.reduceByKey(func, number).saveAsTextFile() And it now takes less than 2 minutes, and the final output is

Re: Spark 2.4 partitions and tasks

2019-02-25 Thread Pedro Tuero
Good question. What I have read about is that Spark is not a magician and can't know how many tasks will be better for your input, so it can fail. Spark set the default parallelism as twice the number of cores on the cluster. In my jobs, it seemed that using the parallelism inherited from input

Re: Spark 2.4 partitions and tasks

2019-02-12 Thread Pedro Tuero
* It is not getPartitions() but getNumPartitions(). El mar., 12 de feb. de 2019 a la(s) 13:08, Pedro Tuero (tuerope...@gmail.com) escribió: > And this is happening in every job I run. It is not just one case. If I > add a forced repartitions it works fine, even better than before. But

Re: Spark 2.4 partitions and tasks

2019-02-12 Thread Pedro Tuero
g-spark-sql >> Spark Structured Streaming https://bit.ly/spark-structured-streaming >> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams >> Follow me at https://twitter.com/jaceklaskowski >> >> >> On Fri, Feb 8, 2019 at 5:09 PM Pedro Tuero wrote:

Re: Spark 2.4 partitions and tasks

2019-02-12 Thread Pedro Tuero
askowski > > > On Fri, Feb 8, 2019 at 5:09 PM Pedro Tuero wrote: > >> I did a repartition to 1 (hardcoded) before the keyBy and it ends in >> 1.2 minutes. >> The questions remain open, because I don't want to harcode paralellism. >> >> El vie., 8 de feb

Re: Spark 2.4 partitions and tasks

2019-02-08 Thread Pedro Tuero
I did a repartition to 1 (hardcoded) before the keyBy and it ends in 1.2 minutes. The questions remain open, because I don't want to harcode paralellism. El vie., 8 de feb. de 2019 a la(s) 12:50, Pedro Tuero (tuerope...@gmail.com) escribió: > 128 is the default parallelism defi

Re: Spark 2.4 partitions and tasks

2019-02-08 Thread Pedro Tuero
128 is the default parallelism defined for the cluster. The question now is why keyBy operation is using default parallelism instead of the number of partition of the RDD created by the previous step (5580). Any clues? El jue., 7 de feb. de 2019 a la(s) 15:30, Pedro Tuero (tuerope...@gmail.com

Re: Aws

2019-02-08 Thread Pedro Tuero
>> I tested maximizeResourceAllocation option. When it's enabled, it seems >> Spark utilized their cores fully. However the performance is not so >> different from default setting. >> >> I consider to use s3-distcp for uploading files. And, I think >> table(data

Spark 2.4 partitions and tasks

2019-02-07 Thread Pedro Tuero
Hi, I am running a job in spark (using aws emr) and some stages are taking a lot more using spark 2.4 instead of Spark 2.3.1: Spark 2.4: [image: image.png] Spark 2.3.1: [image: image.png] With Spark 2.4, the keyBy operation take more than 10X what it took with Spark 2.3.1 It seems to be

Re: Aws

2019-02-01 Thread Pedro Tuero
formance tuning. > > Do you configure dynamic allocation ? > > FYI: > > https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation > > I've not tested it yet. I guess spark-submit needs to specify number of > executors. > > Regards, > Hi

Aws

2019-01-31 Thread Pedro Tuero
Hi guys, I use to run spark jobs in Aws emr. Recently I switch from aws emr label 5.16 to 5.20 (which use Spark 2.4.0). I've noticed that a lot of steps are taking longer than before. I think it is related to the automatic configuration of cores by executor. In version 5.16, some executors toke

Broadcasted Object is empty in executors.

2017-05-22 Thread Pedro Tuero
Hi, I'm using spark 2.1.0 in aws emr. Kryo Serializer. I'm broadcasting a java class : public class NameMatcher { private static final Logger LOG = LoggerFactory.getLogger(NameMatcher.class); private final Splitter splitter; private final SetMultimap itemsByWord;

Kryo Exception: NegativeArraySizeException

2016-11-24 Thread Pedro Tuero
Hi, I'm trying to broadcast a map of 2.6GB but I'm getting a weird Kryo exception. I tried to set -XX:hashCode=0 in executor and driver, following this copmment: https://github.com/broadinstitute/gatk/issues/1524#issuecomment-189368808 But it didn't change anything. Are you aware of this

Broadcasting Complex Custom Objects

2016-10-17 Thread Pedro Tuero
Hi guys, I'm trying to do a a job with Spark, using Java. The thing is I need to have an index of words of about 3 GB in each machine, so I'm trying to broadcast custom objects to represent the index and the interface with it. I'm using java standard serialization, so I tried to implement