Re: Problems with JobScheduler

2015-07-31 Thread Guillermo Ortiz
It doesn't make sense to me. Because in the another cluster process all data in less than a second. Anyway, I'm going to set that parameter. 2015-07-31 0:36 GMT+02:00 Tathagata Das t...@databricks.com: Yes, and that is indeed the problem. It is trying to process all the data in Kafka, and

Re: Problems with JobScheduler

2015-07-31 Thread Guillermo Ortiz
I detected the error. The final step is to index data in ElasticSearch, The elasticSearch in one of the cluster is overhelmed and it doesn't work correctly. I linked the cluster which doesn't work with another ES and don't get any delay. Sorry, it wasn't relationed with Spark! 2015-07-31

Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
I have some problem with the JobScheduler. I have executed same code in two cluster. I read from three topics in Kafka with DirectStream so I have three tasks. I have check YARN and there aren't more jobs launched. The cluster where I have troubles I got this logs: 15/07/30 14:32:58 INFO

Re: Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
I read about maxRatePerPartition parameter, I haven't set this parameter. Could it be the problem?? Although this wouldn't explain why it doesn't work in one of the clusters. 2015-07-30 14:47 GMT+02:00 Guillermo Ortiz konstt2...@gmail.com: They just share the kafka, the rest of resources are

Re: Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
They just share the kafka, the rest of resources are independents. I tried to stop one cluster and execute just the cluster isn't working but it happens the same. 2015-07-30 14:41 GMT+02:00 Guillermo Ortiz konstt2...@gmail.com: I have some problem with the JobScheduler. I have executed same

Re: Problems with JobScheduler

2015-07-30 Thread Cody Koeninger
Just so I'm clear, the difference in timing you're talking about is this: 15/07/30 14:33:59 INFO DAGScheduler: Job 24 finished: foreachRDD at MetricsSpark.scala:67, took 60.391761 s 15/07/30 14:37:35 INFO DAGScheduler: Job 93 finished: foreachRDD at MetricsSpark.scala:67, took 0.531323 s Are

Re: Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
I have three topics with one partition each topic. So each jobs run about one topics. 2015-07-30 16:20 GMT+02:00 Cody Koeninger c...@koeninger.org: Just so I'm clear, the difference in timing you're talking about is this: 15/07/30 14:33:59 INFO DAGScheduler: Job 24 finished: foreachRDD at

Re: Problems with JobScheduler

2015-07-30 Thread Cody Koeninger
If the jobs are running on different topicpartitions, what's different about them? Is one of them 120x the throughput of the other, for instance? You should be able to eliminate cluster config as a difference by running the same topic partition on the different clusters and comparing the

Re: Problems with JobScheduler

2015-07-30 Thread Tathagata Das
Yes, and that is indeed the problem. It is trying to process all the data in Kafka, and therefore taking 60 seconds. You need to set the rate limits for that. On Thu, Jul 30, 2015 at 8:51 AM, Cody Koeninger c...@koeninger.org wrote: If you don't set it, there is no maximum rate, it will get

Re: Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
The difference is that one recives more data than the others two. I can pass thought parameters the topics, so, I could execute the code trying with one topic and figure out with one is the topic, although I guess that it's the topics which gets more data. Anyway it's pretty weird those delays in