Re: Too many open files when using Spark to consume messages from Kafka

Bill Jay Thu, 30 Apr 2015 13:07:25 -0700

I terminated the old job and now start a new one. Currently, the Spark
streaming job has been running for 2 hours and when I use lsof, I do not
see many files related to the Spark job.


BTW, I am running Spark streaming using local[2] mode. The batch is 5
seconds and it has around 50 lines to read each batch.

On Thu, Apr 30, 2015 at 11:15 AM, Cody Koeninger <c...@koeninger.org> wrote:

> Did you use lsof to see what files were opened during the job?
>
> On Thu, Apr 30, 2015 at 1:05 PM, Bill Jay <bill.jaypeter...@gmail.com>
> wrote:
>
>> The data ingestion is in outermost portion in foreachRDD block. Although
>> now I close the statement of jdbc, the same exception happened again. It
>> seems it is not related to the data ingestion part.
>>
>> On Wed, Apr 29, 2015 at 8:35 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> Use lsof to see what files are actually being held open.
>>>
>>> That stacktrace looks to me like it's from the driver, not executors.
>>> Where in foreach is it being called?  The outermost portion of foreachRDD
>>> runs in the driver, the innermost portion runs in the executors.  From the
>>> docs:
>>>
>>> https://spark.apache.org/docs/latest/streaming-programming-guide.html
>>>
>>> dstream.foreachRDD { rdd =>
>>>   val connection = createNewConnection()  // executed at the driver
>>>   rdd.foreach { record =>
>>>     connection.send(record) // executed at the worker
>>>   }}
>>>
>>>
>>> @td I've specifically looked at kafka socket connections for the
>>> standard 1.3 code vs my branch that has cached connections.  The standard
>>> non-caching code has very short lived connections.  I've had jobs running
>>> for a month at a time, including ones writing to mysql.  Not saying it's
>>> impossible, but I'd think we need some evidence before speculating this has
>>> anything to do with it.
>>>
>>>
>>> On Wed, Apr 29, 2015 at 6:50 PM, Bill Jay <bill.jaypeter...@gmail.com>
>>> wrote:
>>>
>>>> This function is called in foreachRDD. I think it should be running in
>>>> the executors. I add the statement.close() in the code and it is running. I
>>>> will let you know if this fixes the issue.
>>>>
>>>>
>>>>
>>>> On Wed, Apr 29, 2015 at 4:06 PM, Tathagata Das <t...@databricks.com>
>>>> wrote:
>>>>
>>>>> Is the function ingestToMysql running on the driver or on the
>>>>> executors? Accordingly you can try debugging while running in a 
>>>>> distributed
>>>>> manner, with and without calling the function.
>>>>>
>>>>> If you dont get "too many open files" without calling ingestToMysql(),
>>>>> the problem is likely to be in ingestToMysql().
>>>>> If you get the problem even without calling ingestToMysql(), then the
>>>>> problem may be in Kafka. If the problem is occuring in the driver, then 
>>>>> its
>>>>> the DirecKafkaInputDStream code. If the problem is occurring in the
>>>>> executor, then the problem is in KafkaRDD.
>>>>>
>>>>> TD
>>>>>
>>>>> On Wed, Apr 29, 2015 at 2:30 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>
>>>>>> Maybe add statement.close() in finally block ?
>>>>>>
>>>>>> Streaming / Kafka experts may have better insight.
>>>>>>
>>>>>> On Wed, Apr 29, 2015 at 2:25 PM, Bill Jay <bill.jaypeter...@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Thanks for the suggestion. I ran the command and the limit is 1024.
>>>>>>>
>>>>>>> Based on my understanding, the connector to Kafka should not open so
>>>>>>> many files. Do you think there is possible socket leakage? BTW, in every
>>>>>>> batch which is 5 seconds, I output some results to mysql:
>>>>>>>
>>>>>>>   def ingestToMysql(data: Array[String]) {
>>>>>>>     val url =
>>>>>>> "jdbc:mysql://localhost:3306/realtime?user=root&password=123"
>>>>>>>     var sql = "insert into loggingserver1 values "
>>>>>>>     data.foreach(line => sql += line)
>>>>>>>     sql = sql.dropRight(1)
>>>>>>>     sql += ";"
>>>>>>>     logger.info(sql)
>>>>>>>     var conn: java.sql.Connection = null
>>>>>>>     try {
>>>>>>>       conn = DriverManager.getConnection(url)
>>>>>>>       val statement = conn.createStatement()
>>>>>>>       statement.executeUpdate(sql)
>>>>>>>     } catch {
>>>>>>>       case e: Exception => logger.error(e.getMessage())
>>>>>>>     } finally {
>>>>>>>       if (conn != null) {
>>>>>>>         conn.close
>>>>>>>       }
>>>>>>>     }
>>>>>>>   }
>>>>>>>
>>>>>>> I am not sure whether the leakage originates from Kafka connector or
>>>>>>> the sql connections.
>>>>>>>
>>>>>>> Bill
>>>>>>>
>>>>>>> On Wed, Apr 29, 2015 at 2:12 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Can you run the command 'ulimit -n' to see the current limit ?
>>>>>>>>
>>>>>>>> To configure ulimit settings on Ubuntu, edit
>>>>>>>> */etc/security/limits.conf*
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On Wed, Apr 29, 2015 at 2:07 PM, Bill Jay <
>>>>>>>> bill.jaypeter...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I am using the direct approach to receive real-time data from
>>>>>>>>> Kafka in the following link:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> My code follows the word count direct example:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> After around 12 hours, I got the following error messages in Spark
>>>>>>>>> log:
>>>>>>>>>
>>>>>>>>> 15/04/29 20:18:10 ERROR JobScheduler: Error generating jobs for
>>>>>>>>> time 1430338690000 ms
>>>>>>>>> org.apache.spark.SparkException: ArrayBuffer(java.io.IOException:
>>>>>>>>> Too many open files, java.io.IOException: Too many open files,
>>>>>>>>> java.io.IOException: Too many open files, java.io.IOException: Too 
>>>>>>>>> many
>>>>>>>>> open files, java.io.IOException: Too many open files)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:94)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:116)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>>>>>>>         at
>>>>>>>>> scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
>>>>>>>>>         at scala.Option.orElse(Option.scala:257)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>>>>>>>         at
>>>>>>>>> scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
>>>>>>>>>         at scala.Option.orElse(Option.scala:257)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>>>>>>>         at
>>>>>>>>> scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
>>>>>>>>>         at scala.Option.orElse(Option.scala:257)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>>>>>>>         at
>>>>>>>>> scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
>>>>>>>>>         at scala.Option.orElse(Option.scala:257)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>>>>>>>         at
>>>>>>>>> scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
>>>>>>>>>         at scala.Option.orElse(Option.scala:257)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>>>>>>>>>         at
>>>>>>>>> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>>>>>>>>>         at
>>>>>>>>> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>>>>>>>>>         at
>>>>>>>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>>>>>>>         at
>>>>>>>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>>>>>>>         at
>>>>>>>>> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>>>>>>>>>         at
>>>>>>>>> scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:239)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:237)
>>>>>>>>>         at scala.util.Try$.apply(Try.scala:161)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:237)
>>>>>>>>>         at org.apache.spark.streaming.scheduler.JobGenerator.org
>>>>>>>>> $apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:174)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:85)
>>>>>>>>>         at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>>>>>>>         at
>>>>>>>>> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1.aroundReceive(JobGenerator.scala:83)
>>>>>>>>>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>>>>>>>         at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>>>>>>>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>>>>>>>>         at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>>>>>>>>         at
>>>>>>>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>>>>>>>>>         at
>>>>>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>>>>>         at
>>>>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>>>>>>         at
>>>>>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>>>>>         at
>>>>>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>>>>>>
>>>>>>>>> Thanks for the help.
>>>>>>>>>
>>>>>>>>> Bill
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Too many open files when using Spark to consume messages from Kafka

Reply via email to