Re: Too many open files when using Spark to consume messages from Kafka

2015-04-30 Thread Bill Jay
The data ingestion is in outermost portion in foreachRDD block. Although now I close the statement of jdbc, the same exception happened again. It seems it is not related to the data ingestion part. On Wed, Apr 29, 2015 at 8:35 PM, Cody Koeninger c...@koeninger.org wrote: Use lsof to see what

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-30 Thread Cody Koeninger
Did you use lsof to see what files were opened during the job? On Thu, Apr 30, 2015 at 1:05 PM, Bill Jay bill.jaypeter...@gmail.com wrote: The data ingestion is in outermost portion in foreachRDD block. Although now I close the statement of jdbc, the same exception happened again. It seems it

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-30 Thread Bill Jay
I terminated the old job and now start a new one. Currently, the Spark streaming job has been running for 2 hours and when I use lsof, I do not see many files related to the Spark job. BTW, I am running Spark streaming using local[2] mode. The batch is 5 seconds and it has around 50 lines to read

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
Thanks for the suggestion. I ran the command and the limit is 1024. Based on my understanding, the connector to Kafka should not open so many files. Do you think there is possible socket leakage? BTW, in every batch which is 5 seconds, I output some results to mysql: def ingestToMysql(data:

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Ted Yu
Can you run the command 'ulimit -n' to see the current limit ? To configure ulimit settings on Ubuntu, edit */etc/security/limits.conf* Cheers On Wed, Apr 29, 2015 at 2:07 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am using the direct approach to receive real-time data from

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Ted Yu
Maybe add statement.close() in finally block ? Streaming / Kafka experts may have better insight. On Wed, Apr 29, 2015 at 2:25 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Thanks for the suggestion. I ran the command and the limit is 1024. Based on my understanding, the connector to Kafka

Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
Hi all, I am using the direct approach to receive real-time data from Kafka in the following link: https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html My code follows the word count direct example:

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Tathagata Das
Is the function ingestToMysql running on the driver or on the executors? Accordingly you can try debugging while running in a distributed manner, with and without calling the function. If you dont get too many open files without calling ingestToMysql(), the problem is likely to be in

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
This function is called in foreachRDD. I think it should be running in the executors. I add the statement.close() in the code and it is running. I will let you know if this fixes the issue. On Wed, Apr 29, 2015 at 4:06 PM, Tathagata Das t...@databricks.com wrote: Is the function ingestToMysql

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Tathagata Das
Also cc;ing Cody. @Cody maybe there is a reason for doing connection pooling even if there is not performance difference. TD On Wed, Apr 29, 2015 at 4:06 PM, Tathagata Das t...@databricks.com wrote: Is the function ingestToMysql running on the driver or on the executors? Accordingly you can

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Cody Koeninger
Use lsof to see what files are actually being held open. That stacktrace looks to me like it's from the driver, not executors. Where in foreach is it being called? The outermost portion of foreachRDD runs in the driver, the innermost portion runs in the executors. From the docs: