Reporting errors from spark sql
Hello, I am working on an SQL editor which is powered by spark SQL. When the SQL is not valid, I would like to provide the user with a line number and column number where the first error occurred. I am having a hard time finding a mechanism that will give me that information programmatically. Most of the time, if an erroneous SQL statement is used, I am getting a RuntimeException, where line number and column number are implicitly embedded within the text of the message, but it is really error prone to parse the message text and count the number of spaces prior to the '^' symbol... Sometimes, AnalysisException is used, but when I try to extract the line and startPosition from it, they are always empty. Any help would be greatly appreciated. thanks!
System.exit in local mode ?
Hello, I have noticed that in https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/SparkUncaughtExceptionHandler.scala spark would call System.exit if an uncaught exception was encountered. I have an application that is running spark in local mode, and would like to avoid exiting the application if that happens. Will spark exit my application in local mode too, or is that the behavior only in cluster mode? Is there a setting to override this behavior? thanks, Yael
Allowing parallelism in spark local mode
Hello, I have an application that receives requests over HTTP and uses spark in local mode to process the requests. Each request is running in its own thread. It seems that spark is queueing the jobs, processing them one at a time. When 2 requests arrive simultaneously, the processing time for each of them is almost doubled. I tried setting spark.default.parallelism, spark.executor.cores, spark.driver.cores but that did not change the time in a meaningful way. Am I missing something obvious? thanks, Yael
Unexplained sleep time
Hello, I am working on improving the performance of our Spark on Yarn applications. Scanning through the logs I found the following lines: [2015-10-07T16:25:17.245-04:00] [DataProcessing] [INFO] [] [org.apache.spark.Logging$class] [tid:main] [userID:yarn] Started progress reporter thread - sleep time : 5000 [2015-10-07T16:25:22.262-04:00] [DataProcessing] [INFO] [] [org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl] [tid:Reporter] [userID:yarn] Received new token for : hostname:8041 As the log says, the main thread sleeps for 5 seconds. Is there a way to configure/eliminate this sleep? thanks, Yael
Adding Kafka topics to a running streaming context
Hello, My streaming application needs to allow consuming new Kafka topics at arbitrary times. I know I can stop and start the streaming context when I need to introduce a new stream, but that seems quite disruptive. I am wondering if other people have this situation and if there is a more elegant solution? thanks, Yael