Reporting errors from spark sql

2016-08-18 Thread yael aharon
Hello,
I am working on an SQL editor which is powered by spark SQL. When the SQL
is not valid, I would like to provide the user with a line number and
column number where the first error occurred. I am having a hard time
finding a mechanism that will give me that information programmatically.

Most of the time, if an erroneous SQL statement is used, I am getting a
RuntimeException, where line number and column number are implicitly
embedded within the text of the message, but it is really error prone to
parse the message text and count the number of spaces prior to the '^'
symbol...

Sometimes, AnalysisException is used, but when I try to extract the line
and startPosition from it, they are always empty.

Any help would be greatly appreciated.
thanks!


System.exit in local mode ?

2016-05-26 Thread yael aharon
Hello,
I have noticed that in
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/SparkUncaughtExceptionHandler.scala
spark would call System.exit if an uncaught exception was encountered.
I have an application that is running spark in local mode, and would like
to avoid exiting the application if that happens.
Will spark exit my application in local mode too, or is that the behavior
only in cluster mode? Is there a setting to override this behavior?
thanks, Yael


Allowing parallelism in spark local mode

2016-02-12 Thread yael aharon
Hello,
I have an application that receives requests over HTTP and uses spark in
local mode to process the requests. Each request is running in its own
thread.
It seems that spark is queueing the jobs, processing them one at a time.
When 2 requests arrive simultaneously, the processing time for each of them
is almost doubled.
I tried setting spark.default.parallelism, spark.executor.cores,
spark.driver.cores but that did not change the time in a meaningful way.

Am I missing something obvious?
thanks, Yael


Unexplained sleep time

2015-10-08 Thread yael aharon
Hello,
I am working on improving the performance of our Spark on Yarn applications.
Scanning through the logs I found the following lines:


[2015-10-07T16:25:17.245-04:00] [DataProcessing] [INFO] []
[org.apache.spark.Logging$class] [tid:main] [userID:yarn] Started
progress reporter thread - sleep time : 5000
[2015-10-07T16:25:22.262-04:00] [DataProcessing] [INFO] []
[org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl] [tid:Reporter]
[userID:yarn] Received new token for : hostname:8041


As the log says, the main thread sleeps for 5 seconds. Is there a way to
configure/eliminate this sleep?
thanks, Yael


Adding Kafka topics to a running streaming context

2015-08-27 Thread yael aharon
Hello,
My streaming application needs to allow consuming new Kafka topics at
arbitrary times. I know I can stop and start the streaming context when I
need to introduce a new stream, but that seems quite disruptive. I am
wondering if other people have this situation and if there is a more
elegant solution?
thanks, Yael