How to get the data url

2017-10-29 Thread onoke
Hi, I am searching a useful API for getting a data URL that is accessed by a application on Spark. For example, when this URL is in a application new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv;) How to get this url from using Spark API? I looked in

Re: Spark 2.2.0 GC Overhead Limit Exceeded and OOM errors in the executors

2017-10-29 Thread mmdenny
Hi Supun, Did you look at https://spark.apache.org/docs/latest/tuning.html? In addition to the info there, if you're partitioning by some key where you've got a lot of data skew, one of the task's memory requirements may be larger than the RAM of a given executor, where the rest of the tasks

is there a way to specify interval between task retry attempts ?

2017-10-29 Thread 王 宇
Sorry for interrupting, I have a quick question regarding the retry mechanism on failed tasks. I like to know whether there is a way to specify the interval between task retry attempts. I have set the spark.task.maxFailures to a relatively large number, but due to the unstable network condition

Re: FW: Kafka Direct Stream - dynamic topic subscription

2017-10-29 Thread Cody Koeninger
As it says in SPARK-10320 and in the docs at http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#consumerstrategies , you can use SubscribePattern On Sun, Oct 29, 2017 at 3:56 PM, Ramanan, Buvana (Nokia - US/Murray Hill) wrote: > Hello

Re: Spark Streaming Small files in Hive

2017-10-29 Thread Siva Gudavalli
Hello Asmath, We had a similar challenge recently. When you write back to hive, you are creating files on HDFS, and it depends on your batch window. If you increase your batch window lets say from 1 min to 5 mins you will end up creating 5x times less. The other factor is your partitioning.

Spark Streaming Small files in Hive

2017-10-29 Thread KhajaAsmath Mohammed
Hi, I am using spark streaming to write data back into hive with the below code snippet eventHubsWindowedStream.map(x => EventContent(new String(x))) .foreachRDD(rdd => { val sparkSession = SparkSession .builder.enableHiveSupport.getOrCreate import