Re: Spark Streaming data checkpoint performance

2015-11-07 Thread trung kien
Hmm, Seems it just do a trick. Using this method, it's very hard to recovery from failure, since we don't know which batch have been done. I really want to maintain the whole running stats in memory to archive full failure-tolerant. I just wonder if the performance of data checkpoint is that

Re: Checkpoint not working after driver restart

2015-11-07 Thread vimal dinakaran
I am pasting the code here . Please let me know if there is any sequence that is wrong. def createContext(checkpointDirectory: String, config: Config): StreamingContext = { println("Creating new context") val conf = new

RE: [sparkR] Any insight on java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-07 Thread Sun, Rui
This is probably because your config option actually do not take effect. Please refer to the email thread titled “How to set memory for SparkR with master="local[*]"”, which may answer you. I recommend you to try to use SparkR built from the master branch, which contains two fixes that may

streaming: missing data. does saveAsTextFile() append or replace?

2015-11-07 Thread Andy Davidson
Hi I just started a new spark streaming project. In this phase of the system all we want to do is save the data we received to hdfs. I after running for a couple of days it looks like I am missing a lot of data. I wonder if saveAsTextFile("hdfs:///rawSteamingData²); is overwriting the data I

sqlCtx.sql('some_hive_table') works in pyspark but not spark-submit

2015-11-07 Thread YaoPau
Within a pyspark shell, both of these work for me: print hc.sql("SELECT * from raw.location_tbl LIMIT 10").collect() print sqlCtx.sql("SELECT * from raw.location_tbl LIMIT 10").collect() But when I submit both of those in batch mode (hc and sqlCtx both exist), I get the following error. Why is

Re: bug: can not run Ipython notebook on cluster

2015-11-07 Thread Andy Davidson
What a BEAR! The following recipe worked for me. (took a couple of days hacking). I hope this improves the out of the box experience for others Andy My test program is now In [1]: from pyspark import SparkContext textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") In [2]:

Spark Job failing with exit status 15

2015-11-07 Thread Shashi Vishwakarma
I am trying to run simple word count job in spark but I am getting exception while running job. For more detailed output, check application tracking page:http://quickstart.cloudera:8088/proxy/application_1446699275562_0006/Then, click on links to logs of each attempt.Diagnostics: Exception from

Re: Checkpointing an InputDStream from Kafka

2015-11-07 Thread Sandip Mehta
I believe you’ll have to use another way of creating StreamingContext by passing create function in getOrCreate function. private def setupSparkContext(): StreamingContext = { val streamingSparkContext = { val sparkConf = new SparkConf().setAppName(config.appName).setMaster(config.master)

Re: Whether Spark is appropriate for our use case.

2015-11-07 Thread Igor Berman
1. if you have join by some specific field(e.g. user id or account-id or whatever) you may try to partition parquet file by this field and then join will be more efficient. 2. you need to see in spark metrics what is performance of particular join, how much partitions is there, what is shuffle

Re: Spark Streaming : minimum cores for a Receiver

2015-11-07 Thread Gideon
I'm not a Spark expert but: What Spark does is run receivers in the executors. These receivers are a long-running task, each receiver occupies 1 core in your executor, if an executor has more cores than receivers it can also process (at least some of) the data that it is receiving. So, enough