How to make sure that Spark Kafka Direct Streaming job maintains the state upon code deployment?

2017-06-27 Thread SRK
Hi, We use UpdateStateByKey, reduceByKeyWindow and checkpoint the data. We store the offsets in Zookeeper. How to make sure that the state of the job is maintained upon redeploying the code? Thanks! -- View this message in context:

How to reduce the amount of data that is getting written to the checkpoint from Spark Streaming

2017-06-27 Thread SRK
Hi, I have checkpoints enabled in Spark streaming and I use updateStateByKey and reduceByKeyAndWindow with inverse functions. How do I reduce the amount of data that I am writing to the checkpoint or clear out the data that I dont care? Thanks! -- View this message in context:

Spark standalone , client mode. How do I monitor?

2017-06-27 Thread anna stax
Hi all, I have a spark standalone cluster. I am running a spark streaming application on it and the deploy mode is client. I am looking for the best way to monitor the cluster and application so that I will know when the application/cluster is down. I cannot move to cluster deploy mode now. I

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-27 Thread Aaron Perrin
I'm assuming some things here, but hopefully I understand. So, basically you have a big table of data distributed across a bunch of executors. And, you want an efficient way to call a native method for each row. It sounds similar to a dataframe writer to me. Except, instead of writing to disk or

Re: IDE for python

2017-06-27 Thread ayan guha
Depends on the need. For data exploration, i use notebooks whenever I can. For developement, any good text editor should work, I use sublime. If you want auto completion and all, you can use eclipse or pycharm, I do not :) On Wed, 28 Jun 2017 at 7:17 am, Xiaomeng Wan wrote:

IDE for python

2017-06-27 Thread Xiaomeng Wan
Hi, I recently switched from scala to python, and wondered which IDE people are using for python. I heard about pycharm, spyder etc. How do they compare with each other? Thanks, Shawn

Spark Encoder with mysql Enum and data truncated Error

2017-06-27 Thread mckunkel
I am using Spark via Java for a MYSQL/ML(machine learning) project. In the mysql database, I have a column "status_change_type" of type enum = {broke, fixed} in a table called "status_change" in a DB called "test". I have an object StatusChangeDB that constructs the needed structure for the

(Spark-ml) java.util.NosuchElementException: key not found exception on doing prediction and computing test error.

2017-06-27 Thread neha nihal
Hi, I am using Apache spark 2.0.2 randomforest ml (standalone mode) for text classification. TF-IDF feature extractor is also used. The training part runs without any issues and returns 100% accuracy. But when I am trying to do prediction using trained model and compute test error, it fails with

How do I find the time taken by each step in a stage in a Spark Job

2017-06-27 Thread SRK
Hi, How do I find the time taken by each step in a stage in spark job? Also, how do I find the bottleneck in each step and if a stage is skipped because of the RDDs being persisted in streaming? I am trying to identify which step in a job is taking time in my Streaming job. Thanks! -- View

Re: ZeroMQ Streaming in Spark2.x

2017-06-27 Thread Aashish Chaudhary
thanks, I was able to get it up and running. One think I am not entirely sure if bahir provided python bindings to ZeroMQ. Looking at the code it does not seems like but I might be wrong. thanks, On Mon, Jun 26, 2017 at 5:13 PM Aashish Chaudhary < aashish.chaudh...@kitware.com> wrote: >

Re: Question about Parallel Stages in Spark

2017-06-27 Thread satish lalam
Thanks Bryan. This is one Spark application with one job. This job has 3 stages. The first 2 are basic reads from cassandra tables and the 3rd is a join between the two. I was expecting the first 2 stages to run in parallel, however they run serially. Job has enough resources. On Tue, Jun 27,

the function of countByValueAndWindow and foreachRDD in DStream, would you like help me understand it please?

2017-06-27 Thread ??????????
HI all, I have code like below: Logger.getLogger("org.apache.spark").setLevel( Level.ERROR) //Logger.getLogger("org.apache.spark.streaming.dstream").setLevel( Level.DEBUG) val conf = new SparkConf().setAppName("testDstream").setMaster("local[4]") //val sc =

[ML] Stop conditions for RandomForest

2017-06-27 Thread OBones
Hello, Reading around on the theory behind tree based regression, I concluded that there are various reasons to stop exploring the tree when a given node has been reached. Among these, I have those two: 1. When starting to process a node, if its size (row count) is less than X then consider

proxy on spark UI

2017-06-27 Thread Soheila S.
Hi all, I am using Hadoop 2.6.5 and spark 2.1.0 and run a job using spark-submit and master is set to "yarn". When spark starts, I can load Spark UI page using port 4040 but no job is shown in the page. After the following logs (registering application master on yarn) spark UI is not accessible

What is the purpose of having RDD.context and RDD.sparkContext at the same time?

2017-06-27 Thread Sergey Zhemzhitsky
Hello spark gurus, Could you please shed some light on what is the purpose of having two identical functions in RDD, RDD.context [1] and RDD.sparkContext [2]. RDD.context seems to be used more frequently across the source code. [1]

PySpark 2.1.1 Can't Save Model - Permission Denied

2017-06-27 Thread John Omernik
Hello all, I am running PySpark 2.1.1 as a user, jomernik. I am working through some documentation here: https://spark.apache.org/docs/latest/mllib-ensembles.html#random-forests And was working on the Random Forest Classification, and found it to be working! That said, when I try to save the

Re: Question about Parallel Stages in Spark

2017-06-27 Thread Bryan Jeffrey
Satish, Is this two separate applications submitted to the Yarn scheduler? If so then you would expect that you would see the original case run in parallel. However, if this is one application your submission to Yarn guarantees that this application will fairly  contend with resources

Re: Question about Parallel Stages in Spark

2017-06-27 Thread satish lalam
Thanks All. To reiterate - stages inside a job can be run parallely as long as - (a) there is no sequential dependency (b) the job has sufficient resources. however, my code was launching 2 jobs and they are sequential as you rightly pointed out. The issue which I was trying to highlight with that

Re: gfortran runtime library for Spark

2017-06-27 Thread Saroj C
Thanks a lot. Thanks & Regards Saroj Kumar Choudhury Tata Consultancy Services (UNIT-I)- KALINGA PARK IT/ITES SPECIAL ECONOMIC ZONE (SEZ),PLOT NO. 35, CHANDAKA INDUSTRIAL ESTATE, PATIA, Bhubaneswar - 751 024,Orissa India Ph:- +91 674 664 5154 Mailto: saro...@tcs.com Website: http://www.tcs.com