Re: Spark Streaming Application Got killed after 2 hours

2014-11-16 Thread Prannoy
Hi Saj, What is the size of the input data that you are putting on the stream ? Have you tried running the same application with different set of data ? Its weird that exactly after 2 hours the streaming stops. Try running the same application with different data of different size to look if it

How to kill/upgrade/restart driver launched in Spark standalone cluster+supervised mode?

2014-11-16 Thread Jesper Lundgren
Hello, I have a Spark Standalone cluster running in HA mode. I launched a application using spark-submit with cluster and supervised mode enabled and it launched sucessfully on one of the worker nodes. How can I stop/restart/kill or otherwise manage such task running in a standalone cluster?

Re: SparkSQL exception on cached parquet table

2014-11-16 Thread Cheng Lian
(Forgot to cc user mail list) On 11/16/14 4:59 PM, Cheng Lian wrote: Hey Sadhan, Thanks for the additional information, this is helpful. Seems that some Parquet internal contract was broken, but I'm not sure whether it's caused by Spark SQL or Parquet, or even maybe the Parquet file itself

Re: Help with Spark Streaming

2014-11-16 Thread Bahubali Jain
Hi, Can anybody help me on this please, haven't been able to find the problem :( Thanks. On Nov 15, 2014 4:48 PM, Bahubali Jain bahub...@gmail.com wrote: Hi, Trying to use spark streaming, but I am struggling with word count :( I want consolidate output of the word count (not on a per window

Re: Help with Spark Streaming

2014-11-16 Thread ZhangYi
I guess, maybe you don’t need invoke reduceByKey() after mapToPair, because updateStateByKey had covered it. For your reference, here is a sample written by scala using text file stream instead of socket as below: object LocalStatefulWordCount extends App { val sparkConf = new

Re: Using data in RDD to specify HDFS directory to write to

2014-11-16 Thread Akhil Das
Can you check in the worker logs what exactly is happening!?? Thanks Best Regards On Sun, Nov 16, 2014 at 2:54 AM, jschindler john.schind...@utexas.edu wrote: UPDATE I have removed and added things systematically to the job and have figured that the inclusion of the construction of the

Re: Cancelled Key Exceptions on Massive Join

2014-11-16 Thread Akhil Das
This usually happens when one of the worker is stuck on GC Pause and it times out. Enable the following configurations while creating sparkContext: sc.set(spark.rdd.compress,true) sc.set(spark.storage.memoryFraction,1) sc.set(spark.core.connection.ack.wait.timeout,6000)

Re: User Authn and Authz in Spark missing ?

2014-11-16 Thread Akhil Das
Have a look at this doc http://spark.apache.org/docs/latest/security.html You can configure your network to only accept connections from the trusted CIDR. If you are using Cloud services like Ec2/Azure/GCE etc, then it is straight forward from their web portal. If you are having a bunch of custom

Re: Status of MLLib exporting models to PMML

2014-11-16 Thread Charles Earl
Manish and others, A follow up question on my mind is whether there are protobuf (or other binary format) frameworks in the vein of PMML. Perhaps scientific data storage frameworks like netcdf, root are possible also. I like the comprehensiveness of PMML but as you mention the complexity of

RE: filtering a SchemaRDD

2014-11-16 Thread Daniel, Ronald (ELS-SDG)
Indeed it did. Thanks! Ron From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, November 14, 2014 9:53 PM To: Daniel, Ronald (ELS-SDG) Cc: user@spark.apache.org Subject: Re: filtering a SchemaRDD If I use row[6] instead of row[text] I get what I am looking for. However,

Returning breeze.linalg.DenseMatrix from method

2014-11-16 Thread Ritesh Kumar Singh
Hi, I have a method that returns DenseMatrix: def func(str: String): DenseMatrix = { ... ... } But I keep getting this error: *class DenseMatrix takes type parameters* I tried this too: def func(str: String): DenseMatrix(Int, Int, Array[Double]) = { ... ... } But this gives me

Re: Communication between Driver and Executors

2014-11-16 Thread Tobias Pfeiffer
Hi, On Fri, Nov 14, 2014 at 3:20 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: I wonder if SparkConf is dynamically updated on all worker nodes or only during initialization. It can be used to piggyback information. Otherwise I guess you are stuck with Broadcast. Primarily I have had

Interoperability between ScalaRDD, JavaRDD and PythonRDD

2014-11-16 Thread Nam Nguyen
Hello, Is it possible to reuse RDD implementations written in Scala/Java with PySpark? Thanks, Nam - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

re: How to incrementally compile spark examples using mvn

2014-11-16 Thread Yiming (John) Zhang
Thank you Marcelo. I tried your suggestion (# mvn -pl :spark-examples_2.10 compile), but it required to download many spark components (as listed below), which I have already compiled on my server. Downloading:

Re: SparkSQL exception on cached parquet table

2014-11-16 Thread Sadhan Sood
Hi Cheng, I tried reading the parquet file(on which we were getting the exception) through parquet-tools and it is able to dump the file and I can read the metadata, etc. I also loaded the file through hive table and can run a table scan query on it as well. Let me know if I can do more to help

Iterative changes to RDD and broadcast variables

2014-11-16 Thread Shannon Quinn
Hi all, I'm iterating over an RDD (representing a distributed matrix...have to roll my own in Python) and making changes to different submatrices at each iteration. The loop structure looks something like: for i in range(x): VAR = sc.broadcast(i) rdd.map(func1).reduceByKey(func2) M =

Re: Communication between Driver and Executors

2014-11-16 Thread Tobias Pfeiffer
Hi again, On Mon, Nov 17, 2014 at 8:16 AM, Tobias Pfeiffer t...@preferred.jp wrote: I have been trying to mis-use broadcast as in - create a class with a boolean var, set to true - query this boolean on the executors as a prerequisite to process the next item - when I want to shutdown, I

RDD.aggregate versus accumulables...

2014-11-16 Thread Segerlind, Nathan L
Hi All. I am trying to get my head around why using accumulators and accumulables seems to be the most recommended method for accumulating running sums, averages, variances and the like, whereas the aggregate method seems to me to be the right one. I have no performance measurements as of yet,

Load json format dataset as RDD

2014-11-16 Thread J
Hi, I am new to spark. I met a problem when I intended to load one dataset. I have a dataset where the data is in json format and I'd like to load it as a RDD. As one record may span multiple lines, so SparkContext.textFile() is not doable. I also tried to use json4s to parse the json manually

Functions in Spark

2014-11-16 Thread Deep Pradhan
Hi, Is there any way to know which of my functions perform better in Spark? In other words, say I have achieved same thing using two different implementations. How do I judge as to which implementation is better than the other. Is processing time the only metric that we can use to claim the

Re: Functions in Spark

2014-11-16 Thread Samarth Mailinglist
Check this video out: https://www.youtube.com/watch?v=dmL0N3qfSc8list=UURzsq7k4-kT-h3TDUBQ82-w On Mon, Nov 17, 2014 at 9:43 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Is there any way to know which of my functions perform better in Spark? In other words, say I have achieved same

Re: Load json format dataset as RDD

2014-11-16 Thread Cheng Lian
|SQLContext.jsonFile| assumes one JSON record per line. Although I haven’t tried yet, it seems that this |JsonInputFormat| [1] can be helpful. You may read your original data set with |SparkContext.hadoopFile| and |JsonInputFormat|, then transform the resulted |RDD[String]| into a |JsonRDD|

Questions Regarding to MPI Program Migration to Spark

2014-11-16 Thread Jun Yang
Guys, Recently we are migrating our backend pipeline from to Spark. In our pipeline, we have a MPI-based HAC implementation, to ensure the result consistency of migration, we also want to migrate this MPI-implemented code to Spark. However, during the migration process, I found that there are

Re: Functions in Spark

2014-11-16 Thread Mukesh Jha
Thanks I did go through the video it was very informative, but I think I's looking for the Transformations section @ page https://spark.apache.org/docs/0.9.1/scala-programming-guide.html. On Mon, Nov 17, 2014 at 10:31 AM, Samarth Mailinglist mailinglistsama...@gmail.com wrote: Check this

Re: spark-submit question

2014-11-16 Thread Sean Owen
You are changing these paths and filenames to match your own actual scripts and file locations right? On Nov 17, 2014 4:59 AM, Samarth Mailinglist mailinglistsama...@gmail.com wrote: I am trying to run a job written in python with the following command: bin/spark-submit --master