FW: Spark streaming - failed recovery from checkpoint

2015-11-02 Thread Adrian Tanase
Re-posting here, didn’t get any feedback on the dev list. Has anyone experienced corrupted checkpoints recently? Thanks! -adrian From: Adrian Tanase Date: Thursday, October 29, 2015 at 1:38 PM To: "d...@spark.apache.org" Subject: Spark streaming - failed recovery

Why does sortByKey() transformation trigger a job in spark-shell?

2015-11-02 Thread Jacek Laskowski
Hi Sparkians, I use the latest Spark 1.6.0-SNAPSHOT in spark-shell with the default local[*] master. I created an RDD of pairs using the following snippet: val rdd = sc.parallelize(0 to 5).map(n => (n, util.Random.nextBoolean)) It's all fine so far. The map transformation causes no

Re: Why does sortByKey() transformation trigger a job in spark-shell?

2015-11-02 Thread Jacek Laskowski
Hi, Answering my own question after...searching sortByKey in the mailing list archives and later in JIRA. It turns out it's a known issue and filed under https://issues.apache.org/jira/browse/SPARK-1021 "sortByKey() launches a cluster job when it shouldn't". It's labelled "starter" that should

Re: execute native system commands in Spark

2015-11-02 Thread Adrian Tanase
Have you seen .pipe()? On 11/2/15, 5:36 PM, "patcharee" wrote: >Hi, > >Is it possible to execute native system commands (in parallel) Spark, >like scala.sys.process ? > >Best, >Patcharee > >- >To

Spark Streaming and periodic broadcast

2015-11-02 Thread Serafín Sedano Arenas
Hi all, I have a long lived Spark Streaming cluster feeding from several Kinesis streams. There is some data that must be obtained from another data source to be used in one of the steps on the workers. My understanding is that the best way to achieve this is by broadcasting that data. Is that

Re: Sort Merge Join

2015-11-02 Thread Alex Nastetsky
Thanks for the response. Taking the file system based data source as “UnknownPartitioning”, will be a simple and SAFE way for JOIN, as it’s hard to guarantee the records from different data sets with the identical join keys will be loaded by the same node/task , since lots of factors need to be

Re: Best practises

2015-11-02 Thread Denny Lee
In addition, you may want to check out Tuning and Debugging in Apache Spark (https://sparkhub.databricks.com/video/tuning-and-debugging-apache-spark/) On Mon, Nov 2, 2015 at 05:27 Stefano Baghino wrote: > There is this interesting book from Databricks: >

Re: Best practises

2015-11-02 Thread satish chandra j
HI All, Yes, any such doc will be a great help!!! On Fri, Oct 30, 2015 at 4:35 PM, huangzheng <1106944...@qq.com> wrote: > I have the same question.anyone help us. > > > -- 原始邮件 -- > *发件人:* "Deepak Sharma"; > *发送时间:* 2015年10月30日(星期五)

Re: Spark, Mesos problems with remote connections

2015-11-02 Thread Tamas Szuromi
Hello Sebastian, Did you set the MESOS_NATIVE_JAVA_LIBRARY variable before you started pyspark? cheers, Tamas On 2 November 2015 at 15:24, Sebastian Kuepers < sebastian.kuep...@publicispixelpark.de> wrote: > Hey, > > > I have a Mesos cluster with a single Master. If I run the following >

execute native system commands in Spark

2015-11-02 Thread patcharee
Hi, Is it possible to execute native system commands (in parallel) Spark, like scala.sys.process ? Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Re: Required file not found: sbt-interface.jar

2015-11-02 Thread Ted Yu
sbt-interface.jar is under build/zinc-0.3.5.3/lib/sbt-interface.jar You can run build/mvn first to download it. Cheers On Mon, Nov 2, 2015 at 1:51 AM, Todd wrote: > Hi, > I am trying to build spark 1.5.1 in my environment, but encounter the > following error complaining

Re: Exception while reading from kafka stream

2015-11-02 Thread Cody Koeninger
combine topicsSet_1 and topicsSet_2 in a single createDirectStream call. Then you can use hasOffsetRanges to see what the topic for a given partition is. On Mon, Nov 2, 2015 at 7:26 AM, Ramkumar V wrote: > if i try like below code snippet , it shows exception , how to

Re: Best practises

2015-11-02 Thread Sushrut Ikhar
This presentation may clarify many of your doubts. https://www.youtube.com/watch?v=7ooZ4S7Ay6Y Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar On Mon, Nov 2, 2015 at 7:15 PM, Denny Lee wrote: > In addition,

Re: Spark Streaming data checkpoint performance

2015-11-02 Thread Adrian Tanase
You are correct, the default checkpointing interval is 10 seconds or your batch size, whichever is bigger. You can change it by calling .checkpoint(x) on your resulting Dstream. For the rest, you are probably keeping an “all time” word count that grows unbounded if you never remove words from

Re: apply simplex method to fix linear programming in spark

2015-11-02 Thread Sean Owen
I might be steering this a bit off topic: does this need the simplex method? this is just an instance of nonnegative least squares. I don't think it relates to LDA either. Spark doesn't have any particular support for NNLS (right?) or simplex though. On Mon, Nov 2, 2015 at 6:03 PM, Debasish Das

Re: apply simplex method to fix linear programming in spark

2015-11-02 Thread Debasish Das
Use breeze simplex which inturn uses apache maths simplex...if you want to use interior point method you can use ecos https://github.com/embotech/ecos-java-scala ...spark summit 2014 talk on quadratic solver in matrix factorization will show you example integration with spark. ecos runs as jni

Re: [Yarn] How to set user in ContainerLaunchContext?

2015-11-02 Thread Marcelo Vanzin
You can try the "--proxy-user" command line argument for spark-submit. That requires that your RM configuration allows the user running your AM to "proxy" other users. And I'm not completely sure it works without Kerberos. See:

Re: callUdf("percentile_approx",col("mycol"),lit(0.25)) does not compile spark 1.5.1 source but it does work in spark 1.5.1 bin

2015-11-02 Thread Umesh Kacha
Hi Ted I checked hive-exec-1.2.1.spark.jar contains the following required classes but still it doesn't compile I don't understand why is this Jar getting overwritten in scope org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$GenericUDAFMultiplePercentileApproxEvaluator.class

Re: ipython notebook NameError: name 'sc' is not defined

2015-11-02 Thread Jörn Franke
You can check a script that I created for the Amazon cloud: https://snippetessay.wordpress.com/2015/04/18/big-data-lab-in-the-cloud-with-hadoopsparkrpython/ If I remember correctly then you need to add something to the startup py for ipython > On 03 Nov 2015, at 01:04, Andy Davidson

Re: How do I get the executor ID from running Java code

2015-11-02 Thread Gideon
Looking at the post date I can only assume you've got your answer. since I just encountered your post while trying to do the same thing I decided it's worth answering for other people. In order to get the executor ID you can use: SparkEnv.get().executorId() I hope this helps anyone -- View

Re: Sort Merge Join

2015-11-02 Thread Jonathan Coveney
Additionally, I'm curious if there are any JIRAS around making dataframes support ordering better? there are a lot of operations that can be optimized if you know that you have a total ordering on your data...are there any plans, or at least JIRAS, around having the catalyst optimizer handle this

Time-series prediction using spark

2015-11-02 Thread Cui Lin
Hello, all, I am wondering if anyone tried time series prediction using spark? Any good practice to suggest me? Thanks a lot! -- Best regards! Lin,Cui

Re: PySpark + Streaming + DataFrames

2015-11-02 Thread Jason White
This should be resolved with https://github.com/apache/spark/commit/f92f334ca47c03b980b06cf300aa652d0ffa1880. The conversion no longer does a `.take` when converting from RDD -> DF. On Mon, Oct 19, 2015 at 6:30 PM, Tathagata Das wrote: > Yes, precisely! Also, for other

Where does mllib's .save method save a model to?

2015-11-02 Thread apu mishra . rr
I want to save an mllib model to disk, and am trying the model.save operation as described in http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html#examples: model.save(sc, "myModelPath") But after running it, I am unable to find any newly created file or dir by the name

Standalone cluster not using multiple workers for single application

2015-11-02 Thread Jeff Jones
I’ve got an a series of applications using a single standalone Spark cluster (v1.4.1). The cluster has 1 master and 4 workers (4 CPUs per worker node). I am using the start-slave.sh script to launch the worker process on each node and I can see the nodes were successfully registered using the

Re: Spark SQL lag() window function, strange behavior

2015-11-02 Thread Yin Huai
Hi Ross, What version of spark are you using? There were two issues that affected the results of window function in Spark 1.5 branch. Both of issues have been fixed and will be released with Spark 1.5.2 (this release will happen soon). For more details of these two issues, you can take a look at

Dump table into file

2015-11-02 Thread Shepherd
Hi all, I have one table called "result" in the database, for example: /user/hive/warehouse/data_result.db/result How do I export the table "result" into a local csv file? Thanks a lot. -- View this message in context:

kinesis batches hang after YARN automatic driver restart

2015-11-02 Thread Hster Geguri
Hello Wonderful Sparks Peoples, We are testing AWS Kinesis/Spark Streaming (1.5) failover behavior with Hadoop/Yarn 2.6 and 2.71 and want to understand expected behavior. When I manually kill a yarn application master/driver with a linux kill -9, YARN will automatically relaunch another master

Fwd: Getting ClassNotFoundException: scala.Some on Spark 1.5.x

2015-11-02 Thread Babar Tareen
Resending, haven't found a workaround. Any help is highly appreciated. -- Forwarded message -- From: Babar Tareen Date: Thu, Oct 22, 2015 at 2:47 PM Subject: Getting ClassNotFoundException: scala.Some on Spark 1.5.x To: user@spark.apache.org Hi, I am

How to handle Option[Int] in dataframe

2015-11-02 Thread manas kar
Hi, I have a case class with many columns that are Option[Int] or Option[Array[Byte]] and such. I would like to save it to parquet file and later read it back to my case class too. I found that Option[Int] when null returns 0 when the field is Null. My question: Is there a way to get

SparkSQL implicit conversion on insert

2015-11-02 Thread Bryan Jeffrey
All, I have an object Joda DateTime fields. I would prefer to continue to use the DateTime in my application. When I am inserting into Hive I need to cast to a Timestamp field (DateTime is not supported). I added an implicit conversion from DateTime to Timestamp - but it does not appear to be

Spark SQL lag() window function, strange behavior

2015-11-02 Thread Ross.Cramblit
Hello Spark community - I am running a Spark SQL query to calculate the difference in time between consecutive events, using lag(event_time) over window - SELECT device_id, unix_time, event_id, unix_time - lag(unix_time) OVER (PARTITION BY device_id ORDER

Re: Spark SQL lag() window function, strange behavior

2015-11-02 Thread Ross.Cramblit
I am using Spark 1.5.0 on Yarn On Nov 2, 2015, at 3:16 PM, Yin Huai > wrote: Hi Ross, What version of spark are you using? There were two issues that affected the results of window function in Spark 1.5 branch. Both of issues have been fixed

Re: Standalone cluster not using multiple workers for single application

2015-11-02 Thread Jean-Baptiste Onofré
Hi Jeff, it may depend of your application code. To verify your setup and if your are able to scale on multiple worker, you can try using the SparkTC example for instance (it should use all workers). Regards JB On 11/02/2015 08:56 PM, Jeff Jones wrote: I’ve got an a series of applications

Re: How to lookup by a key in an RDD

2015-11-02 Thread Deenar Toraskar
Swetha Currently IndexedRDD is an external library and not part of Spark Core. You can use it by adding a dependency and pull it in. There are plans to move it to Spark core tracked in https://issues.apache.org/jira/browse/SPARK-2365. See

Re: Submitting Spark Applications - Do I need to leave ports open?

2015-11-02 Thread Akhil Das
Yes you need to open up a few ports for that to happen, have a look at http://spark.apache.org/docs/latest/configuration.html#networking you can see *.port configurations which bounds to random by default, fix those ports to a specific number and open those ports in your firewall and it should

Re: execute native system commands in Spark

2015-11-02 Thread Deenar Toraskar
You can do the following, make sure you the no of executors requested equal the number of executors on your cluster. import scala.sys.process._ import org.apache.hadoop.security.UserGroupInformation import org.apache.spark.deploy.SparkHadoopUtil sc.parallelize(0 to 10).map { _

[Yarn] How to set user in ContainerLaunchContext?

2015-11-02 Thread Peter Rudenko
Hi, i have an ApplicationMaster which accepts requests and launches container on which it launches spark-submit --master yarn. In request i have a field "username" - the user i want to laucnh a job from. How can i set a user which will be run conmmand on a container? Currently they all running

Re: Why does sortByKey() transformation trigger a job in spark-shell?

2015-11-02 Thread Mark Hamstra
Hah! No, that is not a "starter" issue. It touches on some fairly deep Spark architecture, and there have already been a few attempts to resolve the issue -- none entirely satisfactory, but you should definitely search out the work that has already been done. On Mon, Nov 2, 2015 at 5:51 AM,

Re: Spark Streaming data checkpoint performance

2015-11-02 Thread Thúy Hằng Lê
Hi Andrian, Thanks for the information. However your 2 suggestions couldn't really work for me. Accuracy is the most important aspect in my application. So keeping only 15 minutes window stats or prune out some of keys is impossible for my application. I can change the checking point interval

Re: Getting ClassNotFoundException: scala.Some on Spark 1.5.x

2015-11-02 Thread Jonathan Coveney
Caused by: java.lang.ClassNotFoundException: scala.Some indicates that you don't have the scala libs present. How are you executing this? My guess is the issue is a conflict between scala 2.11.6 in your build and 2.11.7? Not sure...try setting your scala to 2.11.7? But really, first it'd be good

Re: --jars option using hdfs jars cannot effect when spark standlone deploymode with cluster

2015-11-02 Thread Akhil Das
Can you give a try putting the jar locally without hdfs? Thanks Best Regards On Wed, Oct 28, 2015 at 8:40 AM, our...@cnsuning.com wrote: > hi all, >when using command: > spark-submit *--deploy-mode cluster --jars > hdfs:///user/spark/cypher.jar* --class >

Split RDD into multiple RDDs using filter-transformation

2015-11-02 Thread Sushrut Ikhar
Hi, I need to split a RDD into 3 different RDD using filter-transformation. I have cached the original RDD before using filter. The input is lopsided leaving some executors with heavy load while others with less; so I have repartitioned it. *DAG-lineage I expected:* I/P RDD --> MAP RDD -->

Re: Programatically create RDDs based on input

2015-11-02 Thread amit tewari
Thanks Natu, Ayan. I was able to create an array of Dataframes (Spark 1.3+). DataFrame[] dfs = new DataFrame[uniqueFileIds.length]; Thanks Amit On Sun, Nov 1, 2015 at 10:58 AM, Natu Lauchande wrote: > Hi Amit, > > I don't see any default constructor in the JavaRDD docs

Re: Error : - No filesystem for scheme: spark

2015-11-02 Thread Balachandar R.A.
-- Forwarded message -- From: "Balachandar R.A." Date: 02-Nov-2015 12:53 pm Subject: Re: Error : - No filesystem for scheme: spark To: "Jean-Baptiste Onofré" Cc: > HI JB, > Thanks for the response, > Here is the content of my

Re: Error : - No filesystem for scheme: spark

2015-11-02 Thread Jean-Baptiste Onofré
Just to be sure: you use yarn cluster (not standalone), right ? Regards JB On 11/02/2015 10:37 AM, Balachandar R.A. wrote: Yes. In two different places I use spark:// 1. In my code, while creating spark configuration, I use the code below val sConf = new

Re: Error : - No filesystem for scheme: spark

2015-11-02 Thread Balachandar R.A.
No.. I am not using yarn. Yarn is not running in my cluster. So, it is standalone one. Regards Bala On 02-Nov-2015 3:11 pm, "Jean-Baptiste Onofré" wrote: > Just to be sure: you use yarn cluster (not standalone), right ? > > Regards > JB > > On 11/02/2015 10:37 AM, Balachandar

Re: How to catch error during Spark job?

2015-11-02 Thread Akhil Das
Usually you add exception handling within the transformations, in your case you have it added in the driver code. This approach won't be able to catch those exceptions happening inside the executor. eg: try { val rdd = sc.parallelize(1 to 100) rdd.foreach(x => throw new

Re: Error : - No filesystem for scheme: spark

2015-11-02 Thread Romi Kuntsman
except "spark.master", do you have "spark://" anywhere in your code or config files? *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Mon, Nov 2, 2015 at 11:27 AM, Balachandar R.A. wrote: > > -- Forwarded message -- > From: "Balachandar

Re: Error : - No filesystem for scheme: spark

2015-11-02 Thread Balachandar R.A.
Yes. In two different places I use spark:// 1. In my code, while creating spark configuration, I use the code below val sConf = new SparkConf().setAppName("Dummy").setMaster("spark://:7077") val sConf = val sc = new SparkContext(sConf) 2. I run the job using the command below spark-submit

Required file not found: sbt-interface.jar

2015-11-02 Thread Todd
Hi, I am trying to build spark 1.5.1 in my environment, but encounter the following error complaining Required file not found: sbt-interface.jar: The error message is below and I am building with: ./make-distribution.sh --name spark-1.5.1-bin-2.6.0 --tgz --with-tachyon -Phadoop-2.6

Re: Split RDD into multiple RDDs using filter-transformation

2015-11-02 Thread Deng Ching-Mallete
Hi, You should perform an action (e.g. count, take, saveAs*, etc. ) in order for your RDDs to be cached since cache/persist are lazy functions. You might also want to do coalesce instead of repartition to avoid shuffling. Thanks, Deng On Mon, Nov 2, 2015 at 5:53 PM, Sushrut Ikhar

Re: How to lookup by a key in an RDD

2015-11-02 Thread swetha kasireddy
Hi, Is Indexed RDDs released yet? Thanks, Swetha On Sun, Nov 1, 2015 at 1:21 AM, Gylfi wrote: > Hi. > > You may want to look into Indexed RDDs > https://github.com/amplab/spark-indexedrdd > > Regards, > Gylfi. > > > > > > > -- > View this message in context: >

Re: How to lookup by a key in an RDD

2015-11-02 Thread Ted Yu
Please take a look at SPARK-2365 On Mon, Nov 2, 2015 at 3:25 PM, swetha kasireddy wrote: > Hi, > > Is Indexed RDDs released yet? > > Thanks, > Swetha > > On Sun, Nov 1, 2015 at 1:21 AM, Gylfi wrote: > >> Hi. >> >> You may want to look into Indexed

Re: Getting ClassNotFoundException: scala.Some on Spark 1.5.x

2015-11-02 Thread Babar Tareen
I am using *'sbt run'* to execute the code. Detailed sbt output is here ( https://drive.google.com/open?id=0B2dlA_DzEohVakpValRjRS1zVG8). I had scala 2.11.7 installed on my machine. But even after uninstalling it, I am still getting the exception with 2.11.6. Changing the scala version to 2.11.7

ipython notebook NameError: name 'sc' is not defined

2015-11-02 Thread Andy Davidson
Hi I recently installed a new cluster using the spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2. SparkPi sample app works correctly. I am trying to run iPython notebook on my cluster master and use an ssh tunnel so that I can work with the notebook in a browser running on my mac. Bellow is how I set up

Re: Getting ClassNotFoundException: scala.Some on Spark 1.5.x

2015-11-02 Thread Jonathan Coveney
My guess, and it's just a guess, is that there is some change between versions which you got bit by as it chsnged the class path. El lunes, 2 de noviembre de 2015, Babar Tareen escribió: > I am using *'sbt run'* to execute the code. Detailed sbt output is here ( >

Re: Error : - No filesystem for scheme: spark

2015-11-02 Thread Jean-Baptiste Onofré
Ah ok. Good catch ;) Regards JB On 11/02/2015 11:51 AM, Balachandar R.A. wrote: I made a stupid mistake it seems. I supplied the --master option to the spark url in my launch command. And this error is gone. Thanks for pointing out possible places for troubleshooting Regards Bala On

Re: Error : - No filesystem for scheme: spark

2015-11-02 Thread Balachandar R.A.
I made a stupid mistake it seems. I supplied the --master option to the spark url in my launch command. And this error is gone. Thanks for pointing out possible places for troubleshooting Regards Bala On 02-Nov-2015 3:15 pm, "Balachandar R.A." wrote: > No.. I am not

ClassNotFoundException even if class is present in Jarfile

2015-11-02 Thread hveiga
Hello, I am facing an issue where I cannot run my Spark job in a cluster environment (standalone or EMR) but it works successfully if I run it locally using local[*] as master. I am getting ClassNotFoundException: com.mycompany.folder.MyObject on the slave executors. I don't really understand

RE: Sort Merge Join

2015-11-02 Thread Cheng, Hao
No as far as I can tell, @Michael @YinHuai @Reynold , any comments on this optimization? From: Jonathan Coveney [mailto:jcove...@gmail.com] Sent: Tuesday, November 3, 2015 4:17 AM To: Alex Nastetsky Cc: Cheng, Hao; user Subject: Re: Sort Merge Join Additionally, I'm curious if there are any

Spark executor jvm classloader not able to load nested jars

2015-11-02 Thread Nirav Patel
Hi, I have maven based mixed scala/java application that can submit spar jobs. My application jar "myapp.jar" has some nested jars inside lib folder. It's a fat jar created using spring-boot-maven plugin which nest other jars inside lib folder of parent jar. I prefer not to create shaded flat jar

Re: Spark Streaming data checkpoint performance

2015-11-02 Thread Shixiong Zhu
"trackStateByKey" is about to be added in 1.6 to resolve the performance issue of "updateStateByKey". You can take a look at https://issues.apache.org/jira/browse/SPARK-2629 and https://github.com/apache/spark/pull/9256

Re: Exception while reading from kafka stream

2015-11-02 Thread Ramkumar V
if i try like below code snippet , it shows exception , how to avoid this exception ? how to switch processing based on topic ? JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(30)); HashSet topicsSet_1 = new HashSet(Arrays.asList(topics.split(","))); HashSet

Re: Best practises

2015-11-02 Thread Stefano Baghino
There is this interesting book from Databricks: https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details What do you think? Does it contain the info you're looking for? :) On Mon, Nov 2, 2015 at 2:18 PM, satish chandra j wrote: > HI All, > Yes,

Spark, Mesos problems with remote connections

2015-11-02 Thread Sebastian Kuepers
Hey, I have a Mesos cluster with a single Master. If I run the following directly on the master machine: pyspark --master mesos://host:5050 everything works just fine. If I try to connect from to the master starting a driver from my laptop everything stops after the following log output

Does the Standalone cluster and Applications need to be same Spark version?

2015-11-02 Thread pnpritchard
The title gives the gist of it: Does the Standalone cluster and Applications need to be same Spark version? For example, say I have a Standalone cluster running version 1.5.0. Can I run an application that was built with the spark library 1.5.1, and using the spark-submit script from 1.5.1