Re: Configure Spark to run with MemSQL DB Cluster

2016-07-26 Thread yash datta
Read here: https://github.com/memsql/memsql-spark-connector Best Regards Yash On Wed, Jul 27, 2016 at 12:54 PM, Subhajit Purkayastha wrote: > All, > > > > Is it possible to integrate spark 1.6.1 with MemSQL Cluster? Any pointers > on how to start with the project will be

Configure Spark to run with MemSQL DB Cluster

2016-07-26 Thread Subhajit Purkayastha
All, Is it possible to integrate spark 1.6.1 with MemSQL Cluster? Any pointers on how to start with the project will be appreciated. Thx, Subhajit

Re: How to give name to Spark jobs shown in Spark UI

2016-07-26 Thread rahulkumar-aws
You can set name in SparkConf() or if You are using Spark submit set --name flag *val sparkconf = new SparkConf()* * .setMaster("local[4]")* * .setAppName("saveFileJob")* *val sc = new SparkContext(sparkconf)* or spark-submit : *./bin/spark-submit --name "FileSaveJob"

Spark Jobs not getting shown in Spark UI browser

2016-07-26 Thread Prashant verma
Hi All, I have recently started using Spark 1.6.2 for running my spark jobs. But now my jobs are not getting shown in the spark browser UI, even though the job is running fine which i can see in shell output. Any suggestions. Thanks, Prashant Verma

Fail a batch in Spark Streaming forcefully based on business rules

2016-07-26 Thread Hemalatha A
Hello, I have a uescase where in, I have to fail certain batches in my streaming batches, based on my application specific business rules. Ex: If in a batch of 2 seconds, I don't receive 100 message, I should fail the batch and move on. How to achieve this behavior? -- Regards Hemalatha

Re: Spark Beginner Question

2016-07-26 Thread Holden Karau
So you will need to convert your input DataFrame into something with vectors and labels to train on - the Spark ML documentation has examples http://spark.apache.org/docs/latest/ml-guide.html (although the website seems to be having some issues mid update to Spark 2.0 so if you want to read it

Spark 2.0 just released

2016-07-26 Thread Chanh Le
Its official now http://spark.apache.org/releases/spark-release-2-0-0.html Everyone should check it out.

The Future Of DStream

2016-07-26 Thread Chang Chen
Hi guys Structure Stream is coming with spark 2.0, but I noticed that DStream is still here What's the future of the DStream, will it be deprecated and removed eventually? Or co-existed with Structure Stream forever? Thanks Chang

Spark Beginner Question

2016-07-26 Thread Shi Yu
Hello, *Question 1: *I am new to Spark. I am trying to train classification model on Spark DataFrame. I am using PySpark. And aFrame object in df:ted a Spark DataFrame object in df: from pyspark.sql.types import * query = """select * from table""" df = sqlContext.sql(query) My question is

Re: how to use spark.mesos.constraints

2016-07-26 Thread Rodrick Brown
Shuffle service has nothing to do with constraints it is however advised to run the mesos-shuffle-service on each of your agent nodes running spark. Here is the command I use to run a typical spark jobs on my cluster using constraints (this is generated but from another script we run but should

Re: Maintaining order of pair rdd

2016-07-26 Thread Kuchekar
Hi Janardhan, You could something like this : For maintaining the insertion order by the key first partition by Key (so that each key is located in the same partition) and after that you can do something like this. RDD.mapValues( x => ArrayBuffer(x)).reduceByKey(x,y =>

Re: Maintaining order of pair rdd

2016-07-26 Thread janardhan shetty
Let me provide step wise details: 1. I have an RDD = { (ID2,18159) - *element 1 * (ID1,18159) - *element 2* (ID3,18159) - *element 3* (ID2,36318) - *element 4 * (ID1,36318) - *element 5* (ID3,36318) (ID2,54477) (ID1,54477) (ID3,54477) } 2. RDD.groupByKey().mapValues(v => v.toArray()) Array(

Re: how to use spark.mesos.constraints

2016-07-26 Thread Jia Yu
Hi, I am also trying to use the spark.mesos.constraints but it gives me the same error: job has not be accepted by any resources. I am doubting that I should start some additional service like ./sbin/start-mesos-shuffle-service.sh. Am I correct? Thanks, Jia On Tue, Dec 1, 2015 at 5:14 PM,

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Koert Kuipers
i dont think so, but that sounds like a good idea On Tue, Jul 26, 2016 at 6:19 PM, Sudhir Babu Pothineni < sbpothin...@gmail.com> wrote: > Just correction: > > ORC Java libraries from Hive are forked into Apache ORC. Vectorization > default. > > Do not know If Spark leveraging this new repo? > >

Re: dynamic coalesce to pick file size

2016-07-26 Thread Pedro Rodriguez
I asked something similar if you search for "Tools for Balancing Partitions By Size" (I couldn't find link on archives). Unfortunately there doesn't seem to be something good right now other than knowing your job statistics. I am planning on implementing the idea I explained in the last paragraph

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Koert Kuipers
parquet was inspired by dremel but written from the ground up as a library with support for a variety of big data systems (hive, pig, impala, cascading, etc.). it is also easy to add new support, since its a proper library. orc bas been enhanced while deployed at facebook in hive and at yahoo in

Re: libraryDependencies

2016-07-26 Thread Michael Armbrust
libraryDependencies ++= Seq( // other dependencies here "org.apache.spark" %% "spark-core" % "1.6.2" % "provided", "org.apache.spark" %% "spark-mllib" % "1.6.2" % "provided", "org.scalanlp" %% "breeze" % "0.12", // native

Re: libraryDependencies

2016-07-26 Thread Martin Somers
cheers - I updated libraryDependencies ++= Seq( // other dependencies here "org.apache.spark" %% "spark-core" % "1.6.2" % "provided", "org.apache.spark" %% "spark-mllib_2.10" % "1.6.2", "org.scalanlp" %% "breeze" % "0.12", //

Re: libraryDependencies

2016-07-26 Thread Michael Armbrust
Also, you'll want all of the various spark versions to be the same. On Tue, Jul 26, 2016 at 12:34 PM, Michael Armbrust wrote: > If you are using %% (double) then you do not need _2.11. > > On Tue, Jul 26, 2016 at 12:18 PM, Martin Somers wrote: > >> >>

libraryDependencies

2016-07-26 Thread Martin Somers
my build file looks like libraryDependencies ++= Seq( // other dependencies here "org.apache.spark" %% "spark-core" % "1.6.2" % "provided", "org.apache.spark" %% "spark-mllib_2.11" % "1.6.0", "org.scalanlp" % "breeze_2.11" % "0.7",

Re: read only specific jsons

2016-07-26 Thread Cody Koeninger
Have you tried filtering out corrupt records with something along the lines of df.filter(df("_corrupt_record").isNull) On Tue, Jul 26, 2016 at 1:53 PM, vr spark wrote: > i am reading data from kafka using spark streaming. > > I am reading json and creating dataframe. > I

dynamic coalesce to pick file size

2016-07-26 Thread Maurin Lenglart
Hi, I am doing a Sql query that return a Dataframe. Then I am writing the result of the query using “df.write”, but the result get written in a lot of different small files (~100 of 200 ko). So now I am doing a “.coalesce(2)” before the write. But the number “2” that I picked is static, is

read only specific jsons

2016-07-26 Thread vr spark
i am reading data from kafka using spark streaming. I am reading json and creating dataframe. I am using pyspark kvs = KafkaUtils.createDirectStream(ssc, kafkaTopic1, kafkaParams) lines = kvs.map(lambda x: x[1]) lines.foreachRDD(mReport) def mReport(clickRDD): clickDF =

read only specific jsons

2016-07-26 Thread vr spark
i am reading data from kafka using spark streaming. I am reading json and creating dataframe. kvs = KafkaUtils.createDirectStream(ssc, kafkaTopic1, kafkaParams) lines = kvs.map(lambda x: x[1]) lines.foreachRDD(mReport) def mReport(clickRDD): clickDF = sqlContext.jsonRDD(clickRDD)

File System closed while submitting job in spark

2016-07-26 Thread KhajaAsmath Mohammed
Hi, My spark job is failing while pulling the properties file from hdfs.Same code is running fine when i am running in windows but not able to run when testing it on yarn. *spark submit script: spark-submit --class com.mcd.sparksql.datahub.DataMarts --master local[*] gdwspark.jar

getting more concurrency best practices

2016-07-26 Thread Andy Davidson
Bellow is a very simple application. It runs very slowly. It does not look like I am getting a lot of parallel execution. I image this is a very common work flow. Periodically I want to runs some standard summary statistics across several different data sets. Any suggestions would be greatly

Is RowMatrix missing in org.apache.spark.ml package?

2016-07-26 Thread Rohit Chaddha
It is present in mlib but I don't seem to find it in ml package. Any suggestions please ? -Rohit

Re: Spark Web UI port 4040 not working

2016-07-26 Thread Andy Davidson
Yup in cluster mode you need to figure out what machine the driver is running on. That is the machine the UI will run on https://issues.apache.org/jira/browse/SPARK-15829 You may also have fire wall issues Here are some notes I made about how to figure out what machine the driver is running on

Re: Upgrade from 1.2 to 1.6 - parsing flat files in working directory

2016-07-26 Thread Sumona Routh
Can anyone provide some guidance on how to get files on the classpath for our Spark job? This used to work in 1.2, however after upgrading we are getting nulls when attempting to load resources. Thanks, Sumona On Thu, Jul 21, 2016 at 4:43 PM Sumona Routh wrote: > Hi all, >

Re: spark sql aggregate function "Nth"

2016-07-26 Thread Alex Nastetsky
Ah, that gives me an idea. val window = Window.partitionBy() val getRand = udf((cnt:Int) => ) df .withColumn("cnt", count().over(window)) .withColumn("rnd", getRand($"cnt")) .where($"rnd" === $"cnt") Not sure how performant this would be, but writing a UDF is much simpler than a UDAF. On Tue,

Re: spark sql aggregate function "Nth"

2016-07-26 Thread ayan guha
You can use rank with window function. Rank=1 is same as calling first(). Not sure how you would randomly pick records though, if there is no Nth record. In your example, what happens if data is of only 2 rows? On 27 Jul 2016 00:57, "Alex Nastetsky" wrote: >

Re: FileUtil.fullyDelete does ?

2016-07-26 Thread Divya Gehlot
What happened in my usecase ? Even I know what it does :) Need to know why they are deleting the src And destination file path On Jul 26, 2016 10:20 PM, "praveenesh kumar" wrote: > >

sparse vector to dense vecotor in pyspark

2016-07-26 Thread pseudo oduesp
Hi , with standerscaler we get a sparse vector how i can transform it to list or dense vector without missing the sparse values thanks

Re: UDF returning generic Seq

2016-07-26 Thread Chris Beavers
Yong, Thanks for the response. While those are good examples, they are able to leverage the keytype/valuetype structure of Maps to specify an explicit return type. I guess maybe the more fundamental issue is that I want to support heterogenous maps/arrays allowed by JSON: [1, "str", 2.345] or

ioStreams for DataFrameReader/Writer

2016-07-26 Thread Roger Holenweger
Hello all, why are all DataFrameReader and Writer methods tied to Path objects, or in other words why are there no Reader/Writers that can load data from an Input or Output Stream? After looking at the spark source code, i realize that there is no easy way to add methods that can handle

spark sql aggregate function "Nth"

2016-07-26 Thread Alex Nastetsky
Spark SQL has a "first" function that returns the first item in a group. Is there a similar function, perhaps in a third party lib, that allows you to return an arbitrary (e.g. 3rd) item from the group? Was thinking of writing a UDAF for it, but didn't want to reinvent the wheel. My endgoal is to

Re: UDF returning generic Seq

2016-07-26 Thread Yong Zhang
I don't know the if "ANY" will work or not, but do you take a look about how "map_values" UDF implemented in Spark, which return map values of an array/seq of arbitrary type. https://issues.apache.org/jira/browse/SPARK-16279 Yong From: Chris Beavers

Re: sbt build under scala

2016-07-26 Thread Jacek Laskowski
Hi, I don't think there are any sbt-related changes in Spark 2.0. Just different versions in libraryDependencies. As to the article, I'm surprised it didn't mention using sbt-assembly [1] for docker-like deployment or sbt-native-packager [2] that could create a Docker image. [1]

Re: Event Log Compression

2016-07-26 Thread Jacek Laskowski
Hi, Guess it's spark.io.compression.codec with lz4 being default. Supported are lzf or snappy. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Tue, Jul 26, 2016

Re: Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout

2016-07-26 Thread Cody Koeninger
Can you go ahead and open a Jira ticket with that explanation? Is there a reason you need to use receivers instead of the direct stream? On Tue, Jul 26, 2016 at 4:45 AM, Andy Zhao wrote: > Hi guys, > > I wrote a spark streaming program which consume 1000 messages from

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Ovidiu-Cristian MARCU
Interesting opinion, thank you Still, on the website parquet is basically inspired by Dremel (Google) [1] and part of orc has been enhanced while deployed for Facebook, Yahoo [2]. Other than this presentation [3], do you guys know any other benchmark?

Re: Num of executors and cores

2016-07-26 Thread Mail.com
Hi, In spark submit, I specify --master yarn-client. When I go to executors in UI I do see all the 12 different executors assigned. But for the stage when I drill down to Tasks I saw only 8 tasks with index 0-7. I ran again increasing the number of executors as 15 and I now see 12 tasks for

Re: Spark Web UI port 4040 not working

2016-07-26 Thread Jestin Ma
I did netstat -apn | grep 4040 on machine 6, and I see tcp0 0 :::4040 :::* LISTEN 30597/java What does this mean? On Tue, Jul 26, 2016 at 6:47 AM, Jestin Ma wrote: > I do not deploy using cluster mode and I don't use EC2. > > I

Re: Spark Web UI port 4040 not working

2016-07-26 Thread Jestin Ma
I do not deploy using cluster mode and I don't use EC2. I just read that launching as client mode: "the driver is launched directly within the spark-submit process which acts as a *client* to the cluster." My current setup is that I have cluster machines 1, 2, 3, 4, 5, with 1 being the master. I

Re: Spark Web UI port 4040 not working

2016-07-26 Thread Jestin Ma
I tried doing that on my master node. I got nothing. However, I grep'd port 8080 and I got the standalone UI. On Tue, Jul 26, 2016 at 12:39 AM, Chanh Le wrote: > You’re running in StandAlone Mode? > Usually inside active task it will show the address of current job. > or

Re: Spark Web UI port 4040 not working

2016-07-26 Thread Jacek Laskowski
Hi, Do you perhaps deploy using cluster mode? Is this EC2? You'd need to figure out where the driver runs and use the machine's IP. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at

Event Log Compression

2016-07-26 Thread Bryan Jeffrey
All, I am running Spark 1.6.1. I enabled 'spark.eventLog.compress', and the data is now being compressed using lz4. I would like to move that back to Snappy as I have some third party tools that require using Snappy. Is there a variable used to control Spark eventLog compression algorithm?

Re: dataframe.foreach VS dataframe.collect().foreach

2016-07-26 Thread Pedro Rodriguez
:) Just realized you didn't get your original question answered though: scala> import sqlContext.implicits._ import sqlContext.implicits._ scala> case class Person(age: Long, name: String) defined class Person scala> val df = Seq(Person(24, "pedro"), Person(22, "fritz")).toDF() df:

Re: Outer Explode needed

2016-07-26 Thread Yong Zhang
The reason of no response is that this feature is not available yet. You can vote and following this JIRA https://issues.apache.org/jira/browse/SPARK-13721, if you really need this feature. Yong From: Don Drake Sent: Monday, July 25,

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Koert Kuipers
when parquet came out it was developed by a community of companies, and was designed as a library to be supported by multiple big data projects. nice orc on the other hand initially only supported hive. it wasn't even designed as a library that can be re-used. even today it brings in the kitchen

Re: dataframe.foreach VS dataframe.collect().foreach

2016-07-26 Thread Gourav Sengupta
And Pedro has made sense of a world running amok, scared, and drunken stupor. Regards, Gourav On Tue, Jul 26, 2016 at 2:01 PM, Pedro Rodriguez wrote: > I am not 100% as I haven't tried this out, but there is a huge difference > between the two. Both foreach and collect

Re: dataframe.foreach VS dataframe.collect().foreach

2016-07-26 Thread Pedro Rodriguez
I am not 100% as I haven't tried this out, but there is a huge difference between the two. Both foreach and collect are actions irregardless of whether or not the data frame is empty. Doing a collect will bring all the results back to the driver, possibly forcing it to run out of memory. Foreach

sbt build under scala

2016-07-26 Thread Martin Somers
Just wondering Whats is the correct way of building a spark job using scala - are there any changes coming with spark v2 Ive been following this post http://www.infoobjects.com/spark-submit-with-sbt/ Then again Ive been mainly using docker locally what is decent container for submitting

Re: Num of executors and cores

2016-07-26 Thread Jacek Laskowski
Hi, Where's this yarn-client mode specified? When you said "However, when I run the job I see that the stage which reads the directory has only 8 tasks." -- how do you see 8 tasks for a stage? It appears you're in local[*] mode on a 8-core machine (like me) and that's why I'm asking such basic

Re: Num of executors and cores

2016-07-26 Thread Mail.com
More of jars and files and app name. It runs on yarn-client mode. Thanks, Pradeep > On Jul 26, 2016, at 7:10 AM, Jacek Laskowski wrote: > > Hi, > > What's ""? What master URL do you use? > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ >

Re: DAGScheduler: Job 20 finished: collectAsMap at DecisionTree.scala:651, took 19.556700 s Killed

2016-07-26 Thread Jacek Laskowski
Hi, Anything relevant in ApplicationMaster's log? What about the executors? You should have 2 (default) so review the logs of each executors. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Ovidiu-Cristian MARCU
So did you tried actually to run your use case with spark 2.0 and orc files? It’s hard to understand your ‘apparently..’. Best, Ovidiu > On 26 Jul 2016, at 13:10, Gourav Sengupta wrote: > > If you have ever tried to use ORC via SPARK you will know that SPARK's >

Re: DAGScheduler: Job 20 finished: collectAsMap at DecisionTree.scala:651, took 19.556700 s Killed

2016-07-26 Thread Ascot Moss
It is YARN cluster, /bin/spark-submit \ --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:+PrintGCTimeStamps -XX:+PrintGCDetails" \ --driver-memory 64G \ --executor-memory 16g \ On Tue, Jul 26, 2016 at 7:00 PM, Jacek Laskowski wrote: > Hi, > > What's the cluster

Re: Num of executors and cores

2016-07-26 Thread Jacek Laskowski
Hi, What's ""? What master URL do you use? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Tue, Jul 26, 2016 at 2:18 AM, Mail.com wrote:

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Gourav Sengupta
If you have ever tried to use ORC via SPARK you will know that SPARK's promise of accessing ORC files is a sham. SPARK cannot access partitioned tables via HIVEcontext which are ORC, SPARK cannot stripe through ORC faster and what more, if you are using SQL and have thought of using HIVE with ORC

Re: Spark Web UI port 4040 not working

2016-07-26 Thread Jacek Laskowski
Hi, Go to 8080 and under Running Applications click the Application ID. You're on the page with Application Detail UI just before Executor Summary table. Use it to access the web UI. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark

Re: DAGScheduler: Job 20 finished: collectAsMap at DecisionTree.scala:651, took 19.556700 s Killed

2016-07-26 Thread Jacek Laskowski
Hi, What's the cluster manager? Is this YARN perhaps? Do you have any other apps on the cluster? How do you submit your app? What are the properties? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at

FileUtil.fullyDelete does ?

2016-07-26 Thread Divya Gehlot
Hi, When I am doing the using theFileUtil.copymerge function val file = "/tmp/primaryTypes.csv" FileUtil.fullyDelete(new File(file)) val destinationFile= "/tmp/singlePrimaryTypes.csv" FileUtil.fullyDelete(new File(destinationFile)) val counts = partitions. reduceByKey {case (x,y) => x +

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Ofir Manor
One additional point specific to Spark 2.0 - for the alpha Structured Streaming API (only), the file sink only supports Parquet format (I'm sure that limitation will be lifted in a future release before Structured Streaming is GA): "File sink - Stores the output to a directory. As of Spark

Re: DAGScheduler: Job 20 finished: collectAsMap at DecisionTree.scala:651, took 19.556700 s Killed

2016-07-26 Thread Ascot Moss
any ideas? On Tuesday, July 26, 2016, Ascot Moss wrote: > Hi, > > spark: 1.6.1 > java: java 1.8_u40 > I tried random forest training phase, the same code works well if with 20 > trees (lower accuracy, about 68%). When trying the training phase with > more tree, I set to

Spark streaming lost data when ReceiverTracker writes Blockinfo to hdfs timeout

2016-07-26 Thread Andy Zhao
Hi guys, I wrote a spark streaming program which consume 1000 messages from one topic of Kafka, did some transformation, and wrote the result back to another topic. But only found 988 messages in the second topic. I checked log info and confirmed all messages was received by receivers. But I

Re: Maintaining order of pair rdd

2016-07-26 Thread Marco Mistroni
Apologies janardhan, i always get confused on this Ok. so you have a (key, val) RDD (val is irrelevant here) then you can do this val reduced = myRDD.reduceByKey((first, second) => first ++ second) val sorted = reduced.sortBy(tpl => tpl._1) hth On Tue, Jul 26, 2016 at 3:31 AM, janardhan

unsubscribe

2016-07-26 Thread Uzi Hadad

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Jörn Franke
I think both are very similar, but with slightly different goals. While they work transparently for each Hadoop application you need to enable specific support in the application for predicate push down. In the end you have to check which application you are using and do some tests (with

Re: PCA machine learning

2016-07-26 Thread pseudo oduesp
hi , i want add somme point getting the follow tow vectors first on it s features vectors = Row(features=SparseVector(765, {0: 3.0, 1: 1.0, 2: 50.0, 3: 16.0, 5: 88021.0, 6: 88021.0, 8: 1.0, 11: 1.0, 12: 200.0, 14: 200.0, 15: 200.0, 16: 200.0, 17: 2.0, 18: 1.0, 25: 1.0, 26: 2.0, 31: 89200.0, 32:

PCA machine learning

2016-07-26 Thread pseudo oduesp
Hi, when i perform PCA reduction dimension i get dense vector with length of number of principla component my question : -How i get the name of features giving this vectors ? -the values inside vectors result its value of projection of all features on this componenets ? - how to use it ?

Re: dataframe.foreach VS dataframe.collect().foreach

2016-07-26 Thread kevin
thank you Chanh 2016-07-26 15:34 GMT+08:00 Chanh Le : > Hi Ken, > > *blacklistDF -> just DataFrame * > Spark is lazy until you call something like* collect, take, write* it > will execute the hold process *like you do map or filter before you > collect*. > That mean until

Re: Spark Web UI port 4040 not working

2016-07-26 Thread Chanh Le
You’re running in StandAlone Mode? Usually inside active task it will show the address of current job. or you can check in master node by using netstat -apn | grep 4040 > On Jul 26, 2016, at 8:21 AM, Jestin Ma wrote: > > Hello, when running spark jobs, I can access

Re: dataframe.foreach VS dataframe.collect().foreach

2016-07-26 Thread Chanh Le
Hi Ken, blacklistDF -> just DataFrame Spark is lazy until you call something like collect, take, write it will execute the hold process like you do map or filter before you collect. That mean until you call collect spark do nothing so you df would not have any data -> can’t call foreach. Call

dataframe.foreach VS dataframe.collect().foreach

2016-07-26 Thread kevin
HI ALL: I don't quite understand the different between : dataframe.foreach and dataframe.collect().foreach . When to use dataframe.foreach? I use spark2.0 ,I want to iterate a dataframe to get one colum's value : this can work out blacklistDF.collect().foreach { x =>

Re: yarn.exceptions.ApplicationAttemptNotFoundException when trying to shut down spark applicaiton via yarn applicaiton --kill

2016-07-26 Thread Saisai Shao
Several useful information can be found here ( https://issues.apache.org/jira/browse/YARN-1842), though personally I haven't met this problem before. Thanks Saisai On Tue, Jul 26, 2016 at 2:21 PM, Yu Wei wrote: > Hi guys, > > > When I tried to shut down spark application

yarn.exceptions.ApplicationAttemptNotFoundException when trying to shut down spark applicaiton via yarn applicaiton --kill

2016-07-26 Thread Yu Wei
Hi guys, When I tried to shut down spark application via "yarn application --kill". I run the spark application in yarn cluster mode in my laptop. I found following exception in log. org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException: Application attempt