Re: High virtual memory consumption on spark-submit client.

2016-05-12 Thread Harsh J
How many CPU cores are on that machine? Read http://qr.ae/8Uv3Xq You can also confirm the above by running the pmap utility on your process and most of the virtual memory would be under 'anon'. On Fri, 13 May 2016 09:11 jone, wrote: > The virtual memory is 9G When i run

Re: High virtual memory consumption on spark-submit client.

2016-05-12 Thread Mich Talebzadeh
can you please do the following: jps|grep SparkSubmit| and send the output of ps aux|grep pid top -p PID and the output of free HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Confused - returning RDDs from functions

2016-05-12 Thread Holden Karau
This is not the expected behavior, can you maybe post the code where you are running into this? On Thursday, May 12, 2016, Dood@ODDO wrote: > Hello all, > > I have been programming for years but this has me baffled. > > I have an RDD[(String,Int)] that I return from a

Re: Graceful shutdown of spark streaming on yarn

2016-05-12 Thread Rakesh H (Marketing Platform-BLR)
Ping!! Has anybody tested graceful shutdown of a spark streaming in yarn-cluster mode?It looks like a defect to me. On Thu, May 12, 2016 at 12:53 PM Rakesh H (Marketing Platform-BLR) < rakes...@flipkart.com> wrote: > We are on spark 1.5.1 > Above change was to add a shutdown hook. > I am not

Re: sbt for Spark build with Scala 2.11

2016-05-12 Thread Luciano Resende
Spark has moved to build using Scala 2.11 by default in master/trunk. As for the 2.0.0-SNAPSHOT, it is actually the version of master/trunk and you might be missing some modules/profiles for your build. What command did you use to build ? On Thu, May 12, 2016 at 9:01 PM, Raghava Mutharaju <

sbt for Spark build with Scala 2.11

2016-05-12 Thread Raghava Mutharaju
Hello All, I built Spark from the source code available at https://github.com/apache/spark/. Although I haven't specified the "-Dscala-2.11" option (to build with Scala 2.11), from the build messages I see that it ended up using Scala 2.11. Now, for my application sbt, what should be the spark

High virtual memory consumption on spark-submit client.

2016-05-12 Thread jone
The virtual memory is 9G When i run org.apache.spark.examples.SparkPi under yarn-cluster model,which using default configurations.   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

Why spark 1.6.1 run so slow?

2016-05-12 Thread sunday2000
Hi, When we use spark 1.6.1 to word count a file of 25M bytes , with 2 nodes in cluster mode, it cost 10 seconds to finish the task. Why so slow ? Could you tell why?

Confused - returning RDDs from functions

2016-05-12 Thread Dood
Hello all, I have been programming for years but this has me baffled. I have an RDD[(String,Int)] that I return from a function after extensive manipulation of an initial RDD of a different type. When I return this RDD and initiate the .collectAsMap() on it from the caller, I get an empty

Re: Joining a RDD to a Dataframe

2016-05-12 Thread Cyril Scetbon
Nobody has the answer ? Another thing I've seen is that if I have no documents at all : scala> df.select(explode(df("addresses.id")).as("aid")).collect res27: Array[org.apache.spark.sql.Row] = Array() Then scala> df.select(explode(df("addresses.id")).as("aid"), df("id"))

Re:Re:Re: Re:Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread kramer2...@126.com
Sorry, the bug link in previous mail was is wrong. Here is the real link: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-SQL-Memory-leak-with-spark-streaming-and-spark-sql-in-spark-1-5-1-td14603.html At 2016-05-13 09:49:05, "李明伟" wrote: It seems

Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread Ted Yu
The link below doesn't refer to specific bug. Can you send the correct link ? Thanks > On May 12, 2016, at 6:50 PM, "kramer2...@126.com" wrote: > > It seems we hit the same issue. > > There was a bug on 1.5.1 about memory leak. But I am using 1.6.1 > > Here is the link

Re:Re: Re:Re: Will the HiveContext cause memory leak ?

2016-05-12 Thread kramer2...@126.com
It seems we hit the same issue. There was a bug on 1.5.1 about memory leak. But I am using 1.6.1 Here is the link about the bug in 1.5.1 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark At 2016-05-12 23:10:43, "Simon Schiff [via Apache Spark User List]"

Re: Spark 1.6 Catalyst optimizer

2016-05-12 Thread Telmo Rodrigues
Thank you Takeshi. After executing df3.explain(true) I realised that the Optimiser batches are being performed and also the predicate push down. I think that only the analiser batches are executed when creating the data frame by the context.sql(query). It seems that the optimiser batches are

Re: XML Processing using Spark SQL

2016-05-12 Thread Mail.com
Hi Arun, Could you try using Stax or JaxB. Thanks, Pradeep > On May 12, 2016, at 8:35 PM, Hyukjin Kwon wrote: > > Hi Arunkumar, > > > I guess your records are self-closing ones. > > There is an issue open here, https://github.com/databricks/spark-xml/issues/92 > >

Re: XML Processing using Spark SQL

2016-05-12 Thread Hyukjin Kwon
Hi Arunkumar, I guess your records are self-closing ones. There is an issue open here, https://github.com/databricks/spark-xml/issues/92 This is about XmlInputFormat.scala and it seems a bit tricky to handle the case so I left open until now. Thanks! 2016-05-13 5:03 GMT+09:00 Arunkumar

RE: SQLContext and HiveContext parse a query string differently ?

2016-05-12 Thread Yong Zhang
Not sure what do you mean? You want to have one exactly query running fine in both sqlContext and HiveContext? The query parser are different, why do you want to have this feature? Do I understand your question correctly? Yong Date: Thu, 12 May 2016 13:09:34 +0200 Subject: SQLContext and

Re: Spark handling spill overs

2016-05-12 Thread Takeshi Yamamuro
Hi, Which version of Spark you use? The recent one cannot handle this kind of spilling, see: http://spark.apache.org/docs/latest/tuning.html#memory-management-overview. // maropu On Fri, May 13, 2016 at 8:07 AM, Ashok Kumar wrote: > Hi, > > How one can avoid

Spark handling spill overs

2016-05-12 Thread Ashok Kumar
Hi, How one can avoid having Spark spill over after filling the node's memory. Thanks

Re: apache spark on gitter?

2016-05-12 Thread Xinh Huynh
I agree that it can help build a community and be a place for real-time conversations. Xinh On Thu, May 12, 2016 at 12:28 AM, Paweł Szulc wrote: > Hi, > > well I guess the advantage of gitter over maling list is the same as with > IRC. It's not actually a replacer because

Re: How to get and save core dump of native library in executors

2016-05-12 Thread prateek arora
ubuntu 14.04 On Thu, May 12, 2016 at 2:40 PM, Ted Yu wrote: > Which OS are you using ? > > See http://en.linuxreviews.org/HOWTO_enable_core-dumps > > On Thu, May 12, 2016 at 2:23 PM, prateek arora > wrote: > >> Hi >> >> I am running my spark

Re: How to get and save core dump of native library in executors

2016-05-12 Thread Ted Yu
Which OS are you using ? See http://en.linuxreviews.org/HOWTO_enable_core-dumps On Thu, May 12, 2016 at 2:23 PM, prateek arora wrote: > Hi > > I am running my spark application with some third party native libraries . > but it crashes some time and show error "

How to get and save core dump of native library in executors

2016-05-12 Thread prateek arora
Hi I am running my spark application with some third party native libraries . but it crashes some time and show error " Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again " . Below are the log : A fatal error

Re: SQLContext and HiveContext parse a query string differently ?

2016-05-12 Thread Mich Talebzadeh
yep the same error I got root |-- a: array (nullable = true) ||-- element: integer (containsNull = false) |-- b: integer (nullable = false) NoViableAltException(35@[]) at org.apache.hadoop.hive.ql.parse.HiveParser.primitiveType(HiveParser.java:38886) at

XML Processing using Spark SQL

2016-05-12 Thread Arunkumar Chandrasekar
Hello, Greetings. I'm trying to process a xml file exported from Health Kit application using Spark SQL for learning purpose. The sample record data is like the below: . I want to have the column name of my table as the field value like type, sourceName, sourceVersion and the row entries

Re: kryo

2016-05-12 Thread Ted Yu
This should be related: https://github.com/JodaOrg/joda-time/issues/307 Do you have more of the stack trace ? Cheers On Thu, May 12, 2016 at 12:39 PM, Younes Naguib < younes.nag...@tritondigital.com> wrote: > Thanks, > > I used that. > > Now I seem to have the following problem: > >

RE: kryo

2016-05-12 Thread Younes Naguib
Thanks, I used that. Now I seem to have the following problem: java.lang.NullPointerException at org.joda.time.tz.CachedDateTimeZone.getInfo(CachedDateTimeZone.java:143) at org.joda.time.tz.CachedDateTimeZone.getOffset(CachedDateTimeZone.java:103) at

S3A Creating Task Per Byte (pyspark / 1.6.1)

2016-05-12 Thread Aaron Jackson
I'm using the spark 1.6.1 (hadoop-2.6) and I'm trying to load a file that's in s3. I've done this previously with spark 1.5 with no issue. Attempting to load and count a single file as follows: dataFrame = sqlContext.read.text('s3a://bucket/path-to-file.csv') dataFrame.count() But when it

mllib random forest - executor heartbeat timed out

2016-05-12 Thread vtkmh
Hello, I have a random forest that works fine with 20 trees on 5e6 LabeledPoints for training and 300 features... but when I try to scale it up just a bit to 60 or 100 trees and 10e6 training points, it consistently gets ExecutorLostFailure's due to "no recent heartbeats" with timeout of 120s.

Re: Reliability of JMS Custom Receiver in Spark Streaming JMS

2016-05-12 Thread Sourav Mazumder
Any inputs on this issue ? Regards, Sourav On Tue, May 10, 2016 at 6:17 PM, Sourav Mazumder < sourav.mazumde...@gmail.com> wrote: > Hi, > > Need to get bit more understanding of reliability aspects of the Custom > Receivers in the context of the code in spark-streaming-jms >

LinearRegressionWithSGD fails on 12Mb data

2016-05-12 Thread RainDev
I'm using Spark 1.6.1 along with scala 2.11.7 on my Ubuntu 14.04 with following memory settings for my project: JAVA_OPTS="-Xmx8G -Xms2G" . My data is organized in 20 json-like files, every file is about 8-15 Mb, containing categorical and numerical values. I parse this data, passing by DataFrame

Re: Spark 1.6 Catalyst optimizer

2016-05-12 Thread Takeshi Yamamuro
Hi, What's the result of `df3.explain(true)`? // maropu On Thu, May 12, 2016 at 10:04 AM, Telmo Rodrigues < telmo.galante.rodrig...@gmail.com> wrote: > I'm building spark from branch-1.6 source with mvn -DskipTests package and > I'm running the following code with spark shell. > > *val*

Re: My notes on Spark Performance & Tuning Guide

2016-05-12 Thread Tom Ellis
I would like to also Mich, please send it through, thanks! On Thu, 12 May 2016 at 15:14 Alonso Isidoro wrote: > Me too, send me the guide. > > Enviado desde mi iPhone > > El 12 may 2016, a las 12:11, Ashok Kumar >

Re: My notes on Spark Performance & Tuning Guide

2016-05-12 Thread Alonso Isidoro
Me too, send me the guide. Enviado desde mi iPhone > El 12 may 2016, a las 12:11, Ashok Kumar > escribió: > > Hi Dr Mich, > > I will be very keen to have a look at it and review if possible. > > Please forward me a copy > > Thanking you warmly > > > On

Re: Spark 1.6.0: substring on df.select

2016-05-12 Thread Sun Rui
Alternatively, you may try the built-in function: regexp_extract > On May 12, 2016, at 20:27, Ewan Leith wrote: > > You could use a UDF pretty easily, something like this should work, the > lastElement function could be changed to do pretty much any string >

Re: groupBy and store in parquet

2016-05-12 Thread Michal Vince
Hi Xinh sorry for my late reply it`s slow because of two reasons (at least to my knowledge) 1. lots of IOs - writing as json, then reading and writing again as parquet 2. because of nested rdd I can`t run the cycle and filter by event_type in parallel - this applies to your solution (3rd

Re: Is this possible to do in spark ?

2016-05-12 Thread Mathieu Longtin
Make a function (or lambda) that reads the text file. Make a RDD with a list of X/Y, then map that RDD throught the file reading function. Same with you X/Y/Z directory. You then have RDDs with the content of each file as a record. Work with those as needed. On Wed, May 11, 2016 at 2:36 PM

Efficient for loops in Spark

2016-05-12 Thread flyinggip
Hi there, I'd like to write some iterative computation, i.e., computation that can be done via a for loop. I understand that in Spark foreach is a better choice. However, foreach and foreachPartition seem to be for self-contained computation that only involves the corresponding Row or Partition,

RE: Spark 1.6.0: substring on df.select

2016-05-12 Thread Ewan Leith
You could use a UDF pretty easily, something like this should work, the lastElement function could be changed to do pretty much any string manipulation you want. import org.apache.spark.sql.functions.udf def lastElement(input: String) = input.split("/").last val lastElementUdf =

SQLContext and HiveContext parse a query string differently ?

2016-05-12 Thread Hao Ren
HI, I just want to figure out why the two contexts behavior differently even on a simple query. In a netshell, I have a query in which there is a String containing single quote and casting to Array/Map. I have tried all the combination of diff type of sql context and query call api (sql,

RE: Spark 1.6.0: substring on df.select

2016-05-12 Thread Bharathi Raja
Thanks Raghav. I have 5+ million records. I feel creating multiple come is not an optimal way. Please suggest any other alternate solution. Can’t we do any string operation in DF.Select? Regards, Raja From: Raghavendra Pandey Sent: 11 May 2016 09:04 PM To: Bharathi Raja Cc: User Subject: Re:

Re: My notes on Spark Performance & Tuning Guide

2016-05-12 Thread Ashok Kumar
Hi Dr Mich, I will be very keen to have a look at it and review if possible. Please forward me a copy Thanking you warmly On Thursday, 12 May 2016, 11:08, Mich Talebzadeh wrote: Hi Al,, Following the threads in spark forum, I decided to write up on

My notes on Spark Performance & Tuning Guide

2016-05-12 Thread Mich Talebzadeh
Hi Al,, Following the threads in spark forum, I decided to write up on configuration of Spark including allocation of resources and configuration of driver, executors, threads, execution of Spark apps and general troubleshooting taking into account the allocation of resources for Spark

Why spark give out this error message?

2016-05-12 Thread sunday2000
Hi, Do u know what this message mean? org.apache.spark.shuffle.FetchFailedException: Failed to connect to localhost/127.0.0.1:50606 at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) at

Re: Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-12 Thread Renato Marroquín Mogrovejo
Hi Amit, This is very interesting indeed because I have got similar resutls. I tried doing a filtter + groupBy using DataSet with a function, and using the inner RDD of the DF(RDD[row]). I used the inner RDD of a DataFrame because apparently there is no straight-forward way to create an RDD of

Re: ML regression - spark context dies without error

2016-05-12 Thread AlexModestov
Hello, I have the same problem... Sometimes I get the error: "Py4JError: Answer from Java side is empty" Sometimes my code works fine but sometimes not... Did you find why it might come? What was the reason? Thanks. -- View this message in context:

Re: Need for advice - performance improvement and out of memory resolution

2016-05-12 Thread AlexModestov
Hello. I'm sorry but did you find the answer? I have the similar error and I can not solve it... No one answered me... Spark driver dies and I get the error "Answer from Java side is empty". I thought that it is so because I made a mistake this conf-file I use Sparkling Water 1.6.3, Spark

Submitting Job to YARN-Cluster using Spark Job Server

2016-05-12 Thread ashesh_28
Hi Guys , Does any of you have tried this mechanism before? I am able to run it locally and get the output ..But how do i submit the job to the Yarn-Cluster using Spark-JobServer. Any documentation ? Regards Ashesh -- View this message in context:

parallelism of task executor worker threads during s3 reads

2016-05-12 Thread sanusha
I am using a spark cluster on Amazon (launched using spark-1.6-prebuilt-with-hadoop-2.6 spark-ec2 script) to run a scala driver application to read S3 object content in parallel. I have tried “s3n://bucket” with sc.textFile as well as set up an RDD with the S3 keys and then used java aws sdk

Re: apache spark on gitter?

2016-05-12 Thread Paweł Szulc
Hi, well I guess the advantage of gitter over maling list is the same as with IRC. It's not actually a replacer because mailing list is also important. But it is lot easier to build a community around tool with ad-hoc ability to connect with each other. I have gitter running on constantly, I

Re: Graceful shutdown of spark streaming on yarn

2016-05-12 Thread Rakesh H (Marketing Platform-BLR)
We are on spark 1.5.1 Above change was to add a shutdown hook. I am not adding shutdown hook in code, so inbuilt shutdown hook is being called. Driver signals that it is going to to graceful shutdown, but executor sees that Driver is dead and it shuts down abruptly. Could this issue be related to

Re: Graceful shutdown of spark streaming on yarn

2016-05-12 Thread Deepak Sharma
This is happening because spark context shuts down without shutting down the ssc first. This was behavior till spark 1.4 ans was addressed in later releases. https://github.com/apache/spark/pull/6307 Which version of spark are you on? Thanks Deepak On Thu, May 12, 2016 at 12:14 PM, Rakesh H

Re: Graceful shutdown of spark streaming on yarn

2016-05-12 Thread Rakesh H (Marketing Platform-BLR)
Yes, it seems to be the case. In this case executors should have continued logging values till 300, but they are shutdown as soon as i do "yarn kill .." On Thu, May 12, 2016 at 12:11 PM Deepak Sharma wrote: > So in your case , the driver is shutting down gracefully ,

Re: Graceful shutdown of spark streaming on yarn

2016-05-12 Thread Deepak Sharma
So in your case , the driver is shutting down gracefully , but the executors are not. IS this the problem? Thanks Deepak On Thu, May 12, 2016 at 11:49 AM, Rakesh H (Marketing Platform-BLR) < rakes...@flipkart.com> wrote: > Yes, it is set to true. > Log of driver : > > 16/05/12 10:18:29 ERROR

Re: Graceful shutdown of spark streaming on yarn

2016-05-12 Thread Rakesh H (Marketing Platform-BLR)
Yes, it is set to true. Log of driver : 16/05/12 10:18:29 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM 16/05/12 10:18:29 INFO streaming.StreamingContext: Invoking stop(stopGracefully=true) from shutdown hook 16/05/12 10:18:29 INFO scheduler.JobGenerator: Stopping JobGenerator

Re: Graceful shutdown of spark streaming on yarn

2016-05-12 Thread Deepak Sharma
Hi Rakesh Did you tried setting *spark.streaming.stopGracefullyOnShutdown to true *for your spark configuration instance? If not try this , and let us know if this helps. Thanks Deepak On Thu, May 12, 2016 at 11:42 AM, Rakesh H (Marketing Platform-BLR) < rakes...@flipkart.com> wrote: > Issue i

Graceful shutdown of spark streaming on yarn

2016-05-12 Thread Rakesh H (Marketing Platform-BLR)
Issue i am having is similar to the one mentioned here : http://stackoverflow.com/questions/36911442/how-to-stop-gracefully-a-spark-streaming-application-on-yarn I am creating a rdd from sequence of 1 to 300 and creating streaming RDD out of it. val rdd = ssc.sparkContext.parallelize(1 to 300)