Total Task size exception in Spark 1.6.0 when writing a DataFrame

2016-01-17 Thread Night Wolf
Hi all, Doing some simple column transformations (e.g. trimming strings) on a DataFrame using UDFs. This DataFrame is in Avro format and being loaded off HDFS. The job has about 16,000 parts/tasks. About half way through the job, then fails with a message; org.apache.spark.SparkException: Job

Spark + Sentry + Kerberos don't add up?

2016-01-17 Thread Ruslan Dautkhanov
Getting following error stack The Spark session could not be created in the cluster: > at org.apache.hadoop.security.*UserGroupInformation.doAs* > (UserGroupInformation.java:1671) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:160) > at

RE: SparkContext SyntaxError: invalid syntax

2016-01-17 Thread Felix Cheung
Do you still need help on the PR? btw, does this apply to YARN client mode? From: andrewweiner2...@u.northwestern.edu Date: Sun, 17 Jan 2016 17:00:39 -0600 Subject: Re: SparkContext SyntaxError: invalid syntax To: cutl...@gmail.com CC: user@spark.apache.org Yeah, I do think it would be worth

Re: simultaneous actions

2016-01-17 Thread Koert Kuipers
the re-use of shuffle files is always a nice surprise to me On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra wrote: > Same SparkContext means same pool of Workers. It's up to the Scheduler, > not the SparkContext, whether the exact same Workers or Executors will be > used

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Mark Hamstra
-dev What do you mean by JobContext? That is a Hadoop mapreduce concept, not Spark. On Sun, Jan 17, 2016 at 7:29 AM, Jia Zou wrote: > Dear all, > > Is there a way to reuse executor JVM across different JobContexts? Thanks. > > Best Regards, > Jia >

Re: Spark Streaming: BatchDuration and Processing time

2016-01-17 Thread Silvio Fiorito
It will just queue up the subsequent batches, however if this delay is constant you may start losing batches. It can handle spikes in processing time, but if you know you're consistently running over your batch duration you either need to increase the duration or look at enabling back pressure

Re: SparkContext SyntaxError: invalid syntax

2016-01-17 Thread Andrew Weiner
Yeah, I do think it would be worth explicitly stating this in the docs. I was going to try to edit the docs myself and submit a pull request, but I'm having trouble building the docs from github. If anyone else wants to do this, here is approximately what I would say: (To be added to

Running out of memory locally launching multiple spark jobs using spark yarn / submit from shell script.

2016-01-17 Thread Colin Kincaid Williams
I launch around 30-60 of these jobs defined like start-job.sh in the background from a wrapper script. I wait about 30 seconds between launches, then the wrapper monitors yarn to determine when to launch more. There is a limit defined at around 60 jobs, but even if I set it to 30, I run out of

Re: simultaneous actions

2016-01-17 Thread Matei Zaharia
They'll be able to run concurrently and share workers / data. Take a look at http://spark.apache.org/docs/latest/job-scheduling.html for how scheduling happens across multiple running jobs in the same SparkContext. Matei > On Jan 17,

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Mark Hamstra
Yes, that is one of the basic reasons to use a jobserver/shared-SparkContext. Otherwise, in order share the data in an RDD you have to use an external storage system, such as a distributed filesystem or Tachyon. On Sun, Jan 17, 2016 at 1:52 PM, Jia wrote: > Thanks,

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia
Hi, Mark, sorry for the confusion. Let me clarify, when an application is submitted, the master will tell each Spark worker to spawn an executor JVM process. All the task sets of the application will be executed by the executor. After the application runs to completion. The executor process

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia
Hi, Mark, sorry for the confusion. Let me clarify, when an application is submitted, the master will tell each Spark worker to spawn an executor JVM process. All the task sets of the application will be executed by the executor. After the application runs to completion. The executor process

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Mark Hamstra
There is a 1-to-1 relationship between Spark Applications and SparkContexts -- fundamentally, a Spark Applications is a program that creates and uses a SparkContext, and that SparkContext is destroyed when then Application ends. A jobserver generically and the Spark JobServer specifically is an

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia
Thanks, Mark. Then, I guess JobServer can fundamentally solve my problem, so that jobs can be submitted at different time and still share RDDs. Best Regards, Jia On Jan 17, 2016, at 3:44 PM, Mark Hamstra wrote: > There is a 1-to-1 relationship between Spark

Re: simultaneous actions

2016-01-17 Thread Mark Hamstra
Same SparkContext means same pool of Workers. It's up to the Scheduler, not the SparkContext, whether the exact same Workers or Executors will be used to calculate simultaneous actions against the same RDD. It is likely that many of the same Workers and Executors will be used as the Scheduler

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia Zou
Hi, Mark, sorry, I mean SparkContext. I mean to change Spark into running all submitted jobs (SparkContexts) in one executor JVM. Best Regards, Jia On Sun, Jan 17, 2016 at 2:21 PM, Mark Hamstra wrote: > -dev > > What do you mean by JobContext? That is a Hadoop

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Mark Hamstra
You've still got me confused. The SparkContext exists at the Driver, not on an Executor. Many Jobs can be run by a SparkContext -- it is a common pattern to use something like the Spark Jobserver where all Jobs are run through a shared SparkContext. On Sun, Jan 17, 2016 at 12:57 PM, Jia Zou

Monitoring Spark with Ganglia on ElCapo

2016-01-17 Thread william tellme
Does anyone have a link handy that describes configuring Ganglia on the mac? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Sending large objects to specific RDDs

2016-01-17 Thread Daniel Imberman
This is perfect. So I guess my best course of action will be to create a custom partitioner to assure that the smallest amount of data is shuffled when I join the partitions, and then I really only need to do a map (rather than a mapPartitions) since the inverted index object will be pointed to

Re: SQL UDF problem (with re to types)

2016-01-17 Thread Ted Yu
While reading some book on Java 8, I saw a reference to the following w.r.t. declaration-site variance : https://bugs.openjdk.java.net/browse/JDK-8043488 The above reportedly targets Java 9. FYI On Thu, Jan 14, 2016 at 12:33 PM, Michael Armbrust wrote: > I don't

Spark Streaming: Does mapWithState implicitly partition the dsteram?

2016-01-17 Thread Lin Zhao
When the state is passed to the task that handles a mapWithState for a particular key, if the key is distributed, it seems extremely difficult to coordinate and synchronise the state. Is there a partition by key before a mapWithState? If not what exactly is the execution model? Thanks, Lin

Re: simultaneous actions

2016-01-17 Thread Mark Hamstra
It can be far more than that (e.g. https://issues.apache.org/jira/browse/SPARK-11838), and is generally either unrecognized or a greatly under-appreciated and underused feature of Spark. On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers wrote: > the re-use of shuffle files is

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia
I guess all jobs submitted through JobServer are executed in the same JVM, so RDDs cached by one job can be visible to all other jobs executed later. On Jan 17, 2016, at 3:56 PM, Mark Hamstra wrote: > Yes, that is one of the basic reasons to use a

Re: [Spark 1.6][Streaming] About the behavior of mapWithState

2016-01-17 Thread Terry Hoo
Hi Ryan, Thanks for your comments! Using reduceByKey() before the mapWithState can get the expected result. Do we ever consider that mapWithState only outputs the changed key one time in every batch interval, just like the updateStateByKey. For some cases, user may only care about the final

Re: PCA OutOfMemoryError

2016-01-17 Thread Bharath Ravi Kumar
Hello Alex, Thanks for the response. There isn't much other data on the driver, so the issue is probably inherent to this particular PCA implementation. I'll try the alternative approach that you suggested instead. Thanks again. -Bharath On Wed, Jan 13, 2016 at 11:24 PM, Alex Gittens

Re: has any one implemented TF_IDF using ML transformers?

2016-01-17 Thread Yanbo Liang
Hi Andy, Actually, the output of ML IDF model is the TF-IDF vector of each instance rather than IDF vector. So it's unnecessary to do member wise multiplication to calculate TF-IDF value. You can refer the code at here:

Re: Feature importance for RandomForestRegressor in Spark 1.5

2016-01-17 Thread Yanbo Liang
Hi Robin, #1 This feature is available from Spark 1.5.0. #2 You should use the new ML rather than the old MLlib package to train the Random Forest model and get featureImportances, because it was only exposed at ML package. You can refer the documents:

Re: simultaneous actions

2016-01-17 Thread Mennour Rostom
Hi, Thank you all for your answers, If I correctly understand, actions (in my case foreach) can be run concurrently and simultaneously on the SAME rdd, (which is logical because they are read only object). however, I want to know if the same workers are used for the concurrent analysis ? Thank

Incorrect timeline for Scheduling Delay in Streaming page in web UI?

2016-01-17 Thread Jacek Laskowski
Hi, I'm trying to understand how Scheduling Delays are displayed in Streaming page in web UI and think the values are displayed incorrectly in the Timelines column. I'm only concerned with the scheduling delays (on y axis) per batch times (x axis). It appears that the values (on y axis) are

Re: Converting CSV files to Avro

2016-01-17 Thread Igor Berman
https://github.com/databricks/spark-avro ? On 17 January 2016 at 13:46, Gideon wrote: > Hi everyone, > > I'm writing a Scala program which uses Spark CSV > to read CSV files from a > directory. After reading the CSVs as

Converting CSV files to Avro

2016-01-17 Thread Gideon
Hi everyone, I'm writing a Scala program which uses Spark CSV to read CSV files from a directory. After reading the CSVs as data frames I need to convert them to Avro format since I need to eventually convert that data to a GenericRecord

Re: How to tunning my spark application.

2016-01-17 Thread Ted Yu
In sampleArray(), there is a loop: for (i <- 0 until ARRAY_SAMPLE_SIZE) { ARRAY_SAMPLE_SIZE is a constant (100). Not clear how the amount of computation in sampleArray() can be reduced. Which Spark release are you using ? Thanks On Sun, Jan 17, 2016 at 6:22 AM, 张峻

Re: How to tunning my spark application.

2016-01-17 Thread 张峻
Dear Ted My Spark release is 1.5.2 BR Julian Zhang > 在 2016年1月17日,23:10,Ted Yu 写道: > > In sampleArray(), there is a loop: > for (i <- 0 until ARRAY_SAMPLE_SIZE) { > > ARRAY_SAMPLE_SIZE is a constant (100). > > Not clear how the amount of computation in

Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia Zou
Dear all, Is there a way to reuse executor JVM across different JobContexts? Thanks. Best Regards, Jia

How to tunning my spark application.

2016-01-17 Thread 张峻
Dear All I used jProfiler to profiling my spark application. And I had find more than 70% cpu is used by the org.apache.spark.util.SizeEstimator class. There call tree is as blow. java.lang.Thread.run --scala.collection.immutable.Range.foreach$mVc$sp

Re: simultaneous actions

2016-01-17 Thread Koert Kuipers
Same rdd means same sparkcontext means same workers Cache/persist the rdd to avoid repeated jobs On Jan 17, 2016 5:21 AM, "Mennour Rostom" wrote: > Hi, > > Thank you all for your answers, > > If I correctly understand, actions (in my case foreach) can be run > concurrently

Re: How to tunning my spark application.

2016-01-17 Thread Ted Yu
For 'List.foreach', it is likely for the pointerFields shown below: private class ClassInfo( val shellSize: Long, val pointerFields: List[Field]) {} FYI On Sun, Jan 17, 2016 at 7:15 AM, 张峻 wrote: > Dear Ted > > My Spark release is 1.5.2 > > BR > > Julian Zhang >

Spark Streaming: BatchDuration and Processing time

2016-01-17 Thread pyspark2555
Hi, If BatchDuration is set to 1 second in StreamingContext and the actual processing time is longer than one second, then how does Spark handle that? For example, I am receiving a continuous Input stream. Every 1 second (batch duration), the RDDs will be processed. What if this processing time