Is there a way how to change the default storage level?
If not, how can I properly change the storage level wherever necessary, if
my input and intermediate results do not fit into memory?
In this example:
context.wholeTextFiles(...)
.flatMap(s - ...)
.flatMap(s - ...)
Does persist()
On Thu, Jul 9, 2015 at 12:32 PM, besil sbernardine...@beintoo.com wrote:
Hi,
We are experimenting scheduling errors due to mesos slave failing.
It seems to be an open bug, more information can be found here.
https://issues.apache.org/jira/browse/SPARK-3289
According to this link
I am running spark cluster over ssh in standalone mode,
I have run pagerank LiveJounral example:
MASTER=spark://172.17.27.12:7077 bin/run-example graphx.SynthBenchmark
-app=pagerank -niters=100 -nverts=4847571 Output/soc-liveJounral.txt
its been running for more than 2hours, I guess this is
I have a spark cluster with 1 master 9nodes.I am running in standalone-mode.
I do not have access to a web browser from any of the nodes in the cluster
(I am connecting to the nodes through ssh --it is a grid5000 cluster). I was
wondering, is there any possibility to access Spark Web UI in this
Hi folks, I just re-wrote a query from using UNION ALL to use with rollup
and I'm seeing some unexpected behavior. I'll open a JIRA if needed but
wanted to check if this is user error. Here is my code:
case class KeyValue(key: Int, value: String)
val df = sc.parallelize(1 to 50).map(i=KeyValue(i,
MEMORY_AND_DISK will use disk if there is no enough memory. If there is no
enough memory when putting a MEMORY_AND_DISK block, BlockManager will store
it to disk. And if a MEMORY_AND_DISK block is dropped from memory,
MemoryStore will call BlockManager.dropFromMemory to store it to disk, see
Hi,
Thank you for confirming my doubts and for the link.
Yes, we actually run in fine-grained mode because we would like to
dynamically resize our cluster as needed (thank you for the dynamic
allocation link).
However, we tried coarse grained mode and mesos seems not to relaunch the
failed task.
Spark won't store RDDs to memory unless you use a memory StorageLevel. By
default, your input and intermediate results won't be put into memory. You
can call persist if you want to avoid duplicate computation or reading.
E.g.,
val r1 = context.wholeTextFiles(...)
val r2 = r1.flatMap(s - ...)
val
Thanks cody, so is it means if old kafka consumer 0.8.1.1 works with kafka
cluster version 0.8.2 then spark streaming 1.3 should also work?
I have tested standalone consumer kafka consumer 0.8.0 with kafka cluster
0.8.2 and that works.
On Thu, Jul 9, 2015 at 9:58 PM, Cody Koeninger
Spark Version: 1.3.1
Cluster: Mesos 0.22.0
Scala Version: 2.10.4
I am seeing work done on my cluster when invoking cache on an rdd. I would
have expected the last line of the code below to not invoke any cluster
work. Is there some condition where cache will do cluster work?
val sqlContext =
Can you please post result of show()?
On 10 Jul 2015 01:00, Yana Kadiyska yana.kadiy...@gmail.com wrote:
Hi folks, I just re-wrote a query from using UNION ALL to use with
rollup and I'm seeing some unexpected behavior. I'll open a JIRA if needed
but wanted to check if this is user error. Here
Hi Ashish,
Your 00-pyspark-setup file looks very different from mine (and from the one
described in the blog post). Questions:
1) Do you have SPARK_HOME set up in your environment? Because if not, it
sets it to None in your code. You should provide the path to your Spark
installation. In my case
+---+---+---+
|cnt|_c1|grp|
+---+---+---+
| 1| 31| 0|
| 1| 31| 1|
| 1| 4| 0|
| 1| 4| 1|
| 1| 42| 0|
| 1| 42| 1|
| 1| 15| 0|
| 1| 15| 1|
| 1| 26| 0|
| 1| 26| 1|
| 1| 37| 0|
| 1| 10| 0|
| 1| 37| 1|
| 1| 10| 1|
| 1| 48| 0|
| 1| 21| 0|
| 1| 48| 1|
| 1| 21| 1|
|
Yes, it should work, let us know if not.
On Thu, Jul 9, 2015 at 11:34 AM, Shushant Arora shushantaror...@gmail.com
wrote:
Thanks cody, so is it means if old kafka consumer 0.8.1.1 works with
kafka cluster version 0.8.2 then spark streaming 1.3 should also work?
I have tested standalone
Hi,
I am not a spark expert but I found that passing a small partitions value
might help. Try to use this option --numEPart=$partitions where
partitions=3 (number of workers) or at most 3*40 (total number of cores).
Thanks,
-Khaled
On Thu, Jul 9, 2015 at 11:37 AM, AshutoshRaghuvanshi
Does spark streaming 1.3 requires kafka version 0.8.1.1 and is not
compatible with kafka 0.8.2 ?
As per maven dependency of spark streaming 1.3 with kafka
dependenciesdependencygroupIdorg.apache.spark/groupIdartifactIdspark-streaming_2.10/artifactIdversion
1.3.0
It's the consumer version. Should work with 0.8.2 clusters.
On Thu, Jul 9, 2015 at 11:10 AM, Shushant Arora shushantaror...@gmail.com
wrote:
Does spark streaming 1.3 requires kafka version 0.8.1.1 and is not
compatible with kafka 0.8.2 ?
As per maven dependency of spark streaming 1.3 with
I have about 20 environment variables to pass to my Spark workers. Even
though they're in the init scripts on the Linux box, the workers don't see
these variables.
Does Spark do something to shield itself from what may be defined in the
environment?
I see multiple pieces of info on how to pass
Is it possible to ignore features in mllib? In other words, I would like to
have some 'pass-through' data, Strings for example, attached to training
examples and test data.
A related stackoverflow question:
Hi,
In the Spark UI, under “Show additional metrics”, there are two extra
metrics you can show
.1 Scheduler delay
.2 and Getting result time
When hovering “Scheduler Delay it says (among other things):
…time to send task result from executor…
When hovering “Getting result time”:
Time that the
Is the read / aggregate performance better when caching Spark SQL tables
on-heap with sqlContext.cacheTable() or off heap by saving it to Tachyon?
Has anybody tested this? Any theories?
1. There will be a long running job with description start() as that is
the jobs that is running the receivers. It will never end.
2. You need to set the number of cores given to the Spark executors by the
YARN container. That is SparkConf spark.executor.cores, --executor-cores
in spark-submit.
If you use the Pipelines Api with DataFrames, you select which columns you
would like to train on using the VectorAssembler. While using the
VectorAssembler, you can choose not to select some features if you like.
Best,
Burak
On Thu, Jul 9, 2015 at 10:38 AM, Arun Luthra arun.lut...@gmail.com
Hi ,
I checked the logs and it looks like YARN is trying to comunicate with my
test server through the local IP ( SPARK cluster and my test server are
in differents VPC in Amazon EC2) and thats why YARN can't response.
I try the same script in yarn-cluster mode and it runs correctly in that
way.
No plans to change that at the moment, but agreed it is against accepted
convention. It would be a lot of work to change the tool, change the AMIs,
and test everything. My suggestion is not to hold your breath for such a
change.
spark-ec2, as far as I understand, is not intended for spinning up
Hi Guys,
Can any one please share me how to use caching feature of spark via spark
sql queries?
-Vinod
Hi Everyone,
Is there is any document/material which compares spark with SQL server?
If so please share me the details.
Thanks,
Vinod
Hey,
I was wondering if it is possible to tune number of jobs generated by spark
sql? Currently my query generates over 80 runJob at SparkPlan.scala:122
jobs, every one of them gets executed in ~4 sec and contains only 5 tasks.
As a result of this, most of my cores do nothing.
Hi everyone,
More time to be taken when i execute query using (group by +
order by) or (group by + cast + order by) in same query. Kindly refer the
following query
Could you please provide any solution regarding thisd performance issue?
SELECT
Never mind, I’ve created the jira issue at
https://issues.apache.org/jira/browse/SPARK-8972.
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Friday, July 10, 2015 9:15 AM
To: yana.kadiy...@gmail.com; ayan guha
Cc: user
Subject: RE: [SparkSQL] Incorrect ROLLUP results
Yes, this is a bug, do
Thanks Todd, this was helpful! I also got some help from the other forum,
and for those that might run into this problem in the future, the solution
that worked for me was:
foreachRDD {r = r.map(x = data(x.getString(0),
x.getInt(1))).saveToCassandra(demo, sqltest)}
On Thu, Jul 9, 2015 at 4:37
Thanks Andrew.
I tried with your suggestions and (2) works for me. (1) still doesn't work.
Chen
On Thu, Jul 9, 2015 at 4:58 PM, Andrew Or and...@databricks.com wrote:
Hi Chen,
I believe the issue is that `object foo` is a member of `object testing`,
so the only way to access `object foo`
Thanks Matei! It worked.
On 9 July 2015 at 19:43, Matei Zaharia matei.zaha...@gmail.com wrote:
Thus means that one of your cached RDD partitions is bigger than 2 GB of
data. You can fix it by having more partitions. If you read data from a
file system like HDFS or S3, set the number of
For records below 50,000 SQL is better right?
On Fri, Jul 10, 2015 at 12:18 AM, ayan guha guha.a...@gmail.com wrote:
With your load, either should be fine.
I would suggest you to run couple of quick prototype.
Best
Ayan
On Fri, Jul 10, 2015 at 2:06 PM, vinod kumar
Yes, this is a bug, do you mind to create a jira issue for this? I will fix
this asap.
BTW, what’s your spark version?
From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
Sent: Friday, July 10, 2015 12:16 AM
To: ayan guha
Cc: user
Subject: Re: [SparkSQL] Incorrect ROLLUP results
Ayan,
I would want to process a data which nearly around 5 records to 2L
records(in flat).
Is there is any scaling is there to decide what technology is best?either
SQL or SPARK?
On Thu, Jul 9, 2015 at 9:40 AM, ayan guha guha.a...@gmail.com wrote:
It depends on workload. How much data
With your load, either should be fine.
I would suggest you to run couple of quick prototype.
Best
Ayan
On Fri, Jul 10, 2015 at 2:06 PM, vinod kumar vinodsachin...@gmail.com
wrote:
Ayan,
I would want to process a data which nearly around 5 records to 2L
records(in flat).
Is there is
Hi,
I am trying to set the hive metadata destination to a mysql database in
hive context, it works fine in spark 1.3.1, but it seems broken in spark
1.4.1-rc1, where it always connect to the default metadata: local), is this
a regression or we must set the connection in hive-site.xml?
The code
I am trying to normalize a dataset (convert values for all attributes in the
vector to 0-1 range). I created an RDD of tuple (attrib-name,
attrib-value) for all the records in the dataset as follows:
val attribMap : RDD[(String,DoubleDimension)] = contactDataset.flatMap(
Hi Ashish,
Julian's approach is probably better, but few observations:
1) Your SPARK_HOME should be C:\spark-1.3.0 (not C:\spark-1.3.0\bin).
2) If you have anaconda python installed (I saw that you had set this up in
a separate thread, py4j should be part of the package - at least I think
so.
Hello Akhil,
Thanks for the response. I will have to figure this out.
Sincerely,
Ashish
On Thu, Jul 9, 2015 at 3:40 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
On Wed, Jul 8, 2015 at 7:31 PM, Ashish Dutt ashish.du...@gmail.com
wrote:
Hi,
We have a cluster with 4 nodes. The cluster
If you are continuously unioning RDDs, then you are accumulating ever
increasing data, and you are processing ever increasing amount of data in
every batch. Obviously this is going to not last for very long. You
fundamentally cannot keep processing ever increasing amount of data with
finite
Query 1) What spark runs is tasks in task slots, whatever is the mapping ot
tasks to physical cores it does not matter. If there are two task slots (2
threads in local mode, or an executor with 2 task slots in distributed
mode), it can only run two tasks concurrently. That is true even if the
task
Do you have enough cores in the configured number of executors in YARN?
On Thu, Jul 9, 2015 at 2:29 AM, Bin Wang wbi...@gmail.com wrote:
I'm using spark streaming with Kafka, and submit it to YARN cluster with
mode yarn-cluster. But it hangs at SparkContext.start(). The Kafka config
is
The data coming from dstream have the same keys that are in myRDD, so
the reduceByKey
after union keeps the overall tuple count in myRDD fixed. Or even with
fixed tuple count, it will keep consuming more resources?
On 9 July 2015 at 16:19, Tathagata Das t...@databricks.com wrote:
If you are
Hello all,
I'm trying to configure the flume to push data into a sink so that my
stream job could pick up the data. My events are in JSON format, but the
Spark + Flume integration [1] document only refer to Avro sink.
[1] https://spark.apache.org/docs/latest/streaming-flume-integration.html
I
Hi ,
Just would like to clarify few doubts I have how BlockManager behaves .
This is mostly in regards to Spark Streaming Context .
There are two possible cases Blocks may get dropped / not stored in memory
Case 1. While writing the Block for MEMORY_ONLY settings , if Node's
BlockManager does
Are there any significant performance differences between reading text
files from S3 and hdfs?
Hi,
Spark ec2 scripts are useful, but they install everything as root.
AFAIK, it's not a good practice ;-)
Why is it so ?
Should these scripts reserved for test/demo purposes, and not to be used for
a production system ?
Is it planned in some roadmap to improve that, or to replace ec2-scripts
Looks like a configuration problem with your spark setup, are you running
the driver on a different network? Can you try a simple program from
spark-shell and make sure your setup is proper? (like sc.parallelize(1 to
1000).collect())
Thanks
Best Regards
On Thu, Jul 9, 2015 at 1:02 AM, ÐΞ€ρ@Ҝ
Hi,
I've developed a POC Spark Streaming application.
But it seems to perform better on my development machine than on our cluster.
I submit it to yarn on our cloudera cluster.
But my first question is more detailed:
In de application UI (:4040) I see in the streaming section that the batch
On Wed, Jul 8, 2015 at 7:31 PM, Ashish Dutt ashish.du...@gmail.com wrote:
Hi,
We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two
days I have been trying to connect my laptop to the server using spark
master ip:port but its been unsucessful.
The server contains data
Did you try sc.shutdown and creating a new one?
Thanks
Best Regards
On Wed, Jul 8, 2015 at 8:12 PM, Terry Hole hujie.ea...@gmail.com wrote:
I am using spark 1.4.1rc1 with default hive settings
Thanks
- Terry
Hi All,
I'd like to use the hive context in spark shell, i need to recreate the
Yes, just to add see the following scenario of rdd lineage:
RDD1 - RDD2 - RDD3 - RDD4
here RDD2 depends on the RDD1's output and the lineage goes till RDD4.
Now, for some reason RDD3 is lost, and spark will recompute it from RDD2.
Thanks
Best Regards
On Thu, Jul 9, 2015 at 5:51 AM, canan
Ok, got it , thanks.
On Thu, Jul 9, 2015 at 12:02 PM, prosp4300 prosp4...@163.com wrote:
Seems what Feynman mentioned is the source code instead of documentation,
vectorMean is private, see
Hi, Akhil,
I have tried, it does not work.
This may be related to the new added isolated classloader in spark hive
context, the error call stack is:
java.sql.SQLException: Failed to start database 'metastore_db' with class
loader
What were the number of cores in the executor? It could be that you had
only one core in the executor which did all the 50 tasks serially so 50
task X 15 ms = ~ 1 second.
Could you take a look at the task details in the stage page to see when the
tasks were added to see whether it explains the 5
latency is much bigger for S3 (if that matters)
And with HDFS you'd get data-locality that will boost your app performance.
I did some light experimenting on this.
see my presentation here for some benchmark numbers ..etc
http://www.slideshare.net/sujee/hadoop-to-sparkv2
from slide# 34
cheers
Hi All:
We already know that Spark utilizes the lineage to recompute the RDDs when
failure occurs.
I want to study the performance of this fault-tolerant approach and have
some questions about it.
1) Is there any benchmark (or standard failure model) to test the fault
tolerance of these kinds of
Hi,
In the Spark UI, under “Show additional metrics”, there are two extra metrics
you can show
.1 Scheduler delay
.2 and Getting result time
When hovering “Scheduler Delay it says (among other things):
…time to send task result from executor…
When hovering “Getting result time”:
Time that
Hi Everyone,
I am new to spark.
Am using SQL in my application to handle data in my application.I have a
thought to move to spark now.
Is data processing speed of spark better than SQL server?
Thank,
Vinod
I suppose every RDBMS has a jdbc driver to connct to. I know Oracle, MySQL,
SQL Server, Terdata, Netezza have.
On Thu, Jul 9, 2015 at 10:09 PM, Niranda Perera niranda.per...@gmail.com
wrote:
Hi,
I'm planning to use Spark SQL JDBC datasource provider in various RDBMS
databases.
what are the
I'm using spark streaming with Kafka, and submit it to YARN cluster with
mode yarn-cluster. But it hangs at SparkContext.start(). The Kafka config
is right since it can show some events in Streaming tab of web UI.
The attached file is the screen shot of the Jobs tab of web UI. The code
in the
Hi,
I've an application in which an rdd is being updated with tuples coming
from RDDs in a DStream with following pattern.
dstream.foreachRDD(rdd = {
myRDD = myRDD.union(rdd.filter(myfilter)).reduceByKey(_+_)
})
I'm using cache() and checkpointin to cache results. Over the time, the
lineage
Hi, Ben
1) I guess this may be a JDK version mismatch. Could you check the JDK
version?
2) I believe this is a bug in SparkR. I will fire a JIRA issue for it.
From: Ben Spark [mailto:ben_spar...@yahoo.com.au]
Sent: Thursday, July 9, 2015 12:14 PM
To: user
Subject: SparkR dataFrame
Hi,
I was just wondering how you generated to second image with the charts.
What product?
From: Anand Nalya [mailto:anand.na...@gmail.com]
Sent: donderdag 9 juli 2015 11:48
To: spark users
Subject: Breaking lineage and reducing stages in Spark Streaming
Hi,
I've an application in which an rdd
Hi,
We are experimenting scheduling errors due to mesos slave failing.
It seems to be an open bug, more information can be found here.
https://issues.apache.org/jira/browse/SPARK-3289
According to this link
Thats from the Streaming tab for Spark 1.4 WebUI.
On 9 July 2015 at 15:35, Michel Hubert mich...@vsnsystemen.nl wrote:
Hi,
I was just wondering how you generated to second image with the charts.
What product?
*From:* Anand Nalya [mailto:anand.na...@gmail.com]
*Sent:* donderdag 9
Is myRDD outside a DStream? If so are you persisting on each batch
iteration? It should be checkpointed frequently too.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler
Yes, myRDD is outside of DStream. Following is the actual code where newBase
and current are the rdds being updated with each batch:
val base = sc.textFile...
var newBase = base.cache()
val dstream: DStream[String] = ssc.textFileStream...
var current: RDD[(String, Long)] =
That’s correct. We were setting up a Spark EC2 cluster from the command line,
then installing RStudio Server, logging into that through the web interface and
attempting to initialize the cluster within RStudio.
We have made some progress on this outside of the thread - I will see what I
can
I recommend testing it for yourself. Even if you have no application, you
can just run the spark-ec2 script, log in, run spark-shell and try reading
files from an S3 bucket and from hdfs://master IP:9000/. (This is the
ephemeral HDFS cluster, which uses SSD.)
I just tested our application this
I think you're complicating the cache behavior by aggressively re-using
vars when temporary vals would be more straightforward. For example,
newBase = newBase.unpersist()... effectively means that newBase's data is
not actually cached when the subsequent .union(...) is performed, so it
probably
Not really a clean solution but I solved the problem by reinstalling Anaconda
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/DLL-load-failed-1-is-not-a-valid-win32-application-on-invoking-pyspark-tp23733p23743.html
Sent from the Apache Spark User List
It depends on workload. How much data you would want to process?
On 9 Jul 2015 22:28, vinod kumar vinodsachin...@gmail.com wrote:
Hi Everyone,
I am new to spark.
Am using SQL in my application to handle data in my application.I have a
thought to move to spark now.
Is data processing speed
Hello,
I have a 4 nodes spark cluster running on EC2 and it's running out of space
in disk. I'm running Spark 1.3.1.
I have mounted a second SSD disk in every instance on /tmp/spark and set
SPARK_LOCAL_DIRS and SPARK_WORKER_DIRS pointing to this folder:
set | grep SPARK
Hi to all,
Is there any way to run pyspark scripts with yarn-cluster mode without using
the spark-submit script? I need it in this way because i will integrate this
code into a django web app.
When i try to run any script in yarn-cluster mode i got the following error
:
Hello All,
I also posted this on the Spark/Datastax thread, but thought it was also
50% a spark question (or mostly a spark question).
I was wondering what is the best practice to saving streaming Spark SQL (
Hi Chen,
I believe the issue is that `object foo` is a member of `object testing`,
so the only way to access `object foo` is to first pull `object testing`
into the closure, then access a pointer to get to `object foo`. There are
two workarounds that I'm aware of:
(1) Move `object foo` outside
Take a look at the examples here:
https://spark.apache.org/docs/latest/ml-guide.html
Mohammed
From: rishikesh thakur [mailto:rishikeshtha...@hotmail.com]
Sent: Saturday, July 4, 2015 10:49 PM
To: ayan guha; Michal Čizmazia
Cc: user
Subject: RE: Feature Generation On Spark
I have one document
For example in the simplest word count example, I want to update the count in
memory and always have the same word getting updated by the same task - not
use any distributed memstore.
I know that updateStateByKey should guarantee that, but how do you approach
this problem outside of spark
Thanks Erik. I saw the document too. That is why I am confused because as
per the article, it should be good as long as *foo *is serializable.
However, what I have seen is that it would work if *testing* is
serializable, even foo is not serializable, as shown below. I don't know if
there is
Repost the code example,
object testing extends Serializable {
object foo {
val v = 42
}
val list = List(1,2,3)
val rdd = sc.parallelize(list)
def func = {
val after = rdd.foreachPartition {
it = println(foo.v)
}
}
}
On Thu, Jul 9, 2015 at 4:09
Reading that article and applying it to your observations of what happens
at runtime:
shouldn't the closure require serializing testing? The foo singleton object
is a member of testing, and then you call this foo value in the closure
func and further in the foreachPartition closure. So following
Dear list,
I have some questions regarding collaborative filtering. Although they are not
specific to Spark, I hope the folks in this community might be able to help me
somehow.
We are looking for a simple way how to recommend users to other users, i.e.,
how to recommend new friends.
Do you
Summarizing the main problems discussed by Dean
1. If you have an infinitely growing lineage, bad things will eventually
happen. You HAVE TO periodically (say every 10th batch), checkpoint the
information.
2. Unpersist the previous `current` RDD ONLY AFTER running an action on the
`newCurrent`.
I am not sure this is more of a question for Spark or just Scala but I am
posting my question here.
The code snippet below shows an example of passing a reference to a closure
in rdd.foreachPartition method.
```
object testing {
object foo extends Serializable {
val v = 42
}
I think you have stumbled across this idiosyncrasy:
http://erikerlandson.github.io/blog/2015/03/31/hygienic-closures-for-scala-function-serialization/
- Original Message -
I am not sure this is more of a question for Spark or just Scala but I am
posting my question here.
The code
Hi,
I'm running Spark 1.4 on Azure.
DataFrame's insertInto fails, but when saveAsTable works.
It seems like some issue with accessing Azure's blob storage but that
doesn't explain why one type of write works and the other doesn't.
This is the stack trace:
Caused by:
Spark version 1.4.0 in the Standalone mode
2015-07-09 20:12:02 INFO (sparkDriver-akka.actor.default-dispatcher-3)
BlockManagerInfo:59 - Added rdd_0_0 on disk on localhost:51132 (size: 29.8
GB)
2015-07-09 20:12:02 ERROR (Executor task launch worker-0) Executor:96 -
Exception in task 0.0 in stage
I am not sure why you are getting node_local and not process_local. Also
there is probably not a good documentation other than that configuration
page - http://spark.apache.org/docs/latest/configuration.html (search for
locality)
On Thu, Jul 9, 2015 at 5:51 AM, Michel Hubert
Hi
I am new to Spark. I am confused between correlation in threads and
physical cores.
As per my understanding, according to number of partitions in data set,
number of tasks is created. For example I have a machine which has 10
physical cores and I have data set which has 100 partitions then in
Please could anyone give me pointers for appropriate SparkConf to work
around Size exceeds Integer.MAX_VALUE?
Stacktrace:
2015-07-09 20:12:02 INFO (sparkDriver-akka.actor.default-dispatcher-3)
BlockManagerInfo:59 - Added rdd_0_0 on disk on localhost:51132 (size: 29.8
GB)
2015-07-09 20:12:02
Hi, All,
I have a function and want to access it in my spark programs, but I got
the:
Exception in thread main java.lang.NoSuchMethodError in spark-submit. I
put the function under:
./src/main/scala/com/aaa/MYFUNC/MYFUNC.scala:
package com.aaa.MYFUNC
object MYFUNC{
def FUNC1(input:
Which release of Spark are you using ?
Can you show the complete stack trace ?
getBytes() could be called from:
getBytes(file, 0, file.length)
or:
getBytes(segment.file, segment.offset, segment.length)
Cheers
On Thu, Jul 9, 2015 at 2:50 PM, Michal Čizmazia mici...@gmail.com wrote:
You cannot run Spark in cluster mode by instantiating a SparkContext like
that.
You have to launch it with the spark-submit command line script.
On Thu, Jul 9, 2015 at 2:23 PM, jegordon jgordo...@gmail.com wrote:
Hi to all,
Is there any way to run pyspark scripts with yarn-cluster mode
Thanks Shixiong! Your response helped me to understand the role of
persist(). No persist() calls were required indeed. I solved my problem by
setting spark.local.dir to allow more space for Spark temporary folder. It
works automatically. I am seeing logs like this:
Not enough space to cache
foreachRDD returns a unit:
def foreachRDD(foreachFunc: (RDD
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html
[T]) ⇒ Unit): Unit
Apply a function to each RDD in this DStream. This is an output operator,
so 'this' DStream will be registered as an output stream and
Thus means that one of your cached RDD partitions is bigger than 2 GB of data.
You can fix it by having more partitions. If you read data from a file system
like HDFS or S3, set the number of partitions higher in the sc.textFile,
hadoopFile, etc methods (it's an optional second parameter to
Thanks for the help. I set --executor-cores and it works now. I've used
--total-executor-cores and don't realize it changed.
Tathagata Das t...@databricks.com于2015年7月10日周五 上午3:11写道:
1. There will be a long running job with description start() as that is
the jobs that is running the receivers.
100 matches
Mail list logo