Greetings!
Is it true that functions, such as those passed to RDD.map(), are deserialized
once per task?This seems to be the case looking at Executor.scala, but I don't
really understand the code.
I'm hoping the answer is yes because that makes it easier to write code without
worrying about
This is something of a wild guess, but I find that when executors start
disappearingfor no obvious reason, this is usually because the yarn
node-managers have decided that the containers are using too much memory and
then terminate the executors.
Unfortunately, to see evidence of this, one
Note that in scala, return is a non-local return:
https://tpolecat.github.io/2014/05/09/return.htmlSo that return is *NOT*
returning from the anonymous function, but attempting to return from the
enclosing method, i.e., main.Which is running on the driver, not on the
workers.So on the workers,
My apologies for following my own post, but a friend just pointed out that if I
use kryo with reference counting AND copy-and-paste, this runs.
However, if I try to load file, this fails as described below.
I thought load was supposed to be equivalent?
Thanks!-Mike
From: Michael Albert
Owen so...@cloudera.com
To: Michael Albert m_albert...@yahoo.com
Cc: User user@spark.apache.org
Sent: Monday, March 23, 2015 7:31 AM
Subject: Re: How to check that a dataset is sorted after it has been written
out?
Data is not (necessarily) sorted when read from disk, no. A file might
have
Greetings![My apologies for this respost, I'm not certain that the first
message made it to the list].
I sorted a dataset in Spark and then wrote it out in avro/parquet.
Then I wanted to check that it was sorted.
It looks like each partition has been sorted, but when reading in, the first
Greetings!
I sorted a dataset in Spark and then wrote it out in avro/parquet.
Then I wanted to check that it was sorted.
It looks like each partition has been sorted, but when reading in, the first
partition (i.e., as seen in the partition index of mapPartitionsWithIndex) is
not the same as
For what it's worth, I was seeing mysterious hangs, but it went away when
upgrading from spark1.2 to 1.2.1.I don't know if this is your problem.Also, I'm
using AWS EMR images, which were also upgraded.
Anyway, that's my experience.
-Mike
From: Manas Kar manasdebashis...@gmail.com
To:
Greetings!
Again, thanks to all who have given suggestions.I am still trying to diagnose a
problem where I have processes than run for one or several hours but
intermittently stall or hang.By stall I mean that there is no CPU usage on
the workers or the driver, nor network activity, nor do I
completely confused :-).
Thanks!-Mike
From: Michael Albert m_albert...@yahoo.com.INVALID
To: user@spark.apache.org user@spark.apache.org
Sent: Thursday, February 5, 2015 9:04 PM
Subject: Spark stalls or hangs: is this a clue? remote fetches seem to never
return?
Greetings!
Again, thanks
1) Parameters like --num-executors should come before the jar. That is, you
want something like$SPARK_HOME --num-executors 3 --driver-memory 6g
--executor-memory 7g \--master yarn-cluster --class EDDApp
target/scala-2.10/eddjar \outputPath
That is, *your* parameters come after the jar,
From: Sandy Ryza sandy.r...@cloudera.com
To: Imran Rashid iras...@cloudera.com
Cc: Michael Albert m_albert...@yahoo.com; user@spark.apache.org
user@spark.apache.org
Sent: Wednesday, February 4, 2015 12:54 PM
Subject: Re: advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?
Also, do
Thank you!
This is very helpful.
-Mike
From: Aaron Davidson ilike...@gmail.com
To: Imran Rashid iras...@cloudera.com
Cc: Michael Albert m_albert...@yahoo.com; Sean Owen so...@cloudera.com;
user@spark.apache.org user@spark.apache.org
Sent: Tuesday, February 3, 2015 6:13 PM
Subject: Re
) at
org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:57)
From: Sean Owen so...@cloudera.com
To: Michael Albert m_albert...@yahoo.com
Cc: user@spark.apache.org user@spark.apache.org
Sent: Monday, February 2, 2015 10:13 PM
Subject: Re: 2GB limit for partitions?
The limit is on blocks
Greetings!
First, my sincere thanks to all who have given me advice.Following previous
discussion, I've rearranged my code to try to keep the partitions to more
manageable sizes.Thanks to all who commented.
At the moment, the input set I'm trying to work with is about 90GB (avro
parquet
Greetings!
SPARK-1476 says that there is a 2G limit for blocks.Is this the same as a 2G
limit for partitions (or approximately so?)?
What I had been attempting to do is the following.1) Start with a moderately
large data set (currently about 100GB, but growing).2) Create about 1,000 files
Greetings!
My executors apparently are being terminated because they are running beyond
physical memory limits according to the yarn-hadoop-nodemanager logs on the
worker nodes (/mnt/var/log/hadoop on AWS EMR). I'm setting the driver-memory
to 8G.However, looking at stdout in userlogs, I can
writing, but perhaps there is
some subtle difference in the context?
Thank you.
Sincerely, Mike
From: Akhil Das ak...@sigmoidanalytics.com
To: Michael Albert m_albert...@yahoo.com
Cc: user@spark.apache.org user@spark.apache.org
Sent: Monday, January 5, 2015 1:21 AM
Subject: Re: a vague
Greetings!
I would like to know if the code below will read one-partition-at-a-time,
and whether I am reinventing the wheel.
If I may explain, upstream code has managed (I hope) to save an RDD such
that each partition file (e.g, part-r-0, part-r-1) contains exactly the
data subset
Greetings!
So, I think I have data saved so that each partition (part-r-0, etc)is
exactly what I wan to translate into an output file of a format not related to
hadoop.
I believe I've figured out how to tell Spark to read the data set without
re-partitioning (in another post I mentioned
6E7 values, and the data is
(DataKey(Int,Int), Option[Float]), so that shouldn't need 5g?
Anyway, thanks for the info.
Best wishes,Mike
From: Sean Owen so...@cloudera.com
To: Michael Albert m_albert...@yahoo.com
Cc: user@spark.apache.org
Sent: Friday, December 26, 2014 3:23 PM
Subject
Greetings!
I'm trying to do something similar, and having a very bad time of it.
What I start with is
key1: (col1, val-1-1, col2: val-1-2, col3: val-1-3, col4: val-1-4...)key2:
(col1: val-2-1, col2: val-2-2, col3: val-2-3, col4: val 2-4, ...)
What I want (what I have been asked to produce
.
Hive at 0.13.1 still can't read it though...Thanks!-Mike
From: Michael Armbrust mich...@databricks.com
To: Michael Albert m_albert...@yahoo.com
Cc: user@spark.apache.org user@spark.apache.org
Sent: Tuesday, November 4, 2014 2:37 PM
Subject: Re: avro + parquet + vectorstring
Greetings!
I'm trying to use avro and parquet with the following schema:
{
name: TestStruct,
namespace: bughunt,
type: record,
fields: [
{
name: string_array,
type: { type: array, items: string }
}
]
}
The writing
Greetings!
This might be a documentation issue as opposed to a coding issue, in that
perhaps the correct answer is don't do that, but as this is not obvious, I am
writing.
The following code produces output most would not expect:
package misc
import org.apache.spark.SparkConfimport
25 matches
Mail list logo