I happen to encounter something similar.
it's probably because you are just `explain` it. when you actually `run`
it. you will get the final spark plan in which case the exchange will be
reused.
right, this is different compared with 3.1 probably because the upgraded
aqe.
not sure whether this
Hi,
When migrating to Spark 3, I'm getting a NoSuchElementException exception
when getting partitions for a parquet dataframe -
The code I'm trying to execute is -
val df = sparkSession.read.parquet(inputFilePath)
val partitions = df.rdd.partitions
and the spark session is created
Greetings!
Is it true that functions, such as those passed to RDD.map(), are deserialized
once per task?This seems to be the case looking at Executor.scala, but I don't
really understand the code.
I'm hoping the answer is yes because that makes it easier to write code without
worrying about
This is something of a wild guess, but I find that when executors start
disappearingfor no obvious reason, this is usually because the yarn
node-managers have decided that the containers are using too much memory and
then terminate the executors.
Unfortunately, to see evidence of this, one
code out there that
is doing this already that we can look at for some inspiration?
Any advice appreciated.
Thanks
Albert
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h
Note that in scala, return is a non-local return:
https://tpolecat.github.io/2014/05/09/return.htmlSo that return is *NOT*
returning from the anonymous function, but attempting to return from the
enclosing method, i.e., main.Which is running on the driver, not on the
workers.So on the workers,
My apologies for following my own post, but a friend just pointed out that if I
use kryo with reference counting AND copy-and-paste, this runs.
However, if I try to load file, this fails as described below.
I thought load was supposed to be equivalent?
Thanks!-Mike
From: Michael Albert
Owen so...@cloudera.com
To: Michael Albert m_albert...@yahoo.com
Cc: User user@spark.apache.org
Sent: Monday, March 23, 2015 7:31 AM
Subject: Re: How to check that a dataset is sorted after it has been written
out?
Data is not (necessarily) sorted when read from disk, no. A file might
have
Greetings![My apologies for this respost, I'm not certain that the first
message made it to the list].
I sorted a dataset in Spark and then wrote it out in avro/parquet.
Then I wanted to check that it was sorted.
It looks like each partition has been sorted, but when reading in, the first
Greetings!
I sorted a dataset in Spark and then wrote it out in avro/parquet.
Then I wanted to check that it was sorted.
It looks like each partition has been sorted, but when reading in, the first
partition (i.e., as seen in the partition index of mapPartitionsWithIndex) is
not the same as
For what it's worth, I was seeing mysterious hangs, but it went away when
upgrading from spark1.2 to 1.2.1.I don't know if this is your problem.Also, I'm
using AWS EMR images, which were also upgraded.
Anyway, that's my experience.
-Mike
From: Manas Kar manasdebashis...@gmail.com
To:
Greetings!
Again, thanks to all who have given suggestions.I am still trying to diagnose a
problem where I have processes than run for one or several hours but
intermittently stall or hang.By stall I mean that there is no CPU usage on
the workers or the driver, nor network activity, nor do I
completely confused :-).
Thanks!-Mike
From: Michael Albert m_albert...@yahoo.com.INVALID
To: user@spark.apache.org user@spark.apache.org
Sent: Thursday, February 5, 2015 9:04 PM
Subject: Spark stalls or hangs: is this a clue? remote fetches seem to never
return?
Greetings!
Again, thanks
1) Parameters like --num-executors should come before the jar. That is, you
want something like$SPARK_HOME --num-executors 3 --driver-memory 6g
--executor-memory 7g \--master yarn-cluster --class EDDApp
target/scala-2.10/eddjar \outputPath
That is, *your* parameters come after the jar,
From: Sandy Ryza sandy.r...@cloudera.com
To: Imran Rashid iras...@cloudera.com
Cc: Michael Albert m_albert...@yahoo.com; user@spark.apache.org
user@spark.apache.org
Sent: Wednesday, February 4, 2015 12:54 PM
Subject: Re: advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?
Also, do
Thank you!
This is very helpful.
-Mike
From: Aaron Davidson ilike...@gmail.com
To: Imran Rashid iras...@cloudera.com
Cc: Michael Albert m_albert...@yahoo.com; Sean Owen so...@cloudera.com;
user@spark.apache.org user@spark.apache.org
Sent: Tuesday, February 3, 2015 6:13 PM
Subject: Re
) at
org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:57)
From: Sean Owen so...@cloudera.com
To: Michael Albert m_albert...@yahoo.com
Cc: user@spark.apache.org user@spark.apache.org
Sent: Monday, February 2, 2015 10:13 PM
Subject: Re: 2GB limit for partitions?
The limit is on blocks
GB of physical memory and, as far as I can determine,
no swap space.
The messages bracketing the stall are shown below.
Any advice is welcome.
Thanks!
Sincerely, Mike Albert
Before the stall.15/02/03 21:45:28 INFO cluster.YarnClientClusterScheduler:
Removed TaskSet 5.0, whose tasks have all
?
Admittedly, this is an odd use case
Thanks!
Sincerely, Mike Albert
Greetings!
My executors apparently are being terminated because they are running beyond
physical memory limits according to the yarn-hadoop-nodemanager logs on the
worker nodes (/mnt/var/log/hadoop on AWS EMR). I'm setting the driver-memory
to 8G.However, looking at stdout in userlogs, I can
writing, but perhaps there is
some subtle difference in the context?
Thank you.
Sincerely, Mike
From: Akhil Das ak...@sigmoidanalytics.com
To: Michael Albert m_albert...@yahoo.com
Cc: user@spark.apache.org user@spark.apache.org
Sent: Monday, January 5, 2015 1:21 AM
Subject: Re: a vague
Greetings!
I would like to know if the code below will read one-partition-at-a-time,
and whether I am reinventing the wheel.
If I may explain, upstream code has managed (I hope) to save an RDD such
that each partition file (e.g, part-r-0, part-r-1) contains exactly the
data subset
Greetings!
So, I think I have data saved so that each partition (part-r-0, etc)is
exactly what I wan to translate into an output file of a format not related to
hadoop.
I believe I've figured out how to tell Spark to read the data set without
re-partitioning (in another post I mentioned
6E7 values, and the data is
(DataKey(Int,Int), Option[Float]), so that shouldn't need 5g?
Anyway, thanks for the info.
Best wishes,Mike
From: Sean Owen so...@cloudera.com
To: Michael Albert m_albert...@yahoo.com
Cc: user@spark.apache.org
Sent: Friday, December 26, 2014 3:23 PM
Subject
Greetings!
I'm trying to do something similar, and having a very bad time of it.
What I start with is
key1: (col1, val-1-1, col2: val-1-2, col3: val-1-3, col4: val-1-4...)key2:
(col1: val-2-1, col2: val-2-2, col3: val-2-3, col4: val 2-4, ...)
What I want (what I have been asked to produce
MatrixFactorizationModel cannot be accessed in object RecommendALS
val model = new MatrixFactorizationModel (8, userFeatures,
productFeatures)
Any ideas?
Thanks!
--
Albert Manyà
alber...@eml.cc
-
To unsubscribe, e-mail
In that case, what is the strategy to train a model in some background
batch process and make recommendations for some other service in real
time? Run both processes in the same spark cluster?
Thanks.
--
Albert Manyà
alber...@eml.cc
On Mon, Dec 15, 2014, at 05:58 PM, Sean Owen wrote
who is compiled against such an old version of
httpclient, I see in the project dependencies that amazonaws 1.9.10
depends on httclient 4.3... It is spark who is compiled against an old
version of amazonaws?
Thanks.
--
Albert Manyà alber...@eml.cc
On Fri, Dec 12, 2014, at 09:27 AM, Akhil Das
signature for
setSoKeepalive:
public static void setSoKeepalive(HttpParams params, boolean
enableKeepalive)
At this point I'm stuck and didn't know where to keep looking... some
help would be greatly appreciated :)
Thank you very much!
--
Albert Manyà
alber...@eml.cc
.
Hive at 0.13.1 still can't read it though...Thanks!-Mike
From: Michael Armbrust mich...@databricks.com
To: Michael Albert m_albert...@yahoo.com
Cc: user@spark.apache.org user@spark.apache.org
Sent: Tuesday, November 4, 2014 2:37 PM
Subject: Re: avro + parquet + vectorstring
stumped. I can read and write records and maps, but arrays/vectors elude
me.Am I missing something obvious?
Thanks!
Sincerely, Mike Albert
Greetings!
This might be a documentation issue as opposed to a coding issue, in that
perhaps the correct answer is don't do that, but as this is not obvious, I am
writing.
The following code produces output most would not expect:
package misc
import org.apache.spark.SparkConfimport
Hi
I'm evaluating Spark streaming to see if it fits to scale or current
architecture.
We are currently downloading and processing 6M documents per day from
online and social media. We have a different workflow for each type of
document, but some of the steps are keyword extraction, language
Hi Jayant,
On 23 October 2014 11:14, Jayant Shekhar jay...@cloudera.com wrote:
Hi Albert,
Have a couple of questions:
- You mentioned near real-time. What exactly is your SLA for
processing each document?
The minimum the best :). Right now it's between 30s - 5m, but I would like
should be independent of that.
I'm sure there's something subtle I'm missing or not understanding,
thanks in advance.
Al
--
Albert Chu
ch...@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
35 matches
Mail list logo