When running the connectedComponents algorithm in GraphFrames on a
sufficiently large dataset, I get the following error I have not encountered
before:
17/04/20 20:35:26 WARN TaskSetManager: Lost task 3.0 in stage 101.0 (TID
53644, 172.19.1.206, executor 40): java.lang.IllegalStateException:
how much of the execution time is used up by
cPickle.load() and .dump() methods.
Hope that helps,
Rok
On Wed, Mar 8, 2017 at 3:18 AM, Yeoul Na [via Apache Spark User List] <
ml-node+s1001560n28468...@n3.nabble.com> wrote:
>
> Hi all,
>
> I am trying to analyze PySpark pe
There is a stddev function since 1.6:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.stddev
If you are using spark < 1.6 you can write your own more or less easily.
On Wed, Feb 17, 2016 at 5:06 PM, mayx [via Apache Spark User List] <
it be fixed? Right now it's rendering
many metrics useless since I want to have a complete view into the
application and I'm only seeing a few executors at a time.
Thanks,
rok
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-metrics-in-graphite-missing
I'm not sure what you mean? I didn't do anything specifically to partition
the columns
On Nov 14, 2015 00:38, "Davies Liu" <dav...@databricks.com> wrote:
> Do you have partitioned columns?
>
> On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar <rokros...@gmail.com> w
schema consisting of a StructType with a few StructField floats and
a string. I’m using all the spark defaults for io compression.
I'll see what I can do about running a profiler -- can you point me to a
resource/example?
Thanks,
Rok
ps: my post on the mailing list is still listed as not accepted
I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into
a parquet file on HDFS. I've got a few hundred nodes in the cluster, so for
the size of file this is way over-provisioned (I've tried it with fewer
partitions and fewer nodes, no obvious effect). I was expecting the dump to
Apologies if this appears a second time!
I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a
parquet file on HDFS. I've got a few hundred nodes in the cluster, so for
the size of file this is way over-provisioned (I've tried it with fewer
partitions and fewer nodes, no
zipWithIndex gives you global indices, which is not what you want. You'll
want to use flatMap with a map function that iterates through each iterable
and returns the (String, Int, String) tuple for each element.
On Thu, Jul 30, 2015 at 4:13 AM, askformore [via Apache Spark User List]
/24 08:10:59 INFO yarn.ApplicationMaster: ApplicationAttemptId:
appattempt_1437724871597_0001_01
15/07/24 08:11:00 INFO spark.SecurityManager: Changing view acls to: root,rok
15/07/24 08:11:00 INFO spark.SecurityManager: Changing modify acls to: root,rok
15/07/24 08:11:00 INFO
I am trying to run Spark applications with the driver running locally and
interacting with a firewalled remote cluster via a SOCKS proxy.
I have to modify the hadoop configuration on the *local machine* to try to
make this work, adding
property
for you to send the logs.
Imran
On Fri, May 15, 2015 at 10:07 AM, rok rokros...@gmail.com wrote:
I am trying to sort a collection of key,value pairs (between several
hundred
million to a few billion) and have recently been getting lots of
FetchFailedException errors that seem to originate
I am trying to sort a collection of key,value pairs (between several hundred
million to a few billion) and have recently been getting lots of
FetchFailedException errors that seem to originate when one of the
executors doesn't seem to find a temporary shuffle file on disk. E.g.:
That's exactly what I'm saying -- I specify the memory options using spark
options, but this is not reflected in how the JVM is created. No matter
which memory settings I specify, the JVM for the driver is always made with
512Mb of memory. So I'm not sure if this is a feature or a bug?
rok
the spark.yarn.am.memory
and overhead parameters but it doesn't seem to have an effect.
Thanks,
Rok
On Apr 23, 2015, at 7:47 AM, Xiangrui Meng men...@gmail.com wrote:
What is the feature dimension? Did you set the driver memory? -Xiangrui
On Tue, Apr 21, 2015 at 6:59 AM, rok rokros...@gmail.com
are being ignored and instead
starting the driver with just 512Mb of heap?
On Thu, Apr 23, 2015 at 8:06 AM, Rok Roskar rokros...@gmail.com wrote:
the feature dimension is 800k.
yes, I believe the driver memory is likely the problem since it doesn't
crash until the very last part of the tree
, Shixiong Zhu zsxw...@gmail.com wrote:
RDD is not thread-safe. You should not use it in multiple threads.
Best Regards,
Shixiong Zhu
2015-02-27 23:14 GMT+08:00 rok rokros...@gmail.com:
I'm seeing this java.util.NoSuchElementException: key not found: exception
pop up sometimes when I run
I'm seeing this java.util.NoSuchElementException: key not found: exception
pop up sometimes when I run operations on an RDD from multiple threads in a
python application. It ends up shutting down the SparkContext so I'm
assuming this is a bug -- from what I understand, I should be able to run
. Are the applications somehow binding to the wrong
ports? Is this a spark setting I need to configure or something within YARN?
Thanks!
Rok
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/cannot-connect-to-Spark-Application-Master-in-YARN-tp21699.html
Sent from
,
0.6606453820150496,
0.8610156719813942,
0.6971353266345091,
0.9896836700210551,
0.05789392881996358]
Is there a size limit for objects serialized with Kryo? Or an option that
controls it? The Java serializer works fine.
On Wed, Feb 11, 2015 at 8:04 PM, Rok Roskar rokros...@gmail.com wrote
Aha great! Thanks for the clarification!
On Feb 11, 2015 8:11 PM, Davies Liu dav...@databricks.com wrote:
On Wed, Feb 11, 2015 at 10:47 AM, rok rokros...@gmail.com wrote:
I was having trouble with memory exceptions when broadcasting a large
lookup
table, so I've resorted to processing
the runtime for each consecutive iteration is still roughly twice as long as
for the previous one -- is there a way to reduce whatever overhead is
accumulating?
On Feb 11, 2015, at 8:11 PM, Davies Liu dav...@databricks.com wrote:
On Wed, Feb 11, 2015 at 10:47 AM, rok rokros...@gmail.com
map is slower than the one
before. I'll try the checkpoint, thanks for the suggestion.
On Feb 12, 2015, at 12:13 AM, Davies Liu dav...@databricks.com wrote:
On Wed, Feb 11, 2015 at 2:43 PM, Rok Roskar rokros...@gmail.com wrote:
the runtime for each consecutive iteration is still roughly twice
next.
Is there a limit to the size of broadcast variables? This one is rather
large (a few Gb dict). Thanks!
Rok
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Java-null-pointer-exception-when-accessing-broadcast-variables-tp21580.html
Sent from
10, 2015 at 10:42 AM, rok rokros...@gmail.com wrote:
I'm trying to use a broadcasted dictionary inside a map function and am
consistently getting Java null pointer exceptions. This is inside an IPython
session connected to a standalone spark cluster. I seem to recall being able
to do this before
if this was somehow automatically corrected!
Thanks,
Rok
On Wed, Jan 28, 2015 at 7:01 PM, Davies Liu dav...@databricks.com wrote:
HadoopRDD will try to split the file as 64M partitions in size, so you
got 1916+ partitions.
(assume 100k per row, they are 80G in size).
I think it has very small chance
+ partitions (I don't know why actually -- how does this number
come about?). How can I check if any objects/batches are exceeding 2Gb?
Thanks,
Rok
On Tue, Jan 27, 2015 at 7:55 PM, Davies Liu dav...@databricks.com wrote:
Maybe it's caused by integer overflow, is it possible that one object
)
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
Not really sure where to start looking for the culprit -- any suggestions
most welcome. Thanks!
Rok
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/NegativeArraySizeException-in-pyspark-when-loading-an-RDD
at 3:46 AM, Rok Roskar rokros...@gmail.com wrote:
thanks for the suggestion -- however, looks like this is even slower.
With
the small data set I'm using, my aggregate function takes ~ 9 seconds and
the colStats.mean() takes ~ 1 minute. However, I can't get it to run with
the Kyro
On Wed, Jan 7, 2015 at 9:42 AM, rok rokros...@gmail.com wrote:
I have an RDD of SparseVectors and I'd like to calculate the means
returning
a dense vector. I've tried doing this with the following (using pyspark,
spark v1.2.0):
def aggregate_partition_values(vec1, vec2) :
vec1[vec2
I have an RDD of SparseVectors and I'd like to calculate the means returning
a dense vector. I've tried doing this with the following (using pyspark,
spark v1.2.0):
def aggregate_partition_values(vec1, vec2) :
vec1[vec2.indices] += vec2.values
return vec1
def
I have an RDD that serves as a feature look-up table downstream in my
analysis. I create it using the zipWithIndex() and because I suppose that
the elements of the RDD could end up in a different order if it is
regenerated at any point, I cache it to try and ensure that the (feature --
index)
true though I was hoping to avoid having to sort... maybe there's no way
around it. Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/ensuring-RDD-indices-remain-immutable-tp20094p20104.html
Sent from the Apache Spark User List mailing list archive at
I'm trying to understand the disk I/O patterns for Spark -- specifically, I'd
like to reduce the number of files that are being written during shuffle
operations. A couple questions:
* is the amount of file I/O performed independent of the memory I allocate
for the shuffles?
* if this is the
Hi, I'm using Spark 1.1.0. There is no error on the executors -- it appears
as if the job never gets properly dispatched -- the only message is the
Broken Pipe message in the driver.
--
View this message in context:
)
error: [Errno 32] Broken pipe
I'm at a loss as to where to begin to debug this... any suggestions? Thanks,
Rok
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/using-LogisticRegressionWithSGD-train-in-Python-crashes-with-Broken-pipe-tp18182.html
Sent from
yes, the training set is fine, I've verified it.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/using-LogisticRegressionWithSGD-train-in-Python-crashes-with-Broken-pipe-tp18182p18195.html
Sent from the Apache Spark User List mailing list archive at
I'm processing some data using PySpark and I'd like to save the RDDs to disk
(they are (k,v) RDDs of strings and SparseVector types) and read them in
using Scala to run them through some other analysis. Is this possible?
Thanks,
Rok
--
View this message in context:
http://apache-spark-user
this strange redistribution of elements? I'm obviously misunderstanding how
spark does the partitioning -- is it a problem with having a list of strings as
an RDD?
Help vey much appreciated!
Thanks,
Rok
-
To unsubscribe, e
is setting the per-core memory limit somewhere?
Thanks,
Rok
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
(path, start, length):
f = open(path)
f.seek(start)
data = f.read(length)
#processing the data
rdd = sc.parallelize(partitions, len(partitions)).flatMap(read_chunk)
right... this is totally obvious in retrospect! Thanks!
Rok
* related to the first question -- when
tips anyone might have for
running pyspark with numpy data...!
Thanks!
Rok
42 matches
Mail list logo