I am facing a weird failure where sbt/sbt assembly” shows a lot of SSL
certificate errors for repo.maven.apache.org. Is anyone else facing the same
problems? Any idea why this is happening? Yesterday I was able to successfully
run it.
Loading https://repo.maven.apache.org shows an invalid cert
I'm also seeing this. It also was working for me previously AFAIK.
Tthe proximate cause is my well-intentioned change that uses HTTPS to
access all artifact repos. The default for Maven Central before would have
been HTTP. While it's a good idea to use HTTPS, it may run into
complications.
I
Andrew, this should be fixed in 0.9.1, assuming it is the same hash
collision error we found there.
Kane, is it possible your bigger data is corrupt, such that that any
operations on it fail?
On Sat, Mar 22, 2014 at 10:39 PM, Andrew Ash and...@andrewash.com wrote:
FWIW I've seen correctness
Hi all,
Hitting a mysterious error loading large text files, specific to PySpark
0.9.0.
In PySpark 0.8.1, this works:
data = sc.textFile(path/to/myfile)
data.count()
But in 0.9.0, it stalls. There are indications of completion up to:
14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in
Hi Koert, Patrick,
do you already have an elegant solution to combine multiple operations on a
single RDD?
Say for example that I want to do a sum over one column, a count and an
average over another column,
thanks in advance,
Richard
On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling
Yes, there was an error in data, after fixing it - count fails with Out of
Memory Error.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3051.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I am getting these weird errors which I have not seen before:
[error] Server access Error: handshake alert: unrecognized_name url=
https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.servlet/2.5.0.v201103041518/javax.servlet-2.5.0.v201103041518.orbit
[info] Resolving
These errors should be fixed on master with Sean's PR:
https://github.com/apache/spark/pull/209
The orbit errors are quite possibly due to using https instead of http,
whether or not the SSL cert was bad. Let us know if they go away with
reverting to http.
On Sun, Mar 23, 2014 at 11:48 AM,
Hello,
I have a weird error showing up when I run a job on my Spark cluster.
The version of spark is 0.9 and I have 3+ GB free on the disk when this
error shows up. Any ideas what I should be looking for?
[error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task
167.0:3 failed
On some systems, /tmp/ is an in-memory tmpfs file system, with its own size
limit. It's possible that this limit has been exceeded. You might try
running the df command to check to free space of /tmp or root if tmp
isn't listed.
3 GB also seems pretty low for the remaining free space of a disk.
Hey All,
I think the old thread is here:
https://groups.google.com/forum/#!msg/spark-users/gVtOp1xaPdU/Uyy9cQz9H_8J
The method proposed in that thread is to create a utility class for
doing single-pass aggregations. Using Algebird is a pretty good way to
do this and is a bit more flexible since
On 3/23/14, 5:49 PM, Matei Zaharia wrote:
You can set spark.local.dir to put this data somewhere other than /tmp
if /tmp is full. Actually it’s recommended to have multiple local
disks and set to to a comma-separated list of directories, one per disk.
Matei, does the number of tasks/partitions
Hi
Thanks for reporting this. It'll be great if you can check a couple of
things:
1. Are you trying to use this with Hadoop2 by any chance ? There was an
incompatible ASM version bug that we fixed for Hadoop2
https://github.com/amplab-extras/SparkR-pkg/issues/17 and we verified it,
but I just
Hello,
In spark we can use *newAPIHadoopRDD *to access the different distributed
system like HDFS, HBase, and MongoDB via different inputformat.
Is it possible to access the *inputsplit *in Spark directly? Spark can
cache data in local memory.
Perform local computation/aggregation on the local
Hey Jeremy, what happens if you pass batchSize=10 as an argument to your
SparkContext? It tries to serialize that many objects together at a time, which
might be too much. By default the batchSize is 1024.
Matei
On Mar 23, 2014, at 10:11 AM, Jeremy Freeman freeman.jer...@gmail.com wrote:
Hi
I don’t see the errors anymore. Thanks Aaron.
On 24-Mar-2014, at 12:52 am, Aaron Davidson ilike...@gmail.com wrote:
These errors should be fixed on master with Sean's PR:
https://github.com/apache/spark/pull/209
The orbit errors are quite possibly due to using https instead of http,
Aaron, thanks for replying. I am very much puzzled as to what is going
on. A job that used to run on the same cluster is failing with this
mysterious message about not having enough disk space when in fact I can
see through watch df -h that the free space is always hovering around
3+GB on the
Bleh, strike that, one of my slaves was at 100% inode utilization on the
file system. It was /tmp/spark* leftovers that apparently did not get
cleaned up properly after failed or interrupted jobs.
Mental note - run a cron job on all slaves and master to clean up
/tmp/spark* regularly.
Thanks
Thanks for bringing this up, 100% inode utilization is an issue I haven't
seen raised before and this raises another issue which is not on our
current roadmap for state cleanup (cleaning up data which was not fully
cleaned up from a crashed process).
On Sun, Mar 23, 2014 at 7:57 PM, Ognen
I would love to work on this (and other) stuff if I can bother someone
with questions offline or on a dev mailing list.
Ognen
On 3/23/14, 10:04 PM, Aaron Davidson wrote:
Thanks for bringing this up, 100% inode utilization is an issue I
haven't seen raised before and this raises another issue
Hey there fellow Dukes of Data,
How can I tell how many partitions my RDD is split into?
I'm interested in knowing because, from what I gather, having a good number
of partitions is good for performance. If I'm looking to understand how my
pipeline is performing, say for a parallelized write out
It's much simpler: rdd.partitions.size
On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Hey there fellow Dukes of Data,
How can I tell how many partitions my RDD is split into?
I'm interested in knowing because, from what I gather, having a good
number
As Mark said you can actually access this easily. The main issue I've
seen from a performance perspective is people having a bunch of really
small partitions. This will still work but the performance will
improve if you coalesce the partitions using rdd.coalesce().
This can happen for example if
Ognen - just so I understand. The issue is that there weren't enough
inodes and this was causing a No space left on device error? Is that
correct? If so, that's good to know because it's definitely counter
intuitive.
On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
og...@nengoiksvelzud.com wrote:
24 matches
Mail list logo