sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Bharath Bhushan
I am facing a weird failure where sbt/sbt assembly” shows a lot of SSL certificate errors for repo.maven.apache.org. Is anyone else facing the same problems? Any idea why this is happening? Yesterday I was able to successfully run it. Loading https://repo.maven.apache.org shows an invalid cert

Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Sean Owen
I'm also seeing this. It also was working for me previously AFAIK. Tthe proximate cause is my well-intentioned change that uses HTTPS to access all artifact repos. The default for Maven Central before would have been HTTP. While it's a good idea to use HTTPS, it may run into complications. I

Re: distinct on huge dataset

2014-03-23 Thread Aaron Davidson
Andrew, this should be fixed in 0.9.1, assuming it is the same hash collision error we found there. Kane, is it possible your bigger data is corrupt, such that that any operations on it fail? On Sat, Mar 22, 2014 at 10:39 PM, Andrew Ash and...@andrewash.com wrote: FWIW I've seen correctness

error loading large files in PySpark 0.9.0

2014-03-23 Thread Jeremy Freeman
Hi all, Hitting a mysterious error loading large text files, specific to PySpark 0.9.0. In PySpark 0.8.1, this works: data = sc.textFile(path/to/myfile) data.count() But in 0.9.0, it stalls. There are indications of completion up to: 14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in

Re: combining operations elegantly

2014-03-23 Thread Richard Siebeling
Hi Koert, Patrick, do you already have an elegant solution to combine multiple operations on a single RDD? Say for example that I want to do a sum over one column, a count and an average over another column, thanks in advance, Richard On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling

Re: distinct on huge dataset

2014-03-23 Thread Kane
Yes, there was an error in data, after fixing it - count fails with Out of Memory Error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3051.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Debasish Das
I am getting these weird errors which I have not seen before: [error] Server access Error: handshake alert: unrecognized_name url= https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.servlet/2.5.0.v201103041518/javax.servlet-2.5.0.v201103041518.orbit [info] Resolving

Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Aaron Davidson
These errors should be fixed on master with Sean's PR: https://github.com/apache/spark/pull/209 The orbit errors are quite possibly due to using https instead of http, whether or not the SSL cert was bad. Let us know if they go away with reverting to http. On Sun, Mar 23, 2014 at 11:48 AM,

No space left on device exception

2014-03-23 Thread Ognen Duzlevski
Hello, I have a weird error showing up when I run a job on my Spark cluster. The version of spark is 0.9 and I have 3+ GB free on the disk when this error shows up. Any ideas what I should be looking for? [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed

Re: No space left on device exception

2014-03-23 Thread Aaron Davidson
On some systems, /tmp/ is an in-memory tmpfs file system, with its own size limit. It's possible that this limit has been exceeded. You might try running the df command to check to free space of /tmp or root if tmp isn't listed. 3 GB also seems pretty low for the remaining free space of a disk.

Re: combining operations elegantly

2014-03-23 Thread Patrick Wendell
Hey All, I think the old thread is here: https://groups.google.com/forum/#!msg/spark-users/gVtOp1xaPdU/Uyy9cQz9H_8J The method proposed in that thread is to create a utility class for doing single-pass aggregations. Using Algebird is a pretty good way to do this and is a bit more flexible since

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
On 3/23/14, 5:49 PM, Matei Zaharia wrote: You can set spark.local.dir to put this data somewhere other than /tmp if /tmp is full. Actually it’s recommended to have multiple local disks and set to to a comma-separated list of directories, one per disk. Matei, does the number of tasks/partitions

Re: Problem with SparkR

2014-03-23 Thread Shivaram Venkataraman
Hi Thanks for reporting this. It'll be great if you can check a couple of things: 1. Are you trying to use this with Hadoop2 by any chance ? There was an incompatible ASM version bug that we fixed for Hadoop2 https://github.com/amplab-extras/SparkR-pkg/issues/17 and we verified it, but I just

is it possible to access the inputsplit in Spark directly?

2014-03-23 Thread hwpstorage
Hello, In spark we can use *newAPIHadoopRDD *to access the different distributed system like HDFS, HBase, and MongoDB via different inputformat. Is it possible to access the *inputsplit *in Spark directly? Spark can cache data in local memory. Perform local computation/aggregation on the local

Re: error loading large files in PySpark 0.9.0

2014-03-23 Thread Matei Zaharia
Hey Jeremy, what happens if you pass batchSize=10 as an argument to your SparkContext? It tries to serialize that many objects together at a time, which might be too much. By default the batchSize is 1024. Matei On Mar 23, 2014, at 10:11 AM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hi

Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Bharath Bhushan
I don’t see the errors anymore. Thanks Aaron. On 24-Mar-2014, at 12:52 am, Aaron Davidson ilike...@gmail.com wrote: These errors should be fixed on master with Sean's PR: https://github.com/apache/spark/pull/209 The orbit errors are quite possibly due to using https instead of http,

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
Aaron, thanks for replying. I am very much puzzled as to what is going on. A job that used to run on the same cluster is failing with this mysterious message about not having enough disk space when in fact I can see through watch df -h that the free space is always hovering around 3+GB on the

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
Bleh, strike that, one of my slaves was at 100% inode utilization on the file system. It was /tmp/spark* leftovers that apparently did not get cleaned up properly after failed or interrupted jobs. Mental note - run a cron job on all slaves and master to clean up /tmp/spark* regularly. Thanks

Re: No space left on device exception

2014-03-23 Thread Aaron Davidson
Thanks for bringing this up, 100% inode utilization is an issue I haven't seen raised before and this raises another issue which is not on our current roadmap for state cleanup (cleaning up data which was not fully cleaned up from a crashed process). On Sun, Mar 23, 2014 at 7:57 PM, Ognen

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
I would love to work on this (and other) stuff if I can bother someone with questions offline or on a dev mailing list. Ognen On 3/23/14, 10:04 PM, Aaron Davidson wrote: Thanks for bringing this up, 100% inode utilization is an issue I haven't seen raised before and this raises another issue

How many partitions is my RDD split into?

2014-03-23 Thread Nicholas Chammas
Hey there fellow Dukes of Data, How can I tell how many partitions my RDD is split into? I'm interested in knowing because, from what I gather, having a good number of partitions is good for performance. If I'm looking to understand how my pipeline is performing, say for a parallelized write out

Re: How many partitions is my RDD split into?

2014-03-23 Thread Mark Hamstra
It's much simpler: rdd.partitions.size On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Hey there fellow Dukes of Data, How can I tell how many partitions my RDD is split into? I'm interested in knowing because, from what I gather, having a good number

Re: How many partitions is my RDD split into?

2014-03-23 Thread Patrick Wendell
As Mark said you can actually access this easily. The main issue I've seen from a performance perspective is people having a bunch of really small partitions. This will still work but the performance will improve if you coalesce the partitions using rdd.coalesce(). This can happen for example if

Re: No space left on device exception

2014-03-23 Thread Patrick Wendell
Ognen - just so I understand. The issue is that there weren't enough inodes and this was causing a No space left on device error? Is that correct? If so, that's good to know because it's definitely counter intuitive. On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski og...@nengoiksvelzud.com wrote: