Re: running the Terasort example

2014-12-16 Thread Ewan Higgs
Hi Tim, run-example is here: https://github.com/ehiggs/spark/blob/terasort/bin/run-example It should be in the repository that you cloned. So if you were at the top level of the checkout, run-example would be run as ./bin/run-example. Yours, Ewan Higgs On 12/12/14 01:06, Tim Harsch wrote

Re: running the Terasort example

2014-12-16 Thread Ewan Higgs
not be functioning appropriately. If you have trouble with it, I recommend using the Hadoop version. Yours, Ewan > Thanks, > Tim > > > On 12/16/14, 12:38 AM, "Ewan Higgs" wrote: > >> Hi Tim, >> run-example is here: >> https://github.com/ehiggs/spa

SparkSpark-perf terasort WIP branch

2015-01-14 Thread Ewan Higgs
Hi all, I'm trying to build the Spark-perf WIP code but there are some errors to do with Hadoop APIs. I presume this is because there is some Hadoop version set and it's referring to that. But I can't seem to find it. The errors are as follows: [info] Compiling 15 Scala sources and 2 Java sou

RDD order guarantees

2015-01-16 Thread Ewan Higgs
Hi all, Quick one: when reading files, are the orders of partitions guaranteed to be preserved? I am finding some weird behaviour where I run sortByKeys() on an RDD (which has 16 byte keys) and write it to disk. If I open a python shell and run the following: for part in range(29): print

Re: RDD order guarantees

2015-01-16 Thread Ewan Higgs
local file system right? HDFS orders the file based on names, but local file system often don't. I think that's why the difference. We might be able to do a sort and order the partitions when we create a RDD to make this universal though. On Fri, Jan 16, 2015 at 8:26 AM,

Re: RDD order guarantees

2015-01-19 Thread Ewan Higgs
ystem implementation that overrides the listStatus method, and then in Hadoop Conf set the fs.file.impl to that. Shouldn't be too hard. Would you be interested in working on it? On Fri, Jan 16, 2015 at 3:36 PM, Ewan Higgs <mailto:ewan.hi...@ugent.be>> wrote: Yes, I am running on

Re: Custom Cluster Managers / Standalone Recovery Mode in Spark

2015-02-01 Thread Ewan Higgs
nd [2]. Then we should be able to get slurm, pbs, and sge in one shot rather than implementing some wire formats for RPC. Thanks, Ewan Higgs [1] https://hadoop.apache.org/docs/r1.2.1/hod_scheduler.html https://github.com/glennklockwood/hpchadoop http://jaliyacgl.blogspot.be/2008/08/hadoop-as-batc

Re: Performance test for sort shuffle

2015-02-02 Thread Ewan Higgs
ing it there[1]. I put it on the back burner until someone can get back to me on it. Yours, Ewan Higgs [1] http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSpark-perf-terasort-WIP-branch-tt10105.html On 02/02/15 23:26, Kannan Rajah wrote: Is there a recommended performance test

Re: Replacing Jetty with TomCat

2015-02-19 Thread Ewan Higgs
To add to Sean and Reynold's point: Please correct me if I'm wrong, but Spark depends on hadoop-common which also uses jetty in the HttpServer2 code. So even if you remove jetty from Spark by making it an optional dependency, it will be pulled in by Hadoop. So you'll still see that your prog

Fwd: SparkSpark-perf terasort WIP branch

2015-03-06 Thread Ewan Higgs
WIP branch Date: Wed, 14 Jan 2015 14:33:45 +0100 From: Ewan Higgs To: dev@spark.apache.org Hi all, I'm trying to build the Spark-perf WIP code but there are some errors to do with Hadoop APIs. I presume this is because there is some Hadoop version set and it's referring to t

Tungsten + Flink

2015-04-29 Thread Ewan Higgs
Hi all, A quick question about Tungsten. The announcement of the Tungsten project is on the back of Hadoop Summit in Brussels where some of the Flink devs were giving talks [1] on how Flink manages memory using byte arrays and the like to avoid the overhead of all the Java types[2]. Is there a

Re: Tungsten + Flink

2015-05-01 Thread Ewan Higgs
both Flink and Spark into one.This eases the industry adaptation instead. Thanking you. With Regards Sree On Wednesday, April 29, 2015 3:21 AM, Ewan Higgs wrote: Hi all, A quick question about Tungsten. The announcement of the Tungsten project is on the back of Hadoop Summit in Brussels whe

Re: Multi-Line JSON in SparkSQL

2015-05-05 Thread Ewan Higgs
FWIW, CSV has the same problem that renders it immune to naive partitioning. Consider the following RFC 4180 compliant record: 1,2," all,of,these,are,just,one,field ",4,5 Now, it's probably a terrible idea to give a file system awareness of actual file types, but couldn't HDFS handle this near

Terasort example

2014-11-11 Thread Ewan Higgs
helped me get through learning some rudimentary Scala to get this far. Yours, Ewan Higgs [1] https://github.com/apache/spark/pull/1242 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org

Re: Terasort example

2014-11-11 Thread Ewan Higgs
great. I think the consensus from last time was that we would put performance stuff into spark-perf, so it is easy to test different Spark versions. On Tue, Nov 11, 2014 at 5:03 AM, Ewan Higgs <mailto:ewan.hi...@ugent.be>> wrote: Hi all, I saw that Reynold Xin had a Terasort e