setting partitioners with hadoop rdds

2014-01-27 Thread Imran Rashid
Hi, I'm trying to figure out how to get partitioners to work correctly with hadoop rdds, so that I can get narrow dependencies & avoid shuffling. I feel like I must be missing something obvious. I can create an RDD with a parititioner of my choosing, shuffle it and then save it out to hdfs. Bu

Re: Cannot get Hadoop dependencies

2014-01-27 Thread Kal El
Well, it seems that 0.20.2 is actually the latest version (2.2.0) I have the following problem: In build.sbt I have this: libraryDependencies ++= Seq(   ("org.apache.spark" %% "spark-core" % "0.8.0-incubating").     exclude("org.mortbay.jetty", "servlet-api").     exclude("commons-beanutils", "com

Re: What could be the cause of this Streaming error

2014-01-27 Thread Khanderao kand
Scala version changed in 0.9.0 to Scala 2.10 Are you using the same version? On Tue, Jan 28, 2014 at 11:30 AM, Ashish Rangole wrote: > Hi, > > I am seeing the following error message when I began testing my Streaming > application locally. Could it be due to a mismatch with > old spark jars som

What could be the cause of this Streaming error

2014-01-27 Thread Ashish Rangole
Hi, I am seeing the following error message when I began testing my Streaming application locally. Could it be due to a mismatch with old spark jars somewhere or is this something else? Thanks, Ashish SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/my

SparkStreaming not read hadoop configuration from its sparkContext on Stand Alone mode?

2014-01-27 Thread robin_up
Hi I try to run a small piece of code on Spark Steaming. It sets the s3 keys in sparkContext object and passed into a sparkStreaming object. However, I got the below error -- it seems StreamingContext did not use the hadoop config on work threads. It works ok if I run it in spark core (batch mode

Re: NoSuchMethodError: org.apache.commons.io.IOUtils.closeQuietly with cdh4 binary

2014-01-27 Thread kamatsuoka
The version of commons-io included in the Spark assembly is an old one, which doesn't have the version of closeQuietly that takes a Closeable: $ javap -cp /root/spark/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.1-incubating-hadoop2.0.0-mr1-cdh4.2.0.jar org.apache.commons.io.IOUtils Compil

Sporadic "IOException: Class not found" in ClosureCleaner

2014-01-27 Thread John Salvatier
I am sporadically getting this error in the process of loading a couple of data files. Its frequent but not consistent. I tried executing `ulimit -n 16000` before running the script (as recommended here), but this didn't s

Re: real world streaming code

2014-01-27 Thread dachuan
thanks, Ryan. I will study Algebird first and try to adapt TopKMonoid to spark streaming program. On Mon, Jan 27, 2014 at 2:54 PM, Ryan Weald wrote: > Hi dachuan, > > Getting top-k up and running using spark streaming is actually very easy > using Twitter's Algebird project. I gave a presentati

Re: Cannot get Hadoop dependencies

2014-01-27 Thread Jey Kottalam
I believe that Hadoop 0.20.2 is too old for compatibility with Spark. The hadoop-client dependency is available in the 0.23.x, 1.0.x, and newer releases, but not in the 0.20.x releases. Source: http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client On Mon, Jan 27, 2014 at 6:20 AM, 尹绪森

Re: s3n > 5GB

2014-01-27 Thread kamatsuoka
Thanks for the replies. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/s3n-5GB-tp943p967.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: real world streaming code

2014-01-27 Thread Ryan Weald
Hi dachuan, Getting top-k up and running using spark streaming is actually very easy using Twitter's Algebird project. I gave a presentation recently at a spark user meetup that wen through an example of using algebird in a spark streaming job. You can find the video and slides here - http://isurf

Re: What I am missing from configuration?

2014-01-27 Thread Matei Zaharia
Hi Dana, I think the problem is that your simple.sbt does not add a dependency on hadoop-client for CDH4, so you get a different version of the Hadoop library on your driver application compared to the cluster. Try adding a dependency on hadoop-client version 2.0.0-mr1-cdh4.X.X for your version

Re: s3n > 5GB

2014-01-27 Thread Ryan Weald
I have run Hadoop + spark jobs on large s3n files without an issue. That being said if you have very large files you might want to consider using s3:// instead, as that uses a HDFS block format compatible storage which means you can more effectively split your large file between map tasks. In my e

Re: Purpose of the HTTP Server?

2014-01-27 Thread Mark Hamstra
Used for broadcast variables and to distribute files or jars to worker nodes. See HttpBroadcast.scalaand SparkContext.scala

updateStateByKey Question

2014-01-27 Thread Craig Vanderborgh
Hello, I'm a big fan of updateStateByKey(), have been using it for a year, and now need to push the envelope again. My question is simply this: Can I use updateStateByKey in the following way? states = events.updateStateByKey[State](firstUpdateFcn) states2 = events.updateStateByKey[State](se

Re: Purpose of the HTTP Server?

2014-01-27 Thread Heiko Braun
Yes, I've seen the one used for the UI. But there is also the HttpServer and HttpFileServer. Those are the ones I was wondering about. /Heiko On 27 Jan 2014, at 15:18, Cheng Lian wrote: > It's used for the Web UI. By default, you may access http://localhost:8080 > to view the cluster informa

Problems while moving from 0.8.0 to 0.8.1

2014-01-27 Thread Archit Thakur
Hi, Implementation of aggregation logic has been changed with 0.8.1 (Aggregator.scala) It is now using AppendOnlyMap as compared to java.util.HashMap in 0.8.0 release. Aggregator.scala def combineValuesByKey(iter: Iterator[_ <: Product2[K, V]]) : Iterator[(K, C)] = { val combiners = new Appe

Re: Cannot get Hadoop dependencies

2014-01-27 Thread 尹绪森
http://www.scala-sbt.org/release/docs/Getting-Started/Library-Dependencies This document might be useful. You should make sure that your specified package in the right uri,and the repo is added in resolver. 2014-1-27 PM9:24于 "Kal El" 写道: > I am having some trouble with Hadoop. I cannot build my p

Re: Purpose of the HTTP Server?

2014-01-27 Thread Cheng Lian
It's used for the Web UI. By default, you may access http://localhost:8080to view the cluster information as well as individual job details. On Mon, Jan 27, 2014 at 10:15 PM, Heiko Braun wrote: > > > Can someone briefly explain the the purpose of the HTTP server in spark? > Is it related to shi

Purpose of the HTTP Server?

2014-01-27 Thread Heiko Braun
Can someone briefly explain the the purpose of the HTTP server in spark? Is it related to shipping the jars around in a cluster? Regards, Heiko

Cannot get Hadoop dependencies

2014-01-27 Thread Kal El
I am having some trouble with Hadoop. I cannot build my project with sbt. According to the documentation, I added a line like this in my build.sbt file: "libraryDependencies+="org.apache.hadoop"%"hadoop-client"%""" my line being: libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "0.2

Re: Inaccurate Estimates from LinearRegressionWithSGD

2014-01-27 Thread Sean Owen
This fix from 8 days ago might be related: https://github.com/apache/incubator-spark/pull/459 If you are not building from HEAD, I might try again with that, or wait for the 0.9 release that will contain it. May not be the cause. On Mon, Jan 27, 2014 at 1:35 AM, herbps10 wrote: > Hello, > > I