Spark - Loading in data from CSVs and Postgres

2013-10-17 Thread Victor Hooi
Hi, *NB: I originally posted this to the Google Group, before I saw the message about how we're moving to the Apache Incubator mailing list.* I'm new to Spark, and I wanted to get some advice on the best way to load our data into it: 1. A CSV file generated each day, which contain user click

help on SparkContext.sequenceFile()

2013-10-17 Thread Shay Seng
Hey gurus, I'm having a little trouble deciphering the docs for sequenceFile[K, V](path: String, minSplits: Int = defaultMinSplits )(implicit km: ClassManifest[K], vm: ClassManifest[V],

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
sorry one more related question: i compile against a spark build for hadoop 1.0.4, but the actual installed version of spark is build against cdh4.3.0-mr1. this also used to work, and i prefer to do this so i compile against a generic spark build. could this be the issue? On Thu, Oct 17, 2013 at

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
i have my spark and hadoop related dependencies as "provided" for my spark job. this used to work with previous versions. are these now supposed to be compile/runtime/default dependencies? On Thu, Oct 17, 2013 at 8:04 PM, Koert Kuipers wrote: > yes i did that and i can see the correct jars sitt

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
yes i did that and i can see the correct jars sitting in lib_managed On Thu, Oct 17, 2013 at 7:56 PM, Matei Zaharia wrote: > Koert, did you link your Spark job to the right version of HDFS as well? > In Spark 0.8, you have to add a Maven dependency on "hadoop-client" for > your version of Hadoop

Re: spark 0.8

2013-10-17 Thread Matei Zaharia
Koert, did you link your Spark job to the right version of HDFS as well? In Spark 0.8, you have to add a Maven dependency on "hadoop-client" for your version of Hadoop. See http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala for example. Matei On Oct 17, 2

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
i got the job a little further along by also setting this: System.setProperty("spark.closure.serializer", "org.apache.spark.serializer.KryoSerializer") not sure why i need to... but anyhow, now my workers start and then they blow up on this: 13/10/17 19:22:57 ERROR Executor: Uncaught exception in

Re: spark 0.8

2013-10-17 Thread dachuan
thanks, Mark. On Thu, Oct 17, 2013 at 6:36 PM, Mark Hamstra wrote: > SNAPSHOTs are not fixed versions, but are floating names associated with > whatever is the most recent code. So, Spark 0.8.0 is the current released > version of Spark, which is exactly the same today as it was yesterday, and

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
even the simplest job will produce these EOFException errors in the workers. for example: object SimpleJob extends App { System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer") val sc = new SparkContext("spark://master:7077", "Simple Job", "/usr/local/lib/spark",

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
oh well i spoke too soon. spark-shell still works but any scala/java program still throws this error. On Thu, Oct 17, 2013 at 6:18 PM, dachuan wrote: > yeah, I mean 0.9.0-SNAPSHOT. I use git clone and that's what I got.. > what's the difference? I mean SNAPSHOT and non-SNAPSHOT. > > > On Thu, O

Re: spark 0.8

2013-10-17 Thread Mark Hamstra
SNAPSHOTs are not fixed versions, but are floating names associated with whatever is the most recent code. So, Spark 0.8.0 is the current released version of Spark, which is exactly the same today as it was yesterday, and will be the same thing forever. Spark 0.8.1-SNAPSHOT is whatever is current

Re: building spark

2013-10-17 Thread dachuan
try spark-0.8.0-incubating/ folder first. My guess. On Thu, Oct 17, 2013 at 6:17 PM, Umar Javed wrote: > But I don't know what folder to apply chmod to. How do I know that? > > > On Thu, Oct 17, 2013 at 3:13 PM, dachuan wrote: > >> Permission denied, and it happened in OutputStream, so I assum

Re: spark 0.8

2013-10-17 Thread dachuan
yeah, I mean 0.9.0-SNAPSHOT. I use git clone and that's what I got.. what's the difference? I mean SNAPSHOT and non-SNAPSHOT. On Thu, Oct 17, 2013 at 6:15 PM, Mark Hamstra wrote: > Of course, you mean 0.9.0-SNAPSHOT. There is no Spark 0.9.0, and won't be > for several months. > > > > On Thu, Oc

Re: building spark

2013-10-17 Thread Umar Javed
But I don't know what folder to apply chmod to. How do I know that? On Thu, Oct 17, 2013 at 3:13 PM, dachuan wrote: > Permission denied, and it happened in OutputStream, so I assume you don't > have write permission on local file system, or at least in some folder. how > about trying chmod cmd.

Re: spark 0.8

2013-10-17 Thread Mark Hamstra
Of course, you mean 0.9.0-SNAPSHOT. There is no Spark 0.9.0, and won't be for several months. On Thu, Oct 17, 2013 at 3:11 PM, dachuan wrote: > I'm sorry if this doesn't answer your question directly, but I have tried > spark 0.9.0 and hdfs 1.0.4 just now, it works.. > > > On Thu, Oct 17, 201

Re: building spark

2013-10-17 Thread dachuan
Permission denied, and it happened in OutputStream, so I assume you don't have write permission on local file system, or at least in some folder. how about trying chmod cmd. On Thu, Oct 17, 2013 at 6:07 PM, Umar Javed wrote: > Hi, > > I'm trying to build spark.0.8.0 uisng 'sbt assembly'. But th

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
ah please disregard my question. i screwed up a single configuration setting in spark! oh well that was 3 hours of fun On Thu, Oct 17, 2013 at 6:11 PM, dachuan wrote: > I'm sorry if this doesn't answer your question directly, but I have tried > spark 0.9.0 and hdfs 1.0.4 just now, it works.. >

Re: spark 0.8

2013-10-17 Thread dachuan
I'm sorry if this doesn't answer your question directly, but I have tried spark 0.9.0 and hdfs 1.0.4 just now, it works.. On Thu, Oct 17, 2013 at 6:05 PM, Koert Kuipers wrote: > after upgrading from spark 0.7 to spark 0.8 i can no longer access any > files on HDFS. > i see the error below. any

building spark

2013-10-17 Thread Umar Javed
Hi, I'm trying to build spark.0.8.0 uisng 'sbt assembly'. But the build is not successful and I get this at the end: [info] SHA-1: 461c20e8753b438e9d5df194de3d9f77ea2ea918 [info] Packaging /proj/UW-PCP/spark-0.8.0-incubating/examples/target/scala-2.9.3/spark-examples-assembly-0.8.0-incubating.jar

spark 0.8

2013-10-17 Thread Koert Kuipers
after upgrading from spark 0.7 to spark 0.8 i can no longer access any files on HDFS. i see the error below. any ideas? i am running spark standalone on a cluster that also has CDH4.3.0 and rebuild spark accordingly. the jars in lib_managed look good to me. i noticed similar errors in the mailing

Re: job reports as KILLED in standalone mode

2013-10-17 Thread Jey Kottalam
You can try calling the "close()" method on your SparkContext, which should allow for a cleaner shutdown. On Thu, Oct 17, 2013 at 2:38 PM, Ameet Kini wrote: > > I'm using the scala 2.10 branch of Spark in standalone mode, and am seeing > the job reports itself as KILLED in the UI with the below m

job reports as KILLED in standalone mode

2013-10-17 Thread Ameet Kini
I'm using the scala 2.10 branch of Spark in standalone mode, and am seeing the job reports itself as KILLED in the UI with the below message in each of the executors log, even though the job processes correctly and returns the correct result. The job is triggered by a .count on an RDD and the count

Re: Persist to disk causes memory problems

2013-10-17 Thread Kyle Ellrott
Looking at the behavior of the program, and reading the code, I think that Spark does the flatmap of all of the elements in it's local partition, then serializes the results altogether at the end. At that point it realizes it doesn't have enough memory, so it starts a store to disk, but by that poi

Re: executor memory in standalone mode stays at default 512MB

2013-10-17 Thread dachuan
maybe my question is not exactly the same, but I am also confused by the memory allocation strategy in spark. I was playing SparkPageRank.scala with a very tiny input (6 lines), after 20 iterations, I get java.lang.StackOverflowError. I am using 0.9.0 version of spark, and scala 2.9.3. Any mater

Re: pyspark memory usage

2013-10-17 Thread Matei Zaharia
Hi there, I'm not sure I understand your problem -- is it that Spark used *less* memory than the 2 GB? That out of memory message seems to be from your operating system, so maybe there were other things using RAM on that machine, or maybe Linux is configured to kill tasks quickly when the memor

Re: Kafka dependency issues

2013-10-17 Thread Matei Zaharia
Yeah, good point. I think that in the future we'll move to having smaller "plugin" projects for various input sources (e.g. spark-kafka, spark-rabbitmq). Matei On Oct 15, 2013, at 9:50 AM, Ryan Weald wrote: > Thanks for the response. It is interesting that you must manually specify the > Kafk

executor memory in standalone mode stays at default 512MB

2013-10-17 Thread Ameet Kini
I'm using the scala 2.10 branch of Spark in standalone mode, and am finding that the executor gets started with the default 512M even after setting spark.executor.memory to 6G. This leads to my job getting an OOM. I've tried setting spark.executor.memory both programmatically (using System.setPrope

Re: Spark and Juju

2013-10-17 Thread Maarten Ectors
Hi Andre, Docker solves the packaging. Juju the orchestration and integration. Juju and Docker are talking about integration between both projects. What Juju can offer is an instant blue-print solution for BDAS and to scale it to hundreds of nodes and to instantly integrate monitoring like Ganglia

Re: Spark and Juju

2013-10-17 Thread Andre Schumacher
Hi Maarten, there is an effort underway to publish the stack as Docker images that should take most of the pain away to get up a Spark or Shark cluster for development and testing. Not being familiar with Juju I understand that it kind of solves a different problem (more about managing deployment

Spark and Juju

2013-10-17 Thread Maarten Ectors
Hi all, Juju is one of the latest Ubuntu open source product. Juju allows anybody to instantly deploy, integrate and scale Hadoop, HBase, Hive, Mongo, Storm, Redis, Solr, and a hundred solutions, called Charms, more. Try it yourself on jujucharms.com. We are missing the Am

Re: Another instance of Derby may have already booted the database /opt/spark/shark/bin/metastore_db.

2013-10-17 Thread Reynold Xin
Yes - you can configure mysql as the metastore in Hive and Shark should pick it up: https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin Make sure you have the mysql connector jar in hive/lib. On Tue, Oct 15, 2013 at 3:05 AM, vinayak navale wrote: > Hi, > > i am getting