Re: Spark app gets slower as it gets executed more times

2014-02-04 Thread Aureliano Buendia
1, 2014 at 8:27 PM, 尹绪森 wrote: > >> Is your spark app an iterative one ? If so, your app is creating a big >> DAG in every iteration. You should use checkpoint it periodically, say, 10 >> iterations one checkpoint. >> >> >> 2014-02-01 Aureliano Buendia

Re: Python API Performance

2014-02-01 Thread Aureliano Buendia
A much (much) better solution than python, (and also scala, if that doesn't make you upset) is julia . Libraries like numpy and scipy are bloated when compared with julia c-like performance. Julia comes with eveything that numpy+scipy come with + more - performance hit. I h

Re: Python API Performance

2014-02-01 Thread Aureliano Buendia
On Thu, Jan 30, 2014 at 7:51 PM, Evan R. Sparks wrote: > If you just need basic matrix operations - Spark is dependent on JBlas ( > http://mikiobraun.github.io/jblas/) to have access to quick linear > algebra routines inside of MLlib and graphx. Jblas does a nice job of > avoiding boxing/unboxing

Spark app gets slower as it gets executed more times

2014-01-31 Thread Aureliano Buendia
Hi, I've noticed my spark app (on ec2) gets slower and slower as I repeatedly execute it. With a fresh ec2 cluster, it is snappy and takes about 15 mins to complete, after running the same app 4 times it gets slower and takes to 40 mins and more. While the cluster gets slower, the monitoring met

Re: Spark does not retry failed tasks initiated by hadoop

2014-01-22 Thread Aureliano Buendia
s can be detected in the log. > > On Wed, Jan 22, 2014 at 4:04 PM, Aureliano Buendia > wrote: > > Hi, > > > > I've written about this issue before, but there was no reply. > > > > It seems when a task fails due to hadoop io errors, spark does no

Re: Using persistent hdfs on spark ec2 instanes

2014-01-22 Thread Aureliano Buendia
rk needs a configuration to work with the persistent hdfs. Of course, as you mentioned, the other way is to change persistent port to 9000. > > On Wed, Jan 22, 2014 at 4:36 PM, Aureliano Buendia > wrote: > > peristent-hdfs server is set to 9010 port, instead of 9000. Does spark

Re: Union of 2 RDD's only returns the first one

2014-01-22 Thread Aureliano Buendia
Keep in mind > this will only work if you coalesce into a single partition. > Thanks! I'll give this a try. > > myRdd.coalesce(1) > .map(_.mkString(","))) > .mapPartitions(it => (Seq("col1,col2,col3") ++ it).iterator) > .saveAsTextFile("

Re: Using persistent hdfs on spark ec2 instanes

2014-01-22 Thread Aureliano Buendia
peristent-hdfs server is set to 9010 port, instead of 9000. Does spark need more config for this? On Thu, Jan 23, 2014 at 12:26 AM, Patrick Wendell wrote: > > 1. It seems by default spark ec2 uses ephemeral hdfs, how to switch this > to > > persistent hdfs? > You can stop the ephemeral one using

Spark does not retry failed tasks initiated by hadoop

2014-01-22 Thread Aureliano Buendia
Hi, I've written about this issue before, but there was no reply. It seems when a task fails due to hadoop io errors, spark does not retry that task, and only reports it as a failed task, carrying on the other tasks. As an example: WARN ClusterTaskSetManager: Loss was due to java.io.IOException

Using persistent hdfs on spark ec2 instanes

2014-01-22 Thread Aureliano Buendia
Hi, 1. It seems by default spark ec2 uses ephemeral hdfs, how to switch this to persistent hdfs? 2. By default persistent hdfs server is not up, is this meant to be like this? Unless I'm missing something, the docs ( https://spark.incubator.apache.org/docs/0.8.1/ec2-scripts.html) do not point to

Re: Quality of documentation (rant)

2014-01-22 Thread Aureliano Buendia
I have to second this. Spark documentations make a lot of non-obvious assumptions. On top of this, when asking a question in the mailing list, you are often referred to those documentations by the developers. On Sun, Jan 19, 2014 at 12:52 PM, Ognen Duzlevski wrote: > Hello, > > I have been tryi

Union of 2 RDD's only returns the first one

2014-01-22 Thread Aureliano Buendia
Hi, I'm trying to find a way to create a csv header when using saveAsTextFile, and I came up with this: (sc.makeRDD(Array("col1,col2,col3"), 1) ++ myRdd.coalesce(1).map(_.mkString(","))) .saveAsTextFile("out.csv") But it only saves the header part. Why is that the union method does not ret

Re: How to perform multi dimensional reduction in spark?

2014-01-21 Thread Aureliano Buendia
Surprisingly, this turned out to be more complicated than what I expected. I had the impression that this would be trivial in spark. Am I missing something here? On Tue, Jan 21, 2014 at 5:42 AM, Aureliano Buendia wrote: > Hi, > > It seems spark does not support nested RDD's, so I

How to perform multi dimensional reduction in spark?

2014-01-20 Thread Aureliano Buendia
Hi, It seems spark does not support nested RDD's, so I was wondering how can spark handle multi dimensional reductions. As an example consider a dataset with these rows: ((i, j), value) where i, j and k are long indexes, and value is a double. How is it possible to first reduce the above rdd o

Spark does not retry a failed task due to hdfs io error

2014-01-16 Thread Aureliano Buendia
Hi, When writing many file son s3 HDFS, with a rate of 1 in 1,000, spark tasks fail due to this error: java.io.FileNotFoundException: File does not exist: /tmp/... This is probably initiated from this bug . While this is a hadoop bug, the probl

Re: Controlling hadoop block size

2014-01-16 Thread Aureliano Buendia
might be an operation/RDD that already does the same, I am not > aware of it as of now. Please let me know, if you are able to figure it out. > > Thanks and Regards, > Archit Thakur. > > > On Tue, Jan 14, 2014 at 11:51 PM, Aureliano Buendia > wrote: > >> >> &

Re: WARN ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-01-14 Thread Aureliano Buendia
, but even with doing that, it is not clear if that set of parameters is the optimal way of using the resources. Spark probably could automate this as much as possible. > Sent while mobile. Pls excuse typos etc. > On Jan 14, 2014 9:27 AM, "Aureliano Buendia" wrote: > &g

Re: Controlling hadoop block size

2014-01-14 Thread Aureliano Buendia
en the > method getPartitions - Check that. > If not, you might have used an operation where you specify your > partitioner or no. of output partitions, eg. groupByKey() - Check that. > > "How is it possible to control the block size by spark?" Do you mean "How >

Re: WARN ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-01-14 Thread Aureliano Buendia
gt; > On Thu, Jan 9, 2014 at 11:27 PM, Aureliano Buendia > wrote: > >> The java command worked when I set SPARK_HOME and SPARK_EXAMPLES_JAR >> values. >> >> There are many issues regarding the Initial job has not accepted any >> resources... error though: >&g

Controlling hadoop block size

2014-01-13 Thread Aureliano Buendia
Hi, Does the output hadoop block size depend on spark tasks number? In my application, when the number of tasks increases, hadoop block size decreases. For a big number of tasks, I get a very high number of 1 MB files generated by saveAsSequenceFile(). How is it possible to control the block siz

Occasional failed tasks

2014-01-13 Thread Aureliano Buendia
Hi, While running a big spark job, in spark web ui I can see a tiny fraction of failed tasks: 26630/536568 (15 failed) Since all the tasks are the same, the failed tasks cannot be an application error. Also, spark log doesn't have any errors. - Does spark retry these tasks? - Are these due to s

Re: Spark on google compute engine

2014-01-13 Thread Aureliano Buendia
egards >> Mayur >> >> Mayur Rustagi >> Ph: +919632149971 >> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com >> https://twitter.com/mayur_rustagi >> >> >> >> On Mon, Jan 13, 2014 at 9:40 PM, Aureliano Buendia >

Re: Unable to load native-hadoop library

2014-01-13 Thread Aureliano Buendia
I had to explicitly use -Djava.library.path for this to work. On Mon, Jan 13, 2014 at 5:51 PM, Aureliano Buendia wrote: > I'm compiling my application against the same hadoop version on spark ec2 > AMI: > > > org.apache.hadoop > hadoop-client > 0.23.7 &

Re: Unable to load native-hadoop library

2014-01-13 Thread Aureliano Buendia
I'm compiling my application against the same hadoop version on spark ec2 AMI: org.apache.hadoop hadoop-client 0.23.7 In my shaded fat jar, I do not include this library though, which shouldn't cause this problem. On Mon, Jan 13, 2014 at 5:28 PM, Aureliano Buendia wr

Unable to load native-hadoop library

2014-01-13 Thread Aureliano Buendia
Hi, I'm using spark-ec2 scripts, and spark applications do not load native hadoop libraries. I've set the native lib path like this: export SPARK_LIBRARY_PATH='/root/ephemeral-hdfs/lib/native/' But get these warnings in log: WARN NativeCodeLoader: Unable to load native-hadoop library for your p

Re: Spark on google compute engine

2014-01-13 Thread Aureliano Buendia
; https://twitter.com/mayur_rustagi > > > > On Mon, Jan 13, 2014 at 11:01 AM, Debasish Das > wrote: > >> Hi Aureliano, >> >> Look for google compute engine scripts from typesafe repo. They recently >> tested Akka Cluster on 2400 nodes from Google Compute Engine

Re: Problems with broadcast large datastructure

2014-01-12 Thread Aureliano Buendia
On Mon, Jan 13, 2014 at 4:17 AM, lihu wrote: > I have occurred the same problem with you . > I have a node of 20 machines, and I just run the broadcast example, what I > do is just change the data size in the example, to 400M, this is really a > small data size. > Is 400 MB a really small size f

Spark on google compute engine

2014-01-12 Thread Aureliano Buendia
Hi, Has anyone worked on a script similar to spark-ec2 for google compute engine? Google compute engine claims that they have faster instance start up time, and that together with by minute charging makes it a desirable choice for spark.

Re: WARN ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-01-09 Thread Aureliano Buendia
wouldn't work. It took me a while to find this out, making debugging very time consuming. - The error message is absolutely irrelevant. I guess the problem should be somewhere with the spark context jar delivery part. On Thu, Jan 9, 2014 at 4:17 PM, Aureliano Buendia wrote: > > >

Re: WARN ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-01-09 Thread Aureliano Buendia
the script way works: spark/run-example org.apache.spark.examples.SparkPi `cat spark-ec2/cluster-url` What am I missing in the above java command? > > Matei > > On Jan 8, 2014, at 8:26 PM, Aureliano Buendia > wrote: > > > > > On Thu, Jan 9, 2014 at 4:11 AM, Matei Zaharia wrote: >

Re: WARN ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-01-08 Thread Aureliano Buendia
t sparse, it does not address enough, and as you can see the community is confused. Are the spark users supposed to create something like run-example for their own jobs? > > Matei > > On Jan 8, 2014, at 8:06 PM, Aureliano Buendia > wrote: > > > > > On Thu, Jan 9,

Re: WARN ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-01-08 Thread Aureliano Buendia
spark-ec2/cluster-url I think the problem might be caused by spark-class script. It seems to assign too much memory. I forgot the fact that run-example doesn't use spark-class. > > Matei > > On Jan 8, 2014, at 7:07 PM, Aureliano Buendia > wrote: > > The strange thing i

Re: WARN ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-01-08 Thread Aureliano Buendia
sufficient memory My jar is deployed to master and then to workers by spark-ec2/copy-dir. Why would including the example in my jar cause this error? On Thu, Jan 9, 2014 at 12:41 AM, Aureliano Buendia wrote: > Could someone explain how SPARK_MEM, SPARK_WORKER_MEMORY and > spark.executor.

Re: WARN ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-01-08 Thread Aureliano Buendia
b UI, it's just redundant. The UI is of no help. On Wed, Jan 8, 2014 at 4:31 PM, Aureliano Buendia wrote: > Hi, > > > My spark cluster is not able to run a job due to this warning: > > WARN ClusterScheduler: Initial job has not accepted any resources; check > your c

How to confirm serializer type on workers?

2014-01-08 Thread Aureliano Buendia
Hi, I'm using kryo serializer, but getting java.io.EOFException error when broadcasting an object. It seems one of the reasons for getting this error is that the workers are not aware of the kryo serializer. In the app code, before creating the context, I have: System.setProperty("spark.seriali

Re: EC2 scripts documentations lacks how to actually run applications

2014-01-08 Thread Aureliano Buendia
10:31 AM, Mark Hamstra > wrote: > > https://github.com/apache/incubator-spark/pull/293 > > > > > > On Wed, Jan 8, 2014 at 10:12 AM, Aureliano Buendia > > > wrote: > >> > >> Here is a refactored version of the question: > >> > >&g

Re: EC2 scripts documentations lacks how to actually run applications

2014-01-08 Thread Aureliano Buendia
Here is a refactored version of the question: How to run spark-class for long running applications? Why is that spark-class doesn't launch a daemon? On Wed, Jan 8, 2014 at 3:21 AM, Aureliano Buendia wrote: > Hi, > > The EC2 > documents<http://spark.incubator.apach

WARN ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-01-08 Thread Aureliano Buendia
Hi, My spark cluster is not able to run a job due to this warning: WARN ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory The workers have these status: ALIVE2 (0 Used)6.3 GB (0.0 B Used)So there

EC2 scripts documentations lacks how to actually run applications

2014-01-07 Thread Aureliano Buendia
Hi, The EC2 documentshas a section called 'Running Applications', but it actually lacks the step which should describe how to run the application. The spark_ec2 script

Re: the spark worker assignment Question?

2014-01-07 Thread Aureliano Buendia
tic way in spark, or should it be done in another way? > > > On Tue, Jan 7, 2014 at 9:20 AM, Aureliano Buendia wrote: > >> >> >> >> On Tue, Jan 7, 2014 at 5:13 PM, Andrew Ash wrote: >> >>> If small-file is hosted in HDFS I think the defa

Re: the spark worker assignment Question?

2014-01-07 Thread Aureliano Buendia
file (which only fits in one block) over 100 machines, instead of calling: sc.parallelize(..., smallInput.partitions.length) should I call?: sc.parallelize(..., System.getProperty("spark.cores.max").toInt) > Sent from my mobile phone > On Jan 7, 2014 8:46 AM, "Aureliano Buendi

Re: Problems with broadcast large datastructure

2014-01-07 Thread Aureliano Buendia
What's the size of your large object to be broadcast? On Tue, Jan 7, 2014 at 8:55 AM, Sebastian Schelter wrote: > Spark repeatedly fails broadcast a large object on a cluster of 25 > machines for me. > > I get log messages like this: > > [spark-akka.actor.default-dispatcher-4] WARN > org.apache

How to time transformations and provide more detailed progress report?

2014-01-07 Thread Aureliano Buendia
Hi, When we time an action it includes all the transformations timings too, and it is not clear which transformation takes how long. Is there a way of timing each transformation separately? Also, does spark provide a way of more detailed progress reporting, broken to transformation steps? For exa

Re: the spark worker assignment Question?

2014-01-07 Thread Aureliano Buendia
On Thu, Jan 2, 2014 at 5:52 PM, Andrew Ash wrote: > That sounds right Mayur. > > Also in 0.8.1 I hear there's a new repartition method that you might be > able to use to further distribute the data. But if your data is so small > that it fits in just a couple blocks, why are you using 20 machine

Re: How to access global kryo instance?

2014-01-06 Thread Aureliano Buendia
bject pool for kryo instances, I'm not sure how spark handles this. > > > On Mon, Jan 6, 2014 at 5:20 PM, Aureliano Buendia wrote: > >> In a map closure, I could use: >> >> val ser = SparkEnv.get.serializer.asInstanceOf[KryoSerializer] >> >> But how

Re: How to access global kryo instance?

2014-01-06 Thread Aureliano Buendia
not be done for > every element you process. > > > On Mon, Jan 6, 2014 at 4:36 PM, Aureliano Buendia wrote: > >> Hi, >> >> Is there a way to access the global kryo instance created by spark? I'm >> referring to the one which is passed to registerClasses() in a

How to access global kryo instance?

2014-01-06 Thread Aureliano Buendia
Hi, Is there a way to access the global kryo instance created by spark? I'm referring to the one which is passed to registerClasses() in a KryoRegistrator sub class. I'd like to access this kryo instance inside a map closure, so it should be accessible from thw workers side too.

Re: A new Spark Web UI

2014-01-06 Thread Aureliano Buendia
Nice job! Any plans for including detailed spark job progress? On Mon, Jan 6, 2014 at 4:44 PM, Romain Rigaux wrote: > Hi, > > We just added a new application in Hue for submitting jobs and managing > contexts: > > http://gethue.tumblr.com/post/71963991256/a-new-spark-web-ui-spark-app > > The app

Why does saveAfObjectFile() serialize Array[T] instead of T?

2014-01-05 Thread Aureliano Buendia
Hi, Given an RDD[T] instance, saveAfObjectFile() passes an instance of Array[T] to serialize(), and not and instance of T: def saveAsObjectFile(path: String) { this.mapPartitions(iter => iter.grouped(10).map(_.toArray)) .map(*x* => (NullWritable.get(), new BytesWritable( *Utils.serial

Re: Spark context jar confusions

2014-01-05 Thread Aureliano Buendia
hats the "common way" but I am doing things this way: > build the fat jar, provide some launch scripts, make debian packaging, ship > it to a node that plays the role of the driver, run it over mesos using the > launch scripts + some conf. > > > 2014/1/2 Aureliano Bu

Re: State of spark on scala 2.10

2014-01-05 Thread Aureliano Buendia
t; > > - Patrick > > On Sat, Jan 4, 2014 at 9:14 PM, Aureliano Buendia > wrote: > > So I should just lanuch an AMI from one of > > https://github.com/mesos/spark-ec2/tree/v2/ami-list and build the > > development version on it? Is it just a simple matter of

Re: ADD_JARS doesn't properly work for spark-shell

2014-01-05 Thread Aureliano Buendia
> > On Sat, Jan 4, 2014 at 8:29 PM, Aureliano Buendia wrote: > >> While myrdd.count() works, a lot of other actions and transformations do >> not still work in spark-shell. Eg myrdd.first() gives this error: >> >> java.lang.ClassCastException: mypackage.MyClass canno

Re: State of spark on scala 2.10

2014-01-04 Thread Aureliano Buendia
> > You'll have to build your own. Also there are some packaging > > differences in master (some bin/ scripts moved to sbin/) just to give > > you a heads up. > > > > On Sat, Jan 4, 2014 at 8:14 PM, Aureliano Buendia > wrote: > >> Good to know the ne

Re: ADD_JARS doesn't properly work for spark-shell

2014-01-04 Thread Aureliano Buendia
ing*] = MappedRDD[2] Basically, type mypackage.MyClass gets converted to Nothing during any action/transformation. On Sun, Jan 5, 2014 at 4:06 AM, Aureliano Buendia wrote: > Sorry, I had a typo. I can conform that using ADD_JARS together with > SPARK_CLASSPATH works as expected in sp

Re: State of spark on scala 2.10

2014-01-04 Thread Aureliano Buendia
utstanding PRs against that branch that haven't been moved to master. > > > On Sat, Jan 4, 2014 at 7:11 PM, Aureliano Buendia wrote: > >> Hi, >> >> I was going to give https://github.com/scala/pickling a try on spark to >> see how it would compare with kryo.

Re: ADD_JARS doesn't properly work for spark-shell

2014-01-04 Thread Aureliano Buendia
age.MyClass()) > (or just import it) > > You could also try running MASTER=local-cluster[2,1,512] which launches 2 > executors, 1 core each, with 512MB in a setup that mimics a real cluster > more closely, in case it's a bug only related to using local mode. > > > On Sat, Jan 4

State of spark on scala 2.10

2014-01-04 Thread Aureliano Buendia
Hi, I was going to give https://github.com/scala/pickling a try on spark to see how it would compare with kryo. Unfortunately, it only works with scala 2.10.3. - Is there a time line for spark to work with scala 2.10? - Is the 2.10 branch as stable as 2.9? - What's blocking spark to work with 2

Re: ADD_JARS doesn't properly work for spark-shell

2014-01-04 Thread Aureliano Buendia
t; is the >> right way to get the jars to your executors (which is where the exception >> appears to be happening), but it would sure be interesting if it did. >> >> >> On Sat, Jan 4, 2014 at 4:50 PM, Aureliano Buendia >> wrote: >> >>> I should add that

Re: ADD_JARS doesn't properly work for spark-shell

2014-01-04 Thread Aureliano Buendia
/jyx81ctj3698wbvphxhm4dw4gn/T/fetchFileTemp8322008964976744710.tmp 14/01/04 15:34:53 INFO Executor: Adding file:/var/folders/3g/jyx81ctj3698wbvphxhm4dw4gn/T/spark-d8ac8f66-fad6-4b3f-8059-73f13b86b070/my.jar.jar to class loader On Sun, Jan 5, 2014 at 12:46 AM, Aureliano Buendia wrote: >

ADD_JARS doesn't properly work for spark-shell

2014-01-04 Thread Aureliano Buendia
Hi, I'm trying to access my stand alone spark app from spark-shell. I tried starting the shell by: MASTER=local[2] ADD_JARS=/path/to/my/jar ./spark-shell The log shows that the jar file was loaded. Also, I can access and create a new instance of mypackage.MyClass. The problem is that myRDD.coll

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-01-04 Thread Aureliano Buendia
ectFile(path: String) { > //put your def here > } > } > > object AwesomeRDD { > implicit def addAwesomeFunctions[T](rdd: RDD[T]) = new AwesomeRDD(rdd) > } > > > then, just import the implicit conversion wherever you want it: > > class Demo { > val rdd:

SequenceFileRDDFunctions cannot be used output of spark package

2014-01-04 Thread Aureliano Buendia
Hi, I'm trying to create a custom version of saveAsObject(). however, I do not seem to be able to use SequenceFileRDDFunctions in my package. I simply copy/pasted saveAsObject() body to my funtion: out.mapPartitions(iter => iter.grouped(10).map(_.toArray)) .map(x => (NullWritable.get(), ne

Re: Turning kryo on does not decrease binary output

2014-01-04 Thread Aureliano Buendia
doopWriter.createPathFromString(path, conf)) saveAsHadoopDataset(conf) } On Fri, Jan 3, 2014 at 9:49 PM, Aureliano Buendia wrote: > > > > On Fri, Jan 3, 2014 at 8:25 PM, Guillaume Pitel < > guillaume.pi...@exensa.com> wrote: > >> Have you tried with the mapred

Re: Turning kryo on does not decrease binary output

2014-01-03 Thread Aureliano Buendia
On Fri, Jan 3, 2014 at 8:25 PM, Guillaume Pitel wrote: > Have you tried with the mapred.* properties ? If saveAsObjectFile uses > saveAsSequenceFile, maybe it uses the old API ? > None of spark.hadoop.mapred.* and spark.hadoop.mapreduce.* approaches cause compression with saveAsObject. (Using sp

Re: Turning kryo on does not decrease binary output

2014-01-03 Thread Aureliano Buendia
k for saveAsObject(), but it does work (according to Guillaume) for saveAsHadoopFile()? > > > > > On Fri, Jan 3, 2014 at 1:33 PM, Aureliano Buendia wrote: > >> >> >> >> On Fri, Jan 3, 2014 at 7:26 PM, Guillaume Pitel < >> guillaume.pi...@exensa.com>

Re: Turning kryo on does not decrease binary output

2014-01-03 Thread Aureliano Buendia
On Fri, Jan 3, 2014 at 7:26 PM, Guillaume Pitel wrote: > Actually, the interesting part in hadoop files is the sequencefile > format which allows to split the data in various blocks. Other files in > HDFS are single-blocks. They do not scale > But the output of saveAsObjectFile looks like: part-

Re: Turning kryo on does not decrease binary output

2014-01-03 Thread Aureliano Buendia
l is already in use: https://github.com/apache/incubator-spark/blob/3713f8129a618a633a7aca8c944960c3e7ac9d3b/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L26 But what we need is something like chill-hadoop: https://github.com/twitter/chill/tree/develop/chill-hadoop > > > On Fri, Jan 3, 2014

Re: Turning kryo on does not decrease binary output

2014-01-03 Thread Aureliano Buendia
RDD only defines saveAsTextFile and saveAsObjectFile. I think saveAsHadoopFile and saveAsNewAPIHadoopFile belong to the older versions. saveAsObjectFile definitely outputs hadoop format. I'm not trying to save big objects by saveAsObjectFile, I'm just trying to minimize the java serialization ove

Re: Turning kryo on does not decrease binary output

2014-01-03 Thread Aureliano Buendia
Even someMap.saveAsTextFile("out", classOf[GzipCodec]) has no effect. Also, I notices that saving sequence files has no compression option (my original question was about compressing binary output). Having said this, I still do not understand why kryo cannot be helpful when saving binary output

Re: Turning kryo on does not decrease binary output

2014-01-03 Thread Aureliano Buendia
ile(path: String) = { > val conf = new Configuration > conf.set("textinputformat.record.delimiter", "\n") > sc.newAPIHadoopFile(path, > classOf[com.hadoop.mapreduce.LzoTextInputFormat], classOf[LongWritable], > classOf[Text], conf) > .map(_

Re: Turning kryo on does not decrease binary output

2014-01-03 Thread Aureliano Buendia
Thanks for clarifying this. I tried setting hadoop properties before constructing SparkContext, but it had no effect. Where is the right place to set these properties? On Fri, Jan 3, 2014 at 4:56 PM, Guillaume Pitel wrote: > Hi, > > I believe Kryo is only use during RDD serialization (i.e. co

Turning kryo on does not decrease binary output

2014-01-03 Thread Aureliano Buendia
Hi, I'm trying to call saveAsObjectFile() on an RDD[*(Int, Int, Double Double)*], expecting the output binary to be smaller, but it is exactly the same size of when kryo is not on. I've checked the log, and there is no trace of kryo related errors. The code looks something like: class MyRegistr

How to deal with multidimensional keys?

2014-01-02 Thread Aureliano Buendia
Hi, How is it possible to reduce by multidimensional keys? For example, if every line is a tuple like: (i, j, k, value) or, alternatively: ((I, j, k), value) how can spark handle reducing over j, or k?

Re: Spark context jar confusions

2014-01-02 Thread Aureliano Buendia
ck (apparently akka is too fragile and sensitive to this). Also, turning off firewall on os x had no effect. > > > 2014/1/2 Aureliano Buendia > >> How about when developing the spark application, do you use "localhost", >> or "spark://localhost:7077"

Re: Spark context jar confusions

2014-01-02 Thread Aureliano Buendia
> plain old scp, etc) to run it you can do something like: > > $SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob > > where MyJob is the entry point to your job it defines a main method. > > 3) I don't know whats the "common way" but I am doi

Re: Spark context jar confusions

2014-01-02 Thread Aureliano Buendia
depend on and then use this utility to get the path: > SparkContext.jarOfClass(YourJob.getClass) > > > 2014/1/2 Aureliano Buendia > >> Hi, >> >> I do not understand why spark context has an option for loading jars at >> runtime. >>

Spark context jar confusions

2014-01-02 Thread Aureliano Buendia
Hi, I do not understand why spark context has an option for loading jars at runtime. As an example, consider this : object BroadcastT

Upper limit of broadcast variables size

2014-01-01 Thread Aureliano Buendia
Hi, Is there a rule of thumb for the upper limit of broadcast variables?

Re: How to map each line to (line number, line)?

2014-01-01 Thread Aureliano Buendia
ome interactive use cases we've >> come across. We're working on providing support for this at a higher level >> of abstraction. >> >> Sent while mobile. Pls excuse typos etc. >> On Dec 31, 2013 11:34 AM, "Aureliano Buendia" >> wrote: >>

Re: How to map each line to (line number, line)?

2013-12-31 Thread Aureliano Buendia
beginning of every line on import time into HDFS instead of doing it > afterwards in Spark. > > Cheers! > Andrew > > > On Mon, Dec 30, 2013 at 12:15 PM, Aureliano Buendia > wrote: > >> I assumed that number of lines in each partition, except the last >> partition, is

Re: How to map each line to (line number, line)?

2013-12-30 Thread Aureliano Buendia
I assumed that number of lines in each partition, except the last partition, is equal. Isn't this the case? In that case Guillaume's approach makes sense. All of these methods are inefficient. Spark needs to support this feature at lower level, as Michael suggested. On Mon, Dec 30, 2013 at 8:01

Re: How to map each line to (line number, line)?

2013-12-30 Thread Aureliano Buendia
On Mon, Dec 30, 2013 at 6:48 PM, Guillaume Pitel wrote: > > > > It won't last for long = for now my dataset are small enough, but >> I'll have to change it someday >> > > How does it depend on the dataset size? Are you saying zipWithIndex is > slow for bigger datasets? > > > No, but for

Re: How to map each line to (line number, line)?

2013-12-30 Thread Aureliano Buendia
gt; On Mon, Dec 30, 2013 at 8:27 AM, Aureliano Buendia > wrote: > >> >> >> >> On Mon, Dec 30, 2013 at 4:24 PM, Michael (Bach) Bui >> wrote: >> >>> Note that, Spark use HDFS API to access the file. >>> HDFS API has KeyValueTextInputForm

Re: How to map each line to (line number, line)?

2013-12-30 Thread Aureliano Buendia
On Mon, Dec 30, 2013 at 5:02 PM, Guillaume Pitel wrote: > Hi > > > I have the same problem here, I need to map some values to ids, and I >> want a unique Int. For now I can use a local zipWithIndex, but it won't >> last for long. >> > > What do you mean by it won't last for long? You can prec

Re: How to map each line to (line number, line)?

2013-12-30 Thread Aureliano Buendia
I can > try to pull it in. > I agree. It'd be super useful to have this feature. > > > Michael (Bach) Bui, PhD, > Senior Staff Architect, ADATAO Inc. > www.adatao.com > > > > > On Dec 30, 2013, at 6:28 AM, A

Re: How to map each line to (line number, line)?

2013-12-30 Thread Aureliano Buendia
On Mon, Dec 30, 2013 at 4:06 PM, Guillaume Pitel wrote: > Hi, > > I have the same problem here, I need to map some values to ids, and I want > a unique Int. For now I can use a local zipWithIndex, but it won't last for > long. > What do you mean by it won't last for long? You can precisely reco

Re: How to map each line to (line number, line)?

2013-12-30 Thread Aureliano Buendia
Mon, Dec 30, 2013 at 8:41 AM, Aureliano Buendia > wrote: > >> One thing could make this more complicated is partitioning. >> >> >> On Mon, Dec 30, 2013 at 12:28 PM, Aureliano Buendia > > wrote: >> >>> Hi, >>> >>> When reading a simple

Re: How to map each line to (line number, line)?

2013-12-30 Thread Aureliano Buendia
One thing could make this more complicated is partitioning. On Mon, Dec 30, 2013 at 12:28 PM, Aureliano Buendia wrote: > Hi, > > When reading a simple text file in spark, what's the best way of mapping > each line to (line number, line)? RDD doesn't seem to have an equivalent of > zipWithIndex. >

How to map each line to (line number, line)?

2013-12-30 Thread Aureliano Buendia
Hi, When reading a simple text file in spark, what's the best way of mapping each line to (line number, line)? RDD doesn't seem to have an equivalent of zipWithIndex.

Re: Web UI and standalone apps

2013-12-27 Thread Aureliano Buendia
til you kill the > cluster by killing Spark Master. > > Hope that long-winded explanation made sense! > Happy Holidays! > > > > On Fri, Dec 27, 2013 at 9:23 AM, Aureliano Buendia > wrote: > >> Hi, >> >> >> I'm a bit confused about web UI

Web UI and standalone apps

2013-12-27 Thread Aureliano Buendia
Hi, I'm a bit confused about web UI access of a stand alone spark app. - When running a spark app, a web server is launched at localhost:4040. When the standalone app execution is finished, the web server is shut down. What's the use of this web server? There is no way of reviewing the data when

mapWith and array index as key

2013-12-24 Thread Aureliano Buendia
Hi, Given a distributed file, does mapWith provide the functionality to know the index of each line (line number -1) across all worker nodes? Can mapWith be used to treat index as a key when joining two RDD?

Spark application development work flow in scala

2013-12-24 Thread Aureliano Buendia
Hi, What's a typical work flow of spark application development in scala? One option is to write a scala application with a main function, and keep executing the app after every development change. Given a big overhead of a moderately sized development data, this could mean slow iterations. Anot

ADD_JARS and jar dependencies in sbt

2013-12-23 Thread Aureliano Buendia
Hi, It seems ADD_JARS can be used to add some jars to class path of spark-shell. This works in simple cases of a few jars. But what happens when those jars depend on other jars? Do we have to list them in ADD_JARS too? Also, do we have to manually download the jars and keep them in parallel with

Re: How to access a sub matrix in a spark task?

2013-12-20 Thread Aureliano Buendia
s thread, such as >> broadcasting, mapPartitions() vs map(), parallel data loading across HDFS >> partitions, etc. But it would be exactly the right thing to do if it best >> fits your problem statement. >> -- >> Christopher T. Nguyen >> Co-founder & CEO, Adatao <http

Re: How to access a sub matrix in a spark task?

2013-12-20 Thread Aureliano Buendia
ht thing to do if it best > fits your problem statement. > -- > Christopher T. Nguyen > Co-founder & CEO, Adatao <http://adatao.com> > linkedin.com/in/ctnguyen > > > > On Fri, Dec 20, 2013 at 2:01 PM, Aureliano Buendia > wrote: > >> >> >> &

Re: How to access a sub matrix in a spark task?

2013-12-20 Thread Aureliano Buendia
old a copy, or a reference. > > Put another way, I see the scale of this challenge as far more operational > than logical (when squinted at from the right angle :) > > -- > Christopher T. Nguyen > Co-founder & CEO, Adatao <http://adatao.com> > linkedin.com/in/ctnguyen

Re: How to access a sub matrix in a spark task?

2013-12-20 Thread Aureliano Buendia
Also over thinking is appreciated in this problem, as my production data is actually near 100 x 1000,000,000 and data duplication could get messy with this. Sorry about the initial misinformation, I was thinking about my development/test data. On Fri, Dec 20, 2013 at 9:04 PM, Aureliano Buendia

Re: How to access a sub matrix in a spark task?

2013-12-20 Thread Aureliano Buendia
can use rdd.mapPartition(func). Where func will take care of >>> splitting n*50-row partition into n sub matrix >>> >>> However, HDFS TextInput or SequnceInputFormat format will not guarantee >>> each partition has certain number of rows. What you want is >>

Re: How to access a sub matrix in a spark task?

2013-12-20 Thread Aureliano Buendia
ation becomes the bottleneck, so writing your algorithm > to minimize communication is important. > > - Evan > > > > > On Fri, Dec 20, 2013 at 9:40 AM, Aureliano Buendia > wrote: > >> >> >> >> On Fri, Dec 20, 2013 at 5:21 PM, Tom Vacek wrote: >> >

  1   2   >