How to create a RPM package

2014-04-04 Thread Rahul Singhal
Hello Community, This is my first mail to the list and I have a small question. The maven build page mentions a way to create a debian package but I was wondering if there is a simple way (preferably th

Re: Error when run Spark on mesos

2014-04-04 Thread Gino Mathews
Hi, The issue was due to meos version mismatch as I am using latest mesos 0.17.0, but spark uses 0.13.0. Fixed by updating the SparkBuild.scala to latest version. However I am now faced with errors in mesos worker threads. I tried after after upgrading spark to 0.9.1., issues persists. Thanks fo

module not found: org.eclipse.paho#mqtt-client;0.4.0

2014-04-04 Thread Dear all
hello, all i am a new guy to spark&scala. Yestday i install spark failed, and the message like this: who can help me: why the matt-client-0.4.0.pom can't find? how should i do ? thanks a lot! command: sbt/sbt assembly [info] Updating {file:/Users/alick/spark/sp

Re: module not found: org.eclipse.paho#mqtt-client;0.4.0

2014-04-04 Thread Sean Owen
This ultimately means a problem with SSL in the version of Java you are using to run SBT. If you look around the internet, you'll see a bunch of discussion, most of which seems to boil down to reinstall, or update, Java. -- Sean Owen | Director, Data Science | London On Fri, Apr 4, 2014 at 12:21

Explain Add Input

2014-04-04 Thread Eduardo Costa Alfaia
Hi all, Could anyone explain me about the lines below? computer1 - worker computer8 - driver(master) 14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added input-0-1396614314800 in memory on computer1.ant-net:60820 (size: 1262.5 KB, free: 540.3 MB) 14/04/04 14:24:56 INFO Memor

RAM high consume

2014-04-04 Thread Eduardo Costa Alfaia
Hi all, I am doing some tests using JavaNetworkWordcount and I have some questions about the performance machine, my tests' time are approximately 2 min. Why does the RAM Memory decrease meaningly? I have done tests with 2, 3 machines and I had gotten the same behavior. What should I

how to save RDD partitions in different folders?

2014-04-04 Thread dmpour23
Hi all, Say I have an input file which I would like to partition using HashPartitioner k times. Calling rdd.saveAsTextFile(""hdfs://"); will save k files as part-0 part-k Is there a way to save each partition in specific folders? i.e. src part0/part-0 part1/part-00

How to start history tracking URL

2014-04-04 Thread zhxfl
I run spark client on yarn, and use "history-daemon.sh start historyserver" to start tracking the hisotory URL,but it didn't work, why?

Re: Spark 1.0.0 release plan

2014-04-04 Thread Tom Graves
Do we have a list of things we really want to get in for 1.X?   Perhaps move any jira out to a 1.1 release if we aren't targetting them for 1.0.  It might be nice to send out reminders when these dates are approaching.  Tom On Thursday, April 3, 2014 11:19 PM, Bhaskar Dutta wrote: Thanks a lo

Re: How to create a RPM package

2014-04-04 Thread Christophe Préaud
Hi Rahul, Spark will be available in Fedora 21 (see: https://fedoraproject.org/wiki/SIGs/bigdata/packaging/Spark), currently scheduled on 2014-10-14 but they already have produced spec files and source RPMs. If you are stuck with EL6 like me, you can have a look at the attached spec file, whi

Driver increase memory utilization

2014-04-04 Thread Eduardo Costa Alfaia
Hi Guys, Could anyone help me understand this driver behavior when I start the JavaNetworkWordCount? computer8 16:24:07 up 121 days, 22:21, 12 users, load average: 0.66, 1.27, 1.55 total used free shared buffers cached Mem: 5897 434

Re: how to save RDD partitions in different folders?

2014-04-04 Thread Evan Sparks
Have a look at MultipleOutputs in the hadoop API. Spark can read and write to arbitrary hadoop formats. > On Apr 4, 2014, at 6:01 AM, dmpour23 wrote: > > Hi all, > Say I have an input file which I would like to partition using > HashPartitioner k times. > > Calling rdd.saveAsTextFile(""hdfs:

Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Erik Freed
Hi All, I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps already being addressed, but I am having a devil of a time with a spark 0.9.0 client jar for hadoop 2.X. If I go to the site and download: - Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache mirror

Re: Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Rahul Singhal
Hi Erik, I am working with TOT branch-0.9 (> 0.9.1) and the following works for me for maven build: export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" mvn -Pyarn -Dhadoop.version=2.3.0 -Dyarn.version=2.3.0 -DskipTests clean package And from http://spark.apache.org/d

Re: Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Amit Tewari
I believe you got to set following SPARK_HADOOP_VERSION=2.2.0 (or whatever your version is) SPARK_YARN=true then type sbt/sbt assembly If you are using Maven to compile mvn -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests clean package Hope this helps -A On Fri, Apr 4, 2014 a

reduceByKeyAndWindow Java

2014-04-04 Thread Eduardo Costa Alfaia
Hi guys, I would like knowing if the part of code is right to use in Window. JavaPairDStream wordCounts = words.map( 103 new PairFunction() { 104 @Override 105 public Tuple2 call(String s) { 106 return new Tuple2(s, 1); 107 } 108 }).reduceByKeyAndWi

Re: how to save RDD partitions in different folders?

2014-04-04 Thread Konstantin Kudryavtsev
Hi Evan, Could you please provide a code-snippet? Because it not clear for me, in Hadoop you need to engage addNamedOutput method and I'm in stuck how to use it from Spark Thank you, Konstantin Kudryavtsev On Fri, Apr 4, 2014 at 5:27 PM, Evan Sparks wrote: > Have a look at MultipleOutputs in

Re: Spark output compression on HDFS

2014-04-04 Thread Konstantin Kudryavtsev
Can anybody suggest how to change compression level (Record, Block) for Snappy? if it possible, of course thank you in advance Thank you, Konstantin Kudryavtsev On Thu, Apr 3, 2014 at 10:28 PM, Konstantin Kudryavtsev < kudryavtsev.konstan...@gmail.com> wrote: > Thanks all, it works fine now an

Re: Job initialization performance of Spark standalone mode vs YARN

2014-04-04 Thread Ron Gonzalez
Hi, Can you explain a little more what's going on? Which one submits a job to the yarn cluster that creates an application master and spawns containers for the local jobs? I tried yarn-client and submitted to our yarn cluster and it seems to work that way. Shouldn't Client.scala be running wi

Re: Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Erik Freed
Thanks all for the update - I have actually built using those options every which way I can think of so perhaps this is something I am doing about how I upload the jar to our artifactory repo server. Anyone have a working pom file for the publish of a spark 0.9 hadoop 2.X publish to a maven repo se

Parallelism level

2014-04-04 Thread Eduardo Costa Alfaia
Hi all, I have put this line in my spark-env.sh: -Dspark.default.parallelism=20 this parallelism level, is it correct? The machine's processor is a dual core. Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155

Re: Regarding Sparkcontext object

2014-04-04 Thread Daniel Siegmann
On Wed, Apr 2, 2014 at 7:11 PM, yh18190 wrote: > Is it always needed that sparkcontext object be created in Main method of > class.Is it necessary?Can we create "sc" object in other class and try to > use it by passing this object through function and use it? > The Spark context can be initializ

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-04-04 Thread Prasad
Hi Wisely, Could you please post your pom.xml here. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158p3770.html Sent from the Apache Spark User List maili

RAM Increase

2014-04-04 Thread Eduardo Costa Alfaia
Hi Guys, Could anyone explain me this behavior? After 2 min of tests computer1- worker computer10 - worker computer8 - driver(master) computer1 18:24:31 up 73 days, 7:14, 1 user, load average: 3.93, 2.45, 1.14 total used free shared buffers cached Me

Re: Parallelism level

2014-04-04 Thread Nicholas Chammas
If you're running on one machine with 2 cores, I believe all you can get out of it are 2 concurrent tasks at any one time. So setting your default parallelism to 20 won't help. On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia < e.costaalf...@unibs.it> wrote: > Hi all, > > I have put this li

Re: java.lang.NoClassDefFoundError: scala/tools/nsc/transform/UnCurry$UnCurryTransformer...

2014-04-04 Thread Marcelo Vanzin
Hi Francis, This might be a long shot, but do you happen to have built spark on an encrypted home dir? (I was running into the same error when I was doing that. Rebuilding on an unencrypted disk fixed the issue. This is a known issue / limitation with ecryptfs. It's weird that the build doesn't f

How are exceptions in map functions handled in Spark?

2014-04-04 Thread John Salvatier
I'm trying to get a clear idea about how exceptions are handled in Spark? Is there somewhere where I can read about this? I'm on spark .7 For some reason I was under the impression that such exceptions are swallowed and the value that produced them ignored but the exception is logged. However, rig

Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread Matei Zaharia
Exceptions should be sent back to the driver program and logged there (with a SparkException thrown if a task fails more than 4 times), but there were some bugs before where this did not happen for non-Serializable exceptions. We changed it to pass back the stack traces only (as text), which sho

Re: Status of MLI?

2014-04-04 Thread Yi Zou
Hi, Evan, Just noticed this thread, do you mind sharing more details regarding algorithms targetted at hyperparameter tuning/model selection? or a link to dev git repo for that work. thanks, yi On Wed, Apr 2, 2014 at 6:03 PM, Evan R. Sparks wrote: > Targeting 0.9.0 should work out of the box (

Re: Example of creating expressions for SchemaRDD methods

2014-04-04 Thread Michael Armbrust
> In such construct, each operator builds on the previous one, including any > materialized results etc. If I use a SQL for each of them, I suspect the > later SQLs will not leverage the earlier SQLs by any means - hence these > will be inefficient to first approach. Let me know if this is not corr

Re: Parallelism level

2014-04-04 Thread Eduardo Costa Alfaia
What do you advice me Nicholas? Em 4/4/14, 19:05, Nicholas Chammas escreveu: If you're running on one machine with 2 cores, I believe all you can get out of it are 2 concurrent tasks at any one time. So setting your default parallelism to 20 won't help. On Fri, Apr 4, 2014 at 11:41 AM, Eduar

Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread John Salvatier
Is there a way to log exceptions inside a mapping function? logError and logInfo seem to freeze things. On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia wrote: > Exceptions should be sent back to the driver program and logged there > (with a SparkException thrown if a task fails more than 4 times)

Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread John Salvatier
Btw, thank you for your help. On Fri, Apr 4, 2014 at 11:49 AM, John Salvatier wrote: > Is there a way to log exceptions inside a mapping function? logError and > logInfo seem to freeze things. > > > On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia wrote: > >> Exceptions should be sent back to the

Re: Example of creating expressions for SchemaRDD methods

2014-04-04 Thread Michael Armbrust
Minor typo in the example. The first SELECT statement should actually be: sql("SELECT * FROM src") Where `src` is a HiveTable with schema (key INT value STRING). On Fri, Apr 4, 2014 at 11:35 AM, Michael Armbrust wrote: > > In such construct, each operator builds on the previous one, including

Largest Spark Cluster

2014-04-04 Thread Parviz Deyhim
Spark community, What's the size of the largest Spark cluster ever deployed? I've heard Yahoo is running Spark on several hundred nodes but don't know the actual number. can someone share? Thanks

Having spark-ec2 join new slaves to existing cluster

2014-04-04 Thread Nicholas Chammas
I would like to be able to use spark-ec2 to launch new slaves and add them to an existing, running cluster. Similarly, I would also like to remove slaves from an existing cluster. Use cases include: 1. Oh snap, I sized my cluster incorrectly. Let me add/remove some slaves. 2. During sche

Re: Parallelism level

2014-04-04 Thread Nicholas Chammas
If you want more parallelism, you need more cores. So, use a machine with more cores, or use a cluster of machines. spark-ec2is the easiest way to do this. If you're stuck on a single machine with 2 cores, then set your default parallelism to

Re: Parallelism level

2014-04-04 Thread Eduardo Costa Alfaia
Thanks Nicolas Em 4/4/14, 21:19, Nicholas Chammas escreveu: If you want more parallelism, you need more cores. So, use a machine with more cores, or use a cluster of machines. spark-ec2 is the easiest way to do this. If you're stuck on a

Re: example of non-line oriented input data?

2014-04-04 Thread Matei Zaharia
FYI, one thing we’ve added now is support for reading multiple text files from a directory as separate records: https://github.com/apache/spark/pull/327. This should remove the need for mapPartitions discussed here. Avro and SequenceFiles look like they may not make it for 1.0, but there’s a ch

Re: Having spark-ec2 join new slaves to existing cluster

2014-04-04 Thread Matei Zaharia
This can’t be done through the script right now, but you can do it manually as long as the cluster is stopped. If the cluster is stopped, just go into the AWS Console, right click a slave and choose “launch more of these” to add more. Or select multiple slaves and delete them. When you run spark

Re: How to create a RPM package

2014-04-04 Thread Rahul Singhal
Hi Christophe, Thanks for your reply and the spec file. I have solved my issue for now. I didn't want to rely building spark using the spec file (%build section) as I don't want to be maintaining the list of files that need to be packaged. I ended up adding maven build support to make-distribut

Re: reduceByKeyAndWindow Java

2014-04-04 Thread Tathagata Das
I havent really compiled the code, but it looks good to me. Why? Is there any problem you are facing? TD On Fri, Apr 4, 2014 at 8:03 AM, Eduardo Costa Alfaia wrote: > > Hi guys, > > I would like knowing if the part of code is right to use in Window. > > JavaPairDStream wordCounts = words.map(

Re: reduceByKeyAndWindow Java

2014-04-04 Thread Eduardo Costa Alfaia
Hi Tathagata, You are right, this code compile, but I am some problems with high memory consummation, I sent today some email about this, but no response until now. Thanks Em 4/4/14, 22:56, Tathagata Das escreveu: I havent really compiled the code, but it looks good to me. Why? Is there any

Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread Andrew Or
Logging inside a map function shouldn't "freeze things." The messages should be logged on the worker logs, since the code is executed on the executors. If you throw a SparkException, however, it'll be propagated to the driver after it has failed 4 or more times (by default). On Fri, Apr 4, 2014 at

Re: Having spark-ec2 join new slaves to existing cluster

2014-04-04 Thread Nicholas Chammas
Sweet, thanks for the instructions. This will do for resizing a dev cluster that you can bring down at will. I will open a JIRA issue about adding the functionality I described to spark-ec2. On Fri, Apr 4, 2014 at 3:43 PM, Matei Zaharia wrote: > This can't be done through the script right now,

Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread Matei Zaharia
Make sure you initialize a log4j Log object on the workers and not on the driver program. If you’re somehow referencing a logInfo method on the driver program, the Log object might not get sent across the network correctly (though you’d usually get some other error there, like NotSerializableExc

Re: Spark output compression on HDFS

2014-04-04 Thread Azuryy
There is no compress type for snappy. Sent from my iPhone5s > On 2014年4月4日, at 23:06, Konstantin Kudryavtsev > wrote: > > Can anybody suggest how to change compression level (Record, Block) for > Snappy? > if it possible, of course > > thank you in advance > > Thank you, > Konstantin Kudr

Spark on other parallel filesystems

2014-04-04 Thread Venkat Krishnamurthy
All Are there any drawbacks or technical challenges (or any information, really) related to using Spark directly on a global parallel filesystem like Lustre/GPFS? Any idea of what would be involved in doing a minimal proof of concept? Is it just possible to run Spark unmodified (without the H

Re: Spark on other parallel filesystems

2014-04-04 Thread Matei Zaharia
As long as the filesystem is mounted at the same path on every node, you should be able to just run Spark and use a file:// URL for your files. The only downside with running it this way is that Lustre won’t expose data locality info to Spark, the way HDFS does. That may not matter if it’s a ne

Re: Avro serialization

2014-04-04 Thread Ron Gonzalez
Thanks will take a look... Sent from my iPad > On Apr 3, 2014, at 7:49 AM, FRANK AUSTIN NOTHAFT > wrote: > > We use avro objects in our project, and have a Kryo serializer for generic > Avro SpecificRecords. Take a look at: > > https://github.com/bigdatagenomics/adam/blob/master/adam-core/sr

Heartbeat exceeds

2014-04-04 Thread Debasish Das
Hi, In my ALS runs I am noticing messages that complain about heart beats: 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(17, machine1, 53419, 0) with no recent heart beats: 48476ms exceeds 45000ms 14/04/04 20:43:09 WARN BlockManagerMasterActor: Removing Bloc

exactly once

2014-04-04 Thread Bharath Bhushan
Does spark in general assure exactly once semantics? What happens to those guarantees in the presence of updateStateByKey operations -- are they also assured to be exactly once? Thanks manku.timma at outlook dot com

Re: Largest Spark Cluster

2014-04-04 Thread Patrick Wendell
Hey Parviz, There was a similar thread a while ago... I think that many companies like to be discrete about the size of large clusters. But of course it would be great if people wanted to share openly :) For my part - I can say that Spark has been benchmarked on hundreds-of-nodes clusters before

Re: How to create a RPM package

2014-04-04 Thread Patrick Wendell
We might be able to incorporate the maven rpm plugin into our build. If that can be done in an elegant way it would be nice to have that distribution target for people who wanted to try this with arbitrary Spark versions... Personally I have no familiarity with that plug-in, so curious if anyone i

Re: Heartbeat exceeds

2014-04-04 Thread Patrick Wendell
If you look in the Spark UI, do you see any garbage collection happening? My best guess is that some of the executors are going into GC and they are timing out. You can manually increase the timeout by setting the Spark conf: spark.storage.blockManagerSlaveTimeoutMs to a higher value. In your cas

Re: Spark on other parallel filesystems

2014-04-04 Thread Jeremy Freeman
We run Spark (in Standalone mode) on top of a network-mounted file system (NFS), rather than HDFS, and find it to work great. It required no modification or special configuration to set this up; as Matei says, we just point Spark to data using the file location. -- Jeremy On Apr 4, 2014, at 8:

Re: Spark on other parallel filesystems

2014-04-04 Thread Anand Avati
On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia wrote: > As long as the filesystem is mounted at the same path on every node, you > should be able to just run Spark and use a file:// URL for your files. > > The only downside with running it this way is that Lustre won't expose > data locality info t

Redirect Incubator pages

2014-04-04 Thread Andrew Ash
I occasionally see links to pages in the spark.incubator.apache.org domain. Can we HTTP 301 redirect that whole domain to spark.apache.org now that the project has graduated? The content seems identical. That would also make the eventual decommission of the incubator domain much easier as usage