Re: module not found: org.eclipse.paho#mqtt-client;0.4.0

2014-04-04 Thread Sean Owen
This ultimately means a problem with SSL in the version of Java you are using to run SBT. If you look around the internet, you'll see a bunch of discussion, most of which seems to boil down to reinstall, or update, Java. -- Sean Owen | Director, Data Science | London On Fri, Apr 4, 2014 at 12:21

Explain Add Input

2014-04-04 Thread Eduardo Costa Alfaia
Hi all, Could anyone explain me about the lines below? computer1 - worker computer8 - driver(master) 14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added input-0-1396614314800 in memory on computer1.ant-net:60820 (size: 1262.5 KB, free: 540.3 MB) 14/04/04 14:24:56 INFO

RAM high consume

2014-04-04 Thread Eduardo Costa Alfaia
Hi all, I am doing some tests using JavaNetworkWordcount and I have some questions about the performance machine, my tests' time are approximately 2 min. Why does the RAM Memory decrease meaningly? I have done tests with 2, 3 machines and I had gotten the same behavior. What should I

how to save RDD partitions in different folders?

2014-04-04 Thread dmpour23
Hi all, Say I have an input file which I would like to partition using HashPartitioner k times. Calling rdd.saveAsTextFile(hdfs://); will save k files as part-0 part-k Is there a way to save each partition in specific folders? i.e. src part0/part-0

Re: Spark 1.0.0 release plan

2014-04-04 Thread Tom Graves
Do we have a list of things we really want to get in for 1.X?   Perhaps move any jira out to a 1.1 release if we aren't targetting them for 1.0.  It might be nice to send out reminders when these dates are approaching.  Tom On Thursday, April 3, 2014 11:19 PM, Bhaskar Dutta bhas...@gmail.com

Re: How to create a RPM package

2014-04-04 Thread Christophe Préaud
Hi Rahul, Spark will be available in Fedora 21 (see: https://fedoraproject.org/wiki/SIGs/bigdata/packaging/Spark), currently scheduled on 2014-10-14 but they already have produced spec files and source RPMs. If you are stuck with EL6 like me, you can have a look at the attached spec file,

Driver increase memory utilization

2014-04-04 Thread Eduardo Costa Alfaia
Hi Guys, Could anyone help me understand this driver behavior when I start the JavaNetworkWordCount? computer8 16:24:07 up 121 days, 22:21, 12 users, load average: 0.66, 1.27, 1.55 total used free shared buffers cached Mem: 5897

Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Erik Freed
Hi All, I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps already being addressed, but I am having a devil of a time with a spark 0.9.0 client jar for hadoop 2.X. If I go to the site and download: - Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache mirror

Re: Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Rahul Singhal
Hi Erik, I am working with TOT branch-0.9 ( 0.9.1) and the following works for me for maven build: export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m mvn -Pyarn -Dhadoop.version=2.3.0 -Dyarn.version=2.3.0 -DskipTests clean package And from

Re: Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Amit Tewari
I believe you got to set following SPARK_HADOOP_VERSION=2.2.0 (or whatever your version is) SPARK_YARN=true then type sbt/sbt assembly If you are using Maven to compile mvn -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests clean package Hope this helps -A On Fri, Apr 4, 2014

Re: how to save RDD partitions in different folders?

2014-04-04 Thread Konstantin Kudryavtsev
Hi Evan, Could you please provide a code-snippet? Because it not clear for me, in Hadoop you need to engage addNamedOutput method and I'm in stuck how to use it from Spark Thank you, Konstantin Kudryavtsev On Fri, Apr 4, 2014 at 5:27 PM, Evan Sparks evan.spa...@gmail.com wrote: Have a look

Re: Job initialization performance of Spark standalone mode vs YARN

2014-04-04 Thread Ron Gonzalez
Hi, Can you explain a little more what's going on? Which one submits a job to the yarn cluster that creates an application master and spawns containers for the local jobs? I tried yarn-client and submitted to our yarn cluster and it seems to work that way. Shouldn't Client.scala be running

Re: Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Erik Freed
Thanks all for the update - I have actually built using those options every which way I can think of so perhaps this is something I am doing about how I upload the jar to our artifactory repo server. Anyone have a working pom file for the publish of a spark 0.9 hadoop 2.X publish to a maven repo

Parallelism level

2014-04-04 Thread Eduardo Costa Alfaia
Hi all, I have put this line in my spark-env.sh: -Dspark.default.parallelism=20 this parallelism level, is it correct? The machine's processor is a dual core. Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155

Re: Regarding Sparkcontext object

2014-04-04 Thread Daniel Siegmann
On Wed, Apr 2, 2014 at 7:11 PM, yh18190 yh18...@gmail.com wrote: Is it always needed that sparkcontext object be created in Main method of class.Is it necessary?Can we create sc object in other class and try to use it by passing this object through function and use it? The Spark context can

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-04-04 Thread Prasad
Hi Wisely, Could you please post your pom.xml here. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158p3770.html Sent from the Apache Spark User List

RAM Increase

2014-04-04 Thread Eduardo Costa Alfaia
Hi Guys, Could anyone explain me this behavior? After 2 min of tests computer1- worker computer10 - worker computer8 - driver(master) computer1 18:24:31 up 73 days, 7:14, 1 user, load average: 3.93, 2.45, 1.14 total used free shared buffers cached

Re: Parallelism level

2014-04-04 Thread Nicholas Chammas
If you're running on one machine with 2 cores, I believe all you can get out of it are 2 concurrent tasks at any one time. So setting your default parallelism to 20 won't help. On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: Hi all, I have put this line

Re: java.lang.NoClassDefFoundError: scala/tools/nsc/transform/UnCurry$UnCurryTransformer...

2014-04-04 Thread Marcelo Vanzin
Hi Francis, This might be a long shot, but do you happen to have built spark on an encrypted home dir? (I was running into the same error when I was doing that. Rebuilding on an unencrypted disk fixed the issue. This is a known issue / limitation with ecryptfs. It's weird that the build doesn't

How are exceptions in map functions handled in Spark?

2014-04-04 Thread John Salvatier
I'm trying to get a clear idea about how exceptions are handled in Spark? Is there somewhere where I can read about this? I'm on spark .7 For some reason I was under the impression that such exceptions are swallowed and the value that produced them ignored but the exception is logged. However,

Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread Matei Zaharia
Exceptions should be sent back to the driver program and logged there (with a SparkException thrown if a task fails more than 4 times), but there were some bugs before where this did not happen for non-Serializable exceptions. We changed it to pass back the stack traces only (as text), which

Re: Example of creating expressions for SchemaRDD methods

2014-04-04 Thread Michael Armbrust
In such construct, each operator builds on the previous one, including any materialized results etc. If I use a SQL for each of them, I suspect the later SQLs will not leverage the earlier SQLs by any means - hence these will be inefficient to first approach. Let me know if this is not

Re: Parallelism level

2014-04-04 Thread Eduardo Costa Alfaia
What do you advice me Nicholas? Em 4/4/14, 19:05, Nicholas Chammas escreveu: If you're running on one machine with 2 cores, I believe all you can get out of it are 2 concurrent tasks at any one time. So setting your default parallelism to 20 won't help. On Fri, Apr 4, 2014 at 11:41 AM,

Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread John Salvatier
Is there a way to log exceptions inside a mapping function? logError and logInfo seem to freeze things. On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia matei.zaha...@gmail.comwrote: Exceptions should be sent back to the driver program and logged there (with a SparkException thrown if a task

Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread John Salvatier
Btw, thank you for your help. On Fri, Apr 4, 2014 at 11:49 AM, John Salvatier jsalvat...@gmail.comwrote: Is there a way to log exceptions inside a mapping function? logError and logInfo seem to freeze things. On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia matei.zaha...@gmail.comwrote:

Re: Example of creating expressions for SchemaRDD methods

2014-04-04 Thread Michael Armbrust
Minor typo in the example. The first SELECT statement should actually be: sql(SELECT * FROM src) Where `src` is a HiveTable with schema (key INT value STRING). On Fri, Apr 4, 2014 at 11:35 AM, Michael Armbrust mich...@databricks.comwrote: In such construct, each operator builds on the

Largest Spark Cluster

2014-04-04 Thread Parviz Deyhim
Spark community, What's the size of the largest Spark cluster ever deployed? I've heard Yahoo is running Spark on several hundred nodes but don't know the actual number. can someone share? Thanks

Re: Parallelism level

2014-04-04 Thread Nicholas Chammas
If you want more parallelism, you need more cores. So, use a machine with more cores, or use a cluster of machines. spark-ec2https://spark.apache.org/docs/latest/ec2-scripts.htmlis the easiest way to do this. If you're stuck on a single machine with 2 cores, then set your default parallelism to

Re: example of non-line oriented input data?

2014-04-04 Thread Matei Zaharia
FYI, one thing we’ve added now is support for reading multiple text files from a directory as separate records: https://github.com/apache/spark/pull/327. This should remove the need for mapPartitions discussed here. Avro and SequenceFiles look like they may not make it for 1.0, but there’s a

Re: Having spark-ec2 join new slaves to existing cluster

2014-04-04 Thread Matei Zaharia
This can’t be done through the script right now, but you can do it manually as long as the cluster is stopped. If the cluster is stopped, just go into the AWS Console, right click a slave and choose “launch more of these” to add more. Or select multiple slaves and delete them. When you run

Re: How to create a RPM package

2014-04-04 Thread Rahul Singhal
Hi Christophe, Thanks for your reply and the spec file. I have solved my issue for now. I didn't want to rely building spark using the spec file (%build section) as I don't want to be maintaining the list of files that need to be packaged. I ended up adding maven build support to

Re: reduceByKeyAndWindow Java

2014-04-04 Thread Eduardo Costa Alfaia
Hi Tathagata, You are right, this code compile, but I am some problems with high memory consummation, I sent today some email about this, but no response until now. Thanks Em 4/4/14, 22:56, Tathagata Das escreveu: I havent really compiled the code, but it looks good to me. Why? Is there any

Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread Andrew Or
Logging inside a map function shouldn't freeze things. The messages should be logged on the worker logs, since the code is executed on the executors. If you throw a SparkException, however, it'll be propagated to the driver after it has failed 4 or more times (by default). On Fri, Apr 4, 2014 at

Re: Spark output compression on HDFS

2014-04-04 Thread Azuryy
There is no compress type for snappy. Sent from my iPhone5s On 2014年4月4日, at 23:06, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Can anybody suggest how to change compression level (Record, Block) for Snappy? if it possible, of course thank you in advance Thank

Spark on other parallel filesystems

2014-04-04 Thread Venkat Krishnamurthy
All Are there any drawbacks or technical challenges (or any information, really) related to using Spark directly on a global parallel filesystem like Lustre/GPFS? Any idea of what would be involved in doing a minimal proof of concept? Is it just possible to run Spark unmodified (without the

Re: Spark on other parallel filesystems

2014-04-04 Thread Matei Zaharia
As long as the filesystem is mounted at the same path on every node, you should be able to just run Spark and use a file:// URL for your files. The only downside with running it this way is that Lustre won’t expose data locality info to Spark, the way HDFS does. That may not matter if it’s a

Re: Avro serialization

2014-04-04 Thread Ron Gonzalez
Thanks will take a look... Sent from my iPad On Apr 3, 2014, at 7:49 AM, FRANK AUSTIN NOTHAFT fnoth...@berkeley.edu wrote: We use avro objects in our project, and have a Kryo serializer for generic Avro SpecificRecords. Take a look at:

exactly once

2014-04-04 Thread Bharath Bhushan
Does spark in general assure exactly once semantics? What happens to those guarantees in the presence of updateStateByKey operations -- are they also assured to be exactly once? Thanks manku.timma at outlook dot com

Re: Largest Spark Cluster

2014-04-04 Thread Patrick Wendell
Hey Parviz, There was a similar thread a while ago... I think that many companies like to be discrete about the size of large clusters. But of course it would be great if people wanted to share openly :) For my part - I can say that Spark has been benchmarked on hundreds-of-nodes clusters before

Re: How to create a RPM package

2014-04-04 Thread Patrick Wendell
We might be able to incorporate the maven rpm plugin into our build. If that can be done in an elegant way it would be nice to have that distribution target for people who wanted to try this with arbitrary Spark versions... Personally I have no familiarity with that plug-in, so curious if anyone

Re: Spark on other parallel filesystems

2014-04-04 Thread Jeremy Freeman
We run Spark (in Standalone mode) on top of a network-mounted file system (NFS), rather than HDFS, and find it to work great. It required no modification or special configuration to set this up; as Matei says, we just point Spark to data using the file location. -- Jeremy On Apr 4, 2014, at

Re: Spark on other parallel filesystems

2014-04-04 Thread Anand Avati
On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia matei.zaha...@gmail.comwrote: As long as the filesystem is mounted at the same path on every node, you should be able to just run Spark and use a file:// URL for your files. The only downside with running it this way is that Lustre won't expose