Spark 1.0.0 release plan

2014-04-03 Thread Bhaskar Dutta
Hi, Is there any change in the release plan for Spark 1.0.0-rc1 release date from what is listed in the Proposal for Spark Release Strategy thread? == Tentative Release Window for 1.0.0 == Feb 1st - April 1st: General development April 1st: Code freeze for new features April 15th: RC1 Thanks,

Re: Job initialization performance of Spark standalone mode vs YARN

2014-04-03 Thread Kevin Markey
We are now testing precisely what you ask about in our environment. But Sandy's questions are relevant. The bigger issue is not Spark vs. Yarn but "client" vs. "standalone" and where the client is located on the network relative to the cluster. The "client" options

Re: Spark 1.0.0 release plan

2014-04-03 Thread Matei Zaharia
Hey Bhaskar, this is still the plan, though QAing might take longer than 15 days. Right now since we’ve passed April 1st, the only features considered for a merge are those that had pull requests in review before. (Some big ones are things like annotating the public APIs and simplifying

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-03 Thread Vipul Pandey
Any word on this one ? On Apr 2, 2014, at 12:26 AM, Vipul Pandey vipan...@gmail.com wrote: I downloaded 0.9.0 fresh and ran the mvn command - the assembly jar thus generated also has both shaded and real version of protobuf classes Vipuls-MacBook-Pro-3:spark-0.9.0-incubating vipul$ jar -ftv

Spark SQL transformations, narrow vs. wide

2014-04-03 Thread Jan-Paul Bultmann
Hey, Does somebody know the kinds of dependencies that the new SQL operators produce? I’m specifically interested in the relational join operation as it seems substantially more optimized. The old join was narrow on two RDDs with the same partitioner. Is the relational join narrow as well?

Re: Spark SQL transformations, narrow vs. wide

2014-04-03 Thread Michael Armbrust
I'm sorry, but I don't really understand what you mean when you say wide in this context. For a HashJoin, the only dependencies of the produced RDD are the two input RDDs. For BroadcastNestedLoopJoin The only dependence will be on the streamed RDD. The other RDD will be distributed to all

Re: Optimal Server Design for Spark

2014-04-03 Thread Matei Zaharia
To run multiple workers with Spark’s standalone mode, set SPARK_WORKER_INSTANCES and SPARK_WORKER_CORES in conf/spark-env.sh. For example, if you have 16 cores and want 2 workers, you could add export SPARK_WORKER_INSTANCES=2 export SPARK_WORKER_CORES=8 Matei On Apr 3, 2014, at 12:38 PM,

Re: Optimal Server Design for Spark

2014-04-03 Thread Debasish Das
@Mayur...I am hitting ulimits on the cluster if I go beyond 4 core per worker and I don't think I can change the ulimit due to sudo issues etc... If I have more workers, in ALS, I can go for 20 blocks (right now I am running 10 blocks on 10 nodes with 4 cores each and now I can go upto 20 blocks

Re: module not found: org.eclipse.paho#mqtt-client;0.4.0

2014-04-04 Thread Sean Owen
This ultimately means a problem with SSL in the version of Java you are using to run SBT. If you look around the internet, you'll see a bunch of discussion, most of which seems to boil down to reinstall, or update, Java. -- Sean Owen | Director, Data Science | London On Fri, Apr 4, 2014 at 12:21

Explain Add Input

2014-04-04 Thread Eduardo Costa Alfaia
Hi all, Could anyone explain me about the lines below? computer1 - worker computer8 - driver(master) 14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added input-0-1396614314800 in memory on computer1.ant-net:60820 (size: 1262.5 KB, free: 540.3 MB) 14/04/04 14:24:56 INFO

RAM high consume

2014-04-04 Thread Eduardo Costa Alfaia
Hi all, I am doing some tests using JavaNetworkWordcount and I have some questions about the performance machine, my tests' time are approximately 2 min. Why does the RAM Memory decrease meaningly? I have done tests with 2, 3 machines and I had gotten the same behavior. What should I

how to save RDD partitions in different folders?

2014-04-04 Thread dmpour23
Hi all, Say I have an input file which I would like to partition using HashPartitioner k times. Calling rdd.saveAsTextFile(hdfs://); will save k files as part-0 part-k Is there a way to save each partition in specific folders? i.e. src part0/part-0

Re: Spark 1.0.0 release plan

2014-04-04 Thread Tom Graves
Do we have a list of things we really want to get in for 1.X?   Perhaps move any jira out to a 1.1 release if we aren't targetting them for 1.0.  It might be nice to send out reminders when these dates are approaching.  Tom On Thursday, April 3, 2014 11:19 PM, Bhaskar Dutta bhas...@gmail.com

Re: How to create a RPM package

2014-04-04 Thread Christophe Préaud
Hi Rahul, Spark will be available in Fedora 21 (see: https://fedoraproject.org/wiki/SIGs/bigdata/packaging/Spark), currently scheduled on 2014-10-14 but they already have produced spec files and source RPMs. If you are stuck with EL6 like me, you can have a look at the attached spec file,

Driver increase memory utilization

2014-04-04 Thread Eduardo Costa Alfaia
Hi Guys, Could anyone help me understand this driver behavior when I start the JavaNetworkWordCount? computer8 16:24:07 up 121 days, 22:21, 12 users, load average: 0.66, 1.27, 1.55 total used free shared buffers cached Mem: 5897

Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Erik Freed
Hi All, I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps already being addressed, but I am having a devil of a time with a spark 0.9.0 client jar for hadoop 2.X. If I go to the site and download: - Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache mirror

Re: Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Rahul Singhal
Hi Erik, I am working with TOT branch-0.9 ( 0.9.1) and the following works for me for maven build: export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m mvn -Pyarn -Dhadoop.version=2.3.0 -Dyarn.version=2.3.0 -DskipTests clean package And from

Re: Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Amit Tewari
I believe you got to set following SPARK_HADOOP_VERSION=2.2.0 (or whatever your version is) SPARK_YARN=true then type sbt/sbt assembly If you are using Maven to compile mvn -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests clean package Hope this helps -A On Fri, Apr 4, 2014

Re: how to save RDD partitions in different folders?

2014-04-04 Thread Konstantin Kudryavtsev
Hi Evan, Could you please provide a code-snippet? Because it not clear for me, in Hadoop you need to engage addNamedOutput method and I'm in stuck how to use it from Spark Thank you, Konstantin Kudryavtsev On Fri, Apr 4, 2014 at 5:27 PM, Evan Sparks evan.spa...@gmail.com wrote: Have a look

Re: Job initialization performance of Spark standalone mode vs YARN

2014-04-04 Thread Ron Gonzalez
Hi, Can you explain a little more what's going on? Which one submits a job to the yarn cluster that creates an application master and spawns containers for the local jobs? I tried yarn-client and submitted to our yarn cluster and it seems to work that way. Shouldn't Client.scala be running

Re: Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Erik Freed
Thanks all for the update - I have actually built using those options every which way I can think of so perhaps this is something I am doing about how I upload the jar to our artifactory repo server. Anyone have a working pom file for the publish of a spark 0.9 hadoop 2.X publish to a maven repo

Parallelism level

2014-04-04 Thread Eduardo Costa Alfaia
Hi all, I have put this line in my spark-env.sh: -Dspark.default.parallelism=20 this parallelism level, is it correct? The machine's processor is a dual core. Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155

Re: Regarding Sparkcontext object

2014-04-04 Thread Daniel Siegmann
On Wed, Apr 2, 2014 at 7:11 PM, yh18190 yh18...@gmail.com wrote: Is it always needed that sparkcontext object be created in Main method of class.Is it necessary?Can we create sc object in other class and try to use it by passing this object through function and use it? The Spark context can

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-04-04 Thread Prasad
Hi Wisely, Could you please post your pom.xml here. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158p3770.html Sent from the Apache Spark User List

RAM Increase

2014-04-04 Thread Eduardo Costa Alfaia
Hi Guys, Could anyone explain me this behavior? After 2 min of tests computer1- worker computer10 - worker computer8 - driver(master) computer1 18:24:31 up 73 days, 7:14, 1 user, load average: 3.93, 2.45, 1.14 total used free shared buffers cached

Re: Parallelism level

2014-04-04 Thread Nicholas Chammas
If you're running on one machine with 2 cores, I believe all you can get out of it are 2 concurrent tasks at any one time. So setting your default parallelism to 20 won't help. On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: Hi all, I have put this line

Re: java.lang.NoClassDefFoundError: scala/tools/nsc/transform/UnCurry$UnCurryTransformer...

2014-04-04 Thread Marcelo Vanzin
Hi Francis, This might be a long shot, but do you happen to have built spark on an encrypted home dir? (I was running into the same error when I was doing that. Rebuilding on an unencrypted disk fixed the issue. This is a known issue / limitation with ecryptfs. It's weird that the build doesn't

How are exceptions in map functions handled in Spark?

2014-04-04 Thread John Salvatier
I'm trying to get a clear idea about how exceptions are handled in Spark? Is there somewhere where I can read about this? I'm on spark .7 For some reason I was under the impression that such exceptions are swallowed and the value that produced them ignored but the exception is logged. However,

Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread Matei Zaharia
Exceptions should be sent back to the driver program and logged there (with a SparkException thrown if a task fails more than 4 times), but there were some bugs before where this did not happen for non-Serializable exceptions. We changed it to pass back the stack traces only (as text), which

Re: Example of creating expressions for SchemaRDD methods

2014-04-04 Thread Michael Armbrust
In such construct, each operator builds on the previous one, including any materialized results etc. If I use a SQL for each of them, I suspect the later SQLs will not leverage the earlier SQLs by any means - hence these will be inefficient to first approach. Let me know if this is not

Re: Parallelism level

2014-04-04 Thread Eduardo Costa Alfaia
What do you advice me Nicholas? Em 4/4/14, 19:05, Nicholas Chammas escreveu: If you're running on one machine with 2 cores, I believe all you can get out of it are 2 concurrent tasks at any one time. So setting your default parallelism to 20 won't help. On Fri, Apr 4, 2014 at 11:41 AM,

Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread John Salvatier
Is there a way to log exceptions inside a mapping function? logError and logInfo seem to freeze things. On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia matei.zaha...@gmail.comwrote: Exceptions should be sent back to the driver program and logged there (with a SparkException thrown if a task

Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread John Salvatier
Btw, thank you for your help. On Fri, Apr 4, 2014 at 11:49 AM, John Salvatier jsalvat...@gmail.comwrote: Is there a way to log exceptions inside a mapping function? logError and logInfo seem to freeze things. On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia matei.zaha...@gmail.comwrote:

Re: Example of creating expressions for SchemaRDD methods

2014-04-04 Thread Michael Armbrust
Minor typo in the example. The first SELECT statement should actually be: sql(SELECT * FROM src) Where `src` is a HiveTable with schema (key INT value STRING). On Fri, Apr 4, 2014 at 11:35 AM, Michael Armbrust mich...@databricks.comwrote: In such construct, each operator builds on the

Largest Spark Cluster

2014-04-04 Thread Parviz Deyhim
Spark community, What's the size of the largest Spark cluster ever deployed? I've heard Yahoo is running Spark on several hundred nodes but don't know the actual number. can someone share? Thanks

Re: Parallelism level

2014-04-04 Thread Nicholas Chammas
If you want more parallelism, you need more cores. So, use a machine with more cores, or use a cluster of machines. spark-ec2https://spark.apache.org/docs/latest/ec2-scripts.htmlis the easiest way to do this. If you're stuck on a single machine with 2 cores, then set your default parallelism to

Re: example of non-line oriented input data?

2014-04-04 Thread Matei Zaharia
FYI, one thing we’ve added now is support for reading multiple text files from a directory as separate records: https://github.com/apache/spark/pull/327. This should remove the need for mapPartitions discussed here. Avro and SequenceFiles look like they may not make it for 1.0, but there’s a

Re: Having spark-ec2 join new slaves to existing cluster

2014-04-04 Thread Matei Zaharia
This can’t be done through the script right now, but you can do it manually as long as the cluster is stopped. If the cluster is stopped, just go into the AWS Console, right click a slave and choose “launch more of these” to add more. Or select multiple slaves and delete them. When you run

Re: How to create a RPM package

2014-04-04 Thread Rahul Singhal
Hi Christophe, Thanks for your reply and the spec file. I have solved my issue for now. I didn't want to rely building spark using the spec file (%build section) as I don't want to be maintaining the list of files that need to be packaged. I ended up adding maven build support to

Re: reduceByKeyAndWindow Java

2014-04-04 Thread Eduardo Costa Alfaia
Hi Tathagata, You are right, this code compile, but I am some problems with high memory consummation, I sent today some email about this, but no response until now. Thanks Em 4/4/14, 22:56, Tathagata Das escreveu: I havent really compiled the code, but it looks good to me. Why? Is there any

Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread Andrew Or
Logging inside a map function shouldn't freeze things. The messages should be logged on the worker logs, since the code is executed on the executors. If you throw a SparkException, however, it'll be propagated to the driver after it has failed 4 or more times (by default). On Fri, Apr 4, 2014 at

Re: Spark output compression on HDFS

2014-04-04 Thread Azuryy
There is no compress type for snappy. Sent from my iPhone5s On 2014年4月4日, at 23:06, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Can anybody suggest how to change compression level (Record, Block) for Snappy? if it possible, of course thank you in advance Thank

Spark on other parallel filesystems

2014-04-04 Thread Venkat Krishnamurthy
All Are there any drawbacks or technical challenges (or any information, really) related to using Spark directly on a global parallel filesystem like Lustre/GPFS? Any idea of what would be involved in doing a minimal proof of concept? Is it just possible to run Spark unmodified (without the

Re: Spark on other parallel filesystems

2014-04-04 Thread Matei Zaharia
As long as the filesystem is mounted at the same path on every node, you should be able to just run Spark and use a file:// URL for your files. The only downside with running it this way is that Lustre won’t expose data locality info to Spark, the way HDFS does. That may not matter if it’s a

Re: Avro serialization

2014-04-04 Thread Ron Gonzalez
Thanks will take a look... Sent from my iPad On Apr 3, 2014, at 7:49 AM, FRANK AUSTIN NOTHAFT fnoth...@berkeley.edu wrote: We use avro objects in our project, and have a Kryo serializer for generic Avro SpecificRecords. Take a look at:

exactly once

2014-04-04 Thread Bharath Bhushan
Does spark in general assure exactly once semantics? What happens to those guarantees in the presence of updateStateByKey operations -- are they also assured to be exactly once? Thanks manku.timma at outlook dot com

Re: Largest Spark Cluster

2014-04-04 Thread Patrick Wendell
Hey Parviz, There was a similar thread a while ago... I think that many companies like to be discrete about the size of large clusters. But of course it would be great if people wanted to share openly :) For my part - I can say that Spark has been benchmarked on hundreds-of-nodes clusters before

Re: How to create a RPM package

2014-04-04 Thread Patrick Wendell
We might be able to incorporate the maven rpm plugin into our build. If that can be done in an elegant way it would be nice to have that distribution target for people who wanted to try this with arbitrary Spark versions... Personally I have no familiarity with that plug-in, so curious if anyone

Re: Spark on other parallel filesystems

2014-04-04 Thread Jeremy Freeman
We run Spark (in Standalone mode) on top of a network-mounted file system (NFS), rather than HDFS, and find it to work great. It required no modification or special configuration to set this up; as Matei says, we just point Spark to data using the file location. -- Jeremy On Apr 4, 2014, at

Re: Spark on other parallel filesystems

2014-04-04 Thread Anand Avati
On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia matei.zaha...@gmail.comwrote: As long as the filesystem is mounted at the same path on every node, you should be able to just run Spark and use a file:// URL for your files. The only downside with running it this way is that Lustre won't expose

Redirect Incubator pages

2014-04-05 Thread Andrew Ash
I occasionally see links to pages in the spark.incubator.apache.org domain. Can we HTTP 301 redirect that whole domain to spark.apache.org now that the project has graduated? The content seems identical. That would also make the eventual decommission of the incubator domain much easier as usage

Re: Redirect Incubator pages

2014-04-05 Thread Pat McDonough
I'm looking forward to that myself! Seems to be hung up with Apache infrastructure though. https://issues.apache.org/jira/plugins/servlet/mobile#issue/INFRA-7398 On Apr 4, 2014 11:19 PM, Andrew Ash and...@andrewash.com wrote: I occasionally see links to pages in the

Re: Heartbeat exceeds

2014-04-05 Thread Debasish Das
@patrick I think there is a bug...when this timeout happens then suddenly I see some negative ms numbers in spark uiI tried to send a pic showing the negative ms numbers but it was rejected by mailing list...I will send it your gmail... From the archive I saw some more suggestions: It

Re: Heartbeat exceeds

2014-04-05 Thread azurecoder
Interested in a resolution to this. I'm building a large triangular matrix so doing similar to ALS - lots of work on the worker nodes and keep timing out. Tried a few updates to akka frame sizes, timeouts and blockmanager but unable to complete. Will try the blockmanagerslaves property now and

Re: Heartbeat exceeds

2014-04-05 Thread Debasish Das
From the documentation this is what I understood: 1. spark.worker.timeout: Number of seconds after which the standalone deploy master considers a worker lost if it receives no heartbeats. default: 60 I increased it to be 600 It was pointed before that if there is GC overload and the worker

Re: Spark on other parallel filesystems

2014-04-05 Thread Christopher Nguyen
Avati, depending on your specific deployment config, there can be up to a 10X difference in data loading time. For example, we routinely parallel load 10+GB data files across small 8-node clusters in 10-20 seconds, which would take about 100s if bottlenecked over a 1GigE network. That's about the

Re: How to create a RPM package

2014-04-05 Thread Will Benton
Hi Rahul, As Christophe pointed out, Spark has been in Fedora Rawhide (which will become Fedora 21) for a little while now. (I haven't announced it here because Rawhide is a little too bleeding-edge for most end-users.) With native packages of any kind, there are a couple of considerations:

Re: Heartbeat exceeds

2014-04-05 Thread Debasish Das
This does not seem to help: export SPARK_JAVA_OPTS=-Dspark.local.dir=/app/spark/tmp -Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.storage.blockManagerSlaveTimeoutMs=30 Getting the message leads to GC failure followed by master declaring the worker as dead ! This is related to

Re: Heartbeat exceeds

2014-04-05 Thread Andrew Or
Setting spark.worker.timeout should not help you. What this value means is that the master checks every 60 seconds whether the workers are still alive, as the documentation describes. But this value also determines how often the workers send HEARTBEAT messages to notify the master of their

Re: Spark on other parallel filesystems

2014-04-05 Thread Venkat Krishnamurthy
Christopher Just to clarify - by ‘load ops’ do you mean RDD actions that result in IO? Venkat From: Christopher Nguyen c...@adatao.commailto:c...@adatao.com Reply-To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Date: Saturday, April 5,

Re: Having spark-ec2 join new slaves to existing cluster

2014-04-06 Thread Rafal Kwasny
Hi, This will work nicely unless you're using spot instances, in this case the start does not work as slaves are lost on shutdown. I feel like spark-ec2 script need a major refactor to cope with new features/more users using it in dynamic environments. Are there any current plans to migrate it to

Spark Worker in different machine doesnt work

2014-04-06 Thread subacini Arunkumar
Hi All, I am using spark-0.9.0 and am able to run my program successfully if spark master and worker are in same machine. If i run the same program in spark master in Machine A and worker in Machine B, I am getting below exception I am running program with java -cp ... instead of scala command

Re: any work around to support nesting of RDDs other than join

2014-04-06 Thread nkd
It worked when I converted the nested RDD to an array -- case class TradingTier(tierId:String, lowerLimit:Int,upperLimit:Int , transactionFees:Double) //userTransactions Seq[(accountId,numTransactions)] val userTransactionsRDD =

hang caused by memory threshold?

2014-04-06 Thread Stuart Zakon
I am seeing a small standalone cluster (master, slave) hang when I reach a certain memory threshold, but I cannot detect how to configure memory to avoid this. I added memory by configuring SPARK_DAEMON_MEMORY=2G and I can see this allocated, but it does not help. The reduce is by key to get

Re: How to create a RPM package

2014-04-06 Thread Rahul Singhal
Hi Will, For issue #2 I was concerned that the build packaging had to be internal. So I am using the already packaged make-distribution.sh (modified to use a maven build) to create a tar ball which I then package it using a RPM spec file. Although on a side note, it would interesting to learn

Re: how to save RDD partitions in different folders?

2014-04-07 Thread dmpour23
Can you provide an example? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754p3823.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

hang on sorting operation

2014-04-07 Thread Stuart Zakon
I am seeing a small standalone cluster (master, slave) hang when I reach a certain memory threshold, but I cannot detect how to configure memory to avoid this. I added memory by configuring SPARK_DAEMON_MEMORY=2G and I can see this allocated, but it does not help. The reduce is by key to get

Recommended way to develop spark application with both java and python

2014-04-07 Thread Wush Wu
Dear all, We have a spark 0.8.1 cluster on mesos 0.15. Some of my colleagues are familiar with python, but some of features are developed under java. I am looking for a way to integrate java and python on spark. I notice that the initialization of pyspark does not include a field to distribute

Re: Spark Disk Usage

2014-04-07 Thread Surendranauth Hiraman
Hi, Any thoughts on this? Thanks. -Suren On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Hi, I know if we call persist with the right options, we can have Spark persist an RDD's data on disk. I am wondering what happens in intermediate operations

Re: Sample Project for using Shark API in Spark programs

2014-04-07 Thread Jerry Lam
Hi Shark, Should I assume that Shark users should not use the shark APIs since there are no documentations for it? If there are documentations, can you point it out? Best Regards, Jerry On Thu, Apr 3, 2014 at 9:24 PM, Jerry Lam chiling...@gmail.com wrote: Hello everyone, I have

Require some clarity on partitioning

2014-04-07 Thread Sanjay Awatramani
Hi, I was going through Matei's Advanced Spark presentation at  https://www.youtube.com/watch?v=w0Tisli7zn4 , and had few questions. The presentation of this video is at  http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf The PageRank example

Null Pointer Exception in Spark Application with Yarn Client Mode

2014-04-07 Thread Sai Prasanna
Hi All, I wanted Spark on Yarn to up and running. I did *SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true ./sbt/sbt assembly* Then i ran *SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar

Re: How to create a RPM package

2014-04-07 Thread Will Benton
For issue #2 I was concerned that the build packaging had to be internal. So I am using the already packaged make-distribution.sh (modified to use a maven build) to create a tar ball which I then package it using a RPM spec file. Hi Rahul, so the issue for downstream operating system

PySpark SocketConnect Issue in Cluster

2014-04-07 Thread Surendranauth Hiraman
Hi, We have a situation where a Pyspark script works fine as a local process (local url) on the Master and the Worker nodes, which would indicate that all python dependencies are set up properly on each machine. But when we try to run the script at the cluster level (using the master's url), if

Re: Spark Disk Usage

2014-04-07 Thread Surendranauth Hiraman
It might help if I clarify my questions. :-) 1. Is persist() applied during the transformation right before the persist() call in the graph? Or is is applied after the transform's processing is complete? In the case of things like GroupBy, is the Seq backed by disk as it is being created? We're

Re: reduceByKeyAndWindow Java

2014-04-07 Thread Eduardo Costa Alfaia
Hi TD Could you explain me this code part? .reduceByKeyAndWindow( 109 new Function2Integer, Integer, Integer() { 110 public Integer call(Integer i1, Integer i2) { return i1 + i2; } 111 }, 112 new Function2Integer, Integer, Integer() { 113 public

AWS Spark-ec2 script with different user

2014-04-07 Thread Marco Costantini
Hi all, On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh. Also, it is the default user for the Spark-EC2 script. Currently, the Amazon Linux images have an 'ec2-user' set up for ssh instead of 'root'. I can see that the Spark-EC2 script allows you to specify which user to

SparkContext.addFile() and FileNotFoundException

2014-04-07 Thread Thierry Herrmann
Hi, I'm trying to use SparkContext.addFile() to propagate a file to worker nodes, in a standalone cluster (2 nodes, 1 master, 1 worker connected to the master). I don't have HDFS or any distributed file system. Just playing with basic stuff. Here's the code in my driver (actually spark-shell

Re: Status of MLI?

2014-04-07 Thread Evan R. Sparks
That work is under submission at an academic conference and will be made available if/when the paper is published. In terms of algorithms for hyperparameter tuning, we consider Grid Search, Random Search, a couple of older derivative-free optimization methods, and a few newer methods - TPE (aka

Re: Sample Project for using Shark API in Spark programs

2014-04-07 Thread Yana Kadiyska
I might be wrong here but I don't believe it's discouraged. Maybe part of the reason there's not a lot of examples is that sql2rdd returns an RDD (TableRDD that is https://github.com/amplab/shark/blob/master/src/main/scala/shark/SharkContext.scala). I haven't done anything too complicated yet but

Re: AWS Spark-ec2 script with different user

2014-04-07 Thread Marco Costantini
Hi Shivaram, OK so let's assume the script CANNOT take a different user and that it must be 'root'. The typical workaround is as you said, allow the ssh with the root user. Now, don't laugh, but, this worked last Friday, but today (Monday) it no longer works. :D Why? ... ...It seems that NOW,

Driver Out of Memory

2014-04-07 Thread Eduardo Costa Alfaia
Hi Guys, I would like understanding why the Driver's RAM goes down, Does the processing occur only in the workers? Thanks # Start Tests computer1(Worker/Source Stream) 23:57:18 up 12:03, 1 user, load average: 0.03, 0.31, 0.44 total used free shared

Re: AWS Spark-ec2 script with different user

2014-04-07 Thread Shivaram Venkataraman
Hmm -- That is strange. Can you paste the command you are using to launch the instances ? The typical workflow is to use the spark-ec2 wrapper script using the guidelines at http://spark.apache.org/docs/latest/ec2-scripts.html Shivaram On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini

CheckpointRDD has different number of partitions than original RDD

2014-04-07 Thread Paul Mogren
Hello, Spark community! My name is Paul. I am a Spark newbie, evaluating version 0.9.0 without any Hadoop at all, and need some help. I run into the following error with the StatefulNetworkWordCount example (and similarly in my prototype app, when I use the updateStateByKey operation). I get

Re: Creating a SparkR standalone job

2014-04-07 Thread pawan kumar
Thanks Shivaram! Will give it a try and let you know. Regards, Pawan Venugopal On Mon, Apr 7, 2014 at 3:38 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: You can create standalone jobs in SparkR as just R files that are run using the sparkR script. These commands will be sent

job offering

2014-04-07 Thread Rault, Severan
Hi, I am looking for users of spark to join my teams here at Amazon. If you are reading this you probably qualify. I am looking for developer of ANY level, but with an interest in spark. My teams are leveraging spark to solve real business scenarios. If you are interested, just shoot me a note

Re: CheckpointRDD has different number of partitions than original RDD

2014-04-07 Thread Tathagata Das
Few things that would be helpful. 1. Environment settings - you can find them on the environment tab in the Spark application UI 2. Are you setting the HDFS configuration correctly in your Spark program? For example, can you write a HDFS file from a Spark program (say spark-shell) to your HDFS

RDDInfo visibility SPARK-1132

2014-04-07 Thread Koert Kuipers
any reason why RDDInfo suddenly became private in SPARK-1132? we are using it to show users status of rdds

Re: RDDInfo visibility SPARK-1132

2014-04-07 Thread Koert Kuipers
ok yeah we are using StageInfo and TaskInfo too... On Mon, Apr 7, 2014 at 8:51 PM, Andrew Or and...@databricks.com wrote: Hi Koert, Other users have expressed interest for us to expose similar classes too (i.e. StageInfo, TaskInfo). In the newest release, they will be available as part of

RE: CheckpointRDD has different number of partitions than original RDD

2014-04-07 Thread Paul Mogren
1.: I will paste the full content of the environment page of the example application running against the cluster at the end of this message. 2. and 3.: Following #2 I was able to see that the count was incorrectly 0 when running against the cluster, and following #3 I was able to get the

答复: java.lang.NoClassDefFoundError: scala/tools/nsc/transform/UnCurry$UnCurryTransformer...

2014-04-07 Thread Francis . Hu
Great!!! When i built it on another disk whose format is ext4, it works right now. hadoop@ubuntu-1:~$ df -Th FilesystemType Size Used Avail Use% Mounted on /dev/sdb6 ext4 135G 8.6G 119G 7% / udev devtmpfs 7.7G 4.0K 7.7G 1% /dev tmpfs

Re: trouble with join on large RDDs

2014-04-07 Thread Patrick Wendell
On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller bmill...@eecs.berkeley.eduwrote: I am running the latest version of PySpark branch-0.9 and having some trouble with join. One RDD is about 100G (25GB compressed and serialized in memory) with 130K records, the other RDD is about 10G (2.5G

[BLOG] For Beginners

2014-04-07 Thread prabeesh k
Hi all, Here I am sharing a blog for beginners, about creating spark streaming stand alone application and bundle the app as single runnable jar. Take a look and drop your comments in blog page. http://prabstechblog.blogspot.in/2014/04/a-standalone-spark-application-in-scala.html

Mongo-Hadoop Connector with Spark

2014-04-07 Thread Pavan Kumar
Hi Everyone, I saved a 2GB pdf file into MongoDB using GridFS. now i want process those GridFS collection data using Java Spark Mapreduce. previously i have successfully processed normal mongoDB collections(not GridFS) with Apache spark using Mongo-Hadoop connector. now i'm unable to handle input

Re: CheckpointRDD has different number of partitions than original RDD

2014-04-08 Thread Tathagata Das
Yes, that is correct. If you are executing a Spark program across multiple machines, that you need to use a distributed file system (HDFS API compatible) for reading and writing data. In your case, your setup is across multiple machines. So what is probably happening is that the the RDD data is

How to execute a function from class in distributed jar on each worker node?

2014-04-08 Thread Adnan
Hello, I am running Cloudera 4 node cluster with 1 Master and 3 Slaves. I am connecting with Spark Master from scala using SparkContext. I am trying to execute a simple java function from the distributed jar on every Spark Worker but haven't found a way to communicate with each worker or a Spark

Only TraversableOnce?

2014-04-08 Thread wxhsdp
In my application, data parts inside an RDD partition have ralations. so I need to do some operations beween them. for example RDD T1 has several partitions, each partition has three parts A, B and C. then I transform T1 to T2. after transform, T2 also has three parts D, E and F, D = A+B, E =

Re: Only TraversableOnce?

2014-04-08 Thread Nan Zhu
so, the data structure looks like: D consists of D1, D2, D3 (DX is partition) and DX consists of d1, d2, d3 (dx is the part in your context)? what you want to do is to transform DX to (d1 + d2, d1 + d3, d2 + d3)? Best, -- Nan Zhu On Tuesday, April 8, 2014 at 8:09 AM, wxhsdp wrote:

Re: Only TraversableOnce?

2014-04-08 Thread wxhsdp
yes, how can i do this conveniently? i can use filter, but there will be so many RDDs and it's not concise -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3875.html Sent from the Apache Spark User List mailing list archive at

Re: Only TraversableOnce?

2014-04-08 Thread Nan Zhu
If that’s the case, I think mapPartition is what you need, but it seems that you have to load the partition into the memory as whole by toArray rdd.mapPartition{D = {val p = D.toArray; ...}} -- Nan Zhu On Tuesday, April 8, 2014 at 8:40 AM, wxhsdp wrote: yes, how can i do this

<    4   5   6   7   8   9   10   11   12   13   >