Re: NPE using saveAsTextFile

2014-04-10 Thread Matei Zaharia
I haven’t seen this but it may be a bug in Typesafe Config, since this is serializing a Config object. We don’t actually use Typesafe Config ourselves. Do you have any nulls in the data itself by any chance? And do you know how that Config object is getting there? Matei On Apr 9, 2014, at

Re: NPE using saveAsTextFile

2014-04-10 Thread Nick Pentreath
Ok I thought it may be closing over the config option. I am using config for job configuration, but extracting vals from that. So not sure why as I thought I'd avoided closing over it. Will go back to source and see where it is creeping in. On Thu, Apr 10, 2014 at 8:42 AM, Matei Zaharia

Re: Where does println output go?

2014-04-10 Thread wxhsdp
rdd.foreach(p = { print(p) }) The above closure gets executed on workers, you need to look at the logs of the workers to see the output. but if i'm in local mode, where's the logs of local driver, there are no /logs and /work dirs in /SPARK_HOME which are set in standalone mode. -- View

Re: Shark CDH5 Final Release

2014-04-10 Thread chutium
hi, you can take a look here: http://www.abcn.net/2014/04/install-shark-on-cdh5-hadoop2-spark.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Shark-CDH5-Final-Release-tp3826p4055.html Sent from the Apache Spark User List mailing list archive at

Re: Pig on Spark

2014-04-10 Thread Konstantin Kudryavtsev
Hi Mayur, I wondered if you could share your findings in some way (github, blog post, etc). I guess your experience will be very interesting/useful for many people sent from Lenovo YogaTablet On Apr 8, 2014 8:48 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: Hi Ankit, Thanx for all the work

Re: How does Spark handle RDD via HDFS ?

2014-04-10 Thread gtanguy
Yes that help to understand better how works spark. But that was also what I was afraid, I think the network communications will take to much time for my job. I will continue to look for a trick in order to not have network communications. I saw on the hadoop website that : To minimize global

Re: Executing spark jobs with predefined Hadoop user

2014-04-10 Thread Adnan
You need to use proper HDFS URI with saveAsTextFile. For Example: rdd.saveAsTextFile(hdfs://NameNode:Port/tmp/Iris/output.tmp) Regards, Adnan Asaf Lahav wrote Hi, We are using Spark with data files on HDFS. The files are stored as files for predefined hadoop user (hdfs). The folder is

Re: Executing spark jobs with predefined Hadoop user

2014-04-10 Thread Adnan
Then problem is not on spark side, you have three options, choose any one of them: 1. Change permissions on /tmp/Iris folder from shell on NameNode with hdfs dfs -chmod command. 2. Run your hadoop service with hdfs user. 3. Disable dfs.permissions in conf/hdfs-site.xml. Regards, Adnan avito

Fwd: Spark - ready for prime time?

2014-04-10 Thread Andras Nemeth
Hello Spark Users, With the recent graduation of Spark to a top level project (grats, btw!), maybe a well timed question. :) We are at the very beginning of a large scale big data project and after two months of exploration work we'd like to settle on the technologies to use, roll up our sleeves

Re: Spark - ready for prime time?

2014-04-10 Thread Debasish Das
When you say Spark is one of the forerunners for our technology choice, what are the other options you are looking into ? I start cross validation runs on a 40 core, 160 GB spark job using a script...I woke up in the morning, none of the jobs crashed ! and the project just came out of incubation

RE: Executing spark jobs with predefined Hadoop user

2014-04-10 Thread Shao, Saisai
Hi Asaf, The user who run SparkContext is decided by the below code in SparkContext, normally this user.name is the user who started JVM, you can start your application with -Duser.name=xxx to specify a username you want, this specified username will be the user to communicate with HDFS. val

Re: Spark - ready for prime time?

2014-04-10 Thread Dean Wampler
Spark has been endorsed by Cloudera as the successor to MapReduce. That says a lot... On Thu, Apr 10, 2014 at 10:11 AM, Andras Nemeth andras.nem...@lynxanalytics.com wrote: Hello Spark Users, With the recent graduation of Spark to a top level project (grats, btw!), maybe a well timed

Spark 0.9.1 PySpark ImportError

2014-04-10 Thread aazout
I am getting a python ImportError on Spark standalone cluster. I have set the PYTHONPATH on both worker and slave and the package imports properly when I run PySpark command line on both machines. This only happens with Master - Slave communication. Here is the error below: 14/04/10 13:40:19

Re: Spark - ready for prime time?

2014-04-10 Thread Dean Wampler
Here are several good ones: https://www.google.com/search?q=cloudera+sparkoq=cloudera+sparkaqs=chrome..69i57j69i65l3j69i60l2.4439j0j7sourceid=chromeespv=2es_sm=119ie=UTF-8 On Thu, Apr 10, 2014 at 10:42 AM, Ian Ferreira ianferre...@hotmail.comwrote: Do you have the link to the Cloudera

Re: Spark - ready for prime time?

2014-04-10 Thread Sean Owen
Mike Olson's comment: http://vision.cloudera.com/mapreduce-spark/ Here's the partnership announcement: http://databricks.com/blog/2013/10/28/databricks-and-cloudera-partner-to-support-spark.html On Thu, Apr 10, 2014 at 10:42 AM, Ian Ferreira ianferre...@hotmail.com wrote: Do you have the

Re: Spark - ready for prime time?

2014-04-10 Thread Alex Boisvert
I'll provide answers from our own experience at Bizo. We've been using Spark for 1+ year now and have found it generally better than previous approaches (Hadoop + Hive mostly). On Thu, Apr 10, 2014 at 7:11 AM, Andras Nemeth andras.nem...@lynxanalytics.com wrote: I. Is it too much magic? Lots

Re: Pig on Spark

2014-04-10 Thread Mayur Rustagi
Bam !!! http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Apr 10, 2014 at 3:07 AM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com

Re: is it possible to initiate Spark jobs from Oozie?

2014-04-10 Thread Mayur Rustagi
I dont think it'll do failure detection etc of spark job in Oozie as of yet. You should be able to trigger it from Oozie (worst case as a shell script). Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Apr 10, 2014 at

Re: Spark on YARN performance

2014-04-10 Thread Flavio Pompermaier
Thank you for the reply Mayur, it would be nice to have a comparison about that. I hope one day it will be available, or to have the time to test it myself :) So you're using Mesos for the moment, right? Which are the main differences in you experience? YARN seems to be more flexible and

Re: Spark operators on Objects

2014-04-10 Thread Flavio Pompermaier
Probably for the XML case the best resource I found iare http://stevenskelton.ca/real-time-data-mining-spark/ and http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/ . And about JSON? If I have to work with JSON and I want to use fasterxml implementation?

Re: NPE using saveAsTextFile

2014-04-10 Thread Nick Pentreath
There was a closure over the config object lurking around - but in any case upgrading to 1.2.0 for config did the trick as it seems to have been a bug in Typesafe config, Thanks Matei! On Thu, Apr 10, 2014 at 8:46 AM, Nick Pentreath nick.pentre...@gmail.comwrote: Ok I thought it may be

Re: Spark - ready for prime time?

2014-04-10 Thread Andrew Ash
The biggest issue I've come across is that the cluster is somewhat unstable when under memory pressure. Meaning that if you attempt to persist an RDD that's too big for memory, even with MEMORY_AND_DISK, you'll often still get OOMs. I had to carefully modify some of the space tuning parameters

Re: Spark - ready for prime time?

2014-04-10 Thread Brad Miller
I would echo much of what Andrew has said. I manage a small/medium sized cluster (48 cores, 512G ram, 512G disk space dedicated to spark, data storage in separate HDFS shares). I've been using spark since 0.7, and as with Andrew I've observed significant and consistent improvements in stability

Re: Spark - ready for prime time?

2014-04-10 Thread Dmitriy Lyubimov
On Thu, Apr 10, 2014 at 9:24 AM, Andrew Ash and...@andrewash.com wrote: The biggest issue I've come across is that the cluster is somewhat unstable when under memory pressure. Meaning that if you attempt to persist an RDD that's too big for memory, even with MEMORY_AND_DISK, you'll often

/bin/java not found: JAVA_HOME ignored launching shark executor

2014-04-10 Thread Ken Ellinwood
14/04/10 08:00:42 INFO AppClient$ClientActor: Executor added: app-20140410080041-0017/9 on worker-20140409145028-ken- VirtualBox-39159 (ken-VirtualBox:39159) with 4 cores 14/04/10 08:00:42 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140410080041-0017/9 on hostPort

Re: Spark - ready for prime time?

2014-04-10 Thread Roger Hoover
Can anyone comment on their experience running Spark Streaming in production? On Thu, Apr 10, 2014 at 10:33 AM, Dmitriy Lyubimov dlie...@gmail.comwrote: On Thu, Apr 10, 2014 at 9:24 AM, Andrew Ash and...@andrewash.com wrote: The biggest issue I've come across is that the cluster is

Re: /bin/java not found: JAVA_HOME ignored launching shark executor

2014-04-10 Thread Ken Ellinwood
Sorry, I forgot to mention this is spark-0.9.1 and shark-0.9.1. Ken On Thursday, April 10, 2014 9:02 AM, Ken Ellinwood kellinw...@yahoo.com wrote: 14/04/10 08:00:42 INFO AppClient$ClientActor: Executor added: app-20140410080041-0017/9 on worker-20140409145028-ken- VirtualBox-39159

Re: Spark - ready for prime time?

2014-04-10 Thread Andrew Or
Here are answers to a subset of your questions: 1. Memory management The general direction of these questions is whether it's possible to take RDD caching related memory management more into our own hands as LRU eviction is nice most of the time but can be very suboptimal in some of our use

Re: Spark - ready for prime time?

2014-04-10 Thread Brad Miller
4. Shuffle on disk Is it true - I couldn't find it in official docs, but did see this mentioned in various threads - that shuffle _always_ hits disk? (Disregarding OS caches.) Why is this the case? Are you planning to add a function to do shuffle in memory or are there some intrinsic reasons

Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

2014-04-10 Thread DiData
Hello friends: I recently compiled and installed Spark v0.9 from the Apache distribution. Note: I have the Cloudera/CDH5 Spark RPMs co-installed as well (actually, the entire big-data suite from CDH is installed), but for the moment I'm using my manually built Apache Spark for 'ground-up'

Re: Spark - ready for prime time?

2014-04-10 Thread Matei Zaharia
To add onto the discussion about memory working space, 0.9 introduced the ability to spill data within a task to disk, and in 1.0 we’re also changing the interface to allow spilling data within the same *group* to disk (e.g. when you do groupBy and get a key with lots of values). The main

Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

2014-04-10 Thread Alton Alexander
I am doing the exact same thing for the purpose of learning. I also don't have a hadoop cluster and plan to scale on ec2 as soon as I get it working locally. I am having good success just using the binaries on and not compiling from source... Is there a reason why you aren't just using the

Re: programmatic way to tell Spark version

2014-04-10 Thread Patrick Wendell
I think this was solved in a recent merge: https://github.com/apache/spark/pull/204/files#diff-364713d7776956cb8b0a771e9b62f82dR779 Is that what you are looking for? If so, mind marking the JIRA as resolved? On Wed, Apr 9, 2014 at 3:30 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote:

Re: programmatic way to tell Spark version

2014-04-10 Thread Pierre Borckmans
I see that this was fixed using a fixed string in SparkContext.scala. Wouldn’t it be better to use something like: getClass.getPackage.getImplementationVersion to get the version from the jar manifest (and thus from the sbt definition)? The same holds for SparkILoopInit.scala in the welcome

Re: programmatic way to tell Spark version

2014-04-10 Thread Patrick Wendell
Pierre - I'm not sure that would work. I just opened a Spark shell and did this: scala classOf[SparkContext].getClass.getPackage.getImplementationVersion res4: String = 1.7.0_25 It looks like this is the JVM version. - Patrick On Thu, Apr 10, 2014 at 2:08 PM, Pierre Borckmans

Re: Using pyspark shell in local[n] (single machine) mode unnecessarily tries to connect to HDFS NameNode ...

2014-04-10 Thread Aaron Davidson
This is likely because hdfs's core-site.xml (or something similar) provides an fs.default.name which changes the default FileSystem and Spark uses the Hadoop FileSystem API to resolve paths. Anyway, your solution is definitely a good one -- another would be to remote hdfs from Spark's classpath if

Is Branch 1.0 build broken ?

2014-04-10 Thread Chester Chen
I just updated and got the following:  [error] (external-mqtt/*:update) sbt.ResolveException: unresolved dependency: org.eclipse.paho#mqtt-client;0.4.0: not found [error] Total time: 7 s, completed Apr 10, 2014 4:27:09 PM Chesters-MacBook-Pro:spark chester$ git branch * branch-1.0   master

Re: Spark 0.9.1 PySpark ImportError

2014-04-10 Thread Matei Zaharia
Kind of strange because we haven’t updated CloudPickle AFAIK. Is this a package you added on the PYTHONPATH? How did you set the path, was it in conf/spark-env.sh? Matei On Apr 10, 2014, at 7:39 AM, aazout albert.az...@velos.io wrote: I am getting a python ImportError on Spark standalone

Re: programmatic way to tell Spark version

2014-04-10 Thread Nicholas Chammas
Looks like it. I'm guessing this didn't make the cut for 0.9.1, and will instead be included with 1.0.0. So would you access it just by calling sc.version from the shell? And will this automatically make it into the Python API? I'll mark the JIRA issue as resolved. On Thu, Apr 10, 2014 at 5:05