Re: Exception adding resource files in latest Spark

2014-12-04 Thread Patrick Wendell
Thanks for flagging this. I reverted the relevant YARN fix in Spark 1.2 release. We can try to debug this in master. On Thu, Dec 4, 2014 at 9:51 PM, Jianshi Huang wrote: > I created a ticket for this: > > https://issues.apache.org/jira/browse/SPARK-4757 > > > Jianshi > > On Fri, Dec 5, 2014 at

Re: spark streaming kafa best practices ?

2014-12-05 Thread Patrick Wendell
The second choice is better. Once you call collect() you are pulling all of the data onto a single node, you want to do most of the processing in parallel on the cluster, which is what map() will do. Ideally you'd try to summarize the data or reduce it before calling collect(). On Fri, Dec 5, 201

Re: Stateful mapPartitions

2014-12-05 Thread Patrick Wendell
Yeah the main way to do this would be to have your own static cache of connections. These could be using an object in Scala or just a static variable in Java (for instance a set of connections that you can borrow from). - Patrick On Thu, Dec 4, 2014 at 5:26 PM, Tobias Pfeiffer wrote: > Hi, > > O

Re: Spark Server - How to implement

2014-12-12 Thread Patrick Wendell
Hey Manoj, One proposal potentially of interest is the Spark Kernel project from IBM - you should look for their. The interface in that project is more of a "remote REPL" interface, i.e. you submit commands (as strings) and get back results (as strings), but you don't have direct programmatic acce

Re: spark streaming kafa best practices ?

2014-12-17 Thread Patrick Wendell
rd. > > On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell wrote: >> >> The second choice is better. Once you call collect() you are pulling >> all of the data onto a single node, you want to do most of the >> processing in parallel on the cluster, which is what map() w

Announcing Spark 1.2!

2014-12-19 Thread Patrick Wendell
I'm happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is the third release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 172 developers and more than 1,000 commits! This release brings operational and performance improvements in Spark core

Re: Announcing Spark 1.2!

2014-12-19 Thread Patrick Wendell
2.0 and v1.2.0-rc2 are pointed to different commits in >> https://github.com/apache/spark/releases >> >> Best Regards, >> >> Shixiong Zhu >> >> 2014-12-19 16:52 GMT+08:00 Patrick Wendell : >>> >>> I'm happy to announce the availability of S

Re: Announcing Spark Packages

2014-12-22 Thread Patrick Wendell
Xiangrui asked me to report that it's back and running :) On Mon, Dec 22, 2014 at 3:21 PM, peng wrote: > Me 2 :) > > > On 12/22/2014 06:14 PM, Andrew Ash wrote: > > Hi Xiangrui, > > That link is currently returning a 503 Over Quota error message. Would you > mind pinging back out when the page i

Re: Announcing Spark Packages

2014-12-22 Thread Patrick Wendell
Hey Nick, I think Hitesh was just trying to be helpful and point out the policy - not necessarily saying there was an issue. We've taken a close look at this and I think we're in good shape her vis-a-vis this policy. - Patrick On Mon, Dec 22, 2014 at 5:29 PM, Nicholas Chammas wrote: > Hitesh, >

Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
Is it sufficient to set "spark.hadoop.validateOutputSpecs" to false? http://spark.apache.org/docs/latest/configuration.html - Patrick On Wed, Dec 24, 2014 at 10:52 PM, Shao, Saisai wrote: > Hi, > > > > We have such requirements to save RDD output to HDFS with saveAsTextFile > like API, but need

Re: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Patrick Wendell
ble as any alternatives. This is already pretty easy IMO. - Patrick On Wed, Dec 24, 2014 at 11:28 PM, Cheng, Hao wrote: > I am wondering if we can provide more friendly API, other than configuration > for this purpose. What do you think Patrick? > > Cheng Hao > > -Original

Re: Long-running job cleanup

2014-12-28 Thread Patrick Wendell
What do you mean when you say "the overhead of spark shuffles start to accumulate"? Could you elaborate more? In newer versions of Spark shuffle data is cleaned up automatically when an RDD goes out of scope. It is safe to remove shuffle data at this point because the RDD can no longer be referenc

Re: action progress in ipython notebook?

2014-12-28 Thread Patrick Wendell
Hey Eric, I'm just curious - which specific features in 1.2 do you find most help with usability? This is a theme we're focusing on for 1.3 as well, so it's helpful to hear what makes a difference. - Patrick On Sun, Dec 28, 2014 at 1:36 AM, Eric Friedman wrote: > Hi Josh, > > Thanks for the inf

Re: Accumulator value in Spark UI

2015-01-14 Thread Patrick Wendell
It should appear in the page for any stage in which accumulators are updated. On Wed, Jan 14, 2015 at 6:46 PM, Justin Yip wrote: > Hello, > > From accumulator documentation, it says that if the accumulator is named, it > will be displayed in the WebUI. However, I cannot find it anywhere. > > Do I

Re: Bouncing Mails

2015-01-17 Thread Patrick Wendell
Akhil, Those are handled by ASF infrastructure, not anyone in the Spark project. So this list is not the appropriate place to ask for help. - Patrick On Sat, Jan 17, 2015 at 12:56 AM, Akhil Das wrote: > My mails to the mailing list are getting rejected, have opened a Jira issue, > can someone t

Re: spark-ec2 login expects at least 1 slave

2014-03-01 Thread Patrick Wendell
Yep, currently it only supports running at least 1 slave. On Sat, Mar 1, 2014 at 4:47 PM, nicholas.chammas wrote: > I successfully launched a Spark EC2 "cluster" with 0 slaves using spark-ec2. > When trying to login to the master node with spark-ec2 login, I get the > following: > > Searching for

Re: Unable to redirect Spark logs to slf4j

2014-03-05 Thread Patrick Wendell
Hey All, We have a fix for this but it didn't get merged yet. I'll put it as a blocker for Spark 0.9.1. https://github.com/pwendell/incubator-spark/commit/66594e88e5be50fca073a7ef38fa62db4082b3c8 https://spark-project.atlassian.net/browse/SPARK-1190 Sergey if you could try compiling Spark with

Re: Unable to redirect Spark logs to slf4j

2014-03-05 Thread Patrick Wendell
e > org.slf4j.impl.StaticLoggerBinder. Adding slf4j-log4j12 with compile scope > helps, and I confirm the logging is redirected to slf4j/Logback correctly > now with the patched module. I'm not sure however if using compile scope for > slf4j-log4j12 is a good idea. > > --

Re: Python 2.7 + numpy break sortByKey()

2014-03-06 Thread Patrick Wendell
The difference between your two jobs is that take() is optimized and only runs on the machine where you are using the shell, whereas sortByKey requires using many machines. It seems like maybe python didn't get upgraded correctly on one of the slaves. I would look in the /root/spark/work/ folder (f

Re: Kryo serialization does not compress

2014-03-06 Thread Patrick Wendell
Hey There, This is interesting... thanks for sharing this. If you are storing in MEMORY_ONLY then you are just directly storing Java objects in the JVM. So they can't be compressed because they aren't really stored in a known format it's just left up to the JVM. To answer you other question, it's

Re: no stdout output from worker

2014-03-09 Thread Patrick Wendell
Hey Sen, Is your code in the driver code or inside one of the tasks? If it's in the tasks, the place you would expect these to be is in stdout file under //work/[stdout/stderr]. Are you seeing at least stderr logs in that folder? If not then the tasks might not be running on the workers machines.

Re: [External] Re: no stdout output from worker

2014-03-10 Thread Patrick Wendell
n more System.out.println (E) >> >> } // main >> >> } //class >> >> I get the console outputs on the master for (B) and (E). I do not see any >> stdout in the worker node. I find the stdout and stderr in the >> /work//0/. I see output >> in stde

Re: "Too many open files" exception on reduceByKey

2014-03-10 Thread Patrick Wendell
Hey Matt, The best way is definitely just to increase the ulimit if possible, this is sort of an assumption we make in Spark that clusters will be able to move it around. You might be able to hack around this by decreasing the number of reducers but this could have some performance implications f

Re: Block

2014-03-11 Thread Patrick Wendell
A block is an internal construct that isn't directly exposed to users. Internally though, each partition of an RDD is mapped to one block. - Patrick On Mon, Mar 10, 2014 at 11:06 PM, David Thomas wrote: > What is the concept of Block and BlockManager in Spark? How is a Block > related to a Parti

Re: building Spark docs

2014-03-12 Thread Patrick Wendell
Dianna I'm forwarding this to the dev list since it might be useful there as well. On Wed, Mar 12, 2014 at 11:39 AM, Diana Carroll wrote: > Hi all. I needed to build the Spark docs. The basic instructions to do > this are in spark/docs/README.md but it took me quite a bit of playing > around to

Re: Changing number of workers for benchmarking purposes

2014-03-12 Thread Patrick Wendell
Hey Pierre, Currently modifying the "slaves" file is the best way to do this because in general we expect that users will want to launch workers on any slave. I think you could hack something together pretty easily to allow this. For instance if you modify the line in slaves.sh from this: for

Re: Round Robin Partitioner

2014-03-13 Thread Patrick Wendell
In Spark 1.0 we've added better randomization to the scheduling of tasks so they are distributed more evenly by default. https://github.com/apache/spark/commit/556c56689bbc32c6cec0d07b57bd3ec73ceb243e However having specific policies like that isn't really supported unless you subclass the RDD it

Re: best practices for pushing an RDD into a database

2014-03-13 Thread Patrick Wendell
Hey Nicholas, The best way to do this is to do rdd.mapPartitions() and pass a function that will open a JDBC connection to your database and write the range in each partition. On the input path there is something called JDBC-RDD that is relevant: http://spark.incubator.apache.org/docs/latest/api/

Help vote for Spark talks at the Hadoop Summit

2014-03-13 Thread Patrick Wendell
Hey All, The Hadoop Summit uses community choice voting to decide which talks to feature. It would be great if the community could help vote for Spark talks so that Spark has a good showing at this event. You can make three votes on each track. Below I've listed Spark talks in each of the tracks -

Re: Maximum memory limits

2014-03-16 Thread Patrick Wendell
Sean - was this merged into the 0.9 branch as well (it seems so based on the message from rxin). If so it might make sense to try out the head of branch-0.9 as well. Unless there are *also* other changes relevant to this in master. - Patrick On Sun, Mar 16, 2014 at 12:24 PM, Sean Owen wrote: > Y

Re: slf4j and log4j loop

2014-03-16 Thread Patrick Wendell
This is not released yet but we're planning to cut a 0.9.1 release very soon (e.g. most likely this week). In the mean time you'll have checkout branch-0.9 of Spark and publish it locally then depend on the snapshot version. Or just wait it out... On Fri, Mar 14, 2014 at 2:01 PM, Adrian Mocanu wr

Re: Log analyzer and other Spark tools

2014-03-18 Thread Patrick Wendell
Hey Roman, Ya definitely checkout pull request 42 - one cool thing is this patch now includes information about in-memory storage in the listener interface, so you can see directly which blocks are cached/on-disk etc. - Patrick On Mon, Mar 17, 2014 at 5:34 PM, Matei Zaharia wrote: > Take a look

Re: combining operations elegantly

2014-03-23 Thread Patrick Wendell
ard >>> >>> >>> On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers wrote: >>>> >>>> not that long ago there was a nice example on here about how to combine >>>> multiple operations on a single RDD. so basically if you want to do a >>>> count() and something else, how to roll them into a single job. i think >>>> patrick wendell gave the examples. >>>> >>>> i cant find them anymore patrick can you please repost? thanks! >>> >>> >> >

Re: How many partitions is my RDD split into?

2014-03-23 Thread Patrick Wendell
As Mark said you can actually access this easily. The main issue I've seen from a performance perspective is people having a bunch of really small partitions. This will still work but the performance will improve if you coalesce the partitions using rdd.coalesce(). This can happen for example if y

Re: No space left on device exception

2014-03-23 Thread Patrick Wendell
Ognen - just so I understand. The issue is that there weren't enough inodes and this was causing a "No space left on device" error? Is that correct? If so, that's good to know because it's definitely counter intuitive. On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski wrote: > I would love to work

Re: How many partitions is my RDD split into?

2014-03-24 Thread Patrick Wendell
Ah we should just add this directly in pyspark - it's as simple as the code Shivaram just wrote. - Patrick On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman wrote: > There is no direct way to get this in pyspark, but you can get it from the > underlying java rdd. For example > > a = sc.para

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-25 Thread Patrick Wendell
Starting with Spark 0.9 the protobuf dependency we use is shaded and cannot interfere with other protobuf libaries including those in Hadoop. Not sure what's going on in this case. Would someone who is having this problem post exactly how they are building spark? - Patrick On Fri, Mar 21, 2014 at

Re: Building Spark 0.9.x for CDH5 with mrv1 installation (Protobuf 2.5 upgrade)

2014-03-25 Thread Patrick Wendell
I'm not sure exactly how your cluster is configured. But as far as I can tell Cloudera's MR1 CDH5 dependencies are against Hadoop 2.3. I'd just find the exact CDH version you have and link against the `mr1` version of their published dependencies in that version. So I think you wan't "2.3.0-mr1-cd

Re: Announcing Spark SQL

2014-03-27 Thread Patrick Wendell
Hey Rohit, I think external tables based on Cassandra or other datastores will work out-of-the box if you build Catalyst with Hive support. Michael may have feelings about this but I'd guess the longer term design for having schema support for Cassandra/HBase etc likely wouldn't rely on hive exte

Re: KafkaInputDStream mapping of partitions to tasks

2014-03-27 Thread Patrick Wendell
If you call repartition() on the original stream you can set the level of parallelism after it's ingested from Kafka. I'm not sure how it maps kafka topic partitions to tasks for the ingest thought. On Thu, Mar 27, 2014 at 11:09 AM, Scott Clasen wrote: > I have a simple streaming job that create

Re: Spark webUI - application details page

2014-03-30 Thread Patrick Wendell
This will be a feature in Spark 1.0 but is not yet released. In 1.0 Spark applications can persist their state so that the UI can be reloaded after they have completed. - Patrick On Sun, Mar 30, 2014 at 10:30 AM, David Thomas wrote: > Is there a way to see 'Application Detail UI' page (at mast

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-03-31 Thread Patrick Wendell
Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be getting pulled in unless you are directly using akka yourself. Are you? Does your project have other dependencies that might be indirectly pulling in protobuf 2.4.1? It would be helpful if you could list all of your depende

Re: batching the output

2014-03-31 Thread Patrick Wendell
Ya this is a good way to do it. On Sun, Mar 30, 2014 at 10:11 PM, Vipul Pandey wrote: > Hi, > > I need to batch the values in my final RDD before writing out to hdfs. The > idea is to batch multiple "rows" in a protobuf and write those batches out > - mostly to save some space as a lot of metad

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Patrick Wendell
dependency but it still failed whenever I use the > jar with ScalaBuf dependency. > Spark version is 0.9.0 > > > ~Vipul > > On Mar 31, 2014, at 4:51 PM, Patrick Wendell wrote: > > Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be > getting pull

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Patrick Wendell
Do you get the same problem if you build with maven? On Tue, Apr 1, 2014 at 12:23 PM, Vipul Pandey wrote: > SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly > > That's all I do. > > On Apr 1, 2014, at 11:41 AM, Patrick Wendell wrote: > > Vidal - could you show e

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Patrick Wendell
(default-cli) on project spark-0.9.0-incubating: Error reading assemblies: > No assembly descriptors found. -> [Help 1] > upon runnning > mvn -Dhadoop.version=2.0.0-cdh4.2.1 -DskipTests clean assembly:assembly > > > On Apr 1, 2014, at 4:13 PM, Patrick Wendell wrote: > > D

Re: Spark output compression on HDFS

2014-04-02 Thread Patrick Wendell
For textFile I believe we overload it and let you set a codec directly: https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59 For saveAsSequenceFile yep, I think Mark is right, you need an option. On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra wrote

Re: Resilient nature of RDD

2014-04-02 Thread Patrick Wendell
The driver stores the meta-data associated with the partition, but the re-computation will occur on an executor. So if several partitions are lost, e.g. due to a few machines failing, the re-computation can be striped across the cluster making it fast. On Wed, Apr 2, 2014 at 11:27 AM, David Thoma

Re: Is there a way to get the current progress of the job?

2014-04-02 Thread Patrick Wendell
Hey Phillip, Right now there is no mechanism for this. You have to go in through the low level listener interface. We could consider exposing the JobProgressListener directly - I think it's been factored nicely so it's fairly decoupled from the UI. The concern is this is a semi-internal piece of

Re: Spark 1.0.0 release plan

2014-04-03 Thread Patrick Wendell
Btw - after that initial thread I proposed a slightly more detailed set of dates: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage - Patrick On Thu, Apr 3, 2014 at 11:28 AM, Matei Zaharia wrote: > Hey Bhaskar, this is still the plan, though QAing might take longer than > 15 days.

Re: Largest Spark Cluster

2014-04-04 Thread Patrick Wendell
Hey Parviz, There was a similar thread a while ago... I think that many companies like to be discrete about the size of large clusters. But of course it would be great if people wanted to share openly :) For my part - I can say that Spark has been benchmarked on hundreds-of-nodes clusters before

Re: How to create a RPM package

2014-04-04 Thread Patrick Wendell
We might be able to incorporate the maven rpm plugin into our build. If that can be done in an elegant way it would be nice to have that distribution target for people who wanted to try this with arbitrary Spark versions... Personally I have no familiarity with that plug-in, so curious if anyone i

Re: Heartbeat exceeds

2014-04-04 Thread Patrick Wendell
If you look in the Spark UI, do you see any garbage collection happening? My best guess is that some of the executors are going into GC and they are timing out. You can manually increase the timeout by setting the Spark conf: spark.storage.blockManagerSlaveTimeoutMs to a higher value. In your cas

Re: trouble with "join" on large RDDs

2014-04-07 Thread Patrick Wendell
On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller wrote: > I am running the latest version of PySpark branch-0.9 and having some > trouble with join. > > One RDD is about 100G (25GB compressed and serialized in memory) with > 130K records, the other RDD is about 10G (2.5G compressed and > serialized in

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-10 Thread Patrick Wendell
Okay so I think the issue here is just a conflict between your application code and the Hadoop code. Hadoop 2.0.0 depends on protobuf 2.4.0a: https://svn.apache.org/repos/asf/hadoop/common/tags/release-2.0.0-alpha/hadoop-project/pom.xml Your code is depending on protobuf 2.5.X The protobuf libra

Re: hbase scan performance

2014-04-10 Thread Patrick Wendell
This job might still be faster... in MapReduce there will be other overheads in addition to the fact that doing sequential reads from HBase is slow. But it's possible the bottleneck is the HBase scan performance. - Patrick On Wed, Apr 9, 2014 at 10:10 AM, Jerry Lam wrote: > Hi Dave, > > This i

Re: programmatic way to tell Spark version

2014-04-10 Thread Patrick Wendell
couldn't >> find one in the docs. I also couldn't find an environment variable with the >> version. After futzing around a bit I realized it was printed out (quite >> conspicuously) in the shell startup banner. >> >> >> On Sat, Feb 22, 2014 at 7:15 PM

Re: programmatic way to tell Spark version

2014-04-10 Thread Patrick Wendell
nalytics *| *Brussels Office > www.realimpactanalytics.com *| > *pierre.borckm...@realimpactanalytics.com > > *FR *+32 485 91 87 31 *| **Skype* pierre.borckmans > > > > > > On 10 Apr 2014, at 23:05, Patrick Wendell wrote: > > I think this was solved in a recent merge: > > &g

Re: Spark on YARN performance

2014-04-11 Thread Patrick Wendell
To reiterate what Tom was saying - the code that runs inside of Spark on YARN is exactly the same code that runs in any deployment mode. There shouldn't be any performance difference once your application starts (assuming you are comparing apples-to-apples in terms of hardware). The differences ar

Re: Spark-ec2 asks for password

2014-04-18 Thread Patrick Wendell
Unfortunately - I think a lot of this is due to generally increased latency on ec2 itself. I've noticed that it's way more common than it used to be for instances to come online past the "wait" timeout in the ec2 script. On Fri, Apr 18, 2014 at 9:11 PM, FRANK AUSTIN NOTHAFT wrote: > Aureliano,

Re: running tests selectively

2014-04-20 Thread Patrick Wendell
I put some notes in this doc: https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools On Sun, Apr 20, 2014 at 8:58 PM, Arun Ramakrishnan < sinchronized.a...@gmail.com> wrote: > I would like to run some of the tests selectively. I am in branch-1.0 > > Tried the following two comm

Re: Task splitting among workers

2014-04-20 Thread Patrick Wendell
For a HadoopRDD, first the spark scheduler calculates the number of tasks based on input splits. Usually people use this with HDFS data so in that case it's based on HDFS blocks. If the HDFS datanodes are co-located with the Spark cluster then it will try to run the tasks on the data node that cont

Re: compile spark 0.9.1 in hadoop 2.2 above exception

2014-04-24 Thread Patrick Wendell
Try running sbt/sbt clean and re-compiling. Any luck? On Thu, Apr 24, 2014 at 5:33 PM, martin.ou wrote: > > > occure exception when compile spark 0.9.1 using sbt,env: hadoop 2.3 > > 1. SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly > > > > 2.found Exception: > > found : org.apache

Re: is it okay to reuse objects across RDD's?

2014-04-26 Thread Patrick Wendell
Hey Todd, This approach violates the normal semantics of RDD transformations as you point out. I think you pointed out some issues already, and there are others. For instance say you cache originalRDD and some of the partitions end up in memory and others end up on disk. The ones that end up in me

Re: pySpark memory usage

2014-04-28 Thread Patrick Wendell
Hey Jim, This IOException thing is a general issue that we need to fix and your observation is spot-in. There is actually a JIRA for it here I created a few days ago: https://issues.apache.org/jira/browse/SPARK-1579 Aaron is assigned on that one but not actively working on it, so we'd welcome a P

Re: Running a spark-submit compatible app in spark-shell

2014-04-28 Thread Patrick Wendell
What about if you run ./bin/spark-shell --driver-class-path=/path/to/your/jar.jar I think either this or the --jars flag should work, but it's possible there is a bug with the --jars flag when calling the Repl. On Mon, Apr 28, 2014 at 4:30 PM, Roger Hoover wrote: > A couple of issues: > 1) the

Re: MLLib - libgfortran LD_LIBRARY_PATH

2014-04-28 Thread Patrick Wendell
Yes, you can set SPARK_LIBRARY_PATH in 0.9.X and in 1.0 you can set spark.executor.extraLibraryPath. On Mon, Apr 28, 2014 at 9:16 AM, Shubham Chopra wrote: > I am trying to use Spark/MLLib on Yarn and do not have libgfortran > installed on my cluster. Is there any way I can set LD_LIBRARY_PATH s

Re: MLLib - libgfortran LD_LIBRARY_PATH

2014-04-28 Thread Patrick Wendell
This can only be a local filesystem though, it can't refer to an HDFS location. This is because it gets passed directly to the JVM. On Mon, Apr 28, 2014 at 9:55 PM, Patrick Wendell wrote: > Yes, you can set SPARK_LIBRARY_PATH in 0.9.X and in 1.0 you can set > spark.executor.extra

Re: NullPointerException when run SparkPI using YARN env

2014-04-28 Thread Patrick Wendell
This was fixed in master. I think this happens if you don't set HADOOP_CONF_DIR to the location where your hadoop configs are (e.g. yarn-site.xml). On Sun, Apr 27, 2014 at 7:40 PM, martin.ou wrote: > 1.my hadoop 2.3.0 > 2.SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly > 3.SPARK_YARN

Re: launching concurrent jobs programmatically

2014-04-28 Thread Patrick Wendell
In general, as Andrew points out, it's possible to submit jobs from multiple threads and many Spark applications do this. One thing to check out is the job server from Ooyala, this is an application on top of Spark that has an automated submission API: https://github.com/ooyala/spark-jobserver Yo

Re: Shuffle Spill Issue

2014-04-28 Thread Patrick Wendell
Could you explain more what your job is doing and what data types you are using? These numbers alone don't necessarily indicate something is wrong. The relationship between the in-memory and on-disk shuffle amount is definitely a bit strange, the data gets compressed when written to disk, but unles

Re: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Patrick Wendell
Is this the serialization throughput per task or the serialization throughput for all the tasks? On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond wrote: > Hi > > I am running a WordCount program which count words from HDFS, and I > noticed that the serializer part of code takes a lot of CPU

Re: NoSuchMethodError from Spark Java

2014-04-29 Thread Patrick Wendell
The signature of this function was changed in spark 1.0... is there any chance that somehow you are actually running against a newer version of Spark? On Tue, Apr 29, 2014 at 8:58 PM, wxhsdp wrote: > i met with the same question when update to spark 0.9.1 > (svn checkout https://github.com/apache

Re: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Patrick Wendell
ut I'm no expert. On Tue, Apr 29, 2014 at 10:14 PM, Liu, Raymond wrote: > For all the tasks, say 32 task on total > > Best Regards, > Raymond Liu > > > -Original Message- > From: Patrick Wendell [mailto:pwend...@gmail.com] > > Is this the serialization throug

Re: JavaSparkConf

2014-04-29 Thread Patrick Wendell
This class was made to be "java friendly" so that we wouldn't have to use two versions. The class itself is simple. But I agree adding java setters would be nice. On Tue, Apr 29, 2014 at 8:32 PM, Soren Macbeth wrote: > There is a JavaSparkContext, but no JavaSparkConf object. I know SparkConf > i

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Patrick Wendell
partition order. RDD order is also what allows me to get the > top k out of RDD by doing RDD.sort().take(). > > Am I misunderstanding it? Or, is it just when RDD is written to disk that > the order is not well preserved? Thanks in advance! > > Mingyu > > > > > On 1/22

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Patrick Wendell
ions returned by RDD.getPartitions() > and the row orders within the partitions determine the row order, I’m not > sure why union doesn’t respect the order because union operation simply > concatenates the two lists of partitions from the two RDDs. > > Mingyu > > > > > On 4/

Re: Union of 2 RDD's only returns the first one

2014-04-30 Thread Patrick Wendell
2, 3”, “4, 5, 6”], then > rdd1.union(rdd2).saveAsTextFile(…) should’ve resulted in a file with three > lines “a, b, c”, “1, 2, 3”, and “4, 5, 6” because the partitions from the > two reds are concatenated. > > Mingyu > > > > > On 4/29/14, 10:55 PM, "Patrick W

Re: Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Patrick Wendell
This is a consequence of the way the Hadoop files API works. However, you can (fairly easily) add code to just rename the file because it will always produce the same filename. (heavy use of pseudo code) dir = "/some/dir" rdd.coalesce(1).saveAsTextFile(dir) f = new File(dir + "part-0") f.move

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-03 Thread Patrick Wendell
tPrefix) >>> s3Client.deleteObjects(new >>> DeleteObjectsRequest("upload").withKeys(objectListing.getObjectSummaries.asScala.map(s >>> => new KeyVersion(s.getKey)).asJava)) >>> >>> Using a 3GB object I achieved about 33MB/s betwee

Re: Setting the Scala version in the EC2 script?

2014-05-03 Thread Patrick Wendell
Spark will only work with Scala 2.10... are you trying to do a minor version upgrade or upgrade to a totally different version? You could do this as follows if you want: 1. Fork the spark-ec2 repository and change the file here: https://github.com/mesos/spark-ec2/blob/v2/scala/init.sh 2. Modify yo

Re: when to use broadcast variables

2014-05-03 Thread Patrick Wendell
Broadcast variables need to fit entirely in memory - so that's a pretty good litmus test for whether or not to broadcast a smaller dataset or turn it into an RDD. On Fri, May 2, 2014 at 7:50 AM, Prashant Sharma wrote: > I had like to be corrected on this but I am just trying to say small enough >

Re: spark ec2 error

2014-05-04 Thread Patrick Wendell
Hey Jeremy, This is actually a big problem - thanks for reporting it, I'm going to revert this change until we can make sure it is backwards compatible. - Patrick On Sun, May 4, 2014 at 2:00 PM, Jeremy Freeman wrote: > Hi all, > > A heads up in case others hit this and are confused... This nic

Re: spark ec2 error

2014-05-04 Thread Patrick Wendell
PM, Patrick Wendell wrote: > Hey Jeremy, > > This is actually a big problem - thanks for reporting it, I'm going to > revert this change until we can make sure it is backwards compatible. > > - Patrick > > On Sun, May 4, 2014 at 2:00 PM, Jeremy Freeman > wrote

Re: same log4j slf4j error in spark 9.1

2014-05-13 Thread Patrick Wendell
Hey Adrian, If you are including log4j-over-slf4j.jar in your application, you'll still need to manually exclude slf4j-log4j12.jar from Spark. However, it should work once you do that. Before 0.9.1 you couldn't make it work, even if you added an exclude. - Patrick On Thu, May 8, 2014 at 1:52 PM,

Re: 1.0.0 Release Date?

2014-05-14 Thread Patrick Wendell
Hey Brian, We've had a fairly stable 1.0 branch for a while now. I've started voting on the dev list last night... voting can take some time but it usually wraps up anywhere from a few days to weeks. However, you can get started right now with the release candidates. These are likely to be almost

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-05-15 Thread Patrick Wendell
Just wondering - how are you launching your application? If you want to set values like this the right way is to add them to the SparkConf when you create a SparkContext. val conf = new SparkConf().set("spark.akka.frameSize", "1").setAppName(...).setMaster(...) val sc = new SparkContext(conf)

Announcing Spark 1.0.0

2014-05-30 Thread Patrick Wendell
I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyon

Re: Announcing Spark 1.0.0

2014-05-30 Thread Patrick Wendell
;> >>> -- >>> Christopher T. Nguyen >>> Co-founder & CEO, Adatao <http://adatao.com> >>> linkedin.com/in/ctnguyen <http://linkedin.com/in/ctnguyen> >>> >>> >>> >>> >>> On Fri, May 30, 2014 at 3:12 AM, Patrick

Re: Yay for 1.0.0! EC2 Still has problems.

2014-05-30 Thread Patrick Wendell
Hi Jeremy, That's interesting, I don't think anyone has ever reported an issue running these scripts due to Python incompatibility, but they may require Python 2.7+. I regularly run them from the AWS Ubuntu 12.04 AMI... that might be a good place to start. But if there is a straightforward way to

Re: Monitoring / Instrumenting jobs in 1.0

2014-05-30 Thread Patrick Wendell
The main change here was refactoring the SparkListener interface which is where we expose internal state about a Spark job to other applications. We've cleaned up these API's a bunch and also added a way to log all data as JSON for post-hoc analysis: https://github.com/apache/spark/blob/master/cor

Re: pyspark MLlib examples don't work with Spark 1.0.0

2014-05-31 Thread Patrick Wendell
I've removed my docs from my site to avoid confusion... somehow that link propogated all over the place! On Sat, May 31, 2014 at 1:58 AM, Xiangrui Meng wrote: > The documentation you looked at is not official, though it is from > @pwendell's website. It was for the Spark SQL release. Please find

Re: Unable to execute saveAsTextFile on multi node mesos

2014-05-31 Thread Patrick Wendell
Can you look at the logs from the executor or in the UI? They should give an exception with the reason for the task failure. Also in the future, for this type of e-mail please only e-mail the "user@" list and not both lists. - Patrick On Sat, May 31, 2014 at 3:22 AM, prabeesh k wrote: > Hi, > >

Re: How can I dispose an Accumulator?

2014-05-31 Thread Patrick Wendell
Hey There, You can remove an accumulator by just letting it go out of scope and it will be garbage collected. For broadcast variables we actually store extra information for it, so we provide hooks for users to remove the associated state. There is no such need for accumulators, though. - Patrick

Re: Spark hook to create external process

2014-05-31 Thread Patrick Wendell
Currently, an executor is always run in it's own JVM, so it should be possible to just use some static initialization to e.g. launch a sub-process and set up a bridge with which to communicate. This is would be a fairly advanced use case, however. - Patrick On Thu, May 29, 2014 at 8:39 PM, ans

Re: possible typos in spark 1.0 documentation

2014-05-31 Thread Patrick Wendell
> 1. ctx is an instance of JavaSQLContext but the textFile method is called as > a member of ctx. > According to the API JavaSQLContext does not have such a member, so im > guessing this should be sc instead. Yeah, I think you are correct. > 2. In that same code example the object sqlCtx is refer

Re: getPreferredLocations

2014-05-31 Thread Patrick Wendell
> 1) Is there a guarantee that a partition will only be processed on a node > which is in the "getPreferredLocations" set of nodes returned by the RDD ? No there isn't, by default Spark may schedule in a "non preferred" location after `spark.locality.wait` has expired. http://spark.apache.org/doc

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-31 Thread Patrick Wendell
I think there are a few ways to do this... the simplest one might be to manually build a set of comma-separated paths that excludes the bad file, and pass that to textFile(). When you call textFile() under the hood it is going to pass your filename string to hadoopFile() which calls setInputPaths(

Re: spark 1.0.0 on yarn

2014-06-01 Thread Patrick Wendell
I would agree with your guess, it looks like the yarn library isn't correctly finding your yarn-site.xml file. If you look in yarn-site.xml do you definitely the resource manager address/addresses? Also, you can try running this command with SPARK_PRINT_LAUNCH_COMMAND=1 to make sure the classpath

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Patrick Wendell
Hey just to clarify this - my understanding is that the poster (Jeremey) was using a custom AMI to *launch* spark-ec2. I normally launch spark-ec2 from my laptop. And he was looking for an AMI that had a high enough version of python. Spark-ec2 itself has a flag "-a" that allows you to give a spec

<    1   2   3   >