Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-26 Thread Patrick Wendell
/Preconitions.checkArgument:(ZLjava/lang/Object;)V 50: invokestatic #502// Method org/spark-project/guava/common/base/Preconitions.checkArgument:(ZLjava/lang/Object;)V On Wed, Nov 26, 2014 at 11:08 AM, Patrick Wendell pwend...@gmail.com wrote: Hi Judy, Are you somehow

Question about resource sharing in Spark Standalone

2014-11-23 Thread Patrick Liu
Dear all, Currently, I am running spark standalone cluster with ~100 nodes. Multiple users can connect to the cluster by Spark-shell or PyShell. However, I can't find an efficient way to control the resources among multiple users. I can set spark.deploy.defaultCores in the server side to

Re: toLocalIterator in Spark 1.0.0

2014-11-13 Thread Patrick Wendell
It looks like you are trying to directly import the toLocalIterator function. You can't import functions, it should just appear as a method of an existing RDD if you have one. - Patrick On Thu, Nov 13, 2014 at 10:21 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am using Spark 1.0.0

Re: Still struggling with building documentation

2014-11-11 Thread Patrick Wendell
The doc build appears to be broken in master. We'll get it patched up before the release: https://issues.apache.org/jira/browse/SPARK-4326 On Tue, Nov 11, 2014 at 10:50 AM, Alessandro Baretta alexbare...@gmail.com wrote: Nichols and Patrick, Thanks for your help, but, no, it still does

Re: Spark and Play

2014-11-11 Thread Patrick Wendell
Hi There, Because Akka versions are not binary compatible with one another, it might not be possible to integrate Play with Spark 1.1.0. - Patrick On Tue, Nov 11, 2014 at 8:21 AM, Akshat Aranya aara...@gmail.com wrote: Hi, Sorry if this has been asked before; I didn't find a satisfactory

Re: Support Hive 0.13 .1 in Spark SQL

2014-10-28 Thread Patrick Wendell
/browse/SPARK-4114 This is a very important issue for Spark SQL, so I'd welcome comments on that JIRA from anyone who is familiar with Hive/HCatalog internals. - Patrick On Mon, Oct 27, 2014 at 9:54 PM, Cheng, Hao hao.ch...@intel.com wrote: Hi, all I have some PRs blocked by hive upgrading

Re: Ending a job early

2014-10-28 Thread Patrick Wendell
or two cases we've exposed functions that rely on this: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L334 I would expect more robust support for online aggregation to show up in a future version of Spark. - Patrick On Tue, Oct 28

Fwd: [akka-user] Akka Camel plus Spark Streaming

2014-10-27 Thread Patrick McGloin
the following error is logged by the worker who tries to use Akka Camel: -- Forwarded message -- From: Patrick McGloin mcgloin.patr...@gmail.com Date: 24 October 2014 15:09 Subject: Re: [akka-user] Akka Camel plus Spark Streaming To: akka-u...@googlegroups.com Hi Patrik, Thanks

Re: [akka-user] Akka Camel plus Spark Streaming

2014-10-27 Thread Patrick McGloin
it is in the assembled jar file. Please see the mails below, which I sent to the Akka group for details. Is there something I am doing wrong? Is there a way to get the Akka Cluster to load the reference.conf from Camel? Any help greatly appreciated! Best regards, Patrick On 27 October 2014 11:33, Patrick

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-23 Thread Patrick Wendell
do a mvn install first then (I think) you can test sub-modules independently: mvn test -pl streaming ... - Patrick On Wed, Oct 22, 2014 at 10:00 PM, Ryan Williams ryan.blake.willi...@gmail.com wrote: I started building Spark / running Spark tests this weekend and on maybe 5-10 occasions have run

Re: About Memory usage in the Spark UI

2014-10-23 Thread Patrick Wendell
It shows the amount of memory used to store RDD blocks, which are created when you run .cache()/.persist() on an RDD. On Wed, Oct 22, 2014 at 10:07 PM, Haopu Wang hw...@qilinsoft.com wrote: Hi, please take a look at the attached screen-shot. I wonders what's the Memory Used column mean. I

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-08 Thread Patrick Wendell
IIRC - the random is seeded with the index, so it will always produce the same result for the same index. Maybe I don't totally follow though. Could you give a small example of how this might change the RDD ordering in a way that you don't expect? In general repartition() will not preserve the

Re: sparksql connect remote hive cluster

2014-10-08 Thread Patrick Wendell
Spark will need to connect both to the hive metastore and to all HDFS nodes (NN and DN's). If that is all in place then it should work. In this case it looks like maybe it can't connect to a datanode in HDFS to get the raw data. Keep in mind that the performance might not be very good if you are

Re: Spark SQL + Hive + JobConf NoClassDefFoundError

2014-10-01 Thread Patrick McGloin
FYI, in case anybody else has this problem, we switched to Spark 1.1 (outside CDH) and the same Spark application worked first time (once recompiled with Spark 1.1 libs of course). I assume this is because Spark 1.1 is compiled with Hive. On 29 September 2014 17:41, Patrick McGloin mcgloin.patr

Spark SQL + Hive + JobConf NoClassDefFoundError

2014-09-29 Thread Patrick McGloin
doesn't find the class. Here is the command: sudo ./spark-submit --class aac.main.SparkDriver --master spark://localhost:7077 --jars AAC-assembly-1.0.jar aacApp_2.10-1.0.jar Any pointers would be appreciated! Best regards, Patrick

Re: Spot instances on Amazon EMR

2014-09-18 Thread Patrick Wendell
Hey Grzegorz, EMR is a service that is not maintained by the Spark community. So this list isn't the right place to ask EMR questions. - Patrick On Thu, Sep 18, 2014 at 3:19 AM, Grzegorz Białek grzegorz.bia...@codilime.com wrote: Hi, I would like to run Spark application on Amazon EMR. I have

Re: partitioned groupBy

2014-09-17 Thread Patrick Wendell
...@gmail.com wrote: Patrick, If I understand this correctly, I won't be able to do this in the closure provided to mapPartitions() because that's going to be stateless, in the sense that a hash map that I create within the closure would only be useful for one call of MapPartitionsRDD.compute(). I

Re: partitioned groupBy

2014-09-16 Thread Patrick Wendell
If each partition can fit in memory, you can do this using mapPartitions and then building an inverse mapping within each partition. You'd need to construct a hash map within each partition yourself. On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya aara...@gmail.com wrote: I have a use case where

Re: spark-1.1.0 with make-distribution.sh problem

2014-09-14 Thread Patrick Wendell
Yeah that issue has been fixed by adding better docs, it just didn't make it in time for the release: https://github.com/apache/spark/blob/branch-1.1/make-distribution.sh#L54 On Thu, Sep 11, 2014 at 11:57 PM, Zhanfeng Huo huozhanf...@gmail.com wrote: resolved: ./make-distribution.sh --name

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Patrick Wendell
[moving to user@] This would typically be accomplished with a union() operation. You can't mutate an RDD in-place, but you can create a new RDD with a union() which is an inexpensive operator. On Fri, Sep 12, 2014 at 5:28 AM, Archit Thakur archit279tha...@gmail.com wrote: Hi, We have a use

Re: Spark 1.1.0: Cannot load main class from JAR

2014-09-12 Thread Patrick Wendell
Hey SK, Yeah, the documented format is the same (we expect users to add the jar at the end) but the old spark-submit had a bug where it would actually accept inputs that did not match the documented format. Sorry if this was difficult to find! - Patrick On Fri, Sep 12, 2014 at 1:50 PM, SK

Announcing Spark 1.1.0!

2014-09-11 Thread Patrick Wendell
, and congratulations! - Patrick - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Deployment model popularity - Standard vs. YARN vs. Mesos vs. SIMR

2014-09-07 Thread Patrick Wendell
I would say that the first three are all used pretty heavily. Mesos was the first one supported (long ago), the standalone is the simplest and most popular today, and YARN is newer but growing a lot in activity. SIMR is not used as much... it was designed mostly for environments where users had

Re: memory size for caching RDD

2014-09-03 Thread Patrick Wendell
Changing this is not supported, it si immutable similar to other spark configuration settings. On Wed, Sep 3, 2014 at 8:13 PM, 牛兆捷 nzjem...@gmail.com wrote: Dear all: Spark uses memory to cache RDD and the memory size is specified by spark.storage.memoryFraction. One the Executor starts,

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Patrick Wendell
Yeah - each batch will produce a new RDD. On Wed, Aug 27, 2014 at 3:33 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Thanks. Just to double check, rdd.id would be unique for a batch in a DStream? On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote: You can use RDD

Submit to the Powered By Spark Page!

2014-08-26 Thread Patrick Wendell
: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark - Patrick

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-25 Thread Patrick Wendell
Hey Andrew, We might create a new JIRA for it, but it doesn't exist yet. We'll create JIRA's for the major 1.2 issues at the beginning of September. - Patrick On Mon, Aug 25, 2014 at 8:53 AM, Andrew Ash and...@andrewash.com wrote: Hi Patrick, For the spilling within on key work you mention

Re: Advantage of using cache()

2014-08-23 Thread Patrick Wendell
Yep - that's correct. As an optimization we save the shuffle output and re-use if if you execute a stage twice. So this can make A:B tests like this a bit confusing. - Patrick On Friday, August 22, 2014, Nieyuan qiushuiwuh...@gmail.com wrote: Because map-reduce tasks like join will save

Re: Advantage of using cache()

2014-08-20 Thread Patrick Wendell
Your rdd2 and rdd3 differ in two ways so it's hard to track the exact effect of caching. In rdd3, in addition to the fact that rdd will be cached, you are also doing a bunch of extra random number generation. So it will be hard to isolate the effect of caching. On Wed, Aug 20, 2014 at 7:48 AM,

Re: Broadcast vs simple variable

2014-08-20 Thread Patrick Wendell
For large objects, it will be more efficient to broadcast it. If your array is small it won't really matter. How many centers do you have? Unless you are finding that you have very large tasks (and Spark will print a warning about this), it could be okay to just reference it directly. On Wed,

Re: Web UI doesn't show some stages

2014-08-20 Thread Patrick Wendell
The reason is that some operators get pipelined into a single stage. rdd.map(XX).filter(YY) - this executes in a single stage since there is no data movement needed in between these operations. If you call toDeubgString on the final RDD it will give you some information about the exact lineage.

Re: type issue: found RDD[T] expected RDD[A]

2014-08-19 Thread Patrick McGloin
for a collection of types I had. Best regards, Patrick On 6 August 2014 07:58, Amit Kumar kumarami...@gmail.com wrote: Hi All, I am having some trouble trying to write generic code that uses sqlContext and RDDs. Can you suggest what might be wrong? class SparkTable[T : ClassTag](val

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-05 Thread Patrick Wendell
out sequentially on disk on one big file, you can call `sortByKey` with a hashed suffix as well. The sort functions are externalized in Spark 1.1 (which is in pre-release). - Patrick On Tue, Aug 5, 2014 at 2:39 PM, Jens Kristian Geyti sp...@jkg.dk wrote: Patrick Wendell wrote In the latest

Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-04 Thread Patrick Wendell
/spark/pull/1165 A (potential) workaround would be to first persist your data to disk, then re-partition it, then cache it. I'm not 100% sure whether that will work though. val a = sc.textFile(s3n://some-path/*.json).persist(DISK_ONLY).repartition(larger nr of partitions).cache() - Patrick On Fri

Re: What should happen if we try to cache more data than the cluster can hold in memory?

2014-08-04 Thread Patrick Wendell
BTW - the reason why the workaround could help is because when persisting to DISK_ONLY, we explicitly avoid materializing the RDD partition in memory... we just pass it through to disk On Mon, Aug 4, 2014 at 1:10 AM, Patrick Wendell pwend...@gmail.com wrote: It seems possible that you

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Patrick Wendell
For hortonworks, I believe it should work to just link against the corresponding upstream version. I.e. just set the Hadoop version to 2.4.0 Does that work? - Patrick On Mon, Aug 4, 2014 at 12:13 AM, Ron's Yahoo! zlgonza...@yahoo.com.invalid wrote: Hi, Not sure whose issue

Re: Cached RDD Block Size - Uneven Distribution

2014-08-04 Thread Patrick Wendell
Are you directly caching files from Hadoop or are you doing some transformation on them first? If you are doing a groupBy or some type of transformation, then you could be causing data skew that way. On Sun, Aug 3, 2014 at 1:19 PM, iramaraju iramar...@gmail.com wrote: I am running spark 1.0.0,

Re: disable log4j for spark-shell

2014-08-03 Thread Patrick Wendell
If you want to customize the logging behavior - the simplest way is to copy conf/log4j.properties.tempate to conf/log4j.properties. Then you can go and modify the log level in there. The spark shells should pick this up. On Sun, Aug 3, 2014 at 6:16 AM, Sean Owen so...@cloudera.com wrote:

Re: Spark SQL, Parquet and Impala

2014-08-02 Thread Patrick McGloin
of the best practice for loading data into Parquet tables. Is the way we are doing the Spark part correct in your opinion? Best regards, Patrick On 1 August 2014 19:32, Michael Armbrust mich...@databricks.com wrote: So is the only issue that impala does not see changes until you refresh

Spark SQL, Parquet and Impala

2014-08-01 Thread Patrick McGloin
Hi, We would like to use Spark SQL to store data in Parquet format and then query that data using Impala. We've tried to come up with a solution and it is working but it doesn't seem good. So I was wondering if you guys could tell us what is the correct way to do this. We are using Spark 1.0

Re: Spark SQL, Parquet and Impala

2014-08-01 Thread Patrick McGloin
insert data from SparkSQL into a Parquet table which can be directly queried by Impala? Best regards, Patrick On 1 August 2014 16:18, Patrick McGloin mcgloin.patr...@gmail.com wrote: Hi, We would like to use Spark SQL to store data in Parquet format and then query that data using Impala

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Patrick Wendell
This is a Scala bug - I filed something upstream, hopefully they can fix it soon and/or we can provide a work around: https://issues.scala-lang.org/browse/SI-8772 - Patrick On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote: Currently scala 2.10.2 can't be pulled in from

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Patrick Wendell
I've had intermiddent access to the artifacts themselves, but for me the directory listing always 404's. I think if sbt hits a 404 on the directory, it sends a somewhat confusing error message that it can't download the artifact. - Patrick On Fri, Aug 1, 2014 at 3:28 PM, Shivaram Venkataraman

Re: how to publish spark inhouse?

2014-07-28 Thread Patrick Wendell
All of the scripts we use to publish Spark releases are in the Spark repo itself, so you could follow these as a guideline. The publishing process in Maven is similar to in SBT: https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L65 On Mon, Jul 28, 2014 at 12:39 PM,

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Patrick Wendell
Adding new build modules is pretty high overhead, so if this is a case where a small amount of duplicated code could get rid of the dependency, that could also be a good short-term option. - Patrick On Mon, Jul 14, 2014 at 2:15 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yeah, I'd just add

Announcing Spark 1.0.1

2014-07-11 Thread Patrick Wendell
I am happy to announce the availability of Spark 1.0.1! This release includes contributions from 70 developers. Spark 1.0.0 includes fixes across several areas of Spark, including the core API, PySpark, and MLlib. It also includes new features in Spark's (alpha) SQL library, including support for

Re: How to clear the list of Completed Appliations in Spark web UI?

2014-07-09 Thread Patrick Wendell
There isn't currently a way to do this, but it will start dropping older applications once more than 200 are stored. On Wed, Jul 9, 2014 at 4:04 PM, Haopu Wang hw...@qilinsoft.com wrote: Besides restarting the Master, is there any other way to clear the Completed Applications in Master web UI?

Re: Purpose of spark-submit?

2014-07-09 Thread Patrick Wendell
It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of

Re: issues with ./bin/spark-shell for standalone mode

2014-07-09 Thread Patrick Wendell
Hey Mikhail, I think (hope?) the -em and -dm options were never in an official Spark release. They were just in the master branch at some point. Did you use these during a previous Spark release or were you just on master? - Patrick On Wed, Jul 9, 2014 at 9:18 AM, Mikhail Strebkov streb

Re: hadoop + yarn + spark

2014-06-27 Thread Patrick Wendell
Hi There, There is an issue with PySpark-on-YARN that requires users build with Java 6. The issue has to do with how Java 6 and 7 package jar files differently. Can you try building spark with Java 6 and trying again? - Patrick On Fri, Jun 27, 2014 at 5:00 PM, sdeb sangha...@gmail.com wrote

Re: 1.0.1 release plan

2014-06-20 Thread Patrick Wendell
Hey There, I'd like to start voting on this release shortly because there are a few important fixes that have queued up. We're just waiting to fix an akka issue. I'd guess we'll cut a vote in the next few days. - Patrick On Thu, Jun 19, 2014 at 10:47 AM, Mingyu Kim m...@palantir.com wrote: Hi

Re: Trailing Tasks Saving to HDFS

2014-06-19 Thread Patrick Wendell
I'll make a comment on the JIRA - thanks for reporting this, let's get to the bottom of it. On Thu, Jun 19, 2014 at 11:19 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: I've created an issue for this but if anyone has any advice, please let me know. Basically, on about 10 GBs of

Re: Enormous EC2 price jump makes r3.large patch more important

2014-06-17 Thread Patrick Wendell
Hey Jeremy, This is patched in the 1.0 and 0.9 branches of Spark. We're likely to make a 1.0.1 release soon (this patch being one of the main reasons), but if you are itching for this sooner, you can just checkout the head of branch-1.0 and you will be able to use r3.XXX instances. - Patrick

Re: Enormous EC2 price jump makes r3.large patch more important

2014-06-17 Thread Patrick Wendell
By the way, in case it's not clear, I mean our maintenance branches: https://github.com/apache/spark/tree/branch-1.0 On Tue, Jun 17, 2014 at 8:35 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Jeremy, This is patched in the 1.0 and 0.9 branches of Spark. We're likely to make a 1.0.1

Re: Enormous EC2 price jump makes r3.large patch more important

2014-06-17 Thread Patrick Wendell
will be present in the 1.0 branch of Spark. - Patrick On Tue, Jun 17, 2014 at 9:29 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: I am about to spin up some new clusters, so I may give that a go... any special instructions for making them work? I assume I use the --spark-git-repo= option

Re: Wildcard support in input path

2014-06-17 Thread Patrick Wendell
These paths get passed directly to the Hadoop FileSystem API and I think the support globbing out-of-the box. So AFAIK it should just work. On Tue, Jun 17, 2014 at 9:09 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi Jianshi, I have used wild card characters (*) in my program and it

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-17 Thread Patrick Wendell
Out of curiosity - are you guys using speculation, shuffle consolidation, or any other non-default option? If so that would help narrow down what's causing this corruption. On Tue, Jun 17, 2014 at 10:40 AM, Surendranauth Hiraman suren.hira...@velos.io wrote: Matt/Ryan, Did you make any headway

Re: Setting spark memory limit

2014-06-09 Thread Patrick Wendell
I you run locally then Spark doesn't launch remote executors. However, in this case you can set the memory with --spark-driver-memory flag to spark-submit. Does that work? - Patrick On Mon, Jun 9, 2014 at 3:24 PM, Henggang Cui cuihengg...@gmail.com wrote: Hi, I'm trying to run the SimpleApp

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-08 Thread Patrick Wendell
are not in the jar because they go beyond the extended zip boundary `jar tvf` won't list them. - Patrick On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown p...@mult.ifario.us wrote: Moving over to the dev list, as this isn't a user-scope issue. I just ran into this issue with the missing saveAsTestFile

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-08 Thread Patrick Wendell
Also I should add - thanks for taking time to help narrow this down! On Sun, Jun 8, 2014 at 1:02 PM, Patrick Wendell pwend...@gmail.com wrote: Paul, Could you give the version of Java that you are building with and the version of Java you are running with? Are they the same? Just off

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-08 Thread Patrick Wendell
Okay I think I've isolated this a bit more. Let's discuss over on the JIRA: https://issues.apache.org/jira/browse/SPARK-2075 On Sun, Jun 8, 2014 at 1:16 PM, Paul Brown p...@mult.ifario.us wrote: Hi, Patrick -- Java 7 on the development machines: » java -version 1 ↵ java version 1.7.0_51

Re: Setting executor memory when using spark-shell

2014-06-06 Thread Patrick Wendell
In 1.0+ you can just pass the --executor-memory flag to ./bin/spark-shell. On Fri, Jun 6, 2014 at 12:32 AM, Oleg Proudnikov oleg.proudni...@gmail.com wrote: Thank you, Hassan! On 6 June 2014 03:23, hassan hellfire...@gmail.com wrote: just use -Dspark.executor.memory= -- View this

Re: Spark 1.0 embedded Hive libraries

2014-06-06 Thread Patrick Wendell
it work. I think it's being tracked by this JIRA: https://issues.apache.org/jira/browse/HIVE-5733 - Patrick On Fri, Jun 6, 2014 at 12:08 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: Is there a repo somewhere with the code for the Hive dependencies (hive-exec, hive-serde, hive-metastore

Re: Spark 1.0.0 fails if mesos.coarse set to true

2014-06-04 Thread Patrick Wendell
Hey, thanks a lot for reporting this. Do you mind making a JIRA with the details so we can track it? - Patrick On Wed, Jun 4, 2014 at 9:24 AM, Marek Wiewiorka marek.wiewio...@gmail.com wrote: Exactly the same story - it used to work with 0.9.1 and does not work anymore with 1.0.0. I ran tests

Re: is there any easier way to define a custom RDD in Java

2014-06-04 Thread Patrick Wendell
Hey There, This is only possible in Scala right now. However, this is almost never needed since the core API is fairly flexible. I have the same question as Andrew... what are you trying to do with your RDD? - Patrick On Wed, Jun 4, 2014 at 7:49 AM, Andrew Ash and...@andrewash.com wrote: Just

Re: error with cdh 5 spark installation

2014-06-04 Thread Patrick Wendell
Hey Chirag, Those init scripts are part of the Cloudera Spark package (they are not in the Spark project itself) so you might try e-mailing their support lists directly. - Patrick On Wed, Jun 4, 2014 at 7:19 AM, chirag lakhani chirag.lakh...@gmail.com wrote: I recently spun up an AWS cluster

Re: Can't seem to link external/twitter classes from my own app

2014-06-04 Thread Patrick Wendell
): https://github.com/pwendell/kafka-spark-example You'll want to make an uber jar that includes these packages (run sbt assembly) and then submit that jar to spark-submit. Also, I'd try running it locally first (if you aren't already) just to make the debugging simpler. - Patrick On Wed, Jun 4, 2014

Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Patrick Wendell
If that's still an issue, one thing to try is just changing the name of the cluster. We create groups that are identified with the cluster name, and there might be something that just got screwed up with the original group creation and AWS isn't happy. - Patrick On Wed, Jun 4, 2014 at 12:55 PM, Sam

Re: spark 1.0 not using properties file from SPARK_CONF_DIR

2014-06-03 Thread Patrick Wendell
You can set an arbitrary properties file by adding --properties-file argument to spark-submit. It would be nice to have spark-submit also look in SPARK_CONF_DIR as well by default. If you opened a JIRA for that I'm sure someone would pick it up. On Tue, Jun 3, 2014 at 7:47 AM, Eugen Cepoi

Re: spark 1.0.0 on yarn

2014-06-02 Thread Patrick Wendell
. -Simon On Sun, Jun 1, 2014 at 9:03 PM, Patrick Wendell pwend...@gmail.com wrote: As a debugging step, does it work if you use a single resource manager with the key yarn.resourcemanager.address instead of using two named resource managers? I wonder if somehow the YARN client can't

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1 However, it would be very easy to add an option that allows preserving the old behavior. Is anyone here interested in contributing that? I created a JIRA for it: https://issues.apache.org/jira/browse/SPARK-1993 - Patrick On Mon, Jun 2

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
Thanks for pointing that out. I've assigned you to SPARK-1677 (I think I accidentally assigned myself way back when I created it). This should be an easy fix. On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, Patrick, I think https://issues.apache.org/jira/browse/SPARK

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Patrick Wendell
Are you building Spark with Java 6 or Java 7. Java 6 uses the extended Zip format and Java 7 uses Zip64. I think we've tried to add some build warnings if Java 7 is used, for this reason: https://github.com/apache/spark/blob/master/make-distribution.sh#L102 Any luck if you use JDK 6 to compile?

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
data by mistake if they don't understand the exact semantics. 2. It would introduce a third set of semantics here for saveAsXX... 3. It's trivial for users to implement this with two lines of code (if output dir exists, delete it) before calling saveAsHadoopFile. - Patrick On Mon, Jun 2, 2014 at 2

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
/clobber an existing destination directory if it exists, then fully over-write it with new data. I'm fine to add a flag that allows (B) for backwards-compatibility reasons, but my point was I'd prefer not to have (C) even though I see some cases where it would be useful. - Patrick On Mon, Jun 2

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Patrick Wendell
. The standard installation guide didn't say anything about java 7 and suggested to do -DskipTests for the build.. http://spark.apache.org/docs/latest/building-with-maven.html So, I didn't see the warning message... On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell pwend...@gmail.com wrote

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
, Jun 2, 2014 at 10:39 PM, Patrick Wendell pwend...@gmail.com wrote: (B) Semantics in Spark 1.0 and earlier: Do you mean 1.0 and later? Option (B) with the exception-on-clobber sounds fine to me, btw. My use pattern is probably common but not universal, and deleting user files is indeed

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Patrick Wendell
Hey just to clarify this - my understanding is that the poster (Jeremey) was using a custom AMI to *launch* spark-ec2. I normally launch spark-ec2 from my laptop. And he was looking for an AMI that had a high enough version of python. Spark-ec2 itself has a flag -a that allows you to give a

Re: Using sbt-pack with Spark 1.0.0

2014-06-01 Thread Patrick Wendell
One potential issue here is that mesos is using classifiers now to publish there jars. It might be that sbt-pack has trouble with dependencies that are published using classifiers. I'm pretty sure mesos is the only dependency in Spark that is using classifiers, so that's why I mention it. On Sun,

Re: Using sbt-pack with Spark 1.0.0

2014-06-01 Thread Patrick Wendell
https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L350 On Sun, Jun 1, 2014 at 11:03 AM, Patrick Wendell pwend...@gmail.com wrote: One potential issue here is that mesos is using classifiers now to publish there jars. It might be that sbt-pack has trouble with dependencies

Re: spark 1.0.0 on yarn

2014-06-01 Thread Patrick Wendell
.. -Simon On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com wrote: I would agree with your guess, it looks like the yarn library isn't correctly finding your yarn-site.xml file. If you look in yarn-site.xml do you definitely the resource manager address/addresses? Also, you

Re: Unable to execute saveAsTextFile on multi node mesos

2014-05-31 Thread Patrick Wendell
Can you look at the logs from the executor or in the UI? They should give an exception with the reason for the task failure. Also in the future, for this type of e-mail please only e-mail the user@ list and not both lists. - Patrick On Sat, May 31, 2014 at 3:22 AM, prabeesh k prabsma

Re: How can I dispose an Accumulator?

2014-05-31 Thread Patrick Wendell
. - Patrick On Thu, May 29, 2014 at 2:13 AM, innowireless TaeYun Kim taeyun@innowireless.co.kr wrote: Hi, How can I dispose an Accumulator? It has no method like 'unpersist()' which Broadcast provides. Thanks.

Re: Spark hook to create external process

2014-05-31 Thread Patrick Wendell
Currently, an executor is always run in it's own JVM, so it should be possible to just use some static initialization to e.g. launch a sub-process and set up a bridge with which to communicate. This is would be a fairly advanced use case, however. - Patrick On Thu, May 29, 2014 at 8:39 PM

Re: possible typos in spark 1.0 documentation

2014-05-31 Thread Patrick Wendell
the change. - Patrick

Re: getPreferredLocations

2014-05-31 Thread Patrick Wendell
1) Is there a guarantee that a partition will only be processed on a node which is in the getPreferredLocations set of nodes returned by the RDD ? No there isn't, by default Spark may schedule in a non preferred location after `spark.locality.wait` has expired.

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-31 Thread Patrick Wendell
this (this is pseudo-code): files = fs.listStatus(s3n://bucket/stuff/*.gz) files = files.filter(not the bad file) fileStr = files.map(f = f.getPath.toString).mkstring(,) sc.textFile(fileStr)... - Patrick On Fri, May 30, 2014 at 4:20 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: YES, your

Announcing Spark 1.0.0

2014-05-30 Thread Patrick Wendell
Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick

Re: Announcing Spark 1.0.0

2014-05-30 Thread Patrick Wendell
, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's

Re: Yay for 1.0.0! EC2 Still has problems.

2014-05-30 Thread Patrick Wendell
to make them compatible with 2.6 we should do that. For r3.large, we can add that to the script. It's a newer type. Any interest in contributing this? - Patrick On May 30, 2014 5:08 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Hi there! I'm relatively new to the list, so sorry

pyspark python exceptions / py4j exceptions

2014-05-15 Thread Patrick Donovan
Hello, I'm trying to write a python function that does something like: def foo(line): try: return stuff(line) except Exception: raise MoreInformativeException(line) and then use it in a map like so: rdd.map(foo) and have my MoreInformativeException make it back if/when

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-05-15 Thread Patrick Wendell
) - Patrick On Wed, May 14, 2014 at 9:09 AM, Koert Kuipers ko...@tresata.com wrote: i have some settings that i think are relevant for my application. they are spark.akka settings so i assume they are relevant for both executors and my driver program. i used to do: SPARK_JAVA_OPTS

Re: 1.0.0 Release Date?

2014-05-14 Thread Patrick Wendell
to be almost identical to the final release. - Patrick On Tue, May 13, 2014 at 9:40 AM, bhusted brian.hus...@gmail.com wrote: Can anyone comment on the anticipated date or worse case timeframe for when Spark 1.0.0 will be released? -- View this message in context: http://apache-spark-user-list

Spark Streaming and JMS

2014-05-05 Thread Patrick McGloin
) Is this the best way to go? Best regards, Patrick

Re: spark ec2 error

2014-05-04 Thread Patrick Wendell
Hey Jeremy, This is actually a big problem - thanks for reporting it, I'm going to revert this change until we can make sure it is backwards compatible. - Patrick On Sun, May 4, 2014 at 2:00 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hi all, A heads up in case others hit

Re: spark ec2 error

2014-05-04 Thread Patrick Wendell
PM, Patrick Wendell pwend...@gmail.com wrote: Hey Jeremy, This is actually a big problem - thanks for reporting it, I'm going to revert this change until we can make sure it is backwards compatible. - Patrick On Sun, May 4, 2014 at 2:00 PM, Jeremy Freeman freeman.jer...@gmail.com wrote

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-03 Thread Patrick Wendell
with many partitions, since often there are bottlenecks at the granularity of a file. Is there a reason you need this to be exactly one file? - Patrick On Sat, May 3, 2014 at 4:14 PM, Chris Fregly ch...@fregly.com wrote: not sure if this directly addresses your issue, peter, but it's worth mentioned

Re: Setting the Scala version in the EC2 script?

2014-05-03 Thread Patrick Wendell
your spark-ec2.py script to checkout spark-ec2 from forked version. - Patrick On Thu, May 1, 2014 at 2:14 PM, Ian Ferreira ianferre...@hotmail.com wrote: Is this possible, it is very annoying to have such a great script, but still have to manually update stuff afterwards.

Re: when to use broadcast variables

2014-05-03 Thread Patrick Wendell
Broadcast variables need to fit entirely in memory - so that's a pretty good litmus test for whether or not to broadcast a smaller dataset or turn it into an RDD. On Fri, May 2, 2014 at 7:50 AM, Prashant Sharma scrapco...@gmail.com wrote: I had like to be corrected on this but I am just trying

Re: Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Patrick Wendell
This is a consequence of the way the Hadoop files API works. However, you can (fairly easily) add code to just rename the file because it will always produce the same filename. (heavy use of pseudo code) dir = /some/dir rdd.coalesce(1).saveAsTextFile(dir) f = new File(dir + part-0)

<    1   2   3   4   >