Re: Palantir replease under org.apache.spark?

2018-01-09 Thread Andrew Ash
That source repo is at https://github.com/palantir/spark/ with artifacts published to Palantir's bintray at https://palantir.bintray.com/releases/org/apache/spark/ If you're seeing any of them in Maven Central please flag, as that's a mistake! Andrew On Tue, Jan 9, 2018 at 10:10 AM, Sean Owen

Re: Spark ANSI SQL Support

2017-01-17 Thread Andrew Ash
Rishabh, Have you come across any ANSI SQL queries that Spark SQL didn't support? I'd be interested to hear if you have. Andrew On Tue, Jan 17, 2017 at 8:14 PM, Deepak Sharma wrote: > From spark documentation page: > Spark SQL can now run all 99 TPC-DS queries. > > On

Re: How do I download 2.0? The main download page isn't showing it?

2016-07-27 Thread Andrew Ash
You sometimes have to hard refresh to get the page to update. On Wed, Jul 27, 2016 at 5:12 PM, Jim O'Flaherty wrote: > Nevermind, it literally just appeared right after I posted this. > > > > -- > View this message in context: >

Re: Spark JVM default memory

2015-05-04 Thread Andrew Ash
It's unlikely you need to increase the amount of memory on your master node since it does simple bookkeeping. The majority of the memory pressure across a cluster is on executor nodes. See the conf/spark-env.sh file for configuring heap sizes, and this section in the docs for more information on

Re: Which OutputCommitter to use for S3?

2015-02-21 Thread Andrew Ash
Josh is that class something you guys would consider open sourcing, or would you rather the community step up and create an OutputCommitter implementation optimized for S3? On Fri, Feb 20, 2015 at 4:02 PM, Josh Rosen rosenvi...@gmail.com wrote: We (Databricks) use our own DirectOutputCommitter

Re: Discourse: A proposed alternative to the Spark User list

2015-01-17 Thread Andrew Ash
People can continue using the stack exchange sites as is with no additional work from the Spark team. I would not support migrating our mailing lists yet again to another system like Discourse because I fear fragmentation of the community between the many sites. On Sat, Jan 17, 2015 at 6:24 AM,

Re: use netty shuffle for network cause high gc time

2015-01-13 Thread Andrew Ash
To confirm, lihu, are you using Spark version 1.2.0 ? On Tue, Jan 13, 2015 at 9:26 PM, lihu lihu...@gmail.com wrote: Hi, I just test groupByKey method on a 100GB data, the cluster is 20 machine, each with 125GB RAM. At first I set conf.set(spark.shuffle.use.netty, false) and run

Re: IndexedRDD

2015-01-13 Thread Andrew Ash
Hi Jem, Linear time in scaling on the big table doesn't seem that surprising to me. What were you expecting? I assume you're doing normalRDD.join(indexedRDD). If you were to replace the indexedRDD with a normal RDD, what times do you get? On Tue, Jan 13, 2015 at 5:35 AM, Jem Tucker

Re: Questions about Spark and HDFS co-location

2015-01-09 Thread Andrew Ash
Note also for short circuit reads that early versions are actually net-negative in performance. Only after a second hadoop release of the feature did it turn towards being a positive change. See earlier threads on this mailing list where short circuit reads are discussed. On Fri, Jan 9, 2015 at

Re: Cleaning up spark.local.dir automatically

2015-01-09 Thread Andrew Ash
That's a worker setting which cleans up the files left behind by executors, so spark.cleaner.ttl isn't at the RDD level. After https://issues.apache.org/jira/browse/SPARK-1860 the cleaner won't clean up directories left by running executors. On Fri, Jan 9, 2015 at 7:38 AM,

Re: FW: No APPLICATION_COMPLETE file created in history server log location upon pyspark job success

2015-01-07 Thread Andrew Ash
Hi Michael, I think you need to explicitly call sc.stop() on the spark context for it to close down properly (this doesn't happen automatically). See https://issues.apache.org/jira/browse/SPARK-2972 for more details Andrew On Wed, Jan 7, 2015 at 3:38 AM, michael.engl...@nomura.com wrote:

Re: Data Locality

2015-01-06 Thread Andrew Ash
You can also read about locality here in the docs: http://spark.apache.org/docs/latest/tuning.html#data-locality On Tue, Jan 6, 2015 at 8:37 AM, Cody Koeninger c...@koeninger.org wrote: No, not all rdds have location information, and in any case tasks may be scheduled on non-local nodes if

Re: Cannot see RDDs in Spark UI

2015-01-06 Thread Andrew Ash
Hi Manoj, I've noticed that the storage tab only shows RDDs that have been cached. Did you call .cache() or .persist() on any of the RDDs? Andrew On Tue, Jan 6, 2015 at 6:48 PM, Manoj Samel manojsamelt...@gmail.com wrote: Hi, I create a bunch of RDDs, including schema RDDs. When I run the

Re: Single worker locked at 100% CPU

2014-12-23 Thread Andrew Ash
Hi Phil, This sounds a lot like a deadlock in Hadoop's Configuration object that I ran into a while back. If you jstack the JVM and see a thread that looks like the below, it could be https://issues.apache.org/jira/browse/SPARK-2546 Executor task launch worker-6 daemon prio=10

Re: When will spark 1.2 released?

2014-12-18 Thread Andrew Ash
Patrick is working on the release as we speak -- I expect it'll be out later tonight (US west coast) or tomorrow at the latest. On Fri, Dec 19, 2014 at 1:09 AM, Ted Yu yuzhih...@gmail.com wrote: Interesting, the maven artifacts were dated Dec 10th. However vote for RC2 closed recently:

Re: when will the spark 1.3.0 be released?

2014-12-16 Thread Andrew Ash
Releases are roughly every 3mo so you should expect around March if the pace stays steady. 2014-12-16 22:56 GMT-05:00 Marco Shaw marco.s...@gmail.com: When it is ready. On Dec 16, 2014, at 11:43 PM, 张建轶 zhangjia...@youku.com wrote: Hi £¡ when will the spark 1.3.0 be released£¿ I

Re: Specifying number of executors in Mesos

2014-12-11 Thread Andrew Ash
Gerard, Are you familiar with spark.deploy.spreadOut http://spark.apache.org/docs/latest/spark-standalone.html in Standalone mode? It sounds like you want the same thing in Mesos mode. On Thu, Dec 11, 2014 at 6:48 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Not that I am aware of.

Re: Why is this operation so expensive

2014-11-25 Thread Andrew Ash
Hi Steve, You changed the first value in a Tuple2, which is the one that Spark uses to hash and determine where in the cluster to place the value. By changing the first part of the PairRDD, you've implicitly asked Spark to reshuffle the data according to the new keys. I'd guess that you would

Re: Another accumulator question

2014-11-21 Thread Andrew Ash
Hi Nathan, It sounds like what you're asking for has already been filed as https://issues.apache.org/jira/browse/SPARK-664 Does that ticket match what you're proposing? Andrew On Fri, Nov 21, 2014 at 12:29 PM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: We've done this with reduce -

Re: Nightly releases

2014-11-18 Thread Andrew Ash
I can see this being valuable for users wanting to live on the cutting edge without building CI infrastructure themselves, myself included. I think Patrick's recent work on the build scripts for 1.2.0 will make delivering nightly builds to a public maven repo easier. On Tue, Nov 18, 2014 at

Re: toLocalIterator in Spark 1.0.0

2014-11-14 Thread Andrew Ash
Deep, toLocalIterator is a method on the RDD class. So try this instead: rdd.toLocalIterator() On Fri, Nov 14, 2014 at 12:21 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: val iter = toLocalIterator (rdd) This is what I am doing and it says error: not found On Fri, Nov 14, 2014 at

Re: Spark Memory Hungry?

2014-11-14 Thread Andrew Ash
TJ, what was your expansion factor between image size on disk and in memory in pyspark? I'd expect in memory to be larger due to Java object overhead, but don't know the exact amounts you should expect. On Fri, Nov 14, 2014 at 12:50 AM, TJ Klein tjkl...@gmail.com wrote: Hi, I am using

Re: Scala vs Python performance differences

2014-11-12 Thread Andrew Ash
Jeremy, Did you complete this benchmark in a way that's shareable with those interested here? Andrew On Tue, Apr 15, 2014 at 2:50 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I'd also be interested in seeing such a benchmark. On Tue, Apr 15, 2014 at 9:25 AM, Ian Ferreira

Re: worker_instances vs worker_cores

2014-10-20 Thread Andrew Ash
Hi Anny, SPARK_WORKER_INSTANCES is the number of copies of spark workers running on a single box. If you change the number you change how the hardware you have is split up (useful for breaking large servers into 32GB heaps each which perform better) but doesn't change the amount of hardware you

Re: Could Spark make use of Intel Xeon Phi?

2014-10-18 Thread Andrew Ash
that Spark could recognize Phi as one worker and run workloads on it? Thanks On Oct 10, 2014, at 4:54 AM, Andrew Ash and...@andrewash.com wrote: Hi Lang, What special features of the Xeon Phil do you want Spark to take advantage of? On Thu, Oct 9, 2014 at 4:50 PM, Lang Yu lysubscr

Re: Designed behavior when master is unreachable.

2014-10-17 Thread Andrew Ash
I'm not sure what the design is, but I think the current behavior if the driver can't reach the master is to attempt to connect once and fail if that attempt fails. Is that what you're observing? (What version of Spark also?) On Fri, Oct 17, 2014 at 3:51 AM, preeze etan...@gmail.com wrote: Hi

Re: Could Spark make use of Intel Xeon Phi?

2014-10-09 Thread Andrew Ash
Hi Lang, What special features of the Xeon Phil do you want Spark to take advantage of? On Thu, Oct 9, 2014 at 4:50 PM, Lang Yu lysubscr...@gmail.com wrote: Hi, I have set up Spark 1.0.2 on the cluster using standalone mode and the input is managed by HDFS. One node of the cluster has Intel

Re: Shuffle files

2014-10-07 Thread Andrew Ash
You will need to restart your Mesos workers to pick up the new limits as well. On Tue, Oct 7, 2014 at 4:02 PM, Sunny Khatri sunny.k...@gmail.com wrote: @SK: Make sure ulimit has taken effect as Todd mentioned. You can verify via ulimit -a. Also make sure you have proper kernel parameters set

Re: Same code --works in spark 1.0.2-- but not in spark 1.1.0

2014-10-07 Thread Andrew Ash
Hi Meethu, I believe you may be hitting a regression in https://issues.apache.org/jira/browse/SPARK-3633 If you are able, could you please try running a patched version of Spark 1.1.0 that has commit 4fde28c reverted and see if the errors go away? Posting your results on that bug would be

Re: java.library.path

2014-10-05 Thread Andrew Ash
You're putting those into spark-env.sh? Try setting LD_LIBRARY_PATH as well, that might help. Also where is the exception coming from? You have to set this properly for both the cluster and the driver, which are independently set. Cheers! Andrew On Sun, Oct 5, 2014 at 1:06 PM, Tom

Re: window every n elements instead of time based

2014-10-05 Thread Andrew Ash
Hi Michael, I couldn't find anything in Jira for it -- https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20%22window%22%20AND%20component%20%3D%20Streaming Could you or Adrian please file a Jira ticket explaining the functionality and maybe a proposed API? This

Re: Larger heap leads to perf degradation due to GC

2014-10-05 Thread Andrew Ash
Hi Mingyu, Maybe we should be limiting our heaps to 32GB max and running multiple workers per machine to avoid large GC issues. For a 128GB memory, 32 core machine, this could look like: SPARK_WORKER_INSTANCES=4 SPARK_WORKER_MEMORY=32 SPARK_WORKER_CORES=8 Are people running with large (32GB+)

Re: still GC overhead limit exceeded after increasing heap space

2014-10-05 Thread Andrew Ash
You may also be writing your algorithm in a way that it requires high peak memory usage. An example of this could be using .groupByKey() where .reduceByKey() might suffice instead. Maybe you can express the algorithm in a different way that's more efficient? On Thu, Oct 2, 2014 at 4:30 AM, Sean

Re: Short Circuit Local Reads

2014-09-30 Thread Andrew Ash
Hi Gary, I gave this a shot on a test cluster of CDH4.7 and actually saw a regression in performance when running the numbers. Have you done any benchmarking? Below are my numbers: Experimental method: 1. Write 14GB of data to HDFS via [1] 2. Read data multiple times via [2] *Experiment 1:

Re: shuffle memory requirements

2014-09-30 Thread Andrew Ash
Hi Maddenpj, Right now the best estimate I've heard for the open file limit is that you'll need the square of the largest partition count in your dataset. I filed a ticket to log the ulimit value when it's too low at https://issues.apache.org/jira/browse/SPARK-3750 On Mon, Sep 29, 2014 at 6:20

Re: Short Circuit Local Reads

2014-09-30 Thread Andrew Ash
verified it. -Kay -- Forwarded message -- From: Andrew Ash and...@andrewash.com Date: Tue, Sep 30, 2014 at 1:33 PM Subject: Re: Short Circuit Local Reads To: Matei Zaharia matei.zaha...@gmail.com Cc: user@spark.apache.org user@spark.apache.org, Gary Malouf malouf.g

Re: Workers disconnected from master sometimes and never reconnect back

2014-09-29 Thread Andrew Ash
Hi Romi, I've observed this many times as well. So much so that on some clusters I restart the workers every night in order to maintain these worker - master connections. I couldn't find an open SPARK ticket on it so filed https://issues.apache.org/jira/browse/SPARK-3736 with you and Piotr

Re: Log hdfs blocks sending

2014-09-27 Thread Andrew Ash
26, 2014 at 10:35 AM, Andrew Ash and...@andrewash.com wrote: Hi Alexey, You should see in the logs a locality measure like NODE_LOCAL, PROCESS_LOCAL, ANY, etc. If your Spark workers each have an HDFS data node on them and you're reading out of HDFS, then you should be seeing almost all

Re: Shuffle files

2014-09-25 Thread Andrew Ash
Hi SK, For the problem with lots of shuffle files and the too many open files exception there are a couple options: 1. The linux kernel has a limit on the number of open files at once. This is set with ulimit -n, and can be set permanently in /etc/sysctl.conf or /etc/sysctl.d/. Try increasing

Re: SPARK UI - Details post job processiong

2014-09-25 Thread Andrew Ash
Matt you should be able to set an HDFS path so you'll get logs written to a unified place instead of to local disk on a random box on the cluster. On Thu, Sep 25, 2014 at 1:38 PM, Matt Narrell matt.narr...@gmail.com wrote: How does this work with a cluster manager like YARN? mn On Sep 25,

Re: Optimal Partition Strategy

2014-09-25 Thread Andrew Ash
Hi Vinay, What I'm guessing is happening is that Spark is taking the locality of files into account and you don't have node-local data on all your machines. This might be the case if you're reading out of HDFS and your 600 files are somehow skewed to only be on about 200 of your 400 machines. A

Re: Working on LZOP Files

2014-09-25 Thread Andrew Ash
Hi Harsha, I use LZOP files extensively on my Spark cluster -- see my writeup for how to do this on this mailing list post: http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCAOoZ679ehwvT1g8=qHd2n11Z4EXOBJkP+q=Aj0qE_=shhyl...@mail.gmail.com%3E Maybe we should better document how

Re: quick start guide: building a standalone scala program

2014-09-25 Thread Andrew Ash
Hi Christy, I'm more of a Gradle fan but I know SBT fits better into the Scala ecosystem as a build tool. If you'd like to give Gradle a shot try this skeleton Gradle+Spark repo from my coworker Punya. https://github.com/punya/spark-gradle-test-example Good luck! Andrew On Thu, Sep 25, 2014

Re: Log hdfs blocks sending

2014-09-25 Thread Andrew Ash
Hi Alexey, You should see in the logs a locality measure like NODE_LOCAL, PROCESS_LOCAL, ANY, etc. If your Spark workers each have an HDFS data node on them and you're reading out of HDFS, then you should be seeing almost all NODE_LOCAL accesses. One cause I've seen for mismatches is if Spark

Re: Worker Random Port

2014-09-23 Thread Andrew Ash
Hi Paul, There are several ports you need to configure in order to run in a tight network environment. It sounds like you the DMZ that contains the spark cluster is wide open internally, but you have to poke holes between that and the driver. You should take a look at the port configuration

Re: Where can I find the module diagram of SPARK?

2014-09-23 Thread Andrew Ash
Hi Theodore, What do you mean by module diagram? A high level architecture diagram of how the classes are organized into packages? Andrew On Tue, Sep 23, 2014 at 12:46 AM, Theodore Si sjyz...@gmail.com wrote: Hi, Please help me with that. BR, Theodore Si

Re: Why recommend 2-3 tasks per CPU core ?

2014-09-23 Thread Andrew Ash
Also you'd rather have 2-3 tasks per core than 1 task per core because if the 1 task per core is actually 1.01 tasks per core, then you have one wave of tasks complete and another wave of tasks with very few tasks in them. You get better utilization when you're higher than 1. Aaron Davidson goes

Re: ParquetRecordReader warnings: counter initialization

2014-09-22 Thread Andrew Ash
the parquet library and as far as I know can be safely ignored. On Mon, Sep 22, 2014 at 3:27 AM, Andrew Ash and...@andrewash.com wrote: Hi All, I'm seeing the below WARNINGs in stdout using Spark SQL in Spark 1.1.0 -- is this warning a known issue? I don't see any open Jira tickets for it. Sep 22

Re: Spark and disk usage.

2014-09-21 Thread Andrew Ash
in Spark Streaming, and some MLlib algorithms. If you can help with the guide, I think it would be a nice feature to have! Burak - Original Message - From: Andrew Ash and...@andrewash.com To: Burak Yavuz bya...@stanford.edu Cc: Макар Красноперов connector@gmail.com, user user

Re: Questions about Spark speculation

2014-09-17 Thread Andrew Ash
Hi Nicolas, I've had suspicions about speculation causing problems on my cluster but don't have any hard evidence of it yet. I'm also interested in why it's turned off by default. On Tue, Sep 16, 2014 at 3:01 PM, Nicolas Mai nicolas@gmail.com wrote: Hi, guys My current project is using

Re: Adjacency List representation in Spark

2014-09-17 Thread Andrew Ash
Hi Harsha, You could look through the GraphX source to see the approach taken there for ideas in your own. I'd recommend starting at https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/Graph.scala#L385 to see the storage technique. Why do you want to avoid

Re: Spark and disk usage.

2014-09-17 Thread Andrew Ash
Hi Burak, Most discussions of checkpointing in the docs is related to Spark streaming. Are you talking about the sparkContext.setCheckpointDir()? What effect does that have? https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing On Wed, Sep 17, 2014 at 7:44 AM,

Re: Spark and disk usage.

2014-09-17 Thread Andrew Ash
Thanks for the info! Are there performance impacts with writing to HDFS instead of local disk? I'm assuming that's why ALS checkpoints every third iteration instead of every iteration. Also I can imagine that checkpointing should be done every N shuffles instead of every N operations (counting

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-15 Thread Andrew Ash
nicholas.cham...@gmail.com wrote: Andrew, This email was pretty helpful. I feel like this stuff should be summarized in the docs somewhere, or perhaps in a blog post. Do you know if it is? Nick On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash and...@andrewash.com wrote: The locality

Re: Multiple spark shell sessions

2014-09-05 Thread Andrew Ash
Hi Dhimant, We also cleaned up these needless warnings on port failover in Spark 1.1 -- see https://issues.apache.org/jira/browse/SPARK-1902 Andrew On Thu, Sep 4, 2014 at 7:38 AM, Dhimant dhimant84.jays...@gmail.com wrote: Thanks Yana, I am able to execute application and command via

Re: Out of memory on large RDDs

2014-08-26 Thread Andrew Ash
Hi Grega, Did you ever get this figured out? I'm observing the same issue in Spark 1.0.2. For me it was after 1.5hr of a large .distinct call, followed by a .saveAsTextFile() 14/08/26 20:57:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 18500 14/08/26 20:57:43 INFO

Re: Understanding RDD.GroupBy OutOfMemory Exceptions

2014-08-25 Thread Andrew Ash
Hi Patrick, For the spilling within on key work you mention might land in Spark 1.2, is that being tracked in https://issues.apache.org/jira/browse/SPARK-1823 or is there another ticket I should be following? Thanks! Andrew On Tue, Aug 5, 2014 at 3:39 PM, Patrick Wendell pwend...@gmail.com

Re: heterogeneous cluster hardware

2014-08-21 Thread Andrew Ash
I'm actually not sure the Spark+Mesos integration supports dynamically allocating memory (it does support dynamically allocating cores though). Has anyone here actually used Spark+Mesos on heterogenous hardware and done dynamic memory allocation? My understanding is that each Spark executor

Re: heterogeneous cluster hardware

2014-08-21 Thread Andrew Ash
/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L114 - where Spark accepts sc.executorMemory of a resource offer, regardless of how much more memory was available On Thu, Aug 21, 2014 at 2:12 PM, Andrew Ash and...@andrewash.com

Re: Segmented fold count

2014-08-18 Thread Andrew Ash
What happens when a run of numbers is spread across a partition boundary? I think you might end up with two adjacent groups of the same value in that situation. On Mon, Aug 18, 2014 at 2:05 AM, Davies Liu dav...@databricks.com wrote: import itertools l = [1,1,1,2,2,3,4,4,5,1] gs =

Re: SPARK_LOCAL_DIRS option

2014-08-13 Thread Andrew Ash
Hi Deb, If you don't have long-running Spark applications (those taking more than spark.worker.cleanup.appDataTtl) then the TTL-based cleaner is a good solution. If however you have a mix of long-running and short-running applications, then the TTL-based solution will fail. It will clean up

Re: saveAsTextFiles file not found exception

2014-08-12 Thread Andrew Ash
Hi Chen, Please see the bug I filed at https://issues.apache.org/jira/browse/SPARK-2984 with the FileNotFoundException on _temporary directory issue. Andrew On Mon, Aug 11, 2014 at 10:50 PM, Andrew Ash and...@andrewash.com wrote: Not sure which stalled HDFS client issue your'e referring

Re: set SPARK_LOCAL_DIRS issue

2014-08-12 Thread Andrew Ash
// assuming Spark 1.0 Hi Baoqiang, In my experience for the standalone cluster you need to set SPARK_WORKER_DIR not SPARK_LOCAL_DIRS to control where shuffle files are written. I think this is a documentation issue that could be improved, as

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Andrew Ash
I've also been seeing similar stacktraces on Spark core (not streaming) and have a theory it's related to spark.speculation being turned on. Do you have that enabled by chance? On Mon, Aug 11, 2014 at 8:10 AM, Chen Song chen.song...@gmail.com wrote: Bill Did you get this resolved somehow?

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Andrew Ash
:13 AM, Andrew Ash and...@andrewash.com wrote: I've also been seeing similar stacktraces on Spark core (not streaming) and have a theory it's related to spark.speculation being turned on. Do you have that enabled by chance? On Mon, Aug 11, 2014 at 8:10 AM, Chen Song chen.song...@gmail.com

Re: Spark: Could not load native gpl library

2014-08-07 Thread Andrew Ash
Hi Jikai, It looks like you're trying to run a Spark job on data that's stored in HDFS in .lzo format. Spark can handle this (I do it all the time), but you need to configure your Spark installation to know about the .lzo format. There are two parts to the hadoop lzo library -- the first is the

Re: How to use spark-cassandra-connector in spark-shell?

2014-08-07 Thread Andrew Ash
Yes, I've done it before. On Thu, Aug 7, 2014 at 10:18 PM, Gary Zhao garyz...@gmail.com wrote: Hello Is it possible to use spark-cassandra-connector in spark-shell? Thanks Gary

Re: How to use spark-cassandra-connector in spark-shell?

2014-08-07 Thread Andrew Ash
7, 2014 at 10:20 PM, Andrew Ash and...@andrewash.com wrote: Yes, I've done it before. On Thu, Aug 7, 2014 at 10:18 PM, Gary Zhao garyz...@gmail.com wrote: Hello Is it possible to use spark-cassandra-connector in spark-shell? Thanks Gary

Re: Spark 0.9.1 - saveAsTextFile() exception: _temporary doesn't exist!

2014-07-30 Thread Andrew Ash
Hi Oleg, Did you ever figure this out? I'm observing the same exception also in 0.9.1 and think it might be related to setting spark.speculation=true. My theory is that multiple attempts at the same task start, the first finishes and cleans up the _temporary directory, and then the second fails

Re: Configuring Spark Memory

2014-07-23 Thread Andrew Ash
Hi Martin, In standalone mode, each SparkContext you initialize gets its own set of executors across the cluster. So for example if you have two shells open, they'll each get two JVMs on each worker machine in the cluster. As far as the other docs, you can configure the total number of cores

Re: How to map each line to (line number, line)?

2014-07-21 Thread Andrew Ash
I'm not sure if you guys ever picked a preferred method for doing this, but I just encountered it and came up with this method that's working reasonably well on a small dataset. It should be quite easily generalizable to non-String RDDs. def addRowNumber(r: RDD[String]): RDD[Tuple2[Long,String]]

Re: hdfs replication on saving RDD

2014-07-15 Thread Andrew Ash
In general it would be nice to be able to configure replication on a per-job basis. Is there a way to do that without changing the config values in the Hadoop conf/ directory between jobs? Maybe by modifying OutputFormats or the JobConf ? On Mon, Jul 14, 2014 at 11:12 PM, Matei Zaharia

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Andrew Ash
Hi Nan, Great digging in -- that makes sense to me for when a job is producing some output handled by Spark like a .count or .distinct or similar. For the other part of the question, I'm also interested in side effects like an HDFS disk write. If one task is writing to an HDFS path and another

Re: reading compress lzo files

2014-07-06 Thread Andrew Ash
Ni Nick, The cluster I was working on in those linked messages was a private data center cluster, not on EC2. I'd imagine that the setup would be pretty similar, but I'm not familiar with the EC2 init scripts that Spark uses. Also I upgraded that cluster to 1.0 recently and am continuing to use

Re: RDD join: composite keys

2014-07-03 Thread Andrew Ash
Hi Sameer, If you set those two IDs to be a Tuple2 in the key of the RDD, then you can join on that tuple. Example: val rdd1: RDD[Tuple3[Int, Int, String]] = ... val rdd2: RDD[Tuple3[Int, Int, String]] = ... val resultRDD = rdd1.map(k = ((k._1, k._2), k._3)).join( rdd2.map(k =

Re: 1.0.1 release plan

2014-06-20 Thread Andrew Ash
Sounds good. Mingyu and I are waiting on 1.0.1 to get the fix for the below issues without running a patched version of Spark: https://issues.apache.org/jira/browse/SPARK-1935 -- commons-codec version conflicts for client applications https://issues.apache.org/jira/browse/SPARK-2043 --

Re: Spark is now available via Homebrew

2014-06-18 Thread Andrew Ash
What's the advantage of Apache maintaining the brew installer vs users? Apache handling it means more work on this dev team, but probably a better experience for brew users. Just wanted to weigh pros/cons before committing to support this installation method. Andrew On Wed, Jun 18, 2014 at

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-18 Thread Andrew Ash
Wait, so the file only has four lines and the job running out of heap space? Can you share the code you're running that does the processing? I'd guess that you're doing some intense processing on every line but just writing parsed case classes back to disk sounds very lightweight. I On Wed,

Re: Memory footprint of Calliope: Spark - Cassandra writes

2014-06-17 Thread Andrew Ash
Gerard, Strings in particular are very inefficient because they're stored in a two-byte format by the JVM. If you use the Kryo serializer and have use StorageLevel.MEMORY_ONLY_SER then Kryo stores Strings in UTF8, which for ASCII-like strings will take half the space. Andrew On Tue, Jun 17,

Re: Wildcard support in input path

2014-06-17 Thread Andrew Ash
In Spark you can use the normal globs supported by Hadoop's FileSystem, which are documented here: http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path) On Wed, Jun 18, 2014 at 12:09 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Re: Comprehensive Port Configuration reference?

2014-06-09 Thread Andrew Ash
Andrew, This is a standalone cluster. And, yes, if my understanding of Spark terminology is correct, you are correct about the port ownerships. Jacob Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 [image: Inactive hide details for Andrew Ash ---05/28

Re: Setting executor memory when using spark-shell

2014-06-05 Thread Andrew Ash
Hi Oleg, I set the size of my executors on a standalone cluster when using the shell like this: ./bin/spark-shell --master $MASTER --total-executor-cores $CORES_ACROSS_CLUSTER --driver-java-options -Dspark.executor.memory=$MEMORY_PER_EXECUTOR It doesn't seem particularly clean, but it works.

Re: Setting executor memory when using spark-shell

2014-06-05 Thread Andrew Ash
-Dspark.executor.memory=$MEMORY_PER_EXECUTOR I get bad option: '--driver-java-options' There must be something different in my setup. Any ideas? Thank you again, Oleg On 5 June 2014 22:28, Andrew Ash and...@andrewash.com wrote: Hi Oleg, I set the size of my executors on a standalone cluster when

Re: Join : Giving incorrect result

2014-06-05 Thread Andrew Ash
Hi Ajay, Can you please try running the same code with spark.shuffle.spill=false and see if the numbers turn out correctly? That parameter controls whether or not the buggy code that Matei fixed in ExternalAppendOnlyMap is used. FWIW I saw similar issues in 0.9.0 but no longer in 0.9.1 after I

Re: Using Spark on Data size larger than Memory size

2014-06-05 Thread Andrew Ash
Hi Roger, You should be able to sort within partitions using the rdd.mapPartitions() method, and that shouldn't require holding all data in memory at once. It does require holding the entire partition in memory though. Do you need the partition to never be held in memory all at once? As far as

Re: is there any easier way to define a custom RDD in Java

2014-06-04 Thread Andrew Ash
Just curious, what do you want your custom RDD to do that the normal ones don't? On Wed, Jun 4, 2014 at 6:30 AM, bluejoe2008 bluejoe2...@gmail.com wrote: hi, folks, is there any easier way to define a custom RDD in Java? I am wondering if I have to define a new java class which

Re: Error related to serialisation in spark streaming

2014-06-04 Thread Andrew Ash
nilmish, To confirm your code is using kryo, go to the web ui of your application (defaults to :4040) and look at the environment tab. If your serializer settings are there then things should be working properly. I'm not sure how to confirm that it works against typos in the setting, but you

Re: How to change default storage levels

2014-06-04 Thread Andrew Ash
You can change storage level on an individual RDD with .persist(StorageLevel.MEMORY_AND_DISK), but I don't think you can change what the default persistency level is for RDDs. Andrew On Wed, Jun 4, 2014 at 1:52 AM, Salih Kardan karda...@gmail.com wrote: Hi I'm using Spark 0.9.1 and Shark

Re: Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread Andrew Ash
When you group by IP address in step 1 to this: (ip1,(lat1,lon1),(lat2,lon2)) (ip2,(lat3,lon3),(lat4,lat5)) How many lat/lon locations do you expect for each IP address? avg and max are interesting. Andrew On Wed, Jun 4, 2014 at 5:29 AM, Oleg Proudnikov

Re: WebUI's Application count doesn't get updated

2014-06-03 Thread Andrew Ash
Your applications are probably not connecting to your existing cluster and instead running in local mode. Are you passing the master URL to the SparkPi application? Andrew On Tue, Jun 3, 2014 at 12:30 AM, MrAsanjar . afsan...@gmail.com wrote: - HI all, - Application running and

Re: How to create RDDs from another RDD?

2014-06-03 Thread Andrew Ash
current conclusion is that the best option would be to roll an own saveHdfsFile(...) Would you agree? -greetz, Gerard. [1] http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job On Mon, Jun 2, 2014 at 11:44 PM, Andrew Ash and...@andrewash.com wrote

Re: Error related to serialisation in spark streaming

2014-06-03 Thread Andrew Ash
Hi Mayur, is that closure cleaning a JVM issue or a Spark issue? I'm used to thinking of closure cleaner as something Spark built. Do you have somewhere I can read more about this? On Tue, Jun 3, 2014 at 12:47 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: So are you using Java 7 or 8. 7

Re: Working with Avro Generic Records in the interactive scala shell

2014-05-27 Thread Andrew Ash
Also see this context from February. We started working with Chill to get Avro records automatically registered with Kryo. I'm not sure the final status, but from the Chill PR #172 it looks like this might be much less friction than before. Issue we filed:

Re: K-nearest neighbors search in Spark

2014-05-27 Thread Andrew Ash
Hi Carter, In Spark 1.0 there will be an implementation of k-means available as part of MLLib. You can see the documentation for that below (until 1.0 is fully released). https://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/mllib-clustering.html Maybe diving into the source here will help

Re: Dead lock running multiple Spark jobs on Mesos

2014-05-25 Thread Andrew Ash
. Martin Am 13.05.2014 08:48, schrieb Andrew Ash: Are you setting a core limit with spark.cores.max? If you don't, in coarse mode each Spark job uses all available cores on Mesos and doesn't let them go until the job is terminated. At which point the other job can access the cores. https

Re: problem about broadcast variable in iteration

2014-05-25 Thread Andrew Ash
Hi Randy, In Spark 1.0 there was a lot of work done to allow unpersisting data that's no longer needed. See the below pull request. Try running kvGlobal.unpersist() on line 11 before the re-broadcast of the next variable to see if you can cut the dependency there.

Re: KryoSerializer Exception

2014-05-25 Thread Andrew Ash
Hi Andrea, What version of Spark are you using? There were some improvements in how Spark uses Kryo in 0.9.1 and to-be 1.0 that I would expect to improve this. Also, can you share your registrator's code? Another possibility is that Kryo can have some difficulty serializing very large objects.

Re: Comprehensive Port Configuration reference?

2014-05-25 Thread Andrew Ash
it aligns! Jacob Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 [image: Inactive hide details for Andrew Ash ---05/23/2014 10:30:58 AM---Hi everyone, I've also been interested in better understanding]Andrew Ash ---05/23/2014 10:30:58 AM---Hi everyone, I've also

Re: Comprehensive Port Configuration reference?

2014-05-23 Thread Andrew Ash
Hi everyone, I've also been interested in better understanding what ports are used where and the direction the network connections go. I've observed a running cluster and read through code, and came up with the below documentation addition. https://github.com/apache/spark/pull/856 Scott and

Re: Computing cosine similiarity using pyspark

2014-05-23 Thread Andrew Ash
Hi Jamal, I don't believe there are pre-written algorithms for Cosine similarity or Pearson Porrelation in PySpark that you can re-use. If you end up writing your own implementation of the algorithm though, the project would definitely appreciate if you shared that code back with the project for

  1   2   >