Re: Efficient Spark-Sql queries when only nth Column changes

2017-02-19 Thread Patrick
by col1,col2,col3,col4,col5").cache > > df_base.registerTempTable("df_base") > > val df1 = sqlContext.sql("select col1, col2, count(*) from df_base group > by col1, col2") > > val df2 = // similar logic > > Yong > -- > *From

Efficient Spark-Sql queries when only nth Column changes

2017-02-18 Thread Patrick
Hi, I have read 5 columns from parquet into data frame. My queries on the parquet table is of below type: val df1 = sqlContext.sql(select col1,col2,count(*) from table groupby col1,col2) val df2 = sqlContext.sql(select col1,col3,count(*) from table groupby col1,col3) val df3 =

groupByKey vs mapPartitions for efficient grouping within a Partition

2017-01-16 Thread Patrick
Hi, Does groupByKey has intelligence associated with it, such that if all the keys resides in the same partition, it should not do the shuffle? Or user should write mapPartitions( scala groupBy code). Which would be more efficient and what are the memory considerations? Thanks

Re: Broadcast Join and Inner Join giving different result on same DataFrame

2017-01-03 Thread Patrick
Hi, An Update on above question: In Local[*] mode code is working fine. The Broadcast size is 200MB, but on Yarn it the broadcast join is giving empty result.But in Sql Query in UI, it does show BroadcastHint. Thanks On Fri, Dec 30, 2016 at 9:15 PM, titli batali wrote:

Projection Pushdown and Predicate Pushdown in Parquet for Nested Column

2017-08-02 Thread Patrick
Hi, I would like to know that if Spark has support for Projection Pushdown and Predicate Pushdown in Parquet for nested column.? I can see two JIRA tasks with PR. https://issues.apache.org/jira/browse/SPARK-17636 https://issues.apache.org/jira/browse/SPARK-4502 If not, are we seeing these

Querying on Deeply Nested JSON Structures

2017-07-15 Thread Patrick
Hi, We need to query deeply nested Json structure. However query is on a single field at a nested level such as mean, median, mode. I am aware of the sql explode function. df = df_nested.withColumn('exploded', explode(top)) But this is too slow. Is there any other strategy that could give us

Re: Nested JSON Handling in Spark 2.1

2017-07-25 Thread Patrick
Hi, I would appreciate some suggestions on how to achieve top level struct treatment to nested JSON when stored in Parquet format. Or any other solutions for best performance using Spark 2.1. Thanks in advance On Mon, Jul 24, 2017 at 4:11 PM, Patrick <titlibat...@gmail.com> wrote: >

Re: Complex types projection handling with Spark 2 SQL and Parquet

2017-07-27 Thread Patrick
Hi , I am having the same issue. Has any one found solution to this. When i convert the nested JSON to parquet. I dont see the projection working correctly. It still reads all the nested structure columns. Parquet does support nested column projection. Does Spark 2 SQL provide the column

Complext JSON Handling in Spark 2.1

2017-07-24 Thread Patrick
Hi, On reading a complex JSON, Spark infers schema as following: root |-- header: struct (nullable = true) ||-- deviceId: string (nullable = true) ||-- sessionId: string (nullable = true) |-- payload: struct (nullable = true) ||-- deviceObjects: array (nullable = true) ||

Re: Complext JSON Handling in Spark 2.1

2017-07-24 Thread Patrick
To avoid confusion, the query i am referring above is over some numeric element inside *a: struct (nullable = true).* On Mon, Jul 24, 2017 at 4:04 PM, Patrick <titlibat...@gmail.com> wrote: > Hi, > > On reading a complex JSON, Spark infers schema as following: > > root

Builder Pattern used by Spark source code architecture

2017-09-18 Thread Patrick
Hi, A lot of code base of Spark is based on Builder Pattern, so i was wondering what are the benefits that Builder Pattern brings to spark. Some of things that comes in my mind, it is easy on garbage collection and also user friendly API's. Are their any other advantages with code running on

Collecting Multiple Aggregation query result on one Column as collectAsMap

2017-08-28 Thread Patrick
Hi I have two lists: - List one: contains names of columns on which I want to do aggregate operations. - List two: contains the aggregate operations on which I want to perform on each column eg ( min, max, mean) I am trying to use spark 2.0 dataset to achieve this. Spark provides

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

2017-08-28 Thread Patrick
Ah, does it work with Dataset API or i need to convert it to RDD first ? On Mon, Aug 28, 2017 at 10:40 PM, Georg Heiler <georg.kf.hei...@gmail.com> wrote: > What about the rdd stat counter? https://spark.apache.org/docs/ > 0.6.2/api/core/spark/util/StatCounter.html > >

Re: Collecting Multiple Aggregation query result on one Column as collectAsMap

2017-08-28 Thread Patrick
on the particular column. I was thinking if we need to write some custom code which does this in one action(job) that would work for me On Tue, Aug 29, 2017 at 12:02 AM, Georg Heiler <georg.kf.hei...@gmail.com> wrote: > Rdd only > Patrick <titlibat...@gmail.com> schrieb am Mo. 28.

Out of memory Error when using Collection Accumulator Spark 2.2

2018-02-26 Thread Patrick
Hi, We were getting OOM error when we are accumulating the results of each worker. We were trying to avoid collecting data to driver node instead used accumulator as per below code snippet, Is there any spark config to set the accumulator settings Or am i doing the wrong way to collect the huge

Spark Mllib logistic regression setWeightCol illegal argument exception

2020-01-09 Thread Patrick
Hi Spark Users, I am trying to solve a class imbalance problem, I figured out, spark supports setting weight in its API but I get IIlegal Argument exception weight column do not exist, but it do exists in the dataset. Any recommedation to go about this problem ? I am using Pipeline API with

Re: Unable to redirect Spark logs to slf4j

2014-03-05 Thread Patrick Wendell
- Patrick On Wed, Mar 5, 2014 at 1:52 PM, Sergey Parhomenko sparhome...@gmail.com wrote: Hi Patrick, Thanks for the patch. I tried building a patched version of spark-core_2.10-0.9.0-incubating.jar but the Maven build fails: [ERROR] /home/das/Work/thx/incubator-spark/core/src/main/scala/org

Re: Python 2.7 + numpy break sortByKey()

2014-03-06 Thread Patrick Wendell
The difference between your two jobs is that take() is optimized and only runs on the machine where you are using the shell, whereas sortByKey requires using many machines. It seems like maybe python didn't get upgraded correctly on one of the slaves. I would look in the /root/spark/work/ folder

Re: no stdout output from worker

2014-03-09 Thread Patrick Wendell
on the workers machines. If you see stderr but not stdout that's a bit of a puzzler since they both go through the same mechanism. - Patrick On Sun, Mar 9, 2014 at 2:32 PM, Sen, Ranjan [USA] sen_ran...@bah.com wrote: Hi I have some System.out.println in my Java code that is working ok in a local environment

Re: [External] Re: no stdout output from worker

2014-03-10 Thread Patrick Wendell
Hey Sen, Suarav is right, and I think all of your print statements are inside of the driver program rather than inside of a closure. How are you running your program (i.e. what do you run that starts this job)? Where you run the driver you should expect to see the output. - Patrick On Mon, Mar

Re: Too many open files exception on reduceByKey

2014-03-10 Thread Patrick Wendell
change so it won't help the ulimit problem. This means you'll have to use fewer reducers (e.g. pass reduceByKey a number of reducers) or use fewer cores on each machine. - Patrick On Mon, Mar 10, 2014 at 10:41 AM, Matthew Cheah matthew.c.ch...@gmail.com wrote: Hi everyone, My team (cc'ed

Re: Round Robin Partitioner

2014-03-13 Thread Patrick Wendell
itself and override getPreferredLocations. Keep in mind this is tricky because the set of executors might change during the lifetime of a Spark job. - Patrick On Thu, Mar 13, 2014 at 11:50 AM, David Thomas dt5434...@gmail.com wrote: Is it possible to parition the RDD elements in a round robin

Re: slf4j and log4j loop

2014-03-16 Thread Patrick Wendell
This is not released yet but we're planning to cut a 0.9.1 release very soon (e.g. most likely this week). In the mean time you'll have checkout branch-0.9 of Spark and publish it locally then depend on the snapshot version. Or just wait it out... On Fri, Mar 14, 2014 at 2:01 PM, Adrian Mocanu

Re: combining operations elegantly

2014-03-23 Thread Patrick Wendell
... but that's not quite released yet :) - Patrick On Sun, Mar 23, 2014 at 1:31 PM, Koert Kuipers ko...@tresata.com wrote: i currently typically do something like this: scala val rdd = sc.parallelize(1 to 10) scala import com.twitter.algebird.Operators._ scala import com.twitter.algebird.{Max, Min

Re: How many partitions is my RDD split into?

2014-03-23 Thread Patrick Wendell
if you do a highly selective filter on an RDD. For instance, you filter out one day of data from a dataset of a year. - Patrick On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra m...@clearstorydata.com wrote: It's much simpler: rdd.partitions.size On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas

Re: No space left on device exception

2014-03-23 Thread Patrick Wendell
Ognen - just so I understand. The issue is that there weren't enough inodes and this was causing a No space left on device error? Is that correct? If so, that's good to know because it's definitely counter intuitive. On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski og...@nengoiksvelzud.com wrote:

Re: How many partitions is my RDD split into?

2014-03-24 Thread Patrick Wendell
Ah we should just add this directly in pyspark - it's as simple as the code Shivaram just wrote. - Patrick On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman shivaram.venkatara...@gmail.com wrote: There is no direct way to get this in pyspark, but you can get it from the underlying java

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-25 Thread Patrick Wendell
Starting with Spark 0.9 the protobuf dependency we use is shaded and cannot interfere with other protobuf libaries including those in Hadoop. Not sure what's going on in this case. Would someone who is having this problem post exactly how they are building spark? - Patrick On Fri, Mar 21, 2014

Re: Building Spark 0.9.x for CDH5 with mrv1 installation (Protobuf 2.5 upgrade)

2014-03-26 Thread Patrick Wendell
I'm not sure exactly how your cluster is configured. But as far as I can tell Cloudera's MR1 CDH5 dependencies are against Hadoop 2.3. I'd just find the exact CDH version you have and link against the `mr1` version of their published dependencies in that version. So I think you wan't

Re: Announcing Spark SQL

2014-03-27 Thread Patrick Wendell
to the respective cassandra columns. I think all of this would be fairly easy to implement on SchemaRDD and likely will make it into Spark 1.1 - Patrick On Wed, Mar 26, 2014 at 10:59 PM, Rohit Rai ro...@tuplejump.com wrote: Great work guys! Have been looking forward to this . . . In the blog it mentions

Re: Spark webUI - application details page

2014-03-30 Thread Patrick Wendell
This will be a feature in Spark 1.0 but is not yet released. In 1.0 Spark applications can persist their state so that the UI can be reloaded after they have completed. - Patrick On Sun, Mar 30, 2014 at 10:30 AM, David Thomas dt5434...@gmail.com wrote: Is there a way to see 'Application

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Patrick Grinaway
Also in NYC, definitely interested in a spark meetup! Sent from my iPhone On Mar 31, 2014, at 3:07 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Happy to help with an NYC meet up (just emailed Andy). I recently moved to VA, but am back in NYC quite often, and have been turning several

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-03-31 Thread Patrick Wendell
dependencies including the exact Spark version and other libraries. - Patrick On Sun, Mar 30, 2014 at 10:03 PM, Vipul Pandey vipan...@gmail.com wrote: I'm using ScalaBuff (which depends on protobuf2.5) and facing the same issue. any word on this one? On Mar 27, 2014, at 6:41 PM, Kanwaldeep kanwal

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-01 Thread Patrick Wendell
Do you get the same problem if you build with maven? On Tue, Apr 1, 2014 at 12:23 PM, Vipul Pandey vipan...@gmail.com wrote: SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 sbt/sbt assembly That's all I do. On Apr 1, 2014, at 11:41 AM, Patrick Wendell pwend...@gmail.com wrote: Vidal - could you show

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-02 Thread Patrick Wendell
(default-cli) on project spark-0.9.0-incubating: Error reading assemblies: No assembly descriptors found. - [Help 1] upon runnning mvn -Dhadoop.version=2.0.0-cdh4.2.1 -DskipTests clean assembly:assembly On Apr 1, 2014, at 4:13 PM, Patrick Wendell pwend...@gmail.com wrote: Do you get the same

Re: Spark output compression on HDFS

2014-04-02 Thread Patrick Wendell
For textFile I believe we overload it and let you set a codec directly: https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59 For saveAsSequenceFile yep, I think Mark is right, you need an option. On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra

Re: Resilient nature of RDD

2014-04-02 Thread Patrick Wendell
The driver stores the meta-data associated with the partition, but the re-computation will occur on an executor. So if several partitions are lost, e.g. due to a few machines failing, the re-computation can be striped across the cluster making it fast. On Wed, Apr 2, 2014 at 11:27 AM, David

Re: Is there a way to get the current progress of the job?

2014-04-02 Thread Patrick Wendell
of functionality and something we might, e.g. want to change the API of over time. - Patrick On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren philip.og...@oracle.comwrote: What I'd like is a way to capture the information provided on the stages page (i.e. cluster:4040/stages via IndexPage). Looking

Re: Largest Spark Cluster

2014-04-04 Thread Patrick Wendell
and on jobs that crunch hundreds of terabytes (uncompressed) of data. - Patrick On Fri, Apr 4, 2014 at 12:05 PM, Parviz Deyhim pdey...@gmail.com wrote: Spark community, What's the size of the largest Spark cluster ever deployed? I've heard Yahoo is running Spark on several hundred nodes

Re: How to create a RPM package

2014-04-04 Thread Patrick Wendell
in the community has feedback from trying this. - Patrick On Fri, Apr 4, 2014 at 12:43 PM, Rahul Singhal rahul.sing...@guavus.comwrote: Hi Christophe, Thanks for your reply and the spec file. I have solved my issue for now. I didn't want to rely building spark using the spec file (%build

Re: trouble with join on large RDDs

2014-04-07 Thread Patrick Wendell
On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller bmill...@eecs.berkeley.eduwrote: I am running the latest version of PySpark branch-0.9 and having some trouble with join. One RDD is about 100G (25GB compressed and serialized in memory) with 130K records, the other RDD is about 10G (2.5G

Re: programmatic way to tell Spark version

2014-04-10 Thread Patrick Wendell
: Hey Patrick, I've created SPARK-1458 https://issues.apache.org/jira/browse/SPARK-1458 to track this request, in case the team/community wants to implement it in the future. Nick On Sat, Feb 22, 2014 at 7:25 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: No use case at the moment

Re: programmatic way to tell Spark version

2014-04-10 Thread Patrick Wendell
Pierre - I'm not sure that would work. I just opened a Spark shell and did this: scala classOf[SparkContext].getClass.getPackage.getImplementationVersion res4: String = 1.7.0_25 It looks like this is the JVM version. - Patrick On Thu, Apr 10, 2014 at 2:08 PM, Pierre Borckmans pierre.borckm

Re: Hybrid GPU CPU computation

2014-04-11 Thread Patrick Grinaway
I've actually done it using PySpark and python libraries which call cuda code, though I've never done it from scala directly. The only major challenge I've hit is assigning tasks to gpus on multiple gpu machines. Sent from my iPhone On Apr 11, 2014, at 8:38 AM, Jaonary Rabarisoa

Re: Spark on YARN performance

2014-04-11 Thread Patrick Wendell
To reiterate what Tom was saying - the code that runs inside of Spark on YARN is exactly the same code that runs in any deployment mode. There shouldn't be any performance difference once your application starts (assuming you are comparing apples-to-apples in terms of hardware). The differences

Re: running tests selectively

2014-04-20 Thread Patrick Wendell
I put some notes in this doc: https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools On Sun, Apr 20, 2014 at 8:58 PM, Arun Ramakrishnan sinchronized.a...@gmail.com wrote: I would like to run some of the tests selectively. I am in branch-1.0 Tried the following two

Re: Task splitting among workers

2014-04-20 Thread Patrick Wendell
For a HadoopRDD, first the spark scheduler calculates the number of tasks based on input splits. Usually people use this with HDFS data so in that case it's based on HDFS blocks. If the HDFS datanodes are co-located with the Spark cluster then it will try to run the tasks on the data node that

Re: compile spark 0.9.1 in hadoop 2.2 above exception

2014-04-24 Thread Patrick Wendell
Try running sbt/sbt clean and re-compiling. Any luck? On Thu, Apr 24, 2014 at 5:33 PM, martin.ou martin...@orchestrallinc.cnwrote: occure exception when compile spark 0.9.1 using sbt,env: hadoop 2.3 1. SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly 2.found Exception:

Re: pySpark memory usage

2014-04-28 Thread Patrick Wendell
the error first before the reader knows what is going on. Anyways maybe if you have a simpler solution you could sketch it out in the JIRA and we could talk over there. The current proposal in the JIRA is somewhat complicated... - Patrick On Mon, Apr 28, 2014 at 1:01 PM, Jim Blomo jim.bl

Re: Running a spark-submit compatible app in spark-shell

2014-04-28 Thread Patrick Wendell
What about if you run ./bin/spark-shell --driver-class-path=/path/to/your/jar.jar I think either this or the --jars flag should work, but it's possible there is a bug with the --jars flag when calling the Repl. On Mon, Apr 28, 2014 at 4:30 PM, Roger Hoover roger.hoo...@gmail.comwrote: A

Re: launching concurrent jobs programmatically

2014-04-28 Thread Patrick Wendell
You can also accomplish this by just having a separate service that submits multiple jobs to a cluster where those jobs e.g. use different jars. - Patrick On Mon, Apr 28, 2014 at 4:44 PM, Andrew Ash and...@andrewash.com wrote: For the second question, you can submit multiple jobs through

Re: How fast would you expect shuffle serialize to be?

2014-04-29 Thread Patrick Wendell
Is this the serialization throughput per task or the serialization throughput for all the tasks? On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond raymond@intel.com wrote: Hi I am running a WordCount program which count words from HDFS, and I noticed that the serializer part of code

Re: JavaSparkConf

2014-04-29 Thread Patrick Wendell
This class was made to be java friendly so that we wouldn't have to use two versions. The class itself is simple. But I agree adding java setters would be nice. On Tue, Apr 29, 2014 at 8:32 PM, Soren Macbeth so...@yieldbot.com wrote: There is a JavaSparkContext, but no JavaSparkConf object. I

Re: Union of 2 RDD's only returns the first one

2014-04-29 Thread Patrick Wendell
You are right, once you sort() the RDD, then yes it has a well defined ordering. But that ordering is lost as soon as you transform the RDD, including if you union it with another RDD. On Tue, Apr 29, 2014 at 10:22 PM, Mingyu Kim m...@palantir.com wrote: Hi Patrick, I¹m a little confused

Re: Reading multiple S3 objects, transforming, writing back one

2014-04-30 Thread Patrick Wendell
This is a consequence of the way the Hadoop files API works. However, you can (fairly easily) add code to just rename the file because it will always produce the same filename. (heavy use of pseudo code) dir = /some/dir rdd.coalesce(1).saveAsTextFile(dir) f = new File(dir + part-0)

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-03 Thread Patrick Wendell
with many partitions, since often there are bottlenecks at the granularity of a file. Is there a reason you need this to be exactly one file? - Patrick On Sat, May 3, 2014 at 4:14 PM, Chris Fregly ch...@fregly.com wrote: not sure if this directly addresses your issue, peter, but it's worth mentioned

Re: Setting the Scala version in the EC2 script?

2014-05-03 Thread Patrick Wendell
your spark-ec2.py script to checkout spark-ec2 from forked version. - Patrick On Thu, May 1, 2014 at 2:14 PM, Ian Ferreira ianferre...@hotmail.com wrote: Is this possible, it is very annoying to have such a great script, but still have to manually update stuff afterwards.

Re: when to use broadcast variables

2014-05-03 Thread Patrick Wendell
Broadcast variables need to fit entirely in memory - so that's a pretty good litmus test for whether or not to broadcast a smaller dataset or turn it into an RDD. On Fri, May 2, 2014 at 7:50 AM, Prashant Sharma scrapco...@gmail.com wrote: I had like to be corrected on this but I am just trying

Re: spark ec2 error

2014-05-04 Thread Patrick Wendell
Hey Jeremy, This is actually a big problem - thanks for reporting it, I'm going to revert this change until we can make sure it is backwards compatible. - Patrick On Sun, May 4, 2014 at 2:00 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hi all, A heads up in case others hit

Re: spark ec2 error

2014-05-04 Thread Patrick Wendell
PM, Patrick Wendell pwend...@gmail.com wrote: Hey Jeremy, This is actually a big problem - thanks for reporting it, I'm going to revert this change until we can make sure it is backwards compatible. - Patrick On Sun, May 4, 2014 at 2:00 PM, Jeremy Freeman freeman.jer...@gmail.com wrote

Spark Streaming and JMS

2014-05-05 Thread Patrick McGloin
) Is this the best way to go? Best regards, Patrick

Re: 1.0.0 Release Date?

2014-05-14 Thread Patrick Wendell
to be almost identical to the final release. - Patrick On Tue, May 13, 2014 at 9:40 AM, bhusted brian.hus...@gmail.com wrote: Can anyone comment on the anticipated date or worse case timeframe for when Spark 1.0.0 will be released? -- View this message in context: http://apache-spark-user-list

pyspark python exceptions / py4j exceptions

2014-05-15 Thread Patrick Donovan
Hello, I'm trying to write a python function that does something like: def foo(line): try: return stuff(line) except Exception: raise MoreInformativeException(line) and then use it in a map like so: rdd.map(foo) and have my MoreInformativeException make it back if/when

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-05-15 Thread Patrick Wendell
) - Patrick On Wed, May 14, 2014 at 9:09 AM, Koert Kuipers ko...@tresata.com wrote: i have some settings that i think are relevant for my application. they are spark.akka settings so i assume they are relevant for both executors and my driver program. i used to do: SPARK_JAVA_OPTS

Announcing Spark 1.0.0

2014-05-30 Thread Patrick Wendell
Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick

Re: Announcing Spark 1.0.0

2014-05-30 Thread Patrick Wendell
, Patrick Wendell pwend...@gmail.com mailto:pwend...@gmail.com wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's

Re: Yay for 1.0.0! EC2 Still has problems.

2014-05-30 Thread Patrick Wendell
to make them compatible with 2.6 we should do that. For r3.large, we can add that to the script. It's a newer type. Any interest in contributing this? - Patrick On May 30, 2014 5:08 AM, Jeremy Lee unorthodox.engine...@gmail.com wrote: Hi there! I'm relatively new to the list, so sorry

Re: Unable to execute saveAsTextFile on multi node mesos

2014-05-31 Thread Patrick Wendell
Can you look at the logs from the executor or in the UI? They should give an exception with the reason for the task failure. Also in the future, for this type of e-mail please only e-mail the user@ list and not both lists. - Patrick On Sat, May 31, 2014 at 3:22 AM, prabeesh k prabsma

Re: How can I dispose an Accumulator?

2014-05-31 Thread Patrick Wendell
. - Patrick On Thu, May 29, 2014 at 2:13 AM, innowireless TaeYun Kim taeyun@innowireless.co.kr wrote: Hi, How can I dispose an Accumulator? It has no method like 'unpersist()' which Broadcast provides. Thanks.

Re: Spark hook to create external process

2014-05-31 Thread Patrick Wendell
Currently, an executor is always run in it's own JVM, so it should be possible to just use some static initialization to e.g. launch a sub-process and set up a bridge with which to communicate. This is would be a fairly advanced use case, however. - Patrick On Thu, May 29, 2014 at 8:39 PM

Re: possible typos in spark 1.0 documentation

2014-05-31 Thread Patrick Wendell
the change. - Patrick

Re: getPreferredLocations

2014-05-31 Thread Patrick Wendell
1) Is there a guarantee that a partition will only be processed on a node which is in the getPreferredLocations set of nodes returned by the RDD ? No there isn't, by default Spark may schedule in a non preferred location after `spark.locality.wait` has expired.

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-31 Thread Patrick Wendell
this (this is pseudo-code): files = fs.listStatus(s3n://bucket/stuff/*.gz) files = files.filter(not the bad file) fileStr = files.map(f = f.getPath.toString).mkstring(,) sc.textFile(fileStr)... - Patrick On Fri, May 30, 2014 at 4:20 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: YES, your

Re: Yay for 1.0.0! EC2 Still has problems.

2014-06-01 Thread Patrick Wendell
Hey just to clarify this - my understanding is that the poster (Jeremey) was using a custom AMI to *launch* spark-ec2. I normally launch spark-ec2 from my laptop. And he was looking for an AMI that had a high enough version of python. Spark-ec2 itself has a flag -a that allows you to give a

Re: Using sbt-pack with Spark 1.0.0

2014-06-01 Thread Patrick Wendell
One potential issue here is that mesos is using classifiers now to publish there jars. It might be that sbt-pack has trouble with dependencies that are published using classifiers. I'm pretty sure mesos is the only dependency in Spark that is using classifiers, so that's why I mention it. On Sun,

Re: Using sbt-pack with Spark 1.0.0

2014-06-01 Thread Patrick Wendell
https://github.com/apache/spark/blob/master/project/SparkBuild.scala#L350 On Sun, Jun 1, 2014 at 11:03 AM, Patrick Wendell pwend...@gmail.com wrote: One potential issue here is that mesos is using classifiers now to publish there jars. It might be that sbt-pack has trouble with dependencies

Re: spark 1.0.0 on yarn

2014-06-01 Thread Patrick Wendell
.. -Simon On Sun, Jun 1, 2014 at 1:57 PM, Patrick Wendell pwend...@gmail.com wrote: I would agree with your guess, it looks like the yarn library isn't correctly finding your yarn-site.xml file. If you look in yarn-site.xml do you definitely the resource manager address/addresses? Also, you

Re: spark 1.0.0 on yarn

2014-06-02 Thread Patrick Wendell
. -Simon On Sun, Jun 1, 2014 at 9:03 PM, Patrick Wendell pwend...@gmail.com wrote: As a debugging step, does it work if you use a single resource manager with the key yarn.resourcemanager.address instead of using two named resource managers? I wonder if somehow the YARN client can't

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
/apache/spark/commit/3a8b698e961ac05d9d53e2bbf0c2844fcb1010d1 However, it would be very easy to add an option that allows preserving the old behavior. Is anyone here interested in contributing that? I created a JIRA for it: https://issues.apache.org/jira/browse/SPARK-1993 - Patrick On Mon, Jun 2

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
Thanks for pointing that out. I've assigned you to SPARK-1677 (I think I accidentally assigned myself way back when I created it). This should be an easy fix. On Mon, Jun 2, 2014 at 12:19 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, Patrick, I think https://issues.apache.org/jira/browse/SPARK

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Patrick Wendell
Are you building Spark with Java 6 or Java 7. Java 6 uses the extended Zip format and Java 7 uses Zip64. I think we've tried to add some build warnings if Java 7 is used, for this reason: https://github.com/apache/spark/blob/master/make-distribution.sh#L102 Any luck if you use JDK 6 to compile?

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
data by mistake if they don't understand the exact semantics. 2. It would introduce a third set of semantics here for saveAsXX... 3. It's trivial for users to implement this with two lines of code (if output dir exists, delete it) before calling saveAsHadoopFile. - Patrick On Mon, Jun 2, 2014 at 2

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
/clobber an existing destination directory if it exists, then fully over-write it with new data. I'm fine to add a flag that allows (B) for backwards-compatibility reasons, but my point was I'd prefer not to have (C) even though I see some cases where it would be useful. - Patrick On Mon, Jun 2

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

2014-06-02 Thread Patrick Wendell
. The standard installation guide didn't say anything about java 7 and suggested to do -DskipTests for the build.. http://spark.apache.org/docs/latest/building-with-maven.html So, I didn't see the warning message... On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell pwend...@gmail.com wrote

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Patrick Wendell
, Jun 2, 2014 at 10:39 PM, Patrick Wendell pwend...@gmail.com wrote: (B) Semantics in Spark 1.0 and earlier: Do you mean 1.0 and later? Option (B) with the exception-on-clobber sounds fine to me, btw. My use pattern is probably common but not universal, and deleting user files is indeed

Re: spark 1.0 not using properties file from SPARK_CONF_DIR

2014-06-03 Thread Patrick Wendell
You can set an arbitrary properties file by adding --properties-file argument to spark-submit. It would be nice to have spark-submit also look in SPARK_CONF_DIR as well by default. If you opened a JIRA for that I'm sure someone would pick it up. On Tue, Jun 3, 2014 at 7:47 AM, Eugen Cepoi

Re: Spark 1.0.0 fails if mesos.coarse set to true

2014-06-04 Thread Patrick Wendell
Hey, thanks a lot for reporting this. Do you mind making a JIRA with the details so we can track it? - Patrick On Wed, Jun 4, 2014 at 9:24 AM, Marek Wiewiorka marek.wiewio...@gmail.com wrote: Exactly the same story - it used to work with 0.9.1 and does not work anymore with 1.0.0. I ran tests

Re: is there any easier way to define a custom RDD in Java

2014-06-04 Thread Patrick Wendell
Hey There, This is only possible in Scala right now. However, this is almost never needed since the core API is fairly flexible. I have the same question as Andrew... what are you trying to do with your RDD? - Patrick On Wed, Jun 4, 2014 at 7:49 AM, Andrew Ash and...@andrewash.com wrote: Just

Re: error with cdh 5 spark installation

2014-06-04 Thread Patrick Wendell
Hey Chirag, Those init scripts are part of the Cloudera Spark package (they are not in the Spark project itself) so you might try e-mailing their support lists directly. - Patrick On Wed, Jun 4, 2014 at 7:19 AM, chirag lakhani chirag.lakh...@gmail.com wrote: I recently spun up an AWS cluster

Re: Can't seem to link external/twitter classes from my own app

2014-06-04 Thread Patrick Wendell
): https://github.com/pwendell/kafka-spark-example You'll want to make an uber jar that includes these packages (run sbt assembly) and then submit that jar to spark-submit. Also, I'd try running it locally first (if you aren't already) just to make the debugging simpler. - Patrick On Wed, Jun 4, 2014

Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Patrick Wendell
If that's still an issue, one thing to try is just changing the name of the cluster. We create groups that are identified with the cluster name, and there might be something that just got screwed up with the original group creation and AWS isn't happy. - Patrick On Wed, Jun 4, 2014 at 12:55 PM, Sam

Re: Setting executor memory when using spark-shell

2014-06-06 Thread Patrick Wendell
In 1.0+ you can just pass the --executor-memory flag to ./bin/spark-shell. On Fri, Jun 6, 2014 at 12:32 AM, Oleg Proudnikov oleg.proudni...@gmail.com wrote: Thank you, Hassan! On 6 June 2014 03:23, hassan hellfire...@gmail.com wrote: just use -Dspark.executor.memory= -- View this

Re: Spark 1.0 embedded Hive libraries

2014-06-06 Thread Patrick Wendell
it work. I think it's being tracked by this JIRA: https://issues.apache.org/jira/browse/HIVE-5733 - Patrick On Fri, Jun 6, 2014 at 12:08 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: Is there a repo somewhere with the code for the Hive dependencies (hive-exec, hive-serde, hive-metastore

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-08 Thread Patrick Wendell
are not in the jar because they go beyond the extended zip boundary `jar tvf` won't list them. - Patrick On Sun, Jun 8, 2014 at 12:45 PM, Paul Brown p...@mult.ifario.us wrote: Moving over to the dev list, as this isn't a user-scope issue. I just ran into this issue with the missing saveAsTestFile

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-08 Thread Patrick Wendell
Also I should add - thanks for taking time to help narrow this down! On Sun, Jun 8, 2014 at 1:02 PM, Patrick Wendell pwend...@gmail.com wrote: Paul, Could you give the version of Java that you are building with and the version of Java you are running with? Are they the same? Just off

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-08 Thread Patrick Wendell
Okay I think I've isolated this a bit more. Let's discuss over on the JIRA: https://issues.apache.org/jira/browse/SPARK-2075 On Sun, Jun 8, 2014 at 1:16 PM, Paul Brown p...@mult.ifario.us wrote: Hi, Patrick -- Java 7 on the development machines: » java -version 1 ↵ java version 1.7.0_51

Re: Setting spark memory limit

2014-06-09 Thread Patrick Wendell
I you run locally then Spark doesn't launch remote executors. However, in this case you can set the memory with --spark-driver-memory flag to spark-submit. Does that work? - Patrick On Mon, Jun 9, 2014 at 3:24 PM, Henggang Cui cuihengg...@gmail.com wrote: Hi, I'm trying to run the SimpleApp

Re: Enormous EC2 price jump makes r3.large patch more important

2014-06-17 Thread Patrick Wendell
Hey Jeremy, This is patched in the 1.0 and 0.9 branches of Spark. We're likely to make a 1.0.1 release soon (this patch being one of the main reasons), but if you are itching for this sooner, you can just checkout the head of branch-1.0 and you will be able to use r3.XXX instances. - Patrick

Re: Enormous EC2 price jump makes r3.large patch more important

2014-06-17 Thread Patrick Wendell
By the way, in case it's not clear, I mean our maintenance branches: https://github.com/apache/spark/tree/branch-1.0 On Tue, Jun 17, 2014 at 8:35 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Jeremy, This is patched in the 1.0 and 0.9 branches of Spark. We're likely to make a 1.0.1

Re: Enormous EC2 price jump makes r3.large patch more important

2014-06-17 Thread Patrick Wendell
will be present in the 1.0 branch of Spark. - Patrick On Tue, Jun 17, 2014 at 9:29 PM, Jeremy Lee unorthodox.engine...@gmail.com wrote: I am about to spin up some new clusters, so I may give that a go... any special instructions for making them work? I assume I use the --spark-git-repo= option

  1   2   3   4   >