openstack swift integration with Spark

2014-06-13 Thread Reynold Xin
If you are interested in openstack/swift integration with Spark, please drop me a line. We are looking into improving the integration. Thanks.

Re: Comparative study

2014-07-08 Thread Reynold Xin
Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. I've personally run a workload that shuffled 300 TB of data. I've also ran something that shuffled 5TB/node and stuffed

Re: Powered By Spark: Can you please add our org?

2014-07-08 Thread Reynold Xin
I added you to the list. Cheers. On Mon, Jul 7, 2014 at 6:19 PM, Alex Gaudio adgau...@gmail.com wrote: Hi, Sailthru is also using Spark. Could you please add us to the Powered By Spark https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark page when you have a chance?

Re: collect on hadoopFile RDD returns wrong results

2014-09-18 Thread Reynold Xin
This is due to the HadoopRDD (and also the underlying Hadoop InputFormat) reuse objects to avoid allocation. It is sort of tricky to fix. However, in most cases you can clone the records to make sure you are not collecting the same object over and over again.

Spark meetup on Oct 15 in NYC

2014-09-28 Thread Reynold Xin
Hi Spark users and developers, Some of the most active Spark developers (including Matei Zaharia, Michael Armbrust, Joseph Bradley, TD, Paco Nathan, and me) will be in NYC for Strata NYC. We are working with the Spark NYC meetup group and Bloomberg to host a meetup event. This might be the event

Re: driver memory management

2014-09-28 Thread Reynold Xin
The storage fraction only limits the amount of memory used for storage. It doesn't actually limit anything else. I.e you can use all the memory if you want in collect. On Sunday, September 28, 2014, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi All, I am interested to collect() a large RDD

Re: SQL queries fail in 1.2.0-SNAPSHOT

2014-09-29 Thread Reynold Xin
Hi Daoyuan, Do you mind applying this patch and look at the exception again? https://github.com/apache/spark/pull/2580 It has also been merged in master so if you pull from master, you should have that. On Mon, Sep 29, 2014 at 1:17 AM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Hi all,

Re: something about rdd.collect

2014-10-14 Thread Reynold Xin
Hi Randy, collect essentially transfers all the data to the driver node. You definitely wouldn’t want to collect 200 million words. It is a pretty large number and you can run out of memory on your driver with that much data. --  Reynold Xin On October 14, 2014 at 9:26:13 PM, randylu (randyl

Re: OOM with groupBy + saveAsTextFile

2014-11-02 Thread Reynold Xin
None of your tuning will help here because the problem is actually the way you are saving the output. If you take a look at the stacktrace, it is trying to build a single string that is too large for the VM to allocate memory. The VM is actually not running out of memory, but rather, JVM cannot

Re: Breaking the previous large-scale sort record with Spark

2014-11-05 Thread Reynold Xin
/10/spark-breaks-previous-large-scale-sort-record.html. Summary: while Hadoop MapReduce held last year's 100 TB world record by sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 206 nodes; and we also scaled up to sort 1 PB in 234 minutes. I want to thank Reynold Xin

Re: Confused why I'm losing workers/executors when writing a large file to S3

2014-11-13 Thread Reynold Xin
Darin, You might want to increase these config options also: spark.akka.timeout 300 spark.storage.blockManagerSlaveTimeoutMs 30 On Thu, Nov 13, 2014 at 11:31 AM, Darin McBeath ddmcbe...@yahoo.com.invalid wrote: For one of my Spark jobs, my workers/executors are dying and leaving the

Re: 2GB limit for partitions?

2015-02-03 Thread Reynold Xin
cc dev list How are you saving the data? There are two relevant 2GB limits: 1. Caching 2. Shuffle For caching, a partition is turned into a single block. For shuffle, each map partition is partitioned into R blocks, where R = number of reduce tasks. It is unlikely a shuffle block 2G,

Re: How to access OpenHashSet in my standalone program?

2015-01-14 Thread Reynold Xin
. Yes, I can incorporate it to my package and use it. But I am still wondering why you designed such useful functions as private. On Tue, Jan 13, 2015 at 3:33 PM, Reynold Xin r...@databricks.com wrote: It is not meant to be a public API. If you want to use it, maybe copy the code out

Re: Spark ML pipeline

2015-02-11 Thread Reynold Xin
Yes. Next release (Spark 1.3) is coming out end of Feb / early Mar. On Wed, Feb 11, 2015 at 7:22 AM, Jianguo Li flyingfromch...@gmail.com wrote: Hi, I really like the pipeline in the spark.ml in Spark1.2 release. Will there be more machine learning algorithms implemented for the pipeline

Re: How to retreive the value from sql.row by column name

2015-02-16 Thread Reynold Xin
BTW we merged this today: https://github.com/apache/spark/pull/4640 This should allow us in the future to address column by name in a Row. On Mon, Feb 16, 2015 at 11:39 AM, Michael Armbrust mich...@databricks.com wrote: I can unpack the code snippet a bit: caper.select('ran_id) is the same

Re: saveAsTextFile just uses toString and Row@37f108

2015-01-13 Thread Reynold Xin
It is just calling RDD's saveAsTextFile. I guess we should really override the saveAsTextFile in SchemaRDD (or make Row.toString comma separated). Do you mind filing a JIRA ticket and copy me? On Tue, Jan 13, 2015 at 12:03 AM, Kevin Burton bur...@spinn3r.com wrote: This is almost funny. I

Re: Creating RDD from only few columns of a Parquet file

2015-01-13 Thread Reynold Xin
What query did you run? Parquet should have predicate and column pushdown, i.e. if your query only needs to read 3 columns, then only 3 will be read. On Mon, Jan 12, 2015 at 10:20 PM, Ajay Srivastava a_k_srivast...@yahoo.com.invalid wrote: Hi, I am trying to read a parquet file using - val

Re: Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-10 Thread Reynold Xin
I think we made the binary protocol compatible across all versions, so you should be fine with using any one of them. 1.2.1 is probably the best since it is the most recent stable release. On Tue, Feb 10, 2015 at 8:43 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I need to use

Re: Spark on teradata?

2015-01-08 Thread Reynold Xin
Depending on your use cases. If the use case is to extract small amount of data out of teradata, then you can use the JdbcRDD and soon a jdbc input source based on the new Spark SQL external data source API. On Wed, Jan 7, 2015 at 7:14 AM, gen tang gen.tan...@gmail.com wrote: Hi, I have a

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-10 Thread Reynold Xin
They should have the same performance, as they are compiled down to the same execution plan. Note that starting in Spark 1.3, SchemaRDD is renamed DataFrame: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html On Tue, Mar 10, 2015 at 2:13

Re: Build fails on 1.3 Branch

2015-03-29 Thread Reynold Xin
I pushed a hotfix to the branch. Should work now. On Sun, Mar 29, 2015 at 9:23 AM, Marty Bower sp...@mjhb.com wrote: Yes, that worked - thank you very much. On Sun, Mar 29, 2015 at 9:05 AM Ted Yu yuzhih...@gmail.com wrote: Jenkins build failed too:

Re: spark disk-to-disk

2015-03-22 Thread Reynold Xin
On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com wrote: so finally i can resort to: rdd.saveAsObjectFile(...) sc.objectFile(...) but that seems like a rather broken abstraction. This seems like a fine solution to me.

Help vote for Spark talks at the Hadoop Summit

2015-02-24 Thread Reynold Xin
Hi all, The Hadoop Summit uses community choice voting to decide which talks to feature. It would be great if the community could help vote for Spark talks so that Spark has a good showing at this event. You can make three votes on each track. Below I've listed 3 talks that are important to

Re: Spark 1.3 dataframe documentation

2015-02-24 Thread Reynold Xin
The official documentation will be posted when 1.3 is released (early March). Right now, you can build the docs yourself by running jekyll build in docs. Alternatively, just look at dataframe,py as Ted pointed out. On Tue, Feb 24, 2015 at 6:56 AM, Ted Yu yuzhih...@gmail.com wrote: Have you

Re: New guide on how to write a Spark job in Clojure

2015-02-24 Thread Reynold Xin
Thanks for sharing, Chris. On Tue, Feb 24, 2015 at 4:39 AM, Christian Betz christian.b...@performance-media.de wrote: Hi all, Maybe some of you are interested: I wrote a new guide on how to start using Spark from Clojure. The tutorial covers - setting up a project, - doing REPL-

Re: How to access OpenHashSet in my standalone program?

2015-01-13 Thread Reynold Xin
It is not meant to be a public API. If you want to use it, maybe copy the code out of the package and put it in your own project. On Fri, Jan 9, 2015 at 7:19 AM, Tae-Hyuk Ahn ahn@gmail.com wrote: Hi, I would like to use OpenHashSet (org.apache.spark.util.collection.OpenHashSet) in my

Re: spark disk-to-disk

2015-03-23 Thread Reynold Xin
at 1:34 AM, Reynold Xin r...@databricks.com wrote: On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com wrote: so finally i can resort to: rdd.saveAsObjectFile(...) sc.objectFile(...) but that seems like a rather broken abstraction. This seems like a fine solution to me.

Re: Can I call aggregate UDF in DataFrame?

2015-04-01 Thread Reynold Xin
You totally can. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L792 There is also an attempt at adding stddev here already: https://github.com/apache/spark/pull/5228 On Thu, Mar 26, 2015 at 12:37 AM, Haopu Wang hw...@qilinsoft.com

Re: Expected behavior for DataFrame.unionAll

2015-04-14 Thread Reynold Xin
I think what happened was applying the narrowest possible type. Type widening is required, and as a result, the narrowest type is string between a string and an int.

Re: [Spark1.3] UDF registration issue

2015-04-14 Thread Reynold Xin
You can do this: strLen = udf((s: String) = s.length()) cleanProcessDF.withColumn(dii,strLen(col(di))) (You might need to play with the type signature a little bit to get it to compile) On Fri, Apr 10, 2015 at 11:30 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi, I'm running into some

Re: how to make a spark cluster ?

2015-04-21 Thread Reynold Xin
Actually if you only have one machine, just use the Spark local mode. Just download the Spark tarball, untar it, set master to local[N], where N = number of cores. You are good to go. There is no setup of job tracker or Hadoop. On Mon, Apr 20, 2015 at 3:21 PM, haihar nahak harihar1...@gmail.com

Re: Updating a Column in a DataFrame

2015-04-21 Thread Reynold Xin
You can use df.withColumn(a, df.b) to make column a having the same value as column b. On Mon, Apr 20, 2015 at 3:38 PM, ARose ashley.r...@telarix.com wrote: In my Java application, I want to update the values of a Column in a given DataFrame. However, I realize DataFrames are immutable, and

Re: Column renaming after DataFrame.groupBy

2015-04-21 Thread Reynold Xin
You can use the more verbose syntax: d.groupBy(_1).agg(d(_1), sum(_1).as(sum_1), sum(_2).as(sum_2)) On Tue, Apr 21, 2015 at 1:06 AM, Justin Yip yipjus...@prediction.io wrote: Hello, I would like rename a column after aggregation. In the following code, the column name is SUM(_1#179), is

Re: How to distribute Spark computation recipes

2015-04-27 Thread Reynold Xin
The code themselves are the recipies, no? On Mon, Apr 27, 2015 at 2:49 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, I know that any RDD is related to its SparkContext and the associated variables (broadcast, accumulators), but I'm looking for a way to

[ANNOUNCE] Ending Java 6 support in Spark 1.5 (Sep 2015)

2015-05-05 Thread Reynold Xin
Hi all, We will drop support for Java 6 starting Spark 1.5, tentative scheduled to be released in Sep 2015. Spark 1.4, scheduled to be released in June 2015, will be the last minor release that supports Java 6. That is to say: Spark 1.4.x (~ Jun 2015): will work with Java 6, 7, 8. Spark 1.5+ (~

Re: Is the AMP lab done next February?

2015-05-11 Thread Reynold Xin
Relaying an answer from AMP director Mike Franklin: One year into the lab we got a 5 yr Expeditions in Computing Award as part of the White House Big Data initiative in 2012, so we extend the lab for a year. We intend to start winding it down at the end of 2016, while supporting existing

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-11 Thread Reynold Xin
Looks like it is spending a lot of time doing hash probing. It could be a number of the following: 1. hash probing itself is inherently expensive compared with rest of your workload 2. murmur3 doesn't work well with this key distribution 3. quadratic probing (triangular sequence) with a

Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-11 Thread Reynold Xin
, Olivier. Le lun. 11 mai 2015 à 22:07, Reynold Xin r...@databricks.com a écrit : Not by design. Would you be interested in submitting a pull request? On Mon, May 11, 2015 at 1:48 AM, Haopu Wang hw...@qilinsoft.com wrote: I try to get the result schema of aggregate functions using DataFrame API

Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-11 Thread Reynold Xin
Not by design. Would you be interested in submitting a pull request? On Mon, May 11, 2015 at 1:48 AM, Haopu Wang hw...@qilinsoft.com wrote: I try to get the result schema of aggregate functions using DataFrame API. However, I find the result field of groupBy columns are always nullable even

Manning looking for a co-author for the GraphX in Action book

2015-04-13 Thread Reynold Xin
Hi all, Manning (the publisher) is looking for a co-author for the GraphX in Action book. The book currently has one author (Michael Malak), but they are looking for a co-author to work closely with Michael and improve the writings and make it more consumable. Early access page for the book:

Re: Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

2015-04-17 Thread Reynold Xin
It's because you did a repartition -- which rearranges all the data. Parquet uses all kinds of compression techniques such as dictionary encoding and run-length encoding, which would result in the size difference when the data is ordered different. On Fri, Apr 17, 2015 at 4:51 AM, zhangxiongfei

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Reynold Xin
There is already an explode function on DataFrame btw https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L712 I think something like this would work. You might need to play with the type. df.explode(arrayBufferColumn) { x = x } On Fri,

Re: Building scaladoc using build/sbt unidoc failure

2015-06-13 Thread Reynold Xin
Try build/sbt clean first. On Tue, May 26, 2015 at 4:45 PM, Justin Yip yipjus...@prediction.io wrote: Hello, I am trying to build scala doc from the 1.4 branch. But it failed due to [error] (sql/compile:compile) java.lang.AssertionError: assertion failed: List(object package$DebugNode,

Re: Exception when using CLUSTER BY or ORDER BY

2015-06-12 Thread Reynold Xin
Tom, Can you file a JIRA and attach a small reproducible test case if possible? On Tue, May 19, 2015 at 1:50 PM, Thomas Dudziak tom...@gmail.com wrote: Under certain circumstances that I haven't yet been able to isolate, I get the following error when doing a HQL query using HiveContext

Re: rdd.sample() methods very slow

2015-05-22 Thread Reynold Xin
You can do something like this: val myRdd = ... val rddSampledByPartition = PartitionPruningRDD.create(myRdd, i = Random.nextDouble() 0.1) // this samples 10% of the partitions rddSampledByPartition.mapPartitions { iter = iter.take(10) } // take the first 10 elements out of each partition

Re: DataFrame Column Alias problem

2015-05-22 Thread Reynold Xin
In 1.4 it actually shows col1 by default. In 1.3, you can add col1 to the output, i.e. df.groupBy($col1).agg($col1, count($col1).as(c)).show() On Thu, May 21, 2015 at 11:22 PM, SLiZn Liu sliznmail...@gmail.com wrote: However this returns a single column of c, without showing the original

Re: Why is RDD to PairRDDFunctions only via implicits?

2015-05-22 Thread Reynold Xin
I'm not sure if it is possible to overload the map function twice, once for just KV pairs, and another for K and V separately. On Fri, May 22, 2015 at 10:26 AM, Justin Pihony justin.pih...@gmail.com wrote: This ticket https://issues.apache.org/jira/browse/SPARK-4397 improved the RDD API, but

Re: DataFrame. SparkPlan / Project serialization issue: ArrayIndexOutOfBounds.

2015-08-21 Thread Reynold Xin
You've probably hit this bug: https://issues.apache.org/jira/browse/SPARK-7180 It's fixed in Spark 1.4.1+. Try setting spark.serializer.extraDebugInfo to false and see if it goes away. On Fri, Aug 21, 2015 at 3:37 AM, Eugene Morozov evgeny.a.moro...@gmail.com wrote: Hi, I'm using spark

Re: Grouping runs of elements in a RDD

2015-06-30 Thread Reynold Xin
Try mapPartitions, which gives you an iterator, and you can produce an iterator back. On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I have a problem where I have a RDD of elements: Item1 Item2 Item3 Item4 Item5 Item6 ... and I want to run a function over

Re: Memory allocation error with Spark 1.5

2015-08-05 Thread Reynold Xin
In Spark 1.5, we have a new way to manage memory (part of Project Tungsten). The default unit of memory allocation is 64MB, which is way too high when you have 1G of memory allocated in total and have more than 4 threads. We will reduce the default page size before releasing 1.5. For now, you

[ANNOUNCE] Announcing Spark 1.5.2

2015-11-10 Thread Reynold Xin
Hi All, Spark 1.5.2 is a maintenance release containing stability fixes. This release is based on the branch-1.5 maintenance branch of Spark. We *strongly recommend* all 1.5.x users to upgrade to this release. The full list of bug fixes is here: http://s.apache.org/spark-1.5.2

Re: Hive on Spark Vs Spark SQL

2015-11-15 Thread Reynold Xin
It's a completely different path. On Sun, Nov 15, 2015 at 10:37 PM, kiran lonikar wrote: > I would like to know if Hive on Spark uses or shares the execution code > with Spark SQL or DataFrames? > > More specifically, does Hive on Spark benefit from the changes made to >

Re: Hive on Spark Vs Spark SQL

2015-11-15 Thread Reynold Xin
No it does not -- although it'd benefit from some of the work to make shuffle more robust. On Sun, Nov 15, 2015 at 10:45 PM, kiran lonikar <loni...@gmail.com> wrote: > So does not benefit from Project Tungsten right? > > > On Mon, Nov 16, 2015 at 12:07 PM, Reynold Xin &l

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-11-01 Thread Reynold Xin
$sql$execution$TungstenSort$$preparePartition$1(sort.scala:131) >>> at >>> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) >>> at >>> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.s

Please reply if you use Mesos fine grained mode

2015-11-03 Thread Reynold Xin
If you are using Spark with Mesos fine grained mode, can you please respond to this email explaining why you use it over the coarse grained mode? Thanks.

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Reynold Xin
in turn kill the entire executor, causing entire > stages to be retried. In fine-grained mode, only the task fails and > subsequently gets retried without taking out an entire stage or worse. > > On Tue, Nov 3, 2015 at 3:54 PM, Reynold Xin <r...@databricks.com> wrote: > >>

Re: Codegen In Shuffle

2015-11-04 Thread Reynold Xin
GenerateUnsafeProjection -- projects any internal row data structure directly into bytes (UnsafeRow). On Wed, Nov 4, 2015 at 12:21 AM, 牛兆捷 wrote: > Dear all: > > Tungsten project has mentioned that they are applying code generation is > to speed up the conversion of data

Re: Looking for the method executors uses to write to HDFS

2015-11-06 Thread Reynold Xin
Are you looking for this? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L69 On Wed, Nov 4, 2015 at 5:11 AM, Tóth Zoltán wrote: > Hi, > > I'd like to write a parquet file from the

Re: [SQL] Memory leak with spark streaming and spark sql in spark 1.5.1

2015-10-14 Thread Reynold Xin
+dev list On Wed, Oct 14, 2015 at 1:07 AM, Terry Hoo wrote: > All, > > Does anyone meet memory leak issue with spark streaming and spark sql in > spark 1.5.1? I can see the memory is increasing all the time when running > this simple sample: > > val sc = new

If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-14 Thread Reynold Xin
Can you reply to this email and provide us with reasons why you disable it? Thanks.

Re: orc read issue n spark

2015-11-18 Thread Reynold Xin
What do you mean by starts delay scheduling? Are you saying it is no longer doing local reads? If that's the case you can increase the spark.locality.read timeout. On Wednesday, November 18, 2015, Renu Yadav wrote: > Hi , > I am using spark 1.4.1 and saving orc file using >

Re: How to avoid shuffle errors for a large join ?

2015-08-29 Thread Reynold Xin
Can you try 1.5? This should work much, much better in 1.5 out of the box. For 1.4, I think you'd want to turn on sort-merge-join, which is off by default. However, the sort-merge join in 1.4 can still trigger a lot of garbage, making it slower. SMJ performance is probably 5x - 1000x better in

Re: How to avoid shuffle errors for a large join ?

2015-09-05 Thread Reynold Xin
; On Sat, Aug 29, 2015 at 7:17 PM, Reynold Xin <r...@databricks.com> wrote: > >> Can you try 1.5? This should work much, much better in 1.5 out of the box. >> >> For 1.4, I think you'd want to turn on sort-merge-join, which is off by >> default. However, the so

Re: Problems with Tungsten in Spark 1.5.0-rc2

2015-09-07 Thread Reynold Xin
On Wed, Sep 2, 2015 at 12:03 AM, Anders Arpteg wrote: > > BTW, is it possible (or will it be) to use Tungsten with dynamic > allocation and the external shuffle manager? > > Yes - I think this already works. There isn't anything specific here related to Tungsten.

Re: Perf impact of BlockManager byte[] copies

2015-09-10 Thread Reynold Xin
This is one problem I'd like to address soon - providing a binary block management interface for shuffle (and maybe other things) that avoids serialization/copying. On Fri, Feb 27, 2015 at 3:39 PM, Paul Wais wrote: > Dear List, > > I'm investigating some problems related to

Re: Best way to import data from Oracle to Spark?

2015-09-09 Thread Reynold Xin
Using the JDBC data source is probably the best way. http://spark.apache.org/docs/1.4.1/sql-programming-guide.html#jdbc-to-other-databases On Tue, Sep 8, 2015 at 10:11 AM, Cui Lin wrote: > What's the best way to import data from Oracle to Spark? Thanks! > > > -- > Best

[ANNOUNCE] Announcing Spark 1.5.0

2015-09-09 Thread Reynold Xin
Hi All, Spark 1.5.0 is the sixth release on the 1.x line. This release represents 1400+ patches from 230+ contributors and 80+ institutions. To download Spark 1.5.0 visit the downloads page. A huge thanks go to all of the individuals and organizations involved in development and testing of this

Re: How to avoid shuffle errors for a large join ?

2015-09-16 Thread Reynold Xin
Only SQL and DataFrame for now. We are thinking about how to apply that to a more general distributed collection based API, but it's not in 1.5. On Sat, Sep 5, 2015 at 11:56 AM, Gurvinder Singh <gurvinder.si...@uninett.no > wrote: > On 09/05/2015 11:22 AM, Reynold Xin wrote: > &g

[ANNOUNCE] Announcing Spark 1.5.1

2015-10-01 Thread Reynold Xin
Hi All, Spark 1.5.1 is a maintenance release containing stability fixes. This release is based on the branch-1.5 maintenance branch of Spark. We *strongly recommend* all 1.5.0 users to upgrade to this release. The full list of bug fixes is here: http://s.apache.org/spark-1.5.1

Re: Driver OOM after upgrading to 1.5

2015-09-09 Thread Reynold Xin
Java 7 / 8? On Wed, Sep 9, 2015 at 10:10 AM, Sandy Ryza wrote: > I just upgraded the spark-timeseries > project to run on top of > 1.5, and I'm noticing that tests are failing with OOMEs. > > I ran a jmap -histo on the

Re: Driver OOM after upgrading to 1.5

2015-09-09 Thread Reynold Xin
<sandy.r...@cloudera.com> wrote: > Java 7. > > FWIW I was just able to get it to work by increasing MaxPermSize to 256m. > > -Sandy > > On Wed, Sep 9, 2015 at 11:37 AM, Reynold Xin <r...@databricks.com> wrote: > >> Java 7 / 8? >> >> On Wed, Sep 9,

Re: in joins, does one side stream?

2015-09-19 Thread Reynold Xin
eynold, > Can you please elaborate on this. I thought RDD also opens only an > iterator. Does it get materialized for joins? > > Rishi > > On Saturday, September 19, 2015, Reynold Xin <r...@databricks.com> wrote: > >> Yes for RDD -- both are materializ

Re: in joins, does one side stream?

2015-09-20 Thread Reynold Xin
n DataFrame >> but not in RDD? >> >> they dont seem specific to structured data analysis to me. >> >> On Sun, Sep 20, 2015 at 2:41 AM, Rishitesh Mishra < >> rishi80.mis...@gmail.com> wrote: >> >>> Got it..thnx Reynold.. >>&g

Re: in joins, does one side stream?

2015-09-18 Thread Reynold Xin
Yes for RDD -- both are materialized. No for DataFrame/SQL - one side streams. On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers wrote: > in scalding we join with the smaller side on the left, since the smaller > side will get buffered while the bigger side streams through the

Re: Null Value in DecimalType column of DataFrame

2015-09-21 Thread Reynold Xin
+dev list Hi Dirceu, The answer to whether throwing an exception is better or null is better depends on your use case. If you are debugging and want to find bugs with your program, you might prefer throwing an exception. However, if you are running on a large real-world dataset (i.e. data is

[discuss] dropping Python 2.6 support

2016-01-04 Thread Reynold Xin
Does anybody here care about us dropping support for Python 2.6 in Spark 2.0? Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json parsing) when compared with Python 2.7. Some libraries that Spark depend on stopped supporting 2.6. We can still convince the library maintainers to

Re: Please add us to the Powered by Spark page

2015-11-24 Thread Reynold Xin
I just updated the page to say "email dev" instead of "email user". On Tue, Nov 24, 2015 at 1:16 AM, Sean Owen wrote: > Not sure who generally handles that, but I just made the edit. > > On Mon, Nov 23, 2015 at 6:26 PM, Sujit Pal wrote: > > Sorry to

Re: XML column not supported in Database

2016-01-11 Thread Reynold Xin
Can you file a JIRA ticket? Thanks. The URL is issues.apache.org/jira/browse/SPARK On Mon, Jan 11, 2016 at 1:44 AM, Gaini Rajeshwar < raja.rajeshwar2...@gmail.com> wrote: > Hi All, > > I am using PostgreSQL database. I am using the following jdbc call to > access a customer table (*customer_id

Re: Spark 2.0 Release Date

2016-06-07 Thread Reynold Xin
It'd be great to cut an RC as soon as possible. Looking at the blocker/critical issue list, majority of them are API audits. I think people will get back to those once Spark Summit is over, and then we should see some good progress towards an RC. On Tue, Jun 7, 2016 at 6:20 AM, Jacek Laskowski

Re: Pros and Cons

2016-05-25 Thread Reynold Xin
On Wed, May 25, 2016 at 9:52 AM, Jörn Franke wrote: > Spark is more for machine learning working iteravely over the whole same > dataset in memory. Additionally it has streaming and graph processing > capabilities that can be used together. > Hi Jörn, The first part is

Re: feedback on dataset api explode

2016-05-25 Thread Reynold Xin
ee, since they can be easily replaced by .flatMap (to do explosion) and > .select (to rename output columns) > > Cheng > > > On 5/25/16 12:30 PM, Reynold Xin wrote: > > Based on this discussion I'm thinking we should deprecate the two explode > functions. > > On We

Re: feedback on dataset api explode

2016-05-25 Thread Reynold Xin
Based on this discussion I'm thinking we should deprecate the two explode functions. On Wednesday, May 25, 2016, Koert Kuipers wrote: > wenchen, > that definition of explode seems identical to flatMap, so you dont need it > either? > > michael, > i didn't know about the

Re: JDBC Dialect for saving DataFrame into Vertica Table

2016-05-26 Thread Reynold Xin
It's probably a good idea to have the vertica dialect too, since it doesn't seem like it'd be too difficult to maintain. It is not going to be as performant as the native Vertica data source, but is going to be much lighter weight. On Thu, May 26, 2016 at 3:09 PM, Mohammed Guller

Re: Thanks For a Job Well Done !!!

2016-06-18 Thread Reynold Xin
Thanks for the kind words, Krishna! Please keep the feedback coming. On Saturday, June 18, 2016, Krishna Sankar wrote: > Hi all, >Just wanted to thank all for the dataset API - most of the times we see > only bugs in these lists ;o). > >- Putting some context, this

[discuss] dropping Hadoop 2.2 and 2.3 support in Spark 2.0?

2016-01-13 Thread Reynold Xin
We've dropped Hadoop 1.x support in Spark 2.0. There is also a proposal to drop Hadoop 2.2 and 2.3, i.e. the minimal Hadoop version we support would be Hadoop 2.4. The main advantage is then we'd be able to focus our Jenkins resources (and the associated maintenance of Jenkins) to create builds

[ANNOUNCE] Announcing Spark 1.6.2

2016-06-27 Thread Reynold Xin
We are happy to announce the availability of Spark 1.6.2! This maintenance release includes fixes across several areas of Spark. You can find the list of changes here: https://s.apache.org/spark-1.6.2 And download the release here: http://spark.apache.org/downloads.html

Re: Is spark.driver.maxResultSize used correctly ?

2016-02-27 Thread Reynold Xin
But sometimes you might have skew and almost all the result data are in one or a few tasks though. On Friday, February 26, 2016, Jeff Zhang wrote: > > My job get this exception very easily even when I set large value of > spark.driver.maxResultSize. After checking the spark

Re: DirectFileOutputCommiter

2016-02-26 Thread Reynold Xin
It could lose data in speculation mode, or if any job fails. On Fri, Feb 26, 2016 at 3:45 AM, Igor Berman wrote: > Takeshi, do you know the reason why they wanted to remove this commiter in > SPARK-10063? > the jira has no info inside > as far as I understand the direct

Spark Summit (San Francisco, June 6-8) call for presentation due in less than week

2016-02-24 Thread Reynold Xin
Just want to send a reminder in case people don't know about it. If you are working on (or with, using) Spark, consider submitting your work to Spark Summit, coming up in June in San Francisco. https://spark-summit.org/2016/call-for-presentations/ Cheers.

Re: Is spark.driver.maxResultSize used correctly ?

2016-03-01 Thread Reynold Xin
data skew might be possible, but not the common case. I think we should > design for the common case, for the skew case, we may can set some > parameter of fraction to allow user to tune it. > > On Sat, Feb 27, 2016 at 4:51 PM, Reynold Xin <r...@databricks.com > <javascript:_e(%7B%

Re: [Proposal] Enabling time series analysis on spark metrics

2016-03-01 Thread Reynold Xin
Is the suggestion just to use a different config (and maybe fallback to appid) in order to publish metrics? Seems reasonable. On Tue, Mar 1, 2016 at 8:17 AM, Karan Kumar wrote: > +dev mailing list > > Time series analysis on metrics becomes quite useful when running

Re: Spark Scheduler creating Straggler Node

2016-03-08 Thread Reynold Xin
You just want to be able to replicate hot cached blocks right? On Tuesday, March 8, 2016, Prabhu Joseph wrote: > Hi All, > > When a Spark Job is running, and one of the Spark Executor on Node A > has some partitions cached. Later for some other stage, Scheduler

[discuss] making SparkEnv private in Spark 2.0

2016-03-19 Thread Reynold Xin
Any objections? Please articulate your use case. SparkEnv is a weird one because it was documented as "private" but not marked as so in class visibility. * NOTE: This is not intended for external use. This is exposed for Shark and may be made private * in a future release. I do see Hive

Re: [discuss] making SparkEnv private in Spark 2.0

2016-03-19 Thread Reynold Xin
On Wed, Mar 16, 2016 at 3:29 PM, Mridul Muralidharan wrote: > b) Shuffle manager (to get shuffle reader) > What's the use case for shuffle manager/reader? This seems like using super internal APIs in applications.

Re: Executor shutdown hooks?

2016-04-06 Thread Reynold Xin
On Wed, Apr 6, 2016 at 4:39 PM, Sung Hwan Chung wrote: > My option so far seems to be using JVM's shutdown hook, but I was > wondering if Spark itself had an API for tasks. > Spark would be using that under the hood anyway, so you might as well just use the jvm

Re: How Spark handles dead machines during a job.

2016-04-09 Thread Reynold Xin
The driver has the data and wouldn't need to rerun. On Friday, April 8, 2016, Sung Hwan Chung wrote: > Hello, > > Say, that I'm doing a simple rdd.map followed by collect. Say, also, that > one of the executors finish all of its tasks, but there are still other > executors

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Reynold Xin
+1 This is a no brainer IMO. On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley wrote: > +1 By the way, the JIRA for tracking (Scala) API parity is: > https://issues.apache.org/jira/browse/SPARK-4591 > > On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia

Re: df.dtypes -> pyspark.sql.types

2016-03-20 Thread Reynold Xin
We probably should have the alias. Is this still a problem on master branch? On Wed, Mar 16, 2016 at 9:40 AM, Ruslan Dautkhanov wrote: > Running following: > > #fix schema for gaid which should not be Double >> from pyspark.sql.types import * >> customSchema = StructType()

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Reynold Xin
Are you looking for "relaxed" mode that simply return nulls for fields that doesn't exist or have incompatible schema? On Wed, Mar 2, 2016 at 11:12 AM, Ewan Leith wrote: > Thanks Michael, it's not a great example really, as the data I'm working with > has some

Re: Selecting column in dataframe created with incompatible schema causes AnalysisException

2016-03-02 Thread Reynold Xin
I don't think that exists right now, but it's definitely a good option to have. I myself have run into this issue a few times. Can you create a JIRA ticket so we can track it? Would be even better if you are interested in working on a patch! Thanks. On Wed, Mar 2, 2016 at 11:51 AM, Ewan Leith

  1   2   >