Re: Best way to import data from Oracle to Spark?

2015-09-09 Thread Reynold Xin
Using the JDBC data source is probably the best way. http://spark.apache.org/docs/1.4.1/sql-programming-guide.html#jdbc-to-other-databases On Tue, Sep 8, 2015 at 10:11 AM, Cui Lin wrote: > What's the best way to import data from Oracle to Spark? Thanks! > > > -- > Best

[ANNOUNCE] Announcing Spark 1.5.0

2015-09-09 Thread Reynold Xin
Hi All, Spark 1.5.0 is the sixth release on the 1.x line. This release represents 1400+ patches from 230+ contributors and 80+ institutions. To download Spark 1.5.0 visit the downloads page. A huge thanks go to all of the individuals and organizations involved in development and testing of this

Re: Driver OOM after upgrading to 1.5

2015-09-09 Thread Reynold Xin
Java 7 / 8? On Wed, Sep 9, 2015 at 10:10 AM, Sandy Ryza wrote: > I just upgraded the spark-timeseries > project to run on top of > 1.5, and I'm noticing that tests are failing with OOMEs. > > I ran a jmap -histo on the

Re: Driver OOM after upgrading to 1.5

2015-09-09 Thread Reynold Xin
<sandy.r...@cloudera.com> wrote: > Java 7. > > FWIW I was just able to get it to work by increasing MaxPermSize to 256m. > > -Sandy > > On Wed, Sep 9, 2015 at 11:37 AM, Reynold Xin <r...@databricks.com> wrote: > >> Java 7 / 8? >> >> On Wed, Sep 9,

Re: Problems with Tungsten in Spark 1.5.0-rc2

2015-09-07 Thread Reynold Xin
On Wed, Sep 2, 2015 at 12:03 AM, Anders Arpteg wrote: > > BTW, is it possible (or will it be) to use Tungsten with dynamic > allocation and the external shuffle manager? > > Yes - I think this already works. There isn't anything specific here related to Tungsten.

Re: How to avoid shuffle errors for a large join ?

2015-09-05 Thread Reynold Xin
; On Sat, Aug 29, 2015 at 7:17 PM, Reynold Xin <r...@databricks.com> wrote: > >> Can you try 1.5? This should work much, much better in 1.5 out of the box. >> >> For 1.4, I think you'd want to turn on sort-merge-join, which is off by >> default. However, the so

Re: How to avoid shuffle errors for a large join ?

2015-08-29 Thread Reynold Xin
Can you try 1.5? This should work much, much better in 1.5 out of the box. For 1.4, I think you'd want to turn on sort-merge-join, which is off by default. However, the sort-merge join in 1.4 can still trigger a lot of garbage, making it slower. SMJ performance is probably 5x - 1000x better in

Re: DataFrame. SparkPlan / Project serialization issue: ArrayIndexOutOfBounds.

2015-08-21 Thread Reynold Xin
You've probably hit this bug: https://issues.apache.org/jira/browse/SPARK-7180 It's fixed in Spark 1.4.1+. Try setting spark.serializer.extraDebugInfo to false and see if it goes away. On Fri, Aug 21, 2015 at 3:37 AM, Eugene Morozov evgeny.a.moro...@gmail.com wrote: Hi, I'm using spark

Re: Memory allocation error with Spark 1.5

2015-08-05 Thread Reynold Xin
In Spark 1.5, we have a new way to manage memory (part of Project Tungsten). The default unit of memory allocation is 64MB, which is way too high when you have 1G of memory allocated in total and have more than 4 threads. We will reduce the default page size before releasing 1.5. For now, you

Re: Grouping runs of elements in a RDD

2015-06-30 Thread Reynold Xin
Try mapPartitions, which gives you an iterator, and you can produce an iterator back. On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I have a problem where I have a RDD of elements: Item1 Item2 Item3 Item4 Item5 Item6 ... and I want to run a function over

Re: Building scaladoc using build/sbt unidoc failure

2015-06-13 Thread Reynold Xin
Try build/sbt clean first. On Tue, May 26, 2015 at 4:45 PM, Justin Yip yipjus...@prediction.io wrote: Hello, I am trying to build scala doc from the 1.4 branch. But it failed due to [error] (sql/compile:compile) java.lang.AssertionError: assertion failed: List(object package$DebugNode,

Re: Exception when using CLUSTER BY or ORDER BY

2015-06-12 Thread Reynold Xin
Tom, Can you file a JIRA and attach a small reproducible test case if possible? On Tue, May 19, 2015 at 1:50 PM, Thomas Dudziak tom...@gmail.com wrote: Under certain circumstances that I haven't yet been able to isolate, I get the following error when doing a HQL query using HiveContext

Re: rdd.sample() methods very slow

2015-05-22 Thread Reynold Xin
You can do something like this: val myRdd = ... val rddSampledByPartition = PartitionPruningRDD.create(myRdd, i = Random.nextDouble() 0.1) // this samples 10% of the partitions rddSampledByPartition.mapPartitions { iter = iter.take(10) } // take the first 10 elements out of each partition

Re: DataFrame Column Alias problem

2015-05-22 Thread Reynold Xin
In 1.4 it actually shows col1 by default. In 1.3, you can add col1 to the output, i.e. df.groupBy($col1).agg($col1, count($col1).as(c)).show() On Thu, May 21, 2015 at 11:22 PM, SLiZn Liu sliznmail...@gmail.com wrote: However this returns a single column of c, without showing the original

Re: Why is RDD to PairRDDFunctions only via implicits?

2015-05-22 Thread Reynold Xin
I'm not sure if it is possible to overload the map function twice, once for just KV pairs, and another for K and V separately. On Fri, May 22, 2015 at 10:26 AM, Justin Pihony justin.pih...@gmail.com wrote: This ticket https://issues.apache.org/jira/browse/SPARK-4397 improved the RDD API, but

Re: Is the AMP lab done next February?

2015-05-11 Thread Reynold Xin
Relaying an answer from AMP director Mike Franklin: One year into the lab we got a 5 yr Expeditions in Computing Award as part of the White House Big Data initiative in 2012, so we extend the lab for a year. We intend to start winding it down at the end of 2016, while supporting existing

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-11 Thread Reynold Xin
Looks like it is spending a lot of time doing hash probing. It could be a number of the following: 1. hash probing itself is inherently expensive compared with rest of your workload 2. murmur3 doesn't work well with this key distribution 3. quadratic probing (triangular sequence) with a

Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-11 Thread Reynold Xin
, Olivier. Le lun. 11 mai 2015 à 22:07, Reynold Xin r...@databricks.com a écrit : Not by design. Would you be interested in submitting a pull request? On Mon, May 11, 2015 at 1:48 AM, Haopu Wang hw...@qilinsoft.com wrote: I try to get the result schema of aggregate functions using DataFrame API

Re: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-11 Thread Reynold Xin
Not by design. Would you be interested in submitting a pull request? On Mon, May 11, 2015 at 1:48 AM, Haopu Wang hw...@qilinsoft.com wrote: I try to get the result schema of aggregate functions using DataFrame API. However, I find the result field of groupBy columns are always nullable even

[ANNOUNCE] Ending Java 6 support in Spark 1.5 (Sep 2015)

2015-05-05 Thread Reynold Xin
Hi all, We will drop support for Java 6 starting Spark 1.5, tentative scheduled to be released in Sep 2015. Spark 1.4, scheduled to be released in June 2015, will be the last minor release that supports Java 6. That is to say: Spark 1.4.x (~ Jun 2015): will work with Java 6, 7, 8. Spark 1.5+ (~

Re: How to distribute Spark computation recipes

2015-04-27 Thread Reynold Xin
The code themselves are the recipies, no? On Mon, Apr 27, 2015 at 2:49 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi everyone, I know that any RDD is related to its SparkContext and the associated variables (broadcast, accumulators), but I'm looking for a way to

Re: how to make a spark cluster ?

2015-04-21 Thread Reynold Xin
Actually if you only have one machine, just use the Spark local mode. Just download the Spark tarball, untar it, set master to local[N], where N = number of cores. You are good to go. There is no setup of job tracker or Hadoop. On Mon, Apr 20, 2015 at 3:21 PM, haihar nahak harihar1...@gmail.com

Re: Updating a Column in a DataFrame

2015-04-21 Thread Reynold Xin
You can use df.withColumn(a, df.b) to make column a having the same value as column b. On Mon, Apr 20, 2015 at 3:38 PM, ARose ashley.r...@telarix.com wrote: In my Java application, I want to update the values of a Column in a given DataFrame. However, I realize DataFrames are immutable, and

Re: Column renaming after DataFrame.groupBy

2015-04-21 Thread Reynold Xin
You can use the more verbose syntax: d.groupBy(_1).agg(d(_1), sum(_1).as(sum_1), sum(_2).as(sum_2)) On Tue, Apr 21, 2015 at 1:06 AM, Justin Yip yipjus...@prediction.io wrote: Hello, I would like rename a column after aggregation. In the following code, the column name is SUM(_1#179), is

Re: Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

2015-04-17 Thread Reynold Xin
It's because you did a repartition -- which rearranges all the data. Parquet uses all kinds of compression techniques such as dictionary encoding and run-length encoding, which would result in the size difference when the data is ordered different. On Fri, Apr 17, 2015 at 4:51 AM, zhangxiongfei

Re: Expected behavior for DataFrame.unionAll

2015-04-14 Thread Reynold Xin
I think what happened was applying the narrowest possible type. Type widening is required, and as a result, the narrowest type is string between a string and an int.

Re: [Spark1.3] UDF registration issue

2015-04-14 Thread Reynold Xin
You can do this: strLen = udf((s: String) = s.length()) cleanProcessDF.withColumn(dii,strLen(col(di))) (You might need to play with the type signature a little bit to get it to compile) On Fri, Apr 10, 2015 at 11:30 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi, I'm running into some

Manning looking for a co-author for the GraphX in Action book

2015-04-13 Thread Reynold Xin
Hi all, Manning (the publisher) is looking for a co-author for the GraphX in Action book. The book currently has one author (Michael Malak), but they are looking for a co-author to work closely with Michael and improve the writings and make it more consumable. Early access page for the book:

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Reynold Xin
There is already an explode function on DataFrame btw https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L712 I think something like this would work. You might need to play with the type. df.explode(arrayBufferColumn) { x = x } On Fri,

Re: Can I call aggregate UDF in DataFrame?

2015-04-01 Thread Reynold Xin
You totally can. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L792 There is also an attempt at adding stddev here already: https://github.com/apache/spark/pull/5228 On Thu, Mar 26, 2015 at 12:37 AM, Haopu Wang hw...@qilinsoft.com

Re: Build fails on 1.3 Branch

2015-03-29 Thread Reynold Xin
I pushed a hotfix to the branch. Should work now. On Sun, Mar 29, 2015 at 9:23 AM, Marty Bower sp...@mjhb.com wrote: Yes, that worked - thank you very much. On Sun, Mar 29, 2015 at 9:05 AM Ted Yu yuzhih...@gmail.com wrote: Jenkins build failed too:

Re: spark disk-to-disk

2015-03-23 Thread Reynold Xin
at 1:34 AM, Reynold Xin r...@databricks.com wrote: On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com wrote: so finally i can resort to: rdd.saveAsObjectFile(...) sc.objectFile(...) but that seems like a rather broken abstraction. This seems like a fine solution to me.

Re: spark disk-to-disk

2015-03-22 Thread Reynold Xin
On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers ko...@tresata.com wrote: so finally i can resort to: rdd.saveAsObjectFile(...) sc.objectFile(...) but that seems like a rather broken abstraction. This seems like a fine solution to me.

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-10 Thread Reynold Xin
They should have the same performance, as they are compiled down to the same execution plan. Note that starting in Spark 1.3, SchemaRDD is renamed DataFrame: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html On Tue, Mar 10, 2015 at 2:13

Help vote for Spark talks at the Hadoop Summit

2015-02-24 Thread Reynold Xin
Hi all, The Hadoop Summit uses community choice voting to decide which talks to feature. It would be great if the community could help vote for Spark talks so that Spark has a good showing at this event. You can make three votes on each track. Below I've listed 3 talks that are important to

Re: Spark 1.3 dataframe documentation

2015-02-24 Thread Reynold Xin
The official documentation will be posted when 1.3 is released (early March). Right now, you can build the docs yourself by running jekyll build in docs. Alternatively, just look at dataframe,py as Ted pointed out. On Tue, Feb 24, 2015 at 6:56 AM, Ted Yu yuzhih...@gmail.com wrote: Have you

Re: New guide on how to write a Spark job in Clojure

2015-02-24 Thread Reynold Xin
Thanks for sharing, Chris. On Tue, Feb 24, 2015 at 4:39 AM, Christian Betz christian.b...@performance-media.de wrote: Hi all, Maybe some of you are interested: I wrote a new guide on how to start using Spark from Clojure. The tutorial covers - setting up a project, - doing REPL-

Re: How to retreive the value from sql.row by column name

2015-02-16 Thread Reynold Xin
BTW we merged this today: https://github.com/apache/spark/pull/4640 This should allow us in the future to address column by name in a Row. On Mon, Feb 16, 2015 at 11:39 AM, Michael Armbrust mich...@databricks.com wrote: I can unpack the code snippet a bit: caper.select('ran_id) is the same

Re: Spark ML pipeline

2015-02-11 Thread Reynold Xin
Yes. Next release (Spark 1.3) is coming out end of Feb / early Mar. On Wed, Feb 11, 2015 at 7:22 AM, Jianguo Li flyingfromch...@gmail.com wrote: Hi, I really like the pipeline in the spark.ml in Spark1.2 release. Will there be more machine learning algorithms implemented for the pipeline

Re: Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-10 Thread Reynold Xin
I think we made the binary protocol compatible across all versions, so you should be fine with using any one of them. 1.2.1 is probably the best since it is the most recent stable release. On Tue, Feb 10, 2015 at 8:43 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I need to use

Re: 2GB limit for partitions?

2015-02-03 Thread Reynold Xin
cc dev list How are you saving the data? There are two relevant 2GB limits: 1. Caching 2. Shuffle For caching, a partition is turned into a single block. For shuffle, each map partition is partitioned into R blocks, where R = number of reduce tasks. It is unlikely a shuffle block 2G,

Re: How to access OpenHashSet in my standalone program?

2015-01-14 Thread Reynold Xin
. Yes, I can incorporate it to my package and use it. But I am still wondering why you designed such useful functions as private. On Tue, Jan 13, 2015 at 3:33 PM, Reynold Xin r...@databricks.com wrote: It is not meant to be a public API. If you want to use it, maybe copy the code out

Re: saveAsTextFile just uses toString and Row@37f108

2015-01-13 Thread Reynold Xin
It is just calling RDD's saveAsTextFile. I guess we should really override the saveAsTextFile in SchemaRDD (or make Row.toString comma separated). Do you mind filing a JIRA ticket and copy me? On Tue, Jan 13, 2015 at 12:03 AM, Kevin Burton bur...@spinn3r.com wrote: This is almost funny. I

Re: Creating RDD from only few columns of a Parquet file

2015-01-13 Thread Reynold Xin
What query did you run? Parquet should have predicate and column pushdown, i.e. if your query only needs to read 3 columns, then only 3 will be read. On Mon, Jan 12, 2015 at 10:20 PM, Ajay Srivastava a_k_srivast...@yahoo.com.invalid wrote: Hi, I am trying to read a parquet file using - val

Re: How to access OpenHashSet in my standalone program?

2015-01-13 Thread Reynold Xin
It is not meant to be a public API. If you want to use it, maybe copy the code out of the package and put it in your own project. On Fri, Jan 9, 2015 at 7:19 AM, Tae-Hyuk Ahn ahn@gmail.com wrote: Hi, I would like to use OpenHashSet (org.apache.spark.util.collection.OpenHashSet) in my

Re: Spark on teradata?

2015-01-08 Thread Reynold Xin
Depending on your use cases. If the use case is to extract small amount of data out of teradata, then you can use the JdbcRDD and soon a jdbc input source based on the new Spark SQL external data source API. On Wed, Jan 7, 2015 at 7:14 AM, gen tang gen.tan...@gmail.com wrote: Hi, I have a

Re: Confused why I'm losing workers/executors when writing a large file to S3

2014-11-13 Thread Reynold Xin
Darin, You might want to increase these config options also: spark.akka.timeout 300 spark.storage.blockManagerSlaveTimeoutMs 30 On Thu, Nov 13, 2014 at 11:31 AM, Darin McBeath ddmcbe...@yahoo.com.invalid wrote: For one of my Spark jobs, my workers/executors are dying and leaving the

Re: Breaking the previous large-scale sort record with Spark

2014-11-05 Thread Reynold Xin
/10/spark-breaks-previous-large-scale-sort-record.html. Summary: while Hadoop MapReduce held last year's 100 TB world record by sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 206 nodes; and we also scaled up to sort 1 PB in 234 minutes. I want to thank Reynold Xin

Re: OOM with groupBy + saveAsTextFile

2014-11-02 Thread Reynold Xin
None of your tuning will help here because the problem is actually the way you are saving the output. If you take a look at the stacktrace, it is trying to build a single string that is too large for the VM to allocate memory. The VM is actually not running out of memory, but rather, JVM cannot

Re: something about rdd.collect

2014-10-14 Thread Reynold Xin
Hi Randy, collect essentially transfers all the data to the driver node. You definitely wouldn’t want to collect 200 million words. It is a pretty large number and you can run out of memory on your driver with that much data. --  Reynold Xin On October 14, 2014 at 9:26:13 PM, randylu (randyl

Re: SQL queries fail in 1.2.0-SNAPSHOT

2014-09-29 Thread Reynold Xin
Hi Daoyuan, Do you mind applying this patch and look at the exception again? https://github.com/apache/spark/pull/2580 It has also been merged in master so if you pull from master, you should have that. On Mon, Sep 29, 2014 at 1:17 AM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Hi all,

Spark meetup on Oct 15 in NYC

2014-09-28 Thread Reynold Xin
Hi Spark users and developers, Some of the most active Spark developers (including Matei Zaharia, Michael Armbrust, Joseph Bradley, TD, Paco Nathan, and me) will be in NYC for Strata NYC. We are working with the Spark NYC meetup group and Bloomberg to host a meetup event. This might be the event

Re: driver memory management

2014-09-28 Thread Reynold Xin
The storage fraction only limits the amount of memory used for storage. It doesn't actually limit anything else. I.e you can use all the memory if you want in collect. On Sunday, September 28, 2014, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi All, I am interested to collect() a large RDD

Re: collect on hadoopFile RDD returns wrong results

2014-09-18 Thread Reynold Xin
This is due to the HadoopRDD (and also the underlying Hadoop InputFormat) reuse objects to avoid allocation. It is sort of tricky to fix. However, in most cases you can clone the records to make sure you are not collecting the same object over and over again.

Re: Comparative study

2014-07-08 Thread Reynold Xin
Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. I've personally run a workload that shuffled 300 TB of data. I've also ran something that shuffled 5TB/node and stuffed

Re: Powered By Spark: Can you please add our org?

2014-07-08 Thread Reynold Xin
I added you to the list. Cheers. On Mon, Jul 7, 2014 at 6:19 PM, Alex Gaudio adgau...@gmail.com wrote: Hi, Sailthru is also using Spark. Could you please add us to the Powered By Spark https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark page when you have a chance?

openstack swift integration with Spark

2014-06-13 Thread Reynold Xin
If you are interested in openstack/swift integration with Spark, please drop me a line. We are looking into improving the integration. Thanks.

<    1   2