Re: SparkPullRequestBuilder coverage

2015-11-13 Thread Reynold Xin
y test(s) be disabled, strengthened and enabled again ? > > Cheers > > On Fri, Nov 13, 2015 at 11:20 AM, Reynold Xin <r...@databricks.com> wrote: > >> It only runs tests that are impacted by the change. E.g. if you only >> modify SQL, it won't run the core or streaming te

Re: Spark 1.4.2 release and votes conversation?

2015-11-13 Thread Reynold Xin
In the interim, you can just build it off branch-1.4 if you want. On Fri, Nov 13, 2015 at 1:30 PM, Reynold Xin <r...@databricks.com> wrote: > I actually tried to build a binary for 1.4.2 and wanted to start voting, > but there was an issue with the release script that failed the

Re: Support for local disk columnar storage for DataFrames

2015-11-11 Thread Reynold Xin
Thanks for the email. Can you explain what the difference is between this and existing formats such as Parquet/ORC? On Wed, Nov 11, 2015 at 4:59 AM, Cristian O wrote: > Hi, > > I was wondering if there's any planned support for local disk columnar > storage. >

Re: Choreographing a Kryo update

2015-11-11 Thread Reynold Xin
We should consider this for Spark 2.0. On Wed, Nov 11, 2015 at 2:01 PM, Steve Loughran wrote: > > > Spark is currently on a fairly dated version of Kryo 2.x; it's trailing on > the fixes in Hive and, as the APIs are incompatible, resulted in that > mutant

[ANNOUNCE] Announcing Spark 1.5.2

2015-11-10 Thread Reynold Xin
Hi All, Spark 1.5.2 is a maintenance release containing stability fixes. This release is based on the branch-1.5 maintenance branch of Spark. We *strongly recommend* all 1.5.x users to upgrade to this release. The full list of bug fixes is here: http://s.apache.org/spark-1.5.2

Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin
On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > > > 3. Assembly-free distribution of Spark: don’t require building an > enormous assembly jar in order to run Spark. > > Could you elaborate a bit on this? I'm not sure what an assembly-free > distribution

Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin
t of turmoil over > the > >>> Python 2 -> Python 3 transition because the upgrade process was too > painful > >>> for too long. The Spark community will benefit greatly from our > explicitly > >>> looking to avoid a similar situation. > >&g

Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin
er usage (e.g. I wouldn't >> be surprised if mapPartitionsWithContext was baked into a number of apps) >> and merit a little extra consideration. >> >> Maybe also obvious, but I think a migration guide with API equivlents and >> the like would be incredibly useful i

Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin
ade at the outset of 2.0 while > trying to guess what we'll need. > > On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <r...@databricks.com> wrote: > >> I’m starting a new thread since the other one got intermixed with feature >> requests. Please refrain from making feature

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-08 Thread Reynold Xin
Thanks everybody for voting. I'm going to close the vote now. The vote passes with 14 +1 votes and no -1 vote. I will work on packaging this asap. +1: Jean-Baptiste Onofré Egor Pahomov Luc Bourlier Tom Graves* Chester Chen Michael Armbrust* Krishna Sankar Robin East Reynold Xin* Joseph Bradley

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-07 Thread Reynold Xin
gt; +1 > Test against CDH5.4.2 with hadoop 2.6.0 version using yesterday's code, > build locally. > > Regression running in Yarn Cluster mode against few internal ML ( logistic > regression, linear regression, random forest and statistic summary) as well > Mlib KMeans. all seems to

Re: Looking for the method executors uses to write to HDFS

2015-11-06 Thread Reynold Xin
Are you looking for this? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L69 On Wed, Nov 4, 2015 at 5:11 AM, Tóth Zoltán wrote: > Hi, > > I'd like to write a parquet file from the

Re: How to force statistics calculation of Dataframe?

2015-11-05 Thread Reynold Xin
hint is only available on dataframe api. > > On Wed, Nov 4, 2015 at 6:49 PM Reynold Xin <r...@databricks.com> wrote: > >> Can you use the broadcast hint? >> >> e.g. >> >> df1.join(broadcast(df2)) >> >> the broadcast function is in org.apache.spa

Re: Need advice on hooking into Sql query plan

2015-11-05 Thread Reynold Xin
You can hack around this by constructing logical plans yourself and then creating a DataFrame in order to execute them. Note that this is all depending on internals of the framework and can break when Spark upgrades. On Thu, Nov 5, 2015 at 4:18 PM, Yana Kadiyska wrote:

Re: How to force statistics calculation of Dataframe?

2015-11-04 Thread Reynold Xin
Can you use the broadcast hint? e.g. df1.join(broadcast(df2)) the broadcast function is in org.apache.spark.sql.functions On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel wrote: > Hi, > > If I have a hive table, analyze table compute statistics will ensure Spark > SQL has

Re: Codegen In Shuffle

2015-11-04 Thread Reynold Xin
GenerateUnsafeProjection -- projects any internal row data structure directly into bytes (UnsafeRow). On Wed, Nov 4, 2015 at 12:21 AM, 牛兆捷 wrote: > Dear all: > > Tungsten project has mentioned that they are applying code generation is > to speed up the conversion of data

Please reply if you use Mesos fine grained mode

2015-11-03 Thread Reynold Xin
If you are using Spark with Mesos fine grained mode, can you please respond to this email explaining why you use it over the coarse grained mode? Thanks.

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Reynold Xin
in turn kill the entire executor, causing entire > stages to be retried. In fine-grained mode, only the task fails and > subsequently gets retried without taking out an entire stage or worse. > > On Tue, Nov 3, 2015 at 3:54 PM, Reynold Xin <r...@databricks.com> wrote: > >>

[VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-03 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.2. The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.2 [ ] -1 Do not release this package because ... The

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-11-01 Thread Reynold Xin
$sql$execution$TungstenSort$$preparePartition$1(sort.scala:131) >>> at >>> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) >>> at >>> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.s

Re: Pickle Spark DataFrame

2015-10-28 Thread Reynold Xin
What are you trying to accomplish to pickle a Spark DataFrame? If your dataset is large, it doesn't make much sense to pickle it. If your dataset is small, maybe it's best to just pickle a Pandas dataframe. On Tue, Oct 27, 2015 at 9:47 PM, agg212 wrote: > Hi, I'd like to

Re: Exception when using some aggregate operators

2015-10-28 Thread Reynold Xin
gt;> OPTIONS ( >>>>>> path '/tmp/partitioned' >>>>>> )""") >>>>>> sqlContext.sql("""select avg(a) from partitionedParquet""").show() >>>>>> >>>>>> Ch

Re: Exception when using some aggregate operators

2015-10-28 Thread Reynold Xin
t if you can clarify this. > > On Wed, Oct 28, 2015 at 4:12 PM, Reynold Xin <r...@databricks.com> wrote: > >> I don't think these are bugs. The SQL standard for average is "avg", not >> "mean". Similarly, a distinct count is supposed to be written as >

Re: [VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-27 Thread Reynold Xin
t 3:08 AM, Krishna Sankar <ksanka...@gmail.com > <javascript:_e(%7B%7D,'cvml','ksanka...@gmail.com');>> wrote: > >> Guys, >>The sc.version returns 1.5.1 in python and scala. Is anyone getting >> the same results ? Probably I am doing something wrong. >> Cheer

Re: Exception when using some aggregate operators

2015-10-27 Thread Reynold Xin
Try count(distinct columnane) In SQL distinct is not part of the function name. On Tuesday, October 27, 2015, Shagun Sodhani wrote: > Oops seems I made a mistake. The error message is : Exception in thread > "main" org.apache.spark.sql.AnalysisException: undefined

[VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-25 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.2. The vote is open until Wed Oct 28, 2015 at 08:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.2 [ ] -1 Do not release this package because ... The

Re: repartitionAndSortWithinPartitions task shuffle phase is very slow

2015-10-22 Thread Reynold Xin
Why do you do a glom? It seems unnecessarily expensive to materialize each partition in memory. On Thu, Oct 22, 2015 at 2:02 AM, 周千昊 wrote: > Hi, spark community > I have an application which I try to migrate from MR to Spark. > It will do some calculations from

Re: Exception when using cosh

2015-10-21 Thread Reynold Xin
I think we made a mistake and forgot to register the function in the registry: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala Do you mind submitting a pull request to fix this? Should be an one line change. I

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-21 Thread Reynold Xin
er.java:74) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:56) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:339) > > > On Tue, Oct 20, 2015 at 9:

Re: MapStatus too large for drvier

2015-10-20 Thread Reynold Xin
How big is your driver heap size? And any reason why you'd need 200k map and 200k reduce tasks? On Mon, Oct 19, 2015 at 11:59 PM, yaoqin wrote: > Hi everyone, > > When I run a spark job contains quite a lot of tasks(in my case is > 200,000*200,000), the driver occured

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-20 Thread Reynold Xin
Jerry - I think that's been fixed in 1.5.1. Do you still see it? On Tue, Oct 20, 2015 at 2:11 PM, Jerry Lam wrote: > I disabled it because of the "Could not acquire 65536 bytes of memory". It > happens to fail the job. So for now, I'm not touching it. > > On Tue, Oct 20,

Fwd: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-20 Thread Reynold Xin
With Jerry's permission, sending this back to the dev list to close the loop. -- Forwarded message -- From: Jerry Lam <chiling...@gmail.com> Date: Tue, Oct 20, 2015 at 3:54 PM Subject: Re: If you use Spark 1.5 and disabled Tungsten mode ... To: Reynold Xin <r...@datab

Re: Should enforce the uniqueness of field name in DataFrame ?

2015-10-15 Thread Reynold Xin
That could break a lot of applications. In particular, a lot of input data sources (csv, json) don't have clean schema, and can have duplicate column names. For the case of join, maybe a better solution is to ask the left/right prefix/suffix in the user code, similar to what Pandas does. On Wed,

a few major changes / improvements for Spark 1.6

2015-10-12 Thread Reynold Xin
Hi Spark devs, It is hard to track everything going on in Spark with so many pull requests and JIRA tickets. Below are 4 major improvements that will likely be in Spark 1.6. We have already done prototyping for all of them, and want feedback on their design. 1. SPARK-9850 Adaptive query

Re: Scala 2.11 builds broken/ Can the PR build run also 2.11?

2015-10-08 Thread Reynold Xin
The problem only applies to the sbt build because it treats warnings as errors. @Iulian - how about we disable warnings -> errors for 2.11? That would seem better until we switch 2.11 to be the default build. On Thu, Oct 8, 2015 at 7:55 AM, Ted Yu wrote: > I tried

Re: spark over drill

2015-10-08 Thread Reynold Xin
You probably saw that in a presentation given by the drill team. You should check with them on that. On Thu, Oct 8, 2015 at 11:51 AM, Pranay Tonpay wrote: > hi ,, > Is spark-drill integration already done ? if yes, which spark version > supports it ... it was in the "upcming

Fwd: multiple count distinct in SQL/DataFrame?

2015-10-07 Thread Reynold Xin
Adding user list too. -- Forwarded message -- From: Reynold Xin <r...@databricks.com> Date: Tue, Oct 6, 2015 at 5:54 PM Subject: Re: multiple count distinct in SQL/DataFrame? To: "dev@spark.apache.org" <dev@spark.apache.org> To provide more co

Re: Pyspark dataframe read

2015-10-06 Thread Reynold Xin
I think the problem is that comma is actually a legitimate character for file name, and as a result ... On Tuesday, October 6, 2015, Josh Rosen wrote: > Could someone please file a JIRA to track this? > https://issues.apache.org/jira/browse/SPARK > > On Tue, Oct 6, 2015 at

multiple count distinct in SQL/DataFrame?

2015-10-06 Thread Reynold Xin
The current implementation of multiple count distinct in a single query is very inferior in terms of performance and robustness, and it is also hard to guarantee correctness of the implementation in some of the refactorings for Tungsten. Supporting a better version of it is possible in the future,

Re: multiple count distinct in SQL/DataFrame?

2015-10-06 Thread Reynold Xin
(distinct colA, colB) from foo; On Tue, Oct 6, 2015 at 5:51 PM, Reynold Xin <r...@databricks.com> wrote: > The current implementation of multiple count distinct in a single query is > very inferior in terms of performance and robustness, and it is also hard > to guaran

Re: IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-05 Thread Reynold Xin
You can write the data to local hdfs (or local disk) and just load it from there. On Mon, Oct 5, 2015 at 4:37 PM, Jegan wrote: > Thanks for your suggestion Ted. > > Unfortunately at this point of time I cannot go beyond 1000 partitions. I > am writing this data to BigQuery

Re: IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2015-10-05 Thread Reynold Xin
nks, > Jegan > > On Mon, Oct 5, 2015 at 4:42 PM, Reynold Xin <r...@databricks.com> wrote: > >> You can write the data to local hdfs (or local disk) and just load it >> from there. >> >> >> On Mon, Oct 5, 2015 at 4:37 PM, Jegan <jega...@gmail.com>

Re: [Build] repo1.maven.org: spark libs 1.5.0 for scala 2.10 poms are broken (404)

2015-10-02 Thread Reynold Xin
Both work for me. It's possible maven.org is having problems with some servers. On Fri, Oct 2, 2015 at 11:08 AM, Ted Yu wrote: > Andy: > 1.5.1 has been released. > > Maybe you can use this: > >

Re: Python UDAFs

2015-10-02 Thread Reynold Xin
No, not yet. On Fri, Oct 2, 2015 at 12:20 PM, Justin Uang wrote: > Hi, > > Is there a Python API for UDAFs? > > Thanks! > > Justin >

Re: Dataframe nested schema inference from Json without type conflicts

2015-10-01 Thread Reynold Xin
You can pass the schema into json directly, can't you? On Thu, Oct 1, 2015 at 10:33 AM, Ewan Leith wrote: > Hi all, > > > > We really like the ability to infer a schema from JSON contained in an > RDD, but when we’re using Spark Streaming on small batches of data, we

[ANNOUNCE] Announcing Spark 1.5.1

2015-10-01 Thread Reynold Xin
Hi All, Spark 1.5.1 is a maintenance release containing stability fixes. This release is based on the branch-1.5 maintenance branch of Spark. We *strongly recommend* all 1.5.0 users to upgrade to this release. The full list of bug fixes is here: http://s.apache.org/spark-1.5.1

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-27 Thread Reynold Xin
Thanks everybody for voting. I'm going to close the vote now. The vote passes with 17 +1 votes and 1 -1 vote. I will work on packaging this asap. +1: Reynold Xin* Sean Owen Hossein Falaki Xiangrui Meng* Krishna Sankar Joseph Bradley Sean McNamara* Luciano Resende Doug Balog Eugene Zhulenev

[VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.1 [ ] -1 Do not release this package because ...

[Discuss] NOTICE file for transitive "NOTICE"s

2015-09-24 Thread Reynold Xin
Richard, Thanks for bringing this up and this is a great point. Let's start another thread for it so we don't hijack the release thread. On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen wrote: > On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas > wrote: >

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Reynold Xin
I'm going to +1 this myself. Tested on my laptop. On Thu, Sep 24, 2015 at 10:56 AM, Reynold Xin <r...@databricks.com> wrote: > I forked a new thread for this. Please discuss NOTICE file related things > there so it doesn't hijack this thread. > > > On Thu, Sep 24, 2015 a

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Reynold Xin
2. FYI, UDFs getM and getY work now (Thanks). Slower; saturates the CPU. A > non-scientific snapshot below. I know that this really has to be done more > rigorously, on a bigger machine, with more cores et al.. > [image: Inline image 1] [image: Inline image 2] > > On Thu, Sep 24, 2015 at 12:

Re: Why Filter return a DataFrame object in DataFrame.scala?

2015-09-23 Thread Reynold Xin
There is an implicit conversion in scope https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L153 /** * An implicit conversion function internal to this class for us to avoid doing * "new DataFrame(...)" everywhere. */ @inline

Re: Why Filter return a DataFrame object in DataFrame.scala?

2015-09-23 Thread Reynold Xin
There is an implicit conversion in scope https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L153 /** * An implicit conversion function internal to this class for us to avoid doing * "new DataFrame(...)" everywhere. */ @inline

Re: DataFrames Aggregate does not spill?

2015-09-21 Thread Reynold Xin
What's the plan if you run explain? In 1.5 the default should be TungstenAggregate, which does spill (switching from hash-based aggregation to sort-based aggregation). On Mon, Sep 21, 2015 at 5:34 PM, Matt Cheah wrote: > Hi everyone, > > I’m debugging some slowness and

Re: Null Value in DecimalType column of DataFrame

2015-09-21 Thread Reynold Xin
+dev list Hi Dirceu, The answer to whether throwing an exception is better or null is better depends on your use case. If you are debugging and want to find bugs with your program, you might prefer throwing an exception. However, if you are running on a large real-world dataset (i.e. data is

Re: RDD: Execution and Scheduling

2015-09-20 Thread Reynold Xin
On Sun, Sep 20, 2015 at 3:58 PM, gsvic wrote: > Concerning answers 1 and 2: > > 1) How Spark determines a node as a "slow node" and how slow is that? > There are two cases here: 1. If a node is busy (e.g. all slots are already occupied), the scheduler cannot schedule

Re: BUILD SYSTEM: fire and power event at UC berkeley's IST colo, jenkins offline

2015-09-19 Thread Reynold Xin
Great! Jon / Shane: Thanks for handling this. On Saturday, September 19, 2015, shane knapp wrote: > we're up and building! time for breakfast... :) > > https://amplab.cs.berkeley.edu/jenkins/ > > On Sat, Sep 19, 2015 at 7:35 AM, shane knapp

Re: spark-shell 1.5 doesn't seem to work in local mode

2015-09-19 Thread Reynold Xin
Maybe you have a hdfs-site.xml lying around somewhere? On Sat, Sep 19, 2015 at 9:14 AM, Madhu wrote: > I downloaded spark-1.5.0-bin-hadoop2.6.tgz recently and installed on > CentOS. > All my local Spark code works fine locally. > > For some odd reason, spark-shell doesn't work

Re: And.eval short circuiting

2015-09-18 Thread Reynold Xin
re’s one. > > Thanks, > Mingyu > > From: Reynold Xin > Date: Wednesday, September 16, 2015 at 1:17 PM > To: Zack Sampson > Cc: "dev@spark.apache.org", Mingyu Kim, Peter Faiman, Matt Cheah, Michael > Armbrust > > Subject: Re: And.eval short circuiting > >

Re: One element per node

2015-09-18 Thread Reynold Xin
Use a global atomic boolean and return nothing from that partition if the boolean is true. Note that your result won't be deterministic. On Sep 18, 2015, at 4:11 PM, Ulanov, Alexander wrote: Thank you! How can I guarantee that I have only one element per executor (per

Re: One element per node

2015-09-18 Thread Reynold Xin
rministic by using > global long value and get the element on partition only if > someFunction(partitionId, globalLong)==true? Or by using some specific > partitioner that creates such partitionIds that can be decomposed into > nodeId and number of partitions per node? > > > &

Re: RDD: Execution and Scheduling

2015-09-17 Thread Reynold Xin
Your understanding is mostly correct. Replies inline. On Thu, Sep 17, 2015 at 5:23 AM, gsvic wrote: > After reading some parts of Spark source code I would like to make some > questions about RDD execution and scheduling. > > At first, please correct me if I am wrong at

Re: SparkR streaming source code

2015-09-16 Thread Reynold Xin
You should reach out to the speakers directly. On Wed, Sep 16, 2015 at 9:52 AM, Renyi Xiong wrote: > SparkR streaming is mentioned at about page 17 in below pdf, can anyone > share source code? (could not find it on GitHub) > > > >

Re: JENKINS: downtime next week, wed and thurs mornings (9-23 and 9-24)

2015-09-16 Thread Reynold Xin
Thanks Shane and Jon for the heads up. On Wednesday, September 16, 2015, shane knapp wrote: > good morning, denizens of the aether! > > your hard working build system (and some associated infrastructure) > has been in need of some updates and housecleaning for quite a while

Re: New Spark json endpoints

2015-09-16 Thread Reynold Xin
Do we need to increment the version number if it is just strict additions? On Wed, Sep 16, 2015 at 7:10 PM, Kevin Chen wrote: > Just wanted to bring this email up again in case there were any thoughts. > Having all the information from the web UI accessible through a

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Reynold Xin
n to that limit. With >> tasks 4 times the number of cores there will be some contention and so they >> remain active for longer. >> >> So I think this is a test case issue configuring the number of executors >> too high. >> >> On 15 September 2015 at 18:54, Reynold Xin

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Reynold Xin
bin...@gmail.com> wrote: > Reynold, thanks for replying. > > getPageSize parameters: maxMemory=515396075, numCores=0 > Calculated values: cores=8, default=4194304 > > So am I getting a large page size as I only have 8 cores? > > On 15 September 2015 at 00:40, Reynold X

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-15 Thread Reynold Xin
;> That test explicitly sets the number of executor cores to 32. >> >> object TestHive >> extends TestHiveContext( >> new SparkContext( >> System.getProperty("spark.sql.test.master", "local[32]"), >> >> >> On Mon, Sep 1

Re: JDBC Dialect tests

2015-09-14 Thread Reynold Xin
SPARK-9818 you link to actually links to a pull request trying to bring them back. On Mon, Sep 14, 2015 at 1:34 PM, Luciano Resende wrote: > I was looking for the code mentioned in SPARK-9818 and SPARK-6136 that > supposedly is testing MySQL and PostgreSQL using Docker

Re: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-14 Thread Reynold Xin
Pete - can you do me a favor? https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/ShuffleMemoryManager.scala#L174 Print the parameters that are passed into the getPageSize function, and check their values. On Mon, Sep 14, 2015 at 4:32 PM, Reynold Xin &l

Spark 1.5.1 release

2015-09-14 Thread Reynold Xin
Hi devs, FYI - we have already accumulated an "interesting" list of issues found with the 1.5.0 release. I will work on an RC in the next week or two, depending on how many blocker/critical issues are fixed. https://issues.apache.org/jira/issues/?filter=1221

Re: [ANNOUNCE] Announcing Spark 1.5.0

2015-09-11 Thread Reynold Xin
It is already there, but the search is not updated. Not sure what's going on with maven central search. http://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.10/1.5.0/ On Fri, Sep 11, 2015 at 10:21 AM, Ryan Williams < ryan.blake.willi...@gmail.com> wrote: > Any idea why 1.5.0 is not

Re: ClassCastException using DataFrame only when num-executors > 2 ...

2015-09-10 Thread Reynold Xin
Does this still happen on 1.5.0 release? On Mon, Aug 31, 2015 at 9:31 AM, Olivier Girardot wrote: > tested now against Spark 1.5.0 rc2, and same exceptions happen when > num-executors > 2 : > > 15/08/25 10:31:10 WARN scheduler.TaskSetManager: Lost task 0.1 in stage > 5.0

Re: Did the 1.5 release complete?

2015-09-09 Thread Reynold Xin
Dev/user announcement was made just now. For Maven, I did publish it this afternoon (so it's been a few hours). If it is still not there tomorrow morning, I will look into it. On Wed, Sep 9, 2015 at 2:42 AM, Sean Owen wrote: > I saw the end of the RC3 vote: > >

Re: groupByKey() and keys with many values

2015-09-08 Thread Reynold Xin
On Tue, Sep 8, 2015 at 6:51 AM, Antonio Piccolboni wrote: > As far as the DB writes, remember spark can retry a computation, so your > writes have to be idempotent (see this thread > , in > which Reynold

Re: Fast Iteration while developing

2015-09-07 Thread Reynold Xin
I usually write a test case for what I want to test, and then run sbt/sbt "~module/test:test-only *MyTestSuite" On Mon, Sep 7, 2015 at 6:02 PM, Justin Uang wrote: > Hi, > > What is the normal workflow for the core devs? > > - Do we need to build the assembly jar to be

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-04 Thread Reynold Xin
: >> >> 1. The synthetic column names are lowercase ( i.e. now ‘sum(OrderPrice)’; >> previously ‘SUM(OrderPrice)’, now ‘avg(Total)’; previously 'AVG(Total)'). >> So programs that depend on the case of the synthetic column names would >> fail. >> 2. orders_3.groupBy

Re: Code generation for GPU

2015-09-03 Thread Reynold Xin
See responses inline. On Thu, Sep 3, 2015 at 1:58 AM, kiran lonikar wrote: > Hi, > >1. I found where the code generation > >

[VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-01 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.0 [ ] -1 Do not release this package because ...

Re: Tungsten off heap memory access for C++ libraries

2015-09-01 Thread Reynold Xin
Please do. Thanks. On Mon, Aug 31, 2015 at 5:00 AM, Paul Weiss <paulweiss@gmail.com> wrote: > Sounds good, want me to create a jira and link it to SPARK-9697? Will put > down some ideas to start. > On Aug 31, 2015 4:14 AM, "Reynold Xin" <r...@databricks

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-31 Thread Reynold Xin
") OK >> 5.0. Packages >> 5.1. com.databricks.spark.csv - read/write OK >> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But >> com.databricks:spark-csv_2.11:1.2.0 worked) >> 6.0. DataFrames >> 6.1. cast,dtypes OK >> 6.2. groupBy

Re: Tungsten off heap memory access for C++ libraries

2015-08-31 Thread Reynold Xin
On Sun, Aug 30, 2015 at 5:58 AM, Paul Weiss wrote: > > Also, is this work being done on a branch I could look into further and > try out? > > We don't have a branch yet -- because there is no code nor design for this yet. As I said, it is one of the motivations behind

Re: Tungsten off heap memory access for C++ libraries

2015-08-31 Thread Reynold Xin
:12 AM, Reynold Xin <r...@databricks.com> wrote: > > On Sun, Aug 30, 2015 at 5:58 AM, Paul Weiss <paulweiss@gmail.com> > wrote: > >> >> Also, is this work being done on a branch I could look into further and >> try out? >> >> > We don't

Re: Tungsten off heap memory access for C++ libraries

2015-08-29 Thread Reynold Xin
Supporting non-JVM code without memory copying and serialization is actually one of the motivations behind Tungsten. We didn't talk much about it since it is not end-user-facing and it is still too early. There are a few challenges still: 1. Spark cannot run entirely in off-heap mode (by entirely

Re: Research of Spark scalability / performance issues

2015-08-29 Thread Reynold Xin
Both 2 and 3 are pretty good topics for master's project I think. You can also look into how one can improve Spark's scheduler throughput. Couple years ago Kay measured it but things have changed. It would be great to start with measurement, and then look at where the bottlenecks are, and see how

Re: Opening up metrics interfaces

2015-08-27 Thread Reynold Xin
I'd like this to happen, but it hasn't been super high priority on anybody's mind. There are a couple things that could be good to do: 1. At the application level: consolidate task metrics and accumulators. They have substantial overlap, and from high level should just be consolidated. Maybe

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-27 Thread Reynold Xin
Marcelo - please submit a patch anyway. If we don't include it in this release, it will go into 1.5.1. On Thu, Aug 27, 2015 at 4:56 PM, Marcelo Vanzin van...@cloudera.com wrote: On Thu, Aug 27, 2015 at 4:42 PM, Marcelo Vanzin van...@cloudera.com wrote: The Windows issue Sen raised could

Re: SQLContext.read.json(path) throws java.io.IOException

2015-08-26 Thread Reynold Xin
Any reason why you have more than 2G in a single line? There is a limit of 2G in the Hadoop library we use. Also the JVM doesn't work when your string is that long. On Wed, Aug 26, 2015 at 11:38 AM, gsvic victora...@gmail.com wrote: Yes, it contain one line On Wed, Aug 26, 2015 at 8:20 PM,

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-26 Thread Reynold Xin
One small update -- the vote should close Saturday Aug 29. Not Friday Aug 29. On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin r...@databricks.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-26 Thread Reynold Xin
)) }) was false. (DirectKafkaStreamSuite.scala:249) On Wed, Aug 26, 2015 at 5:28 AM, Reynold Xin r...@databricks.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes if a majority of at least 3

[VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-25 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.0 [ ] -1 Do not release this package because ...

Re: Dataframe aggregation with Tungsten unsafe

2015-08-25 Thread Reynold Xin
On Fri, Aug 21, 2015 at 11:07 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: It seems that there is a nice improvement with Tungsten enabled given that data is persisted in memory 2x and 3x. However, the improvement is not that nice for parquet, it is 1.5x. What’s interesting, with

Re: [jira] [Commented] (INFRA-10191) git pushing for Spark fails

2015-08-24 Thread Reynold Xin
This has been resolved. On Mon, Aug 24, 2015 at 11:58 AM, Reynold Xin r...@databricks.com wrote: FYI -- Forwarded message -- From: Geoffrey Corey (JIRA) j...@apache.org Date: Mon, Aug 24, 2015 at 11:54 AM Subject: [jira] [Commented] (INFRA-10191) git pushing for Spark

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-24 Thread Reynold Xin
)) }) was false. (DirectKafkaStreamSuite.scala:249) On Fri, Aug 21, 2015 at 5:37 AM, Reynold Xin r...@databricks.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.5.0! The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes if a majority

Fwd: [jira] [Commented] (INFRA-10191) git pushing for Spark fails

2015-08-24 Thread Reynold Xin
: Reynold Xin Assignee: Geoffrey Corey Not sure what's going on, but it happened to at least two committers with the following errors: Using Spark's merge script: {code} Exception while pushing: Command '[u'git', u'push', u'apache', u'PR_TOOL_MERGE_PR_8373_MASTER:master']' returned

Re: DataFrame. SparkPlan / Project serialization issue: ArrayIndexOutOfBounds.

2015-08-21 Thread Reynold Xin
You've probably hit this bug: https://issues.apache.org/jira/browse/SPARK-7180 It's fixed in Spark 1.4.1+. Try setting spark.serializer.extraDebugInfo to false and see if it goes away. On Fri, Aug 21, 2015 at 3:37 AM, Eugene Morozov evgeny.a.moro...@gmail.com wrote: Hi, I'm using spark

Re: Tungsten and sun.misc.Unsafe

2015-08-21 Thread Reynold Xin
I'm actually somewhat involved with the Google Docs you linked to. I don't think Oracle will remove Unsafe in JVM 9. As you said, JEP 260 already proposes making Unsafe available. Given the widespread use of Unsafe for performance and advanced functionalities, I don't think Oracle can just remove

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-21 Thread Reynold Xin
Problem noted. Apparently the release script doesn't automate the replacement of all version strings yet. I'm going to publish a new RC over the weekend with the release version properly assigned. Please continue the testing and report any problems you find. Thanks! On Fri, Aug 21, 2015 at 2:20

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Reynold Xin
don't see the change in time if I unset the unsafe flags. Could you explain why it might happen? 20 авг. 2015 г., в 15:32, Reynold Xin r...@databricks.commailto: r...@databricks.com написал(а): I didn't wait long enough earlier. Actually it did finish when I raised memory to 8g. In 1.5

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Reynold Xin
BTW one other thing -- don't use the count() to do benchmark, since the optimizer is smart enough to figure out that you don't actually need to run the sum. For the purpose of benchmarking, you can use df.foreach(i = do nothing) On Thu, Aug 20, 2015 at 3:31 PM, Reynold Xin r

<    4   5   6   7   8   9   10   11   12   13   >