Re: [SQL] Write parquet files under partition directories?

2015-06-01 Thread Reynold Xin
There will be in 1.4. df.write.partitionBy(year, month, day).parquet(/path/to/output) On Mon, Jun 1, 2015 at 10:21 PM, Matt Cheah mch...@palantir.com wrote: Hi there, I noticed in the latest Spark SQL programming guide https://spark.apache.org/docs/latest/sql-programming-guide.html, there

Re: [VOTE] Release Apache Spark 1.4.1

2015-06-30 Thread Reynold Xin
+1 On Tue, Jun 23, 2015 at 10:37 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.1! This release fixes a handful of known issues in Spark 1.4.0, listed here: http://s.apache.org/spark-1.4.1 The tag to be voted on

Re: [pyspark] What is the best way to run a minimum unit testing related to our developing module?

2015-07-01 Thread Reynold Xin
Run ./python/run-tests --help and you will see. :) On Wed, Jul 1, 2015 at 9:10 PM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com wrote: Hi all, When I develop pyspark modules, such as adding a spark.ml API in Python, I'd like to run a minimum unit testing related to the developing module again

Re: HyperLogLogUDT

2015-07-02 Thread Reynold Xin
Yes - it's very interesting. However, ideally we should have a version of hyperloglog that can work directly against some raw bytes in memory (rather than java objects), in order for this to fit the Tungsten execution model where everything is operating directly against some memory address. On

asf git merge currently not working

2015-07-06 Thread Reynold Xin
FYI there are some problems with ASF's git or ldap infra. As a result, we cannot merge anything into Spark right now. An infra ticket has been created: https://issues.apache.org/jira/browse/INFRA-9932 Please watch/vote on that ticket for progress. Thanks.

Re: Tungsten's Vectorized Execution

2015-05-22 Thread Reynold Xin
Yijie, As Davies said, it will take us a while to get to vectorized execution. However, before that, we are going to refactor code generation to push it into each expression: https://issues.apache.org/jira/browse/SPARK-7813 Once this one is in (probably in the next 2 or 3 weeks), there will be

Re: Tungsten's Vectorized Execution

2015-05-25 Thread Reynold Xin
Yes that's exactly the reason. On Sat, May 23, 2015 at 12:37 AM, Yijie Shen henry.yijies...@gmail.com wrote: Davies and Reynold, Glad to hear about the status. I’ve seen [SPARK-7813](https://issues.apache.org/jira/browse/SPARK-7813) and watching it now. If I understand correctly, it’s

Re: spark packages

2015-05-23 Thread Reynold Xin
That's the nice thing about Spark packages. It is just a package index for libraries and applications built on top of Spark and not part of the Spark codebase, so it is not restricted to follow only ASF-compatible licenses. On Sat, May 23, 2015 at 10:12 PM, DB Tsai dbt...@dbtsai.com wrote: I

Re: Testing spark applications

2015-05-21 Thread Reynold Xin
It is just 15 lines of code to copy, isn't it? On Thu, May 21, 2015 at 7:46 PM, Nathan Kronenfeld nkronenfeld@uncharted.software wrote: see discussions about Spark not really liking multiple contexts in the same JVM Speaking of this - is there a standard way of writing unit tests that

Re: SparkR and RDDs

2015-05-26 Thread Reynold Xin
You definitely don't want to implement kmeans in R, since it would be very slow. Just providing R wrappers for the MLlib implementation is the way to go. I believe one of the major items in SparkR next is the MLlib wrappers. On Tue, May 26, 2015 at 7:46 AM, Andrew Psaltis

Re: DataFrame. SparkPlan / Project serialization issue: ArrayIndexOutOfBounds.

2015-08-21 Thread Reynold Xin
You've probably hit this bug: https://issues.apache.org/jira/browse/SPARK-7180 It's fixed in Spark 1.4.1+. Try setting spark.serializer.extraDebugInfo to false and see if it goes away. On Fri, Aug 21, 2015 at 3:37 AM, Eugene Morozov evgeny.a.moro...@gmail.com wrote: Hi, I'm using spark

Re: Tungsten and sun.misc.Unsafe

2015-08-21 Thread Reynold Xin
I'm actually somewhat involved with the Google Docs you linked to. I don't think Oracle will remove Unsafe in JVM 9. As you said, JEP 260 already proposes making Unsafe available. Given the widespread use of Unsafe for performance and advanced functionalities, I don't think Oracle can just remove

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-21 Thread Reynold Xin
Problem noted. Apparently the release script doesn't automate the replacement of all version strings yet. I'm going to publish a new RC over the weekend with the release version properly assigned. Please continue the testing and report any problems you find. Thanks! On Fri, Aug 21, 2015 at 2:20

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Reynold Xin
don't see the change in time if I unset the unsafe flags. Could you explain why it might happen? 20 авг. 2015 г., в 15:32, Reynold Xin r...@databricks.commailto: r...@databricks.com написал(а): I didn't wait long enough earlier. Actually it did finish when I raised memory to 8g. In 1.5

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Reynold Xin
BTW one other thing -- don't use the count() to do benchmark, since the optimizer is smart enough to figure out that you don't actually need to run the sum. For the purpose of benchmarking, you can use df.foreach(i = do nothing) On Thu, Aug 20, 2015 at 3:31 PM, Reynold Xin r

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Reynold Xin
*From:* Reynold Xin [mailto:r...@databricks.com] *Sent:* Thursday, August 20, 2015 4:22 PM *To:* Ulanov, Alexander *Cc:* dev@spark.apache.org *Subject:* Re: Dataframe aggregation with Tungsten unsafe I think you might need to turn codegen on also in order for the unsafe stuff to work

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Reynold Xin
Please git pull :) On Thu, Aug 20, 2015 at 5:35 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: I am using Spark 1.5 cloned from master on June 12. (The aggregate unsafe feature was added to Spark on April 29.) *From:* Reynold Xin [mailto:r...@databricks.com] *Sent:* Thursday

[VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-20 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.0! The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.0 [ ] -1 Do not release this package because ...

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Reynold Xin
true spark.sql.unsafe.enabledtrue spark.unsafe.offHeaptrue Unsafe off: spark.sql.codegen false spark.sql.unsafe.enabledfalse spark.unsafe.offHeapfalse *From:* Reynold Xin [mailto:r...@databricks.com] *Sent:* Thursday, August 20, 2015 5:43 PM

Re: [ANNOUNCE] Spark 1.5.0-preview package

2015-08-20 Thread Reynold Xin
Thanks for reporting back, Mark. I will soon post a release candidate. On Thursday, August 20, 2015, mkhaitman mark.khait...@chango.com wrote: Turns out it was a mix of user-error as well as a bug in the sbt/sbt build that has since been fixed in the current 1.5 branch (I built from this

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Reynold Xin
How did you run this? I couldn't run your query with 4G of RAM in 1.4, but in 1.5 it ran. Also I recommend just dumping the data to parquet on disk to evaluate, rather than using the in-memory cache, which is super slow and we are thinking of removing/replacing with something else. val size =

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Reynold Xin
. On Thu, Aug 20, 2015 at 3:22 PM, Reynold Xin r...@databricks.com wrote: How did you run this? I couldn't run your query with 4G of RAM in 1.4, but in 1.5 it ran. Also I recommend just dumping the data to parquet on disk to evaluate, rather than using the in-memory cache, which is super

Re: Question about Spark process and thread

2015-06-29 Thread Reynold Xin
Most of those threads are not for task execution. They are for RPC, scheduling, ... On Sun, Jun 28, 2015 at 8:32 AM, Dogtail Ray spark.ru...@gmail.com wrote: Hi, I was looking at Spark source code, and I found that when launching a Executor, actually Spark is launching a threadpool; each

Re: Dataframes filter by count fails with python API

2015-06-29 Thread Reynold Xin
Hi Andrew, Thanks for the email. This is a known bug with the expression parser. We will hopefully fix this in 1.5. There are more keywords with the expression parser, and we have already got rid of most of them. Count is still there due to the handling of count distinct, but we plan to get rid

Re: Grouping runs of elements in a RDD

2015-06-30 Thread Reynold Xin
Try mapPartitions, which gives you an iterator, and you can produce an iterator back. On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com wrote: Hi all, I have a problem where I have a RDD of elements: Item1 Item2 Item3 Item4 Item5 Item6 ... and I want to run a function over

Re: What are 'Buckets' referred in Spark Core code

2015-08-02 Thread Reynold Xin
There are two usage of buckets used in Spark core. The first usage is in histogram, used to perform sorting. Basically we build an approximate histogram of the data in order to decide how to partition the data in sorting. Each bucket is a range in the histogram. The 2nd is used in shuffle, where

Reminder about Spark 1.5.0 code freeze deadline of Aug 1st

2015-07-28 Thread Reynold Xin
Hey All, Just a friendly reminder that Aug 1st is the feature freeze for Spark 1.5, meaning major outstanding changes will need to land in the this week. After May 1st we'll package a release for testing and then go into the normal triage process where bugs are prioritized and some smaller

Re: Custom UDFs with zero parameters support

2015-07-29 Thread Reynold Xin
BTW for 1.5, there is already a now like function being added, so it should work out of the box in 1.5.0, to be released end of Aug/early Sep. On Tue, Jul 28, 2015 at 11:38 PM, Reynold Xin r...@databricks.com wrote: Yup - would you be willing to submit a patch to add UDF0? Should be pretty

Re: Custom UDFs with zero parameters support

2015-07-29 Thread Reynold Xin
/main/java/org/apache/spark/sql/api/java ). But currently there is no UDF0 adapter. Any suggestions? I'm new to Spark and any help would be appreciated. -- Thanks, Sachith Withana On Tue, Jul 28, 2015 at 10:18 PM, Reynold Xin r...@databricks.com wrote: I think we do support 0 arg UDFs

Re: Custom UDFs with zero parameters support

2015-07-29 Thread Reynold Xin
. On Wed, Jul 29, 2015 at 11:46 AM, Reynold Xin r...@databricks.com wrote: We should add UDF0 to it. For now, can you just create an one-arg UDF and don't use the argument? On Tue, Jul 28, 2015 at 10:59 PM, Sachith Withana swsach...@gmail.com wrote: Hi Reynold, I'm implementing the interfaces

Re: Custom UDFs with zero parameters support

2015-07-28 Thread Reynold Xin
I think we do support 0 arg UDFs: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2165 How are you using UDFs? On Tue, Jul 28, 2015 at 2:15 AM, Sachith Withana swsach...@gmail.com wrote: Hi all, Currently I need to support custom

[ANNOUNCE] Spark branch-1.5

2015-08-03 Thread Reynold Xin
Hi Devs, Just an announcement that I've cut Spark's branch-1.5 to form the basis of the 1.5 release. Other than a few stragglers, this represents the end of active feature development for Spark 1.5. *If committers are merging any features (outside of alpha modules), please shoot me an email so I

Re: Came across Spark SQL hang/Error issue with Spark 1.5 Tungsten feature

2015-07-31 Thread Reynold Xin
Is this deterministically reproducible? Can you try this on the latest master branch? Would be great to turn debug logging and and dump the generated code. Also would be great to dump the array size at your line 314 in UnsafeRow (and whatever master branch's appropriate line is). On Fri, Jul 31,

Re: [discuss] Removing individual commit messages from the squash commit message

2015-08-11 Thread Reynold Xin
, Sandy Ryza sandy.r...@cloudera.com wrote: +1 On Sat, Jul 18, 2015 at 4:00 PM, Mridul Muralidharan mri...@gmail.com wrote: Thanks for detailing, definitely sounds better. +1 Regards Mridul On Saturday, July 18, 2015, Reynold Xin r...@databricks.com wrote: A single commit message

Re: Fixed number of partitions in RangePartitioner

2015-08-06 Thread Reynold Xin
Any reason why you need exactly a certain number of partitions? One way we can make that work is for RangePartitioner to return a bunch of empty partitions if the number of distinct elements is small. That would require changing Spark. If you want a quick work around, you can also append some

Re: possible bug: user SparkConf properties not copied to worker process

2015-08-13 Thread Reynold Xin
That was intentional - what's your use case that require configs not starting with spark? On Thu, Aug 13, 2015 at 8:16 AM, rfarrjr rfar...@gmail.com wrote: Ran into an issue setting a property on the SparkConf that wasn't made available on the worker. After some digging[1] I noticed that

Re: Developer API plugins for Hive Hadoop ?

2015-08-13 Thread Reynold Xin
I believe for Hive, there is already a client interface that can be used to build clients for different Hive metastores. That should also work for your heavily forked one. For Hadoop, it is definitely a bigger project to refactor. A good way to start evaluating this is to list what needs to be

Fwd: [ANNOUNCE] Spark 1.5.0-preview package

2015-08-13 Thread Reynold Xin
Retry sending this again ... -- Forwarded message -- From: Reynold Xin r...@databricks.com Date: Thu, Aug 13, 2015 at 12:15 AM Subject: [ANNOUNCE] Spark 1.5.0-preview package To: dev@spark.apache.org dev@spark.apache.org In order to facilitate community testing of the 1.5.0

Re: possible bug: user SparkConf properties not copied to worker process

2015-08-13 Thread Reynold Xin
Is this through Java properties? For java properties, you can pass them using spark.executor.extraJavaOptions. On Thu, Aug 13, 2015 at 2:11 PM, rfarrjr rfar...@gmail.com wrote: Thanks for the response. In this particular case we passed a url that would be leveraged when configuring some

Re: avoid creating small objects

2015-08-14 Thread Reynold Xin
You can use mapPartitions to do that. On Friday, August 14, 2015, 周千昊 qhz...@apache.org wrote: I am thinking that creating a shared object outside the closure, use this object to hold the byte array. will this work? 周千昊 qhz...@apache.org

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Reynold Xin
This is already supported with the new partitioned data sources in DataFrame/SQL right? On Fri, Aug 14, 2015 at 8:04 AM, Alex Angelini alex.angel...@shopify.com wrote: Speaking about Shopify's deployment, this would be a really nice to have feature. We would like to write data to folders

SPARK-10000 + now

2015-08-14 Thread Reynold Xin
Five month ago we reached 1 commits on GitHub. Today we reached 1 JIRA tickets. https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20created%3E%3D-1w%20ORDER%20BY%20created%20DESC Hopefully the extra character we have to type doesn't bring our productivity much.

Re: Fwd: [ANNOUNCE] Spark 1.5.0-preview package

2015-08-14 Thread Reynold Xin
Is it possible that you have only upgraded some set of nodes but not the others? We have ran some performance benchmarks on this so it definitely runs in some configuration. Could still be buggy in some other configurations though. On Fri, Aug 14, 2015 at 6:37 AM, mkhaitman

Re: Intermittent timeout failure org/apache/spark/sql/hive/thriftserver/CliSuite.scala

2015-08-12 Thread Reynold Xin
Thanks for finding this. Should we just switch to Java's process library for now? On Wed, Aug 12, 2015 at 1:30 AM, Tim Preece tepre...@mail.com wrote: I was just debugging an intermittent timeout failure in the testsuite CliSuite.scala I traced it down to a timing window in the Scala

Fwd: [ANNOUNCE] Spark 1.5.0-preview package

2015-08-13 Thread Reynold Xin
(I tried to send this last night but somehow ASF mailing list rejected my mail) In order to facilitate community testing of the 1.5.0 release, I've built a preview package. This is not a release candidate, so there is no voting involved. However, it'd be great if community members can start

Re: Converting DataFrame to RDD of case class

2015-07-27 Thread Reynold Xin
There is this pull request: https://github.com/apache/spark/pull/5713 We mean to merge it for 1.5. Maybe you can help review it too? On Mon, Jul 27, 2015 at 11:23 AM, Vyacheslav Baranov slavik.bara...@gmail.com wrote: Hi all, For now it's possible to convert RDD of case class to DataFrame:

Re: Is `dev/lint-python` broken?

2015-07-27 Thread Reynold Xin
I just pushed a hotfix to disable Pylint. On Mon, Jul 27, 2015 at 1:09 PM, Pedro Rodriguez ski.rodrig...@gmail.com wrote: I am having the same issue, but the python style checks are failing on the Jenkins build server. Is anyone else having this problem? Failed build is here:

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Reynold Xin
Is this just frequent items? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com wrote: 100% I would love to do it. Who a good person to review the

Re: non-deprecation compiler warnings are upgraded to build errors now

2015-07-24 Thread Reynold Xin
? That way we could turn on more stringent checks for the other ones. Punya On Thu, Jul 23, 2015 at 12:08 AM Reynold Xin r...@databricks.com wrote: Hi all, FYI, we just merged a patch that fails a build if there is a scala compiler warning (if it is not deprecation warning). In the past

Re: non-deprecation compiler warnings are upgraded to build errors now

2015-07-24 Thread Reynold Xin
, Reynold Xin r...@databricks.com wrote: Hi all, FYI, we just merged a patch that fails a build if there is a scala compiler warning (if it is not deprecation warning). I’m a bit confused, since I see quite a lot of warnings in semi-legitimate code. For instance, @transient (plenty of instances

non-deprecation compiler warnings are upgraded to build errors now

2015-07-22 Thread Reynold Xin
Hi all, FYI, we just merged a patch that fails a build if there is a scala compiler warning (if it is not deprecation warning). In the past, many compiler warnings are actually caused by legitimate bugs that we need to address. However, if we don't fail the build with warnings, people don't pay

Re: Record metadata with RDDs and DataFrames

2015-07-15 Thread Reynold Xin
with DataFrames. RDDs can easily be extended from RDD[T] to RDD[Record[T]]. I guess with DataFrames, I could add special columns? On Wed, Jul 15, 2015 at 12:36 PM, Reynold Xin r...@databricks.com wrote: How about just using two fields, one boolean field to mark good/bad, and another to get

Re: Use of non-standard LIMIT keyword in JDBC tableExists code

2015-07-15 Thread Reynold Xin
Hi Bob, Thanks for the email. You can select Spark as the project when you file a JIRA ticket at https://issues.apache.org/jira/browse/SPARK For select 1 from $table where 0=1 -- if the database's optimizer doesn't do constant folding and short-circuit execution, could the query end up

Re: Slight API incompatibility caused by SPARK-4072

2015-07-15 Thread Reynold Xin
It's bad that expose a trait - even though we want to mixin stuff. We should really audit all of these and expose only abstract classes for anything beyond an extremely simple interface. That itself however would break binary compatibility. On Wed, Jul 15, 2015 at 12:15 PM, Patrick Wendell

[discuss] Removing individual commit messages from the squash commit message

2015-07-18 Thread Reynold Xin
I took a look at the commit messages in git log -- it looks like the individual commit messages are not that useful to include, but do make the commit messages more verbose. They are usually just a bunch of extremely concise descriptions of bug fixes, merges, etc: cb3f12d [xxx] add whitespace

Re: [discuss] Removing individual commit messages from the squash commit message

2015-07-18 Thread Reynold Xin
, Reynold Xin r...@databricks.com wrote: I took a look at the commit messages in git log -- it looks like the individual commit messages are not that useful to include, but do make the commit messages more verbose. They are usually just a bunch of extremely concise descriptions of bug fixes, merges

Re: Foundation policy on releases and Spark nightly builds

2015-07-20 Thread Reynold Xin
Thanks, Sean. On Mon, Jul 20, 2015 at 12:22 AM, Sean Owen so...@cloudera.com wrote: This is done, and yes I believe that resolves the issue as far all here know. http://spark.apache.org/downloads.html -

Re: How to Read Excel file in Spark 1.4

2015-07-13 Thread Reynold Xin
What Sandy meant was there was no out-of-the-box support in Spark for reading excel files. However, you can still read excel: If you are using Python, you can use Pandas to load an excel file and then convert it into a Spark DataFrame. If you are using the JVM, you can find any excel library for

Re: Make off-heap store pluggable

2015-07-20 Thread Reynold Xin
They are already pluggable. On Mon, Jul 20, 2015 at 9:32 PM, Prashant Sharma scrapco...@gmail.com wrote: +1 Looks like a nice idea(I do not see any harm). Would you like to work on the patch to support it ? Prashant Sharma On Tue, Jul 21, 2015 at 2:46 AM, Alexey Goncharuk

Re: Make off-heap store pluggable

2015-07-20 Thread Reynold Xin
to the codebase. On Mon, Jul 20, 2015 at 9:34 PM, Reynold Xin r...@databricks.com wrote: They are already pluggable. On Mon, Jul 20, 2015 at 9:32 PM, Prashant Sharma scrapco...@gmail.com wrote: +1 Looks like a nice idea(I do not see any harm). Would you like to work on the patch to support

Re: [jira] [Commented] (INFRA-10191) git pushing for Spark fails

2015-08-24 Thread Reynold Xin
This has been resolved. On Mon, Aug 24, 2015 at 11:58 AM, Reynold Xin r...@databricks.com wrote: FYI -- Forwarded message -- From: Geoffrey Corey (JIRA) j...@apache.org Date: Mon, Aug 24, 2015 at 11:54 AM Subject: [jira] [Commented] (INFRA-10191) git pushing for Spark

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-24 Thread Reynold Xin
)) }) was false. (DirectKafkaStreamSuite.scala:249) On Fri, Aug 21, 2015 at 5:37 AM, Reynold Xin r...@databricks.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.5.0! The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes if a majority

Fwd: [jira] [Commented] (INFRA-10191) git pushing for Spark fails

2015-08-24 Thread Reynold Xin
: Reynold Xin Assignee: Geoffrey Corey Not sure what's going on, but it happened to at least two committers with the following errors: Using Spark's merge script: {code} Exception while pushing: Command '[u'git', u'push', u'apache', u'PR_TOOL_MERGE_PR_8373_MASTER:master']' returned

[VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-25 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.2. The vote is open until Wed Oct 28, 2015 at 08:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.2 [ ] -1 Do not release this package because ... The

Re: repartitionAndSortWithinPartitions task shuffle phase is very slow

2015-10-22 Thread Reynold Xin
Why do you do a glom? It seems unnecessarily expensive to materialize each partition in memory. On Thu, Oct 22, 2015 at 2:02 AM, 周千昊 wrote: > Hi, spark community > I have an application which I try to migrate from MR to Spark. > It will do some calculations from

Re: Exception when using cosh

2015-10-21 Thread Reynold Xin
I think we made a mistake and forgot to register the function in the registry: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala Do you mind submitting a pull request to fix this? Should be an one line change. I

Re: [VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-27 Thread Reynold Xin
t 3:08 AM, Krishna Sankar <ksanka...@gmail.com > <javascript:_e(%7B%7D,'cvml','ksanka...@gmail.com');>> wrote: > >> Guys, >>The sc.version returns 1.5.1 in python and scala. Is anyone getting >> the same results ? Probably I am doing something wrong. >> Cheer

Re: Exception when using some aggregate operators

2015-10-27 Thread Reynold Xin
Try count(distinct columnane) In SQL distinct is not part of the function name. On Tuesday, October 27, 2015, Shagun Sodhani wrote: > Oops seems I made a mistake. The error message is : Exception in thread > "main" org.apache.spark.sql.AnalysisException: undefined

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-21 Thread Reynold Xin
er.java:74) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:56) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:339) > > > On Tue, Oct 20, 2015 at 9:

Re: Pickle Spark DataFrame

2015-10-28 Thread Reynold Xin
What are you trying to accomplish to pickle a Spark DataFrame? If your dataset is large, it doesn't make much sense to pickle it. If your dataset is small, maybe it's best to just pickle a Pandas dataframe. On Tue, Oct 27, 2015 at 9:47 PM, agg212 wrote: > Hi, I'd like to

Re: Exception when using some aggregate operators

2015-10-28 Thread Reynold Xin
gt;> OPTIONS ( >>>>>> path '/tmp/partitioned' >>>>>> )""") >>>>>> sqlContext.sql("""select avg(a) from partitionedParquet""").show() >>>>>> >>>>>> Ch

Re: Exception when using some aggregate operators

2015-10-28 Thread Reynold Xin
t if you can clarify this. > > On Wed, Oct 28, 2015 at 4:12 PM, Reynold Xin <r...@databricks.com> wrote: > >> I don't think these are bugs. The SQL standard for average is "avg", not >> "mean". Similarly, a distinct count is supposed to be written as >

[ANNOUNCE] Announcing Spark 1.5.2

2015-11-10 Thread Reynold Xin
Hi All, Spark 1.5.2 is a maintenance release containing stability fixes. This release is based on the branch-1.5 maintenance branch of Spark. We *strongly recommend* all 1.5.x users to upgrade to this release. The full list of bug fixes is here: http://s.apache.org/spark-1.5.2

Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin
On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > > > 3. Assembly-free distribution of Spark: don’t require building an > enormous assembly jar in order to run Spark. > > Could you elaborate a bit on this? I'm not sure what an assembly-free > distribution

Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin
t of turmoil over > the > >>> Python 2 -> Python 3 transition because the upgrade process was too > painful > >>> for too long. The Spark community will benefit greatly from our > explicitly > >>> looking to avoid a similar situation. > >&g

Re: Support for local disk columnar storage for DataFrames

2015-11-11 Thread Reynold Xin
Thanks for the email. Can you explain what the difference is between this and existing formats such as Parquet/ORC? On Wed, Nov 11, 2015 at 4:59 AM, Cristian O wrote: > Hi, > > I was wondering if there's any planned support for local disk columnar > storage. >

Re: Choreographing a Kryo update

2015-11-11 Thread Reynold Xin
We should consider this for Spark 2.0. On Wed, Nov 11, 2015 at 2:01 PM, Steve Loughran wrote: > > > Spark is currently on a fairly dated version of Kryo 2.x; it's trailing on > the fixes in Hive and, as the APIs are incompatible, resulted in that > mutant

Re: SparkPullRequestBuilder coverage

2015-11-13 Thread Reynold Xin
It only runs tests that are impacted by the change. E.g. if you only modify SQL, it won't run the core or streaming tests. On Fri, Nov 13, 2015 at 11:17 AM, Ted Yu wrote: > Hi, > I noticed that SparkPullRequestBuilder completes much faster than maven > Jenkins build. > >

Re: Spark 1.4.2 release and votes conversation?

2015-11-13 Thread Reynold Xin
I actually tried to build a binary for 1.4.2 and wanted to start voting, but there was an issue with the release script that failed the jenkins job. Would be great to kick off a 1.4.2 release. On Fri, Nov 13, 2015 at 1:00 PM, Andrew Lee wrote: > Hi All, > > > I'm wondering

Re: SparkPullRequestBuilder coverage

2015-11-13 Thread Reynold Xin
y test(s) be disabled, strengthened and enabled again ? > > Cheers > > On Fri, Nov 13, 2015 at 11:20 AM, Reynold Xin <r...@databricks.com> wrote: > >> It only runs tests that are impacted by the change. E.g. if you only >> modify SQL, it won't run the core or streaming te

Re: Spark 1.4.2 release and votes conversation?

2015-11-13 Thread Reynold Xin
In the interim, you can just build it off branch-1.4 if you want. On Fri, Nov 13, 2015 at 1:30 PM, Reynold Xin <r...@databricks.com> wrote: > I actually tried to build a binary for 1.4.2 and wanted to start voting, > but there was an issue with the release script that failed the

Re: Are map tasks spilling data to disk?

2015-11-15 Thread Reynold Xin
It depends on what the next operator is. If the next operator is just an aggregation, then no, the hash join won't write anything to disk. It will just stream the data through to the next operator. If the next operator is shuffle (exchange), then yes. On Sun, Nov 15, 2015 at 10:52 AM, gsvic

Re: Support for local disk columnar storage for DataFrames

2015-11-15 Thread Reynold Xin
treaming >> apps can take advantage of the compact columnar representation and Tungsten >> optimisations. >> >> I'm not quite sure if something like this can be achieved by other means >> or has been investigated before, hence why I'm looking for feedback here. >&g

Re: Hive on Spark Vs Spark SQL

2015-11-15 Thread Reynold Xin
It's a completely different path. On Sun, Nov 15, 2015 at 10:37 PM, kiran lonikar wrote: > I would like to know if Hive on Spark uses or shares the execution code > with Spark SQL or DataFrames? > > More specifically, does Hive on Spark benefit from the changes made to >

Re: Hive on Spark Vs Spark SQL

2015-11-15 Thread Reynold Xin
No it does not -- although it'd benefit from some of the work to make shuffle more robust. On Sun, Nov 15, 2015 at 10:45 PM, kiran lonikar <loni...@gmail.com> wrote: > So does not benefit from Project Tungsten right? > > > On Mon, Nov 16, 2015 at 12:07 PM, Reynold Xin &l

Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin
er usage (e.g. I wouldn't >> be surprised if mapPartitionsWithContext was baked into a number of apps) >> and merit a little extra consideration. >> >> Maybe also obvious, but I think a migration guide with API equivlents and >> the like would be incredibly useful i

Re: A proposal for Spark 2.0

2015-11-10 Thread Reynold Xin
ade at the outset of 2.0 while > trying to guess what we'll need. > > On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <r...@databricks.com> wrote: > >> I’m starting a new thread since the other one got intermixed with feature >> requests. Please refrain from making feature

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-08 Thread Reynold Xin
Thanks everybody for voting. I'm going to close the vote now. The vote passes with 14 +1 votes and no -1 vote. I will work on packaging this asap. +1: Jean-Baptiste Onofré Egor Pahomov Luc Bourlier Tom Graves* Chester Chen Michael Armbrust* Krishna Sankar Robin East Reynold Xin* Joseph Bradley

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-11-01 Thread Reynold Xin
$sql$execution$TungstenSort$$preparePartition$1(sort.scala:131) >>> at >>> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) >>> at >>> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.s

Re: How to force statistics calculation of Dataframe?

2015-11-04 Thread Reynold Xin
Can you use the broadcast hint? e.g. df1.join(broadcast(df2)) the broadcast function is in org.apache.spark.sql.functions On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel wrote: > Hi, > > If I have a hive table, analyze table compute statistics will ensure Spark > SQL has

Re: How to force statistics calculation of Dataframe?

2015-11-05 Thread Reynold Xin
hint is only available on dataframe api. > > On Wed, Nov 4, 2015 at 6:49 PM Reynold Xin <r...@databricks.com> wrote: > >> Can you use the broadcast hint? >> >> e.g. >> >> df1.join(broadcast(df2)) >> >> the broadcast function is in org.apache.spa

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-07 Thread Reynold Xin
gt; +1 > Test against CDH5.4.2 with hadoop 2.6.0 version using yesterday's code, > build locally. > > Regression running in Yarn Cluster mode against few internal ML ( logistic > regression, linear regression, random forest and statistic summary) as well > Mlib KMeans. all seems to

Please reply if you use Mesos fine grained mode

2015-11-03 Thread Reynold Xin
If you are using Spark with Mesos fine grained mode, can you please respond to this email explaining why you use it over the coarse grained mode? Thanks.

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Reynold Xin
in turn kill the entire executor, causing entire > stages to be retried. In fine-grained mode, only the task fails and > subsequently gets retried without taking out an entire stage or worse. > > On Tue, Nov 3, 2015 at 3:54 PM, Reynold Xin <r...@databricks.com> wrote: > >>

[VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-03 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.2. The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.2 [ ] -1 Do not release this package because ... The

Re: Codegen In Shuffle

2015-11-04 Thread Reynold Xin
GenerateUnsafeProjection -- projects any internal row data structure directly into bytes (UnsafeRow). On Wed, Nov 4, 2015 at 12:21 AM, 牛兆捷 wrote: > Dear all: > > Tungsten project has mentioned that they are applying code generation is > to speed up the conversion of data

Re: Need advice on hooking into Sql query plan

2015-11-05 Thread Reynold Xin
You can hack around this by constructing logical plans yourself and then creating a DataFrame in order to execute them. Note that this is all depending on internals of the framework and can break when Spark upgrades. On Thu, Nov 5, 2015 at 4:18 PM, Yana Kadiyska wrote:

Re: Looking for the method executors uses to write to HDFS

2015-11-06 Thread Reynold Xin
Are you looking for this? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L69 On Wed, Nov 4, 2015 at 5:11 AM, Tóth Zoltán wrote: > Hi, > > I'd like to write a parquet file from the

Re: Should enforce the uniqueness of field name in DataFrame ?

2015-10-15 Thread Reynold Xin
That could break a lot of applications. In particular, a lot of input data sources (csv, json) don't have clean schema, and can have duplicate column names. For the case of join, maybe a better solution is to ask the left/right prefix/suffix in the user code, similar to what Pandas does. On Wed,

Fwd: multiple count distinct in SQL/DataFrame?

2015-10-07 Thread Reynold Xin
Adding user list too. -- Forwarded message -- From: Reynold Xin <r...@databricks.com> Date: Tue, Oct 6, 2015 at 5:54 PM Subject: Re: multiple count distinct in SQL/DataFrame? To: "dev@spark.apache.org" <dev@spark.apache.org> To provide more co

<    1   2   3   4   5   6   7   8   9   10   >