There will be in 1.4.
df.write.partitionBy(year, month, day).parquet(/path/to/output)
On Mon, Jun 1, 2015 at 10:21 PM, Matt Cheah mch...@palantir.com wrote:
Hi there,
I noticed in the latest Spark SQL programming guide
https://spark.apache.org/docs/latest/sql-programming-guide.html, there
+1
On Tue, Jun 23, 2015 at 10:37 PM, Patrick Wendell pwend...@gmail.com
wrote:
Please vote on releasing the following candidate as Apache Spark version
1.4.1!
This release fixes a handful of known issues in Spark 1.4.0, listed here:
http://s.apache.org/spark-1.4.1
The tag to be voted on
Run
./python/run-tests --help
and you will see. :)
On Wed, Jul 1, 2015 at 9:10 PM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com
wrote:
Hi all,
When I develop pyspark modules, such as adding a spark.ml API in Python,
I'd
like to run a minimum unit testing related to the developing module again
Yes - it's very interesting. However, ideally we should have a version of
hyperloglog that can work directly against some raw bytes in memory (rather
than java objects), in order for this to fit the Tungsten execution model
where everything is operating directly against some memory address.
On
FYI there are some problems with ASF's git or ldap infra. As a result, we
cannot merge anything into Spark right now.
An infra ticket has been created:
https://issues.apache.org/jira/browse/INFRA-9932
Please watch/vote on that ticket for progress. Thanks.
Yijie,
As Davies said, it will take us a while to get to vectorized execution.
However, before that, we are going to refactor code generation to push it
into each expression: https://issues.apache.org/jira/browse/SPARK-7813
Once this one is in (probably in the next 2 or 3 weeks), there will be
Yes that's exactly the reason.
On Sat, May 23, 2015 at 12:37 AM, Yijie Shen henry.yijies...@gmail.com
wrote:
Davies and Reynold,
Glad to hear about the status.
I’ve seen [SPARK-7813](https://issues.apache.org/jira/browse/SPARK-7813)
and watching it now.
If I understand correctly, it’s
That's the nice thing about Spark packages. It is just a package index for
libraries and applications built on top of Spark and not part of the Spark
codebase, so it is not restricted to follow only ASF-compatible licenses.
On Sat, May 23, 2015 at 10:12 PM, DB Tsai dbt...@dbtsai.com wrote:
I
It is just 15 lines of code to copy, isn't it?
On Thu, May 21, 2015 at 7:46 PM, Nathan Kronenfeld
nkronenfeld@uncharted.software wrote:
see discussions about Spark not really liking multiple contexts in the
same JVM
Speaking of this - is there a standard way of writing unit tests that
You definitely don't want to implement kmeans in R, since it would be very
slow. Just providing R wrappers for the MLlib implementation is the way to
go. I believe one of the major items in SparkR next is the MLlib wrappers.
On Tue, May 26, 2015 at 7:46 AM, Andrew Psaltis
You've probably hit this bug:
https://issues.apache.org/jira/browse/SPARK-7180
It's fixed in Spark 1.4.1+. Try setting spark.serializer.extraDebugInfo to
false and see if it goes away.
On Fri, Aug 21, 2015 at 3:37 AM, Eugene Morozov evgeny.a.moro...@gmail.com
wrote:
Hi,
I'm using spark
I'm actually somewhat involved with the Google Docs you linked to.
I don't think Oracle will remove Unsafe in JVM 9. As you said, JEP 260
already proposes making Unsafe available. Given the widespread use of
Unsafe for performance and advanced functionalities, I don't think Oracle
can just remove
Problem noted. Apparently the release script doesn't automate the
replacement of all version strings yet. I'm going to publish a new RC over
the weekend with the release version properly assigned.
Please continue the testing and report any problems you find. Thanks!
On Fri, Aug 21, 2015 at 2:20
don't see the change in time if I
unset the unsafe flags. Could you explain why it might happen?
20 авг. 2015 г., в 15:32, Reynold Xin r...@databricks.commailto:
r...@databricks.com написал(а):
I didn't wait long enough earlier. Actually it did finish when I raised
memory to 8g.
In 1.5
BTW one other thing -- don't use the count() to do benchmark, since the
optimizer is smart enough to figure out that you don't actually need to run
the sum.
For the purpose of benchmarking, you can use
df.foreach(i = do nothing)
On Thu, Aug 20, 2015 at 3:31 PM, Reynold Xin r
*From:* Reynold Xin [mailto:r...@databricks.com]
*Sent:* Thursday, August 20, 2015 4:22 PM
*To:* Ulanov, Alexander
*Cc:* dev@spark.apache.org
*Subject:* Re: Dataframe aggregation with Tungsten unsafe
I think you might need to turn codegen on also in order for the unsafe
stuff to work
Please git pull :)
On Thu, Aug 20, 2015 at 5:35 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:
I am using Spark 1.5 cloned from master on June 12. (The aggregate unsafe
feature was added to Spark on April 29.)
*From:* Reynold Xin [mailto:r...@databricks.com]
*Sent:* Thursday
Please vote on releasing the following candidate as Apache Spark version
1.5.0!
The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes if a
majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.5.0
[ ] -1 Do not release this package because ...
true
spark.sql.unsafe.enabledtrue
spark.unsafe.offHeaptrue
Unsafe off:
spark.sql.codegen false
spark.sql.unsafe.enabledfalse
spark.unsafe.offHeapfalse
*From:* Reynold Xin [mailto:r...@databricks.com]
*Sent:* Thursday, August 20, 2015 5:43 PM
Thanks for reporting back, Mark.
I will soon post a release candidate.
On Thursday, August 20, 2015, mkhaitman mark.khait...@chango.com wrote:
Turns out it was a mix of user-error as well as a bug in the sbt/sbt build
that has since been fixed in the current 1.5 branch (I built from this
How did you run this? I couldn't run your query with 4G of RAM in 1.4, but
in 1.5 it ran.
Also I recommend just dumping the data to parquet on disk to evaluate,
rather than using the in-memory cache, which is super slow and we are
thinking of removing/replacing with something else.
val size =
.
On Thu, Aug 20, 2015 at 3:22 PM, Reynold Xin r...@databricks.com wrote:
How did you run this? I couldn't run your query with 4G of RAM in 1.4, but
in 1.5 it ran.
Also I recommend just dumping the data to parquet on disk to evaluate,
rather than using the in-memory cache, which is super
Most of those threads are not for task execution. They are for RPC,
scheduling, ...
On Sun, Jun 28, 2015 at 8:32 AM, Dogtail Ray spark.ru...@gmail.com wrote:
Hi,
I was looking at Spark source code, and I found that when launching a
Executor, actually Spark is launching a threadpool; each
Hi Andrew,
Thanks for the email. This is a known bug with the expression parser. We
will hopefully fix this in 1.5.
There are more keywords with the expression parser, and we have already got
rid of most of them. Count is still there due to the handling of count
distinct, but we plan to get rid
Try mapPartitions, which gives you an iterator, and you can produce an
iterator back.
On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling rnowl...@gmail.com wrote:
Hi all,
I have a problem where I have a RDD of elements:
Item1 Item2 Item3 Item4 Item5 Item6 ...
and I want to run a function over
There are two usage of buckets used in Spark core.
The first usage is in histogram, used to perform sorting. Basically we
build an approximate histogram of the data in order to decide how to
partition the data in sorting. Each bucket is a range in the histogram.
The 2nd is used in shuffle, where
Hey All,
Just a friendly reminder that Aug 1st is the feature freeze for Spark
1.5, meaning major outstanding changes will need to land in the this
week.
After May 1st we'll package a release for testing and then go into the
normal triage process where bugs are prioritized and some smaller
BTW for 1.5, there is already a now like function being added, so it should
work out of the box in 1.5.0, to be released end of Aug/early Sep.
On Tue, Jul 28, 2015 at 11:38 PM, Reynold Xin r...@databricks.com wrote:
Yup - would you be willing to submit a patch to add UDF0?
Should be pretty
/main/java/org/apache/spark/sql/api/java
).
But currently there is no UDF0 adapter.
Any suggestions? I'm new to Spark and any help would be appreciated.
--
Thanks,
Sachith Withana
On Tue, Jul 28, 2015 at 10:18 PM, Reynold Xin r...@databricks.com wrote:
I think we do support 0 arg UDFs
.
On Wed, Jul 29, 2015 at 11:46 AM, Reynold Xin r...@databricks.com wrote:
We should add UDF0 to it.
For now, can you just create an one-arg UDF and don't use the argument?
On Tue, Jul 28, 2015 at 10:59 PM, Sachith Withana swsach...@gmail.com
wrote:
Hi Reynold,
I'm implementing the interfaces
I think we do support 0 arg UDFs:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2165
How are you using UDFs?
On Tue, Jul 28, 2015 at 2:15 AM, Sachith Withana swsach...@gmail.com
wrote:
Hi all,
Currently I need to support custom
Hi Devs,
Just an announcement that I've cut Spark's branch-1.5 to form the basis of
the 1.5 release. Other than a few stragglers, this represents the end of
active feature development for Spark 1.5. *If committers are merging any
features (outside of alpha modules), please shoot me an email so I
Is this deterministically reproducible? Can you try this on the latest
master branch?
Would be great to turn debug logging and and dump the generated code. Also
would be great to dump the array size at your line 314 in UnsafeRow (and
whatever master branch's appropriate line is).
On Fri, Jul 31,
, Sandy Ryza sandy.r...@cloudera.com
wrote:
+1
On Sat, Jul 18, 2015 at 4:00 PM, Mridul Muralidharan mri...@gmail.com
wrote:
Thanks for detailing, definitely sounds better.
+1
Regards
Mridul
On Saturday, July 18, 2015, Reynold Xin r...@databricks.com wrote:
A single commit message
Any reason why you need exactly a certain number of partitions?
One way we can make that work is for RangePartitioner to return a bunch of
empty partitions if the number of distinct elements is small. That would
require changing Spark.
If you want a quick work around, you can also append some
That was intentional - what's your use case that require configs not
starting with spark?
On Thu, Aug 13, 2015 at 8:16 AM, rfarrjr rfar...@gmail.com wrote:
Ran into an issue setting a property on the SparkConf that wasn't made
available on the worker. After some digging[1] I noticed that
I believe for Hive, there is already a client interface that can be used to
build clients for different Hive metastores. That should also work for your
heavily forked one.
For Hadoop, it is definitely a bigger project to refactor. A good way to
start evaluating this is to list what needs to be
Retry sending this again ...
-- Forwarded message --
From: Reynold Xin r...@databricks.com
Date: Thu, Aug 13, 2015 at 12:15 AM
Subject: [ANNOUNCE] Spark 1.5.0-preview package
To: dev@spark.apache.org dev@spark.apache.org
In order to facilitate community testing of the 1.5.0
Is this through Java properties? For java properties, you can pass them
using spark.executor.extraJavaOptions.
On Thu, Aug 13, 2015 at 2:11 PM, rfarrjr rfar...@gmail.com wrote:
Thanks for the response.
In this particular case we passed a url that would be leveraged when
configuring some
You can use mapPartitions to do that.
On Friday, August 14, 2015, 周千昊 qhz...@apache.org wrote:
I am thinking that creating a shared object outside the closure, use this
object to hold the byte array.
will this work?
周千昊 qhz...@apache.org
This is already supported with the new partitioned data sources in
DataFrame/SQL right?
On Fri, Aug 14, 2015 at 8:04 AM, Alex Angelini alex.angel...@shopify.com
wrote:
Speaking about Shopify's deployment, this would be a really nice to have
feature.
We would like to write data to folders
Five month ago we reached 1 commits on GitHub. Today we reached 1
JIRA tickets.
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20created%3E%3D-1w%20ORDER%20BY%20created%20DESC
Hopefully the extra character we have to type doesn't bring our
productivity much.
Is it possible that you have only upgraded some set of nodes but not the
others?
We have ran some performance benchmarks on this so it definitely runs in
some configuration. Could still be buggy in some other configurations
though.
On Fri, Aug 14, 2015 at 6:37 AM, mkhaitman
Thanks for finding this. Should we just switch to Java's process library
for now?
On Wed, Aug 12, 2015 at 1:30 AM, Tim Preece tepre...@mail.com wrote:
I was just debugging an intermittent timeout failure in the testsuite
CliSuite.scala
I traced it down to a timing window in the Scala
(I tried to send this last night but somehow ASF mailing list rejected my
mail)
In order to facilitate community testing of the 1.5.0 release, I've built a
preview package. This is not a release candidate, so there is no voting
involved. However, it'd be great if community members can start
There is this pull request: https://github.com/apache/spark/pull/5713
We mean to merge it for 1.5. Maybe you can help review it too?
On Mon, Jul 27, 2015 at 11:23 AM, Vyacheslav Baranov
slavik.bara...@gmail.com wrote:
Hi all,
For now it's possible to convert RDD of case class to DataFrame:
I just pushed a hotfix to disable Pylint.
On Mon, Jul 27, 2015 at 1:09 PM, Pedro Rodriguez ski.rodrig...@gmail.com
wrote:
I am having the same issue, but the python style checks are failing on the
Jenkins build server. Is anyone else having this problem? Failed build is
here:
Is this just frequent items?
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97
On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com
wrote:
100% I would love to do it. Who a good person to review the
? That way we could turn on more stringent checks for the other
ones.
Punya
On Thu, Jul 23, 2015 at 12:08 AM Reynold Xin r...@databricks.com wrote:
Hi all,
FYI, we just merged a patch that fails a build if there is a scala
compiler warning (if it is not deprecation warning).
In the past
, Reynold Xin r...@databricks.com wrote:
Hi all,
FYI, we just merged a patch that fails a build if there is a scala
compiler warning (if it is not deprecation warning).
I’m a bit confused, since I see quite a lot of warnings in semi-legitimate
code.
For instance, @transient (plenty of instances
Hi all,
FYI, we just merged a patch that fails a build if there is a scala compiler
warning (if it is not deprecation warning).
In the past, many compiler warnings are actually caused by legitimate bugs
that we need to address. However, if we don't fail the build with warnings,
people don't pay
with DataFrames. RDDs can easily be extended
from RDD[T] to RDD[Record[T]]. I guess with DataFrames, I could add
special columns?
On Wed, Jul 15, 2015 at 12:36 PM, Reynold Xin r...@databricks.com wrote:
How about just using two fields, one boolean field to mark good/bad, and
another to get
Hi Bob,
Thanks for the email. You can select Spark as the project when you file a
JIRA ticket at https://issues.apache.org/jira/browse/SPARK
For select 1 from $table where 0=1 -- if the database's optimizer doesn't
do constant folding and short-circuit execution, could the query end up
It's bad that expose a trait - even though we want to mixin stuff. We
should really audit all of these and expose only abstract classes for
anything beyond an extremely simple interface. That itself however would
break binary compatibility.
On Wed, Jul 15, 2015 at 12:15 PM, Patrick Wendell
I took a look at the commit messages in git log -- it looks like the
individual commit messages are not that useful to include, but do make the
commit messages more verbose. They are usually just a bunch of extremely
concise descriptions of bug fixes, merges, etc:
cb3f12d [xxx] add whitespace
, Reynold Xin r...@databricks.com wrote:
I took a look at the commit messages in git log -- it looks like the
individual commit messages are not that useful to include, but do make the
commit messages more verbose. They are usually just a bunch of extremely
concise descriptions of bug fixes, merges
Thanks, Sean.
On Mon, Jul 20, 2015 at 12:22 AM, Sean Owen so...@cloudera.com wrote:
This is done, and yes I believe that resolves the issue as far all here
know.
http://spark.apache.org/downloads.html
-
What Sandy meant was there was no out-of-the-box support in Spark for
reading excel files. However, you can still read excel:
If you are using Python, you can use Pandas to load an excel file and then
convert it into a Spark DataFrame.
If you are using the JVM, you can find any excel library for
They are already pluggable.
On Mon, Jul 20, 2015 at 9:32 PM, Prashant Sharma scrapco...@gmail.com
wrote:
+1 Looks like a nice idea(I do not see any harm). Would you like to work
on the patch to support it ?
Prashant Sharma
On Tue, Jul 21, 2015 at 2:46 AM, Alexey Goncharuk
to the codebase.
On Mon, Jul 20, 2015 at 9:34 PM, Reynold Xin r...@databricks.com wrote:
They are already pluggable.
On Mon, Jul 20, 2015 at 9:32 PM, Prashant Sharma scrapco...@gmail.com
wrote:
+1 Looks like a nice idea(I do not see any harm). Would you like to work
on the patch to support
This has been resolved.
On Mon, Aug 24, 2015 at 11:58 AM, Reynold Xin r...@databricks.com wrote:
FYI
-- Forwarded message --
From: Geoffrey Corey (JIRA) j...@apache.org
Date: Mon, Aug 24, 2015 at 11:54 AM
Subject: [jira] [Commented] (INFRA-10191) git pushing for Spark
))
}) was false. (DirectKafkaStreamSuite.scala:249)
On Fri, Aug 21, 2015 at 5:37 AM, Reynold Xin r...@databricks.com
wrote:
Please vote on releasing the following candidate as Apache Spark
version
1.5.0!
The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes if
a
majority
: Reynold Xin
Assignee: Geoffrey Corey
Not sure what's going on, but it happened to at least two committers with
the following errors:
Using Spark's merge script:
{code}
Exception while pushing: Command '[u'git', u'push', u'apache',
u'PR_TOOL_MERGE_PR_8373_MASTER:master']' returned
Please vote on releasing the following candidate as Apache Spark
version 1.5.2. The vote is open until Wed Oct 28, 2015 at 08:00 UTC and
passes if a majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.5.2
[ ] -1 Do not release this package because ...
The
Why do you do a glom? It seems unnecessarily expensive to materialize each
partition in memory.
On Thu, Oct 22, 2015 at 2:02 AM, 周千昊 wrote:
> Hi, spark community
> I have an application which I try to migrate from MR to Spark.
> It will do some calculations from
I think we made a mistake and forgot to register the function in the
registry:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
Do you mind submitting a pull request to fix this? Should be an one line
change. I
t 3:08 AM, Krishna Sankar <ksanka...@gmail.com
> <javascript:_e(%7B%7D,'cvml','ksanka...@gmail.com');>> wrote:
>
>> Guys,
>>The sc.version returns 1.5.1 in python and scala. Is anyone getting
>> the same results ? Probably I am doing something wrong.
>> Cheer
Try
count(distinct columnane)
In SQL distinct is not part of the function name.
On Tuesday, October 27, 2015, Shagun Sodhani
wrote:
> Oops seems I made a mistake. The error message is : Exception in thread
> "main" org.apache.spark.sql.AnalysisException: undefined
er.java:74)
> at
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:56)
> at
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:339)
>
>
> On Tue, Oct 20, 2015 at 9:
What are you trying to accomplish to pickle a Spark DataFrame? If your
dataset is large, it doesn't make much sense to pickle it. If your dataset
is small, maybe it's best to just pickle a Pandas dataframe.
On Tue, Oct 27, 2015 at 9:47 PM, agg212 wrote:
> Hi, I'd like to
gt;> OPTIONS (
>>>>>> path '/tmp/partitioned'
>>>>>> )""")
>>>>>> sqlContext.sql("""select avg(a) from partitionedParquet""").show()
>>>>>>
>>>>>> Ch
t if you can clarify this.
>
> On Wed, Oct 28, 2015 at 4:12 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> I don't think these are bugs. The SQL standard for average is "avg", not
>> "mean". Similarly, a distinct count is supposed to be written as
>
Hi All,
Spark 1.5.2 is a maintenance release containing stability fixes. This
release is based on the branch-1.5 maintenance branch of Spark. We
*strongly recommend* all 1.5.x users to upgrade to this release.
The full list of bug fixes is here: http://s.apache.org/spark-1.5.2
On Tue, Nov 10, 2015 at 3:35 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:
>
> > 3. Assembly-free distribution of Spark: don’t require building an
> enormous assembly jar in order to run Spark.
>
> Could you elaborate a bit on this? I'm not sure what an assembly-free
> distribution
t of turmoil over
> the
> >>> Python 2 -> Python 3 transition because the upgrade process was too
> painful
> >>> for too long. The Spark community will benefit greatly from our
> explicitly
> >>> looking to avoid a similar situation.
> >&g
Thanks for the email. Can you explain what the difference is between this
and existing formats such as Parquet/ORC?
On Wed, Nov 11, 2015 at 4:59 AM, Cristian O wrote:
> Hi,
>
> I was wondering if there's any planned support for local disk columnar
> storage.
>
We should consider this for Spark 2.0.
On Wed, Nov 11, 2015 at 2:01 PM, Steve Loughran
wrote:
>
>
> Spark is currently on a fairly dated version of Kryo 2.x; it's trailing on
> the fixes in Hive and, as the APIs are incompatible, resulted in that
> mutant
It only runs tests that are impacted by the change. E.g. if you only modify
SQL, it won't run the core or streaming tests.
On Fri, Nov 13, 2015 at 11:17 AM, Ted Yu wrote:
> Hi,
> I noticed that SparkPullRequestBuilder completes much faster than maven
> Jenkins build.
>
>
I actually tried to build a binary for 1.4.2 and wanted to start voting,
but there was an issue with the release script that failed the jenkins job.
Would be great to kick off a 1.4.2 release.
On Fri, Nov 13, 2015 at 1:00 PM, Andrew Lee wrote:
> Hi All,
>
>
> I'm wondering
y test(s) be disabled, strengthened and enabled again ?
>
> Cheers
>
> On Fri, Nov 13, 2015 at 11:20 AM, Reynold Xin <r...@databricks.com> wrote:
>
>> It only runs tests that are impacted by the change. E.g. if you only
>> modify SQL, it won't run the core or streaming te
In the interim, you can just build it off branch-1.4 if you want.
On Fri, Nov 13, 2015 at 1:30 PM, Reynold Xin <r...@databricks.com> wrote:
> I actually tried to build a binary for 1.4.2 and wanted to start voting,
> but there was an issue with the release script that failed the
It depends on what the next operator is. If the next operator is just an
aggregation, then no, the hash join won't write anything to disk. It will
just stream the data through to the next operator. If the next operator is
shuffle (exchange), then yes.
On Sun, Nov 15, 2015 at 10:52 AM, gsvic
treaming
>> apps can take advantage of the compact columnar representation and Tungsten
>> optimisations.
>>
>> I'm not quite sure if something like this can be achieved by other means
>> or has been investigated before, hence why I'm looking for feedback here.
>&g
It's a completely different path.
On Sun, Nov 15, 2015 at 10:37 PM, kiran lonikar wrote:
> I would like to know if Hive on Spark uses or shares the execution code
> with Spark SQL or DataFrames?
>
> More specifically, does Hive on Spark benefit from the changes made to
>
No it does not -- although it'd benefit from some of the work to make
shuffle more robust.
On Sun, Nov 15, 2015 at 10:45 PM, kiran lonikar <loni...@gmail.com> wrote:
> So does not benefit from Project Tungsten right?
>
>
> On Mon, Nov 16, 2015 at 12:07 PM, Reynold Xin &l
er usage (e.g. I wouldn't
>> be surprised if mapPartitionsWithContext was baked into a number of apps)
>> and merit a little extra consideration.
>>
>> Maybe also obvious, but I think a migration guide with API equivlents and
>> the like would be incredibly useful i
ade at the outset of 2.0 while
> trying to guess what we'll need.
>
> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> I’m starting a new thread since the other one got intermixed with feature
>> requests. Please refrain from making feature
Thanks everybody for voting. I'm going to close the vote now. The vote
passes with 14 +1 votes and no -1 vote. I will work on packaging this asap.
+1:
Jean-Baptiste Onofré
Egor Pahomov
Luc Bourlier
Tom Graves*
Chester Chen
Michael Armbrust*
Krishna Sankar
Robin East
Reynold Xin*
Joseph Bradley
$sql$execution$TungstenSort$$preparePartition$1(sort.scala:131)
>>> at
>>> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
>>> at
>>> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.s
Can you use the broadcast hint?
e.g.
df1.join(broadcast(df2))
the broadcast function is in org.apache.spark.sql.functions
On Wed, Nov 4, 2015 at 10:19 AM, Charmee Patel wrote:
> Hi,
>
> If I have a hive table, analyze table compute statistics will ensure Spark
> SQL has
hint is only available on dataframe api.
>
> On Wed, Nov 4, 2015 at 6:49 PM Reynold Xin <r...@databricks.com> wrote:
>
>> Can you use the broadcast hint?
>>
>> e.g.
>>
>> df1.join(broadcast(df2))
>>
>> the broadcast function is in org.apache.spa
gt; +1
> Test against CDH5.4.2 with hadoop 2.6.0 version using yesterday's code,
> build locally.
>
> Regression running in Yarn Cluster mode against few internal ML ( logistic
> regression, linear regression, random forest and statistic summary) as well
> Mlib KMeans. all seems to
If you are using Spark with Mesos fine grained mode, can you please respond
to this email explaining why you use it over the coarse grained mode?
Thanks.
in turn kill the entire executor, causing entire
> stages to be retried. In fine-grained mode, only the task fails and
> subsequently gets retried without taking out an entire stage or worse.
>
> On Tue, Nov 3, 2015 at 3:54 PM, Reynold Xin <r...@databricks.com> wrote:
>
>>
Please vote on releasing the following candidate as Apache Spark version
1.5.2. The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes if a
majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.5.2
[ ] -1 Do not release this package because ...
The
GenerateUnsafeProjection -- projects any internal row data structure
directly into bytes (UnsafeRow).
On Wed, Nov 4, 2015 at 12:21 AM, 牛兆捷 wrote:
> Dear all:
>
> Tungsten project has mentioned that they are applying code generation is
> to speed up the conversion of data
You can hack around this by constructing logical plans yourself and then
creating a DataFrame in order to execute them. Note that this is all
depending on internals of the framework and can break when Spark upgrades.
On Thu, Nov 5, 2015 at 4:18 PM, Yana Kadiyska
wrote:
Are you looking for this?
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L69
On Wed, Nov 4, 2015 at 5:11 AM, Tóth Zoltán wrote:
> Hi,
>
> I'd like to write a parquet file from the
That could break a lot of applications. In particular, a lot of input data
sources (csv, json) don't have clean schema, and can have duplicate column
names.
For the case of join, maybe a better solution is to ask the left/right
prefix/suffix in the user code, similar to what Pandas does.
On Wed,
Adding user list too.
-- Forwarded message --
From: Reynold Xin <r...@databricks.com>
Date: Tue, Oct 6, 2015 at 5:54 PM
Subject: Re: multiple count distinct in SQL/DataFrame?
To: "dev@spark.apache.org" <dev@spark.apache.org>
To provide more co
301 - 400 of 1258 matches
Mail list logo