You can actually just use df1['a'] in projection to differentiate.
e.g. in Scala (similar things work in Python):
scala val df1 = Seq((1, one)).toDF(a, b)
df1: org.apache.spark.sql.DataFrame = [a: int, b: string]
scala val df2 = Seq((2, two)).toDF(a, b)
df2: org.apache.spark.sql.DataFrame =
In 1.4, you can do
row.getInt(colName)
In 1.5, some variant of this will come to allow you to turn a DataFrame
into a typed RDD, where the case class's field names match the column
names. https://github.com/apache/spark/pull/5713
On Fri, May 8, 2015 at 11:01 AM, Will Benton wi...@redhat.com
In 1.5, we will most likely just rewrite distinct in SQL to either use the
Aggregate operator which will benefit from all the Tungsten optimizations,
or have a Tungsten version of distinct for SQL/DataFrame.
On Thu, May 7, 2015 at 1:32 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
What's the use case?
I'm wondering if we should even expose fromJSON. I think it's more a bug
than feature.
On Thu, May 7, 2015 at 1:55 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Observe, my fellow Sparkophiles (Spark 1.3.1):
json_rdd =
Is this related to s3a update in 2.6?
On Thursday, May 7, 2015, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Details are here: https://issues.apache.org/jira/browse/SPARK-7442
It looks like something specific to building against Hadoop 2.6?
Nick
and PR builder in Jenkins
should simply continue to use Java 7 then.
On Tue, May 5, 2015 at 11:25 PM, Reynold Xin r...@databricks.com wrote:
Hi all,
We will drop support for Java 6 starting Spark 1.5, tentative scheduled
to
be released in Sep 2015. Spark 1.4, scheduled to be released
, it will use zip64. Could Python 2.x (or even 3.x) be
able to load zip64 files on PYTHONPATH? -Xiangrui
On Tue, May 5, 2015 at 3:25 PM, Reynold Xin r...@databricks.com wrote:
OK I sent an email.
On Tue, May 5, 2015 at 2:47 PM, shane knapp skn...@berkeley.edu wrote:
+1 to an announce
They are usually pretty responsive. We can ping chill to get them to do a
release.
On Wed, May 6, 2015 at 10:32 AM, Tom Graves tgraves...@yahoo.com.invalid
wrote:
Hey folks,
I had a customer ask about updating the version of kryo to get fix:
https://github.com/EsotericSoftware/kryo/pull/164
Thanks for doing this. Testing infra is one of the most important parts of
a project, and this will make it easier to identify flaky tests.
On Wed, May 6, 2015 at 5:41 PM, Andrew Or and...@databricks.com wrote:
Dear all,
I'm sure you have all noticed that the Spark tests have been fairly
OK I sent an email.
On Tue, May 5, 2015 at 2:47 PM, shane knapp skn...@berkeley.edu wrote:
+1 to an announce to user and dev. java6 is so old and sad.
On Tue, May 5, 2015 at 2:24 PM, Tom Graves tgraves...@yahoo.com wrote:
+1. I haven't seen major objections here so I would say send
Hi all,
We will drop support for Java 6 starting Spark 1.5, tentative scheduled to
be released in Sep 2015. Spark 1.4, scheduled to be released in June 2015,
will be the last minor release that supports Java 6. That is to say:
Spark 1.4.x (~ Jun 2015): will work with Java 6, 7, 8.
Spark 1.5+ (~
I took a quick look at that implementation. I'm not sure if it actually
handles JSON correctly, because it attempts to find the first { starting
from a random point. However, that random point could be in the middle of a
string, and thus the first { might just be part of a string, rather than a
, Reynold Xin r...@databricks.com wrote:
I took a quick look at that implementation. I'm not sure if it actually
handles JSON correctly, because it attempts to find the first { starting
from a random point. However, that random point could be in the middle of
a
string, and thus the first { might
to allow importing a namespace
into
SQL somehow?
I ask because if we have to keep worrying about name collisions then I'm
not sure what the added complexity of #2 and #3 buys us.
Punya
On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin r...@databricks.com
wrote:
Scaladoc isn't much
method with an optional Seq of column
names.
Regards,
Olivier.
Le dim. 3 mai 2015 à 07:44, Reynold Xin r...@databricks.com a écrit :
Part of the reason is that it is really easy to just call toDF on Scala,
and we already have a lot of createDataFrame functions.
(You might find some
How does the pivotal format decides where to split the files? It seems to
me the challenge is to decide that, and on the top of my head the only way
to do this is to scan from the beginning and parse the json properly, which
makes it not possible with large files (doable for whole input with a lot
I've personally prototyped completely in-memory shuffle for Spark 3 times.
However, it is unclear how big of a gain it would be to put all of these in
memory, under newer file systems (ext4, xfs). If the shuffle data is small,
they are still in the file system buffer cache anyway. Note that
Part of the reason is that it is really easy to just call toDF on Scala,
and we already have a lot of createDataFrame functions.
(You might find some of the cross-language differences confusing, but I'd
argue most real users just stick to one language, and developers or
trainers are the only ones
.
Tom
On Thursday, April 30, 2015 2:04 PM, Reynold Xin r...@databricks.com
wrote:
This has been discussed a few times in the past, but now Oracle has ended
support for Java 6 for over a year, I wonder if we should just drop Java 6
support.
There is one outstanding issue Tom has brought
I filed a ticket: https://issues.apache.org/jira/browse/SPARK-7280
Would you like to give it a shot?
On Thu, Apr 30, 2015 at 10:22 AM, rakeshchalasani vnit.rak...@gmail.com
wrote:
Hi All:
Is there any plan to add drop column/s functionality in the data frame?
One can you select function to
This has been discussed a few times in the past, but now Oracle has ended
support for Java 6 for over a year, I wonder if we should just drop Java 6
support.
There is one outstanding issue Tom has brought to my attention: PySpark on
YARN doesn't work well with Java 7/8, but we have an outstanding
We should change the trait to abstract class, and then your problem will go
away.
Do you want to submit a pull request?
On Wed, Apr 29, 2015 at 11:02 PM, Niranda Perera niranda.per...@gmail.com
wrote:
Hi,
this follows the following feature in this feature [1]
I'm trying to implement a
In this case it's fine to discuss whether this would fit in Spark
DataFrames' high level direction before putting it in JIRA. Otherwise we
might end up creating a lot of tickets just for querying whether something
might be a good idea.
About this specific feature -- I'm not sure what it means in
is that we should have a handful of namespaces (say 4 or 5). It
becomes too cumbersome to import / remember more package names and having
everything in one package makes it hard to read scaladoc etc.
Thanks
Shivaram
On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin r...@databricks.com wrote
?)
On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin r...@databricks.com wrote:
Before we make DataFrame non-alpha, it would be great to decide how we
want to namespace all the functions. There are 3 alternatives:
1. Put all in org.apache.spark.sql.functions. This is how SQL does it,
since SQL
Actually I'm doing some cleanups related to type coercion, and I will take
care of this.
On Wed, Apr 29, 2015 at 5:10 PM, lonely Feb lonely8...@gmail.com wrote:
OK, I'll try.
On Apr 30, 2015 06:54, Reynold Xin r...@databricks.com wrote:
We added ExpectedInputConversion rule recently
Before we make DataFrame non-alpha, it would be great to decide how we want
to namespace all the functions. There are 3 alternatives:
1. Put all in org.apache.spark.sql.functions. This is how SQL does it,
since SQL doesn't have namespaces. I estimate eventually we will have ~ 200
functions.
2.
somehow?
I ask because if we have to keep worrying about name collisions then I'm
not sure what the added complexity of #2 and #3 buys us.
Punya
On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin r...@databricks.com wrote:
Scaladoc isn't much of a problem because scaladocs are grouped.
Java/Python
Shane - can we purge all the outstanding builds so we are not running stuff
against stale PRs?
On Mon, Apr 27, 2015 at 11:30 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
And unfortunately, many Jenkins executor slots are being taken by stale
Spark PRs...
On Mon, Apr 27, 2015 at
I like that idea (having a new-issues list instead of directly forwarding
them to dev).
On Fri, Apr 24, 2015 at 11:08 AM, Patrick Wendell pwend...@gmail.com
wrote:
It's a bit of a digression - but Steve's suggestion that we have a
mailing list for new issues is a great idea and we can do it
I'd love to see more design discussions consolidated in a single place as
well. That said, there are many practical challenges to overcome. Some of
them are out of our control:
1. For large features, it is fairly common to open a PR for discussion,
close the PR taking some feedback into account,
Thanks for looking into this, Shane.
On Fri, Apr 24, 2015 at 3:18 PM, shane knapp skn...@berkeley.edu wrote:
ok, jenkins is back up and building. we have a few things to mop up here
(ganglia is sad), but i think we'll be good for the afternoon.
shane
On Fri, Apr 24, 2015 at 2:17 PM, shane
This looks like a specific Spray configuration issue (or how Spray reads
config files). Maybe Spray is reading some local config file that doesn't
exist on your executors?
You might need to email the Spray list.
On Fri, Apr 24, 2015 at 2:38 PM, Yang Lei genia...@gmail.com wrote:
forward to
-7118
thx
Le ven. 24 avr. 2015 à 07:34, Olivier Girardot
o.girar...@lateral-thoughts.com a écrit :
I'll try thanks
Le ven. 24 avr. 2015 à 00:09, Reynold Xin r...@databricks.com a écrit :
You can do it similar to the way countDistinct is done, can't you?
https://github.com/apache/spark
Can you elaborate what you mean by that? (what's already available in
Python?)
On Fri, Apr 24, 2015 at 2:24 PM, Shuai Zheng szheng.c...@gmail.com wrote:
Hi All,
I want to ask whether there is a plan to implement the feature to access
the Row in sql by name? Currently we can only allow to
Ah damn. We need to add it to the Python list. Would you like to give it a
shot?
On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
Yep no problem, but I can't seem to find the coalesce fonction in
pyspark.sql.{*, functions, types or whatever :) }
possible
with Hadoop's InputFormat.getSplits()?
Thanks,
Mingyu
On 4/21/15, 4:33 PM, Soren Macbeth so...@yieldbot.com wrote:
I'm also super interested in this. Flambo (our clojure DSL) wraps the java
api and it would be great to have this.
On Tue, Apr 21, 2015 at 4:10 PM, Reynold Xin r
à 11:56, Olivier Girardot
o.girar...@lateral-thoughts.com a écrit :
Where should this *coalesce* come from ? Is it related to the partition
manipulation coalesce method ?
Thanks !
Le lun. 20 avr. 2015 à 22:48, Reynold Xin r...@databricks.com a écrit :
Ah ic. You can do something like
Woh hold on a minute.
Spark has been among the projects that are the most welcoming to new
contributors. And thanks to this, the sheer number of activities in Spark
is much larger than other projects, and our workflow has to accommodate
this fact.
In practice, people just create pull requests on
I replied on JIRA. Let's move the discussion there.
On Tue, Apr 21, 2015 at 8:13 AM, Karlson ksonsp...@siberie.de wrote:
I think the __getattr__ method should be removed from the DataFrame API in
pyspark.
May I draw the Python folk's attention to the issue
It runs tons of integration tests. I think most developers just let Jenkins
run the full suite of them.
On Tue, Apr 21, 2015 at 12:54 PM, Olivier Girardot ssab...@gmail.com
wrote:
Hi everyone,
I was just wandering about the Spark full build time (including tests),
1h48 seems to me quite...
I created a pull request last night for a new InputSource API that is
essentially a stripped down version of the RDD API for providing data into
Spark. Would be great to hear the community's feedback.
Spark currently has two de facto input source API:
1. RDD
2. Hadoop MapReduce InputFormat
() if they
want to use it later?
Punya
On Tue, Apr 21, 2015 at 4:35 PM Reynold Xin r...@databricks.com wrote:
I created a pull request last night for a new InputSource API that is
essentially a stripped down version of the RDD API for providing data into
Spark. Would be great to hear
:)
Le lun. 20 avr. 2015 à 22:22, Reynold Xin r...@databricks.com a écrit :
You can just create fillna function based on the 1.3.1 implementation of
fillna, no?
On Mon, Apr 20, 2015 at 2:48 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
a UDF might be a good idea no ?
Le lun
You can just create fillna function based on the 1.3.1 implementation of
fillna, no?
On Mon, Apr 20, 2015 at 2:48 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
a UDF might be a good idea no ?
Le lun. 20 avr. 2015 à 11:17, Olivier Girardot
o.girar...@lateral-thoughts.com a
Definitely a bug. I just checked and it looks like we don't actually have a
function that takes a Scala RDD and Seq[String].
cc Davies who added this code a while back.
On Sun, Apr 19, 2015 at 2:56 PM, Justin Uang justin.u...@gmail.com wrote:
Hi,
I have a question regarding
I think in 1.3 and above, you'd need to do
.sql(...).javaRDD().map(..)
On Fri, Apr 17, 2015 at 9:22 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
Yes thanks !
Le ven. 17 avr. 2015 à 16:20, Ted Yu yuzhih...@gmail.com a écrit :
The image didn't go through.
I think you
Please do! Thanks.
On Fri, Apr 17, 2015 at 2:36 PM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
Ok, do you want me to open a pull request to fix the dedicated
documentation ?
Le ven. 17 avr. 2015 à 18:14, Reynold Xin r...@databricks.com a écrit :
I think in 1.3 and above
in the documentation
?
Le ven. 17 avr. 2015 à 21:39, Reynold Xin r...@databricks.com a écrit :
Please do! Thanks.
On Fri, Apr 17, 2015 at 2:36 PM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
Ok, do you want me to open a pull request to fix the dedicated
documentation ?
Le ven
It's because you did a repartition -- which rearranges all the data.
Parquet uses all kinds of compression techniques such as dictionary
encoding and run-length encoding, which would result in the size difference
when the data is ordered different.
On Fri, Apr 17, 2015 at 4:51 AM, zhangxiongfei
This is strange. cc the dev list since it might be a bug.
On Thu, Apr 16, 2015 at 3:18 PM, Cesar Flores ces...@gmail.com wrote:
Never mind. I found the solution:
val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd,
hiveLoadedDataFrame.schema)
which translate to convert the data
There is a jdbc in the SQLContext scala doc:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext
Note that this is more of a user list question
On Thu, Apr 16, 2015 at 5:11 AM, Suraj Shetiya surajshet...@gmail.com
wrote:
Hi,
Is there any means of
Welcome, Dmitriy, to the Spark dev list!
On Sat, Apr 11, 2015 at 1:14 AM, Dmitriy Setrakyan dsetrak...@apache.org
wrote:
Hello Everyone,
I am one of the committers to Apache Ignite and have noticed some talks on
this dev list about integrating Ignite In-Memory File System (IgniteFS)
with
+1
On Fri, Apr 10, 2015 at 11:07 PM -0700, Patrick Wendell pwend...@gmail.com
wrote:
Please vote on releasing the following candidate as Apache Spark version 1.3.1!
The tag to be voted on is v1.3.1-rc2 (commit 3e83913):
Take a look at the following two files:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala
and
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala
On
)
2015-04-09 10:24 GMT+02:00 Reynold Xin r...@databricks.com:
Take a look at the following two files:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala
and
https://github.com/apache/spark/blob/master/core/src/main
Why is this a bug? Each RDD implementation should know whether they have a
parent or not.
For example, if you are a MapPartitionedRDD, there is always a parent since
it is a unary operator.
On Wed, Apr 8, 2015 at 6:19 AM, Zoltán Zvara zoltan.zv...@gmail.com wrote:
Is does not seem to be safe
+1 too
On Sun, Apr 5, 2015 at 4:24 PM, Patrick Wendell pwend...@gmail.com wrote:
Please vote on releasing the following candidate as Apache Spark version
1.2.2!
The tag to be voted on is v1.2.2-rc1 (commit 7531b50):
Note that we can do this in DataFrames and use Catalyst to push Sample down
beneath Projection :)
On Mon, Apr 6, 2015 at 12:42 PM, Xiangrui Meng men...@gmail.com wrote:
The gap sampling is triggered when the sampling probability is small
and the directly underlying storage has constant time
I think those are great to have. I would put them in the DataFrame API
though, since this is applying to structured data. Many of the advanced
functions on the PairRDDFunctions should really go into the DataFrame API
now we have it.
One thing that would be great to understand is what
Adding Jianping Wang to the thread, since he contributed the SVDPlusPlus
implementaiton.
Jianping,
Can you take a look at this message? Thanks.
On Fri, Apr 3, 2015 at 8:41 AM, Michael Malak
michaelma...@yahoo.com.invalid wrote:
I believe that in the initialization portion of GraphX
Yup - we merged the Java and Scala API so there is now a single set of API
to support both languages.
See more at
http://spark.apache.org/docs/latest/sql-programming-guide.html#unification-of-the-java-and-scala-apis
On Tue, Mar 31, 2015 at 11:40 PM, Niranda Perera niranda.per...@gmail.com
Reviving this to see if others would like to chime in about this
expression language for config options.
On Fri, Mar 13, 2015 at 7:57 PM, Dale Richardson dale...@hotmail.com
wrote:
Mridul,I may have added some confusion by giving examples in completely
different areas. For example the number
The only reason I can think of right now is that you might want to change
the config parameter to change the behavior of the optimizer and regenerate
the plan. However, maybe that's not a strong enough reasons to regenerate
the RDD everytime.
On Mon, Mar 30, 2015 at 5:38 AM, Cheng Lian
Igor,
Welcome -- everything is open here:
https://issues.apache.org/jira/browse/SPARK
You should be able to see them even if you are not an ASF member.
On Wed, Mar 25, 2015 at 1:51 PM, Igor Costa igorco...@apache.org wrote:
Hi there Guys.
I want to be more collaborative to Spark, but I
If scaladoc can show the Java enum types, I do think the best way is then
just Java enum types.
On Mon, Mar 23, 2015 at 2:11 PM, Patrick Wendell pwend...@gmail.com wrote:
If the official solution from the Scala community is to use Java
enums, then it seems strange they aren't generated in
I created a ticket to separate the API refactoring from the implementation.
Would be great to have these as two separate patches to make it easier to
review (similar to the way we are doing RPC refactoring -- first
introducing an internal RPC api, port akka to it, and then add an
alternative
), it seems to us that it
is accepting it. Also, in IBM's J9 health center, I see it reserve the
900g, and use up to 68g.
Thanks,
Tom
On 13 March 2015 at 02:05, Reynold Xin r...@databricks.com wrote:
How did you run the Spark command? Maybe the memory setting didn't
actually apply? How much memory
This is an interesting idea.
Are there well known libraries for doing this? Config is the one place
where it would be great to have something ridiculously simple, so it is
more or less bug free. I'm concerned about the complexity in this patch and
subtle bugs that it might introduce to config
Thanks for the email and encouragement, Devl. Responses to the 3 requests:
-tonnes of configuration properties and go faster type flags. For example
Hadoop and Hbase users will know that there are a whole catalogue of
properties for regions, caches, network properties, block sizes, etc etc.
Hi all,
The Hadoop Summit uses community choice voting to decide which talks to
feature. It would be great if the community could help vote for Spark talks
so that Spark has a good showing at this event. You can make three votes on
each track. Below I've listed 3 talks that are important to
Yes, that's a bug and should be using the standard serializer.
On Wed, Feb 18, 2015 at 2:58 PM, Sean Owen so...@cloudera.com wrote:
That looks, at the least, inconsistent. As far as I know this should
be changed so that the zero value is always cloned via the non-closure
serializer. Any
Michael - it is already transient. This should probably considered a bug in
the scala compiler, but we can easily work around it by removing the use of
destructuring binding.
On Mon, Feb 16, 2015 at 10:41 AM, Michael Armbrust mich...@databricks.com
wrote:
I'd suggest marking the HiveContext as
this
through the tuple extraction. This is only a workaround. We can also
remove the tuple extraction.
On Mon, Feb 16, 2015 at 10:47 AM, Reynold Xin r...@databricks.com wrote:
Michael - it is already transient. This should probably considered a bug
in the scala compiler, but we can easily work around
Most likely no. We are using the embedded mode of Jetty, rather than using
servlets.
Even if it is possible, you probably wouldn't want to embed Spark in your
application server ...
On Sun, Feb 15, 2015 at 9:08 PM, Niranda Perera niranda.per...@gmail.com
wrote:
Hi,
We are thinking of
Spark SQL is not the same as Hive on Spark.
Spark SQL is a query engine that is designed from ground up for Spark
without the historic baggage of Hive. It also does more than SQL now -- it
is meant for structured data processing (e.g. the new DataFrame API) and
SQL. Spark SQL is mostly compatible
server inside Spark? Is it used for Spark core functionality or is it there
for Spark jobs UI purposes?
cheers
On Mon, Feb 16, 2015 at 10:47 AM, Reynold Xin r...@databricks.com wrote:
Most likely no. We are using the embedded mode of Jetty, rather than
using servlets.
Even if it is possible
Evan articulated it well.
On Thu, Feb 12, 2015 at 9:29 AM, Evan R. Sparks evan.spa...@gmail.com
wrote:
Well, you can always join as many RDDs as you want by chaining them
together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of
RDDs in this way but 10 is probably doable.
Can you use the new aggregateNeighbors method? I suspect the null is coming
from automatic join elimination, which detects bytecode to see if you
need the src or dst vertex data. Occasionally it can fail to detect. In the
new aggregateNeighbors API, the caller needs to explicitly specifying that,
Then maybe you actually had a null in your vertex attribute?
On Thu, Feb 12, 2015 at 10:47 PM, James alcaid1...@gmail.com wrote:
I changed the mapReduceTriplets() func to aggregateMessages(), but it
still failed.
2015-02-13 6:52 GMT+08:00 Reynold Xin r...@databricks.com:
Can you use
It seems to me having a version that is 2+ is good for that? Once we move
to 2.0, we can retag those that are not going to be fixed in 2.0 as 2.0.1
or 2.1.0 .
On Thu, Feb 12, 2015 at 12:42 AM, Sean Owen so...@cloudera.com wrote:
Patrick and I were chatting about how to handle several issues
this
makes sense.
Thanks,
Aniket
On Sat, Feb 7, 2015, 4:50 AM Reynold Xin r...@databricks.com wrote:
We thought about this today after seeing this email. I actually built a
patch for this (adding filter/column to data source stat estimation), but
ultimately dropped it due
://www.r-bloggers.com/r-na-vs-null/
On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin r...@databricks.com
wrote:
Isn't that just null in SQL?
On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan
velvia.git...@gmail.com
wrote:
I believe that most DataFrame implementations out
10, 2015 at 2:58 PM, Reynold Xin r...@databricks.com wrote:
Koert,
Don't get too hang up on the name SQL. This is exactly what you want: a
collection with record-like objects with field names and runtime types.
Almost all of the 40 methods are transformations for structured data
it is easier for IDEs to
recognize it as a block comment. If you press enter in the comment
block with the `//` style, IDEs won't add `//` for you. -Xiangrui
On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin r...@databricks.com
wrote:
We should update the style doc to reflect what we have
We thought about this today after seeing this email. I actually built a
patch for this (adding filter/column to data source stat estimation), but
ultimately dropped it due to the potential problems the change the cause.
The main problem I see is that column pruning/predicate pushdowns are
This is the original ticket:
https://issues.apache.org/jira/browse/SPARK-1442
I believe it will happen, one way or another :)
On Fri, Feb 6, 2015 at 5:29 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:
Currently there's no standard way of handling time series data in Spark. We
were kicking
We should update the style doc to reflect what we have in most places
(which I think is //).
On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
FWIW I like the multi-line // over /* */ from a purely style standpoint.
The Google Java style guide[1] has
Haven't sync-ed anything for the last 4 hours. Seems like this little piece
of infrastructure always stops working around our own code freeze time ...
I filed an INFRA ticket: https://issues.apache.org/jira/browse/INFRA-9115
I wish ASF can reconsider requests like this in order to handle downtime
gracefully https://issues.apache.org/jira/browse/INFRA-8738
On Tue, Feb 3, 2015 at 9:09 PM, Reynold Xin r...@databricks.com wrote:
Haven't sync
We can use ScalaTest's privateMethodTester also instead of exposing that.
On Tue, Feb 3, 2015 at 2:18 PM, Marcelo Vanzin van...@cloudera.com wrote:
Hi Jay,
On Tue, Feb 3, 2015 at 6:28 AM, jayhutfles jayhutf...@gmail.com wrote:
// Exposed for testing
private[spark] var printStream:
It's bad naming - JsonRDD is actually not an RDD. It is just a set of util
methods.
The case sensitivity issues seem orthogonal, and would be great to be able
to control that with a flag.
On Mon, Feb 2, 2015 at 4:16 PM, Daniil Osipov daniil.osi...@shazam.com
wrote:
Hey Spark developers,
Is
Once the data frame API is released for 1.3, you can write your thing in
Python and get the same performance. It can't express everything, but for
basic things like projection, filter, join, aggregate and simple numeric
computation, it should work pretty well.
On Thu, Jan 29, 2015 at 12:45 PM,
are we talking about pandas or this is something
internal to spark py api.
If you could elaborate a bit on this or point me to alternate
documentation.
Thanks much --sasha
On Thu, Jan 29, 2015 at 4:12 PM, Reynold Xin r...@databricks.com wrote:
Once the data frame API is released for 1.3, you can
Hopefully problems like this will go away entirely in the next couple of
releases. https://issues.apache.org/jira/browse/SPARK-5293
On Wed, Jan 28, 2015 at 3:12 PM, jay vyas jayunit100.apa...@gmail.com
wrote:
Hi spark. Where is akka coming from in spark ?
I see the distribution referenced
DataFrame and SchemaRDD
2015-01-27 17:18 GMT-02:00 Reynold Xin r...@databricks.com:
Dirceu,
That is not possible because one cannot overload return types.
SQLContext.parquetFile (and many other methods) needs to return some
type,
and that type cannot be both
Thanks for doing that, Shane!
On Wed, Jan 28, 2015 at 10:29 PM, shane knapp skn...@berkeley.edu wrote:
jenkins is back up and all builds have been retriggered... things are
building and looking good, and i'll keep an eye on the spark master builds
tonite and tomorrow.
On Wed, Jan 28, 2015
It's an interesting idea, but there are major challenges with per row
schema.
1. Performance - query optimizer and execution use assumptions about schema
and data to generate optimized query plans. Having to re-reason about
schema for each row can substantially slow down the engine, but due to
on
this
idea (mostly from Patrick and Reynold :-).
https://www.youtube.com/watch?v=YWppYPWznSQ
From: Patrick Wendell pwend...@gmail.com
To: Reynold Xin r...@databricks.com
Cc: dev@spark.apache.org dev@spark.apache.org
Sent: Monday, January 26, 2015 4:01 PM
(mostly from Patrick and Reynold :-).
https://www.youtube.com/watch?v=YWppYPWznSQ
From: Patrick Wendell pwend...@gmail.com
To: Reynold Xin r...@databricks.com
Cc: dev@spark.apache.org dev@spark.apache.org
Sent: Monday, January
1001 - 1100 of 1258 matches
Mail list logo