Re: DataFrames equivalent to SQL table namespacing and aliases

2015-05-08 Thread Reynold Xin
You can actually just use df1['a'] in projection to differentiate. e.g. in Scala (similar things work in Python): scala val df1 = Seq((1, one)).toDF(a, b) df1: org.apache.spark.sql.DataFrame = [a: int, b: string] scala val df2 = Seq((2, two)).toDF(a, b) df2: org.apache.spark.sql.DataFrame =

Re: Easy way to convert Row back to case class

2015-05-08 Thread Reynold Xin
In 1.4, you can do row.getInt(colName) In 1.5, some variant of this will come to allow you to turn a DataFrame into a typed RDD, where the case class's field names match the column names. https://github.com/apache/spark/pull/5713 On Fri, May 8, 2015 at 11:01 AM, Will Benton wi...@redhat.com

Re: DataFrame distinct vs RDD distinct

2015-05-07 Thread Reynold Xin
In 1.5, we will most likely just rewrite distinct in SQL to either use the Aggregate operator which will benefit from all the Tungsten optimizations, or have a Tungsten version of distinct for SQL/DataFrame. On Thu, May 7, 2015 at 1:32 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote:

Re: pyspark.sql.types.StructType.fromJson() is a lie

2015-05-07 Thread Reynold Xin
What's the use case? I'm wondering if we should even expose fromJSON. I think it's more a bug than feature. On Thu, May 7, 2015 at 1:55 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Observe, my fellow Sparkophiles (Spark 1.3.1): json_rdd =

Re: Spark 1.3.1 / Hadoop 2.6 package has broken S3 access

2015-05-07 Thread Reynold Xin
Is this related to s3a update in 2.6? On Thursday, May 7, 2015, Nicholas Chammas nicholas.cham...@gmail.com wrote: Details are here: https://issues.apache.org/jira/browse/SPARK-7442 It looks like something specific to building against Hadoop 2.6? Nick

Re: [ANNOUNCE] Ending Java 6 support in Spark 1.5 (Sep 2015)

2015-05-06 Thread Reynold Xin
and PR builder in Jenkins should simply continue to use Java 7 then. On Tue, May 5, 2015 at 11:25 PM, Reynold Xin r...@databricks.com wrote: Hi all, We will drop support for Java 6 starting Spark 1.5, tentative scheduled to be released in Sep 2015. Spark 1.4, scheduled to be released

Re: [discuss] ending support for Java 6?

2015-05-06 Thread Reynold Xin
, it will use zip64. Could Python 2.x (or even 3.x) be able to load zip64 files on PYTHONPATH? -Xiangrui On Tue, May 5, 2015 at 3:25 PM, Reynold Xin r...@databricks.com wrote: OK I sent an email. On Tue, May 5, 2015 at 2:47 PM, shane knapp skn...@berkeley.edu wrote: +1 to an announce

Re: kryo version?

2015-05-06 Thread Reynold Xin
They are usually pretty responsive. We can ping chill to get them to do a release. On Wed, May 6, 2015 at 10:32 AM, Tom Graves tgraves...@yahoo.com.invalid wrote: Hey folks, I had a customer ask about updating the version of kryo to get fix: https://github.com/EsotericSoftware/kryo/pull/164

Re: Recent Spark test failures

2015-05-06 Thread Reynold Xin
Thanks for doing this. Testing infra is one of the most important parts of a project, and this will make it easier to identify flaky tests. On Wed, May 6, 2015 at 5:41 PM, Andrew Or and...@databricks.com wrote: Dear all, I'm sure you have all noticed that the Spark tests have been fairly

Re: [discuss] ending support for Java 6?

2015-05-05 Thread Reynold Xin
OK I sent an email. On Tue, May 5, 2015 at 2:47 PM, shane knapp skn...@berkeley.edu wrote: +1 to an announce to user and dev. java6 is so old and sad. On Tue, May 5, 2015 at 2:24 PM, Tom Graves tgraves...@yahoo.com wrote: +1. I haven't seen major objections here so I would say send

[ANNOUNCE] Ending Java 6 support in Spark 1.5 (Sep 2015)

2015-05-05 Thread Reynold Xin
Hi all, We will drop support for Java 6 starting Spark 1.5, tentative scheduled to be released in Sep 2015. Spark 1.4, scheduled to be released in June 2015, will be the last minor release that supports Java 6. That is to say: Spark 1.4.x (~ Jun 2015): will work with Java 6, 7, 8. Spark 1.5+ (~

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Reynold Xin
I took a quick look at that implementation. I'm not sure if it actually handles JSON correctly, because it attempts to find the first { starting from a random point. However, that random point could be in the middle of a string, and thus the first { might just be part of a string, rather than a

Re: Multi-Line JSON in SparkSQL

2015-05-04 Thread Reynold Xin
, Reynold Xin r...@databricks.com wrote: I took a quick look at that implementation. I'm not sure if it actually handles JSON correctly, because it attempts to find the first { starting from a random point. However, that random point could be in the middle of a string, and thus the first { might

Re: [discuss] DataFrame function namespacing

2015-05-04 Thread Reynold Xin
to allow importing a namespace into SQL somehow? I ask because if we have to keep worrying about name collisions then I'm not sure what the added complexity of #2 and #3 buys us. Punya On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin r...@databricks.com wrote: Scaladoc isn't much

Re: createDataFrame allows column names as second param in Python not in Scala

2015-05-03 Thread Reynold Xin
method with an optional Seq of column names. Regards, Olivier. Le dim. 3 mai 2015 à 07:44, Reynold Xin r...@databricks.com a écrit : Part of the reason is that it is really easy to just call toDF on Scala, and we already have a lot of createDataFrame functions. (You might find some

Re: Multi-Line JSON in SparkSQL

2015-05-03 Thread Reynold Xin
How does the pivotal format decides where to split the files? It seems to me the challenge is to decide that, and on the top of my head the only way to do this is to scan from the beginning and parse the json properly, which makes it not possible with large files (doable for whole input with a lot

Re: Why does SortShuffleWriter write to disk always?

2015-05-02 Thread Reynold Xin
I've personally prototyped completely in-memory shuffle for Spark 3 times. However, it is unclear how big of a gain it would be to put all of these in memory, under newer file systems (ext4, xfs). If the shuffle data is small, they are still in the file system buffer cache anyway. Note that

Re: createDataFrame allows column names as second param in Python not in Scala

2015-05-02 Thread Reynold Xin
Part of the reason is that it is really easy to just call toDF on Scala, and we already have a lot of createDataFrame functions. (You might find some of the cross-language differences confusing, but I'd argue most real users just stick to one language, and developers or trainers are the only ones

Re: [discuss] ending support for Java 6?

2015-05-02 Thread Reynold Xin
. Tom On Thursday, April 30, 2015 2:04 PM, Reynold Xin r...@databricks.com wrote: This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought

Re: Drop column/s in DataFrame

2015-04-30 Thread Reynold Xin
I filed a ticket: https://issues.apache.org/jira/browse/SPARK-7280 Would you like to give it a shot? On Thu, Apr 30, 2015 at 10:22 AM, rakeshchalasani vnit.rak...@gmail.com wrote: Hi All: Is there any plan to add drop column/s functionality in the data frame? One can you select function to

[discuss] ending support for Java 6?

2015-04-30 Thread Reynold Xin
This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought to my attention: PySpark on YARN doesn't work well with Java 7/8, but we have an outstanding

Re: Custom PersistanceEngine and LeaderAgent implementation in Java

2015-04-30 Thread Reynold Xin
We should change the trait to abstract class, and then your problem will go away. Do you want to submit a pull request? On Wed, Apr 29, 2015 at 11:02 PM, Niranda Perera niranda.per...@gmail.com wrote: Hi, this follows the following feature in this feature [1] I'm trying to implement a

Re: Pandas' Shift in Dataframe

2015-04-29 Thread Reynold Xin
In this case it's fine to discuss whether this would fit in Spark DataFrames' high level direction before putting it in JIRA. Otherwise we might end up creating a lot of tickets just for querying whether something might be a good idea. About this specific feature -- I'm not sure what it means in

Re: [discuss] DataFrame function namespacing

2015-04-29 Thread Reynold Xin
is that we should have a handful of namespaces (say 4 or 5). It becomes too cumbersome to import / remember more package names and having everything in one package makes it hard to read scaladoc etc. Thanks Shivaram On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin r...@databricks.com wrote

Re: [discuss] DataFrame function namespacing

2015-04-29 Thread Reynold Xin
?) On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin r...@databricks.com wrote: Before we make DataFrame non-alpha, it would be great to decide how we want to namespace all the functions. There are 3 alternatives: 1. Put all in org.apache.spark.sql.functions. This is how SQL does it, since SQL

Re: Spark SQL cannot tolerate regexp with BIGINT

2015-04-29 Thread Reynold Xin
Actually I'm doing some cleanups related to type coercion, and I will take care of this. On Wed, Apr 29, 2015 at 5:10 PM, lonely Feb lonely8...@gmail.com wrote: OK, I'll try. On Apr 30, 2015 06:54, Reynold Xin r...@databricks.com wrote: We added ExpectedInputConversion rule recently

[discuss] DataFrame function namespacing

2015-04-29 Thread Reynold Xin
Before we make DataFrame non-alpha, it would be great to decide how we want to namespace all the functions. There are 3 alternatives: 1. Put all in org.apache.spark.sql.functions. This is how SQL does it, since SQL doesn't have namespaces. I estimate eventually we will have ~ 200 functions. 2.

Re: [discuss] DataFrame function namespacing

2015-04-29 Thread Reynold Xin
somehow? I ask because if we have to keep worrying about name collisions then I'm not sure what the added complexity of #2 and #3 buys us. Punya On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin r...@databricks.com wrote: Scaladoc isn't much of a problem because scaladocs are grouped. Java/Python

Re: github pull request builder FAIL, now WIN(-ish)

2015-04-27 Thread Reynold Xin
Shane - can we purge all the outstanding builds so we are not running stuff against stale PRs? On Mon, Apr 27, 2015 at 11:30 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: And unfortunately, many Jenkins executor slots are being taken by stale Spark PRs... On Mon, Apr 27, 2015 at

Re: Should we let everyone set Assignee?

2015-04-24 Thread Reynold Xin
I like that idea (having a new-issues list instead of directly forwarding them to dev). On Fri, Apr 24, 2015 at 11:08 AM, Patrick Wendell pwend...@gmail.com wrote: It's a bit of a digression - but Steve's suggestion that we have a mailing list for new issues is a great idea and we can do it

Re: Design docs: consolidation and discoverability

2015-04-24 Thread Reynold Xin
I'd love to see more design discussions consolidated in a single place as well. That said, there are many practical challenges to overcome. Some of them are out of our control: 1. For large features, it is fairly common to open a PR for discussion, close the PR taking some feedback into account,

Re: Jenkins down

2015-04-24 Thread Reynold Xin
Thanks for looking into this, Shane. On Fri, Apr 24, 2015 at 3:18 PM, shane knapp skn...@berkeley.edu wrote: ok, jenkins is back up and building. we have a few things to mop up here (ganglia is sad), but i think we'll be good for the afternoon. shane On Fri, Apr 24, 2015 at 2:17 PM, shane

Re: Issue of running partitioned loading (RDD) in Spark External Datasource on Mesos

2015-04-24 Thread Reynold Xin
This looks like a specific Spray configuration issue (or how Spray reads config files). Maybe Spray is reading some local config file that doesn't exist on your executors? You might need to email the Spray list. On Fri, Apr 24, 2015 at 2:38 PM, Yang Lei genia...@gmail.com wrote: forward to

Re: Dataframe.fillna from 1.3.0

2015-04-24 Thread Reynold Xin
-7118 thx Le ven. 24 avr. 2015 à 07:34, Olivier Girardot o.girar...@lateral-thoughts.com a écrit : I'll try thanks Le ven. 24 avr. 2015 à 00:09, Reynold Xin r...@databricks.com a écrit : You can do it similar to the way countDistinct is done, can't you? https://github.com/apache/spark

Re: [SQL][Feature] Access row by column name instead of index

2015-04-24 Thread Reynold Xin
Can you elaborate what you mean by that? (what's already available in Python?) On Fri, Apr 24, 2015 at 2:24 PM, Shuai Zheng szheng.c...@gmail.com wrote: Hi All, I want to ask whether there is a plan to implement the feature to access the Row in sql by name? Currently we can only allow to

Re: Dataframe.fillna from 1.3.0

2015-04-23 Thread Reynold Xin
Ah damn. We need to add it to the Python list. Would you like to give it a shot? On Thu, Apr 23, 2015 at 4:31 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yep no problem, but I can't seem to find the coalesce fonction in pyspark.sql.{*, functions, types or whatever :) }

Re: [discuss] new Java friendly InputSource API

2015-04-23 Thread Reynold Xin
possible with Hadoop's InputFormat.getSplits()? Thanks, Mingyu On 4/21/15, 4:33 PM, Soren Macbeth so...@yieldbot.com wrote: I'm also super interested in this. Flambo (our clojure DSL) wraps the java api and it would be great to have this. On Tue, Apr 21, 2015 at 4:10 PM, Reynold Xin r

Re: Dataframe.fillna from 1.3.0

2015-04-22 Thread Reynold Xin
à 11:56, Olivier Girardot o.girar...@lateral-thoughts.com a écrit : Where should this *coalesce* come from ? Is it related to the partition manipulation coalesce method ? Thanks ! Le lun. 20 avr. 2015 à 22:48, Reynold Xin r...@databricks.com a écrit : Ah ic. You can do something like

Re: Should we let everyone set Assignee?

2015-04-22 Thread Reynold Xin
Woh hold on a minute. Spark has been among the projects that are the most welcoming to new contributors. And thanks to this, the sheer number of activities in Spark is much larger than other projects, and our workflow has to accommodate this fact. In practice, people just create pull requests on

Re: [pyspark] Drop __getattr__ on DataFrame

2015-04-21 Thread Reynold Xin
I replied on JIRA. Let's move the discussion there. On Tue, Apr 21, 2015 at 8:13 AM, Karlson ksonsp...@siberie.de wrote: I think the __getattr__ method should be removed from the DataFrame API in pyspark. May I draw the Python folk's attention to the issue

Re: Spark build time

2015-04-21 Thread Reynold Xin
It runs tons of integration tests. I think most developers just let Jenkins run the full suite of them. On Tue, Apr 21, 2015 at 12:54 PM, Olivier Girardot ssab...@gmail.com wrote: Hi everyone, I was just wandering about the Spark full build time (including tests), 1h48 seems to me quite...

[discuss] new Java friendly InputSource API

2015-04-21 Thread Reynold Xin
I created a pull request last night for a new InputSource API that is essentially a stripped down version of the RDD API for providing data into Spark. Would be great to hear the community's feedback. Spark currently has two de facto input source API: 1. RDD 2. Hadoop MapReduce InputFormat

Re: [discuss] new Java friendly InputSource API

2015-04-21 Thread Reynold Xin
() if they want to use it later? Punya On Tue, Apr 21, 2015 at 4:35 PM Reynold Xin r...@databricks.com wrote: I created a pull request last night for a new InputSource API that is essentially a stripped down version of the RDD API for providing data into Spark. Would be great to hear

Re: Dataframe.fillna from 1.3.0

2015-04-20 Thread Reynold Xin
:) Le lun. 20 avr. 2015 à 22:22, Reynold Xin r...@databricks.com a écrit : You can just create fillna function based on the 1.3.1 implementation of fillna, no? On Mon, Apr 20, 2015 at 2:48 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: a UDF might be a good idea no ? Le lun

Re: Dataframe.fillna from 1.3.0

2015-04-20 Thread Reynold Xin
You can just create fillna function based on the 1.3.1 implementation of fillna, no? On Mon, Apr 20, 2015 at 2:48 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: a UDF might be a good idea no ? Le lun. 20 avr. 2015 à 11:17, Olivier Girardot o.girar...@lateral-thoughts.com a

Re: Infinite recursion when using SQLContext#createDataFrame(JavaRDD[Row], java.util.List[String])

2015-04-19 Thread Reynold Xin
Definitely a bug. I just checked and it looks like we don't actually have a function that takes a Scala RDD and Seq[String]. cc Davies who added this code a while back. On Sun, Apr 19, 2015 at 2:56 PM, Justin Uang justin.u...@gmail.com wrote: Hi, I have a question regarding

Re: [Spark SQL] Java map/flatMap api broken with DataFrame in 1.3.{0,1}

2015-04-17 Thread Reynold Xin
I think in 1.3 and above, you'd need to do .sql(...).javaRDD().map(..) On Fri, Apr 17, 2015 at 9:22 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yes thanks ! Le ven. 17 avr. 2015 à 16:20, Ted Yu yuzhih...@gmail.com a écrit : The image didn't go through. I think you

Re: [Spark SQL] Java map/flatMap api broken with DataFrame in 1.3.{0,1}

2015-04-17 Thread Reynold Xin
Please do! Thanks. On Fri, Apr 17, 2015 at 2:36 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Ok, do you want me to open a pull request to fix the dedicated documentation ? Le ven. 17 avr. 2015 à 18:14, Reynold Xin r...@databricks.com a écrit : I think in 1.3 and above

Re: [Spark SQL] Java map/flatMap api broken with DataFrame in 1.3.{0,1}

2015-04-17 Thread Reynold Xin
in the documentation ? Le ven. 17 avr. 2015 à 21:39, Reynold Xin r...@databricks.com a écrit : Please do! Thanks. On Fri, Apr 17, 2015 at 2:36 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Ok, do you want me to open a pull request to fix the dedicated documentation ? Le ven

Re: Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

2015-04-17 Thread Reynold Xin
It's because you did a repartition -- which rearranges all the data. Parquet uses all kinds of compression techniques such as dictionary encoding and run-length encoding, which would result in the size difference when the data is ordered different. On Fri, Apr 17, 2015 at 4:51 AM, zhangxiongfei

Re: dataframe can not find fields after loading from hive

2015-04-17 Thread Reynold Xin
This is strange. cc the dev list since it might be a bug. On Thu, Apr 16, 2015 at 3:18 PM, Cesar Flores ces...@gmail.com wrote: Never mind. I found the solution: val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd, hiveLoadedDataFrame.schema) which translate to convert the data

Re: Dataframe from mysql database in pyspark

2015-04-16 Thread Reynold Xin
There is a jdbc in the SQLContext scala doc: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext Note that this is more of a user list question On Thu, Apr 16, 2015 at 5:11 AM, Suraj Shetiya surajshet...@gmail.com wrote: Hi, Is there any means of

Re: Integrating Spark with Ignite File System

2015-04-11 Thread Reynold Xin
Welcome, Dmitriy, to the Spark dev list! On Sat, Apr 11, 2015 at 1:14 AM, Dmitriy Setrakyan dsetrak...@apache.org wrote: Hello Everyone, I am one of the committers to Apache Ignite and have noticed some talks on this dev list about integrating Ignite In-Memory File System (IgniteFS) with

Re: [VOTE] Release Apache Spark 1.3.1 (RC3)

2015-04-11 Thread Reynold Xin
+1 On Fri, Apr 10, 2015 at 11:07 PM -0700, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc2 (commit 3e83913):

Re: Spark remote communication pattern

2015-04-09 Thread Reynold Xin
Take a look at the following two files: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala and https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala On

Re: Spark remote communication pattern

2015-04-09 Thread Reynold Xin
) 2015-04-09 10:24 GMT+02:00 Reynold Xin r...@databricks.com: Take a look at the following two files: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala and https://github.com/apache/spark/blob/master/core/src/main

Re: RDD firstParent

2015-04-08 Thread Reynold Xin
Why is this a bug? Each RDD implementation should know whether they have a parent or not. For example, if you are a MapPartitionedRDD, there is always a parent since it is a unary operator. On Wed, Apr 8, 2015 at 6:19 AM, Zoltán Zvara zoltan.zv...@gmail.com wrote: Is does not seem to be safe

Re: [VOTE] Release Apache Spark 1.2.2

2015-04-06 Thread Reynold Xin
+1 too On Sun, Apr 5, 2015 at 4:24 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.2! The tag to be voted on is v1.2.2-rc1 (commit 7531b50):

Re: Stochastic gradient descent performance

2015-04-06 Thread Reynold Xin
Note that we can do this in DataFrames and use Catalyst to push Sample down beneath Projection :) On Mon, Apr 6, 2015 at 12:42 PM, Xiangrui Meng men...@gmail.com wrote: The gap sampling is triggered when the sampling probability is small and the directly underlying storage has constant time

Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

2015-04-06 Thread Reynold Xin
I think those are great to have. I would put them in the DataFrame API though, since this is applying to structured data. Many of the advanced functions on the PairRDDFunctions should really go into the DataFrame API now we have it. One thing that would be great to understand is what

Re: Wrong initial bias in GraphX SVDPlusPlus?

2015-04-05 Thread Reynold Xin
Adding Jianping Wang to the thread, since he contributed the SVDPlusPlus implementaiton. Jianping, Can you take a look at this message? Thanks. On Fri, Apr 3, 2015 at 8:41 AM, Michael Malak michaelma...@yahoo.com.invalid wrote: I believe that in the initialization portion of GraphX

Re: Migrating from 1.2.1 to 1.3.0 - org.apache.spark.sql.api.java.Row

2015-04-01 Thread Reynold Xin
Yup - we merged the Java and Scala API so there is now a single set of API to support both languages. See more at http://spark.apache.org/docs/latest/sql-programming-guide.html#unification-of-the-java-and-scala-apis On Tue, Mar 31, 2015 at 11:40 PM, Niranda Perera niranda.per...@gmail.com

Re: Spark config option 'expression language' feedback request

2015-03-31 Thread Reynold Xin
Reviving this to see if others would like to chime in about this expression language for config options. On Fri, Mar 13, 2015 at 7:57 PM, Dale Richardson dale...@hotmail.com wrote: Mridul,I may have added some confusion by giving examples in completely different areas. For example the number

Re: [sql] How to uniquely identify Dataframe?

2015-03-30 Thread Reynold Xin
The only reason I can think of right now is that you might want to change the config parameter to change the behavior of the optimizer and regenerate the plan. However, maybe that's not a strong enough reasons to regenerate the RDD everytime. On Mon, Mar 30, 2015 at 5:38 AM, Cheng Lian

Re: Jira Issues

2015-03-25 Thread Reynold Xin
Igor, Welcome -- everything is open here: https://issues.apache.org/jira/browse/SPARK You should be able to see them even if you are not an ASF member. On Wed, Mar 25, 2015 at 1:51 PM, Igor Costa igorco...@apache.org wrote: Hi there Guys. I want to be more collaborative to Spark, but I

Re: enum-like types in Spark

2015-03-23 Thread Reynold Xin
If scaladoc can show the Java enum types, I do think the best way is then just Java enum types. On Mon, Mar 23, 2015 at 2:11 PM, Patrick Wendell pwend...@gmail.com wrote: If the official solution from the Scala community is to use Java enums, then it seems strange they aren't generated in

Re: Review request for SPARK-6112:Provide OffHeap support through HDFS RAM_DISK

2015-03-23 Thread Reynold Xin
I created a ticket to separate the API refactoring from the implementation. Would be great to have these as two separate patches to make it easier to review (similar to the way we are doing RPC refactoring -- first introducing an internal RPC api, port akka to it, and then add an alternative

Re: Spilling when not expected

2015-03-17 Thread Reynold Xin
), it seems to us that it is accepting it. Also, in IBM's J9 health center, I see it reserve the 900g, and use up to 68g. Thanks, Tom On 13 March 2015 at 02:05, Reynold Xin r...@databricks.com wrote: How did you run the Spark command? Maybe the memory setting didn't actually apply? How much memory

Re: Spark config option 'expression language' feedback request

2015-03-13 Thread Reynold Xin
This is an interesting idea. Are there well known libraries for doing this? Config is the one place where it would be great to have something ridiculously simple, so it is more or less bug free. I'm concerned about the complexity in this patch and subtle bugs that it might introduce to config

Re: Some praise and comments on Spark

2015-02-25 Thread Reynold Xin
Thanks for the email and encouragement, Devl. Responses to the 3 requests: -tonnes of configuration properties and go faster type flags. For example Hadoop and Hbase users will know that there are a whole catalogue of properties for regions, caches, network properties, block sizes, etc etc.

Help vote for Spark talks at the Hadoop Summit

2015-02-24 Thread Reynold Xin
Hi all, The Hadoop Summit uses community choice voting to decide which talks to feature. It would be great if the community could help vote for Spark talks so that Spark has a good showing at this event. You can make three votes on each track. Below I've listed 3 talks that are important to

Re: JavaRDD Aggregate initial value - Closure-serialized zero value reasoning?

2015-02-18 Thread Reynold Xin
Yes, that's a bug and should be using the standard serializer. On Wed, Feb 18, 2015 at 2:58 PM, Sean Owen so...@cloudera.com wrote: That looks, at the least, inconsistent. As far as I know this should be changed so that the zero value is always cloned via the non-closure serializer. Any

Re: HiveContext cannot be serialized

2015-02-16 Thread Reynold Xin
Michael - it is already transient. This should probably considered a bug in the scala compiler, but we can easily work around it by removing the use of destructuring binding. On Mon, Feb 16, 2015 at 10:41 AM, Michael Armbrust mich...@databricks.com wrote: I'd suggest marking the HiveContext as

Re: HiveContext cannot be serialized

2015-02-16 Thread Reynold Xin
this through the tuple extraction. This is only a workaround. We can also remove the tuple extraction. On Mon, Feb 16, 2015 at 10:47 AM, Reynold Xin r...@databricks.com wrote: Michael - it is already transient. This should probably considered a bug in the scala compiler, but we can easily work around

Re: Replacing Jetty with TomCat

2015-02-15 Thread Reynold Xin
Most likely no. We are using the embedded mode of Jetty, rather than using servlets. Even if it is possible, you probably wouldn't want to embed Spark in your application server ... On Sun, Feb 15, 2015 at 9:08 PM, Niranda Perera niranda.per...@gmail.com wrote: Hi, We are thinking of

Re: Spark Hive

2015-02-15 Thread Reynold Xin
Spark SQL is not the same as Hive on Spark. Spark SQL is a query engine that is designed from ground up for Spark without the historic baggage of Hive. It also does more than SQL now -- it is meant for structured data processing (e.g. the new DataFrame API) and SQL. Spark SQL is mostly compatible

Re: Replacing Jetty with TomCat

2015-02-15 Thread Reynold Xin
server inside Spark? Is it used for Spark core functionality or is it there for Spark jobs UI purposes? cheers On Mon, Feb 16, 2015 at 10:47 AM, Reynold Xin r...@databricks.com wrote: Most likely no. We are using the embedded mode of Jetty, rather than using servlets. Even if it is possible

Re: Spark SQL value proposition in batch pipelines

2015-02-12 Thread Reynold Xin
Evan articulated it well. On Thu, Feb 12, 2015 at 9:29 AM, Evan R. Sparks evan.spa...@gmail.com wrote: Well, you can always join as many RDDs as you want by chaining them together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of RDDs in this way but 10 is probably doable.

Re: Why a program would receive null from send message of mapReduceTriplets

2015-02-12 Thread Reynold Xin
Can you use the new aggregateNeighbors method? I suspect the null is coming from automatic join elimination, which detects bytecode to see if you need the src or dst vertex data. Occasionally it can fail to detect. In the new aggregateNeighbors API, the caller needs to explicitly specifying that,

Re: Why a program would receive null from send message of mapReduceTriplets

2015-02-12 Thread Reynold Xin
Then maybe you actually had a null in your vertex attribute? On Thu, Feb 12, 2015 at 10:47 PM, James alcaid1...@gmail.com wrote: I changed the mapReduceTriplets() func to aggregateMessages(), but it still failed. 2015-02-13 6:52 GMT+08:00 Reynold Xin r...@databricks.com: Can you use

Re: How to track issues that must wait for Spark 2.x in JIRA?

2015-02-12 Thread Reynold Xin
It seems to me having a version that is 2+ is good for that? Once we move to 2.0, we can retag those that are not going to be fixed in 2.0 as 2.0.1 or 2.1.0 . On Thu, Feb 12, 2015 at 12:42 AM, Sean Owen so...@cloudera.com wrote: Patrick and I were chatting about how to handle several issues

Re: Data source API | sizeInBytes should be to *Scan

2015-02-11 Thread Reynold Xin
this makes sense. Thanks, Aniket On Sat, Feb 7, 2015, 4:50 AM Reynold Xin r...@databricks.com wrote: We thought about this today after seeing this email. I actually built a patch for this (adding filter/column to data source stat estimation), but ultimately dropped it due

Re: renaming SchemaRDD - DataFrame

2015-02-10 Thread Reynold Xin
://www.r-bloggers.com/r-na-vs-null/ On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin r...@databricks.com wrote: Isn't that just null in SQL? On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan velvia.git...@gmail.com wrote: I believe that most DataFrame implementations out

Re: renaming SchemaRDD - DataFrame

2015-02-10 Thread Reynold Xin
10, 2015 at 2:58 PM, Reynold Xin r...@databricks.com wrote: Koert, Don't get too hang up on the name SQL. This is exactly what you want: a collection with record-like objects with field names and runtime types. Almost all of the 40 methods are transformations for structured data

Re: multi-line comment style

2015-02-09 Thread Reynold Xin
it is easier for IDEs to recognize it as a block comment. If you press enter in the comment block with the `//` style, IDEs won't add `//` for you. -Xiangrui On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin r...@databricks.com wrote: We should update the style doc to reflect what we have

Re: Data source API | sizeInBytes should be to *Scan

2015-02-08 Thread Reynold Xin
We thought about this today after seeing this email. I actually built a patch for this (adding filter/column to data source stat estimation), but ultimately dropped it due to the potential problems the change the cause. The main problem I see is that column pruning/predicate pushdowns are

Re: Spark SQL Window Functions

2015-02-08 Thread Reynold Xin
This is the original ticket: https://issues.apache.org/jira/browse/SPARK-1442 I believe it will happen, one way or another :) On Fri, Feb 6, 2015 at 5:29 PM, Evan R. Sparks evan.spa...@gmail.com wrote: Currently there's no standard way of handling time series data in Spark. We were kicking

Re: multi-line comment style

2015-02-04 Thread Reynold Xin
We should update the style doc to reflect what we have in most places (which I think is //). On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: FWIW I like the multi-line // over /* */ from a purely style standpoint. The Google Java style guide[1] has

ASF Git / GitHub sync is down

2015-02-03 Thread Reynold Xin
Haven't sync-ed anything for the last 4 hours. Seems like this little piece of infrastructure always stops working around our own code freeze time ...

Re: ASF Git / GitHub sync is down

2015-02-03 Thread Reynold Xin
I filed an INFRA ticket: https://issues.apache.org/jira/browse/INFRA-9115 I wish ASF can reconsider requests like this in order to handle downtime gracefully https://issues.apache.org/jira/browse/INFRA-8738 On Tue, Feb 3, 2015 at 9:09 PM, Reynold Xin r...@databricks.com wrote: Haven't sync

Re: SparkSubmit.scala and stderr

2015-02-03 Thread Reynold Xin
We can use ScalaTest's privateMethodTester also instead of exposing that. On Tue, Feb 3, 2015 at 2:18 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Jay, On Tue, Feb 3, 2015 at 6:28 AM, jayhutfles jayhutf...@gmail.com wrote: // Exposed for testing private[spark] var printStream:

Re: [spark-sql] JsonRDD

2015-02-02 Thread Reynold Xin
It's bad naming - JsonRDD is actually not an RDD. It is just a set of util methods. The case sensitivity issues seem orthogonal, and would be great to be able to control that with a flag. On Mon, Feb 2, 2015 at 4:16 PM, Daniil Osipov daniil.osi...@shazam.com wrote: Hey Spark developers, Is

Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Reynold Xin
Once the data frame API is released for 1.3, you can write your thing in Python and get the same performance. It can't express everything, but for basic things like projection, filter, join, aggregate and simple numeric computation, it should work pretty well. On Thu, Jan 29, 2015 at 12:45 PM,

Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Reynold Xin
are we talking about pandas or this is something internal to spark py api. If you could elaborate a bit on this or point me to alternate documentation. Thanks much --sasha On Thu, Jan 29, 2015 at 4:12 PM, Reynold Xin r...@databricks.com wrote: Once the data frame API is released for 1.3, you can

Re: spark akka fork : is the source anywhere?

2015-01-28 Thread Reynold Xin
Hopefully problems like this will go away entirely in the next couple of releases. https://issues.apache.org/jira/browse/SPARK-5293 On Wed, Jan 28, 2015 at 3:12 PM, jay vyas jayunit100.apa...@gmail.com wrote: Hi spark. Where is akka coming from in spark ? I see the distribution referenced

Re: renaming SchemaRDD - DataFrame

2015-01-28 Thread Reynold Xin
DataFrame and SchemaRDD 2015-01-27 17:18 GMT-02:00 Reynold Xin r...@databricks.com: Dirceu, That is not possible because one cannot overload return types. SQLContext.parquetFile (and many other methods) needs to return some type, and that type cannot be both

Re: emergency jenkins restart soon

2015-01-28 Thread Reynold Xin
Thanks for doing that, Shane! On Wed, Jan 28, 2015 at 10:29 PM, shane knapp skn...@berkeley.edu wrote: jenkins is back up and all builds have been retriggered... things are building and looking good, and i'll keep an eye on the spark master builds tonite and tomorrow. On Wed, Jan 28, 2015

Re: Data source API | Support for dynamic schema

2015-01-28 Thread Reynold Xin
It's an interesting idea, but there are major challenges with per row schema. 1. Performance - query optimizer and execution use assumptions about schema and data to generate optimized query plans. Having to re-reason about schema for each row can substantially slow down the engine, but due to

Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Reynold Xin
on this idea (mostly from Patrick and Reynold :-). https://www.youtube.com/watch?v=YWppYPWznSQ From: Patrick Wendell pwend...@gmail.com To: Reynold Xin r...@databricks.com Cc: dev@spark.apache.org dev@spark.apache.org Sent: Monday, January 26, 2015 4:01 PM

Re: renaming SchemaRDD - DataFrame

2015-01-27 Thread Reynold Xin
(mostly from Patrick and Reynold :-). https://www.youtube.com/watch?v=YWppYPWznSQ From: Patrick Wendell pwend...@gmail.com To: Reynold Xin r...@databricks.com Cc: dev@spark.apache.org dev@spark.apache.org Sent: Monday, January

<    6   7   8   9   10   11   12   13   >