Re: emergency jenkins restart soon

2015-01-29 Thread shane knapp
the master builds triggered around ~1am last night (according to the logs), so it looks like we're back in business. On Wed, Jan 28, 2015 at 10:32 PM, shane knapp skn...@berkeley.edu wrote: np! the master builds haven't triggered yet, but let's give the rube goldberg machine a minute to get

Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-29 Thread Robert C Senkbeil
+1 I verified that the REPL jars published work fine with the Spark Kernel project (can build/test against them). Signed, Chip Senkbeil From: Krishna Sankar ksanka...@gmail.com To: Sean Owen so...@cloudera.com Cc: Patrick Wendell pwend...@gmail.com, dev@spark.apache.org

Re: Data source API | Support for dynamic schema

2015-01-29 Thread Aniket Bhatnagar
Thanks Reynold and Cheng. It does seem quiet a bit of heavy lifting to have schema per row. I will for now settle with having to do a union schema of all the schema versions and complain any incompatibilities :-) Looking forward to do great things with the API! Thanks, Aniket On Thu Jan 29 2015

Re: RDD.combineBy without intermediate (k,v) pair allocation

2015-01-29 Thread Mohit Jaggi
Francois, RDD.aggregate() does not support aggregation by key. But, indeed, that is the kind of implementation I am looking for, one that does not allocate intermediate space for storing (K,V) pairs. When working with large datasets this type of intermediate memory allocation wrecks havoc with

Re: renaming SchemaRDD - DataFrame

2015-01-29 Thread Evan Chan
+1 having proper NA support is much cleaner than using null, at least the Java null. On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks evan.spa...@gmail.com wrote: You've got to be a little bit careful here. NA in systems like R or pandas may have special meaning that is distinct from null.

TimeoutException on tests

2015-01-29 Thread Dirceu Semighini Filho
Hi All, I'm trying to use a local build spark, adding the pr 1290 to the 1.2.0 build and after I do the build, I my tests start to fail. should create labeledpoint *** FAILED *** (10 seconds, 50 milliseconds) [info] java.util.concurrent.TimeoutException: Futures timed out after [1

Re: Any interest in 'weighting' VectorTransformer which does component-wise scaling?

2015-01-29 Thread Octavian Geagla
Thanks for the responses. How would something like HadamardProduct or similar be in order to keep it explicit? Would still be a VectorTransformer so the name and trait would hopefully lead to a somewhat self-documenting class. Xiangrui, do you mean Hadamard product or transform? My initial

Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Davies Liu
Hey, Without having Python as fast as Scala/Java, I think it's impossible to similar performance in PySpark as in Scala/Java. Jython is also much slower than Scala/Java. With Jython, we can avoid the cost of manage multiple process and RPC, we may still need to do the data conversion between

Re: renaming SchemaRDD - DataFrame

2015-01-29 Thread Cheng Lian
Yes, when a DataFrame is cached in memory, it's stored in an efficient columnar format. And you can also easily persist it on disk using Parquet, which is also columnar. Cheng On 1/29/15 1:24 PM, Koert Kuipers wrote: to me the word DataFrame does come with certain expectations. one of them

Re: renaming SchemaRDD - DataFrame

2015-01-29 Thread Koert Kuipers
to me the word DataFrame does come with certain expectations. one of them is that the data is stored columnar. in R data.frame internally uses a list of sequences i think, but since lists can have labels its more like a SortedMap[String, Array[_]]. this makes certain operations very cheap (such as

How to speed PySpark to match Scala/Java performance

2015-01-29 Thread rtshadow
Hi, In my company, we've been trying to use PySpark to run ETLs on our data. Alas, it turned out to be terribly slow compared to Java or Scala API (which we ended up using to meet performance criteria). To be more quantitative, let's consider simple case: I've generated test file (848MB): /seq

Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Reynold Xin
Once the data frame API is released for 1.3, you can write your thing in Python and get the same performance. It can't express everything, but for basic things like projection, filter, join, aggregate and simple numeric computation, it should work pretty well. On Thu, Jan 29, 2015 at 12:45 PM,

Re: renaming SchemaRDD - DataFrame

2015-01-29 Thread Cheng Lian
Forgot to mention that you can find it here https://github.com/apache/spark/blob/f9e569452e2f0ae69037644170d8aa79ac6b4ccf/sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala. On 1/29/15 1:59 PM, Cheng Lian wrote: Yes, when a DataFrame is cached in memory, it's

Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Reynold Xin
It is something like this: https://issues.apache.org/jira/browse/SPARK-5097 On the master branch, we have a Pandas like API already. On Thu, Jan 29, 2015 at 4:31 PM, Sasha Kacanski skacan...@gmail.com wrote: Hi Reynold, In my project I want to use Python API too. When you mention DF's are