Re: PySpark on PyPi

2015-08-06 Thread Davies Liu
We could do that after 1.5 released, it will have same release cycle as Spark in the future. On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: +1 (once again :) ) 2015-07-28 14:51 GMT+02:00 Justin Uang justin.u...@gmail.com: // ping do we have any

Re: Make ML Developer APIs public (post-1.4)

2015-08-06 Thread Joseph Bradley
Eron, Thanks for sending out this list! We can make some of the critical ones public for 1.5, but they will be marked DeveloperApi since they may require changes in the future. Just made the JIRA: [ https://issues.apache.org/jira/browse/SPARK-9704] and I'll send a PR soon. Joseph On Mon, Aug

Bucket mappings of map stage output

2015-08-06 Thread cheez
Hey all. I was trying to understand Spark Internals by looking in to (and hacking) the code. I was basically trying to explore the buckets which are generated when we partition the output of each map task and then let the reduce side fetch them on the basis of paritionId. I went into the write()

Re: Fixed number of partitions in RangePartitioner

2015-08-06 Thread Reynold Xin
Any reason why you need exactly a certain number of partitions? One way we can make that work is for RangePartitioner to return a bunch of empty partitions if the number of distinct elements is small. That would require changing Spark. If you want a quick work around, you can also append some

SparkR driver side JNI

2015-08-06 Thread Renyi Xiong
why SparkR chose to uses inter-process socket solution eventually on driver side instead of in-process JNI showed in one of its doc's below (about page 20)? https://spark-summit.org/wp-content/uploads/2014/07/SparkR-Interactive-R-Programs-at-Scale-Shivaram-Vankataraman-Zongheng-Yang.pdf

Re: SparkR driver side JNI

2015-08-06 Thread Shivaram Venkataraman
The in-process JNI only works out when the R process comes up first and we launch a JVM inside it. In many deploy modes like YARN (or actually in anything using spark-submit) the JVM comes up first and we launch R after that. Using an inter-process solution helps us cover both use cases Thanks

Workflow manager tool for scheduling spark jobs on cassandra

2015-08-06 Thread Vikram Kone
Hi, I'm looking for open source workflow tools/engines that allow us to schedule spark jobs on a cassandra cluster. Since there are tonnes of alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to check with people here to see what they are using today. Some of the

Re: Is there any way to support multiple users executing SQL on thrift server?

2015-08-06 Thread Ted Yu
What is the JIRA number if a JIRA has been logged for this ? Thanks On Jan 20, 2015, at 11:30 AM, Cheng Lian lian.cs@gmail.com wrote: Hey Yi, I'm quite unfamiliar with Hadoop/HDFS auth mechanisms for now, but would like to investigate this issue later. Would you please open an

Why SparkR didn't reuse PythonRDD

2015-08-06 Thread Daniel Li
On behalf of Renyi Xiong - When reading Spark codebase, looks to me PythonRDD.scala is reusable, I wonder why SparkR choose to implement its own RRDD.scala? thanks Daniel

Re:

2015-08-06 Thread Jonathan Winandy
Hello ! I think I found a performant and nice solution based on take' source code : def exists[T](rdd: RDD[T])(qualif: T = Boolean, num: Int): Boolean = { if (num == 0) { true } else { var count: Int = 0 val totalParts: Int = rdd.partitions.length var partsScanned: Int = 0

Re: Why SparkR didn't reuse PythonRDD

2015-08-06 Thread Shivaram Venkataraman
PythonRDD.scala has a number of PySpark specific conventions (for example worker reuse, exceptions etc.) and PySpark specific protocols (e.g. for communicating accumulators, broadcasts between the JVM and Python etc.). While it might be possible to refactor the two classes to share some more code