Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-15 Thread Reynold Xin
I'm going to start this with a +1! On Thu, Dec 15, 2016 at 9:42 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > In addition to usual binary artifacts, this is the first release where > we have installable packages for Python [1] and R [2] that are part of > the release. I'm

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-15 Thread Shivaram Venkataraman
In addition to usual binary artifacts, this is the first release where we have installable packages for Python [1] and R [2] that are part of the release. I'm including instructions to test the R package below. Holden / other Python developers can chime in if there are special instructions to

[VOTE] Apache Spark 2.1.0 (RC5)

2016-12-15 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 2.1.0. The vote is open until Sun, December 18, 2016 at 21:30 PT and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.1.0 [ ] -1 Do not release this package because ...

Re: spark-core "compile"-scope transitive-dependency on scalatest

2016-12-15 Thread Marcelo Vanzin
I posted a PR; the solution I suggested seems to work (and is simpler than breaking spark-tags into multiple artifacts). On Thu, Dec 15, 2016 at 4:46 PM, Ryan Williams wrote: > ah I see this thread now, thanks; interestingly I don't think the solution > I've

Re: spark-core "compile"-scope transitive-dependency on scalatest

2016-12-15 Thread Ryan Williams
ah I see this thread now, thanks; interestingly I don't think the solution I've proposed here (splitting spark-tags' test-bits into a "-tests" JAR and having spark-core

Re: spark-core "compile"-scope transitive-dependency on scalatest

2016-12-15 Thread Marcelo Vanzin
You're right; we had a discussion here recently about this. I'll re-open that bug, if you want to send a PR. (I think it's just a matter of making the scalatest dependency "provided" in spark-tags, if I remember the discussion.) On Thu, Dec 15, 2016 at 4:15 PM, Ryan Williams

spark-core "compile"-scope transitive-dependency on scalatest

2016-12-15 Thread Ryan Williams
spark-core depends on spark-tags (compile scope) which depends on scalatest (compile scope), so spark-core leaks test-deps into downstream libraries' "compile"-scope classpath. The cause is that spark-core has logical "test->test" and "compile->compile" dependencies on spark-tags, but spark-tags

Spark 2.1.0-rc3 cut

2016-12-15 Thread Reynold Xin
Committers please use 2.1.1 as the fix version for patches merged into the branch. I will post a voting email once the packaging is done.

Re: Expand the Spark SQL programming guide?

2016-12-15 Thread Anton Okolnychyi
I think it will make sense to show a sample implementation of UserDefinedAggregateFunction for DataFrames, and an example of the Aggregator API for typed Datasets. Jim, what if I submit a PR and you join the review process? I also do not mind to split this if you want, but it seems to be an

Re: Is restarting of SparkContext allowed?

2016-12-15 Thread Alexey Klimov
I also wanted to ask that if it's not design way to use SparkContext how much would it take to get it work completely correct. (e.g. are there any other singletones that can preserve other state between running different SparkContext's). -- View this message in context:

Is restarting of SparkContext allowed?

2016-12-15 Thread Alexey Klimov
Hello, my question is the continuation of a problem I described here . I've done some investigation and found out that nameNode.getDelegationToken is called during

Re: Output Side Effects for different chain of operations

2016-12-15 Thread Chawla,Sumit
I am already creating these files on slave. How can i create an RDD from these slaves? Regards Sumit Chawla On Thu, Dec 15, 2016 at 11:42 AM, Reynold Xin wrote: > You can just write some files out directly (and idempotently) in your > map/mapPartitions functions. It is

Re: Expand the Spark SQL programming guide?

2016-12-15 Thread Jim Hughes
Hi Anton, I'd like to see this as well. I've been working on implementing geospatial user-defined types and functions. Having examples of aggregations and window functions would be awesome! I did test out implementing a distributed convex hull as a UserDefinedAggregateFunction, and that

Re: Output Side Effects for different chain of operations

2016-12-15 Thread Reynold Xin
You can just write some files out directly (and idempotently) in your map/mapPartitions functions. It is just a function that you can run arbitrary code after all. On Thu, Dec 15, 2016 at 11:33 AM, Chawla,Sumit wrote: > Any suggestions on this one? > > Regards > Sumit

Re: Output Side Effects for different chain of operations

2016-12-15 Thread Chawla,Sumit
Any suggestions on this one? Regards Sumit Chawla On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit wrote: > Hi All > > I have a workflow with different steps in my program. Lets say these are > steps A, B, C, D. Step B produces some temp files on each executor node. >

Mistake in Apache Spark Java.

2016-12-15 Thread Mario Fernandez Villa
Hello, My name is Mario Fernández and I´m a Big Data developer, I usually program in Apache Spark in Java, and we have a big problem to read properly a csv file. The issue is that: When I want to read csv file, for instance, with semicolon delimiter, the dataframe take semicolon like

Re: SPARK-18689: A proposal for priority based app scheduling utilizing linux cgroups.

2016-12-15 Thread Reynold Xin
In general this falls directly into the domain of external cluster managers (YARN, Mesos, Kub). The standalone thing was meant as a simple way to deploy Spark, and we gotta be careful with introducing a lot more features to it because then it becomes just a full fledged cluster manager and is

Re: SPARK-18689: A proposal for priority based app scheduling utilizing linux cgroups.

2016-12-15 Thread Hegner, Travis
Thanks for the response Jörn, This patch is intended only for spark standalone. My understanding of the YARN cgroup support is that it only limits cpu, rather than allocates it based on the priority or shares system. This could be old documentation that I'm remembering, however. Another issue

Re: SPARK-18689: A proposal for priority based app scheduling utilizing linux cgroups.

2016-12-15 Thread Jörn Franke
Hi, What about yarn or mesos used in combination with Spark. The have also cgroups. Or a kubernetes etc deployment. > On 15 Dec 2016, at 17:37, Hegner, Travis wrote: > > Hello Spark Devs, > > > I have finally completed a mostly working proof of concept. I do not want

Forking or upgrading Apache Parquet in Spark

2016-12-15 Thread Dongjoon Hyun
Hi, All. I made a PR to upgrade Parquet to 1.9.0 for Apache Spark 2.2 on Late March. - https://github.com/apache/spark/pull/16281 Currently, there occurs some important options about that. Here is the summary. 1. Forking Parquet 1.8.X and maintaining like Spark Hive. 2. Wait and see for

Re: SPARK-18689: A proposal for priority based app scheduling utilizing linux cgroups.

2016-12-15 Thread Hegner, Travis
Hello Spark Devs, I have finally completed a mostly working proof of concept. I do not want to create a pull request for this code, as I don't believe it's production worthy at the moment. My intent is to better communicate what I'd like to accomplish. Please review the following patch:

Re: Document Similarity -Spark Mllib

2016-12-15 Thread Liang-Chi Hsieh
OK. I go to check the DIMSUM implementation in Spark MLlib. The probability a column is sampled is decided by math.sqrt(10 * math.log(nCol) / threshold) / colMagnitude. The most influential parameter is colMagnitude. If in your dataset, the colMagnitude for most columns is very low, then looks

Re: Belief propagation algorithm is open sourced

2016-12-15 Thread Bertrand Dechoux
Nice! I am especially interested in Bayesian Networks, which are only one of the many models that can be expressed by a factor graph representation. Do you do Bayesian Networks learning at scale (parameters and structure) with latent variables? Are you using publicly available tools for that?

Expand the Spark SQL programming guide?

2016-12-15 Thread Anton Okolnychyi
Hi, I am wondering whether it makes sense to expand the Spark SQL programming guide with examples of aggregations (including user-defined via the Aggregator API) and window functions. For instance, there might be a separate subsection under "Getting Started" for each functionality. SPARK-16046