Adding/Using More Resolution Types on JIRA

2015-05-12 Thread Patrick Wendell
In Spark we sometimes close issues as something other than Fixed, and this is an important part of maintaining our JIRA. The current resolution types we use are the following: Won't Fix - bug fix or (more often) feature we don't want to add Invalid - issue is underspecified or not appropriate

Re: Adding/Using More Resolution Types on JIRA

2015-05-12 Thread Nicholas Chammas
I tend to find that any large project has a lot of walking dead JIRAs, and pretending they are simply Open causes problems. Any state is better for these, so I favor this. Agreed. 1. Inactive: A way to clear out inactive/dead JIRA’s without indicating a decision has been made one way or

s3 vfs on Mesos Slaves

2015-05-12 Thread Stephen Carman
We have a small mesos cluster and these slaves need to have a vfs setup on them so that the slaves can pull down the data they need from S3 when spark runs. There doesn’t seem to be any obvious way online on how to do this or how easily accomplish this. Does anyone have some best practices or

Sharing memory across applications/integration

2015-05-12 Thread Alexey Goncharuk
Hello Spark community, I am currently trying to implement a proof-of-concept RDD that will allow to integrate Apache Spark and Apache Ignite (incubating) [1]. My original idea was to embed an Ignite node in Spark's worker process, in order for the user code to have a direct access to in-memory

Re: Change for submitting to yarn in 1.3.1

2015-05-12 Thread Marcelo Vanzin
On Tue, May 12, 2015 at 11:34 AM, Kevin Markey kevin.mar...@oracle.com wrote: I understand that SparkLauncher was supposed to address these issues, but it really doesn't. Yarn already provides indirection and an arm's length transaction for starting Spark on a cluster. The launcher introduces

[build system] brief downtime tomorrow morning (5-12-15, 7am PDT)

2015-05-12 Thread shane knapp
i will need to restart jenkins to finish a plugin install and resolve https://issues.apache.org/jira/browse/SPARK-7561 this will be very brief, and i'll retrigger any errant jobs i kill. please let me know if there are any comments/questions/concerns. thanks! shane

[IMPORTANT] Committers please update merge script

2015-05-12 Thread Patrick Wendell
Due to an ASF infrastructure change (bug?) [1] the default JIRA resolution status has switched to Pending Closed. I've made a change to our merge script to coerce the correct status of Fixed when resolving [2]. Please upgrade the merge script to master. I've manually corrected JIRA's that were

回复: [PySpark DataFrame] When a Row is not a Row

2015-05-12 Thread Davies Liu
The class (called Row) for rows from Spark SQL is created on the fly, is different from pyspark.sql.Row (is an public API to create Row by users). The reason we done it in this way is that we want to have better performance when accessing the columns. Basically, the rows are just named tuples

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-05-12 Thread fightf...@163.com
Hi, there Which version are you using ? Actually the problem seems gone after we change our spark version from 1.2.0 to 1.3.0 Not sure what the internal changes did. Best, Sun. fightf...@163.com From: Night Wolf Date: 2015-05-12 22:05 To: fightf...@163.com CC: Patrick Wendell; user; dev

Re: Getting Access is denied error while cloning Spark source using Eclipse

2015-05-12 Thread Akhil Das
May be you should check where exactly its throwing up permission denied (possibly trying to write to some directory). Also you can try manually cloning the git repo to a directory and then try opening that in eclipse. Thanks Best Regards On Tue, May 12, 2015 at 3:46 PM, Chandrashekhar Kotekar

Getting Access is denied error while cloning Spark source using Eclipse

2015-05-12 Thread Chandrashekhar Kotekar
Hi, I am trying to clone Spark source using Eclipse. After providing spark source URL, eclipse downloads some code which I can see in download location but as soon as downloading reaches 99% Eclipse throws Gi repository clone failed. Access is denied error. Has anyone encountered such a

@since version tag for all dataframe/sql methods

2015-05-12 Thread Reynold Xin
I added @since version tag for all public dataframe/sql methods/classes in this patch: https://github.com/apache/spark/pull/6101/files From now on, if you merge anything related to DF/SQL, please make sure the public functions have @since tag. Thanks.

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-12 Thread Night Wolf
I'm seeing a similar thing with a slightly different stack trace. Ideas? org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150) org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-05-12 Thread Night Wolf
Seeing similar issues, did you find a solution? One would be to increase the number of partitions if you're doing lots of object creation. On Thu, Feb 12, 2015 at 7:26 PM, fightf...@163.com fightf...@163.com wrote: Hi, patrick Really glad to get your reply. Yes, we are doing group by

Re: Adding/Using More Resolution Types on JIRA

2015-05-12 Thread Sean Owen
I tend to find that any large project has a lot of walking dead JIRAs, and pretending they are simply Open causes problems. Any state is better for these, so I favor this. The possible objection is that this will squash or hide useful issues, but in practice we have the opposite problem. Resolved

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-12 Thread Matei Zaharia
It could also be that your hash function is expensive. What is the key class you have for the reduceByKey / groupByKey? Matei On May 12, 2015, at 10:08 AM, Night Wolf nightwolf...@gmail.com wrote: I'm seeing a similar thing with a slightly different stack trace. Ideas?