RE: Handling questions in the mailing lists

2016-11-23 Thread assaf.mendelson
Sorry to reawaken this, but I just noticed it is possible to propose new topic 
specific sites (http://area51.stackexchange.com/faq)  for stack overflow. So 
for example we might have a spark.stackexchange.com spark specific site.
The advantage of such a site are many. First of all it is spark specific. 
Secondly the reputation of people would be on spark and not on general 
questions and lastly (and most importantly in my opinion) it would have spark 
based moderators (which are all spark moderator as opposed to general 
technology).

The process of creating such a site is not complicated. Basically someone 
creates a proposal (I have no problem doing so). Then creating 5 example 
questions (something we want on the site) and get 5 people need to 'follow' it 
within 3 days. This creates a "definition" phase. The goal is to get at least 
40 questions that embody the goal of the site and have at least 10 net votes 
and enough people follow it. When enough traction has been made (enough 
questions and enough followers) then the site moves to commitment phase. In 
this phase users "commit" to being on the site (basically this is aimed to see 
the community of experts is big enough). Once all this happens the site moves 
into beta. This means the site becomes active and it will become a full site if 
it sees enough traction.

I would suggest trying to set this up.

Thanks,
Assaf


From: Denny Lee [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19916...@n3.nabble.com]
Sent: Wednesday, November 16, 2016 4:33 PM
To: Mendelson, Assaf
Subject: Re: Handling questions in the mailing lists

Awesome stuff! Thanks Sean! :-)
On Wed, Nov 16, 2016 at 05:57 Sean Owen <[hidden 
email]> wrote:
I updated the wiki to point to the /community.html page. (We're going to 
migrate the wiki real soon now anyway)

I updated the /community.html page per this thread too. PR: 
https://github.com/apache/spark-website/pull/16


On Tue, Nov 15, 2016 at 2:49 PM assaf.mendelson <[hidden 
email]> wrote:

Should probably also update the helping others section in the how to contribute 
section (https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingbyHelpingOtherUsers">https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingbyHelpingOtherUsers)

Assaf.



From: Denny Lee [via Apache Spark Developers List] [mailto:[hidden 
email][hidden 
email]]
Sent: Sunday, November 13, 2016 8:52 AM



To: Mendelson, Assaf
Subject: Re: Handling questions in the mailing lists







Hey Reynold,






Looks like we all of the proposed changes into Proposed Community Mailing Lists 
/ StackOverflow 
Changes.
 Anything else we can do to update the Spark Community page / welcome email?





Meanwhile, let's all start answering questions on SO, eh?! :)


Denny



On Thu, Nov 10, 2016 at 1:54 PM Holden Karau <[hidden 
email]> wrote:




That's a good question, looking at 
http://stackoverflow.com/tags/apache-spark/topusers shows a few contributors 
who have already been active on SO including some committers and PMC members 
with very high overall SO reputations for any administrative needs (as well as 
a number of other contributors besides just PMC/committers).



On Wed, Nov 9, 2016 at 2:18 AM, assaf.mendelson <[hidden 
email]> wrote:



I was just wondering, before we move on to SO.

Do we have enough contributors with enough reputation do manage things in SO?

We would need contributors with enough reputation to have relevant privilages.

For example: creating tags (requires 1500 reputation), edit questions and 
answers (2000), create tag synonums (2500), approve tag wiki edits (5000), 
access to moderator tools (1, this is required to delete questions etc.), 
protect questions (15000).

All of these are important if we plan to have SO as a main resource.

I know I originally suggested SO, however, if we do not have contributors with 
the required privileges and the willingness to help manage everything then I am 
not sure this is a good fit.

Assaf.





From: Denny Lee [via Apache Spark Developers List] [mailto:[hidden 
email][hidden 
email]]




Sent: Wednesday, November 09, 2016 9:54 AM
To: Mendelson, Assaf
Subject: Re: Handling questions in the mailing lists






Agreed that by simply just moving the questions to SO will not solve anything 
but I think the call out about the meta-tags is that we need to abide by SO 
rules and if we were 

FOSDEM 2017 HPC, Bigdata and Data Science DevRoom CFP is closing soon

2016-11-23 Thread Roman Shaposhnik
Hi!

apologies for the extra wide distribution (this exhausts my once
a year ASF mail-to-all-bigdata-projects quota ;-)) but I wanted
to suggest that all of you should consider submitting talks
to FOSDEM 2017 HPC, Bigdata and Data Science DevRoom:
https://hpc-bigdata-fosdem17.github.io/

It was a great success this year and we hope to make it an even
bigger success in 2017.

Besides -- FOSDEM is the biggest gathering of open source
developers on the face of the earth -- don't miss it!

Thanks,
Roman.

P.S. If you have any questions -- please email me directly and
see you all in Brussels!

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Memory leak warnings in Spark 2.0.1

2016-11-23 Thread Nicholas Chammas
đź‘Ť  Thanks for the reference and PR.

On Wed, Nov 23, 2016 at 2:59 AM Reynold Xin  wrote:

> See https://issues.apache.org/jira/browse/SPARK-18557
> 
>
> On Mon, Nov 21, 2016 at 1:16 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> I'm also curious about this. Is there something we can do to help
> troubleshoot these leaks and file useful bug reports?
>
> On Wed, Oct 12, 2016 at 4:33 PM vonnagy  wrote:
>
> I am getting excessive memory leak warnings when running multiple mapping
> and
> aggregations and using DataSets. Is there anything I should be looking for
> to resolve this or is this a known issue?
>
> WARN  [Executor task launch worker-0]
> org.apache.spark.memory.TaskMemoryManager - leak 16.3 MB memory from
> org.apache.spark.unsafe.map.BytesToBytesMap@33fb6a15
> WARN  [Executor task launch worker-0]
> org.apache.spark.memory.TaskMemoryManager - leak a page:
> org.apache.spark.unsafe.memory.MemoryBlock@29e74a69 in task 88341
> WARN  [Executor task launch worker-0]
> org.apache.spark.memory.TaskMemoryManager - leak a page:
> org.apache.spark.unsafe.memory.MemoryBlock@22316bec in task 88341
> WARN  [Executor task launch worker-0] org.apache.spark.executor.Executor -
> Managed memory leak detected; size = 17039360 bytes, TID = 88341
>
> Thanks,
>
> Ivan
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Memory-leak-warnings-in-Spark-2-0-1-tp19424.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>


Re: Spark Wiki now migrated to spark.apache.org

2016-11-23 Thread Nicholas Chammas
Same here. Nice to be able to deprecate most of the docs living on the wiki
and refer to them on GitHub.

On Wed, Nov 23, 2016 at 11:54 AM Holden Karau  wrote:

> That's awesome thanks for doing the migration :)
>
> On Wed, Nov 23, 2016 at 3:29 AM Sean Owen  wrote:
>
> I completed the migration. You can see the results live right now at
> http://spark.apache.org, and
> https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage
>
> A summary of the changes:
> https://issues.apache.org/jira/browse/SPARK-18073
>
> The substance of the changes:
> https://github.com/apache/spark-website/pull/19
> https://github.com/apache/spark-website/pull/25
>
> No information has been lost. Old wikis still either exist as they were
> with an "end of life" notice, or, point to the new location of the
> information.
>
> I ported the content basically as-is, only making minor changes to fix
> obviously out of date content.
>
> I did alter the menu structure, most significantly to add a "Developer"
> menu.
>
> Of course, we can change it further. Please comment if you see any errors,
> don't like some of the choices, etc.
>
> Note that all the release docs are now also updated according in branch
> 2.1.
>
>


Re: Spark Wiki now migrated to spark.apache.org

2016-11-23 Thread Holden Karau
That's awesome thanks for doing the migration :)

On Wed, Nov 23, 2016 at 3:29 AM Sean Owen  wrote:

> I completed the migration. You can see the results live right now at
> http://spark.apache.org, and
> https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage
>
> A summary of the changes:
> https://issues.apache.org/jira/browse/SPARK-18073
>
> The substance of the changes:
> https://github.com/apache/spark-website/pull/19
> https://github.com/apache/spark-website/pull/25
>
> No information has been lost. Old wikis still either exist as they were
> with an "end of life" notice, or, point to the new location of the
> information.
>
> I ported the content basically as-is, only making minor changes to fix
> obviously out of date content.
>
> I did alter the menu structure, most significantly to add a "Developer"
> menu.
>
> Of course, we can change it further. Please comment if you see any errors,
> don't like some of the choices, etc.
>
> Note that all the release docs are now also updated according in branch
> 2.1.
>


[Spark Thriftserver] connection timeout option?

2016-11-23 Thread Artur Sukhenko
Hello devs,

Lets say there is/are user(s) who are using: T*ableau
desktop+spark+sparkSQL* and *Hive server* *2* is installed but they use
*spark* for the thrift server connection.

They are trying to configure spark to drop Thrift Connection when there is
inactivity for this specific user and the timer expire.

Available parameters(mentioned below) which can achieve these results are
not working for spark thriftserver.

  1- *hive.server2.long.polling.timeout*
  2- *hive.server2.idle.session.timeout*
  3- *hive.server2.idle.operation.timeout*


Do we have any other available parameter which can achieve idle timeout
results for spark(thrift server) ?
Will this require a new development or if its even possible ?

Sincerely,

Artur
-- 
--
Artur Sukhenko


PowerIterationClustering can't handle "large" files

2016-11-23 Thread Lydia Ickler
Hi all,I have a question regarding the Power Iteration Clustering.I have an input file (tab separated edge list) which I read in and map it to the required format of RDD[(Long, Long, Double)] to then apply PIC.So far so good… The implementation works fine if the input is small (up to 50MB). But it crashes if I try to apply it to a file of size 650 MB.My technical setup is a compute cluster with 1 master 2 workers. The executor memory is set to 50 GB and in total 24 cores are available.Is it normal that  the program crashes at such a file size?I attached my program code as well as the error output.I hope someone can help me!Best regards, Lydia

PIC.scala
Description: Binary data
16/11/23 13:34:19 INFO spark.SparkContext: Running Spark version 2.1.0-SNAPSHOT
16/11/23 13:34:20 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
16/11/23 13:34:20 INFO spark.SecurityManager: Changing view acls to: icklerly
16/11/23 13:34:20 INFO spark.SecurityManager: Changing modify acls to: icklerly
16/11/23 13:34:20 INFO spark.SecurityManager: Changing view acls groups to: 
16/11/23 13:34:20 INFO spark.SecurityManager: Changing modify acls groups to: 
16/11/23 13:34:20 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(icklerly); groups 
with view permissions: Set(); users  with modify permissions: Set(icklerly); 
groups with modify permissions: Set()
16/11/23 13:34:20 INFO util.Utils: Successfully started service 'sparkDriver' 
on port 36371.
16/11/23 13:34:20 INFO spark.SparkEnv: Registering MapOutputTracker
16/11/23 13:34:20 INFO spark.SparkEnv: Registering BlockManagerMaster
16/11/23 13:34:20 INFO storage.BlockManagerMasterEndpoint: Using 
org.apache.spark.storage.DefaultTopologyMapper for getting topology information
16/11/23 13:34:20 INFO storage.BlockManagerMasterEndpoint: 
BlockManagerMasterEndpoint up
16/11/23 13:34:20 INFO storage.DiskBlockManager: Created local directory at 
/tmp/blockmgr-80b089a7-be21-4d14-ab6f-7e0ef1f14396
16/11/23 13:34:20 INFO memory.MemoryStore: MemoryStore started with capacity 
396.3 MB
16/11/23 13:34:20 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/11/23 13:34:20 INFO util.log: Logging initialized @1120ms
16/11/23 13:34:20 INFO server.Server: jetty-9.2.z-SNAPSHOT
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@3543df7d{/jobs,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@7c541c15{/jobs/json,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@3542162a{/jobs/job,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@698122b2{/jobs/job/json,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@4212a0c8{/stages,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@1e7aa82b{/stages/json,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@3b2c0e88{/stages/stage,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@5bd82fed{/stages/stage/json,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@c1bd0be{/stages/pool,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@476b0ae6{/stages/pool/json,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@1c6804cd{/storage,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@655f7ea{/storage/json,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@549949be{/storage/rdd,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@4b3a45f1{/storage/rdd/json,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@17a87e37{/environment,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@3eeb318f{/environment/json,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@20a14b55{/executors,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@39ad977d{/executors/json,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@6da00fb9{/executors/threadDump,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@a202ccb{/executors/threadDump/json,null,AVAILABLE}
16/11/23 13:34:20 INFO handler.ContextHandler: Started 
o.s.j.s.ServletContextHandler@20f12539{/static,null,AV

Aggregating over sorted data

2016-11-23 Thread assaf.mendelson
Hi,
An issue I have encountered frequently is the need to look at data in an 
ordered manner per key.
A common way of doing this can be seen in the classic map reduce as the shuffle 
stage provides sorted data per key and one can therefore do a lot with that.
It is of course relatively easy to achieve this by using RDD but that would 
mean moving to RDD and back which has a non-insignificant performance penalty 
(beyond the fact that we lose any catalyst optimization).
We can use SQL window functions but that is not an ideal solution either. 
Beyond the fact that window functions are much slower than aggregate functions 
(as we need to generate a result for every record), we also can't join them 
together (i.e. if we have two window functions on the same window, it is still 
two separate scans).

Ideally, I would have liked to see something like: 
df.groupBy(col1).sortBy(col2).agg(...) and have the aggregations work on the 
sorted data. That would enable to use both the existing functions and UDAF 
where we can assume the order (and do any windowing we need as part of the 
function itself which is relatively easy in many cases).

I have tried to look for this and seen many questions on the subject but no 
answers.

I was hoping I missed something (I have seen that the SQL CLUSTER BY command 
repartitions and sorts accordingly but from my understanding it does not 
promise that this would remain true if we do a groupby afterwards). If I 
didn't, I believe this should be a feature to add (I can open a JIRA if people 
think it is a good idea).
Assaf.





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp1.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Spark Wiki now migrated to spark.apache.org

2016-11-23 Thread Sean Owen
I completed the migration. You can see the results live right now at
http://spark.apache.org, and
https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage

A summary of the changes:
https://issues.apache.org/jira/browse/SPARK-18073

The substance of the changes:
https://github.com/apache/spark-website/pull/19
https://github.com/apache/spark-website/pull/25

No information has been lost. Old wikis still either exist as they were
with an "end of life" notice, or, point to the new location of the
information.

I ported the content basically as-is, only making minor changes to fix
obviously out of date content.

I did alter the menu structure, most significantly to add a "Developer"
menu.

Of course, we can change it further. Please comment if you see any errors,
don't like some of the choices, etc.

Note that all the release docs are now also updated according in branch 2.1.


Re: Is it possible to pass "-javaagent=customAgent.jar" into spark as a JAVA_OPTS

2016-11-23 Thread Artur Sukhenko
Hello Zak,

I believe this video from Spark Summit would be useful for you:
https://youtu.be/EB1-7AXQOhM

They are talking about extending Spark with Java agents.

On Tue, Nov 22, 2016, 23:50 Zak H  wrote:

> Hi,
>
> I'm interested in passing an agent that will expose jmx metrics from spark
> to my agent. I wanted to know if anyone has tried this and if so what
> environment variable do I need to set ?
>
> Do I set: $SPARK_DAEMON_JAVA_OPTS ??
>
>
> http://docs.oracle.com/javase/7/docs/api/java/lang/instrument/package-summary.html
>
>
> Thank you,
> Zak Hassan
>
> --
--
Artur Sukhenko


Re: view canonicalization - looking for database gurus to chime in

2016-11-23 Thread Jiang Xingbo
Hi all,

I have recently prepared a design document for Spark SQL robust view
canonicalization, in the doc we defined the expected behavior and described
a late binding approach, views created by older versions of Spark/HIVE are
still supposed to work under this new approach.
For more details, please review:
https://docs.google.com/document/d/16NEDA9ejzAQAkm_tRqEVpaUAXgXuUo5gTQsbnItOUsA

This doc is accomplished under the guidance of Herman van Hovell, I've
learned a lot from the process.
Thanks to Reynold Xin, Srinath Shankar for the valuable advices and for
helping improve the quality of this design doc.
Any suggestions and discussions are welcomed, thank you all!

Sincerely,
Xingbo



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/view-canonicalization-looking-for-database-gurus-to-chime-in-tp19681p19996.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org