Re: Low throughput and effect of GC in SparkSql GROUP BY

2015-05-21 Thread Pramod Biligiri
I hadn't turned on codegen. I enabled it and ran it again, it is running 4-5 times faster now! :) Since my log statements are no longer appearing, I presume the code path seems quite different from the earlier hashmap related stuff in Aggregates.scala? Pramod On Wed, May 20, 2015 at 9:18 PM,

Re: Low throughput and effect of GC in SparkSql GROUP BY

2015-05-21 Thread Reynold Xin
Yup it is a different path. It runs GeneratedAggregate. On Wed, May 20, 2015 at 11:43 PM, Pramod Biligiri pramodbilig...@gmail.com wrote: I hadn't turned on codegen. I enabled it and ran it again, it is running 4-5 times faster now! :) Since my log statements are no longer appearing, I

Re: Spark Streaming - Design considerations/Knobs

2015-05-21 Thread Hemant Bhanawat
Honestly, given the length of my email, I didn't expect a reply. :-) Thanks for reading and replying. However, I have a follow-up question: I don't think if I understand the block replication completely. Are the blocks replicated immediately after they are received by the receiver? Or are they

Re: Resource usage of a spark application

2015-05-21 Thread Peter Prettenhofer
Thanks Akhil, Ryan! @Akhil: YARN can only tell me how much vcores my app has been granted but not actual cpu usage, right? Pulling mem/cpu usage from the OS means i need to map JVM executor processes to the context they belong to, right? @Ryan: what a great blog post -- this is super relevant

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-05-21 Thread Tathagata Das
Looks like somehow the file size reported by the FSInputDStream of Tachyon's FileSystem interface, is returning zero. On Mon, May 11, 2015 at 4:38 AM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: Just to follow up this thread further . I was doing some fault tolerant testing

Re: Adding/Using More Resolution Types on JIRA

2015-05-21 Thread Santiago Mola
Some examples to illustrate my point. A couple of issues from the oldest open issues in the SQL component: [SQL] spark-sql exits while encountered an error https://issues.apache.org/jira/browse/SPARK-4572 This is an incomplete report that nobody can take action on. It can be resolved as

Re: Tungsten's Vectorized Execution

2015-05-21 Thread Davies Liu
We have not start to prototype the vectorized one yet, will evaluated in 1.5 and may targeted for 1.6. We're glad to hear some feedback/suggestions/comments from your side! On Thu, May 21, 2015 at 9:37 AM, Yijie Shen henry.yijies...@gmail.com wrote: Hi all, I’ve seen the Blog of Project

Re: Adding/Using More Resolution Types on JIRA

2015-05-21 Thread Sean Owen
On Thu, May 21, 2015 at 9:06 PM, Santiago Mola sm...@stratio.com wrote: Inactive - A feature or bug that has had no activity from users or developers in a long time Why is this needed? Every JIRA listing can be sorted by activity. That gets the inactive ones out of your view quickly. I do not

Re: Adding/Using More Resolution Types on JIRA

2015-05-21 Thread Santiago Mola
2015-05-12 9:50 GMT+02:00 Patrick Wendell pwend...@gmail.com: Inactive - A feature or bug that has had no activity from users or developers in a long time Why is this needed? Every JIRA listing can be sorted by activity. That gets the inactive ones out of your view quickly. I do not see any

Re: Adding/Using More Resolution Types on JIRA

2015-05-21 Thread Sean Owen
On Thu, May 21, 2015 at 10:03 PM, Santiago Mola sm...@stratio.com wrote: Sure. That is why I was talking about the Inactive resolution specifically. The combination of Priority + other statuses are enough to solve these issues. A minor/trivial issue that is incomplete is probably not going to

Re: Contribute code to MLlib

2015-05-21 Thread Trevor Grant
Thank you Ram and Joseph. I am also hoping to contribute to MLib once my Scala gets up to snuff, this is the guidance I needed for how to proceed when ready. Best wishes, Trevor On Wed, May 20, 2015 at 1:55 PM, Joseph Bradley jos...@databricks.com wrote: Hi Trevor, I may be repeating what

Re: Resource usage of a spark application

2015-05-21 Thread Akhil Das
Yes Peter that's correct, you need to identify the processes and with that you can pull the actual usage metrics. Thanks Best Regards On Thu, May 21, 2015 at 2:52 PM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: Thanks Akhil, Ryan! @Akhil: YARN can only tell me how much vcores my

Re:Re: Low throughput and effect of GC in SparkSql GROUP BY

2015-05-21 Thread zhangxiongfei
Hi Pramod Is your data compressed? I encountered similar problem,however, after turned codegen on, the GC time was still very long.The size of input data for my map task is about 100M lzo file. My query is select ip, count(*) as c from stage_bitauto_adclick_d group by ip sort by c limit 100

Re: Resource usage of a spark application

2015-05-21 Thread Ryan Williams
On Thu, May 21, 2015 at 5:22 AM Peter Prettenhofer peter.prettenho...@gmail.com wrote: Thanks Akhil, Ryan! @Akhil: YARN can only tell me how much vcores my app has been granted but not actual cpu usage, right? Pulling mem/cpu usage from the OS means i need to map JVM executor processes to

Why use lib_managed for the Sbt build?

2015-05-21 Thread Iulian Dragoș
I’m trying to understand why Sbt is configured to pull all libs under lib_managed. - it seems like unnecessary duplication (I will have those libraries under ./m2, via maven anyway) - every time I call make-distribution I lose lib_managed (via mvn clean install) and have to wait to

Re: Change for submitting to yarn in 1.3.1

2015-05-21 Thread Kevin Markey
This is an excellent discussion.  As mentioned in an earlier email, we agree with a number of Chester's suggestions, but we have yet other concerns.  I've researched this further in the past several days, and I've queried my team.  This email attempts to

Re: Change for submitting to yarn in 1.3.1

2015-05-21 Thread Marcelo Vanzin
Hi Kevin, I read through your e-mail and I see two main things you're talking about. - You want a public YARN Client class and don't really care about anything else. In you message you already mention why that's not a good idea. It's much better to have a standardized submission API. As you

Re: Re: Low throughput and effect of GC in SparkSql GROUP BY

2015-05-21 Thread Pramod Biligiri
Hi Zhang, No my data is not compressed. I'm trying to minimize the load on the CPU. The GC time reduced for me after codegen. Pramod On Thu, May 21, 2015 at 3:43 AM, zhangxiongfei zhangxiongfei0...@163.com wrote: Hi Pramod Is your data compressed? I encountered similar problem,however,

Re: Change for submitting to yarn in 1.3.1

2015-05-21 Thread Nathan Kronenfeld
In researching and discussing these issues with Cloudera and others, we've been told that only one mechanism is supported for starting Spark jobs: the *spark-submit* scripts. Is this new? We've been submitting jobs directly from a programatically created spark context (instead of through

Testing spark applications

2015-05-21 Thread Nathan Kronenfeld
see discussions about Spark not really liking multiple contexts in the same JVM Speaking of this - is there a standard way of writing unit tests that require a SparkContext? We've ended up copying out the code of SharedSparkContext to our own testing hierarchy, but it occurs to me someone

Re: Change for submitting to yarn in 1.3.1

2015-05-21 Thread Marcelo Vanzin
Hi Nathan, On Thu, May 21, 2015 at 7:30 PM, Nathan Kronenfeld nkronenfeld@uncharted.software wrote: In researching and discussing these issues with Cloudera and others, we've been told that only one mechanism is supported for starting Spark jobs: the *spark-submit* scripts. Is this new?

Re: Change for submitting to yarn in 1.3.1

2015-05-21 Thread Nathan Kronenfeld
Thanks, Marcelo Instantiating SparkContext directly works. Well, sorta: it has limitations. For example, see discussions about Spark not really liking multiple contexts in the same JVM. It also does not work in cluster deploy mode. That's fine - when one is doing something out of

Re: Change for submitting to yarn in 1.3.1

2015-05-21 Thread Koert Kuipers
we also launch jobs programmatically, both on standalone mode and yarn-client mode. in standalone mode it always worked, in yarn-client mode we ran into some issues and were forced to use spark-submit, but i still have on my todo list to move back to a normal java launch without spark-submit at

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-05-21 Thread Dibyendu Bhattacharya
Hi Tathagata, Thanks for looking into this. Further investigating I found that the issue is with Tachyon does not support File Append. The streaming receiver which writes to WAL when failed, and again restarted, not able to append to same WAL file after restart. I raised this with Tachyon user

Re: Testing spark applications

2015-05-21 Thread Reynold Xin
It is just 15 lines of code to copy, isn't it? On Thu, May 21, 2015 at 7:46 PM, Nathan Kronenfeld nkronenfeld@uncharted.software wrote: see discussions about Spark not really liking multiple contexts in the same JVM Speaking of this - is there a standard way of writing unit tests that

Tungsten's Vectorized Execution

2015-05-21 Thread Yijie Shen
Hi all, I’ve seen the Blog of Project Tungsten here, it sounds awesome to me! I’ve also noticed there is a plan to change the code generation from record-at-a-time evaluation to a vectorized one, which interests me most. What’s the status of vectorized evaluation?  Is this an inner effort of

Customizing Akka configuration for Spark

2015-05-21 Thread Akshat Aranya
Hi, Is there some way to customize the Akka configuration for Spark? Specifically, I want to experiment with custom serialization for messages that are send between the driver and executors in standalone mode. Thanks, Akshat