Use cases for kafka direct stream messageHandler

2016-03-04 Thread Cody Koeninger
Wanted to survey what people are using the direct stream
messageHandler for, besides just extracting key / value / offset.

Would your use case still work if that argument was removed, and the
stream just contained ConsumerRecord objects
(http://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/ConsumerRecord.html)
which you could then use normal map transformations to access?

The only other valid use of messageHandler that I can think of is
catching serialization problems on a per-message basis.  But with the
new Kafka consumer library, that doesn't seem feasible anyway, and
could be handled with a custom (de)serializer.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



GraphX optimizations

2016-03-04 Thread Khaled Ammar
Hi all,

I wonder if the optimizations mentioned in the GraphX paper (
https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf ) are
currently implemented. In particular, I am looking for mrTriplets
optimizations and memory-based shuffle.

-- 
Thanks,
-Khaled


Re: Set up a Coverity scan for Spark

2016-03-04 Thread Sean Owen
No. Those are all in Java examples, and while we should show stopping
the context, it has no big impact. It's worth touching up.

I'm concerned about the ones with a potential correctness implication.
They are easy to fix and already identified; why wouldn't we fix them?
we take PRs to fix typos in comments.

On Fri, Mar 4, 2016 at 3:36 PM, Ted Yu  wrote:
> Is there JIRA for fixing the resource leaks w.r.t. unclosed SparkContext ?
>
> I wonder if such defects are really high priority.
>
> Cheers
>
> On Fri, Mar 4, 2016 at 7:06 AM, Sean Owen  wrote:
>>
>> Hi Ted, I've already marked them. You should be able to see the ones
>> marked "Fix Required" if you click through to the defects. Most are
>> just bad form and probably have no impact. The few that looked
>> reasonably important were:
>>
>> - using platform char encoding, not UTF-8
>> - Incorrect notify/wait
>> - volatile count with non-atomic update
>> - bad equals/hashCode
>>
>> On Fri, Mar 4, 2016 at 2:52 PM, Ted Yu  wrote:
>> > Last time I checked there wasn't high impact defects.
>> >
>> > Mind pointing out the defects you think should be fixed ?
>> >
>> > Thanks
>> >
>> > On Fri, Mar 4, 2016 at 4:35 AM, Sean Owen  wrote:
>> >>
>> >> Yeah, it's not going to help with Scala, but it can at least find
>> >> stuff in the Java code. I'm not suggesting anyone run it regularly,
>> >> but one run to catch some bugs is useful.
>> >>
>> >> I've already triaged ~70 issues there just in the Java code, of which
>> >> a handful are important.
>> >>
>> >> On Fri, Mar 4, 2016 at 12:18 PM, Ted Yu  wrote:
>> >> > Since majority of code is written in Scala which is not analyzed by
>> >> > Coverity, the efficacy of the tool seems limited.
>> >> >
>> >> >> On Mar 4, 2016, at 2:34 AM, Sean Owen  wrote:
>> >> >>
>> >> >>
>> >> >>
>> >> >> https://scan.coverity.com/projects/apache-spark-2f9d080d-401d-47bc-9dd1-7956c411fbb4?tab=overview
>> >> >>
>> >> >> This has to be run manually, and is Java-only, but the inspection
>> >> >> results are pretty good. Anyone should be able to browse them, and
>> >> >> let
>> >> >> me know if anyone would like more access.
>> >> >> Most are false-positives, but it's found some reasonable little
>> >> >> bugs.
>> >> >>
>> >> >> When my stack of things to do clears I'll try to address them, but I
>> >> >> bring it up as an FYI for anyone interested in static analysis.
>> >> >>
>> >> >>
>> >> >> -
>> >> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> >> For additional commands, e-mail: dev-h...@spark.apache.org
>> >> >>
>> >
>> >
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Set up a Coverity scan for Spark

2016-03-04 Thread Ted Yu
Is there JIRA for fixing the resource leaks w.r.t. unclosed SparkContext ?

I wonder if such defects are really high priority.

Cheers

On Fri, Mar 4, 2016 at 7:06 AM, Sean Owen  wrote:

> Hi Ted, I've already marked them. You should be able to see the ones
> marked "Fix Required" if you click through to the defects. Most are
> just bad form and probably have no impact. The few that looked
> reasonably important were:
>
> - using platform char encoding, not UTF-8
> - Incorrect notify/wait
> - volatile count with non-atomic update
> - bad equals/hashCode
>
> On Fri, Mar 4, 2016 at 2:52 PM, Ted Yu  wrote:
> > Last time I checked there wasn't high impact defects.
> >
> > Mind pointing out the defects you think should be fixed ?
> >
> > Thanks
> >
> > On Fri, Mar 4, 2016 at 4:35 AM, Sean Owen  wrote:
> >>
> >> Yeah, it's not going to help with Scala, but it can at least find
> >> stuff in the Java code. I'm not suggesting anyone run it regularly,
> >> but one run to catch some bugs is useful.
> >>
> >> I've already triaged ~70 issues there just in the Java code, of which
> >> a handful are important.
> >>
> >> On Fri, Mar 4, 2016 at 12:18 PM, Ted Yu  wrote:
> >> > Since majority of code is written in Scala which is not analyzed by
> >> > Coverity, the efficacy of the tool seems limited.
> >> >
> >> >> On Mar 4, 2016, at 2:34 AM, Sean Owen  wrote:
> >> >>
> >> >>
> >> >>
> https://scan.coverity.com/projects/apache-spark-2f9d080d-401d-47bc-9dd1-7956c411fbb4?tab=overview
> >> >>
> >> >> This has to be run manually, and is Java-only, but the inspection
> >> >> results are pretty good. Anyone should be able to browse them, and
> let
> >> >> me know if anyone would like more access.
> >> >> Most are false-positives, but it's found some reasonable little bugs.
> >> >>
> >> >> When my stack of things to do clears I'll try to address them, but I
> >> >> bring it up as an FYI for anyone interested in static analysis.
> >> >>
> >> >> -
> >> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >> >>
> >
> >
>


Re: Set up a Coverity scan for Spark

2016-03-04 Thread Sean Owen
Hi Ted, I've already marked them. You should be able to see the ones
marked "Fix Required" if you click through to the defects. Most are
just bad form and probably have no impact. The few that looked
reasonably important were:

- using platform char encoding, not UTF-8
- Incorrect notify/wait
- volatile count with non-atomic update
- bad equals/hashCode

On Fri, Mar 4, 2016 at 2:52 PM, Ted Yu  wrote:
> Last time I checked there wasn't high impact defects.
>
> Mind pointing out the defects you think should be fixed ?
>
> Thanks
>
> On Fri, Mar 4, 2016 at 4:35 AM, Sean Owen  wrote:
>>
>> Yeah, it's not going to help with Scala, but it can at least find
>> stuff in the Java code. I'm not suggesting anyone run it regularly,
>> but one run to catch some bugs is useful.
>>
>> I've already triaged ~70 issues there just in the Java code, of which
>> a handful are important.
>>
>> On Fri, Mar 4, 2016 at 12:18 PM, Ted Yu  wrote:
>> > Since majority of code is written in Scala which is not analyzed by
>> > Coverity, the efficacy of the tool seems limited.
>> >
>> >> On Mar 4, 2016, at 2:34 AM, Sean Owen  wrote:
>> >>
>> >>
>> >> https://scan.coverity.com/projects/apache-spark-2f9d080d-401d-47bc-9dd1-7956c411fbb4?tab=overview
>> >>
>> >> This has to be run manually, and is Java-only, but the inspection
>> >> results are pretty good. Anyone should be able to browse them, and let
>> >> me know if anyone would like more access.
>> >> Most are false-positives, but it's found some reasonable little bugs.
>> >>
>> >> When my stack of things to do clears I'll try to address them, but I
>> >> bring it up as an FYI for anyone interested in static analysis.
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>> >>
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Set up a Coverity scan for Spark

2016-03-04 Thread Ted Yu
Last time I checked there wasn't high impact defects.

Mind pointing out the defects you think should be fixed ?

Thanks

On Fri, Mar 4, 2016 at 4:35 AM, Sean Owen  wrote:

> Yeah, it's not going to help with Scala, but it can at least find
> stuff in the Java code. I'm not suggesting anyone run it regularly,
> but one run to catch some bugs is useful.
>
> I've already triaged ~70 issues there just in the Java code, of which
> a handful are important.
>
> On Fri, Mar 4, 2016 at 12:18 PM, Ted Yu  wrote:
> > Since majority of code is written in Scala which is not analyzed by
> Coverity, the efficacy of the tool seems limited.
> >
> >> On Mar 4, 2016, at 2:34 AM, Sean Owen  wrote:
> >>
> >>
> https://scan.coverity.com/projects/apache-spark-2f9d080d-401d-47bc-9dd1-7956c411fbb4?tab=overview
> >>
> >> This has to be run manually, and is Java-only, but the inspection
> >> results are pretty good. Anyone should be able to browse them, and let
> >> me know if anyone would like more access.
> >> Most are false-positives, but it's found some reasonable little bugs.
> >>
> >> When my stack of things to do clears I'll try to address them, but I
> >> bring it up as an FYI for anyone interested in static analysis.
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
>


Re: Set up a Coverity scan for Spark

2016-03-04 Thread Sean Owen
Yeah, it's not going to help with Scala, but it can at least find
stuff in the Java code. I'm not suggesting anyone run it regularly,
but one run to catch some bugs is useful.

I've already triaged ~70 issues there just in the Java code, of which
a handful are important.

On Fri, Mar 4, 2016 at 12:18 PM, Ted Yu  wrote:
> Since majority of code is written in Scala which is not analyzed by Coverity, 
> the efficacy of the tool seems limited.
>
>> On Mar 4, 2016, at 2:34 AM, Sean Owen  wrote:
>>
>> https://scan.coverity.com/projects/apache-spark-2f9d080d-401d-47bc-9dd1-7956c411fbb4?tab=overview
>>
>> This has to be run manually, and is Java-only, but the inspection
>> results are pretty good. Anyone should be able to browse them, and let
>> me know if anyone would like more access.
>> Most are false-positives, but it's found some reasonable little bugs.
>>
>> When my stack of things to do clears I'll try to address them, but I
>> bring it up as an FYI for anyone interested in static analysis.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Set up a Coverity scan for Spark

2016-03-04 Thread Ted Yu
Since majority of code is written in Scala which is not analyzed by Coverity, 
the efficacy of the tool seems limited. 

> On Mar 4, 2016, at 2:34 AM, Sean Owen  wrote:
> 
> https://scan.coverity.com/projects/apache-spark-2f9d080d-401d-47bc-9dd1-7956c411fbb4?tab=overview
> 
> This has to be run manually, and is Java-only, but the inspection
> results are pretty good. Anyone should be able to browse them, and let
> me know if anyone would like more access.
> Most are false-positives, but it's found some reasonable little bugs.
> 
> When my stack of things to do clears I'll try to address them, but I
> bring it up as an FYI for anyone interested in static analysis.
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Mapper side join with DataFrames API

2016-03-04 Thread Deepak Gopalakrishnan
Have added this to SO, can you guys share any thoughts ?

http://stackoverflow.com/questions/35795518/spark-1-6-spills-to-disk-even-when-there-is-enough-memory


On Thu, Mar 3, 2016 at 7:06 AM, Deepak Gopalakrishnan 
wrote:

> Hello,
>
> I'm using 1.6.0 on EMR
>
> On Thu, Mar 3, 2016 at 12:34 AM, Yong Zhang  wrote:
>
>> What version of Spark you are using?
>>
>> I am also trying to figure out how to do the map side join in Spark.
>>
>> In 1.5.x, there is a broadcast function in the Dataframe, and it caused
>> OOM for me simple test case, even one side of join is very small.
>>
>> I am still trying to find out the root cause yet.
>>
>> Yong
>>
>> --
>> Date: Wed, 2 Mar 2016 15:38:29 +0530
>> Subject: Re: Mapper side join with DataFrames API
>> From: dgk...@gmail.com
>> To: mich...@databricks.com
>> CC: u...@spark.apache.org
>>
>>
>> Thanks for the help guys.
>>
>> Just to ask a part of my question in a little different way.
>>
>> I have attached my screenshots here. There is so much of memory that is
>> unused and yet there is a spill ( as in screenshots). Any idea why ?
>>
>> Thanks
>> Deepak
>>
>> On Wed, Mar 2, 2016 at 5:14 AM, Michael Armbrust 
>> wrote:
>>
>> Its helpful to always include the output of df.explain(true) when you
>> are asking about performance.
>>
>> On Mon, Feb 29, 2016 at 6:14 PM, Deepak Gopalakrishnan 
>> wrote:
>>
>> Hello All,
>>
>> I'm trying to join 2 dataframes A and B with a
>>
>> sqlContext.sql("SELECT * FROM A INNER JOIN B ON A.a=B.a");
>>
>> Now what I have done is that I have registeredTempTables for A and B
>> after loading these DataFrames from different sources. I need the join to
>> be really fast and I was wondering if there is a way to use the SQL
>> statement and then being able to do a mapper side join ( say my table B is
>> small) ?
>>
>> I read some articles on using broadcast to do mapper side joins. Could I
>> do something like this and then execute my sql statement to achieve mapper
>> side join ?
>>
>> DataFrame B = sparkContext.broadcast(B);
>> B.registerTempTable("B");
>>
>>
>> I have a join as stated above and I see in my executor logs the below :
>>
>> 16/02/29 17:02:35 INFO TaskSetManager: Finished task 198.0 in stage 7.0
>> (TID 1114) in 20354 ms on localhost (196/200)
>> 16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Getting 200 non-empty
>> blocks out of 200 blocks
>> 16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Started 0 remote
>> fetches in 0 ms
>> 16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty
>> blocks out of 128 blocks
>> 16/02/29 17:02:35 INFO ShuffleBlockFetcherIterator: Started 0 remote
>> fetches in 0 ms
>> 16/02/29 17:03:03 INFO Executor: Finished task 199.0 in stage 7.0 (TID
>> 1115). 2511 bytes result sent to driver
>> 16/02/29 17:03:03 INFO TaskSetManager: Finished task 199.0 in stage 7.0
>> (TID 1115) in 27621 ms on localhost (197/200)
>>
>> *16/02/29 17:07:06 INFO UnsafeExternalSorter: Thread 124 spilling sort
>> data of 256.0 KB to disk (0  time so far)*
>>
>>
>> Now, I have around 10G of executor memory and my memory faction should be
>> the default ( 0.75 as per the documentation) and my memory usage is < 1.5G(
>> obtained from the Storage tab on Spark dashboard), but still it says
>> spilling sort data. I'm a little surprised why this happens even when I
>> have enough memory free.
>> Any inputs will be greatly appreciated!
>> Thanks
>> --
>> Regards,
>> *Deepak Gopalakrishnan*
>> *Mobile*:+918891509774
>> *Skype* : deepakgk87
>> http://myexps.blogspot.com
>>
>>
>>
>>
>>
>> --
>> Regards,
>> *Deepak Gopalakrishnan*
>> *Mobile*:+918891509774
>> *Skype* : deepakgk87
>> http://myexps.blogspot.com
>>
>>
>> - To
>> unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
>> commands, e-mail: user-h...@spark.apache.org
>>
>
>
>
> --
> Regards,
> *Deepak Gopalakrishnan*
> *Mobile*:+918891509774
> *Skype* : deepakgk87
> http://myexps.blogspot.com
>
>


-- 
Regards,
*Deepak Gopalakrishnan*
*Mobile*:+918891509774
*Skype* : deepakgk87
http://myexps.blogspot.com


Set up a Coverity scan for Spark

2016-03-04 Thread Sean Owen
https://scan.coverity.com/projects/apache-spark-2f9d080d-401d-47bc-9dd1-7956c411fbb4?tab=overview

This has to be run manually, and is Java-only, but the inspection
results are pretty good. Anyone should be able to browse them, and let
me know if anyone would like more access.
Most are false-positives, but it's found some reasonable little bugs.

When my stack of things to do clears I'll try to address them, but I
bring it up as an FYI for anyone interested in static analysis.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Fwd: spark master ui to proxy app and worker ui

2016-03-04 Thread Gurvinder Singh
Forwarding to development mailing list, as it might be more relevant
here to ask for it. I am wondering if I miss something in the
documentation that it might be possible already. If yes then please
point me to the documentation as how to achieve it. If no, then would it
make sense to implement it ?

Thanks,
Gurvinder


 Forwarded Message 
Subject: spark master ui to proxy app and worker ui
Date: Thu, 3 Mar 2016 20:12:07 +0100
From: Gurvinder Singh 
To: user 

Hi,

I am wondering if it is possible for the spark standalone master UI to
proxy app/driver UI and worker UI. The reason for this is that currently
if you want to access UI of driver and worker to see logs, you need to
have access to their IP:port which makes it harder to open up from
networking point of view. So operationally it makes life easier if
master can simply proxy those connections and allow access both app and
worker UI details from master UI itself.

Master does not need to have content stream to it all the time, only
when user wants to access contents from other UIs then it simply proxy
the request/response during that duration. Thus master will not have to
incur extra load all the time.

Thanks,
Gurvinder

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org