Re: Unable to access Resource Manager /Name Node on port 9026 / 9101 on a Spark EMR Cluster

2016-04-15 Thread Wei-Shun Lo
Hi Chanda,

You may want to check by using nmap to check whether the port and service
is correctly started locally.
ex. nmap localhost

If the port is already successfully internally, it might be related to the
outbound/inbound traffic control in your security group setting.

Just fyi.


On Fri, Apr 15, 2016 at 7:29 AM, Chadha Pooja  wrote:

> Hi ,
>
>
>
>
>
> We have setup a Spark Cluster (3 node) on Amazon EMR.
>
>
>
> We aren’t able to use port 9026 and 9101 on the existing Spark EMR Cluster
> which are part of the Web UIs offered with Amazon EMR. I was able to use
> other ports like Zeppelin port, 8890, HUE etc
>
>
>
> We checked that the security settings currently are open to everyone, and
> it is not an issue with security.
>
>
>
> URLs
>
>
>
> Hadoop ResourceManager
>
> http://master-node-IP:9026/
>
> Hadoop HDFS NameNode
>
> http://master-node-IP:9101/
>
>
>
> Errors Observed on Fiddler:
>
> *Port 9026: *
>
> [Fiddler] The connection to 'masternodeIP' failed.
> Error: TimedOut (0x274c).
> System.Net.Sockets.SocketException A connection attempt failed because the
> connected party did not properly respond after a period of time, or
> established connection failed because connected host has failed to respond
> <>:9026
>
>
>
> *Port 9101:*
>
> [Fiddler] The connection to <>: failed.
> Error: TimedOut (0x274c).
> System.Net.Sockets.SocketException A connection attempt failed because the
> connected party did not properly respond after a period of time, or
> established connection failed because connected host has failed to respond
> <>:9101
>
>
>
> Does anyone have any experiences or pointers? Appreciate your help!
>
>
>
> Thanks!
>
>
>
> --
>
> The Boston Consulting Group, Inc.
>
> This e-mail message may contain confidential and/or privileged
> information. If you are not an addressee or otherwise authorized to receive
> this message, you should not use, copy, disclose or take any action based
> on this e-mail or any information contained in the message. If you have
> received this material in error, please advise the sender immediately by
> reply e-mail and delete this message. Thank you.
>



-- 
Best Luck,
Ralic Lo
**
*---*
Phone: 408-609-7628
Email: rali...@gmail.com
**
*---*
*INSPIRATION FILLS THE GAP OF KNOWLEDGE  !*


Will not store rdd_16_4383 as it would require dropping another block from the same RDD

2016-04-15 Thread Alexander Pivovarov
I run Spark 1.6.1 on YARN   (EMR-4.5.0)

I call RDD.count on MEMORY_ONLY_SER cached RDD (spark.serializer is
KryoSerializer)

after count task is done I noticed that Spark UI shows that RDD Fraction
Cached is 6% only
Size in Memory = 65.3 GB

I looked at Executors stderr on Spark UI and saw lots of messages like

16/04/15 19:08:03 INFO storage.MemoryStore: Will not store rdd_16_4383
as it would require dropping another block from the same RDD
16/04/15 19:08:03 WARN storage.MemoryStore: Not enough space to cache
rdd_16_4383 in memory! (computed 1462.4 MB so far)
16/04/15 19:08:03 INFO storage.MemoryStore: Memory use = 11.0 KB
(blocks) + 33.8 GB (scratch space shared across 17 tasks(s)) = 33.8
GB. Storage limit = 33.8 GB.
16/04/15 19:08:06 INFO storage.MemoryStore: Will not store rdd_16_4306
as it would require dropping another block from the same RDD
16/04/15 19:08:06 WARN storage.MemoryStore: Not enough space to cache
rdd_16_4306 in memory! (computed 1920.6 MB so far)
16/04/15 19:08:06 INFO storage.MemoryStore: Memory use = 11.0 KB
(blocks) + 33.8 GB (scratch space shared across 17 tasks(s)) = 33.8
GB. Storage limit = 33.8 GB.



But the cluster has memory to cache 3 RDDs like that

spark.executor.instances - 100

spark.executor.memory - 48524M

Storage Memory on each executor - 33.8 GB

Executors Memory: 67.2 GB Used (3.3 TB Total)


If my RDD takes 65.3 GB in memory storage when RDD Fraction Cached =
6%  then total size in memory should be about 1.1 TB


The cluster has 3.3 TB total storage memory and only 1 application is
running now (the RDD is the first RDD to cache in my programm)

Why Spark can not store entire RDD in memory?


BTW. Previous Spark 1.5.2 stores 100% of the RDD


Should I switch to legacy mode? spark.memory.useLegacyMode=true


   -


Re: Skipping Type Conversion and using InternalRows for UDF

2016-04-15 Thread Michael Armbrust
This would also probably improve performance:
https://github.com/apache/spark/pull/9565

On Fri, Apr 15, 2016 at 8:44 AM, Hamel Kothari 
wrote:

> Hi all,
>
> So we have these UDFs which take <1ms to operate and we're seeing pretty
> poor performance around them in practice, the overhead being >10ms for the
> projections (this data is deeply nested with ArrayTypes and MapTypes so
> that could be the cause). Looking at the logs and code for ScalaUDF, I
> noticed that there are a series of projections which take place before and
> after in order to make the Rows safe and then unsafe again. Is there any
> way to opt out of this and input/return InternalRows to skip the
> performance hit of the type conversion? It doesn't immediately appear to be
> possible but I'd like to make sure that I'm not missing anything.
>
> I suspect we could make this possible by checking if typetags in the
> register function are all internal types, if they are, passing a false
> value for "needs[Input|Output]Conversion" to ScalaUDF and then in ScalaUDF
> checking for that flag to figure out if the conversion process needs to
> take place. We're still left with the issue of missing a schema in the case
> of outputting InternalRows, but we could expose the DataType parameter
> rather than inferring it in the register function. Is there anything else
> in the code that would prevent this from working?
>
> Regards,
> Hamel
>


Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Mridul Muralidharan
On Friday, April 15, 2016, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Yeah in support of this statement I think that my primary interest in
> this Spark Extras and the good work by Luciano here is that anytime we
> take bits out of a code base and “move it to GitHub” I see a bad precedent
> being set.


Can't agree more !



>
> Creating this project at the ASF creates a synergy between *Apache Spark*
> which is *at the ASF*.


In addition, this will give all the "goodness " of being an Apache project
from a user/consumer point of view compared to a general  github project.




>
> We welcome comments and as Luciano said, this is meant to invite and be
> open to those in the Apache Spark PMC to join and help.
>
>
This would definitely be something worthwhile to explore.
+1

Regards
Mridul



> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov 
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
>
>
>
>
>
>
>
>
> On 4/15/16, 9:39 AM, "Luciano Resende"  > wrote:
>
> >
> >
> >On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger
> >> wrote:
> >
> >Given that not all of the connectors were removed, I think this
> >creates a weird / confusing three tier system
> >
> >1. connectors in the official project's spark/extras or spark/external
> >2. connectors in "Spark Extras"
> >3. connectors in some random organization's github
> >
> >
> >
> >
> >
> >
> >
> >Agree Cody, and I think this is one of the goals of "Spark Extras",
> centralize the development of these connectors under one central place at
> Apache, and that's why one of our asks is to invite the Spark PMC to
> continue developing the remaining connectors
> > that stayed in Spark proper, in "Spark Extras". We will also discuss
> some process policies on enabling lowering the bar to allow proposal of
> these other github extensions to be part of "Spark Extras" while also
> considering a way to move code to a maintenance
> > mode location.
> >
> >
> >
> >
> >--
> >Luciano Resende
> >http://twitter.com/lresende1975
> >http://lresende.blogspot.com/
> >
> >
> >
> >
>


Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Cody Koeninger
100% agree with Sean & Reynold's comments on this.

Adding this as a TLP would just cause more confusion as to "official"
endorsement.



On Fri, Apr 15, 2016 at 11:50 AM, Sean Owen  wrote:
> On Fri, Apr 15, 2016 at 5:34 PM, Luciano Resende  wrote:
>> I know the name might be confusing, but I also think that the projects have
>> a very big synergy, more like sibling projects, where "Spark Extras" extends
>> the Spark community and develop/maintain components for, and pretty much
>> only for, Apache Spark.  Based on your comment above, if making the project
>> "Spark-Extras" a more acceptable name, I believe this is ok as well.
>
> This also grants special status to a third-party project. It's not
> clear this should be *the* official unofficial third-party Spark
> project over some other one. If something's to be blessed, it should
> be in the Spark project.
>
> And why isn't it in the Spark project? the argument was that these
> bits were not used and pretty de minimis as code. It's not up to me or
> anyone else to tell you code X isn't useful to you. But arguing X
> should be a TLP asserts it is substantial and of broad interest, since
> there's non-zero effort for volunteers to deal with it. I am not sure
> I've heard anyone argue that -- or did I miss it? because removing
> bits of unused code happens all the time and isn't a bad precedent or
> even unusual.
>
> It doesn't actually enable any more cooperation than is already
> possible with any other project (like Kafka, Mesos, etc). You can run
> the same governance model anywhere you like. I realize literally being
> operated under the ASF banner is something different.
>
> What I hear here is a proposal to make an unofficial official Spark
> project as a TLP, that begins with these fairly inconsequential
> extras. I question the value of that on its face. Example: what goes
> into this project? deleted Spark code only? or is this a glorified
> "contrib" folder with a lower and somehow different bar determined by
> different people?
>
> And at that stage... is it really helping to give that special status?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Mattmann, Chris A (3980)
Yeah in support of this statement I think that my primary interest in
this Spark Extras and the good work by Luciano here is that anytime we
take bits out of a code base and “move it to GitHub” I see a bad precedent
being set.

Creating this project at the ASF creates a synergy between *Apache Spark*
which is *at the ASF*.

We welcome comments and as Luciano said, this is meant to invite and be
open to those in the Apache Spark PMC to join and help.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 4/15/16, 9:39 AM, "Luciano Resende"  wrote:

>
>
>On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger 
> wrote:
>
>Given that not all of the connectors were removed, I think this
>creates a weird / confusing three tier system
>
>1. connectors in the official project's spark/extras or spark/external
>2. connectors in "Spark Extras"
>3. connectors in some random organization's github
>
>
>
>
>
>
>
>Agree Cody, and I think this is one of the goals of "Spark Extras", centralize 
>the development of these connectors under one central place at Apache, and 
>that's why one of our asks is to invite the Spark PMC to continue developing 
>the remaining connectors
> that stayed in Spark proper, in "Spark Extras". We will also discuss some 
> process policies on enabling lowering the bar to allow proposal of these 
> other github extensions to be part of "Spark Extras" while also considering a 
> way to move code to a maintenance
> mode location.
>
> 
>
>
>-- 
>Luciano Resende
>http://twitter.com/lresende1975
>http://lresende.blogspot.com/
>
>
>
>


Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Mattmann, Chris A (3980)
Hey Reynold,

Thanks. Getting to the heart of this, I think that this project would
be successful if the Apache Spark PMC decided to participate and there
was some overlap. As much as I think it would be great to stand up another
project, the goal here from Luciano and crew (myself included) would be
to suggest it’s just as easy to start an Apache Incubator project to 
manage “extra” pieces of Apache Spark code outside of the release cycle
and the other reasons stated that it made sense to move this code out of
the code base. This isn’t a competing effort to some code on GitHub that
was moved out of Apache source control from Apache Spark - it’s meant to 
be an enabler to suggest that code could be managed here just as easily
(see the difference?)

Let me know what you think thanks Reynold.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++









On 4/15/16, 9:47 AM, "Reynold Xin"  wrote:

>
>
>
>Anybody is free and welcomed to create another ASF project, but I don't think 
>"Spark extras" is a good name. It unnecessarily creates another tier of code 
>that ASF is "endorsing".
>On Friday, April 15, 2016, Mattmann, Chris A (3980) 
> wrote:
>
>Yeah in support of this statement I think that my primary interest in
>this Spark Extras and the good work by Luciano here is that anytime we
>take bits out of a code base and “move it to GitHub” I see a bad precedent
>being set.
>
>Creating this project at the ASF creates a synergy between *Apache Spark*
>which is *at the ASF*.
>
>We welcome comments and as Luciano said, this is meant to invite and be
>open to those in the Apache Spark PMC to join and help.
>
>Cheers,
>Chris
>
>++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: 
>chris.a.mattm...@nasa.gov 
>WWW:  http://sunset.usc.edu/~mattmann/
>++
>Director, Information Retrieval and Data Science Group (IRDS)
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>WWW: http://irds.usc.edu/
>++
>
>
>
>
>
>
>
>
>
>
>On 4/15/16, 9:39 AM, "Luciano Resende" > 
>wrote:
>
>>
>>
>>On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger
>>> wrote:
>>
>>Given that not all of the connectors were removed, I think this
>>creates a weird / confusing three tier system
>>
>>1. connectors in the official project's spark/extras or spark/external
>>2. connectors in "Spark Extras"
>>3. connectors in some random organization's github
>>
>>
>>
>>
>>
>>
>>
>>Agree Cody, and I think this is one of the goals of "Spark Extras", 
>>centralize the development of these connectors under one central place at 
>>Apache, and that's why one of our asks is to invite the Spark PMC to continue 
>>developing the remaining connectors
>> that stayed in Spark proper, in "Spark Extras". We will also discuss some 
>> process policies on enabling lowering the bar to allow proposal of these 
>> other github extensions to be part of "Spark Extras" while also considering 
>> a way to move code to a maintenance
>> mode location.
>>
>>
>>
>>
>>--
>>Luciano Resende
>>http://twitter.com/lresende1975
>>http://lresende.blogspot.com/
>>
>>
>>
>>
>
>
>


Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Jean-Baptiste Onofré

+1

Regards
JB

On 04/15/2016 06:41 PM, Mattmann, Chris A (3980) wrote:

Yeah in support of this statement I think that my primary interest in
this Spark Extras and the good work by Luciano here is that anytime we
take bits out of a code base and “move it to GitHub” I see a bad precedent
being set.

Creating this project at the ASF creates a synergy between *Apache Spark*
which is *at the ASF*.

We welcome comments and as Luciano said, this is meant to invite and be
open to those in the Apache Spark PMC to join and help.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++










On 4/15/16, 9:39 AM, "Luciano Resende"  wrote:




On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger
 wrote:

Given that not all of the connectors were removed, I think this
creates a weird / confusing three tier system

1. connectors in the official project's spark/extras or spark/external
2. connectors in "Spark Extras"
3. connectors in some random organization's github







Agree Cody, and I think this is one of the goals of "Spark Extras", centralize 
the development of these connectors under one central place at Apache, and that's why one 
of our asks is to invite the Spark PMC to continue developing the remaining connectors
that stayed in Spark proper, in "Spark Extras". We will also discuss some process 
policies on enabling lowering the bar to allow proposal of these other github extensions to be part 
of "Spark Extras" while also considering a way to move code to a maintenance
mode location.




--
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/






-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Sean Owen
On Fri, Apr 15, 2016 at 5:34 PM, Luciano Resende  wrote:
> I know the name might be confusing, but I also think that the projects have
> a very big synergy, more like sibling projects, where "Spark Extras" extends
> the Spark community and develop/maintain components for, and pretty much
> only for, Apache Spark.  Based on your comment above, if making the project
> "Spark-Extras" a more acceptable name, I believe this is ok as well.

This also grants special status to a third-party project. It's not
clear this should be *the* official unofficial third-party Spark
project over some other one. If something's to be blessed, it should
be in the Spark project.

And why isn't it in the Spark project? the argument was that these
bits were not used and pretty de minimis as code. It's not up to me or
anyone else to tell you code X isn't useful to you. But arguing X
should be a TLP asserts it is substantial and of broad interest, since
there's non-zero effort for volunteers to deal with it. I am not sure
I've heard anyone argue that -- or did I miss it? because removing
bits of unused code happens all the time and isn't a bad precedent or
even unusual.

It doesn't actually enable any more cooperation than is already
possible with any other project (like Kafka, Mesos, etc). You can run
the same governance model anywhere you like. I realize literally being
operated under the ASF banner is something different.

What I hear here is a proposal to make an unofficial official Spark
project as a TLP, that begins with these fairly inconsequential
extras. I question the value of that on its face. Example: what goes
into this project? deleted Spark code only? or is this a glorified
"contrib" folder with a lower and somehow different bar determined by
different people?

And at that stage... is it really helping to give that special status?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



ClassFormatError in latest spark 2 SNAPSHOT build

2016-04-15 Thread Koert Kuipers
not sure why, but i am getting this today using spark 2 snapshots...
i am on java 7 and scala 2.11

16/04/15 12:35:46 WARN TaskSetManager: Lost task 2.0 in stage 3.0 (TID 15,
localhost): java.lang.ClassFormatError: Duplicate field name in
class file
org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificMutableProjection
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at
org.codehaus.janino.ByteArrayClassLoader.findClass(ByteArrayClassLoader.java:66)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass.generate(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$$anonfun$create$2.apply(GenerateMutableProjection.scala:140)
at
org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$$anonfun$create$2.apply(GenerateMutableProjection.scala:139)
at
org.apache.spark.sql.execution.aggregate.AggregationIterator.generateProcessRow(AggregationIterator.scala:178)
at
org.apache.spark.sql.execution.aggregate.AggregationIterator.(AggregationIterator.scala:197)
at
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.(SortBasedAggregationIterator.scala:39)
at
org.apache.spark.sql.execution.aggregate.SortBasedAggregate$$anonfun$doExecute$1$$anonfun$3.apply(SortBasedAggregate.scala:80)
at
org.apache.spark.sql.execution.aggregate.SortBasedAggregate$$anonfun$doExecute$1$$anonfun$3.apply(SortBasedAggregate.scala:71)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:768)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:768)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:72)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:239)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$


Re: ClassFormatError in latest spark 2 SNAPSHOT build

2016-04-15 Thread Reynold Xin
Can you post the generated code?

df.queryExecution.debug.codeGen()

(Or something similar to that)

On Friday, April 15, 2016, Koert Kuipers  wrote:

> not sure why, but i am getting this today using spark 2 snapshots...
> i am on java 7 and scala 2.11
>
> 16/04/15 12:35:46 WARN TaskSetManager: Lost task 2.0 in stage 3.0 (TID 15,
> localhost): java.lang.ClassFormatError: Duplicate field name in
> class file
> org/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificMutableProjection
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
> at
> org.codehaus.janino.ByteArrayClassLoader.findClass(ByteArrayClassLoader.java:66)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass.generate(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$$anonfun$create$2.apply(GenerateMutableProjection.scala:140)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$$anonfun$create$2.apply(GenerateMutableProjection.scala:139)
> at
> org.apache.spark.sql.execution.aggregate.AggregationIterator.generateProcessRow(AggregationIterator.scala:178)
> at
> org.apache.spark.sql.execution.aggregate.AggregationIterator.(AggregationIterator.scala:197)
> at
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.(SortBasedAggregationIterator.scala:39)
> at
> org.apache.spark.sql.execution.aggregate.SortBasedAggregate$$anonfun$doExecute$1$$anonfun$3.apply(SortBasedAggregate.scala:80)
> at
> org.apache.spark.sql.execution.aggregate.SortBasedAggregate$$anonfun$doExecute$1$$anonfun$3.apply(SortBasedAggregate.scala:71)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:768)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:768)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:72)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:239)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$
>


Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Luciano Resende
On Fri, Apr 15, 2016 at 9:34 AM, Cody Koeninger  wrote:

> Given that not all of the connectors were removed, I think this
> creates a weird / confusing three tier system
>
> 1. connectors in the official project's spark/extras or spark/external
> 2. connectors in "Spark Extras"
> 3. connectors in some random organization's github
>
>
Agree Cody, and I think this is one of the goals of "Spark Extras",
centralize the development of these connectors under one central place at
Apache, and that's why one of our asks is to invite the Spark PMC to
continue developing the remaining connectors that stayed in Spark proper,
in "Spark Extras". We will also discuss some process policies on enabling
lowering the bar to allow proposal of these other github extensions to be
part of "Spark Extras" while also considering a way to move code to a
maintenance mode location.


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: BytesToBytes and unaligned memory

2016-04-15 Thread Ted Yu
I am curious if all Spark unit tests pass with the forced true value for
unaligned.
If that is the case, it seems we can add s390x to the known architectures.

It would also give us some more background if you can describe how
java.nio.Bits#unaligned()
is implemented on s390x.

Josh / Andrew / Davies / Ryan are more familiar with related code. It would
be good to hear what they think.

Thanks

On Fri, Apr 15, 2016 at 8:47 AM, Adam Roberts  wrote:

> Ted, yeah with the forced true value the tests in that suite all pass and
> I know they're being executed thanks to prints I've added
>
> Cheers,
>
>
>
>
> From:Ted Yu 
> To:Adam Roberts/UK/IBM@IBMGB
> Cc:"dev@spark.apache.org" 
> Date:15/04/2016 16:43
> Subject:Re: BytesToBytes and unaligned memory
> --
>
>
>
> Can you clarify whether BytesToBytesMapOffHeapSuite passed or failed with
> the forced true value for unaligned ?
>
> If the test failed, please pastebin the failure(s).
>
> Thanks
>
> On Fri, Apr 15, 2016 at 8:32 AM, Adam Roberts <*arobe...@uk.ibm.com*
> > wrote:
> Ted, yep I'm working from the latest code which includes that unaligned
> check, for experimenting I've modified that code to ignore the unaligned
> check (just go ahead and say we support it anyway, even though our JDK
> returns false: the return value of java.nio.Bits.unaligned()).
>
> My Platform.java for testing contains:
>
> private static final boolean unaligned;
>
> static {
>   boolean _unaligned;
>   // use reflection to access unaligned field
>   try {
> * System.out.println("Checking unaligned support");*
> Class bitsClass =
>   Class.forName("java.nio.Bits", false,
> ClassLoader.getSystemClassLoader());
> Method unalignedMethod = bitsClass.getDeclaredMethod("unaligned");
> unalignedMethod.setAccessible(true);
> _unaligned = Boolean.TRUE.equals(unalignedMethod.invoke(null));
> *System.out.println("Used reflection and _unaligned is: " +
> _unaligned);*
> * System.out.println("Setting to true anyway for experimenting");*
> * _unaligned = true;*
> } catch (Throwable t) {
>   // We at least know x86 and x64 support unaligned access.
>   String arch = System.getProperty("os.arch", "");
>   //noinspection DynamicRegexReplaceableByCompiledPattern
> *   // We don't actually get here since we find the unaligned method
> OK and it returns false (I override with true anyway)*
> *   // but add s390x incase we somehow fail anyway.*
> *   System.out.println("Checking for s390x, os.arch is: " + arch);*
> *   _unaligned =
> arch.matches("^(i[3-6]86|x86(_64)?|x64|s390x|amd64)$");*
> }
> unaligned = _unaligned;
> * System.out.println("returning: " + unaligned);*
>   }
> }
>
> Output is, as you'd expect, "used reflection and _unaligned is false,
> setting to true anyway for experimenting", and the tests pass.
>
> No other problems on the platform (pending a different pull request).
>
> Cheers,
>
>
>
>
>
>
>
> From:Ted Yu <*yuzhih...@gmail.com* >
> To:Adam Roberts/UK/IBM@IBMGB
> Cc:"*dev@spark.apache.org* " <
> *dev@spark.apache.org* >
> Date:15/04/2016 15:32
> Subject:Re: BytesToBytes and unaligned memory
> --
>
>
>
>
> I assume you tested 2.0 with SPARK-12181 .
>
> Related code from Platform.java if java.nio.Bits#unaligned() throws
> exception:
>
>   // We at least know x86 and x64 support unaligned access.
>   String arch = System.getProperty("os.arch", "");
>   //noinspection DynamicRegexReplaceableByCompiledPattern
>   _unaligned = arch.matches("^(i[3-6]86|x86(_64)?|x64|amd64)$");
>
> Can you give us some detail on how the code runs for JDKs on zSystems ?
>
> Thanks
>
> On Fri, Apr 15, 2016 at 7:01 AM, Adam Roberts <*arobe...@uk.ibm.com*
> > wrote:
> Hi, I'm testing Spark 2.0.0 on various architectures and have a question,
> are we sure if
> *core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java*
> 
> really is attempting to use unaligned memory access (for the
> BytesToBytesMapOffHeapSuite tests specifically)?
>
> Our JDKs on zSystems for example return false for the
> java.nio.Bits.unaligned() method and yet if I skip this check and add s390x
> to the supported architectures (for zSystems), all thirteen tests here
> pass.
>
> The 13 tests here all fail as we do not pass the unaligned requirement
> (but perhaps incorrectly):
>
> *core/src/test/java/org/apache/spark/unsafe/map/BytesToBytesMapOffHeapSuite.java*
> 

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Luciano Resende
On Fri, Apr 15, 2016 at 9:18 AM, Sean Owen  wrote:

> Why would this need to be an ASF project of its own? I don't think
> it's possible to have a yet another separate "Spark Extras" TLP (?)
>
> There is already a project to manage these bits of code on Github. How
> about all of the interested parties manage the code there, under the
> same process, under the same license, etc?
>

This whole discussion started when some of the connectors were moved from
Apache to Github, which makes a statement that The "Spark Governance" of
the bits is something very valuable by the community, consumers, and other
companies that are consuming open source code. Being an Apache project also
allows the project to use and share the Apache infrastructure to run the
project.


>
> I'm not against calling it Spark Extras myself but I wonder if that
> needlessly confuses the situation. They aren't part of the Spark TLP
> on purpose, so trying to give it some special middle-ground status
> might just be confusing. The thing that comes to mind immediately is
> "Connectors for Apache Spark", spark-connectors, etc.
>
>
I know the name might be confusing, but I also think that the projects have
a very big synergy, more like sibling projects, where "Spark Extras"
extends the Spark community and develop/maintain components for, and pretty
much only for, Apache Spark.  Based on your comment above, if making the
project "Spark-Extras" a more acceptable name, I believe this is ok as well.

I also understand that the Spark PMC might have concerns with branding, and
that's why we are inviting all members of the Spark PMC to join the project
and help oversee and manage the project.



>
> On Fri, Apr 15, 2016 at 5:01 PM, Luciano Resende 
> wrote:
> > After some collaboration with other community members, we have created a
> > initial draft for Spark Extras which is available for review at
> >
> >
> https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing
> >
> > We would like to invite other community members to participate in the
> > project, particularly the Spark Committers and PMC (feel free to express
> > interest and I will update the proposal). Another option here is just to
> > give ALL Spark committers write access to "Spark Extras".
> >
> >
> > We also have couple asks from the Spark PMC :
> >
> > - Permission to use "Spark Extras" as the project name. We already
> checked
> > this with Apache Brand Management, and the recommendation was to discuss
> and
> > reach consensus with the Spark PMC.
> >
> > - We would also want to check with the Spark PMC that, in case of
> > successfully creation of  "Spark Extras", if the PMC would be willing to
> > continue the development of the remaining connectors that stayed in Spark
> > 2.0 codebase in the "Spark Extras" project.
> >
> >
> > Thanks in advance, and we welcome any feedback around this proposal
> before
> > we present to the Apache Board for consideration.
> >
> >
> >
> > On Sat, Mar 26, 2016 at 10:07 AM, Luciano Resende 
> > wrote:
> >>
> >> I believe some of this has been resolved in the context of some parts
> that
> >> had interest in one extra connector, but we still have a few removed,
> and as
> >> you mentioned, we still don't have a simple way or willingness to
> manage and
> >> be current on new packages like kafka. And based on the fact that this
> >> thread is still alive, I believe that other community members might have
> >> other concerns as well.
> >>
> >> After some thought, I believe having a separate project (what was
> >> mentioned here as Spark Extras) to handle Spark Connectors and Spark
> add-ons
> >> in general could be very beneficial to Spark and the overall Spark
> >> community, which would have a central place in Apache to collaborate
> around
> >> related Spark components.
> >>
> >> Some of the benefits on this approach
> >>
> >> - Enables maintaining the connectors inside Apache, following the Apache
> >> governance and release rules, while allowing Spark proper to focus on
> the
> >> core runtime.
> >> - Provides more flexibility in controlling the direction (currency) of
> the
> >> existing connectors (e.g. willing to find a solution and maintain
> multiple
> >> versions of same connectors like kafka 0.8x and 0.9x)
> >> - Becomes a home for other types of Spark related connectors helping
> >> expanding the community around Spark (e.g. Zeppelin see most of it's
> current
> >> contribution around new/enhanced connectors)
> >>
> >> What are some requirements for Spark Extras to be successful:
> >>
> >> - Be up to date with Spark Trunk APIs (based on daily CIs against
> >> SNAPSHOT)
> >> - Adhere to Spark release cycles (have a very little window compared to
> >> Spark release)
> >> - Be more open and flexible to the set of connectors it will accept and
> >> maintain (e.g. also handle multiple versions like the kafka 0.9 issue we
> >> have today)
> >>

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Cody Koeninger
Given that not all of the connectors were removed, I think this
creates a weird / confusing three tier system

1. connectors in the official project's spark/extras or spark/external
2. connectors in "Spark Extras"
3. connectors in some random organization's github



On Fri, Apr 15, 2016 at 11:18 AM, Sean Owen  wrote:
> Why would this need to be an ASF project of its own? I don't think
> it's possible to have a yet another separate "Spark Extras" TLP (?)
>
> There is already a project to manage these bits of code on Github. How
> about all of the interested parties manage the code there, under the
> same process, under the same license, etc?
>
> I'm not against calling it Spark Extras myself but I wonder if that
> needlessly confuses the situation. They aren't part of the Spark TLP
> on purpose, so trying to give it some special middle-ground status
> might just be confusing. The thing that comes to mind immediately is
> "Connectors for Apache Spark", spark-connectors, etc.
>
>
> On Fri, Apr 15, 2016 at 5:01 PM, Luciano Resende  wrote:
>> After some collaboration with other community members, we have created a
>> initial draft for Spark Extras which is available for review at
>>
>> https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing
>>
>> We would like to invite other community members to participate in the
>> project, particularly the Spark Committers and PMC (feel free to express
>> interest and I will update the proposal). Another option here is just to
>> give ALL Spark committers write access to "Spark Extras".
>>
>>
>> We also have couple asks from the Spark PMC :
>>
>> - Permission to use "Spark Extras" as the project name. We already checked
>> this with Apache Brand Management, and the recommendation was to discuss and
>> reach consensus with the Spark PMC.
>>
>> - We would also want to check with the Spark PMC that, in case of
>> successfully creation of  "Spark Extras", if the PMC would be willing to
>> continue the development of the remaining connectors that stayed in Spark
>> 2.0 codebase in the "Spark Extras" project.
>>
>>
>> Thanks in advance, and we welcome any feedback around this proposal before
>> we present to the Apache Board for consideration.
>>
>>
>>
>> On Sat, Mar 26, 2016 at 10:07 AM, Luciano Resende 
>> wrote:
>>>
>>> I believe some of this has been resolved in the context of some parts that
>>> had interest in one extra connector, but we still have a few removed, and as
>>> you mentioned, we still don't have a simple way or willingness to manage and
>>> be current on new packages like kafka. And based on the fact that this
>>> thread is still alive, I believe that other community members might have
>>> other concerns as well.
>>>
>>> After some thought, I believe having a separate project (what was
>>> mentioned here as Spark Extras) to handle Spark Connectors and Spark add-ons
>>> in general could be very beneficial to Spark and the overall Spark
>>> community, which would have a central place in Apache to collaborate around
>>> related Spark components.
>>>
>>> Some of the benefits on this approach
>>>
>>> - Enables maintaining the connectors inside Apache, following the Apache
>>> governance and release rules, while allowing Spark proper to focus on the
>>> core runtime.
>>> - Provides more flexibility in controlling the direction (currency) of the
>>> existing connectors (e.g. willing to find a solution and maintain multiple
>>> versions of same connectors like kafka 0.8x and 0.9x)
>>> - Becomes a home for other types of Spark related connectors helping
>>> expanding the community around Spark (e.g. Zeppelin see most of it's current
>>> contribution around new/enhanced connectors)
>>>
>>> What are some requirements for Spark Extras to be successful:
>>>
>>> - Be up to date with Spark Trunk APIs (based on daily CIs against
>>> SNAPSHOT)
>>> - Adhere to Spark release cycles (have a very little window compared to
>>> Spark release)
>>> - Be more open and flexible to the set of connectors it will accept and
>>> maintain (e.g. also handle multiple versions like the kafka 0.9 issue we
>>> have today)
>>>
>>> Where to start Spark Extras
>>>
>>> Depending on the interest here, we could follow the steps of (Apache
>>> Arrow) and start this directly as a TLP, or start as an incubator project. I
>>> would consider the first option first.
>>>
>>> Who would participate
>>>
>>> Have thought about this for a bit, and if we go to the direction of TLP, I
>>> would say Spark Committers and Apache Members can request to participate as
>>> PMC members, while other committers can request to become committers. Non
>>> committers would be added based on meritocracy after the start of the
>>> project.
>>>
>>> Project Name
>>>
>>> It would be ideal if we could have a project name that shows close ties to
>>> Spark (e.g. Spark Extras or Spark Connectors) but we will need permission
>>> and 

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Chris Fregly
and how does this all relate to the existing 1-and-a-half-class citizen
known as spark-packages.org?

support for this citizen is buried deep in the Spark source (which was
always a bit odd, in my opinion):

https://github.com/apache/spark/search?utf8=%E2%9C%93=spark-packages


On Fri, Apr 15, 2016 at 12:18 PM, Sean Owen  wrote:

> Why would this need to be an ASF project of its own? I don't think
> it's possible to have a yet another separate "Spark Extras" TLP (?)
>
> There is already a project to manage these bits of code on Github. How
> about all of the interested parties manage the code there, under the
> same process, under the same license, etc?
>
> I'm not against calling it Spark Extras myself but I wonder if that
> needlessly confuses the situation. They aren't part of the Spark TLP
> on purpose, so trying to give it some special middle-ground status
> might just be confusing. The thing that comes to mind immediately is
> "Connectors for Apache Spark", spark-connectors, etc.
>
>
> On Fri, Apr 15, 2016 at 5:01 PM, Luciano Resende 
> wrote:
> > After some collaboration with other community members, we have created a
> > initial draft for Spark Extras which is available for review at
> >
> >
> https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing
> >
> > We would like to invite other community members to participate in the
> > project, particularly the Spark Committers and PMC (feel free to express
> > interest and I will update the proposal). Another option here is just to
> > give ALL Spark committers write access to "Spark Extras".
> >
> >
> > We also have couple asks from the Spark PMC :
> >
> > - Permission to use "Spark Extras" as the project name. We already
> checked
> > this with Apache Brand Management, and the recommendation was to discuss
> and
> > reach consensus with the Spark PMC.
> >
> > - We would also want to check with the Spark PMC that, in case of
> > successfully creation of  "Spark Extras", if the PMC would be willing to
> > continue the development of the remaining connectors that stayed in Spark
> > 2.0 codebase in the "Spark Extras" project.
> >
> >
> > Thanks in advance, and we welcome any feedback around this proposal
> before
> > we present to the Apache Board for consideration.
> >
> >
> >
> > On Sat, Mar 26, 2016 at 10:07 AM, Luciano Resende 
> > wrote:
> >>
> >> I believe some of this has been resolved in the context of some parts
> that
> >> had interest in one extra connector, but we still have a few removed,
> and as
> >> you mentioned, we still don't have a simple way or willingness to
> manage and
> >> be current on new packages like kafka. And based on the fact that this
> >> thread is still alive, I believe that other community members might have
> >> other concerns as well.
> >>
> >> After some thought, I believe having a separate project (what was
> >> mentioned here as Spark Extras) to handle Spark Connectors and Spark
> add-ons
> >> in general could be very beneficial to Spark and the overall Spark
> >> community, which would have a central place in Apache to collaborate
> around
> >> related Spark components.
> >>
> >> Some of the benefits on this approach
> >>
> >> - Enables maintaining the connectors inside Apache, following the Apache
> >> governance and release rules, while allowing Spark proper to focus on
> the
> >> core runtime.
> >> - Provides more flexibility in controlling the direction (currency) of
> the
> >> existing connectors (e.g. willing to find a solution and maintain
> multiple
> >> versions of same connectors like kafka 0.8x and 0.9x)
> >> - Becomes a home for other types of Spark related connectors helping
> >> expanding the community around Spark (e.g. Zeppelin see most of it's
> current
> >> contribution around new/enhanced connectors)
> >>
> >> What are some requirements for Spark Extras to be successful:
> >>
> >> - Be up to date with Spark Trunk APIs (based on daily CIs against
> >> SNAPSHOT)
> >> - Adhere to Spark release cycles (have a very little window compared to
> >> Spark release)
> >> - Be more open and flexible to the set of connectors it will accept and
> >> maintain (e.g. also handle multiple versions like the kafka 0.9 issue we
> >> have today)
> >>
> >> Where to start Spark Extras
> >>
> >> Depending on the interest here, we could follow the steps of (Apache
> >> Arrow) and start this directly as a TLP, or start as an incubator
> project. I
> >> would consider the first option first.
> >>
> >> Who would participate
> >>
> >> Have thought about this for a bit, and if we go to the direction of
> TLP, I
> >> would say Spark Committers and Apache Members can request to
> participate as
> >> PMC members, while other committers can request to become committers.
> Non
> >> committers would be added based on meritocracy after the start of the
> >> project.
> >>
> >> Project Name
> >>
> >> It would be ideal if we could 

Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Sean Owen
Why would this need to be an ASF project of its own? I don't think
it's possible to have a yet another separate "Spark Extras" TLP (?)

There is already a project to manage these bits of code on Github. How
about all of the interested parties manage the code there, under the
same process, under the same license, etc?

I'm not against calling it Spark Extras myself but I wonder if that
needlessly confuses the situation. They aren't part of the Spark TLP
on purpose, so trying to give it some special middle-ground status
might just be confusing. The thing that comes to mind immediately is
"Connectors for Apache Spark", spark-connectors, etc.


On Fri, Apr 15, 2016 at 5:01 PM, Luciano Resende  wrote:
> After some collaboration with other community members, we have created a
> initial draft for Spark Extras which is available for review at
>
> https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing
>
> We would like to invite other community members to participate in the
> project, particularly the Spark Committers and PMC (feel free to express
> interest and I will update the proposal). Another option here is just to
> give ALL Spark committers write access to "Spark Extras".
>
>
> We also have couple asks from the Spark PMC :
>
> - Permission to use "Spark Extras" as the project name. We already checked
> this with Apache Brand Management, and the recommendation was to discuss and
> reach consensus with the Spark PMC.
>
> - We would also want to check with the Spark PMC that, in case of
> successfully creation of  "Spark Extras", if the PMC would be willing to
> continue the development of the remaining connectors that stayed in Spark
> 2.0 codebase in the "Spark Extras" project.
>
>
> Thanks in advance, and we welcome any feedback around this proposal before
> we present to the Apache Board for consideration.
>
>
>
> On Sat, Mar 26, 2016 at 10:07 AM, Luciano Resende 
> wrote:
>>
>> I believe some of this has been resolved in the context of some parts that
>> had interest in one extra connector, but we still have a few removed, and as
>> you mentioned, we still don't have a simple way or willingness to manage and
>> be current on new packages like kafka. And based on the fact that this
>> thread is still alive, I believe that other community members might have
>> other concerns as well.
>>
>> After some thought, I believe having a separate project (what was
>> mentioned here as Spark Extras) to handle Spark Connectors and Spark add-ons
>> in general could be very beneficial to Spark and the overall Spark
>> community, which would have a central place in Apache to collaborate around
>> related Spark components.
>>
>> Some of the benefits on this approach
>>
>> - Enables maintaining the connectors inside Apache, following the Apache
>> governance and release rules, while allowing Spark proper to focus on the
>> core runtime.
>> - Provides more flexibility in controlling the direction (currency) of the
>> existing connectors (e.g. willing to find a solution and maintain multiple
>> versions of same connectors like kafka 0.8x and 0.9x)
>> - Becomes a home for other types of Spark related connectors helping
>> expanding the community around Spark (e.g. Zeppelin see most of it's current
>> contribution around new/enhanced connectors)
>>
>> What are some requirements for Spark Extras to be successful:
>>
>> - Be up to date with Spark Trunk APIs (based on daily CIs against
>> SNAPSHOT)
>> - Adhere to Spark release cycles (have a very little window compared to
>> Spark release)
>> - Be more open and flexible to the set of connectors it will accept and
>> maintain (e.g. also handle multiple versions like the kafka 0.9 issue we
>> have today)
>>
>> Where to start Spark Extras
>>
>> Depending on the interest here, we could follow the steps of (Apache
>> Arrow) and start this directly as a TLP, or start as an incubator project. I
>> would consider the first option first.
>>
>> Who would participate
>>
>> Have thought about this for a bit, and if we go to the direction of TLP, I
>> would say Spark Committers and Apache Members can request to participate as
>> PMC members, while other committers can request to become committers. Non
>> committers would be added based on meritocracy after the start of the
>> project.
>>
>> Project Name
>>
>> It would be ideal if we could have a project name that shows close ties to
>> Spark (e.g. Spark Extras or Spark Connectors) but we will need permission
>> and support from whoever is going to evaluate the project proposal (e.g.
>> Apache Board)
>>
>>
>> Thoughts ?
>>
>> Does anyone have any big disagreement or objection to moving into this
>> direction ?
>>
>> Otherwise, who would be interested in joining the project, so I can start
>> working on some concrete proposal ?
>>
>>
>
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/


Re: Creating Spark Extras project, was Re: SPARK-13843 and future of streaming backends

2016-04-15 Thread Luciano Resende
After some collaboration with other community members, we have created a
initial draft for Spark Extras which is available for review at

https://docs.google.com/document/d/1zRFGG4414LhbKlGbYncZ13nyX34Rw4sfWhZRA5YBtIE/edit?usp=sharing

We would like to invite other community members to participate in the
project, particularly the Spark Committers and PMC (feel free to express
interest and I will update the proposal). Another option here is just to
give ALL Spark committers write access to "Spark Extras".


We also have couple asks from the Spark PMC :

- Permission to use "Spark Extras" as the project name. We already checked
this with Apache Brand Management, and the recommendation was to discuss
and reach consensus with the Spark PMC.

- We would also want to check with the Spark PMC that, in case of
successfully creation of  "Spark Extras", if the PMC would be willing to
continue the development of the remaining connectors that stayed in Spark
2.0 codebase in the "Spark Extras" project.


Thanks in advance, and we welcome any feedback around this proposal before
we present to the Apache Board for consideration.



On Sat, Mar 26, 2016 at 10:07 AM, Luciano Resende 
wrote:

> I believe some of this has been resolved in the context of some parts that
> had interest in one extra connector, but we still have a few removed, and
> as you mentioned, we still don't have a simple way or willingness to manage
> and be current on new packages like kafka. And based on the fact that this
> thread is still alive, I believe that other community members might have
> other concerns as well.
>
> After some thought, I believe having a separate project (what was
> mentioned here as Spark Extras) to handle Spark Connectors and Spark
> add-ons in general could be very beneficial to Spark and the overall Spark
> community, which would have a central place in Apache to collaborate around
> related Spark components.
>
> Some of the benefits on this approach
>
> - Enables maintaining the connectors inside Apache, following the Apache
> governance and release rules, while allowing Spark proper to focus on the
> core runtime.
> - Provides more flexibility in controlling the direction (currency) of the
> existing connectors (e.g. willing to find a solution and maintain multiple
> versions of same connectors like kafka 0.8x and 0.9x)
> - Becomes a home for other types of Spark related connectors helping
> expanding the community around Spark (e.g. Zeppelin see most of it's
> current contribution around new/enhanced connectors)
>
> What are some requirements for Spark Extras to be successful:
>
> - Be up to date with Spark Trunk APIs (based on daily CIs against SNAPSHOT)
> - Adhere to Spark release cycles (have a very little window compared to
> Spark release)
> - Be more open and flexible to the set of connectors it will accept and
> maintain (e.g. also handle multiple versions like the kafka 0.9 issue we
> have today)
>
> Where to start Spark Extras
>
> Depending on the interest here, we could follow the steps of (Apache
> Arrow) and start this directly as a TLP, or start as an incubator project.
> I would consider the first option first.
>
> Who would participate
>
> Have thought about this for a bit, and if we go to the direction of TLP, I
> would say Spark Committers and Apache Members can request to participate as
> PMC members, while other committers can request to become committers. Non
> committers would be added based on meritocracy after the start of the
> project.
>
> Project Name
>
> It would be ideal if we could have a project name that shows close ties to
> Spark (e.g. Spark Extras or Spark Connectors) but we will need permission
> and support from whoever is going to evaluate the project proposal (e.g.
> Apache Board)
>
>
> Thoughts ?
>
> Does anyone have any big disagreement or objection to moving into this
> direction ?
>
> Otherwise, who would be interested in joining the project, so I can start
> working on some concrete proposal ?
>
>
>



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Skipping Type Conversion and using InternalRows for UDF

2016-04-15 Thread Hamel Kothari
Hi all,

So we have these UDFs which take <1ms to operate and we're seeing pretty
poor performance around them in practice, the overhead being >10ms for the
projections (this data is deeply nested with ArrayTypes and MapTypes so
that could be the cause). Looking at the logs and code for ScalaUDF, I
noticed that there are a series of projections which take place before and
after in order to make the Rows safe and then unsafe again. Is there any
way to opt out of this and input/return InternalRows to skip the
performance hit of the type conversion? It doesn't immediately appear to be
possible but I'd like to make sure that I'm not missing anything.

I suspect we could make this possible by checking if typetags in the
register function are all internal types, if they are, passing a false
value for "needs[Input|Output]Conversion" to ScalaUDF and then in ScalaUDF
checking for that flag to figure out if the conversion process needs to
take place. We're still left with the issue of missing a schema in the case
of outputting InternalRows, but we could expose the DataType parameter
rather than inferring it in the register function. Is there anything else
in the code that would prevent this from working?

Regards,
Hamel


Re: BytesToBytes and unaligned memory

2016-04-15 Thread Ted Yu
Can you clarify whether BytesToBytesMapOffHeapSuite passed or failed with
the forced true value for unaligned ?

If the test failed, please pastebin the failure(s).

Thanks

On Fri, Apr 15, 2016 at 8:32 AM, Adam Roberts  wrote:

> Ted, yep I'm working from the latest code which includes that unaligned
> check, for experimenting I've modified that code to ignore the unaligned
> check (just go ahead and say we support it anyway, even though our JDK
> returns false: the return value of java.nio.Bits.unaligned()).
>
> My Platform.java for testing contains:
>
> private static final boolean unaligned;
>
> static {
>   boolean _unaligned;
>   // use reflection to access unaligned field
>   try {
> *System.out.println("Checking unaligned support");*
> Class bitsClass =
>   Class.forName("java.nio.Bits", false,
> ClassLoader.getSystemClassLoader());
> Method unalignedMethod = bitsClass.getDeclaredMethod("unaligned");
> unalignedMethod.setAccessible(true);
> _unaligned = Boolean.TRUE.equals(unalignedMethod.invoke(null));
> *System.out.println("Used reflection and _unaligned is: " +
> _unaligned);*
> *System.out.println("Setting to true anyway for experimenting");*
> *_unaligned = true;*
> } catch (Throwable t) {
>   // We at least know x86 and x64 support unaligned access.
>   String arch = System.getProperty("os.arch", "");
>   //noinspection DynamicRegexReplaceableByCompiledPattern
> *  // We don't actually get here since we find the unaligned method OK
> and it returns false (I override with true anyway)*
> *  // but add s390x incase we somehow fail anyway.*
> *  System.out.println("Checking for s390x, os.arch is: " + arch);*
> *  _unaligned =
> arch.matches("^(i[3-6]86|x86(_64)?|x64|s390x|amd64)$");*
> }
> unaligned = _unaligned;
> *System.out.println("returning: " + unaligned);*
>   }
> }
>
> Output is, as you'd expect, "used reflection and _unaligned is false,
> setting to true anyway for experimenting", and the tests pass.
>
> No other problems on the platform (pending a different pull request).
>
> Cheers,
>
>
>
>
>
>
>
> From:Ted Yu 
> To:Adam Roberts/UK/IBM@IBMGB
> Cc:"dev@spark.apache.org" 
> Date:15/04/2016 15:32
> Subject:Re: BytesToBytes and unaligned memory
> --
>
>
>
> I assume you tested 2.0 with SPARK-12181 .
>
> Related code from Platform.java if java.nio.Bits#unaligned() throws
> exception:
>
>   // We at least know x86 and x64 support unaligned access.
>   String arch = System.getProperty("os.arch", "");
>   //noinspection DynamicRegexReplaceableByCompiledPattern
>   _unaligned = arch.matches("^(i[3-6]86|x86(_64)?|x64|amd64)$");
>
> Can you give us some detail on how the code runs for JDKs on zSystems ?
>
> Thanks
>
> On Fri, Apr 15, 2016 at 7:01 AM, Adam Roberts <*arobe...@uk.ibm.com*
> > wrote:
> Hi, I'm testing Spark 2.0.0 on various architectures and have a question,
> are we sure if
> *core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java*
> 
> really is attempting to use unaligned memory access (for the
> BytesToBytesMapOffHeapSuite tests specifically)?
>
> Our JDKs on zSystems for example return false for the
> java.nio.Bits.unaligned() method and yet if I skip this check and add s390x
> to the supported architectures (for zSystems), all thirteen tests here
> pass.
>
> The 13 tests here all fail as we do not pass the unaligned requirement
> (but perhaps incorrectly):
>
> *core/src/test/java/org/apache/spark/unsafe/map/BytesToBytesMapOffHeapSuite.java*
> 
> and I know the unaligned checking is at
> *common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java*
> 
>
> Either our JDK's method is returning false incorrectly or this test isn't
> using unaligned memory access (so the requirement is invalid), there's no
> mention of alignment in the test itself.
>
> Any guidance would be very much appreciated, cheers
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>


Re: BytesToBytes and unaligned memory

2016-04-15 Thread Ted Yu
I assume you tested 2.0 with SPARK-12181 .

Related code from Platform.java if java.nio.Bits#unaligned() throws
exception:

  // We at least know x86 and x64 support unaligned access.
  String arch = System.getProperty("os.arch", "");
  //noinspection DynamicRegexReplaceableByCompiledPattern
  _unaligned = arch.matches("^(i[3-6]86|x86(_64)?|x64|amd64)$");

Can you give us some detail on how the code runs for JDKs on zSystems ?

Thanks

On Fri, Apr 15, 2016 at 7:01 AM, Adam Roberts  wrote:

> Hi, I'm testing Spark 2.0.0 on various architectures and have a question,
> are we sure if
> core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java
> 
> really is attempting to use unaligned memory access (for the
> BytesToBytesMapOffHeapSuite tests specifically)?
>
> Our JDKs on zSystems for example return false for the
> java.nio.Bits.unaligned() method and yet if I skip this check and add s390x
> to the supported architectures (for zSystems), all thirteen tests here
> pass.
>
> The 13 tests here all fail as we do not pass the unaligned requirement
> (but perhaps incorrectly):
>
> core/src/test/java/org/apache/spark/unsafe/map/BytesToBytesMapOffHeapSuite.java
> 
> and I know the unaligned checking is at
> common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java
> 
>
> Either our JDK's method is returning false incorrectly or this test isn't
> using unaligned memory access (so the requirement is invalid), there's no
> mention of alignment in the test itself.
>
> Any guidance would be very much appreciated, cheers
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>


Unable to access Resource Manager /Name Node on port 9026 / 9101 on a Spark EMR Cluster

2016-04-15 Thread Chadha Pooja
Hi ,


We have setup a Spark Cluster (3 node) on Amazon EMR.

We aren't able to use port 9026 and 9101 on the existing Spark EMR Cluster 
which are part of the Web UIs offered with Amazon EMR. I was able to use other 
ports like Zeppelin port, 8890, HUE etc

We checked that the security settings currently are open to everyone, and it is 
not an issue with security.

URLs

Hadoop ResourceManager

http://master-node-IP:9026/

Hadoop HDFS NameNode

http://master-node-IP:9101/


Errors Observed on Fiddler:
Port 9026:
[Fiddler] The connection to 'masternodeIP' failed.
Error: TimedOut (0x274c).
System.Net.Sockets.SocketException A connection attempt failed because the 
connected party did not properly respond after a period of time, or established 
connection failed because connected host has failed to respond 
<>:9026

Port 9101:
[Fiddler] The connection to <>: failed.
Error: TimedOut (0x274c).
System.Net.Sockets.SocketException A connection attempt failed because the 
connected party did not properly respond after a period of time, or established 
connection failed because connected host has failed to respond 
<>:9101

Does anyone have any experiences or pointers? Appreciate your help!


Thanks!

__
The Boston Consulting Group, Inc.
 
This e-mail message may contain confidential and/or privileged information.
If you are not an addressee or otherwise authorized to receive this message,
you should not use, copy, disclose or take any action based on this e-mail or
any information contained in the message. If you have received this material
in error, please advise the sender immediately by reply e-mail and delete this
message. Thank you.


BytesToBytes and unaligned memory

2016-04-15 Thread Adam Roberts
Hi, I'm testing Spark 2.0.0 on various architectures and have a question, 
are we sure if 
core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java
 
really is attempting to use unaligned memory access (for the 
BytesToBytesMapOffHeapSuite tests specifically)?

Our JDKs on zSystems for example return false for the 
java.nio.Bits.unaligned() method and yet if I skip this check and add 
s390x to the supported architectures (for zSystems), all thirteen tests 
here pass. 

The 13 tests here all fail as we do not pass the unaligned requirement 
(but perhaps incorrectly):
core/src/test/java/org/apache/spark/unsafe/map/BytesToBytesMapOffHeapSuite.java 
and I know the unaligned checking is at 
common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java

Either our JDK's method is returning false incorrectly or this test isn't 
using unaligned memory access (so the requirement is invalid), there's no 
mention of alignment in the test itself.

Any guidance would be very much appreciated, cheers


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Re: Should localProperties be inheritable? Should we change that or document it?

2016-04-15 Thread Marcin Tustin
It would be a pleasure. That said, what do you think about adding the
non-inheritable feature? I think that would be a big win for everything
that doesn't specifically need Inheritability.

On Friday, April 15, 2016, Reynold Xin  wrote:

> I think this was added a long time ago by me in order to make certain
> things work for Shark (good old times ...). You are probably right that by
> now some apps depend on the fact that this is inheritable, and changing
> that could break them in weird ways.
>
> Do you mind documenting this, and also add a test case?
>
>
> On Wed, Apr 13, 2016 at 6:15 AM, Marcin Tustin  > wrote:
>
>> *Tl;dr: *SparkContext.setLocalProperty is implemented with
>> InheritableThreadLocal.
>> This has unexpected consequences, not least because the method
>> documentation doesn't say anything about it:
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L605
>>
>> I'd like to propose that we do one of: (1) document explicitly that these
>> properties are inheritable; (2) stop them being inheritable; or (3)
>> introduce the option to set these in a non-inheritable way.
>>
>> *Motivation: *This started with me investigating a last vestige of the
>> leaking spark.sql.execution.id issue in Spark 1.5.2 (it's not
>> reproducible under controlled conditions, and given the many and excellent
>> fixes on this issue it's completely mysterious that this hangs around; the
>> bug itself is largely beside the point).
>>
>> The specific contribution that inheritable localProperties makes to this
>> problem is that if a localProperty like spark.sql.execution.id leaks
>> (i.e. remains set when it shouldn't) because those properties are inherited
>> by spawned threads, that pollution affects all subsequently spawned threads.
>>
>> This doesn't sound like a big deal - why would worker threads be spawning
>> other threads? It turns out that Java's ThreadPoolExecutor has worker
>> threads spawn other worker threads (it has no master dispatcher thread; the
>> workers themselves run all the housekeeping). JavaDoc here:
>> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ThreadPoolExecutor.html
>> and source code here:
>> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/concurrent/ThreadPoolExecutor.java#ThreadPoolExecutor
>>
>> Accordingly, if using Scala Futures and any kind of thread pool that
>> comes built-in with Java, it's impossible to avoid localproperties
>> propagating haphazardly to different threads. For localProperties
>> explicitly set by user code this isn't nice, and requires work arounds like
>> explicitly clearing known properties at the start of every future, or in a
>> beforeExecute hook on the threadpool. For leaky properties the work around
>> is pretty much the same: defensively clear them in the threadpool.
>>
>> *Options:*
>> (0) Do nothing at all. Unattractive, because documenting this would still
>> be better;
>> (1) Update the scaladoc to explicitly say that localProperties are
>> inherited by spawned threads and note that caution should be exercised with
>> thread pools.
>> (2) Switch to using ordinary, non-inheritable thread locals. I assume
>> this would break something for somebody, but if not, this would be my
>> preferred option. Also a very simple change to implement if no-one is
>> relying on property inheritance.
>> (3) Introduce a second localProperty facility which is not inherited.
>> This would not break any existing code, and should not be too hard to
>> implement. localProperties which need cleanup could be migrated to using
>> this non-inheritable facility, helping to limit the impact of failing to
>> clean up.
>> The way I envisage this working is that non-inheritable localProperties
>> would be checked first, then inheritable, then global properties.
>>
>> *Actions:*
>> I'm happy to do the coding and open such Jira tickets as desirable or
>> necessary. Before I do any of that, I'd like to know if there's any support
>> for this, and ideally secure a committer who can help shepherd this change
>> through.
>>
>> Marcin Tustin
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> 
>> Latest news  at Handy
>> Handy just raised $50m
>> 
>>  led
>> by Fidelity
>>
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: Should localProperties be inheritable? Should we change that or document it?

2016-04-15 Thread Reynold Xin
I think this was added a long time ago by me in order to make certain
things work for Shark (good old times ...). You are probably right that by
now some apps depend on the fact that this is inheritable, and changing
that could break them in weird ways.

Do you mind documenting this, and also add a test case?


On Wed, Apr 13, 2016 at 6:15 AM, Marcin Tustin 
wrote:

> *Tl;dr: *SparkContext.setLocalProperty is implemented with
> InheritableThreadLocal.
> This has unexpected consequences, not least because the method
> documentation doesn't say anything about it:
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L605
>
> I'd like to propose that we do one of: (1) document explicitly that these
> properties are inheritable; (2) stop them being inheritable; or (3)
> introduce the option to set these in a non-inheritable way.
>
> *Motivation: *This started with me investigating a last vestige of the
> leaking spark.sql.execution.id issue in Spark 1.5.2 (it's not
> reproducible under controlled conditions, and given the many and excellent
> fixes on this issue it's completely mysterious that this hangs around; the
> bug itself is largely beside the point).
>
> The specific contribution that inheritable localProperties makes to this
> problem is that if a localProperty like spark.sql.execution.id leaks
> (i.e. remains set when it shouldn't) because those properties are inherited
> by spawned threads, that pollution affects all subsequently spawned threads.
>
> This doesn't sound like a big deal - why would worker threads be spawning
> other threads? It turns out that Java's ThreadPoolExecutor has worker
> threads spawn other worker threads (it has no master dispatcher thread; the
> workers themselves run all the housekeeping). JavaDoc here:
> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ThreadPoolExecutor.html
> and source code here:
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/concurrent/ThreadPoolExecutor.java#ThreadPoolExecutor
>
> Accordingly, if using Scala Futures and any kind of thread pool that comes
> built-in with Java, it's impossible to avoid localproperties propagating
> haphazardly to different threads. For localProperties explicitly set by
> user code this isn't nice, and requires work arounds like explicitly
> clearing known properties at the start of every future, or in a
> beforeExecute hook on the threadpool. For leaky properties the work around
> is pretty much the same: defensively clear them in the threadpool.
>
> *Options:*
> (0) Do nothing at all. Unattractive, because documenting this would still
> be better;
> (1) Update the scaladoc to explicitly say that localProperties are
> inherited by spawned threads and note that caution should be exercised with
> thread pools.
> (2) Switch to using ordinary, non-inheritable thread locals. I assume this
> would break something for somebody, but if not, this would be my preferred
> option. Also a very simple change to implement if no-one is relying on
> property inheritance.
> (3) Introduce a second localProperty facility which is not inherited. This
> would not break any existing code, and should not be too hard to implement.
> localProperties which need cleanup could be migrated to using this
> non-inheritable facility, helping to limit the impact of failing to clean
> up.
> The way I envisage this working is that non-inheritable localProperties
> would be checked first, then inheritable, then global properties.
>
> *Actions:*
> I'm happy to do the coding and open such Jira tickets as desirable or
> necessary. Before I do any of that, I'd like to know if there's any support
> for this, and ideally secure a committer who can help shepherd this change
> through.
>
> Marcin Tustin
>
> Want to work at Handy? Check out our culture deck and open roles
> 
> Latest news  at Handy
> Handy just raised $50m
> 
>  led
> by Fidelity
>
>