from:"Reynold Xin"

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Reynold Xin

cc Matthias

In the past we talked with Matthias and there were some discussions about
this.

On Thu, Nov 6, 2014 at 11:34 AM, York, Brennon 
wrote:

> All, was wondering if there had been any discussion around this topic yet?
> TinkerPop  is a great abstraction for graph
> databases and has been implemented across various graph database backends
> / gaining traction. Has anyone thought about integrating the TinkerPop
> framework with GraphX to enable GraphX as another backend? Not sure if
> this has been brought up or not, but would certainly volunteer to
> spearhead this effort if the community thinks it to be a good idea!
>
> As an aside, wasn¹t sure if this discussion should happen on the board
> here or on JIRA, but a made a ticket as well for reference:
> https://issues.apache.org/jira/browse/SPARK-4279
>
> 
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed.  If the reader of this message is not the
> intended recipient, you are hereby notified that any review,
> retransmission, dissemination, distribution, copying or other use of, or
> taking of any action in reliance upon this information is strictly
> prohibited. If you have received this communication in error, please
> contact the sender and delete the material from your computer.
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Implementing TinkerPop on top of GraphX

2014-11-06 Thread Reynold Xin

Some form of graph querying support would be great to have. This can be a
great community project hosted outside of Spark initially, both due to the
maturity of the component itself as well as the maturity of query language
standards (there isn't really a dominant standard for graph ql).

One thing is that GraphX API will need to evolve and probably need to
provide more primitives in order to support the new ql implementation.
There might also be inherent mismatches in the way the external API is
defined vs what GraphX can support. We should discuss those on a
case-by-case basis.


On Thu, Nov 6, 2014 at 5:42 PM, Kyle Ellrott  wrote:

> I think its best to look to existing standard rather then try to make your
> own. Of course small additions would need to be added to make it valuable
> for the Spark community, like a method similar to Gremlin's 'table'
> function, that produces an RDD instead.
> But there may be a lot of extra code and data structures that would need
> to be added to make it work, and those may not be directly applicable to
> all GraphX users. I think it would be best run as a separate module/project
> that builds directly on top of GraphX.
>
> Kyle
>
>
>
> On Thu, Nov 6, 2014 at 4:39 PM, York, Brennon  > wrote:
>
>> My personal 2c is that, since GraphX is just beginning to provide a full
>> featured graph API, I think it would be better to align with the TinkerPop
>> group rather than roll our own. In my mind the benefits out way the
>> detriments as follows:
>>
>> Benefits:
>> * GraphX gains the ability to become another core tenant within the
>> TinkerPop community allowing a more diverse group of users into the Spark
>> ecosystem.
>> * TinkerPop can continue to maintain and own a solid / feature-rich graph
>> API that has already been accepted by a wide audience, relieving the
>> pressure of “one off” API additions from the GraphX team.
>> * GraphX can demonstrate its ability to be a key player in the GraphDB
>> space sitting inline with other major distributions (Neo4j, Titan, etc.).
>> * Allows for the abstract graph traversal logic (query API) to be owned
>> and maintained by a group already proven on the topic.
>>
>> Drawbacks:
>> * GraphX doesn’t own the API for its graph query capability. This could
>> be seen as good or bad, but it might make GraphX-specific implementation
>> additions more tricky (possibly). Also, GraphX will need to maintain the
>> features described within the TinkerPop API as that might change in the
>> future.
>>
>> From: Kushal Datta 
>> Date: Thursday, November 6, 2014 at 4:00 PM
>> To: "York, Brennon" 
>> Cc: Kyle Ellrott , Reynold Xin <
>> r...@databricks.com>, "dev@spark.apache.org" ,
>> Matthias Broecheler 
>>
>> Subject: Re: Implementing TinkerPop on top of GraphX
>>
>> Before we dive into the implementation details, what are the high level
>> thoughts on Gremlin/GraphX? Scala already provides the procedural way to
>> query graphs in GraphX today. So, today I can run
>> g.vertices().filter().join() queries as OLAP in GraphX just like Tinkerpop3
>> Gremlin, of course sans the useful operators that Gremlin offers such as
>> outE, inE, loop, as, dedup, etc. In that case is mapping Gremlin operators
>> to GraphX api's a better approach or should we extend the existing set of
>> transformations/actions that GraphX already offers with the useful
>> operators from Gremlin? For example, we add as(), loop() and dedup()
>> methods in VertexRDD and EdgeRDD.
>>
>> Either way we get a desperately needed graph query interface in GraphX.
>>
>> On Thu, Nov 6, 2014 at 3:25 PM, York, Brennon <
>> brennon.y...@capitalone.com> wrote:
>>
>>> This was my thought exactly with the TinkerPop3 release. Looks like, to
>>> move this forward, we’d need to implement gremlin-core per <
>>> http://www.tinkerpop.com/docs/3.0.0.M1/#_implementing_gremlin_core>.
>>> The real question lies in whether GraphX can only support the OLTP
>>> functionality, or if we can bake into it the OLAP requirements as well. At
>>> a first glance I believe we could create an entire OLAP system. If so, I
>>> believe we could do this in a set of parallel subtasks, those being the
>>> implementation of each of the individual API’s (Structure, Process, and, if
>>> OLAP, GraphComputer) necessary for gremlin-core. Thoughts?
>>>
>>>
>>> From: Kyle Ellrott 
>>> Date: Thursday, November 6, 2014 at 12:10 PM
>>> To: Kushal Datta 
>>> Cc: Reynold Xin , "York, Brennon" <

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Reynold Xin

Greg,

Thanks a lot for commenting on this, but I feel we are splitting hairs
here. Matei did mention -1, followed by "or give feedback". The original
process outlined by Matei was exactly about review, rather than fighting.
Nobody wants to spend their energy fighting.  Everybody is doing it to
improve the project.


In particular, quoting you in your email

"Be careful here. "Responsibility" is pretty much a taboo word. All of
Apache is a group of volunteers. People can disappear at any point, which
is why you need multiple (as my fellow Director warned, on your private
list). And multiple people can disappear."

Take a look at this page: http://www.apache.org/dev/pmc.html

This Project Management Committee Guide outlines the general
***responsibilities*** of PMC members in managing their projects.

Are you suggesting the wording used by the PMC guideline itself is taboo?





On Thu, Nov 6, 2014 at 11:27 PM, Greg Stein  wrote:

> [last reply for tonite; let others read; and after the next drink or three,
> I shouldn't be replying...]
>
> On Thu, Nov 6, 2014 at 11:38 PM, Matei Zaharia 
> wrote:
>
> > Alright, Greg, I think I understand how Subversion's model is different,
> > which is that the PMC members are all full committers. However, I still
> > think that the model proposed here is purely organizational (how the PMC
> > and committers organize themselves), and in no way changes peoples'
> > ownership or rights.
>
>
> That was not my impression, when your proposal said that maintainers need
> to provide "sign-off".
>
> Okay. Now my next item of feedback starts here:
>
>
> > Certainly the reason I proposed it was organizational, to make sure
> > patches get seen by the right people. I believe that every PMC member
> still
> > has the same responsibility for two reasons:
> >
> > 1) The PMC is actually what selects the maintainers, so basically this
> > mechanism is a way for the PMC to make sure certain people review each
> > patch.
> >
> > 2) Code changes are all still made by consensus, where any individual has
> > veto power over the code. The maintainer model mentioned here is only
> meant
> > to make sure that the "experts" in an area get to see each patch *before*
> > it is merged, and choose whether to exercise their veto power.
> >
> > Let me give a simple example, which is a patch to the Spark core public
> > API. Say I'm a maintainer in this API. Without the maintainer model, the
> > decision on the patch would be made as follows:
> >
> > - Any committer could review the patch and merge it
> > - At any point during this process, I (as the main expert on this) could
> > come in and -1 it, or give feedback
> > - In addition, any other committer beyond me is allowed to -1 this patch
> >
> > With the maintainer model, the process is as follows:
> >
> > - Any committer could review the patch and merge it, but they would need
> > to forward it to me (or another core API maintainer) to make sure we also
> > approve
> > - At any point during this process, I could come in and -1 it, or give
> > feedback
> > - In addition, any other committer beyond me is still allowed to -1 this
> > patch
> >
> > The only change in this model is that committers are responsible to
> > forward patches in these areas to certain other committers. If every
> > committer had perfect oversight of the project, they could have also seen
> > every patch to their component on their own, but this list ensures that
> > they see it even if they somehow overlooked it.
> >
> > It's true that technically this model might "gate" development in the
> > sense of adding some latency, but it doesn't "gate" it any more than
> > consensus as a whole does, where any committer (not even PMC member) can
> -1
> > any code change. In fact I believe this will speed development by
> > motivating the maintainers to be active in reviewing their areas and by
> > reducing the chance that mistakes happen that require a revert.
> >
> > I apologize if this wasn't clear in any way, but I do think it's pretty
> > clear in the original wording of the proposal. The sign-off by a
> maintainer
> > is simply an extra step in the merge process, it does *not* mean that
> other
> > committers can't -1 a patch, or that the maintainers get to review all
> > patches, or that they somehow have more "ownership" of the component
> (since
> > they already had the ability to -1). I also wanted to clarify another
> thing
> > -- it seems there is a misunderstanding that only PMC members can be
> > maintainers, but this was not the point; the PMC *assigns* maintainers
> but
> > they can do it out of the whole committer pool (and if we move to
> > separating the PMC from the committers, I fully expect some non-PMC
> > committers to be made maintainers).
> >
>
> ... and ends here.
>
> All of that text is about a process for applying Vetoes. ... That is just
> the wrong focus (IMO).
>
> Back around 2000, in httpd, we ran into vetoes. It was horrible. The
> community suffered. We

Re: proposal / discuss: multiple Serializers within a SparkContext?

2014-11-07 Thread Reynold Xin

Technically you can already do custom serializer for each shuffle operation
(it is part of the ShuffledRDD). I've seen Matei suggesting on jira issues
(or github) in the past a "storage policy" in which you can specify how
data should be stored. I think that would be a great API to have in the
long run. Designing it won't be trivial though.


On Fri, Nov 7, 2014 at 1:05 AM, Sandy Ryza  wrote:

> Hey all,
>
> Was messing around with Spark and Google FlatBuffers for fun, and it got me
> thinking about Spark and serialization.  I know there's been work / talk
> about in-memory columnar formats Spark SQL, so maybe there are ways to
> provide this flexibility already that I've missed?  Either way, my
> thoughts:
>
> Java and Kryo serialization are really nice in that they require almost no
> extra work on the part of the user.  They can also represent complex object
> graphs with cycles etc.
>
> There are situations where other serialization frameworks are more
> efficient:
> * A Hadoop Writable style format that delineates key-value boundaries and
> allows for raw comparisons can greatly speed up some shuffle operations by
> entirely avoiding deserialization until the object hits user code.
> Writables also probably ser / deser faster than Kryo.
> * "No-deserialization" formats like FlatBuffers and Cap'n Proto address the
> tradeoff between (1) Java objects that offer fast access but take lots of
> space and stress GC and (2) Kryo-serialized buffers that are more compact
> but take time to deserialize.
>
> The drawbacks of these frameworks are that they require more work from the
> user to define types.  And that they're more restrictive in the reference
> graphs they can represent.
>
> In large applications, there are probably a few points where a
> "specialized" serialization format is useful. But requiring Writables
> everywhere because they're needed in a particularly intense shuffle is
> cumbersome.
>
> In light of that, would it make sense to enable varying Serializers within
> an app? It could make sense to choose a serialization framework both based
> on the objects being serialized and what they're being serialized for
> (caching vs. shuffle).  It might be possible to implement this underneath
> the Serializer interface with some sort of multiplexing serializer that
> chooses between subserializers.
>
> Nothing urgent here, but curious to hear other's opinions.
>
> -Sandy
>

Re: Terasort example

2014-11-11 Thread Reynold Xin

This is great. I think the consensus from last time was that we would put
performance stuff into spark-perf, so it is easy to test different Spark
versions.


On Tue, Nov 11, 2014 at 5:03 AM, Ewan Higgs  wrote:

> Hi all,
> I saw that Reynold Xin had a Terasort example PR on Github[1]. It didn't
> appear to be similar to the Hadoop Terasort example, so I've tried to brush
> it into shape so it can generate Terasort files (teragen), sort the files
> (terasort) and validate the files (teravalidate). My branch is available
> here:
>
> https://github.com/ehiggs/spark/tree/terasort
>
> With this code, you can run the following:
>
> # Generate 1M 100 byte records:
>  ./bin/run-example terasort.TeraGen 100M ~/data/terasort_in
>
> # Sort the file:
> MASTER=local[4] ./bin/run-example terasort.TeraSort ~/data/terasort_in
> ~/data/terasort_out
>
> # Validate the file
> MASTER=local[4] ./bin/run-example terasort.TeraValidate
> ~/data/terasort_out  ~/data/terasort_validate
>
> # Validate that an unsorted file is indeed not correctly sorted:
>
> MASTER=local[4] ./bin/run-example terasort.TeraValidate
> ~/data/terasort_in  ~/data/terasort_validate_bad
>
> This matches the interface for the Hadoop version of Terasort, except I
> added the ability to use K,M,G,T for record sizes in TeraGen. This code
> therefore makes a good example of how to use Spark, how to read and write
> Hadoop files, and also a way to test some of the performance claims of
> Spark.
>
> > That's great, but why is this on the mailing list and not submitted as a
> PR?
>
> I suspect there are some rough edges and I'd really appreciate reviews. I
> would also like to know if others can try it out on clusters and tell me if
> it's performing as it should.
>
> For example, I find it runs fine on my local machine, but when I try to
> sort 100G of data on a cluster of 16 nodes, I get >2900 file splits. This
> really eats into the sort time.
>
> Another issue is that in TeraValidate, to work around SPARK-1018 I had to
> clone each element. Does this /really/ need to be done? It's pretty lame.
>
> In any event, I know the Spark 1.2 merge window closed on Friday but as
> this is only for the examples directory maybe we can slip it in if we can
> bash it into shape quickly enough?
>
> Anyway, thanks to everyone on #apache-spark and #scala who helped me get
> through learning some rudimentary Scala to get this far.
>
> Yours,
> Ewan Higgs
>
> [1] https://github.com/apache/spark/pull/1242
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: About implicit rddToPairRDDFunctions

2014-11-13 Thread Reynold Xin

Do people usually important o.a.spark.rdd._ ?

Also in order to maintain source and binary compatibility, we would need to
keep both right?


On Thu, Nov 6, 2014 at 3:12 AM, Shixiong Zhu  wrote:

> I saw many people asked how to convert a RDD to a PairRDDFunctions. I would
> like to ask a question about it. Why not put the following implicit into
> "pacakge object rdd" or "object rdd"?
>
>   implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)])
>   (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null)
> = {
> new PairRDDFunctions(rdd)
>   }
>
> If so, the converting will be automatic and not need to
> import org.apache.spark.SparkContext._
>
> I tried to search some discussion but found nothing.
>
> Best Regards,
> Shixiong Zhu
>

Re: About implicit rddToPairRDDFunctions

2014-11-13 Thread Reynold Xin

That seems like a great idea. Can you submit a pull request?


On Thu, Nov 13, 2014 at 7:13 PM, Shixiong Zhu  wrote:

> If we put the `implicit` into "pacakge object rdd" or "object rdd", when
> we write `rdd.groupbykey()`, because rdd is an object of RDD, Scala
> compiler will search `object rdd`(companion object) and `package object 
> rdd`(pacakge
> object) by default. We don't need to import them explicitly. Here is a
> post about the implicit search logic:
> http://eed3si9n.com/revisiting-implicits-without-import-tax
>
> To maintain the compatibility, we can keep `rddToPairRDDFunctions` in the
> SparkContext but remove `implicit`. The disadvantage is there are two
> copies of same codes.
>
>
>
>
> Best Regards,
> Shixiong Zhu
>
> 2014-11-14 3:57 GMT+08:00 Reynold Xin :
>
>> Do people usually important o.a.spark.rdd._ ?
>>
>> Also in order to maintain source and binary compatibility, we would need
>> to keep both right?
>>
>>
>> On Thu, Nov 6, 2014 at 3:12 AM, Shixiong Zhu  wrote:
>>
>>> I saw many people asked how to convert a RDD to a PairRDDFunctions. I
>>> would
>>> like to ask a question about it. Why not put the following implicit into
>>> "pacakge object rdd" or "object rdd"?
>>>
>>>   implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)])
>>>   (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] =
>>> null)
>>> = {
>>> new PairRDDFunctions(rdd)
>>>   }
>>>
>>> If so, the converting will be automatic and not need to
>>> import org.apache.spark.SparkContext._
>>>
>>> I tried to search some discussion but found nothing.
>>>
>>> Best Regards,
>>> Shixiong Zhu
>>>
>>
>>
>

Re: send currentJars and currentFiles to exetutor with actor?

2014-11-16 Thread Reynold Xin

The current design is not ideal, but the size of dependencies should be
fairly small since we only send the path and timestamp, not the jars
themselves.

Executors can come and go. This is essentially a state replication problem
that you gotta be very careful with consistency.

On Sun, Nov 16, 2014 at 4:24 AM, scwf  wrote:

> I notice that spark serialize each task with the dependencies (files and
> JARs
> added to the SparkContext) ,
>   def serializeWithDependencies(
>   task: Task[_],
>   currentFiles: HashMap[String, Long],
>   currentJars: HashMap[String, Long],
>   serializer: SerializerInstance)
> : ByteBuffer = {
>
> val out = new ByteArrayOutputStream(4096)
> val dataOut = new DataOutputStream(out)
>
> // Write currentFiles
> dataOut.writeInt(currentFiles.size)
> for ((name, timestamp) <- currentFiles) {
>   dataOut.writeUTF(name)
>   dataOut.writeLong(timestamp)
> }
>
> // Write currentJars
> dataOut.writeInt(currentJars.size)
> for ((name, timestamp) <- currentJars) {
>   dataOut.writeUTF(name)
>   dataOut.writeLong(timestamp)
> }
>
> // Write the task itself and finish
> dataOut.flush()
> val taskBytes = serializer.serialize(task).array()
> out.write(taskBytes)
> ByteBuffer.wrap(out.toByteArray)
>   }
>
> Why not send currentJars and currentFiles to exetutor using actor? I think
> it's not necessary to serialize them for each task.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/send-currentJars-and-currentFiles-to-exetutor-with-actor-tp9381.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Is there a way for scala compiler to catch unserializable app code?

2014-11-16 Thread Reynold Xin

That's a great idea and it is also a pain point for some users. However, it
is not possible to solve this problem at compile time, because the content
of serialization can only be determined at runtime.

There are some efforts in Scala to help users avoid mistakes like this. One
example project that is more researchy is Spore:
http://docs.scala-lang.org/sips/pending/spores.html

On Sun, Nov 16, 2014 at 4:12 PM, jay vyas 
wrote:

> This is more a curiosity than an immediate problem.
>
> Here is my question: I ran into this easily solved issue
>
> http://stackoverflow.com/questions/22592811/task-not-serializable-java-io-notserializableexception-when-calling-function-ou
> recently.  The solution was to replace my "class" with a scala singleton,
> which i guess is readily serializable.
>
> So its clear that spark needs to serialize objects which carry the driver
> methods for an app, in order to run... but I'm wondering,,, maybe there is
> a way to change or update the spark API to catch unserializable spark apps
> at compile time?
>
>
> --
> jay vyas
>

Re: Regarding RecordReader of spark

2014-11-16 Thread Reynold Xin

I don't think the code is immediately obvious.

Davies - I think you added the code, and Josh reviewed it. Can you guys
explain and maybe submit a patch to add more documentation on the whole
thing?

Thanks.

On Sun, Nov 16, 2014 at 3:22 AM, Vibhanshu Prasad 
wrote:

> Hello Everyone,
>
> I am going through the source code of rdd and Record readers
> There are found 2 classes
>
> 1. WholeTextFileRecordReader
> 2. WholeCombineFileRecordReader  ( extends CombineFileRecordReader )
>
> The description of both the classes is perfectly similar.
>
> I am not able to understand why we have 2 classes. Is
> CombineFileRecordReader providing some extra advantage?
>
> Regards
> Vibhanshu
>

Re: Apache infra github sync down

2014-11-18 Thread Reynold Xin

This basically stops us from merging patches. I'm wondering if it is
possible for ASF to give some Spark committers write permission to github
repo. In that case, if the sync tool is down, we can manually push
periodically.

On Tue, Nov 18, 2014 at 10:24 PM, Patrick Wendell 
wrote:

> Hey All,
>
> The Apache-->github mirroring is not working right now and hasn't been
> working fo more than 24 hours. This means that pull requests will not
> appear as closed even though they have been merged. It also causes
> diffs to display incorrectly in some cases. If you'd like to follow
> progress by Apache infra on this issue you can watch this JIRA:
>
> https://issues.apache.org/jira/browse/INFRA-8654
>
> - Patrick
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Eliminate copy while sending data : any Akka experts here ?

2014-11-20 Thread Reynold Xin

Can you elaborate? Not 100% sure if I understand what you mean.

On Thu, Nov 20, 2014 at 7:14 PM, Shixiong Zhu  wrote:

> Is it possible that Spark buffers the messages
> of mapOutputStatuses(Array[Byte]) according to the size
> of mapOutputStatuses which have already sent but not yet ACKed? The buffer
> will be cheap since the mapOutputStatuses messages are same and the memory
> cost is only a few pointers.
>
> Best Regards,
> Shixiong Zhu
>
> 2014-09-20 16:24 GMT+08:00 Reynold Xin :
>
>> BTW - a partial solution here: https://github.com/apache/spark/pull/2470
>>
>> This doesn't address the 0 size block problem yet, but makes my large job
>> on hundreds of terabytes of data much more reliable.
>>
>>
>> On Fri, Jul 4, 2014 at 2:28 AM, Mridul Muralidharan 
>> wrote:
>>
>> > In our clusters, number of containers we can get is high but memory
>> > per container is low : which is why avg_nodes_not_hosting data is
>> > rarely zero for ML tasks :-)
>> >
>> > To update - to unblock our current implementation efforts, we went
>> > with broadcast - since it is intutively easier and minimal change; and
>> > compress the array as bytes in TaskResult.
>> > This is then stored in disk backed maps - to remove memory pressure on
>> > master and workers (else MapOutputTracker becomes a memory hog).
>> >
>> > But I agree, compressed bitmap to represent 'large' blocks (anything
>> > larger that maxBytesInFlight actually) and probably existing to track
>> > non zero should be fine (we should not really track zero output for
>> > reducer - just waste of space).
>> >
>> >
>> > Regards,
>> > Mridul
>> >
>> > On Fri, Jul 4, 2014 at 3:43 AM, Reynold Xin 
>> wrote:
>> > > Note that in my original proposal, I was suggesting we could track
>> > whether
>> > > block size = 0 using a compressed bitmap. That way we can still avoid
>> > > requests for zero-sized blocks.
>> > >
>> > >
>> > >
>> > > On Thu, Jul 3, 2014 at 3:12 PM, Reynold Xin 
>> wrote:
>> > >
>> > >> Yes, that number is likely == 0 in any real workload ...
>> > >>
>> > >>
>> > >> On Thu, Jul 3, 2014 at 8:01 AM, Mridul Muralidharan <
>> mri...@gmail.com>
>> > >> wrote:
>> > >>
>> > >>> On Thu, Jul 3, 2014 at 11:32 AM, Reynold Xin 
>> > wrote:
>> > >>> > On Wed, Jul 2, 2014 at 3:44 AM, Mridul Muralidharan <
>> > mri...@gmail.com>
>> > >>> > wrote:
>> > >>> >
>> > >>> >>
>> > >>> >> >
>> > >>> >> > The other thing we do need is the location of blocks. This is
>> > >>> actually
>> > >>> >> just
>> > >>> >> > O(n) because we just need to know where the map was run.
>> > >>> >>
>> > >>> >> For well partitioned data, wont this not involve a lot of
>> unwanted
>> > >>> >> requests to nodes which are not hosting data for a reducer (and
>> lack
>> > >>> >> of ability to throttle).
>> > >>> >>
>> > >>> >
>> > >>> > Was that a question? (I'm guessing it is). What do you mean
>> exactly?
>> > >>>
>> > >>>
>> > >>> I was not sure if I understood the proposal correctly - hence the
>> > >>> query : if I understood it right - the number of wasted requests
>> goes
>> > >>> up by num_reducers * avg_nodes_not_hosting data.
>> > >>>
>> > >>> Ofcourse, if avg_nodes_not_hosting data == 0, then we are fine !
>> > >>>
>> > >>> Regards,
>> > >>> Mridul
>> > >>>
>> > >>
>> > >>
>> >
>>
>
>

Re: Troubleshooting JVM OOM during Spark Unit Tests

2014-11-22 Thread Reynold Xin

What does /tmp/jvm-21940/hs_error.log tell you? It might give hints to what
threads are allocating the extra off-heap memory.


On Fri, Nov 21, 2014 at 1:50 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Howdy folks,
>
> I’m trying to understand why I’m getting “insufficient memory” errors when
> trying to run Spark Units tests within a CentOS Docker container.
>
> I’m building Spark and running the tests as follows:
>
> # build
> sbt/sbt -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl
> -Phive -Phive-thriftserver package assembly/assembly
>
> # Scala unit tests
> sbt/sbt -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl
> -Phive -Phive-thriftserver catalyst/test sql/test hive/test mllib/test
>
> The build completes successfully. After humming along for many minutes, the
> unit tests fail with this:
>
> OpenJDK 64-Bit Server VM warning: INFO:
> os::commit_memory(0x00074a58, 30932992, 0) failed;
> error='Cannot allocate memory' (errno=12)
> #
> # There is insufficient memory for the Java Runtime Environment to
> continue.
> # Native memory allocation (malloc) failed to allocate 30932992 bytes
> for committing reserved memory.
> # An error report file with more information is saved as:
> # /tmp/jvm-21940/hs_error.log
> Exception in thread "Thread-20" Exception in thread "Thread-16"
> java.io.EOFException
> at
> java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2598)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at
> org.scalatest.tools.Framework$ScalaTestRunner$Skeleton$1$React.react(Framework.scala:945)
> at
> org.scalatest.tools.Framework$ScalaTestRunner$Skeleton$1.run(Framework.scala:934)
> at java.lang.Thread.run(Thread.java:745)
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:196)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at java.net.SocketInputStream.read(SocketInputStream.java:210)
> at
> java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
> at
> java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
> at
> java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at sbt.React.react(ForkTests.scala:114)
> at
> sbt.ForkTests$$anonfun$mainTestTask$1$Acceptor$2$.run(ForkTests.scala:74)
> at java.lang.Thread.run(Thread.java:745)
>
> Here are some (I think) relevant environment variables I have set:
>
> export
> JAVA_HOME="/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.71-2.5.3.1.el7_0.x86_64"
> export JAVA_OPTS="-Xms128m -Xmx1g -XX:MaxPermSize=128m"
> export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=128m"
>
> How do I narrow down why this is happening? I know that running this thing
> within a Docker container may be playing a role here, but before poking
> around with Docker configs I want to make an effort at getting the Java
> setup right within the container.
>
> I’ve already tried giving the container 2GB of memory, so I don’t think at
> this point it’s a restriction on the container.
>
> Any pointers on how to narrow the problem down?
>
> Nick
>
> P.S. If you’re wondering why I’m trying to run unit tests within a Docker
> container, I’m exploring a different angle on SPARK-3431
> .
> 
>

Re: How to resolve Spark site issues?

2014-11-25 Thread Reynold Xin

The website is hosted on some svn server by ASF and unfortunately it
doesn't have a github mirror, so we will have to manually patch it ...


On Tue, Nov 25, 2014 at 11:12 AM, York, Brennon  wrote:

> For JIRA tickets like SPARK-4046<
> https://issues.apache.org/jira/browse/SPARK-4046> (Incorrect Java example
> on site) is there a way to go about fixing those things? Its a trivial fix,
> but I’m not seeing that code in the codebase anywhere. Is this something
> the admins are going to have to take care of? Just want to clarify before I
> let it go and let the example sit on the site :)
> 
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed.  If the reader of this message is not the
> intended recipient, you are hereby notified that any review,
> retransmission, dissemination, distribution, copying or other use of, or
> taking of any action in reliance upon this information is strictly
> prohibited. If you have received this communication in error, please
> contact the sender and delete the material from your computer.
>

Re: How to resolve Spark site issues?

2014-11-26 Thread Reynold Xin

Thanks, Brennon. I pushed the change and updated the website.

On Wed, Nov 26, 2014 at 8:17 AM, York, Brennon 
wrote:

> A diff you say?! Done and done. If someone with privileges to push to the
> SVN site repo could check it out I think we¹d be good to go. @Sean, thanks
> for the repo URL!
>
> On 11/25/14, 2:51 PM, "Sean Owen"  wrote:
>
> >For the interested, the SVN repo for the site is viewable at
> >http://svn.apache.org/viewvc/spark/site/ and to check it out, you can
> >"svn co https://svn.apache.org/repos/asf/spark/site";
> >
> >I assume the best process is to make a diff and attach it to the JIRA.
> >How old school.
> >
> >On Tue, Nov 25, 2014 at 7:30 PM, Reynold Xin  wrote:
> >> The website is hosted on some svn server by ASF and unfortunately it
> >> doesn't have a github mirror, so we will have to manually patch it ...
> >>
> >>
> >> On Tue, Nov 25, 2014 at 11:12 AM, York, Brennon
> >> >>> wrote:
> >>
> >>> For JIRA tickets like SPARK-4046<
> >>> https://issues.apache.org/jira/browse/SPARK-4046> (Incorrect Java
> >>>example
> >>> on site) is there a way to go about fixing those things? Its a trivial
> >>>fix,
> >>> but I¹m not seeing that code in the codebase anywhere. Is this
> >>>something
> >>> the admins are going to have to take care of? Just want to clarify
> >>>before I
> >>> let it go and let the example sit on the site :)
> >>> 
> >>>
> >>> The information contained in this e-mail is confidential and/or
> >>> proprietary to Capital One and/or its affiliates. The information
> >>> transmitted herewith is intended only for use by the individual or
> >>>entity
> >>> to which it is addressed.  If the reader of this message is not the
> >>> intended recipient, you are hereby notified that any review,
> >>> retransmission, dissemination, distribution, copying or other use of,
> >>>or
> >>> taking of any action in reliance upon this information is strictly
> >>> prohibited. If you have received this communication in error, please
> >>> contact the sender and delete the material from your computer.
> >>>
>
> 
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed.  If the reader of this message is not the
> intended recipient, you are hereby notified that any review,
> retransmission, dissemination, distribution, copying or other use of, or
> taking of any action in reliance upon this information is strictly
> prohibited. If you have received this communication in error, please
> contact the sender and delete the material from your computer.
>
>

Re: Standalone scheduling - document inconsistent

2014-11-27 Thread Reynold Xin

The 1st was referring to different Spark applications connecting to the
standalone cluster manager, and the 2nd one was referring to within a
single Spark application, the jobs can be scheduled using a fair scheduler.

On Thu, Nov 27, 2014 at 3:47 AM, Praveen Sripati 
wrote:

> Hi,
>
> There is a bit of inconsistent in the document. Which is the correct
> statement?
>
> `http://spark.apache.org/docs/latest/spark-standalone.html` says
>
> The standalone cluster mode currently only supports a simple FIFO scheduler
> across applications.
>
> while `http://spark.apache.org/docs/latest/job-scheduling.html` says
>
> Starting in Spark 0.8, it is also possible to configure fair sharing
> between jobs.
>
> Thanks,
> Praveen
>

Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-11-28 Thread Reynold Xin

Krishna,

Docs don't block the rc voting because docs can be updated in parallel with
release candidates, until the point a release is made.


On Fri, Nov 28, 2014 at 9:55 PM, Krishna Sankar  wrote:

> Looks like the documentation hasn't caught up with the new features.
> On the machine learning side, for example org.apache.spark.ml,
> RandomForest, gbtree and so forth. Is a refresh of the documentation
> planned ?
> Am happy to see these capabilities, but these would need good explanations
> as well, especially the new thinking around the ml ... pipelines,
> transformations et al.
> IMHO, the documentation is a -1.
> Will check out the compilation, mlib et al
>
> Cheers
> 
>
> On Fri, Nov 28, 2014 at 9:16 PM, Patrick Wendell 
> wrote:
>
> > Please vote on releasing the following candidate as Apache Spark version
> > 1.2.0!
> >
> > The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.0-rc1/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1048/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/
> >
> > Please vote on releasing this package as Apache Spark 1.2.0!
> >
> > The vote is open until Tuesday, December 02, at 05:15 UTC and passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.1.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
> >
> > == What justifies a -1 vote for this release? ==
> > This vote is happening very late into the QA period compared with
> > previous votes, so -1 votes should only occur for significant
> > regressions from 1.0.2. Bugs already present in 1.1.X, minor
> > regressions, or bugs related to new features will not block this
> > release.
> >
> > == What default changes should I be aware of? ==
> > 1. The default value of "spark.shuffle.blockTransferService" has been
> > changed to "netty"
> > --> Old behavior can be restored by switching to "nio"
> >
> > 2. The default value of "spark.shuffle.manager" has been changed to
> "sort".
> > --> Old behavior can be restored by setting "spark.shuffle.manager" to
> > "hash".
> >
> > == Other notes ==
> > Because this vote is occurring over a weekend, I will likely extend
> > the vote if this RC survives until the end of the vote period.
> >
> > - Patrick
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>

Re: Can the Scala classes in the spark source code, be inherited in Java classes?

2014-12-01 Thread Reynold Xin

Oops my previous response wasn't sent properly to the dev list. Here you go
for archiving.

Yes you can. Scala classes are compiled down to classes in bytecode. Take a
look at this: https://twitter.github.io/scala_school/java.html

Note that questions like this are not exactly what this dev list is meant
for  ...

On Mon, Dec 1, 2014 at 9:22 PM, Niranda Perera  wrote:

> Hi,
>
> Can the Scala classes in the spark source code, be inherited (and other OOP
> concepts) in Java classes?
>
> I want to customize some part of the code, but I would like to do it in a
> Java environment.
>
> Rgds
>
> --
> *Niranda Perera*
> Software Engineer, WSO2 Inc.
> Mobile: +94-71-554-8430
> Twitter: @n1r44 
>

Re: HA support for Spark

2014-12-10 Thread Reynold Xin

This would be plausible for specific purposes such as Spark streaming or
Spark SQL, but I don't think it is doable for general Spark driver since it
is just a normal JVM process with arbitrary program state.

On Wed, Dec 10, 2014 at 12:25 AM, Jun Feng Liu  wrote:

> Do we have any high availability support in Spark driver level? For
> example, if we want spark drive can move to another node continue execution
> when failure happen. I can see the RDD checkpoint can help to serialization
> the status of RDD. I can image to load the check point from another node
> when error happen, but seems like will lost track all tasks status or even
> executor information that maintain in spark context. I am not sure if there
> is any existing stuff I can leverage to do that. thanks for any suggests
>
> Best Regards
>
>
> *Jun Feng Liu*
> IBM China Systems & Technology Laboratory in Beijing
>
>   --
>  [image: 2D barcode - encoded with contact information] *Phone: 
> *86-10-82452683
>
> * E-mail:* *liuj...@cn.ibm.com* 
> [image: IBM]
>
> BLD 28,ZGC Software Park
> No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> China
>
>
>
>
>

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-11 Thread Reynold Xin

+1

Tested on OS X.

On Wednesday, December 10, 2014, Patrick Wendell  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.2.0!
>
> The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.2.0-rc2/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1055/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/
>
> Please vote on releasing this package as Apache Spark 1.2.0!
>
> The vote is open until Saturday, December 13, at 21:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.2.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == What justifies a -1 vote for this release? ==
> This vote is happening relatively late into the QA period, so
> -1 votes should only occur for significant regressions from
> 1.0.2. Bugs already present in 1.1.X, minor
> regressions, or bugs related to new features will not block this
> release.
>
> == What default changes should I be aware of? ==
> 1. The default value of "spark.shuffle.blockTransferService" has been
> changed to "netty"
> --> Old behavior can be restored by switching to "nio"
>
> 2. The default value of "spark.shuffle.manager" has been changed to "sort".
> --> Old behavior can be restored by setting "spark.shuffle.manager" to
> "hash".
>
> == How does this differ from RC1 ==
> This has fixes for a handful of issues identified - some of the
> notable fixes are:
>
> [Core]
> SPARK-4498: Standalone Master can fail to recognize completed/failed
> applications
>
> [SQL]
> SPARK-4552: Query for empty parquet table in spark sql hive get
> IllegalArgumentException
> SPARK-4753: Parquet2 does not prune based on OR filters on partition
> columns
> SPARK-4761: With JDBC server, set Kryo as default serializer and
> disable reference tracking
> SPARK-4785: When called with arguments referring column fields, PMOD
> throws NPE
>
> - Patrick
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> For additional commands, e-mail: dev-h...@spark.apache.org 
>
>

Re: Tachyon in Spark

2014-12-11 Thread Reynold Xin

I don't think the lineage thing is even turned on in Tachyon - it was
mostly a research prototype, so I don't think it'd make sense for us to use
that.


On Thu, Dec 11, 2014 at 3:51 PM, Andrew Ash  wrote:

> I'm interested in understanding this as well.  One of the main ways Tachyon
> is supposed to realize performance gains without sacrificing durability is
> by storing the lineage of data rather than full copies of it (similar to
> Spark).  But if Spark isn't sending lineage information into Tachyon, then
> I'm not sure how this isn't a durability concern.
>
> On Wed, Dec 10, 2014 at 5:47 AM, Jun Feng Liu  wrote:
>
> > Dose Spark today really leverage Tachyon linage to process data? It seems
> > like the application should call createDependency function in TachyonFS
> > to create a new linage node. But I did not find any place call that in
> > Spark code. Did I missed anything?
> >
> > Best Regards
> >
> >
> > *Jun Feng Liu*
> > IBM China Systems & Technology Laboratory in Beijing
> >
> >   --
> >  [image: 2D barcode - encoded with contact information] *Phone:
> *86-10-82452683
> >
> > * E-mail:* *liuj...@cn.ibm.com* 
> > [image: IBM]
> >
> > BLD 28,ZGC Software Park
> > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> > China
> >
> >
> >
> >
> >
>

Re: Tachyon in Spark

2014-12-11 Thread Reynold Xin

Actually HY emailed me offline about this and this is supported in the
latest version of Tachyon. It is a hard problem to push this into storage;
need to think about how to handle isolation, resource allocation, etc.

https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/master/Dependency.java

On Thu, Dec 11, 2014 at 3:54 PM, Reynold Xin  wrote:

> I don't think the lineage thing is even turned on in Tachyon - it was
> mostly a research prototype, so I don't think it'd make sense for us to use
> that.
>
>
> On Thu, Dec 11, 2014 at 3:51 PM, Andrew Ash  wrote:
>
>> I'm interested in understanding this as well.  One of the main ways
>> Tachyon
>> is supposed to realize performance gains without sacrificing durability is
>> by storing the lineage of data rather than full copies of it (similar to
>> Spark).  But if Spark isn't sending lineage information into Tachyon, then
>> I'm not sure how this isn't a durability concern.
>>
>> On Wed, Dec 10, 2014 at 5:47 AM, Jun Feng Liu  wrote:
>>
>> > Dose Spark today really leverage Tachyon linage to process data? It
>> seems
>> > like the application should call createDependency function in TachyonFS
>> > to create a new linage node. But I did not find any place call that in
>> > Spark code. Did I missed anything?
>> >
>> > Best Regards
>> >
>> >
>> > *Jun Feng Liu*
>> > IBM China Systems & Technology Laboratory in Beijing
>> >
>> >   --
>> >  [image: 2D barcode - encoded with contact information] *Phone:
>> *86-10-82452683
>> >
>> > * E-mail:* *liuj...@cn.ibm.com* 
>> > [image: IBM]
>> >
>> > BLD 28,ZGC Software Park
>> > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
>> > China
>> >
>> >
>> >
>> >
>> >
>>
>
>

Re: Scala's Jenkins setup looks neat

2014-12-16 Thread Reynold Xin

It's worth trying :)


On Tue, Dec 16, 2014 at 6:02 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:
>
> News flash!
>
> From the latest version of the GitHub API
> <https://developer.github.com/v3/repos/statuses/>:
>
> Note that the repo:status OAuth scope
> <https://developer.github.com/v3/oauth/#scopes> grants targeted access to
> Statuses *without* also granting access to repository code, while the repo
> scope grants permission to code as well as statuses.
>
> As I understand it, ASF Infra has said no in the past to granting access
> to statuses because it also granted push access.
>
> If so, this no longer appears to be the case.
>
> 1) Did I understand correctly and 2) should I open a new request with ASF
> Infra to give us OAuth keys with repo:status access?
>
> Nick
>
> On Sat Sep 06 2014 at 1:29:53 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> Aww, that's a bummer...
>>
>>
>> On Sat, Sep 6, 2014 at 1:10 PM, Reynold Xin  wrote:
>>
>>> that would require github hooks permission and unfortunately asf infra
>>> wouldn't allow that.
>>>
>>> Maybe they will change their mind one day, but so far we asked about
>>> this and the answer has been no for security reasons.
>>>
>>> On Saturday, September 6, 2014, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> After reading Erik's email, I found this Scala PR
>>>> <https://github.com/scala/scala/pull/3963> and immediately noticed a
>>>> few
>>>> cool things:
>>>>
>>>>- Jenkins is hooked directly into GitHub somehow, so you get the
>>>> "All is
>>>>well" message in the merge status window, presumably based on the
>>>> last test
>>>>status
>>>>- Jenkins is also tagging the PR based on its test status or need for
>>>>review
>>>>- Jenkins is also tagging the PR for a specific milestone
>>>>
>>>> Do any of these things make sense to add to our setup? Or perhaps
>>>> something
>>>> inspired by these features?
>>>>
>>>> Nick
>>>>
>>>
>>  
>

Re: Scala's Jenkins setup looks neat

2014-12-16 Thread Reynold Xin

This was the ticket: https://issues.apache.org/jira/browse/INFRA-7918

On Tue, Dec 16, 2014 at 6:23 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:
>
> Actually, reading through the existing issue opened for this
> <https://issues.apache.org/jira/browse/INFRA-7367> back in February, I
> don’t see any explanation from ASF Infra as to why they won’t grant
> permission against the Status API. They just recommended transitioning to
> the Apache Jenkins instance.
>
> Reynold/Patrick, was there any discussion elsewhere about this, or should
> I just go ahead and try reopening the issue with the appropriate
> explanation?
>
> Nick
> 
>
> On Tue Dec 16 2014 at 9:10:59 PM Patrick Wendell 
> wrote:
>
>> Yeah you can do it - just make sure they understand it is a new
>> feature so we're asking them to revisit it. They looked at it in the
>> past and they concluded they couldn't give us access without giving us
>> push access.
>>
>> - Patrick
>>
>> On Tue, Dec 16, 2014 at 6:06 PM, Reynold Xin  wrote:
>> > It's worth trying :)
>> >
>> >
>> > On Tue, Dec 16, 2014 at 6:02 PM, Nicholas Chammas <
>> > nicholas.cham...@gmail.com> wrote:
>> >>
>> >> News flash!
>> >>
>> >> From the latest version of the GitHub API
>> >> <https://developer.github.com/v3/repos/statuses/>:
>> >>
>> >> Note that the repo:status OAuth scope
>> >> <https://developer.github.com/v3/oauth/#scopes> grants targeted
>> access to
>> >> Statuses *without* also granting access to repository code, while the
>> repo
>> >> scope grants permission to code as well as statuses.
>> >>
>> >> As I understand it, ASF Infra has said no in the past to granting
>> access
>> >> to statuses because it also granted push access.
>> >>
>> >> If so, this no longer appears to be the case.
>> >>
>> >> 1) Did I understand correctly and 2) should I open a new request with
>> ASF
>> >> Infra to give us OAuth keys with repo:status access?
>> >>
>> >> Nick
>> >>
>> >> On Sat Sep 06 2014 at 1:29:53 PM Nicholas Chammas <
>> >> nicholas.cham...@gmail.com> wrote:
>> >>
>> >> Aww, that's a bummer...
>> >>>
>> >>>
>> >>> On Sat, Sep 6, 2014 at 1:10 PM, Reynold Xin 
>> wrote:
>> >>>
>> >>>> that would require github hooks permission and unfortunately asf
>> infra
>> >>>> wouldn't allow that.
>> >>>>
>> >>>> Maybe they will change their mind one day, but so far we asked about
>> >>>> this and the answer has been no for security reasons.
>> >>>>
>> >>>> On Saturday, September 6, 2014, Nicholas Chammas <
>> >>>> nicholas.cham...@gmail.com> wrote:
>> >>>>
>> >>>>> After reading Erik's email, I found this Scala PR
>> >>>>> <https://github.com/scala/scala/pull/3963> and immediately noticed
>> a
>> >>>>> few
>> >>>>> cool things:
>> >>>>>
>> >>>>>- Jenkins is hooked directly into GitHub somehow, so you get the
>> >>>>> "All is
>> >>>>>well" message in the merge status window, presumably based on the
>> >>>>> last test
>> >>>>>status
>> >>>>>- Jenkins is also tagging the PR based on its test status or
>> need for
>> >>>>>review
>> >>>>>- Jenkins is also tagging the PR for a specific milestone
>> >>>>>
>> >>>>> Do any of these things make sense to add to our setup? Or perhaps
>> >>>>> something
>> >>>>> inspired by these features?
>> >>>>>
>> >>>>> Nick
>> >>>>>
>> >>>>
>> >>>
>> >>
>>
>

Re: What RDD transformations trigger computations?

2014-12-18 Thread Reynold Xin

Alessandro was probably referring to some transformations whose
implementations depend on some actions. For example: sortByKey requires
sampling the data to get the histogram.

There is a ticket tracking this:
https://issues.apache.org/jira/browse/SPARK-2992






On Thu, Dec 18, 2014 at 11:52 AM, Josh Rosen  wrote:
>
> Could you provide an example?  These operations are lazy, in the sense
> that they don’t trigger Spark jobs:
>
>
> scala> val a = sc.parallelize(1 to 1, 1).mapPartitions{ x =>
> println("computed a!"); x}
> a: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[14] at mapPartitions
> at :18
>
> scala> a.union(a)
> res4: org.apache.spark.rdd.RDD[Int] = UnionRDD[15] at union at :22
>
> scala> a.map(x => (x, x)).groupByKey()
> res5: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[17] at
> groupByKey at :22
>
> scala> a.map(x => (x, x)).groupByKey().count()
> computed a!
> res6: Long = 1
>
>
> On December 18, 2014 at 1:04:54 AM, Alessandro Baretta (
> alexbare...@gmail.com) wrote:
>
> All,
>
> I noticed that while some operations that return RDDs are very cheap, such
> as map and flatMap, some are quite expensive, such as union and groupByKey.
> I'm referring here to the cost of constructing the RDD scala value, not the
> cost of collecting the values contained in the RDD. This does not match my
> understanding that RDD transformations only set up a computation without
> actually running it. Oh, Spark developers, can you please provide some
> clarity?
>
> Alex
>

Re: Highly interested in contributing to spark

2015-01-01 Thread Reynold Xin

Hi Manoj,

Thanks for the email.

Yes - you should start with the starter task before attempting larger ones.
Last year I signed up as a mentor for GSoC, but no student signed up. I
don't think I'd have time to be a mentor this year, but others might.


On Thu, Jan 1, 2015 at 4:54 PM, Manoj Kumar 
wrote:

> Hello,
>
> I am Manoj (https://github.com/MechCoder), an undergraduate student highly
> interested in Machine Learning. I have contributed to SymPy and
> scikit-learn as part of Google Summer of Code projects and my bachelor's
> thesis. I have a few quick (non-technical) questions before I dive into the
> issue tracker.
>
> Are the ones marked trivial easy to fix ones, that I could try before
> attempting slightly more ambitious ones? Also I would like to know if
> Apache Spark takes part in Google Summer of Code projects under the Apache
> Software Foundation. It would be really great if it does!
>
> Looking forward!
>
> --
> Godspeed,
> Manoj Kumar,
> Mech Undergrad
> http://manojbits.wordpress.com
>

Re: Missing Catalyst API docs

2015-01-07 Thread Reynold Xin

I'm in the middle of revamping the SchemaRDD public API and in 1.3, we will
have a public, documented version of the expression library. The Catalyst
expression library will remain hidden.

You can track it with this ticket:
https://issues.apache.org/jira/browse/SPARK-5097

On Wed, Jan 7, 2015 at 3:04 PM, Alessandro Baretta 
wrote:

> Gents,
>
> It looks like some of the Catalyst classes' API docs are missing. For
> instance, the Expression class, referred to by the SchemaRDD docs seems to
> be missing. (See here:
>
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD
> )
>
> Is this intended or is it due to a failure in the doc creation process?
>
> Alex
>

Re: Spark on teradata?

2015-01-08 Thread Reynold Xin

Depending on your use cases. If the use case is to extract small amount of
data out of teradata, then you can use the JdbcRDD and soon a jdbc input
source based on the new Spark SQL external data source API.

On Wed, Jan 7, 2015 at 7:14 AM, gen tang  wrote:

> Hi,
>
> I have a stupid question:
> Is it possible to use spark on Teradata data warehouse, please? I read
> some news on internet which say yes. However, I didn't find any example
> about this issue
>
> Thanks in advance.
>
> Cheers
> Gen
>
>

Re: create a SchemaRDD from a custom datasource

2015-01-13 Thread Reynold Xin

Depends on what the other side is doing. You can create your own RDD
implementation by subclassing RDD, or it might work if you use
sc.parallelize(1 to n, n).mapPartitionsWithIndex( /* code to read the data
and return an iterator */ ) where n is the number of partitions.

On Tue, Jan 13, 2015 at 12:51 AM, Niranda Perera 
wrote:

> Hi,
>
> We have a custom datasources API, which connects to various data sources
> and exposes them out as a common API. We are now trying to implement the
> Spark datasources API released in 1.2.0 to connect Spark for analytics.
>
> Looking at the sources API, we figured out that we should extend a scan
> class (table scan etc). While doing so, we would have to implement the
> 'schema' and 'buildScan' methods.
>
> say, we can infer the schema of the underlying data and take data out as
> Row elements. Is there any way we could create RDD[Row] (needed in the
> buildScan method) using these Row elements?
>
> Cheers
> --
> Niranda
>

Re: create a SchemaRDD from a custom datasource

2015-01-13 Thread Reynold Xin

If it is a small collection of them on the driver, you can just use
sc.parallelize to create an RDD.


On Tue, Jan 13, 2015 at 7:56 AM, Malith Dhanushka 
wrote:

> Hi Reynold,
>
> Thanks for the response. I am just wondering, lets say we have set of Row
> objects. Isn't there a straightforward way of creating RDD[Row] out of it
> without writing a custom RDD?
>
> ie - a utility method
>
> Thanks
> Malith
>
> On Tue, Jan 13, 2015 at 2:29 PM, Reynold Xin  wrote:
>
>> Depends on what the other side is doing. You can create your own RDD
>> implementation by subclassing RDD, or it might work if you use
>> sc.parallelize(1 to n, n).mapPartitionsWithIndex( /* code to read the data
>> and return an iterator */ ) where n is the number of partitions.
>>
>> On Tue, Jan 13, 2015 at 12:51 AM, Niranda Perera <
>> niranda.per...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> We have a custom datasources API, which connects to various data sources
>>> and exposes them out as a common API. We are now trying to implement the
>>> Spark datasources API released in 1.2.0 to connect Spark for analytics.
>>>
>>> Looking at the sources API, we figured out that we should extend a scan
>>> class (table scan etc). While doing so, we would have to implement the
>>> 'schema' and 'buildScan' methods.
>>>
>>> say, we can infer the schema of the underlying data and take data out as
>>> Row elements. Is there any way we could create RDD[Row] (needed in the
>>> buildScan method) using these Row elements?
>>>
>>> Cheers
>>> --
>>> Niranda
>>>
>>
>>
> 
>
>

Re: SciSpark: NASA AIST14 proposal

2015-01-14 Thread Reynold Xin

Chris,

This is really cool. Congratulations and thanks for sharing the news.


On Wed, Jan 14, 2015 at 6:08 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Spark Devs,
>
> Just wanted to FYI that I was funded on a 2 year NASA proposal
> to build out the concept of a scientific RDD (create by space/time,
> and other operations) for use in some neat climate related NASA
> use cases.
>
> http://esto.nasa.gov/files/solicitations/AIST_14/ROSES2014_AIST_A41_awards.
> html
>
>
> I will keep everyone posted and plan on interacting with the list
> over here to get it done. I expect that we’ll start work in March.
> In the meanwhile you guys can scope the abstract at the link provided.
> Happy
> to chat about it if you have any questions too.
>
> Cheers!
>
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>

Spark SQL API changes and stabilization

2015-01-14 Thread Reynold Xin

Hi Spark devs,

Given the growing number of developers that are building on Spark SQL, we
would like to stabilize the API in 1.3 so users and developers can be
confident to build on it. This also gives us a chance to improve the API.

In particular, we are proposing the following major changes. This should
have no impact for most users (i.e. those running SQL through the JDBC
client or SQLContext.sql method).

1. Everything in sql.catalyst package is private to the project.

2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for
SchemaRDD and logical plans in order to construct test cases. We have
received feedback from a lot of users that the DSL can be incredibly
powerful. In 1.3, we’d like to refactor the DSL to make it suitable for not
only constructing test cases, but also in everyday data pipelines. The new
SchemaRDD API is inspired by the data frame concept in Pandas and R.

3. Reconcile Java and Scala APIs (SPARK-5193): We would like to expose one
set of APIs that will work for both Java and Scala. The current Java API
(sql.api.java) does not share any common ancestor with the Scala API. This
led to high maintenance burden for us as Spark developers and for library
developers. We propose to eliminate the Java specific API, and simply work
on the existing Scala API to make it also usable for Java. This will make
Java a first class citizen as Scala. This effectively means that all public
classes should be usable for both Scala and Java, including SQLContext,
HiveContext, SchemaRDD, data types, and the aforementioned DSL.


Again, this should have no impact on most users since the existing DSL is
rarely used by end users. However, library developers might need to change
the import statements because we are moving certain classes around. We will
keep you posted as patches are merged.

Re: Spark SQL API changes and stabilization

2015-01-15 Thread Reynold Xin

Alex,

I didn't communicate properly. By "private", I simply meant the expectation
that it is not a public API. The plan is to still omit it from the
scaladoc/javadoc generation, but no language visibility modifier will be
applied on them.

After 1.3, you will likely no longer need to use things in sql.catalyst
package directly. Programmatically construct SchemaRDDs is going to be a
first class public API. Data types have already been moved out of the
sql.catalyst package and now lives in sql.types. They are becoming stable
public APIs. When the "data frame" patch is submitted, you will see a
public expression library also. There will be few reason for end users or
library developers to hook into things in sql.catalyst. For the bravest and
the most advanced, they can still use them, with the expectation that it is
subject to change.





On Thu, Jan 15, 2015 at 7:53 AM, Alessandro Baretta 
wrote:

> Reynold,
>
> Thanks for the heads up. In general, I strongly oppose the use of
> "private" to restrict access to certain parts of the API, the reason being
> that I might find the need to use some of the internals of a library from
> my own project. I find that a @DeveloperAPI annotation serves the same
> purpose as "private" without imposing unnecessary restrictions: it
> discourages people from using the annotated API and reserves the right for
> the core developers to change it suddenly in backwards incompatible ways.
>
> In particular, I would like to express the desire that the APIs to
> programmatically construct SchemaRDDs from an RDD[Row] and a StructType
> remain public. All the SparkSQL data type objects should be exposed by the
> API, and the jekyll build should not hide the docs as it does now.
>
> Thanks.
>
> Alex
>
> On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin  wrote:
>
>> Hi Spark devs,
>>
>> Given the growing number of developers that are building on Spark SQL, we
>> would like to stabilize the API in 1.3 so users and developers can be
>> confident to build on it. This also gives us a chance to improve the API.
>>
>> In particular, we are proposing the following major changes. This should
>> have no impact for most users (i.e. those running SQL through the JDBC
>> client or SQLContext.sql method).
>>
>> 1. Everything in sql.catalyst package is private to the project.
>>
>> 2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for
>> SchemaRDD and logical plans in order to construct test cases. We have
>> received feedback from a lot of users that the DSL can be incredibly
>> powerful. In 1.3, we’d like to refactor the DSL to make it suitable for
>> not
>> only constructing test cases, but also in everyday data pipelines. The new
>> SchemaRDD API is inspired by the data frame concept in Pandas and R.
>>
>> 3. Reconcile Java and Scala APIs (SPARK-5193): We would like to expose one
>> set of APIs that will work for both Java and Scala. The current Java API
>> (sql.api.java) does not share any common ancestor with the Scala API. This
>> led to high maintenance burden for us as Spark developers and for library
>> developers. We propose to eliminate the Java specific API, and simply work
>> on the existing Scala API to make it also usable for Java. This will make
>> Java a first class citizen as Scala. This effectively means that all
>> public
>> classes should be usable for both Scala and Java, including SQLContext,
>> HiveContext, SchemaRDD, data types, and the aforementioned DSL.
>>
>>
>> Again, this should have no impact on most users since the existing DSL is
>> rarely used by end users. However, library developers might need to change
>> the import statements because we are moving certain classes around. We
>> will
>> keep you posted as patches are merged.
>>
>
>

Re: Join implementation in SparkSQL

2015-01-15 Thread Reynold Xin

It's a bunch of strategies defined here:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

In most common use cases (e.g. inner equi join), filters are pushed below
the join or into the join. Doing a cartesian product followed by a filter
is too expensive.

On Thu, Jan 15, 2015 at 7:39 AM, Alessandro Baretta 
wrote:

> Hello,
>
> Where can I find docs about how joins are implemented in SparkSQL? In
> particular, I'd like to know whether they are implemented according to
> their relational algebra definition as filters on top of a cartesian
> product.
>
> Thanks,
>
> Alex
>

Re: Graphx TripletFields written in Java?

2015-01-15 Thread Reynold Xin

The static fields - Scala can't express JVM static fields unfortunately.
Those will be important once we provide the Java API.

On Thu, Jan 15, 2015 at 8:58 AM, Jay Hutfles  wrote:

> Hi all,
>   Does anyone know the reasoning behind implementing
> org.apache.spark.graphx.TripletFields in Java instead of Scala?  It doesn't
> look like there's anything in there that couldn't be done in Scala.
> Nothing serious, just curious.  Thanks!
>-Jay
>

Re: Spark SQL API changes and stabilization

2015-01-15 Thread Reynold Xin

We can look into some sort of util class in sql.types for general type
inference. In general many methods in JsonRDD might be useful enough to
extract. Those will probably be marked as DeveloperAPI with less stability
guarantees.

On Thu, Jan 15, 2015 at 12:16 PM, Corey Nolet  wrote:

> Reynold,
>
> One thing I'd like worked into the public portion of the API is the json
> inferencing logic that creates a Set[(String, StructType)] out of
> Map[String,Any]. SPARK-5260 addresses this so that I can use Accumulators
> to infer my schema instead of forcing a map/reduce phase to occur on an RDD
> in order to get the final schema. Do you (or anyone else) see a path
> forward in exposing this to users? A utility class perhaps?
>
> On Thu, Jan 15, 2015 at 1:33 PM, Reynold Xin  wrote:
>
>> Alex,
>>
>> I didn't communicate properly. By "private", I simply meant the
>> expectation
>> that it is not a public API. The plan is to still omit it from the
>> scaladoc/javadoc generation, but no language visibility modifier will be
>> applied on them.
>>
>> After 1.3, you will likely no longer need to use things in sql.catalyst
>> package directly. Programmatically construct SchemaRDDs is going to be a
>> first class public API. Data types have already been moved out of the
>> sql.catalyst package and now lives in sql.types. They are becoming stable
>> public APIs. When the "data frame" patch is submitted, you will see a
>> public expression library also. There will be few reason for end users or
>> library developers to hook into things in sql.catalyst. For the bravest
>> and
>> the most advanced, they can still use them, with the expectation that it
>> is
>> subject to change.
>>
>>
>>
>>
>>
>> On Thu, Jan 15, 2015 at 7:53 AM, Alessandro Baretta <
>> alexbare...@gmail.com>
>> wrote:
>>
>> > Reynold,
>> >
>> > Thanks for the heads up. In general, I strongly oppose the use of
>> > "private" to restrict access to certain parts of the API, the reason
>> being
>> > that I might find the need to use some of the internals of a library
>> from
>> > my own project. I find that a @DeveloperAPI annotation serves the same
>> > purpose as "private" without imposing unnecessary restrictions: it
>> > discourages people from using the annotated API and reserves the right
>> for
>> > the core developers to change it suddenly in backwards incompatible
>> ways.
>> >
>> > In particular, I would like to express the desire that the APIs to
>> > programmatically construct SchemaRDDs from an RDD[Row] and a StructType
>> > remain public. All the SparkSQL data type objects should be exposed by
>> the
>> > API, and the jekyll build should not hide the docs as it does now.
>> >
>> > Thanks.
>> >
>> > Alex
>> >
>> > On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin 
>> wrote:
>> >
>> >> Hi Spark devs,
>> >>
>> >> Given the growing number of developers that are building on Spark SQL,
>> we
>> >> would like to stabilize the API in 1.3 so users and developers can be
>> >> confident to build on it. This also gives us a chance to improve the
>> API.
>> >>
>> >> In particular, we are proposing the following major changes. This
>> should
>> >> have no impact for most users (i.e. those running SQL through the JDBC
>> >> client or SQLContext.sql method).
>> >>
>> >> 1. Everything in sql.catalyst package is private to the project.
>> >>
>> >> 2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for
>> >> SchemaRDD and logical plans in order to construct test cases. We have
>> >> received feedback from a lot of users that the DSL can be incredibly
>> >> powerful. In 1.3, we’d like to refactor the DSL to make it suitable for
>> >> not
>> >> only constructing test cases, but also in everyday data pipelines. The
>> new
>> >> SchemaRDD API is inspired by the data frame concept in Pandas and R.
>> >>
>> >> 3. Reconcile Java and Scala APIs (SPARK-5193): We would like to expose
>> one
>> >> set of APIs that will work for both Java and Scala. The current Java
>> API
>> >> (sql.api.java) does not share any common ancestor with the Scala API.
>> This
>> >> led to high maintenance burden for us as Spark developers and for
>> library
>> >> developers. We propose to eliminate the Java specific API, and simply
>> work
>> >> on the existing Scala API to make it also usable for Java. This will
>> make
>> >> Java a first class citizen as Scala. This effectively means that all
>> >> public
>> >> classes should be usable for both Scala and Java, including SQLContext,
>> >> HiveContext, SchemaRDD, data types, and the aforementioned DSL.
>> >>
>> >>
>> >> Again, this should have no impact on most users since the existing DSL
>> is
>> >> rarely used by end users. However, library developers might need to
>> change
>> >> the import statements because we are moving certain classes around. We
>> >> will
>> >> keep you posted as patches are merged.
>> >>
>> >
>> >
>>
>
>

Re: RDD order guarantees

2015-01-16 Thread Reynold Xin

You are running on a local file system right? HDFS orders the file based on
names, but local file system often don't. I think that's why the difference.

We might be able to do a sort and order the partitions when we create a RDD
to make this universal though.

On Fri, Jan 16, 2015 at 8:26 AM, Ewan Higgs  wrote:

> Hi all,
> Quick one: when reading files, are the orders of partitions guaranteed to
> be preserved? I am finding some weird behaviour where I run sortByKeys() on
> an RDD (which has 16 byte keys) and write it to disk. If I open a python
> shell and run the following:
>
> for part in range(29):
> print map(ord, 
> open('/home/ehiggs/data/terasort_out/part-r-000{0:02}'.format(part),
> 'r').read(16))
>
> Then each partition is in order based on the first value of each partition.
>
> I can also call TeraValidate.validate from TeraSort and it is happy with
> the results. It seems to be on loading the file that the reordering
> happens. If this is expected, is there a way to ask Spark nicely to give me
> the RDD in the order it was saved?
>
> This is based on trying to fix my TeraValidate code on this branch:
> https://github.com/ehiggs/spark/tree/terasort
>
> Thanks,
> Ewan
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Spark SQL API changes and stabilization

2015-01-16 Thread Reynold Xin

That's a good idea. We didn't intentionally break the doc generation. The
doc generation for Catalyst is broken because we use Scala macros and we
haven't had time to investigate how to fix it yet.

If you have a minute and want to investigate, I can merge it in as soon as
possible.





On Fri, Jan 16, 2015 at 2:11 PM, Alessandro Baretta 
wrote:

> Reynold,
>
> Your clarification is much appreciated. One issue though, that I would
> strongly encourage you to work on, is to make sure that the Scaladoc CAN be
> generated manually if needed (a "Use at your own risk" clause would be
> perfectly legitimate here). The reason I say this is that currently even
> hacking SparkBuild.scala to include SparkSQL in the unidoc target doesn't
> help, as scaladoc itself fails with errors such as these.
>
> [error]
> /Users/alex/git/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala:359:
> polymorphic expression cannot be instantiated to expected type;
> [error]  found   : [T(in method
> apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method apply)]
> [error]  required:
> org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(in method
> functionToUdfBuilder)]
> [error]   implicit def functionToUdfBuilder[T: TypeTag](func:
> Function22[_, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
> _, T]): ScalaUdfBuilder[T] = ScalaUdfBuilder(func)
> [error]
>
>   ^
> [error]
> /Users/alex/git/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:147:
> value q is not a member of StringContext
> [error]  Note: implicit class Evaluate2 is not applicable here because it
> comes after the application point and it lacks an explicit result type
> [error] q"""
> [error] ^
> [error]
> /Users/alex/git/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:181:
> value q is not a member of StringContext
> [error] q"""
> [error] ^
> [error]
> /Users/alex/git/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:198:
> value q is not a member of StringContext
>
> While I understand you desire to discourage users from relying on the
> internal "private" APIs, there is no reason to prevent people from gaining
> a better understanding of how things work by allow them--with some
> effort--to get to the docs.
>
> Thanks,
>
> Alex
>
> On Thu, Jan 15, 2015 at 10:33 AM, Reynold Xin  wrote:
>
>> Alex,
>>
>> I didn't communicate properly. By "private", I simply meant the
>> expectation that it is not a public API. The plan is to still omit it from
>> the scaladoc/javadoc generation, but no language visibility modifier will
>> be applied on them.
>>
>> After 1.3, you will likely no longer need to use things in sql.catalyst
>> package directly. Programmatically construct SchemaRDDs is going to be a
>> first class public API. Data types have already been moved out of the
>> sql.catalyst package and now lives in sql.types. They are becoming stable
>> public APIs. When the "data frame" patch is submitted, you will see a
>> public expression library also. There will be few reason for end users or
>> library developers to hook into things in sql.catalyst. For the bravest and
>> the most advanced, they can still use them, with the expectation that it is
>> subject to change.
>>
>>
>>
>>
>>
>> On Thu, Jan 15, 2015 at 7:53 AM, Alessandro Baretta <
>> alexbare...@gmail.com> wrote:
>>
>>> Reynold,
>>>
>>> Thanks for the heads up. In general, I strongly oppose the use of
>>> "private" to restrict access to certain parts of the API, the reason being
>>> that I might find the need to use some of the internals of a library from
>>> my own project. I find that a @DeveloperAPI annotation serves the same
>>> purpose as "private" without imposing unnecessary restrictions: it
>>> discourages people from using the annotated API and reserves the right for
>>> the core developers to change it suddenly in backwards incompatible ways.
>>>
>>> In particular, I would like to express the desire that the APIs to
>>> programmatically construct SchemaRDDs from an RDD[Row] and a StructType
>>> remain public. All the SparkSQL data type objects should be exposed by the
>>> API, and the jekyll build should not hide the docs as it does now.
>>>
>>> Thanks.
>>>
&g

Re: Semantics of LGTM

2015-01-18 Thread Reynold Xin

Maybe just to avoid LGTM as a single token when it is not actually
according to Patrick's definition, but anybody can still leave comments
like:

"The direction of the PR looks good to me." or "+1 on the direction"

"The build part looks good to me"

...


On Sat, Jan 17, 2015 at 8:49 PM, Kay Ousterhout 
wrote:

> +1 to Patrick's proposal of strong LGTM semantics.  On past projects, I've
> heard the semantics of "LGTM" expressed as "I've looked at this thoroughly
> and take as much ownership as if I wrote the patch myself".  My
> understanding is that this is the level of review we expect for all patches
> that ultimately go into Spark, so it's important to have a way to concisely
> describe when this has been done.
>
> Aaron / Sandy, when have you found the weaker LGTM to be useful?  In the
> cases I've seen, if someone else says "I looked at this very quickly and
> didn't see any glaring problems", it doesn't add any value for subsequent
> reviewers (someone still needs to take a thorough look).
>
> -Kay
>
> On Sat, Jan 17, 2015 at 8:04 PM,  wrote:
>
> > Yeah, the ASF +1 has become partly overloaded to mean both "I would like
> > to see this feature" and "this patch should be committed", although, at
> > least in Hadoop, using +1 on JIRA (as opposed to, say, in a release vote)
> > should unambiguously mean the latter unless qualified in some other way.
> >
> > I don't have any opinion on the specific characters, but I agree with
> > Aaron that it would be nice to have some sort of abbreviation for both
> the
> > strong and weak forms of approval.
> >
> > -Sandy
> >
> > > On Jan 17, 2015, at 7:25 PM, Patrick Wendell 
> wrote:
> > >
> > > I think the ASF +1 is *slightly* different than Google's LGTM, because
> > > it might convey wanting the patch/feature to be merged but not
> > > necessarily saying you did a thorough review and stand behind it's
> > > technical contents. For instance, I've seen people pile on +1's to try
> > > and indicate support for a feature or patch in some projects, even
> > > though they didn't do a thorough technical review. This +1 is
> > > definitely a useful mechanism.
> > >
> > > There is definitely much overlap though in the meaning, though, and
> > > it's largely because Spark had it's own culture around reviews before
> > > it was donated to the ASF, so there is a mix of two styles.
> > >
> > > Nonetheless, I'd prefer to stick with the stronger LGTM semantics I
> > > proposed originally (unlike the one Sandy proposed, e.g.). This is
> > > what I've seen every project using the LGTM convention do (Google, and
> > > some open source projects such as Impala) to indicate technical
> > > sign-off.
> > >
> > > - Patrick
> > >
> > >> On Sat, Jan 17, 2015 at 7:09 PM, Aaron Davidson 
> > wrote:
> > >> I think I've seen something like +2 = "strong LGTM" and +1 = "weak
> LGTM;
> > >> someone else should review" before. It's nice to have a shortcut which
> > isn't
> > >> a sentence when talking about weaker forms of LGTM.
> > >>
> > >> On Sat, Jan 17, 2015 at 6:59 PM,  wrote:
> > >>>
> > >>> I think clarifying these semantics is definitely worthwhile. Maybe
> this
> > >>> complicates the process with additional terminology, but the way I've
> > used
> > >>> these has been:
> > >>>
> > >>> +1 - I think this is safe to merge and, barring objections from
> others,
> > >>> would merge it immediately.
> > >>>
> > >>> LGTM - I have no concerns about this patch, but I don't necessarily
> > feel
> > >>> qualified to make a final call about it.  The TM part acknowledges
> the
> > >>> judgment as a little more subjective.
> > >>>
> > >>> I think having some concise way to express both of these is useful.
> > >>>
> > >>> -Sandy
> > >>>
> >  On Jan 17, 2015, at 5:40 PM, Patrick Wendell 
> > wrote:
> > 
> >  Hey All,
> > 
> >  Just wanted to ping about a minor issue - but one that ends up
> having
> >  consequence given Spark's volume of reviews and commits. As much as
> >  possible, I think that we should try and gear towards "Google Style"
> >  LGTM on reviews. What I mean by this is that LGTM has the following
> >  semantics:
> > 
> >  "I know this code well, or I've looked at it close enough to feel
> >  confident it should be merged. If there are issues/bugs with this
> code
> >  later on, I feel confident I can help with them."
> > 
> >  Here is an alternative semantic:
> > 
> >  "Based on what I know about this part of the code, I don't see any
> >  show-stopper problems with this patch".
> > 
> >  The issue with the latter is that it ultimately erodes the
> >  significance of LGTM, since subsequent reviewers need to reason
> about
> >  what the person meant by saying LGTM. In contrast, having strong
> >  semantics around LGTM can help streamline reviews a lot, especially
> as
> >  reviewers get more experienced and gain trust from the comittership.
> > 
> >  There are several easy ways to

Re: GraphX doc: triangleCount() requirement overstatement?

2015-01-18 Thread Reynold Xin

We will merge https://issues.apache.org/jira/browse/SPARK-3650  for 1.3.
Thanks for reminding!


On Sun, Jan 18, 2015 at 8:34 PM, Michael Malak <
michaelma...@yahoo.com.invalid> wrote:

> According to:
>
> https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#triangle-counting
>
> "Note that TriangleCount requires the edges to be in canonical orientation
> (srcId < dstId)"
>
> But isn't this overstating the requirement? Isn't the requirement really
> that IF there are duplicate edges between two vertices, THEN those edges
> must all be in the same direction (in order for the groupEdges() at the
> beginning of triangleCount() to produce the intermediate results that
> triangleCount() expects)?
>
> If so, should I enter a JIRA ticket to clarify the documentation?
>
> Or is it the case that https://issues.apache.org/jira/browse/SPARK-3650
> will make it into Spark 1.3 anyway?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: RDD order guarantees

2015-01-18 Thread Reynold Xin

Hi Ewan,

Not sure if there is a JIRA ticket (there are too many that I lose track).

I chatted briefly with Aaron on this. The way we can solve it is to create
a new FileSystem implementation that overrides the listStatus method, and
then in Hadoop Conf set the fs.file.impl to that.

Shouldn't be too hard. Would you be interested in working on it?




On Fri, Jan 16, 2015 at 3:36 PM, Ewan Higgs  wrote:

>  Yes, I am running on a local file system.
>
> Is there a bug open for this? Mingyu Kim reported the problem last April:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html
>
> -Ewan
>
>
> On 01/16/2015 07:41 PM, Reynold Xin wrote:
>
> You are running on a local file system right? HDFS orders the file based
> on names, but local file system often don't. I think that's why the
> difference.
>
>  We might be able to do a sort and order the partitions when we create a
> RDD to make this universal though.
>
> On Fri, Jan 16, 2015 at 8:26 AM, Ewan Higgs  wrote:
>
>> Hi all,
>> Quick one: when reading files, are the orders of partitions guaranteed to
>> be preserved? I am finding some weird behaviour where I run sortByKeys() on
>> an RDD (which has 16 byte keys) and write it to disk. If I open a python
>> shell and run the following:
>>
>> for part in range(29):
>> print map(ord,
>> open('/home/ehiggs/data/terasort_out/part-r-000{0:02}'.format(part),
>> 'r').read(16))
>>
>> Then each partition is in order based on the first value of each
>> partition.
>>
>> I can also call TeraValidate.validate from TeraSort and it is happy with
>> the results. It seems to be on loading the file that the reordering
>> happens. If this is expected, is there a way to ask Spark nicely to give me
>> the RDD in the order it was saved?
>>
>> This is based on trying to fix my TeraValidate code on this branch:
>> https://github.com/ehiggs/spark/tree/terasort
>>
>> Thanks,
>> Ewan
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>
>

Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Reynold Xin

Definitely go for a pull request!


On Mon, Jan 19, 2015 at 10:10 AM, Mick Davies 
wrote:

>
> Looking at Parquet code - it looks like hooks are already in place to
> support this.
>
> In particular PrimitiveConverter has methods hasDictionarySupport and
> addValueFromDictionary for this purpose. These are not used by
> CatalystPrimitiveConverter.
>
> I think that it would be pretty straightforward to add this. Has anyone
> considered this? Shall I get a pull request  together for it.
>
> Mick
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10195.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Will Spark-SQL support vectorized query engine someday?

2015-01-19 Thread Reynold Xin

It will probably eventually make its way into part of the query engine, one
way or another. Note that there are in general a lot of other lower hanging
fruits before you have to do vectorization.

As far as I know, Hive doesn't really have vectorization because the
vectorization in Hive is simply writing everything in small batches, in
order to avoid the virtual function call overhead, and hoping the JVM can
unroll some of the loops. There is no SIMD involved.

Something that is pretty useful, which isn't exactly from vectorization but
comes from similar lines of research, is being able to push predicates down
into the columnar compression encoding. For example, one can turn string
comparisons into integer comparisons. These will probably give much larger
performance improvements in common queries.

On Mon, Jan 19, 2015 at 6:27 PM, Xuelin Cao  wrote:

> Hi,
>
>  Correct me if I were wrong. It looks like, the current version of
> Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical
> operator produces a tuple by recursively call child->execute .
>
>  There are papers that illustrate the benefits of vectorized query
> engine. And Hive-Stinger also embrace this style.
>
>  So, the question is, will Spark-SQL give a support to vectorized query
> execution someday?
>
>  Thanks
>

Re: Will Spark-SQL support vectorized query engine someday?

2015-01-20 Thread Reynold Xin

I don't know if there is a list, but in general running performance
profiler can identify a lot of things...

On Tue, Jan 20, 2015 at 12:30 AM, Xuelin Cao 
wrote:

>
> Thanks, Reynold
>
>   Regarding the "lower hanging fruits", can you give me some example?
> Where can I find them in JIRA?
>
>
> On Tue, Jan 20, 2015 at 3:55 PM, Reynold Xin  wrote:
>
>> It will probably eventually make its way into part of the query engine,
>> one way or another. Note that there are in general a lot of other lower
>> hanging fruits before you have to do vectorization.
>>
>> As far as I know, Hive doesn't really have vectorization because the
>> vectorization in Hive is simply writing everything in small batches, in
>> order to avoid the virtual function call overhead, and hoping the JVM can
>> unroll some of the loops. There is no SIMD involved.
>>
>> Something that is pretty useful, which isn't exactly from vectorization
>> but comes from similar lines of research, is being able to push predicates
>> down into the columnar compression encoding. For example, one can turn
>> string comparisons into integer comparisons. These will probably give much
>> larger performance improvements in common queries.
>>
>>
>> On Mon, Jan 19, 2015 at 6:27 PM, Xuelin Cao 
>> wrote:
>>
>>> Hi,
>>>
>>>  Correct me if I were wrong. It looks like, the current version of
>>> Spark-SQL is *tuple-at-a-time* module. Basically, each time the physical
>>> operator produces a tuple by recursively call child->execute .
>>>
>>>  There are papers that illustrate the benefits of vectorized query
>>> engine. And Hive-Stinger also embrace this style.
>>>
>>>  So, the question is, will Spark-SQL give a support to vectorized
>>> query
>>> execution someday?
>>>
>>>  Thanks
>>>
>>
>>
>

Re: not found: type LocalSparkContext

2015-01-20 Thread Reynold Xin

You don't need the LocalSparkContext. It is only for Spark's own unit test.

You can just create a SparkContext and use it in your unit tests, e.g.

val sc = new SparkContext("local", "my test app", new SparkConf)

On Tue, Jan 20, 2015 at 7:27 PM, James  wrote:

> I could not correctly import org.apache.spark.LocalSparkContext,
>
> I use sbt on Intellij for developing,here is my build sbt.
>
> ```
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0"
>
> libraryDependencies += "org.apache.spark" %% "spark-graphx" % "1.2.0"
>
> libraryDependencies += "com.clearspring.analytics" % "stream" % "2.7.0"
>
> libraryDependencies += "org.scalatest" % "scalatest_2.10" % "2.0"
>
> resolvers += "Akka Repository" at "http://repo.akka.io/releases/";
> ```
>
> I think maybe I have make some mistakes on the library setting, as a new
> developer of spark application, I wonder what is the standard procedure of
> developing a spark application.
>
> Any reply is appreciated.
>
>
> Alcaid
>
>
> 2015-01-21 2:05 GMT+08:00 Will Benton :
>
> > It's declared here:
> >
> >
> >
> https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/LocalSparkContext.scala
> >
> > I assume you're already importing LocalSparkContext, but since the test
> > classes aren't included in Spark packages, you'll also need to package
> them
> > up in order to use them in your application (viz., outside of Spark).
> >
> >
> >
> > best,
> > wb
> >
> > - Original Message -
> > > From: "James" 
> > > To: dev@spark.apache.org
> > > Sent: Tuesday, January 20, 2015 6:35:07 AM
> > > Subject: not found: type LocalSparkContext
> > >
> > > Hi all,
> > >
> > > When I was trying to write a test on my spark application I met
> > >
> > > ```
> > > Error:(14, 43) not found: type LocalSparkContext
> > > class HyperANFSuite extends FunSuite with LocalSparkContext {
> > > ```
> > >
> > > At the source code of spark-core I could not found "LocalSparkContext",
> > > thus I wonder how to write a test like [this] (
> > >
> >
> https://github.com/apache/spark/blob/master/graphx/src/test/scala/org/apache/spark/graphx/lib/ConnectedComponentsSuite.scala
> > > )
> > >
> > > Alcaid
> > >
> >
>

renaming SchemaRDD -> DataFrame

2015-01-26 Thread Reynold Xin

Hi,

We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
get the community's opinion.

The context is that SchemaRDD is becoming a common data format used for
bringing data into Spark from external systems, and used for various
components of Spark, e.g. MLlib's new pipeline API. We also expect more and
more users to be programming directly against SchemaRDD API rather than the
core RDD API. SchemaRDD, through its less commonly used DSL originally
designed for writing test cases, always has the data-frame like API. In
1.3, we are redesigning the API to make the API usable for end users.


There are two motivations for the renaming:

1. DataFrame seems to be a more self-evident name than SchemaRDD.

2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
though it would contain some RDD functions like map, flatMap, etc), and
calling it Schema*RDD* while it is not an RDD is highly confusing. Instead.
DataFrame.rdd will return the underlying RDD for all RDD methods.


My understanding is that very few users program directly against the
SchemaRDD API at the moment, because they are not well documented. However,
oo maintain backward compatibility, we can create a type alias DataFrame
that is still named SchemaRDD. This will maintain source compatibility for
Scala. That said, we will have to update all existing materials to use
DataFrame rather than SchemaRDD.

talk on interface design

2015-01-26 Thread Reynold Xin

Hi all,

In Spark, we have done reasonable well historically in interface and API
design, especially compared with some other Big Data systems. However, we
have also made mistakes along the way. I want to share a talk I gave about
interface design at Databricks' internal retreat.

https://speakerdeck.com/rxin/interface-design-for-spark-community

Interface design is a vital part of Spark becoming a long-term sustainable,
thriving framework. Good interfaces can be the project's biggest asset,
while bad interfaces can be the worst technical debt. As the project scales
bigger and bigger, the community is expanding and we are getting a wider
range of contributors that have not thought about this as their everyday
development experience outside Spark.

It is part-art part-science and in some sense acquired taste. However, I
think there are common issues that can be spotted easily, and common
principles that can address a lot of the low hanging fruits. Through this
presentation, I hope to bring to everybody's attention the issue of
interface design and encourage everybody to think hard about interface
design in their contributions.

Re: talk on interface design

2015-01-27 Thread Reynold Xin

Thanks, Andrew. That's great material.


On Mon, Jan 26, 2015 at 10:23 PM, Andrew Ash  wrote:

> In addition to the references you have at the end of the presentation,
> there's a great set of practical examples based on the learnings from Qt
> posted here: http://www21.in.tum.de/~blanchet/api-design.pdf
>
> Chapter 4's way of showing a principle and then an example from Qt is
> particularly instructional.
>
> On Tue, Jan 27, 2015 at 1:05 AM, Reynold Xin  wrote:
>
>> Hi all,
>>
>> In Spark, we have done reasonable well historically in interface and API
>> design, especially compared with some other Big Data systems. However, we
>> have also made mistakes along the way. I want to share a talk I gave about
>> interface design at Databricks' internal retreat.
>>
>> https://speakerdeck.com/rxin/interface-design-for-spark-community
>>
>> Interface design is a vital part of Spark becoming a long-term
>> sustainable,
>> thriving framework. Good interfaces can be the project's biggest asset,
>> while bad interfaces can be the worst technical debt. As the project
>> scales
>> bigger and bigger, the community is expanding and we are getting a wider
>> range of contributors that have not thought about this as their everyday
>> development experience outside Spark.
>>
>> It is part-art part-science and in some sense acquired taste. However, I
>> think there are common issues that can be spotted easily, and common
>> principles that can address a lot of the low hanging fruits. Through this
>> presentation, I hope to bring to everybody's attention the issue of
>> interface design and encourage everybody to think hard about interface
>> design in their contributions.
>>
>
>

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Reynold Xin

Dirceu,

That is not possible because one cannot overload return types.

SQLContext.parquetFile (and many other methods) needs to return some type,
and that type cannot be both SchemaRDD and DataFrame.

In 1.3, we will create a type alias for DataFrame called SchemaRDD to not
break source compatibility for Scala.


On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:

> Can't the SchemaRDD remain the same, but deprecated, and be removed in the
> release 1.5(+/- 1)  for example, and the new code been added to DataFrame?
> With this, we don't impact in existing code for the next few releases.
>
>
>
> 2015-01-27 0:02 GMT-02:00 Kushal Datta :
>
> > I want to address the issue that Matei raised about the heavy lifting
> > required for a full SQL support. It is amazing that even after 30 years
> of
> > research there is not a single good open source columnar database like
> > Vertica. There is a column store option in MySQL, but it is not nearly as
> > sophisticated as Vertica or MonetDB. But there's a true need for such a
> > system. I wonder why so and it's high time to change that.
> > On Jan 26, 2015 5:47 PM, "Sandy Ryza"  wrote:
> >
> > > Both SchemaRDD and DataFrame sound fine to me, though I like the former
> > > slightly better because it's more descriptive.
> > >
> > > Even if SchemaRDD's needs to rely on Spark SQL under the covers, it
> would
> > > be more clear from a user-facing perspective to at least choose a
> package
> > > name for it that omits "sql".
> > >
> > > I would also be in favor of adding a separate Spark Schema module for
> > Spark
> > > SQL to rely on, but I imagine that might be too large a change at this
> > > point?
> > >
> > > -Sandy
> > >
> > > On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
> matei.zaha...@gmail.com>
> > > wrote:
> > >
> > > > (Actually when we designed Spark SQL we thought of giving it another
> > > name,
> > > > like Spark Schema, but we decided to stick with SQL since that was
> the
> > > most
> > > > obvious use case to many users.)
> > > >
> > > > Matei
> > > >
> > > > > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
> matei.zaha...@gmail.com>
> > > > wrote:
> > > > >
> > > > > While it might be possible to move this concept to Spark Core
> > > long-term,
> > > > supporting structured data efficiently does require quite a bit of
> the
> > > > infrastructure in Spark SQL, such as query planning and columnar
> > storage.
> > > > The intent of Spark SQL though is to be more than a SQL server --
> it's
> > > > meant to be a library for manipulating structured data. Since this is
> > > > possible to build over the core API, it's pretty natural to organize
> it
> > > > that way, same as Spark Streaming is a library.
> > > > >
> > > > > Matei
> > > > >
> > > > >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers 
> > wrote:
> > > > >>
> > > > >> "The context is that SchemaRDD is becoming a common data format
> used
> > > for
> > > > >> bringing data into Spark from external systems, and used for
> various
> > > > >> components of Spark, e.g. MLlib's new pipeline API."
> > > > >>
> > > > >> i agree. this to me also implies it belongs in spark core, not sql
> > > > >>
> > > > >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> > > > >> michaelma...@yahoo.com.invalid> wrote:
> > > > >>
> > > > >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13
> > Bay
> > > > Area
> > > > >>> Spark Meetup YouTube contained a wealth of background information
> > on
> > > > this
> > > > >>> idea (mostly from Patrick and Reynold :-).
> > > > >>>
> > > > >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> > > > >>>
> > > > >>> 
> > > > >>> From: Patrick Wendell 
> > > > >>> To: Reynold Xin 
> > > > >>> Cc: "dev@spark.apache.org" 
> > > > >>> Sent: Monday, January 26, 2015 4:01 PM
> >

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Reynold Xin

Koert,

As Mark said, I have already refactored the API so that nothing is catalyst
is exposed (and users won't need them anyway). Data types, Row interfaces
are both outside catalyst package and in org.apache.spark.sql.

On Tue, Jan 27, 2015 at 9:08 AM, Koert Kuipers  wrote:

> hey matei,
> i think that stuff such as SchemaRDD, columar storage and perhaps also
> query planning can be re-used by many systems that do analysis on
> structured data. i can imagine panda-like systems, but also datalog or
> scalding-like (which we use at tresata and i might rebase on SchemaRDD at
> some point). SchemaRDD should become the interface for all these. and
> columnar storage abstractions should be re-used between all these.
>
> currently the sql tie in is way beyond just the (perhaps unfortunate)
> naming convention. for example a core part of the SchemaRD abstraction is
> Row, which is org.apache.spark.sql.catalyst.expressions.Row, forcing anyone
> that want to build on top of SchemaRDD to dig into catalyst, a SQL Parser
> (if i understand it correctly, i have not used catalyst, but it looks
> neat). i should not need to include a SQL parser just to use structured
> data in say a panda-like framework.
>
> best, koert
>
>
> On Mon, Jan 26, 2015 at 8:31 PM, Matei Zaharia 
> wrote:
>
>> While it might be possible to move this concept to Spark Core long-term,
>> supporting structured data efficiently does require quite a bit of the
>> infrastructure in Spark SQL, such as query planning and columnar storage.
>> The intent of Spark SQL though is to be more than a SQL server -- it's
>> meant to be a library for manipulating structured data. Since this is
>> possible to build over the core API, it's pretty natural to organize it
>> that way, same as Spark Streaming is a library.
>>
>> Matei
>>
>> > On Jan 26, 2015, at 4:26 PM, Koert Kuipers  wrote:
>> >
>> > "The context is that SchemaRDD is becoming a common data format used for
>> > bringing data into Spark from external systems, and used for various
>> > components of Spark, e.g. MLlib's new pipeline API."
>> >
>> > i agree. this to me also implies it belongs in spark core, not sql
>> >
>> > On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>> > michaelma...@yahoo.com.invalid> wrote:
>> >
>> >> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
>> Area
>> >> Spark Meetup YouTube contained a wealth of background information on
>> this
>> >> idea (mostly from Patrick and Reynold :-).
>> >>
>> >> https://www.youtube.com/watch?v=YWppYPWznSQ
>> >>
>> >> 
>> >> From: Patrick Wendell 
>> >> To: Reynold Xin 
>> >> Cc: "dev@spark.apache.org" 
>> >> Sent: Monday, January 26, 2015 4:01 PM
>> >> Subject: Re: renaming SchemaRDD -> DataFrame
>> >>
>> >>
>> >> One thing potentially not clear from this e-mail, there will be a 1:1
>> >> correspondence where you can get an RDD to/from a DataFrame.
>> >>
>> >>
>> >> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin 
>> wrote:
>> >>> Hi,
>> >>>
>> >>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted
>> to
>> >>> get the community's opinion.
>> >>>
>> >>> The context is that SchemaRDD is becoming a common data format used
>> for
>> >>> bringing data into Spark from external systems, and used for various
>> >>> components of Spark, e.g. MLlib's new pipeline API. We also expect
>> more
>> >> and
>> >>> more users to be programming directly against SchemaRDD API rather
>> than
>> >> the
>> >>> core RDD API. SchemaRDD, through its less commonly used DSL originally
>> >>> designed for writing test cases, always has the data-frame like API.
>> In
>> >>> 1.3, we are redesigning the API to make the API usable for end users.
>> >>>
>> >>>
>> >>> There are two motivations for the renaming:
>> >>>
>> >>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>> >>>
>> >>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
>> (even
>> >>> though it would contain some RDD functions like map, flatMap, etc),
>> and
>> >>> calling it Schema*RDD* while it is not

Re: [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-27 Thread Reynold Xin

+1

Tested on Mac OS X

On Tue, Jan 27, 2015 at 12:35 PM, Krishna Sankar 
wrote:

> +1
> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 12:55 min
>  mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
> -Dhadoop.version=2.6.0 -Phive -DskipTests
> 2. Tested pyspark, mlib - running as well as compare results with 1.1.x &
> 1.2.0
> 2.1. statistics OK
> 2.2. Linear/Ridge/Laso Regression OK
> 2.3. Decision Tree, Naive Bayes OK
> 2.4. KMeans OK
>Center And Scale OK
>Fixed : org.apache.spark.SparkException in zip !
> 2.5. rdd operations OK
>State of the Union Texts - MapReduce, Filter,sortByKey (word count)
> 2.6. recommendation OK
>
> Cheers
> 
>
> On Mon, Jan 26, 2015 at 11:02 PM, Patrick Wendell 
> wrote:
>
> > Please vote on releasing the following candidate as Apache Spark version
> > 1.2.1!
> >
> > The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
> >
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.1-rc1/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1061/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/
> >
> > Please vote on releasing this package as Apache Spark 1.2.1!
> >
> > The vote is open until Friday, January 30, at 07:00 UTC and passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.2.1
> > [ ] -1 Do not release this package because ...
> >
> > For a list of fixes in this release, see http://s.apache.org/Mpn.
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
> >
> > - Patrick
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Reynold Xin

Alright I have merged the patch ( https://github.com/apache/spark/pull/4173
) since I don't see any strong opinions against it (as a matter of fact
most were for it). We can still change it if somebody lays out a strong
argument.

On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia 
wrote:

> The type alias means your methods can specify either type and they will
> work. It's just another name for the same type. But Scaladocs and such will
> show DataFrame as the type.
>
> Matei
>
> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> dirceu.semigh...@gmail.com> wrote:
> >
> > Reynold,
> > But with type alias we will have the same problem, right?
> > If the methods doesn't receive schemardd anymore, we will have to change
> > our code to migrade from schema to dataframe. Unless we have an implicit
> > conversion between DataFrame and SchemaRDD
> >
> >
> >
> > 2015-01-27 17:18 GMT-02:00 Reynold Xin :
> >
> >> Dirceu,
> >>
> >> That is not possible because one cannot overload return types.
> >>
> >> SQLContext.parquetFile (and many other methods) needs to return some
> type,
> >> and that type cannot be both SchemaRDD and DataFrame.
> >>
> >> In 1.3, we will create a type alias for DataFrame called SchemaRDD to
> not
> >> break source compatibility for Scala.
> >>
> >>
> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> >> dirceu.semigh...@gmail.com> wrote:
> >>
> >>> Can't the SchemaRDD remain the same, but deprecated, and be removed in
> the
> >>> release 1.5(+/- 1)  for example, and the new code been added to
> DataFrame?
> >>> With this, we don't impact in existing code for the next few releases.
> >>>
> >>>
> >>>
> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta :
> >>>
> >>>> I want to address the issue that Matei raised about the heavy lifting
> >>>> required for a full SQL support. It is amazing that even after 30
> years
> >>> of
> >>>> research there is not a single good open source columnar database like
> >>>> Vertica. There is a column store option in MySQL, but it is not nearly
> >>> as
> >>>> sophisticated as Vertica or MonetDB. But there's a true need for such
> a
> >>>> system. I wonder why so and it's high time to change that.
> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" 
> wrote:
> >>>>
> >>>>> Both SchemaRDD and DataFrame sound fine to me, though I like the
> >>> former
> >>>>> slightly better because it's more descriptive.
> >>>>>
> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the covers, it
> >>> would
> >>>>> be more clear from a user-facing perspective to at least choose a
> >>> package
> >>>>> name for it that omits "sql".
> >>>>>
> >>>>> I would also be in favor of adding a separate Spark Schema module for
> >>>> Spark
> >>>>> SQL to rely on, but I imagine that might be too large a change at
> this
> >>>>> point?
> >>>>>
> >>>>> -Sandy
> >>>>>
> >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
> >>> matei.zaha...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> (Actually when we designed Spark SQL we thought of giving it another
> >>>>> name,
> >>>>>> like Spark Schema, but we decided to stick with SQL since that was
> >>> the
> >>>>> most
> >>>>>> obvious use case to many users.)
> >>>>>>
> >>>>>> Matei
> >>>>>>
> >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
> >>> matei.zaha...@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> While it might be possible to move this concept to Spark Core
> >>>>> long-term,
> >>>>>> supporting structured data efficiently does require quite a bit of
> >>> the
> >>>>>> infrastructure in Spark SQL, such as query planning and columnar
> >>>> storage.
> >>>>>> The intent of Spark SQL though is to be more than a SQL server --
&g

Re: Data source API | Support for dynamic schema

2015-01-28 Thread Reynold Xin

It's an interesting idea, but there are major challenges with per row
schema.

1. Performance - query optimizer and execution use assumptions about schema
and data to generate optimized query plans. Having to re-reason about
schema for each row can substantially slow down the engine, but due to
optimization and due to the overhead of schema information associated with
each row.

2. Data model: per-row schema is fundamentally a different data model. The
current relational model has gone through 40 years of research and have
very well defined semantics. I don't think there are well defined semantics
of a per-row schema data model. For example, what is the semantics of an
UDF function that is operating on a data cell that has incompatible schema?
Should we also coerce or convert the data type? If yes, will that lead to
conflicting semantics with some other rules? We need to answer questions
like this in order to have a robust data model.

On Wed, Jan 28, 2015 at 11:26 AM, Cheng Lian  wrote:

> Hi Aniket,
>
> In general the schema of all rows in a single table must be same. This is
> a basic assumption made by Spark SQL. Schema union does make sense, and
> we're planning to support this for Parquet. But as you've mentioned, it
> doesn't help if types of different versions of a column differ from each
> other. Also, you need to reload the data source table after schema changes
> happen.
>
> Cheng
>
>
> On 1/28/15 2:12 AM, Aniket Bhatnagar wrote:
>
>> I saw the talk on Spark data sources and looking at the interfaces, it
>> seems that the schema needs to be provided upfront. This works for many
>> data sources but I have a situation in which I would need to integrate a
>> system that supports schema evolutions by allowing users to change schema
>> without affecting existing rows. Basically, each row contains a schema
>> hint
>> (id and version) and this allows developers to evolve schema over time and
>> perform migration at will. Since the schema needs to be specified upfront
>> in the data source API, one possible way would be to build a union of all
>> schema versions and handle populating row values appropriately. This works
>> in case columns have been added or deleted in the schema but doesn't work
>> if types have changed. I was wondering if it is possible to change the API
>>   to provide schema for each row instead of expecting data source to
>> provide
>> schema upfront?
>>
>> Thanks,
>> Aniket
>>
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: spark akka fork : is the source anywhere?

2015-01-28 Thread Reynold Xin

Hopefully problems like this will go away entirely in the next couple of
releases. https://issues.apache.org/jira/browse/SPARK-5293



On Wed, Jan 28, 2015 at 3:12 PM, jay vyas 
wrote:

> Hi spark. Where is akka coming from in spark ?
>
> I see the distribution referenced is a spark artifact... but not in the
> apache namespace.
>
>  org.spark-project.akka
>  2.3.4-spark
>
> Clearly this is a deliberate thought out change (See SPARK-1812), but its
> not clear where 2.3.4 spark is coming from and who is maintaining its
> release?
>
> --
> jay vyas
>
> PS
>
> I've had some conversations with will benton as well about this, and its
> clear that some modifications to akka are needed, or else a protobug error
> occurs, which amount to serialization incompatibilities, hence if one wants
> to build spark from sources, the patched akka is required (or else, manual
> patching needs to be done)...
>
> 15/01/28 22:58:10 ERROR ActorSystemImpl: Uncaught fatal error from thread
> [sparkWorker-akka.remote.default-remote-dispatcher-6] shutting down
> ActorSystem [sparkWorker] java.lang.VerifyError: class
> akka.remote.WireFormats$AkkaControlMessage overrides final method
> getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
>

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Reynold Xin

It shouldn't change the data source api at all because data sources create
RDD[Row], and that gets converted into a DataFrame automatically
(previously to SchemaRDD).

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

One thing that will break the data source API in 1.3 is the location of
types. Types were previously defined in sql.catalyst.types, and now moved
to sql.types. After 1.3, sql.catalyst is hidden from users, and all public
APIs have first class classes/objects defined in sql directly.



On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan  wrote:

> Hey guys,
>
> How does this impact the data sources API?  I was planning on using
> this for a project.
>
> +1 that many things from spark-sql / DataFrame is universally
> desirable and useful.
>
> By the way, one thing that prevents the columnar compression stuff in
> Spark SQL from being more useful is, at least from previous talks with
> Reynold and Michael et al., that the format was not designed for
> persistence.
>
> I have a new project that aims to change that.  It is a
> zero-serialisation, high performance binary vector library, designed
> from the outset to be a persistent storage friendly.  May be one day
> it can replace the Spark SQL columnar compression.
>
> Michael told me this would be a lot of work, and recreates parts of
> Parquet, but I think it's worth it.  LMK if you'd like more details.
>
> -Evan
>
> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin  wrote:
> > Alright I have merged the patch (
> https://github.com/apache/spark/pull/4173
> > ) since I don't see any strong opinions against it (as a matter of fact
> > most were for it). We can still change it if somebody lays out a strong
> > argument.
> >
> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia  >
> > wrote:
> >
> >> The type alias means your methods can specify either type and they will
> >> work. It's just another name for the same type. But Scaladocs and such
> will
> >> show DataFrame as the type.
> >>
> >> Matei
> >>
> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> >> dirceu.semigh...@gmail.com> wrote:
> >> >
> >> > Reynold,
> >> > But with type alias we will have the same problem, right?
> >> > If the methods doesn't receive schemardd anymore, we will have to
> change
> >> > our code to migrade from schema to dataframe. Unless we have an
> implicit
> >> > conversion between DataFrame and SchemaRDD
> >> >
> >> >
> >> >
> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin :
> >> >
> >> >> Dirceu,
> >> >>
> >> >> That is not possible because one cannot overload return types.
> >> >>
> >> >> SQLContext.parquetFile (and many other methods) needs to return some
> >> type,
> >> >> and that type cannot be both SchemaRDD and DataFrame.
> >> >>
> >> >> In 1.3, we will create a type alias for DataFrame called SchemaRDD to
> >> not
> >> >> break source compatibility for Scala.
> >> >>
> >> >>
> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> >> >> dirceu.semigh...@gmail.com> wrote:
> >> >>
> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be removed
> in
> >> the
> >> >>> release 1.5(+/- 1)  for example, and the new code been added to
> >> DataFrame?
> >> >>> With this, we don't impact in existing code for the next few
> releases.
> >> >>>
> >> >>>
> >> >>>
> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta :
> >> >>>
> >> >>>> I want to address the issue that Matei raised about the heavy
> lifting
> >> >>>> required for a full SQL support. It is amazing that even after 30
> >> years
> >> >>> of
> >> >>>> research there is not a single good open source columnar database
> like
> >> >>>> Vertica. There is a column store option in MySQL, but it is not
> nearly
> >> >>> as
> >> >>>> sophisticated as Vertica or MonetDB. But there's a true need for
> such
> >> a
> >> >>>> system. I wonder why so and it's high time to change that.
> >> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" 
&g

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Reynold Xin

Isn't that just "null" in SQL?

On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan  wrote:

> I believe that most DataFrame implementations out there, like Pandas,
> supports the idea of missing values / NA, and some support the idea of
> Not Meaningful as well.
>
> Does Row support anything like that?  That is important for certain
> applications.  I thought that Row worked by being a mutable object,
> but haven't looked into the details in a while.
>
> -Evan
>
> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin  wrote:
> > It shouldn't change the data source api at all because data sources
> create
> > RDD[Row], and that gets converted into a DataFrame automatically
> (previously
> > to SchemaRDD).
> >
> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
> >
> > One thing that will break the data source API in 1.3 is the location of
> > types. Types were previously defined in sql.catalyst.types, and now
> moved to
> > sql.types. After 1.3, sql.catalyst is hidden from users, and all public
> APIs
> > have first class classes/objects defined in sql directly.
> >
> >
> >
> > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan 
> wrote:
> >>
> >> Hey guys,
> >>
> >> How does this impact the data sources API?  I was planning on using
> >> this for a project.
> >>
> >> +1 that many things from spark-sql / DataFrame is universally
> >> desirable and useful.
> >>
> >> By the way, one thing that prevents the columnar compression stuff in
> >> Spark SQL from being more useful is, at least from previous talks with
> >> Reynold and Michael et al., that the format was not designed for
> >> persistence.
> >>
> >> I have a new project that aims to change that.  It is a
> >> zero-serialisation, high performance binary vector library, designed
> >> from the outset to be a persistent storage friendly.  May be one day
> >> it can replace the Spark SQL columnar compression.
> >>
> >> Michael told me this would be a lot of work, and recreates parts of
> >> Parquet, but I think it's worth it.  LMK if you'd like more details.
> >>
> >> -Evan
> >>
> >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin 
> wrote:
> >> > Alright I have merged the patch (
> >> > https://github.com/apache/spark/pull/4173
> >> > ) since I don't see any strong opinions against it (as a matter of
> fact
> >> > most were for it). We can still change it if somebody lays out a
> strong
> >> > argument.
> >> >
> >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
> >> > 
> >> > wrote:
> >> >
> >> >> The type alias means your methods can specify either type and they
> will
> >> >> work. It's just another name for the same type. But Scaladocs and
> such
> >> >> will
> >> >> show DataFrame as the type.
> >> >>
> >> >> Matei
> >> >>
> >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> >> >> dirceu.semigh...@gmail.com> wrote:
> >> >> >
> >> >> > Reynold,
> >> >> > But with type alias we will have the same problem, right?
> >> >> > If the methods doesn't receive schemardd anymore, we will have to
> >> >> > change
> >> >> > our code to migrade from schema to dataframe. Unless we have an
> >> >> > implicit
> >> >> > conversion between DataFrame and SchemaRDD
> >> >> >
> >> >> >
> >> >> >
> >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin :
> >> >> >
> >> >> >> Dirceu,
> >> >> >>
> >> >> >> That is not possible because one cannot overload return types.
> >> >> >>
> >> >> >> SQLContext.parquetFile (and many other methods) needs to return
> some
> >> >> type,
> >> >> >> and that type cannot be both SchemaRDD and DataFrame.
> >> >> >>
> >> >> >> In 1.3, we will create a type alias for DataFrame called SchemaRDD
> >> >> >> to
> >> >> not
> >> >> >> break source compatibility for Scala.
> >> >> >>
> >> >> >>
> >> >>

Re: emergency jenkins restart soon

2015-01-28 Thread Reynold Xin

Thanks for doing that, Shane!


On Wed, Jan 28, 2015 at 10:29 PM, shane knapp  wrote:

> jenkins is back up and all builds have been retriggered...  things are
> building and looking good, and i'll keep an eye on the spark master builds
> tonite and tomorrow.
>
> On Wed, Jan 28, 2015 at 9:56 PM, shane knapp  wrote:
>
> > the spark master builds stopped triggering ~yesterday and the logs don't
> > show anything.  i'm going to give the current batch of spark pull request
> > builder jobs a little more time (~30 mins) to finish, then kill whatever
> is
> > left and restart jenkins.  anything that was queued or killed will be
> > retriggered once jenkins is back up.
> >
> > sorry for the inconvenience, we'll get this sorted asap.
> >
> > thanks,
> >
> > shane
> >
>

Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Reynold Xin

Once the data frame API is released for 1.3, you can write your thing in
Python and get the same performance. It can't express everything, but for
basic things like projection, filter, join, aggregate and simple numeric
computation, it should work pretty well.


On Thu, Jan 29, 2015 at 12:45 PM, rtshadow 
wrote:

> Hi,
>
> In my company, we've been trying to use PySpark to run ETLs on our data.
> Alas, it turned out to be terribly slow compared to Java or Scala API
> (which
> we ended up using to meet performance criteria).
>
> To be more quantitative, let's consider simple case:
> I've generated test file (848MB): /seq 1 1 > /tmp/test/
>
> and tried to run simple computation on it, which includes three steps: read
> -> multiply each row by 2 -> take max
> Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
> Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/
>
> Here are the results of this simple benchmark:
> CPython - 59s
> PyPy - 26s
> Scala version - 7s
>
> I didn't dig into what exactly contributes to execution times of CPython /
> PyPy, but it seems that serialization / deserialization, when sending data
> to the worker may be the issue.
> I know some guys already have been asking about using Jython
> (
> http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658
> ,
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html
> ),
> but it seems, that no one have really done this with Spark.
> It looks like performance gain from using jython can be huge - you wouldn't
> need to spawn PythonWorkers, all the code would be just executed inside
> SparkExecutor JVM, using python code compiled to java bytecode. Do you
> think
> that's possible to achieve? Do you see any obvious obstacles? Of course,
> jython doesn't have C extensions, but if one doesn't need them, then it
> should fit here nicely.
>
> I'm willing to try to marry Spark with Jython and see how it goes.
>
> What do you think about this?
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Reynold Xin

It is something like this: https://issues.apache.org/jira/browse/SPARK-5097

On the master branch, we have a Pandas like API already.


On Thu, Jan 29, 2015 at 4:31 PM, Sasha Kacanski  wrote:

> Hi Reynold,
> In my project I want to use Python API too.
> When you mention DF's are we talking about pandas or this is something
> internal to spark py api.
> If you could elaborate a bit on this or point me to alternate
> documentation.
> Thanks much --sasha
>
> On Thu, Jan 29, 2015 at 4:12 PM, Reynold Xin  wrote:
>
>> Once the data frame API is released for 1.3, you can write your thing in
>> Python and get the same performance. It can't express everything, but for
>> basic things like projection, filter, join, aggregate and simple numeric
>> computation, it should work pretty well.
>>
>>
>> On Thu, Jan 29, 2015 at 12:45 PM, rtshadow <
>> pastuszka.przemys...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > In my company, we've been trying to use PySpark to run ETLs on our data.
>> > Alas, it turned out to be terribly slow compared to Java or Scala API
>> > (which
>> > we ended up using to meet performance criteria).
>> >
>> > To be more quantitative, let's consider simple case:
>> > I've generated test file (848MB): /seq 1 1 > /tmp/test/
>> >
>> > and tried to run simple computation on it, which includes three steps:
>> read
>> > -> multiply each row by 2 -> take max
>> > Code in python: /sc.textFile("/tmp/test").map(lambda x: x * 2).max()/
>> > Code in scala: /sc.textFile("/tmp/test").map(x => x * 2).max()/
>> >
>> > Here are the results of this simple benchmark:
>> > CPython - 59s
>> > PyPy - 26s
>> > Scala version - 7s
>> >
>> > I didn't dig into what exactly contributes to execution times of
>> CPython /
>> > PyPy, but it seems that serialization / deserialization, when sending
>> data
>> > to the worker may be the issue.
>> > I know some guys already have been asking about using Jython
>> > (
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Jython-importing-pyspark-td8654.html#a8658
>> > ,
>> >
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-Driver-from-Jython-td7142.html
>> > ),
>> > but it seems, that no one have really done this with Spark.
>> > It looks like performance gain from using jython can be huge - you
>> wouldn't
>> > need to spawn PythonWorkers, all the code would be just executed inside
>> > SparkExecutor JVM, using python code compiled to java bytecode. Do you
>> > think
>> > that's possible to achieve? Do you see any obvious obstacles? Of course,
>> > jython doesn't have C extensions, but if one doesn't need them, then it
>> > should fit here nicely.
>> >
>> > I'm willing to try to marry Spark with Jython and see how it goes.
>> >
>> > What do you think about this?
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-speed-PySpark-to-match-Scala-Java-performance-tp10356.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> > Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>> >
>>
>
>
>
> --
> Aleksandar Kacanski
>

Re: [spark-sql] JsonRDD

2015-02-02 Thread Reynold Xin

It's bad naming - JsonRDD is actually not an RDD. It is just a set of util
methods.

The case sensitivity issues seem orthogonal, and would be great to be able
to control that with a flag.


On Mon, Feb 2, 2015 at 4:16 PM, Daniil Osipov 
wrote:

> Hey Spark developers,
>
> Is there a good reason for JsonRDD being a Scala object as opposed to
> class? Seems most other RDDs are classes, and can be extended.
>
> The reason I'm asking is that there is a problem with Hive interoperability
> with JSON DataFrames where jsonFile generates case sensitive schema, while
> Hive expects case insensitive and fails with an exception during
> saveAsTable if there are two columns with the same name in different case.
>
> I'm trying to resolve the problem, but that requires me to extend JsonRDD,
> which I can't do. Other RDDs are subclass friendly, why is JsonRDD
> different?
>
> Dan
>

Re: SparkSubmit.scala and stderr

2015-02-03 Thread Reynold Xin

We can use ScalaTest's privateMethodTester also instead of exposing that.

On Tue, Feb 3, 2015 at 2:18 PM, Marcelo Vanzin  wrote:

> Hi Jay,
>
> On Tue, Feb 3, 2015 at 6:28 AM, jayhutfles  wrote:
> > // Exposed for testing
> > private[spark] var printStream: PrintStream = System.err
>
> > But as the comment states that it's for testing, maybe I'm
> > misunderstanding its intent...
>
> The comment is there to tell someone reading the code that this field
> is a `var` and not private just because test code (SparkSubmitSuite in
> this case) needs to modify it, otherwise it wouldn't exist or would be
> private. Similar in spirit to this annotation:
>
>
> http://guava-libraries.googlecode.com/svn/tags/release09/javadoc/com/google/common/annotations/VisibleForTesting.html
>
> (Which I'd probably have used in this case, but is not really common
> in Spark code.)
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: 2GB limit for partitions?

2015-02-03 Thread Reynold Xin

cc dev list


How are you saving the data? There are two relevant 2GB limits:

1. Caching

2. Shuffle


For caching, a partition is turned into a single block.

For shuffle, each map partition is partitioned into R blocks, where R =
number of reduce tasks. It is unlikely a shuffle block > 2G, although it
can still happen.

I think the 2nd problem is easier to fix than the 1st, because we can
handle that in the network transport layer. It'd require us to divide the
transfer of a very large block into multiple smaller blocks.



On Tue, Feb 3, 2015 at 3:00 PM, Imran Rashid  wrote:

> Michael,
>
> you are right, there is definitely some limit at 2GB.  Here is a trivial
> example to demonstrate it:
>
> import org.apache.spark.storage.StorageLevel
> val d = sc.parallelize(1 to 1e6.toInt, 1).map{i => new
> Array[Byte](5e3.toInt)}.persist(StorageLevel.DISK_ONLY)
> d.count()
>
> It gives the same error you are observing.  I was under the same
> impression as Sean about the limits only being on blocks, not partitions --
> but clearly that isn't the case here.
>
> I don't know the whole story yet, but I just wanted to at least let you
> know you aren't crazy :)
> At the very least this suggests that you might need to make smaller
> partitions for now.
>
> Imran
>
>
> On Tue, Feb 3, 2015 at 4:58 AM, Michael Albert <
> m_albert...@yahoo.com.invalid> wrote:
>
>> Greetings!
>>
>> Thanks for the response.
>>
>> Below is an example of the exception I saw.
>> I'd rather not post code at the moment, so I realize it is completely
>> unreasonable to ask for a diagnosis.
>> However, I will say that adding a "partitionBy()" was the last change
>> before this error was created.
>>
>>
>> Thanks for your time and any thoughts you might have.
>>
>> Sincerely,
>>  Mike
>>
>>
>>
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>> due to stage failure: Task 4 in stage 5.0 failed 4 times, most recent
>> failure: Lost task 4.3 in stage 5.0 (TID 6012,
>> ip-10-171-0-31.ec2.internal): java.lang.RuntimeException:
>> java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
>> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828)
>> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123)
>> at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132)
>> at
>> org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517)
>> at
>> org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:307)
>> at
>> org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
>> at
>> org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57)
>> at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>> at
>> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>> at
>> org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:57)
>>
>>
>>   --
>>  *From:* Sean Owen 
>> *To:* Michael Albert 
>> *Cc:* "u...@spark.apache.org" 
>> *Sent:* Monday, February 2, 2015 10:13 PM
>> *Subject:* Re: 2GB limit for partitions?
>>
>> The limit is on blocks, not partitions. Partitions have many blocks.
>>
>> It sounds like you are creating very large values in memory, but I'm
>> not sure given your description. You will run into problems if a
>> single object is more than 2GB, of course. More of the stack trace
>> might show what is mapping that much memory.
>>
>> If you simply want data into 1000 files it's a lot simpler. Just
>> repartition into 1000 partitions and save the data. If you need more
>> control over what goes into which partition, use a Partitioner, yes.
>>
>>
>>
>> On Mon, Feb 2, 2015 at 8:40 PM, Michael Albert
>>  wrote:
>> > Greetings!
>> >
>> > SPARK-1476 says that there is a 2G limit for "blocks".
>> > Is this the same as a 2G limit for partitions (or approximately so?)?
>> >
>> >
>> > What I had been attempting to do is the following.
>> > 1) Start with a moderately large data set (currently about 100GB, but
>> > growing).
>> > 2) Create about 1,000 files (yes, files) each representing a subset of
>> the
>> > data.
>> >
>> > The current attempt I am working on is something like this.
>> > 1) Do a "map" whose output key indicates which of the 1,000 files it
>> will go
>> > into and whose value is what I will want to stick into the file.
>> > 2) Partition the data and use the body of mapPartition to open a file
>> and
>> > save the data.
>> >
>> > My apologies, this is actually embedded in a bigger mess, so I won't
>> post
>> >

ASF Git / GitHub sync is down

2015-02-03 Thread Reynold Xin

Haven't sync-ed anything for the last 4 hours. Seems like this little piece
of infrastructure always stops working around our own code freeze time ...

Re: ASF Git / GitHub sync is down

2015-02-03 Thread Reynold Xin

I filed an INFRA ticket: https://issues.apache.org/jira/browse/INFRA-9115



I wish ASF can reconsider requests like this in order to handle downtime
gracefully https://issues.apache.org/jira/browse/INFRA-8738

On Tue, Feb 3, 2015 at 9:09 PM, Reynold Xin  wrote:

> Haven't sync-ed anything for the last 4 hours. Seems like this little
> piece of infrastructure always stops working around our own code freeze
> time ...
>
>

Re: multi-line comment style

2015-02-04 Thread Reynold Xin

We should update the style doc to reflect what we have in most places
(which I think is //).



On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> FWIW I like the multi-line // over /* */ from a purely style standpoint.
> The Google Java style guide[1] has some comment about code formatting tools
> working better with /* */ but there doesn't seem to be any strong arguments
> for one over the other I can find
>
> Thanks
> Shivaram
>
> [1]
>
> https://google-styleguide.googlecode.com/svn/trunk/javaguide.html#s4.8.6.1-block-comment-style
>
> On Wed, Feb 4, 2015 at 2:05 PM, Patrick Wendell 
> wrote:
>
> > Personally I have no opinion, but agree it would be nice to standardize.
> >
> > - Patrick
> >
> > On Wed, Feb 4, 2015 at 1:58 PM, Sean Owen  wrote:
> > > One thing Marcelo pointed out to me is that the // style does not
> > > interfere with commenting out blocks of code with /* */, which is a
> > > small good thing. I am also accustomed to // style for multiline, and
> > > reserve /** */ for javadoc / scaladoc. Meaning, seeing the /* */ style
> > > inline always looks a little funny to me.
> > >
> > > On Wed, Feb 4, 2015 at 3:53 PM, Kay Ousterhout <
> kayousterh...@gmail.com>
> > wrote:
> > >> Hi all,
> > >>
> > >> The Spark Style Guide
> > >> <
> > https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
> >
> > >> says multi-line comments should formatted as:
> > >>
> > >> /*
> > >>  * This is a
> > >>  * very
> > >>  * long comment.
> > >>  */
> > >>
> > >> But in my experience, we almost always use "//" for multi-line
> comments:
> > >>
> > >> // This is a
> > >> // very
> > >> // long comment.
> > >>
> > >> Here are some examples:
> > >>
> > >>- Recent commit by Reynold, king of style:
> > >>
> >
> https://github.com/apache/spark/commit/bebf4c42bef3e75d31ffce9bfdb331c16f34ddb1#diff-d616b5496d1a9f648864f4ab0db5a026R58
> > >>- RDD.scala:
> > >>
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L361
> > >>- DAGScheduler.scala:
> > >>
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L281
> > >>
> > >>
> > >> Any objections to me updating the style guide to reflect this?  As
> with
> > >> other style issues, I think consistency here is helpful (and
> formatting
> > >> multi-line comments as "//" does nicely visually distinguish code
> > comments
> > >> from doc comments).
> > >>
> > >> -Kay
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > > For additional commands, e-mail: dev-h...@spark.apache.org
> > >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>

Re: Data source API | sizeInBytes should be to *Scan

2015-02-08 Thread Reynold Xin

We thought about this today after seeing this email. I actually built a
patch for this (adding filter/column to data source stat estimation), but
ultimately dropped it due to the potential problems the change the cause.

The main problem I see is that column pruning/predicate pushdowns are
advisory, i.e. the data source might or might not apply those filters.

Without significantly complicating the data source API, it is hard for the
optimizer (and future cardinality estimation) to know whether the
filter/column pushdowns are advisory, and whether to incorporate that in
cardinality estimation.

Imagine this scenario: a data source applies a filter and estimates the
filter's selectivity is 0.1, then the data set is reduced to 10% of the
size. Catalyst's own cardinality estimation estimates the filter
selectivity to 0.1 again, and thus the estimated data size is now 1% of the
original data size, lowering than some threshold. Catalyst decides to
broadcast the table. The actual table size is actually 10x the size.

On Fri, Feb 6, 2015 at 3:39 AM, Aniket Bhatnagar  wrote:

> Hi Spark SQL committers
>
> I have started experimenting with data sources API and I was wondering if
> it makes sense to move the method sizeInBytes from BaseRelation to Scan
> interfaces. This is because that a relation may be able to leverage filter
> push down to estimate size potentially making a very large relation
> broadcast-able. Thoughts?
>
> Aniket
>

Re: Spark SQL Window Functions

2015-02-08 Thread Reynold Xin

This is the original ticket:
https://issues.apache.org/jira/browse/SPARK-1442

I believe it will happen, one way or another :)


On Fri, Feb 6, 2015 at 5:29 PM, Evan R. Sparks 
wrote:

> Currently there's no standard way of handling time series data in Spark. We
> were kicking around some ideas in the lab today and one thing that came up
> was SQL Window Functions as a way to support them and query over time
> series (do things like moving average, etc.)
>
> These don't seem to be implemented in Spark SQL yet, but there's some
> discussion on JIRA (https://issues.apache.org/jira/browse/SPARK-3587)
> asking for them, and there have also been a couple of pull requests -
> https://github.com/apache/spark/pull/3703 and
> https://github.com/apache/spark/pull/2953.
>
> Is any work currently underway here?
>

Re: multi-line comment style

2015-02-09 Thread Reynold Xin

Why don't we just pick // as the default (by encouraging it in the style
guide), since it is mostly used, and then do not disallow /* */? I don't
think it is that big of a deal to have slightly deviations here since it is
dead simple to understand what's going on.


On Mon, Feb 9, 2015 at 1:33 PM, Patrick Wendell  wrote:

> Clearly there isn't a strictly optimal commenting format (pro's and
> cons for both '//' and '/*'). My thought is for consistency we should
> just chose one and put in the style guide.
>
> On Mon, Feb 9, 2015 at 12:25 PM, Xiangrui Meng  wrote:
> > Btw, I think allowing `/* ... */` without the leading `*` in lines is
> > also useful. Check this line:
> >
> https://github.com/apache/spark/pull/4259/files#diff-e9dcb3b5f3de77fc31b3aff7831110eaR55
> ,
> > where we put the R commands that can reproduce the test result. It is
> > easier if we write in the following style:
> >
> > ~~~
> > /*
> >  Using the following R code to load the data and train the model using
> > glmnet package.
> >
> >  library("glmnet")
> >  data <- read.csv("path", header=FALSE, stringsAsFactors=FALSE)
> >  features <- as.matrix(data.frame(as.numeric(data$V2),
> as.numeric(data$V3)))
> >  label <- as.numeric(data$V1)
> >  weights <- coef(glmnet(features, label, family="gaussian", alpha = 0,
> > lambda = 0))
> >  */
> > ~~~
> >
> > So people can copy & paste the R commands directly.
> >
> > Xiangrui
> >
> > On Mon, Feb 9, 2015 at 12:18 PM, Xiangrui Meng  wrote:
> >> I like the `/* .. */` style more. Because it is easier for IDEs to
> >> recognize it as a block comment. If you press enter in the comment
> >> block with the `//` style, IDEs won't add `//` for you. -Xiangrui
> >>
> >> On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin 
> wrote:
> >>> We should update the style doc to reflect what we have in most places
> >>> (which I think is //).
> >>>
> >>>
> >>>
> >>> On Wed, Feb 4, 2015 at 2:09 PM, Shivaram Venkataraman <
> >>> shiva...@eecs.berkeley.edu> wrote:
> >>>
> >>>> FWIW I like the multi-line // over /* */ from a purely style
> standpoint.
> >>>> The Google Java style guide[1] has some comment about code formatting
> tools
> >>>> working better with /* */ but there doesn't seem to be any strong
> arguments
> >>>> for one over the other I can find
> >>>>
> >>>> Thanks
> >>>> Shivaram
> >>>>
> >>>> [1]
> >>>>
> >>>>
> https://google-styleguide.googlecode.com/svn/trunk/javaguide.html#s4.8.6.1-block-comment-style
> >>>>
> >>>> On Wed, Feb 4, 2015 at 2:05 PM, Patrick Wendell 
> >>>> wrote:
> >>>>
> >>>> > Personally I have no opinion, but agree it would be nice to
> standardize.
> >>>> >
> >>>> > - Patrick
> >>>> >
> >>>> > On Wed, Feb 4, 2015 at 1:58 PM, Sean Owen 
> wrote:
> >>>> > > One thing Marcelo pointed out to me is that the // style does not
> >>>> > > interfere with commenting out blocks of code with /* */, which is
> a
> >>>> > > small good thing. I am also accustomed to // style for multiline,
> and
> >>>> > > reserve /** */ for javadoc / scaladoc. Meaning, seeing the /* */
> style
> >>>> > > inline always looks a little funny to me.
> >>>> > >
> >>>> > > On Wed, Feb 4, 2015 at 3:53 PM, Kay Ousterhout <
> >>>> kayousterh...@gmail.com>
> >>>> > wrote:
> >>>> > >> Hi all,
> >>>> > >>
> >>>> > >> The Spark Style Guide
> >>>> > >> <
> >>>> >
> https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
> >>>> >
> >>>> > >> says multi-line comments should formatted as:
> >>>> > >>
> >>>> > >> /*
> >>>> > >>  * This is a
> >>>> > >>  * very
> >>>> > >>  * long comment.
> >>>> > >>  */
> >>>> > >>
> >>>> > >> But in my experience, we almost always use "//" for m

Re: renaming SchemaRDD -> DataFrame

2015-02-10 Thread Reynold Xin

Koert,

Don't get too hang up on the name SQL. This is exactly what you want: a
collection with record-like objects with field names and runtime types.

Almost all of the 40 methods are transformations for structured data, such
as aggregation on a field, or filtering on a field. If all you have is the
old RDD style map/flatMap, then any transformation would lose the schema
information, making the extra schema information useless.




On Tue, Feb 10, 2015 at 11:47 AM, Koert Kuipers  wrote:

> so i understand the success or spark.sql. besides the fact that anything
> with the words SQL in its name will have thousands of developers running
> towards it because of the familiarity, there is also a genuine need for a
> generic RDD that holds record-like objects, with field names and runtime
> types. after all that is a successfull generic abstraction used in many
> structured data tools.
>
> but to me that abstraction is as simple as:
>
> trait SchemaRDD extends RDD[Row] {
>   def schema: StructType
> }
>
> and perhaps another abstraction to indicate it intends to be column
> oriented (with a few methods to efficiently extract a subset of columns).
> so that could be DataFrame.
>
> such simple contracts would allow many people to write loaders for this
> (say from csv) and whatnot.
>
> what i do not understand why it has to be much more complex than this. but
> if i look at DataFrame it has so much additional stuff, that has (in my
> eyes) nothing to do with generic structured data analysis.
>
> for example to implement DataFrame i need to implement about 40 additional
> methods!? and for some the SQLness is obviously leaking into the
> abstraction. for example why would i care about:
>   def registerTempTable(tableName: String): Unit
>
>
> best, koert
>
> On Sun, Feb 1, 2015 at 3:31 AM, Evan Chan  wrote:
>
> > It is true that you can persist SchemaRdds / DataFrames to disk via
> > Parquet, but a lot of time and inefficiencies is lost.   The in-memory
> > columnar cached representation is completely different from the
> > Parquet file format, and I believe there has to be a translation into
> > a Row (because ultimately Spark SQL traverses Row's -- even the
> > InMemoryColumnarTableScan has to then convert the columns into Rows
> > for row-based processing).   On the other hand, traditional data
> > frames process in a columnar fashion.   Columnar storage is good, but
> > nowhere near as good as columnar processing.
> >
> > Another issue, which I don't know if it is solved yet, but it is
> > difficult for Tachyon to efficiently cache Parquet files without
> > understanding the file format itself.
> >
> > I gave a talk at last year's Spark Summit on this topic.
> >
> > I'm working on efforts to change this, however.  Shoot me an email at
> > velvia at gmail if you're interested in joining forces.
> >
> > On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian 
> wrote:
> > > Yes, when a DataFrame is cached in memory, it's stored in an efficient
> > > columnar format. And you can also easily persist it on disk using
> > Parquet,
> > > which is also columnar.
> > >
> > > Cheng
> > >
> > >
> > > On 1/29/15 1:24 PM, Koert Kuipers wrote:
> > >>
> > >> to me the word DataFrame does come with certain expectations. one of
> > them
> > >> is that the data is stored columnar. in R data.frame internally uses a
> > >> list
> > >> of sequences i think, but since lists can have labels its more like a
> > >> SortedMap[String, Array[_]]. this makes certain operations very cheap
> > >> (such
> > >> as adding a column).
> > >>
> > >> in Spark the closest thing would be a data structure where per
> Partition
> > >> the data is also stored columnar. does spark SQL already use something
> > >> like
> > >> that? Evan mentioned "Spark SQL columnar compression", which sounds
> like
> > >> it. where can i find that?
> > >>
> > >> thanks
> > >>
> > >> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan 
> > >> wrote:
> > >>
> > >>> +1 having proper NA support is much cleaner than using null, at
> > >>> least the Java null.
> > >>>
> > >>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <
> evan.spa...@gmail.com
> > >
> > >>> wrote:
> > >>>>
> > >>>> You've got to be a little bit careful here. "NA" in systems like R
&g

Re: renaming SchemaRDD -> DataFrame

2015-02-10 Thread Reynold Xin

It's a good point. I will update the documentation to say that this is not
meant to be subclassed externally.


On Tue, Feb 10, 2015 at 12:10 PM, Koert Kuipers  wrote:

> thanks matei its good to know i can create them like that
>
> reynold, yeah somehow the words sql gets me going :) sorry...
> yeah agreed that you need new transformations to preserve the schema info.
> i misunderstood and thought i had to implement the bunch but that is
> clearly not necessary as matei indicated.
>
> allright i am clearly being slow/dense here, but now it makes sense to
> me
>
>
>
>
>
>
> On Tue, Feb 10, 2015 at 2:58 PM, Reynold Xin  wrote:
>
> > Koert,
> >
> > Don't get too hang up on the name SQL. This is exactly what you want: a
> > collection with record-like objects with field names and runtime types.
> >
> > Almost all of the 40 methods are transformations for structured data,
> such
> > as aggregation on a field, or filtering on a field. If all you have is
> the
> > old RDD style map/flatMap, then any transformation would lose the schema
> > information, making the extra schema information useless.
> >
> >
> >
> >
> > On Tue, Feb 10, 2015 at 11:47 AM, Koert Kuipers 
> wrote:
> >
> >> so i understand the success or spark.sql. besides the fact that anything
> >> with the words SQL in its name will have thousands of developers running
> >> towards it because of the familiarity, there is also a genuine need for
> a
> >> generic RDD that holds record-like objects, with field names and runtime
> >> types. after all that is a successfull generic abstraction used in many
> >> structured data tools.
> >>
> >> but to me that abstraction is as simple as:
> >>
> >> trait SchemaRDD extends RDD[Row] {
> >>   def schema: StructType
> >> }
> >>
> >> and perhaps another abstraction to indicate it intends to be column
> >> oriented (with a few methods to efficiently extract a subset of
> columns).
> >> so that could be DataFrame.
> >>
> >> such simple contracts would allow many people to write loaders for this
> >> (say from csv) and whatnot.
> >>
> >> what i do not understand why it has to be much more complex than this.
> but
> >> if i look at DataFrame it has so much additional stuff, that has (in my
> >> eyes) nothing to do with generic structured data analysis.
> >>
> >> for example to implement DataFrame i need to implement about 40
> additional
> >> methods!? and for some the SQLness is obviously leaking into the
> >> abstraction. for example why would i care about:
> >>   def registerTempTable(tableName: String): Unit
> >>
> >>
> >> best, koert
> >>
> >> On Sun, Feb 1, 2015 at 3:31 AM, Evan Chan 
> >> wrote:
> >>
> >> > It is true that you can persist SchemaRdds / DataFrames to disk via
> >> > Parquet, but a lot of time and inefficiencies is lost.   The in-memory
> >> > columnar cached representation is completely different from the
> >> > Parquet file format, and I believe there has to be a translation into
> >> > a Row (because ultimately Spark SQL traverses Row's -- even the
> >> > InMemoryColumnarTableScan has to then convert the columns into Rows
> >> > for row-based processing).   On the other hand, traditional data
> >> > frames process in a columnar fashion.   Columnar storage is good, but
> >> > nowhere near as good as columnar processing.
> >> >
> >> > Another issue, which I don't know if it is solved yet, but it is
> >> > difficult for Tachyon to efficiently cache Parquet files without
> >> > understanding the file format itself.
> >> >
> >> > I gave a talk at last year's Spark Summit on this topic.
> >> >
> >> > I'm working on efforts to change this, however.  Shoot me an email at
> >> > velvia at gmail if you're interested in joining forces.
> >> >
> >> > On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian 
> >> wrote:
> >> > > Yes, when a DataFrame is cached in memory, it's stored in an
> efficient
> >> > > columnar format. And you can also easily persist it on disk using
> >> > Parquet,
> >> > > which is also columnar.
> >> > >
> >> > > Cheng
> >> > >
> >> > >
> >> > > On 1/29/15 1:24 PM, Koert Kuipers wrote:
> >> >

Re: Data source API | sizeInBytes should be to *Scan

2015-02-11 Thread Reynold Xin

Unfortunately this is not to happen for 1.3 (as a snapshot release is
already cut). We need to figure out how we are going to do cardinality
estimation before implementing this. If we need to do this in the future, I
think we can do it in a way that doesn't break existing APIs. Given I think
this won't bring much benefit right now (the only use for it is broadcast
joins), I think it is ok to push this till later.

The issue I asked still stands. What should the optimizer do w.r.t. filters
that are pushed into the data source? Should it ignore those filters, or
apply statistics again?

This also depends on how we want to do statistics. Hive (and a lot of other
database systems) does a scan to figure out statistics, and put all of
those statistics in a catalog. That is a more unified way to solve the
stats problem.

That said, in the world of federated databases, I can see why we might want
to push cardinality estimation to the data sources, since if the use case
is selecting a very small subset of the data from the sources, then it
might be hard for the statistics to be accurate in the catalog built from
data scan.



On Wed, Feb 11, 2015 at 10:47 AM, Aniket Bhatnagar <
aniket.bhatna...@gmail.com> wrote:

> Circling back on this. Did you get a chance to re-look at this?
>
> Thanks,
> Aniket
>
> On Sun, Feb 8, 2015, 2:53 AM Aniket Bhatnagar 
> wrote:
>
>> Thanks for looking into this. If this true, isn't this an issue today?
>> The default implementation of sizeInBytes is 1 + broadcast threshold. So,
>> if catalyst's cardinality estimation estimates even a small filter
>> selectivity, it will result in broadcasting the relation. Therefore,
>> shouldn't the default be much higher than broadcast threshold?
>>
>> Also, since the default implementation of sizeInBytes already exists in
>> BaseRelation, I am not sure why the same/similar default implementation
>> can't be provided with in *Scan specific sizeInBytes functions and have
>> Catalyst always trust the size returned by DataSourceAPI (with default
>> implementation being to never broadcast). Another thing that could be done
>> is have sizeInBytes return Option[Long] so that Catalyst explicitly knows
>> when DataSource was able to optimize the size. The reason why I would push
>> for sizeInBytes in *Scan interfaces is because at times the data source
>> implementation can more accurately predict the size output. For example,
>> DataSource implementations for MongoDB, ElasticSearch, Cassandra, etc can
>> easy use filter push downs to query the underlying storage to predict the
>> size. Such predictions will be more accurate than Catalyst's prediction.
>> Therefore, if its not a fundamental change in Catalyst, I would think this
>> makes sense.
>>
>>
>> Thanks,
>> Aniket
>>
>>
>> On Sat, Feb 7, 2015, 4:50 AM Reynold Xin  wrote:
>>
>>> We thought about this today after seeing this email. I actually built a
>>> patch for this (adding filter/column to data source stat estimation), but
>>> ultimately dropped it due to the potential problems the change the cause.
>>>
>>> The main problem I see is that column pruning/predicate pushdowns are
>>> advisory, i.e. the data source might or might not apply those filters.
>>>
>>> Without significantly complicating the data source API, it is hard for
>>> the optimizer (and future cardinality estimation) to know whether the
>>> filter/column pushdowns are advisory, and whether to incorporate that in
>>> cardinality estimation.
>>>
>>> Imagine this scenario: a data source applies a filter and estimates the
>>> filter's selectivity is 0.1, then the data set is reduced to 10% of the
>>> size. Catalyst's own cardinality estimation estimates the filter
>>> selectivity to 0.1 again, and thus the estimated data size is now 1% of the
>>> original data size, lowering than some threshold. Catalyst decides to
>>> broadcast the table. The actual table size is actually 10x the size.
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Feb 6, 2015 at 3:39 AM, Aniket Bhatnagar <
>>> aniket.bhatna...@gmail.com> wrote:
>>>
>>>> Hi Spark SQL committers
>>>>
>>>> I have started experimenting with data sources API and I was wondering
>>>> if
>>>> it makes sense to move the method sizeInBytes from BaseRelation to Scan
>>>> interfaces. This is because that a relation may be able to leverage
>>>> filter
>>>> push down to estimate size potentially making a very large relation
>>>> broadcast-able. Thoughts?
>>>>
>>>> Aniket
>>>>
>>>
>>>

Re: How to track issues that must wait for Spark 2.x in JIRA?

2015-02-12 Thread Reynold Xin

It seems to me having a version that is 2+ is good for that? Once we move
to 2.0, we can retag those that are not going to be fixed in 2.0 as 2.0.1
or 2.1.0 .

On Thu, Feb 12, 2015 at 12:42 AM, Sean Owen  wrote:

> Patrick and I were chatting about how to handle several issues which
> clearly need a fix, and are easy, but can't be implemented until a
> next major release like Spark 2.x since it would change APIs.
> Examples:
>
> https://issues.apache.org/jira/browse/SPARK-3266
> https://issues.apache.org/jira/browse/SPARK-3369
> https://issues.apache.org/jira/browse/SPARK-4819
>
> We could simply make version 2.0.0 in JIRA. Although straightforward,
> it might imply that release planning has begun for 2.0.0.
>
> The version could be called "2+" for now to better indicate its status.
>
> There is also a "Later" JIRA resolution. Although resolving the above
> seems a little wrong, it might be reasonable if we're sure to revisit
> "Later", well, at some well defined later. The three issues above risk
> getting lost in the shuffle.
>
> We also wondered whether using "Later" is good or bad. It takes items
> off the radar that aren't going to be acted on anytime soon -- and
> there are lots of those right now. It might send a message that these
> will be revisited when they are even less likely to if resolved.
>
> Any opinions?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Spark SQL value proposition in batch pipelines

2015-02-12 Thread Reynold Xin

Evan articulated it well.


On Thu, Feb 12, 2015 at 9:29 AM, Evan R. Sparks 
wrote:

> Well, you can always join as many RDDs as you want by chaining them
> together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of
> RDDs in this way but 10 is probably doable.
>
> That said - SparkSQL has an optimizer under the covers that can make clever
> decisions e.g. pushing the predicates in the WHERE clause down to the base
> data (even to external data sources if you have them), ordering joins, and
> choosing between join implementations (like using broadcast joins instead
> of the default shuffle-based hash join in RDD.join). These decisions can
> make your queries run orders of magnitude faster than they would if you
> implemented them using basic RDD transformations. The best part is at this
> stage, I'd expect the optimizer will continue to improve - meaning many of
> your queries will get faster with each new release.
>
> I'm sure the SparkSQL devs can enumerate many other benefits - but as soon
> as you're working with multiple tables and doing fairly textbook SQL stuff
> - you likely want the engine figuring this stuff out for you rather than
> hand coding it yourself. That said - with Spark, you can always drop back
> to plain old RDDs and use map/reduce/filter/cogroup, etc. when you need to.
>
> On Thu, Feb 12, 2015 at 8:56 AM, vha14  wrote:
>
> > My team is building a batch data processing pipeline using Spark API and
> > trying to understand if Spark SQL can help us. Below are what we found so
> > far:
> >
> > - SQL's declarative style may be more readable in some cases (e.g.
> joining
> > of more than two RDDs), although some devs prefer the fluent style
> > regardless.
> > - Cogrouping of more than 4 RDDs is not supported and it's not clear if
> > Spark SQL supports joining of arbitrary number of RDDs.
> > - It seems that Spark SQL's features such as optimization based on
> > predicate
> > pushdown and dynamic schema inference are less applicable in a batch
> > environment.
> >
> > Your inputs/suggestions are most welcome!
> >
> > Thanks,
> > Vu Ha
> > CTO, Semantic Scholar
> > http://www.quora.com/What-is-Semantic-Scholar-and-how-will-it-work
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-value-proposition-in-batch-pipelines-tp10607.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>

Re: Why a program would receive null from send message of mapReduceTriplets

2015-02-12 Thread Reynold Xin

Can you use the new aggregateNeighbors method? I suspect the null is coming
from "automatic join elimination", which detects bytecode to see if you
need the src or dst vertex data. Occasionally it can fail to detect. In the
new aggregateNeighbors API, the caller needs to explicitly specifying that,
making it more robust.


On Thu, Feb 12, 2015 at 6:26 AM, James  wrote:

> Hello,
>
> When I am running the code on a much bigger size graph, I met
> NullPointerException.
>
> I found that is because the sendMessage() function receive a triplet that
> edge.srcAttr or edge.dstAttr is null. Thus I wonder why it will happen as I
> am sure every vertices have a attr.
>
> Any returns is appreciated.
>
> Alcaid
>
>
> 2015-02-11 19:30 GMT+08:00 James :
>
> > Hello,
> >
> > Recently  I am trying to estimate the average distance of a big graph
> > using spark with the help of [HyperAnf](
> > http://dl.acm.org/citation.cfm?id=1963493).
> >
> > It works like Connect Componenet algorithm, while the attribute of a
> > vertex is a HyperLogLog counter that at k-th iteration it estimates the
> > number of vertices it could reaches less than k hops.
> >
> > I have successfully run the code on a graph with 20M vertices. But I
> still
> > need help:
> >
> >
> > *I think the code could work more efficiently especially the "Send
> > message" function, but I am not sure about what will happen if a vertex
> > receive no message at a iteration.*
> >
> > Here is my code: https://github.com/alcaid1801/Erdos
> >
> > Any returns is appreciated.
> >
>

Re: Why a program would receive null from send message of mapReduceTriplets

2015-02-12 Thread Reynold Xin

Then maybe you actually had a null in your vertex attribute?


On Thu, Feb 12, 2015 at 10:47 PM, James  wrote:

> I changed the mapReduceTriplets() func to aggregateMessages(), but it
> still failed.
>
>
> 2015-02-13 6:52 GMT+08:00 Reynold Xin :
>
>> Can you use the new aggregateNeighbors method? I suspect the null is
>> coming from "automatic join elimination", which detects bytecode to see if
>> you need the src or dst vertex data. Occasionally it can fail to detect. In
>> the new aggregateNeighbors API, the caller needs to explicitly specifying
>> that, making it more robust.
>>
>>
>> On Thu, Feb 12, 2015 at 6:26 AM, James  wrote:
>>
>>> Hello,
>>>
>>> When I am running the code on a much bigger size graph, I met
>>> NullPointerException.
>>>
>>> I found that is because the sendMessage() function receive a triplet that
>>> edge.srcAttr or edge.dstAttr is null. Thus I wonder why it will happen
>>> as I
>>> am sure every vertices have a attr.
>>>
>>> Any returns is appreciated.
>>>
>>> Alcaid
>>>
>>>
>>> 2015-02-11 19:30 GMT+08:00 James :
>>>
>>> > Hello,
>>> >
>>> > Recently  I am trying to estimate the average distance of a big graph
>>> > using spark with the help of [HyperAnf](
>>> > http://dl.acm.org/citation.cfm?id=1963493).
>>> >
>>> > It works like Connect Componenet algorithm, while the attribute of a
>>> > vertex is a HyperLogLog counter that at k-th iteration it estimates the
>>> > number of vertices it could reaches less than k hops.
>>> >
>>> > I have successfully run the code on a graph with 20M vertices. But I
>>> still
>>> > need help:
>>> >
>>> >
>>> > *I think the code could work more efficiently especially the "Send
>>> > message" function, but I am not sure about what will happen if a vertex
>>> > receive no message at a iteration.*
>>> >
>>> > Here is my code: https://github.com/alcaid1801/Erdos
>>> >
>>> > Any returns is appreciated.
>>> >
>>>
>>
>>
>

Re: Spark & Hive

2015-02-15 Thread Reynold Xin

Spark SQL is not the same as Hive on Spark.

Spark SQL is a query engine that is designed from ground up for Spark
without the historic baggage of Hive. It also does more than SQL now -- it
is meant for structured data processing (e.g. the new DataFrame API) and
SQL. Spark SQL is mostly compatible with Hive, but 100% compatibility is
not a goal (nor desired, since Hive has a lot of weird SQL semantics in the
course of its evolution).

Hive on Spark is meant to replace Hive's MapReduce runtime with Spark's.

For more information, see this blog post:
https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html

On Sun, Feb 15, 2015 at 3:03 AM, The Watcher  wrote:

> I'm a little confused around Hive & Spark, can someone shed some light ?
>
> Using Spark, I can access the Hive metastore and run Hive queries. Since I
> am able to do this in stand-alone mode, it can't be using map-reduce to run
> the Hive queries and I suppose it's building a query plan and executing it
> all in Spark.
>
> So, is this the same as
>
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
> ?
> If not, why not and aren't they likely to merge at some point ?
>
> If Spark really builds its own query plan, joins, etc without Hive's then
> is everything that requires special SQL syntax in Hive supported : window
> functions, cubes, rollups, skewed tables, etc
>
> Thanks
>

Re: Replacing Jetty with TomCat

2015-02-15 Thread Reynold Xin

Most likely no. We are using the embedded mode of Jetty, rather than using
servlets.

Even if it is possible, you probably wouldn't want to embed Spark in your
application server ...

On Sun, Feb 15, 2015 at 9:08 PM, Niranda Perera 
wrote:

> Hi,
>
> We are thinking of integrating Spark server inside a product. Our current
> product uses Tomcat as its webserver.
>
> Is it possible to switch the Jetty webserver in Spark to Tomcat
> off-the-shelf?
>
> Cheers
>
> --
> Niranda
>

Re: Replacing Jetty with TomCat

2015-02-15 Thread Reynold Xin

Mostly UI.

However, we are also using Jetty as a file server I believe (for
distributing files from the driver to workers).


On Sun, Feb 15, 2015 at 9:24 PM, Niranda Perera 
wrote:

> Hi Reynold,
>
> Thank you for the response. Could you please clarify the need of Jetty
> server inside Spark? Is it used for Spark core functionality or is it there
> for Spark jobs UI purposes?
>
> cheers
>
> On Mon, Feb 16, 2015 at 10:47 AM, Reynold Xin  wrote:
>
>> Most likely no. We are using the embedded mode of Jetty, rather than
>> using servlets.
>>
>> Even if it is possible, you probably wouldn't want to embed Spark in your
>> application server ...
>>
>>
>> On Sun, Feb 15, 2015 at 9:08 PM, Niranda Perera > > wrote:
>>
>>> Hi,
>>>
>>> We are thinking of integrating Spark server inside a product. Our current
>>> product uses Tomcat as its webserver.
>>>
>>> Is it possible to switch the Jetty webserver in Spark to Tomcat
>>> off-the-shelf?
>>>
>>> Cheers
>>>
>>> --
>>> Niranda
>>>
>>
>>
>
>
> --
> Niranda
>

Re: HiveContext cannot be serialized

2015-02-16 Thread Reynold Xin

Michael - it is already transient. This should probably considered a bug in
the scala compiler, but we can easily work around it by removing the use of
destructuring binding.

On Mon, Feb 16, 2015 at 10:41 AM, Michael Armbrust 
wrote:

> I'd suggest marking the HiveContext as @transient since its not valid to
> use it on the slaves anyway.
>
> On Mon, Feb 16, 2015 at 4:27 AM, Haopu Wang  wrote:
>
> > When I'm investigating this issue (in the end of this email), I take a
> > look at HiveContext's code and find this change
> > (https://github.com/apache/spark/commit/64945f868443fbc59cb34b34c16d782d
> > da0fb63d#diff-ff50aea397a607b79df9bec6f2a841db):
> >
> >
> >
> > -  @transient protected[hive] lazy val hiveconf = new
> > HiveConf(classOf[SessionState])
> >
> > -  @transient protected[hive] lazy val sessionState = {
> >
> > -val ss = new SessionState(hiveconf)
> >
> > -setConf(hiveconf.getAllProperties)  // Have SQLConf pick up the
> > initial set of HiveConf.
> >
> > -ss
> >
> > -  }
> >
> > +  @transient protected[hive] lazy val (hiveconf, sessionState) =
> >
> > +Option(SessionState.get())
> >
> > +  .orElse {
> >
> >
> >
> > With the new change, Scala compiler always generate a Tuple2 field of
> > HiveContext as below:
> >
> >
> >
> > private Tuple2 x$3;
> >
> > private transient OutputStream outputBuffer;
> >
> > private transient HiveConf hiveconf;
> >
> > private transient SessionState sessionState;
> >
> > private transient HiveMetastoreCatalog catalog;
> >
> >
> >
> > That "x$3" field's key is HiveConf object that cannot be serialized. So
> > can you suggest how to resolve this issue? Thank you very much!
> >
> >
> >
> > 
> >
> >
> >
> > I have a streaming application which registered temp table on a
> > HiveContext for each batch duration.
> >
> > The application runs well in Spark 1.1.0. But I get below error from
> > 1.1.1.
> >
> > Do you have any suggestions to resolve it? Thank you!
> >
> >
> >
> > java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf
> >
> > - field (class "scala.Tuple2", name: "_1", type: "class
> > java.lang.Object")
> >
> > - object (class "scala.Tuple2", (Configuration: core-default.xml,
> > core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
> > yarn-site.xml, hdfs-default.xml, hdfs-site.xml,
> > org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@2158ce23,org.apa
> > che.hadoop.hive.ql.session.SessionState@49b6eef9))
> >
> > - field (class "org.apache.spark.sql.hive.HiveContext", name: "x$3",
> > type: "class scala.Tuple2")
> >
> > - object (class "org.apache.spark.sql.hive.HiveContext",
> > org.apache.spark.sql.hive.HiveContext@4e6e66a4)
> >
> > - field (class
> > "example.BaseQueryableDStream$$anonfun$registerTempTable$2", name:
> > "sqlContext$1", type: "class org.apache.spark.sql.SQLContext")
> >
> >- object (class
> > "example.BaseQueryableDStream$$anonfun$registerTempTable$2",
> > )
> >
> > - field (class
> > "org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1",
> > name: "foreachFunc$1", type: "interface scala.Function1")
> >
> > - object (class
> > "org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1",
> > )
> >
> > - field (class "org.apache.spark.streaming.dstream.ForEachDStream",
> > name: "org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc",
> > type: "interface scala.Function2")
> >
> > - object (class "org.apache.spark.streaming.dstream.ForEachDStream",
> > org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20)
> >
> > - element of array (index: 0)
> >
> > - array (class "[Ljava.lang.Object;", size: 16)
> >
> > - field (class "scala.collection.mutable.ArrayBuffer", name:
> > "array", type: "class [Ljava.lang.Object;")
> >
> > - object (class "scala.collection.mutable.ArrayBuffer",
> > ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream@5ccbdc20))
> >
> > - field (class "org.apache.spark.streaming.DStreamGraph", name:
> > "outputStreams", type: "class scala.collection.mutable.ArrayBuffer")
> >
> > - custom writeObject data (class
> > "org.apache.spark.streaming.DStreamGraph")
> >
> > - object (class "org.apache.spark.streaming.DStreamGraph",
> > org.apache.spark.streaming.DStreamGraph@776ae7da)
> >
> > - field (class "org.apache.spark.streaming.Checkpoint", name:
> > "graph", type: "class org.apache.spark.streaming.DStreamGraph")
> >
> > - root object (class "org.apache.spark.streaming.Checkpoint",
> > org.apache.spark.streaming.Checkpoint@5eade065)
> >
> > at java.io.ObjectOutputStream.writeObject0(Unknown Source)
> >
> >
> >
> >
> >
> >
> >
> >
>

Re: HiveContext cannot be serialized

2015-02-16 Thread Reynold Xin

I submitted a patch

https://github.com/apache/spark/pull/4628

On Mon, Feb 16, 2015 at 10:59 AM, Michael Armbrust 
wrote:

> I was suggesting you mark the variable that is holding the HiveContext
> '@transient' since the scala compiler is not correctly propagating this
> through the tuple extraction.  This is only a workaround.  We can also
> remove the tuple extraction.
>
> On Mon, Feb 16, 2015 at 10:47 AM, Reynold Xin  wrote:
>
>> Michael - it is already transient. This should probably considered a bug
>> in the scala compiler, but we can easily work around it by removing the use
>> of destructuring binding.
>>
>> On Mon, Feb 16, 2015 at 10:41 AM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> I'd suggest marking the HiveContext as @transient since its not valid to
>>> use it on the slaves anyway.
>>>
>>> On Mon, Feb 16, 2015 at 4:27 AM, Haopu Wang  wrote:
>>>
>>> > When I'm investigating this issue (in the end of this email), I take a
>>> > look at HiveContext's code and find this change
>>> > (
>>> https://github.com/apache/spark/commit/64945f868443fbc59cb34b34c16d782d
>>> > da0fb63d#diff-ff50aea397a607b79df9bec6f2a841db):
>>> >
>>> >
>>> >
>>> > -  @transient protected[hive] lazy val hiveconf = new
>>> > HiveConf(classOf[SessionState])
>>> >
>>> > -  @transient protected[hive] lazy val sessionState = {
>>> >
>>> > -val ss = new SessionState(hiveconf)
>>> >
>>> > -setConf(hiveconf.getAllProperties)  // Have SQLConf pick up the
>>> > initial set of HiveConf.
>>> >
>>> > -ss
>>> >
>>> > -  }
>>> >
>>> > +  @transient protected[hive] lazy val (hiveconf, sessionState) =
>>> >
>>> > +Option(SessionState.get())
>>> >
>>> > +  .orElse {
>>> >
>>> >
>>> >
>>> > With the new change, Scala compiler always generate a Tuple2 field of
>>> > HiveContext as below:
>>> >
>>> >
>>> >
>>> > private Tuple2 x$3;
>>> >
>>> > private transient OutputStream outputBuffer;
>>> >
>>> > private transient HiveConf hiveconf;
>>> >
>>> > private transient SessionState sessionState;
>>> >
>>> > private transient HiveMetastoreCatalog catalog;
>>> >
>>> >
>>> >
>>> > That "x$3" field's key is HiveConf object that cannot be serialized. So
>>> > can you suggest how to resolve this issue? Thank you very much!
>>> >
>>> >
>>> >
>>> > 
>>> >
>>> >
>>> >
>>> > I have a streaming application which registered temp table on a
>>> > HiveContext for each batch duration.
>>> >
>>> > The application runs well in Spark 1.1.0. But I get below error from
>>> > 1.1.1.
>>> >
>>> > Do you have any suggestions to resolve it? Thank you!
>>> >
>>> >
>>> >
>>> > java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf
>>> >
>>> > - field (class "scala.Tuple2", name: "_1", type: "class
>>> > java.lang.Object")
>>> >
>>> > - object (class "scala.Tuple2", (Configuration: core-default.xml,
>>> > core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml,
>>> > yarn-site.xml, hdfs-default.xml, hdfs-site.xml,
>>> > org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@2158ce23
>>> ,org.apa
>>> > che.hadoop.hive.ql.session.SessionState@49b6eef9))
>>> >
>>> > - field (class "org.apache.spark.sql.hive.HiveContext", name:
>>> "x$3",
>>> > type: "class scala.Tuple2")
>>> >
>>> > - object (class "org.apache.spark.sql.hive.HiveContext",
>>> > org.apache.spark.sql.hive.HiveContext@4e6e66a4)
>>> >
>>> > - field (class
>>> > "example.BaseQueryableDStream$$anonfun$registerTempTable$2", name:
>>> > "sqlContext$1", type: "class org.apache.spark.sql.SQLContext")
>>> >
>>> >- object (class
>>> > "

Re: JavaRDD Aggregate initial value - Closure-serialized zero value reasoning?

2015-02-18 Thread Reynold Xin

Yes, that's a bug and should be using the standard serializer.

On Wed, Feb 18, 2015 at 2:58 PM, Sean Owen  wrote:

> That looks, at the least, inconsistent. As far as I know this should
> be changed so that the zero value is always cloned via the non-closure
> serializer. Any objection to that?
>
> On Wed, Feb 18, 2015 at 10:28 PM, Matt Cheah  wrote:
> > But RDD.aggregate() has this code:
> >
> > // Clone the zero value since we will also be serializing it as part
> of
> > tasks
> > var jobResult = Utils.clone(zeroValue,
> > sc.env.closureSerializer.newInstance())
> >
> > I do see the SparkEnv.get.serializer used in aggregateByKey however.
> Perhaps
> > we just missed it and need to apply the change to aggregate()? It seems
> > appropriate to target a fix for 1.3.0.
> >
> > -Matt Cheah
> > From: Josh Rosen 
> > Date: Wednesday, February 18, 2015 at 6:12 AM
> > To: Matt Cheah 
> > Cc: "dev@spark.apache.org" , Mingyu Kim
> > , Andrew Ash 
> > Subject: Re: JavaRDD Aggregate initial value - Closure-serialized zero
> value
> > reasoning?
> >
> > It looks like this was fixed in
> > https://issues.apache.org/jira/browse/SPARK-4743 /
> > https://github.com/apache/spark/pull/3605.  Can you see whether that
> patch
> > fixes this issue for you?
> >
> >
> >
> > On Tue, Feb 17, 2015 at 8:31 PM, Matt Cheah  wrote:
> >>
> >> Hi everyone,
> >>
> >> I was using JavaPairRDD’s combineByKey() to compute all of my
> aggregations
> >> before, since I assumed that every aggregation required a key. However,
> I
> >> realized I could do my analysis using JavaRDD’s aggregate() instead and
> not
> >> use a key.
> >>
> >> I have set spark.serializer to use Kryo. As a result, JavaRDD’s
> >> combineByKey requires that a “createCombiner” function is provided, and
> the
> >> return value from that function must be serializable using Kryo. When I
> >> switched to using rdd.aggregate I assumed that the zero value would
> also be
> >> strictly Kryo serialized, as it is a data item and not part of a
> closure or
> >> the aggregation functions. However, I got a serialization exception as
> the
> >> closure serializer (only valid serializer is the Java serializer) was
> used
> >> instead.
> >>
> >> I was wondering the following:
> >>
> >> What is the rationale for making the zero value be serialized using the
> >> closure serializer? This isn’t part of the closure, but is an initial
> data
> >> item.
> >> Would it make sense for us to perhaps write a version of rdd.aggregate()
> >> that takes a function as a parameter, that generates the zero value?
> This
> >> would be more intuitive to be serialized using the closure serializer.
> >>
> >> I believe aggregateByKey is also affected.
> >>
> >> Thanks,
> >>
> >> -Matt Cheah
> >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Help vote for Spark talks at the Hadoop Summit

2015-02-24 Thread Reynold Xin

Hi all,

The Hadoop Summit uses community choice voting to decide which talks to
feature. It would be great if the community could help vote for Spark talks
so that Spark has a good showing at this event. You can make three votes on
each track. Below I've listed 3 talks that are important to Spark's
roadmap. Please give 3 votes to each of the following talks.

Committer Track: Lessons from Running Ultra Large Scale Spark Workloads on
Hadoop
https://hadoopsummit.uservoice.com/forums/283260-committer-track/suggestions/7074016

Data Science track: DataFrames: large-scale data science on Hadoop data
with Spark
https://hadoopsummit.uservoice.com/forums/283261-data-science-and-hadoop/suggestions/7074147

Future of Hadoop track: Online Approximate OLAP in SparkSQL
https://hadoopsummit.uservoice.com/forums/283266-the-future-of-apache-hadoop/suggestions/7074424


Thanks!

Re: Some praise and comments on Spark

2015-02-25 Thread Reynold Xin

Thanks for the email and encouragement, Devl. Responses to the 3 requests:

-tonnes of configuration properties and "go faster" type flags. For example
Hadoop and Hbase users will know that there are a whole catalogue of
properties for regions, caches, network properties, block sizes, etc etc.
Please don't end up here for example:
https://hadoop.apache.org/docs/r1.0.4/mapred-default.html, it is painful
having to configure all of this and then create a set of properties for
each environment and then tie this into CI and deployment tools.

As the project grows, it is unavoidable to introduce more config options,
in particular, we often use config options to test new modules that are
still experimental before making them the default (e.g. sort-based shuffle).

The philosophy here is to make it a very high bar to introduce new config
options, and make the default values sensible for most deployments, and
then whenever possible, figure out automatically what is the right setting.
Note that this in general is hard, but we expect for 99% of the users they
only need to know a very small number of options (e.g. setting the
serializer).


-no more daemons and processes to have to monitor and manipulate and
restart and crash.

At the very least you'd need the cluster manager itself to be a daemon
process because we can't defy the law of physics. But I don't think we want
to introduce anything beyond that.


-a project that penalises developers (that will ultimately help promote
Spark to their managers and budget holders) with expensive training,
certification, books and accreditation. Ideally this open source should be
free, free training= more users = more commercial uptake.

I definitely agree with you on making it easier to learn Spark. We are
making a lot of materials freely available, including two freely available
MOOCs on edX:
https://databricks.com/blog/2014/12/02/announcing-two-spark-based-moocs.html



On Wed, Feb 25, 2015 at 2:13 PM, Devl Devel 
wrote:

> Hi Spark Developers,
>
> First, apologies if this doesn't belong on this list but the
> comments/praise are relevant to all developers. This is just a small note
> about what we really like about Spark, I/we don't mean to start a whole
> long discussion thread in this forum but just share our positive
> experiences with Spark thus far.
>
> To start, as you can tell, we think that the Spark project is amazing and
> we love it! Having put in nearly half a decade worth of sweat and tears
> into production Hadoop, MapReduce clusters and application development it's
> so refreshing to see something arguably simpler and more elegant to
> supersede it.
>
> These are the things we love about Spark and hope these principles
> continue:
>
> -the one command build; make-distribution.sh. Simple, clean  and ideal for
> deployment and devops and rebuilding on different environments and nodes.
> -not having too much runtime and deploy config; as admins and developers we
> are sick of setting props like io.sort and mapred.job.shuffle.merge.percent
> and dfs file locations and temp directories and so on and on again and
> again every time we deploy a job, new cluster, environment or even change
> company.
> -a fully built-in stack, one global project for SQL, dataframes, MLlib etc,
> so there is no need to add on projects to it on as per Hive, Hue, Hbase
> etc. This helps life and keeps everything in one place.
> -single (global) user based operation - no creation of a hdfs mapred unix
> user, makes life much simpler
> -single quick-start daemons; master and slaves. Not having to worry about
> JT, NN, DN , TT, RM, Hbase master ... and doing netstat and jps on hundreds
> of clusters makes life much easier.
> -proper code versioning, feature releases and release management.
> - good & well organised documentation with good examples.
>
> In addition to the comments above this is where we hope Spark never ends
> up:
>
> -tonnes of configuration properties and "go faster" type flags. For example
> Hadoop and Hbase users will know that there are a whole catalogue of
> properties for regions, caches, network properties, block sizes, etc etc.
> Please don't end up here for example:
> https://hadoop.apache.org/docs/r1.0.4/mapred-default.html, it is painful
> having to configure all of this and then create a set of properties for
> each environment and then tie this into CI and deployment tools.
> -no more daemons and processes to have to monitor and manipulate and
> restart and crash.
> -a project that penalises developers (that will ultimately help promote
> Spark to their managers and budget holders) with expensive training,
> certification, books and accreditation. Ideally this open source should be
> free, free training= more users = more commercial uptake.
>
> Anyway, those are our thoughts for what they are worth, keep up the good
> work, we just had to mention it. Again sorry if this is not the right place
> or if there is another forum for this stuff.
>
> Cheers
>

over 10000 commits!

2015-03-05 Thread Reynold Xin

We reached a new milestone today.

https://github.com/apache/spark


10,001 commits now. Congratulations to Xiangrui for making the 1th
commit!

Re: SparkSpark-perf terasort WIP branch

2015-03-10 Thread Reynold Xin

Hi Ewan,

Sorry it took a while for us to reply. I don't know spark-perf that well,
but I think this would be problematic if it works with only a specific
version of Hadoop. Maybe we can take a different approach -- just have a
bunch of tasks using the HDFS client API to read data, and not relying on
input formats?


On Fri, Mar 6, 2015 at 1:41 AM, Ewan Higgs  wrote:

> Hi all,
> I never heard from anyone on this and have received emails in private that
> people would like to add terasort to their spark-perf installs so it
> becomes part of their cluster validation checks.
>
> Yours,
> Ewan
>
>
>  Forwarded Message 
> Subject:SparkSpark-perf terasort WIP branch
> Date:   Wed, 14 Jan 2015 14:33:45 +0100
> From:   Ewan Higgs 
> To: dev@spark.apache.org 
>
>
>
> Hi all,
> I'm trying to build the Spark-perf WIP code but there are some errors to
> do with Hadoop APIs. I presume this is because there is some Hadoop
> version set and it's referring to that. But I can't seem to find it.
>
> The errors are as follows:
>
> [info] Compiling 15 Scala sources and 2 Java sources to
> /home/ehiggs/src/spark-perf/spark-tests/target/scala-2.10/classes...
> [error]
> /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/
> spark/perf/terasort/TeraInputFormat.scala:40:
> object task is not a member of package org.apache.hadoop.mapreduce
> [error] import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> [error]^
> [error]
> /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/
> spark/perf/terasort/TeraInputFormat.scala:132:
> not found: type TaskAttemptContextImpl
> [error] val context = new TaskAttemptContextImpl(
> [error]   ^
> [error]
> /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/
> spark/perf/terasort/TeraScheduler.scala:37:
> object TTConfig is not a member of package
> org.apache.hadoop.mapreduce.server.tasktracker
> [error] import org.apache.hadoop.mapreduce.server.tasktracker.TTConfig
> [error]^
> [error]
> /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/
> spark/perf/terasort/TeraScheduler.scala:91:
> not found: value TTConfig
> [error]   var slotsPerHost : Int = conf.getInt(TTConfig.TT_MAP_SLOTS, 4)
> [error]^
> [error]
> /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/
> spark/perf/terasort/TeraSortAll.scala:7:
> value run is not a member of org.apache.spark.examples.terasort.TeraGen
> [error] tg.run(Array[String]("10M", "/tmp/terasort_in"))
> [error]^
> [error]
> /home/ehiggs/src/spark-perf/spark-tests/src/main/scala/
> spark/perf/terasort/TeraSortAll.scala:9:
> value run is not a member of org.apache.spark.examples.terasort.TeraSort
> [error] ts.run(Array[String]("/tmp/terasort_in", "/tmp/terasort_out"))
> [error]^
> [error] 6 errors found
> [error] (compile:compile) Compilation failed
> [error] Total time: 13 s, completed 05-Jan-2015 12:21:47
>
> I can build the same code if it's in the Spark tree using the following
> command:
> mvn -Dhadoop.version=2.5.0 -DskipTests=true install
>
> Is there a way I can convince spark-perf to build this code with the
> appropriate Hadoop library version? I tried to apply the following to
> spark-tests/project/SparkTestsBuild.scala but it didn't seem to work as
> I expected:
>
> $ git diff project/SparkTestsBuild.scala
> diff --git a/spark-tests/project/SparkTestsBuild.scala
> b/spark-tests/project/SparkTestsBuild.scala
> index 4116326..4ed5f0c 100644
> --- a/spark-tests/project/SparkTestsBuild.scala
> +++ b/spark-tests/project/SparkTestsBuild.scala
> @@ -16,7 +16,9 @@ object SparkTestsBuild extends Build {
>   "org.scalatest" %% "scalatest" % "2.2.1" % "test",
>   "com.google.guava" % "guava" % "14.0.1",
>   "org.apache.spark" %% "spark-core" % "1.0.0" % "provided",
> -"org.json4s" %% "json4s-native" % "3.2.9"
> +"org.json4s" %% "json4s-native" % "3.2.9",
> +"org.apache.hadoop" % "hadoop-common" % "2.5.0",
> +"org.apache.hadoop" % "hadoop-mapreduce" % "2.5.0"
> ),
> test in assembly := {},
> outputPath in assembly :=
> file("target/spark-perf-tests-assembly.jar"),
> @@ -36,4 +38,4 @@ object SparkTestsBuild extends Build {
>   case _ => MergeStrategy.first
> }
>   ))
> -}
> \ No newline at end of file
> +}
>
>
> Yours,
> Ewan
>
>
>
>

Re: Using CUDA within Spark / boosting linear algebra

2015-03-12 Thread Reynold Xin

Thanks for chiming in, John. I missed your meetup last night - do you have
any writeups or slides about roofline design? In particular, I'm curious
about what optimizations are available for power-law dense * sparse? (I
don't have any background in optimizations)



On Thu, Mar 12, 2015 at 8:50 PM, jfcanny  wrote:

> If you're contemplating GPU acceleration in Spark, its important to look
> beyond BLAS. Dense BLAS probably account for only 10% of the cycles in the
> datasets we've tested in BIDMach, and we've tried to make them
> representative of industry machine learning workloads. Unless you're
> crunching images or audio, the majority of data will be very sparse and
> power law distributed. You need a good sparse BLAS, and in practice it
> seems
> like you need a sparse BLAS tailored for power-law data. We had to write
> our
> own since the NVIDIA libraries didnt perform well on typical power-law
> data.
> Intel MKL sparse BLAS also have issues and we only use some of them.
>
> You also need 2D reductions, scan operations, slicing, element-wise
> transcendental functions and operators, many kinds of sort, random number
> generators etc, and some kind of memory management strategy. Some of this
> was layered on top of Thrust in BIDMat, but most had to be written from
> scratch. Its all been rooflined, typically to memory throughput of current
> GPUs (around 200 GB/s).
>
> When you have all this you can write Learning Algorithms in the same
> high-level primitives available in Breeze or Numpy/Scipy. Its literally the
> same in BIDMat, since the generic matrix operations are implemented on both
> CPU and GPU, so the same code runs on either platform.
>
> A lesser known fact is that GPUs are around 10x faster for *all* those
> operations, not just dense BLAS. Its mostly due to faster streaming memory
> speeds, but some kernels (random number generation and transcendentals) are
> more than an order of magnitude thanks to some specialized hardware for
> power series on the GPU chip.
>
> When you have all this there is no need to move data back and forth across
> the PCI bus. The CPU only has to pull chunks of data off disk, unpack them,
> and feed them to the available GPUs. Most models fit comfortably in GPU
> memory these days (4-12 GB). With minibatch algorithms you can push TBs of
> data through the GPU this way.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Using-CUDA-within-Spark-boosting-linear-algebra-tp10481p11021.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Spilling when not expected

2015-03-13 Thread Reynold Xin

How did you run the Spark command? Maybe the memory setting didn't actually
apply? How much memory does the web ui say is available?

BTW - I don't think any JVM can actually handle 700G heap ... (maybe Zing).

On Thu, Mar 12, 2015 at 4:09 PM, Tom Hubregtsen 
wrote:

> Hi all,
>
> I'm running the teraSort benchmark with a relative small input set: 5GB.
> During profiling, I can see I am using a total of 68GB. I've got a terabyte
> of memory in my system, and set
> spark.executor.memory 900g
> spark.driver.memory 900g
> I use the default for
> spark.shuffle.memoryFraction
> spark.storage.memoryFraction
> I believe that I now have 0.2*900=180GB for shuffle and 0.6*900=540GB for
> storage.
>
> I noticed a lot of variation in runtime (under the same load), and tracked
> this down to this function in
> core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala
>   private def spillToPartitionFiles(collection:
> SizeTrackingPairCollection[(Int, K), C]): Unit = {
> spillToPartitionFiles(collection.iterator)
>   }
> In a slow run, it would loop through this function 12000 times, in a fast
> run only 700 times, even though the settings in both runs are the same and
> there are no other users on the system. When I look at the function calling
> this (insertAll, also in ExternalSorter), I see that spillToPartitionFiles
> is only called 700 times in both fast and slow runs, meaning that the
> function recursively calls itself very often. Because of the function name,
> I assume the system is spilling to disk. As I have sufficient memory, I
> assume that I forgot to set a certain memory setting. Anybody any idea
> which
> other setting I have to set, in order to not spill data in this scenario?
>
> Thanks,
>
> Tom
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spilling-when-not-expected-tp11017.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Spark config option 'expression language' feedback request

2015-03-13 Thread Reynold Xin

This is an interesting idea.

Are there well known libraries for doing this? Config is the one place
where it would be great to have something ridiculously simple, so it is
more or less bug free. I'm concerned about the complexity in this patch and
subtle bugs that it might introduce to config options that users will have
no workarounds. Also I believe it is fairly hard for nice error messages to
propagate when using Scala's parser combinator.


On Fri, Mar 13, 2015 at 3:07 AM, Dale Richardson 
wrote:

>
> PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to
> allow for Spark configuration options (whether on command line, environment
> variable or a configuration file) to be specified via a simple expression
> language.
>
>
> Such a feature has the following end-user benefits:
> - Allows for the flexibility in specifying time intervals or byte
> quantities in appropriate and easy to follow units e.g. 1 week rather
> rather then 604800 seconds
>
> - Allows for the scaling of a configuration option in relation to a system
> attributes. e.g.
>
> SPARK_WORKER_CORES = numCores - 1
>
> SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
>
> - Gives the ability to scale multiple configuration options together eg:
>
> spark.driver.memory = 0.75 * physicalMemoryBytes
>
> spark.driver.maxResultSize = spark.driver.memory * 0.8
>
>
> The following functions are currently supported by this PR:
> NumCores: Number of cores assigned to the JVM (usually ==
> Physical machine cores)
> PhysicalMemoryBytes:  Memory size of hosting machine
>
> JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
>
> JVMMaxMemoryBytes:Maximum number of bytes of memory available to the
> JVM
>
> JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
>
>
> I was wondering if anybody on the mailing list has any further ideas on
> other functions that could be useful to have when specifying spark
> configuration options?
> Regards,Dale.
>

Re: broadcast hang out

2015-03-16 Thread Reynold Xin

It would be great to add a timeout. Do you mind submitting a pull request?


On Sun, Mar 15, 2015 at 10:41 PM, lonely Feb  wrote:

> Anyone can help? Thanks a lot !
>
> 2015-03-16 11:45 GMT+08:00 lonely Feb :
>
> > yes
> >
> > 2015-03-16 11:43 GMT+08:00 Mridul Muralidharan :
> >
> >> Cross region as in different data centers ?
> >>
> >> - Mridul
> >>
> >> On Sun, Mar 15, 2015 at 8:08 PM, lonely Feb 
> wrote:
> >> > Hi all, i meet up with a problem that torrent broadcast hang out in my
> >> > spark cluster (1.2, standalone) , particularly serious when driver and
> >> > executors are cross-region. when i read the code of broadcast i found
> >> that
> >> > a sync block read here:
> >> >
> >> >   def fetchBlockSync(host: String, port: Int, execId: String, blockId:
> >> > String): ManagedBuffer = {
> >> > // A monitor for the thread to wait on.
> >> > val result = Promise[ManagedBuffer]()
> >> > fetchBlocks(host, port, execId, Array(blockId),
> >> >   new BlockFetchingListener {
> >> > override def onBlockFetchFailure(blockId: String, exception:
> >> > Throwable): Unit = {
> >> >   result.failure(exception)
> >> > }
> >> > override def onBlockFetchSuccess(blockId: String, data:
> >> > ManagedBuffer): Unit = {
> >> >   val ret = ByteBuffer.allocate(data.size.toInt)
> >> >   ret.put(data.nioByteBuffer())
> >> >   ret.flip()
> >> >   result.success(new NioManagedBuffer(ret))
> >> > }
> >> >   })
> >> >
> >> > Await.result(result.future, Duration.Inf)
> >> >   }
> >> >
> >> > it seems that fetchBlockSync method does not have a timeout limit but
> >> wait
> >> > forever ? Anybody can show me how to control the timeout here?
> >>
> >
> >
>

Re: Spilling when not expected

2015-03-17 Thread Reynold Xin

Tom - sorry for the delay. If you try OpenJDK (on a smaller heap), do you
see the same problem? Would be great to isolate whether the problem is
related to J9 or not. In either case we should fix it though.

On Fri, Mar 13, 2015 at 9:33 AM, Tom Hubregtsen 
wrote:

> I use the spark-submit script and the config files in a conf directory. I
> see the memory settings reflected in the stdout, as well as in the webUI.
> (it prints all variables from spark-default.conf, and metions I have 540GB
> free memory available when trying to store a broadcast variable or RDD). I
> also run "ps -aux | grep java | grep th", which show me that I called java
> with "-Xms1000g -Xmx1000g"
>
> I also tested if these numbers are realistic for the J9 JVM. Outside of
> Spark, when setting just the initial heapsize (Xms), it gives an error, but
> if I also define the maximum option with it (Xmx), it seems to us that it
> is accepting it. Also, in IBM's J9 health center, I see it reserve the
> 900g, and use up to 68g.
>
> Thanks,
>
> Tom
>
> On 13 March 2015 at 02:05, Reynold Xin  wrote:
>
>> How did you run the Spark command? Maybe the memory setting didn't
>> actually apply? How much memory does the web ui say is available?
>>
>> BTW - I don't think any JVM can actually handle 700G heap ... (maybe
>> Zing).
>>
>> On Thu, Mar 12, 2015 at 4:09 PM, Tom Hubregtsen 
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm running the teraSort benchmark with a relative small input set: 5GB.
>>> During profiling, I can see I am using a total of 68GB. I've got a
>>> terabyte
>>> of memory in my system, and set
>>> spark.executor.memory 900g
>>> spark.driver.memory 900g
>>> I use the default for
>>> spark.shuffle.memoryFraction
>>> spark.storage.memoryFraction
>>> I believe that I now have 0.2*900=180GB for shuffle and 0.6*900=540GB for
>>> storage.
>>>
>>> I noticed a lot of variation in runtime (under the same load), and
>>> tracked
>>> this down to this function in
>>> core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala
>>>   private def spillToPartitionFiles(collection:
>>> SizeTrackingPairCollection[(Int, K), C]): Unit = {
>>> spillToPartitionFiles(collection.iterator)
>>>   }
>>> In a slow run, it would loop through this function 12000 times, in a fast
>>> run only 700 times, even though the settings in both runs are the same
>>> and
>>> there are no other users on the system. When I look at the function
>>> calling
>>> this (insertAll, also in ExternalSorter), I see that
>>> spillToPartitionFiles
>>> is only called 700 times in both fast and slow runs, meaning that the
>>> function recursively calls itself very often. Because of the function
>>> name,
>>> I assume the system is spilling to disk. As I have sufficient memory, I
>>> assume that I forgot to set a certain memory setting. Anybody any idea
>>> which
>>> other setting I have to set, in order to not spill data in this scenario?
>>>
>>> Thanks,
>>>
>>> Tom
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Spilling-when-not-expected-tp11017.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>

Re: enum-like types in Spark

2015-03-23 Thread Reynold Xin

If scaladoc can show the Java enum types, I do think the best way is then
just Java enum types.


On Mon, Mar 23, 2015 at 2:11 PM, Patrick Wendell  wrote:

> If the official solution from the Scala community is to use Java
> enums, then it seems strange they aren't generated in scaldoc? Maybe
> we can just fix that w/ Typesafe's help and then we can use them.
>
> On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen  wrote:
> > Yeah the fully realized #4, which gets back the ability to use it in
> > switch statements (? in Scala but not Java?) does end up being kind of
> > huge.
> >
> > I confess I'm swayed a bit back to Java enums, seeing what it
> > involves. The hashCode() issue can be 'solved' with the hash of the
> > String representation.
> >
> > On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid 
> wrote:
> >> I've just switched some of my code over to the new format, and I just
> want
> >> to make sure everyone realizes what we are getting into.  I went from 10
> >> lines as java enums
> >>
> >>
> https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20
> >>
> >> to 30 lines with the new format:
> >>
> >>
> https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250
> >>
> >> its not just that its verbose.  each name has to be repeated 4 times,
> with
> >> potential typos in some locations that won't be caught by the compiler.
> >> Also, you have to manually maintain the "values" as you update the set
> of
> >> enums, the compiler won't do it for you.
> >>
> >> The only downside I've heard for java enums is enum.hashcode().  OTOH,
> the
> >> downsides for this version are: maintainability / verbosity, no
> values(),
> >> more cumbersome to use from java, no enum map / enumset.
> >>
> >> I did put together a little util to at least get back the equivalent of
> >> enum.valueOf() with this format
> >>
> >>
> https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala
> >>
> >> I'm not trying to prevent us from moving forward on this, its fine if
> this
> >> is still what everyone wants, but I feel pretty strongly java enums make
> >> more sense.
> >>
> >> thanks,
> >> Imran
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Review request for SPARK-6112:Provide OffHeap support through HDFS RAM_DISK

2015-03-23 Thread Reynold Xin

I created a ticket to separate the API refactoring from the implementation.
Would be great to have these as two separate patches to make it easier to
review (similar to the way we are doing RPC refactoring -- first
introducing an internal RPC api, port akka to it, and then add an
alternative implementation).

https://issues.apache.org/jira/browse/SPARK-6479

Can you upload your design doc there so we can discuss the block store api?
Thanks.

On Mon, Mar 23, 2015 at 3:47 PM, Zhan Zhang  wrote:

> Hi Folks,
>
> I am planning to implement hdfs off heap support for spark, and have
> uploaded the design doc for the off heap support through hdfs ramdisk in
> jira SPARK-6112. Please review it and provide your feedback if anybody are
> interested.
>
> https://issues.apache.org/jira/browse/SPARK-6112
>
> Thanks.
>
> Zhan Zhang
>

Re: Spark SQL(1.3.0) "import sqlContext.implicits._" seems not work for converting a case class RDD to DataFrame

2015-03-24 Thread Reynold Xin

In particular:

http://spark.apache.org/docs/latest/sql-programming-guide.html


"Additionally, the implicit conversions now only augment RDDs that are
composed of Products (i.e., case classes or tuples) with a method toDF,
instead of applying automatically."



On Tue, Mar 24, 2015 at 9:07 PM, Ted Yu  wrote:

> Please take a look at:
> ./sql/core/src/main/scala/org/apache/spark/sql/DataFrameHolder.scala
> ./sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala
>
> Cheers
>
> On Tue, Mar 24, 2015 at 8:46 PM, Zhiwei Chan 
> wrote:
>
> > Hi all,
> >
> >   I just upgraded spark from 1.2.1 to 1.3.0, and changed the "import
> > sqlContext.createSchemaRDD" to "import sqlContext.implicits._" in my
> code.
> > (I scan the programming guide and it seems this is the only change I need
> > to do). But it come to an error when run compile as following:
> > >>>
> > [ERROR] ...\magic.scala:527: error: value registerTempTable is not a
> member
> > of org.apache.spark.rdd.RDD[com.yhd.ycache.magic.Table]
> > [INFO] tableRdd.registerTempTable(tableName)
> > <<<
> >
> > Then I try the exactly example in the programming guide of 1.3  in
> > spark-shell, it come to the same error.
> > >>>
> > scala> sys.env.get("CLASSPATH")
> > res7: Option[String] =
> >
> >
> Some(:/root/scala/spark-1.3.0-bin-hadoop2.4/conf:/root/scala/spark-1.3.0-bin-hadoop2.4/lib/spark-assembly-1.3.0-hadoop2.4.0.jar:/root/scala/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/root/scala/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/root/scala/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar)
> >
> > scala>  val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> > sqlContext: org.apache.spark.sql.SQLContext =
> > org.apache.spark.sql.SQLContext@4b05b3ff
> >
> > scala>  import sqlContext.implicits._
> > import sqlContext.implicits._
> >
> > scala>  case class Person(name: String, age: Int)
> > defined class Person
> >
> > scala>   val t1 =
> > sc.textFile("hdfs://heju:8020/user/root/magic/poolInfo.txt")
> > 15/03/25 11:13:35 INFO MemoryStore: ensureFreeSpace(81443) called with
> > curMem=186397, maxMem=278302556
> > 15/03/25 11:13:35 INFO MemoryStore: Block broadcast_3 stored as values in
> > memory (estimated size 79.5 KB, free 265.2 MB)
> > 15/03/25 11:13:35 INFO MemoryStore: ensureFreeSpace(31262) called with
> > curMem=267840, maxMem=278302556
> > 15/03/25 11:13:35 INFO MemoryStore: Block broadcast_3_piece0 stored as
> > bytes in memory (estimated size 30.5 KB, free 265.1 MB)
> > 15/03/25 11:13:35 INFO BlockManagerInfo: Added broadcast_3_piece0 in
> memory
> > on heju:48885 (size: 30.5 KB, free: 265.4 MB)
> > 15/03/25 11:13:35 INFO BlockManagerMaster: Updated info of block
> > broadcast_3_piece0
> > 15/03/25 11:13:35 INFO SparkContext: Created broadcast 3 from textFile at
> > :34
> > t1: org.apache.spark.rdd.RDD[String] =
> > hdfs://heju:8020/user/root/magic/poolInfo.txt MapPartitionsRDD[9] at
> > textFile at :34
> >
> > scala>  val t2 = t1.flatMap(_.split("\n")).map(_.split(" ")).map(p =>
> > Person(p(0),1))
> > t2: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[12] at map at
> > :38
> >
> > scala>  t2.registerTempTable("people")
> > :41: error: value registerTempTable is not a member of
> > org.apache.spark.rdd.RDD[Person]
> >t2.registerTempTable("people")
> >   ^
> > <<<
> >
> > I found the following explanation in programming guide about implicit
> > convert case class to DataFrams, but I don't understand what I should do.
> > Could any one tell me how should I do if I want to convert a case class
> RDD
> > to DataFrame?
> >
> > >>>
> > Isolation of Implicit Conversions and Removal of dsl Package (Scala-only)
> >
> > Many of the code examples prior to Spark 1.3 started with import
> > sqlContext._, which brought all of the functions from sqlContext into
> > scope. In Spark 1.3 we have isolated the implicit conversions for
> > converting RDDs into DataFrames into an object inside of the SQLContext.
> > Users should now write import sqlContext.implicits._.
> >
> > Additionally, the implicit conversions now only augment RDDs that are
> > composed of Products (i.e., case classes or tuples) with a method toDF,
> > instead of applying automatically.
> >
> > <<<
> > Thanks
> > Jason
> >
>

Re: Jira Issues

2015-03-25 Thread Reynold Xin

Igor,

Welcome -- everything is open here:
https://issues.apache.org/jira/browse/SPARK

You should be able to see them even if you are not an ASF member.


On Wed, Mar 25, 2015 at 1:51 PM, Igor Costa  wrote:

> Hi there Guys.
>
> I want to be more collaborative to Spark, but I have two questions.
>
>
> Issues are used in Github or jira Issues?
>
> If so on Jira, Is there a way I can get in to see the issues?
>
> I've tried to login but no success.
>
>
> I'm PMC from another Apache project, flex.apache.org
>
>
> Best Regards
> Igor
>

Re: Iterative pyspark / scala codebase development

2015-03-27 Thread Reynold Xin

Python is tough if you need to change Scala at the same time.

sbt/sbt assembly/assembly

can be slightly faster than just assembly.

On Fri, Mar 27, 2015 at 10:02 AM, Stephen Boesch  wrote:

> I am iteratively making changes to the scala side of some new pyspark code
> and re-testing from the python/pyspark side.
>
> Presently my only solution is to rebuild completely
>
>   sbt assembly
>
> after any scala side change - no matter how small.
>
> Any better / expedited way for pyspark to see small scala side updates?
>

Re: RDD.count

2015-03-28 Thread Reynold Xin

I think the worry here is that people often use count() to force execution,
and when coupled with transformations with side-effect, it is no longer
safe to not run it.

However, maybe we can add a new lazy val .size that doesn't require
recomputation.


On Sat, Mar 28, 2015 at 7:42 AM, Sandy Ryza  wrote:

> I definitely see the value in this.  However, I think at this point it
> would be an incompatible behavioral change.  People often use count in
> Spark to exercise their DAG.  Omitting processing steps that were
> previously included would likely mislead many users into thinking their
> pipeline was running faster.
>
> It's possible there might be room for something like a new smartCount API
> or a new argument to count that allows it to avoid unnecessary
> transformations.
>
> -Sandy
>
> On Sat, Mar 28, 2015 at 6:10 AM, Sean Owen  wrote:
>
> > No, I'm not saying side effects change the count. But not executing
> > the map() function at all certainly has an effect on the side effects
> > of that function: the side effects which should take place never do. I
> > am not sure that is something to be 'fixed'; it's a legitimate
> > question.
> >
> > You can persist an RDD if you do not want to compute it twice.
> >
> > On Sat, Mar 28, 2015 at 1:05 PM, jimfcarroll 
> > wrote:
> > > Hi Sean,
> > >
> > > Thanks for the response.
> > >
> > > I can't imagine a case (though my imagination may be somewhat limited)
> > where
> > > even map side effects could change the number of elements in the
> > resulting
> > > map.
> > >
> > > I guess "count" wouldn't officially be an 'action' if it were
> implemented
> > > this way. At least it wouldn't ALWAYS be one.
> > >
> > > My example was contrived. We're passing RDDs to functions. If that RDD
> > is an
> > > instance of my class, then its count() may take a shortcut. If I
> > > map/zip/zipWithIndex/mapPartition/etc. first then I'm stuck with a call
> > that
> > > literally takes 100s to 1000s of times longer (seconds vs hours on some
> > of
> > > our datasets) and since my custom RDDs are immutable they cache the
> count
> > > call so a second invocation is the cost of a method call's overhead.
> > >
> > > I could fix this in Spark if there's any interest in that change.
> > Otherwise
> > > I'll need to overload more RDD methods for my own purposes (like all of
> > the
> > > transformations). Of course, that will be more difficult because those
> > > intermediate classes (like MappedRDD) are private, so I can't extend
> > them.
> > >
> > > Jim
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-count-tp11298p11302.html
> > > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > > For additional commands, e-mail: dev-h...@spark.apache.org
> > >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>

Re: [sql] How to uniquely identify Dataframe?

2015-03-30 Thread Reynold Xin

The only reason I can think of right now is that you might want to change
the config parameter to change the behavior of the optimizer and regenerate
the plan. However, maybe that's not a strong enough reasons to regenerate
the RDD everytime.


On Mon, Mar 30, 2015 at 5:38 AM, Cheng Lian  wrote:

> This is because unlike SchemaRDD, DataFrame itself is no longer an RDD
> now. In the meanwhile, DataFrame.rdd is a function, which always returns a
> new RDD. I think you may use DataFrame.queryExecution.logical (the
> logical plan) as an ID. Maybe we should make it a "lazy val" rather than a
> "def". Personally I don't find a good reason that it has to be a "def", but
> maybe I miss something here.
>
> Filed JIRA ticket and PR for this:
>
> - https://issues.apache.org/jira/browse/SPARK-6608
> - https://github.com/apache/spark/pull/5265
>
> Cheng
>
>
> On 3/30/15 8:02 PM, Peter Rudenko wrote:
>
>> Hi i have some custom caching logic in my application. I need to identify
>> somehow Dataframe, to check whether i saw it previously. Here’s a problem:
>>
>> |scala> val data = sc.parallelize(1 to 1000) data:
>> org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at
>> :21 scala> data.id res0: Int = 0 scala> data.id res1: Int = 0
>> scala> val dataDF = data.toDF dataDF: org.apache.spark.sql.DataFrame = [_1:
>> int] scala> dataDF.rdd.id res3: Int = 2 scala> dataDF.rdd.id res4: Int =
>> 3 |
>>
>> For some reason it generates a new ID on each call. With schemaRDD i was
>> able to call SchemaRDD.id.
>>
>> Thanks,
>> Peter Rudenko
>>
>> 
>>
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Spark config option 'expression language' feedback request

2015-03-30 Thread Reynold Xin

Reviving this to see if others would like to chime in about this
"expression language" for config options.


On Fri, Mar 13, 2015 at 7:57 PM, Dale Richardson 
wrote:

> Mridul,I may have added some confusion by giving examples in completely
> different areas. For example the number of cores available for tasking on
> each worker machine is a resource-controller level configuration variable.
> In standalone mode (ie using Spark's home-grown resource manager) the
> configuration variable SPARK_WORKER_CORES is an item that spark admins can
> set (and we can use expressions for). The equivalent variable for YARN
> (Yarn.nodemanager.resource.cpu-vcores) is only used by Yarn's node manager
> setup and is set by Yarn administrators and outside of control of spark
> (and most users).  If you are not a cluster administrator then both
> variables are irrelevant to you. The same goes for SPARK_WORKER_MEMORY.
>
> As for spark.executor.memory,  As there is no way to know the attributes
> of a machine before a task is allocated to it, we cannot use any of the
> JVMInfo functions. For options like that the expression parser can easily
> be limited to supporting different byte units of scale (kb/mb/gb etc) and
> other configuration variables only.
> Regards,Dale.
>
>
>
>
> > Date: Fri, 13 Mar 2015 17:30:51 -0700
> > Subject: Re: Spark config option 'expression language' feedback request
> > From: mri...@gmail.com
> > To: dale...@hotmail.com
> > CC: dev@spark.apache.org
> >
> > Let me try to rephrase my query.
> > How can a user specify, for example, what the executor memory should
> > be or number of cores should be.
> >
> > I dont want a situation where some variables can be specified using
> > one set of idioms (from this PR for example) and another set cannot
> > be.
> >
> >
> > Regards,
> > Mridul
> >
> >
> >
> >
> > On Fri, Mar 13, 2015 at 4:06 PM, Dale Richardson 
> wrote:
> > >
> > >
> > >
> > > Thanks for your questions Mridul.
> > > I assume you are referring to how the functionality to query system
> state works in Yarn and Mesos?
> > > The API's used are the standard JVM API's so the functionality will
> work without change. There is no real use case for using
> 'physicalMemoryBytes' in these cases though, as the JVM size has already
> been limited by the resource manager.
> > > Regards,Dale.
> > >> Date: Fri, 13 Mar 2015 08:20:33 -0700
> > >> Subject: Re: Spark config option 'expression language' feedback
> request
> > >> From: mri...@gmail.com
> > >> To: dale...@hotmail.com
> > >> CC: dev@spark.apache.org
> > >>
> > >> I am curious how you are going to support these over mesos and yarn.
> > >> Any configure change like this should be applicable to all of them,
> not
> > >> just local and standalone modes.
> > >>
> > >> Regards
> > >> Mridul
> > >>
> > >> On Friday, March 13, 2015, Dale Richardson 
> wrote:
> > >>
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature
> to
> > >> > allow for Spark configuration options (whether on command line,
> environment
> > >> > variable or a configuration file) to be specified via a simple
> expression
> > >> > language.
> > >> >
> > >> >
> > >> > Such a feature has the following end-user benefits:
> > >> > - Allows for the flexibility in specifying time intervals or byte
> > >> > quantities in appropriate and easy to follow units e.g. 1 week
> rather
> > >> > rather then 604800 seconds
> > >> >
> > >> > - Allows for the scaling of a configuration option in relation to a
> system
> > >> > attributes. e.g.
> > >> >
> > >> > SPARK_WORKER_CORES = numCores - 1
> > >> >
> > >> > SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
> > >> >
> > >> > - Gives the ability to scale multiple configuration options
> together eg:
> > >> >
> > >> > spark.driver.memory = 0.75 * physicalMemoryBytes
> > >> >
> > >> > spark.driver.maxResultSize = spark.driver.memory * 0.8
> > >> >
> > >> >
> > >> > The following functions are currently supported by this PR:
> > >> > NumCores: Number of cores assigned to the JVM (usually
> ==
> > >> > Physical machine cores)
> > >> > PhysicalMemoryBytes:  Memory size of hosting machine
> > >> >
> > >> > JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
> > >> >
> > >> > JVMMaxMemoryBytes:Maximum number of bytes of memory available
> to the
> > >> > JVM
> > >> >
> > >> > JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
> > >> >
> > >> >
> > >> > I was wondering if anybody on the mailing list has any further
> ideas on
> > >> > other functions that could be useful to have when specifying spark
> > >> > configuration options?
> > >> > Regards,Dale.
> > >> >
> > >
> > >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
>

Re: Migrating from 1.2.1 to 1.3.0 - org.apache.spark.sql.api.java.Row

2015-03-31 Thread Reynold Xin

Yup - we merged the Java and Scala API so there is now a single set of API
to support both languages.

See more at
http://spark.apache.org/docs/latest/sql-programming-guide.html#unification-of-the-java-and-scala-apis



On Tue, Mar 31, 2015 at 11:40 PM, Niranda Perera 
wrote:

> Hi,
>
> previously in 1.2.1, the result row from a Spark SQL query was
> a org.apache.spark.sql.api.java.Row.
>
> In 1.3.0 I do not see a sql.api.java package. so does it mean that even the
> SQL query result row is an implementation of org.apache.spark.sql.Row such
> as GenericRow etc?
>
> --
> Niranda
>

Spark 2.0: Rearchitecting Spark for Mobile, Local, Social

2015-04-01 Thread Reynold Xin

Hi Spark devs,

I've spent the last few months investigating the feasibility of
re-architecting Spark for mobile platforms, considering the growing
population of Android/iOS users. I'm happy to share with you my findings at
https://issues.apache.org/jira/browse/SPARK-6646

The tl;dr is that we should support running Spark on Android/iOS, and the
best way to do this at the moment is to use Scala.js to compile Spark code
into JavaScript, and then run it in Safari or Chrome (and even node.js
potentially for servers).

If you are on your phones right now and prefer reading a blog post rather
than a PDF file, you can read more about the design doc at
https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html


This is done in collaboration with TD, Xiangrui, Patrick. Look forward to
your feedback!

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 1414 matches

Mail list logo