SUB

2017-03-13 Thread dongxu


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Question on Spark's graph libraries roadmap

2017-03-13 Thread Tim Hunter
Hello Enzo,

since this question is also relevant to Spark, I will answer it here. The
goal of GraphFrames is to provide graph capabilities along with excellent
integration to the rest of the Spark ecosystem (using modern APIs such as
DataFrames). As you seem to be well aware, a large number of graph
algorithms can be implemented in terms of a small subset of graph
primitives. These graph primitives can be translated to Spark operations,
but we feel that some important low-level optimizations should be added to
the Catalyst engine in order to realize the true potential of GraphFrames.
You can find a flavor of this work in this presentation of Ankur Dave [1].
This is still an area of collaboration with the Spark core team, and we
would like to merge GraphFrames in Spark 2.x eventually.

Where does it leave us for the time being? GraphFrames is actively
supported, and we implemented a highly scalable version of GraphFrames in
November. As you mentioned, there are a number of distributed Graph
frameworks out there, but to my knowledge they are not as easy to integrate
with Spark. The current approach has been to reach parity with GraphX first
and then add new algorithms based on popular demand. Along these lines,
GraphBLAS could be added on top of it if someone is willing to step up.

Tim

[1]
https://spark-summit.org/east-2016/events/graphframes-graph-queries-in-spark-sql/

On Mon, Mar 13, 2017 at 2:58 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Since GraphFrames is not part of the Spark project, your
> GraphFrames-specific questions are probably better directed at the
> GraphFrames issue tracker:
>
> https://github.com/graphframes/graphframes/issues
>
> As far as I know, GraphFrames is an active project, though not as active
> as Spark of course. There will be lulls in development since the people
> driving that project forward also have major commitments to other projects.
> This is natural.
>
> If you post on GitHub I would wager somewhere there (maybe Joseph or Tim
> ?) should
> be able to answer your questions about GraphFrames.
>
>
>1. The page you linked refers to a *plan* to move GraphFrames to the
>standard Spark release cycle. Is this *plan* publicly available /
>visible?
>
> I didn’t see any such reference to a plan in the page I linked you to.
> Rather, the page says 
> :
>
> The current plan is to keep GraphFrames separate from core Apache Spark
> for the time being.
>
> Nick
> ​
>
> On Mon, Mar 13, 2017 at 5:46 PM enzo 
> wrote:
>
>> Nick
>>
>> Thanks for the quick answer :)
>>
>> Sadly, the comment in the page doesn’t answer my questions. More
>> specifically:
>>
>> 1. GraphFrames last activity in github was 2 months ago.  Last release on 12
>> Nov 2016.  Till recently 2 month was close to a Spark release cycle.
>> Why there has been no major development since mid November?
>>
>> 2. The page you linked refers to a *plan* to move GraphFrames to the
>> standard Spark release cycle.  Is this *plan* publicly available / visible?
>>
>> 3. I couldn’t find any statement of intent to preserve either one or the
>> other APIs, or just merge them: in other words, there seem to be no
>> overarching plan for a cohesive & comprehensive graph API (I apologise in
>> advance if I’m wrong).
>>
>> 4. I was initially impressed by GraphFrames syntax in places similar to
>> Neo4J Cypher (now open source), but later I understood was an incomplete
>> lightweight experiment (with no intention to move to full compatibility,
>> perhaps for good reasons).  To me it sort of gave the wrong message.
>>
>> 5. In the mean time the world of graphs is changing. GraphBlas forum
>> seems to make some traction: a library based on GraphBlas has been made
>> available on Accumulo (Graphulo).  Assuming that Spark is NOT going to
>> adopt similar lines, nor to follow Datastax with tinkertop and Gremlin,
>> again, what is the new,  cohesive & comprehensive API that Spark is going
>> to deliver?
>>
>>
>> Sadly, the API uncertainty may force developers to more stable kind of
>> API / platforms & roadmaps.
>>
>>
>>
>> Thanks Enzo
>>
>> On 13 Mar 2017, at 22:09, Nicholas Chammas 
>> wrote:
>>
>> Your question is answered here under "Will GraphFrames be part of Apache
>> Spark?", no?
>>
>> http://graphframes.github.io/#what-are-graphframes
>>
>> Nick
>>
>> On Mon, Mar 13, 2017 at 4:56 PM enzo 
>> wrote:
>>
>> Please see this email  trail:  no answer so far on the user@spark
>> board.  Trying the developer board for better luck
>>
>> The question:
>>
>> I am a bit confused by the current roadmap for graph and graph analytics
>> in Apache Spark.
>>
>> I understand that we have had for some time two libraries (the following
>> is my understanding - please amend as appropriate!):
>>
>> . GraphX, part of Spark 

Adding the executor ID to Spark logs when launching an executor in a YARN container

2017-03-13 Thread Rodriguez Hortala, Juan
Hi Spark developers,

For Spark running on YARN, I would like to be able to find out the container 
where an executor is running by looking at the logs. I haven't been able to 
find a way to do this, not even with the Spark UI, as neither the Executors tab 
nor the stage information page show the container id. I was thinking on 
modifying the logs sent in YarnAllocator to log the executor id on container 
start, as follows:

@@ -494,7 +494,8 @@ private[yarn] class YarnAllocator(
   val containerId = container.getId
   val executorId = executorIdCounter.toString
   assert(container.getResource.getMemory >= resource.getMemory)
-  logInfo(s"Launching container $containerId on host $executorHostname")
+  logInfo(s"Launching container $containerId on host $executorHostname " +
+s"for executor with ID $executorId")

   def updateInternalState(): Unit = synchronized {
 numExecutorsRunning += 1
@@ -528,7 +529,8 @@ private[yarn] class YarnAllocator(
 updateInternalState()
   } catch {
 case NonFatal(e) =>
-  logError(s"Failed to launch executor $executorId on 
container $containerId", e)
+  logError(s"Failed to launch executor $executorId on 
container $containerId " +
+s"for executor with ID $executorId", e)
   // Assigned container should be released immediately to 
avoid unnecessary resource
   // occupation.
   amClient.releaseAssignedContainer(containerId)

Do you think this is a good idea, or there is a better way to achieve this?

Thanks in advance,

Juan ?



Re: Question on Spark's graph libraries roadmap

2017-03-13 Thread Nicholas Chammas
Since GraphFrames is not part of the Spark project, your
GraphFrames-specific questions are probably better directed at the
GraphFrames issue tracker:

https://github.com/graphframes/graphframes/issues

As far as I know, GraphFrames is an active project, though not as active as
Spark of course. There will be lulls in development since the people
driving that project forward also have major commitments to other projects.
This is natural.

If you post on GitHub I would wager somewhere there (maybe Joseph or Tim
?) should
be able to answer your questions about GraphFrames.


   1. The page you linked refers to a *plan* to move GraphFrames to the
   standard Spark release cycle. Is this *plan* publicly available /
   visible?

I didn’t see any such reference to a plan in the page I linked you to.
Rather, the page says :

The current plan is to keep GraphFrames separate from core Apache Spark for
the time being.

Nick
​

On Mon, Mar 13, 2017 at 5:46 PM enzo  wrote:

> Nick
>
> Thanks for the quick answer :)
>
> Sadly, the comment in the page doesn’t answer my questions. More
> specifically:
>
> 1. GraphFrames last activity in github was 2 months ago.  Last release on 12
> Nov 2016.  Till recently 2 month was close to a Spark release cycle.  Why
> there has been no major development since mid November?
>
> 2. The page you linked refers to a *plan* to move GraphFrames to the
> standard Spark release cycle.  Is this *plan* publicly available / visible?
>
> 3. I couldn’t find any statement of intent to preserve either one or the
> other APIs, or just merge them: in other words, there seem to be no
> overarching plan for a cohesive & comprehensive graph API (I apologise in
> advance if I’m wrong).
>
> 4. I was initially impressed by GraphFrames syntax in places similar to
> Neo4J Cypher (now open source), but later I understood was an incomplete
> lightweight experiment (with no intention to move to full compatibility,
> perhaps for good reasons).  To me it sort of gave the wrong message.
>
> 5. In the mean time the world of graphs is changing. GraphBlas forum seems
> to make some traction: a library based on GraphBlas has been made available
> on Accumulo (Graphulo).  Assuming that Spark is NOT going to adopt similar
> lines, nor to follow Datastax with tinkertop and Gremlin, again, what is
> the new,  cohesive & comprehensive API that Spark is going to deliver?
>
>
> Sadly, the API uncertainty may force developers to more stable kind of API
> / platforms & roadmaps.
>
>
>
> Thanks Enzo
>
> On 13 Mar 2017, at 22:09, Nicholas Chammas 
> wrote:
>
> Your question is answered here under "Will GraphFrames be part of Apache
> Spark?", no?
>
> http://graphframes.github.io/#what-are-graphframes
>
> Nick
>
> On Mon, Mar 13, 2017 at 4:56 PM enzo 
> wrote:
>
> Please see this email  trail:  no answer so far on the user@spark board.
> Trying the developer board for better luck
>
> The question:
>
> I am a bit confused by the current roadmap for graph and graph analytics
> in Apache Spark.
>
> I understand that we have had for some time two libraries (the following
> is my understanding - please amend as appropriate!):
>
> . GraphX, part of Spark project.  This library is based on RDD and it is
> only accessible via Scala.  It doesn’t look that this library has been
> enhanced recently.
> . GraphFrames, independent (at the moment?) library for Spark.  This
> library is based on Spark DataFrames and accessible by Scala & Python. Last
> commit on GitHub was 2 months ago.
>
> GraphFrames cam about with the promise at some point to be integrated in
> Apache Spark.
>
> I can see other projects coming up with interesting libraries and ideas
> (e.g. Graphulo on Accumulo, a new project with the goal of implementing
> the GraphBlas building blocks for graph algorithms on top of Accumulo).
>
> Where is Apache Spark going?
>
> Where are graph libraries in the roadmap?
>
>
>
> Thanks for any clarity brought to this matter.
>
> Thanks Enzo
>
> Begin forwarded message:
>
> *From: *"Md. Rezaul Karim" 
> *Subject: **Re: Question on Spark's graph libraries*
> *Date: *10 March 2017 at 13:13:15 CET
> *To: *Robin East 
> *Cc: *enzo , spark users <
> u...@spark.apache.org>
>
> +1
>
> Regards,
> _
> *Md. Rezaul Karim*, BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> 
>
> On 10 March 2017 at 12:10, Robin East  wrote:
>
> I would love to know the answer to that too.
>
> 

Re: Question on Spark's graph libraries roadmap

2017-03-13 Thread enzo
Nick

Thanks for the quick answer :)

Sadly, the comment in the page doesn’t answer my questions. More specifically:

1. GraphFrames last activity in github was 2 months ago.  Last release on 12 
Nov 2016.  Till recently 2 month was close to a Spark release cycle.  Why there 
has been no major development since mid November?

2. The page you linked refers to a *plan* to move GraphFrames to the standard 
Spark release cycle.  Is this *plan* publicly available / visible?

3. I couldn’t find any statement of intent to preserve either one or the other 
APIs, or just merge them: in other words, there seem to be no overarching plan 
for a cohesive & comprehensive graph API (I apologise in advance if I’m wrong).

4. I was initially impressed by GraphFrames syntax in places similar to Neo4J 
Cypher (now open source), but later I understood was an incomplete lightweight 
experiment (with no intention to move to full compatibility, perhaps for good 
reasons).  To me it sort of gave the wrong message.

5. In the mean time the world of graphs is changing. GraphBlas forum seems to 
make some traction: a library based on GraphBlas has been made available on 
Accumulo (Graphulo).  Assuming that Spark is NOT going to adopt similar lines, 
nor to follow Datastax with tinkertop and Gremlin, again, what is the new,  
cohesive & comprehensive API that Spark is going to deliver?


Sadly, the API uncertainty may force developers to more stable kind of API / 
platforms & roadmaps.



Thanks Enzo

> On 13 Mar 2017, at 22:09, Nicholas Chammas  wrote:
> 
> Your question is answered here under "Will GraphFrames be part of Apache 
> Spark?", no?
> 
> http://graphframes.github.io/#what-are-graphframes 
> 
> 
> Nick
> 
> On Mon, Mar 13, 2017 at 4:56 PM enzo  > wrote:
> Please see this email  trail:  no answer so far on the user@spark board.  
> Trying the developer board for better luck
> 
> The question:
> 
> I am a bit confused by the current roadmap for graph and graph analytics in 
> Apache Spark.
> 
> I understand that we have had for some time two libraries (the following is 
> my understanding - please amend as appropriate!):
> 
> . GraphX, part of Spark project.  This library is based on RDD and it is only 
> accessible via Scala.  It doesn’t look that this library has been enhanced 
> recently.
> . GraphFrames, independent (at the moment?) library for Spark.  This library 
> is based on Spark DataFrames and accessible by Scala & Python. Last commit on 
> GitHub was 2 months ago.
> 
> GraphFrames cam about with the promise at some point to be integrated in 
> Apache Spark.
> 
> I can see other projects coming up with interesting libraries and ideas (e.g. 
> Graphulo on Accumulo, a new project with the goal of implementing the 
> GraphBlas building blocks for graph algorithms on top of Accumulo).
> 
> Where is Apache Spark going?
> 
> Where are graph libraries in the roadmap?
> 
> 
> 
> Thanks for any clarity brought to this matter.
> 
> Thanks Enzo
> 
>> Begin forwarded message:
>> 
>> From: "Md. Rezaul Karim" > >
>> Subject: Re: Question on Spark's graph libraries
>> Date: 10 March 2017 at 13:13:15 CET
>> To: Robin East >
>> Cc: enzo > >, spark users > >
>> 
>> +1
>> 
>> Regards,
>> _
>> Md. Rezaul Karim, BSc, MSc
>> PhD Researcher, INSIGHT Centre for Data Analytics 
>> National University of Ireland, Galway
>> IDA Business Park, Dangan, Galway, Ireland
>> Web: http://www.reza-analytics.eu/index.html 
>> 
>> 
>> On 10 March 2017 at 12:10, Robin East > > wrote:
>> I would love to know the answer to that too.
>> ---
>> Robin East
>> Spark GraphX in Action Michael Malak and Robin East
>> Manning Publications Co.
>> http://www.manning.com/books/spark-graphx-in-action 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 9 Mar 2017, at 17:42, enzo >> > wrote:
>>> 
>>> I am a bit confused by the current roadmap for graph and graph analytics in 
>>> Apache Spark.
>>> 
>>> I understand that we have had for some time two libraries (the following is 
>>> my understanding - please amend as appropriate!):
>>> 
>>> . GraphX, part of Spark project.  This library is based on RDD and it is 
>>> only accessible via Scala.  It doesn’t look that this library has been 
>>> enhanced recently.
>>> . GraphFrames, independent (at the moment?) 

Re: Question on Spark's graph libraries roadmap

2017-03-13 Thread Nicholas Chammas
Your question is answered here under "Will GraphFrames be part of Apache
Spark?", no?

http://graphframes.github.io/#what-are-graphframes

Nick

On Mon, Mar 13, 2017 at 4:56 PM enzo  wrote:

> Please see this email  trail:  no answer so far on the user@spark board.
> Trying the developer board for better luck
>
> The question:
>
> I am a bit confused by the current roadmap for graph and graph analytics
> in Apache Spark.
>
> I understand that we have had for some time two libraries (the following
> is my understanding - please amend as appropriate!):
>
> . GraphX, part of Spark project.  This library is based on RDD and it is
> only accessible via Scala.  It doesn’t look that this library has been
> enhanced recently.
> . GraphFrames, independent (at the moment?) library for Spark.  This
> library is based on Spark DataFrames and accessible by Scala & Python. Last
> commit on GitHub was 2 months ago.
>
> GraphFrames cam about with the promise at some point to be integrated in
> Apache Spark.
>
> I can see other projects coming up with interesting libraries and ideas
> (e.g. Graphulo on Accumulo, a new project with the goal of implementing
> the GraphBlas building blocks for graph algorithms on top of Accumulo).
>
> Where is Apache Spark going?
>
> Where are graph libraries in the roadmap?
>
>
>
> Thanks for any clarity brought to this matter.
>
> Thanks Enzo
>
> Begin forwarded message:
>
> *From: *"Md. Rezaul Karim" 
> *Subject: **Re: Question on Spark's graph libraries*
> *Date: *10 March 2017 at 13:13:15 CET
> *To: *Robin East 
> *Cc: *enzo , spark users <
> u...@spark.apache.org>
>
> +1
>
> Regards,
> _
> *Md. Rezaul Karim*, BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> 
>
> On 10 March 2017 at 12:10, Robin East  wrote:
>
> I would love to know the answer to that too.
>
> ---
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
> On 9 Mar 2017, at 17:42, enzo  wrote:
>
> I am a bit confused by the current roadmap for graph and graph analytics
> in Apache Spark.
>
> I understand that we have had for some time two libraries (the following
> is my understanding - please amend as appropriate!):
>
> . GraphX, part of Spark project.  This library is based on RDD and it is
> only accessible via Scala.  It doesn’t look that this library has been
> enhanced recently.
> . GraphFrames, independent (at the moment?) library for Spark.  This
> library is based on Spark DataFrames and accessible by Scala & Python. Last
> commit on GitHub was 2 months ago.
>
> GraphFrames cam about with the promise at some point to be integrated in
> Apache Spark.
>
> I can see other projects coming up with interesting libraries and ideas
> (e.g. Graphulo on Accumulo, a new project with the goal of implementing
> the GraphBlas building blocks for graph algorithms on top of Accumulo).
>
> Where is Apache Spark going?
>
> Where are graph libraries in the roadmap?
>
>
>
> Thanks for any clarity brought to this matter.
>
> Enzo
>
>
>
>
>


Fwd: Question on Spark's graph libraries roadmap

2017-03-13 Thread enzo
Please see this email  trail:  no answer so far on the user@spark board.  
Trying the developer board for better luck

The question:

I am a bit confused by the current roadmap for graph and graph analytics in 
Apache Spark.

I understand that we have had for some time two libraries (the following is my 
understanding - please amend as appropriate!):

. GraphX, part of Spark project.  This library is based on RDD and it is only 
accessible via Scala.  It doesn’t look that this library has been enhanced 
recently.
. GraphFrames, independent (at the moment?) library for Spark.  This library is 
based on Spark DataFrames and accessible by Scala & Python. Last commit on 
GitHub was 2 months ago.

GraphFrames cam about with the promise at some point to be integrated in Apache 
Spark.

I can see other projects coming up with interesting libraries and ideas (e.g. 
Graphulo on Accumulo, a new project with the goal of implementing the GraphBlas 
building blocks for graph algorithms on top of Accumulo).

Where is Apache Spark going?

Where are graph libraries in the roadmap?



Thanks for any clarity brought to this matter.

Thanks Enzo

> Begin forwarded message:
> 
> From: "Md. Rezaul Karim" 
> Subject: Re: Question on Spark's graph libraries
> Date: 10 March 2017 at 13:13:15 CET
> To: Robin East 
> Cc: enzo , spark users 
> 
> +1
> 
> Regards,
> _
> Md. Rezaul Karim, BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics 
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html 
> 
> 
> On 10 March 2017 at 12:10, Robin East  > wrote:
> I would love to know the answer to that too.
> ---
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action 
> 
> 
> 
> 
> 
> 
>> On 9 Mar 2017, at 17:42, enzo > > wrote:
>> 
>> I am a bit confused by the current roadmap for graph and graph analytics in 
>> Apache Spark.
>> 
>> I understand that we have had for some time two libraries (the following is 
>> my understanding - please amend as appropriate!):
>> 
>> . GraphX, part of Spark project.  This library is based on RDD and it is 
>> only accessible via Scala.  It doesn’t look that this library has been 
>> enhanced recently.
>> . GraphFrames, independent (at the moment?) library for Spark.  This library 
>> is based on Spark DataFrames and accessible by Scala & Python. Last commit 
>> on GitHub was 2 months ago.
>> 
>> GraphFrames cam about with the promise at some point to be integrated in 
>> Apache Spark.
>> 
>> I can see other projects coming up with interesting libraries and ideas 
>> (e.g. Graphulo on Accumulo, a new project with the goal of implementing the 
>> GraphBlas building blocks for graph algorithms on top of Accumulo).
>> 
>> Where is Apache Spark going?
>> 
>> Where are graph libraries in the roadmap?
>> 
>> 
>> 
>> Thanks for any clarity brought to this matter.
>> 
>> Enzo
> 
> 



Re: Should we consider a Spark 2.1.1 release?

2017-03-13 Thread Holden Karau
I'd be happy to do the work of coordinating a 2.1.1 release if that's a
thing a committer can do (I think the release coordinator for the most
recent Arrow release was a committer and the final publish step took a PMC
member to upload but other than that I don't remember any issues).

On Mon, Mar 13, 2017 at 1:05 PM Sean Owen  wrote:

> It seems reasonable to me, in that other x.y.1 releases have followed ~2
> months after the x.y.0 release and it's been about 3 months since 2.1.0.
>
> Related: creating releases is tough work, so I feel kind of bad voting for
> someone else to do that much work. Would it make sense to deputize another
> release manager to help get out just the maintenance releases? this may in
> turn mean maintenance branches last longer. Experienced hands can continue
> to manage new minor and major releases as they require more coordination.
>
> I know most of the release process is written down; I know it's also still
> going to be work to make it 100% documented. Eventually it'll be necessary
> to make sure it's entirely codified anyway.
>
> Not pushing for it myself, just noting I had heard this brought up in side
> conversations before.
>
>
> On Mon, Mar 13, 2017 at 7:07 PM Holden Karau  wrote:
>
> Hi Spark Devs,
>
> Spark 2.1 has been out since end of December
> 
> and we've got quite a few fixes merged for 2.1.1
> 
> .
>
> On the Python side one of the things I'd like to see us get out into a
> patch release is a packaging fix (now merged) before we upload to PyPI &
> Conda, and we also have the normal batch of fixes like toLocalIterator for
> large DataFrames in PySpark.
>
> I've chatted with Felix & Shivaram who seem to think the R side is looking
> close to in good shape for a 2.1.1 release to submit to CRAN (if I've
> miss-spoken my apologies). The two outstanding issues that are being
> tracked for R are SPARK-18817, SPARK-19237.
>
> Looking at the other components quickly it seems like structured streaming
> could also benefit from a patch release.
>
> What do others think - are there any issues people are actively targeting
> for 2.1.1? Is this too early to be considering a patch release?
>
> Cheers,
>
> Holden
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>
> --
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Should we consider a Spark 2.1.1 release?

2017-03-13 Thread Sean Owen
It seems reasonable to me, in that other x.y.1 releases have followed ~2
months after the x.y.0 release and it's been about 3 months since 2.1.0.

Related: creating releases is tough work, so I feel kind of bad voting for
someone else to do that much work. Would it make sense to deputize another
release manager to help get out just the maintenance releases? this may in
turn mean maintenance branches last longer. Experienced hands can continue
to manage new minor and major releases as they require more coordination.

I know most of the release process is written down; I know it's also still
going to be work to make it 100% documented. Eventually it'll be necessary
to make sure it's entirely codified anyway.

Not pushing for it myself, just noting I had heard this brought up in side
conversations before.

On Mon, Mar 13, 2017 at 7:07 PM Holden Karau  wrote:

> Hi Spark Devs,
>
> Spark 2.1 has been out since end of December
> 
> and we've got quite a few fixes merged for 2.1.1
> 
> .
>
> On the Python side one of the things I'd like to see us get out into a
> patch release is a packaging fix (now merged) before we upload to PyPI &
> Conda, and we also have the normal batch of fixes like toLocalIterator for
> large DataFrames in PySpark.
>
> I've chatted with Felix & Shivaram who seem to think the R side is looking
> close to in good shape for a 2.1.1 release to submit to CRAN (if I've
> miss-spoken my apologies). The two outstanding issues that are being
> tracked for R are SPARK-18817, SPARK-19237.
>
> Looking at the other components quickly it seems like structured streaming
> could also benefit from a patch release.
>
> What do others think - are there any issues people are actively targeting
> for 2.1.1? Is this too early to be considering a patch release?
>
> Cheers,
>
> Holden
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


Re: Spark Improvement Proposals

2017-03-13 Thread Sean Owen
Responding to your request for a vote, I meant that this isn't required per
se and the consensus here was not to vote on it. Hence the jokes about
meta-voting protocol. In that sense nothing new happened process-wise,
nothing against ASF norms, if that's your concern.

I think it's just an agreed convention now, that we will VOTE, as normal,
on particular types of changes that we call SPIPs. I mean it's no new
process in the ASF sense because VOTEs are an existing mechanic. I
personally view it as, simply, additional guidance about how to manage huge
JIRAs in a way that makes them stand a chance of moving forward. I suppose
we could VOTE about any JIRA if we wanted. They all proceed via lazy
consensus at the moment.

Practically -- I heard support for codifying this process and no objections
to the final form. This was bouncing around in process purgatory, when no
particular new process was called for.

It takes effect immediately, implicitly, like anything else I guess, like
amendments to code style guidelines. Please uses SPIPs to propose big
changes from here.

As to finding it hard to pick out of the noise, sure, I sympathize. Many
big things happen without a VOTE tag though. It does take a time investment
to triage these email lists. I don't know that this by itself means a VOTE
should have happened.

On Mon, Mar 13, 2017 at 6:15 PM Tom Graves  wrote:

> Another thing I think you should send out is when exactly does this take
> affect.  Is it any major new feature without a pull request?   Is it
> anything major starting with the 2.3 release?
>
> Tom
>
>
> On Monday, March 13, 2017 1:08 PM, Tom Graves 
> wrote:
>
>
> I'm not sure how you can say its not a new process.  If that is the case
> why do we need a page documenting it?
> As a developer if I want to put up a major improvement I have to now
> follow the SPIP whereas before I didn't, that certain seems like a new
> process.  As a PMC member I now have the ability to vote on these SPIPs,
> that seems like something new again.
>
> There are  apache bylaws and then there are project specific bylaws.  As
> far as I know Spark doesn't document any of its project specific bylaws so
> I guess this isn't officially a change to them, but it was implicit before
> that you didn't need any review for major improvements before, now you need
> an explicit vote for them to be approved.  Certainly seems to fall under
> the "Procedural" section in the voting link you sent.
>
> I understand this was under discussion for a while and you have asked for
> peoples feedback multiple times.  But sometimes long threads are easy to
> ignore.  That is why personally I like to see things labelled [VOTE],
> [ANNOUNCE], [DISCUSS] when it gets close to finalizing on something like
> this.
>
> I don't really want to draw this out or argue anymore about it, if I
> really wanted a vote I guess I would -1 the change. I'm not going to do
> that.
> I would at least like to see an announcement go out about it.  The last
> thing I saw you say was you were going to call a vote.  A few people chimed
> in with their thoughts on that vote, but nothing was said after that.
>
> Tom
>
>
>
> On Monday, March 13, 2017 12:36 PM, Sean Owen  wrote:
>
>
> It's not a new process, in that it doesn't entail anything not already in
> http://apache.org/foundation/voting.html . We're just deciding to call a
> VOTE for this type of code modification.
>
> To your point -- yes, it's been around a long time with no further
> comment, and I called several times for more input. That's pretty strong
> lazy consensus of the form we use every day.
>
> On Mon, Mar 13, 2017 at 5:30 PM Tom Graves  wrote:
>
> It seems like if you are adding responsibilities you should do a vote.
> SPIP'S require votes from PMC members so you are now putting more
> responsibility on them. It feels like we should have an official vote to
> make sure they (PMC members) agree with that and to make sure everyone pays
> attention to it.  That thread has been there for a while just as discussion
> and now all of a sudden its implemented without even an announcement being
> sent out about it.
>
> Tom
>
>
>
>
>
>


Re: Should we consider a Spark 2.1.1 release?

2017-03-13 Thread Felix Cheung
+1
there are a lot of good fixes in overall and we need a release for Python and R 
packages.



From: Holden Karau 
Sent: Monday, March 13, 2017 12:06:47 PM
To: Felix Cheung; Shivaram Venkataraman; dev@spark.apache.org
Subject: Should we consider a Spark 2.1.1 release?

Hi Spark Devs,

Spark 2.1 has been out since end of 
December
 and we've got quite a few fixes merged for 
2.1.1.

On the Python side one of the things I'd like to see us get out into a patch 
release is a packaging fix (now merged) before we upload to PyPI & Conda, and 
we also have the normal batch of fixes like toLocalIterator for large 
DataFrames in PySpark.

I've chatted with Felix & Shivaram who seem to think the R side is looking 
close to in good shape for a 2.1.1 release to submit to CRAN (if I've 
miss-spoken my apologies). The two outstanding issues that are being tracked 
for R are SPARK-18817, SPARK-19237.

Looking at the other components quickly it seems like structured streaming 
could also benefit from a patch release.

What do others think - are there any issues people are actively targeting for 
2.1.1? Is this too early to be considering a patch release?

Cheers,

Holden
--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Should we consider a Spark 2.1.1 release?

2017-03-13 Thread Holden Karau
Hi Spark Devs,

Spark 2.1 has been out since end of December

and we've got quite a few fixes merged for 2.1.1

.

On the Python side one of the things I'd like to see us get out into a
patch release is a packaging fix (now merged) before we upload to PyPI &
Conda, and we also have the normal batch of fixes like toLocalIterator for
large DataFrames in PySpark.

I've chatted with Felix & Shivaram who seem to think the R side is looking
close to in good shape for a 2.1.1 release to submit to CRAN (if I've
miss-spoken my apologies). The two outstanding issues that are being
tracked for R are SPARK-18817, SPARK-19237.

Looking at the other components quickly it seems like structured streaming
could also benefit from a patch release.

What do others think - are there any issues people are actively targeting
for 2.1.1? Is this too early to be considering a patch release?

Cheers,

Holden
-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Spark Improvement Proposals

2017-03-13 Thread Tom Graves
Another thing I think you should send out is when exactly does this take 
affect.  Is it any major new feature without a pull request?   Is it anything 
major starting with the 2.3 release?  
Tom 

On Monday, March 13, 2017 1:08 PM, Tom Graves 
 wrote:
 

 I'm not sure how you can say its not a new process.  If that is the case why 
do we need a page documenting it?  
As a developer if I want to put up a major improvement I have to now follow the 
SPIP whereas before I didn't, that certain seems like a new process.  As a PMC 
member I now have the ability to vote on these SPIPs, that seems like something 
new again. 
There are  apache bylaws and then there are project specific bylaws.  As far as 
I know Spark doesn't document any of its project specific bylaws so I guess 
this isn't officially a change to them, but it was implicit before that you 
didn't need any review for major improvements before, now you need an explicit 
vote for them to be approved.  Certainly seems to fall under the "Procedural" 
section in the voting link you sent.
I understand this was under discussion for a while and you have asked for 
peoples feedback multiple times.  But sometimes long threads are easy to 
ignore.  That is why personally I like to see things labelled [VOTE], 
[ANNOUNCE], [DISCUSS] when it gets close to finalizing on something like this. 
I don't really want to draw this out or argue anymore about it, if I really 
wanted a vote I guess I would -1 the change. I'm not going to do that. I would 
at least like to see an announcement go out about it.  The last thing I saw you 
say was you were going to call a vote.  A few people chimed in with their 
thoughts on that vote, but nothing was said after that. 
Tom

 

On Monday, March 13, 2017 12:36 PM, Sean Owen  wrote:
 

 It's not a new process, in that it doesn't entail anything not already in 
http://apache.org/foundation/voting.html . We're just deciding to call a VOTE 
for this type of code modification.
To your point -- yes, it's been around a long time with no further comment, and 
I called several times for more input. That's pretty strong lazy consensus of 
the form we use every day. 

On Mon, Mar 13, 2017 at 5:30 PM Tom Graves  wrote:

It seems like if you are adding responsibilities you should do a vote.  SPIP'S 
require votes from PMC members so you are now putting more responsibility on 
them. It feels like we should have an official vote to make sure they (PMC 
members) agree with that and to make sure everyone pays attention to it.  That 
thread has been there for a while just as discussion and now all of a sudden 
its implemented without even an announcement being sent out about it. 
Tom 



   

   

Re: Spark Improvement Proposals

2017-03-13 Thread Tom Graves
I'm not sure how you can say its not a new process.  If that is the case why do 
we need a page documenting it?  
As a developer if I want to put up a major improvement I have to now follow the 
SPIP whereas before I didn't, that certain seems like a new process.  As a PMC 
member I now have the ability to vote on these SPIPs, that seems like something 
new again. 
There are  apache bylaws and then there are project specific bylaws.  As far as 
I know Spark doesn't document any of its project specific bylaws so I guess 
this isn't officially a change to them, but it was implicit before that you 
didn't need any review for major improvements before, now you need an explicit 
vote for them to be approved.  Certainly seems to fall under the "Procedural" 
section in the voting link you sent.
I understand this was under discussion for a while and you have asked for 
peoples feedback multiple times.  But sometimes long threads are easy to 
ignore.  That is why personally I like to see things labelled [VOTE], 
[ANNOUNCE], [DISCUSS] when it gets close to finalizing on something like this. 
I don't really want to draw this out or argue anymore about it, if I really 
wanted a vote I guess I would -1 the change. I'm not going to do that. I would 
at least like to see an announcement go out about it.  The last thing I saw you 
say was you were going to call a vote.  A few people chimed in with their 
thoughts on that vote, but nothing was said after that. 
Tom

 

On Monday, March 13, 2017 12:36 PM, Sean Owen  wrote:
 

 It's not a new process, in that it doesn't entail anything not already in 
http://apache.org/foundation/voting.html . We're just deciding to call a VOTE 
for this type of code modification.
To your point -- yes, it's been around a long time with no further comment, and 
I called several times for more input. That's pretty strong lazy consensus of 
the form we use every day. 

On Mon, Mar 13, 2017 at 5:30 PM Tom Graves  wrote:

It seems like if you are adding responsibilities you should do a vote.  SPIP'S 
require votes from PMC members so you are now putting more responsibility on 
them. It feels like we should have an official vote to make sure they (PMC 
members) agree with that and to make sure everyone pays attention to it.  That 
thread has been there for a while just as discussion and now all of a sudden 
its implemented without even an announcement being sent out about it. 
Tom 



   

Re: Spark Improvement Proposals

2017-03-13 Thread Sean Owen
It's not a new process, in that it doesn't entail anything not already in
http://apache.org/foundation/voting.html . We're just deciding to call a
VOTE for this type of code modification.

To your point -- yes, it's been around a long time with no further comment,
and I called several times for more input. That's pretty strong lazy
consensus of the form we use every day.

On Mon, Mar 13, 2017 at 5:30 PM Tom Graves  wrote:

> It seems like if you are adding responsibilities you should do a vote.
> SPIP'S require votes from PMC members so you are now putting more
> responsibility on them. It feels like we should have an official vote to
> make sure they (PMC members) agree with that and to make sure everyone pays
> attention to it.  That thread has been there for a while just as discussion
> and now all of a sudden its implemented without even an announcement being
> sent out about it.
>
> Tom
>
>


Re: Spark Improvement Proposals

2017-03-13 Thread Tom Graves
It seems like if you are adding responsibilities you should do a vote.  SPIP'S 
require votes from PMC members so you are now putting more responsibility on 
them. It feels like we should have an official vote to make sure they (PMC 
members) agree with that and to make sure everyone pays attention to it.  That 
thread has been there for a while just as discussion and now all of a sudden 
its implemented without even an announcement being sent out about it. 
Tom 

On Monday, March 13, 2017 11:37 AM, Sean Owen  wrote:
 

 This ended up proceeding as a normal doc change, instead of precipitating a 
meta-vote.However, the text that's on the web site now can certainly be further 
amended if anyone wants to propose a change from here.
On Mon, Mar 13, 2017 at 1:50 PM Tom Graves  wrote:

I think a vote here would be good. I think most of the discussion was done by 4 
or 5 people and its a long thread.  If nothing else it summarizes everything 
and gets people attention to the change.
Tom 

On Thursday, March 9, 2017 10:55 AM, Sean Owen  wrote:
 

 I think a VOTE is over-thinking it, and is rarely used, but, can't hurt. Nah, 
anyone can call a vote. This really isn't that formal. We just want to declare 
and document consensus.
I think SPIP is just a remix of existing process anyway, and don't think it 
will actually do much anyway, which is why I am sanguine about the whole thing.
To bring this to a conclusion, I will just put the contents of the doc in an 
email tomorrow for a VOTE. Raise any objections now.
On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger  wrote:

I started this idea as a fork with a merge-able change to docs.
Reynold moved it to his google doc, and has suggested during this
email thread that a vote should occur.
If a vote needs to occur, I can't see anything on
http://apache.org/foundation/voting.html suggesting that I can call
for a vote, which is why I'm asking PMC members to do it since they're
the ones who would vote anyway.
Now Sean is saying this is a code/doc change that can just be reviewed
and merged as usual...which is what I tried to do to begin with.

The fact that you haven't agreed on a process to agree on your process
is, I think, an indication that the process really does need
improvement ;)




   


   

Re: Spark Improvement Proposals

2017-03-13 Thread Sean Owen
This ended up proceeding as a normal doc change, instead of precipitating a
meta-vote.
However, the text that's on the web site now can certainly be further
amended if anyone wants to propose a change from here.

On Mon, Mar 13, 2017 at 1:50 PM Tom Graves  wrote:

> I think a vote here would be good. I think most of the discussion was done
> by 4 or 5 people and its a long thread.  If nothing else it summarizes
> everything and gets people attention to the change.
>
> Tom
>
>
> On Thursday, March 9, 2017 10:55 AM, Sean Owen  wrote:
>
>
> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
> Nah, anyone can call a vote. This really isn't that formal. We just want to
> declare and document consensus.
>
> I think SPIP is just a remix of existing process anyway, and don't think
> it will actually do much anyway, which is why I am sanguine about the whole
> thing.
>
> To bring this to a conclusion, I will just put the contents of the doc in
> an email tomorrow for a VOTE. Raise any objections now.
>
> On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger  wrote:
>
> I started this idea as a fork with a merge-able change to docs.
> Reynold moved it to his google doc, and has suggested during this
> email thread that a vote should occur.
> If a vote needs to occur, I can't see anything on
> http://apache.org/foundation/voting.html suggesting that I can call
> for a vote, which is why I'm asking PMC members to do it since they're
> the ones who would vote anyway.
> Now Sean is saying this is a code/doc change that can just be reviewed
> and merged as usual...which is what I tried to do to begin with.
>
> The fact that you haven't agreed on a process to agree on your process
> is, I think, an indication that the process really does need
> improvement ;)
>
>
>
>


Re: Spark Local Pipelines

2017-03-13 Thread Asher Krim
Thanks for the feedback.

If we strip away all of the fancy stuff, my proposal boils down to exposing
the logic used in Spark's ML library. In an ideal world, Spark would
possibly have relied on an existing ML implementation rather than
reimplement, since there's very little that's Spark specific about using ML
models. As Sean says, it may make most sense to have localPipelines live
outside of Spark. However it would be really beneficial for Spark ML
pipelines adoption if they used non-Spark logic. This would eliminate
issues with train-serve skew and close the potential for bugs.

I'll leave some more comments in-line to Sean's response:

I'm skeptical.  Serving synchronous queries from a model at scale is a
fundamentally different activity. As you note, it doesn't logically involve
Spark. If it has to happen in milliseconds it's going to be in-core.
Scoring even 10qps with a Spark job per request is probably a non-starter;
think of the thousands of tasks per second and the overhead of just
tracking them.

When you say the RDDs support point prediction, I think you mean that those
older models expose a method to score a Vector. They are not somehow
exposing distributed point prediction. You could add this to the newer
models, but it raises the question of how to make the Row to feed it? the
.mllib punts on this and assumes you can construct the Vector.
AK: In my mind, punting is exactly the right solution - no overhead, full
control to the user

I think this sweeps a lot under the rug in assuming that there can just be
a "local" version of every Transformer -- but, even if there could be,
consider how much extra implementation that is. Lots of them probably could
be but I'm not sure that all can.
AK: I'm not aware of models for which this is not possible - there are no
Spark-only algorithms that I'm aware of. The work to convert Spark to Local
models may be more involved for some implementations, sure, but I don't
think any would be too bad. However if there is something that's
impossible, then that's fine too. I'm not sure we have to commit to having
local versions for every single model

The bigger problem in my experience is the Pipelines don't generally
encapsulate the entire pipeline from source data to score. They encapsulate
the part after computing underlying features. That is, if one of your
features is "total clicks from this user", that's the product of a
DataFrame operation that precedes a Pipeline. This can't be turned into a
non-distributed, non-Spark local version.
AK: That's a great point, and a really good argument for keeping any local
pipeline logic outside of Spark

Solving subsets of this problem could still be useful, and you've
highlighted some external projects that try. I'd also highlight PMML as an
established interchange format for just the model part, and for cases that
don't involve much or any pipeline, it's a better fit paired with a library
that can score from PMML.
AK: The problem with solutions like PMML is that they can tell you WHAT to
do, but not HOW EXACTLY to do it. At the end of the day, the best
model-description possible would be the metadata+ the code itself. That's
the crux of my proposal - expose the implementation so users can use Spark
models with the same exact code that was used to train

I think this is one of those things that could live outside the project,
because it's more not-Spark than Spark. Remember too that building a
solution into the project blesses one at the expense of others.

Asher Krim
Senior Software Engineer

On Mon, Mar 13, 2017 at 11:08 AM, Dongjin Lee  wrote:

> Although I love the cool idea of Asher, I'd rather +1 for Sean's view; I
> think it would be much better to live outside of the project.
>
> Best,
> Dongjin
>
> On Mon, Mar 13, 2017 at 5:39 PM, Sean Owen  wrote:
>
>> I'm skeptical.  Serving synchronous queries from a model at scale is a
>> fundamentally different activity. As you note, it doesn't logically involve
>> Spark. If it has to happen in milliseconds it's going to be in-core.
>> Scoring even 10qps with a Spark job per request is probably a non-starter;
>> think of the thousands of tasks per second and the overhead of just
>> tracking them.
>>
>> When you say the RDDs support point prediction, I think you mean that
>> those older models expose a method to score a Vector. They are not somehow
>> exposing distributed point prediction. You could add this to the newer
>> models, but it raises the question of how to make the Row to feed it? the
>> .mllib punts on this and assumes you can construct the Vector.
>>
>> I think this sweeps a lot under the rug in assuming that there can just
>> be a "local" version of every Transformer -- but, even if there could be,
>> consider how much extra implementation that is. Lots of them probably could
>> be but I'm not sure that all can.
>>
>> The bigger problem in my experience is the Pipelines don't generally
>> encapsulate the entire pipeline 

Re: Spark Local Pipelines

2017-03-13 Thread Dongjin Lee
Although I love the cool idea of Asher, I'd rather +1 for Sean's view; I
think it would be much better to live outside of the project.

Best,
Dongjin

On Mon, Mar 13, 2017 at 5:39 PM, Sean Owen  wrote:

> I'm skeptical.  Serving synchronous queries from a model at scale is a
> fundamentally different activity. As you note, it doesn't logically involve
> Spark. If it has to happen in milliseconds it's going to be in-core.
> Scoring even 10qps with a Spark job per request is probably a non-starter;
> think of the thousands of tasks per second and the overhead of just
> tracking them.
>
> When you say the RDDs support point prediction, I think you mean that
> those older models expose a method to score a Vector. They are not somehow
> exposing distributed point prediction. You could add this to the newer
> models, but it raises the question of how to make the Row to feed it? the
> .mllib punts on this and assumes you can construct the Vector.
>
> I think this sweeps a lot under the rug in assuming that there can just be
> a "local" version of every Transformer -- but, even if there could be,
> consider how much extra implementation that is. Lots of them probably could
> be but I'm not sure that all can.
>
> The bigger problem in my experience is the Pipelines don't generally
> encapsulate the entire pipeline from source data to score. They encapsulate
> the part after computing underlying features. That is, if one of your
> features is "total clicks from this user", that's the product of a
> DataFrame operation that precedes a Pipeline. This can't be turned into a
> non-distributed, non-Spark local version.
>
> Solving subsets of this problem could still be useful, and you've
> highlighted some external projects that try. I'd also highlight PMML as an
> established interchange format for just the model part, and for cases that
> don't involve much or any pipeline, it's a better fit paired with a library
> that can score from PMML.
>
> I think this is one of those things that could live outside the project,
> because it's more not-Spark than Spark. Remember too that building a
> solution into the project blesses one at the expense of others.
>
>
> On Sun, Mar 12, 2017 at 10:15 PM Asher Krim  wrote:
>
>> Hi All,
>>
>> I spent a lot of time at Spark Summit East this year talking with Spark
>> developers and committers about challenges with productizing Spark. One of
>> the biggest shortcomings I've encountered in Spark ML pipelines is the lack
>> of a way to serve single requests with any reasonable performance.
>> SPARK-10413 explores adding methods for single item prediction, but I'd
>> like to explore a more holistic approach - a separate local api, with
>> models that support transformations without depending on Spark at all.
>>
>> I've written up a doc
>> 
>> detailing the approach, and I'm happy to discuss alternatives. If this
>> gains traction, I can create a branch with a minimal example on a simple
>> transformer (probably something like CountVectorizerModel) so we have
>> something concrete to continue the discussion on.
>>
>> Thanks,
>> Asher Krim
>> Senior Software Engineer
>>
>


-- 
*Dongjin Lee*


*Software developer in Line+.So interested in massive-scale machine
learning.facebook: www.facebook.com/dongjin.lee.kr
linkedin:
kr.linkedin.com/in/dongjinleekr
github:
github.com/dongjinleekr
twitter: www.twitter.com/dongjinleekr
*


Re: Spark Improvement Proposals

2017-03-13 Thread Tom Graves
I think a vote here would be good. I think most of the discussion was done by 4 
or 5 people and its a long thread.  If nothing else it summarizes everything 
and gets people attention to the change.
Tom 

On Thursday, March 9, 2017 10:55 AM, Sean Owen  wrote:
 

 I think a VOTE is over-thinking it, and is rarely used, but, can't hurt. Nah, 
anyone can call a vote. This really isn't that formal. We just want to declare 
and document consensus.
I think SPIP is just a remix of existing process anyway, and don't think it 
will actually do much anyway, which is why I am sanguine about the whole thing.
To bring this to a conclusion, I will just put the contents of the doc in an 
email tomorrow for a VOTE. Raise any objections now.
On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger  wrote:

I started this idea as a fork with a merge-able change to docs.
Reynold moved it to his google doc, and has suggested during this
email thread that a vote should occur.
If a vote needs to occur, I can't see anything on
http://apache.org/foundation/voting.html suggesting that I can call
for a vote, which is why I'm asking PMC members to do it since they're
the ones who would vote anyway.
Now Sean is saying this is a code/doc change that can just be reviewed
and merged as usual...which is what I tried to do to begin with.

The fact that you haven't agreed on a process to agree on your process
is, I think, an indication that the process really does need
improvement ;)




   

Re: how to construct parameter for model.transform() from datafile

2017-03-13 Thread jinhong lu
Anyone help?

> 在 2017年3月13日,19:38,jinhong lu  写道:
> 
> After train the mode, I got the result look like this:
> 
> 
>   scala>  predictionResult.show()
>   
> +-++++--+
>   |label|features|   rawPrediction| 
> probability|prediction|
>   
> +-++++--+
>   |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|  
>  0.0|
>   |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|  
>  0.0|
>   |  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|  
>  1.0|
> 
> And then, I transform() the data by these code:
> 
>   import org.apache.spark.ml.linalg.Vectors
>   import org.apache.spark.ml.linalg.Vector
>   import scala.collection.mutable
> 
>  def lineToVector(line:String ):Vector={
>   val seq = new mutable.Queue[(Int,Double)]
>   val content = line.split(" ");
>   for( s <- content){
> val index = s.split(":")(0).toInt
> val value = s.split(":")(1).toDouble
>  seq += ((index,value))
>   }
>   return Vectors.sparse(144109, seq)
> }
> 
>val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, 
> org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line
>  => 
> (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1.toDF("udid",
>  "features")
>val predictionResult = model.transform(df)
>predictionResult.show()
> 
> 
> But I got the error look like this:
> 
> Caused by: java.lang.IllegalArgumentException: requirement failed: You may 
> not write an element to index 804201 because the declared size of your vector 
> is 144109
>  at scala.Predef$.require(Predef.scala:224)
>  at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
>  at lineToVector(:55)
>  at $anonfun$4.apply(:50)
>  at $anonfun$4.apply(:50)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
>  at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> 
> So I change
> 
>   return Vectors.sparse(144109, seq)
> 
> to 
> 
>   return Vectors.sparse(804202, seq)
> 
> Another error occurs:
> 
>   Caused by: java.lang.IllegalArgumentException: requirement failed: The 
> columns of A don't match the number of elements of x. A: 144109, x: 804202
> at scala.Predef$.require(Predef.scala:224)
> at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
> at 
> org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
> at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)
> 
> what should I do?
>> 在 2017年3月13日,16:31,jinhong lu  写道:
>> 
>> Hi, all:
>> 
>> I got these training data:
>> 
>>  0 31607:17
>>  0 111905:36
>>  0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 
>> 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 
>> 112109:4 123305:48 142509:1
>>  0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
>>  0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19 
>> 31607:19
>>  0 19109:7 29705:4 123305:32
>>  0 15309:1 43005:1 108509:1
>>  1 604:1 6401:1 6503:1 15207:4 31607:40
>>  0 1807:19
>>  0 301:14 501:1 1502:14 2507:12 123305:4
>>  0 607:14 19109:460 123305:448
>>  0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48 
>> 128209:1
>>  1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2 
>> 27709:2 56509:8 122705:62 123305:31 124005:2
>> 
>> And then I train the model by spark:
>> 
>>  import org.apache.spark.ml.classification.NaiveBayes
>>  import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
>>  import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
>>  import org.apache.spark.sql.SparkSession
>> 
>>  val spark = 
>> 

Re: how to construct parameter for model.transform() from datafile

2017-03-13 Thread jinhong lu
After train the mode, I got the result look like this:


scala>  predictionResult.show()

+-++++--+
|label|features|   rawPrediction| 
probability|prediction|

+-++++--+
|  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|  
 0.0|
|  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|  
 0.0|
|  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|  
 1.0|

And then, I transform() the data by these code:

import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.Vector
import scala.collection.mutable

   def lineToVector(line:String ):Vector={
val seq = new mutable.Queue[(Int,Double)]
val content = line.split(" ");
for( s <- content){
  val index = s.split(":")(0).toInt
  val value = s.split(":")(1).toDouble
   seq += ((index,value))
}
return Vectors.sparse(144109, seq)
  }

 val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, 
org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line
 => 
(line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1.toDF("udid",
 "features")
 val predictionResult = model.transform(df)
 predictionResult.show()


But I got the error look like this:

 Caused by: java.lang.IllegalArgumentException: requirement failed: You may not 
write an element to index 804201 because the declared size of your vector is 
144109
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
  at lineToVector(:55)
  at $anonfun$4.apply(:50)
  at $anonfun$4.apply(:50)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)

So I change

return Vectors.sparse(144109, seq)

to 

return Vectors.sparse(804202, seq)

Another error occurs:

Caused by: java.lang.IllegalArgumentException: requirement failed: The 
columns of A don't match the number of elements of x. A: 144109, x: 804202
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
  at 
org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
  at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)

what should I do?
> 在 2017年3月13日,16:31,jinhong lu  写道:
> 
> Hi, all:
> 
> I got these training data:
> 
>   0 31607:17
>   0 111905:36
>   0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 
> 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 
> 112109:4 123305:48 142509:1
>   0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
>   0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19 
> 31607:19
>   0 19109:7 29705:4 123305:32
>   0 15309:1 43005:1 108509:1
>   1 604:1 6401:1 6503:1 15207:4 31607:40
>   0 1807:19
>   0 301:14 501:1 1502:14 2507:12 123305:4
>   0 607:14 19109:460 123305:448
>   0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48 
> 128209:1
>   1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2 
> 27709:2 56509:8 122705:62 123305:31 124005:2
> 
> And then I train the model by spark:
> 
>   import org.apache.spark.ml.classification.NaiveBayes
>   import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
>   import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
>   import org.apache.spark.sql.SparkSession
> 
>   val spark = 
> SparkSession.builder.appName("NaiveBayesExample").getOrCreate()
>   val data = 
> spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3")
>   val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), 
> 

Re: Spark Local Pipelines

2017-03-13 Thread Sean Owen
I'm skeptical.  Serving synchronous queries from a model at scale is a
fundamentally different activity. As you note, it doesn't logically involve
Spark. If it has to happen in milliseconds it's going to be in-core.
Scoring even 10qps with a Spark job per request is probably a non-starter;
think of the thousands of tasks per second and the overhead of just
tracking them.

When you say the RDDs support point prediction, I think you mean that those
older models expose a method to score a Vector. They are not somehow
exposing distributed point prediction. You could add this to the newer
models, but it raises the question of how to make the Row to feed it? the
.mllib punts on this and assumes you can construct the Vector.

I think this sweeps a lot under the rug in assuming that there can just be
a "local" version of every Transformer -- but, even if there could be,
consider how much extra implementation that is. Lots of them probably could
be but I'm not sure that all can.

The bigger problem in my experience is the Pipelines don't generally
encapsulate the entire pipeline from source data to score. They encapsulate
the part after computing underlying features. That is, if one of your
features is "total clicks from this user", that's the product of a
DataFrame operation that precedes a Pipeline. This can't be turned into a
non-distributed, non-Spark local version.

Solving subsets of this problem could still be useful, and you've
highlighted some external projects that try. I'd also highlight PMML as an
established interchange format for just the model part, and for cases that
don't involve much or any pipeline, it's a better fit paired with a library
that can score from PMML.

I think this is one of those things that could live outside the project,
because it's more not-Spark than Spark. Remember too that building a
solution into the project blesses one at the expense of others.


On Sun, Mar 12, 2017 at 10:15 PM Asher Krim  wrote:

> Hi All,
>
> I spent a lot of time at Spark Summit East this year talking with Spark
> developers and committers about challenges with productizing Spark. One of
> the biggest shortcomings I've encountered in Spark ML pipelines is the lack
> of a way to serve single requests with any reasonable performance.
> SPARK-10413 explores adding methods for single item prediction, but I'd
> like to explore a more holistic approach - a separate local api, with
> models that support transformations without depending on Spark at all.
>
> I've written up a doc
> 
> detailing the approach, and I'm happy to discuss alternatives. If this
> gains traction, I can create a branch with a minimal example on a simple
> transformer (probably something like CountVectorizerModel) so we have
> something concrete to continue the discussion on.
>
> Thanks,
> Asher Krim
> Senior Software Engineer
>


Re: Spark Local Pipelines

2017-03-13 Thread Georg Heiler
Great idea. I see the same problem.
I would suggest checking the following projects as a kick start as well (
not only mleap)
https://github.com/ucbrise/clipper and
https://github.com/Hydrospheredata/mist

Regards Georg
Asher Krim  schrieb am So. 12. März 2017 um 23:21:

> Hi All,
>
> I spent a lot of time at Spark Summit East this year talking with Spark
> developers and committers about challenges with productizing Spark. One of
> the biggest shortcomings I've encountered in Spark ML pipelines is the lack
> of a way to serve single requests with any reasonable performance.
> SPARK-10413 explores adding methods for single item prediction, but I'd
> like to explore a more holistic approach - a separate local api, with
> models that support transformations without depending on Spark at all.
>
> I've written up a doc
> 
> detailing the approach, and I'm happy to discuss alternatives. If this
> gains traction, I can create a branch with a minimal example on a simple
> transformer (probably something like CountVectorizerModel) so we have
> something concrete to continue the discussion on.
>
> Thanks,
> Asher Krim
> Senior Software Engineer
>