Re: DataSourceV2 sync tomorrow

2018-11-13 Thread Arun Mahadevan
IMO, the currentOffset should not be optional.
For continuous mode I assume this offset gets periodically check pointed
(so mandatory) ?
For the micro batch mode the currentOffset would be the start offset for a
micro-batch.

And if the micro-batch could be executed without knowing the 'latest'
offset (say until 'next' returns false), we only need the current offset
(to figure out the offset boundaries of a micro-batch) and may be then the
'latest' offset is not needed at all.

- Arun


On Tue, 13 Nov 2018 at 16:01, Ryan Blue  wrote:

> Hi everyone,
> I just wanted to send out a reminder that there’s a DSv2 sync tomorrow at
> 17:00 PST, which is 01:00 UTC.
>
> Here are some of the topics under discussion in the last couple of weeks:
>
>- Read API for v2 - see Wenchen’s doc
>
> 
>- Capabilities API - see the dev list thread
>
> 
>- Using CatalogTableIdentifier to reliably separate v2 code paths -
>see PR #21978 
>- A replacement for InternalRow
>
> I know that a lot of people are also interested in combining the source
> API for micro-batch and continuous streaming. Wenchen and I have been
> discussing a way to do that and Wenchen has added it to the Read API doc as
> Alternative #2. I think this would be a good thing to plan on discussing.
>
> rb
>
> Here’s some additional background on combining micro-batch and continuous
> APIs:
>
> The basic idea is to update how tasks end so that the same tasks can be
> used in micro-batch or streaming. For tasks that are naturally limited like
> data files, when the data is exhausted, Spark stops reading. For tasks that
> are not limited, like a Kafka partition, Spark decides when to stop in
> micro-batch mode by hitting a pre-determined LocalOffset or Spark can just
> keep running in continuous mode.
>
> Note that a task deciding to stop can happen in both modes, either when a
> task is exhausted in micro-batch or when a stream needs to be reconfigured
> in continuous.
>
> Here’s the task reader API. The offset returned is optional so that a task
> can avoid stopping if there isn’t a resumeable offset, like if it is in the
> middle of an input file:
>
> interface StreamPartitionReader extends InputPartitionReader {
>   Optional currentOffset();
>   boolean next() // from InputPartitionReader
>   T get()// from InputPartitionReader
> }
>
> The streaming code would look something like this:
>
> Stream stream = scan.toStream()
> StreamReaderFactory factory = stream.createReaderFactory()
>
> while (true) {
>   Offset start = stream.currentOffset()
>   Offset end = if (isContinuousMode) {
> None
>   } else {
> // rate limiting would happen here
> Some(stream.latestOffset())
>   }
>
>   InputPartition[] parts = stream.planInputPartitions(start)
>
>   // returns when needsReconfiguration is true or all tasks finish
>   runTasks(parts, factory, end)
>
>   // the stream's current offset has been updated at the last epoch
> }
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


New PySpark test style

2018-11-13 Thread Hyukjin Kwon
Hi all,

Lately, https://github.com/apache/spark/pull/23021 is merged, which tries
to a big single file that contains all the tests into smaller files.

I picked up one example and follow, NumPy. Because the current style looks
closer to NumPy structure and looks easier to follow. Please see
https://github.com/numpy/numpy/tree/master/numpy.

I would like to say:

1. If you were working on PySpark changes, I am terribly sorry that I
caused some conflicts. Please take a look for the new styles and get used
it.

2. Probably, I rushed a bit. I was scared if it's going to get conflicts
again and again but please take a look and see if there's outstanding
structural issues. I am willing to fix or revert at worst case.

3. Now, PySpark's SQL test were newly split but other tests will be split
too in the same manner.

Thank you so much.


Re: DataSourceV2 sync tomorrow

2018-11-13 Thread Cody Koeninger
Am I the only one for whom the livestream link didn't work last time?
Would like to be able to at least watch the discussion this time
around.
On Tue, Nov 13, 2018 at 6:01 PM Ryan Blue  wrote:
>
> Hi everyone,
> I just wanted to send out a reminder that there’s a DSv2 sync tomorrow at 
> 17:00 PST, which is 01:00 UTC.
>
> Here are some of the topics under discussion in the last couple of weeks:
>
> Read API for v2 - see Wenchen’s doc
> Capabilities API - see the dev list thread
> Using CatalogTableIdentifier to reliably separate v2 code paths - see PR 
> #21978
> A replacement for InternalRow
>
> I know that a lot of people are also interested in combining the source API 
> for micro-batch and continuous streaming. Wenchen and I have been discussing 
> a way to do that and Wenchen has added it to the Read API doc as Alternative 
> #2. I think this would be a good thing to plan on discussing.
>
> rb
>
> Here’s some additional background on combining micro-batch and continuous 
> APIs:
>
> The basic idea is to update how tasks end so that the same tasks can be used 
> in micro-batch or streaming. For tasks that are naturally limited like data 
> files, when the data is exhausted, Spark stops reading. For tasks that are 
> not limited, like a Kafka partition, Spark decides when to stop in 
> micro-batch mode by hitting a pre-determined LocalOffset or Spark can just 
> keep running in continuous mode.
>
> Note that a task deciding to stop can happen in both modes, either when a 
> task is exhausted in micro-batch or when a stream needs to be reconfigured in 
> continuous.
>
> Here’s the task reader API. The offset returned is optional so that a task 
> can avoid stopping if there isn’t a resumeable offset, like if it is in the 
> middle of an input file:
>
> interface StreamPartitionReader extends InputPartitionReader {
>   Optional currentOffset();
>   boolean next() // from InputPartitionReader
>   T get()// from InputPartitionReader
> }
>
> The streaming code would look something like this:
>
> Stream stream = scan.toStream()
> StreamReaderFactory factory = stream.createReaderFactory()
>
> while (true) {
>   Offset start = stream.currentOffset()
>   Offset end = if (isContinuousMode) {
> None
>   } else {
> // rate limiting would happen here
> Some(stream.latestOffset())
>   }
>
>   InputPartition[] parts = stream.planInputPartitions(start)
>
>   // returns when needsReconfiguration is true or all tasks finish
>   runTasks(parts, factory, end)
>
>   // the stream's current offset has been updated at the last epoch
> }
>
> --
> Ryan Blue
> Software Engineer
> Netflix

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: which classes/methods are considered as private in Spark?

2018-11-13 Thread Reynold Xin
I used to, before each release during the RC phase, go through every single doc 
page to make sure we don’t unintentionally leave things public. I no longer 
have time to do that unfortunately. I find that very useful because I always 
catch some mistakes through organic development.

> On Nov 13, 2018, at 8:00 PM, Wenchen Fan  wrote:
> 
> > Could you clarify what you mean here? Mima has some known limitations such 
> > as not handling "private[blah]" very well
> 
> Yes that's what I mean.
> 
> What I want to know here is, which classes/methods we expect them to be 
> private. I think things marked as "private[blabla]" are expected to be 
> private for sure, it's just the MiMa and doc generator can't handle it well. 
> We can fix them later, by using the @Private annotation probably.
> 
> > seems like it's tracked by a bunch of exclusions in the Unidoc object
> 
> That's good. At least we have a clear definition about which packages are 
> meant to be private. We should make it consistent between MiMa and doc 
> generator though.
> 
>> On Wed, Nov 14, 2018 at 10:41 AM Marcelo Vanzin  wrote:
>> On Tue, Nov 13, 2018 at 6:26 PM Wenchen Fan  wrote:
>> > Recently I updated the MiMa exclusion rules, and found MiMa tracks some 
>> > private classes/methods unexpectedly.
>> 
>> Could you clarify what you mean here? Mima has some known limitations
>> such as not handling "private[blah]" very well (because that means
>> public in Java). Spark has (had?) this tool to generate an exclusions
>> file for Mima, but not sure how up-to-date it is.
>> 
>> > AFAIK, we have several rules:
>> > 1. everything which is really private that end users can't access, e.g. 
>> > package private classes, private methods, etc.
>> > 2. classes under certain packages. I don't know if we have a list, the 
>> > catalyst package is considered as a private package.
>> > 3. everything which has a @Private annotation.
>> 
>> That's my understanding of the scope of the rules.
>> 
>> (2) to me means "things that show up in the public API docs". That's,
>> AFAIK, tracked in SparkBuild.scala; seems like it's tracked by a bunch
>> of exclusions in the Unidoc object (I remember that being different in
>> the past).
>> 
>> (3) might be a limitation of the doc generation tool? Not sure if it's
>> easy to say "do not document classes that have @Private". At the very
>> least, that annotation seems to be missing the "@Documented"
>> annotation, which would make that info present in the javadoc. I do
>> not know if the scala doc tool handles that.
>> 
>> -- 
>> Marcelo


Re: which classes/methods are considered as private in Spark?

2018-11-13 Thread Wenchen Fan
> Could you clarify what you mean here? Mima has some known limitations
such as not handling "private[blah]" very well

Yes that's what I mean.

What I want to know here is, which classes/methods we expect them to be
private. I think things marked as "private[blabla]" are expected to be
private for sure, it's just the MiMa and doc generator can't handle it
well. We can fix them later, by using the @Private annotation probably.

> seems like it's tracked by a bunch of exclusions in the Unidoc object

That's good. At least we have a clear definition about which packages are
meant to be private. We should make it consistent between MiMa and doc
generator though.

On Wed, Nov 14, 2018 at 10:41 AM Marcelo Vanzin  wrote:

> On Tue, Nov 13, 2018 at 6:26 PM Wenchen Fan  wrote:
> > Recently I updated the MiMa exclusion rules, and found MiMa tracks some
> private classes/methods unexpectedly.
>
> Could you clarify what you mean here? Mima has some known limitations
> such as not handling "private[blah]" very well (because that means
> public in Java). Spark has (had?) this tool to generate an exclusions
> file for Mima, but not sure how up-to-date it is.
>
> > AFAIK, we have several rules:
> > 1. everything which is really private that end users can't access, e.g.
> package private classes, private methods, etc.
> > 2. classes under certain packages. I don't know if we have a list, the
> catalyst package is considered as a private package.
> > 3. everything which has a @Private annotation.
>
> That's my understanding of the scope of the rules.
>
> (2) to me means "things that show up in the public API docs". That's,
> AFAIK, tracked in SparkBuild.scala; seems like it's tracked by a bunch
> of exclusions in the Unidoc object (I remember that being different in
> the past).
>
> (3) might be a limitation of the doc generation tool? Not sure if it's
> easy to say "do not document classes that have @Private". At the very
> least, that annotation seems to be missing the "@Documented"
> annotation, which would make that info present in the javadoc. I do
> not know if the scala doc tool handles that.
>
> --
> Marcelo
>


Re: which classes/methods are considered as private in Spark?

2018-11-13 Thread Marcelo Vanzin
On Tue, Nov 13, 2018 at 6:26 PM Wenchen Fan  wrote:
> Recently I updated the MiMa exclusion rules, and found MiMa tracks some 
> private classes/methods unexpectedly.

Could you clarify what you mean here? Mima has some known limitations
such as not handling "private[blah]" very well (because that means
public in Java). Spark has (had?) this tool to generate an exclusions
file for Mima, but not sure how up-to-date it is.

> AFAIK, we have several rules:
> 1. everything which is really private that end users can't access, e.g. 
> package private classes, private methods, etc.
> 2. classes under certain packages. I don't know if we have a list, the 
> catalyst package is considered as a private package.
> 3. everything which has a @Private annotation.

That's my understanding of the scope of the rules.

(2) to me means "things that show up in the public API docs". That's,
AFAIK, tracked in SparkBuild.scala; seems like it's tracked by a bunch
of exclusions in the Unidoc object (I remember that being different in
the past).

(3) might be a limitation of the doc generation tool? Not sure if it's
easy to say "do not document classes that have @Private". At the very
least, that annotation seems to be missing the "@Documented"
annotation, which would make that info present in the javadoc. I do
not know if the scala doc tool handles that.

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: which classes/methods are considered as private in Spark?

2018-11-13 Thread Sean Owen
You should find that 'surprisingly public' classes are there because
of language technicalities. For example DummySerializerInstance is
public because it's a Java class, and can't be used outside its
package otherwise.

LIkewise I think MiMa just looks at bytecode, and private[spark]
classes are public in the bytecode for similar reasons (although Scala
enforces the access within Scala as expected). Hence it will flag
changes to "nonpublic" private[spark] classes.

I think things that are meant to be marked private are, well, marked
private, or else as private as possible and flagged with annotations
like @Private. (It does sound like DummySerializerInstance should be
so annotated?) Yes, the catalyst package in its entirety is one big
exception - private by fiat, not by painstaking flagging of every
class.

The issue to me is really docs. If we have java/scaladoc of private
classes, and there's a way to avoid that like with annotations, that
should be fixed.
On Tue, Nov 13, 2018 at 6:26 PM Wenchen Fan  wrote:
>
> Hi all,
>
> Recently I updated the MiMa exclusion rules, and found MiMa tracks some 
> private classes/methods unexpectedly.
>
> Note that, "private" here means that, we have no guarantee about 
> compatibility. We don't provide documents and users need to take the risk 
> when using them.
>
> In the API document, it has some obvious private classes, e.g. 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.serializer.DummySerializerInstance
>  , which is not expected either.
>
> I looked around and can't find a clear definition of "private" in Spark.
>
> AFAIK, we have several rules:
> 1. everything which is really private that end users can't access, e.g. 
> package private classes, private methods, etc.
> 2. classes under certain packages. I don't know if we have a list, the 
> catalyst package is considered as a private package.
> 3. everything which has a @Private annotation.
>
> I'm sending this email to collect more feedback, and hope we can come up with 
> a clear definition about what is "private".
>
> Thanks,
> Wenchen

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



which classes/methods are considered as private in Spark?

2018-11-13 Thread Wenchen Fan
Hi all,

Recently I updated the MiMa exclusion rules, and found MiMa tracks some
private classes/methods unexpectedly.

Note that, "private" here means that, we have no guarantee about
compatibility. We don't provide documents and users need to take the risk
when using them.

In the API document, it has some obvious private classes, e.g.
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.serializer.DummySerializerInstance
, which is not expected either.

I looked around and can't find a clear definition of "private" in Spark.

AFAIK, we have several rules:
1. everything which is really private that end users can't access, e.g.
package private classes, private methods, etc.
2. classes under certain packages. I don't know if we have a list, the
catalyst package is considered as a private package.
3. everything which has a @Private annotation.

I'm sending this email to collect more feedback, and hope we can come up
with a clear definition about what is "private".

Thanks,
Wenchen


DataSourceV2 sync tomorrow

2018-11-13 Thread Ryan Blue
Hi everyone,
I just wanted to send out a reminder that there’s a DSv2 sync tomorrow at
17:00 PST, which is 01:00 UTC.

Here are some of the topics under discussion in the last couple of weeks:

   - Read API for v2 - see Wenchen’s doc
   

   - Capabilities API - see the dev list thread
   

   - Using CatalogTableIdentifier to reliably separate v2 code paths - see PR
   #21978 
   - A replacement for InternalRow

I know that a lot of people are also interested in combining the source API
for micro-batch and continuous streaming. Wenchen and I have been
discussing a way to do that and Wenchen has added it to the Read API doc as
Alternative #2. I think this would be a good thing to plan on discussing.

rb

Here’s some additional background on combining micro-batch and continuous
APIs:

The basic idea is to update how tasks end so that the same tasks can be
used in micro-batch or streaming. For tasks that are naturally limited like
data files, when the data is exhausted, Spark stops reading. For tasks that
are not limited, like a Kafka partition, Spark decides when to stop in
micro-batch mode by hitting a pre-determined LocalOffset or Spark can just
keep running in continuous mode.

Note that a task deciding to stop can happen in both modes, either when a
task is exhausted in micro-batch or when a stream needs to be reconfigured
in continuous.

Here’s the task reader API. The offset returned is optional so that a task
can avoid stopping if there isn’t a resumeable offset, like if it is in the
middle of an input file:

interface StreamPartitionReader extends InputPartitionReader {
  Optional currentOffset();
  boolean next() // from InputPartitionReader
  T get()// from InputPartitionReader
}

The streaming code would look something like this:

Stream stream = scan.toStream()
StreamReaderFactory factory = stream.createReaderFactory()

while (true) {
  Offset start = stream.currentOffset()
  Offset end = if (isContinuousMode) {
None
  } else {
// rate limiting would happen here
Some(stream.latestOffset())
  }

  InputPartition[] parts = stream.planInputPartitions(start)

  // returns when needsReconfiguration is true or all tasks finish
  runTasks(parts, factory, end)

  // the stream's current offset has been updated at the last epoch
}

-- 
Ryan Blue
Software Engineer
Netflix


Re: time for Apache Spark 3.0?

2018-11-13 Thread Matt Cheah
I just added the label to https://issues.apache.org/jira/browse/SPARK-25908. 
Unsure if there are any others. I’ll look through the tickets and see if there 
are any that are missing the label.

 

-Matt Cheah

 

From: Sean Owen 
Date: Tuesday, November 13, 2018 at 12:09 PM
To: Matt Cheah 
Cc: Sean Owen , Vinoo Ganesh , dev 

Subject: Re: time for Apache Spark 3.0?

 

As far as I know any JIRA that has implications for users is tagged this way 
but I haven't examined all of them. All that are going in for 3.0 should have 
it as Fix Version . Most changes won't have a user visible impact. Do you see 
any that seem to need the tag? Call em out or even fix them by adding the tag 
and proposed release notes. 

 

On Tue, Nov 13, 2018, 11:49 AM Matt Cheah  wrote:

My non-definitive takes --

I would personally like to remove all deprecated methods for Spark 3.
I started by removing 'old' deprecated methods in that commit. Things
deprecated in 2.4 are maybe less clear, whether they should be removed

Everything's fair game for removal or change in a major release. So
far some items in discussion seem to be Scala 2.11 support, Python 2
support, R support before 3.4. I don't know about other APIs.

Generally, take a look at JIRA for items targeted at version 3.0. Not
everything targeted for 3.0 is going in, but ones from committers are
more likely than others. Breaking changes ought to be tagged
'release-notes' with a description of the change. The release itself
has a migration guide that's being updated as we go.


On Mon, Nov 12, 2018 at 5:49 PM Matt Cheah  wrote:
>
> I wanted to clarify what categories of APIs are eligible to be broken in 
Spark 3.0. Specifically:
>
>
>
> Are we removing all deprecated methods? If we’re only removing some 
subset of deprecated methods, what is that subset? I see a bunch were removed 
in 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_pull_22921=DwIFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=yQSElmBeMSlm-LdOsYqwPm3ZZJaoBktOmNYSGTF7FKk=_pRqHGBRV-RX3Ij_qSDb7bevUDmqENa-4caKSr5xs88=
 for example. Are we only committed to removing methods that were deprecated in 
some Spark version and earlier?
> Aside from removing support for Scala 2.11, what other kinds of 
(non-experimental and non-evolving) APIs are eligible to be broken?
> Is there going to be a way to track the current list of all proposed 
breaking changes / JIRA tickets? Perhaps we can include it in the JIRA ticket 
that can be filtered down to somehow?
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




smime.p7s
Description: S/MIME cryptographic signature


Re: time for Apache Spark 3.0?

2018-11-13 Thread Sean Owen
As far as I know any JIRA that has implications for users is tagged this
way but I haven't examined all of them. All that are going in for 3.0
should have it as Fix Version . Most changes won't have a user visible
impact. Do you see any that seem to need the tag? Call em out or even fix
them by adding the tag and proposed release notes.

On Tue, Nov 13, 2018, 11:49 AM Matt Cheah  The release-notes label on JIRA sounds good. Can we make it a point to
> have that done retroactively now, and then moving forward?
>
> On 11/12/18, 4:01 PM, "Sean Owen"  wrote:
>
> My non-definitive takes --
>
> I would personally like to remove all deprecated methods for Spark 3.
> I started by removing 'old' deprecated methods in that commit. Things
> deprecated in 2.4 are maybe less clear, whether they should be removed
>
> Everything's fair game for removal or change in a major release. So
> far some items in discussion seem to be Scala 2.11 support, Python 2
> support, R support before 3.4. I don't know about other APIs.
>
> Generally, take a look at JIRA for items targeted at version 3.0. Not
> everything targeted for 3.0 is going in, but ones from committers are
> more likely than others. Breaking changes ought to be tagged
> 'release-notes' with a description of the change. The release itself
> has a migration guide that's being updated as we go.
>
>
> On Mon, Nov 12, 2018 at 5:49 PM Matt Cheah 
> wrote:
> >
> > I wanted to clarify what categories of APIs are eligible to be
> broken in Spark 3.0. Specifically:
> >
> >
> >
> > Are we removing all deprecated methods? If we’re only removing some
> subset of deprecated methods, what is that subset? I see a bunch were
> removed in
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_pull_22921=DwIFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=yQSElmBeMSlm-LdOsYqwPm3ZZJaoBktOmNYSGTF7FKk=_pRqHGBRV-RX3Ij_qSDb7bevUDmqENa-4caKSr5xs88=
> for example. Are we only committed to removing methods that were deprecated
> in some Spark version and earlier?
> > Aside from removing support for Scala 2.11, what other kinds of
> (non-experimental and non-evolving) APIs are eligible to be broken?
> > Is there going to be a way to track the current list of all proposed
> breaking changes / JIRA tickets? Perhaps we can include it in the JIRA
> ticket that can be filtered down to somehow?
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>


Re: time for Apache Spark 3.0?

2018-11-13 Thread Matt Cheah
The release-notes label on JIRA sounds good. Can we make it a point to have 
that done retroactively now, and then moving forward?

On 11/12/18, 4:01 PM, "Sean Owen"  wrote:

My non-definitive takes --

I would personally like to remove all deprecated methods for Spark 3.
I started by removing 'old' deprecated methods in that commit. Things
deprecated in 2.4 are maybe less clear, whether they should be removed

Everything's fair game for removal or change in a major release. So
far some items in discussion seem to be Scala 2.11 support, Python 2
support, R support before 3.4. I don't know about other APIs.

Generally, take a look at JIRA for items targeted at version 3.0. Not
everything targeted for 3.0 is going in, but ones from committers are
more likely than others. Breaking changes ought to be tagged
'release-notes' with a description of the change. The release itself
has a migration guide that's being updated as we go.


On Mon, Nov 12, 2018 at 5:49 PM Matt Cheah  wrote:
>
> I wanted to clarify what categories of APIs are eligible to be broken in 
Spark 3.0. Specifically:
>
>
>
> Are we removing all deprecated methods? If we’re only removing some 
subset of deprecated methods, what is that subset? I see a bunch were removed 
in 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_pull_22921=DwIFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=yQSElmBeMSlm-LdOsYqwPm3ZZJaoBktOmNYSGTF7FKk=_pRqHGBRV-RX3Ij_qSDb7bevUDmqENa-4caKSr5xs88=
 for example. Are we only committed to removing methods that were deprecated in 
some Spark version and earlier?
> Aside from removing support for Scala 2.11, what other kinds of 
(non-experimental and non-evolving) APIs are eligible to be broken?
> Is there going to be a way to track the current list of all proposed 
breaking changes / JIRA tickets? Perhaps we can include it in the JIRA ticket 
that can be filtered down to somehow?
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




smime.p7s
Description: S/MIME cryptographic signature


Re: SPIP: SPARK-25728 Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-11-13 Thread Kazuaki Ishizaki
Hi all,
I spend some time to consider great points. Sorry for my delay.
I put comments in green into h
ttps://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit

Here are summary of comments:
1) For simplicity and expressiveness, introduce nodes to represent a 
structure (e.g. for, while)
2) For simplicity, measure some statistics (e.g. node / java bytecode, 
memory consumption)
3) For ease of understanding, use simple APIs like the original statements 
(op2, for, while, ...)

We would appreciate it if you put any comments/suggestions on 
GoogleDoc/dev-ml for going forward.

Kazuaki Ishizaki, 



From:   "Kazuaki Ishizaki" 
To: Reynold Xin 
Cc: dev , Takeshi Yamamuro 
, Xiao Li 
Date:   2018/10/31 00:56
Subject:Re: SPIP: SPARK-25728 Structured Intermediate 
Representation (Tungsten IR) for generating Java code



Hi Reynold,
Thank you for your comments. They are great points.

1) Yes, it is not easy to design the expressive and enough IR. We can 
learn concepts from good examples like HyPer, Weld, and others. They are 
expressive and not complicated. The detail cannot be captured yet, 
2) To introduce another layer takes some time to learn new things. This 
SPIP tries to reduce learning time to preparing clean APIs for 
constructing generated code. I will try to add some examples for APIs that 
are equivalent to current string concatenations (e.g. "a" + " * " + "b" + 
" / " + "c").

It is important for us to learn from failures than learn from successes. 
We would appreciate it if you could list up failures that you have seen.

Best Regards,
Kazuaki Ishizaki



From:Reynold Xin 
To:Kazuaki Ishizaki 
Cc:Xiao Li , dev , 
Takeshi Yamamuro 
Date:2018/10/26 03:46
Subject:Re: SPIP: SPARK-25728 Structured Intermediate 
Representation (Tungsten IR) for generating Java code



I have some pretty serious concerns over this proposal. I agree that there 
are many things that can be improved, but at the same time I also think 
the cost of introducing a new IR in the middle is extremely high. Having 
participated in designing some of the IRs in other systems, I've seen more 
failures than successes. The failures typically come from two sources: (1) 
in general it is extremely difficult to design IRs that are both 
expressive enough and are simple enough; (2) typically another layer of 
indirection increases the complexity a lot more, beyond the level of 
understanding and expertise that most contributors can obtain without 
spending years in the code base and learning about all the gotchas.

In either case, I'm not saying "no please don't do this". This is one of 
those cases in which the devils are in the details that cannot be captured 
by a high level document, and I want to explicitly express my concern 
here.




On Thu, Oct 25, 2018 at 12:10 AM Kazuaki Ishizaki  
wrote:
Hi Xiao,
Thank you very much for becoming a shepherd.
If you feel the discussion settles, we would appreciate it if you would 
start a voting.

Regards,
Kazuaki Ishizaki



From:Xiao Li 
To:Kazuaki Ishizaki 
Cc:dev , Takeshi Yamamuro <
linguin@gmail.com>
Date:2018/10/22 16:31
Subject:Re: SPIP: SPARK-25728 Structured Intermediate 
Representation (Tungsten IR) for generating Java code



Hi, Kazuaki, 

Thanks for your great SPIP! I am willing to be the shepherd of this SPIP. 

Cheers,

Xiao


On Mon, Oct 22, 2018 at 12:05 AM Kazuaki Ishizaki  
wrote:
Hi Yamamuro-san,
Thank you for your comments. This SPIP gets several valuable comments and 
feedback on Google Doc: 
https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing
.
I hope that this SPIP could go forward based on these feedback.

Based on this SPIP procedure 
http://spark.apache.org/improvement-proposals.html, can I ask one or more 
PMCs to become a shepherd of this SPIP?
I would appreciate your kindness and cooperation. 

Best Regards,
Kazuaki Ishizaki



From:Takeshi Yamamuro 
To:Spark dev list 
Cc:ishiz...@jp.ibm.com
Date:2018/10/15 12:12
Subject:Re: SPIP: SPARK-25728 Structured Intermediate 
Representation (Tungsten IR) for generating Java code



Hi, ishizaki-san,

Cool activity, I left some comments on the doc.

best,
takeshi


On Mon, Oct 15, 2018 at 12:05 AM Kazuaki Ishizaki  
wrote:
Hello community,

I am writing this e-mail in order to start a discussion about adding 
structure intermediate representation for generating Java code from a 
program using DataFrame or Dataset API, in addition to the current 
String-based representation.
This addition is based on the discussions in a thread at 
https://github.com/apache/spark/pull/21537#issuecomment-413268196

Please feel free to comment on the JIRA ticket or Google Doc.

JIRA ticket: https://issues.apache.org/jira/browse/SPARK-25728
Google Doc: 

RE: Looking for spark connector for SQS

2018-11-13 Thread Jagwani, Prakash
Did you try the SQS JMS Client ?
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-java-message-service-jms-client.html


Thanks,
Prakash Jagwani

From: Pawan Gandhi 
Sent: Tuesday, November 13, 2018 1:14 PM
To: dev@spark.apache.org
Subject: Looking for spark connector for SQS

Hi All,

Searched for connector to connect spark with SQS but could not find any. So 
please provide pointer for the same.

Regards
Pawan

This message (including any attachments) may contain confidential, proprietary, 
privileged and/or private information. The information is intended to be for 
the use of the individual or entity designated above. If you are not the 
intended recipient of this message, please notify the sender immediately, and 
delete the message and any attachments. Any disclosure, reproduction, 
distribution or other use of this message or any attachments by an individual 
or entity other than the intended recipient is prohibited.


Looking for spark connector for SQS

2018-11-13 Thread Pawan Gandhi
Hi All,

Searched for connector to connect spark with SQS but could not find any. So
please provide pointer for the same.

Regards
Pawan


Re: SPIP: Property Graphs, Cypher Queries, and Algorithms

2018-11-13 Thread Xiangrui Meng
+Joseph Gonzalez  +Ankur Dave


On Tue, Nov 13, 2018 at 2:55 AM Martin Junghanns 
wrote:

> Hi Spark community,
>
> We would like to propose a new graph module for Apache Spark with support
> for Property Graphs, Cypher graph queries and graph algorithms built on top
> of the DataFrame API.
>
> Jira issue for the SPIP: https://issues.apache.org/jira/browse/SPARK-25994
> Google Doc:
> https://docs.google.com/document/d/1ljqVsAh2wxTZS8XqwDQgRT6i_mania3ffYSYpEgLx9k/edit?usp=sharing
>
> Jira issue for a first design sketch:
> https://issues.apache.org/jira/browse/SPARK-26028
> Google Doc:
> https://docs.google.com/document/d/1Wxzghj0PvpOVu7XD1iA8uonRYhexwn18utdcTxtkxlI/edit?usp=sharing
>
> Thanks,
>
> Martin (on behalf of the Neo4j Cypher for Apache Spark team)
>
-- 

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] 

[image: Spark+AI Summit Europe]

[image: Spark+AI Summit North America 2019]