Pickling error when attempting to add a method in pyspark

2015-04-27 Thread Stephen Boesch
My intention is to add pyspark support for certain mllib spark methods.  I
have been unable to resolve pickling errors of the form

   Pyspark py4j PickleException: “expected zero arguments for construction
of ClassDict”


These are occurring during python to java conversion of python named
tuples.  The details are rather hard to provide here so I have created an
SOF question

http://stackoverflow.com/questions/29910708/pyspark-py4j-pickleexception-expected-zero-arguments-for-construction-of-class

In any case I have included the text here. The SOF is easier to read though
;)

--

This question is directed towards persons familiar with py4j - and can help
to resolve a pickling error. I am trying to add a method to the pyspark
PythonMLLibAPI that accepts an RDD of a namedtuple, does some work, and
returns a result in the form of an RDD.

This method is modeled after the PYthonMLLibAPI.trainALSModel() method,
whose analogous *existing* relevant portions are:

  def trainALSModel(
ratingsJRDD: JavaRDD[Rating],
.. )

The *existing* python Rating class used to model the new code is:

class Rating(namedtuple("Rating", ["user", "product", "rating"])):
def __reduce__(self):
return Rating, (int(self.user), int(self.product), float(self.rating))

Here is the attempt So here are the relevant classes:

*New* python class pyspark.mllib.clustering.MatrixEntry:

from collections import namedtupleclass
MatrixEntry(namedtuple("MatrixEntry", ["x","y","weight"])):
def __reduce__(self):
return MatrixEntry, (long(self.x), long(self.y), float(self.weight))

*New* method *foobarRDD* In PythonMLLibAPI:

  def foobarRdd(
data: JavaRDD[MatrixEntry]): RDD[FooBarResult] = {
val rdd = data.rdd.map { d => FooBarResult(d.i, d.j, d.value, d.i
* 100 + d.j * 10 + d.value)}
rdd
  }

Now let us try it out:

from pyspark.mllib.clustering import MatrixEntry
def convert_to_MatrixEntry(tuple):
  return MatrixEntry(*tuple)
from pyspark.mllib.clustering import *
pic = PowerIterationClusteringModel(2)
tups = [(1,2,3),(4,5,6),(12,13,14),(15,7,8),(16,17,16.5)]
trdd = sc.parallelize(map(convert_to_MatrixEntry,tups))
# print out the RDD on python side just for validationprint "%s"
%(repr(trdd.collect()))
from pyspark.mllib.common import callMLlibFunc
pic = callMLlibFunc("foobar", trdd)

Relevant portions of results:

[(1,2)=3.0, (4,5)=6.0, (12,13)=14.0, (15,7)=8.0, (16,17)=16.5]

which shows the input rdd is 'whole'. However the pickling was unhappy:

5/04/27 21:15:44 ERROR Executor: Exception in task 6.0 in stage 1.0 (TID 14)
net.razorvine.pickle.PickleException: expected zero arguments for
construction of ClassDict(for pyspark.mllib.clustering.MatrixEntry)
at 
net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
at 
org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1167)
at 
org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1166)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819)
at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1523)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1523)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:212)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j

java.lang.StackOverflowError when recovery from checkpoint in Streaming

2015-04-27 Thread wyphao.2007
 Hi everyone, I am using val messages = KafkaUtils.createDirectStream[String, 
String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet) to read data 
from kafka(1k/second), and store the data in windows,the code snippets as 
follow:val windowedStreamChannel = 
streamChannel.combineByKey[TreeSet[Obj]](TreeSet[Obj](_), _ += _, _ ++= _, new 
HashPartitioner(numPartition))
  .reduceByKeyAndWindow((x: TreeSet[Obj], y: TreeSet[Obj]) => x ++= y,
(x: TreeSet[Obj], y: TreeSet[Obj]) => x --= y, Minutes(60), 
Seconds(2), numPartition,
(item: (String, TreeSet[Obj])) => item._2.size != 0)after the 
application  run for an hour,  I kill the application and restart it from 
checkpoint directory, but I  encountered an exception:2015-04-27 17:52:40,955 
INFO  [Driver] - Slicing from 1430126222000 ms to 1430126222000 ms (aligned to 
1430126222000 ms and 1430126222000 ms)
2015-04-27 17:52:40,958 ERROR [Driver] - User class threw exception: null
java.lang.StackOverflowError
at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
at java.io.File.exists(File.java:813)
at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1080)
at sun.misc.URLClassPath.getResource(URLClassPath.java:199)
at java.net.URLClassLoader$1.run(URLClassLoader.java:358)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1623)
at org.apache.spark.rdd.RDD.filter(RDD.scala:303)
at 
org.apache.spark.streaming.dstream.FilteredDStream$$anonfun$compute$1.apply(FilteredDStream.scala:35)
at 
org.apache.spark.streaming.dstream.FilteredDStream$$anonfun$compute$1.apply(FilteredDStream.scala:35)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.streaming.dstream.FilteredDStream.compute(FilteredDStream.scala:35)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at scala.Option.orElse(Option.scala:257)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
at 
org.apache.spark.streaming.dstream.FlatMappedDStream.compute(FlatMappedDStream.scala:35)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at scala.Option.orElse(Option.scala:257)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
at 
org.apache.spark.streaming.dstream.FilteredDStream.compute(FilteredDStream.scala:35)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
at scala.Option.orElse(Option.scala:257)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
at 
org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
   

Re: Plans for upgrading Hive dependency?

2015-04-27 Thread Punyashloka Biswal
Thanks Marcelo and Patrick - I don't know how I missed that ticket in my
Jira search earlier. Is anybody working on the sub-issues yet, or is there
a design doc I should look at before taking a stab?

Regards,
Punya

On Mon, Apr 27, 2015 at 3:56 PM Patrick Wendell  wrote:

> Hey Punya,
>
> There is some ongoing work to help make Hive upgrades more manageable
> and allow us to support multiple versions of Hive. Once we do that, it
> will be much easier for us to upgrade.
>
> https://issues.apache.org/jira/browse/SPARK-6906
>
> - Patrick
>
> On Mon, Apr 27, 2015 at 12:47 PM, Marcelo Vanzin 
> wrote:
> > That's a lot more complicated than you might think.
> >
> > We've done some basic work to get HiveContext to compile against Hive
> > 1.1.0. Here's the code:
> >
> https://github.com/cloudera/spark/commit/00e2c7e35d4ac236bcfbcd3d2805b483060255ec
> >
> > We didn't sent that upstream because that only solves half of the
> > problem; the hive-thriftserver is disabled in our CDH build because it
> > uses a lot of Hive APIs that have been removed in 1.1.0, so even
> > getting it to compile is really complicated.
> >
> > If there's interest in getting the HiveContext part fixed up I can
> > send a PR for that code. But at this time I don't really have plans to
> > look at the thrift server.
> >
> >
> > On Mon, Apr 27, 2015 at 11:58 AM, Punyashloka Biswal
> >  wrote:
> >> Dear Spark devs,
> >>
> >> Is there a plan for staying up-to-date with current (and future)
> versions
> >> of Hive? Spark currently supports version 0.13 (June 2014), but the
> latest
> >> version of Hive is 1.1.0 (March 2015). I don't see any Jira tickets
> about
> >> updating beyond 0.13, so I was wondering if this was intentional or it
> was
> >> just that nobody had started work on this yet.
> >>
> >> I'd be happy to work on a PR for the upgrade if one of the core
> developers
> >> can tell me what pitfalls to watch out for.
> >>
> >> Punya
> >
> >
> >
> > --
> > Marcelo
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>


Re: Plans for upgrading Hive dependency?

2015-04-27 Thread Patrick Wendell
Hey Punya,

There is some ongoing work to help make Hive upgrades more manageable
and allow us to support multiple versions of Hive. Once we do that, it
will be much easier for us to upgrade.

https://issues.apache.org/jira/browse/SPARK-6906

- Patrick

On Mon, Apr 27, 2015 at 12:47 PM, Marcelo Vanzin  wrote:
> That's a lot more complicated than you might think.
>
> We've done some basic work to get HiveContext to compile against Hive
> 1.1.0. Here's the code:
> https://github.com/cloudera/spark/commit/00e2c7e35d4ac236bcfbcd3d2805b483060255ec
>
> We didn't sent that upstream because that only solves half of the
> problem; the hive-thriftserver is disabled in our CDH build because it
> uses a lot of Hive APIs that have been removed in 1.1.0, so even
> getting it to compile is really complicated.
>
> If there's interest in getting the HiveContext part fixed up I can
> send a PR for that code. But at this time I don't really have plans to
> look at the thrift server.
>
>
> On Mon, Apr 27, 2015 at 11:58 AM, Punyashloka Biswal
>  wrote:
>> Dear Spark devs,
>>
>> Is there a plan for staying up-to-date with current (and future) versions
>> of Hive? Spark currently supports version 0.13 (June 2014), but the latest
>> version of Hive is 1.1.0 (March 2015). I don't see any Jira tickets about
>> updating beyond 0.13, so I was wondering if this was intentional or it was
>> just that nobody had started work on this yet.
>>
>> I'd be happy to work on a PR for the upgrade if one of the core developers
>> can tell me what pitfalls to watch out for.
>>
>> Punya
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Plans for upgrading Hive dependency?

2015-04-27 Thread Marcelo Vanzin
That's a lot more complicated than you might think.

We've done some basic work to get HiveContext to compile against Hive
1.1.0. Here's the code:
https://github.com/cloudera/spark/commit/00e2c7e35d4ac236bcfbcd3d2805b483060255ec

We didn't sent that upstream because that only solves half of the
problem; the hive-thriftserver is disabled in our CDH build because it
uses a lot of Hive APIs that have been removed in 1.1.0, so even
getting it to compile is really complicated.

If there's interest in getting the HiveContext part fixed up I can
send a PR for that code. But at this time I don't really have plans to
look at the thrift server.


On Mon, Apr 27, 2015 at 11:58 AM, Punyashloka Biswal
 wrote:
> Dear Spark devs,
>
> Is there a plan for staying up-to-date with current (and future) versions
> of Hive? Spark currently supports version 0.13 (June 2014), but the latest
> version of Hive is 1.1.0 (March 2015). I don't see any Jira tickets about
> updating beyond 0.13, so I was wondering if this was intentional or it was
> just that nobody had started work on this yet.
>
> I'd be happy to work on a PR for the upgrade if one of the core developers
> can tell me what pitfalls to watch out for.
>
> Punya



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Plans for upgrading Hive dependency?

2015-04-27 Thread Punyashloka Biswal
Dear Spark devs,

Is there a plan for staying up-to-date with current (and future) versions
of Hive? Spark currently supports version 0.13 (June 2014), but the latest
version of Hive is 1.1.0 (March 2015). I don't see any Jira tickets about
updating beyond 0.13, so I was wondering if this was intentional or it was
just that nobody had started work on this yet.

I'd be happy to work on a PR for the upgrade if one of the core developers
can tell me what pitfalls to watch out for.

Punya


Re: github pull request builder FAIL, now WIN(-ish)

2015-04-27 Thread shane knapp
never mind, looks like you guys are already on it.  :)

On Mon, Apr 27, 2015 at 11:35 AM, shane knapp  wrote:

> sure, i'll kill all of the current spark prb build...
>
> On Mon, Apr 27, 2015 at 11:34 AM, Reynold Xin  wrote:
>
>> Shane - can we purge all the outstanding builds so we are not running
>> stuff against stale PRs?
>>
>>
>> On Mon, Apr 27, 2015 at 11:30 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> And unfortunately, many Jenkins executor slots are being taken by stale
>>> Spark PRs...
>>>
>>> On Mon, Apr 27, 2015 at 2:25 PM shane knapp  wrote:
>>>
>>> > anyways, the build queue is SLAMMED...  we're going to need at least a
>>> day
>>> > to catch up w/this.  i'll be keeping an eye on system loads and
>>> whatnot all
>>> > day today.
>>> >
>>> > whee!
>>> >
>>> > On Mon, Apr 27, 2015 at 11:18 AM, shane knapp 
>>> wrote:
>>> >
>>> > > somehow, the power outage on friday caused the pull request builder
>>> to
>>> > > lose it's config entirely...  i'm not sure why, but after i added the
>>> > oauth
>>> > > token back, we're now catching up on the weekend's pull request
>>> builds.
>>> > >
>>> > > have i mentioned how much i hate this plugin?  ;)
>>> > >
>>> > > sorry for the inconvenience...
>>> > >
>>> > > shane
>>> > >
>>> >
>>>
>>
>>
>


Re: github pull request builder FAIL, now WIN(-ish)

2015-04-27 Thread shane knapp
sure, i'll kill all of the current spark prb build...

On Mon, Apr 27, 2015 at 11:34 AM, Reynold Xin  wrote:

> Shane - can we purge all the outstanding builds so we are not running
> stuff against stale PRs?
>
>
> On Mon, Apr 27, 2015 at 11:30 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> And unfortunately, many Jenkins executor slots are being taken by stale
>> Spark PRs...
>>
>> On Mon, Apr 27, 2015 at 2:25 PM shane knapp  wrote:
>>
>> > anyways, the build queue is SLAMMED...  we're going to need at least a
>> day
>> > to catch up w/this.  i'll be keeping an eye on system loads and whatnot
>> all
>> > day today.
>> >
>> > whee!
>> >
>> > On Mon, Apr 27, 2015 at 11:18 AM, shane knapp 
>> wrote:
>> >
>> > > somehow, the power outage on friday caused the pull request builder to
>> > > lose it's config entirely...  i'm not sure why, but after i added the
>> > oauth
>> > > token back, we're now catching up on the weekend's pull request
>> builds.
>> > >
>> > > have i mentioned how much i hate this plugin?  ;)
>> > >
>> > > sorry for the inconvenience...
>> > >
>> > > shane
>> > >
>> >
>>
>
>


Re: github pull request builder FAIL, now WIN(-ish)

2015-04-27 Thread Reynold Xin
Shane - can we purge all the outstanding builds so we are not running stuff
against stale PRs?


On Mon, Apr 27, 2015 at 11:30 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> And unfortunately, many Jenkins executor slots are being taken by stale
> Spark PRs...
>
> On Mon, Apr 27, 2015 at 2:25 PM shane knapp  wrote:
>
> > anyways, the build queue is SLAMMED...  we're going to need at least a
> day
> > to catch up w/this.  i'll be keeping an eye on system loads and whatnot
> all
> > day today.
> >
> > whee!
> >
> > On Mon, Apr 27, 2015 at 11:18 AM, shane knapp 
> wrote:
> >
> > > somehow, the power outage on friday caused the pull request builder to
> > > lose it's config entirely...  i'm not sure why, but after i added the
> > oauth
> > > token back, we're now catching up on the weekend's pull request builds.
> > >
> > > have i mentioned how much i hate this plugin?  ;)
> > >
> > > sorry for the inconvenience...
> > >
> > > shane
> > >
> >
>


Re: Design docs: consolidation and discoverability

2015-04-27 Thread Punyashloka Biswal
Github's wiki is just another Git repo. If we use a separate repo, it's
probably easiest to use the wiki git repo rather than the "primary" git
repo.

Punya

On Mon, Apr 27, 2015 at 1:50 PM Nicholas Chammas 
wrote:

> Oh, a GitHub wiki (which is separate from having docs in a repo) is yet
> another approach we could take, though if we want to do that on the main
> Spark repo we'd need permission from Apache, which may be tough to get...
>
> On Mon, Apr 27, 2015 at 1:47 PM Punyashloka Biswal 
> wrote:
>
>> Nick, I like your idea of keeping it in a separate git repository. It
>> seems to combine the advantages of the present Google Docs approach with
>> the crisper history, discoverability, and text format simplicity of GitHub
>> wikis.
>>
>> Punya
>> On Mon, Apr 27, 2015 at 1:30 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I like the idea of having design docs be kept up to date and tracked in
>>> git.
>>>
>>> If the Apache repo isn't a good fit, perhaps we can have a separate repo
>>> just for design docs? Maybe something like
>>> github.com/spark-docs/spark-docs/
>>> ?
>>>
>>> If there's other stuff we want to track but haven't, perhaps we can
>>> generalize the purpose of the repo a bit and rename it accordingly (e.g.
>>> spark-misc/spark-misc).
>>>
>>> Nick
>>>
>>> On Mon, Apr 27, 2015 at 1:21 PM Sandy Ryza 
>>> wrote:
>>>
>>> > My only issue with Google Docs is that they're mutable, so it's
>>> difficult
>>> > to follow a design's history through its revisions and link up JIRA
>>> > comments with the relevant version.
>>> >
>>> > -Sandy
>>> >
>>> > On Mon, Apr 27, 2015 at 7:54 AM, Steve Loughran <
>>> ste...@hortonworks.com>
>>> > wrote:
>>> >
>>> > >
>>> > > One thing to consider is that while docs as PDFs in JIRAs do
>>> document the
>>> > > original proposal, that's not the place to keep living
>>> specifications.
>>> > That
>>> > > stuff needs to live in SCM, in a format which can be easily
>>> maintained,
>>> > can
>>> > > generate readable documents, and, in an unrealistically ideal world,
>>> even
>>> > > be used by machines to validate compliance with the design. Test
>>> suites
>>> > > tend to be the implicit machine-readable part of the specification,
>>> > though
>>> > > they aren't usually viewed as such.
>>> > >
>>> > > PDFs of word docs in JIRAs are not the place for ongoing work, even
>>> if
>>> > the
>>> > > early drafts can contain them. Given it's just as easy to point to
>>> > markdown
>>> > > docs in github by commit ID, that could be an alternative way to
>>> publish
>>> > > docs, with the document itself being viewed as one of the
>>> deliverables.
>>> > > When the time comes to update a document, then its there in the
>>> source
>>> > tree
>>> > > to edit.
>>> > >
>>> > > If there's a flaw here, its that design docs are that: the design.
>>> The
>>> > > implementation may not match, ongoing work will certainly diverge.
>>> If the
>>> > > design docs aren't kept in sync, then they can mislead people.
>>> > Accordingly,
>>> > > once the design docs are incorporated into the source tree, keeping
>>> them
>>> > in
>>> > > sync with changes has be viewed as essential as keeping tests up to
>>> date
>>> > >
>>> > > > On 26 Apr 2015, at 22:34, Patrick Wendell 
>>> wrote:
>>> > > >
>>> > > > I actually don't totally see why we can't use Google Docs provided
>>> it
>>> > > > is clearly discoverable from the JIRA. It was my understanding that
>>> > > > many projects do this. Maybe not (?).
>>> > > >
>>> > > > If it's a matter of maintaining public record on ASF
>>> infrastructure,
>>> > > > perhaps we can just automate that if an issue is closed we capture
>>> the
>>> > > > doc content and attach it to the JIRA as a PDF.
>>> > > >
>>> > > > My sense is that in general the ASF infrastructure policy is
>>> becoming
>>> > > > more and more lenient with regards to using third party services,
>>> > > > provided the are broadly accessible (such as a public google doc)
>>> and
>>> > > > can be definitively archived on ASF controlled storage.
>>> > > >
>>> > > > - Patrick
>>> > > >
>>> > > > On Fri, Apr 24, 2015 at 4:57 PM, Sean Owen 
>>> wrote:
>>> > > >> I know I recently used Google Docs from a JIRA, so am guilty as
>>> > > >> charged. I don't think there are a lot of design docs in general,
>>> but
>>> > > >> the ones I've seen have simply pushed docs to a JIRA. (I did the
>>> same,
>>> > > >> mirroring PDFs of the Google Doc.) I don't think this is hard to
>>> > > >> follow.
>>> > > >>
>>> > > >> I think you can do what you like: make a JIRA and attach files.
>>> Make a
>>> > > >> WIP PR and attach your notes. Make a Google Doc if you're feeling
>>> > > >> transgressive.
>>> > > >>
>>> > > >> I don't see much of a problem to solve here. In practice there are
>>> > > >> plenty of workable options, all of which are mainstream, and so I
>>> do
>>> > > >> not see an argument that somehow this is solved by letting people
>>> make
>>> > > >> wikis.
>>> > > >>
>>> > > >> On Fri

Re: github pull request builder FAIL, now WIN(-ish)

2015-04-27 Thread Nicholas Chammas
And unfortunately, many Jenkins executor slots are being taken by stale
Spark PRs...

On Mon, Apr 27, 2015 at 2:25 PM shane knapp  wrote:

> anyways, the build queue is SLAMMED...  we're going to need at least a day
> to catch up w/this.  i'll be keeping an eye on system loads and whatnot all
> day today.
>
> whee!
>
> On Mon, Apr 27, 2015 at 11:18 AM, shane knapp  wrote:
>
> > somehow, the power outage on friday caused the pull request builder to
> > lose it's config entirely...  i'm not sure why, but after i added the
> oauth
> > token back, we're now catching up on the weekend's pull request builds.
> >
> > have i mentioned how much i hate this plugin?  ;)
> >
> > sorry for the inconvenience...
> >
> > shane
> >
>


Re: github pull request builder FAIL, now WIN(-ish)

2015-04-27 Thread shane knapp
anyways, the build queue is SLAMMED...  we're going to need at least a day
to catch up w/this.  i'll be keeping an eye on system loads and whatnot all
day today.

whee!

On Mon, Apr 27, 2015 at 11:18 AM, shane knapp  wrote:

> somehow, the power outage on friday caused the pull request builder to
> lose it's config entirely...  i'm not sure why, but after i added the oauth
> token back, we're now catching up on the weekend's pull request builds.
>
> have i mentioned how much i hate this plugin?  ;)
>
> sorry for the inconvenience...
>
> shane
>


github pull request builder FAIL, now WIN(-ish)

2015-04-27 Thread shane knapp
somehow, the power outage on friday caused the pull request builder to lose
it's config entirely...  i'm not sure why, but after i added the oauth
token back, we're now catching up on the weekend's pull request builds.

have i mentioned how much i hate this plugin?  ;)

sorry for the inconvenience...

shane


Re: Design docs: consolidation and discoverability

2015-04-27 Thread Nicholas Chammas
Oh, a GitHub wiki (which is separate from having docs in a repo) is yet
another approach we could take, though if we want to do that on the main
Spark repo we'd need permission from Apache, which may be tough to get...

On Mon, Apr 27, 2015 at 1:47 PM Punyashloka Biswal 
wrote:

> Nick, I like your idea of keeping it in a separate git repository. It
> seems to combine the advantages of the present Google Docs approach with
> the crisper history, discoverability, and text format simplicity of GitHub
> wikis.
>
> Punya
> On Mon, Apr 27, 2015 at 1:30 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I like the idea of having design docs be kept up to date and tracked in
>> git.
>>
>> If the Apache repo isn't a good fit, perhaps we can have a separate repo
>> just for design docs? Maybe something like
>> github.com/spark-docs/spark-docs/
>> ?
>>
>> If there's other stuff we want to track but haven't, perhaps we can
>> generalize the purpose of the repo a bit and rename it accordingly (e.g.
>> spark-misc/spark-misc).
>>
>> Nick
>>
>> On Mon, Apr 27, 2015 at 1:21 PM Sandy Ryza 
>> wrote:
>>
>> > My only issue with Google Docs is that they're mutable, so it's
>> difficult
>> > to follow a design's history through its revisions and link up JIRA
>> > comments with the relevant version.
>> >
>> > -Sandy
>> >
>> > On Mon, Apr 27, 2015 at 7:54 AM, Steve Loughran > >
>> > wrote:
>> >
>> > >
>> > > One thing to consider is that while docs as PDFs in JIRAs do document
>> the
>> > > original proposal, that's not the place to keep living specifications.
>> > That
>> > > stuff needs to live in SCM, in a format which can be easily
>> maintained,
>> > can
>> > > generate readable documents, and, in an unrealistically ideal world,
>> even
>> > > be used by machines to validate compliance with the design. Test
>> suites
>> > > tend to be the implicit machine-readable part of the specification,
>> > though
>> > > they aren't usually viewed as such.
>> > >
>> > > PDFs of word docs in JIRAs are not the place for ongoing work, even if
>> > the
>> > > early drafts can contain them. Given it's just as easy to point to
>> > markdown
>> > > docs in github by commit ID, that could be an alternative way to
>> publish
>> > > docs, with the document itself being viewed as one of the
>> deliverables.
>> > > When the time comes to update a document, then its there in the source
>> > tree
>> > > to edit.
>> > >
>> > > If there's a flaw here, its that design docs are that: the design. The
>> > > implementation may not match, ongoing work will certainly diverge. If
>> the
>> > > design docs aren't kept in sync, then they can mislead people.
>> > Accordingly,
>> > > once the design docs are incorporated into the source tree, keeping
>> them
>> > in
>> > > sync with changes has be viewed as essential as keeping tests up to
>> date
>> > >
>> > > > On 26 Apr 2015, at 22:34, Patrick Wendell 
>> wrote:
>> > > >
>> > > > I actually don't totally see why we can't use Google Docs provided
>> it
>> > > > is clearly discoverable from the JIRA. It was my understanding that
>> > > > many projects do this. Maybe not (?).
>> > > >
>> > > > If it's a matter of maintaining public record on ASF infrastructure,
>> > > > perhaps we can just automate that if an issue is closed we capture
>> the
>> > > > doc content and attach it to the JIRA as a PDF.
>> > > >
>> > > > My sense is that in general the ASF infrastructure policy is
>> becoming
>> > > > more and more lenient with regards to using third party services,
>> > > > provided the are broadly accessible (such as a public google doc)
>> and
>> > > > can be definitively archived on ASF controlled storage.
>> > > >
>> > > > - Patrick
>> > > >
>> > > > On Fri, Apr 24, 2015 at 4:57 PM, Sean Owen 
>> wrote:
>> > > >> I know I recently used Google Docs from a JIRA, so am guilty as
>> > > >> charged. I don't think there are a lot of design docs in general,
>> but
>> > > >> the ones I've seen have simply pushed docs to a JIRA. (I did the
>> same,
>> > > >> mirroring PDFs of the Google Doc.) I don't think this is hard to
>> > > >> follow.
>> > > >>
>> > > >> I think you can do what you like: make a JIRA and attach files.
>> Make a
>> > > >> WIP PR and attach your notes. Make a Google Doc if you're feeling
>> > > >> transgressive.
>> > > >>
>> > > >> I don't see much of a problem to solve here. In practice there are
>> > > >> plenty of workable options, all of which are mainstream, and so I
>> do
>> > > >> not see an argument that somehow this is solved by letting people
>> make
>> > > >> wikis.
>> > > >>
>> > > >> On Fri, Apr 24, 2015 at 7:42 PM, Punyashloka Biswal
>> > > >>  wrote:
>> > > >>> Okay, I can understand wanting to keep Git history clean, and
>> avoid
>> > > >>> bottlenecking on committers. Is it reasonable to establish a
>> > > convention of
>> > > >>> having a label, component or (best of all) an issue type for
>> issues
>> > > that are
>> > > >>> associated with design docs? For example, if w

Re: Design docs: consolidation and discoverability

2015-04-27 Thread Punyashloka Biswal
Nick, I like your idea of keeping it in a separate git repository. It seems
to combine the advantages of the present Google Docs approach with the
crisper history, discoverability, and text format simplicity of GitHub
wikis.

Punya
On Mon, Apr 27, 2015 at 1:30 PM Nicholas Chammas 
wrote:

> I like the idea of having design docs be kept up to date and tracked in
> git.
>
> If the Apache repo isn't a good fit, perhaps we can have a separate repo
> just for design docs? Maybe something like
> github.com/spark-docs/spark-docs/
> ?
>
> If there's other stuff we want to track but haven't, perhaps we can
> generalize the purpose of the repo a bit and rename it accordingly (e.g.
> spark-misc/spark-misc).
>
> Nick
>
> On Mon, Apr 27, 2015 at 1:21 PM Sandy Ryza 
> wrote:
>
> > My only issue with Google Docs is that they're mutable, so it's difficult
> > to follow a design's history through its revisions and link up JIRA
> > comments with the relevant version.
> >
> > -Sandy
> >
> > On Mon, Apr 27, 2015 at 7:54 AM, Steve Loughran 
> > wrote:
> >
> > >
> > > One thing to consider is that while docs as PDFs in JIRAs do document
> the
> > > original proposal, that's not the place to keep living specifications.
> > That
> > > stuff needs to live in SCM, in a format which can be easily maintained,
> > can
> > > generate readable documents, and, in an unrealistically ideal world,
> even
> > > be used by machines to validate compliance with the design. Test suites
> > > tend to be the implicit machine-readable part of the specification,
> > though
> > > they aren't usually viewed as such.
> > >
> > > PDFs of word docs in JIRAs are not the place for ongoing work, even if
> > the
> > > early drafts can contain them. Given it's just as easy to point to
> > markdown
> > > docs in github by commit ID, that could be an alternative way to
> publish
> > > docs, with the document itself being viewed as one of the deliverables.
> > > When the time comes to update a document, then its there in the source
> > tree
> > > to edit.
> > >
> > > If there's a flaw here, its that design docs are that: the design. The
> > > implementation may not match, ongoing work will certainly diverge. If
> the
> > > design docs aren't kept in sync, then they can mislead people.
> > Accordingly,
> > > once the design docs are incorporated into the source tree, keeping
> them
> > in
> > > sync with changes has be viewed as essential as keeping tests up to
> date
> > >
> > > > On 26 Apr 2015, at 22:34, Patrick Wendell 
> wrote:
> > > >
> > > > I actually don't totally see why we can't use Google Docs provided it
> > > > is clearly discoverable from the JIRA. It was my understanding that
> > > > many projects do this. Maybe not (?).
> > > >
> > > > If it's a matter of maintaining public record on ASF infrastructure,
> > > > perhaps we can just automate that if an issue is closed we capture
> the
> > > > doc content and attach it to the JIRA as a PDF.
> > > >
> > > > My sense is that in general the ASF infrastructure policy is becoming
> > > > more and more lenient with regards to using third party services,
> > > > provided the are broadly accessible (such as a public google doc) and
> > > > can be definitively archived on ASF controlled storage.
> > > >
> > > > - Patrick
> > > >
> > > > On Fri, Apr 24, 2015 at 4:57 PM, Sean Owen 
> wrote:
> > > >> I know I recently used Google Docs from a JIRA, so am guilty as
> > > >> charged. I don't think there are a lot of design docs in general,
> but
> > > >> the ones I've seen have simply pushed docs to a JIRA. (I did the
> same,
> > > >> mirroring PDFs of the Google Doc.) I don't think this is hard to
> > > >> follow.
> > > >>
> > > >> I think you can do what you like: make a JIRA and attach files.
> Make a
> > > >> WIP PR and attach your notes. Make a Google Doc if you're feeling
> > > >> transgressive.
> > > >>
> > > >> I don't see much of a problem to solve here. In practice there are
> > > >> plenty of workable options, all of which are mainstream, and so I do
> > > >> not see an argument that somehow this is solved by letting people
> make
> > > >> wikis.
> > > >>
> > > >> On Fri, Apr 24, 2015 at 7:42 PM, Punyashloka Biswal
> > > >>  wrote:
> > > >>> Okay, I can understand wanting to keep Git history clean, and avoid
> > > >>> bottlenecking on committers. Is it reasonable to establish a
> > > convention of
> > > >>> having a label, component or (best of all) an issue type for issues
> > > that are
> > > >>> associated with design docs? For example, if we used the existing
> > > >>> "Brainstorming" issue type, and people put their design doc in the
> > > >>> description of the ticket, it would be relatively easy to figure
> out
> > > what
> > > >>> designs are in progress.
> > > >>>
> > > >>> Given the push-back against design docs in Git or on the wiki and
> the
> > > strong
> > > >>> preference for keeping docs on ASF property, I'm a bit surprised
> that
> > > all
> > > >>> the existing design docs are on Googl

Re: Design docs: consolidation and discoverability

2015-04-27 Thread Nicholas Chammas
I like the idea of having design docs be kept up to date and tracked in
git.

If the Apache repo isn't a good fit, perhaps we can have a separate repo
just for design docs? Maybe something like github.com/spark-docs/spark-docs/
?

If there's other stuff we want to track but haven't, perhaps we can
generalize the purpose of the repo a bit and rename it accordingly (e.g.
spark-misc/spark-misc).

Nick

On Mon, Apr 27, 2015 at 1:21 PM Sandy Ryza  wrote:

> My only issue with Google Docs is that they're mutable, so it's difficult
> to follow a design's history through its revisions and link up JIRA
> comments with the relevant version.
>
> -Sandy
>
> On Mon, Apr 27, 2015 at 7:54 AM, Steve Loughran 
> wrote:
>
> >
> > One thing to consider is that while docs as PDFs in JIRAs do document the
> > original proposal, that's not the place to keep living specifications.
> That
> > stuff needs to live in SCM, in a format which can be easily maintained,
> can
> > generate readable documents, and, in an unrealistically ideal world, even
> > be used by machines to validate compliance with the design. Test suites
> > tend to be the implicit machine-readable part of the specification,
> though
> > they aren't usually viewed as such.
> >
> > PDFs of word docs in JIRAs are not the place for ongoing work, even if
> the
> > early drafts can contain them. Given it's just as easy to point to
> markdown
> > docs in github by commit ID, that could be an alternative way to publish
> > docs, with the document itself being viewed as one of the deliverables.
> > When the time comes to update a document, then its there in the source
> tree
> > to edit.
> >
> > If there's a flaw here, its that design docs are that: the design. The
> > implementation may not match, ongoing work will certainly diverge. If the
> > design docs aren't kept in sync, then they can mislead people.
> Accordingly,
> > once the design docs are incorporated into the source tree, keeping them
> in
> > sync with changes has be viewed as essential as keeping tests up to date
> >
> > > On 26 Apr 2015, at 22:34, Patrick Wendell  wrote:
> > >
> > > I actually don't totally see why we can't use Google Docs provided it
> > > is clearly discoverable from the JIRA. It was my understanding that
> > > many projects do this. Maybe not (?).
> > >
> > > If it's a matter of maintaining public record on ASF infrastructure,
> > > perhaps we can just automate that if an issue is closed we capture the
> > > doc content and attach it to the JIRA as a PDF.
> > >
> > > My sense is that in general the ASF infrastructure policy is becoming
> > > more and more lenient with regards to using third party services,
> > > provided the are broadly accessible (such as a public google doc) and
> > > can be definitively archived on ASF controlled storage.
> > >
> > > - Patrick
> > >
> > > On Fri, Apr 24, 2015 at 4:57 PM, Sean Owen  wrote:
> > >> I know I recently used Google Docs from a JIRA, so am guilty as
> > >> charged. I don't think there are a lot of design docs in general, but
> > >> the ones I've seen have simply pushed docs to a JIRA. (I did the same,
> > >> mirroring PDFs of the Google Doc.) I don't think this is hard to
> > >> follow.
> > >>
> > >> I think you can do what you like: make a JIRA and attach files. Make a
> > >> WIP PR and attach your notes. Make a Google Doc if you're feeling
> > >> transgressive.
> > >>
> > >> I don't see much of a problem to solve here. In practice there are
> > >> plenty of workable options, all of which are mainstream, and so I do
> > >> not see an argument that somehow this is solved by letting people make
> > >> wikis.
> > >>
> > >> On Fri, Apr 24, 2015 at 7:42 PM, Punyashloka Biswal
> > >>  wrote:
> > >>> Okay, I can understand wanting to keep Git history clean, and avoid
> > >>> bottlenecking on committers. Is it reasonable to establish a
> > convention of
> > >>> having a label, component or (best of all) an issue type for issues
> > that are
> > >>> associated with design docs? For example, if we used the existing
> > >>> "Brainstorming" issue type, and people put their design doc in the
> > >>> description of the ticket, it would be relatively easy to figure out
> > what
> > >>> designs are in progress.
> > >>>
> > >>> Given the push-back against design docs in Git or on the wiki and the
> > strong
> > >>> preference for keeping docs on ASF property, I'm a bit surprised that
> > all
> > >>> the existing design docs are on Google Docs. Perhaps Apache should
> > consider
> > >>> opening up parts of the wiki to a larger group, to better serve this
> > use
> > >>> case.
> > >>>
> > >>> Punya
> > >>>
> > >>> On Fri, Apr 24, 2015 at 5:01 PM Patrick Wendell 
> > wrote:
> > 
> >  Using our ASF git repository as a working area for design docs, it
> >  seems potentially concerning to me. It's difficult process wise
> >  because all commits need to go through committers and also, we'd
> >  pollute our git history a lot with random incremen

Re: Design docs: consolidation and discoverability

2015-04-27 Thread Sandy Ryza
My only issue with Google Docs is that they're mutable, so it's difficult
to follow a design's history through its revisions and link up JIRA
comments with the relevant version.

-Sandy

On Mon, Apr 27, 2015 at 7:54 AM, Steve Loughran 
wrote:

>
> One thing to consider is that while docs as PDFs in JIRAs do document the
> original proposal, that's not the place to keep living specifications. That
> stuff needs to live in SCM, in a format which can be easily maintained, can
> generate readable documents, and, in an unrealistically ideal world, even
> be used by machines to validate compliance with the design. Test suites
> tend to be the implicit machine-readable part of the specification, though
> they aren't usually viewed as such.
>
> PDFs of word docs in JIRAs are not the place for ongoing work, even if the
> early drafts can contain them. Given it's just as easy to point to markdown
> docs in github by commit ID, that could be an alternative way to publish
> docs, with the document itself being viewed as one of the deliverables.
> When the time comes to update a document, then its there in the source tree
> to edit.
>
> If there's a flaw here, its that design docs are that: the design. The
> implementation may not match, ongoing work will certainly diverge. If the
> design docs aren't kept in sync, then they can mislead people. Accordingly,
> once the design docs are incorporated into the source tree, keeping them in
> sync with changes has be viewed as essential as keeping tests up to date
>
> > On 26 Apr 2015, at 22:34, Patrick Wendell  wrote:
> >
> > I actually don't totally see why we can't use Google Docs provided it
> > is clearly discoverable from the JIRA. It was my understanding that
> > many projects do this. Maybe not (?).
> >
> > If it's a matter of maintaining public record on ASF infrastructure,
> > perhaps we can just automate that if an issue is closed we capture the
> > doc content and attach it to the JIRA as a PDF.
> >
> > My sense is that in general the ASF infrastructure policy is becoming
> > more and more lenient with regards to using third party services,
> > provided the are broadly accessible (such as a public google doc) and
> > can be definitively archived on ASF controlled storage.
> >
> > - Patrick
> >
> > On Fri, Apr 24, 2015 at 4:57 PM, Sean Owen  wrote:
> >> I know I recently used Google Docs from a JIRA, so am guilty as
> >> charged. I don't think there are a lot of design docs in general, but
> >> the ones I've seen have simply pushed docs to a JIRA. (I did the same,
> >> mirroring PDFs of the Google Doc.) I don't think this is hard to
> >> follow.
> >>
> >> I think you can do what you like: make a JIRA and attach files. Make a
> >> WIP PR and attach your notes. Make a Google Doc if you're feeling
> >> transgressive.
> >>
> >> I don't see much of a problem to solve here. In practice there are
> >> plenty of workable options, all of which are mainstream, and so I do
> >> not see an argument that somehow this is solved by letting people make
> >> wikis.
> >>
> >> On Fri, Apr 24, 2015 at 7:42 PM, Punyashloka Biswal
> >>  wrote:
> >>> Okay, I can understand wanting to keep Git history clean, and avoid
> >>> bottlenecking on committers. Is it reasonable to establish a
> convention of
> >>> having a label, component or (best of all) an issue type for issues
> that are
> >>> associated with design docs? For example, if we used the existing
> >>> "Brainstorming" issue type, and people put their design doc in the
> >>> description of the ticket, it would be relatively easy to figure out
> what
> >>> designs are in progress.
> >>>
> >>> Given the push-back against design docs in Git or on the wiki and the
> strong
> >>> preference for keeping docs on ASF property, I'm a bit surprised that
> all
> >>> the existing design docs are on Google Docs. Perhaps Apache should
> consider
> >>> opening up parts of the wiki to a larger group, to better serve this
> use
> >>> case.
> >>>
> >>> Punya
> >>>
> >>> On Fri, Apr 24, 2015 at 5:01 PM Patrick Wendell 
> wrote:
> 
>  Using our ASF git repository as a working area for design docs, it
>  seems potentially concerning to me. It's difficult process wise
>  because all commits need to go through committers and also, we'd
>  pollute our git history a lot with random incremental design updates.
> 
>  The git history is used a lot by downstream packagers, us during our
>  QA process, etc... we really try to keep it oriented around code
>  patches:
> 
>  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=shortlog
> 
>  Committing a polished design doc along with a feature, maybe that's
>  something we could consider. But I still think JIRA is the best
>  location for these docs, consistent with what most other ASF projects
>  do that I know.
> 
>  On Fri, Apr 24, 2015 at 1:19 PM, Cody Koeninger 
>  wrote:
> > Why can't pull requests be used for design docs in Git

Re: Design docs: consolidation and discoverability

2015-04-27 Thread Steve Loughran

One thing to consider is that while docs as PDFs in JIRAs do document the 
original proposal, that's not the place to keep living specifications. That 
stuff needs to live in SCM, in a format which can be easily maintained, can 
generate readable documents, and, in an unrealistically ideal world, even be 
used by machines to validate compliance with the design. Test suites tend to be 
the implicit machine-readable part of the specification, though they aren't 
usually viewed as such.

PDFs of word docs in JIRAs are not the place for ongoing work, even if the 
early drafts can contain them. Given it's just as easy to point to markdown 
docs in github by commit ID, that could be an alternative way to publish docs, 
with the document itself being viewed as one of the deliverables. When the time 
comes to update a document, then its there in the source tree to edit.

If there's a flaw here, its that design docs are that: the design. The 
implementation may not match, ongoing work will certainly diverge. If the 
design docs aren't kept in sync, then they can mislead people. Accordingly, 
once the design docs are incorporated into the source tree, keeping them in 
sync with changes has be viewed as essential as keeping tests up to date

> On 26 Apr 2015, at 22:34, Patrick Wendell  wrote:
> 
> I actually don't totally see why we can't use Google Docs provided it
> is clearly discoverable from the JIRA. It was my understanding that
> many projects do this. Maybe not (?).
> 
> If it's a matter of maintaining public record on ASF infrastructure,
> perhaps we can just automate that if an issue is closed we capture the
> doc content and attach it to the JIRA as a PDF.
> 
> My sense is that in general the ASF infrastructure policy is becoming
> more and more lenient with regards to using third party services,
> provided the are broadly accessible (such as a public google doc) and
> can be definitively archived on ASF controlled storage.
> 
> - Patrick
> 
> On Fri, Apr 24, 2015 at 4:57 PM, Sean Owen  wrote:
>> I know I recently used Google Docs from a JIRA, so am guilty as
>> charged. I don't think there are a lot of design docs in general, but
>> the ones I've seen have simply pushed docs to a JIRA. (I did the same,
>> mirroring PDFs of the Google Doc.) I don't think this is hard to
>> follow.
>> 
>> I think you can do what you like: make a JIRA and attach files. Make a
>> WIP PR and attach your notes. Make a Google Doc if you're feeling
>> transgressive.
>> 
>> I don't see much of a problem to solve here. In practice there are
>> plenty of workable options, all of which are mainstream, and so I do
>> not see an argument that somehow this is solved by letting people make
>> wikis.
>> 
>> On Fri, Apr 24, 2015 at 7:42 PM, Punyashloka Biswal
>>  wrote:
>>> Okay, I can understand wanting to keep Git history clean, and avoid
>>> bottlenecking on committers. Is it reasonable to establish a convention of
>>> having a label, component or (best of all) an issue type for issues that are
>>> associated with design docs? For example, if we used the existing
>>> "Brainstorming" issue type, and people put their design doc in the
>>> description of the ticket, it would be relatively easy to figure out what
>>> designs are in progress.
>>> 
>>> Given the push-back against design docs in Git or on the wiki and the strong
>>> preference for keeping docs on ASF property, I'm a bit surprised that all
>>> the existing design docs are on Google Docs. Perhaps Apache should consider
>>> opening up parts of the wiki to a larger group, to better serve this use
>>> case.
>>> 
>>> Punya
>>> 
>>> On Fri, Apr 24, 2015 at 5:01 PM Patrick Wendell  wrote:
 
 Using our ASF git repository as a working area for design docs, it
 seems potentially concerning to me. It's difficult process wise
 because all commits need to go through committers and also, we'd
 pollute our git history a lot with random incremental design updates.
 
 The git history is used a lot by downstream packagers, us during our
 QA process, etc... we really try to keep it oriented around code
 patches:
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=shortlog
 
 Committing a polished design doc along with a feature, maybe that's
 something we could consider. But I still think JIRA is the best
 location for these docs, consistent with what most other ASF projects
 do that I know.
 
 On Fri, Apr 24, 2015 at 1:19 PM, Cody Koeninger 
 wrote:
> Why can't pull requests be used for design docs in Git if people who
> aren't
> committers want to contribute changes (as opposed to just comments)?
> 
> On Fri, Apr 24, 2015 at 2:57 PM, Sean Owen  wrote:
> 
>> Only catch there is it requires commit access to the repo. We need a
>> way for people who aren't committers to write and collaborate (for
>> point #1)
>> 
>> On Fri, Apr 24, 2015 at 3:56 PM, Punyashloka Biswal
>>

Exception in using updateStateByKey

2015-04-27 Thread Sea
Hi, all:
I use function updateStateByKey in Spark Streaming, I need to store the states 
for one minite,  I set "spark.cleaner.ttl" to 120, the duration is 2 seconds, 
but it throws Exception 




Caused by: 
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does 
not exist: spark/ck/hdfsaudit/receivedData/0/log-1430139541443-1430139601443
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1499)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1448)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1428)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1402)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)


at org.apache.hadoop.ipc.Client.call(Client.java:1347)
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:188)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)



Why?


my code is 


ssc = StreamingContext(sc,2)
kvs = KafkaUtils.createStream(ssc, zkQuorum, group, {topic: 1})
kvs.window(60,2).map(lambda x: analyzeMessage(x[1]))\
.filter(lambda x: x[1] != None).updateStateByKey(updateStateFunc) \
.filter(lambda x: x[1]['isExisted'] != 1) \
.foreachRDD(lambda rdd: rdd.foreachPartition(insertIntoDb))

Re: Is there any particular reason why there's no Java counterpart in Streaming Guide's "Design Patterns for using foreachRDD" section?

2015-04-27 Thread Sean Owen
My guess is since it says "for example (in Scala)" that this started
as Scala-only and then Python was tacked on as a one-off, and Java
never got added. I think you'd be welcome to add it. It's not an
obscure example and one people might want to see in Java.

On Mon, Apr 27, 2015 at 4:34 AM, Emre Sevinc  wrote:
> Hello,
>
> Is there any particular reason why there's no Java counterpart in Streaming
> Guide's "Design Patterns for using foreachRDD" section?
>
>https://spark.apache.org/docs/latest/streaming-programming-guide.html
>
> Up to that point, each source code example includes corresponding Java (and
> sometimes Python) source code for the Scala examples, but in section
> "Design Patterns for using foreachRDD", the code examples are only in Scala
> and Python.
>
> After that section comes "DataFrame and SQL Operations", and it continues
> giving examples in Scala, Java, and Python.
>
> The reason I'm asking: if there's no particular reason, maybe I can open a
> JIRA ticket and contribute to that part of the documentation?
>
> --
> Emre Sevinç

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: creating hive packages for spark

2015-04-27 Thread yash datta
Hi,

you can build spark-project hive from here :

https://github.com/pwendell/hive/tree/0.13.1-shaded-protobuf

Hope this helps.


On Mon, Apr 27, 2015 at 3:23 PM, Manku Timma  wrote:

> Hello Spark developers,
> I want to understand the procedure to create the org.spark-project.hive
> jars. Is this documented somewhere? I am having issues with -Phive-provided
> with my private hive13 jars and want to check if using spark's procedure
> helps.
>



-- 
When events unfold with calm and ease
When the winds that blow are merely breeze
Learn from nature, from birds and bees
Live your life in love, and let joy not cease.


creating hive packages for spark

2015-04-27 Thread Manku Timma
Hello Spark developers,
I want to understand the procedure to create the org.spark-project.hive
jars. Is this documented somewhere? I am having issues with -Phive-provided
with my private hive13 jars and want to check if using spark's procedure
helps.


Is there any particular reason why there's no Java counterpart in Streaming Guide's "Design Patterns for using foreachRDD" section?

2015-04-27 Thread Emre Sevinc
Hello,

Is there any particular reason why there's no Java counterpart in Streaming
Guide's "Design Patterns for using foreachRDD" section?

   https://spark.apache.org/docs/latest/streaming-programming-guide.html

Up to that point, each source code example includes corresponding Java (and
sometimes Python) source code for the Scala examples, but in section
"Design Patterns for using foreachRDD", the code examples are only in Scala
and Python.

After that section comes "DataFrame and SQL Operations", and it continues
giving examples in Scala, Java, and Python.

The reason I'm asking: if there's no particular reason, maybe I can open a
JIRA ticket and contribute to that part of the documentation?

-- 
Emre Sevinç