Re: non-deprecation compiler warnings are upgraded to build errors now
Would it make sense to isolate the use of deprecated warnings to a subset of projects? That way we could turn on more stringent checks for the other ones. Punya On Thu, Jul 23, 2015 at 12:08 AM Reynold Xin r...@databricks.com wrote: Hi all, FYI, we just merged a patch that fails a build if there is a scala compiler warning (if it is not deprecation warning). In the past, many compiler warnings are actually caused by legitimate bugs that we need to address. However, if we don't fail the build with warnings, people don't pay attention at all to the warnings (it is also tough to pay attention since there are a lot of deprecated warnings due to unit tests testing deprecated APIs and reliance on Hadoop on deprecated APIs). Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop).
Re: PySpark on PyPi
I agree with everything Justin just said. An additional advantage of publishing PySpark's Python code in a standards-compliant way is the fact that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way that pip can use. Contrast this with the current situation, where df.toPandas() exists in the Spark API but doesn't actually work until you install Pandas. Punya On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com wrote: // + *Davies* for his comments // + Punya for SA For development and CI, like Olivier mentioned, I think it would be hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI. If anyone wants to develop against PySpark APIs, they need to download the distribution and do a lot of PYTHONPATH munging for all the tools (pylint, pytest, IDE code completion). Right now that involves adding python/ and python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more dependencies, we would have to manually mirror all the PYTHONPATH munging in the ./pyspark script. With a proper pyspark setup.py which declares its dependencies, and a published distribution, depending on pyspark will just be adding pyspark to my setup.py dependencies. Of course, if we actually want to run parts of pyspark that is backed by Py4J calls, then we need the full spark distribution with either ./pyspark or ./spark-submit, but for things like linting and development, the PYTHONPATH munging is very annoying. I don't think the version-mismatch issues are a compelling reason to not go ahead with PyPI publishing. At runtime, we should definitely enforce that the version has to be exact, which means there is no backcompat nightmare as suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267. This would mean that even if the user got his pip installed pyspark to somehow get loaded before the spark distribution provided pyspark, then the user would be alerted immediately. *Davies*, if you buy this, should me or someone on my team pick up https://issues.apache.org/jira/browse/SPARK-1267 and https://github.com/apache/spark/pull/464? On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot o.girar...@lateral-thoughts.com wrote: Ok, I get it. Now what can we do to improve the current situation, because right now if I want to set-up a CI env for PySpark, I have to : 1- download a pre-built version of pyspark and unzip it somewhere on every agent 2- define the SPARK_HOME env 3- symlink this distribution pyspark dir inside the python install dir site-packages/ directory and if I rely on additional packages (like databricks' Spark-CSV project), I have to (except if I'm mistaken) 4- compile/assembly spark-csv, deploy the jar in a specific directory on every agent 5- add this jar-filled directory to the Spark distribution's additional classpath using the conf/spark-default file Then finally we can launch our unit/integration-tests. Some issues are related to spark-packages, some to the lack of python-based dependency, and some to the way SparkContext are launched when using pyspark. I think step 1 and 2 are fair enough 4 and 5 may already have solutions, I didn't check and considering spark-shell is downloading such dependencies automatically, I think if nothing's done yet it will (I guess ?). For step 3, maybe just adding a setup.py to the distribution would be enough, I'm not exactly advocating to distribute a full 300Mb spark distribution in PyPi, maybe there's a better compromise ? Regards, Olivier. Le ven. 5 juin 2015 à 22:12, Jey Kottalam j...@cs.berkeley.edu a écrit : Couldn't we have a pip installable pyspark package that just serves as a shim to an existing Spark installation? Or it could even download the latest Spark binary if SPARK_HOME isn't set during installation. Right now, Spark doesn't play very well with the usual Python ecosystem. For example, why do I need to use a strange incantation when booting up IPython if I want to use PySpark in a notebook with MASTER=local[4]? It would be much nicer to just type `from pyspark import SparkContext; sc = SparkContext(local[4])` in my notebook. I did a test and it seems like PySpark's basic unit-tests do pass when SPARK_HOME is set and Py4J is on the PYTHONPATH: PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH python $SPARK_HOME/python/pyspark/rdd.py -Jey On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen rosenvi...@gmail.com wrote: This has been proposed before: https://issues.apache.org/jira/browse/SPARK-1267 There's currently tighter coupling between the Python and Java halves of PySpark than just requiring SPARK_HOME to be set; if we did this, I bet we'd run into tons of issues when users try to run a newer version of the Python half of PySpark against an older set of Java components or vice-versa. On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot o.girar...@lateral-thoughts.com
Re: Python UDF performance at large scale
Hi Davies, In general, do we expect people to use CPython only for heavyweight UDFs that invoke an external library? Are there any examples of using Jython, especially performance comparisons to Java/Scala and CPython? When using Jython, do you expect the driver to send code to the executor as a string, or is there a good way to serialized Jython lambdas? (For context, I was unable to serialize Nashorn lambdas when I tried to use them in Spark.) Punya On Wed, Jun 24, 2015 at 2:26 AM Davies Liu dav...@databricks.com wrote: Fare points, I also like simpler solutions. The overhead of Python task could be a few of milliseconds, which means we also should eval them as batches (one Python task per batch). Decreasing the batch size for UDF sounds reasonable to me, together with other tricks to reduce the data in socket/pipe buffer. BTW, what do your UDF looks like? How about to use Jython to run simple Python UDF (without some external libraries). On Tue, Jun 23, 2015 at 8:21 PM, Justin Uang justin.u...@gmail.com wrote: // + punya Thanks for your quick response! I'm not sure that using an unbounded buffer is a good solution to the locking problem. For example, in the situation where I had 500 columns, I am in fact storing 499 extra columns on the java side, which might make me OOM if I have to store many rows. In addition, if I am using an AutoBatchedSerializer, the java side might have to write 1 16 == 65536 rows before python starts outputting elements, in which case, the Java side has to buffer 65536 complete rows. In general it seems fragile to rely on blocking behavior in the Python coprocess. By contrast, it's very easy to verify the correctness and performance characteristics of the synchronous blocking solution. On Tue, Jun 23, 2015 at 7:21 PM Davies Liu dav...@databricks.com wrote: Thanks for looking into it, I'd like the idea of having ForkingIterator. If we have unlimited buffer in it, then will not have the problem of deadlock, I think. The writing thread will be blocked by Python process, so there will be not much rows be buffered(still be a reason to OOM). At least, this approach is better than current one. Could you create a JIRA and sending out the PR? On Tue, Jun 23, 2015 at 3:27 PM, Justin Uang justin.u...@gmail.com wrote: BLUF: BatchPythonEvaluation's implementation is unusable at large scale, but I have a proof-of-concept implementation that avoids caching the entire dataset. Hi, We have been running into performance problems using Python UDFs with DataFrames at large scale. From the implementation of BatchPythonEvaluation, it looks like the goal was to reuse the PythonRDD code. It caches the entire child RDD so that it can do two passes over the data. One to give to the PythonRDD, then one to join the python lambda results with the original row (which may have java objects that should be passed through). In addition, it caches all the columns, even the ones that don't need to be processed by the Python UDF. In the cases I was working with, I had a 500 column table, and i wanted to use a python UDF for one column, and it ended up caching all 500 columns. I have a working solution over here that does it in one pass over the data, avoiding caching ( https://github.com/justinuang/spark/commit/c1a415a18d31226ac580f1a9df7985571d03199b ). With this patch, I go from a job that takes 20 minutes then OOMs, to a job that finishes completely in 3 minutes. It is indeed quite hacky and prone to deadlocks since there is buffering in many locations: - NEW: the ForkingIterator LinkedBlockingDeque - batching the rows before pickling them - os buffers on both sides - pyspark.serializers.BatchedSerializer We can avoid deadlock by being very disciplined. For example, we can have the ForkingIterator instead always do a check of whether the LinkedBlockingDeque is full and if so: Java - flush the java pickling buffer - send a flush command to the python process - os.flush the java side Python - flush BatchedSerializer - os.flush() I haven't added this yet. This is getting very complex however. Another model would just be to change the protocol between the java side and the worker to be a synchronous request/response. This has the disadvantage that the CPU isn't doing anything when the batch is being sent across, but it has the huge advantage of simplicity. In addition, I imagine that the actual IO between the processes isn't that slow, but rather the serialization of java objects into pickled bytes, and the deserialization/serialization + python loops on the python side. Another advantage is that we won't be taking more than 100% CPU since only one thread is doing CPU work at a
Re: Spark 1.4.0 pyspark and pylint breaking
Davies: Can we use relative imports (import .types) in the unit tests in order to disambiguate between the global and local module? Punya On Tue, May 26, 2015 at 3:09 PM Justin Uang justin.u...@gmail.com wrote: Thanks for clarifying! I don't understand python package and modules names that well, but I thought that the package namespacing would've helped, since you are in pyspark.sql.types. I guess not? On Tue, May 26, 2015 at 3:03 PM Davies Liu dav...@databricks.com wrote: There is a module called 'types' in python 3: davies@localhost:~/work/spark$ python3 Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type help, copyright, credits or license for more information. import types types module 'types' from '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/types.py' Without renaming, our `types.py` will conflict with it when you run unittests in pyspark/sql/ . On Tue, May 26, 2015 at 11:57 AM, Justin Uang justin.u...@gmail.com wrote: In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was renamed to pyspark/sql/_types.py and then some magic in pyspark/sql/__init__.py dynamically renamed the module back to types. I imagine that this is some naming conflict with Python 3, but what was the error that showed up? The reason why I'm asking about this is because it's messing with pylint, since pylint cannot now statically find the module. I tried also importing the package so that __init__ would be run in a init-hook, but that isn't what the discovery mechanism is using. I imagine it's probably just crawling the directory structure. One way to work around this would be something akin to this ( http://stackoverflow.com/questions/9602811/how-to-tell-pylint-to-ignore-certain-imports ), where I would have to create a fake module, but I would probably be missing a ton of pylint features on users of that module, and it's pretty hacky.
Re: [VOTE] Release Apache Spark 1.4.0 (RC1)
Thanks! I realize that manipulating the published version in the pom is a bit inconvenient but it's really useful to have clear version identifiers when we're juggling different versions and testing them out. For example, this will come in handy when we compare 1.4.0-rc1 and 1.4.0-rc2 in a couple of weeks :) Punya On Tue, May 19, 2015 at 12:39 PM Patrick Wendell pwend...@gmail.com wrote: Punya, Let me see if I can publish these under rc1 as well. In the future this will all be automated but current it's a somewhat manual task. - Patrick On Tue, May 19, 2015 at 9:32 AM, Punyashloka Biswal punya.bis...@gmail.com wrote: When publishing future RCs to the staging repository, would it be possible to use a version number that includes the rc1 designation? In the current setup, when I run a build against the artifacts at https://repository.apache.org/content/repositories/orgapachespark-1092/org/apache/spark/spark-core_2.10/1.4.0/ , my local Maven cache will get polluted with things that claim to be 1.4.0 but aren't. It would be preferable for the version number to be 1.4.0-rc1 instead. Thanks! Punya On Tue, May 19, 2015 at 12:20 PM Sean Owen so...@cloudera.com wrote: Before I vote, I wanted to point out there are still 9 Blockers for 1.4.0. I'd like to use this status to really mean must happen before the release. Many of these may be already fixed, or aren't really blockers -- can just be updated accordingly. I bet at least one will require further work if it's really meant for 1.4, so all this means is there is likely to be another RC. We should still kick the tires on RC1. (I also assume we should be extra conservative about what is merged into 1.4 at this point.) SPARK-6784 SQL Clean up all the inbound/outbound conversions for DateType Adrian Wang SPARK-6811 SparkR Building binary R packages for SparkR Shivaram Venkataraman SPARK-6941 SQL Provide a better error message to explain that tables created from RDDs are immutable SPARK-7158 SQL collect and take return different results SPARK-7478 SQL Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext Tathagata Das SPARK-7616 SQL Overwriting a partitioned parquet table corrupt data Cheng Lian SPARK-7654 SQL DataFrameReader and DataFrameWriter for input/output API Reynold Xin SPARK-7662 SQL Exception of multi-attribute generator anlysis in projection SPARK-7713 SQL Use shared broadcast hadoop conf for partitioned table scan. Yin Huai On Tue, May 19, 2015 at 5:10 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.4.0! The tag to be voted on is v1.4.0-rc1 (commit 777a081): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=777a08166f1fb144146ba32581d4632c3466541e The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.4.0-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1092/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.4.0-rc1-docs/ Please vote on releasing this package as Apache Spark 1.4.0! The vote is open until Friday, May 22, at 17:03 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.4.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.3 workload and running on this release candidate, then reporting any regressions. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.4 QA period, so -1 votes should only occur for significant regressions from 1.3.1. Bugs already present in 1.3.X, minor regressions, or bugs related to new features will not block this release. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__
Is there a foolproof way to access methods exclusively (instead of picking between columns and methods at runtime)? Here are two ideas, neither of which seems particularly Pythonic - pyspark.sql.methods(df).name() - df.__methods__.name() Punya On Fri, May 8, 2015 at 10:06 AM Nicholas Chammas nicholas.cham...@gmail.com wrote: And a link to SPARK-7035 https://issues.apache.org/jira/browse/SPARK-7035 (which Xiangrui mentioned in his initial email) for the lazy. On Fri, May 8, 2015 at 3:41 AM Xiangrui Meng men...@gmail.com wrote: On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I dont know much about Python style, but I think the point Wes made about usability on the JIRA is pretty powerful. IMHO the number of methods on a Spark DataFrame might not be much more compared to Pandas. Given that it looks like users are okay with the possibility of collisions in Pandas I think sticking (1) is not a bad idea. This is true for interactive work. Spark's DataFrames can handle really large datasets, which might be used in production workflows. So I think it is reasonable for us to care more about compatibility issues than Pandas. Also is it possible to detect such collisions in Python ? A (4)th option might be to detect that `df` contains a column named `name` and print a warning in `df.name` which tells the user that the method is overriding the column. Maybe we can inspect the frame `df.name` gets called and warn users in `df.select(df.name)` but not in `name = df.name`. This could be tricky to implement. -Xiangrui Thanks Shivaram On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng men...@gmail.com wrote: Hi all, In PySpark, a DataFrame column can be referenced using df[abcd] (__getitem__) and df.abcd (__getattr__). There is a discussion on SPARK-7035 on compatibility issues with the __getattr__ approach, and I want to collect more inputs on this. Basically, if in the future we introduce a new method to DataFrame, it may break user code that uses the same attr to reference a column or silently changes its behavior. For example, if we add name() to DataFrame in the next release, all existing code using `df.name` to reference a column called name will break. If we add `name()` as a property instead of a method, all existing code using `df.name` may still work but with a different meaning. `df.select(df.name)` no longer selects the column called name but the column that has the same name as `df.name`. There are several proposed solutions: 1. Keep both df.abcd and df[abcd], and encourage users to use the latter that is future proof. This is the current solution in master (https://github.com/apache/spark/pull/5971). But I think users may be still unaware of the compatibility issue and prefer `df.abcd` to `df[abcd]` because the former could be auto-completed. 2. Drop df.abcd and support df[abcd] only. From Wes' comment on the JIRA page: I actually dragged my feet on the _getattr_ issue for several months back in the day, then finally added it (and tab completion in IPython with _dir_), and immediately noticed a huge quality-of-life improvement when using pandas for actual (esp. interactive) work. 3. Replace df.abcd by df.abcd_ (with a suffix _). Both df.abcd_ and df[abcd] would be future proof, and df.abcd_ could be auto-completed. The tradeoff is apparently the extra _ appearing in the code. My preference is 3 1 2. Your inputs would be greatly appreciated. Thanks! Best, Xiangrui - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [build infra] quick downtime again tomorrow morning for DOCKER
Just curious: will docker allow new capabilities for the Spark build? (Where can I read more?) Punya On Fri, May 8, 2015 at 10:00 AM shane knapp skn...@berkeley.edu wrote: this is happening now. On Thu, May 7, 2015 at 3:40 PM, shane knapp skn...@berkeley.edu wrote: yes, docker. that wonderful little wrapper for linux containers will be installed and ready for play on all of the jenkins workers tomorrow morning. the downtime will be super quick: i just need to kill the jenkins slaves' ssh connections and relaunch to add the jenkins user to the docker group. this will begin at around 7am PDT and shouldn't take long at all. shane
Re: [discuss] ending support for Java 6?
I'm in favor of ending support for Java 6. We should also articulate a policy on how long we want to support current and future versions of Java after Oracle declares them EOL (Java 7 will be in that bucket in a matter of days). Punya On Thu, Apr 30, 2015 at 1:18 PM shane knapp skn...@berkeley.edu wrote: something to keep in mind: we can easily support java 6 for the build environment, particularly if there's a definite EOL. i'd like to fix our java versioning 'problem', and this could be a big instigator... right now we're hackily setting java_home in test invocation on jenkins, which really isn't the best. if i decide, within jenkins, to reconfigure every build to 'do the right thing' WRT java version, then i will clean up the old mess and pay down on some technical debt. or i can just install java 6 and we use that as JAVA_HOME on a build-by-build basis. this will be a few days of prep and another morning-long downtime if i do the right thing (within jenkins), and only a couple of hours the hacky way (system level). either way, we can test on java 6. :) On Thu, Apr 30, 2015 at 1:00 PM, Koert Kuipers ko...@tresata.com wrote: nicholas started it! :) for java 6 i would have said the same thing about 1 year ago: it is foolish to drop it. but i think the time is right about now. about half our clients are on java 7 and the other half have active plans to migrate to it within 6 months. On Thu, Apr 30, 2015 at 3:57 PM, Reynold Xin r...@databricks.com wrote: Guys thanks for chiming in, but please focus on Java here. Python is an entirely separate issue. On Thu, Apr 30, 2015 at 12:53 PM, Koert Kuipers ko...@tresata.com wrote: i am not sure eol means much if it is still actively used. we have a lot of clients with centos 5 (for which we still support python 2.4 in some form or another, fun!). most of them are on centos 6, which means python 2.6. by cutting out python 2.6 you would cut out the majority of the actual clusters i am aware of. unless you intention is to truly make something academic i dont think that is wise. On Thu, Apr 30, 2015 at 3:48 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that’s for another thread.) (To continue the parenthetical, Python 2.6 was in fact EOL-ed in October of 2013. https://www.python.org/download/releases/2.6.9/) On Thu, Apr 30, 2015 at 3:18 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I understand the concern about cutting out users who still use Java 6, and I don't have numbers about how many people are still using Java 6. But I want to say at a high level that I support deprecating older versions of stuff to reduce our maintenance burden and let us use more modern patterns in our code. Maintenance always costs way more than initial development over the lifetime of a project, and for that reason anti-support is just as important as support. (On that note, I think Python 2.6 should be next on the chopping block sometime later this year, but that's for another thread.) Nick On Thu, Apr 30, 2015 at 3:03 PM Reynold Xin r...@databricks.com wrote: This has been discussed a few times in the past, but now Oracle has ended support for Java 6 for over a year, I wonder if we should just drop Java 6 support. There is one outstanding issue Tom has brought to my attention: PySpark on YARN doesn't work well with Java 7/8, but we have an outstanding pull request to fix that. https://issues.apache.org/jira/browse/SPARK-6869 https://issues.apache.org/jira/browse/SPARK-1920
Re: Plans for upgrading Hive dependency?
Thanks Marcelo and Patrick - I don't know how I missed that ticket in my Jira search earlier. Is anybody working on the sub-issues yet, or is there a design doc I should look at before taking a stab? Regards, Punya On Mon, Apr 27, 2015 at 3:56 PM Patrick Wendell pwend...@gmail.com wrote: Hey Punya, There is some ongoing work to help make Hive upgrades more manageable and allow us to support multiple versions of Hive. Once we do that, it will be much easier for us to upgrade. https://issues.apache.org/jira/browse/SPARK-6906 - Patrick On Mon, Apr 27, 2015 at 12:47 PM, Marcelo Vanzin van...@cloudera.com wrote: That's a lot more complicated than you might think. We've done some basic work to get HiveContext to compile against Hive 1.1.0. Here's the code: https://github.com/cloudera/spark/commit/00e2c7e35d4ac236bcfbcd3d2805b483060255ec We didn't sent that upstream because that only solves half of the problem; the hive-thriftserver is disabled in our CDH build because it uses a lot of Hive APIs that have been removed in 1.1.0, so even getting it to compile is really complicated. If there's interest in getting the HiveContext part fixed up I can send a PR for that code. But at this time I don't really have plans to look at the thrift server. On Mon, Apr 27, 2015 at 11:58 AM, Punyashloka Biswal punya.bis...@gmail.com wrote: Dear Spark devs, Is there a plan for staying up-to-date with current (and future) versions of Hive? Spark currently supports version 0.13 (June 2014), but the latest version of Hive is 1.1.0 (March 2015). I don't see any Jira tickets about updating beyond 0.13, so I was wondering if this was intentional or it was just that nobody had started work on this yet. I'd be happy to work on a PR for the upgrade if one of the core developers can tell me what pitfalls to watch out for. Punya -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Design docs: consolidation and discoverability
Nick, I like your idea of keeping it in a separate git repository. It seems to combine the advantages of the present Google Docs approach with the crisper history, discoverability, and text format simplicity of GitHub wikis. Punya On Mon, Apr 27, 2015 at 1:30 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I like the idea of having design docs be kept up to date and tracked in git. If the Apache repo isn't a good fit, perhaps we can have a separate repo just for design docs? Maybe something like github.com/spark-docs/spark-docs/ ? If there's other stuff we want to track but haven't, perhaps we can generalize the purpose of the repo a bit and rename it accordingly (e.g. spark-misc/spark-misc). Nick On Mon, Apr 27, 2015 at 1:21 PM Sandy Ryza sandy.r...@cloudera.com wrote: My only issue with Google Docs is that they're mutable, so it's difficult to follow a design's history through its revisions and link up JIRA comments with the relevant version. -Sandy On Mon, Apr 27, 2015 at 7:54 AM, Steve Loughran ste...@hortonworks.com wrote: One thing to consider is that while docs as PDFs in JIRAs do document the original proposal, that's not the place to keep living specifications. That stuff needs to live in SCM, in a format which can be easily maintained, can generate readable documents, and, in an unrealistically ideal world, even be used by machines to validate compliance with the design. Test suites tend to be the implicit machine-readable part of the specification, though they aren't usually viewed as such. PDFs of word docs in JIRAs are not the place for ongoing work, even if the early drafts can contain them. Given it's just as easy to point to markdown docs in github by commit ID, that could be an alternative way to publish docs, with the document itself being viewed as one of the deliverables. When the time comes to update a document, then its there in the source tree to edit. If there's a flaw here, its that design docs are that: the design. The implementation may not match, ongoing work will certainly diverge. If the design docs aren't kept in sync, then they can mislead people. Accordingly, once the design docs are incorporated into the source tree, keeping them in sync with changes has be viewed as essential as keeping tests up to date On 26 Apr 2015, at 22:34, Patrick Wendell pwend...@gmail.com wrote: I actually don't totally see why we can't use Google Docs provided it is clearly discoverable from the JIRA. It was my understanding that many projects do this. Maybe not (?). If it's a matter of maintaining public record on ASF infrastructure, perhaps we can just automate that if an issue is closed we capture the doc content and attach it to the JIRA as a PDF. My sense is that in general the ASF infrastructure policy is becoming more and more lenient with regards to using third party services, provided the are broadly accessible (such as a public google doc) and can be definitively archived on ASF controlled storage. - Patrick On Fri, Apr 24, 2015 at 4:57 PM, Sean Owen so...@cloudera.com wrote: I know I recently used Google Docs from a JIRA, so am guilty as charged. I don't think there are a lot of design docs in general, but the ones I've seen have simply pushed docs to a JIRA. (I did the same, mirroring PDFs of the Google Doc.) I don't think this is hard to follow. I think you can do what you like: make a JIRA and attach files. Make a WIP PR and attach your notes. Make a Google Doc if you're feeling transgressive. I don't see much of a problem to solve here. In practice there are plenty of workable options, all of which are mainstream, and so I do not see an argument that somehow this is solved by letting people make wikis. On Fri, Apr 24, 2015 at 7:42 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: Okay, I can understand wanting to keep Git history clean, and avoid bottlenecking on committers. Is it reasonable to establish a convention of having a label, component or (best of all) an issue type for issues that are associated with design docs? For example, if we used the existing Brainstorming issue type, and people put their design doc in the description of the ticket, it would be relatively easy to figure out what designs are in progress. Given the push-back against design docs in Git or on the wiki and the strong preference for keeping docs on ASF property, I'm a bit surprised that all the existing design docs are on Google Docs. Perhaps Apache should consider opening up parts of the wiki to a larger group, to better serve this use case. Punya On Fri, Apr 24, 2015 at 5:01 PM Patrick Wendell pwend...@gmail.com wrote
Plans for upgrading Hive dependency?
Dear Spark devs, Is there a plan for staying up-to-date with current (and future) versions of Hive? Spark currently supports version 0.13 (June 2014), but the latest version of Hive is 1.1.0 (March 2015). I don't see any Jira tickets about updating beyond 0.13, so I was wondering if this was intentional or it was just that nobody had started work on this yet. I'd be happy to work on a PR for the upgrade if one of the core developers can tell me what pitfalls to watch out for. Punya
Re: Design docs: consolidation and discoverability
Github's wiki is just another Git repo. If we use a separate repo, it's probably easiest to use the wiki git repo rather than the primary git repo. Punya On Mon, Apr 27, 2015 at 1:50 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: Oh, a GitHub wiki (which is separate from having docs in a repo) is yet another approach we could take, though if we want to do that on the main Spark repo we'd need permission from Apache, which may be tough to get... On Mon, Apr 27, 2015 at 1:47 PM Punyashloka Biswal punya.bis...@gmail.com wrote: Nick, I like your idea of keeping it in a separate git repository. It seems to combine the advantages of the present Google Docs approach with the crisper history, discoverability, and text format simplicity of GitHub wikis. Punya On Mon, Apr 27, 2015 at 1:30 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I like the idea of having design docs be kept up to date and tracked in git. If the Apache repo isn't a good fit, perhaps we can have a separate repo just for design docs? Maybe something like github.com/spark-docs/spark-docs/ ? If there's other stuff we want to track but haven't, perhaps we can generalize the purpose of the repo a bit and rename it accordingly (e.g. spark-misc/spark-misc). Nick On Mon, Apr 27, 2015 at 1:21 PM Sandy Ryza sandy.r...@cloudera.com wrote: My only issue with Google Docs is that they're mutable, so it's difficult to follow a design's history through its revisions and link up JIRA comments with the relevant version. -Sandy On Mon, Apr 27, 2015 at 7:54 AM, Steve Loughran ste...@hortonworks.com wrote: One thing to consider is that while docs as PDFs in JIRAs do document the original proposal, that's not the place to keep living specifications. That stuff needs to live in SCM, in a format which can be easily maintained, can generate readable documents, and, in an unrealistically ideal world, even be used by machines to validate compliance with the design. Test suites tend to be the implicit machine-readable part of the specification, though they aren't usually viewed as such. PDFs of word docs in JIRAs are not the place for ongoing work, even if the early drafts can contain them. Given it's just as easy to point to markdown docs in github by commit ID, that could be an alternative way to publish docs, with the document itself being viewed as one of the deliverables. When the time comes to update a document, then its there in the source tree to edit. If there's a flaw here, its that design docs are that: the design. The implementation may not match, ongoing work will certainly diverge. If the design docs aren't kept in sync, then they can mislead people. Accordingly, once the design docs are incorporated into the source tree, keeping them in sync with changes has be viewed as essential as keeping tests up to date On 26 Apr 2015, at 22:34, Patrick Wendell pwend...@gmail.com wrote: I actually don't totally see why we can't use Google Docs provided it is clearly discoverable from the JIRA. It was my understanding that many projects do this. Maybe not (?). If it's a matter of maintaining public record on ASF infrastructure, perhaps we can just automate that if an issue is closed we capture the doc content and attach it to the JIRA as a PDF. My sense is that in general the ASF infrastructure policy is becoming more and more lenient with regards to using third party services, provided the are broadly accessible (such as a public google doc) and can be definitively archived on ASF controlled storage. - Patrick On Fri, Apr 24, 2015 at 4:57 PM, Sean Owen so...@cloudera.com wrote: I know I recently used Google Docs from a JIRA, so am guilty as charged. I don't think there are a lot of design docs in general, but the ones I've seen have simply pushed docs to a JIRA. (I did the same, mirroring PDFs of the Google Doc.) I don't think this is hard to follow. I think you can do what you like: make a JIRA and attach files. Make a WIP PR and attach your notes. Make a Google Doc if you're feeling transgressive. I don't see much of a problem to solve here. In practice there are plenty of workable options, all of which are mainstream, and so I do not see an argument that somehow this is solved by letting people make wikis. On Fri, Apr 24, 2015 at 7:42 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: Okay, I can understand wanting to keep Git history clean, and avoid bottlenecking on committers. Is it reasonable to establish a convention of having a label, component or (best of all) an issue type for issues that are associated with design docs? For example, if we used the existing Brainstorming issue type, and people put their design doc
Re: Design docs: consolidation and discoverability
Okay, I can understand wanting to keep Git history clean, and avoid bottlenecking on committers. Is it reasonable to establish a convention of having a label, component or (best of all) an issue type for issues that are associated with design docs? For example, if we used the existing Brainstorming issue type, and people put their design doc in the description of the ticket, it would be relatively easy to figure out what designs are in progress. Given the push-back against design docs in Git or on the wiki and the strong preference for keeping docs on ASF property, I'm a bit surprised that all the existing design docs are on Google Docs. Perhaps Apache should consider opening up parts of the wiki to a larger group, to better serve this use case. Punya On Fri, Apr 24, 2015 at 5:01 PM Patrick Wendell pwend...@gmail.com wrote: Using our ASF git repository as a working area for design docs, it seems potentially concerning to me. It's difficult process wise because all commits need to go through committers and also, we'd pollute our git history a lot with random incremental design updates. The git history is used a lot by downstream packagers, us during our QA process, etc... we really try to keep it oriented around code patches: https://git-wip-us.apache.org/repos/asf?p=spark.git;a=shortlog Committing a polished design doc along with a feature, maybe that's something we could consider. But I still think JIRA is the best location for these docs, consistent with what most other ASF projects do that I know. On Fri, Apr 24, 2015 at 1:19 PM, Cody Koeninger c...@koeninger.org wrote: Why can't pull requests be used for design docs in Git if people who aren't committers want to contribute changes (as opposed to just comments)? On Fri, Apr 24, 2015 at 2:57 PM, Sean Owen so...@cloudera.com wrote: Only catch there is it requires commit access to the repo. We need a way for people who aren't committers to write and collaborate (for point #1) On Fri, Apr 24, 2015 at 3:56 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: Sandy, doesn't keeping (in-progress) design docs in Git satisfy the history requirement? Referring back to my Gradle example, it seems that https://github.com/gradle/gradle/commits/master/design-docs/build-comparison.md is a really good way to see why the design doc evolved the way it did. When keeping the doc in Jira (presumably as an attachment) it's not easy to see what changed between successive versions of the doc. Punya On Fri, Apr 24, 2015 at 3:53 PM Sandy Ryza sandy.r...@cloudera.com wrote: I think there are maybe two separate things we're talking about? 1. Design discussions and in-progress design docs. My two cents are that JIRA is the best place for this. It allows tracking the progression of a design across multiple PRs and contributors. A piece of useful feedback that I've gotten in the past is to make design docs immutable. When updating them in response to feedback, post a new version rather than editing the existing one. This enables tracking the history of a design and makes it possible to read comments about previous designs in context. Otherwise it's really difficult to understand why particular approaches were chosen or abandoned. 2. Completed design docs for features that we've implemented. Perhaps less essential to project progress, but it would be really lovely to have a central repository to all the projects design doc. If anyone wants to step up to maintain it, it would be cool to have a wiki page with links to all the final design docs posted on JIRA.
Re: Design docs: consolidation and discoverability
The Gradle dev team keep their design documents *checked into* their Git repository -- see https://github.com/gradle/gradle/blob/master/design-docs/build-comparison.md for example. The advantages I see to their approach are: - design docs stay on ASF property (since Github is synced to the Apache-run Git repository) - design docs have a lifetime across PRs, but can still be modified and commented on through the mechanism of PRs - keeping a central location helps people to find good role models and converge on conventions Sean, I find it hard to use the central Jira as a jumping-off point for understanding ongoing design work because a tiny fraction of the tickets actually relate to design docs, and it's not easy from the outside to figure out which ones are relevant. Punya On Fri, Apr 24, 2015 at 2:49 PM Sean Owen so...@cloudera.com wrote: I think it's OK to have design discussions on github, as emails go to ASF lists. After all, loads of PR discussions happen there. It's easy for anyone to follow. I also would rather just discuss on Github, except for all that noise. It's not great to put discussions in something like Google Docs actually; the resulting doc needs to be pasted back to JIRA promptly if so. I suppose it's still better than a private conversation or not talking at all, but the principle is that one should be able to access any substantive decision or conversation by being tuned in to only the project systems of record -- mailing list, JIRA. On Fri, Apr 24, 2015 at 2:30 PM, Reynold Xin r...@databricks.com wrote: I'd love to see more design discussions consolidated in a single place as well. That said, there are many practical challenges to overcome. Some of them are out of our control: 1. For large features, it is fairly common to open a PR for discussion, close the PR taking some feedback into account, and reopen another one. You sort of lose the discussions that way. 2. With the way Jenkins is setup currently, Jenkins testing introduces a lot of noise to GitHub pull requests, making it hard to differentiate legitimate comments from noise. This is unfortunately due to the fact that ASF won't allow our Jenkins bot to have API privilege to post messages. 3. The Apache Way is that all development discussions need to happen on ASF property, i.e. dev lists and JIRA. As a result, technically we are not allowed to have development discussions on GitHub. On Fri, Apr 24, 2015 at 7:09 AM, Cody Koeninger c...@koeninger.org wrote: My 2 cents - I'd rather see design docs in github pull requests (using plain text / markdown). That doesn't require changing access or adding people, and github PRs already allow for conversation / email notifications. Conversation is already split between jira and github PRs. Having a third stream of conversation in Google Docs just leads to things being ignored. On Fri, Apr 24, 2015 at 7:21 AM, Sean Owen so...@cloudera.com wrote: That would require giving wiki access to everyone or manually adding people any time they make a doc. I don't see how this helps though. They're still docs on the internet and they're still linked from the central project JIRA, which is what you should follow. On Apr 24, 2015 8:14 AM, Punyashloka Biswal punya.bis...@gmail.com wrote: Dear Spark devs, Right now, design docs are stored on Google docs and linked from tickets. For someone new to the project, it's hard to figure out what subjects are being discussed, what organization to follow for new feature proposals, etc. Would it make sense to consolidate future design docs in either a designated area on the Apache Confluence Wiki, or on GitHub's Wiki pages? If people have a strong preference to keep the design docs on Google Docs, then could we have a top-level page on the confluence wiki that lists all active and archived design docs? Punya
Re: Design docs: consolidation and discoverability
Sandy, doesn't keeping (in-progress) design docs in Git satisfy the history requirement? Referring back to my Gradle example, it seems that https://github.com/gradle/gradle/commits/master/design-docs/build-comparison.md is a really good way to see why the design doc evolved the way it did. When keeping the doc in Jira (presumably as an attachment) it's not easy to see what changed between successive versions of the doc. Punya On Fri, Apr 24, 2015 at 3:53 PM Sandy Ryza sandy.r...@cloudera.com wrote: I think there are maybe two separate things we're talking about? 1. Design discussions and in-progress design docs. My two cents are that JIRA is the best place for this. It allows tracking the progression of a design across multiple PRs and contributors. A piece of useful feedback that I've gotten in the past is to make design docs immutable. When updating them in response to feedback, post a new version rather than editing the existing one. This enables tracking the history of a design and makes it possible to read comments about previous designs in context. Otherwise it's really difficult to understand why particular approaches were chosen or abandoned. 2. Completed design docs for features that we've implemented. Perhaps less essential to project progress, but it would be really lovely to have a central repository to all the projects design doc. If anyone wants to step up to maintain it, it would be cool to have a wiki page with links to all the final design docs posted on JIRA. -Sandy On Fri, Apr 24, 2015 at 12:01 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: The Gradle dev team keep their design documents *checked into* their Git repository -- see https://github.com/gradle/gradle/blob/master/design-docs/build-comparison.md for example. The advantages I see to their approach are: - design docs stay on ASF property (since Github is synced to the Apache-run Git repository) - design docs have a lifetime across PRs, but can still be modified and commented on through the mechanism of PRs - keeping a central location helps people to find good role models and converge on conventions Sean, I find it hard to use the central Jira as a jumping-off point for understanding ongoing design work because a tiny fraction of the tickets actually relate to design docs, and it's not easy from the outside to figure out which ones are relevant. Punya On Fri, Apr 24, 2015 at 2:49 PM Sean Owen so...@cloudera.com wrote: I think it's OK to have design discussions on github, as emails go to ASF lists. After all, loads of PR discussions happen there. It's easy for anyone to follow. I also would rather just discuss on Github, except for all that noise. It's not great to put discussions in something like Google Docs actually; the resulting doc needs to be pasted back to JIRA promptly if so. I suppose it's still better than a private conversation or not talking at all, but the principle is that one should be able to access any substantive decision or conversation by being tuned in to only the project systems of record -- mailing list, JIRA. On Fri, Apr 24, 2015 at 2:30 PM, Reynold Xin r...@databricks.com wrote: I'd love to see more design discussions consolidated in a single place as well. That said, there are many practical challenges to overcome. Some of them are out of our control: 1. For large features, it is fairly common to open a PR for discussion, close the PR taking some feedback into account, and reopen another one. You sort of lose the discussions that way. 2. With the way Jenkins is setup currently, Jenkins testing introduces a lot of noise to GitHub pull requests, making it hard to differentiate legitimate comments from noise. This is unfortunately due to the fact that ASF won't allow our Jenkins bot to have API privilege to post messages. 3. The Apache Way is that all development discussions need to happen on ASF property, i.e. dev lists and JIRA. As a result, technically we are not allowed to have development discussions on GitHub. On Fri, Apr 24, 2015 at 7:09 AM, Cody Koeninger c...@koeninger.org wrote: My 2 cents - I'd rather see design docs in github pull requests (using plain text / markdown). That doesn't require changing access or adding people, and github PRs already allow for conversation / email notifications. Conversation is already split between jira and github PRs. Having a third stream of conversation in Google Docs just leads to things being ignored. On Fri, Apr 24, 2015 at 7:21 AM, Sean Owen so...@cloudera.com wrote: That would require giving wiki access to everyone or manually adding people any time they make a doc. I don't see how this helps though. They're still docs on the internet and they're still linked from the central project JIRA
Re: Graphical display of metrics on application UI page
Thanks for the pointers! It looks like others are pretty active on this so I'll comment on those PRs and try to coordinate before starting any new work. Punya On Wed, Apr 22, 2015 at 2:49 AM Akhil Das ak...@sigmoidanalytics.com wrote: There were some PR's about graphical representation with D3.js, you can possibly see it on the github. Here's a few of them https://github.com/apache/spark/pulls?utf8=%E2%9C%93q=d3 Thanks Best Regards On Wed, Apr 22, 2015 at 8:08 AM, Punyashloka Biswal punya.bis...@gmail.com wrote: Dear Spark devs, Would people find it useful to have a graphical display of metrics (such as duration, GC time, etc) on the application UI page? Has anybody worked on this before? Punya
Graphical display of metrics on application UI page
Dear Spark devs, Would people find it useful to have a graphical display of metrics (such as duration, GC time, etc) on the application UI page? Has anybody worked on this before? Punya
Re: [discuss] new Java friendly InputSource API
Reynold, thanks for this! At Palantir we're heavy users of the Java APIs and appreciate being able to stop hacking around with fake ClassTags :) Regarding this specific proposal, is the contract of RecordReader#get intended to be that it returns a fresh object each time? Or is it allowed to mutate a fixed object and return a pointer to it each time? Put another way, is a caller supposed to clone the output of get() if they want to use it later? Punya On Tue, Apr 21, 2015 at 4:35 PM Reynold Xin r...@databricks.com wrote: I created a pull request last night for a new InputSource API that is essentially a stripped down version of the RDD API for providing data into Spark. Would be great to hear the community's feedback. Spark currently has two de facto input source API: 1. RDD 2. Hadoop MapReduce InputFormat Neither of the above is ideal: 1. RDD: It is hard for Java developers to implement RDD, given the implicit class tags. In addition, the RDD API depends on Scala's runtime library, which does not preserve binary compatibility across Scala versions. If a developer chooses Java to implement an input source, it would be great if that input source can be binary compatible in years to come. 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For example, it forces key-value semantics, and does not support running arbitrary code on the driver side (an example of why this is useful is broadcast). In addition, it is somewhat awkward to tell developers that in order to implement an input source for Spark, they should learn the Hadoop MapReduce API first. My patch creates a new InputSource interface, described by: - an array of InputPartition that specifies the data partitioning - a RecordReader that specifies how data on each partition can be read This interface is similar to Hadoop's InputFormat, except that there is no explicit key/value separation. JIRA ticket: https://issues.apache.org/jira/browse/SPARK-7025 Pull request: https://github.com/apache/spark/pull/5603