GitHub user AnthonyTruchet opened a pull request: https://github.com/apache/spark/pull/16042
Fix of dev scripts and new, Criteo specific, ones WIP You can merge this pull request into a Git repository by running: $ git pull https://github.com/AnthonyTruchet/spark dev-tools Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16042.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16042 ---- commit 18661a2bb527adbd01e98158696a16f6d8162411 Author: Tommy YU <tumm...@163.com> Date: 2016-02-12T02:38:49Z [SPARK-13153][PYSPARK] ML persistence failed when handle no default value parameter Fix this defect by check default value exist or not. yanboliang Please help to review. Author: Tommy YU <tumm...@163.com> Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue. (cherry picked from commit d3e2e202994e063856c192e9fdd0541777b88e0e) Signed-off-by: Xiangrui Meng <m...@databricks.com> commit 93a55f3df3c9527ecf4143cb40ac7212bc3a975a Author: markpavey <mark.pa...@thefilter.com> Date: 2016-02-13T08:39:43Z [SPARK-13142][WEB UI] Problem accessing Web UI /logPage/ on Microsoft Windows Due to being on a Windows platform I have been unable to run the tests as described in the "Contributing to Spark" instructions. As the change is only to two lines of code in the Web UI, which I have manually built and tested, I am submitting this pull request anyway. I hope this is OK. Is it worth considering also including this fix in any future 1.5.x releases (if any)? I confirm this is my own original work and license it to the Spark project under its open source license. Author: markpavey <mark.pa...@thefilter.com> Closes #11135 from markpavey/JIRA_SPARK-13142_WindowsWebUILogFix. (cherry picked from commit 374c4b2869fc50570a68819cf0ece9b43ddeb34b) Signed-off-by: Sean Owen <so...@cloudera.com> commit 107290c94312524bfc4560ebe0de268be4ca56af Author: Liang-Chi Hsieh <vii...@gmail.com> Date: 2016-02-13T23:56:20Z [SPARK-12363][MLLIB] Remove setRun and fix PowerIterationClustering failed test JIRA: https://issues.apache.org/jira/browse/SPARK-12363 This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work. Author: Liang-Chi Hsieh <vii...@gmail.com> Author: Xiangrui Meng <m...@databricks.com> Closes #10539 from viirya/fix-poweriter. (cherry picked from commit e3441e3f68923224d5b576e6112917cf1fe1f89a) Signed-off-by: Xiangrui Meng <m...@databricks.com> commit ec40c5a59fe45e49496db6e0082ddc65c937a857 Author: Amit Dev <amit...@gmail.com> Date: 2016-02-14T11:41:27Z [SPARK-13300][DOCUMENTATION] Added pygments.rb dependancy Looks like pygments.rb gem is also required for jekyll build to work. At least on Ubuntu/RHEL I could not do build without this dependency. So added this to steps. Author: Amit Dev <amit...@gmail.com> Closes #11180 from amitdev/master. (cherry picked from commit 331293c30242dc43e54a25171ca51a1c9330ae44) Signed-off-by: Sean Owen <so...@cloudera.com> commit 71f53edc0e39bc907755153b9603be8c6fcc1d93 Author: JeremyNixon <jnix...@gmail.com> Date: 2016-02-15T09:25:13Z [SPARK-13312][MLLIB] Update java train-validation-split example in ml-guide Response to JIRA https://issues.apache.org/jira/browse/SPARK-13312. This contribution is my original work and I license the work to this project. Author: JeremyNixon <jnix...@gmail.com> Closes #11199 from JeremyNixon/update_train_val_split_example. (cherry picked from commit adb548365012552e991d51740bfd3c25abf0adec) Signed-off-by: Sean Owen <so...@cloudera.com> commit d95089190d714e3e95579ada84ac42d463f824b5 Author: Miles Yucht <mi...@databricks.com> Date: 2016-02-16T13:01:21Z Correct SparseVector.parse documentation There's a small typo in the SparseVector.parse docstring (which says that it returns a DenseVector rather than a SparseVector), which seems to be incorrect. Author: Miles Yucht <mi...@databricks.com> Closes #11213 from mgyucht/fix-sparsevector-docs. (cherry picked from commit 827ed1c06785692d14857bd41f1fd94a0853874a) Signed-off-by: Sean Owen <so...@cloudera.com> commit 98354cae984e3719a49050e7a6aa75dae78b12bb Author: Sital Kedia <ske...@fb.com> Date: 2016-02-17T06:27:34Z [SPARK-13279] Remove O(n^2) operation from scheduler. This commit removes an unnecessary duplicate check in addPendingTask that meant that scheduling a task set took time proportional to (# tasks)^2. Author: Sital Kedia <ske...@fb.com> Closes #11175 from sitalkedia/fix_stuck_driver. (cherry picked from commit 1e1e31e03df14f2e7a9654e640fb2796cf059fe0) Signed-off-by: Kay Ousterhout <kayousterh...@gmail.com> commit 66106a660149607348b8e51994eb2ce29d67abc0 Author: Christopher C. Aycock <ch...@chrisaycock.com> Date: 2016-02-17T19:24:18Z [SPARK-13350][DOCS] Config doc updated to state that PYSPARK_PYTHON's default is "python2.7" Author: Christopher C. Aycock <ch...@chrisaycock.com> Closes #11239 from chrisaycock/master. (cherry picked from commit a7c74d7563926573c01baf613708a0f105a03e57) Signed-off-by: Josh Rosen <joshro...@databricks.com> commit 16f35c4c6e7e56bdb1402eab0877da6e8497cb3f Author: Sean Owen <so...@cloudera.com> Date: 2016-02-18T20:14:30Z [SPARK-13371][CORE][STRING] TaskSetManager.dequeueSpeculativeTask compares Option and String directly. ## What changes were proposed in this pull request? Fix some comparisons between unequal types that cause IJ warnings and in at least one case a likely bug (TaskSetManager) ## How was the this patch tested? Running Jenkins tests Author: Sean Owen <so...@cloudera.com> Closes #11253 from srowen/SPARK-13371. (cherry picked from commit 78562535feb6e214520b29e0bbdd4b1302f01e93) Signed-off-by: Andrew Or <and...@databricks.com> commit 699644c692472e5b78baa56a1a6c44d8d174e70e Author: Michael Armbrust <mich...@databricks.com> Date: 2016-02-22T23:27:29Z [SPARK-12546][SQL] Change default number of open parquet files A common problem that users encounter with Spark 1.6.0 is that writing to a partitioned parquet table OOMs. The root cause is that parquet allocates a significant amount of memory that is not accounted for by our own mechanisms. As a workaround, we can ensure that only a single file is open per task unless the user explicitly asks for more. Author: Michael Armbrust <mich...@databricks.com> Closes #11308 from marmbrus/parquetWriteOOM. (cherry picked from commit 173aa949c309ff7a7a03e9d762b9108542219a95) Signed-off-by: Michael Armbrust <mich...@databricks.com> commit 85e6a2205d4549c81edbc2238fd15659120cee78 Author: Shixiong Zhu <shixi...@databricks.com> Date: 2016-02-23T01:42:30Z [SPARK-13298][CORE][UI] Escape "label" to avoid DAG being broken by some special character ## What changes were proposed in this pull request? When there are some special characters (e.g., `"`, `\`) in `label`, DAG will be broken. This patch just escapes `label` to avoid DAG being broken by some special characters ## How was the this patch tested? Jenkins tests Author: Shixiong Zhu <shixi...@databricks.com> Closes #11309 from zsxwing/SPARK-13298. (cherry picked from commit a11b3995190cb4a983adcc8667f7b316cce18d24) Signed-off-by: Andrew Or <and...@databricks.com> commit f7898f9e2df131fa78200f6034508e74a78c2a44 Author: Daoyuan Wang <daoyuan.w...@intel.com> Date: 2016-02-23T02:13:32Z [SPARK-11624][SPARK-11972][SQL] fix commands that need hive to exec In SparkSQLCLI, we have created a `CliSessionState`, but then we call `SparkSQLEnv.init()`, which will start another `SessionState`. This would lead to exception because `processCmd` need to get the `CliSessionState` instance by calling `SessionState.get()`, but the return value would be a instance of `SessionState`. See the exception below. spark-sql> !echo "test"; Exception in thread "main" java.lang.ClassCastException: org.apache.hadoop.hive.ql.session.SessionState cannot be cast to org.apache.hadoop.hive.cli.CliSessionState at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:112) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:301) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:242) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:691) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Author: Daoyuan Wang <daoyuan.w...@intel.com> Closes #9589 from adrian-wang/clicommand. (cherry picked from commit 5d80fac58f837933b5359a8057676f45539e53af) Signed-off-by: Michael Armbrust <mich...@databricks.com> Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/ClientWrapper.scala commit 40d11d0492bcdf4aa442e527e69804e53b4135e9 Author: Michael Armbrust <mich...@databricks.com> Date: 2016-02-23T02:25:48Z Update branch-1.6 for 1.6.1 release commit 152252f15b7ee2a9b0d53212474e344acd8a55a9 Author: Patrick Wendell <pwend...@gmail.com> Date: 2016-02-23T02:30:24Z Preparing Spark release v1.6.1-rc1 commit 290279808e5e9e91d7c349ccec12ff12b99a4556 Author: Patrick Wendell <pwend...@gmail.com> Date: 2016-02-23T02:30:30Z Preparing development version 1.6.1-SNAPSHOT commit d31854da5155550f4e9c5e717c92dfec87d0ff6a Author: Earthson Lu <earthson...@gmail.com> Date: 2016-02-23T07:40:36Z [SPARK-12746][ML] ArrayType(_, true) should also accept ArrayType(_, false) fix for branch-1.6 https://issues.apache.org/jira/browse/SPARK-13359 Author: Earthson Lu <earthson...@gmail.com> Closes #11237 from Earthson/SPARK-13359. commit 0784e02fd438e5fa2e6639d6bba114fa647dad23 Author: Xiangrui Meng <m...@databricks.com> Date: 2016-02-23T07:54:21Z [SPARK-13355][MLLIB] replace GraphImpl.fromExistingRDDs by Graph.apply `GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We call it in LDA without validating this requirement. So it might introduce errors. Replacing it by `Graph.apply` would be safer and more proper because it is a public API. The tests still pass. So maybe it is safe to use `fromExistingRDDs` here (though it doesn't seem so based on the implementation) or the test cases are special. jkbradley ankurdave Author: Xiangrui Meng <m...@databricks.com> Closes #11226 from mengxr/SPARK-13355. (cherry picked from commit 764ca18037b6b1884fbc4be9a011714a81495020) Signed-off-by: Xiangrui Meng <m...@databricks.com> commit 573a2c97e9a9b8feae22f8af173fb158d59e5332 Author: Franklyn D'souza <frankl...@gmail.com> Date: 2016-02-23T23:34:04Z [SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns. ## What changes were proposed in this pull request? This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations. This was previously causing `"AnalysisException: u"unresolved operator 'Union;""` when trying to unionAll two dataframes with UDT columns as below. ``` from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT from pyspark.sql import types schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)]) a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema) b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema) c = a.unionAll(b) ``` ## How was the this patch tested? Tested using two unit tests in sql/test.py and the DataFrameSuite. Additional information here : https://issues.apache.org/jira/browse/SPARK-13410 rxin Author: Franklyn D'souza <frankl...@gmail.com> Closes #11333 from damnMeddlingKid/udt-union-patch. commit 06f4fce29227f9763d9f9abff6e7459542dce261 Author: Shixiong Zhu <shixi...@databricks.com> Date: 2016-02-24T13:35:36Z [SPARK-13390][SQL][BRANCH-1.6] Fix the issue that Iterator.map().toSeq is not Serializable ## What changes were proposed in this pull request? `scala.collection.Iterator`'s methods (e.g., map, filter) will return an `AbstractIterator` which is not Serializable. E.g., ```Scala scala> val iter = Array(1, 2, 3).iterator.map(_ + 1) iter: Iterator[Int] = non-empty iterator scala> println(iter.isInstanceOf[Serializable]) false ``` If we call something like `Iterator.map(...).toSeq`, it will create a `Stream` that contains a non-serializable `AbstractIterator` field and make the `Stream` be non-serializable. This PR uses `toArray` instead of `toSeq` to fix such issue in `def createDataFrame(data: java.util.List[_], beanClass: Class[_]): DataFrame`. ## How was the this patch tested? Jenkins tests. Author: Shixiong Zhu <shixi...@databricks.com> Closes #11334 from zsxwing/SPARK-13390. commit fe71cabd46e4d384e8790dbfdda892df24b48e92 Author: Yin Huai <yh...@databricks.com> Date: 2016-02-24T21:34:53Z [SPARK-13475][TESTS][SQL] HiveCompatibilitySuite should still run in PR builder even if a PR only changes sql/core ## What changes were proposed in this pull request? `HiveCompatibilitySuite` should still run in PR build even if a PR only changes sql/core. So, I am going to remove `ExtendedHiveTest` annotation from `HiveCompatibilitySuite`. https://issues.apache.org/jira/browse/SPARK-13475 Author: Yin Huai <yh...@databricks.com> Closes #11351 from yhuai/SPARK-13475. (cherry picked from commit bc353805bd98243872d520e05fa6659da57170bf) Signed-off-by: Yin Huai <yh...@databricks.com> commit 897599601a5ca0f95fd70f16e89df58b9b17705c Author: huangzhaowei <carlmartin...@gmail.com> Date: 2016-02-25T07:52:17Z [SPARK-13482][MINOR][CONFIGURATION] Make consistency of the configuraiton named in TransportConf. `spark.storage.memoryMapThreshold` has two kind of the value, one is 2*1024*1024 as integer and the other one is '2m' as string. "2m" is recommanded in document but it will go wrong if the code goes into `TransportConf#memoryMapBytes`. [Jira](https://issues.apache.org/jira/browse/SPARK-13482) Author: huangzhaowei <carlmartin...@gmail.com> Closes #11360 from SaintBacchus/SPARK-13482. (cherry picked from commit 264533b553be806b6c45457201952e83c028ec78) Signed-off-by: Reynold Xin <r...@databricks.com> commit 3cc938ac8124b8445f171baa365fa44a47962cc9 Author: Cheng Lian <l...@databricks.com> Date: 2016-02-25T12:43:03Z [SPARK-13473][SQL] Don't push predicate through project with nondeterministic field(s) ## What changes were proposed in this pull request? Predicates shouldn't be pushed through project with nondeterministic field(s). See https://github.com/graphframes/graphframes/pull/23 and SPARK-13473 for more details. This PR targets master, branch-1.6, and branch-1.5. ## How was this patch tested? A test case is added in `FilterPushdownSuite`. It constructs a query plan where a filter is over a project with a nondeterministic field. Optimized query plan shouldn't change in this case. Author: Cheng Lian <l...@databricks.com> Closes #11348 from liancheng/spark-13473-no-ppd-through-nondeterministic-project-field. (cherry picked from commit 3fa6491be66dad690ca5329dd32e7c82037ae8c1) Signed-off-by: Wenchen Fan <wenc...@databricks.com> commit cb869a143d338985c3d99ef388dd78b1e3d90a73 Author: Oliver Pierson <o...@gatech.edu> Date: 2016-02-25T13:24:46Z [SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large DataFrames Change line 113 of QuantileDiscretizer.scala to `val requiredSamples = math.max(numBins * numBins, 10000.0)` so that `requiredSamples` is a `Double`. This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count` Manual tests. I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected. Author: Oliver Pierson <o...@gatech.edu> Author: Oliver Pierson <opier...@umd.edu> Closes #11319 from oliverpierson/SPARK-13444. (cherry picked from commit 6f8e835c68dff6fcf97326dc617132a41ff9d043) Signed-off-by: Sean Owen <so...@cloudera.com> commit 1f031635ffb4df472ad0d9c00bc82ebb601ebbb5 Author: Terence Yim <tere...@cask.co> Date: 2016-02-25T13:29:30Z [SPARK-13441][YARN] Fix NPE in yarn Client.createConfArchive method ## What changes were proposed in this pull request? Instead of using result of File.listFiles() directly, which may throw NPE, check for null first. If it is null, log a warning instead ## How was the this patch tested? Ran the ./dev/run-tests locally Tested manually on a cluster Author: Terence Yim <tere...@cask.co> Closes #11337 from chtyim/fixes/SPARK-13441-null-check. (cherry picked from commit fae88af18445c5a88212b4644e121de4b30ce027) Signed-off-by: Sean Owen <so...@cloudera.com> commit e3802a7522a83b91c84d0ee6f721a768a485774b Author: Michael Gummelt <mgumm...@mesosphere.io> Date: 2016-02-25T13:32:09Z [SPARK-13439][MESOS] Document that spark.mesos.uris is comma-separated Author: Michael Gummelt <mgumm...@mesosphere.io> Closes #11311 from mgummelt/document_csv. (cherry picked from commit c98a93ded36db5da2f3ebd519aa391de90927688) Signed-off-by: Sean Owen <so...@cloudera.com> commit 5f7440b2529a0f6edfed5038756c004acecbce39 Author: huangzhaowei <carlmartin...@gmail.com> Date: 2016-02-25T15:14:19Z [SPARK-12316] Wait a minutes to avoid cycle calling. When application end, AM will clean the staging dir. But if the driver trigger to update the delegation token, it will can't find the right token file and then it will endless cycle call the method 'updateCredentialsIfRequired'. Then it lead driver StackOverflowError. https://issues.apache.org/jira/browse/SPARK-12316 Author: huangzhaowei <carlmartin...@gmail.com> Closes #10475 from SaintBacchus/SPARK-12316. (cherry picked from commit 5fcf4c2bfce4b7e3543815c8e49ffdec8072c9a2) Signed-off-by: Tom Graves <tgra...@yahoo-inc.com> commit d59a08f7c1c455d86e7ee3d6522a3e9c55f9ee02 Author: Xiangrui Meng <m...@databricks.com> Date: 2016-02-25T20:28:03Z Revert "[SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large DataFrames" This reverts commit cb869a143d338985c3d99ef388dd78b1e3d90a73. commit abe8f991a32bef92fbbcd2911836bb7d8e61ca97 Author: Yu ISHIKAWA <yuu.ishik...@gmail.com> Date: 2016-02-25T21:21:33Z [SPARK-12874][ML] ML StringIndexer does not protect itself from column name duplication ## What changes were proposed in this pull request? ML StringIndexer does not protect itself from column name duplication. We should still improve a way to validate a schema of `StringIndexer` and `StringIndexerModel`. However, it would be great to fix at another issue. ## How was this patch tested? unit test Author: Yu ISHIKAWA <yuu.ishik...@gmail.com> Closes #11370 from yu-iskw/SPARK-12874. (cherry picked from commit 14e2700de29d06460179a94cc9816bcd37344cf7) Signed-off-by: Xiangrui Meng <m...@databricks.com> commit a57f87ee4aafdb97c15f4076e20034ea34c7e2e5 Author: Yin Huai <yh...@databricks.com> Date: 2016-02-26T20:34:03Z [SPARK-13454][SQL] Allow users to drop a table with a name starting with an underscore. ## What changes were proposed in this pull request? This change adds a workaround to allow users to drop a table with a name starting with an underscore. Without this patch, we can create such a table, but we cannot drop it. The reason is that Hive's parser unquote an quoted identifier (see https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveLexer.g#L453). So, when we issue a drop table command to Hive, a table name starting with an underscore is actually not quoted. Then, Hive will complain about it because it does not support a table name starting with an underscore without using backticks (underscores are allowed as long as it is not the first char though). ## How was this patch tested? Add a test to make sure we can drop a table with a name starting with an underscore. https://issues.apache.org/jira/browse/SPARK-13454 Author: Yin Huai <yh...@databricks.com> Closes #11349 from yhuai/fixDropTable. commit 8a43c3bfbcd9d6e3876e09363dba604dc7e63dc3 Author: Josh Rosen <joshro...@databricks.com> Date: 2016-02-27T02:40:00Z [SPARK-13474][PROJECT INFRA] Update packaging scripts to push artifacts to home.apache.org Due to the people.apache.org -> home.apache.org migration, we need to update our packaging scripts to publish artifacts to the new server. Because the new server only supports sftp instead of ssh, we need to update the scripts to use lftp instead of ssh + rsync. Author: Josh Rosen <joshro...@databricks.com> Closes #11350 from JoshRosen/update-release-scripts-for-apache-home. (cherry picked from commit f77dc4e1e202942aa8393fb5d8f492863973fe17) Signed-off-by: Josh Rosen <joshro...@databricks.com> ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org