Yin, It is the https://github.com/xsankar/global-bd-conf/blob/master/004-Orders.ipynb. Cheers <k/>
On Fri, Sep 4, 2015 at 9:58 AM, Yin Huai <yh...@databricks.com> wrote: > Hi Krishna, > > Can you share your code to reproduce the memory allocation issue? > > Thanks, > > Yin > > On Fri, Sep 4, 2015 at 8:00 AM, Krishna Sankar <ksanka...@gmail.com> > wrote: > >> Thanks Tom. Interestingly it happened between RC2 and RC3. >> Now my vote is +1/2 unless the memory error is known and has a workaround. >> >> Cheers >> <k/> >> >> >> On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves <tgraves...@yahoo.com> wrote: >> >>> The upper/lower case thing is known. >>> https://issues.apache.org/jira/browse/SPARK-9550 >>> I assume it was decided to be ok and its going to be in the release >>> notes but Reynold or Josh can probably speak to it more. >>> >>> Tom >>> >>> >>> >>> On Thursday, September 3, 2015 10:21 PM, Krishna Sankar < >>> ksanka...@gmail.com> wrote: >>> >>> >>> +? >>> >>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min >>> mvn clean package -Pyarn -Phadoop-2.6 -DskipTests >>> 2. Tested pyspark, mllib >>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK >>> 2.2. Linear/Ridge/Laso Regression OK >>> 2.3. Decision Tree, Naive Bayes OK >>> 2.4. KMeans OK >>> Center And Scale OK >>> 2.5. RDD operations OK >>> State of the Union Texts - MapReduce, Filter,sortByKey (word count) >>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK >>> Model evaluation/optimization (rank, numIter, lambda) with >>> itertools OK >>> 3. Scala - MLlib >>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK >>> 3.2. LinearRegressionWithSGD OK >>> 3.3. Decision Tree OK >>> 3.4. KMeans OK >>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK >>> 3.6. saveAsParquetFile OK >>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile, >>> registerTempTable, sql OK >>> 3.8. result = sqlContext.sql("SELECT >>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER >>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK >>> 4.0. Spark SQL from Python OK >>> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") >>> OK >>> 5.0. Packages >>> 5.1. com.databricks.spark.csv - read/write OK >>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But >>> com.databricks:spark-csv_2.11:1.2.0 worked) >>> 6.0. DataFrames >>> 6.1. cast,dtypes OK >>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK >>> 6.3. All joins,sql,set operations,udf OK >>> >>> Two Problems: >>> >>> 1. The synthetic column names are lowercase ( i.e. now >>> ‘sum(OrderPrice)’; previously ‘SUM(OrderPrice)’, now ‘avg(Total)’; >>> previously 'AVG(Total)'). So programs that depend on the case of the >>> synthetic column names would fail. >>> 2. orders_3.groupBy("Year","Month").sum('Total').show() >>> fails with the error ‘java.io.IOException: Unable to acquire 4194304 >>> bytes of memory’ >>> orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails >>> with the same error >>> Is this a known bug ? >>> Cheers >>> <k/> >>> P.S: Sorry for the spam, forgot Reply All >>> >>> On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin <r...@databricks.com> wrote: >>> >>> Please vote on releasing the following candidate as Apache Spark version >>> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes >>> if a majority of at least 3 +1 PMC votes are cast. >>> >>> [ ] +1 Release this package as Apache Spark 1.5.0 >>> [ ] -1 Do not release this package because ... >>> >>> To learn more about Apache Spark, please see http://spark.apache.org/ >>> >>> >>> The tag to be voted on is v1.5.0-rc3: >>> >>> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a >>> >>> The release files, including signatures, digests, etc. can be found at: >>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/ >>> >>> Release artifacts are signed with the following key: >>> https://people.apache.org/keys/committer/pwendell.asc >>> >>> The staging repository for this release (published as 1.5.0-rc3) can be >>> found at: >>> https://repository.apache.org/content/repositories/orgapachespark-1143/ >>> >>> The staging repository for this release (published as 1.5.0) can be >>> found at: >>> https://repository.apache.org/content/repositories/orgapachespark-1142/ >>> >>> The documentation corresponding to this release can be found at: >>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/ >>> >>> >>> ======================================= >>> How can I help test this release? >>> ======================================= >>> If you are a Spark user, you can help us test this release by taking an >>> existing Spark workload and running on this release candidate, then >>> reporting any regressions. >>> >>> >>> ================================================ >>> What justifies a -1 vote for this release? >>> ================================================ >>> This vote is happening towards the end of the 1.5 QA period, so -1 votes >>> should only occur for significant regressions from 1.4. Bugs already >>> present in 1.4, minor regressions, or bugs related to new features will not >>> block this release. >>> >>> >>> =============================================================== >>> What should happen to JIRA tickets still targeting 1.5.0? >>> =============================================================== >>> 1. It is OK for documentation patches to target 1.5.0 and still go into >>> branch-1.5, since documentations will be packaged separately from the >>> release. >>> 2. New features for non-alpha-modules should target 1.6+. >>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the >>> target version. >>> >>> >>> ================================================== >>> Major changes to help you focus your testing >>> ================================================== >>> >>> As of today, Spark 1.5 contains more than 1000 commits from 220+ >>> contributors. I've curated a list of important changes for 1.5. For the >>> complete list, please refer to Apache JIRA changelog. >>> >>> RDD/DataFrame/SQL APIs >>> >>> - New UDAF interface >>> - DataFrame hints for broadcast join >>> - expr function for turning a SQL expression into DataFrame column >>> - Improved support for NaN values >>> - StructType now supports ordering >>> - TimestampType precision is reduced to 1us >>> - 100 new built-in expressions, including date/time, string, math >>> - memory and local disk only checkpointing >>> >>> DataFrame/SQL Backend Execution >>> >>> - Code generation on by default >>> - Improved join, aggregation, shuffle, sorting with cache friendly >>> algorithms and external algorithms >>> - Improved window function performance >>> - Better metrics instrumentation and reporting for DF/SQL execution plans >>> >>> Data Sources, Hive, Hadoop, Mesos and Cluster Management >>> >>> - Dynamic allocation support in all resource managers (Mesos, YARN, >>> Standalone) >>> - Improved Mesos support (framework authentication, roles, dynamic >>> allocation, constraints) >>> - Improved YARN support (dynamic allocation with preferred locations) >>> - Improved Hive support (metastore partition pruning, metastore >>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2) >>> - Support persisting data in Hive compatible format in metastore >>> - Support data partitioning for JSON data sources >>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster >>> metadata discovery and schema merging, support reading non-standard legacy >>> Parquet files generated by other libraries) >>> - Faster and more robust dynamic partition insert >>> - DataSourceRegister interface for external data sources to specify >>> short names >>> >>> SparkR >>> >>> - YARN cluster mode in R >>> - GLMs with R formula, binomial/Gaussian families, and elastic-net >>> regularization >>> - Improved error messages >>> - Aliases to make DataFrame functions more R-like >>> >>> Streaming >>> >>> - Backpressure for handling bursty input streams. >>> - Improved Python support for streaming sources (Kafka offsets, Kinesis, >>> MQTT, Flume) >>> - Improved Python streaming machine learning algorithms (K-Means, linear >>> regression, logistic regression) >>> - Native reliable Kinesis stream support >>> - Input metadata like Kafka offsets made visible in the batch details UI >>> - Better load balancing and scheduling of receivers across cluster >>> - Include streaming storage in web UI >>> >>> Machine Learning and Advanced Analytics >>> >>> - Feature transformers: CountVectorizer, Discrete Cosine transformation, >>> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer. >>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic >>> regression. >>> - Algorithms: multilayer perceptron classifier, PrefixSpan for >>> sequential pattern mining, association rule generation, 1-sample >>> Kolmogorov-Smirnov test. >>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs >>> - More efficient Pregel API implementation for GraphX >>> - Model summary for linear and logistic regression. >>> - Python API: distributed matrices, streaming k-means and linear models, >>> LDA, power iteration clustering, etc. >>> - Tuning and evaluation: train-validation split and multiclass >>> classification evaluator. >>> - Documentation: document the release version of public API methods >>> >>> >>> >>> >>> >>> >> >