[GitHub] zeppelin issue #1216: [ZEPPELIN-919] Apply new mechanism to Markdown
Github user prabhjyotsingh commented on the issue: https://github.com/apache/zeppelin/pull/1216 CI green. Merging this if no more discussion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin pull request #1163: [ZEPPELIN-1149] %sh interpreter kerberos suppor...
Github user asfgit closed the pull request at: https://github.com/apache/zeppelin/pull/1163 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin pull request #1155: [ZEPPELIN-1143] Interpreter dependencies are no...
Github user asfgit closed the pull request at: https://github.com/apache/zeppelin/pull/1155 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin issue #1214: ZEPPELIN-1224: Fix typo in method name
Github user Leemoonsoo commented on the issue: https://github.com/apache/zeppelin/pull/1214 LGTM. CI failure irrelevant. Merge it into master and branch-0.6 if there're no more discussions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (ZEPPELIN-1230) Summary function in R does not display proper output
Abul Basar created ZEPPELIN-1230: Summary: Summary function in R does not display proper output Key: ZEPPELIN-1230 URL: https://issues.apache.org/jira/browse/ZEPPELIN-1230 Project: Zeppelin Issue Type: Bug Components: GUI Affects Versions: 0.6.0 Environment: Zeppelin launched on CentOS 6.7, Mac OS JDK 1.8 Tested with Chrome and FF Reporter: Abul Basar %r #using Spark R interpreter require(ggplot2) #purpose is use diamonds dataset summary(lm(price ~ carat, data = diamonds)) --- Above command does not produce the model summary as it does in R IDE. Also, summary(diamonds) does not show the header on the first line. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets
That sounds great Anish! Please keep it up :) -- Alex On Wed, Jul 20, 2016, 18:07 anish singhwrote: > Alex, some good news! > > I just tried the first option you mentioned in the previous mail, increased > the driver memory to 16g, reduced caching space to 0.1% of total memory and > additionally trimmed the warc content to include only three domains and its > working (everything including reduceByKey()). Although, I had tried this > earlier, few days ago but it had not worked then. > > I even understood the core problem : the original rdd( ~ 2GB) contained > exactly 53307 rdd elements and when I ran 'flatMap( > r => ExtractLinks(r.getUrl(), r.getContentstring())) on the this rdd it > resulted in explosion of data extracted from these many elements(web pages) > which the available memory was perhaps unable to handle. This also means > that the rest of the analysis in the notebook must be done on domains > extracted from the original warc files so it reduces the size of data to be > processed. In case, more RAM is needed I will try to use m4.2xlarge (32GB) > instance. > > Thrilled to have it working after struggling for so many days, so now I can > proceed with the notebook. > > Thanks again, > Anish. > > On Wed, Jul 20, 2016 at 7:08 AM, Alexander Bezzubov > wrote: > > > Hi Anish, > > > > thank you for sharing your progress and totally know what you mean - > that's > > an expected pain of working with real BigData. > > > > I would advise to conduct a series of experiments: > > > > *1 moderate machine*, Spark 1.6 in local mode, 1 WARC input file (1Gb) > > - Spark in local mode is a single JVM process, so fine-tune it and make > > sure it uses ALL available memory (i.e 16Gb) > > - We are not going to use in-memory caching, so storage part can be > turned > > off [1] and [2] > > - AFIAK DataFrames use memory more efficient than RDDs but not sure if > we > > can benefit from it here > > - Start with something simple, like `val mayBegLinks = > > mayBegData.keepValidPages().count()` and make sure it works > > - Proceed further until few more complex queries work > > > > *Cluster of N machines*, Spark 1.6 in standalone cluster mode > > - process fraction of the whole dataset i.e 1 segment > > > > > > I know that is not easy, but it's worth to try for 1 more week and see if > > the approach outlined above works. > > Last, but not least - do not hesitate to reach out to CommonCrawl > community > > [3] for an advice, there are people using Apache Spark there as well. > > > > Please keep us posted! > > > > 1. > > > http://spark.apache.org/docs/latest/tuning.html#memory-management-overview > > 2. > > > > > http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ > > 3. https://groups.google.com/forum/#!forum/common-crawl > > > > -- > > Alex > > > > > > On Wed, Jul 20, 2016 at 2:27 AM, anish singh > wrote: > > > > > Hello, > > > > > > The last two weeks have been tough and full of learning, the code in > the > > > previous mail which performed only simple transformation and > > reduceByKey() > > > to count similar domain links did not work even on the first > segment(1005 > > > MB) of data. So I studied and read extensively on the web : > > blogs(cloudera, > > > databricks and stack overflow) and books on Spark, tried all the > options > > > and configurations on memory and performance tuning but the code did > not > > > run. My current configurations to SPARK_SUBMIT_OPTIONS are set to > > > "--driver-memory 9g --driver-java-options -XX:+UseG1GC > > > -XX:+UseCompressedOops --conf spark.storage.memoryFraction=0.1" and > even > > > this does not work. Even simple operations such as rdd.count() after > the > > > transformations in the previous mail does not work. All this on an > > > m4.xlarge machine. > > > > > > Moreover, in trying to set up standalone cluster on single machine by > > > following instructions in the book 'Learning Spark', I messed with file > > > '~/.ssh/authorized_keys' file which cut me out of the instance so I had > > to > > > terminate it and start all over again after losing all the work done in > > one > > > week. > > > > > > Today, I performed a comparison of memory and cpu load values using the > > > size of data and the machine configurations between two conditions: > > (when I > > > worked on my local machine) vs. (m4.xlarge single instance), where > > > > > > memory load = (data size) / (memory available for processing), > > > cpu load = (data size) / (cores available for processing) > > > > > > the results of the comparison indicate that with the amount of data, > the > > > AWS instance is 100 times more constrained than the analysis that I > > > previously did on my machine (for calculations, please see sheet [0] ). > > > This has completely stalled work as I'm unable to perform any further > > > operations on the data sets. Further, choosing another instance (such > as > > 32 > > > GiB)
Re: [GSoC 2016] Notebooks
Thanks for sharing your progress Paul, the notebook looks great! By the way, did you know that in latest Apache Zeppelin instead of ``` print(titanic.head()) ``` one can use ``` z.show(titanic) ``` ? It would be a good opportunity to showcase this [1] and other features of the Python interpreter like recent SQL over PandasDataframe with built-in visualizations for easy exploratory analysis [2] thought this work, how do you think? 1. http://zeppelin.apache.org/docs/0.6.0/interpreter/python.html#pandas-integration 2. https://github.com/apache/zeppelin/blob/master/docs/interpreter/python.md#sql-over-pandas-dataframes -- Alex On Sat, Jul 23, 2016, 12:54 Paul Bustios Belizariowrote: > Thanks Moon, > > Here is my third notebook using the Titanic dataset: > > > https://www.zeppelinhub.com/viewer/notebooks/bm90ZTovL2J1c3Rpb3MvbG9jYWwvYmI0Y2EwNjVkMTI1NDY2Y2EzNTIzNThiZjViYzIxOWQvbm90ZS5qc29u > > Now, I'm working on the fourth notebook and updating my first notebook to > use z.show() > > Regards, > Paul > > On Sat, Jul 16, 2016 at 7:42 PM moon soo Lee wrote: > > > Hi Paul, > > > > That would be very interesting! > > And like you mentioned, it's dataset that for starters. I think it's > super > > reasonable to have a notebooks with those data. > > > > Thanks, > > moon > > > > On Sat, Jul 9, 2016 at 11:09 AM Paul Bustios Belizario < > pbust...@gmail.com > > > > > wrote: > > > > > Hi community, > > > > > > I was searching some databases and chose [1,2] for the next notebooks. > > > These databases are not big, but are classic and educational for people > > who > > > are starting the path of data science. Additionally, through the > process > > of > > > machine learning, these databases can provide many graphics. > > > > > > What do you think? > > > > > > Regards, > > > Paul > > > > > > [1] https://www.kaggle.com/c/titanic > > > [2] https://www.kaggle.com/c/digit-recognizer > > > > > >
[jira] [Created] (ZEPPELIN-1229) Versioning zeppelin-web resources
Lee moon soo created ZEPPELIN-1229: -- Summary: Versioning zeppelin-web resources Key: ZEPPELIN-1229 URL: https://issues.apache.org/jira/browse/ZEPPELIN-1229 Project: Zeppelin Issue Type: Improvement Affects Versions: 0.6.0 Reporter: Lee moon soo When newer version of zeppelin-web updates dependencies or any resources, browser cache does not invalidated automatically. That results broken GUI until each individual user clear browser cache manually. It would be really helpful if Zeppelin generates each resource file name with version number. So we can make sure newer version is being loaded correctly without cleaning browser cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] zeppelin issue #1218: [Zeppelin-1213] Customize editor configuration
Github user corneadoug commented on the issue: https://github.com/apache/zeppelin/pull/1218 @cloverhearts it would be nice to divide it in multiple PRs and commits. For the theme, I'm not against them wanting to change it, however in this case crowding the build by importing all the Ace theme inside the build, even though 70% of them will probably not be used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin pull request #1224: [ZEPPELIN-1228] Make z.show() work with Dataset
GitHub user Leemoonsoo opened a pull request: https://github.com/apache/zeppelin/pull/1224 [ZEPPELIN-1228] Make z.show() work with Dataset ### What is this PR for? z.show() does not work in spark 2.0 ### What type of PR is it? Bug Fix ### Todos * [x] - Make z.show() work with dataset * [x] - add a unittest ### What is the Jira issue? https://issues.apache.org/jira/browse/ZEPPELIN-1228 ### How should this be tested? ``` case class Data(n:Int) val data = sc.parallelize(1 to 10).map(i=>Data(i)).toDF data.registerTempTable("data") z.show(spark.sql("select * from data")) ``` ### Questions: * Does the licenses files need update? no * Is there breaking changes for older versions? no * Does this needs documentation? no You can merge this pull request into a Git repository by running: $ git pull https://github.com/Leemoonsoo/zeppelin ZEPPELIN-1228 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zeppelin/pull/1224.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1224 commit 486e00a091a5d9d40aed154c2e37d08402534558 Author: Lee moon sooDate: 2016-07-24T23:02:03Z Make z.show() work with Dataset --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin pull request #1209: [ZEPPELIN-1180] Update publish_release.sh to pu...
Github user asfgit closed the pull request at: https://github.com/apache/zeppelin/pull/1209 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin issue #1209: [ZEPPELIN-1180] Update publish_release.sh to publish s...
Github user minahlee commented on the issue: https://github.com/apache/zeppelin/pull/1209 Merge if there is no more discussion --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin issue #1220: [MINOR] Make scala version definition consistent in Tr...
Github user minahlee commented on the issue: https://github.com/apache/zeppelin/pull/1220 LGTM. Could you resolve the conflict? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin issue #1222: [HOTFIX] add scala version to spark-dependencies artif...
Github user minahlee commented on the issue: https://github.com/apache/zeppelin/pull/1222 Fixed by #1195 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] zeppelin pull request #1222: [HOTFIX] add scala version to spark-dependencie...
Github user minahlee closed the pull request at: https://github.com/apache/zeppelin/pull/1222 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---