[GitHub] zeppelin issue #1216: [ZEPPELIN-919] Apply new mechanism to Markdown

2016-07-24 Thread prabhjyotsingh
Github user prabhjyotsingh commented on the issue:

https://github.com/apache/zeppelin/pull/1216
  
CI green. 
Merging this if no more discussion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin pull request #1163: [ZEPPELIN-1149] %sh interpreter kerberos suppor...

2016-07-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/zeppelin/pull/1163


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin pull request #1155: [ZEPPELIN-1143] Interpreter dependencies are no...

2016-07-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/zeppelin/pull/1155


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin issue #1214: ZEPPELIN-1224: Fix typo in method name

2016-07-24 Thread Leemoonsoo
Github user Leemoonsoo commented on the issue:

https://github.com/apache/zeppelin/pull/1214
  
LGTM. 

CI failure irrelevant. Merge it into master and branch-0.6  if there're no 
more discussions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (ZEPPELIN-1230) Summary function in R does not display proper output

2016-07-24 Thread Abul Basar (JIRA)
Abul Basar created ZEPPELIN-1230:


 Summary: Summary function in R does not display proper output
 Key: ZEPPELIN-1230
 URL: https://issues.apache.org/jira/browse/ZEPPELIN-1230
 Project: Zeppelin
  Issue Type: Bug
  Components: GUI
Affects Versions: 0.6.0
 Environment: Zeppelin launched on CentOS 6.7, Mac OS
JDK 1.8
Tested with Chrome and FF
Reporter: Abul Basar


%r #using Spark R interpreter
require(ggplot2) #purpose is use diamonds dataset
summary(lm(price ~ carat, data = diamonds))

---
Above command does not produce the model summary as it does in R IDE. 
Also, summary(diamonds) does not show the header on the first line.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [GSoC - 2016][Zeppelin Notebooks] Issues with Common Crawl Datasets

2016-07-24 Thread Alexander Bezzubov
That sounds great Anish!

Please keep it up :)

--
Alex

On Wed, Jul 20, 2016, 18:07 anish singh  wrote:

> Alex, some good news!
>
> I just tried the first option you mentioned in the previous mail, increased
> the driver memory to 16g, reduced caching space to 0.1% of total memory and
> additionally trimmed the warc content to include only three domains and its
> working (everything including reduceByKey()). Although, I had tried this
> earlier, few days ago but it had not worked then.
>
> I even understood the core problem : the original rdd( ~ 2GB) contained
> exactly 53307 rdd elements and when I ran 'flatMap(
> r => ExtractLinks(r.getUrl(), r.getContentstring())) on the this rdd it
> resulted in explosion of data extracted from these many elements(web pages)
> which the available memory was perhaps unable to handle. This also means
> that the rest of the analysis in the notebook must be done on domains
> extracted from the original warc files so it reduces the size of data to be
> processed. In case, more RAM is needed I will try to use m4.2xlarge (32GB)
> instance.
>
> Thrilled to have it working after struggling for so many days, so now I can
> proceed with the notebook.
>
> Thanks again,
> Anish.
>
> On Wed, Jul 20, 2016 at 7:08 AM, Alexander Bezzubov 
> wrote:
>
> > Hi Anish,
> >
> > thank you for sharing your progress and totally know what you mean -
> that's
> > an expected pain of working with real BigData.
> >
> > I would advise to conduct a series of experiments:
> >
> > *1 moderate machine*, Spark 1.6 in local mode, 1 WARC input file (1Gb)
> >  - Spark in local mode is a single JVM process, so fine-tune it and make
> > sure it uses ALL available memory (i.e 16Gb)
> >  - We are not going to use in-memory caching, so storage part can be
> turned
> > off [1]  and [2]
> >  - AFIAK DataFrames use memory more efficient than RDDs but not sure if
> we
> > can benefit from it here
> >  - Start with something simple, like `val mayBegLinks =
> > mayBegData.keepValidPages().count()` and make sure it works
> >  - Proceed further until few more complex queries work
> >
> > *Cluster of N machines*, Spark 1.6 in standalone cluster mode
> >  - process fraction of the whole dataset i.e 1 segment
> >
> >
> > I know that is not easy, but it's worth to try for 1 more week and see if
> > the approach outlined above works.
> > Last, but not least - do not hesitate to reach out to CommonCrawl
> community
> > [3] for an advice, there are people using Apache Spark there as well.
> >
> > Please keep us posted!
> >
> >  1.
> >
> http://spark.apache.org/docs/latest/tuning.html#memory-management-overview
> >  2.
> >
> >
> http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
> >  3. https://groups.google.com/forum/#!forum/common-crawl
> >
> > --
> > Alex
> >
> >
> > On Wed, Jul 20, 2016 at 2:27 AM, anish singh 
> wrote:
> >
> > > Hello,
> > >
> > > The last two weeks have been tough and full of learning, the code in
> the
> > > previous mail which performed only simple transformation and
> > reduceByKey()
> > > to count similar domain links did not work even on the first
> segment(1005
> > > MB) of data. So I studied and read extensively on the web :
> > blogs(cloudera,
> > > databricks and stack overflow) and books on Spark, tried all the
> options
> > > and configurations on memory and performance tuning but the code did
> not
> > > run. My current configurations to SPARK_SUBMIT_OPTIONS are set to
> > > "--driver-memory 9g --driver-java-options -XX:+UseG1GC
> > > -XX:+UseCompressedOops --conf spark.storage.memoryFraction=0.1" and
> even
> > > this does not work. Even simple operations such as rdd.count() after
> the
> > > transformations in the previous mail does not work. All this on an
> > > m4.xlarge machine.
> > >
> > > Moreover, in trying to set up standalone cluster on single machine by
> > > following instructions in the book 'Learning Spark', I messed with file
> > > '~/.ssh/authorized_keys' file which cut me out of the instance so I had
> > to
> > > terminate it and start all over again after losing all the work done in
> > one
> > > week.
> > >
> > > Today, I performed a comparison of memory and cpu load values using the
> > > size of data and the machine configurations between two conditions:
> > (when I
> > > worked on my local machine) vs. (m4.xlarge single instance), where
> > >
> > > memory load = (data size) / (memory available for processing),
> > > cpu load = (data size) / (cores available for processing)
> > >
> > > the results of the comparison indicate that with the amount of data,
> the
> > > AWS instance is 100 times more constrained than the analysis that I
> > > previously did on my machine (for calculations, please see sheet [0] ).
> > > This has completely stalled work as I'm unable to perform any further
> > > operations on the data sets. Further, choosing another instance (such
> as
> > 32
> > > GiB) 

Re: [GSoC 2016] Notebooks

2016-07-24 Thread Alexander Bezzubov
Thanks for sharing your progress Paul, the notebook looks great!

By the way, did you know that in latest Apache Zeppelin instead of
```
print(titanic.head())
```
one can use

```
z.show(titanic)
```
?

It would be a good opportunity to showcase this [1] and other features of
the Python interpreter like recent SQL over PandasDataframe with built-in
visualizations for easy exploratory analysis [2] thought this work, how do
you think?

1.
http://zeppelin.apache.org/docs/0.6.0/interpreter/python.html#pandas-integration
2.
https://github.com/apache/zeppelin/blob/master/docs/interpreter/python.md#sql-over-pandas-dataframes
--
Alex

On Sat, Jul 23, 2016, 12:54 Paul Bustios Belizario 
wrote:

> Thanks Moon,
>
> Here is my third notebook using the Titanic dataset:
>
>
> https://www.zeppelinhub.com/viewer/notebooks/bm90ZTovL2J1c3Rpb3MvbG9jYWwvYmI0Y2EwNjVkMTI1NDY2Y2EzNTIzNThiZjViYzIxOWQvbm90ZS5qc29u
>
> Now, I'm working on the fourth notebook and updating my first notebook to
> use z.show()
>
> Regards,
> Paul
>
> On Sat, Jul 16, 2016 at 7:42 PM moon soo Lee  wrote:
>
> > Hi Paul,
> >
> > That would be very interesting!
> > And like you mentioned, it's dataset that for starters. I think it's
> super
> > reasonable to have a notebooks with those data.
> >
> > Thanks,
> > moon
> >
> > On Sat, Jul 9, 2016 at 11:09 AM Paul Bustios Belizario <
> pbust...@gmail.com
> > >
> > wrote:
> >
> > > Hi community,
> > >
> > > I was searching some databases and chose [1,2] for the next notebooks.
> > > These databases are not big, but are classic and educational for people
> > who
> > > are starting the path of data science. Additionally, through the
> process
> > of
> > > machine learning, these databases can provide many graphics.
> > >
> > > What do you think?
> > >
> > > Regards,
> > > Paul
> > >
> > > [1] https://www.kaggle.com/c/titanic
> > > [2] https://www.kaggle.com/c/digit-recognizer
> > >
> >
>


[jira] [Created] (ZEPPELIN-1229) Versioning zeppelin-web resources

2016-07-24 Thread Lee moon soo (JIRA)
Lee moon soo created ZEPPELIN-1229:
--

 Summary: Versioning zeppelin-web resources
 Key: ZEPPELIN-1229
 URL: https://issues.apache.org/jira/browse/ZEPPELIN-1229
 Project: Zeppelin
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Lee moon soo


When newer version of zeppelin-web updates dependencies or any resources, 
browser cache does not invalidated automatically. That results broken GUI until 
each individual user clear browser cache manually.

It would be really helpful if Zeppelin generates each resource file name with 
version number. So we can make sure newer version is being loaded correctly 
without cleaning browser cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] zeppelin issue #1218: [Zeppelin-1213] Customize editor configuration

2016-07-24 Thread corneadoug
Github user corneadoug commented on the issue:

https://github.com/apache/zeppelin/pull/1218
  
@cloverhearts it would be nice to divide it in multiple PRs and commits.
For the theme, I'm not against them wanting to change it, however in this 
case crowding the build by importing all the Ace theme inside the build, even 
though 70% of them will probably not be used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin pull request #1224: [ZEPPELIN-1228] Make z.show() work with Dataset

2016-07-24 Thread Leemoonsoo
GitHub user Leemoonsoo opened a pull request:

https://github.com/apache/zeppelin/pull/1224

[ZEPPELIN-1228] Make z.show() work with Dataset

### What is this PR for?
z.show() does not work in spark 2.0


### What type of PR is it?
Bug Fix

### Todos
* [x] - Make z.show() work with dataset
* [x] - add a unittest

### What is the Jira issue?
https://issues.apache.org/jira/browse/ZEPPELIN-1228

### How should this be tested?
```
case class Data(n:Int)
val data = sc.parallelize(1 to 10).map(i=>Data(i)).toDF
data.registerTempTable("data")
z.show(spark.sql("select * from data"))
```

### Questions:
* Does the licenses files need update? no
* Is there breaking changes for older versions? no
* Does this needs documentation? no



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Leemoonsoo/zeppelin ZEPPELIN-1228

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zeppelin/pull/1224.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1224


commit 486e00a091a5d9d40aed154c2e37d08402534558
Author: Lee moon soo 
Date:   2016-07-24T23:02:03Z

Make z.show() work with Dataset




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin pull request #1209: [ZEPPELIN-1180] Update publish_release.sh to pu...

2016-07-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/zeppelin/pull/1209


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin issue #1209: [ZEPPELIN-1180] Update publish_release.sh to publish s...

2016-07-24 Thread minahlee
Github user minahlee commented on the issue:

https://github.com/apache/zeppelin/pull/1209
  
Merge if there is no more discussion


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin issue #1220: [MINOR] Make scala version definition consistent in Tr...

2016-07-24 Thread minahlee
Github user minahlee commented on the issue:

https://github.com/apache/zeppelin/pull/1220
  
LGTM. Could you resolve the conflict?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin issue #1222: [HOTFIX] add scala version to spark-dependencies artif...

2016-07-24 Thread minahlee
Github user minahlee commented on the issue:

https://github.com/apache/zeppelin/pull/1222
  
Fixed by #1195 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zeppelin pull request #1222: [HOTFIX] add scala version to spark-dependencie...

2016-07-24 Thread minahlee
Github user minahlee closed the pull request at:

https://github.com/apache/zeppelin/pull/1222


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---