Re: csv dependencies loaded in %spark but not %sql in spark 1.6/zeppelin 0.5.6

2016-02-03 Thread Benjamin Kim
Same here. I want to know the answer too.


> On Feb 2, 2016, at 12:32 PM, Jonathan Kelly  wrote:
> 
> Hey, I just ran into that same exact issue yesterday and wasn't sure if I was 
> doing something wrong or what. Glad to know it's not just me! Unfortunately I 
> have not yet had the time to look any deeper into it. Would you mind filing a 
> JIRA if there isn't already one?
> 
> On Tue, Feb 2, 2016 at 12:29 PM Lin, Yunfeng  > wrote:
> Hi guys,
> 
>  
> 
> I load spark-csv dependencies in %spark, but not in %sql using apache 
> zeppelin 0.5.6 with spark 1.6.0. Everything is working fine in zeppelin 0.5.5 
> with spark 1.5 through
> 
>  
> 
> Do you have similar problems?
> 
>  
> 
> I am loading spark csv dependencies (https://github.com/databricks/spark-csv 
> )
> 
>  
> 
> Using:
> 
> %dep
> 
> z.load(“PATH/commons-csv-1.1.jar”)
> 
> z.load(“PATH/spark-csv_2.10-1.3.0.jar”)
> 
> z.load(“PATH/univocity-parsers-1.5.1.jar:)
> 
> z.load(“PATH/scala-library-2.10.5.jar”)
> 
>  
> 
> I am able to load a csv from hdfs using data frame API in spark. It is 
> running perfect fine.
> 
> %spark
> 
> val df = sqlContext.read
> 
> .format("com.databricks.spark.csv")
> 
> .option("header", "false") // Use finrst line of all files as header
> 
> .option("inferSchema", "true") // Automatically infer data types
> 
> .load("hdfs://sd-6f48-7fe6:8020/tmp/people.txt")   // this is a file in 
> HDFS
> 
> df.registerTempTable("people")
> 
> df.show()
> 
>  
> 
> This also work:
> 
> %spark
> 
> val df2=sqlContext.sql(“select * from people”)
> 
> df2.show()
> 
>  
> 
> But this doesn’t work….
> 
> %sql
> 
> select * from people
> 
>  
> 
> java.lang.ClassNotFoundException: 
> com.databricks.spark.csv.CsvRelation$$anonfun$1$$anonfun$2 at 
> java.net.URLClassLoader$1.run(URLClassLoader.java:366) at 
> java.net.URLClassLoader$1.run(URLClassLoader.java:355) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> java.net.URLClassLoader.findClass(URLClassLoader.java:354) at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:425) at 
> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:358) at 
> java.lang.Class.forName0(Native Method) at 
> java.lang.Class.forName(Class.java:270) at 
> org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:435)
>  at org.apache.xbean.asm5.ClassReader.a(Unknown Source) at 
> org.apache.xbean.asm5.ClassReader.b(Unknown Source) at 
> org.apache.xbean.asm5.ClassReader.accept(Unknown Source) at 
> org.apache.xbean.asm5.ClassReader.accept(Unknown Source) at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:84)
>  at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:187)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122) at 
> org.apache.spark.SparkContext.clean(SparkContext.scala:2055) at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707) at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at 
> org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706) at 
> com.databricks.spark.csv.CsvRelation.tokenRdd(CsvRelation.scala:90) at 
> com.databricks.spark.csv.CsvRelation.buildScan(CsvRelation.scala:104) at 
> com.databricks.spark.csv.CsvRelation.buildScan(CsvRelation.scala:152) at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$4.apply(DataSourceStrategy.scala:64)
>  at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$4.apply(DataSourceStrategy.scala:64)
>  at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:274)
>  at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:273)
>  at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:352)
>  at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:269)
>  at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:60)
>  at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>  at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>  at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planL

Re: [DISCUSS] Update Roadmap

2016-02-29 Thread Benjamin Kim
I concur with this suggestion. In the enterprise, management would like to see 
scheduled runs to be tracked, monitored, and given SLA constraints for the 
mission critical. Alerts and notifications are crucial for DevOps to respond 
with error clarification within it. If the Zeppelin notebooks can be executed 
by a third party scheduling application, such as Oozie, then this requirement 
can be satisfied if there are no immediate plans for a built-in one.

> On Feb 29, 2016, at 1:17 AM, Eran Witkon  wrote:
> 
> @Vinayak Agrawal I would suggest adding the ability to connect zeppelin to 
> existing scheduling tools\workflow tools such as  https://oozie.apache.org/ 
> . this requires betters hooks and status reporting 
> but doesn't make zeppeling and ETL\scheduler tool by itself/
> 
> 
> On Mon, Feb 29, 2016 at 10:21 AM Vinayak Agrawal  > wrote:
> Moon,
> The new roadmap looks very promising. I am very happy to see security in the 
> list.
> I have some suggestions regarding Enterprise Ready features:
> 
> 1. Job Scheduler - Can this be improved? 
> Currently the scheduler can be used with Cron expression or a pre-set time. 
> But in an enterprise solution, a notebook might be one piece of the workflow. 
> Can we look towards the functionality of scheduling notebook's based on other 
> notebooks finishing their job successfully?
> This requirement would arise in any ETL workflow, where all the downstream 
> users wait for the ETL notebook to finish successfully. Only after that, 
> other business oriented notebooks can be executed.  
> 
> 2. Importing a notebook - Is there a current requirement or future plan to 
> implement a feature that allows import-notebook-from-github? This would allow 
> users to share notebooks seamlessly. 
> 
> Thanks 
> Vinayak
> 
> On Sun, Feb 28, 2016 at 11:22 PM, moon soo Lee  > wrote:
> Zhong Wang, 
> Right, Folder support would be quite useful. Thanks for the opinion. 
> Hope i can finish the work pr-190 
> .
> 
> Sourav,
> Regarding concurrent running, Zeppelin doesn't have limitation of run 
> paragraph/query concurrently. Interpreter can implement it's own scheduling 
> policy. For example, SparkSQL interpreter and ShellInterpreter can already 
> run paragraph/query concurrently.
> 
> SparkInterpreter is implemented with FIFO scheduler considering nature of 
> scala compiler. That's why user can not run multiple paragraph concurrently 
> when they work with SparkInterpreter.
> But as Zhong Wang mentioned, pr-703 enables each notebook will have separate 
> scala compiler so paragraphs run concurrently, while they're in different 
> notebooks.
> Thanks for the feedback!
> 
> Best,
> moon
> On Sat, Feb 27, 2016 at 8:59 PM Zhong Wang  > wrote:
> Sourav: I think this newly merged PR can help you 
> https://github.com/apache/incubator-zeppelin/pull/703#issuecomment-185582537 
> 
> 
> On Sat, Feb 27, 2016 at 1:46 PM, Sourav Mazumder  > wrote:
> Hi Moon,
> 
> This looks great.
> 
> My only suggestion would be to include a PR/feature - Support for Running 
> Concurrent paragraphs/queries in Zeppelin. 
> 
> Right now if more than one user tries to run paragraphs in multiple notebooks 
> concurrently through a single Zeppelin instance (and single interpreter 
> instance) the performance is very slow. It is obvious that the queue gets 
> built up within the zeppelin process and interpreter process in that scenario 
> as the time taken to move the status from start to pending and pending to 
> running is very high compared to the actual running time of a paragraph.
> 
> Without this the multi tenancy support would be meaningless as no one can 
> practically use it in a situation where multiple users are trying to connect 
> to the same instance of Zeppelin (and the related interpreter). A possible 
> solution would be to spawn separate instance of the same interpreter at every 
> notebook/user level.
> 
> Regards,
> Sourav
> On Sat, Feb 27, 2016 at 12:48 PM, moon soo Lee  > wrote:
> Hi Zeppelin users and developers,
> 
> The roadmap we have published at
> https://cwiki.apache.org/confluence/display/ZEPPELIN/Zeppelin+Roadmap 
> 
> is almost 9 month old, and it doesn't reflect where the community goes 
> anymore. It's time to update.
> 
> Based on mailing list, jira issues, pullrequests, feedbacks from users, 
> conferences and meetings, I could summarize the major interest of users and 
> developers in 7 categories. Enterprise ready, Usability improvement, 
> Pluggability, Documentation, Backend integration, Notebook storage, and 
> Visualization.
> 
> And i could list related subjects under each categories.
> E