Re: Running a notebook in a standalone cluster mode issues
Or you can try livy interpreter which support yarn cluster mode https://zeppelin.apache.org/docs/0.8.0-SNAPSHOT/interpreter/livy.html Sofiane Cherchalli 于2017年5月4日周四 上午3:49写道: > Hi Moon, > > Great, I am keen to see Zeppelin-2040 resolved soon. But meanwhile is > there any workaround? > > Thanks. > > Sofiane > > > El El mié, 3 may 2017 a las 20:40, moon soo Lee > escribió: > >> Zeppelin don't need to be installed in every workers. >> You can think the way SparkInterpreter in Zeppelin work is very similar >> to spark-shell (which works in client mode), until ZEPPELIN-2040 is >> resolved. >> >> Therefore, if spark-shell works in a machine with your standalone >> cluster, Zeppelin will work in the same machine with the standalone cluster. >> >> Thanks, >> moon >> >> On Wed, May 3, 2017 at 2:28 PM Sofiane Cherchalli >> wrote: >> >>> Hi Moon, >>> >>> So in my case, if II have standalone or yarn cluster, the workaround >>> would be to install zeppelin along every worker, proxy them, and run each >>> zeppelin in client mode ? >>> >>> Thanks, >>> Sofiane >>> >>> El El mié, 3 may 2017 a las 19:12, moon soo Lee >>> escribió: >>> Hi, Zeppelin does not support cluster mode deploy at the moment. Fortunately, there will be a support for cluster mode, soon! Please keep an eye on https://issues.apache.org/jira/browse/ZEPPELIN-2040. Thanks, moon On Wed, May 3, 2017 at 11:00 AM Sofiane Cherchalli wrote: > Shall I configure a remote interpreter to my notebook to run on the > worker? > > Mayday! > > On Wed, May 3, 2017 at 4:18 PM, Sofiane Cherchalli < > sofian...@gmail.com> wrote: > >> What port does the remote interpreter use? >> >> On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli < >> sofian...@gmail.com> wrote: >> >>> Hi Moon and al, >>> >>> I have a standalone cluster with one master, one worker. I submit >>> jobs through zeppelin. master, worker, and zeppelin run in a separate >>> container. >>> >>> My zeppelin-env.sh: >>> >>> # spark home >>> export SPARK_HOME=/usr/local/spark >>> >>> # set hadoop conf dir >>> export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop >>> >>> # set options to pass spark-submit command >>> export SPARK_SUBMIT_OPTIONS="--packages >>> com.databricks:spark-csv_2.11:1.5.0 --deploy-mode cluster" >>> >>> # worker memory >>> export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g >>> -Dspark.submit.deployMode=cluster" >>> >>> # master >>> export MASTER="spark://:7077" >>> >>> My notebook code is very simple. It read csv and write it again in >>> directory /data previously created: >>> %spark.pyspark >>> def read_input(fin): >>> ''' >>> Read input file from filesystem and return dataframe >>> ''' >>> df = sqlContext.read.load(fin, >>> format='com.databricks.spark.csv', mode='PERMISSIVE', header='false', >>> inferSchema='true') >>> return df >>> >>> def write_output(df, fout): >>> ''' >>> Write dataframe to filesystem >>> ''' >>> >>> df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',', >>> header='true').save(fout) >>> >>> data_in = '/data/01.csv' >>> data_out = '/data/02.csv' >>> df = read_input(data_in) >>> newdf = del_columns(df) >>> write_output(newdf, data_out) >>> >>> >>> I used --deploy-mode to *cluster* so that the driver is run in the >>> worker in order to read the CSV in the /data directory and not in >>> zeppelin. >>> When running the notebook it complains that >>> /opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is >>> missing: >>> org.apache.zeppelin.interpreter.InterpreterException: Ivy Default >>> Cache set to: /root/.ivy2/cache The jars for the packages stored in: >>> /root/.ivy2/jars :: loading settings :: url = >>> jar:file:/opt/spark-2.1.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml >>> com.databricks#spark-csv_2.11 added as a dependency :: resolving >>> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: >>> [default] >>> found com.databricks#spark-csv_2.11;1.5.0 in central found >>> org.apache.commons#commons-csv;1.1 in central found >>> com.univocity#univocity-parsers;1.5.1 in central :: resolution report :: >>> resolve 310ms :: artifacts dl 6ms :: modules in use: >>> com.databricks#spark-csv_2.11;1.5.0 from central in [default] >>> com.univocity#univocity-parsers;1.5.1 from central in [default] >>> org.apache.commons#commons-csv;1.1 from central in [default] >>> - | >>> | >>> modules || artifacts | | conf | number| search|dwnlded|evicted|| >>> number|dwnlded| >>> --
Re: Running a notebook in a standalone cluster mode issues
Any workaround except for using client mode, it's difficult to think ... Thanks, moon On Wed, 3 May 2017 at 3:49 PM Sofiane Cherchalli wrote: > Hi Moon, > > Great, I am keen to see Zeppelin-2040 resolved soon. But meanwhile is > there any workaround? > > Thanks. > > Sofiane > > > El El mié, 3 may 2017 a las 20:40, moon soo Lee > escribió: > >> Zeppelin don't need to be installed in every workers. >> You can think the way SparkInterpreter in Zeppelin work is very similar >> to spark-shell (which works in client mode), until ZEPPELIN-2040 is >> resolved. >> >> Therefore, if spark-shell works in a machine with your standalone >> cluster, Zeppelin will work in the same machine with the standalone cluster. >> >> Thanks, >> moon >> >> On Wed, May 3, 2017 at 2:28 PM Sofiane Cherchalli >> wrote: >> >>> Hi Moon, >>> >>> So in my case, if II have standalone or yarn cluster, the workaround >>> would be to install zeppelin along every worker, proxy them, and run each >>> zeppelin in client mode ? >>> >>> Thanks, >>> Sofiane >>> >>> El El mié, 3 may 2017 a las 19:12, moon soo Lee >>> escribió: >>> Hi, Zeppelin does not support cluster mode deploy at the moment. Fortunately, there will be a support for cluster mode, soon! Please keep an eye on https://issues.apache.org/jira/browse/ZEPPELIN-2040. Thanks, moon On Wed, May 3, 2017 at 11:00 AM Sofiane Cherchalli wrote: > Shall I configure a remote interpreter to my notebook to run on the > worker? > > Mayday! > > On Wed, May 3, 2017 at 4:18 PM, Sofiane Cherchalli < > sofian...@gmail.com> wrote: > >> What port does the remote interpreter use? >> >> On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli < >> sofian...@gmail.com> wrote: >> >>> Hi Moon and al, >>> >>> I have a standalone cluster with one master, one worker. I submit >>> jobs through zeppelin. master, worker, and zeppelin run in a separate >>> container. >>> >>> My zeppelin-env.sh: >>> >>> # spark home >>> export SPARK_HOME=/usr/local/spark >>> >>> # set hadoop conf dir >>> export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop >>> >>> # set options to pass spark-submit command >>> export SPARK_SUBMIT_OPTIONS="--packages >>> com.databricks:spark-csv_2.11:1.5.0 --deploy-mode cluster" >>> >>> # worker memory >>> export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g >>> -Dspark.submit.deployMode=cluster" >>> >>> # master >>> export MASTER="spark://:7077" >>> >>> My notebook code is very simple. It read csv and write it again in >>> directory /data previously created: >>> %spark.pyspark >>> def read_input(fin): >>> ''' >>> Read input file from filesystem and return dataframe >>> ''' >>> df = sqlContext.read.load(fin, >>> format='com.databricks.spark.csv', mode='PERMISSIVE', header='false', >>> inferSchema='true') >>> return df >>> >>> def write_output(df, fout): >>> ''' >>> Write dataframe to filesystem >>> ''' >>> >>> df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',', >>> header='true').save(fout) >>> >>> data_in = '/data/01.csv' >>> data_out = '/data/02.csv' >>> df = read_input(data_in) >>> newdf = del_columns(df) >>> write_output(newdf, data_out) >>> >>> >>> I used --deploy-mode to *cluster* so that the driver is run in the >>> worker in order to read the CSV in the /data directory and not in >>> zeppelin. >>> When running the notebook it complains that >>> /opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is >>> missing: >>> org.apache.zeppelin.interpreter.InterpreterException: Ivy Default >>> Cache set to: /root/.ivy2/cache The jars for the packages stored in: >>> /root/.ivy2/jars :: loading settings :: url = >>> jar:file:/opt/spark-2.1.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml >>> com.databricks#spark-csv_2.11 added as a dependency :: resolving >>> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: >>> [default] >>> found com.databricks#spark-csv_2.11;1.5.0 in central found >>> org.apache.commons#commons-csv;1.1 in central found >>> com.univocity#univocity-parsers;1.5.1 in central :: resolution report :: >>> resolve 310ms :: artifacts dl 6ms :: modules in use: >>> com.databricks#spark-csv_2.11;1.5.0 from central in [default] >>> com.univocity#univocity-parsers;1.5.1 from central in [default] >>> org.apache.commons#commons-csv;1.1 from central in [default] >>> - | >>> | >>> modules || artifacts | | conf | number| search|dwnlded|evicted|| >>> number|dwnlded| >>>
Re: Running a notebook in a standalone cluster mode issues
Hi Moon, Great, I am keen to see Zeppelin-2040 resolved soon. But meanwhile is there any workaround? Thanks. Sofiane El El mié, 3 may 2017 a las 20:40, moon soo Lee escribió: > Zeppelin don't need to be installed in every workers. > You can think the way SparkInterpreter in Zeppelin work is very similar to > spark-shell (which works in client mode), until ZEPPELIN-2040 is resolved. > > Therefore, if spark-shell works in a machine with your standalone cluster, > Zeppelin will work in the same machine with the standalone cluster. > > Thanks, > moon > > On Wed, May 3, 2017 at 2:28 PM Sofiane Cherchalli > wrote: > >> Hi Moon, >> >> So in my case, if II have standalone or yarn cluster, the workaround >> would be to install zeppelin along every worker, proxy them, and run each >> zeppelin in client mode ? >> >> Thanks, >> Sofiane >> >> El El mié, 3 may 2017 a las 19:12, moon soo Lee >> escribió: >> >>> Hi, >>> >>> Zeppelin does not support cluster mode deploy at the moment. >>> Fortunately, there will be a support for cluster mode, soon! >>> Please keep an eye on >>> https://issues.apache.org/jira/browse/ZEPPELIN-2040. >>> >>> Thanks, >>> moon >>> >>> On Wed, May 3, 2017 at 11:00 AM Sofiane Cherchalli >>> wrote: >>> Shall I configure a remote interpreter to my notebook to run on the worker? Mayday! On Wed, May 3, 2017 at 4:18 PM, Sofiane Cherchalli >>> > wrote: > What port does the remote interpreter use? > > On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli < > sofian...@gmail.com> wrote: > >> Hi Moon and al, >> >> I have a standalone cluster with one master, one worker. I submit >> jobs through zeppelin. master, worker, and zeppelin run in a separate >> container. >> >> My zeppelin-env.sh: >> >> # spark home >> export SPARK_HOME=/usr/local/spark >> >> # set hadoop conf dir >> export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop >> >> # set options to pass spark-submit command >> export SPARK_SUBMIT_OPTIONS="--packages >> com.databricks:spark-csv_2.11:1.5.0 --deploy-mode cluster" >> >> # worker memory >> export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g >> -Dspark.submit.deployMode=cluster" >> >> # master >> export MASTER="spark://:7077" >> >> My notebook code is very simple. It read csv and write it again in >> directory /data previously created: >> %spark.pyspark >> def read_input(fin): >> ''' >> Read input file from filesystem and return dataframe >> ''' >> df = sqlContext.read.load(fin, format='com.databricks.spark.csv', >> mode='PERMISSIVE', header='false', inferSchema='true') >> return df >> >> def write_output(df, fout): >> ''' >> Write dataframe to filesystem >> ''' >> >> df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',', >> header='true').save(fout) >> >> data_in = '/data/01.csv' >> data_out = '/data/02.csv' >> df = read_input(data_in) >> newdf = del_columns(df) >> write_output(newdf, data_out) >> >> >> I used --deploy-mode to *cluster* so that the driver is run in the >> worker in order to read the CSV in the /data directory and not in >> zeppelin. >> When running the notebook it complains that >> /opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is >> missing: >> org.apache.zeppelin.interpreter.InterpreterException: Ivy Default >> Cache set to: /root/.ivy2/cache The jars for the packages stored in: >> /root/.ivy2/jars :: loading settings :: url = >> jar:file:/opt/spark-2.1.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml >> com.databricks#spark-csv_2.11 added as a dependency :: resolving >> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] >> found com.databricks#spark-csv_2.11;1.5.0 in central found >> org.apache.commons#commons-csv;1.1 in central found >> com.univocity#univocity-parsers;1.5.1 in central :: resolution report :: >> resolve 310ms :: artifacts dl 6ms :: modules in use: >> com.databricks#spark-csv_2.11;1.5.0 from central in [default] >> com.univocity#univocity-parsers;1.5.1 from central in [default] >> org.apache.commons#commons-csv;1.1 from central in [default] >> - | | >> modules || artifacts | | conf | number| search|dwnlded|evicted|| >> number|dwnlded| >> - | >> default | 3 | 0 | 0 | 0 || 3 | 0 | >> - :: >> retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 >> artifacts copied, 3 already retrieved (0kB/8ms) Running Spark using the >> REST application submission pr
Re: Running a notebook in a standalone cluster mode issues
Zeppelin don't need to be installed in every workers. You can think the way SparkInterpreter in Zeppelin work is very similar to spark-shell (which works in client mode), until ZEPPELIN-2040 is resolved. Therefore, if spark-shell works in a machine with your standalone cluster, Zeppelin will work in the same machine with the standalone cluster. Thanks, moon On Wed, May 3, 2017 at 2:28 PM Sofiane Cherchalli wrote: > Hi Moon, > > So in my case, if II have standalone or yarn cluster, the workaround would > be to install zeppelin along every worker, proxy them, and run each > zeppelin in client mode ? > > Thanks, > Sofiane > > El El mié, 3 may 2017 a las 19:12, moon soo Lee > escribió: > >> Hi, >> >> Zeppelin does not support cluster mode deploy at the moment. Fortunately, >> there will be a support for cluster mode, soon! >> Please keep an eye on https://issues.apache.org/jira/browse/ZEPPELIN-2040 >> . >> >> Thanks, >> moon >> >> On Wed, May 3, 2017 at 11:00 AM Sofiane Cherchalli >> wrote: >> >>> Shall I configure a remote interpreter to my notebook to run on the >>> worker? >>> >>> Mayday! >>> >>> On Wed, May 3, 2017 at 4:18 PM, Sofiane Cherchalli >>> wrote: >>> What port does the remote interpreter use? On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli >>> > wrote: > Hi Moon and al, > > I have a standalone cluster with one master, one worker. I submit jobs > through zeppelin. master, worker, and zeppelin run in a separate > container. > > My zeppelin-env.sh: > > # spark home > export SPARK_HOME=/usr/local/spark > > # set hadoop conf dir > export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop > > # set options to pass spark-submit command > export SPARK_SUBMIT_OPTIONS="--packages > com.databricks:spark-csv_2.11:1.5.0 --deploy-mode cluster" > > # worker memory > export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g > -Dspark.submit.deployMode=cluster" > > # master > export MASTER="spark://:7077" > > My notebook code is very simple. It read csv and write it again in > directory /data previously created: > %spark.pyspark > def read_input(fin): > ''' > Read input file from filesystem and return dataframe > ''' > df = sqlContext.read.load(fin, format='com.databricks.spark.csv', > mode='PERMISSIVE', header='false', inferSchema='true') > return df > > def write_output(df, fout): > ''' > Write dataframe to filesystem > ''' > > df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',', > header='true').save(fout) > > data_in = '/data/01.csv' > data_out = '/data/02.csv' > df = read_input(data_in) > newdf = del_columns(df) > write_output(newdf, data_out) > > > I used --deploy-mode to *cluster* so that the driver is run in the > worker in order to read the CSV in the /data directory and not in > zeppelin. > When running the notebook it complains that > /opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is > missing: > org.apache.zeppelin.interpreter.InterpreterException: Ivy Default > Cache set to: /root/.ivy2/cache The jars for the packages stored in: > /root/.ivy2/jars :: loading settings :: url = > jar:file:/opt/spark-2.1.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml > com.databricks#spark-csv_2.11 added as a dependency :: resolving > dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] > found com.databricks#spark-csv_2.11;1.5.0 in central found > org.apache.commons#commons-csv;1.1 in central found > com.univocity#univocity-parsers;1.5.1 in central :: resolution report :: > resolve 310ms :: artifacts dl 6ms :: modules in use: > com.databricks#spark-csv_2.11;1.5.0 from central in [default] > com.univocity#univocity-parsers;1.5.1 from central in [default] > org.apache.commons#commons-csv;1.1 from central in [default] > - | | > modules || artifacts | | conf | number| search|dwnlded|evicted|| > number|dwnlded| > - | > default | 3 | 0 | 0 | 0 || 3 | 0 | > - :: > retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 > artifacts copied, 3 already retrieved (0kB/8ms) Running Spark using the > REST application submission protocol. SLF4J: Class path contains multiple > SLF4J bindings. SLF4J: Found binding in > [jar:file:/opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/zeppelin-0.7.1/lib/interpreter/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/S
Re: Running a notebook in a standalone cluster mode issues
Hi Moon, So in my case, if II have standalone or yarn cluster, the workaround would be to install zeppelin along every worker, proxy them, and run each zeppelin in client mode ? Thanks, Sofiane El El mié, 3 may 2017 a las 19:12, moon soo Lee escribió: > Hi, > > Zeppelin does not support cluster mode deploy at the moment. Fortunately, > there will be a support for cluster mode, soon! > Please keep an eye on https://issues.apache.org/jira/browse/ZEPPELIN-2040. > > Thanks, > moon > > On Wed, May 3, 2017 at 11:00 AM Sofiane Cherchalli > wrote: > >> Shall I configure a remote interpreter to my notebook to run on the >> worker? >> >> Mayday! >> >> On Wed, May 3, 2017 at 4:18 PM, Sofiane Cherchalli >> wrote: >> >>> What port does the remote interpreter use? >>> >>> On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli >>> wrote: >>> Hi Moon and al, I have a standalone cluster with one master, one worker. I submit jobs through zeppelin. master, worker, and zeppelin run in a separate container. My zeppelin-env.sh: # spark home export SPARK_HOME=/usr/local/spark # set hadoop conf dir export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop # set options to pass spark-submit command export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.11:1.5.0 --deploy-mode cluster" # worker memory export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g -Dspark.submit.deployMode=cluster" # master export MASTER="spark://:7077" My notebook code is very simple. It read csv and write it again in directory /data previously created: %spark.pyspark def read_input(fin): ''' Read input file from filesystem and return dataframe ''' df = sqlContext.read.load(fin, format='com.databricks.spark.csv', mode='PERMISSIVE', header='false', inferSchema='true') return df def write_output(df, fout): ''' Write dataframe to filesystem ''' df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',', header='true').save(fout) data_in = '/data/01.csv' data_out = '/data/02.csv' df = read_input(data_in) newdf = del_columns(df) write_output(newdf, data_out) I used --deploy-mode to *cluster* so that the driver is run in the worker in order to read the CSV in the /data directory and not in zeppelin. When running the notebook it complains that /opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is missing: org.apache.zeppelin.interpreter.InterpreterException: Ivy Default Cache set to: /root/.ivy2/cache The jars for the packages stored in: /root/.ivy2/jars :: loading settings :: url = jar:file:/opt/spark-2.1.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml com.databricks#spark-csv_2.11 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found com.databricks#spark-csv_2.11;1.5.0 in central found org.apache.commons#commons-csv;1.1 in central found com.univocity#univocity-parsers;1.5.1 in central :: resolution report :: resolve 310ms :: artifacts dl 6ms :: modules in use: com.databricks#spark-csv_2.11;1.5.0 from central in [default] com.univocity#univocity-parsers;1.5.1 from central in [default] org.apache.commons#commons-csv;1.1 from central in [default] - | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | default | 3 | 0 | 0 | 0 || 3 | 0 | - :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 3 already retrieved (0kB/8ms) Running Spark using the REST application submission protocol. SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/zeppelin-0.7.1/lib/interpreter/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Warning: Master endpoint spark://spark-drone-master-sofiane.autoetl.svc.cluster.local:7077 was not a REST server. Falling back to legacy submission gateway instead. Ivy Default Cache set to: /root/.ivy2/cache
Re: Running a notebook in a standalone cluster mode issues
Hi, Zeppelin does not support cluster mode deploy at the moment. Fortunately, there will be a support for cluster mode, soon! Please keep an eye on https://issues.apache.org/jira/browse/ZEPPELIN-2040. Thanks, moon On Wed, May 3, 2017 at 11:00 AM Sofiane Cherchalli wrote: > Shall I configure a remote interpreter to my notebook to run on the worker? > > Mayday! > > On Wed, May 3, 2017 at 4:18 PM, Sofiane Cherchalli > wrote: > >> What port does the remote interpreter use? >> >> On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli >> wrote: >> >>> Hi Moon and al, >>> >>> I have a standalone cluster with one master, one worker. I submit jobs >>> through zeppelin. master, worker, and zeppelin run in a separate container. >>> >>> My zeppelin-env.sh: >>> >>> # spark home >>> export SPARK_HOME=/usr/local/spark >>> >>> # set hadoop conf dir >>> export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop >>> >>> # set options to pass spark-submit command >>> export SPARK_SUBMIT_OPTIONS="--packages >>> com.databricks:spark-csv_2.11:1.5.0 --deploy-mode cluster" >>> >>> # worker memory >>> export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g >>> -Dspark.submit.deployMode=cluster" >>> >>> # master >>> export MASTER="spark://:7077" >>> >>> My notebook code is very simple. It read csv and write it again in >>> directory /data previously created: >>> %spark.pyspark >>> def read_input(fin): >>> ''' >>> Read input file from filesystem and return dataframe >>> ''' >>> df = sqlContext.read.load(fin, format='com.databricks.spark.csv', >>> mode='PERMISSIVE', header='false', inferSchema='true') >>> return df >>> >>> def write_output(df, fout): >>> ''' >>> Write dataframe to filesystem >>> ''' >>> >>> df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',', >>> header='true').save(fout) >>> >>> data_in = '/data/01.csv' >>> data_out = '/data/02.csv' >>> df = read_input(data_in) >>> newdf = del_columns(df) >>> write_output(newdf, data_out) >>> >>> >>> I used --deploy-mode to *cluster* so that the driver is run in the >>> worker in order to read the CSV in the /data directory and not in zeppelin. >>> When running the notebook it complains that >>> /opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is >>> missing: >>> org.apache.zeppelin.interpreter.InterpreterException: Ivy Default Cache >>> set to: /root/.ivy2/cache The jars for the packages stored in: >>> /root/.ivy2/jars :: loading settings :: url = >>> jar:file:/opt/spark-2.1.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml >>> com.databricks#spark-csv_2.11 added as a dependency :: resolving >>> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] >>> found com.databricks#spark-csv_2.11;1.5.0 in central found >>> org.apache.commons#commons-csv;1.1 in central found >>> com.univocity#univocity-parsers;1.5.1 in central :: resolution report :: >>> resolve 310ms :: artifacts dl 6ms :: modules in use: >>> com.databricks#spark-csv_2.11;1.5.0 from central in [default] >>> com.univocity#univocity-parsers;1.5.1 from central in [default] >>> org.apache.commons#commons-csv;1.1 from central in [default] >>> - | | >>> modules || artifacts | | conf | number| search|dwnlded|evicted|| >>> number|dwnlded| >>> - | >>> default | 3 | 0 | 0 | 0 || 3 | 0 | >>> - :: >>> retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 >>> artifacts copied, 3 already retrieved (0kB/8ms) Running Spark using the >>> REST application submission protocol. SLF4J: Class path contains multiple >>> SLF4J bindings. SLF4J: Found binding in >>> [jar:file:/opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>> SLF4J: Found binding in >>> [jar:file:/opt/zeppelin-0.7.1/lib/interpreter/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>> SLF4J: Found binding in >>> [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an >>> explanation. SLF4J: Actual binding is of type >>> [org.slf4j.impl.Log4jLoggerFactory] Warning: Master endpoint >>> spark://spark-drone-master-sofiane.autoetl.svc.cluster.local:7077 was not a >>> REST server. Falling back to legacy submission gateway instead. Ivy Default >>> Cache set to: /root/.ivy2/cache The jars for the packages stored in: >>> /root/.ivy2/jars com.databricks#spark-csv_2.11 added as a dependency :: >>> resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: >>> [default] found com.databricks#spark-csv_2.11;1.5.0 in central found >>> org.apache.commons#commons-csv;1.1 in central found >>> com.univocity#univocity-parsers;1.5.1 in central :: resol
Re: Running a notebook in a standalone cluster mode issues
Shall I configure a remote interpreter to my notebook to run on the worker? Mayday! On Wed, May 3, 2017 at 4:18 PM, Sofiane Cherchalli wrote: > What port does the remote interpreter use? > > On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli > wrote: > >> Hi Moon and al, >> >> I have a standalone cluster with one master, one worker. I submit jobs >> through zeppelin. master, worker, and zeppelin run in a separate container. >> >> My zeppelin-env.sh: >> >> # spark home >> export SPARK_HOME=/usr/local/spark >> >> # set hadoop conf dir >> export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop >> >> # set options to pass spark-submit command >> export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.11:1.5.0 >> --deploy-mode cluster" >> >> # worker memory >> export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g >> -Dspark.submit.deployMode=cluster" >> >> # master >> export MASTER="spark://:7077" >> >> My notebook code is very simple. It read csv and write it again in >> directory /data previously created: >> %spark.pyspark >> def read_input(fin): >> ''' >> Read input file from filesystem and return dataframe >> ''' >> df = sqlContext.read.load(fin, format='com.databricks.spark.csv', >> mode='PERMISSIVE', header='false', inferSchema='true') >> return df >> >> def write_output(df, fout): >> ''' >> Write dataframe to filesystem >> ''' >> >> df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',', >> header='true').save(fout) >> >> data_in = '/data/01.csv' >> data_out = '/data/02.csv' >> df = read_input(data_in) >> newdf = del_columns(df) >> write_output(newdf, data_out) >> >> >> I used --deploy-mode to *cluster* so that the driver is run in the >> worker in order to read the CSV in the /data directory and not in zeppelin. >> When running the notebook it complains that /opt/zeppelin-0.7.1/inter >> preter/spark/zeppelin-spark_2.11-0.7.1.jar is missing: >> org.apache.zeppelin.interpreter.InterpreterException: Ivy Default Cache >> set to: /root/.ivy2/cache The jars for the packages stored in: >> /root/.ivy2/jars :: loading settings :: url = jar:file:/opt/spark-2.1.0/jars >> /ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml >> com.databricks#spark-csv_2.11 added as a dependency :: resolving >> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: >> [default] found com.databricks#spark-csv_2.11;1.5.0 in central found >> org.apache.commons#commons-csv;1.1 in central found >> com.univocity#univocity-parsers;1.5.1 in central :: resolution report :: >> resolve 310ms :: artifacts dl 6ms :: modules in use: >> com.databricks#spark-csv_2.11;1.5.0 from central in [default] >> com.univocity#univocity-parsers;1.5.1 from central in [default] >> org.apache.commons#commons-csv;1.1 from central in [default] >> - | >> | modules || artifacts | | conf | number| search|dwnlded|evicted|| >> number|dwnlded| -- >> --- | default | 3 | 0 | 0 | 0 || 3 | >> 0 | - >> :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 >> artifacts copied, 3 already retrieved (0kB/8ms) Running Spark using the >> REST application submission protocol. SLF4J: Class path contains multiple >> SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/zeppelin-0.7.1/ >> interpreter/spark/zeppelin-spark_2.11-0.7.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: Found binding in [jar:file:/opt/zeppelin-0.7.1/ >> lib/interpreter/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/sh >> are/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an >> explanation. SLF4J: Actual binding is of type >> [org.slf4j.impl.Log4jLoggerFactory] >> Warning: Master endpoint spark://spark-drone-master-sof >> iane.autoetl.svc.cluster.local:7077 was not a REST server. Falling back >> to legacy submission gateway instead. Ivy Default Cache set to: >> /root/.ivy2/cache The jars for the packages stored in: /root/.ivy2/jars >> com.databricks#spark-csv_2.11 added as a dependency :: resolving >> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: >> [default] found com.databricks#spark-csv_2.11;1.5.0 in central found >> org.apache.commons#commons-csv;1.1 in central found >> com.univocity#univocity-parsers;1.5.1 in central :: resolution report :: >> resolve 69ms :: artifacts dl 5ms :: modules in use: >> com.databricks#spark-csv_2.11;1.5.0 from central in [default] >> com.univocity#univocity-parsers;1.5.1 from central in [default] >> org.apache.commons#commons-csv;1.1 from central in [default] >> - | >> | modules || artifacts | | conf
Re: Running a notebook in a standalone cluster mode issues
What port does the remote interpreter use? On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli wrote: > Hi Moon and al, > > I have a standalone cluster with one master, one worker. I submit jobs > through zeppelin. master, worker, and zeppelin run in a separate container. > > My zeppelin-env.sh: > > # spark home > export SPARK_HOME=/usr/local/spark > > # set hadoop conf dir > export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop > > # set options to pass spark-submit command > export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.11:1.5.0 > --deploy-mode cluster" > > # worker memory > export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g > -Dspark.submit.deployMode=cluster" > > # master > export MASTER="spark://:7077" > > My notebook code is very simple. It read csv and write it again in > directory /data previously created: > %spark.pyspark > def read_input(fin): > ''' > Read input file from filesystem and return dataframe > ''' > df = sqlContext.read.load(fin, format='com.databricks.spark.csv', > mode='PERMISSIVE', header='false', inferSchema='true') > return df > > def write_output(df, fout): > ''' > Write dataframe to filesystem > ''' > > df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',', > header='true').save(fout) > > data_in = '/data/01.csv' > data_out = '/data/02.csv' > df = read_input(data_in) > newdf = del_columns(df) > write_output(newdf, data_out) > > > I used --deploy-mode to *cluster* so that the driver is run in the worker > in order to read the CSV in the /data directory and not in zeppelin. When > running the notebook it complains that /opt/zeppelin-0.7.1/ > interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is missing: > org.apache.zeppelin.interpreter.InterpreterException: Ivy Default Cache > set to: /root/.ivy2/cache The jars for the packages stored in: > /root/.ivy2/jars :: loading settings :: url = jar:file:/opt/spark-2.1.0/ > jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml > com.databricks#spark-csv_2.11 added as a dependency :: resolving > dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] > found com.databricks#spark-csv_2.11;1.5.0 in central found > org.apache.commons#commons-csv;1.1 in central found > com.univocity#univocity-parsers;1.5.1 in central :: resolution report :: > resolve 310ms :: artifacts dl 6ms :: modules in use: > com.databricks#spark-csv_2.11;1.5.0 from central in [default] > com.univocity#univocity-parsers;1.5.1 from central in [default] > org.apache.commons#commons-csv;1.1 from central in [default] > - | | > modules || artifacts | | conf | number| search|dwnlded|evicted|| > number|dwnlded| -- > --- | default | 3 | 0 | 0 | 0 || 3 | > 0 | - > :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 > artifacts copied, 3 already retrieved (0kB/8ms) Running Spark using the > REST application submission protocol. SLF4J: Class path contains multiple > SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/zeppelin-0.7.1/ > interpreter/spark/zeppelin-spark_2.11-0.7.1.jar!/org/ > slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in > [jar:file:/opt/zeppelin-0.7.1/lib/interpreter/slf4j-log4j12- > 1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding > in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j- > log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See > http://www.slf4j.org/codes.html#multiple_bindings for an explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > Warning: Master endpoint spark://spark-drone-master- > sofiane.autoetl.svc.cluster.local:7077 was not a REST server. Falling > back to legacy submission gateway instead. Ivy Default Cache set to: > /root/.ivy2/cache The jars for the packages stored in: /root/.ivy2/jars > com.databricks#spark-csv_2.11 added as a dependency :: resolving > dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] > found com.databricks#spark-csv_2.11;1.5.0 in central found > org.apache.commons#commons-csv;1.1 in central found > com.univocity#univocity-parsers;1.5.1 in central :: resolution report :: > resolve 69ms :: artifacts dl 5ms :: modules in use: > com.databricks#spark-csv_2.11;1.5.0 from central in [default] > com.univocity#univocity-parsers;1.5.1 from central in [default] > org.apache.commons#commons-csv;1.1 from central in [default] > - | | > modules || artifacts | | conf | number| search|dwnlded|evicted|| > number|dwnlded| -- > --- | default | 3 | 0 | 0 | 0 || 3 | > 0 | - > :: retrieving ::
Running a notebook in a standalone cluster mode issues
Hi Moon and al, I have a standalone cluster with one master, one worker. I submit jobs through zeppelin. master, worker, and zeppelin run in a separate container. My zeppelin-env.sh: # spark home export SPARK_HOME=/usr/local/spark # set hadoop conf dir export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop # set options to pass spark-submit command export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.11:1.5.0 --deploy-mode cluster" # worker memory export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g -Dspark.submit.deployMode=cluster" # master export MASTER="spark://:7077" My notebook code is very simple. It read csv and write it again in directory /data previously created: %spark.pyspark def read_input(fin): ''' Read input file from filesystem and return dataframe ''' df = sqlContext.read.load(fin, format='com.databricks.spark.csv', mode='PERMISSIVE', header='false', inferSchema='true') return df def write_output(df, fout): ''' Write dataframe to filesystem ''' df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',', header='true').save(fout) data_in = '/data/01.csv' data_out = '/data/02.csv' df = read_input(data_in) newdf = del_columns(df) write_output(newdf, data_out) I used --deploy-mode to *cluster* so that the driver is run in the worker in order to read the CSV in the /data directory and not in zeppelin. When running the notebook it complains that /opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is missing: org.apache.zeppelin.interpreter.InterpreterException: Ivy Default Cache set to: /root/.ivy2/cache The jars for the packages stored in: /root/.ivy2/jars :: loading settings :: url = jar:file:/opt/spark-2.1.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml com.databricks#spark-csv_2.11 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found com.databricks#spark-csv_2.11;1.5.0 in central found org.apache.commons#commons-csv;1.1 in central found com.univocity#univocity-parsers;1.5.1 in central :: resolution report :: resolve 310ms :: artifacts dl 6ms :: modules in use: com.databricks#spark-csv_2.11;1.5.0 from central in [default] com.univocity#univocity-parsers;1.5.1 from central in [default] org.apache.commons#commons-csv;1.1 from central in [default] - | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | default | 3 | 0 | 0 | 0 || 3 | 0 | - :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 3 already retrieved (0kB/8ms) Running Spark using the REST application submission protocol. SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/zeppelin-0.7.1/lib/interpreter/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Warning: Master endpoint spark://spark-drone-master-sofiane.autoetl.svc.cluster.local:7077 was not a REST server. Falling back to legacy submission gateway instead. Ivy Default Cache set to: /root/.ivy2/cache The jars for the packages stored in: /root/.ivy2/jars com.databricks#spark-csv_2.11 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found com.databricks#spark-csv_2.11;1.5.0 in central found org.apache.commons#commons-csv;1.1 in central found com.univocity#univocity-parsers;1.5.1 in central :: resolution report :: resolve 69ms :: artifacts dl 5ms :: modules in use: com.databricks#spark-csv_2.11;1.5.0 from central in [default] com.univocity#univocity-parsers;1.5.1 from central in [default] org.apache.commons#commons-csv;1.1 from central in [default] - | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | default | 3 | 0 | 0 | 0 || 3 | 0 | - :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 3 already retrieved (0kB/4ms) java.nio.file.NoSuchFileException: /opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.r