Re: Running a notebook in a standalone cluster mode issues

2017-05-03 Thread Jeff Zhang
Or you can try livy interpreter which support yarn cluster mode

https://zeppelin.apache.org/docs/0.8.0-SNAPSHOT/interpreter/livy.html


Sofiane Cherchalli 于2017年5月4日周四 上午3:49写道:

> Hi Moon,
>
> Great, I am keen to see Zeppelin-2040 resolved soon. But meanwhile is
> there any workaround?
>
> Thanks.
>
> Sofiane
>
>
> El El mié, 3 may 2017 a las 20:40, moon soo Lee 
> escribió:
>
>> Zeppelin don't need to be installed in every workers.
>> You can think the way SparkInterpreter in Zeppelin work is very similar
>> to spark-shell (which works in client mode), until ZEPPELIN-2040 is
>> resolved.
>>
>> Therefore, if spark-shell works in a machine with your standalone
>> cluster, Zeppelin will work in the same machine with the standalone cluster.
>>
>> Thanks,
>> moon
>>
>> On Wed, May 3, 2017 at 2:28 PM Sofiane Cherchalli 
>> wrote:
>>
>>> Hi Moon,
>>>
>>> So in my case, if II have standalone or yarn cluster, the workaround
>>> would be to install zeppelin along every worker, proxy them,  and run each
>>> zeppelin in client mode ?
>>>
>>> Thanks,
>>> Sofiane
>>>
>>> El El mié, 3 may 2017 a las 19:12, moon soo Lee 
>>> escribió:
>>>
 Hi,

 Zeppelin does not support cluster mode deploy at the moment.
 Fortunately, there will be a support for cluster mode, soon!
 Please keep an eye on
 https://issues.apache.org/jira/browse/ZEPPELIN-2040.

 Thanks,
 moon

 On Wed, May 3, 2017 at 11:00 AM Sofiane Cherchalli 
 wrote:

> Shall I configure a remote interpreter to my notebook to run on the
> worker?
>
> Mayday!
>
> On Wed, May 3, 2017 at 4:18 PM, Sofiane Cherchalli <
> sofian...@gmail.com> wrote:
>
>> What port does the remote interpreter use?
>>
>> On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli <
>> sofian...@gmail.com> wrote:
>>
>>> Hi Moon and al,
>>>
>>> I have a standalone cluster with one master, one worker. I submit
>>> jobs through zeppelin. master, worker, and zeppelin run in a separate
>>> container.
>>>
>>> My zeppelin-env.sh:
>>>
>>> # spark home
>>> export SPARK_HOME=/usr/local/spark
>>>
>>> # set hadoop conf dir
>>> export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop
>>>
>>> # set options to pass spark-submit command
>>> export SPARK_SUBMIT_OPTIONS="--packages
>>> com.databricks:spark-csv_2.11:1.5.0 --deploy-mode cluster"
>>>
>>> # worker memory
>>> export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g
>>> -Dspark.submit.deployMode=cluster"
>>>
>>> # master
>>> export MASTER="spark://:7077"
>>>
>>> My notebook code is very simple. It read csv and write it again in
>>> directory /data previously created:
>>> %spark.pyspark
>>> def read_input(fin):
>>> '''
>>> Read input file from filesystem and return dataframe
>>> '''
>>> df = sqlContext.read.load(fin,
>>> format='com.databricks.spark.csv', mode='PERMISSIVE', header='false',
>>> inferSchema='true')
>>> return df
>>>
>>> def write_output(df, fout):
>>> '''
>>> Write dataframe to filesystem
>>> '''
>>>
>>> df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',',
>>> header='true').save(fout)
>>>
>>> data_in = '/data/01.csv'
>>> data_out = '/data/02.csv'
>>> df = read_input(data_in)
>>> newdf = del_columns(df)
>>> write_output(newdf, data_out)
>>>
>>>
>>> I used --deploy-mode to *cluster* so that the driver is run in the
>>> worker in order to read the CSV in the /data directory and not in 
>>> zeppelin.
>>> When running the notebook it complains that
>>> /opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is
>>> missing:
>>> org.apache.zeppelin.interpreter.InterpreterException: Ivy Default
>>> Cache set to: /root/.ivy2/cache The jars for the packages stored in:
>>> /root/.ivy2/jars :: loading settings :: url =
>>> jar:file:/opt/spark-2.1.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
>>> com.databricks#spark-csv_2.11 added as a dependency :: resolving
>>> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: 
>>> [default]
>>> found com.databricks#spark-csv_2.11;1.5.0 in central found
>>> org.apache.commons#commons-csv;1.1 in central found
>>> com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
>>> resolve 310ms :: artifacts dl 6ms :: modules in use:
>>> com.databricks#spark-csv_2.11;1.5.0 from central in [default]
>>> com.univocity#univocity-parsers;1.5.1 from central in [default]
>>> org.apache.commons#commons-csv;1.1 from central in [default]
>>> - | 
>>> |
>>> modules || artifacts | | conf | number| search|dwnlded|evicted||
>>> number|dwnlded|
>>> --

Re: Running a notebook in a standalone cluster mode issues

2017-05-03 Thread moon soo Lee
Any workaround except for using client mode, it's difficult to think ...

Thanks,
moon

On Wed, 3 May 2017 at 3:49 PM Sofiane Cherchalli 
wrote:

> Hi Moon,
>
> Great, I am keen to see Zeppelin-2040 resolved soon. But meanwhile is
> there any workaround?
>
> Thanks.
>
> Sofiane
>
>
> El El mié, 3 may 2017 a las 20:40, moon soo Lee 
> escribió:
>
>> Zeppelin don't need to be installed in every workers.
>> You can think the way SparkInterpreter in Zeppelin work is very similar
>> to spark-shell (which works in client mode), until ZEPPELIN-2040 is
>> resolved.
>>
>> Therefore, if spark-shell works in a machine with your standalone
>> cluster, Zeppelin will work in the same machine with the standalone cluster.
>>
>> Thanks,
>> moon
>>
>> On Wed, May 3, 2017 at 2:28 PM Sofiane Cherchalli 
>> wrote:
>>
>>> Hi Moon,
>>>
>>> So in my case, if II have standalone or yarn cluster, the workaround
>>> would be to install zeppelin along every worker, proxy them,  and run each
>>> zeppelin in client mode ?
>>>
>>> Thanks,
>>> Sofiane
>>>
>>> El El mié, 3 may 2017 a las 19:12, moon soo Lee 
>>> escribió:
>>>
 Hi,

 Zeppelin does not support cluster mode deploy at the moment.
 Fortunately, there will be a support for cluster mode, soon!
 Please keep an eye on
 https://issues.apache.org/jira/browse/ZEPPELIN-2040.

 Thanks,
 moon

 On Wed, May 3, 2017 at 11:00 AM Sofiane Cherchalli 
 wrote:

> Shall I configure a remote interpreter to my notebook to run on the
> worker?
>
> Mayday!
>
> On Wed, May 3, 2017 at 4:18 PM, Sofiane Cherchalli <
> sofian...@gmail.com> wrote:
>
>> What port does the remote interpreter use?
>>
>> On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli <
>> sofian...@gmail.com> wrote:
>>
>>> Hi Moon and al,
>>>
>>> I have a standalone cluster with one master, one worker. I submit
>>> jobs through zeppelin. master, worker, and zeppelin run in a separate
>>> container.
>>>
>>> My zeppelin-env.sh:
>>>
>>> # spark home
>>> export SPARK_HOME=/usr/local/spark
>>>
>>> # set hadoop conf dir
>>> export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop
>>>
>>> # set options to pass spark-submit command
>>> export SPARK_SUBMIT_OPTIONS="--packages
>>> com.databricks:spark-csv_2.11:1.5.0 --deploy-mode cluster"
>>>
>>> # worker memory
>>> export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g
>>> -Dspark.submit.deployMode=cluster"
>>>
>>> # master
>>> export MASTER="spark://:7077"
>>>
>>> My notebook code is very simple. It read csv and write it again in
>>> directory /data previously created:
>>> %spark.pyspark
>>> def read_input(fin):
>>> '''
>>> Read input file from filesystem and return dataframe
>>> '''
>>> df = sqlContext.read.load(fin,
>>> format='com.databricks.spark.csv', mode='PERMISSIVE', header='false',
>>> inferSchema='true')
>>> return df
>>>
>>> def write_output(df, fout):
>>> '''
>>> Write dataframe to filesystem
>>> '''
>>>
>>> df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',',
>>> header='true').save(fout)
>>>
>>> data_in = '/data/01.csv'
>>> data_out = '/data/02.csv'
>>> df = read_input(data_in)
>>> newdf = del_columns(df)
>>> write_output(newdf, data_out)
>>>
>>>
>>> I used --deploy-mode to *cluster* so that the driver is run in the
>>> worker in order to read the CSV in the /data directory and not in 
>>> zeppelin.
>>> When running the notebook it complains that
>>> /opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is
>>> missing:
>>> org.apache.zeppelin.interpreter.InterpreterException: Ivy Default
>>> Cache set to: /root/.ivy2/cache The jars for the packages stored in:
>>> /root/.ivy2/jars :: loading settings :: url =
>>> jar:file:/opt/spark-2.1.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
>>> com.databricks#spark-csv_2.11 added as a dependency :: resolving
>>> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: 
>>> [default]
>>> found com.databricks#spark-csv_2.11;1.5.0 in central found
>>> org.apache.commons#commons-csv;1.1 in central found
>>> com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
>>> resolve 310ms :: artifacts dl 6ms :: modules in use:
>>> com.databricks#spark-csv_2.11;1.5.0 from central in [default]
>>> com.univocity#univocity-parsers;1.5.1 from central in [default]
>>> org.apache.commons#commons-csv;1.1 from central in [default]
>>> - | 
>>> |
>>> modules || artifacts | | conf | number| search|dwnlded|evicted||
>>> number|dwnlded|
>>> 

Re: Running a notebook in a standalone cluster mode issues

2017-05-03 Thread Sofiane Cherchalli
Hi Moon,

Great, I am keen to see Zeppelin-2040 resolved soon. But meanwhile is there
any workaround?

Thanks.
Sofiane


El El mié, 3 may 2017 a las 20:40, moon soo Lee  escribió:

> Zeppelin don't need to be installed in every workers.
> You can think the way SparkInterpreter in Zeppelin work is very similar to
> spark-shell (which works in client mode), until ZEPPELIN-2040 is resolved.
>
> Therefore, if spark-shell works in a machine with your standalone cluster,
> Zeppelin will work in the same machine with the standalone cluster.
>
> Thanks,
> moon
>
> On Wed, May 3, 2017 at 2:28 PM Sofiane Cherchalli 
> wrote:
>
>> Hi Moon,
>>
>> So in my case, if II have standalone or yarn cluster, the workaround
>> would be to install zeppelin along every worker, proxy them,  and run each
>> zeppelin in client mode ?
>>
>> Thanks,
>> Sofiane
>>
>> El El mié, 3 may 2017 a las 19:12, moon soo Lee 
>> escribió:
>>
>>> Hi,
>>>
>>> Zeppelin does not support cluster mode deploy at the moment.
>>> Fortunately, there will be a support for cluster mode, soon!
>>> Please keep an eye on
>>> https://issues.apache.org/jira/browse/ZEPPELIN-2040.
>>>
>>> Thanks,
>>> moon
>>>
>>> On Wed, May 3, 2017 at 11:00 AM Sofiane Cherchalli 
>>> wrote:
>>>
 Shall I configure a remote interpreter to my notebook to run on the
 worker?

 Mayday!

 On Wed, May 3, 2017 at 4:18 PM, Sofiane Cherchalli >>> > wrote:

> What port does the remote interpreter use?
>
> On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli <
> sofian...@gmail.com> wrote:
>
>> Hi Moon and al,
>>
>> I have a standalone cluster with one master, one worker. I submit
>> jobs through zeppelin. master, worker, and zeppelin run in a separate
>> container.
>>
>> My zeppelin-env.sh:
>>
>> # spark home
>> export SPARK_HOME=/usr/local/spark
>>
>> # set hadoop conf dir
>> export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop
>>
>> # set options to pass spark-submit command
>> export SPARK_SUBMIT_OPTIONS="--packages
>> com.databricks:spark-csv_2.11:1.5.0 --deploy-mode cluster"
>>
>> # worker memory
>> export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g
>> -Dspark.submit.deployMode=cluster"
>>
>> # master
>> export MASTER="spark://:7077"
>>
>> My notebook code is very simple. It read csv and write it again in
>> directory /data previously created:
>> %spark.pyspark
>> def read_input(fin):
>> '''
>> Read input file from filesystem and return dataframe
>> '''
>> df = sqlContext.read.load(fin, format='com.databricks.spark.csv',
>> mode='PERMISSIVE', header='false', inferSchema='true')
>> return df
>>
>> def write_output(df, fout):
>> '''
>> Write dataframe to filesystem
>> '''
>>
>> df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',',
>> header='true').save(fout)
>>
>> data_in = '/data/01.csv'
>> data_out = '/data/02.csv'
>> df = read_input(data_in)
>> newdf = del_columns(df)
>> write_output(newdf, data_out)
>>
>>
>> I used --deploy-mode to *cluster* so that the driver is run in the
>> worker in order to read the CSV in the /data directory and not in 
>> zeppelin.
>> When running the notebook it complains that
>> /opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is
>> missing:
>> org.apache.zeppelin.interpreter.InterpreterException: Ivy Default
>> Cache set to: /root/.ivy2/cache The jars for the packages stored in:
>> /root/.ivy2/jars :: loading settings :: url =
>> jar:file:/opt/spark-2.1.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
>> com.databricks#spark-csv_2.11 added as a dependency :: resolving
>> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default]
>> found com.databricks#spark-csv_2.11;1.5.0 in central found
>> org.apache.commons#commons-csv;1.1 in central found
>> com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
>> resolve 310ms :: artifacts dl 6ms :: modules in use:
>> com.databricks#spark-csv_2.11;1.5.0 from central in [default]
>> com.univocity#univocity-parsers;1.5.1 from central in [default]
>> org.apache.commons#commons-csv;1.1 from central in [default]
>> - | |
>> modules || artifacts | | conf | number| search|dwnlded|evicted||
>> number|dwnlded|
>> - |
>> default | 3 | 0 | 0 | 0 || 3 | 0 |
>> - ::
>> retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0
>> artifacts copied, 3 already retrieved (0kB/8ms) Running Spark using the
>> REST application submission pr

Re: Running a notebook in a standalone cluster mode issues

2017-05-03 Thread moon soo Lee
Zeppelin don't need to be installed in every workers.
You can think the way SparkInterpreter in Zeppelin work is very similar to
spark-shell (which works in client mode), until ZEPPELIN-2040 is resolved.

Therefore, if spark-shell works in a machine with your standalone cluster,
Zeppelin will work in the same machine with the standalone cluster.

Thanks,
moon

On Wed, May 3, 2017 at 2:28 PM Sofiane Cherchalli 
wrote:

> Hi Moon,
>
> So in my case, if II have standalone or yarn cluster, the workaround would
> be to install zeppelin along every worker, proxy them,  and run each
> zeppelin in client mode ?
>
> Thanks,
> Sofiane
>
> El El mié, 3 may 2017 a las 19:12, moon soo Lee 
> escribió:
>
>> Hi,
>>
>> Zeppelin does not support cluster mode deploy at the moment. Fortunately,
>> there will be a support for cluster mode, soon!
>> Please keep an eye on https://issues.apache.org/jira/browse/ZEPPELIN-2040
>> .
>>
>> Thanks,
>> moon
>>
>> On Wed, May 3, 2017 at 11:00 AM Sofiane Cherchalli 
>> wrote:
>>
>>> Shall I configure a remote interpreter to my notebook to run on the
>>> worker?
>>>
>>> Mayday!
>>>
>>> On Wed, May 3, 2017 at 4:18 PM, Sofiane Cherchalli 
>>> wrote:
>>>
 What port does the remote interpreter use?

 On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli >>> > wrote:

> Hi Moon and al,
>
> I have a standalone cluster with one master, one worker. I submit jobs
> through zeppelin. master, worker, and zeppelin run in a separate 
> container.
>
> My zeppelin-env.sh:
>
> # spark home
> export SPARK_HOME=/usr/local/spark
>
> # set hadoop conf dir
> export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop
>
> # set options to pass spark-submit command
> export SPARK_SUBMIT_OPTIONS="--packages
> com.databricks:spark-csv_2.11:1.5.0 --deploy-mode cluster"
>
> # worker memory
> export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g
> -Dspark.submit.deployMode=cluster"
>
> # master
> export MASTER="spark://:7077"
>
> My notebook code is very simple. It read csv and write it again in
> directory /data previously created:
> %spark.pyspark
> def read_input(fin):
> '''
> Read input file from filesystem and return dataframe
> '''
> df = sqlContext.read.load(fin, format='com.databricks.spark.csv',
> mode='PERMISSIVE', header='false', inferSchema='true')
> return df
>
> def write_output(df, fout):
> '''
> Write dataframe to filesystem
> '''
>
> df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',',
> header='true').save(fout)
>
> data_in = '/data/01.csv'
> data_out = '/data/02.csv'
> df = read_input(data_in)
> newdf = del_columns(df)
> write_output(newdf, data_out)
>
>
> I used --deploy-mode to *cluster* so that the driver is run in the
> worker in order to read the CSV in the /data directory and not in 
> zeppelin.
> When running the notebook it complains that
> /opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is
> missing:
> org.apache.zeppelin.interpreter.InterpreterException: Ivy Default
> Cache set to: /root/.ivy2/cache The jars for the packages stored in:
> /root/.ivy2/jars :: loading settings :: url =
> jar:file:/opt/spark-2.1.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> com.databricks#spark-csv_2.11 added as a dependency :: resolving
> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default]
> found com.databricks#spark-csv_2.11;1.5.0 in central found
> org.apache.commons#commons-csv;1.1 in central found
> com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
> resolve 310ms :: artifacts dl 6ms :: modules in use:
> com.databricks#spark-csv_2.11;1.5.0 from central in [default]
> com.univocity#univocity-parsers;1.5.1 from central in [default]
> org.apache.commons#commons-csv;1.1 from central in [default]
> - | |
> modules || artifacts | | conf | number| search|dwnlded|evicted||
> number|dwnlded|
> - |
> default | 3 | 0 | 0 | 0 || 3 | 0 |
> - ::
> retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0
> artifacts copied, 3 already retrieved (0kB/8ms) Running Spark using the
> REST application submission protocol. SLF4J: Class path contains multiple
> SLF4J bindings. SLF4J: Found binding in
> [jar:file:/opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/opt/zeppelin-0.7.1/lib/interpreter/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/S

Re: Running a notebook in a standalone cluster mode issues

2017-05-03 Thread Sofiane Cherchalli
Hi Moon,

So in my case, if II have standalone or yarn cluster, the workaround would
be to install zeppelin along every worker, proxy them,  and run each
zeppelin in client mode ?

Thanks,
Sofiane

El El mié, 3 may 2017 a las 19:12, moon soo Lee  escribió:

> Hi,
>
> Zeppelin does not support cluster mode deploy at the moment. Fortunately,
> there will be a support for cluster mode, soon!
> Please keep an eye on https://issues.apache.org/jira/browse/ZEPPELIN-2040.
>
> Thanks,
> moon
>
> On Wed, May 3, 2017 at 11:00 AM Sofiane Cherchalli 
> wrote:
>
>> Shall I configure a remote interpreter to my notebook to run on the
>> worker?
>>
>> Mayday!
>>
>> On Wed, May 3, 2017 at 4:18 PM, Sofiane Cherchalli 
>> wrote:
>>
>>> What port does the remote interpreter use?
>>>
>>> On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli 
>>> wrote:
>>>
 Hi Moon and al,

 I have a standalone cluster with one master, one worker. I submit jobs
 through zeppelin. master, worker, and zeppelin run in a separate container.

 My zeppelin-env.sh:

 # spark home
 export SPARK_HOME=/usr/local/spark

 # set hadoop conf dir
 export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop

 # set options to pass spark-submit command
 export SPARK_SUBMIT_OPTIONS="--packages
 com.databricks:spark-csv_2.11:1.5.0 --deploy-mode cluster"

 # worker memory
 export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g
 -Dspark.submit.deployMode=cluster"

 # master
 export MASTER="spark://:7077"

 My notebook code is very simple. It read csv and write it again in
 directory /data previously created:
 %spark.pyspark
 def read_input(fin):
 '''
 Read input file from filesystem and return dataframe
 '''
 df = sqlContext.read.load(fin, format='com.databricks.spark.csv',
 mode='PERMISSIVE', header='false', inferSchema='true')
 return df

 def write_output(df, fout):
 '''
 Write dataframe to filesystem
 '''

 df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',',
 header='true').save(fout)

 data_in = '/data/01.csv'
 data_out = '/data/02.csv'
 df = read_input(data_in)
 newdf = del_columns(df)
 write_output(newdf, data_out)


 I used --deploy-mode to *cluster* so that the driver is run in the
 worker in order to read the CSV in the /data directory and not in zeppelin.
 When running the notebook it complains that
 /opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is
 missing:
 org.apache.zeppelin.interpreter.InterpreterException: Ivy Default Cache
 set to: /root/.ivy2/cache The jars for the packages stored in:
 /root/.ivy2/jars :: loading settings :: url =
 jar:file:/opt/spark-2.1.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
 com.databricks#spark-csv_2.11 added as a dependency :: resolving
 dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default]
 found com.databricks#spark-csv_2.11;1.5.0 in central found
 org.apache.commons#commons-csv;1.1 in central found
 com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
 resolve 310ms :: artifacts dl 6ms :: modules in use:
 com.databricks#spark-csv_2.11;1.5.0 from central in [default]
 com.univocity#univocity-parsers;1.5.1 from central in [default]
 org.apache.commons#commons-csv;1.1 from central in [default]
 - | |
 modules || artifacts | | conf | number| search|dwnlded|evicted||
 number|dwnlded|
 - |
 default | 3 | 0 | 0 | 0 || 3 | 0 |
 - ::
 retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0
 artifacts copied, 3 already retrieved (0kB/8ms) Running Spark using the
 REST application submission protocol. SLF4J: Class path contains multiple
 SLF4J bindings. SLF4J: Found binding in
 [jar:file:/opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in
 [jar:file:/opt/zeppelin-0.7.1/lib/interpreter/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in
 [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
 explanation. SLF4J: Actual binding is of type
 [org.slf4j.impl.Log4jLoggerFactory] Warning: Master endpoint
 spark://spark-drone-master-sofiane.autoetl.svc.cluster.local:7077 was not a
 REST server. Falling back to legacy submission gateway instead. Ivy Default
 Cache set to: /root/.ivy2/cache

Re: Running a notebook in a standalone cluster mode issues

2017-05-03 Thread moon soo Lee
Hi,

Zeppelin does not support cluster mode deploy at the moment. Fortunately,
there will be a support for cluster mode, soon!
Please keep an eye on https://issues.apache.org/jira/browse/ZEPPELIN-2040.

Thanks,
moon

On Wed, May 3, 2017 at 11:00 AM Sofiane Cherchalli 
wrote:

> Shall I configure a remote interpreter to my notebook to run on the worker?
>
> Mayday!
>
> On Wed, May 3, 2017 at 4:18 PM, Sofiane Cherchalli 
> wrote:
>
>> What port does the remote interpreter use?
>>
>> On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli 
>> wrote:
>>
>>> Hi Moon and al,
>>>
>>> I have a standalone cluster with one master, one worker. I submit jobs
>>> through zeppelin. master, worker, and zeppelin run in a separate container.
>>>
>>> My zeppelin-env.sh:
>>>
>>> # spark home
>>> export SPARK_HOME=/usr/local/spark
>>>
>>> # set hadoop conf dir
>>> export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop
>>>
>>> # set options to pass spark-submit command
>>> export SPARK_SUBMIT_OPTIONS="--packages
>>> com.databricks:spark-csv_2.11:1.5.0 --deploy-mode cluster"
>>>
>>> # worker memory
>>> export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g
>>> -Dspark.submit.deployMode=cluster"
>>>
>>> # master
>>> export MASTER="spark://:7077"
>>>
>>> My notebook code is very simple. It read csv and write it again in
>>> directory /data previously created:
>>> %spark.pyspark
>>> def read_input(fin):
>>> '''
>>> Read input file from filesystem and return dataframe
>>> '''
>>> df = sqlContext.read.load(fin, format='com.databricks.spark.csv',
>>> mode='PERMISSIVE', header='false', inferSchema='true')
>>> return df
>>>
>>> def write_output(df, fout):
>>> '''
>>> Write dataframe to filesystem
>>> '''
>>>
>>> df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',',
>>> header='true').save(fout)
>>>
>>> data_in = '/data/01.csv'
>>> data_out = '/data/02.csv'
>>> df = read_input(data_in)
>>> newdf = del_columns(df)
>>> write_output(newdf, data_out)
>>>
>>>
>>> I used --deploy-mode to *cluster* so that the driver is run in the
>>> worker in order to read the CSV in the /data directory and not in zeppelin.
>>> When running the notebook it complains that
>>> /opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is
>>> missing:
>>> org.apache.zeppelin.interpreter.InterpreterException: Ivy Default Cache
>>> set to: /root/.ivy2/cache The jars for the packages stored in:
>>> /root/.ivy2/jars :: loading settings :: url =
>>> jar:file:/opt/spark-2.1.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
>>> com.databricks#spark-csv_2.11 added as a dependency :: resolving
>>> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default]
>>> found com.databricks#spark-csv_2.11;1.5.0 in central found
>>> org.apache.commons#commons-csv;1.1 in central found
>>> com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
>>> resolve 310ms :: artifacts dl 6ms :: modules in use:
>>> com.databricks#spark-csv_2.11;1.5.0 from central in [default]
>>> com.univocity#univocity-parsers;1.5.1 from central in [default]
>>> org.apache.commons#commons-csv;1.1 from central in [default]
>>> - | |
>>> modules || artifacts | | conf | number| search|dwnlded|evicted||
>>> number|dwnlded|
>>> - |
>>> default | 3 | 0 | 0 | 0 || 3 | 0 |
>>> - ::
>>> retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0
>>> artifacts copied, 3 already retrieved (0kB/8ms) Running Spark using the
>>> REST application submission protocol. SLF4J: Class path contains multiple
>>> SLF4J bindings. SLF4J: Found binding in
>>> [jar:file:/opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>> SLF4J: Found binding in
>>> [jar:file:/opt/zeppelin-0.7.1/lib/interpreter/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>> SLF4J: Found binding in
>>> [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>>> explanation. SLF4J: Actual binding is of type
>>> [org.slf4j.impl.Log4jLoggerFactory] Warning: Master endpoint
>>> spark://spark-drone-master-sofiane.autoetl.svc.cluster.local:7077 was not a
>>> REST server. Falling back to legacy submission gateway instead. Ivy Default
>>> Cache set to: /root/.ivy2/cache The jars for the packages stored in:
>>> /root/.ivy2/jars com.databricks#spark-csv_2.11 added as a dependency ::
>>> resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs:
>>> [default] found com.databricks#spark-csv_2.11;1.5.0 in central found
>>> org.apache.commons#commons-csv;1.1 in central found
>>> com.univocity#univocity-parsers;1.5.1 in central :: resol

Re: Running a notebook in a standalone cluster mode issues

2017-05-03 Thread Sofiane Cherchalli
Shall I configure a remote interpreter to my notebook to run on the worker?

Mayday!

On Wed, May 3, 2017 at 4:18 PM, Sofiane Cherchalli 
wrote:

> What port does the remote interpreter use?
>
> On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli 
> wrote:
>
>> Hi Moon and al,
>>
>> I have a standalone cluster with one master, one worker. I submit jobs
>> through zeppelin. master, worker, and zeppelin run in a separate container.
>>
>> My zeppelin-env.sh:
>>
>> # spark home
>> export SPARK_HOME=/usr/local/spark
>>
>> # set hadoop conf dir
>> export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop
>>
>> # set options to pass spark-submit command
>> export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.11:1.5.0
>> --deploy-mode cluster"
>>
>> # worker memory
>> export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g
>> -Dspark.submit.deployMode=cluster"
>>
>> # master
>> export MASTER="spark://:7077"
>>
>> My notebook code is very simple. It read csv and write it again in
>> directory /data previously created:
>> %spark.pyspark
>> def read_input(fin):
>> '''
>> Read input file from filesystem and return dataframe
>> '''
>> df = sqlContext.read.load(fin, format='com.databricks.spark.csv',
>> mode='PERMISSIVE', header='false', inferSchema='true')
>> return df
>>
>> def write_output(df, fout):
>> '''
>> Write dataframe to filesystem
>> '''
>> 
>> df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',',
>> header='true').save(fout)
>>
>> data_in = '/data/01.csv'
>> data_out = '/data/02.csv'
>> df = read_input(data_in)
>> newdf = del_columns(df)
>> write_output(newdf, data_out)
>>
>>
>> I used --deploy-mode to *cluster* so that the driver is run in the
>> worker in order to read the CSV in the /data directory and not in zeppelin.
>> When running the notebook it complains that /opt/zeppelin-0.7.1/inter
>> preter/spark/zeppelin-spark_2.11-0.7.1.jar is missing:
>> org.apache.zeppelin.interpreter.InterpreterException: Ivy Default Cache
>> set to: /root/.ivy2/cache The jars for the packages stored in:
>> /root/.ivy2/jars :: loading settings :: url = jar:file:/opt/spark-2.1.0/jars
>> /ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
>> com.databricks#spark-csv_2.11 added as a dependency :: resolving
>> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs:
>> [default] found com.databricks#spark-csv_2.11;1.5.0 in central found
>> org.apache.commons#commons-csv;1.1 in central found
>> com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
>> resolve 310ms :: artifacts dl 6ms :: modules in use:
>> com.databricks#spark-csv_2.11;1.5.0 from central in [default]
>> com.univocity#univocity-parsers;1.5.1 from central in [default]
>> org.apache.commons#commons-csv;1.1 from central in [default]
>> - |
>> | modules || artifacts | | conf | number| search|dwnlded|evicted||
>> number|dwnlded| --
>> --- | default | 3 | 0 | 0 | 0 || 3 |
>> 0 | -
>> :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0
>> artifacts copied, 3 already retrieved (0kB/8ms) Running Spark using the
>> REST application submission protocol. SLF4J: Class path contains multiple
>> SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/zeppelin-0.7.1/
>> interpreter/spark/zeppelin-spark_2.11-0.7.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in [jar:file:/opt/zeppelin-0.7.1/
>> lib/interpreter/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/sh
>> are/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>> explanation. SLF4J: Actual binding is of type 
>> [org.slf4j.impl.Log4jLoggerFactory]
>> Warning: Master endpoint spark://spark-drone-master-sof
>> iane.autoetl.svc.cluster.local:7077 was not a REST server. Falling back
>> to legacy submission gateway instead. Ivy Default Cache set to:
>> /root/.ivy2/cache The jars for the packages stored in: /root/.ivy2/jars
>> com.databricks#spark-csv_2.11 added as a dependency :: resolving
>> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs:
>> [default] found com.databricks#spark-csv_2.11;1.5.0 in central found
>> org.apache.commons#commons-csv;1.1 in central found
>> com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
>> resolve 69ms :: artifacts dl 5ms :: modules in use:
>> com.databricks#spark-csv_2.11;1.5.0 from central in [default]
>> com.univocity#univocity-parsers;1.5.1 from central in [default]
>> org.apache.commons#commons-csv;1.1 from central in [default]
>> - |
>> | modules || artifacts | | conf 

Re: Running a notebook in a standalone cluster mode issues

2017-05-03 Thread Sofiane Cherchalli
What port does the remote interpreter use?

On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli 
wrote:

> Hi Moon and al,
>
> I have a standalone cluster with one master, one worker. I submit jobs
> through zeppelin. master, worker, and zeppelin run in a separate container.
>
> My zeppelin-env.sh:
>
> # spark home
> export SPARK_HOME=/usr/local/spark
>
> # set hadoop conf dir
> export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop
>
> # set options to pass spark-submit command
> export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.11:1.5.0
> --deploy-mode cluster"
>
> # worker memory
> export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g
> -Dspark.submit.deployMode=cluster"
>
> # master
> export MASTER="spark://:7077"
>
> My notebook code is very simple. It read csv and write it again in
> directory /data previously created:
> %spark.pyspark
> def read_input(fin):
> '''
> Read input file from filesystem and return dataframe
> '''
> df = sqlContext.read.load(fin, format='com.databricks.spark.csv',
> mode='PERMISSIVE', header='false', inferSchema='true')
> return df
>
> def write_output(df, fout):
> '''
> Write dataframe to filesystem
> '''
> 
> df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',',
> header='true').save(fout)
>
> data_in = '/data/01.csv'
> data_out = '/data/02.csv'
> df = read_input(data_in)
> newdf = del_columns(df)
> write_output(newdf, data_out)
>
>
> I used --deploy-mode to *cluster* so that the driver is run in the worker
> in order to read the CSV in the /data directory and not in zeppelin. When
> running the notebook it complains that /opt/zeppelin-0.7.1/
> interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is missing:
> org.apache.zeppelin.interpreter.InterpreterException: Ivy Default Cache
> set to: /root/.ivy2/cache The jars for the packages stored in:
> /root/.ivy2/jars :: loading settings :: url = jar:file:/opt/spark-2.1.0/
> jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> com.databricks#spark-csv_2.11 added as a dependency :: resolving
> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default]
> found com.databricks#spark-csv_2.11;1.5.0 in central found
> org.apache.commons#commons-csv;1.1 in central found
> com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
> resolve 310ms :: artifacts dl 6ms :: modules in use:
> com.databricks#spark-csv_2.11;1.5.0 from central in [default]
> com.univocity#univocity-parsers;1.5.1 from central in [default]
> org.apache.commons#commons-csv;1.1 from central in [default]
> - | |
> modules || artifacts | | conf | number| search|dwnlded|evicted||
> number|dwnlded| --
> --- | default | 3 | 0 | 0 | 0 || 3 |
> 0 | -
> :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0
> artifacts copied, 3 already retrieved (0kB/8ms) Running Spark using the
> REST application submission protocol. SLF4J: Class path contains multiple
> SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/zeppelin-0.7.1/
> interpreter/spark/zeppelin-spark_2.11-0.7.1.jar!/org/
> slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in
> [jar:file:/opt/zeppelin-0.7.1/lib/interpreter/slf4j-log4j12-
> 1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding
> in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-
> log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See
> http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> Warning: Master endpoint spark://spark-drone-master-
> sofiane.autoetl.svc.cluster.local:7077 was not a REST server. Falling
> back to legacy submission gateway instead. Ivy Default Cache set to:
> /root/.ivy2/cache The jars for the packages stored in: /root/.ivy2/jars
> com.databricks#spark-csv_2.11 added as a dependency :: resolving
> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default]
> found com.databricks#spark-csv_2.11;1.5.0 in central found
> org.apache.commons#commons-csv;1.1 in central found
> com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
> resolve 69ms :: artifacts dl 5ms :: modules in use:
> com.databricks#spark-csv_2.11;1.5.0 from central in [default]
> com.univocity#univocity-parsers;1.5.1 from central in [default]
> org.apache.commons#commons-csv;1.1 from central in [default]
> - | |
> modules || artifacts | | conf | number| search|dwnlded|evicted||
> number|dwnlded| --
> --- | default | 3 | 0 | 0 | 0 || 3 |
> 0 | -
> :: retrieving ::

Running a notebook in a standalone cluster mode issues

2017-05-03 Thread Sofiane Cherchalli
Hi Moon and al,

I have a standalone cluster with one master, one worker. I submit jobs
through zeppelin. master, worker, and zeppelin run in a separate container.

My zeppelin-env.sh:

# spark home
export SPARK_HOME=/usr/local/spark

# set hadoop conf dir
export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop

# set options to pass spark-submit command
export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.11:1.5.0
--deploy-mode cluster"

# worker memory
export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g
-Dspark.submit.deployMode=cluster"

# master
export MASTER="spark://:7077"

My notebook code is very simple. It read csv and write it again in
directory /data previously created:
%spark.pyspark
def read_input(fin):
'''
Read input file from filesystem and return dataframe
'''
df = sqlContext.read.load(fin, format='com.databricks.spark.csv',
mode='PERMISSIVE', header='false', inferSchema='true')
return df

def write_output(df, fout):
'''
Write dataframe to filesystem
'''

df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',',
header='true').save(fout)

data_in = '/data/01.csv'
data_out = '/data/02.csv'
df = read_input(data_in)
newdf = del_columns(df)
write_output(newdf, data_out)


I used --deploy-mode to *cluster* so that the driver is run in the worker
in order to read the CSV in the /data directory and not in zeppelin. When
running the notebook it complains that
/opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is
missing:
org.apache.zeppelin.interpreter.InterpreterException: Ivy Default Cache set
to: /root/.ivy2/cache The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url =
jar:file:/opt/spark-2.1.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency :: resolving
dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default]
found com.databricks#spark-csv_2.11;1.5.0 in central found
org.apache.commons#commons-csv;1.1 in central found
com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
resolve 310ms :: artifacts dl 6ms :: modules in use:
com.databricks#spark-csv_2.11;1.5.0 from central in [default]
com.univocity#univocity-parsers;1.5.1 from central in [default]
org.apache.commons#commons-csv;1.1 from central in [default]
- | |
modules || artifacts | | conf | number| search|dwnlded|evicted||
number|dwnlded|
- |
default | 3 | 0 | 0 | 0 || 3 | 0 |
- ::
retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0
artifacts copied, 3 already retrieved (0kB/8ms) Running Spark using the
REST application submission protocol. SLF4J: Class path contains multiple
SLF4J bindings. SLF4J: Found binding in
[jar:file:/opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/opt/zeppelin-0.7.1/lib/interpreter/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation. SLF4J: Actual binding is of type
[org.slf4j.impl.Log4jLoggerFactory] Warning: Master endpoint
spark://spark-drone-master-sofiane.autoetl.svc.cluster.local:7077 was not a
REST server. Falling back to legacy submission gateway instead. Ivy Default
Cache set to: /root/.ivy2/cache The jars for the packages stored in:
/root/.ivy2/jars com.databricks#spark-csv_2.11 added as a dependency ::
resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs:
[default] found com.databricks#spark-csv_2.11;1.5.0 in central found
org.apache.commons#commons-csv;1.1 in central found
com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
resolve 69ms :: artifacts dl 5ms :: modules in use:
com.databricks#spark-csv_2.11;1.5.0 from central in [default]
com.univocity#univocity-parsers;1.5.1 from central in [default]
org.apache.commons#commons-csv;1.1 from central in [default]
- | |
modules || artifacts | | conf | number| search|dwnlded|evicted||
number|dwnlded|
- |
default | 3 | 0 | 0 | 0 || 3 | 0 |
- ::
retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0
artifacts copied, 3 already retrieved (0kB/4ms)
java.nio.file.NoSuchFileException:
/opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at
sun.nio.fs.UnixException.r