Re: Spark-CSV - Zeppelin tries to read CSV locally in Standalon mode

2017-05-10 Thread Sofiane Cherchalli
I've put the csv in the worker node since the job is run in the worker. I
didn't put the csv in the master because I believe it doesn't run jobs.

If I put the csv in the zeppelin node with the same path as the worker, it
reads the csv and writes a _SUCCESS file locally. The job is run on the
worker too but doesn't terminate. The result is saved under a _temporary
directory in the worker.

worker - ls -laRt /data/02.csv/


02.csv/:
total 0
drwxr-xr-x. 3 root root 24 Apr 28 09:55 .
drwxr-xr-x. 3 root root 15 Apr 28 09:55 _temporary
drwxr-xr-x. 3 root root 64 Apr 28 09:55 ..

02.csv/_temporary:
total 0
drwxr-xr-x. 5 root root 106 Apr 28 09:56 0
drwxr-xr-x. 3 root root  15 Apr 28 09:55 .
drwxr-xr-x. 3 root root  24 Apr 28 09:55 ..

02.csv/_temporary/0:
total 0
drwxr-xr-x. 5 root root 106 Apr 28 09:56 .
drwxr-xr-x. 2 root root   6 Apr 28 09:56 _temporary
drwxr-xr-x. 2 root root 129 Apr 28 09:56 task_20170428095632_0005_m_00
drwxr-xr-x. 2 root root 129 Apr 28 09:55 task_20170428095516_0002_m_00
drwxr-xr-x. 3 root root  15 Apr 28 09:55 ..

02.csv/_temporary/0/_temporary:
total 0
drwxr-xr-x. 2 root root   6 Apr 28 09:56 .
drwxr-xr-x. 5 root root 106 Apr 28 09:56 ..

02.csv/_temporary/0/task_20170428095632_0005_m_00:
total 52
drwxr-xr-x. 5 root root   106 Apr 28 09:56 ..
-rw-r--r--. 1 root root   376 Apr 28 09:56
.part-0-e39ebc76-5343-407e-b42e-c33e69b8fd1a.csv.crc
-rw-r--r--. 1 root root 46605 Apr 28 09:56
part-0-e39ebc76-5343-407e-b42e-c33e69b8fd1a.csv
drwxr-xr-x. 2 root root   129 Apr 28 09:56 .

02.csv/_temporary/0/task_20170428095516_0002_m_00:
total 52
drwxr-xr-x. 5 root root   106 Apr 28 09:56 ..
-rw-r--r--. 1 root root   376 Apr 28 09:55
.part-0-c2ac5299-26f6-4b23-a74b-b3dc96464271.csv.crc
-rw-r--r--. 1 root root 46605 Apr 28 09:55
part-0-c2ac5299-26f6-4b23-a74b-b3dc96464271.csv


zeppelin - ls -laRt 02.csv/


02.csv/:
total 12
drwxr-sr-x2 root 1700  4096 Apr 28 09:56 .
-rw-r--r--1 root 1700 8 Apr 28 09:56 ._SUCCESS.crc
-rw-r--r--1 root 1700 0 Apr 28 09:56 _SUCCESS
drwxrwsr-x5 root 1700  4096 Apr 28 09:56 ..




El El mié, 10 may 2017 a las 14:06, Meethu Mathew <meethu.mat...@flytxt.com>
escribió:

> Try putting the csv in the same path in all the nodes or in a mount point
> path which is accessible by all the nodes
>
> Regards,
>
>
> Meethu Mathew
>
>
> On Wed, May 10, 2017 at 3:36 PM, Sofiane Cherchalli <sofian...@gmail.com>
> wrote:
>
>> Yes, I already tested with spark-shell and pyspark , with the same result.
>>
>> Can't I use Linux filesystem to read CSV, such as file:///data/file.csv.
>> My understanding is that the job is sent and is interpreted in the worker,
>> isn't it?
>>
>> Thanks.
>>
>> El El mar, 9 may 2017 a las 20:23, Jongyoul Lee <jongy...@gmail.com>
>> escribió:
>>
>>> Could you test if it works with spark-shell?
>>>
>>> On Sun, May 7, 2017 at 5:22 PM, Sofiane Cherchalli <sofian...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a standalone cluster, one master and one worker, running in
>>>> separate nodes. Zeppelin is running is in a separate node too in client
>>>> mode.
>>>>
>>>> When I run a notebook that reads a CSV file located in the worker
>>>> node with Spark-CSV package, Zeppelin tries to read the CSV locally and
>>>> fails because the CVS is in the worker node and not in Zeppelin node.
>>>>
>>>> Is this the expected behavior?
>>>>
>>>> Thanks.
>>>>
>>>
>>>
>>>
>>> --
>>> 이종열, Jongyoul Lee, 李宗烈
>>> http://madeng.net
>>>
>>
>


Spark-CSV - Zeppelin tries to read CSV locally in Standalon mode

2017-05-07 Thread Sofiane Cherchalli
Hi,

I have a standalone cluster, one master and one worker, running in separate
nodes. Zeppelin is running is in a separate node too in client mode.

When I run a notebook that reads a CSV file located in the worker node with
Spark-CSV package, Zeppelin tries to read the CSV locally and fails because
the CVS is in the worker node and not in Zeppelin node.

Is this the expected behavior?

Thanks.


Re: Running a notebook in a standalone cluster mode issues

2017-05-03 Thread Sofiane Cherchalli
Hi Moon,

Great, I am keen to see Zeppelin-2040 resolved soon. But meanwhile is there
any workaround?

Thanks.
Sofiane


El El mié, 3 may 2017 a las 20:40, moon soo Lee <m...@apache.org> escribió:

> Zeppelin don't need to be installed in every workers.
> You can think the way SparkInterpreter in Zeppelin work is very similar to
> spark-shell (which works in client mode), until ZEPPELIN-2040 is resolved.
>
> Therefore, if spark-shell works in a machine with your standalone cluster,
> Zeppelin will work in the same machine with the standalone cluster.
>
> Thanks,
> moon
>
> On Wed, May 3, 2017 at 2:28 PM Sofiane Cherchalli <sofian...@gmail.com>
> wrote:
>
>> Hi Moon,
>>
>> So in my case, if II have standalone or yarn cluster, the workaround
>> would be to install zeppelin along every worker, proxy them,  and run each
>> zeppelin in client mode ?
>>
>> Thanks,
>> Sofiane
>>
>> El El mié, 3 may 2017 a las 19:12, moon soo Lee <m...@apache.org>
>> escribió:
>>
>>> Hi,
>>>
>>> Zeppelin does not support cluster mode deploy at the moment.
>>> Fortunately, there will be a support for cluster mode, soon!
>>> Please keep an eye on
>>> https://issues.apache.org/jira/browse/ZEPPELIN-2040.
>>>
>>> Thanks,
>>> moon
>>>
>>> On Wed, May 3, 2017 at 11:00 AM Sofiane Cherchalli <sofian...@gmail.com>
>>> wrote:
>>>
>>>> Shall I configure a remote interpreter to my notebook to run on the
>>>> worker?
>>>>
>>>> Mayday!
>>>>
>>>> On Wed, May 3, 2017 at 4:18 PM, Sofiane Cherchalli <sofian...@gmail.com
>>>> > wrote:
>>>>
>>>>> What port does the remote interpreter use?
>>>>>
>>>>> On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli <
>>>>> sofian...@gmail.com> wrote:
>>>>>
>>>>>> Hi Moon and al,
>>>>>>
>>>>>> I have a standalone cluster with one master, one worker. I submit
>>>>>> jobs through zeppelin. master, worker, and zeppelin run in a separate
>>>>>> container.
>>>>>>
>>>>>> My zeppelin-env.sh:
>>>>>>
>>>>>> # spark home
>>>>>> export SPARK_HOME=/usr/local/spark
>>>>>>
>>>>>> # set hadoop conf dir
>>>>>> export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop
>>>>>>
>>>>>> # set options to pass spark-submit command
>>>>>> export SPARK_SUBMIT_OPTIONS="--packages
>>>>>> com.databricks:spark-csv_2.11:1.5.0 --deploy-mode cluster"
>>>>>>
>>>>>> # worker memory
>>>>>> export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g
>>>>>> -Dspark.submit.deployMode=cluster"
>>>>>>
>>>>>> # master
>>>>>> export MASTER="spark://:7077"
>>>>>>
>>>>>> My notebook code is very simple. It read csv and write it again in
>>>>>> directory /data previously created:
>>>>>> %spark.pyspark
>>>>>> def read_input(fin):
>>>>>> '''
>>>>>> Read input file from filesystem and return dataframe
>>>>>> '''
>>>>>> df = sqlContext.read.load(fin, format='com.databricks.spark.csv',
>>>>>> mode='PERMISSIVE', header='false', inferSchema='true')
>>>>>> return df
>>>>>>
>>>>>> def write_output(df, fout):
>>>>>> '''
>>>>>> Write dataframe to filesystem
>>>>>> '''
>>>>>>
>>>>>> df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',',
>>>>>> header='true').save(fout)
>>>>>>
>>>>>> data_in = '/data/01.csv'
>>>>>> data_out = '/data/02.csv'
>>>>>> df = read_input(data_in)
>>>>>> newdf = del_columns(df)
>>>>>> write_output(newdf, data_out)
>>>>>>
>>>>>>
>>>>>> I used --deploy-mode to *cluster* so that the driver is run in the
>>>>>> worker in order to read the CSV in the /data directory and not in 
>>>>>> zeppelin.
>>>>>> When running the notebook it complains that
>>>>>> /opt/zeppe

Re: Running a notebook in a standalone cluster mode issues

2017-05-03 Thread Sofiane Cherchalli
Hi Moon,

So in my case, if II have standalone or yarn cluster, the workaround would
be to install zeppelin along every worker, proxy them,  and run each
zeppelin in client mode ?

Thanks,
Sofiane

El El mié, 3 may 2017 a las 19:12, moon soo Lee <m...@apache.org> escribió:

> Hi,
>
> Zeppelin does not support cluster mode deploy at the moment. Fortunately,
> there will be a support for cluster mode, soon!
> Please keep an eye on https://issues.apache.org/jira/browse/ZEPPELIN-2040.
>
> Thanks,
> moon
>
> On Wed, May 3, 2017 at 11:00 AM Sofiane Cherchalli <sofian...@gmail.com>
> wrote:
>
>> Shall I configure a remote interpreter to my notebook to run on the
>> worker?
>>
>> Mayday!
>>
>> On Wed, May 3, 2017 at 4:18 PM, Sofiane Cherchalli <sofian...@gmail.com>
>> wrote:
>>
>>> What port does the remote interpreter use?
>>>
>>> On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli <sofian...@gmail.com>
>>> wrote:
>>>
>>>> Hi Moon and al,
>>>>
>>>> I have a standalone cluster with one master, one worker. I submit jobs
>>>> through zeppelin. master, worker, and zeppelin run in a separate container.
>>>>
>>>> My zeppelin-env.sh:
>>>>
>>>> # spark home
>>>> export SPARK_HOME=/usr/local/spark
>>>>
>>>> # set hadoop conf dir
>>>> export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop
>>>>
>>>> # set options to pass spark-submit command
>>>> export SPARK_SUBMIT_OPTIONS="--packages
>>>> com.databricks:spark-csv_2.11:1.5.0 --deploy-mode cluster"
>>>>
>>>> # worker memory
>>>> export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g
>>>> -Dspark.submit.deployMode=cluster"
>>>>
>>>> # master
>>>> export MASTER="spark://:7077"
>>>>
>>>> My notebook code is very simple. It read csv and write it again in
>>>> directory /data previously created:
>>>> %spark.pyspark
>>>> def read_input(fin):
>>>> '''
>>>> Read input file from filesystem and return dataframe
>>>> '''
>>>> df = sqlContext.read.load(fin, format='com.databricks.spark.csv',
>>>> mode='PERMISSIVE', header='false', inferSchema='true')
>>>> return df
>>>>
>>>> def write_output(df, fout):
>>>> '''
>>>> Write dataframe to filesystem
>>>> '''
>>>>
>>>> df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',',
>>>> header='true').save(fout)
>>>>
>>>> data_in = '/data/01.csv'
>>>> data_out = '/data/02.csv'
>>>> df = read_input(data_in)
>>>> newdf = del_columns(df)
>>>> write_output(newdf, data_out)
>>>>
>>>>
>>>> I used --deploy-mode to *cluster* so that the driver is run in the
>>>> worker in order to read the CSV in the /data directory and not in zeppelin.
>>>> When running the notebook it complains that
>>>> /opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is
>>>> missing:
>>>> org.apache.zeppelin.interpreter.InterpreterException: Ivy Default Cache
>>>> set to: /root/.ivy2/cache The jars for the packages stored in:
>>>> /root/.ivy2/jars :: loading settings :: url =
>>>> jar:file:/opt/spark-2.1.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
>>>> com.databricks#spark-csv_2.11 added as a dependency :: resolving
>>>> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default]
>>>> found com.databricks#spark-csv_2.11;1.5.0 in central found
>>>> org.apache.commons#commons-csv;1.1 in central found
>>>> com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
>>>> resolve 310ms :: artifacts dl 6ms :: modules in use:
>>>> com.databricks#spark-csv_2.11;1.5.0 from central in [default]
>>>> com.univocity#univocity-parsers;1.5.1 from central in [default]
>>>> org.apache.commons#commons-csv;1.1 from central in [default]
>>>> - | |
>>>> modules || artifacts | | conf | number| search|dwnlded|evicted||
>>>> number|dwnlded|
>>>> - |
>>>> default | 3 | 0 | 0 | 0 || 3 | 0 |
>>>> 

Re: Running a notebook in a standalone cluster mode issues

2017-05-03 Thread Sofiane Cherchalli
What port does the remote interpreter use?

On Wed, May 3, 2017 at 2:14 PM, Sofiane Cherchalli <sofian...@gmail.com>
wrote:

> Hi Moon and al,
>
> I have a standalone cluster with one master, one worker. I submit jobs
> through zeppelin. master, worker, and zeppelin run in a separate container.
>
> My zeppelin-env.sh:
>
> # spark home
> export SPARK_HOME=/usr/local/spark
>
> # set hadoop conf dir
> export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop
>
> # set options to pass spark-submit command
> export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.11:1.5.0
> --deploy-mode cluster"
>
> # worker memory
> export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g
> -Dspark.submit.deployMode=cluster"
>
> # master
> export MASTER="spark://:7077"
>
> My notebook code is very simple. It read csv and write it again in
> directory /data previously created:
> %spark.pyspark
> def read_input(fin):
> '''
> Read input file from filesystem and return dataframe
> '''
> df = sqlContext.read.load(fin, format='com.databricks.spark.csv',
> mode='PERMISSIVE', header='false', inferSchema='true')
> return df
>
> def write_output(df, fout):
> '''
> Write dataframe to filesystem
> '''
> 
> df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',',
> header='true').save(fout)
>
> data_in = '/data/01.csv'
> data_out = '/data/02.csv'
> df = read_input(data_in)
> newdf = del_columns(df)
> write_output(newdf, data_out)
>
>
> I used --deploy-mode to *cluster* so that the driver is run in the worker
> in order to read the CSV in the /data directory and not in zeppelin. When
> running the notebook it complains that /opt/zeppelin-0.7.1/
> interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is missing:
> org.apache.zeppelin.interpreter.InterpreterException: Ivy Default Cache
> set to: /root/.ivy2/cache The jars for the packages stored in:
> /root/.ivy2/jars :: loading settings :: url = jar:file:/opt/spark-2.1.0/
> jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> com.databricks#spark-csv_2.11 added as a dependency :: resolving
> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default]
> found com.databricks#spark-csv_2.11;1.5.0 in central found
> org.apache.commons#commons-csv;1.1 in central found
> com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
> resolve 310ms :: artifacts dl 6ms :: modules in use:
> com.databricks#spark-csv_2.11;1.5.0 from central in [default]
> com.univocity#univocity-parsers;1.5.1 from central in [default]
> org.apache.commons#commons-csv;1.1 from central in [default]
> - | |
> modules || artifacts | | conf | number| search|dwnlded|evicted||
> number|dwnlded| --
> --- | default | 3 | 0 | 0 | 0 || 3 |
> 0 | -
> :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0
> artifacts copied, 3 already retrieved (0kB/8ms) Running Spark using the
> REST application submission protocol. SLF4J: Class path contains multiple
> SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/zeppelin-0.7.1/
> interpreter/spark/zeppelin-spark_2.11-0.7.1.jar!/org/
> slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in
> [jar:file:/opt/zeppelin-0.7.1/lib/interpreter/slf4j-log4j12-
> 1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding
> in [jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-
> log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See
> http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> Warning: Master endpoint spark://spark-drone-master-
> sofiane.autoetl.svc.cluster.local:7077 was not a REST server. Falling
> back to legacy submission gateway instead. Ivy Default Cache set to:
> /root/.ivy2/cache The jars for the packages stored in: /root/.ivy2/jars
> com.databricks#spark-csv_2.11 added as a dependency :: resolving
> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default]
> found com.databricks#spark-csv_2.11;1.5.0 in central found
> org.apache.commons#commons-csv;1.1 in central found
> com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
> resolve 69ms :: artifacts dl 5ms :: modules in use:
> com.databricks#spark-csv_2.11;1.5.0 from central in [default]
> com.univocity#univocity-parsers;1.5.1 from central in [default]
> org.

Running a notebook in a standalone cluster mode issues

2017-05-03 Thread Sofiane Cherchalli
Hi Moon and al,

I have a standalone cluster with one master, one worker. I submit jobs
through zeppelin. master, worker, and zeppelin run in a separate container.

My zeppelin-env.sh:

# spark home
export SPARK_HOME=/usr/local/spark

# set hadoop conf dir
export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop

# set options to pass spark-submit command
export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.11:1.5.0
--deploy-mode cluster"

# worker memory
export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g
-Dspark.submit.deployMode=cluster"

# master
export MASTER="spark://:7077"

My notebook code is very simple. It read csv and write it again in
directory /data previously created:
%spark.pyspark
def read_input(fin):
'''
Read input file from filesystem and return dataframe
'''
df = sqlContext.read.load(fin, format='com.databricks.spark.csv',
mode='PERMISSIVE', header='false', inferSchema='true')
return df

def write_output(df, fout):
'''
Write dataframe to filesystem
'''

df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',',
header='true').save(fout)

data_in = '/data/01.csv'
data_out = '/data/02.csv'
df = read_input(data_in)
newdf = del_columns(df)
write_output(newdf, data_out)


I used --deploy-mode to *cluster* so that the driver is run in the worker
in order to read the CSV in the /data directory and not in zeppelin. When
running the notebook it complains that
/opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar is
missing:
org.apache.zeppelin.interpreter.InterpreterException: Ivy Default Cache set
to: /root/.ivy2/cache The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url =
jar:file:/opt/spark-2.1.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.11 added as a dependency :: resolving
dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default]
found com.databricks#spark-csv_2.11;1.5.0 in central found
org.apache.commons#commons-csv;1.1 in central found
com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
resolve 310ms :: artifacts dl 6ms :: modules in use:
com.databricks#spark-csv_2.11;1.5.0 from central in [default]
com.univocity#univocity-parsers;1.5.1 from central in [default]
org.apache.commons#commons-csv;1.1 from central in [default]
- | |
modules || artifacts | | conf | number| search|dwnlded|evicted||
number|dwnlded|
- |
default | 3 | 0 | 0 | 0 || 3 | 0 |
- ::
retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0
artifacts copied, 3 already retrieved (0kB/8ms) Running Spark using the
REST application submission protocol. SLF4J: Class path contains multiple
SLF4J bindings. SLF4J: Found binding in
[jar:file:/opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/opt/zeppelin-0.7.1/lib/interpreter/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/opt/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation. SLF4J: Actual binding is of type
[org.slf4j.impl.Log4jLoggerFactory] Warning: Master endpoint
spark://spark-drone-master-sofiane.autoetl.svc.cluster.local:7077 was not a
REST server. Falling back to legacy submission gateway instead. Ivy Default
Cache set to: /root/.ivy2/cache The jars for the packages stored in:
/root/.ivy2/jars com.databricks#spark-csv_2.11 added as a dependency ::
resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs:
[default] found com.databricks#spark-csv_2.11;1.5.0 in central found
org.apache.commons#commons-csv;1.1 in central found
com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
resolve 69ms :: artifacts dl 5ms :: modules in use:
com.databricks#spark-csv_2.11;1.5.0 from central in [default]
com.univocity#univocity-parsers;1.5.1 from central in [default]
org.apache.commons#commons-csv;1.1 from central in [default]
- | |
modules || artifacts | | conf | number| search|dwnlded|evicted||
number|dwnlded|
- |
default | 3 | 0 | 0 | 0 || 3 | 0 |
- ::
retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0
artifacts copied, 3 already retrieved (0kB/4ms)
java.nio.file.NoSuchFileException:
/opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at

zeppelin 0.7.1 and Spark cluster standalone - Reading and writing csv

2017-04-28 Thread Sofiane Cherchalli
Hi,

I have a spark cluster in standalone mode with one worker. Each of
Zeppelin, spark master, and spark slave run in its own docker container.

I am trying to read and write a csv from a notebook, but I'm having issues.

First, my zeppelin-env.sh:
# spark home
export SPARK_HOME=/opt/spark-2.1.0

# set hadoop conf dir
export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop

# set options to pass spark-submit command
export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.11:1.5.0"

# worker memory
export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g"

# master
export MASTER="spark://master:7077"


The notebook:
%spark.pyspark
data_in = '/data/01.csv'
data_out = '/data/02.csv'

def read_input(fin):
'''
Read input file from filesystem and return dataframe
'''
df = sqlContext.read.load(fin, format='com.databricks.spark.csv',
mode='PERMISSIVE', header='false', inferSchema='true')
return df

def write_output(df, fout):
'''
Write dataframe to filesystem
'''

df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',',
header='true').save(fout)

df = read_input(data_in)
write_output(df, data_out)

I copied the /data/01.csv file in the spark worker.

When I run the notebook it fails complaining that the /data/01.csv was not
found int the zeppelin container:
Traceback (most recent call last):
File "/opt/spark-2.1.0/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/spark-2.1.0/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
line 319, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o46.load.
: org.apache.spark.sql.AnalysisException: Path does not exist:
file:/data/01.csv;
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-5030675375956428180.py", line 337, in 
exec(code)
File "", line 1, in 
File "", line 5, in read_input
File "/opt/spark-2.1.0/python/pyspark/sql/readwriter.py", line 149, in load
return self._df(self._jreader.load(path))
File
"/opt/spark-2.1.0/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/spark-2.1.0/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Path does not exist:
file:/data/01.csv;'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-5030675375956428180.py", line 349, in 
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/opt/spark-2.1.0/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/spark-2.1.0/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
line 319, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o46.load.
: org.apache.spark.sql.AnalysisException: Path does not exist:
file:/data/01.csv;
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 

Re: Helium - Algorithm

2017-04-18 Thread Sofiane Cherchalli
Thanks Moon. I'll have a look at it.


El El mié, 19 abr 2017 a las 6:06, moon soo Lee <m...@apache.org> escribió:

> Hi,
>
> If you take a look Helium Proof of Concept video [1] in the proposal [2],
> you'll see one Helium app load data and then visualize data with another
> Helium App (from 1min).
>
> So i would say it's totally possible, although we might need some
> improvement to do it more smoothly.
>
> What do you think?
>
> Thanks,
> moon
>
> [1] https://www.youtube.com/watch?time_continue=10=8Wdc70e6QVI
> [2] https://cwiki.apache.org/confluence/display/ZEPPELIN/Helium+proposal
>
> On Tue, Apr 18, 2017 at 7:49 AM Sofiane Cherchalli <sofian...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Is it possible to use Helium to just implement algorithms without views?
>> My idea is to have a catalog of algorithms that could be chained together.
>> Each algorithm would read input from hdfs, process, and write output to
>> hdfs. This could be very useful for data preprocessing.
>>
>> Any thought or suggestions about that?
>>
>> Thanks.
>> Sofiane
>>
>


Helium - Algorithm

2017-04-18 Thread Sofiane Cherchalli
Hi,

Is it possible to use Helium to just implement algorithms without views? My
idea is to have a catalog of algorithms that could be chained together.
Each algorithm would read input from hdfs, process, and write output to
hdfs. This could be very useful for data preprocessing.

Any thought or suggestions about that?

Thanks.
Sofiane


Re: Zeppelin Notebook API - reporting errors

2017-04-03 Thread Sofiane Cherchalli
Hi Moon,

I just created the issue: ZEPPELIN-2345
<https://issues.apache.org/jira/browse/ZEPPELIN-2345>

Best regards,
Sofiane

On Sat, Apr 1, 2017 at 1:53 AM, Sofiane Cherchalli <sofian...@gmail.com>
wrote:

> No problem. I'll film the issue.
>
> Thanks
>
>
> El El vie, 31 mar 2017 a las 22:32, moon soo Lee <m...@apache.org>
> escribió:
>
>> Thanks for the suggestion.
>>
>> As far as i know, there's no related issue in JIRA.
>> Do you mind create one?
>>
>> Thanks,
>> moon
>>
>>
>> On Fri, Mar 31, 2017 at 2:49 AM Sofiane Cherchalli <sofian...@gmail.com>
>> wrote:
>>
>> Any taker? Is this an issue or expected behaviour?
>>
>> Thanks.
>>
>> On Thu, Mar 30, 2017 at 9:52 PM, Sofiane Cherchalli <sofian...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> I am running notebooks through the Notebook API by running synchronously
>> every paragraph of the notebook, but it seems that if something fails
>> during the execution of a paragraph due to an exception for instance, the
>> API returns a 500 Server Error. To get more detail of the exception, one
>> has to open the notebook to see the stack trace. Is it the expected
>> behaviour? Shouldn't the API catch the exception and return the 500 with
>> the stack trace text of the exception?
>>
>> Thanks.
>> Sofiane
>>
>>
>>


Re: Zeppelin Notebook API - reporting errors

2017-03-31 Thread Sofiane Cherchalli
Any taker? Is this an issue or expected behaviour?

Thanks.

On Thu, Mar 30, 2017 at 9:52 PM, Sofiane Cherchalli <sofian...@gmail.com>
wrote:

> Hi,
>
> I am running notebooks through the Notebook API by running synchronously
> every paragraph of the notebook, but it seems that if something fails
> during the execution of a paragraph due to an exception for instance, the
> API returns a 500 Server Error. To get more detail of the exception, one
> has to open the notebook to see the stack trace. Is it the expected
> behaviour? Shouldn't the API catch the exception and return the 500 with
> the stack trace text of the exception?
>
> Thanks.
> Sofiane
>


Angular FrontEnd API - Drag and Drop

2017-03-28 Thread Sofiane Cherchalli
Hi,

Would it be possible to use angular front-end api to for example list the
notebooks and display them in a zeppelin paragraph. Also, would it be
possible to use drag and drop?

Thanks
Sofiane


Re: Release on 0.7.1 and 0.7.2

2017-03-15 Thread Sofiane Cherchalli
Hi Zeppelin team,

What's the release forecast? Shall we expect 0.7.1 or 0.7.2 by Friday?

Thanks.

On Tue, 14 Mar 2017 at 13:09, Jianfeng (Jeff) Zhang 
wrote:


+1

Best Regard,
Jeff Zhang


From: Jun Kim 
Reply-To: "users@zeppelin.apache.org" 
Date: Tuesday, March 14, 2017 at 11:38 AM
To: "users@zeppelin.apache.org" 
Subject: Re: Release on 0.7.1 and 0.7.2

Cool! I look forward to it!

2017년 3월 14일 (화) 오후 12:31, moon soo Lee 님이 작성:

Sounds like a plan!


On Mon, Mar 13, 2017 at 8:22 PM Xiaohui Liu  wrote:

This is the right action. In fact, 0.7.0 release bin did not work for my
team. We almost started to use 0.7.1-snapshot immediately after 0.7.0
release.

I guess many of us are taking the same route.

But for new zeppelin users, starting with 0.7.0 will give them the wrong
first impression.


On Tue, 14 Mar 2017 at 10:28 AM, Jongyoul Lee  wrote:

Hi dev and users,

As we released 0.7.0, most of users and dev reported a lot of bugs which
were critical. For the reason, community including me started to prepare
new minor release with umbrella issue[1]. Due to contributors' efforts, we
have resolved some of issues and have reviewed almost unresolved issues. I
want to talk about the new minor release at this point. Generally, we have
resolved all of issues reported as bugs before we release but some issues
are very critical and it causes serious problem using Apache Zeppelin. Then
I think, in this time, it's better to release 0.7.1 as soon as we can and
prepare a new minor release with rest of unresolved issues.

I'd like to start a process this Friday and if some issues are not merged
until then, I hope they would be included in 0.7.2.

Feel free to talk to me if you have a better plan to improve users'
experiences.

Regards,
Jongyoul Lee

[1] https://issues.apache.org/jira/browse/ZEPPELIN-2134


-- 
이종열, Jongyoul Lee, 李宗烈
http://madeng.net

-- 
Taejun Kim

Data Mining Lab.
School of Electrical and Computer Engineering
University of Seoul


Re: Running all paragraphs with dynamic form's value parameters

2017-02-28 Thread Sofiane Cherchalli
After playing around running notes and paragraphs, it seems passing dynamic
form's values works only with paragraphs but not with a notes. It would be
nice to run a note with dynamic form's value and allowing paragraphs to
override them.

On Tue, Feb 28, 2017 at 8:33 PM, Sofiane Cherchalli <sofian...@gmail.com>
wrote:

> The API allows to run a paragraph with dynamic form's value. example:
>
> curl -sL -X POST -d "{'form_field': 'value'}" http://localhost:8080/api/
> notebook/run//
>
> Is it possible to run all paragraphs with dynamic form's value? For
> instance:
> curl -sL -X POST -d "{'form_field': 'value'}" http://localhost:8080/api/
> notebook/run/
>
> I can't get it to work... Any hint?
>
>
>


Running all paragraphs with dynamic form's value parameters

2017-02-28 Thread Sofiane Cherchalli
The API allows to run a paragraph with dynamic form's value. example:

curl -sL -X POST -d "{'form_field': 'value'}"
http://localhost:8080/api/notebook/run//

Is it possible to run all paragraphs with dynamic form's value? For
instance:
curl -sL -X POST -d "{'form_field': 'value'}"
http://localhost:8080/api/notebook/run/

I can't get it to work... Any hint?