Re: Validate spark sql

2023-12-24 Thread Nicholas Chammas
This is a user-list question, not a dev-list question. Moving this conversation 
to the user list and BCC-ing the dev list.

Also, this statement

> We are not validating against table or column existence.

is not correct. When you call spark.sql(…), Spark will lookup the table 
references and fail with TABLE_OR_VIEW_NOT_FOUND if it cannot find them.

Also, when you run DDL via spark.sql(…), Spark will actually run it. So 
spark.sql(“drop table my_table”) will actually drop my_table. It’s not a 
validation-only operation.

This question of validating SQL is already discussed on Stack Overflow 
. You may find some useful tips 
there.

Nick


> On Dec 24, 2023, at 4:52 AM, Mich Talebzadeh  
> wrote:
> 
>   
> Yes, you can validate the syntax of your PySpark SQL queries without 
> connecting to an actual dataset or running the queries on a cluster.
> PySpark provides a method for syntax validation without executing the query. 
> Something like below
>   __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.4.0
>   /_/
> 
> Using Python version 3.9.16 (main, Apr 24 2023 10:36:11)
> Spark context Web UI available at http://rhes75:4040 
> Spark context available as 'sc' (master = local[*], app id = 
> local-1703410019374).
> SparkSession available as 'spark'.
> >>> from pyspark.sql import SparkSession
> >>> spark = SparkSession.builder.appName("validate").getOrCreate()
> 23/12/24 09:28:02 WARN SparkSession: Using an existing Spark session; only 
> runtime SQL configurations will take effect.
> >>> sql = "SELECT * FROM  WHERE  = some value"
> >>> try:
> ...   spark.sql(sql)
> ...   print("is working")
> ... except Exception as e:
> ...   print(f"Syntax error: {e}")
> ...
> Syntax error:
> [PARSE_SYNTAX_ERROR] Syntax error at or near '<'.(line 1, pos 14)
> 
> == SQL ==
> SELECT * FROM  WHERE  = some value
> --^^^
> 
> Here we only check for syntax errors and not the actual existence of query 
> semantics. We are not validating against table or column existence.
> 
> This method is useful when you want to catch obvious syntax errors before 
> submitting your PySpark job to a cluster, especially when you don't have 
> access to the actual data.
> In summary
> Theis method validates syntax but will not catch semantic errors
> If you need more comprehensive validation, consider using a testing framework 
> and a small dataset.
> For complex queries, using a linter or code analysis tool can help identify 
> potential issues.
> HTH
> 
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
> 
>view my Linkedin profile 
> 
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> 
> On Sun, 24 Dec 2023 at 07:57, ram manickam  > wrote:
>> Hello,
>> Is there a way to validate pyspark sql to validate only syntax errors?. I 
>> cannot connect do actual data set to perform this validation.  Any help 
>> would be appreciated.
>> 
>> 
>> Thanks
>> Ram



Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Nicholas Chammas
PyMySQL has its own implementation 
<https://github.com/PyMySQL/PyMySQL/blob/f13f054abcc18b39855a760a84be0a517f0da658/pymysql/protocol.py>
 of the MySQL client-server protocol. It does not use JDBC.


> On Dec 6, 2023, at 10:43 PM, Venkatesan Muniappan 
>  wrote:
> 
> Thanks for the advice Nicholas. 
> 
> As mentioned in the original email, I have tried JDBC + SSH Tunnel using 
> pymysql and sshtunnel and it worked fine. The problem happens only with Spark.
> 
> Thanks,
> Venkat
> 
> 
> 
> On Wed, Dec 6, 2023 at 10:21 PM Nicholas Chammas  <mailto:nicholas.cham...@gmail.com>> wrote:
>> This is not a question for the dev list. Moving dev to bcc.
>> 
>> One thing I would try is to connect to this database using JDBC + SSH 
>> tunnel, but without Spark. That way you can focus on getting the JDBC 
>> connection to work without Spark complicating the picture for you.
>> 
>> 
>>> On Dec 5, 2023, at 8:12 PM, Venkatesan Muniappan 
>>> mailto:venkatesa...@noonacademy.com>> wrote:
>>> 
>>> Hi Team,
>>> 
>>> I am facing an issue with SSH Tunneling in Apache Spark. The behavior is 
>>> same as the one in this Stackoverflow question 
>>> <https://stackoverflow.com/questions/68278369/how-to-use-pyspark-to-read-a-mysql-database-using-a-ssh-tunnel>
>>>  but there are no answers there.
>>> 
>>> This is what I am trying:
>>> 
>>> 
>>> with SSHTunnelForwarder(
>>> (ssh_host, ssh_port),
>>> ssh_username=ssh_user,
>>> ssh_pkey=ssh_key_file,
>>> remote_bind_address=(sql_hostname, sql_port),
>>> local_bind_address=(local_host_ip_address, sql_port)) as tunnel:
>>> tunnel.local_bind_port
>>> b1_semester_df = spark.read \
>>> .format("jdbc") \
>>> .option("url", b2b_mysql_url.replace("<>", 
>>> str(tunnel.local_bind_port))) \
>>> .option("query", b1_semester_sql) \
>>> .option("database", 'b2b') \
>>> .option("password", b2b_mysql_password) \
>>> .option("driver", "com.mysql.cj.jdbc.Driver") \
>>> .load()
>>> b1_semester_df.count()
>>> 
>>> Here, the b1_semester_df is loaded but when I try count on the same Df it 
>>> fails saying this
>>> 
>>> 23/12/05 11:49:17 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
>>> aborting job
>>> Traceback (most recent call last):
>>>   File "", line 1, in 
>>>   File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 382, in show
>>> print(self._jdf.showString(n, 20, vertical))
>>>   File 
>>> "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
>>> 1257, in __call__
>>>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
>>> return f(*a, **kw)
>>>   File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
>>> line 328, in get_return_value
>>> py4j.protocol.Py4JJavaError: An error occurred while calling 
>>> o284.showString.
>>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
>>> in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
>>> 2.0 (TID 11, ip-172-32-108-1.eu-central-1.compute.internal, executor 3): 
>>> com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link 
>>> failure
>>> 
>>> However, the same is working fine with pandas df. I have tried this below 
>>> and it worked.
>>> 
>>> 
>>> with SSHTunnelForwarder(
>>> (ssh_host, ssh_port),
>>> ssh_username=ssh_user,
>>> ssh_pkey=ssh_key_file,
>>> remote_bind_address=(sql_hostname, sql_port)) as tunnel:
>>> conn = pymysql.connect(host=local_host_ip_address, user=sql_username,
>>>passwd=sql_password, db=sql_main_database,
>>>port=tunnel.local_bind_port)
>>> df = pd.read_sql_query(b1_semester_sql, conn)
>>> spark.createDataFrame(df).createOrReplaceTempView("b1_semester")
>>> 
>>> So wanted to check what I am missing with my Spark usage. Please help.
>>> 
>>> Thanks,
>>> Venkat
>>> 
>> 



Suppressing output from Apache Ivy (?) when calling spark-submit with --packages

2018-02-27 Thread Nicholas Chammas
I’m not sure whether this is something controllable via Spark, but when you
call spark-submit with --packages you get a lot of output. Is there any way
to suppress it? Does it come from Apache Ivy?

I posted more details about what I’m seeing on Stack Overflow
.

Nick


Re: Trouble with PySpark UDFs and SPARK_HOME only on EMR

2017-06-22 Thread Nicholas Chammas
Here’s a repro for a very similar issue where Spark hangs on the UDF, which
I think is related to the SPARK_HOME issue. I posted the repro on the EMR
forum ,
but in case you can’t access it:

   1. I’m running EMR 5.6.0, Spark 2.1.1, and Python 3.5.1.
   2. Create a simple Python package by creating a directory called udftest.
   3. Inside udftest put an empty __init__.py and a nothing.py.
   4.

   nothing.py should have the following contents:

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

def do_nothing(s: int) -> int:
return s

do_nothing_udf = udf(do_nothing, IntegerType())

   5.

   From your home directory (the one that contains your udftest package),
   create a ZIP that we will ship to YARN.

pushd udftest/
zip -rq ../udftest.zip *
popd

   6.

   Start a PySpark shell with our test package.

export PYSPARK_PYTHON=python3
pyspark \
  --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=$PYSPARK_PYTHON" \
  --archives "udftest.zip#udftest"

   7.

   Now try to use the UDF. It will hang.

from udftest.nothing import do_nothing_udf
spark.range(10).select(do_nothing_udf('id')).show()  # hangs

   8.

   The strange thing is, if you define the exact same UDF directly in the
   active PySpark shell, it works fine! It’s only when you import it from a
   user-defined module that you see this issue.

​

On Thu, Jun 22, 2017 at 12:08 PM Nick Chammas 
wrote:

> I’m seeing a strange issue on EMR which I posted about here
> 
> .
>
> In brief, when I try to import a UDF I’ve defined, Python somehow fails to
> find Spark. This exact code works for me locally and works on our
> on-premises CDH cluster under YARN.
>
> This is the traceback:
>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 318, in show
> print(self._jdf.showString(n, 20))
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
> line 1133, in __call__
>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
> 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o89.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, ip-10-97-35-12.ec2.internal, executor 1): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/worker.py",
>  line 161, in main
> func, profiler, deserializer, serializer = read_udfs(pickleSer, infile)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/worker.py",
>  line 91, in read_udfs
> _, udf = read_single_udf(pickleSer, infile)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/worker.py",
>  line 78, in read_single_udf
> f, return_type = read_command(pickleSer, infile)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/worker.py",
>  line 54, in read_command
> command = serializer._read_with_length(file)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/serializers.py",
>  line 169, in _read_with_length
> return self.loads(obj)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/serializers.py",
>  line 451, in loads
> return pickle.loads(obj, encoding=encoding)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/splinkr/person.py",
>  line 7, in 
> from splinkr.util import repartition_to_size
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/splinkr/util.py",
>  line 34, in 
> containsNull=False,
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/sql/functions.py",
>  line 1872, in udf
> return UserDefinedFunction(f, returnType)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/sql/functions.py",
>  line 1830, in __init__
> self._judf = 

Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Nicholas Chammas
Ah, that's why all the stuff about scheduler pools is under the
section "Scheduling
Within an Application
<https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application>".
 I am so used to talking to my coworkers about jobs in sense of
applications that I forgot your typical Spark application submits multiple
"jobs", each of which has multiple stages, etc.

So in my case I need to read up more closely about YARN queues
<https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html>
since I want to share resources *across* applications. Thanks Mark!

On Wed, Apr 5, 2017 at 4:31 PM Mark Hamstra <m...@clearstorydata.com> wrote:

> `spark-submit` creates a new Application that will need to get resources
> from YARN. Spark's scheduler pools will determine how those resources are
> allocated among whatever Jobs run within the new Application.
>
> Spark's scheduler pools are only relevant when you are submitting multiple
> Jobs within a single Application (i.e., you are using the same SparkContext
> to launch multiple Jobs) and you have used SparkContext#setLocalProperty to
> set "spark.scheduler.pool" to something other than the default pool before
> a particular Job intended to use that pool is started via that SparkContext.
>
> On Wed, Apr 5, 2017 at 1:11 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> Hmm, so when I submit an application with `spark-submit`, I need to
> guarantee it resources using YARN queues and not Spark's scheduler pools.
> Is that correct?
>
> When are Spark's scheduler pools relevant/useful in this context?
>
> On Wed, Apr 5, 2017 at 3:54 PM Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
> grrr... s/your/you're/
>
> On Wed, Apr 5, 2017 at 12:54 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
> Your mixing up different levels of scheduling. Spark's fair scheduler
> pools are about scheduling Jobs, not Applications; whereas YARN queues with
> Spark are about scheduling Applications, not Jobs.
>
> On Wed, Apr 5, 2017 at 12:27 PM, Nick Chammas <nicholas.cham...@gmail.com>
> wrote:
>
> I'm having trouble understanding the difference between Spark fair
> scheduler pools
> <https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools>
> and YARN queues
> <https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html>.
> Do they conflict? Does one override the other?
>
> I posted a more detailed question about an issue I'm having with this on
> Stack Overflow: http://stackoverflow.com/q/43239921/877069
>
> Nick
>
>
> --
> View this message in context: Spark fair scheduler pools vs. YARN queues
> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-fair-scheduler-pools-vs-YARN-queues-tp28572.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>
>
>
>
>


Re: Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Nicholas Chammas
Hmm, so when I submit an application with `spark-submit`, I need to
guarantee it resources using YARN queues and not Spark's scheduler pools.
Is that correct?

When are Spark's scheduler pools relevant/useful in this context?

On Wed, Apr 5, 2017 at 3:54 PM Mark Hamstra  wrote:

> grrr... s/your/you're/
>
> On Wed, Apr 5, 2017 at 12:54 PM, Mark Hamstra 
> wrote:
>
> Your mixing up different levels of scheduling. Spark's fair scheduler
> pools are about scheduling Jobs, not Applications; whereas YARN queues with
> Spark are about scheduling Applications, not Jobs.
>
> On Wed, Apr 5, 2017 at 12:27 PM, Nick Chammas 
> wrote:
>
> I'm having trouble understanding the difference between Spark fair
> scheduler pools
> 
> and YARN queues
> .
> Do they conflict? Does one override the other?
>
> I posted a more detailed question about an issue I'm having with this on
> Stack Overflow: http://stackoverflow.com/q/43239921/877069
>
> Nick
>
>
> --
> View this message in context: Spark fair scheduler pools vs. YARN queues
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>
>
>
>


Re: New Amazon AMIs for EC2 script

2017-02-23 Thread Nicholas Chammas
spark-ec2 has moved to GitHub and is no longer part of the Spark project. A
related issue from the current issue tracker that you may want to
follow/comment on is this one: https://github.com/amplab/spark-ec2/issues/74

As I said there, I think requiring custom AMIs is one of the major
maintenance headaches of spark-ec2. I solved this problem in my own
project, Flintrock , by working with
the default Amazon Linux AMIs and letting people more freely bring their
own AMI.

Nick


On Thu, Feb 23, 2017 at 7:23 AM in4maniac  wrote:

> Hyy all,
>
> I have been using the EC2 script to launch R pyspark clusters for a while
> now. As we use alot of packages such as numpy and scipy with openblas,
> scikit-learn, bokeh, vowpal wabbit, pystan and etc... All this time, we
> have
> been building AMIs on top of the standard spark-AMIs at
> https://github.com/amplab/spark-ec2/tree/branch-1.6/ami-list/us-east-1
>
> Mainly, I have done the following:
> - updated yum
> - Changed the standard python to python 2.7
> - changed pip to 2.7 and installed alot of libararies on top of the
> existing
> AMIs and created my own AMIs to avoid having to boostrap.
>
> But the ec-2 standard AMIs are from *Early February , 2014* and now have
> become extremely fragile. For example, when I update a certain library,
> ipython would break, or pip would break and so forth.
>
> Can someone please direct me to a more upto date AMI that I can use with
> more confidence. And I am also interested to know what things need to be in
> the AMI, if I wanted to build an AMI from scratch (Last resort :( )
>
> And isn't it time to have a ticket in the spark project to build a new
> suite
> of AMIs for the EC2 script?
> https://issues.apache.org/jira/browse/SPARK-922
>
> Many thanks
> in4maniac
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/New-Amazon-AMIs-for-EC2-script-tp28419.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Order of rows not preserved after cache + count + coalesce

2017-02-13 Thread Nicholas Chammas
RDDs and DataFrames do not guarantee any specific ordering of data. They
are like tables in a SQL database. The only way to get a guaranteed
ordering of rows is to explicitly specify an orderBy() clause in your
statement. Any ordering you see otherwise is incidental.
​

On Mon, Feb 13, 2017 at 7:52 AM David Haglund (external) <
david.hagl...@husqvarnagroup.com> wrote:

> Hi,
>
>
>
> I found something that surprised me, I expected the order of the rows to
> be preserved, so I suspect this might be a bug. The problem is illustrated
> with the Python example below:
>
>
>
> In [1]:
>
> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
>
> df.cache()
>
> df.count()
>
> df.coalesce(2).rdd.glom().collect()
>
> Out[1]:
>
> [[Row(n=1)], [Row(n=0), Row(n=2)]]
>
>
>
> Note how n=1 comes before n=0, above.
>
>
>
>
>
> If I remove the cache line I get the rows in the correct order and the
> same if I use df.rdd.count() instead of df.count(), see examples below:
>
>
>
> In [2]:
>
> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
>
> df.count()
>
> df.coalesce(2).rdd.glom().collect()
>
> Out[2]:
>
> [[Row(n=0)], [Row(n=1), Row(n=2)]]
>
>
>
> In [3]:
>
> df = spark.createDataFrame([(i,) for i in range(3)], ['n'])
>
> df.cache()
>
> df.rdd.count()
>
> df.coalesce(2).rdd.glom().collect()
>
> Out[3]:
>
> [[Row(n=0)], [Row(n=1), Row(n=2)]]
>
>
>
>
>
> I use spark 2.1.0 and pyspark.
>
>
>
> Regards,
>
> /David
>
> The information in this email may be confidential and/or legally
> privileged. It has been sent for the sole use of the intended recipient(s).
> If you are not an intended recipient, you are strictly prohibited from
> reading, disclosing, distributing, copying or using this email or any of
> its contents, in any way whatsoever. If you have received this email in
> error, please contact the sender by reply email and destroy all copies of
> the original message. Please also be advised that emails are not a secure
> form for communication, and may contain errors.
>


Re: Debugging a PythonException with no details

2017-01-17 Thread Nicholas Chammas
Hey Marco,

I stopped seeing this error once I started round-tripping intermediate
DataFrames to disk.

You can read more about what I saw here:
https://github.com/graphframes/graphframes/issues/159

Nick

On Sat, Jan 14, 2017 at 4:02 PM Marco Mistroni <mmistr...@gmail.com> wrote:

> It seems it has to do with UDF..Could u share snippet of code you are
> running?
> Kr
>
> On 14 Jan 2017 1:40 am, "Nicholas Chammas" <nicholas.cham...@gmail.com>
> wrote:
>
> I’m looking for tips on how to debug a PythonException that’s very sparse
> on details. The full exception is below, but the only interesting bits
> appear to be the following lines:
>
> org.apache.spark.api.python.PythonException:
> ...
> py4j.protocol.Py4JError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext
>
> Otherwise, the only other clue from the traceback I can see is that the
> problem may involve a UDF somehow.
>
> I’ve tested this code against many datasets (stored as ORC) and it works
> fine. The same code only seems to throw this error on a few datasets that
> happen to be sourced via JDBC. I can’t seem to get a lead on what might be
> going wrong here.
>
> Does anyone have tips on how to debug a problem like this? How do I find
> more specifically what is going wrong?
>
> Nick
>
> Here’s the full exception:
>
> 17/01/13 17:12:14 WARN TaskSetManager: Lost task 7.0 in stage 9.0 (TID 15, 
> devlx023.private.massmutual.com, executor 4): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py", line 
> 161, in main
> func, profiler, deserializer, serializer = read_udfs(pickleSer, infile)
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py", line 97, 
> in read_udfs
> arg_offsets, udf = read_single_udf(pickleSer, infile)
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py", line 78, 
> in read_single_udf
> f, return_type = read_command(pickleSer, infile)
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py", line 54, 
> in read_command
> command = serializer._read_with_length(file)
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 169, in _read_with_length
> return self.loads(obj)
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 431, in loads
> return pickle.loads(obj, encoding=encoding)
>   File 
> "/hadoop/yarn/nm/usercache/jenkins/appcache/application_1483203887152_1207/container_1483203887152_1207_01_05/splinkr/person.py",
>  line 111, in 
> py_normalize_udf = udf(py_normalize, StringType())
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/sql/functions.py", 
> line 1868, in udf
> return UserDefinedFunction(f, returnType)
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/sql/functions.py", 
> line 1826, in __init__
> self._judf = self._create_judf(name)
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/sql/functions.py", 
> line 1830, in _create_judf
> sc = SparkContext.getOrCreate()
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py", line 
> 307, in getOrCreate
> SparkContext(conf=conf or SparkConf())
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py", line 
> 118, in __init__
> conf, jsc, profiler_cls)
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py", line 
> 179, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py", line 
> 246, in _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/hadoop/spark/2.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1401, in __call__
> answer, self._gateway_client, None, self._fqn)
>   File "/hadoop/spark/2.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", 
> line 327, in get_return_value
> format(target_id, ".", name))
> py4j.protocol.Py4JError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext
>
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
> at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
> at 
> org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala

Debugging a PythonException with no details

2017-01-13 Thread Nicholas Chammas
I’m looking for tips on how to debug a PythonException that’s very sparse
on details. The full exception is below, but the only interesting bits
appear to be the following lines:

org.apache.spark.api.python.PythonException:
...
py4j.protocol.Py4JError: An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext

Otherwise, the only other clue from the traceback I can see is that the
problem may involve a UDF somehow.

I’ve tested this code against many datasets (stored as ORC) and it works
fine. The same code only seems to throw this error on a few datasets that
happen to be sourced via JDBC. I can’t seem to get a lead on what might be
going wrong here.

Does anyone have tips on how to debug a problem like this? How do I find
more specifically what is going wrong?

Nick

Here’s the full exception:

17/01/13 17:12:14 WARN TaskSetManager: Lost task 7.0 in stage 9.0 (TID
15, devlx023.private.massmutual.com, executor 4):
org.apache.spark.api.python.PythonException: Traceback (most recent
call last):
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py",
line 161, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py",
line 97, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py",
line 78, in read_single_udf
f, return_type = read_command(pickleSer, infile)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/worker.py",
line 54, in read_command
command = serializer._read_with_length(file)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/serializers.py",
line 169, in _read_with_length
return self.loads(obj)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/serializers.py",
line 431, in loads
return pickle.loads(obj, encoding=encoding)
  File 
"/hadoop/yarn/nm/usercache/jenkins/appcache/application_1483203887152_1207/container_1483203887152_1207_01_05/splinkr/person.py",
line 111, in 
py_normalize_udf = udf(py_normalize, StringType())
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/sql/functions.py",
line 1868, in udf
return UserDefinedFunction(f, returnType)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/sql/functions.py",
line 1826, in __init__
self._judf = self._create_judf(name)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/sql/functions.py",
line 1830, in _create_judf
sc = SparkContext.getOrCreate()
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py",
line 307, in getOrCreate
SparkContext(conf=conf or SparkConf())
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py",
line 118, in __init__
conf, jsc, profiler_cls)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py",
line 179, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/hadoop/spark/2.1/python/lib/pyspark.zip/pyspark/context.py",
line 246, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
  File "/hadoop/spark/2.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1401, in __call__
answer, self._gateway_client, None, self._fqn)
  File "/hadoop/spark/2.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
line 327, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
at 
org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:796)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:973)
at 

Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
I wish I could provide additional suggestions. Maybe one of the admins can
step in and help. I'm just another random user trying (with mixed success)
to be helpful. 

Sorry again to everyone about my spam, which just added to the problem.

On Thu, Dec 8, 2016 at 11:22 AM Chen, Yan I <yani.c...@rbc.com> wrote:

> I’m pretty sure I didn’t.
>
>
>
> *From:* Nicholas Chammas [mailto:nicholas.cham...@gmail.com]
> *Sent:* Thursday, December 08, 2016 10:56 AM
> *To:* Chen, Yan I; Di Zhu
>
>
> *Cc:* user @spark
> *Subject:* Re: unsubscribe
>
>
>
> Oh, hmm...
>
> Did you perhaps subscribe with a different address than the one you're
> trying to unsubscribe from?
>
> For example, you subscribed with myemail+sp...@gmail.com but you send the
> unsubscribe email from myem...@gmail.com
>
> 2016년 12월 8일 (목) 오전 10:35, Chen, Yan I <yani.c...@rbc.com>님이 작성:
>
> The reason I sent that email is because I did sent emails to
> user-unsubscr...@spark.apache.org and dev-unsubscr...@spark.apache.org
> two months ago. But I can still receive a lot of emails every day. I even
> did that again before 10AM EST and got confirmation that I’m unsubscribed,
> but I still received this email.
>
>
>
>
>
> *From:* Nicholas Chammas [mailto:nicholas.cham...@gmail.com]
> *Sent:* Thursday, December 08, 2016 10:02 AM
> *To:* Di Zhu
> *Cc:* user @spark
> *Subject:* Re: unsubscribe
>
>
>
> Yes, sorry about that. I didn't think before responding to all those who
> asked to unsubscribe.
>
>
>
> On Thu, Dec 8, 2016 at 10:00 AM Di Zhu <jason4zhu.bigd...@gmail.com>
> wrote:
>
> Could you send to individual privately without cc to all users every time?
>
>
>
>
>
> On 8 Dec 2016, at 3:58 PM, Nicholas Chammas <nicholas.cham...@gmail.com>
> wrote:
>
>
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
> This is explained here:
> http://spark.apache.org/community.html#mailing-lists
>
>
>
> On Thu, Dec 8, 2016 at 7:46 AM Ramon Rosa da Silva <
> ramon.si...@neogrid.com> wrote:
>
>
>
> *This e-mail message, including any attachments, is for the sole use of
> the person to whom it has been sent and may contain information that is
> confidential or legally protected. If you are not the intended recipient or
> have received this message in error, you are not authorized to copy,
> distribute, or otherwise use it or its attachments. Please notify the
> sender immediately by return email and permanently delete this message and
> any attachments. NeoGrid makes no warranty that this email is error or
> virus free. NeoGrid Europe Limited is a company registered in the United
> Kingdom with the registration number 7717968. The registered office is 8-10
> Upper Marlborough Road, St Albans AL1 3UR, Hertfordshire, UK. NeoGrid
> Netherlands B.V. is a company registered in the Netherlands with the
> registration number 3416.6499 and registered office at Science Park 400,
> 1098 XH Amsterdam, NL. NeoGrid North America Limited is a company
> registered in the United States with the registration number 52-2242825.
> The registered office is 55 West Monroe Street, Suite 3590-60603, Chicago,
> IL, USA. NeoGrid Japan is located at New Otani Garden Court 7F, 4-1
> Kioi-cho, Chiyoda-ku, Tokyo 102-0094, Japan. NeoGrid Software SA is a
> company registered in Brazil, with the registration number CNPJ:
> 03.553.145/0001-08 and located at Av. Santos Dumont, 935, 89.218-105,
> Joinville - SC – Brazil. *
>
> * Esta mensagem pode conter informação confidencial ou privilegiada, sendo
> seu sigilo protegido por lei. Se você não for o destinatário ou a pessoa
> autorizada a receber esta mensagem, não pode usar, copiar ou divulgar as
> informações nela contidas ou tomar qualquer ação baseada nessas
> informações. Se você recebeu esta mensagem por engano, por favor, avise
> imediatamente ao remetente, respondendo o e-mail e em seguida apague-a.
> Agradecemos sua cooperação. *
>
>
>
> ___
>
> If you received this email in error, please advise the sender (by return
> email or otherwise) immediately. You have consented to receive the attached
> electronically at the above-noted email address; please retain a copy of
> this confirmation for future reference.
>
> Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur
> immédiatement, par retour de courriel ou par un autre moyen. Vous avez
> accepté de recevoir le(s) document(s) ci-joint(s) par voie électronique à
> l'adresse courriel indiquée ci-dessus; veuillez conserver une copie de
> cette confirmation pour les fins de reference future.
>
> 

Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
Oh, hmm...

Did you perhaps subscribe with a different address than the one you're
trying to unsubscribe from?

For example, you subscribed with myemail+sp...@gmail.com but you send the
unsubscribe email from myem...@gmail.com
2016년 12월 8일 (목) 오전 10:35, Chen, Yan I <yani.c...@rbc.com>님이 작성:

> The reason I sent that email is because I did sent emails to
> user-unsubscr...@spark.apache.org and dev-unsubscr...@spark.apache.org
> two months ago. But I can still receive a lot of emails every day. I even
> did that again before 10AM EST and got confirmation that I’m unsubscribed,
> but I still received this email.
>
>
>
>
>
> *From:* Nicholas Chammas [mailto:nicholas.cham...@gmail.com]
> *Sent:* Thursday, December 08, 2016 10:02 AM
> *To:* Di Zhu
> *Cc:* user @spark
> *Subject:* Re: unsubscribe
>
>
>
> Yes, sorry about that. I didn't think before responding to all those who
> asked to unsubscribe.
>
>
>
> On Thu, Dec 8, 2016 at 10:00 AM Di Zhu <jason4zhu.bigd...@gmail.com>
> wrote:
>
> Could you send to individual privately without cc to all users every time?
>
>
>
>
>
> On 8 Dec 2016, at 3:58 PM, Nicholas Chammas <nicholas.cham...@gmail.com>
> wrote:
>
>
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
> This is explained here:
> http://spark.apache.org/community.html#mailing-lists
>
>
>
> On Thu, Dec 8, 2016 at 7:46 AM Ramon Rosa da Silva <
> ramon.si...@neogrid.com> wrote:
>
>
>
>
>
> *This e-mail message, including any attachments, is for the sole use of
> the person to whom it has been sent and may contain information that is
> confidential or legally protected. If you are not the intended recipient or
> have received this message in error, you are not authorized to copy,
> distribute, or otherwise use it or its attachments. Please notify the
> sender immediately by return email and permanently delete this message and
> any attachments. NeoGrid makes no warranty that this email is error or
> virus free. NeoGrid Europe Limited is a company registered in the United
> Kingdom with the registration number 7717968. The registered office is 8-10
> Upper Marlborough Road, St Albans AL1 3UR, Hertfordshire, UK. NeoGrid
> Netherlands B.V. is a company registered in the Netherlands with the
> registration number 3416.6499 and registered office at Science Park 400,
> 1098 XH Amsterdam, NL. NeoGrid North America Limited is a company
> registered in the United States with the registration number 52-2242825.
> The registered office is 55 West Monroe Street, Suite 3590-60603, Chicago,
> IL, USA. NeoGrid Japan is located at New Otani Garden Court 7F, 4-1
> Kioi-cho, Chiyoda-ku, Tokyo 102-0094, Japan. NeoGrid Software SA is a
> company registered in Brazil, with the registration number CNPJ:
> 03.553.145/0001-08 and located at Av. Santos Dumont, 935, 89.218-105,
> Joinville - SC – Brazil. Esta mensagem pode conter informação confidencial
> ou privilegiada, sendo seu sigilo protegido por lei. Se você não for o
> destinatário ou a pessoa autorizada a receber esta mensagem, não pode usar,
> copiar ou divulgar as informações nela contidas ou tomar qualquer ação
> baseada nessas informações. Se você recebeu esta mensagem por engano, por
> favor, avise imediatamente ao remetente, respondendo o e-mail e em seguida
> apague-a. Agradecemos sua cooperação. *
>
>
>
> ___
>
> If you received this email in error, please advise the sender (by return
> email or otherwise) immediately. You have consented to receive the attached
> electronically at the above-noted email address; please retain a copy of
> this confirmation for future reference.
>
> Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur
> immédiatement, par retour de courriel ou par un autre moyen. Vous avez
> accepté de recevoir le(s) document(s) ci-joint(s) par voie électronique à
> l'adresse courriel indiquée ci-dessus; veuillez conserver une copie de
> cette confirmation pour les fins de reference future.
>
>


Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
Yes, sorry about that. I didn't think before responding to all those who
asked to unsubscribe.

On Thu, Dec 8, 2016 at 10:00 AM Di Zhu <jason4zhu.bigd...@gmail.com> wrote:

> Could you send to individual privately without cc to all users every time?
>
>
> On 8 Dec 2016, at 3:58 PM, Nicholas Chammas <nicholas.cham...@gmail.com>
> wrote:
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> This is explained here:
> http://spark.apache.org/community.html#mailing-lists
>
> On Thu, Dec 8, 2016 at 7:46 AM Ramon Rosa da Silva <
> ramon.si...@neogrid.com> wrote:
>
>
> This e-mail message, including any attachments, is for the sole use of the
> person to whom it has been sent and may contain information that is
> confidential or legally protected. If you are not the intended recipient or
> have received this message in error, you are not authorized to copy,
> distribute, or otherwise use it or its attachments. Please notify the
> sender immediately by return email and permanently delete this message and
> any attachments. NeoGrid makes no warranty that this email is error or
> virus free. NeoGrid Europe Limited is a company registered in the United
> Kingdom with the registration number 7717968. The registered office is 8-10
> Upper Marlborough Road, St Albans AL1 3UR, Hertfordshire, UK. NeoGrid
> Netherlands B.V. is a company registered in the Netherlands with the
> registration number 3416.6499 and registered office at Science Park 400,
> 1098 XH Amsterdam, NL. NeoGrid North America Limited is a company
> registered in the United States with the registration number 52-2242825.
> The registered office is 55 West Monroe Street, Suite 3590-60603, Chicago,
> IL, USA. NeoGrid Japan is located at New Otani Garden Court 7F, 4-1
> Kioi-cho, Chiyoda-ku, Tokyo 102-0094, Japan. NeoGrid Software SA is a
> company registered in Brazil, with the registration number CNPJ:
> 03.553.145/0001-08 and located at Av. Santos Dumont, 935, 89.218-105,
> Joinville - SC – Brazil.
>
> Esta mensagem pode conter informação confidencial ou privilegiada, sendo
> seu sigilo protegido por lei. Se você não for o destinatário ou a pessoa
> autorizada a receber esta mensagem, não pode usar, copiar ou divulgar as
> informações nela contidas ou tomar qualquer ação baseada nessas
> informações. Se você recebeu esta mensagem por engano, por favor, avise
> imediatamente ao remetente, respondendo o e-mail e em seguida apague-a.
> Agradecemos sua cooperação.
>
>
>


Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 7:46 AM Ramon Rosa da Silva 
wrote:

>
> This e-mail message, including any attachments, is for the sole use of the
> person to whom it has been sent and may contain information that is
> confidential or legally protected. If you are not the intended recipient or
> have received this message in error, you are not authorized to copy,
> distribute, or otherwise use it or its attachments. Please notify the
> sender immediately by return email and permanently delete this message and
> any attachments. NeoGrid makes no warranty that this email is error or
> virus free. NeoGrid Europe Limited is a company registered in the United
> Kingdom with the registration number 7717968. The registered office is 8-10
> Upper Marlborough Road, St Albans AL1 3UR, Hertfordshire, UK. NeoGrid
> Netherlands B.V. is a company registered in the Netherlands with the
> registration number 3416.6499 and registered office at Science Park 400,
> 1098 XH Amsterdam, NL. NeoGrid North America Limited is a company
> registered in the United States with the registration number 52-2242825.
> The registered office is 55 West Monroe Street, Suite 3590-60603, Chicago,
> IL, USA. NeoGrid Japan is located at New Otani Garden Court 7F, 4-1
> Kioi-cho, Chiyoda-ku, Tokyo 102-0094, Japan. NeoGrid Software SA is a
> company registered in Brazil, with the registration number CNPJ:
> 03.553.145/0001-08 and located at Av. Santos Dumont, 935, 89.218-105,
> Joinville - SC – Brazil.
>
> Esta mensagem pode conter informação confidencial ou privilegiada, sendo
> seu sigilo protegido por lei. Se você não for o destinatário ou a pessoa
> autorizada a receber esta mensagem, não pode usar, copiar ou divulgar as
> informações nela contidas ou tomar qualquer ação baseada nessas
> informações. Se você recebeu esta mensagem por engano, por favor, avise
> imediatamente ao remetente, respondendo o e-mail e em seguida apague-a.
> Agradecemos sua cooperação.
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 9:46 AM Tao Lu  wrote:

>
>


Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 8:01 AM Niki Pavlopoulou  wrote:

> unsubscribe
>


Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 7:50 AM Juan Caravaca 
wrote:

> unsubscribe
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 9:54 AM Kishorkumar Patil
 wrote:

>
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 9:42 AM Chen, Yan I  wrote:

>
>
>
> ___
>
> If you received this email in error, please advise the sender (by return
> email or otherwise) immediately. You have consented to receive the attached
> electronically at the above-noted email address; please retain a copy of
> this confirmation for future reference.
>
> Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur
> immédiatement, par retour de courriel ou par un autre moyen. Vous avez
> accepté de recevoir le(s) document(s) ci-joint(s) par voie électronique à
> l'adresse courriel indiquée ci-dessus; veuillez conserver une copie de
> cette confirmation pour les fins de reference future.
>
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 12:17 AM Prashant Singh Thakur <
prashant.tha...@impetus.co.in> wrote:

>
>
>
>
> Best Regards,
>
> Prashant Thakur
>
> Work : 6046
>
> Mobile: +91-9740266522 <+91%2097402%2066522>
>
>
>
> --
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 12:08 AM Kranthi Gmail 
wrote:

>
>
> --
> Kranthi
>
> PS: Sent from mobile, pls excuse the brevity and typos.
>
> On Dec 7, 2016, at 8:05 PM, Siddhartha Khaitan <
> siddhartha.khai...@gmail.com> wrote:
>
>
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 6:27 AM Vinicius Barreto <
vinicius.s.barr...@gmail.com> wrote:

> Unsubscribe
>
> Em 7 de dez de 2016 17:46, "map reduced"  escreveu:
>
> Hi,
>
> I am trying to solve this problem - in my streaming flow, every day few
> jobs fail due to some (say kafka cluster maintenance etc, mostly
> unavoidable) reasons for few batches and resumes back to success.
> I want to reprocess those failed jobs programmatically (assume I have a
> way of getting start-end offsets for kafka topics for failed jobs). I was
> thinking of these options:
> 1) Somehow pause streaming job when it detects failing jobs - this seems
> not possible.
> 2) From driver - run additional processing to check every few minutes
> using driver rest api (/api/v1/applications...) what jobs have failed and
> submit batch jobs for those failed jobs
>
> 1 - doesn't seem to be possible, and I don't want to kill streaming
> context just for few failing batches to stop the job for some time and
> resume after few minutes.
> 2 - seems like a viable option, but a little complicated, since even the
> batch job can fail due to whatever reasons and I am back to tracking that
> separately etc.
>
> Does anyone has faced this issue or have any suggestions?
>
> Thanks,
> KP
>
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 12:54 AM Roger Holenweger  wrote:

>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: unscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 1:34 AM smith_666  wrote:

>
>
>
>


Re: Unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Thu, Dec 8, 2016 at 12:12 AM Ajit Jaokar 
wrote:

>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: unsubscribe

2016-12-08 Thread Nicholas Chammas
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

This is explained here: http://spark.apache.org/community.html#mailing-lists

On Wed, Dec 7, 2016 at 10:53 PM Ajith Jose  wrote:

>
>


Re: Strongly Connected Components

2016-11-13 Thread Nicholas Chammas
FYI: There is a new connected components implementation coming in
GraphFrames 0.3.

See: https://github.com/graphframes/graphframes/pull/119

Implementation is based on:
https://mmds-data.org/presentations/2014/vassilvitskii_mmds14.pdf

Nick

On Sat, Nov 12, 2016 at 3:01 PM Koert Kuipers  wrote:

> oh ok i see now its not the same
>
> On Sat, Nov 12, 2016 at 2:48 PM, Koert Kuipers  wrote:
>
> not sure i see the faster algo in the paper you mention.
>
> i see this in section 6.1.2:
> "In what follows we give a simple labeling algorithm that computes
> connectivity  on  sparse  graphs  in O(log N) rounds."
> N here is the size of the graph, not the largest component diameter.
>
> that is the exact same algo as is implemented in graphx i think. or is it
> not?
>
> On Fri, Nov 11, 2016 at 7:58 PM, Daniel Darabos <
> daniel.dara...@lynxanalytics.com> wrote:
>
> Hi Shreya,
> GraphFrames just calls the GraphX strongly connected components code. (
> https://github.com/graphframes/graphframes/blob/release-0.2.0/src/main/scala/org/graphframes/lib/StronglyConnectedComponents.scala#L51
> )
>
> For choosing the number of iterations: If the number of iterations is less
> than the diameter of the graph, you may get an incorrect result. But
> running for more iterations than that buys you nothing. The algorithm is
> basically to broadcast your ID to all your neighbors in the first round,
> and then broadcast the smallest ID that you have seen so far in the next
> rounds. So with only 1 round you will get a wrong result unless each vertex
> is connected to the vertex with the lowest ID in that component. (Unlikely
> in a real graph.)
>
> See
> https://github.com/apache/spark/blob/v2.0.2/graphx/src/main/scala/org/apache/spark/graphx/lib/ConnectedComponents.scala
> for the actual implementation.
>
> A better algorithm exists for this problem that only requires O(log(N))
> iterations when N is the largest component diameter. (It is described in "A
> Model of Computation for MapReduce",
> http://www.sidsuri.com/Publications_files/mrc.pdf.) This outperforms
> GraphX's implementation immensely. (See the last slide of
> http://www.slideshare.net/SparkSummit/interactive-graph-analytics-daniel-darabos#33.)
> The large advantage is due to the lower number of necessary iterations.
>
> For why this is failing even with one iteration: I would first check your
> partitioning. Too many or too few partitions could equally cause the issue.
> If you are lucky, there is no overlap between the "too many" and "too few"
> domains :).
>
> On Fri, Nov 11, 2016 at 7:39 PM, Shreya Agarwal 
> wrote:
>
> Tried GraphFrames. Still faced the same – job died after a few hours . The
> errors I see (And I see tons of them) are –
>
> (I ran with 3 times the partitions as well, which was 12 times number of
> executors , but still the same.)
>
>
>
> -
>
> ERROR NativeAzureFileSystem: Encountered Storage Exception for write on
> Blob : hdp/spark2-events/application_1478717432179_0021.inprogress
> Exception details: null Error Code : RequestBodyTooLarge
>
>
>
> -
>
>
>
> 16/11/11 09:21:46 ERROR TransportResponseHandler: Still have 3 requests
> outstanding when connection from /10.0.0.95:43301 is closed
>
> 16/11/11 09:21:46 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 2
> outstanding blocks after 5000 ms
>
> 16/11/11 09:21:46 INFO ShuffleBlockFetcherIterator: Getting 1500 non-empty
> blocks out of 1500 blocks
>
> 16/11/11 09:21:46 ERROR OneForOneBlockFetcher: Failed while starting block
> fetches
>
> java.io.IOException: Connection from /10.0.0.95:43301 closed
>
>
>
> -
>
>
>
> 16/11/11 09:21:46 ERROR OneForOneBlockFetcher: Failed while starting block
> fetches
>
> java.lang.RuntimeException: java.io.FileNotFoundException:
> /mnt/resource/hadoop/yarn/local/usercache/shreyagrssh/appcache/application_1478717432179_0021/blockmgr-b1dde30d-359e-4932-b7a4-a5e138a52360/37/shuffle_1346_21_0.index
> (No such file or directory)
>
>
>
> -
>
>
>
> org.apache.spark.SparkException: Exception thrown in awaitResult
>
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
>
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
>
> at
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>
> at
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>
>

Re: Scala Vs Python

2016-09-02 Thread Nicholas Chammas
I apologize for my harsh tone. You are right, it was unnecessary and
discourteous.

On Fri, Sep 2, 2016 at 11:01 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> You made such statement:
>
> "That's complete nonsense."
>
> That is a strong language and void of any courtesy. Only dogmatic
> individuals make such statements, engaging the keyboard before thinking
> about it.
>
> You are perfectly in your right to agree to differ. However, that does not
> give you the right to call other peoples opinion nonsense.
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 September 2016 at 15:54, Nicholas Chammas <nicholas.cham...@gmail.com
> > wrote:
>
>> You made a specific claim -- that Spark will move away from Python --
>> which I responded to with clear references and data. How on earth is that a
>> "religious argument"?
>>
>> I'm not saying that Python is better than Scala or anything like that.
>> I'm just addressing your specific claim about its future in the Spark
>> project.
>>
>> On Fri, Sep 2, 2016 at 10:48 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Right so. We are back into religious arguments. Best of luck
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 2 September 2016 at 15:35, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> I believe as we progress in time Spark is going to move away from
>>>>> Python. If you look at 2014 Databricks code examples, they were
>>>>> mostly in Python. Now they are mostly in Scala for a reason.
>>>>>
>>>>
>>>> That's complete nonsense.
>>>>
>>>> First off, you can find dozens and dozens of Python code examples here:
>>>> https://github.com/apache/spark/tree/master/examples/src/main/python
>>>>
>>>> The Python API was added to Spark in 0.7.0
>>>> <http://spark.apache.org/news/spark-0-7-0-released.html>, back in
>>>> February of 2013, before Spark was even accepted into the Apache incubator.
>>>> Since then it's undergone major and continuous development. Though it does
>>>> lag behind the Scala API in some areas, it's a first-class language and
>>>> bringing it up to parity with Scala is an explicit project goal. A quick
>>>> example off the top of my head is all the work that's going into model
>>>> import/export for Python: SPARK-11939
>>>> <https://issues.apache.org/jira/browse/SPARK-11939>
>>>>
>>>> Additionally, according to the 2015 Spark Survey
>>>> <http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf?t=1472746902480>,
>>>> 58% of Spark users use the Python API, more than any other language save
>>>> for Scala (71%). (Users can select multiple languages on the survey.)
>>>> Python users were also the 3rd-fastest growing "demographic" for Spark,
>>>> after Windows and Spark Streaming users.
>>>>
>>>> Any notion that Spark is going to "move away from Python" is completely
>>>> contradicted by the facts.
>>>>
>>>> Nick
>>>>
>>>>
>>>
>


Re: Scala Vs Python

2016-09-02 Thread Nicholas Chammas
You made a specific claim -- that Spark will move away from Python -- which
I responded to with clear references and data. How on earth is that a
"religious argument"?

I'm not saying that Python is better than Scala or anything like that. I'm
just addressing your specific claim about its future in the Spark project.

On Fri, Sep 2, 2016 at 10:48 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Right so. We are back into religious arguments. Best of luck
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 September 2016 at 15:35, Nicholas Chammas <nicholas.cham...@gmail.com
> > wrote:
>
>> On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> I believe as we progress in time Spark is going to move away from
>>> Python. If you look at 2014 Databricks code examples, they were mostly
>>> in Python. Now they are mostly in Scala for a reason.
>>>
>>
>> That's complete nonsense.
>>
>> First off, you can find dozens and dozens of Python code examples here:
>> https://github.com/apache/spark/tree/master/examples/src/main/python
>>
>> The Python API was added to Spark in 0.7.0
>> <http://spark.apache.org/news/spark-0-7-0-released.html>, back in
>> February of 2013, before Spark was even accepted into the Apache incubator.
>> Since then it's undergone major and continuous development. Though it does
>> lag behind the Scala API in some areas, it's a first-class language and
>> bringing it up to parity with Scala is an explicit project goal. A quick
>> example off the top of my head is all the work that's going into model
>> import/export for Python: SPARK-11939
>> <https://issues.apache.org/jira/browse/SPARK-11939>
>>
>> Additionally, according to the 2015 Spark Survey
>> <http://cdn2.hubspot.net/hubfs/438089/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf?t=1472746902480>,
>> 58% of Spark users use the Python API, more than any other language save
>> for Scala (71%). (Users can select multiple languages on the survey.)
>> Python users were also the 3rd-fastest growing "demographic" for Spark,
>> after Windows and Spark Streaming users.
>>
>> Any notion that Spark is going to "move away from Python" is completely
>> contradicted by the facts.
>>
>> Nick
>>
>>
>


Re: Scala Vs Python

2016-09-02 Thread Nicholas Chammas
On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh 
wrote:

> I believe as we progress in time Spark is going to move away from Python. If
> you look at 2014 Databricks code examples, they were mostly in Python. Now
> they are mostly in Scala for a reason.
>

That's complete nonsense.

First off, you can find dozens and dozens of Python code examples here:
https://github.com/apache/spark/tree/master/examples/src/main/python

The Python API was added to Spark in 0.7.0
, back in February
of 2013, before Spark was even accepted into the Apache incubator. Since
then it's undergone major and continuous development. Though it does lag
behind the Scala API in some areas, it's a first-class language and
bringing it up to parity with Scala is an explicit project goal. A quick
example off the top of my head is all the work that's going into model
import/export for Python: SPARK-11939


Additionally, according to the 2015 Spark Survey
,
58% of Spark users use the Python API, more than any other language save
for Scala (71%). (Users can select multiple languages on the survey.)
Python users were also the 3rd-fastest growing "demographic" for Spark,
after Windows and Spark Streaming users.

Any notion that Spark is going to "move away from Python" is completely
contradicted by the facts.

Nick


Re: UNSUBSCRIBE

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe:
http://spark.apache.org/community.html


On Tue, Aug 9, 2016 at 5:14 PM abhishek singh  wrote:

>
>


Re: UNSUBSCRIBE

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe:
http://spark.apache.org/community.html


On Tue, Aug 9, 2016 at 8:03 PM James Ding  wrote:

>
>


Re: UNSUBSCRIBE

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe:
http://spark.apache.org/community.html


On Wed, Aug 10, 2016 at 2:46 AM Martin Somers  wrote:

>
>
> --
> M
>


Re: Unsubscribe

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe:
http://spark.apache.org/community.html

On Tue, Aug 9, 2016 at 3:02 PM Hogancamp, Aaron <
aaron.t.hoganc...@leidos.com> wrote:

> Unsubscribe.
>
>
>
> Thanks,
>
>
>
> Aaron Hogancamp
>
> Data Scientist
>
>
>


Re: Unsubscribe.

2016-08-10 Thread Nicholas Chammas
Please follow the links here to unsubscribe:
http://spark.apache.org/community.html

On Tue, Aug 9, 2016 at 3:05 PM Martin Somers  wrote:

> Unsubscribe.
>
> Thanks
> M
>


Re: Add column sum as new column in PySpark dataframe

2016-08-05 Thread Nicholas Chammas
I think this is what you need:

import pyspark.sql.functions as sqlf

df.withColumn('total', sqlf.sum(df.columns))

Nic

On Thu, Aug 4, 2016 at 9:41 AM Javier Rey jre...@gmail.com
 wrote:

Hi everybody,
>
> Sorry, I sent last mesage it was imcomplete this is complete:
>
> I'm using PySpark and I have a Spark dataframe with a bunch of numeric
> columns. I want to add a column that is the sum of all the other columns.
>
> Suppose my dataframe had columns "a", "b", and "c". I know I can do this:
>
> df.withColumn('total_col', df.a + df.b + df.c)
>
> The problem is that I don't want to type out each column individually and
> add them, especially if I have a lot of columns. I want to be able to do
> this automatically or by specifying a list of column names that I want to
> add. Is there another way to do this?
>
> I find this solution:
>
> df.withColumn('total', sum(df[col] for col in df.columns))
>
> But I get this error:
>
> "AttributeError: 'generator' object has no attribute '_get_object_id"
>
> Additionally I want to sum onlt not nulls values.
>
> Thanks in advance,
>
> Samir
>
​


Re: registering udf to use in spark.sql('select...

2016-08-04 Thread Nicholas Chammas
No, SQLContext is not disappearing. The top-level class is replaced by
SparkSession, but you can always get the underlying context from the
session.

You can also use SparkSession.udf.register()
<http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.SparkSession.udf>,
which is just a wrapper for sqlContext.registerFunction
<https://github.com/apache/spark/blob/2182e4322da6ba732f99ae75dce00f76f1cdc4d9/python/pyspark/sql/context.py#L511-L520>
.
​

On Thu, Aug 4, 2016 at 12:04 PM Ben Teeuwen <bteeu...@gmail.com> wrote:

> Yes, but I don’t want to use it in a select() call.
> Either selectExpr() or spark.sql(), with the udf being called inside a
> string.
>
> Now I got it to work using
> "sqlContext.registerFunction('encodeOneHot_udf',encodeOneHot, VectorUDT())”
> But this sqlContext approach will disappear, right? So I’m curious what to
> use instead.
>
> On Aug 4, 2016, at 3:54 PM, Nicholas Chammas <nicholas.cham...@gmail.com>
> wrote:
>
> Have you looked at pyspark.sql.functions.udf and the associated examples?
> 2016년 8월 4일 (목) 오전 9:10, Ben Teeuwen <bteeu...@gmail.com>님이 작성:
>
>> Hi,
>>
>> I’d like to use a UDF in pyspark 2.0. As in ..
>> 
>>
>> def squareIt(x):
>>   return x * x
>>
>> # register the function and define return type
>> ….
>>
>> spark.sql(“”"select myUdf(adgroupid, 'extra_string_parameter') as
>> function_result from df’)
>>
>> _
>>
>> How can I register the function? I only see registerFunction in the
>> deprecated sqlContext at
>> http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html.
>> As the ‘spark’ object unifies hiveContext and sqlContext, what is the new
>> way to go?
>>
>> Ben
>>
>
>


Re: registering udf to use in spark.sql('select...

2016-08-04 Thread Nicholas Chammas
Have you looked at pyspark.sql.functions.udf and the associated examples?
2016년 8월 4일 (목) 오전 9:10, Ben Teeuwen 님이 작성:

> Hi,
>
> I’d like to use a UDF in pyspark 2.0. As in ..
> 
>
> def squareIt(x):
>   return x * x
>
> # register the function and define return type
> ….
>
> spark.sql(“”"select myUdf(adgroupid, 'extra_string_parameter') as
> function_result from df’)
>
> _
>
> How can I register the function? I only see registerFunction in the
> deprecated sqlContext at
> http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html.
> As the ‘spark’ object unifies hiveContext and sqlContext, what is the new
> way to go?
>
> Ben
>


Re: spark-2.0 support for spark-ec2 ?

2016-07-27 Thread Nicholas Chammas
Yes, spark-ec2 has been removed from the main project, as called out in the
Release Notes:

http://spark.apache.org/releases/spark-release-2-0-0.html#removals

You can still discuss spark-ec2 here or on Stack Overflow, as before. Bug
reports and the like should now go on that AMPLab GitHub project as opposed
to JIRA, though.

You should use branch-2.0.

On Wed, Jul 27, 2016 at 2:30 PM Andy Davidson 
wrote:

> Congratulations on releasing 2.0!
>
>
> spark-2.0.0-bin-hadoop2.7 no longer includes the spark-ec2 script How ever
>  http://spark.apache.org/docs/latest/index.html  has a link to the
> spark-ec2 github repo https://github.com/amplab/spark-ec2
>
>
> Is this the right group to discuss spark-ec2?
>
> Any idea how stable spark-ec2 is on spark-2.0?
>
> Should we use master or branch-2.0? It looks like the default might be the
> branch-1.6 ?
>
> Thanks
>
> Andy
>
>
> P.s. The new stand alone documentation is a big improvement. I have a
> much better idea of what spark-ec2 does and how to upgrade my system.
>
>
>
>
>
>
>
>
>
>
>
>


Re: Unsubscribe - 3rd time

2016-06-29 Thread Nicholas Chammas
> I'm not sure I've ever come across an email list that allows you to
unsubscribe by responding to the list with "unsubscribe".

Many noreply lists (e.g. companies sending marketing email) actually work
that way, which is probably what most people are used to these days.

What this list needs is an unsubscribe link in the footer, like most modern
mailing lists have. Work to add this in is already in progress here:
https://issues.apache.org/jira/browse/INFRA-12185

Nick

On Wed, Jun 29, 2016 at 12:57 PM Jonathan Kelly 
wrote:

> If at first you don't succeed, try, try again. But please don't. :)
>
> See the "unsubscribe" link here: http://spark.apache.org/community.html
>
> I'm not sure I've ever come across an email list that allows you to
> unsubscribe by responding to the list with "unsubscribe". At least, all of
> the Apache ones have a separate address to which you send
> subscribe/unsubscribe messages. And yet people try to send "unsubscribe"
> messages to the actual list almost every day.
>
> On Wed, Jun 29, 2016 at 9:03 AM Mich Talebzadeh 
> wrote:
>
>> LOL. Bravely said Joaquin.
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 June 2016 at 16:54, Joaquin Alzola 
>> wrote:
>>
>>> And 3rd time is not enough to know that unsubscribe is done through à
>>> user-unsubscr...@spark.apache.org
>>>
>>>
>>>
>>> *From:* Steve Florence [mailto:sflore...@ypm.com]
>>> *Sent:* 29 June 2016 16:47
>>> *To:* user@spark.apache.org
>>> *Subject:* Unsubscribe - 3rd time
>>>
>>>
>>>
>>>
>>> This email is confidential and may be subject to privilege. If you are
>>> not the intended recipient, please do not copy or disclose its content but
>>> contact the sender immediately upon receipt.
>>>
>>
>>


Re: Writing output of key-value Pair RDD

2016-05-04 Thread Nicholas Chammas
You're looking for this discussion:
http://stackoverflow.com/q/23995040/877069

Also, a simpler alternative with DataFrames:
https://github.com/apache/spark/pull/8375#issuecomment-202458325

On Wed, May 4, 2016 at 4:09 PM Afshartous, Nick 
wrote:

> Hi,
>
>
> Is there any way to write out to S3 the values of a f key-value Pair RDD ?
>
>
> I'd like each value of a pair to be written to its own file where the file
> name corresponds to the key name.
>
>
> Thanks,
>
> --
>
> Nick
>


Re: spark-ec2 hitting yum install issues

2016-04-14 Thread Nicholas Chammas
If you log into the cluster and manually try that step does it still fail?
Can you yum install anything else?

You might want to report this issue directly on the spark-ec2 repo, btw:
https://github.com/amplab/spark-ec2

Nick

On Thu, Apr 14, 2016 at 9:08 PM sanusha  wrote:

>
> I am using spark-1.6.1-prebuilt-with-hadoop-2.6 on mac. I am using the
> spark-ec2 to launch a cluster in
> Amazon VPC. The setup.sh script [run first thing on master after launch]
> uses pssh and tries to install it
> via 'yum install -y pssh'. This step always fails on the master AMI that
> the
> script uses by default as it is
> not able to find it in the repo mirrors - hits 403.
>
> Has anyone faced this and know what's causing it? For now, I have changed
> the script to not use pssh
> as a workaround. But would like to fix the root cause.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-hitting-yum-install-issues-tp26786.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark 1.6.1 packages on S3 corrupt?

2016-04-12 Thread Nicholas Chammas
Yes, this is a known issue. The core devs are already aware of it. [CC dev]

FWIW, I believe the Spark 1.6.1 / Hadoop 2.6 package on S3 is not corrupt.
It may be the only 1.6.1 package that is not corrupt, though. :/

Nick


On Tue, Apr 12, 2016 at 9:00 PM Augustus Hong 
wrote:

> Hi all,
>
> I'm trying to launch a cluster with the spark-ec2 script but seeing the
> error below.  Are the packages on S3 corrupted / not in the correct format?
>
> Initializing spark
>
> --2016-04-13 00:25:39--
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop1.tgz
>
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.11.67
>
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.11.67|:80...
> connected.
>
> HTTP request sent, awaiting response... 200 OK
>
> Length: 277258240 (264M) [application/x-compressed]
>
> Saving to: ‘spark-1.6.1-bin-hadoop1.tgz’
>
> 100%[==>]
> 277,258,240 37.6MB/s   in 9.2s
>
> 2016-04-13 00:25:49 (28.8 MB/s) - ‘spark-1.6.1-bin-hadoop1.tgz’ saved
> [277258240/277258240]
>
> Unpacking Spark
>
>
> gzip: stdin: not in gzip format
>
> tar: Child returned status 1
>
> tar: Error is not recoverable: exiting now
>
> mv: missing destination file operand after `spark'
>
> Try `mv --help' for more information.
>
>
>
>
>
>
> --
> [image: Branch] 
> Augustus Hong
> Software Engineer
>
>


Re: Reading Back a Cached RDD

2016-03-24 Thread Nicholas Chammas
Isn’t persist() only for reusing an RDD within an active application? Maybe
checkpoint() is what you’re looking for instead?
​

On Thu, Mar 24, 2016 at 2:02 PM Afshartous, Nick 
wrote:

>
> Hi,
>
>
> After calling RDD.persist(), is then possible to come back later and
> access the persisted RDD.
>
> Let's say for instance coming back and starting a new Spark shell
> session.  How would one access the persisted RDD in the new shell session ?
>
>
> Thanks,
>
> --
>
>Nick
>


Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Nicholas Chammas
We’re veering off from the original question of this thread, but to
clarify, my comment earlier was this:

So in short, DataFrames are the “new RDD”—i.e. the new base structure you
should be using in your Spark programs wherever possible.

RDDs are not going away, and clearly in your case DataFrames are not that
helpful, so sure, continue to use RDDs. There’s nothing wrong with that.
No-one is saying you *must* use DataFrames, and Spark will continue to
offer its RDD API.

However, my original comment to Jules still stands: If you can, use
DataFrames. In most cases they will offer you a better development
experience and better performance across languages, and future Spark
optimizations will mostly be enabled by the structure that DataFrames
provide.

DataFrames are the “new RDD” in the sense that they are the new foundation
for much of the new work that has been done in recent versions and that is
coming in Spark 2.0 and beyond.

Many people work with semi-structured data and have a relatively easy path
to DataFrames, as I explained in my previous email. If, however, you’re
working with data that has very little structure, like in Darren’s case,
then yes, DataFrames are probably not going to help that much. Stick with
RDDs and you’ll be fine.
​

On Wed, Mar 2, 2016 at 6:28 PM Darren Govoni <dar...@ontrenet.com> wrote:

> Our data is made up of single text documents scraped off the web. We store
> these in a  RDD. A Dataframe or similar structure makes no sense at that
> point. And the RDD is transient.
>
> So my point is. Dataframes should not replace plain old rdd since rdds
> allow for more flexibility and sql etc is not even usable on our data while
> in rdd. So all those nice dataframe apis aren't usable until it's
> structured. Which is the core problem anyway.
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
> ---- Original message 
> From: Nicholas Chammas <nicholas.cham...@gmail.com>
> Date: 03/02/2016 5:43 PM (GMT-05:00)
> To: Darren Govoni <dar...@ontrenet.com>, Jules Damji <dmat...@comcast.net>,
> Joshua Sorrell <jsor...@gmail.com>
> Cc: user@spark.apache.org
> Subject: Re: Does pyspark still lag far behind the Scala API in terms of
> features
>
> Plenty of people get their data in Parquet, Avro, or ORC files; or from a
> database; or do their initial loading of un- or semi-structured data using
> one of the various data source libraries
> <http://spark-packages.org/?q=tags%3A%22Data%20Sources%22> which help
> with type-/schema-inference.
>
> All of these paths help you get to a DataFrame very quickly.
>
> Nick
>
> On Wed, Mar 2, 2016 at 5:22 PM Darren Govoni <dar...@ontrenet.com> wrote:
>
> Dataframes are essentially structured tables with schemas. So where does
>> the non typed data sit before it becomes structured if not in a traditional
>> RDD?
>>
>> For us almost all the processing comes before there is structure to it.
>>
>>
>>
>>
>>
>> Sent from my Verizon Wireless 4G LTE smartphone
>>
>>
>>  Original message 
>> From: Nicholas Chammas <nicholas.cham...@gmail.com>
>> Date: 03/02/2016 5:13 PM (GMT-05:00)
>> To: Jules Damji <dmat...@comcast.net>, Joshua Sorrell <jsor...@gmail.com>
>>
>> Cc: user@spark.apache.org
>> Subject: Re: Does pyspark still lag far behind the Scala API in terms of
>> features
>>
>> > However, I believe, investing (or having some members of your group)
>> learn and invest in Scala is worthwhile for few reasons. One, you will get
>> the performance gain, especially now with Tungsten (not sure how it relates
>> to Python, but some other knowledgeable people on the list, please chime
>> in).
>>
>> The more your workload uses DataFrames, the less of a difference there
>> will be between the languages (Scala, Java, Python, or R) in terms of
>> performance.
>>
>> One of the main benefits of Catalyst (which DFs enable) is that it
>> automatically optimizes DataFrame operations, letting you focus on _what_
>> you want while Spark will take care of figuring out _how_.
>>
>> Tungsten takes things further by tightly managing memory using the type
>> information made available to it via DataFrames. This benefit comes into
>> play regardless of the language used.
>>
>> So in short, DataFrames are the "new RDD"--i.e. the new base structure
>> you should be using in your Spark programs wherever possible. And with
>> DataFrames, what language you use matters much less in terms of performance.
>>
>> Nick
>>
>> On Tue, Mar 1, 2016 at 12:07 PM Jules Damji <dmat...@comcast.net> wrote:
>>

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Nicholas Chammas
Plenty of people get their data in Parquet, Avro, or ORC files; or from a
database; or do their initial loading of un- or semi-structured data using
one of the various data source libraries
<http://spark-packages.org/?q=tags%3A%22Data%20Sources%22> which help with
type-/schema-inference.

All of these paths help you get to a DataFrame very quickly.

Nick

On Wed, Mar 2, 2016 at 5:22 PM Darren Govoni <dar...@ontrenet.com> wrote:

Dataframes are essentially structured tables with schemas. So where does
> the non typed data sit before it becomes structured if not in a traditional
> RDD?
>
> For us almost all the processing comes before there is structure to it.
>
>
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
> ---- Original message 
> From: Nicholas Chammas <nicholas.cham...@gmail.com>
> Date: 03/02/2016 5:13 PM (GMT-05:00)
> To: Jules Damji <dmat...@comcast.net>, Joshua Sorrell <jsor...@gmail.com>
> Cc: user@spark.apache.org
> Subject: Re: Does pyspark still lag far behind the Scala API in terms of
> features
>
> > However, I believe, investing (or having some members of your group)
> learn and invest in Scala is worthwhile for few reasons. One, you will get
> the performance gain, especially now with Tungsten (not sure how it relates
> to Python, but some other knowledgeable people on the list, please chime
> in).
>
> The more your workload uses DataFrames, the less of a difference there
> will be between the languages (Scala, Java, Python, or R) in terms of
> performance.
>
> One of the main benefits of Catalyst (which DFs enable) is that it
> automatically optimizes DataFrame operations, letting you focus on _what_
> you want while Spark will take care of figuring out _how_.
>
> Tungsten takes things further by tightly managing memory using the type
> information made available to it via DataFrames. This benefit comes into
> play regardless of the language used.
>
> So in short, DataFrames are the "new RDD"--i.e. the new base structure you
> should be using in your Spark programs wherever possible. And with
> DataFrames, what language you use matters much less in terms of performance.
>
> Nick
>
> On Tue, Mar 1, 2016 at 12:07 PM Jules Damji <dmat...@comcast.net> wrote:
>
>> Hello Joshua,
>>
>> comments are inline...
>>
>> On Mar 1, 2016, at 5:03 AM, Joshua Sorrell <jsor...@gmail.com> wrote:
>>
>> I haven't used Spark in the last year and a half. I am about to start a
>> project with a new team, and we need to decide whether to use pyspark or
>> Scala.
>>
>>
>> Indeed, good questions, and they do come up lot in trainings that I have
>> attended, where this inevitable question is raised.
>> I believe, it depends on your level of comfort zone or adventure into
>> newer things.
>>
>> True, for the most part that Apache Spark committers have been committed
>> to keep the APIs at parity across all the language offerings, even though
>> in some cases, in particular Python, they have lagged by a minor release.
>> To the the extent that they’re committed to level-parity is a good sign. It
>> might to be the case with some experimental APIs, where they lag behind,
>>  but for the most part, they have been admirably consistent.
>>
>> With Python there’s a minor performance hit, since there’s an extra level
>> of indirection in the architecture and an additional Python PID that the
>> executors launch to execute your pickled Python lambdas. Other than that it
>> boils down to your comfort zone. I recommend looking at Sameer’s slides on
>> (Advanced Spark for DevOps Training) where he walks through the pySpark and
>> Python architecture.
>>
>>
>> We are NOT a java shop. So some of the build tools/procedures will
>> require some learning overhead if we go the Scala route. What I want to
>> know is: is the Scala version of Spark still far enough ahead of pyspark to
>> be well worth any initial training overhead?
>>
>>
>> If you are a very advanced Python shop and if you’ve in-house libraries
>> that you have written in Python that don’t exist in Scala or some ML libs
>> that don’t exist in the Scala version and will require fair amount of
>> porting and gap is too large, then perhaps it makes sense to stay put with
>> Python.
>>
>> However, I believe, investing (or having some members of your group)
>> learn and invest in Scala is worthwhile for few reasons. One, you will get
>> the performance gain, especially now with Tungsten (not sure how it relates
>> to Python, but some other knowledgeable people on the list, please ch

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Nicholas Chammas
> However, I believe, investing (or having some members of your group)
learn and invest in Scala is worthwhile for few reasons. One, you will get
the performance gain, especially now with Tungsten (not sure how it relates
to Python, but some other knowledgeable people on the list, please chime
in).

The more your workload uses DataFrames, the less of a difference there will
be between the languages (Scala, Java, Python, or R) in terms of
performance.

One of the main benefits of Catalyst (which DFs enable) is that it
automatically optimizes DataFrame operations, letting you focus on _what_
you want while Spark will take care of figuring out _how_.

Tungsten takes things further by tightly managing memory using the type
information made available to it via DataFrames. This benefit comes into
play regardless of the language used.

So in short, DataFrames are the "new RDD"--i.e. the new base structure you
should be using in your Spark programs wherever possible. And with
DataFrames, what language you use matters much less in terms of performance.

Nick

On Tue, Mar 1, 2016 at 12:07 PM Jules Damji  wrote:

> Hello Joshua,
>
> comments are inline...
>
> On Mar 1, 2016, at 5:03 AM, Joshua Sorrell  wrote:
>
> I haven't used Spark in the last year and a half. I am about to start a
> project with a new team, and we need to decide whether to use pyspark or
> Scala.
>
>
> Indeed, good questions, and they do come up lot in trainings that I have
> attended, where this inevitable question is raised.
> I believe, it depends on your level of comfort zone or adventure into
> newer things.
>
> True, for the most part that Apache Spark committers have been committed
> to keep the APIs at parity across all the language offerings, even though
> in some cases, in particular Python, they have lagged by a minor release.
> To the the extent that they’re committed to level-parity is a good sign. It
> might to be the case with some experimental APIs, where they lag behind,
>  but for the most part, they have been admirably consistent.
>
> With Python there’s a minor performance hit, since there’s an extra level
> of indirection in the architecture and an additional Python PID that the
> executors launch to execute your pickled Python lambdas. Other than that it
> boils down to your comfort zone. I recommend looking at Sameer’s slides on
> (Advanced Spark for DevOps Training) where he walks through the pySpark and
> Python architecture.
>
>
> We are NOT a java shop. So some of the build tools/procedures will require
> some learning overhead if we go the Scala route. What I want to know is: is
> the Scala version of Spark still far enough ahead of pyspark to be well
> worth any initial training overhead?
>
>
> If you are a very advanced Python shop and if you’ve in-house libraries
> that you have written in Python that don’t exist in Scala or some ML libs
> that don’t exist in the Scala version and will require fair amount of
> porting and gap is too large, then perhaps it makes sense to stay put with
> Python.
>
> However, I believe, investing (or having some members of your group) learn
> and invest in Scala is worthwhile for few reasons. One, you will get the
> performance gain, especially now with Tungsten (not sure how it relates to
> Python, but some other knowledgeable people on the list, please chime in).
> Two, since Spark is written in Scala, it gives you an enormous advantage to
> read sources (which are well documented and highly readable) should you
> have to consult or learn nuances of certain API method or action not
> covered comprehensively in the docs. And finally, there’s a long term
> benefit in learning Scala for reasons other than Spark. For example,
> writing other scalable and distributed applications.
>
>
> Particularly, we will be using Spark Streaming. I know a couple of years
> ago that practically forced the decision to use Scala.  Is this still the
> case?
>
>
> You’ll notice that certain APIs call are not available, at least for now,
> in Python.
> http://spark.apache.org/docs/latest/streaming-programming-guide.html
>
>
> Cheers
> Jules
>
> --
> The Best Ideas Are Simple
> Jules S. Damji
> e-mail:dmat...@comcast.net
> e-mail:jules.da...@gmail.com
>
>


Re: Is this likely to cause any problems?

2016-02-19 Thread Nicholas Chammas
The docs mention spark-ec2 because it is part of the Spark project. There
are many, many alternatives to spark-ec2 out there like EMR, but it's
probably not the place of the official docs to promote any one of those
third-party solutions.

On Fri, Feb 19, 2016 at 11:05 AM James Hammerton  wrote:

> Hi,
>
> Having looked at how easy it is to use EMR, I reckon you may be right,
> especially if using Java 8 is no more difficult with that than with
> spark-ec2 (where I had to install it on the master and slaves and edit the
> spark-env.sh).
>
> I'm now curious as to why the Spark documentation (
> http://spark.apache.org/docs/latest/index.html) mentions EC2 but not EMR.
>
> Regards,
>
> James
>
>
> On 19 February 2016 at 14:25, Daniel Siegmann  > wrote:
>
>> With EMR supporting Spark, I don't see much reason to use the spark-ec2
>> script unless it is important for you to be able to launch clusters using
>> the bleeding edge version of Spark. EMR does seem to do a pretty decent job
>> of keeping up to date - the latest version (4.3.0) supports the latest
>> Spark version (1.6.0).
>>
>> So I'd flip the question around and ask: is there any reason to continue
>> using the spark-ec2 script rather than EMR?
>>
>> On Thu, Feb 18, 2016 at 11:39 AM, James Hammerton  wrote:
>>
>>> I have now... So far  I think the issues I've had are not related to
>>> this, but I wanted to be sure in case it should be something that needs to
>>> be patched. I've had some jobs run successfully but this warning appears in
>>> the logs.
>>>
>>> Regards,
>>>
>>> James
>>>
>>> On 18 February 2016 at 12:23, Ted Yu  wrote:
>>>
 Have you seen this ?

 HADOOP-10988

 Cheers

 On Thu, Feb 18, 2016 at 3:39 AM, James Hammerton 
 wrote:

> HI,
>
> I am seeing warnings like this in the logs when I run Spark jobs:
>
> OpenJDK 64-Bit Server VM warning: You have loaded library 
> /root/ephemeral-hdfs/lib/native/libhadoop.so.1.0.0 which might have 
> disabled stack guard. The VM will try to fix the stack guard now.
> It's highly recommended that you fix the library with 'execstack -c 
> ', or link it with '-z noexecstack'.
>
>
> I used spark-ec2 to launch the cluster with the default AMI, Spark
> 1.5.2, hadoop major version 2.4. I altered the jdk to be openjdk 8 as I'd
> written some jobs in Java 8. The 6 workers nodes are m4.2xlarge and master
> is m4.large.
>
> Could this contribute to any problems running the jobs?
>
> Regards,
>
> James
>


>>>
>>
>


Re: Is spark-ec2 going away?

2016-01-27 Thread Nicholas Chammas
I noticed that in the main branch, the ec2 directory along with the
spark-ec2 script is no longer present.

It’s been moved out of the main repo to its own location:
https://github.com/amplab/spark-ec2/pull/21

Is spark-ec2 going away in the next release? If so, what would be the best
alternative at that time?

It’s not going away. It’s just being removed from the main Spark repo and
maintained separately.

There are many alternatives like EMR, which was already mentioned, as well
as more full-service solutions like Databricks. It depends on what you’re
looking for.

If you want something as close to spark-ec2 as possible but more actively
developed, you might be interested in checking out Flintrock
, which I built.

Is there any way to add/remove additional workers while the cluster is
running without stopping/starting the EC2 cluster?

Not currently possible with spark-ec2 and a bit difficult to add. See:
https://issues.apache.org/jira/browse/SPARK-2008

For 1, if no such capability is provided with the current script., do we
have to write it ourselves? Or is there any plan in the future to add such
functions?

No "official" plans to add this to spark-ec2. It’s up to a contributor to
step up and implement this feature, basically. Otherwise it won’t happen.

Nick

On Wed, Jan 27, 2016 at 5:13 PM Alexander Pivovarov 
wrote:

you can use EMR-4.3.0 run on spot instances to control the price
>
> yes, you can add/remove instances to the cluster on fly  (CORE instances
> support add only, TASK instances - add and remove)
>
>
>
> On Wed, Jan 27, 2016 at 2:07 PM, Sung Hwan Chung  > wrote:
>
>> I noticed that in the main branch, the ec2 directory along with the
>> spark-ec2 script is no longer present.
>>
>> Is spark-ec2 going away in the next release? If so, what would be the
>> best alternative at that time?
>>
>> A couple more additional questions:
>> 1. Is there any way to add/remove additional workers while the cluster is
>> running without stopping/starting the EC2 cluster?
>> 2. For 1, if no such capability is provided with the current script., do
>> we have to write it ourselves? Or is there any plan in the future to add
>> such functions?
>> 2. In PySpark, is it possible to dynamically change driver/executor
>> memory, number of cores per executor without having to restart it? (e.g.
>> via changing sc configuration or recreating sc?)
>>
>> Our ideal scenario is to keep running PySpark (in our case, as a
>> notebook) and connect/disconnect to any spark clusters on demand.
>>
>
> ​


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
+1

Red Hat supports Python 2.6 on REHL 5 until 2020
, but
otherwise yes, Python 2.6 is ancient history and the core Python developers
stopped supporting it in 2013. REHL 5 is not a good enough reason to
continue support for Python 2.6 IMO.

We should aim to support Python 2.7 and Python 3.3+ (which I believe we
currently do).

Nick

On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang  wrote:

> plus 1,
>
> we are currently using python 2.7.2 in production environment.
>
>
>
>
>
> 在 2016-01-05 18:11:45,"Meethu Mathew"  写道:
>
> +1
> We use Python 2.7
>
> Regards,
>
> Meethu Mathew
>
> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin  wrote:
>
>> Does anybody here care about us dropping support for Python 2.6 in Spark
>> 2.0?
>>
>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>> parsing) when compared with Python 2.7. Some libraries that Spark depend on
>> stopped supporting 2.6. We can still convince the library maintainers to
>> support 2.6, but it will be extra work. I'm curious if anybody still uses
>> Python 2.6 to run Spark.
>>
>> Thanks.
>>
>>
>>
>


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
As I pointed out in my earlier email, RHEL will support Python 2.6 until
2020. So I'm assuming these large companies will have the option of riding
out Python 2.6 until then.

Are we seriously saying that Spark should likewise support Python 2.6 for
the next several years? Even though the core Python devs stopped supporting
it in 2013?

If that's not what we're suggesting, then when, roughly, can we drop
support? What are the criteria?

I understand the practical concern here. If companies are stuck using 2.6,
it doesn't matter to them that it is deprecated. But balancing that concern
against the maintenance burden on this project, I would say that "upgrade
to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to take.
There are many tiny annoyances one has to put up with to support 2.6.

I suppose if our main PySpark contributors are fine putting up with those
annoyances, then maybe we don't need to drop support just yet...

Nick
2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente <ju...@esbet.es>님이
작성:

> Unfortunately, Koert is right.
>
> I've been in a couple of projects using Spark (banking industry) where
> CentOS + Python 2.6 is the toolbox available.
>
> That said, I believe it should not be a concern for Spark. Python 2.6 is
> old and busted, which is totally opposite to the Spark philosophy IMO.
>
>
> El 5 ene 2016, a las 20:07, Koert Kuipers <ko...@tresata.com> escribió:
>
> rhel/centos 6 ships with python 2.6, doesnt it?
>
> if so, i still know plenty of large companies where python 2.6 is the only
> option. asking them for python 2.7 is not going to work
>
> so i think its a bad idea
>
> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland <juliet.hougl...@gmail.com
> > wrote:
>
>> I don't see a reason Spark 2.0 would need to support Python 2.6. At this
>> point, Python 3 should be the default that is encouraged.
>> Most organizations acknowledge the 2.7 is common, but lagging behind the
>> version they should theoretically use. Dropping python 2.6
>> support sounds very reasonable to me.
>>
>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> +1
>>>
>>> Red Hat supports Python 2.6 on REHL 5 until 2020
>>> <https://alexgaynor.net/2015/mar/30/red-hat-open-source-community/>,
>>> but otherwise yes, Python 2.6 is ancient history and the core Python
>>> developers stopped supporting it in 2013. REHL 5 is not a good enough
>>> reason to continue support for Python 2.6 IMO.
>>>
>>> We should aim to support Python 2.7 and Python 3.3+ (which I believe we
>>> currently do).
>>>
>>> Nick
>>>
>>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang <allenzhang...@126.com>
>>> wrote:
>>>
>>>> plus 1,
>>>>
>>>> we are currently using python 2.7.2 in production environment.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 在 2016-01-05 18:11:45,"Meethu Mathew" <meethu.mat...@flytxt.com> 写道:
>>>>
>>>> +1
>>>> We use Python 2.7
>>>>
>>>> Regards,
>>>>
>>>> Meethu Mathew
>>>>
>>>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> Does anybody here care about us dropping support for Python 2.6 in
>>>>> Spark 2.0?
>>>>>
>>>>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>>>>> parsing) when compared with Python 2.7. Some libraries that Spark depend 
>>>>> on
>>>>> stopped supporting 2.6. We can still convince the library maintainers to
>>>>> support 2.6, but it will be extra work. I'm curious if anybody still uses
>>>>> Python 2.6 to run Spark.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>
>>
>


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
I think all the slaves need the same (or a compatible) version of Python
installed since they run Python code in PySpark jobs natively.

On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers <ko...@tresata.com> wrote:

> interesting i didnt know that!
>
> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> even if python 2.7 was needed only on this one machine that launches the
>> app we can not ship it with our software because its gpl licensed
>>
>> Not to nitpick, but maybe this is important. The Python license is 
>> GPL-compatible
>> but not GPL <https://docs.python.org/3/license.html>:
>>
>> Note GPL-compatible doesn’t mean that we’re distributing Python under the
>> GPL. All Python licenses, unlike the GPL, let you distribute a modified
>> version without making your changes open source. The GPL-compatible
>> licenses make it possible to combine Python with other software that is
>> released under the GPL; the others don’t.
>>
>> Nick
>> ​
>>
>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> i do not think so.
>>>
>>> does the python 2.7 need to be installed on all slaves? if so, we do not
>>> have direct access to those.
>>>
>>> also, spark is easy for us to ship with our software since its apache 2
>>> licensed, and it only needs to be present on the machine that launches the
>>> app (thanks to yarn).
>>> even if python 2.7 was needed only on this one machine that launches the
>>> app we can not ship it with our software because its gpl licensed, so the
>>> client would have to download it and install it themselves, and this would
>>> mean its an independent install which has to be audited and approved and
>>> now you are in for a lot of fun. basically it will never happen.
>>>
>>>
>>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen <joshro...@databricks.com>
>>> wrote:
>>>
>>>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>>>> imagine that they're also capable of installing a standalone Python
>>>> alongside that Spark version (without changing Python systemwide). For
>>>> instance, Anaconda/Miniconda make it really easy to install Python
>>>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>>>> require any special permissions to install (you don't need root / sudo
>>>> access). Does this address the Python versioning concerns for RHEL users?
>>>>
>>>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> yeah, the practical concern is that we have no control over java or
>>>>> python version on large company clusters. our current reality for the vast
>>>>> majority of them is java 7 and python 2.6, no matter how outdated that is.
>>>>>
>>>>> i dont like it either, but i cannot change it.
>>>>>
>>>>> we currently don't use pyspark so i have no stake in this, but if we
>>>>> did i can assure you we would not upgrade to spark 2.x if python 2.6 was
>>>>> dropped. no point in developing something that doesnt run for majority of
>>>>> customers.
>>>>>
>>>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>
>>>>>> As I pointed out in my earlier email, RHEL will support Python 2.6
>>>>>> until 2020. So I'm assuming these large companies will have the option of
>>>>>> riding out Python 2.6 until then.
>>>>>>
>>>>>> Are we seriously saying that Spark should likewise support Python 2.6
>>>>>> for the next several years? Even though the core Python devs stopped
>>>>>> supporting it in 2013?
>>>>>>
>>>>>> If that's not what we're suggesting, then when, roughly, can we drop
>>>>>> support? What are the criteria?
>>>>>>
>>>>>> I understand the practical concern here. If companies are stuck using
>>>>>> 2.6, it doesn't matter to them that it is deprecated. But balancing that
>>>>>> concern against the maintenance burden on this project, I would say that
>>>>>> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position 
>>>>>> to
>>>>>> take. T

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
even if python 2.7 was needed only on this one machine that launches the
app we can not ship it with our software because its gpl licensed

Not to nitpick, but maybe this is important. The Python license is
GPL-compatible
but not GPL <https://docs.python.org/3/license.html>:

Note GPL-compatible doesn’t mean that we’re distributing Python under the
GPL. All Python licenses, unlike the GPL, let you distribute a modified
version without making your changes open source. The GPL-compatible
licenses make it possible to combine Python with other software that is
released under the GPL; the others don’t.

Nick
​

On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers <ko...@tresata.com> wrote:

> i do not think so.
>
> does the python 2.7 need to be installed on all slaves? if so, we do not
> have direct access to those.
>
> also, spark is easy for us to ship with our software since its apache 2
> licensed, and it only needs to be present on the machine that launches the
> app (thanks to yarn).
> even if python 2.7 was needed only on this one machine that launches the
> app we can not ship it with our software because its gpl licensed, so the
> client would have to download it and install it themselves, and this would
> mean its an independent install which has to be audited and approved and
> now you are in for a lot of fun. basically it will never happen.
>
>
> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen <joshro...@databricks.com>
> wrote:
>
>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>> imagine that they're also capable of installing a standalone Python
>> alongside that Spark version (without changing Python systemwide). For
>> instance, Anaconda/Miniconda make it really easy to install Python
>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>> require any special permissions to install (you don't need root / sudo
>> access). Does this address the Python versioning concerns for RHEL users?
>>
>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> yeah, the practical concern is that we have no control over java or
>>> python version on large company clusters. our current reality for the vast
>>> majority of them is java 7 and python 2.6, no matter how outdated that is.
>>>
>>> i dont like it either, but i cannot change it.
>>>
>>> we currently don't use pyspark so i have no stake in this, but if we did
>>> i can assure you we would not upgrade to spark 2.x if python 2.6 was
>>> dropped. no point in developing something that doesnt run for majority of
>>> customers.
>>>
>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> As I pointed out in my earlier email, RHEL will support Python 2.6
>>>> until 2020. So I'm assuming these large companies will have the option of
>>>> riding out Python 2.6 until then.
>>>>
>>>> Are we seriously saying that Spark should likewise support Python 2.6
>>>> for the next several years? Even though the core Python devs stopped
>>>> supporting it in 2013?
>>>>
>>>> If that's not what we're suggesting, then when, roughly, can we drop
>>>> support? What are the criteria?
>>>>
>>>> I understand the practical concern here. If companies are stuck using
>>>> 2.6, it doesn't matter to them that it is deprecated. But balancing that
>>>> concern against the maintenance burden on this project, I would say that
>>>> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to
>>>> take. There are many tiny annoyances one has to put up with to support 2.6.
>>>>
>>>> I suppose if our main PySpark contributors are fine putting up with
>>>> those annoyances, then maybe we don't need to drop support just yet...
>>>>
>>>> Nick
>>>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente <ju...@esbet.es>님이
>>>> 작성:
>>>>
>>>>> Unfortunately, Koert is right.
>>>>>
>>>>> I've been in a couple of projects using Spark (banking industry) where
>>>>> CentOS + Python 2.6 is the toolbox available.
>>>>>
>>>>> That said, I believe it should not be a concern for Spark. Python 2.6
>>>>> is old and busted, which is totally opposite to the Spark philosophy IMO.
>>>>>
>>>>>
>>>>> El 5 ene 2016, a las 20:07, Koert Kuipers <ko...@tresata.com>
>>>>> 

Re: Not all workers seem to run in a standalone cluster setup by spark-ec2 script

2015-12-04 Thread Nicholas Chammas
Quick question: Are you processing gzipped files by any chance? It's a
common stumbling block people hit.

See: http://stackoverflow.com/q/27531816/877069

Nick

On Fri, Dec 4, 2015 at 2:28 PM Kyohey Hamaguchi 
wrote:

> Hi,
>
> I have setup a Spark standalone-cluster, which involves 5 workers,
> using spark-ec2 script.
>
> After submitting my Spark application, I had noticed that just one
> worker seemed to run the application and other 4 workers were doing
> nothing. I had confirmed this by checking CPU and memory usage on the
> Spark Web UI (CPU usage indicates zero and memory is almost fully
> availabile.)
>
> This is the command used to launch:
>
> $ ~/spark/ec2/spark-ec2 -k awesome-keypair-name -i
> /path/to/.ssh/awesome-private-key.pem --region ap-northeast-1
> --zone=ap-northeast-1a --slaves 5 --instance-type m1.large
> --hadoop-major-version yarn launch awesome-spark-cluster
>
> And the command to run application:
>
> $ ssh -i ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "mkdir ~/awesome"
> $ scp -i ~/path/to/awesome-private-key.pem spark.jar
> root@ec2-master-host-name:~/awesome && ssh -i
> ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "~/spark-ec2/copy-dir ~/awesome"
> $ ssh -i ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "~/spark/bin/spark-submit --num-executors 5 --executor-cores 2
> --executor-memory 5G --total-executor-cores 10 --driver-cores 2
> --driver-memory 5G --class com.example.SparkIsAwesome
> awesome/spark.jar"
>
> How do I let the all of the workers execute the app?
>
> Or do I have wrong understanding on what workers, slaves and executors are?
>
> My understanding is: Spark driver(or maybe master?) sends a part of
> jobs to each worker (== executor == slave), so a Spark cluster
> automatically exploits all resources available in the cluster. Is this
> some sort of misconception?
>
> Thanks,
>
> --
> Kyohey Hamaguchi
> TEL:  080-6918-1708
> Mail: tnzk.ma...@gmail.com
> Blog: http://blog.tnzk.org/
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Adding more slaves to a running cluster

2015-11-25 Thread Nicholas Chammas
spark-ec2 does not directly support adding instances to an existing
cluster, apart from the special case of adding slaves to a cluster with a
master but no slaves. There is an open issue to track adding this support,
SPARK-2008 , but it
doesn't have any momentum at the moment.

Your best bet currently is to do what you did and hack your way through
using spark-ec2's various scripts.

You probably already know this, but to be clear, note that Spark itself
supports adding slaves to a running cluster. It's just that spark-ec2
hasn't implemented a feature to do this work for you.

Nick

On Wed, Nov 25, 2015 at 2:27 PM Dillian Murphey 
wrote:

> It appears start-slave.sh works on a running cluster.  I'm surprised I
> can't find more info on this. Maybe I'm not looking hard enough?
>
> Using AWS and spot instances is incredibly more efficient, which begs for
> the need of dynamically adding more nodes while the cluster is up, yet
> everything I've found so far seems to indicate it isn't supported yet.
>
> But yet here I am with 1.5 and it at least appears to be working. Am I
> missing something?
>
> On Tue, Nov 24, 2015 at 4:40 PM, Dillian Murphey 
> wrote:
>
>> What's the current status on adding slaves to a running cluster?  I want
>> to leverage spark-ec2 and autoscaling groups.  I want to launch slaves as
>> spot instances when I need to do some heavy lifting, but I don't want to
>> bring down my cluster in order to add nodes.
>>
>> Can this be done by just running start-slave.sh??
>>
>> What about using Mesos?
>>
>> I just want to create an AMI for a slave and on some trigger launch it
>> and have it automatically add itself to the cluster.
>>
>> thanks
>>
>
>


Re: spark-ec2 script to launch cluster running Spark 1.5.2 built with HIVE?

2015-11-23 Thread Nicholas Chammas
Don't the Hadoop builds include Hive already? Like
spark-1.5.2-bin-hadoop2.6.tgz?

On Mon, Nov 23, 2015 at 7:49 PM Jeff Schecter  wrote:

> Hi all,
>
> As far as I can tell, the bundled spark-ec2 script provides no way to
> launch a cluster running Spark 1.5.2 pre-built with HIVE.
>
> That is to say, all of the pre-build versions of Spark 1.5.2 in the s3 bin
> spark-related-packages are missing HIVE.
>
> aws s3 ls s3://spark-related-packages/ | grep 1.5.2
>
>
> Am I missing something here? I'd rather avoid resorting to whipping up
> hacky patching scripts that might break with the next Spark point release
> if at all possible.
>


Re: Upgrading Spark in EC2 clusters

2015-11-12 Thread Nicholas Chammas
spark-ec2 does not offer a way to upgrade an existing cluster, and from
what I gather, it wasn't intended to be used to manage long-lasting
infrastructure. The recommended approach really is to just destroy your
existing cluster and launch a new one with the desired configuration.

If you want to upgrade the cluster in place, you'll probably have to do
that manually. Otherwise, perhaps spark-ec2 is not the right tool, and
instead you want one of those "grown-up" management tools like Ansible
which can be setup to allow in-place upgrades. That'll take a bit of work,
though.

Nick

On Wed, Nov 11, 2015 at 6:01 PM Augustus Hong 
wrote:

> Hey All,
>
> I have a Spark cluster(running version 1.5.0) on EC2 launched with the
> provided spark-ec2 scripts. If I want to upgrade Spark to 1.5.2 in the same
> cluster, what's the safest / recommended way to do that?
>
>
> I know I can spin up a new cluster running 1.5.2, but it doesn't seem
> efficient to spin up a new cluster every time we need to upgrade.
>
>
> Thanks,
> Augustus
>
>
>
>
>
> --
> [image: Branch Metrics mobile deep linking] * Augustus
> Hong*
>  Data Analytics | Branch Metrics
>  m 650-391-3369 | e augus...@branch.io
>


Re: Spark EC2 script on Large clusters

2015-11-05 Thread Nicholas Chammas
Yeah, as Shivaram mentioned, this issue is well-known. It's documented in
SPARK-5189  and a bunch
of related issues. Unfortunately, it's hard to resolve this issue in
spark-ec2 without rewriting large parts of the project. But if you take a
crack at it and succeed I'm sure a lot of people will be happy.

I've started a separate project  --
which Shivaram also mentioned -- which aims to solve the problem of long
launch times and other issues
 with spark-ec2. It's
still very young and lacks several critical features, but we are making
steady progress.

Nick

On Thu, Nov 5, 2015 at 12:30 PM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> It is a known limitation that spark-ec2 is very slow for large
> clusters and as you mention most of this is due to the use of rsync to
> transfer things from the master to all the slaves.
>
> Nick cc'd has been working on an alternative approach at
> https://github.com/nchammas/flintrock that is more scalable.
>
> Thanks
> Shivaram
>
> On Thu, Nov 5, 2015 at 8:12 AM, Christian  wrote:
> > For starters, thanks for the awesome product!
> >
> > When creating ec2-clusters of 20-40 nodes, things work great. When we
> create
> > a cluster with the provided spark-ec2 script, it takes hours. When
> creating
> > a 200 node cluster, it takes 2 1/2 hours and for a 500 node cluster it
> takes
> > over 5 hours. One other problem we are having is that some nodes don't
> come
> > up when the other ones do, the process seems to just move on, skipping
> the
> > rsync and any installs on those ones.
> >
> > My guess as to why it takes so long to set up a large cluster is because
> of
> > the use of rsync. What if instead of using rsync, you synched to s3 and
> then
> > did a pdsh to pull it down on all of the machines. This is a big deal
> for us
> > and if we can come up with a good plan, we might be able help out with
> the
> > required changes.
> >
> > Are there any suggestions on how to deal with some of the nodes not being
> > ready when the process starts?
> >
> > Thanks for your time,
> > Christian
> >
>


Re: Sorry, but Nabble and ML suck

2015-10-31 Thread Nicholas Chammas
Nabble is an unofficial archive of this mailing list. I don't know who runs
it, but it's not Apache. There are often delays between when things get
posted to the list and updated on Nabble, and sometimes things never make
it over for whatever reason.

This mailing list is, I agree, very 1980s. Unfortunately, it's required by
the Apache Software Foundation (ASF).

There was a discussion earlier this year

about
migrating to Discourse that explained why we're stuck with what we have for
now. Ironically, that discussion is hard to follow on the Apache archives
(which is precisely one of the motivations for proposing to migrate to
Discourse), but there is a more readable archive on another unofficial site

.

Nick

On Sat, Oct 31, 2015 at 12:20 PM Martin Senne 
wrote:

> Having written a post on last Tuesday, I'm still not able to see my post
> under nabble. And yeah, subscription to u...@apache.spark.org was
> successful (rechecked a minute ago)
>
> Even more, I have no way (and no confirmation) that my post was accepted,
> rejected, whatever.
>
> This is very L4M3 and so 80ies.
>
> Any help appreciated. Thx!
>


Can we add an unsubscribe link in the footer of every email?

2015-10-21 Thread Nicholas Chammas
Every week or so someone emails the list asking to unsubscribe.

Of course, that's not the right way to do it. You're supposed to email
a different
address  than this one to
unsubscribe, yet this is not in-your-face obvious, so many people miss it.
And someone steps up almost every time to point people in the right
direction.

The vast majority of mailing lists I'm familiar with include a small footer
at the bottom of each email with a link to unsubscribe. I think this is
what most people expect, and it's where they check first.

Can we add a footer like that?

I think it would cut down on the weekly emails from people wanting to
unsubscribe, and it would match existing mailing list conventions elsewhere.

Nick


Re: stability of Spark 1.4.1 with Python 3 versions

2015-10-14 Thread Nicholas Chammas
The Spark 1.4 release notes
 say that
Python 3 is supported. The 1.4 docs are incorrect, and the 1.5 programming
guide has been updated to indicate Python 3 support.

On Wed, Oct 14, 2015 at 7:06 AM shoira.mukhsin...@bnpparibasfortis.com <
shoira.mukhsin...@bnpparibasfortis.com> wrote:

> Dear Spark Community,
>
>
>
> The official documentation of Spark 1.4.1 mentions that Spark runs on Python
> 2.6+ http://spark.apache.org/docs/1.4.1/
>
> It is not clear if by “Python 2.6+” do you also mean Python 3.4 or not.
>
>
>
> There is a resolved issue on this point which makes me believe that it
> does run on Python 3.4: https://issues.apache.org/jira/i#browse/SPARK-9705
>
> Maybe the documentation is simply not up to date ? The programming guide
> mentions that it does not work for Python 3:
> https://spark.apache.org/docs/1.4.1/programming-guide.html
>
>
>
> Do you confirm that Spark 1.4.1 does run on Python3.4?
>
>
>
> Thanks in advance for your reaction!
>
>
>
> Regards,
>
> Shoira
>
>
>
>
>
>
>
> ==
> BNP Paribas Fortis disclaimer:
> http://www.bnpparibasfortis.com/e-mail-disclaimer.html
>
> BNP Paribas Fortis privacy policy:
> http://www.bnpparibasfortis.com/privacy-policy.html
>
> ==
>


Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-28 Thread Nicholas Chammas
Hi Everybody!

Thanks for participating in the spark-ec2 survey. The full results are
publicly viewable here:

https://docs.google.com/forms/d/1VC3YEcylbguzJ-YeggqxntL66MbqksQHPwbodPz_RTg/viewanalytics

The gist of the results is as follows:

Most people found spark-ec2 useful as an easy way to get a working Spark
cluster to run a quick experiment or do some benchmarking without having to
do a lot of manual configuration or setup work.

Many people lamented the slow launch times of spark-ec2, problems getting
it to launch clusters within a VPC, and broken Ganglia installs. Some also
mentioned that Hadoop 2 didn't work as expected.

Wish list items for spark-ec2 included faster launches, selectable Hadoop 2
versions, and more configuration options.

If you'd like to add your own feedback to what's already there, I've
decided to leave the survey open for a few more days:

http://goo.gl/forms/erct2s6KRR

As noted before, your results are anonymous and public.

Thanks again for participating! I hope this has been useful to the
community.

Nick

On Tue, Aug 25, 2015 at 1:31 PM Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 Final chance to fill out the survey!

 http://goo.gl/forms/erct2s6KRR

 I'm gonna close it to new responses tonight and send out a summary of the
 results.

 Nick

 On Thu, Aug 20, 2015 at 2:08 PM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 I'm planning to close the survey to further responses early next week.

 If you haven't chimed in yet, the link to the survey is here:

 http://goo.gl/forms/erct2s6KRR

 We already have some great responses, which you can view. I'll share a
 summary after the survey is closed.

 Cheers!

 Nick


 On Mon, Aug 17, 2015 at 11:09 AM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Howdy folks!

 I’m interested in hearing about what people think of spark-ec2
 http://spark.apache.org/docs/latest/ec2-scripts.html outside of the
 formal JIRA process. Your answers will all be anonymous and public.

 If the embedded form below doesn’t work for you, you can use this link
 to get the same survey:

 http://goo.gl/forms/erct2s6KRR

 Cheers!
 Nick
 ​




Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-25 Thread Nicholas Chammas
Final chance to fill out the survey!

http://goo.gl/forms/erct2s6KRR

I'm gonna close it to new responses tonight and send out a summary of the
results.

Nick

On Thu, Aug 20, 2015 at 2:08 PM Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 I'm planning to close the survey to further responses early next week.

 If you haven't chimed in yet, the link to the survey is here:

 http://goo.gl/forms/erct2s6KRR

 We already have some great responses, which you can view. I'll share a
 summary after the survey is closed.

 Cheers!

 Nick


 On Mon, Aug 17, 2015 at 11:09 AM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Howdy folks!

 I’m interested in hearing about what people think of spark-ec2
 http://spark.apache.org/docs/latest/ec2-scripts.html outside of the
 formal JIRA process. Your answers will all be anonymous and public.

 If the embedded form below doesn’t work for you, you can use this link to
 get the same survey:

 http://goo.gl/forms/erct2s6KRR

 Cheers!
 Nick
 ​




Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-20 Thread Nicholas Chammas
I'm planning to close the survey to further responses early next week.

If you haven't chimed in yet, the link to the survey is here:

http://goo.gl/forms/erct2s6KRR

We already have some great responses, which you can view. I'll share a
summary after the survey is closed.

Cheers!

Nick


On Mon, Aug 17, 2015 at 11:09 AM Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Howdy folks!

 I’m interested in hearing about what people think of spark-ec2
 http://spark.apache.org/docs/latest/ec2-scripts.html outside of the
 formal JIRA process. Your answers will all be anonymous and public.

 If the embedded form below doesn’t work for you, you can use this link to
 get the same survey:

 http://goo.gl/forms/erct2s6KRR

 Cheers!
 Nick
 ​



[survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-17 Thread Nicholas Chammas
Howdy folks!

I’m interested in hearing about what people think of spark-ec2
http://spark.apache.org/docs/latest/ec2-scripts.html outside of the
formal JIRA process. Your answers will all be anonymous and public.

If the embedded form below doesn’t work for you, you can use this link to
get the same survey:

http://goo.gl/forms/erct2s6KRR

Cheers!
Nick
​


Re: spark spark-ec2 credentials using aws_security_token

2015-07-27 Thread Nicholas Chammas
You refer to `aws_security_token`, but I'm not sure where you're specifying
it. Can you elaborate? Is it an environment variable?

On Mon, Jul 27, 2015 at 4:21 AM Jan Zikeš jan.zi...@centrum.cz wrote:

 Hi,

 I would like to ask if it is currently possible to use spark-ec2 script
 together with credentials that are consisting not only from:
 aws_access_key_id and aws_secret_access_key, but it also contains
 aws_security_token.

 When I try to run the script I am getting following error message:

 ERROR:boto:Caught exception reading instance data
 Traceback (most recent call last):
   File /Users/zikes/opensource/spark/ec2/lib/boto-2.34.0/boto/utils.py,
 line 210, in retry_url
 r = opener.open(req, timeout=timeout)
   File

 /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,
 line 404, in open
 response = self._open(req, data)
   File

 /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,
 line 422, in _open
 '_open', req)
   File

 /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,
 line 382, in _call_chain
 result = func(*args)
   File

 /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,
 line 1214, in http_open
 return self.do_open(httplib.HTTPConnection, req)
   File

 /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,
 line 1184, in do_open
 raise URLError(err)
 URLError: urlopen error [Errno 64] Host is down
 ERROR:boto:Unable to read instance data, giving up
 No handler was ready to authenticate. 1 handlers were checked.
 ['QuerySignatureV2AuthHandler'] Check your credentials

 Does anyone has some idea what can be possibly wrong? Is aws_security_token
 the problem?
 I know that it seems more like a boto problem, but still I would like to
 ask
 if anybody has some experience with this?

 My launch command is:
 ./spark-ec2 -k my_key -i my_key.pem --additional-tags
 mytag:tag1,mytag2:tag2 --instance-profile-name profile1 -s 1 launch
 test

 Thank you in advance for any help.
 Best regards,

 Jan

 Note:
 I have also asked at

 http://stackoverflow.com/questions/31583513/spark-spark-ec2-credentials-using-aws-security-token?noredirect=1#comment51151822_31583513
 without any success.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-spark-ec2-credentials-using-aws-security-token-tp24007.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: spark ec2 as non-root / any plan to improve that in the future ?

2015-07-09 Thread Nicholas Chammas
No plans to change that at the moment, but agreed it is against accepted
convention. It would be a lot of work to change the tool, change the AMIs,
and test everything. My suggestion is not to hold your breath for such a
change.

spark-ec2, as far as I understand, is not intended for spinning up
permanent or production infrastructure (though people may use it for those
purposes), so there isn't a big impetus to fix this kind of issue. It works
really well for what it was intended for: spinning up clusters for testing,
prototyping, and experimenting.

Nick

On Thu, Jul 9, 2015 at 3:25 AM matd matd...@gmail.com wrote:

 Hi,

 Spark ec2 scripts are useful, but they install everything as root.
 AFAIK, it's not a good practice ;-)

 Why is it so ?
 Should these scripts reserved for test/demo purposes, and not to be used
 for
 a production system ?
 Is it planned in some roadmap to improve that, or to replace ec2-scripts
 with something else ?

 Would it be difficult to change them to use a sudo-er instead ?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-as-non-root-any-plan-to-improve-that-in-the-future-tp23734.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: dataframe left joins are not working as expected in pyspark

2015-06-27 Thread Nicholas Chammas
Yeah, you shouldn't have to rename the columns before joining them.

Do you see the same behavior on 1.3 vs 1.4?

Nick
2015년 6월 27일 (토) 오전 2:51, Axel Dahl a...@whisperstream.com님이 작성:

 still feels like a bug to have to create unique names before a join.

 On Fri, Jun 26, 2015 at 9:51 PM, ayan guha guha.a...@gmail.com wrote:

 You can declare the schema with unique names before creation of df.
 On 27 Jun 2015 13:01, Axel Dahl a...@whisperstream.com wrote:


 I have the following code:

 from pyspark import SQLContext

 d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}]
 d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'alice',
 'country': 'ire', 'colour':'green'}]

 r1 = sc.parallelize(d1)
 r2 = sc.parallelize(d2)

 sqlContext = SQLContext(sc)
 df1 = sqlContext.createDataFrame(d1)
 df2 = sqlContext.createDataFrame(d2)
 df1.join(df2, df1.name == df2.name and df1.country == df2.country,
 'left_outer').collect()


 When I run it I get the following, (notice in the first row, all join
 keys are take from the right-side and so are blanked out):

 [Row(age=2, country=None, name=None, colour=None, country=None,
 name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
 name=u'bob'),
 Row(age=3, country=u'ire', name=u'alice', colour=u'green',
 country=u'ire', name=u'alice')]

 I would expect to get (though ideally without duplicate columns):
 [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None,
 name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
 name=u'bob'),
 Row(age=3, country=u'ire', name=u'alice', colour=u'green',
 country=u'ire', name=u'alice')]

 The workaround for now is this rather clunky piece of code:
 df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name',
 'name2').withColumnRenamed('country', 'country2')
 df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2,
 'left_outer').collect()

 So to me it looks like a bug, but am I doing something wrong?

 Thanks,

 -Axel








Re: dataframe left joins are not working as expected in pyspark

2015-06-27 Thread Nicholas Chammas
I would test it against 1.3 to be sure, because it could -- though unlikely
-- be a regression. For example, I recently stumbled upon this issue
https://issues.apache.org/jira/browse/SPARK-8670 which was specific to
1.4.

On Sat, Jun 27, 2015 at 12:28 PM Axel Dahl a...@whisperstream.com wrote:

 I've only tested on 1.4, but imagine 1.3 is the same or a lot of people's
 code would be failing right now.

 On Saturday, June 27, 2015, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Yeah, you shouldn't have to rename the columns before joining them.

 Do you see the same behavior on 1.3 vs 1.4?

 Nick
 2015년 6월 27일 (토) 오전 2:51, Axel Dahl a...@whisperstream.com님이 작성:

 still feels like a bug to have to create unique names before a join.

 On Fri, Jun 26, 2015 at 9:51 PM, ayan guha guha.a...@gmail.com wrote:

 You can declare the schema with unique names before creation of df.
 On 27 Jun 2015 13:01, Axel Dahl a...@whisperstream.com wrote:


 I have the following code:

 from pyspark import SQLContext

 d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}]
 d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
 {'name':'alice', 'country': 'ire', 'colour':'green'}]

 r1 = sc.parallelize(d1)
 r2 = sc.parallelize(d2)

 sqlContext = SQLContext(sc)
 df1 = sqlContext.createDataFrame(d1)
 df2 = sqlContext.createDataFrame(d2)
 df1.join(df2, df1.name == df2.name and df1.country == df2.country,
 'left_outer').collect()


 When I run it I get the following, (notice in the first row, all join
 keys are take from the right-side and so are blanked out):

 [Row(age=2, country=None, name=None, colour=None, country=None,
 name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
 name=u'bob'),
 Row(age=3, country=u'ire', name=u'alice', colour=u'green',
 country=u'ire', name=u'alice')]

 I would expect to get (though ideally without duplicate columns):
 [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None,
 name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
 name=u'bob'),
 Row(age=3, country=u'ire', name=u'alice', colour=u'green',
 country=u'ire', name=u'alice')]

 The workaround for now is this rather clunky piece of code:
 df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name',
 'name2').withColumnRenamed('country', 'country2')
 df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2,
 'left_outer').collect()

 So to me it looks like a bug, but am I doing something wrong?

 Thanks,

 -Axel








Re: Required settings for permanent HDFS Spark on EC2

2015-06-05 Thread Nicholas Chammas
If your problem is that stopping/starting the cluster resets configs, then
you may be running into this issue:

https://issues.apache.org/jira/browse/SPARK-4977

Nick

On Thu, Jun 4, 2015 at 2:46 PM barmaley o...@solver.com wrote:

 Hi - I'm having similar problem with switching from ephemeral to persistent
 HDFS - it always looks for 9000 port regardless of options I set for 9010
 persistent HDFS. Have you figured out a solution? Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Required-settings-for-permanent-HDFS-Spark-on-EC2-tp22860p23157.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-20 Thread Nicholas Chammas
To put this on the devs' radar, I suggest creating a JIRA for it (and
checking first if one already exists).

issues.apache.org/jira/

Nick

On Tue, May 19, 2015 at 1:34 PM Matei Zaharia matei.zaha...@gmail.com
wrote:

 Yeah, this definitely seems useful there. There might also be some ways to
 cap the application in Mesos, but I'm not sure.

 Matei

 On May 19, 2015, at 1:11 PM, Thomas Dudziak tom...@gmail.com wrote:

 I'm using fine-grained for a multi-tenant environment which is why I would
 welcome the limit of tasks per job :)

 cheers,
 Tom

 On Tue, May 19, 2015 at 10:05 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Hey Tom,

 Are you using the fine-grained or coarse-grained scheduler? For the
 coarse-grained scheduler, there is a spark.cores.max config setting that
 will limit the total # of cores it grabs. This was there in earlier
 versions too.

 Matei

  On May 19, 2015, at 12:39 PM, Thomas Dudziak tom...@gmail.com wrote:
 
  I read the other day that there will be a fair number of improvements
 in 1.4 for Mesos. Could I ask for one more (if it isn't already in there):
 a configurable limit for the number of tasks for jobs run on Mesos ? This
 would be a very simple yet effective way to prevent a job dominating the
 cluster.
 
  cheers,
  Tom
 






Re: Virtualenv pyspark

2015-05-08 Thread Nicholas Chammas
This is an interesting question. I don't have a solution for you, but you
may be interested in taking a look at Anaconda Cluster
http://continuum.io/anaconda-cluster.

It's made by the same people behind Conda (an alternative to pip focused on
data science pacakges) and may offer a better way of doing this. Haven't
used it though.

On Thu, May 7, 2015 at 5:20 PM alemagnani ale.magn...@gmail.com wrote:

 I am currently using pyspark with a virtualenv.
 Unfortunately I don't have access to the nodes file system and therefore I
 cannot  manually copy the virtual env over there.

 I have been using this technique:

 I first add a tar ball with the venv
 sc.addFile(virtual_env_tarball_file)

 Then in the code used on the node to do the computation I activate the venv
 like this:
 venv_location = SparkFiles.get(venv_name)
 activate_env=%s/bin/activate_this.py % venv_location
 execfile(activate_env, dict(__file__=activate_env))

 Is there a better way to do this?
 One of the problem with this approach is that in
 spark/python/pyspark/statcounter.py numpy is imported
 before the venv is activated and this can cause conflicts with the venv
 numpy.

 Moreover this requires the venv to be sent around in the cluster all the
 time.
 Any suggestions?




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Virtualenv-pyspark-tp22803.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How to deploy self-build spark source code on EC2

2015-04-28 Thread Nicholas Chammas
[-dev] [+user]

This is a question for the user list, not the dev list.

Use the --spark-version and --spark-git-repo options to specify your own
repo and hash to deploy.

Source code link.
https://github.com/apache/spark/blob/268c419f1586110b90e68f98cd000a782d18828c/ec2/spark_ec2.py#L189-L195

Nick

On Tue, Apr 28, 2015 at 12:14 PM Bo Fu b...@uchicago.edu
http://mailto:b...@uchicago.edu wrote:

Hi all,

 I have an issue. I added some timestamps in Spark source code and built it
 using:

 mvn package -DskipTests

 I checked the new version in my own computer and it works. However, when I
 ran spark on EC2, the spark code EC2 machines ran is the original version.

 Anyone knows how to deploy the changed spark source code into EC2?
 Thx a lot


 Bo Fu

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

  ​


Re: Querying Cluster State

2015-04-26 Thread Nicholas Chammas
The Spark web UI offers a JSON interface with some of this information.

http://stackoverflow.com/a/29659630/877069

It's not an official API, so be warned that it may change unexpectedly
between versions, but you might find it helpful.

Nick

On Sun, Apr 26, 2015 at 9:46 AM michal.klo...@gmail.com 
michal.klo...@gmail.com wrote:

 Not sure if there's a spark native way but we've been using consul for
 this.

 M



 On Apr 26, 2015, at 5:17 AM, James King jakwebin...@gmail.com wrote:

 Thanks for the response.

 But no this does not answer the question.

 The question was: Is there a way (via some API call) to query the number
 and type of daemons currently running in the Spark cluster.

 Regards


 On Sun, Apr 26, 2015 at 10:12 AM, ayan guha guha.a...@gmail.com wrote:

 In my limited understanding, there must be single   leader master  in
 the cluster. If there are multiple leaders, it will lead to unstable
 cluster as each masters will keep scheduling independently. You should use
 zookeeper for HA, so that standby masters can vote to find new leader if
 the primary goes down.

 Now, you can still have multiple masters running as leaders but
 conceptually they should be thought as different clusters.

 Regarding workers, they should follow their master.

 Not sure if this answers your question, as I am sure you have read the
 documentation thoroughly.

 Best
 Ayan

 On Sun, Apr 26, 2015 at 6:31 PM, James King jakwebin...@gmail.com
 wrote:

 If I have 5 nodes and I wish to maintain 1 Master and 2 Workers on each
 node, so in total I will have 5 master and 10 Workers.

 Now to maintain that setup I would like to query spark regarding the
 number Masters and Workers that are currently available using API calls and
 then take some appropriate action based on the information I get back, like
 restart a dead Master or Worker.

 Is this possible? does Spark provide such API?




 --
 Best Regards,
 Ayan Guha





Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Nicholas Chammas
Nabble is a third-party site that tries its best to archive mail sent out
over the list. Nothing guarantees it will be in sync with the real mailing
list.

To get the truth on what was sent over this, Apache-managed list, you
unfortunately need to go the Apache archives:
http://mail-archives.apache.org/mod_mbox/spark-user/

Nick

On Thu, Mar 19, 2015 at 5:18 AM Ted Yu yuzhih...@gmail.com wrote:

 There might be some delay:


 http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responsessubj=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view


 On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg dgoldenberg...@gmail.com
 wrote:

 Thanks, Ted. Well, so far even there I'm only seeing my post and not, for
 example, your response.

 On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu yuzhih...@gmail.com wrote:

 Was this one of the threads you participated ?
 http://search-hadoop.com/m/JW1q5w0p8x1

 You should be able to find your posts on search-hadoop.com

 On Wed, Mar 18, 2015 at 3:21 PM, dgoldenberg dgoldenberg...@gmail.com
 wrote:

 Sorry if this is a total noob question but is there a reason why I'm only
 seeing folks' responses to my posts in emails but not in the browser view
 under apache-spark-user-list.1001560.n3.nabble.com?  Is this a matter of
 setting your preferences such that your responses only go to email and
 never
 to the browser-based view of the list? I don't seem to see such a
 preference...



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-User-List-people-s-responses-not-showing-in-the-browser-view-tp22135.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Nicholas Chammas
Sure, you can use Nabble or search-hadoop or whatever you prefer.

My point is just that the source of truth are the Apache archives, and
these other sites may or may not be in sync with that truth.

On Thu, Mar 19, 2015 at 10:20 AM Ted Yu yuzhih...@gmail.com wrote:

 I prefer using search-hadoop.com which provides better search capability.

 Cheers

 On Thu, Mar 19, 2015 at 6:48 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Nabble is a third-party site that tries its best to archive mail sent out
 over the list. Nothing guarantees it will be in sync with the real mailing
 list.

 To get the truth on what was sent over this, Apache-managed list, you
 unfortunately need to go the Apache archives:
 http://mail-archives.apache.org/mod_mbox/spark-user/

 Nick

 On Thu, Mar 19, 2015 at 5:18 AM Ted Yu yuzhih...@gmail.com wrote:

 There might be some delay:


 http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responsessubj=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view


 On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg dgoldenberg...@gmail.com
 wrote:

 Thanks, Ted. Well, so far even there I'm only seeing my post and not,
 for example, your response.

 On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu yuzhih...@gmail.com wrote:

 Was this one of the threads you participated ?
 http://search-hadoop.com/m/JW1q5w0p8x1

 You should be able to find your posts on search-hadoop.com

 On Wed, Mar 18, 2015 at 3:21 PM, dgoldenberg dgoldenberg...@gmail.com
 wrote:

 Sorry if this is a total noob question but is there a reason why I'm
 only
 seeing folks' responses to my posts in emails but not in the browser
 view
 under apache-spark-user-list.1001560.n3.nabble.com?  Is this a matter
 of
 setting your preferences such that your responses only go to email and
 never
 to the browser-based view of the list? I don't seem to see such a
 preference...



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-User-List-people-s-responses-not-showing-in-the-browser-view-tp22135.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Nicholas Chammas
Yes, that is mostly why these third-party sites have sprung up around the
official archives--to provide better search. Did you try the link Ted
posted?

On Thu, Mar 19, 2015 at 10:49 AM Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:

 It seems that those archives are not necessarily easy to find stuff in. Is
 there a search engine on top of them? so as to find e.g. your own posts
 easily?

 On Thu, Mar 19, 2015 at 10:34 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Sure, you can use Nabble or search-hadoop or whatever you prefer.

 My point is just that the source of truth are the Apache archives, and
 these other sites may or may not be in sync with that truth.

 On Thu, Mar 19, 2015 at 10:20 AM Ted Yu yuzhih...@gmail.com wrote:

 I prefer using search-hadoop.com which provides better search
 capability.

 Cheers

 On Thu, Mar 19, 2015 at 6:48 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Nabble is a third-party site that tries its best to archive mail sent
 out over the list. Nothing guarantees it will be in sync with the real
 mailing list.

 To get the truth on what was sent over this, Apache-managed list, you
 unfortunately need to go the Apache archives:
 http://mail-archives.apache.org/mod_mbox/spark-user/

 Nick

 On Thu, Mar 19, 2015 at 5:18 AM Ted Yu yuzhih...@gmail.com wrote:

 There might be some delay:


 http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responsessubj=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view


 On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 Thanks, Ted. Well, so far even there I'm only seeing my post and not,
 for example, your response.

 On Wed, Mar 18, 2015 at 7:28 PM, Ted Yu yuzhih...@gmail.com wrote:

 Was this one of the threads you participated ?
 http://search-hadoop.com/m/JW1q5w0p8x1

 You should be able to find your posts on search-hadoop.com

 On Wed, Mar 18, 2015 at 3:21 PM, dgoldenberg 
 dgoldenberg...@gmail.com wrote:

 Sorry if this is a total noob question but is there a reason why I'm
 only
 seeing folks' responses to my posts in emails but not in the browser
 view
 under apache-spark-user-list.1001560.n3.nabble.com?  Is this a
 matter of
 setting your preferences such that your responses only go to email
 and never
 to the browser-based view of the list? I don't seem to see such a
 preference...



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-User-List-people-s-responses-not-showing-in-the-browser-view-tp22135.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org








Re: Processing of text file in large gzip archive

2015-03-16 Thread Nicholas Chammas
You probably want to update this line as follows:

lines = sc.textFile('file.gz').repartition(sc.defaultParallelism * 3)

For more details on why, see this answer
http://stackoverflow.com/a/27631722/877069.

Nick
​

On Mon, Mar 16, 2015 at 6:50 AM Marius Soutier mps@gmail.com wrote:

 1. I don't think textFile is capable of unpacking a .gz file. You need to
 use hadoopFile or newAPIHadoop file for this.


 Sorry that’s incorrect, textFile works fine on .gz files. What it can’t do
 is compute splits on gz files, so if you have a single file, you'll have a
 single partition.

 Processing 30 GB of gzipped data should not take that long, at least with
 the Scala API. Python not sure, especially under 1.2.1.




Re: Posting to the list

2015-02-23 Thread Nicholas Chammas
Nabble is a third-party site. If you send stuff through Nabble, Nabble has
to forward it along to the Apache mailing list. If something goes wrong
with that, you will have a message show up on Nabble that no-one saw.

The reverse can also happen, where something actually goes out on the list
and doesn't make it to Nabble.

Nabble is a nicer, third-party interface to the Apache list archives. No
more. It works best for reading through old threads.

Apache is the source of truth. Post through there.

Unfortunately, this is what we're stuck with. For a related
discussion, see this
thread about Discourse
http://apache-spark-user-list.1001560.n3.nabble.com/Discourse-A-proposed-alternative-to-the-Spark-User-list-td20851.html
.

Nick

On Sun Feb 22 2015 at 8:07:08 PM haihar nahak harihar1...@gmail.com wrote:

 I checked it but I didn't see any mail from user list. Let me do it one
 more time.

 [image: Inline image 1]

 --Harihar

 On Mon, Feb 23, 2015 at 11:50 AM, Ted Yu yuzhih...@gmail.com wrote:

 bq. i didnt get any new subscription mail in my inbox.

 Have you checked your Spam folder ?

 Cheers

 On Sun, Feb 22, 2015 at 2:36 PM, hnahak harihar1...@gmail.com wrote:

 I'm also facing the same issue, this is third time whenever I post
 anything
 it never accept by the community and at the same time got a failure mail
 in
 my register mail id.

 and when click to subscribe to this mailing list link, i didnt get any
 new
 subscription mail in my inbox.

 Please anyone suggest a best way to subscribed the email ID



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Posting-to-the-list-tp21750p21756.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 --
 {{{H2N}}}-(@:



Re: Launching Spark cluster on EC2 with Ubuntu AMI

2015-02-23 Thread Nicholas Chammas
I know that Spark EC2 scripts are not guaranteed to work with custom AMIs
but still, it should work…

Nope, it shouldn’t, unfortunately. The Spark base AMIs are custom-built for
spark-ec2. No other AMI will work unless it was built with that goal in
mind. Using a random AMI from the Amazon marketplace is unlikely to work
because there are several tools and packages (e.g. like git) that need to
be on the AMI.

Furthermore, the spark-ec2 scripts all assume a yum-based Linux
distribution, so you won’t be able to use Ubuntu (and apt-get-based distro)
without some significant changes to the shell scripts used to build the AMI.

There is some work ongoing as part of SPARK-3821
https://issues.apache.org/jira/browse/SPARK-3821 to make it easier to
generate AMIs that work with spark-ec2.

Nick
​

On Sun Feb 22 2015 at 7:42:52 PM Ted Yu yuzhih...@gmail.com wrote:

 bq. bash: git: command not found

 Looks like the AMI doesn't have git pre-installed.

 Cheers

 On Sun, Feb 22, 2015 at 4:29 PM, olegshirokikh o...@solver.com wrote:

 I'm trying to launch Spark cluster on AWS EC2 with custom AMI (Ubuntu)
 using
 the following:

 ./ec2/spark-ec2 --key-pair=*** --identity-file='/home/***.pem'
 --region=us-west-2 --zone=us-west-2b --spark-version=1.2.1 --slaves=2
 --instance-type=t2.micro --ami=ami-29ebb519 --user=ubuntu launch
 spark-ubuntu-cluster

 Everything starts OK and instances are launched:

 Found 1 master(s), 2 slaves
 Waiting for all instances in cluster to enter 'ssh-ready' state.
 Generating cluster's SSH key on master.

 But then I'm getting the following SSH errors until it stops trying and
 quits:

 bash: git: command not found
 Connection to ***.us-west-2.compute.amazonaws.com closed.
 Error executing remote command, retrying after 30 seconds: Command
 '['ssh',
 '-o', 'StrictHostKeyChecking=no', '-i', '/home/***t.pem', '-o',
 'UserKnownHostsFile=/dev/null', '-t', '-t',
 u'ubuntu@***.us-west-2.compute.amazonaws.com', 'rm -rf spark-ec2  git
 clone https://github.com/mesos/spark-ec2.git -b v4']' returned non-zero
 exit
 status 127

 I know that Spark EC2 scripts are not guaranteed to work with custom AMIs
 but still, it should work... Any advice would be greatly appreciated!




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Launching-Spark-cluster-on-EC2-with-Ubuntu-AMI-tp21757.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: SQLContext.applySchema strictness

2015-02-14 Thread Nicholas Chammas
Would it make sense to add an optional validate parameter to applySchema()
which defaults to False, both to give users the option to check the schema
immediately and to make the default behavior clearer?
​

On Sat Feb 14 2015 at 9:18:59 AM Michael Armbrust mich...@databricks.com
wrote:

 Doing runtime type checking is very expensive, so we only do it when
 necessary (i.e. you perform an operation like adding two columns together)

 On Sat, Feb 14, 2015 at 2:19 AM, nitin nitin2go...@gmail.com wrote:

 AFAIK, this is the expected behavior. You have to make sure that the
 schema
 matches the row. It won't give any error when you apply the schema as it
 doesn't validate the nature of data.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SQLContext-applySchema-strictness-tp21650p21653.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: How to create spark AMI in AWS

2015-02-09 Thread Nicholas Chammas
OK, good luck!

On Mon Feb 09 2015 at 6:41:14 PM Guodong Wang wangg...@gmail.com wrote:

 Hi Nicholas,

 Thanks for your quick reply.

 I'd like to try to build a image with create_image.sh. Then let's see how
 we can launch spark cluster in region cn-north-1.



 Guodong

 On Tue, Feb 10, 2015 at 3:59 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Guodong,

 spark-ec2 does not currently support the cn-north-1 region, but you can
 follow [SPARK-4241](https://issues.apache.org/jira/browse/SPARK-4241) to
 find out when it does.

 The base AMI used to generate the current Spark AMIs is very old. I'm not
 sure anyone knows what it is anymore. What I know is that it is an Amazon
 Linux AMI.

 Yes, the create_image.sh script is what is used to generate the current
 Spark AMI.

 Nick

 On Mon Feb 09 2015 at 3:27:13 AM Franc Carter 
 franc.car...@rozettatech.com wrote:


 Hi,

 I'm very new to Spark, but  experienced with AWS - so take that in to
 account with my suggestions.

 I started with an AWS base image and then added the pre-built Spark-1.2.
 I then added made a 'Master' version and a 'Worker' versions and then made
 AMIs for them.

 The Master comes up with a static IP and the Worker image has this baked
 in. I haven't completed everything I am planning to do but so far I can
 bring up the Master and a bunch of Workers inside and ASG and run spark
 code successfully.

 cheers


 On Mon, Feb 9, 2015 at 10:06 PM, Guodong Wang wangg...@gmail.com
 wrote:

 Hi guys,

 I want to launch spark cluster in AWS. And I know there is a
 spark_ec2.py script.

 I am using the AWS service in China. But I can not find the AMI in the
 region of China.

 So, I have to build one. My question is
 1. Where is the bootstrap script to create the Spark AMI? Is it here(
 https://github.com/mesos/spark-ec2/blob/branch-1.3/create_image.sh) ?
 2. What is the base image of the Spark AMI? Eg, the base image of this (
 https://github.com/mesos/spark-ec2/blob/branch-1.3/ami-list/us-west-1/hvm
 )
 3. Shall I install scala during building the AMI?


 Thanks.

 Guodong




 --

 *Franc Carter* | Systems Architect | Rozetta Technology

 franc.car...@rozettatech.com  franc.car...@rozettatech.com|
 www.rozettatechnology.com

 Tel: +61 2 8355 2515

 Level 4, 55 Harrington St, The Rocks NSW 2000

 PO Box H58, Australia Square, Sydney NSW 1215

 AUSTRALIA





Re: How to create spark AMI in AWS

2015-02-09 Thread Nicholas Chammas
Guodong,

spark-ec2 does not currently support the cn-north-1 region, but you can
follow [SPARK-4241](https://issues.apache.org/jira/browse/SPARK-4241) to
find out when it does.

The base AMI used to generate the current Spark AMIs is very old. I'm not
sure anyone knows what it is anymore. What I know is that it is an Amazon
Linux AMI.

Yes, the create_image.sh script is what is used to generate the current
Spark AMI.

Nick

On Mon Feb 09 2015 at 3:27:13 AM Franc Carter franc.car...@rozettatech.com
wrote:


 Hi,

 I'm very new to Spark, but  experienced with AWS - so take that in to
 account with my suggestions.

 I started with an AWS base image and then added the pre-built Spark-1.2. I
 then added made a 'Master' version and a 'Worker' versions and then made
 AMIs for them.

 The Master comes up with a static IP and the Worker image has this baked
 in. I haven't completed everything I am planning to do but so far I can
 bring up the Master and a bunch of Workers inside and ASG and run spark
 code successfully.

 cheers


 On Mon, Feb 9, 2015 at 10:06 PM, Guodong Wang wangg...@gmail.com wrote:

 Hi guys,

 I want to launch spark cluster in AWS. And I know there is a spark_ec2.py
 script.

 I am using the AWS service in China. But I can not find the AMI in the
 region of China.

 So, I have to build one. My question is
 1. Where is the bootstrap script to create the Spark AMI? Is it here(
 https://github.com/mesos/spark-ec2/blob/branch-1.3/create_image.sh) ?
 2. What is the base image of the Spark AMI? Eg, the base image of this (
 https://github.com/mesos/spark-ec2/blob/branch-1.3/ami-list/us-west-1/hvm
 )
 3. Shall I install scala during building the AMI?


 Thanks.

 Guodong




 --

 *Franc Carter* | Systems Architect | Rozetta Technology

 franc.car...@rozettatech.com  franc.car...@rozettatech.com|
 www.rozettatechnology.com

 Tel: +61 2 8355 2515

 Level 4, 55 Harrington St, The Rocks NSW 2000

 PO Box H58, Australia Square, Sydney NSW 1215

 AUSTRALIA




Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Nicholas Chammas
Hmm, I can’t see why using ~ would be problematic, especially if you
confirm that echo ~/path/to/pem expands to the correct path to your
identity file.

If you have a simple reproduction of the problem, please send it over. I’d
love to look into this. When I pass paths with ~ to spark-ec2 on my system,
it works fine. I’m using bash, but zsh handles tilde expansion the same as
bash.

Nick
​

On Wed Jan 28 2015 at 3:30:08 PM Charles Feduke charles.fed...@gmail.com
wrote:

 It was only hanging when I specified the path with ~ I never tried
 relative.

 Hanging on the waiting for ssh to be ready on all hosts. I let it sit for
 about 10 minutes then I found the StackOverflow answer that suggested
 specifying an absolute path, cancelled, and re-run with --resume and the
 absolute path and all slaves were up in a couple minutes.

 (I've stood up 4 integration clusters and 2 production clusters on EC2
 since with no problems.)

 On Wed Jan 28 2015 at 12:05:43 PM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Ey-chih,

 That makes more sense. This is a known issue that will be fixed as part
 of SPARK-5242 https://issues.apache.org/jira/browse/SPARK-5242.

 Charles,

 Thanks for the info. In your case, when does spark-ec2 hang? Only when
 the specified path to the identity file doesn't exist? Or also when you
 specify the path as a relative path or with ~?

 Nick


 On Wed Jan 28 2015 at 9:29:34 AM ey-chih chow eyc...@hotmail.com wrote:

 We found the problem and already fixed it.  Basically, spark-ec2
 requires ec2 instances to have external ip addresses. You need to specify
 this in the ASW console.
 --
 From: nicholas.cham...@gmail.com
 Date: Tue, 27 Jan 2015 17:19:21 +
 Subject: Re: spark 1.2 ec2 launch script hang
 To: charles.fed...@gmail.com; pzybr...@gmail.com; eyc...@hotmail.com
 CC: user@spark.apache.org


 For those who found that absolute vs. relative path for the pem file
 mattered, what OS and shell are you using? What version of Spark are you
 using?

 ~/ vs. absolute path shouldn’t matter. Your shell will expand the ~/ to
 the absolute path before sending it to spark-ec2. (i.e. tilde
 expansion.)

 Absolute vs. relative path (e.g. ../../path/to/pem) also shouldn’t
 matter, since we fixed that for Spark 1.2.0
 https://issues.apache.org/jira/browse/SPARK-4137. Maybe there’s some
 case that we missed?

 Nick

 On Tue Jan 27 2015 at 10:10:29 AM Charles Feduke 
 charles.fed...@gmail.com wrote:


 Absolute path means no ~ and also verify that you have the path to the
 file correct. For some reason the Python code does not validate that the
 file exists and will hang (this is the same reason why ~ hangs).
 On Mon, Jan 26, 2015 at 10:08 PM Pete Zybrick pzybr...@gmail.com
 wrote:

 Try using an absolute path to the pem file



  On Jan 26, 2015, at 8:57 PM, ey-chih chow eyc...@hotmail.com wrote:
 
  Hi,
 
  I used the spark-ec2 script of spark 1.2 to launch a cluster.  I have
  modified the script according to
 
  https://github.com/grzegorz-dubicki/spark/commit/5dd8458d2ab
 9753aae939b3bb33be953e2c13a70
 
  But the script was still hung at the following message:
 
  Waiting for cluster to enter 'ssh-ready'
  state.
 
  Any additional thing I should do to make it succeed?  Thanks.
 
 
  Ey-Chih Chow
 
 
 
  --
  View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/spark-1-2-ec2-launch-script-hang-tp21381.html
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 ​




Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Nicholas Chammas
Thanks for sending this over, Peter.

What if you try this? (i.e. Remove the = after --identity-file.)

ec2/spark-ec2 --key-pair=spark-streaming-kp --identity-file
~/.pzkeys/spark-streaming-kp.pem  --region=us-east-1 login
pz-spark-cluster

If that works, then I think the problem in this case is simply that Bash
cannot expand the tilde because it’s stuck to the --identity-file=. This
isn’t a problem with spark-ec2.

Bash sees the --identity-file=~/.pzkeys/spark-streaming-kp.pem as one big
argument, so it can’t do tilde expansion.

Nick
​

On Wed Jan 28 2015 at 9:17:06 PM Peter Zybrick pzybr...@gmail.com wrote:

 Below is trace from trying to access with ~/path.  I also did the echo as
 per Nick (see the last line), looks ok to me.  This is my development box
 with Spark 1.2.0 running CentOS 6.5, Python 2.6.6

 [pete.zybrick@pz-lt2-ipc spark-1.2.0]$ ec2/spark-ec2
 --key-pair=spark-streaming-kp
 --identity-file=~/.pzkeys/spark-streaming-kp.pem  --region=us-east-1 login
 pz-spark-cluster
 Searching for existing cluster pz-spark-cluster...
 Found 1 master(s), 3 slaves
 Logging into master ec2-54-152-95-129.compute-1.amazonaws.com...
 Warning: Identity file ~/.pzkeys/spark-streaming-kp.pem not accessible: No
 such file or directory.
 Permission denied (publickey).
 Traceback (most recent call last):
   File ec2/spark_ec2.py, line 1082, in module
 main()
   File ec2/spark_ec2.py, line 1074, in main
 real_main()
   File ec2/spark_ec2.py, line 1007, in real_main
 ssh_command(opts) + proxy_opt + ['-t', '-t', %s@%s % (opts.user,
 master)])
   File /usr/lib64/python2.6/subprocess.py, line 505, in check_call
 raise CalledProcessError(retcode, cmd)
 subprocess.CalledProcessError: Command '['ssh', '-o',
 'StrictHostKeyChecking=no', '-i', '~/.pzkeys/spark-streaming-kp.pem', '-t',
 '-t', u'r...@ec2-54-152-95-129.compute-1.amazonaws.com']' returned
 non-zero exit status 255
 [pete.zybrick@pz-lt2-ipc spark-1.2.0]$ echo
 ~/.pzkeys/spark-streaming-kp.pem
 /home/pete.zybrick/.pzkeys/spark-streaming-kp.pem


 On Wed, Jan 28, 2015 at 3:49 PM, Charles Feduke charles.fed...@gmail.com
 wrote:

 Yeah, I agree ~ should work. And it could have been [read: probably was]
 the fact that one of the EC2 hosts was in my known_hosts (don't know, never
 saw an error message, but the behavior is no error message for that state),
 which I had fixed later with Pete's patch. But the second execution when
 things worked with an absolute path could have worked because the random
 hosts that came up on EC2 were never in my known_hosts.


 On Wed Jan 28 2015 at 3:45:36 PM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Hmm, I can’t see why using ~ would be problematic, especially if you
 confirm that echo ~/path/to/pem expands to the correct path to your
 identity file.

 If you have a simple reproduction of the problem, please send it over.
 I’d love to look into this. When I pass paths with ~ to spark-ec2 on my
 system, it works fine. I’m using bash, but zsh handles tilde expansion the
 same as bash.

 Nick
 ​

 On Wed Jan 28 2015 at 3:30:08 PM Charles Feduke 
 charles.fed...@gmail.com wrote:

 It was only hanging when I specified the path with ~ I never tried
 relative.

 Hanging on the waiting for ssh to be ready on all hosts. I let it sit
 for about 10 minutes then I found the StackOverflow answer that suggested
 specifying an absolute path, cancelled, and re-run with --resume and the
 absolute path and all slaves were up in a couple minutes.

 (I've stood up 4 integration clusters and 2 production clusters on EC2
 since with no problems.)

 On Wed Jan 28 2015 at 12:05:43 PM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Ey-chih,

 That makes more sense. This is a known issue that will be fixed as
 part of SPARK-5242 https://issues.apache.org/jira/browse/SPARK-5242.

 Charles,

 Thanks for the info. In your case, when does spark-ec2 hang? Only when
 the specified path to the identity file doesn't exist? Or also when you
 specify the path as a relative path or with ~?

 Nick


 On Wed Jan 28 2015 at 9:29:34 AM ey-chih chow eyc...@hotmail.com
 wrote:

 We found the problem and already fixed it.  Basically, spark-ec2
 requires ec2 instances to have external ip addresses. You need to specify
 this in the ASW console.
 --
 From: nicholas.cham...@gmail.com
 Date: Tue, 27 Jan 2015 17:19:21 +
 Subject: Re: spark 1.2 ec2 launch script hang
 To: charles.fed...@gmail.com; pzybr...@gmail.com; eyc...@hotmail.com
 CC: user@spark.apache.org


 For those who found that absolute vs. relative path for the pem file
 mattered, what OS and shell are you using? What version of Spark are you
 using?

 ~/ vs. absolute path shouldn’t matter. Your shell will expand the ~/
 to the absolute path before sending it to spark-ec2. (i.e. tilde
 expansion.)

 Absolute vs. relative path (e.g. ../../path/to/pem) also shouldn’t
 matter, since we fixed that for Spark 1.2.0
 https://issues.apache.org/jira/browse/SPARK

Re: spark 1.2 ec2 launch script hang

2015-01-28 Thread Nicholas Chammas
If that was indeed the problem, I suggest updating your answer on SO
http://stackoverflow.com/a/28005151/877069 to help others who may run
into this same problem.
​

On Wed Jan 28 2015 at 9:40:39 PM Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Thanks for sending this over, Peter.

 What if you try this? (i.e. Remove the = after --identity-file.)

 ec2/spark-ec2 --key-pair=spark-streaming-kp --identity-file 
 ~/.pzkeys/spark-streaming-kp.pem  --region=us-east-1 login pz-spark-cluster

 If that works, then I think the problem in this case is simply that Bash
 cannot expand the tilde because it’s stuck to the --identity-file=. This
 isn’t a problem with spark-ec2.

 Bash sees the --identity-file=~/.pzkeys/spark-streaming-kp.pem as one big
 argument, so it can’t do tilde expansion.

 Nick
 ​

 On Wed Jan 28 2015 at 9:17:06 PM Peter Zybrick pzybr...@gmail.com wrote:

 Below is trace from trying to access with ~/path.  I also did the echo as
 per Nick (see the last line), looks ok to me.  This is my development box
 with Spark 1.2.0 running CentOS 6.5, Python 2.6.6

 [pete.zybrick@pz-lt2-ipc spark-1.2.0]$ ec2/spark-ec2
 --key-pair=spark-streaming-kp 
 --identity-file=~/.pzkeys/spark-streaming-kp.pem
 --region=us-east-1 login pz-spark-cluster
 Searching for existing cluster pz-spark-cluster...
 Found 1 master(s), 3 slaves
 Logging into master ec2-54-152-95-129.compute-1.amazonaws.com...
 Warning: Identity file ~/.pzkeys/spark-streaming-kp.pem not accessible:
 No such file or directory.
 Permission denied (publickey).
 Traceback (most recent call last):
   File ec2/spark_ec2.py, line 1082, in module
 main()
   File ec2/spark_ec2.py, line 1074, in main
 real_main()
   File ec2/spark_ec2.py, line 1007, in real_main
 ssh_command(opts) + proxy_opt + ['-t', '-t', %s@%s % (opts.user,
 master)])
   File /usr/lib64/python2.6/subprocess.py, line 505, in check_call
 raise CalledProcessError(retcode, cmd)
 subprocess.CalledProcessError: Command '['ssh', '-o',
 'StrictHostKeyChecking=no', '-i', '~/.pzkeys/spark-streaming-kp.pem',
 '-t', '-t', u'r...@ec2-54-152-95-129.compute-1.amazonaws.com']' returned
 non-zero exit status 255
 [pete.zybrick@pz-lt2-ipc spark-1.2.0]$ echo ~/.pzkeys/spark-streaming-kp.
 pem
 /home/pete.zybrick/.pzkeys/spark-streaming-kp.pem


 On Wed, Jan 28, 2015 at 3:49 PM, Charles Feduke charles.fed...@gmail.com
  wrote:

 Yeah, I agree ~ should work. And it could have been [read: probably was]
 the fact that one of the EC2 hosts was in my known_hosts (don't know, never
 saw an error message, but the behavior is no error message for that state),
 which I had fixed later with Pete's patch. But the second execution when
 things worked with an absolute path could have worked because the random
 hosts that came up on EC2 were never in my known_hosts.


 On Wed Jan 28 2015 at 3:45:36 PM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Hmm, I can’t see why using ~ would be problematic, especially if you
 confirm that echo ~/path/to/pem expands to the correct path to your
 identity file.

 If you have a simple reproduction of the problem, please send it over.
 I’d love to look into this. When I pass paths with ~ to spark-ec2 on my
 system, it works fine. I’m using bash, but zsh handles tilde expansion the
 same as bash.

 Nick
 ​

 On Wed Jan 28 2015 at 3:30:08 PM Charles Feduke 
 charles.fed...@gmail.com wrote:

 It was only hanging when I specified the path with ~ I never tried
 relative.

 Hanging on the waiting for ssh to be ready on all hosts. I let it sit
 for about 10 minutes then I found the StackOverflow answer that suggested
 specifying an absolute path, cancelled, and re-run with --resume and the
 absolute path and all slaves were up in a couple minutes.

 (I've stood up 4 integration clusters and 2 production clusters on EC2
 since with no problems.)

 On Wed Jan 28 2015 at 12:05:43 PM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Ey-chih,

 That makes more sense. This is a known issue that will be fixed as
 part of SPARK-5242 https://issues.apache.org/jira/browse/SPARK-5242
 .

 Charles,

 Thanks for the info. In your case, when does spark-ec2 hang? Only
 when the specified path to the identity file doesn't exist? Or also when
 you specify the path as a relative path or with ~?

 Nick


 On Wed Jan 28 2015 at 9:29:34 AM ey-chih chow eyc...@hotmail.com
 wrote:

 We found the problem and already fixed it.  Basically, spark-ec2
 requires ec2 instances to have external ip addresses. You need to 
 specify
 this in the ASW console.
 --
 From: nicholas.cham...@gmail.com
 Date: Tue, 27 Jan 2015 17:19:21 +
 Subject: Re: spark 1.2 ec2 launch script hang
 To: charles.fed...@gmail.com; pzybr...@gmail.com; eyc...@hotmail.com
 CC: user@spark.apache.org


 For those who found that absolute vs. relative path for the pem file
 mattered, what OS and shell are you using? What version of Spark are you
 using?

 ~/ vs. absolute path shouldn’t matter

Re: saving rdd to multiple files named by the key

2015-01-27 Thread Nicholas Chammas
There is also SPARK-3533 https://issues.apache.org/jira/browse/SPARK-3533,
which proposes to add a convenience method for this.
​

On Mon Jan 26 2015 at 10:38:56 PM Aniket Bhatnagar 
aniket.bhatna...@gmail.com wrote:

 This might be helpful:
 http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job

 On Tue Jan 27 2015 at 07:45:18 Sharon Rapoport sha...@plaid.com wrote:

 Hi,

 I have an rdd of [k,v] pairs. I want to save each [v] to a file named [k].
 I got them by combining many [k,v] by [k]. I could then save to file by
 partitions, but that still doesn't allow me to choose the name, and leaves
 me stuck with foo/part-...

 Any tips?

 Thanks,
 Sharon




Re: spark 1.2 ec2 launch script hang

2015-01-27 Thread Nicholas Chammas
For those who found that absolute vs. relative path for the pem file
mattered, what OS and shell are you using? What version of Spark are you
using?

~/ vs. absolute path shouldn’t matter. Your shell will expand the ~/ to the
absolute path before sending it to spark-ec2. (i.e. tilde expansion.)

Absolute vs. relative path (e.g. ../../path/to/pem) also shouldn’t matter,
since we fixed that for Spark 1.2.0
https://issues.apache.org/jira/browse/SPARK-4137. Maybe there’s some case
that we missed?

Nick

On Tue Jan 27 2015 at 10:10:29 AM Charles Feduke charles.fed...@gmail.com
wrote:

Absolute path means no ~ and also verify that you have the path to the file
 correct. For some reason the Python code does not validate that the file
 exists and will hang (this is the same reason why ~ hangs).
 On Mon, Jan 26, 2015 at 10:08 PM Pete Zybrick pzybr...@gmail.com wrote:

 Try using an absolute path to the pem file



  On Jan 26, 2015, at 8:57 PM, ey-chih chow eyc...@hotmail.com wrote:
 
  Hi,
 
  I used the spark-ec2 script of spark 1.2 to launch a cluster.  I have
  modified the script according to
 
  https://github.com/grzegorz-dubicki/spark/commit/5dd8458d2ab
 9753aae939b3bb33be953e2c13a70
 
  But the script was still hung at the following message:
 
  Waiting for cluster to enter 'ssh-ready'
  state.
 
  Any additional thing I should do to make it succeed?  Thanks.
 
 
  Ey-Chih Chow
 
 
 
  --
  View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/spark-1-2-ec2-launch-script-hang-tp21381.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

  ​


Re: Analyzing data from non-standard data sources (e.g. AWS Redshift)

2015-01-24 Thread Nicholas Chammas
I believe databricks provides an rdd interface to redshift. Did you check
spark-packages.org?
On 2015년 1월 24일 (토) at 오전 6:45 Denis Mikhalkin deni...@yahoo.com.invalid
wrote:

 Hello,

 we've got some analytics data in AWS Redshift. The data is being
 constantly updated.

 I'd like to be able to write a query against Redshift which would return a
 subset of data, and then run a Spark job (Pyspark) to do some analysis.

 I could not find an RDD which would let me do it OOB (Python), so I tried
 writing my own. For example, tried combination of a generator (via yield)
 with parallelize. It appears though that parallelize reads all the data
 first into memory as I get either OOM or Python swaps as soon as I increase
 the number of rows beyond trivial limits.

 I've also looked at Java RDDs (there is an example of MySQL RDD) but it
 seems that it also reads all the data into memory.

 So my question is - how to correctly feed Spark with huge datasets which
 don't initially reside in HDFS/S3 (ideally for Pyspark, but would
 appreciate any tips)?

 Thanks.

 Denis





Re: Discourse: A proposed alternative to the Spark User list

2015-01-23 Thread Nicholas Chammas
That sounds good to me. Shall I open a JIRA / PR about updating the site
community page?
On 2015년 1월 23일 (금) at 오전 4:37 Patrick Wendell patr...@databricks.com
wrote:

 Hey Nick,

 So I think we what can do is encourage people to participate on the
 stack overflow topic, and this I think we can do on the Spark website
 as a first class community resource for Spark. We should probably be
 spending more time on that site given its popularity.

 In terms of encouraging this explicitly *to replace* the ASF mailing
 list, that I think is harder to do. The ASF makes a lot of effort to
 host its own infrastructure that is neutral and not associated with
 any corporation. And by and large the ASF policy is to consider that
 as the de-facto forum of communication for any project.

 Personally, I wish the ASF would update this policy - for instance, by
 allowing the use of third party lists or communication fora - provided
 that they allow exporting the conversation if those sites were to
 change course. However, the state of the art stands as such.

 - Patrick

 On Wed, Jan 21, 2015 at 8:43 AM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  Josh / Patrick,
 
  What do y’all think of the idea of promoting Stack Overflow as a place to
  ask questions over this list, as long as the questions fit SO’s
 guidelines
  (how-to-ask, dont-ask)?
 
  The apache-spark tag is very active on there.
 
  Discussions of all types are still on-topic here, but when possible we
 want
  to encourage people to use SO.
 
  Nick
 
  On Wed Jan 21 2015 at 8:37:05 AM Jay Vyas jayunit100.apa...@gmail.com
 wrote:
 
  Its a very valid  idea indeed, but... It's a tricky  subject since the
  entire ASF is run on mailing lists , hence there are so many different
 but
  equally sound ways of looking at this idea, which conflict with one
 another.
 
   On Jan 21, 2015, at 7:03 AM, btiernay btier...@hotmail.com wrote:
  
   I think this is a really great idea for really opening up the
   discussions
   that happen here. Also, it would be nice to know why there doesn't
 seem
   to
   be much interest. Maybe I'm misunderstanding some nuance of Apache
   projects.
  
   Cheers
  
  
  
   --
   View this message in context:
   http://apache-spark-user-list.1001560.n3.nabble.com/
 Discourse-A-proposed-alternative-to-the-Spark-User-list-tp20851p21288.html
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 



Re: Discourse: A proposed alternative to the Spark User list

2015-01-23 Thread Nicholas Chammas
https://issues.apache.org/jira/browse/SPARK-5390

On Fri Jan 23 2015 at 12:05:00 PM Gerard Maas gerard.m...@gmail.com wrote:

 +1

 On Fri, Jan 23, 2015 at 5:58 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 That sounds good to me. Shall I open a JIRA / PR about updating the site
 community page?
 On 2015년 1월 23일 (금) at 오전 4:37 Patrick Wendell patr...@databricks.com
 wrote:

 Hey Nick,

 So I think we what can do is encourage people to participate on the
 stack overflow topic, and this I think we can do on the Spark website
 as a first class community resource for Spark. We should probably be
 spending more time on that site given its popularity.

 In terms of encouraging this explicitly *to replace* the ASF mailing
 list, that I think is harder to do. The ASF makes a lot of effort to
 host its own infrastructure that is neutral and not associated with
 any corporation. And by and large the ASF policy is to consider that
 as the de-facto forum of communication for any project.

 Personally, I wish the ASF would update this policy - for instance, by
 allowing the use of third party lists or communication fora - provided
 that they allow exporting the conversation if those sites were to
 change course. However, the state of the art stands as such.

 - Patrick


 On Wed, Jan 21, 2015 at 8:43 AM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  Josh / Patrick,
 
  What do y’all think of the idea of promoting Stack Overflow as a place
 to
  ask questions over this list, as long as the questions fit SO’s
 guidelines
  (how-to-ask, dont-ask)?
 
  The apache-spark tag is very active on there.
 
  Discussions of all types are still on-topic here, but when possible we
 want
  to encourage people to use SO.
 
  Nick
 
  On Wed Jan 21 2015 at 8:37:05 AM Jay Vyas jayunit100.apa...@gmail.com
 wrote:
 
  Its a very valid  idea indeed, but... It's a tricky  subject since the
  entire ASF is run on mailing lists , hence there are so many
 different but
  equally sound ways of looking at this idea, which conflict with one
 another.
 
   On Jan 21, 2015, at 7:03 AM, btiernay btier...@hotmail.com wrote:
  
   I think this is a really great idea for really opening up the
   discussions
   that happen here. Also, it would be nice to know why there doesn't
 seem
   to
   be much interest. Maybe I'm misunderstanding some nuance of Apache
   projects.
  
   Cheers
  
  
  
   --
   View this message in context:
   http://apache-spark-user-list.1001560.n3.nabble.com/
 Discourse-A-proposed-alternative-to-the-Spark-User-
 list-tp20851p21288.html
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
   
 -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 





Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Nicholas Chammas
I agree with Sean that a Spark-specific Stack Exchange likely won't help
and almost certainly won't make it out of Area 51. The idea certainly
sounds nice from our perspective as Spark users, but it doesn't mesh with
the structure of Stack Exchange or the criteria for creating new sites.

On Thu Jan 22 2015 at 1:23:14 PM Sean Owen so...@cloudera.com wrote:

 FWIW I am a moderator for datascience.stackexchange.com, and even that
 hasn't really achieved the critical mass that SE sites are supposed
 to: http://area51.stackexchange.com/proposals/55053/data-science

 I think a Spark site would have a lot less traffic. One annoyance is
 that people can't figure out when to post on SO vs Data Science vs
 Cross Validated. A Spark site would have the same problem,
 fragmentation and cross posting with SO. I don't think this would be
 accepted as a StackExchange site and don't think it helps.

 On Thu, Jan 22, 2015 at 6:16 PM, pierred pie...@demartines.com wrote:
 
  A dedicated stackexchange site for Apache Spark sounds to me like the
  logical solution.  Less trolling, more enthusiasm, and with the
  participation of the people on this list, I think it would very quickly
  become the reference for many technical questions, as well as a great
  vehicle to promote the awesomeness of Spark.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Nicholas Chammas
we could implement some ‘load balancing’ policies:

I think Gerard’s suggestions are good. We need some “official” buy-in from
the project’s maintainers and heavy contributors and we should move forward
with them.

I know that at least Josh Rosen, Sean Owen, and Tathagata Das, who are
active on this list, are also active on SO
http://stackoverflow.com/tags/apache-spark/topusers. So perhaps we’re
already part of the way there.

Nick
​

On Thu Jan 22 2015 at 5:32:40 AM Gerard Maas gerard.m...@gmail.com wrote:

 I've have been contributing to SO for a while now.  Here're few
 observations I'd like to contribute to the discussion:

 The level of questions on SO is often of more entry-level. Harder
 questions (that require expertise in a certain area) remain unanswered for
 a while. Same questions here on the list (as they are often cross-posted)
 receive faster turnaround.
 Roughly speaking, there're two groups of questions: Implementing things on
 Spark and Running Spark.  The second one is borderline on SO guidelines as
 they often involve cluster setups, long logs and little idea of what's
 going on (mind you, often those questions come from people starting with
 Spark)

 In my opinion, Stack Overflow offers a better Q/A experience, in
 particular, they have tooling in place to reduce duplicates, something that
 often overloads this list (same getting started issues or how to map,
 filter, flatmap over and over again).  That said, this list offers a
 richer forum, where the expertise pool is a lot deeper.
 Also, while SO is fairly strict in requiring posters from showing a
 minimal amount of effort in the question being asked, this list is quite
 friendly to the same behavior. This could be probably an element that makes
 the list 'lower impedance'.
 One additional thing on SO is that the [apache-spark] tag is a 'low rep'
 tag. Neither questions nor answers get significant voting, reducing the
 'rep gaming' factor  (discouraging participation?)

 Thinking about how to improve both platforms: SO[apache-spark] and this
 ML, and get back the list to not overwhelming message volumes, we could
 implement some 'load balancing' policies:
 - encourage new users to use Stack Overflow, in particular, redirect
 newbie questions to SO the friendly way: did you search SO already? or
 link to an existing question.
   - most how to map, flatmap, filter, aggregate, reduce, ... would fall
 under  this category
 - encourage domain experts to hang on SO more often  (my impression is
 that MLLib, GraphX are fairly underserved)
 - have an 'scalation process' in place, where we could post
 'interesting/hard/bug' questions from SO back to the list (or encourage the
 poster to do so)
 - update our community guidelines on [
 http://spark.apache.org/community.html] to implement such policies.

 Those are just some ideas on how to improve the community and better serve
 the newcomers while avoiding overload of our existing expertise pool.

 kr, Gerard.


 On Thu, Jan 22, 2015 at 10:42 AM, Sean Owen so...@cloudera.com wrote:

 Yes, there is some project business like votes of record on releases that
 needs to be carried on in standard, simple accessible place and SO is not
 at all suitable.

 Nobody is stuck with Nabble. The suggestion is to enable a different
 overlay on the existing list. SO remains a place you can ask questions too.
 So I agree with Nick's take.

 BTW are there perhaps plans to split this mailing list into
 subproject-specific lists? That might also help tune in/out the subset of
 conversations of interest.
 On Jan 22, 2015 10:30 AM, Petar Zecevic petar.zece...@gmail.com
 wrote:


 Ok, thanks for the clarifications. I didn't know this list has to remain
 as the only official list.

 Nabble is really not the best solution in the world, but we're stuck
 with it, I guess.

 That's it from me on this subject.

 Petar


 On 22.1.2015. 3:55, Nicholas Chammas wrote:

  I think a few things need to be laid out clearly:

1. This mailing list is the “official” user discussion platform.
That is, it is sponsored and managed by the ASF.
2. Users are free to organize independent discussion platforms
focusing on Spark, and there is already one such platform in Stack 
 Overflow
under the apache-spark and related tags. Stack Overflow works quite
well.
3. The ASF will not agree to deprecating or migrating this user list
to a platform that they do not control.
4. This mailing list has grown to an unwieldy size and discussions
are hard to find or follow; discussion tooling is also lacking. We want 
 to
improve the utility and user experience of this mailing list.
5. We don’t want to fragment this “official” discussion community.
6. Nabble is an independent product not affiliated with the ASF. It
offers a slightly better interface to the Apache mailing list archives.

 So to respond to some of your points, pzecevic:

 Apache user group could be frozen (not accepting new questions

Re: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread Nicholas Chammas
Josh / Patrick,

What do y’all think of the idea of promoting Stack Overflow as a place to
ask questions over this list, as long as the questions fit SO’s guidelines (
how-to-ask http://stackoverflow.com/help/how-to-ask, dont-ask
http://stackoverflow.com/help/dont-ask)?

The apache-spark http://stackoverflow.com/questions/tagged/apache-spark
tag is very active on there.

Discussions of all types are still on-topic here, but when possible we want
to encourage people to use SO.

Nick

On Wed Jan 21 2015 at 8:37:05 AM Jay Vyas jayunit100.apa...@gmail.com
http://mailto:jayunit100.apa...@gmail.com wrote:

Its a very valid  idea indeed, but... It's a tricky  subject since the
 entire ASF is run on mailing lists , hence there are so many different but
 equally sound ways of looking at this idea, which conflict with one another.

  On Jan 21, 2015, at 7:03 AM, btiernay btier...@hotmail.com wrote:
 
  I think this is a really great idea for really opening up the discussions
  that happen here. Also, it would be nice to know why there doesn't seem
 to
  be much interest. Maybe I'm misunderstanding some nuance of Apache
 projects.
 
  Cheers
 
 
 
  --
  View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Discourse-A-proposed-alternative-to-the-Spark-User-
 list-tp20851p21288.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

  ​


Re: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread Nicholas Chammas
I think a few things need to be laid out clearly:

   1. This mailing list is the “official” user discussion platform. That
   is, it is sponsored and managed by the ASF.
   2. Users are free to organize independent discussion platforms focusing
   on Spark, and there is already one such platform in Stack Overflow under
   the apache-spark and related tags. Stack Overflow works quite well.
   3. The ASF will not agree to deprecating or migrating this user list to
   a platform that they do not control.
   4. This mailing list has grown to an unwieldy size and discussions are
   hard to find or follow; discussion tooling is also lacking. We want to
   improve the utility and user experience of this mailing list.
   5. We don’t want to fragment this “official” discussion community.
   6. Nabble is an independent product not affiliated with the ASF. It
   offers a slightly better interface to the Apache mailing list archives.

So to respond to some of your points, pzecevic:

Apache user group could be frozen (not accepting new questions, if that’s
possible) and redirect users to Stack Overflow (automatic reply?).

From what I understand of the ASF’s policies, this is not possible. :( This
mailing list must remain the official Spark user discussion platform.

Other thing, about new Stack Exchange site I proposed earlier. If a new
site is created, there is no problem with guidelines, I think, because
Spark community can apply different guidelines for the new site.

I think Stack Overflow and the various Spark tags are working fine. I don’t
see a compelling need for a Stack Exchange dedicated to Spark, either now
or in the near future. Also, I doubt a Spark-specific site can pass the 4
tests in the Area 51 FAQ http://area51.stackexchange.com/faq:

   - Almost all Spark questions are on-topic for Stack Overflow
   - Stack Overflow already exists, it already has a tag for Spark, and
   nobody is complaining
   - You’re not creating such a big group that you don’t have enough
   experts to answer all possible questions
   - There’s a high probability that users of Stack Overflow would enjoy
   seeing the occasional question about Spark

I think complaining won’t be sufficient. :)

Someone expressed a concern that they won’t allow creating a
project-specific site, but there already exist some project-specific sites,
like Tor, Drupal, Ubuntu…

The communities for these projects are many, many times larger than the
Spark community is or likely ever will be, simply due to the nature of the
problems they are solving.

What we need is an improvement to this mailing list. We need better tooling
than Nabble to sit on top of the Apache archives, and we also need some way
to control the volume and quality of mail on the list so that it remains a
useful resource for the majority of users.

Nick
​

On Wed Jan 21 2015 at 3:13:21 PM pzecevic petar.zece...@gmail.com wrote:

 Hi,
 I tried to find the last reply by Nick Chammas (that I received in the
 digest) using the Nabble web interface, but I cannot find it (perhaps he
 didn't reply directly to the user list?). That's one example of Nabble's
 usability.

 Anyhow, I wanted to add my two cents...

 Apache user group could be frozen (not accepting new questions, if that's
 possible) and redirect users to Stack Overflow (automatic reply?). Old
 questions remain (and are searchable) on Nabble, new questions go to Stack
 Exchange, so no need for migration. That's the idea, at least, as I'm not
 sure if that's technically doable... Is it?
 dev mailing list could perhaps stay on Nabble (it's not that busy), or have
 a special tag on Stack Exchange.

 Other thing, about new Stack Exchange site I proposed earlier. If a new
 site
 is created, there is no problem with guidelines, I think, because Spark
 community can apply different guidelines for the new site.

 There is a FAQ about creating new sites: http://area51.stackexchange.
 com/faq
 It says: Stack Exchange sites are free to create and free to use. All we
 ask is that you have an enthusiastic, committed group of expert users who
 check in regularly, asking and answering questions.
 I think this requirement is satisfied...
 Someone expressed a concern that they won't allow creating a
 project-specific site, but there already exist some project-specific sites,
 like Tor, Drupal, Ubuntu...

 Later, though, the FAQ also says:
 If Y already exists, it already has a tag for X, and nobody is
 complaining
 (then you should not create a new site). But we could complain :)

 The advantage of having a separate site is that users, who should have more
 privileges, would need to earn them through Spark questions and answers
 only. The other thing, already mentioned, is that the community could
 create
 Spark specific guidelines. There are also  'meta' sites for asking
 questions
 like this one, etc.

 There is a process for starting a site - it's not instantaneous. New site
 needs to go through private beta and public beta, so that could be a

Re: pyspark sc.textFile uses only 4 out of 32 threads per node

2015-01-20 Thread Nicholas Chammas
Are the gz files roughly equal in size? Do you know that your partitions
are roughly balanced? Perhaps some cores get assigned tasks that end very
quickly, while others get most of the work.

On Sat Jan 17 2015 at 2:02:49 AM Gautham Anil gautham.a...@gmail.com
wrote:

 Hi,

 Thanks for getting back to me. Sorry for the delay. I am still having
 this issue.

 @sun: To clarify, The machine actually has 16 usable threads and the
 job has more than 100 gzip files. So, there are enough partitions to
 use all threads.

 @nicholas: The number of partitions match the number of files:  100.

 @Sebastian: I understand the lazy loading behavior. For this reason, I
 usually use a .count() to force the transformation (.first() will not
 be enough). Still, during the transformation, only 4 cores are used
 for processing the input files.

 I don't know if this issue is noticed by other people. Can anyone
 reproduce it with v1.1?


 On Wed, Dec 17, 2014 at 2:14 AM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  Rui is correct.
 
  Check how many partitions your RDD has after loading the gzipped files.
 e.g.
  rdd.getNumPartitions().
 
  If that number is way less than the number of cores in your cluster (in
 your
  case I suspect the number is 4), then explicitly repartition the RDD to
  match the number of cores in your cluster, or some multiple thereof.
 
  For example:
 
  new_rdd = rdd.repartition(sc.defaultParallelism * 3)
 
  Operations on new_rdd should utilize all the cores in your cluster.
 
  Nick
 
 
  On Wed Dec 17 2014 at 1:42:16 AM Sun, Rui rui@intel.com wrote:
 
  Gautham,
 
  How many number of gz files do you have?  Maybe the reason is that gz
 file
  is compressed that can't be splitted for processing by Mapreduce. A
 single
  gz  file can only be processed by a single Mapper so that the CPU treads
  can't be fully utilized.
 
  -Original Message-
  From: Gautham [mailto:gautham.a...@gmail.com]
  Sent: Wednesday, December 10, 2014 3:00 AM
  To: u...@spark.incubator.apache.org
  Subject: pyspark sc.textFile uses only 4 out of 32 threads per node
 
  I am having an issue with pyspark launched in ec2 (using spark-ec2)
 with 5
  r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I
 do
  sc.textFile to load data from a number of gz files, it does not
 progress as
  fast as expected. When I log-in to a child node and run top, I see only
 4
  threads at 100 cpu. All remaining 28 cores were idle. This is not an
 issue
  when processing the strings after loading, when all the cores are used
 to
  process the data.
 
  Please help me with this? What setting can be changed to get the CPU
 usage
  back up to full?
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-
 sc-textFile-uses-only-4-out-of-32-threads-per-node-tp20595.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
 additional
  commands, e-mail: user-h...@spark.apache.org
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 



 --
 Gautham Anil

 The first principle is that you must not fool yourself. And you are
 the easiest person to fool - Richard P. Feynman



Re: Cluster hangs in 'ssh-ready' state using Spark 1.2 EC2 launch script

2015-01-18 Thread Nicholas Chammas
Nathan,

I posted a bunch of questions for you as a comment on your question
http://stackoverflow.com/q/28002443/877069 on Stack Overflow. If you
answer them (don't forget to @ping me) I may be able to help you.

Nick

On Sat Jan 17 2015 at 3:49:54 PM gen tang gen.tan...@gmail.com wrote:

 Hi,

 This is because ssh-ready is the ec2 scripy means that all the instances
 are in the status of running and all the instances in the status of OK,
 In another word, the instances is ready to download and to install
 software, just as emr is ready for bootstrap actions.
 Before, the script just repeatedly prints the information showing that we
 are waiting for every instance being launched.And it is quite ugly, so they
 change the information to print
 However, you can use ssh to connect the instance even if it is in the
 status of pending. If you wait patiently a little more,, the script will
 finish the launch of cluster.

 Cheers
 Gen


 On Sat, Jan 17, 2015 at 7:00 PM, Nathan Murthy nathan.mur...@gmail.com
 wrote:

 Originally posted here:
 http://stackoverflow.com/questions/28002443/cluster-hangs-in-ssh-ready-state-using-spark-1-2-ec2-launch-script

 I'm trying to launch a standalone Spark cluster using its pre-packaged
 EC2 scripts, but it just indefinitely hangs in an 'ssh-ready' state:

 ubuntu@machine:~/spark-1.2.0-bin-hadoop2.4$ ./ec2/spark-ec2 -k
 key-pair -i identity-file.pem -r us-west-2 -s 3 launch test
 Setting up security groups...
 Searching for existing cluster test...
 Spark AMI: ami-ae6e0d9e
 Launching instances...
 Launched 3 slaves in us-west-2c, regid = r-b___6
 Launched master in us-west-2c, regid = r-0__0
 Waiting for all instances in cluster to enter 'ssh-ready'
 state..

 Yet I can SSH into these instances without compaint:

 ubuntu@machine:~$ ssh -i identity-file.pem root@master-ip
 Last login: Day MMM DD HH:mm:ss 20YY from
 c-AA-BBB--DDD.eee1.ff.provider.net

__|  __|_  )
_|  ( /   Amazon Linux AMI
   ___|\___|___|

 https://aws.amazon.com/amazon-linux-ami/2013.03-release-notes/
 There are 59 security update(s) out of 257 total update(s) available
 Run sudo yum update to apply all updates.
 Amazon Linux version 2014.09 is available.
 root@ip-internal ~]$

 I'm trying to figure out if this is a problem in AWS or with the Spark
 scripts. I've never had this issue before until recently.


 --
 Nathan Murthy // 713.884.7110 (mobile) // @natemurthy





Re: Discourse: A proposed alternative to the Spark User list

2015-01-17 Thread Nicholas Chammas
The Stack Exchange community will not support creating a whole new site
just for Spark (otherwise you’d see dedicated sites for much larger topics
like “Python”). Their tagging system works well enough to separate
questions about different topics, and the apache-spark
http://stackoverflow.com/questions/tagged/apache-spark tag on Stack
Overflow is already doing pretty well.

The ASF as well as this community will also not support any migration of
the mailing list to another system due to ASF rules
http://www.apache.org/foundation/how-it-works.html and community
fragmentation.

Realistically, the only options available to us that I see are options 1
and 3 from my original email (which can be used together).

Option 3: Change the culture around the user list. Encourage people to use
Stack Overflow whenever possible, and this list only when their question
doesn’t fit SO’s strict rules.

Option 1: Work with the ASF and the Discourse teams to allow Discourse to
be deployed as an overlay on top of this existing mailing list. (e.g. Like
a new UI on top of an old database.)

The goal of both changes would be to make the user list more usable.

Nick

On 2015년 1월 17일 (토) at 오전 8:51 Andrew Ash and...@andrewash.com wrote:

People can continue using the stack exchange sites as is with no additional
 work from the Spark team.  I would not support migrating our mailing lists
 yet again to another system like Discourse because I fear fragmentation of
 the community between the many sites.

 On Sat, Jan 17, 2015 at 6:24 AM, pzecevic petar.zece...@gmail.com wrote:

 Hi, guys!

 I'm reviving this old question from Nick Chammas with a new proposal: what
 do you think about creating a separate Stack Exchange 'Apache Spark' site
 (like 'philosophy' and 'English' etc.)?

 I'm not sure what would be the best way to deal with user and dev lists,
 though - to merge them into one or create two separate sites...

 And I don't know it it's at all possible to migrate current lists to stack
 exchange, but I believe it would be an improvement over the current
 situation. People are used to stack exchange, it's easy to use and search,
 topics (Spark SQL, Streaming, Graphx) could be marked with tags for easy
 filtering, code formatting is super easy etc.

 What do you all think?



 Nick Chammas wrote
  When people have questions about Spark, there are 2 main places (as far
 as
  I can tell) where they ask them:
 
 - Stack Overflow, under the apache-spark tag
 lt;http://stackoverflow.com/questions/tagged/apache-sparkgt;
 - This mailing list
 
  The mailing list is valuable as an independent place for discussion that
  is
  part of the Spark project itself. Furthermore, it allows for a broader
  range of discussions than would be allowed on Stack Overflow
  lt;http://stackoverflow.com/help/dont-askgt;.
 
  As the Spark project has grown in popularity, I see that a few problems
  have emerged with this mailing list:
 
 - It’s hard to follow topics (e.g. Streaming vs. SQL) that you’re
 interested in, and it’s hard to know when someone has mentioned you
 specifically.
 - It’s hard to search for existing threads and link information
 across
 disparate threads.
 - It’s hard to format code and log snippets nicely, and by extension,
 hard to read other people’s posts with this kind of information.
 
  There are existing solutions to all these (and other) problems based
  around
  straight-up discipline or client-side tooling, which users have to
 conjure
  up for themselves.
 
  I’d like us as a community to consider using Discourse
  lt;http://www.discourse.org/gt; as an alternative to, or overlay on
 top
  of,
  this mailing list, that provides better out-of-the-box solutions to
 these
  problems.
 
  Discourse is a modern discussion platform built by some of the same
 people
  who created Stack Overflow. It has many neat features
  lt;http://v1.discourse.org/about/gt; that I believe this community
 would
  benefit from.
 
  For example:
 
 - When a user starts typing up a new post, they get a panel *showing
 existing conversations that look similar*, just like on Stack
 Overflow.
 - It’s easy to search for posts and link between them.
 - *Markdown support* is built-in to composer.
 - You can *specifically mention people* and they will be notified.
 - Posts can be categorized (e.g. Streaming, SQL, etc.).
 - There is a built-in option for mailing list support which forwards
  all
 activity on the forum to a user’s email address and which allows for
 creation of new posts via email.
 
  What do you think of Discourse as an alternative, more manageable way to
  discus Spark?
 
  There are a few options we can consider:
 
 1. Work with the ASF as well as the Discourse team to allow Discourse
  to
 act as an overlay on top of this mailing list
 
  lt;https://meta.discourse.org/t/discourse-as-a-front-
 end-for-existing-asf-mailing-lists/23167?u=nicholaschammasgt;,
 

Re: dockerized spark executor on mesos?

2015-01-15 Thread Nicholas Chammas
The AMPLab maintains a bunch of Docker files for Spark here:
https://github.com/amplab/docker-scripts

Hasn't been updated since 1.0.0, but might be a good starting point.

On Wed Jan 14 2015 at 12:14:13 PM Josh J joshjd...@gmail.com wrote:

 We have dockerized Spark Master and worker(s) separately and are using it
 in
 our dev environment.


 Is this setup available on github or dockerhub?

 On Tue, Dec 9, 2014 at 3:50 PM, Venkat Subramanian vsubr...@gmail.com
 wrote:

 We have dockerized Spark Master and worker(s) separately and are using it
 in
 our dev environment. We don't use Mesos though, running it in Standalone
 mode, but adding Mesos should not be that difficult I think.

 Regards

 Venkat



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/dockerized-spark-executor-on-mesos-tp20276p20603.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Accidental kill in UI

2015-01-09 Thread Nicholas Chammas
As Sean said, this definitely sounds like something worth a JIRA issue (and
PR).

On Fri Jan 09 2015 at 8:17:34 AM Sean Owen so...@cloudera.com wrote:

 (FWIW yes I think this should certainly be a POST. The link can become
 a miniature form to achieve this and then the endpoint just needs to
 accept POST only. You should propose a pull request.)

 On Fri, Jan 9, 2015 at 12:51 PM, Joe Wass jw...@crossref.org wrote:
  So I had a Spark job with various failures, and I decided to kill it and
  start again. I clicked the 'kill' link in the web console, restarted the
 job
  on the command line and headed back to the web console and refreshed to
 see
  how my job was doing... the URL at the time was:
 
  /stages/stage/kill?id=1terminate=true
 
  Which of course terminated the stage again. No loss, but if I'd waited a
 few
  hours before doing that, I would have lost data.
 
  I know to be careful next time, but isn't 'don't modify state as a
 result of
  a GET request' the first rule of HTTP? It could lead to an expensive
  mistake. Making this a POST would be a simple fix.
 
  Does anyone else think this is worth creating an issue for?
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




  1   2   3   4   >