Re: Save a spark RDD to disk

2016-11-08 Thread Andrew Holway
Thats around 750MB/s which seems quite respectable even in this day and age! How many and what kind of disks to you have attached to your nodes? What are you expecting? On Tue, Nov 8, 2016 at 11:08 PM, Elf Of Lothlorein wrote: > Hi > I am trying to save a RDD to disk and

Live data visualisations with Spark

2016-11-08 Thread Andrew Holway
. Is this something that could be accomplished with shiny server for instance? Thanks, Andrew Holway

Re: sanboxing spark executors

2016-11-04 Thread Andrew Holway
I think running it on a Mesos cluster could give you better control over this kinda stuff. On Fri, Nov 4, 2016 at 7:41 AM, blazespinnaker wrote: > Is there a good method / discussion / documentation on how to sandbox a > spark > executor? Assume the code is

Re: Python - Spark Cassandra Connector on DC/OS

2016-11-01 Thread Andrew Holway
Sorry: Spark 2.0.0 On Tue, Nov 1, 2016 at 10:04 AM, Andrew Holway < andrew.hol...@otternetworks.de> wrote: > Hello, > > I've been getting pretty serious with DC/OS which I guess could be > described as a somewhat polished distribution of Mesos. I'm not sure ho

Python - Spark Cassandra Connector on DC/OS

2016-11-01 Thread Andrew Holway
j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o33.showString. Full output to stdout and stderr. : http://bit.ly/2f80f9e (gist) Versions: Spark 2.0.1 Python Version: 3.4.3 (default, Sep 14 2016, 12:36:27) [cqlsh 5.0.1 | Cassandra 2.2.8 | CQL spec 3.3.1 |

Re: Anyone attending spark summit?

2016-10-13 Thread Andrew Gelinas
*ANDREW! thank you. The code worked, Youre a legend. I was going to register today and now saved **€**€**€. Owe you a beer* *Gregory* 2016-10-12 10:04 GMT+09:00 Andrew James <attainablecodi...@gmail.com>: > Hey, I just found a promo code for Spark Summit Europe that saves 20

Anyone attending spark summit?

2016-10-11 Thread Andrew James
Hey, I just found a promo code for Spark Summit Europe that saves 20%. It’s "Summit16" - I love Brussels and just registered! Who’s coming with me to get their Spark on?! Cheers, Andrew

Can't connect to remote spark standalone cluster: getting WARN TaskSchedulerImpl: Initial job has not accepted any resources

2016-08-16 Thread Andrew Vykhodtsev
Dear all, I am trying to connect a remote windows machine to a standalone spark cluster (a single VM running on Ubuntu server with 8 cores and 64GB RAM). Both client and server have Spark 2.0 software prebuilt for Hadoop 2.6, and hadoop 2.7 I have the following settings on cluster: export

Re: Changing Spark configuration midway through application.

2016-08-10 Thread Andrew Ehrlich
If you're changing properties for the SparkContext, then I believe you will have to start a new SparkContext with the new properties. On Wed, Aug 10, 2016 at 8:47 AM, Jestin Ma wrote: > If I run an application, for example with 3 joins: > > [join 1] > [join 2] > [join

Re: Tuning level of Parallelism: Increase or decrease?

2016-07-31 Thread Andrew Ehrlich
15000 seems like a lot of tasks for that size. Test it out with a .coalesce(50) placed right after loading the data. It will probably either run faster or crash with out of memory errors. > On Jul 29, 2016, at 9:02 AM, Jestin Ma wrote: > > I am processing ~2 TB of

Re: How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

2016-07-31 Thread Andrew Ehrlich
You could write each image to a different directory instead of a different file. That can be done by filtering the RDD into one RDD for each image and then saving each. That might not be what you’re after though, in terms of space and speed efficiency. Another way would be to save them multiple

Re: How do I download 2.0? The main download page isn't showing it?

2016-07-27 Thread Andrew Ash
You sometimes have to hard refresh to get the page to update. On Wed, Jul 27, 2016 at 5:12 PM, Jim O'Flaherty wrote: > Nevermind, it literally just appeared right after I posted this. > > > > -- > View this message in context: >

Re: Bzip2 to Parquet format

2016-07-24 Thread Andrew Ehrlich
You can load the text with sc.textFile() to an RDD[String], then use .map() to convert it into an RDD[Row]. At this point you are ready to apply a schema. Use sqlContext.createDataFrame(rddOfRow, structType) Here is an example on how to define the StructType (schema) that you will combine with

Re: Size exceeds Integer.MAX_VALUE

2016-07-24 Thread Andrew Ehrlich
You can use the .repartition() function on the rdd or dataframe to set the number of partitions higher. Use .partitions.length to get the current number of partitions. (Scala API). Andrew > On Jul 24, 2016, at 4:30 PM, Ascot Moss <ascot.m...@gmail.com> wrote: > > the data set

Re: Size exceeds Integer.MAX_VALUE

2016-07-23 Thread Andrew Ehrlich
number of partitions, or using a more space-efficient data structure inside the RDD, or increasing the amount of memory available to spark and caching the data in memory. Make sure you are using Kryo serialization. Andrew > On Jul 23, 2016, at 9:00 PM, Ascot Moss <ascot.m...@gmail.com> wr

Re: How to generate a sequential key in rdd across executors

2016-07-23 Thread Andrew Ehrlich
It’s hard to do in a distributed system. Maybe try generating a meaningful key using a timestamp + hashed unique key fields in the record? > On Jul 23, 2016, at 7:53 PM, yeshwanth kumar wrote: > > Hi, > > i am doing bulk load to hbase using spark, > in which i need to

Re: spark and plot data

2016-07-23 Thread Andrew Ehrlich
@Gourav, did you find any good inline plotting tools when using the Scala kernel? I found one based on highcharts but it was not frictionless the way matplotlib is. > On Jul 23, 2016, at 2:26 AM, Gourav Sengupta > wrote: > > Hi Pedro, > > Toree is Scala kernel for

Re: Error in collecting RDD as a Map - IOException in collectAsMap

2016-07-23 Thread Andrew Ehrlich
+1 for the misleading error. Messages about failing to connect often mean that an executor has died. If so, dig into the executor logs and find out why the executor died (out of memory, perhaps). Andrew > On Jul 23, 2016, at 11:39 AM, VG <vlin...@gmail.com> wrote: > > Hi Pe

Re: How to give name to Spark jobs shown in Spark UI

2016-07-23 Thread Andrew Ehrlich
As far as I know, the best you can do is refer to the Actions by line number. > On Jul 23, 2016, at 8:47 AM, unk1102 wrote: > > Hi I have multiple child spark jobs run at a time. Is there any way to name > these child spark jobs so I can identify slow running ones. For e.

Re: Spark Job trigger in production

2016-07-19 Thread Andrew Ehrlich
Another option is Oozie with the spark action: https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html > On Jul 18, 2016, at 12:15 AM, Jagat Singh wrote: > > You can use following options > > *

Re: the spark job is so slow - almost frozen

2016-07-19 Thread Andrew Ehrlich
Try: - filtering down the data as soon as possible in the job, dropping columns you don’t need. - processing fewer partitions of the hive tables at a time - caching frequently accessed data, for example dimension tables, lookup tables, or other datasets that are repeatedly accessed - using the

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-19 Thread Andrew Ehrlich
There is a Spark<->HBase library that does this. I used it once in a prototype (never tried in production through): http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/

Re: Heavy Stage Concentration - Ends With Failure

2016-07-19 Thread Andrew Ehrlich
Yea this is a good suggestion; also check 25th percentile, median, and 75th percentile to see how skewed the input data is. If you find that the RDD’s partitions are skewed you can solve it either by changing the partitioner when you read the files like already suggested, or call repartition()

Re: spark worker continuously trying to connect to master and failed in standalone mode

2016-07-19 Thread Andrew Ehrlich
Troubleshooting steps: $ telnet localhost 7077 (on master, to confirm port is open) $ telnet 7077 (on slave, to confirm port is blocked) If the port is available on the master from the master, but not on the master from the slave, check firewall settings on the master:

Re: Building standalone spark application via sbt

2016-07-19 Thread Andrew Ehrlich
Yes, spark-core will depend on Hadoop and several other jars. Here’s the list of dependencies: https://github.com/apache/spark/blob/master/core/pom.xml#L35 Whether you need spark-sql depends on whether you will use the DataFrame

Re: Spark performance testing

2016-07-08 Thread Andrew Ehrlich
Yea, I'm looking for any personal experiences people have had with tools like these. > On Jul 8, 2016, at 8:57 PM, charles li <charles.up...@gmail.com> wrote: > > Hi, Andrew, I've got lots of materials when asking google for "spark > performance test" > > h

Spark performance testing

2016-07-08 Thread Andrew Ehrlich
I’m wondering what others do, and if you do performance testing at all. Also, is anyone generating test data, or just operating on a static set? Is regression testing for performance a thing? Andrew - To unsubscribe e-mail: user

Spark 2.0 preview - How to configure warehouse for Catalyst? always pointing to /user/hive/warehouse

2016-06-17 Thread Andrew Lee
>From branch-2.0, Spark 2.0.0 preview, I found it interesting, no matter what you do by configuring spark.sql.warehouse.dir it will always pull up the default path which is /user/hive/warehouse In the code, I notice that at LOC45

pyspark.GroupedData.agg works incorrectly when one column is aggregated twice?

2016-05-27 Thread Andrew Vykhodtsev
Dear list, I am trying to calculate sum and count on the same column: user_id_books_clicks = (sqlContext.read.parquet('hdfs:///projects/kaggle-expedia/input/train.parquet') .groupby('user_id') .agg({'is_booking':'count',

Re: never understand

2016-05-25 Thread Andrew Ehrlich
- Try doing less in each transformation - Try using different data structures within the transformations - Try not caching anything to free up more memory On Wed, May 25, 2016 at 1:32 AM, pseudo oduesp wrote: > hi guys , > -i get this errors with pyspark 1.5.0 under

Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-09 Thread Andrew Lee
In fact, it does require ojdbc from Oracle which also requires a username and password. This was added as part of the testing scope for Oracle's docker. I notice this PR and commit in branch-2.0 according to https://issues.apache.org/jira/browse/SPARK-12941. In the comment, I'm not sure what

ERROR SparkContext: Error initializing SparkContext.

2016-05-09 Thread Andrew Holway
Hi, I am having a hard time getting to the bottom of this problem. I'm really not sure where to start with it. Everything works fine in local mode. Cheers, Andrew [testing@instance-16826 ~]$ /opt/mapr/spark/spark-1.5.2/bin/spark-submit --num-executors 21 --executor-cores 5 --master yarn-client

Unsubscribe

2016-04-26 Thread Andrew Heinrichs
Unsubscribe On Apr 22, 2016 3:21 PM, "Mich Talebzadeh" wrote: > > Hi, > > Anyone know which jar file has import org.apache.spark.internal.Logging? > > I tried *spark-core_2.10-1.5.1.jar * > > but does not seem to work > > scala> import

Custom Log4j layout on YARN = ClassNotFoundException

2016-04-22 Thread Rowson, Andrew G. (TR Technology & Ops)
This e-mail is for the sole use of the intended recipient and contains information that may be privileged and/or confidential. If you are not an intended recipient, please notify the sender by return e-mail and delete this e-mail and any attachments. Certain

Unsubscribe

2016-03-28 Thread Andrew Heinrichs
On Mar 29, 2016 8:56 AM, "Alexander Krasnukhin" wrote: > e.g. select max value for column "foo": > > from pyspark.sql.functions import max, col > df.select(max(col("foo"))).show() > > On Tue, Mar 29, 2016 at 2:15 AM, Andy Davidson < > a...@santacruzintegration.com> wrote:

Graphx

2016-03-10 Thread Andrew A
which fail my spark job. Thank you, Andrew

Re: No event log in /tmp/spark-events

2016-03-08 Thread Andrew Or
in your spark-defaults.conf are not passed correctly. -Andrew 2016-03-03 19:57 GMT-08:00 PatrickYu <hant...@gmail.com>: > alvarobrandon wrote > > Just write /tmp/sparkserverlog without the file part. > > I don't get your point, what's mean of 'without the file part' > > &

Re: subtractByKey increases RDD size in memory - any ideas?

2016-02-18 Thread Andrew Ehrlich
There could be clues in the different RDD subclasses; rdd1 is ParallelCollectionRDD but rdd3 is SubtractedRDD. On Thu, Feb 18, 2016 at 1:37 PM, DaPsul wrote: > (copy from > > http://stackoverflow.com/questions/35467128/spark-subtractbykey-increases-rdd-cached-memory-size > ) > >

Re: Hive REGEXP_REPLACE use or equivalent in Spark

2016-02-18 Thread Andrew Ehrlich
Use the scala method .split(",") to split the string into a collection of strings, and try using .replaceAll() on the field with the "?" to remove it. On Thu, Feb 18, 2016 at 2:09 PM, Mich Talebzadeh wrote: > Hi, > > What is the equivalent of this Hive statement in Spark >

[OT] Apache Spark Jobs in Kochi, India

2016-02-11 Thread Andrew Holway
for the noise! Cheers, Andrew

Futures timed out after [120 seconds]

2016-02-08 Thread Andrew Milkowski
-Andrew 16/02/08 08:33:37 ERROR YarnScheduler: Lost executor 265 on ip-172-20-35-115.ec2.internal: remote Rpc client disassociated [Stage 4313813:> (0 + 94) / 95]16/02/08 08:35:35 ERROR ContextCleaner: Error cleaning broadcast 4311

Writing to jdbc database from SparkR (1.5.2)

2016-02-06 Thread Andrew Holway
ionality exist in 1.5.2? Thanks, Andrew

Re: Writing to jdbc database from SparkR (1.5.2)

2016-02-06 Thread Andrew Holway
> > df <- read.df(sqlContext, source="jdbc", > url="jdbc:mysql://hostname:3306?user=user=pass", > dbtable="database.table") > I got a bit further but am now getting the following error. This error is being thrown without the database being touched. I tested this by making the database

Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

2016-01-28 Thread Andrew Zurn
sometime soon). Thanks again for the advice on where to dig further into. Much appreciated. Andrew On Tue, Jan 26, 2016 at 9:18 AM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Have you tried setting spark.emr.dropCharacters to a lower value? (It > defaults to 8.)

Concatenating tables

2016-01-23 Thread Andrew Holway
this. +-+ | A B C D | +-+ | 1 2 3 4 | | 5 6 7 8 | | 3 5 6 8 | | 0 0 0 0 | | 8 8 8 8 | | 1 1 1 1 | +-+ Thanks, Andrew

Re: SparkContext SyntaxError: invalid syntax

2016-01-21 Thread Andrew Weiner
Thanks Felix. I think I was missing gem install pygments.rb and I also had to roll back to Python 2.7 but I got it working. I submitted the PR submitted with the added explanation in the docs. Andrew On Wed, Jan 20, 2016 at 1:44 AM, Felix Cheung <felixcheun...@hotmail.com> wrote: >

Date / time stuff with spark.

2016-01-21 Thread Andrew Holway
T23:59:53+00:00”} I want to do some date/time operations on this json data but I cannot find clear documentation on how to A) specify the “time” field as a date/time in the schema. B) the format the date should be in to be correctly in the raw data for an easy import. Cheers, Andrew

Re: Date / time stuff with spark.

2016-01-21 Thread Andrew Holway
P.S. We are working with Python. On Thu, Jan 21, 2016 at 8:24 PM, Andrew Holway <andrew.hol...@otternetworks.de> wrote: > Hello, > > I am importing this data from HDFS into a data frame with > sqlContext.read.json(). > > {“a": 42, “a": 56,

Re: SparkContext SyntaxError: invalid syntax

2016-01-18 Thread Andrew Weiner
; Application Master process in cluster mode. See the YARN-related Spark > Properties > <http://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties> > for more information." > > I might take another crack at building the docs myself if nobody beats me >

Re: SparkContext SyntaxError: invalid syntax

2016-01-17 Thread Andrew Weiner
myself if nobody beats me to this. Andrew On Fri, Jan 15, 2016 at 5:01 PM, Bryan Cutler <cutl...@gmail.com> wrote: > Glad you got it going! It's wasn't very obvious what needed to be set, > maybe it is worth explicitly stating this in the docs since it seems to > have come up a cou

Re: SparkContext SyntaxError: invalid syntax

2016-01-15 Thread Andrew Weiner
d exporting it in spark-submit *and* in spark-env.sh. Is there somewhere else I need to set this variable? Maybe in one of the hadoop conf files in my HADOOP_CONF_DIR? Andrew On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler <cutl...@gmail.com> wrote: > It seems like it could be the case

Re: SparkContext SyntaxError: invalid syntax

2016-01-15 Thread Andrew Weiner
I first submit the job, but at some point during the job, my environment variables are thrown out and someone's (yarn's?) environment variables are being used. Andrew On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner < andrewweiner2...@u.northwestern.edu> wrote: > Indeed! Here is the ou

Re: SparkContext SyntaxError: invalid syntax

2016-01-15 Thread Andrew Weiner
/path/to/python While both this solution and the solution from my prior email work, I believe this is the preferred solution. Sorry for the flurry of emails. Again, thanks for all the help! Andrew On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner < andrewweiner2...@u.northwestern.edu>

Re: SparkContext SyntaxError: invalid syntax

2016-01-15 Thread Andrew Weiner
ARK_PYTHON environment variable to be used in my yarn environment in cluster mode. Thank you for all your help! Best, Andrew On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner < andrewweiner2...@u.northwestern.edu> wrote: > I tried playing around with my environment variables, and here is an

Re: SparkContext SyntaxError: invalid syntax

2016-01-14 Thread Andrew Weiner
tax Is it possible that the PYSPARK_PYTHON environment variable is ignored when jobs are submitted in cluster mode? It seems that Spark or Yarn is going behind my back, so to speak, and using some older version of python I didn't even know was installed. Thanks again for all your help thus far. We are

Re: automatically unpersist RDDs which are not used for 24 hours?

2016-01-13 Thread Andrew Or
questions, -Andrew 2016-01-13 11:36 GMT-08:00 Alexander Pivovarov <apivova...@gmail.com>: > Is it possible to automatically unpersist RDDs which are not used for 24 > hours? >

Re: Read Accumulator value while running

2016-01-13 Thread Andrew Or
there is not currently a way (in Spark 1.6 and before) to access the accumulator values until after the tasks that updated them have completed. This will change in Spark 2.0, the next version, however. Please let me know if you have more questions. -Andrew 2016-01-13 11:24 GMT-08:00 Daniel Imberman <daniel.im

Re: SparkContext SyntaxError: invalid syntax

2016-01-13 Thread Andrew Weiner
--executor-memory 2g --executor-cores 1 ./examples/src/main/python/pi.py 10* I get the error I mentioned in the prior email: Error from python worker: python: module pyspark.daemon not found Any thoughts? Best, Andrew On Mon, Jan 11, 2016 at 12:25 PM, Bryan Cutler <cutl...@gmail.co

Re: Windows driver cannot run job on Linux cluster

2016-01-11 Thread Andrew Wooster
Files\Common Files\Intel\WirelessCommon\;C:\Program Files (x86)\QuickTime\QTSystem\;C:\Program Files (x86)\Skype\Phone\;C:\Program Files\Amazon\AWSCLI\;C:\Users\Andrew\Projects\magellan\opt\RDKit_2015_03_1.win64.java;C:\Program Files\OpenBabel-2.3.90;C:\Program Files\Intel\WiFi\bin\;C:\Program Files

Windows driver cannot run job on Linux cluster

2016-01-11 Thread Andrew Wooster
and the job is marked as RUNNING (so it does not appears to be waiting on a worker). I do not see anything out of the ordinary in the master and worker logs. How do I debug a problem like this? -Andrew

Re: SparkContext SyntaxError: invalid syntax

2016-01-08 Thread Andrew Weiner
/container_1450370639491_0136_01_02/py4j-0.9-src.zip java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at [] And then a few more similar pyspark.daemon not found errors... Andrew On Fri, Jan 8, 2016 at 2:31 PM, Bryan Cutler <cutl...@gmail.com> wrote: > Hi Andr

Re: Can't submit job to stand alone cluster

2015-12-30 Thread Andrew Or
, your jars have to be visible to the machine running spark-submit. In cluster mode, your jars have to be visible to all machines running a Worker, since the driver can be launched on any of them. The nice email from Greg is spot-on. Does that make sense? -Andrew 2015-12-30 11:23 GMT-08:00 Spa

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
somewhere then we should add that. Also, the packages problem seems legitimate. Thanks for reporting it. I have filed https://issues.apache.org/jira/browse/SPARK-12559. -Andrew 2015-12-29 4:18 GMT-08:00 Greg Hill <greg.h...@rackspace.com>: > > > On 12/28/15, 5:16 PM, "

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
a.com > <http://www.cloudera.com/content/www/en-us/documentation/enterprise/latest/topics/cdh_ig_running_spark_apps.html> > Preview by Yahoo > > > > > On Tuesday, December 29, 2015 2:42 PM, Andrew Or <and...@databricks.com> > wrote: > > > The confusi

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
plicationMaster; use --jars > option with a globally visible path to said jar > 3. Yarn Client-mode: client and driver run on the same machine. driver > is *NOT* a thread in ApplicationMaster; use --packages to submit a jar > > > On Tuesday, December 29, 2015 1:54 PM, Andr

Re: Opening Dynamic Scaling Executors on Yarn

2015-12-29 Thread Andrew Or
> > External shuffle service is backward compatible, so if you deployed 1.6 > shuffle service on NM, it could serve both 1.5 and 1.6 Spark applications. Actually, it just happens to be backward compatible because we didn't change the shuffle file formats. This may not necessarily be the case

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Andrew Or
ase let me know if there's anything else I can help clarify. Cheers, -Andrew 2015-12-29 13:07 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>: > Andrew, > > Now I see where the confusion lays. Standalone cluster mode, your link, is > nothing but a combination of client-mode

Re: Yarn application ID for Spark job on Yarn

2015-12-18 Thread Andrew Or
Hi Roy, I believe Spark just gets its application ID from YARN, so you can just do `sc.applicationId`. -Andrew 2015-12-18 0:14 GMT-08:00 Deepak Sharma <deepakmc...@gmail.com>: > I have never tried this but there is yarn client api's that you can use in > your spark pr

Re: which aws instance type for shuffle performance

2015-12-18 Thread Andrew Or
using off-heap, so much of the memory was actually not used. Yes, if a shuffle file exists locally Spark just reads from disk. -Andrew 2015-12-15 23:11 GMT-08:00 Rastan Boroujerdi <rast...@gmail.com>: > I'm trying to determine whether I should be using 10 r3.8xlarge or 40 > r3.2xlarge

Re: Limit of application submission to cluster

2015-12-18 Thread Andrew Or
Hi Saif, have you verified that the cluster has enough resources for all 4 programs? -Andrew 2015-12-18 5:52 GMT-08:00 <saif.a.ell...@wellsfargo.com>: > Hello everyone, > > I am testing some parallel program submission to a stand alone cluster. > Everything works al

Re: imposed dynamic resource allocation

2015-12-18 Thread Andrew Or
requests to the AM asking for executors. If you did not enable this config, your application will not make such requests. -Andrew 2015-12-11 14:01 GMT-08:00 Antony Mayi <antonym...@yahoo.com.invalid>: > Hi, > > using spark 1.5.2 on yarn (client mode) and was trying to use the dyn

Re: Spark job submission REST API

2015-12-10 Thread Andrew Or
would require a detailed design consensus among the community. -Andrew 2015-12-10 8:26 GMT-08:00 mvle <m...@us.ibm.com>: > Hi, > > I would like to use Spark as a service through REST API calls > for uploading and submitting a job, getting results, etc. > > There is a projec

Re: Warning: Master endpoint spark://ip:7077 was not a REST server. Falling back to legacy submission gateway instead.

2015-12-10 Thread Andrew Or
go away. -Andrew 2015-12-10 18:13 GMT-08:00 Andy Davidson <a...@santacruzintegration.com>: > Hi Jakob > > The cluster was set up using the spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 > script > > Given my limited knowledge I think this looks okay? > > Thanks > > A

Re: create a table for csv files

2015-11-19 Thread Andrew Or
There's not an easy way. The closest thing you can do is: import org.apache.spark.sql.functions._ val df = ... df.withColumn("id", monotonicallyIncreasingId()) -Andrew 2015-11-19 8:23 GMT-08:00 xiaohe lan <zombiexco...@gmail.com>: > Hi, > > I have some csv file in HD

Re: send transformed RDD to s3 from slaves

2015-11-14 Thread Andrew Ehrlich
Maybe you want to be using rdd.saveAsTextFile() ? > On Nov 13, 2015, at 4:56 PM, Walrus theCat wrote: > > Hi, > > I have an RDD which crashes the driver when being collected. I want to send > the data on its partitions out to S3 without bringing it back to the driver.

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2015-10-05 Thread Andrew Or
service (e.g. 1.2) also happens to work with say Spark 1.4 because the shuffle file formats haven't changed. However, there are no guarantees that this will remain the case. -Andrew 2015-10-05 16:37 GMT-07:00 Alex Rovner <alex.rov...@magnetic.com>: > We are running CDH 5.4 with Spark 1

Re: Why are executors on slave never used?

2015-09-21 Thread Andrew Or
need to do `setMaster("yarn")`, assuming that all the Hadoop configurations files such as core-site.xml are already set up properly. -Andrew 2015-09-21 8:53 GMT-07:00 Hemant Bhanawat <hemant9...@gmail.com>: > When you specify master as local[2], it starts the spark components in a >

Re: Executor lost failure

2015-09-01 Thread Andrew Duffy
If you're using YARN with Spark 1.3.1, you could be running into https://issues.apache.org/jira/browse/SPARK-8119, although without more information it's impossible to know. On Tue, Sep 1, 2015 at 11:28 AM, Priya Ch wrote: > Hi All, > > I have a spark streaming

Re: Spark ec2 lunch problem

2015-08-24 Thread Andrew Or
. -Andrew 2015-08-24 5:58 GMT-07:00 Robin East robin.e...@xense.co.uk: spark-ec2 is the way to go however you may need to debug connectivity issues. For example do you know that the servers were correctly setup in AWS and can you access each node using ssh? If no then you need to work out why (it’s

Re: DAG related query

2015-08-20 Thread Andrew Or
have 3 distinct RDDs. All you're doing is reassigning a reference `rdd1`, but the underlying RDD doesn't change. -Andrew 2015-08-20 6:21 GMT-07:00 Sean Owen so...@cloudera.com: No. The third line creates a third RDD whose reference simply replaces the reference to the first RDD in your local

Re: how do I execute a job on a single worker node in standalone mode

2015-08-19 Thread Andrew Or
everything on 1 node, it looks like it's not grabbing the extra nodes. On Wed, Aug 19, 2015 at 8:43 AM, Axel Dahl a...@whisperstream.com wrote: That worked great, thanks Andrew. On Tue, Aug 18, 2015 at 1:39 PM, Andrew Or and...@databricks.com wrote: Hi Axel, You can try setting

Re: Why use spark.history.fs.logDirectory instead of spark.eventLog.dir

2015-08-19 Thread Andrew Or
it provides is broader than that. -Andrew 2015-08-19 5:13 GMT-07:00 canan chen ccn...@gmail.com: Anyone know about this ? Or do I miss something here ? On Fri, Aug 7, 2015 at 4:20 PM, canan chen ccn...@gmail.com wrote: Is there any reason that historyserver use another property for the event

Re: Difference between Sort based and Hash based shuffle

2015-08-19 Thread Andrew Or
Yes, in other words, a bucket is a single file in hash-based shuffle (no consolidation), but a segment of partitioned file in sort-based shuffle. 2015-08-19 5:52 GMT-07:00 Muhammad Haseeb Javed 11besemja...@seecs.edu.pk: Thanks Andrew for a detailed response, So the reason why key value pairs

Re: Why standalone mode don't allow to set num-executor ?

2015-08-18 Thread Andrew Or
confusion. -Andrew 2015-08-18 2:35 GMT-07:00 canan chen ccn...@gmail.com: num-executor only works for yarn mode. In standalone mode, I have to set the --total-executor-cores and --executor-cores. Isn't this way so intuitive ? Any reason for that ?

Re: Difference between Sort based and Hash based shuffle

2015-08-18 Thread Andrew Or
. This places much less stress on the file system and requires much fewer I/O operations especially on the read side. -Andrew 2015-08-16 11:08 GMT-07:00 Muhammad Haseeb Javed 11besemja...@seecs.edu.pk : I did check it out and although I did get a general understanding of the various classes used

Re: how do I execute a job on a single worker node in standalone mode

2015-08-18 Thread Andrew Or
scripts. For more information see: http://spark.apache.org/docs/latest/spark-standalone.html. Feel free to let me know whether it works, -Andrew 2015-08-18 4:49 GMT-07:00 Igor Berman igor.ber...@gmail.com: by default standalone creates 1 executor on every worker machine per application number

Re: dse spark-submit multiple jars issue

2015-08-18 Thread Andrew Or
application jar uses. -Andrew 2015-08-13 3:22 GMT-07:00 Javier Domingo Cansino javier.domi...@fon.com: Please notice that 'jars: null' I don't know why you put ///. but I would propose you just put normal absolute paths. dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld

Re: Programmatically create SparkContext on YARN

2015-08-18 Thread Andrew Or
. In other words, you shouldn't have to do anything more than providing a different value to `--master` to use YARN. -Andrew 2015-08-17 0:34 GMT-07:00 Andreas Fritzler andreas.fritz...@gmail.com: Hi all, when runnig the Spark cluster in standalone mode I am able to create the Spark context from

Re: TestSQLContext compilation error when run SparkPi in Intellij ?

2015-08-15 Thread Andrew Or
Hi Canan, TestSQLContext is no longer a singleton but now a class. It is never meant to be a fully public API, but if you wish to use it you can just instantiate a new one: val sqlContext = new TestSQLContext or just create a new SQLContext from a SparkContext. -Andrew 2015-08-15 20:33 GMT-07

Re: Spark master driver UI: How to keep it after process finished?

2015-08-08 Thread Andrew Or
18080. For more information on how to set this up, visit: http://spark.apache.org/docs/latest/monitoring.html -Andrew 2015-08-07 13:16 GMT-07:00 François Pelletier newslett...@francoispelletier.org: look at spark.history.ui.port, if you use standalone spark.yarn.historyServer.address, if you

Re: How to share a Map among RDDS?

2015-07-22 Thread Andrew Or
Hi Dan, `map2` is a broadcast variable, not your map. To access the map on the executors you need to do `map2.value(a)`. -Andrew 2015-07-22 12:20 GMT-07:00 Dan Dong dongda...@gmail.com: Hi, Andrew, If I broadcast the Map: val map2=sc.broadcast(map1) I will get compilation error

Re: spark.deploy.spreadOut core allocation

2015-07-22 Thread Andrew Or
Hi Srikanth, It does look like a bug. Did you set `spark.executor.cores` in your application by any chance? -Andrew 2015-07-22 8:05 GMT-07:00 Srikanth srikanth...@gmail.com: Hello, I've set spark.deploy.spreadOut=false in spark-env.sh. export SPARK_MASTER_OPTS=-Dspark.deploy.defaultCores

Re: spark.deploy.spreadOut core allocation

2015-07-22 Thread Andrew Or
Hi Srikanth, I was able to reproduce the issue by setting `spark.cores.max` to a number greater than the number of cores on a worker. I've filed SPARK-9260 which I believe is already being fixed in https://github.com/apache/spark/pull/7274. Thanks for reporting the issue! -Andrew 2015-07-22 11

Re: Which memory fraction is Spark using to compute RDDs that are not going to be persisted

2015-07-22 Thread Andrew Or
intensive. -Andrew 2015-07-21 13:47 GMT-07:00 wdbaruni wdbar...@gmail.com: I am new to Spark and I understand that Spark divides the executor memory into the following fractions: *RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or .cache() and can be defined by setting

Re: Spark spark.shuffle.memoryFraction has no affect

2015-07-22 Thread Andrew Or
bytes spilled on the UI before and after the change. Let me know if that helped. -Andrew 2015-07-21 13:50 GMT-07:00 wdbaruni wdbar...@gmail.com: Hi I am testing Spark on Amazon EMR using Python and the basic wordcount example shipped with Spark. After running the application, I realized

Re: How to share a Map among RDDS?

2015-07-22 Thread Andrew Or
/examples/BroadcastTest.scala . -Andrew 2015-07-21 19:56 GMT-07:00 ayan guha guha.a...@gmail.com: Either you have to do rdd.collect and then broadcast or you can do a join On 22 Jul 2015 07:54, Dan Dong dongda...@gmail.com wrote: Hi, All, I am trying to access a Map from RDDs

RE: The auxService:spark_shuffle does not exist

2015-07-21 Thread Andrew Lee
Hi Andrew Or, Yes, NodeManager was restarted, I also checked the logs to see if the JARs appear in the CLASSPATH. I have also downloaded the binary distribution and use the JAR spark-1.4.1-bin-hadoop2.4/lib/spark-1.4.1-yarn-shuffle.jar without success. Has anyone successfully enabled

Re: The auxService:spark_shuffle does not exist

2015-07-21 Thread Andrew Or
Hi Andrew, Based on your driver logs, it seems the issue is that the shuffle service is actually not running on the NodeManagers, but your application is trying to provide a spark_shuffle secret anyway. One way to verify whether the shuffle service is actually started is to look

RE: The auxService:spark_shuffle does not exist

2015-07-21 Thread Andrew Lee
Hi Andrew, Thanks for the advice. I didn't see the log in the NodeManager, so apparently, something was wrong with the yarn-site.xml configuration. After digging in more, I realize it was an user error. I'm sharing this with other people so others may know what mistake I have made. When I review

Adding meetup groups to Community page - Moscow, Slovenia, Zagreb

2015-07-17 Thread Andrew Vykhodtsev
Dear all, The page https://spark.apache.org/community.html Says : If you'd like your meetup added, email user@spark.apache.org. So here I am emailing, could please someone add three new groups to the page Moscow : http://www.meetup.com/Apache-Spark-in-Moscow/ Slovenija (Ljubljana)

<    1   2   3   4   5   6   >