Re: spark-shell on standalone cluster gives error no mesos in java.library.path

2014-04-08 Thread Christoph Böhm
Forgot to post the solution. I messed up the master URL. In particular, I gave the host (master), not a URL. My bad. The error message is weird, though. Seems like the URL regex matches master for mesos://... No idea about the Java Runtime Environment Error. On Mar 26, 2014, at 3:52 PM,

Re: Only TraversableOnce?

2014-04-08 Thread wxhsdp
thank you for your help! let me have a try Nan Zhu wrote If that’s the case, I think mapPartition is what you need, but it seems that you have to load the partition into the memory as whole by toArray rdd.mapPartition{D = {val p = D.toArray; ...}} -- Nan Zhu On Tuesday, April

Re: AWS Spark-ec2 script with different user

2014-04-08 Thread Marco Costantini
Another thing I didn't mention. The AMI and user used: naturally I've created several of my own AMIs with the following characteristics. None of which worked. 1) Enabling ssh as root as per this guide ( http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/). When doing this, I

NPE using saveAsTextFile

2014-04-08 Thread Nick Pentreath
Hi I'm using Spark 0.9.0. When calling saveAsTextFile on a custom hadoop inputformat (loaded with newAPIHadoopRDD), I get the following error below. If I call count, I get the correct count of number of records, so the inputformat is being read correctly... the issue only appears when trying to

Re: AWS Spark-ec2 script with different user

2014-04-08 Thread Marco Costantini
I was able to keep the workaround ...around... by overwriting the generated '/root/.ssh/authorized_keys' file with a known good one, in the '/etc/rc.local' file On Tue, Apr 8, 2014 at 10:12 AM, Marco Costantini silvio.costant...@granatads.com wrote: Another thing I didn't mention. The AMI and

Re: Spark and HBase

2014-04-08 Thread Bin Wang
Hi Flavio, I happened to attend, actually attending the 2014 Apache Conf, I heard a project called Apache Phoenix, which fully leverage HBase and suppose to be 1000x faster than Hive. And it is not memory bounded, in which case sets up a limit for Spark. It is still in the incubating group and

Re: Spark and HBase

2014-04-08 Thread Christopher Nguyen
Flavio, the two are best at two orthogonal use cases, HBase on the transactional side, and Spark on the analytic side. Spark is not intended for row-based random-access updates, while far more flexible and efficient in dataset-scale aggregations and general computations. So yes, you can easily

Re: Spark and HBase

2014-04-08 Thread Flavio Pompermaier
Thanks for the quick reply Bin. Phenix is something I'm going to try for sure but is seems somehow useless if I can use Spark. Probably, as you said, since Phoenix use a dedicated data structure within each HBase Table has a more effective memory usage but if I need to deserialize data stored in a

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
i tried again with latest master, which includes commit below, but ui page still shows nothing on storage tab. koert commit ada310a9d3d5419e101b24d9b41398f609da1ad3 Author: Andrew Or andrewo...@gmail.com Date: Mon Mar 31 23:01:14 2014 -0700 [Hot Fix #42] Persisted RDD disappears on

Re: ui broken in latest 1.0.0

2014-04-08 Thread Xiangrui Meng
That commit did work for me. Could you confirm the following: 1) After you called cache(), did you make any actions like count() or reduce()? If you don't materialize the RDD, it won't show up in the storage tab. 2) Did you run ./make-distribution.sh after you switched to the current master?

RDD creation on HDFS

2014-04-08 Thread gtanguy
I read on the RDD paper (http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf) : For example, an RDD representing an HDFS file has a partition for each block of the file and knows which machines each block is on And that on http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html To minimize

assumption that lib_managed is present

2014-04-08 Thread Koert Kuipers
when i start spark-shell i now see ls: cannot access /usr/local/lib/spark/lib_managed/jars/: No such file or directory we do not package a lib_managed with our spark build (never did). maybe the logic in compute-classpath.sh that searches for datanucleus should check for the existence of

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the spark-shell in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.com wrote: i tried again with latest master, which includes

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
sorry, i meant to say: note that for a cached rdd in the spark shell it all works fine. but something is going wrong with the SPARK-APPLICATION-UI in our applications that extensively cache and re-use RDDs On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote: note that for a

Re: ui broken in latest 1.0.0

2014-04-08 Thread Xiangrui Meng
That commit fixed the exact problem you described. That is why I want to confirm that you switched to the master branch. bin/spark-shell doesn't detect code changes, so you need to run ./make-distribution.sh to re-compile Spark first. -Xiangrui On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
yes i call an action after cache, and i can see that the RDDs are fully cached using context.getRDDStorageInfo which we expose via our own api. i did not run make-distribution.sh, we have our own scripts to build a distribution. however if your question is if i correctly deployed the latest

Re: Spark and HBase

2014-04-08 Thread Nicholas Chammas
Just took a quick look at the overview herehttp://phoenix.incubator.apache.org/ and the quick start guide herehttp://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html . It looks like Apache Phoenix aims to provide flexible SQL access to data, both for transactional and analytic

Re: Pig on Spark

2014-04-08 Thread Mayur Rustagi
Hi Ankit, Thanx for all the work on Pig. Finally got it working. Couple of high level bugs right now: - Getting it working on Spark 0.9.0 - Getting UDF working - Getting generate functionality working - Exhaustive test suite on Spark on Pig are you maintaining a Jira somewhere? I am

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
yes i am definitely using latest On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng men...@gmail.com wrote: That commit fixed the exact problem you described. That is why I want to confirm that you switched to the master branch. bin/spark-shell doesn't detect code changes, so you need to run

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
i put some println statements in BlockManagerUI i have RDDs that are cached in memory. I see this: *** onStageSubmitted ** rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
yet at same time i can see via our own api: storageInfo: { diskSize: 0, memSize: 19944, numCachedPartitions: 1, numPartitions: 1 } On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers ko...@tresata.com wrote: i put some println statements in BlockManagerUI

Re: assumption that lib_managed is present

2014-04-08 Thread Aaron Davidson
Yup, sorry about that. This error message should not produce incorrect behavior, but it is annoying. Posted a patch to fix it: https://github.com/apache/spark/pull/361 Thanks for reporting it! On Tue, Apr 8, 2014 at 9:54 AM, Koert Kuipers ko...@tresata.com wrote: when i start spark-shell i

java.io.IOException: Call to dev/17.29.25.4:50070 failed on local exception: java.io.EOFException

2014-04-08 Thread reegs
I am trying to read file from HDFS on Spark Shell and getting error as below. When i create first RDD it works fine but when i try to do count on that RDD, it trows me some connection error. I have single node hdfs setup and on the same machine, i have spark running. Please help. When i run jps

Re: Urgently need help interpreting duration

2014-04-08 Thread Yana Kadiyska
Thank you -- this actually helped a lot. Strangely it appears that the task detail view is not accurate in 0.8 -- that view shows 425ms duration for one of the tasks, but in the driver log I do indeed see Finished TID 125 in 10940ms. On that slow worker I see the following: 14/04/08 18:06:24

Re: java.io.IOException: Call to dev/17.29.25.4:50070 failed on local exception: java.io.EOFException

2014-04-08 Thread reegs
There are couple of issues here which i was able to find out. 1: We should not use web port which we use to access the web UI. I was usong that initially so it was not working. 2: All request should go to Name node and not anything else. 3: By replacing localhost:9000 in the above request, it

Measuring Network Traffic for Spark Job

2014-04-08 Thread yxzhao
Hi All, I want to measure the total network traffic for a Spark Job. But I did not see related information from the log. Does anybody know how to measure it? Thanks very much in advance. -- View this message in context:

ETL for postgres to hadoop

2014-04-08 Thread Manas Kar
Hi All, I have some spatial data in postgres machine. I want to be able to move that data to Hadoop and do some geo-processing. I tried using sqoop to move the data to Hadoop but it complained about the position data(which it says can't recognize) Does anyone have any idea as to

Re: ETL for postgres to hadoop

2014-04-08 Thread andy petrella
Hello Manas, I don't know Sqoop that much but my best guess is that you're probably using Postgis which has specific structures for Geometry and so on. And if you need some spatial operators my gut feeling is that things will be harder ^^ (but a raw import won't need that...). So I did a quick

Spark with SSL?

2014-04-08 Thread kamatsuoka
Can Spark be configured to use SSL for all its network communication? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-SSL-tp3916.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
1) at the end of the callback 2) yes we simply expose sc.getRDDStorageInfo to the user via REST 3) yes exactly. we define the RDDs at startup, all of them are cached. from that point on we only do calculations on these cached RDDs. i will add some more println statements for storageStatusList

Re: ui broken in latest 1.0.0

2014-04-08 Thread Koert Kuipers
our one cached RDD in this run has id 3 *** onStageSubmitted ** rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1); CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0 B; DiskSize: 0.0 B _rddInfoMap: Map(2 - RDD 2

Re: Spark with SSL?

2014-04-08 Thread Andrew Ash
Not that I know of, but it would be great if that was supported. The way I typically handle security now is to put the Spark servers in their own subnet with strict inbound/outbound firewalls. On Tue, Apr 8, 2014 at 1:14 PM, kamatsuoka ken...@gmail.com wrote: Can Spark be configured to use

Re: Measuring Network Traffic for Spark Job

2014-04-08 Thread Andrew Ash
If you set up Spark's metrics reporting to write to the Ganglia backend that will give you a good idea of how much network/disk/CPU is being used and on what machines. https://spark.apache.org/docs/0.9.0/monitoring.html On Tue, Apr 8, 2014 at 12:57 PM, yxzhao yxz...@ualr.edu wrote: Hi All,

Re: Why doesn't the driver node do any work?

2014-04-08 Thread Sean Owen
If you want the machine that hosts the driver to also do work, you can designate it as a worker too, if I'm not mistaken. I don't think the driver should do work, logically, but, that's not to say that the machine it's on shouldn't do work. -- Sean Owen | Director, Data Science | London On Tue,

Re: How to execute a function from class in distributed jar on each worker node?

2014-04-08 Thread Andrew Ash
One thing you could do is create an RDD of [1,2,3] and set a partitioner that puts all three values on their own nodes. Then .foreach() over the RDD and call your function that will run on each node. Why do you need to run the function on every node? Is it some sort of setup code that needs to

Re: Why doesn't the driver node do any work?

2014-04-08 Thread Nicholas Chammas
Alright, so I guess I understand now why spark-ec2 allows you to select different instance types for the driver node and worker nodes. If the driver node is just driving and not doing any large collect()s or heavy processing, it can be much smaller than the worker nodes. With regards to data

Re: Why doesn't the driver node do any work?

2014-04-08 Thread Nan Zhu
may be unrelated to the question itself, just FYI you can run your driver program in worker node with Spark-0.9 http://spark.apache.org/docs/latest/spark-standalone.html#launching-applications-inside-the-cluster Best, -- Nan Zhu On Tuesday, April 8, 2014 at 5:11 PM, Nicholas Chammas

A series of meetups about machine learning with Spark in San Francisco

2014-04-08 Thread DB Tsai
Hi guys, We're going to hold a series of meetups about machine learning with Spark in San Francisco. The first one will be on April 24. Xiangrui Meng from Databricks will talk about Spark, Spark/Python, features engineering, and MLlib. See

Re: [BLOG] For Beginners

2014-04-08 Thread weida xu
Dears, I'm very interested in this. However, the links mentioned above are not accessible from China. Is there any other way to read the two blog pagess? Thanks a lot. 2014-04-08 12:54 GMT+08:00 prabeesh k prabsma...@gmail.com: Hi all, Here I am sharing a blog for beginners, about creating

Re: Measuring Network Traffic for Spark Job

2014-04-08 Thread yxzhao
Thanks Andrew, I will take a look at it. On Tue, Apr 8, 2014 at 3:35 PM, Andrew Ash [via Apache Spark User List] ml-node+s1001560n3920...@n3.nabble.com wrote: If you set up Spark's metrics reporting to write to the Ganglia backend that will give you a good idea of how much network/disk/CPU is

Re: Spark RDD to Shark table IN MEMORY conversion

2014-04-08 Thread abhietc31
Anybody, please help for abov e query. It's challanging but will open new horizon for In-Memory analysis. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-RDD-to-Shark-table-IN-MEMORY-conversion-tp3682p3968.html Sent from the Apache Spark User List

java.io.NotSerializableException exception - custom Accumulator

2014-04-08 Thread Dhimant Jayswal
Hi , I am getting java.io.NotSerializableException exception while executing following program. import org.apache.spark.SparkContext._ import org.apache.spark.SparkContext import org.apache.spark.AccumulatorParam object App { class Vector (val data: Array[Double]) {} implicit object VectorAP

Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread Dong Mo
Dear list, SBT compiles fine, but when I do the following: sbt/sbt gen-idea import project as SBT project to IDEA 13.1 Make Project and these errors show up: Error:(28, 8) object FileContext is not a member of package org.apache.hadoop.fs import org.apache.hadoop.fs.{FileContext, FileStatus,

Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread DB Tsai
Hi Dong, This is pretty much what I did. I run into the same issue you have. Since I'm not developing yarn related stuff, I just excluded those two yarn related project from intellji, and it works. PS, you may need to exclude java8 project as well now. Sincerely, DB Tsai

Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-08 Thread Sean Owen
I let IntelliJ read the Maven build directly and that works fine. -- Sean Owen | Director, Data Science | London On Wed, Apr 9, 2014 at 6:14 AM, Dong Mo monted...@gmail.com wrote: Dear list, SBT compiles fine, but when I do the following: sbt/sbt gen-idea import project as SBT project to

Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-09 Thread Xiangrui Meng
After sbt/sbt gen-diea, do not import as an SBT project but choose open project and point it to the spark folder. -Xiangrui On Tue, Apr 8, 2014 at 10:45 PM, Sean Owen so...@cloudera.com wrote: I let IntelliJ read the Maven build directly and that works fine. -- Sean Owen | Director, Data

Re: Preconditions on RDDs for creating a Graph?

2014-04-09 Thread Ankur Dave
On Tue, Apr 8, 2014 at 2:33 PM, Adam Novak ano...@soe.ucsc.edu wrote: What, exactly, needs to be true about the RDDs that you pass to Graph() to be sure of constructing a valid graph? (Do they need to have the same number of partitions? The same number of partitions and no empty partitions?

KafkaReciever Error when starting ssc (Actor name not unique)

2014-04-09 Thread gaganbm
Hi All, I am getting this exception when doing ssc.start to start the streaming context. ERROR KafkaReceiver - Error receiving data akka.actor.InvalidActorNameException: actor name [NetworkReceiver-0] is not unique! at

Spark on YARN performance

2014-04-09 Thread Flavio Pompermaier
Hi to everybody, I'm new to Spark and I'd like to know if running Spark on top of YARN or Mesos could affect (and how much) its performance. Is there any doc about this? Best, Flavio

Re: trouble with join on large RDDs

2014-04-09 Thread Andrew Ash
A JVM can easily be limited in how much memory it uses with the -Xmx parameter, but Python doesn't have memory limits built in in such a first-class way. Maybe the memory limits aren't making it to the python executors. What was your SPARK_MEM setting? The JVM below seems to be using 603201

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
For 1, persist can be used to save an RDD to disk using the various persistence levels. When a persistency level is set on an RDD, when that RDD is evaluated it's saved to memory/disk/elsewhere so that it can be re-used. It's applied to that RDD, so that subsequent uses of the RDD can use the

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
Which persistence level are you talking about? MEMORY_AND_DISK ? Sent from my mobile phone On Apr 9, 2014 2:28 PM, Surendranauth Hiraman suren.hira...@velos.io wrote: Thanks, Andrew. That helps. For 1, it sounds like the data for the RDD is held in memory and then only written to disk after

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Nicholas Chammas
Marco, If you call spark-ec2 launch without specifying an AMI, it will default to the Spark-provided AMI. Nick On Wed, Apr 9, 2014 at 9:43 AM, Marco Costantini silvio.costant...@granatads.com wrote: Hi there, To answer your question; no there is no reason NOT to use an AMI that Spark has

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Nicholas Chammas
And for the record, that AMI is ami-35b1885c. Again, you don't need to specify it explicitly; spark-ec2 will default to it. On Wed, Apr 9, 2014 at 11:08 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Marco, If you call spark-ec2 launch without specifying an AMI, it will default to

executors not registering with the driver

2014-04-09 Thread azurecoder
Up until last week we had no problems running a Spark standalone cluster. We now have a problem registering executors with the driver node in any application. Although we can start-all and see the worker on 8080 no executors are registered with the blockmanager. The feedback we have is scant but

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Marco Costantini
Ah, tried that. I believe this is an HVM AMI? We are exploring paravirtual AMIs. On Wed, Apr 9, 2014 at 11:17 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: And for the record, that AMI is ami-35b1885c. Again, you don't need to specify it explicitly; spark-ec2 will default to it.

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Shivaram Venkataraman
The AMI should automatically switch between PVM and HVM based on the instance type you specify on the command line. For reference (note you don't need to specify this on the command line), the PVM ami id is ami-5bb18832 in us-east-1. FWIW we maintain the list of AMI Ids (across regions and pvm,

Re: How does Spark handle RDD via HDFS ?

2014-04-09 Thread Andrew Ash
The typical way to handle that use case would be to join the 3 files together into one RDD and then do the factorization on that. There will definitely be network traffic during the initial join to get everything into one table, and after that there will likely be more network traffic for various

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
The groupByKey would be aware of the subsequent persist -- that's part of the reason why operations are lazy. As far as whether it's materialized in memory first and then flushed to disk vs streamed to disk I'm not sure the exact behavior. What I'd expect to happen would be that the RDD is

How to change the parallelism level of input dstreams

2014-04-09 Thread Dong Mo
Dear list, A quick question about spark streaming: Say I have this stage set up in my Spark Streaming cluster: batched TCP stream == map(expensive computation) === ReduceByKey I know I can set the number of tasks for ReduceByKey. But I didn't find a place to specify the parallelism for the

Re: hbase scan performance

2014-04-09 Thread Jerry Lam
Hi Dave, This is HBase solution to the poor scan performance issue: https://issues.apache.org/jira/browse/HBASE-8369 I encountered the same issue before. To the best of my knowledge, this is not a mapreduce issue. It is hbase issue. If you are planning to swap out mapreduce and replace it with

Re: Why doesn't the driver node do any work?

2014-04-09 Thread Mayur Rustagi
Also Driver can run on one of the slave nodes. (you will stil need a spark master though for resource allocation etc). Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Apr 8, 2014 at 2:46 PM, Nan Zhu

cannot run spark shell in yarn-client mode

2014-04-09 Thread Pennacchiotti, Marco
I am pretty new to Spark and I am trying to run the spark shell on a Yarn cluster from the cli (in yarn-client mode). I am able to start the shell with the following command: SPARK_JAR=../spark-0.9.0-incubating/jars/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar \ SPARK_YARN_APP_JAR=emptyfile

Re: Spark RDD to Shark table IN MEMORY conversion

2014-04-09 Thread abhietc31
Never mind...plz return it later with interest -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-RDD-to-Shark-table-IN-MEMORY-conversion-tp3682p4014.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

KafkaInputDStream Stops reading new messages

2014-04-09 Thread Kanwaldeep
Spark Streaming job was running on two worker nodes and then there was an error on one of the nodes. The spark job showed running but no progress was being made and not processing any new messages. Based on the driver log files I see the following errors. I would expect the stream reading would

is it possible to initiate Spark jobs from Oozie?

2014-04-09 Thread Segerlind, Nathan L
Howdy. Is it possible to initiate Spark jobs from Oozie (presumably as a java action)? If so, are there known limitations to this? And would anybody have a pointer to an example? Thanks, Nate

Re: Spark packaging

2014-04-09 Thread Pradeep baji
Thanks Prabeesh. On Wed, Apr 9, 2014 at 12:37 AM, prabeesh k prabsma...@gmail.com wrote: Please refer http://prabstechblog.blogspot.in/2014/04/creating-single-jar-for-spark-project.html Regards, prabeesh On Wed, Apr 9, 2014 at 1:04 PM, Pradeep baji pradeep.chanum...@gmail.comwrote:

Spark 0.9.1 released

2014-04-09 Thread Tathagata Das
Hi everyone, We have just posted Spark 0.9.1, which is a maintenance release with bug fixes, performance improvements, better stability with YARN and improved parity of the Scala and Python API. We recommend all 0.9.0 users to upgrade to this stable release. This is the first release since Spark

Re: Spark 0.9.1 released

2014-04-09 Thread Tathagata Das
A small additional note: Please use the direct download links in the Spark Downloads http://spark.apache.org/downloads.html page. The Apache mirrors take a day or so to sync from the main repo, so may not work immediately. TD On Wed, Apr 9, 2014 at 2:54 PM, Tathagata Das

Problem with running LogisticRegression in spark cluster mode

2014-04-09 Thread Jenny Zhao
Hi all, I have been able to run LR in local mode, but I am facing problem to run it in cluster mode, below is the source script, and stack trace when running it cluster mode, I used sbt package to build the project, not sure what it is complaining? another question I have is for

Multi master Spark

2014-04-09 Thread Pradeep Ch
Hi, I want to enable Spark Master HA in spark. Documentation specifies that we can do this with the help of Zookeepers. But what I am worried is how to configure one master with the other and similarly how do workers know that the have two masters? where do you specify the multi-master

Re: Spark 0.9.1 released

2014-04-09 Thread Matei Zaharia
Thanks TD for managing this release, and thanks to everyone who contributed! Matei On Apr 9, 2014, at 2:59 PM, Tathagata Das tathagata.das1...@gmail.com wrote: A small additional note: Please use the direct download links in the Spark Downloads page. The Apache mirrors take a day or so to

Re: Problem with running LogisticRegression in spark cluster mode

2014-04-09 Thread Jagat Singh
Hi Jenny, How are you packaging your jar. Can you please confirm if you have included the Mlib jar inside the fat jar you have created for your code. libraryDependencies += org.apache.spark % spark-mllib_2.9.3 % 0.8.1-incubating Thanks, Jagat Singh On Thu, Apr 10, 2014 at 8:05 AM, Jenny

Re: Multi master Spark

2014-04-09 Thread Dmitriy Lyubimov
The only way i know to do this is to use mesos with zookeepers. you specify zookeeper url as spark url that contains multiple zookeeper hosts. Multiple mesos masters are then elected thru zookeeper leader election until current leader dies; at which point mesos will elect another master (if still

Re: programmatic way to tell Spark version

2014-04-09 Thread Nicholas Chammas
Hey Patrick, I've created SPARK-1458 https://issues.apache.org/jira/browse/SPARK-1458 to track this request, in case the team/community wants to implement it in the future. Nick On Sat, Feb 22, 2014 at 7:25 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: No use case at the moment.

Re: Multi master Spark

2014-04-09 Thread Pradeep Ch
Thanks Dmitriy. But I want multi master support when running spark standalone. Also I want to know if this multi master thing works if I use spark-shell. On Wed, Apr 9, 2014 at 3:26 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: The only way i know to do this is to use mesos with zookeepers.

Re: pySpark memory usage

2014-04-09 Thread Jim Blomo
Hi Matei, thanks for working with me to find these issues. To summarize, the issues I've seen are: 0.9.0: - https://issues.apache.org/jira/browse/SPARK-1323 SNAPSHOT 2014-03-18: - When persist() used and batchSize=1, java.lang.OutOfMemoryError: Java heap space. To me this indicates a memory

Re: Problem with running LogisticRegression in spark cluster mode

2014-04-09 Thread Jenny Zhao
Hi Jagat, yes, I did specify mllib in build.sbt name := Spark LogisticRegression version :=1.0 scalaVersion := 2.10.3 libraryDependencies += org.apache.spark % spark-core_2.10 % 0.9.0-incubating libraryDependencies += org.apache.spark % spark-mllib_2.10 % 0.9.0-incubating

Re: Spark 0.9.1 released

2014-04-09 Thread Nicholas Chammas
A very nice addition for us PySpark users in 0.9.1 is the addition of RDD.repartition(), which is not mentioned in the release noteshttp://spark.apache.org/releases/spark-release-0-9-1.html ! This is super helpful for when you create an RDD from a gzipped file and then need to explicitly shuffle

Re: pySpark memory usage

2014-04-09 Thread Matei Zaharia
Okay, thanks. Do you have any info on how large your records and data file are? I’d like to reproduce and fix this. Matei On Apr 9, 2014, at 3:52 PM, Jim Blomo jim.bl...@gmail.com wrote: Hi Matei, thanks for working with me to find these issues. To summarize, the issues I've seen are:

Re: trouble with join on large RDDs

2014-04-09 Thread Brad Miller
I set SPARK_MEM in the driver process by setting spark.executor.memory to 10G. Each machine had 32G of RAM and a dedicated 32G spill volume. I believe all of the units are in pages, and the page size is the standard 4K. There are 15 slave nodes in the cluster and the sizes of the datasets I'm

Re: Spark 0.9.1 released

2014-04-09 Thread Nicholas Chammas
Ah, looks good now. It took me a minute to realize that doing a hard refresh on the docs page was missing the RDD class doc page... And thanks for updating the release notes. On Wed, Apr 9, 2014 at 7:21 PM, Tathagata Das tathagata.das1...@gmail.comwrote: Thanks Nick for pointing that out! I

Re: Best way to turn an RDD back into a SchemaRDD

2014-04-09 Thread Michael Armbrust
Good question. This is something we wanted to fix, but unfortunately I'm not sure how to do it without changing the API to RDD, which is undesirable now that the 1.0 branch has been cut. We should figure something out though for 1.1. I've created https://issues.apache.org/jira/browse/SPARK-1460

Re: Multi master Spark

2014-04-09 Thread Aaron Davidson
It is as Jagat said. The Masters do not need to know about one another, as ZooKeeper manages their implicit communication. As for Workers (and applications, such as spark-shell), once a Worker is registered with *some *Master, its metadata is stored in ZooKeeper such that if another Master is

Re: Only TraversableOnce?

2014-04-09 Thread wxhsdp
thank you, it works after my operation over p, return p.toIterator, because mapPartitions has iterator return type, is that right? rdd.mapPartitions{D = {val p = D.toArray; ...; p.toIterator}} -- View this message in context:

Re: Only TraversableOnce?

2014-04-09 Thread Nan Zhu
Yeah, should be right -- Nan Zhu On Wednesday, April 9, 2014 at 8:54 PM, wxhsdp wrote: thank you, it works after my operation over p, return p.toIterator, because mapPartitions has iterator return type, is that right? rdd.mapPartitions{D = {val p = D.toArray; ...; p.toIterator}}

Re: pySpark memory usage

2014-04-09 Thread Jim Blomo
This dataset is uncompressed text at ~54GB. stats() returns (count: 56757667, mean: 1001.68740583, stdev: 601.775217822, max: 8965, min: 343) On Wed, Apr 9, 2014 at 6:59 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Okay, thanks. Do you have any info on how large your records and data file

Re: NPE using saveAsTextFile

2014-04-10 Thread Matei Zaharia
I haven’t seen this but it may be a bug in Typesafe Config, since this is serializing a Config object. We don’t actually use Typesafe Config ourselves. Do you have any nulls in the data itself by any chance? And do you know how that Config object is getting there? Matei On Apr 9, 2014, at

Re: NPE using saveAsTextFile

2014-04-10 Thread Nick Pentreath
Ok I thought it may be closing over the config option. I am using config for job configuration, but extracting vals from that. So not sure why as I thought I'd avoided closing over it. Will go back to source and see where it is creeping in. On Thu, Apr 10, 2014 at 8:42 AM, Matei Zaharia

Re: Where does println output go?

2014-04-10 Thread wxhsdp
rdd.foreach(p = { print(p) }) The above closure gets executed on workers, you need to look at the logs of the workers to see the output. but if i'm in local mode, where's the logs of local driver, there are no /logs and /work dirs in /SPARK_HOME which are set in standalone mode. -- View

Re: Shark CDH5 Final Release

2014-04-10 Thread chutium
hi, you can take a look here: http://www.abcn.net/2014/04/install-shark-on-cdh5-hadoop2-spark.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Shark-CDH5-Final-Release-tp3826p4055.html Sent from the Apache Spark User List mailing list archive at

Re: Pig on Spark

2014-04-10 Thread Konstantin Kudryavtsev
Hi Mayur, I wondered if you could share your findings in some way (github, blog post, etc). I guess your experience will be very interesting/useful for many people sent from Lenovo YogaTablet On Apr 8, 2014 8:48 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: Hi Ankit, Thanx for all the work

Re: How does Spark handle RDD via HDFS ?

2014-04-10 Thread gtanguy
Yes that help to understand better how works spark. But that was also what I was afraid, I think the network communications will take to much time for my job. I will continue to look for a trick in order to not have network communications. I saw on the hadoop website that : To minimize global

Re: Executing spark jobs with predefined Hadoop user

2014-04-10 Thread Adnan
You need to use proper HDFS URI with saveAsTextFile. For Example: rdd.saveAsTextFile(hdfs://NameNode:Port/tmp/Iris/output.tmp) Regards, Adnan Asaf Lahav wrote Hi, We are using Spark with data files on HDFS. The files are stored as files for predefined hadoop user (hdfs). The folder is

Re: Executing spark jobs with predefined Hadoop user

2014-04-10 Thread Adnan
Then problem is not on spark side, you have three options, choose any one of them: 1. Change permissions on /tmp/Iris folder from shell on NameNode with hdfs dfs -chmod command. 2. Run your hadoop service with hdfs user. 3. Disable dfs.permissions in conf/hdfs-site.xml. Regards, Adnan avito

Fwd: Spark - ready for prime time?

2014-04-10 Thread Andras Nemeth
Hello Spark Users, With the recent graduation of Spark to a top level project (grats, btw!), maybe a well timed question. :) We are at the very beginning of a large scale big data project and after two months of exploration work we'd like to settle on the technologies to use, roll up our sleeves

Re: Spark - ready for prime time?

2014-04-10 Thread Debasish Das
When you say Spark is one of the forerunners for our technology choice, what are the other options you are looking into ? I start cross validation runs on a 40 core, 160 GB spark job using a script...I woke up in the morning, none of the jobs crashed ! and the project just came out of incubation

RE: Executing spark jobs with predefined Hadoop user

2014-04-10 Thread Shao, Saisai
Hi Asaf, The user who run SparkContext is decided by the below code in SparkContext, normally this user.name is the user who started JVM, you can start your application with -Duser.name=xxx to specify a username you want, this specified username will be the user to communicate with HDFS. val

Re: Spark - ready for prime time?

2014-04-10 Thread Dean Wampler
Spark has been endorsed by Cloudera as the successor to MapReduce. That says a lot... On Thu, Apr 10, 2014 at 10:11 AM, Andras Nemeth andras.nem...@lynxanalytics.com wrote: Hello Spark Users, With the recent graduation of Spark to a top level project (grats, btw!), maybe a well timed

Spark 0.9.1 PySpark ImportError

2014-04-10 Thread aazout
I am getting a python ImportError on Spark standalone cluster. I have set the PYTHONPATH on both worker and slave and the package imports properly when I run PySpark command line on both machines. This only happens with Master - Slave communication. Here is the error below: 14/04/10 13:40:19

<    5   6   7   8   9   10   11   12   13   14   >