Re: why a machine learning application run slowly on the spark cluster

2014-07-30 Thread Tan Tim
input data is evenly distributed to the executors. The input data is on the HDFS, not on the spark clusters. How can I make the data distributed to the excutors? On Wed, Jul 30, 2014 at 1:52 PM, Xiangrui Meng men...@gmail.com wrote: The weight vector is usually dense and if you have many

Re: Last step of processing is using too much memory.

2014-07-30 Thread Davies Liu
When you do groupBy(), it wish to load all the data into memory for best performance, then you should specify the number of partitions carefully. In Spark master or upcoming 1.1 release, PySpark can do external groupBy(), it means that it will dumps the data into disks if there is not enough

Re: zip two RDD in pyspark

2014-07-30 Thread Nick Pentreath
parallelize uses the default Serializer (PickleSerializer) while textFile uses UTF8Serializer. You can get around this with index.zip(input_data._reserialize()) (or index.zip(input_data.map(lambda x: x))) (But if you try to just do this, you run into the issue with different number of

Logging in Spark through YARN.

2014-07-30 Thread Archit Thakur
Hi, I want to manage logging of containers when I run Spark through YARN. I checked there is a environment variable exposed to custom log4j.properties. Setting SPARK_LOG4J_CONF to /dir/log4j.properties should ideally make containers use /dir/log4j.properties file for logging. This doesn't seem

Converting matrix format

2014-07-30 Thread Chengi Liu
Hi, I have an rdd with n rows and m columns... but most of them are 0 .. its as sparse matrix.. I would like to only get the non zero entries with their index? Any equivalent python code would be for i,x in enumerate(matrix): for j,y in enumerate(x): if y: print i,j,y

Re: Is it possible to read file head in each partition?

2014-07-30 Thread Cheng Lian
What's the format of the file header? Is it possible to filter them out by prefix string matching or regex? On Wed, Jul 30, 2014 at 1:39 PM, Fengyun RAO raofeng...@gmail.com wrote: It will certainly cause bad performance, since it reads the whole content of a large file into one value,

Re: How to specify the job to run on the specific nodes(machines) in the hadoop yarn cluster?

2014-07-30 Thread Haiyang Fu
It's really a good question !I'm also working on it On Wed, Jul 30, 2014 at 11:45 AM, adu dujinh...@hzduozhun.com wrote: Hi all, RT. I want to run a job on specific two nodes in the cluster? How to configure the yarn? Dose yarn queue help? Thanks

Spark Ooyala Job Server

2014-07-30 Thread nightwolf
Hi all, I'm trying to get the jobserver working with Spark 1.0.1. I've got it building, tests passing and it connects to my Spark master (e.g. spark://hadoop-001:7077). I can also pre-create contexts. These show up in the Spark master console i.e. on hadoop-001:8080 The problem is that after I

Re: Is it possible to read file head in each partition?

2014-07-30 Thread Fengyun RAO
of course we can filter them out. A typical file head is as below: #Software: Microsoft Internet Information Services 7.5 #Version: 1.0 #Date: 2013-07-04 20:00:00 #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus

NotSerializableException

2014-07-30 Thread Ron Gonzalez
Hi, I took avro 1.7.7 and recompiled my distribution to be able to fix the issue when dealing with avro GenericRecord. The issue I got was resolved. I'm referring to AVRO-1476. I also enabled kryo registration in SparkConf. That said, I am still seeing a NotSerializableException for

Re: spark.shuffle.consolidateFiles seems not working

2014-07-30 Thread Larry Xiao
Hi Jianshi, I've met similar situation before. And my solution was 'ulimit', you can use -a to see your current settings -n to set open files limit (and other limits also) And I set -n to 10240. I see spark.shuffle.consolidateFiles helps by reusing open files. (so I don't know to what extend

Re: Debugging Task not serializable

2014-07-30 Thread Juan Rodríguez Hortalá
Akhil, Andry, thanks a lot for your suggestions. I will take a look to those JVM options. Greetings, Juan 2014-07-28 18:56 GMT+02:00 andy petrella andy.petre...@gmail.com: Also check the guides for the JVM option that prints messages for such problems. Sorry, sent from phone and don't know

Re: How to submit Pyspark job in mesos?

2014-07-30 Thread daijia
:240] Master ID: 20140730-165621-1526966464-5050-23977 Hostname: CentOS-19 I0730 16:56:21.127964 23977 master.cpp:322] Master started on 192.168.3.91:5050 I0730 16:56:21.127989 23977 master.cpp:332] Master allowing unauthenticated frameworks to register!! I0730 16:56:21.130705 23979 master.cpp:757

Initialize custom serializer on YARN

2014-07-30 Thread Anthony F
I am creating an API that can access data stored using an Avro schema. The API can only know the Avro schema at runtime when it is passed as a parm by a user of the API. I need to initialize a custom serializer with the Avro schema on remote worker and driver processes. I've tried to set the

RE: Example standalone app error!

2014-07-30 Thread Alex Minnaar
Hi Andrew, I'm not sure why an assembly would help (I don't have the Spark source code, I have just included Spark core and Spark streaming in my dependencies in my build file). I did try it though and the error is still occurring. I have tried cleaning and refreshing SBT many times as

Re: Avro Schema + GenericRecord to HadoopRDD

2014-07-30 Thread Laird, Benjamin
That makes sense, thanks Chris. I'm currently reworking my code to use the newAPIHadoopRDD with an AvroSequenceFileInputFormat (see below), but I think I'll run into the same issue. I'll give your suggestion a try. val avroRdd = sc.newAPIHadoopFile(fp,

Streaming on different store types

2014-07-30 Thread Flavio Pompermaier
Hi everybody, I have a scenario where I would like to stream data to different persistency types (i.e. sql db, graphdb ,hdfs, etc) and perform some filtering and trasformation as the the data comes in. The problem is to maintain consistency between all datastores (maybe some operation could fail)

Re: Spark 0.9.1 - saveAsTextFile() exception: _temporary doesn't exist!

2014-07-30 Thread Andrew Ash
Hi Oleg, Did you ever figure this out? I'm observing the same exception also in 0.9.1 and think it might be related to setting spark.speculation=true. My theory is that multiple attempts at the same task start, the first finishes and cleans up the _temporary directory, and then the second fails

the EC2 setup script often will not allow me to SSH into my machines. Ideas?

2014-07-30 Thread William Cox
*TL;DR 50% of the time I can't SSH into either my master or slave nodes and have to terminate all the machines and restart the EC2 cluster setup process.* Hello, I'm trying to setup a Spark cluster on Amazon EC2. I am finding the setup script to be delicate and unpredictable in terms of reliably

Re: Streaming on different store types

2014-07-30 Thread Jörn Franke
Hallo, I fear you have to write your own transaction logic for it (coordination,e. .g. via Zookeeper, transaction log, depending on your requirements raft /paxos etc.). However, before you embark on this journey question yourself if your application really needs it and what data load you expect.

Re: Is there a way to write spark RDD to Avro files

2014-07-30 Thread Lewis John Mcgibbney
Hi, Have you checked out SchemaRDD? There should be an examp[le of writing to Parquet files there. BTW, FYI I was discussing this with the SparlSQL developers last week and possibly using Apache Gora [0] for achieving this. HTH Lewis [0] http://gora.apache.org On Wed, Jul 30, 2014 at 5:14 AM,

Re: why a machine learning application run slowly on the spark cluster

2014-07-30 Thread Xiangrui Meng
It looks reasonable. You can also try the treeAggregate ( https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L89) instead of normal aggregate if the driver needs to collect a large weight vector from each partition. -Xiangrui On Wed,

Re: the EC2 setup script often will not allow me to SSH into my machines. Ideas?

2014-07-30 Thread Akhil Das
You need to increase the wait time, (-w) the default is 120 seconds, you may set it to a higher number like 300-400. The problem is that EC2 takes some time to initiate the machine (which is 120 seconds sometimes.) Thanks Best Regards On Wed, Jul 30, 2014 at 8:52 PM, William Cox

Re: the EC2 setup script often will not allow me to SSH into my machines. Ideas?

2014-07-30 Thread Nicholas Chammas
William, The error you are seeing is misleading. There is no need to terminate the cluster and start over. Just re-run your launch command, but with the additional --resume option tacked on the end. As Akhil explained, this happens because AWS is not starting up the instances as quickly as the

Re: the EC2 setup script often will not allow me to SSH into my machines. Ideas?

2014-07-30 Thread Zongheng Yang
To add to this: for this many (= 20) machines I usually use at least --wait 600. On Wed, Jul 30, 2014 at 9:10 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: William, The error you are seeing is misleading. There is no need to terminate the cluster and start over. Just re-run your

Re: Worker logs

2014-07-30 Thread Andrew Or
They are found in the executors' logs (not the worker's). In general, all code inside foreach or map etc. are executed on the executors. You can find these either through the Master UI (under Running Applications) or manually on the worker machines (under $SPARK_HOME/work). -Andrew 2014-07-30

Implementing percentile through top Vs take

2014-07-30 Thread Bharath Ravi Kumar
I'm looking to select the top n records (by rank) from a data set of a few hundred GB's. My understanding is that JavaRDD.top(n, comparator) is entirely a driver-side operation in that all records are sorted in the driver's memory. I prefer an approach where the records are sorted on the cluster

Do I need to know Scala to take full advantage of spark?

2014-07-30 Thread Majid Azimi
Hi guys I'm very new in spark land comming from old school MapReduce world. I have no idea about scala. Does Java/Python API can compete with native Scala API? Is Spark heavily scala centric and binding for other languages are only for starting up and testing and serious work in spark will

Re: Keep state inside map function

2014-07-30 Thread Kevin
Thanks to the both of you for your inputs. Looks like I'll play with the mapPartitions function to start porting MapReduce algorithms to Spark. On Wed, Jul 30, 2014 at 1:23 PM, Sean Owen so...@cloudera.com wrote: Really, the analog of a Mapper is not map(), but mapPartitions(). Instead of:

Re: Do I need to know Scala to take full advantage of spark?

2014-07-30 Thread Matei Zaharia
Java is very close to Scala across the board, the only thing missing in it right now is GraphX (which is still alpha). Python is missing GraphX, streaming and a few of the ML algorithms, though most of them are there. So it should be fine to start with  any of them. See 

Partioner to process data in the same order for each key

2014-07-30 Thread Venkat Subramanian
I have a data file that I need to process using Spark . The file has multiple events for different users and I need to process the events for each user in the order it is in the file. User 1 : Event 1 User 2: Event 1 User 1 : Event 2 User 3: Event 1 User 2: Event 2 User 3: Event 2 etc.. I want

Re: Implementing percentile through top Vs take

2014-07-30 Thread Sean Owen
No, it's definitely not done on the driver. It works as you say. Look at the source code for RDD.takeOrdered, which is what top calls. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1130 On Wed, Jul 30, 2014 at 7:07 PM, Bharath Ravi Kumar

Re: evaluating classification accuracy

2014-07-30 Thread SK
I am using 1.0.1 and I am running locally (I am not providing any master URL). But the zip() does not produce the correct count as I mentioned above. So not sure if the issue has been fixed in 1.0.1. However, instead of using zip, I am now using the code that Sean has mentioned and am getting the

Re: Decision Tree requires regression LabeledPoint

2014-07-30 Thread SK
I have also used labeledPoint or libSVM format (for sparse data) for DecisionTree. When I had categorical labels (not features), I mapped the categories to numerical data as part of the data transformation step (i.e. before creating the LabeledPoint). -- View this message in context:

Number of partitions and Number of concurrent tasks

2014-07-30 Thread Darin McBeath
I have a cluster with 3 nodes (each with 8 cores) using Spark 1.0.1. I have an RDDString which I've repartitioned so it has 100 partitions (hoping to increase the parallelism). When I do a transformation (such as filter) on this RDD, I can't  seem to get more than 24 tasks (my total number of

Re: Spark SQL JDBC Connectivity

2014-07-30 Thread Venkat Subramanian
For the time being, we decided to take a different route. We created a Rest API layer in our app and allowed SQL query passing via the Rest. Internally we pass that query to the SparkSQL layer on the RDD and return back the results. With this Spark SQL is supported for our RDDs via this rest API

Re: Spark SQL JDBC Connectivity

2014-07-30 Thread Michael Armbrust
Very cool. Glad you found a solution that works. On Wed, Jul 30, 2014 at 1:04 PM, Venkat Subramanian vsubr...@gmail.com wrote: For the time being, we decided to take a different route. We created a Rest API layer in our app and allowed SQL query passing via the Rest. Internally we pass that

Installing Spark 1.0.1

2014-07-30 Thread cetaylor
Hello, I am attempting to install Spark 1.0.1 on a windows machine but I've been running into some difficulties. When I attempt to run some examples I am always met with the same response: Failed to find Spark assembly JAR. You need to build Spark with sbt\sbt assembly before running this

Re: Number of partitions and Number of concurrent tasks

2014-07-30 Thread Daniel Siegmann
This is correct behavior. Each core can execute exactly one task at a time, with each task corresponding to a partition. If your cluster only has 24 cores, you can only run at most 24 tasks at once. You could run multiple workers per node to get more executors. That would give you more cores in

Re: Data from Mysql using JdbcRDD

2014-07-30 Thread chaitu reddy
Kc On Jul 30, 2014 3:55 PM, srinivas kusamsrini...@gmail.com wrote: Hi, I am trying to get data from mysql using JdbcRDD using code The table have three columns val url = jdbc:mysql://localhost:3306/studentdata val username = root val password = root val mysqlrdd = new

Spark fault tolerance after a executor failure.

2014-07-30 Thread Sung Hwan Chung
I sometimes see that after fully caching the data, if one of the executors fails for some reason, that portion of cache gets lost and does not get re-cached, even though there are plenty of resources. Is this a bug or a normal behavior (V1.0.1)?

Re: Data from Mysql using JdbcRDD

2014-07-30 Thread Josh Mahonin
Hi Srini, I believe the JdbcRDD requires input splits based on ranges within the query itself. As an example, you could adjust your query to something like: SELECT * FROM student_info WHERE id = ? AND id = ? Note that the values you've passed in '1, 20, 2' correspond to the lower bound index,

Deploying spark applications from within Eclipse?

2014-07-30 Thread nunarob
Hi all, I'm a new Spark user and I'm interested in deploying java-based Spark applications directly from Eclipse. Specifically, I'd like to write a Spark/java application, have it directly submitted to a spark cluster, and then see the output of the spark execution directly in the eclipse

Re: Number of partitions and Number of concurrent tasks

2014-07-30 Thread Darin McBeath
Thanks. So to make sure I understand.  Since I'm using a 'stand-alone' cluster, I would set SPARK_WORKER_INSTANCES to something like 2 (instead of the default value of 1).  Is that correct?  But, it also sounds like I need to explicitly set a value for SPARKER_WORKER_CORES (based on what the

Re: Graphx : Perfomance comparison over cluster

2014-07-30 Thread Ankur Dave
ShreyanshB shreyanshpbh...@gmail.com writes: The version with in-memory shuffle is here: https://github.com/amplab/graphx2/commits/vldb. It'd be great if you can tell me how to configure and invoke this spark version. Sorry for the delay on this. Assuming you're planning to launch an EC2

Re: How do you debug a PythonException?

2014-07-30 Thread Davies Liu
The exception in Python means that the worker try to read command from JVM, but it reach the end of socket (socket had been closed). So it's possible that there another exception happened in JVM. Could you change the log level of log4j, then check is there any problem inside JVM? Davies On Wed,

Re: spark.shuffle.consolidateFiles seems not working

2014-07-30 Thread Jianshi Huang
Ok... but my question is why spark.shuffle.consolidateFiles is working (or is it)? Is this a bug? On Wed, Jul 30, 2014 at 4:29 PM, Larry Xiao xia...@sjtu.edu.cn wrote: Hi Jianshi, I've met similar situation before. And my solution was 'ulimit', you can use -a to see your current settings