subscribe

2014-07-28 Thread James Todd

Re: Spark as a application library vs infra

2014-07-28 Thread Sandy Ryza
At Cloudera we recommend bundling your application separately from the Spark libraries. The two biggest reasons are: * No need to modify your application jar when upgrading or applying a patch. * When running on YARN, the Spark jar can be cached as a YARN local resource, meaning it doesn't need

Re: Hadoop Input Format - newAPIHadoopFile

2014-07-28 Thread chang cheng
Here is a tutorial on how to customize your own file format in hadoop: https://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat and once you get your own file format, you can use it the same way as TextInputFormat in spark as you have done in this post. -- View this message in

VertexPartition and ShippableVertexPartition

2014-07-28 Thread shijiaxin
There is a VertexPartition in the EdgePartition,which is created by EdgePartitionBuilder.toEdgePartition. and There is also a ShippableVertexPartition in the VertexRDD. These two Partitions have a lot of common things like index, data and Bitset, why is this necessary? -- View this message in

[Spark 1.0.1][SparkSQL] reduce stage of shuffle is slow。

2014-07-28 Thread Earthson
I'm using SparkSQL with Hive 0.13, here is the SQL for inserting a partition with 2048 buckets. pre sqlsc.set(spark.sql.shuffle.partitions, 2048) hql(|insert %s table mz_log |PARTITION (date='%s') |select * from tmp_mzlog

Re: Confusing behavior of newAPIHadoopFile

2014-07-28 Thread chang cheng
the value in (key, value) returned by textFile is exactly one line of the input. But what I want is the field between the two “!!”, hope this makes sense. - Senior in Tsinghua Univ. github: http://www.github.com/uronce-cc -- View this message in context:

NotSerializableException exception while using TypeTag in Scala 2.10

2014-07-28 Thread Aniket Bhatnagar
I am trying to serialize objects contained in RDDs using runtime relfection via TypeTag. However, the Spark job keeps failing java.io.NotSerializableException on an instance of TypeCreator (auto generated by compiler to enable TypeTags). Is there any workaround for this without switching to scala

Re: Confusing behavior of newAPIHadoopFile

2014-07-28 Thread chang cheng
Nop. My input file's format is: !! string1 string2 !! string3 string4 sc.textFile(path) will return RDD(!!, string1, string2, !!, string3, string4) what we need now is to transform this rdd to RDD(string1, string2, string3, string4) your solution may not handle this. - Senior in

Re: Confusing behavior of newAPIHadoopFile

2014-07-28 Thread Sean Owen
Oh, you literally mean these are different lines, not the structure of a line. You can't solve this in general by reading the entire file into one string. If the input is tens of gigabytes you will probably exhaust memory on any of your machines. (Or, you might as well not bother with Spark

Re: Confusing behavior of newAPIHadoopFile

2014-07-28 Thread chang cheng
Exactly, the fields between !! is a (key, value) customized data structure. So, newAPIHadoopFile may be the best practice now. For this specific format, change the delimiter from default \n to !!\n can be the cheapest, and this can only be done in hadoop2.x, in hadoop1.x, this can be done by

Re: Bad Digest error while doing aws s3 put

2014-07-28 Thread lmk
Hi I was using saveAsTextFile earlier. It was working fine. When we migrated to spark-1.0, I started getting the following error: java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 java.net.URLClassLoader$1.run(URLClassLoader.java:366)

sbt directory missed

2014-07-28 Thread redocpot
Hi, I have started a EC2 cluster using Spark by running spark-ec2 script. Just a little confused, I can not find sbt/ directory under /spark. I have checked spark-version, it's 1.0.0 (default). When I was working 0.9.x, sbt/ has been there. Is the script changed in 1.0.X ? I can not find any

Re: sbt directory missed

2014-07-28 Thread redocpot
update: Just checked the python launch script, when retrieving spark, it will refer to this script: https://github.com/mesos/spark-ec2/blob/v3/spark/init.sh where each version number is mapped to a tar file, 0.9.2) if [[ $HADOOP_MAJOR_VERSION == 1 ]]; then wget

Re: Fraud management system implementation

2014-07-28 Thread Sandy Ryza
+user list bcc: dev list It's definitely possible to implement credit fraud management using Spark. A good start would be using some of the supervised learning algorithms that Spark provides in MLLib (logistic regression or linear SVMs). Spark doesn't have any HMM implementation right now.

Re: Debugging Task not serializable

2014-07-28 Thread Akhil Das
A quick fix would be to implement java.io.Serializable in those classes which are causing this exception. Thanks Best Regards On Mon, Jul 28, 2014 at 9:21 PM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi all, I was wondering if someone has conceived a method for

Re: Debugging Task not serializable

2014-07-28 Thread andy petrella
Also check the guides for the JVM option that prints messages for such problems. Sorry, sent from phone and don't know it by heart :/ Le 28 juil. 2014 18:44, Akhil Das ak...@sigmoidanalytics.com a écrit : A quick fix would be to implement java.io.Serializable in those classes which are causing

Re: MLlib NNLS implementation is buggy, returning wrong solutions

2014-07-28 Thread Debasish Das
Hi Aureliano, Will it be possible for you to give the test-case ? You can add it to JIRA as well as an attachment I guess... I am preparing the PR for ADMM based QuadraticMinimizer...In my matlab experiments with scaling the rank to 1000 and beyond (which is too high for ALS but gives a good

Re: MLlib NNLS implementation is buggy, returning wrong solutions

2014-07-28 Thread Shuo Xiang
It is possible that the answer (the final solution vector x) given by two different algorithms (such as the one in mllib and in R) are different, as the problem may not be strictly convex and multiple global optimum may exist. However, these answers should admit the same objective values. Can you

akka.tcp://spark@localhost:7077/user/MapOutputTracker akka.actor.ActorNotFound

2014-07-28 Thread Andrew Milkowski
19678 fetcher.cpp:102] Downloading resource from 'hdfs://localhost:8020/spark/spark-1.0.0-cdh5.1.0-bin-2.3.0-cdh5.0.3.tgz' to '/tmp/mesos/slaves/20140724-134606-16777343-5050-25095-0/frameworks/20140728-143300-24815808-5050-19059-/executors/20140724-134606-16777343-5050-25095-0/runs/c9c9eaa2-b722

Re: VertexPartition and ShippableVertexPartition

2014-07-28 Thread Ankur Dave
On Mon, Jul 28, 2014 at 4:29 AM, Larry Xiao xia...@sjtu.edu.cn wrote: On 7/28/14, 3:41 PM, shijiaxin wrote: There is a VertexPartition in the EdgePartition,which is created by EdgePartitionBuilder.toEdgePartition. and There is also a ShippableVertexPartition in the VertexRDD. These two

how to publish spark inhouse?

2014-07-28 Thread Koert Kuipers
hey we used to publish spark inhouse by simply overriding the publishTo setting. but now that we are integrated in SBT with maven i cannot find it anymore. i tried looking into the pom file, but after reading 1144 lines of xml i 1) havent found anything that looks like publishing 2) i feel

javasparksql Hbase

2014-07-28 Thread Madabhattula Rajesh Kumar
Hi Team, Could you please let me know example program/link for JavaSparkSql to join 2 Hbase tables. Regards, Rajesh

Re: how to publish spark inhouse?

2014-07-28 Thread Koert Kuipers
and if i want to change the version, it seems i have to change it in all 23 pom files? mhhh. is it mandatory for these sub-project pom files to repeat that version info? useful? spark$ grep 1.1.0-SNAPSHOT * -r | wc -l 23 On Mon, Jul 28, 2014 at 3:05 PM, Koert Kuipers ko...@tresata.com wrote:

Re: how to publish spark inhouse?

2014-07-28 Thread Sean Owen
This is not something you edit yourself. The Maven release plugin manages setting all this. I think virtually everything you're worried about is done for you by this plugin. Maven requires artifacts to set a version and it can't inherit one. I feel like I understood the reason this is necessary

Spark java.lang.AbstractMethodError

2014-07-28 Thread Alex Minnaar
I am trying to run an example Spark standalone app with the following code import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ object SparkGensimLDA extends App{ val ssc=new StreamingContext(local,testApp,Seconds(5)) val

Re: how to publish spark inhouse?

2014-07-28 Thread Koert Kuipers
ah ok thanks. guess i am gonna read up about maven-release-plugin then! On Mon, Jul 28, 2014 at 3:37 PM, Sean Owen so...@cloudera.com wrote: This is not something you edit yourself. The Maven release plugin manages setting all this. I think virtually everything you're worried about is done

Issues on spark-shell and spark-submit behave differently on spark-defaults.conf parameter spark.eventLog.dir

2014-07-28 Thread Andrew Lee
Hi All, Not sure if anyone has ran into this problem, but this exist in spark 1.0.0 when you specify the location in conf/spark-defaults.conf for spark.eventLog.dir hdfs:///user/$USER/spark/logs to use the $USER env variable. For example, I'm running the command with user 'test'. In

Re: sbt directory missed

2014-07-28 Thread Shivaram Venkataraman
I think the 1.0 AMI only contains the prebuilt packages (i.e just the binaries) of Spark and not the source code. If you want to build Spark on EC2, you'll can clone the github repo and then use sbt. Thanks Shivaram On Mon, Jul 28, 2014 at 8:49 AM, redocpot julien19890...@gmail.com wrote:

Re: Spark streaming vs. spark usage

2014-07-28 Thread Nathan Kronenfeld
So after months and months, I finally started to try and tackle this, but my scala ability isn't up to it. The problem is that, of course, even with the common interface, we don't want inter-operability between RDDs and DStreams. I looked into Monads, as per Ashish's suggestion, and I think I

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread durin
Hi Xiangru, thanks for the explanation. 1. You said we have to broadcast m * k centers (with m = number of rows). I thought there were only k centers at each time, which would the have size of n * k and needed to be broadcasted. Is that I typo or did I understand something wrong? And the

zip two RDD in pyspark

2014-07-28 Thread lllll
I have a file in s3 that I want to map each line with an index. Here is my code: input_data = sc.textFile('s3n:/myinput',minPartitions=6).cache() N input_data.count() index = sc.parallelize(range(N), 6) index.zip(input_data).collect() ... 14/07/28 19:49:31 INFO DAGScheduler: Completed

Re: Issues on spark-shell and spark-submit behave differently on spark-defaults.conf parameter spark.eventLog.dir

2014-07-28 Thread Andrew Or
Hi Andrew, It's definitely not bad practice to use spark-shell with HistoryServer. The issue here is not with spark-shell, but the way we pass Spark configs to the application. spark-defaults.conf does not currently support embedding environment variables, but instead interprets everything as a

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread Xiangrui Meng
1. I meant in the n (1k) by m (10k) case, we need to broadcast k centers and hence the total size is m * k. In 1.0, the driver needs to send the current centers to each partition one by one. In the current master, we use torrent to broadcast the centers to workers, which should be much faster. 2.

RE: Issues on spark-shell and spark-submit behave differently on spark-defaults.conf parameter spark.eventLog.dir

2014-07-28 Thread Andrew Lee
Hi Andrew, Thanks to re-confirm the problem. I thought it only happens to my own build. :) by the way, we have multiple users using the spark-shell to explore their dataset, and we are continuously looking into ways to isolate their jobs history. In the current situation, we can't really ask

RE: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-28 Thread Andrew Lee
Hi Jianshi, My understanding is 'No' based on how Spark's is designed even with your own log4j.properties in the Spark's conf folder. In YARN mode, the Application Master is running inside the cluster and all logs are part of containers log which is defined by another log4j.properties file from

Re: sbt directory missed

2014-07-28 Thread redocpot
Thank you for your reply. I need sbt for packaging my project and then submit it. Could you tell me how to run a spark project on 1.0 AMI without sbt? I don't understand why 1.0 only contains the prebuilt packages. I dont think it makes sense, since sbt is essential. User has to download sbt

Re: Getting the number of slaves

2014-07-28 Thread Sung Hwan Chung
Do getExecutorStorageStatus and getExecutorMemoryStatus both return the number of executors + the driver? E.g., if I submit a job with 10 executors, I get 11 for getExeuctorStorageStatus.length and getExecutorMemoryStatus.size On Thu, Jul 24, 2014 at 4:53 PM, Nicolas Mai nicolas@gmail.com

Re: Getting the number of slaves

2014-07-28 Thread Andrew Or
Yes, both of these are derived from the same source, and this source includes the driver. In other words, if you submit a job with 10 executors you will get back 11 for both statuses. 2014-07-28 15:40 GMT-07:00 Sung Hwan Chung coded...@cs.stanford.edu: Do getExecutorStorageStatus and

Re: Spark streaming vs. spark usage

2014-07-28 Thread Ankur Dave
On Mon, Jul 28, 2014 at 12:53 PM, Nathan Kronenfeld nkronenf...@oculusinfo.com wrote: But when done processing, one would still have to pull out the wrapped object, knowing what it was, and I don't see how to do that. It's pretty tricky to get the level of type safety you're looking for. I

ssh connection refused

2014-07-28 Thread sparking
I'm trying to launch Spark with this command on AWS: *./spark-ec2 -k keypair_name -i keypair.pem -s 5 -t c1.xlarge -r us-west-2 --hadoop-major-version=2.4.0 launch spark_cluster* This script is erroring out with this message: *ssh: connect to host hostname port 22: Connection refused Error

Re: how to publish spark inhouse?

2014-07-28 Thread Patrick Wendell
All of the scripts we use to publish Spark releases are in the Spark repo itself, so you could follow these as a guideline. The publishing process in Maven is similar to in SBT: https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L65 On Mon, Jul 28, 2014 at 12:39 PM,

evaluating classification accuracy

2014-07-28 Thread SK
Hi, In order to evaluate the ML classification accuracy, I am zipping up the prediction and test labels as follows and then comparing the pairs in predictionAndLabel: val prediction = model.predict(test.map(_.features)) val predictionAndLabel = prediction.zip(test.map(_.label)) However, I am

hdfs.BlockMissingException on Iterator.hasNext() in mapPartitionsWithIndex()

2014-07-28 Thread innowireless TaeYun Kim
Hi, I'm trying to split one large multi-field text file into many single-field text files. My code is like this: (somewhat simplified) final BroadcastColSchema bcSchema = sc.broadcast(schema); final String outputPathName = env.outputPathName;

RE: streaming sequence files?

2014-07-28 Thread Barnaby Falls
Running as Standalone Cluster. From my monitoring console: [spark-logo-77x50px-hd.png] Spark Master at spark://101.73.54.149:7077 * URL: spark://101.73.54.149:7077 * Workers: 1 * Cores: 2 Total, 0 Used * Memory: 2.4 GB Total, 0.0 B Used * Applications: 0 Running, 24

Re: [Spark 1.0.1][SparkSQL] reduce stage of shuffle is slow。

2014-07-28 Thread Earthson
spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to takes too much time, what should I do? What is the correct configuration? blockManager timeout if I using a small number of reduce partition.

Re: ssh connection refused

2014-07-28 Thread Google
This may occurred while the ec2 instance are not ready and ssh port not open yet. Please give larger time by specify -w 300. Default should be 120 Thanks, Tracy Sent from my iPhone On 2014年7月29日, at 上午8:17, sparking research...@gmail.com wrote: I'm trying to launch Spark with this command

The function of ClosureCleaner.clean

2014-07-28 Thread Wang, Jensen
Hi, All Before sc.runJob invokes dagScheduler.runJob, the func performed on the rdd is cleaned by ClosureCleaner.clearn. Why spark has to do this? What's the purpose?

Re: The function of ClosureCleaner.clean

2014-07-28 Thread Mayur Rustagi
I am not sure specifically about specific purpose of this function but Spark needs to remove elements from the closure that may be included by default but not really needed so as to serialize it send it to executors to operate on RDD. For example a function in Map function of RDD may reference

Re: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-28 Thread Jianshi Huang
I see Andrew, thanks for the explanantion. On Tue, Jul 29, 2014 at 5:29 AM, Andrew Lee alee...@hotmail.com wrote: I was thinking maybe we can suggest the community to enhance the Spark HistoryServer to capture the last failure exception from the container logs in the last failed stage?

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread durin
Hi Xiangrui, using the current master meant a huge improvement for my task. Something that did not even finish before (training with 120G of dense data) now completes in a reasonable time. I guess using torrent helps a lot in this case. Best regards, Simon -- View this message in context:

Reading hdf5 formats with pyspark

2014-07-28 Thread Mohit Singh
Hi, We have setup spark on a HPC system and are trying to implement some data pipeline and algorithms in place. The input data is in hdf5 (these are very high resolution brain images) and it can be read via h5py library in python. So, my current approach (which seems to be working ) is writing

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread Xiangrui Meng
Great! Thanks for testing the new features! -Xiangrui On Mon, Jul 28, 2014 at 8:58 PM, durin m...@simon-schaefer.net wrote: Hi Xiangrui, using the current master meant a huge improvement for my task. Something that did not even finish before (training with 120G of dense data) now completes

Re: evaluating classification accuracy

2014-07-28 Thread Xiangrui Meng
Are you using 1.0.0? There was a bug, which was fixed in 1.0.1 and master. If you don't want to switch to 1.0.1 or master, try to cache and count test first. -Xiangrui On Mon, Jul 28, 2014 at 6:07 PM, SK skrishna...@gmail.com wrote: Hi, In order to evaluate the ML classification accuracy, I am

Re: Reading hdf5 formats with pyspark

2014-07-28 Thread Xiangrui Meng
That looks good to me since there is no Hadoop InputFormat for HDF5. But remember to specify the number of partitions in sc.parallelize to use all the nodes. You can change `process` to `read` which yields records one-by-one. Then sc.parallelize(files, numPartitions).flatMap(read) returns an RDD

HiveContext is creating metastore warehouse locally instead of in hdfs

2014-07-28 Thread nikroy16
Hi, Even though hive.metastore.warehouse.dir in hive-site.xml is set to the default user/hive/warehouse and the permissions are correct in hdfs, HiveContext seems to be creating metastore locally instead of hdfs. After looking into the spark code, I found the following in HiveContext.scala:

Joining spark user group

2014-07-28 Thread jitendra shelar

SparkSQL can not use SchemaRDD from Hive

2014-07-28 Thread Kevin Jung
Hi I got a error message while using Hive and SparkSQL. This is code snippet I used. (in spark-shell , 1.0.0) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ val hive = new org.apache.spark.sql.hive.HiveContext(sc) var sample = hive.hql(select * from sample10) // This