Array and RDDs

2014-09-05 Thread Deep Pradhan
Hi, I have an input file which consists of I have created and RDD consisting of key-value pair where key is the node id and the values are the children of that node. Now I want to associate a byte with each node. For that I have created a byte array. Every time I print out the key-value pair in th

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
I will add dice, overlap, and jaccard similarity in a future PR, probably still for 1.2 On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das wrote: > Awesome...Let me try it out... > > Any plans of putting other similarity measures in future (jaccard is > something that will be useful) ? I guess it mak

Re: Huge matrix

2014-09-05 Thread Debasish Das
Awesome...Let me try it out... Any plans of putting other similarity measures in future (jaccard is something that will be useful) ? I guess it makes sense to add some similarity measures in mllib... On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh wrote: > Yes you're right, calling dimsum with gamm

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
Yes you're right, calling dimsum with gamma as PositiveInfinity turns it into the usual brute force algorithm for cosine similarity, there is no sampling. This is by design. On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das wrote: > I looked at the code: similarColumns(Double.posInf) is generating t

Re: Huge matrix

2014-09-05 Thread Debasish Das
I looked at the code: similarColumns(Double.posInf) is generating the brute force... Basically dimsum with gamma as PositiveInfinity will produce the exact same result as doing catesian products of RDD[(product, vector)] and computing similarities or there will be some approximation ? Sorry I hav

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
For 60M x 10K brute force and dimsum thresholding should be fine. For 60M x 10M probably brute force won't work depending on the cluster's power, and dimsum thresholding should work with appropriate threshold. Dimensionality reduction should help, and how effective it is will depend on your appli

Re: Huge matrix

2014-09-05 Thread Debasish Das
Also for tall and wide (rows ~60M, columns 10M), I am considering running a matrix factorization to reduce the dimension to say ~60M x 50 and then run all pair similarity... Did you also try similar ideas and saw positive results ? On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das wrote: > Ok...ju

Re: Huge matrix

2014-09-05 Thread Debasish Das
Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~ 60M and columns are 10M say with billion data points... I have another version that's around 60M and ~ 10K... I guess for the second one both all pair and dimsum will run fine... But for tall and wide, what do you suggest ? c

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
You might want to wait until Wednesday since the interface will be changing in that PR before Wednesday, probably over the weekend, so that you don't have to redo your code. Your call if you need it before a week. Reza On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das wrote: > Ohh coolall-pairs

Re: Huge matrix

2014-09-05 Thread Debasish Das
Ohh coolall-pairs brute force is also part of this PR ? Let me pull it in and test on our dataset... Thanks. Deb On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh wrote: > Hi Deb, > > We are adding all-pairs and thresholded all-pairs via dimsum in this PR: > https://github.com/apache/spark/pull/1

Re: Huge matrix

2014-09-05 Thread Reza Zadeh
Hi Deb, We are adding all-pairs and thresholded all-pairs via dimsum in this PR: https://github.com/apache/spark/pull/1778 Your question wasn't entirely clear - does this answer it? Best, Reza On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das wrote: > Hi Reza, > > Have you compared with the brute

Re: Huge matrix

2014-09-05 Thread Debasish Das
Hi Reza, Have you compared with the brute force algorithm for similarity computation with something like the following in Spark ? https://github.com/echen/scaldingale I am adding cosine similarity computation but I do want to compute an all pair similarities... Note that the data is sparse for

Re: spark 1.1.0 requested array size exceed vm limits

2014-09-05 Thread marylucy
I set 200,it remain failed in second step,(map and mapPartition in webui) In spark1.0.2 stable version ,it works well in first step,configuration same as 1.1.0 - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For addit

prepending jars to the driver class path for spark-submit on YARN

2014-09-05 Thread Penny Espinoza
Hey - I’m struggling with some dependency issues with org.apache.httpcomponents httpcore and httpclient when using spark-submit with YARN running Spark 1.0.2 on a Hadoop 2.2 cluster. I’ve seen several posts about this issue, but no resolution. The error message is this: Caused by: java.lang.

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread Ognen Duzlevski
Ah. So there is some kind of a "back and forth" going on. Thanks! Ognen On 9/5/2014 5:34 PM, qihong wrote: Since you are using your home computer, so it's probably not reachable by EC2 from internet. You can try to set "spark.driver.host" to your WAN ip, "spark.driver.port" to a fixed port in S

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread qihong
Since you are using your home computer, so it's probably not reachable by EC2 from internet. You can try to set "spark.driver.host" to your WAN ip, "spark.driver.port" to a fixed port in SparkConf, and open that port in your home network (port forwarding to the computer you are using). see if that

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread Ognen Duzlevski
That is the command I ran and it still times out.Besides 7077 is there any other port that needs to be open? Thanks! Ognen On 9/5/2014 4:10 PM, qihong wrote: the command should be "spark-shell --master spark://:7077". -- View this message in context: http://apache-spark-user-list.1001560.n3

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread qihong
the command should be "spark-shell --master spark://:7077". -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-shell-or-queries-over-the-network-not-from-master-tp13543p13593.html Sent from the Apache Spark User List mailing list archive at Nabble

Re: Sparse Matrices support in Spark

2014-09-05 Thread yannis_at
YOu are right,but unfortunately the size of the matrices that i am trying to craft it will outperform the capacities of the machines that i have access to,that is why i need sparse libraries. i wish there is a good library for sparse matrices in java,i will try to check scala that you mentioned.

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread Ognen Duzlevski
On 9/5/2014 3:27 PM, anthonyjschu...@gmail.com wrote: I think that should be possible. Make sure spark is installed on your local machine and is the same version as on the cluster. It is the same version, I can telnet to master:7077 but when I run the spark-shell it times out. --

Re: Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread anthonyjschu...@gmail.com
I think that should be possible. Make sure spark is installed on your local machine and is the same version as on the cluster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-shell-or-queries-over-the-network-not-from-master-tp13543p13590.html

Re: Sparse Matrices support in Spark

2014-09-05 Thread anthonyjschu...@gmail.com
It would probably be much easier to use an existing library in conjunction with mlib or spark. I have been using jblas as my matrix backend for machine learning work (I don't use mlib yet)... but jblas does not support sparse matrices afaik. Consider checking scala Breeze. Interestingly, I have fou

Re: Shared variable in Spark Streaming

2014-09-05 Thread Tathagata Das
Good explanation, Chris :) On Fri, Sep 5, 2014 at 12:42 PM, Chris Fregly wrote: > good question, soumitra. it's a bit confusing. > > to break TD's code down a bit: > > dstream.count() is a transformation operation (returns a new DStream), > executes lazily, runs in the cluster on the underlyi

Repartition inefficient

2014-09-05 Thread anthonyjschu...@gmail.com
I wonder if anyone has any tips for using repartition? It seems that when you call the repartition method, the entire RDD gets split up, shuffled, and redistributed... This is an extremely heavy task if you have a large hdfs dataset and all you want to do is make sure your RDD is balance/ data ske

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Marcelo Vanzin
Hi Davies, On Fri, Sep 5, 2014 at 1:04 PM, Davies Liu wrote: > In Douban, we use Moose FS[1] instead of HDFS as the distributed file system, > it's POSIX compatible and can be mounted just as NFS. Sure, if you already have the infrastructure in place, it might be worthwhile to use it. After all,

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Davies Liu
Here is a store about how shared storage simplify all the things: In Douban, we use Moose FS[1] instead of HDFS as the distributed file system, it's POSIX compatible and can be mounted just as NFS. We put all the data and tools and code in it, so we can access them easily on all the machines, jus

Re: Shared variable in Spark Streaming

2014-09-05 Thread Chris Fregly
good question, soumitra. it's a bit confusing. to break TD's code down a bit: dstream.count() is a transformation operation (returns a new DStream), executes lazily, runs in the cluster on the underlying RDDs that come through in that batch, and returns a new DStream with a single element repres

Re: pyspark on yarn hdp hortonworks

2014-09-05 Thread Greg Hill
I'm running into a problem getting this working as well. I have spark-submit and spark-shell working fine, but pyspark in interactive mode can't seem to find the lzo jar: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found This is in /usr/lib/hadoop/lib/hadoo

Re: how to choose right DStream batch interval

2014-09-05 Thread qihong
repost since original msg was marked with "This post has NOT been accepted by the mailing list yet." I have some questions regarding DStream batch interval: 1. if it only take 0.5 second to process the batch 99% of time, but 1% of batches need 5 seconds to process (due to some random factor or f

Re: spark 1.1.0 requested array size exceed vm limits

2014-09-05 Thread Ankur Dave
At 2014-09-05 21:40:51 +0800, marylucy wrote: > But running graphx edgeFileList ,some tasks failed > error:requested array size exceed vm limits Try passing a higher value for minEdgePartitions when calling GraphLoader.edgeListFile. Ankur --

Re: spark-streaming-kafka with broadcast variable

2014-09-05 Thread Tathagata Das
I am not sure if there is a good, clean way to do that - broadcasts variables are not designed to be used out side spark job closures. You could try a bit of a hacky stuff where you write the serialized variable to file in HDFS / NFS / distributed files sytem, and then use a custom decoder class th

Re: Web UI

2014-09-05 Thread Andrew Or
Sure, you can request it by filing an issue here: https://issues.apache.org/jira/browse/SPARK 2014-09-05 6:50 GMT-07:00 Ruebenacker, Oliver A < oliver.ruebenac...@altisource.com>: > > > Hello, > > > > Thanks for the explanation. So events are stored internally as JSON, but > there is no o

Re: Viewing web UI after fact

2014-09-05 Thread Andrew Or
Hi Grzegorz, Can you verify that there are "APPLICATION_COMPLETE" files in the event log directories? E.g. Does file:/tmp/spark-events/app-name-1234567890/APPLICATION_COMPLETE exist? If not, it could be that your application didn't call sc.stop(), so the "ApplicationEnd" event is not actually logg

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Marcelo Vanzin
On Fri, Sep 5, 2014 at 10:50 AM, Davies Liu wrote: > In daily development, it's common to modify your projects and re-run > the jobs. If using zip or egg to package your code, you need to do > this every time after modification, I think it will be boring. That's why shell scripts were invented. :

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Andrew Or
Hi Oleg, We do support serving python files in zips. If you use --py-files, you can provide a comma delimited list of zips instead of python files. This will allow you to automatically add these files to the python path on the executors without you having to manually copy them to every single slav

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Davies Liu
On Fri, Sep 5, 2014 at 10:21 AM, Oleg Ruchovets wrote: > Ok , I didn't explain my self correct: >In case of java having a lot of classes jar should be used. >All examples for PySpark I found is one py script( Pi , wordcount ...) , > but in real environment analytics has more then one py f

Re: Low Level Kafka Consumer for Spark

2014-09-05 Thread Tathagata Das
Some thoughts on this thread to clarify the doubts. 1. Driver recovery: The current (1.1 to be released) does not recover the raw data that has been received but not processes. This is because when the driver dies, the executors die and so does the raw data that was stored in it. Only for HDFS, th

spark-streaming-kafka with broadcast variable

2014-09-05 Thread Penny Espinoza
I need to use a broadcast variable inside the Decoder I use for class parameter T in org.apache.spark.streaming.kafka.KafkaUtils.createStream. I am using the override with this signature: createStream

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Oleg Ruchovets
Ok , I didn't explain my self correct: In case of java having a lot of classes jar should be used. All examples for PySpark I found is one py script( Pi , wordcount ...) , but in real environment analytics has more then one py file. My question is how to use PySpark on Yarn analytics in c

Re: [Spark Streaming] Tracking/solving 'block input not found'

2014-09-05 Thread Tathagata Das
Hey Gerard, Spark Streaming should just queue the processing and not delete the block data. There are reports of this error and I am still unable to reproduce the problem. One workaround you can try the configuration "spark.streaming.unpersist = false" . This stops Spark Streaming from cleaning up

Re: error: type mismatch while Union

2014-09-05 Thread Yana Kadiyska
Which version are you using -- I can reproduce your issue w/ 0.9.2 but not with 1.0.1...so my guess is that it's a bug and the fix hasn't been backported... No idea on a workaround though.. On Fri, Sep 5, 2014 at 7:58 AM, Dhimant wrote: > Hi, > I am getting type mismatch error while union opera

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Dimension Data, LLC.
Hi: Curious... is there any reason not to use one of the below pyspark options (in red)? Assuming each file is, say 10k in size, is 50 files too much? Does that touch on some practical limitation? Usage: ./bin/pyspark [options] Options: --master MASTER_URL spark://host:port, mesos://h

Re: TimeStamp selection with SparkSQL

2014-09-05 Thread Brad Miller
Preprocessing (after loading the data into HDFS). I started with data in JSON format in text files (stored in HDFS), and then loaded the data into parquet files with a bit of preprocessing and now I always retrieve the data by creating a SchemaRDD from the parquet file and using the SchemaRDD to b

Re: How to list all registered tables in a sql context?

2014-09-05 Thread Jianshi Huang
Err... there's no such feature? Jianshi On Wed, Sep 3, 2014 at 7:03 PM, Jianshi Huang wrote: > Hi, > > How can I list all registered tables in a sql context? > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > -- Jianshi Huang Lin

Re: Spark that integrates with Kafka 0.7

2014-09-05 Thread Hemanth Yamijala
After that long mail :-), I think I figured it out. I removed the 'provided' tag in my pom.xml and let the jars be directly included using maven's jar-with-dependencies plugin. Things started working after that. Thanks Hemanth On Fri, Sep 5, 2014 at 9:50 PM, Hemanth Yamijala wrote: > After sea

Re: Spark that integrates with Kafka 0.7

2014-09-05 Thread Hemanth Yamijala
After searching a little bit, I came to know that Spark 0.8 supports kafka-0.7. So, I tried to use it this way: In my pom.xml, specified a Spark dependency as follows: org.apache.spark spark-streaming_2.9.3 0.8.1-incubating And a kafka dependency as follows:

Re: PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Davies Liu
Hi Oleg, In order to simplify the process of package and distribute you codes, you could deploy an shared storage (such as NFS), and put your project in it, mount it to all the slaves as "/projects". In the spark job scripts, you can access your project by put the path into sys.path, such as: im

replicated rdd storage problem

2014-09-05 Thread rapelly kartheek
Hi, Whenever I replicate an rdd, I find that the rdd gets replicated only in one node. I have a 3 node cluster. I set rdd.persist(StorageLevel.MEMORY_ONLY_2) in my application. The webUI shows that its replicates twice. But, the rdd stogare details show that its replicated only once and only in

Re: TimeStamp selection with SparkSQL

2014-09-05 Thread Brad Miller
My approach may be partly influenced by my limited experience with SQL and Hive, but I just converted all my dates to seconds-since-epoch and then selected samples from specific time ranges using integer comparisons. On Thu, Sep 4, 2014 at 6:38 PM, Cheng, Hao wrote: > There are 2 SQL dialects,

Spark-cassandra-connector 1.0.0-rc5: java.io.NotSerializableException

2014-09-05 Thread Shing Hing Man
Hi, My version of Spark is 1.0.2. I am trying to use Spark-cassandra-connector to execute an update csql statement inside an CassandraConnector(conf).withSessionDo block : CassandraConnector(conf).withSessionDo { session => { myRdd.foreach { case (ip, value

Re: Task not serializable

2014-09-05 Thread Sarath Chandra
Hi Akhil, I've done this for the classes which are in my scope. But what to do with classes that are out of my scope? For example org.apache.hadoop.io.Text Also I'm using several 3rd part libraries like "jeval". ~Sarath On Fri, Sep 5, 2014 at 7:40 PM, Akhil Das wrote: > You can bring those c

Re: Task not serializable

2014-09-05 Thread Akhil Das
You can bring those classes out of the library and Serialize it (implements Serializable). It is not the right way of doing it though it solved few of my similar problems. Thanks Best Regards On Fri, Sep 5, 2014 at 7:36 PM, Sarath Chandra < sarathchandra.jos...@algofusiontech.com> wrote: > Hi,

Task not serializable

2014-09-05 Thread Sarath Chandra
Hi, I'm trying to migrate a map-reduce program to work with spark. I migrated the program from Java to Scala. The map-reduce program basically loads a HDFS file and for each line in the file it applies several transformation functions available in various external libraries. When I execute this o

RE: Web UI

2014-09-05 Thread Ruebenacker, Oliver A
Hello, Thanks for the explanation. So events are stored internally as JSON, but there is no official support for having Spark serve that JSON via HTTP? So if I wanted to write an app that monitors Spark, I would either have to scrape the web UI in HTML or rely on unofficial JSON feature

spark 1.1.0 requested array size exceed vm limits

2014-09-05 Thread marylucy
I am building spark-1.0-rc4 with maven,following http://spark.apache.org/docs/latest/building-with-maven.html But running graphx edgeFileList ,some tasks failed error:requested array size exceed vm limits error:executor lost Can anyone know how to fit it

error: type mismatch while Union

2014-09-05 Thread Dhimant
Hi, I am getting type mismatch error while union operation. Can someone suggest solution ? / case class MyNumber(no: Int, secondVal: String) extends Serializable with Ordered[MyNumber] { override def toString(): String = this.no.toString + " " + this.secondVal override def compare(th

Re: New sbt plugin to deploy jobs to EC2

2014-09-05 Thread andy petrella
\o/ => will test it soon or sooner, gr8 idea btw aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] On Fri, Sep 5, 2014 at 12:37 PM, Felix Garcia Borrego wrote: > As far as I know in other to deploy and execute jobs in EC2 you need to > assembly you

New sbt plugin to deploy jobs to EC2

2014-09-05 Thread Felix Garcia Borrego
As far as I know in other to deploy and execute jobs in EC2 you need to assembly you project, copy your jar into the cluster, log into using ssh and submit the job. To avoid having to do this I've been prototyping an sbt plugin(1) that allows to create and send Spark jobs to an Amazon EC2 cluster

Running spark-shell (or queries) over the network (not from master)

2014-09-05 Thread Ognen Duzlevski
Is this possible? If i have a cluster set up on EC2 and I want to run spark-shell --master :7077 from my home computer - is this possible at all or am I wasting my time ;)? I am seeing a connection timeout when I try it. Thanks! Ognen --

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError "GC overhead limit exceeded"

2014-09-05 Thread Yifan LI
Thank you, Ankur! :) But how to assign the storage level to a new vertices RDD that mapped from an existing vertices RDD, e.g. *val newVertexRDD = graph.collectNeighborIds(EdgeDirection.Out).map{case(id:VertexId, a:Array[VertexId]) => (id, initialHashMap(a))}* the new one will be combined with th

Spark that integrates with Kafka 0.7

2014-09-05 Thread Hemanth Yamijala
Hi, Due to some limitations, we are having to stick to Kafka 0.7.x. We would like to use as latest a version of Spark in streaming mode that integrates with Kafka 0.7. The latest version supports only 0.8 it appears. Has anyone solved such a requirement ? Any tips on what can be tried ? FWIW, I t

question on replicate() in blockManager.scala

2014-09-05 Thread rapelly kartheek
Hi, var cachedPeers: Seq[BlockManagerId] = null private def replicate(blockId: String, data: ByteBuffer, level: StorageLevel) { val tLevel = StorageLevel(level.useDisk, level.useMemory, level.deserialized, 1) if (cachedPeers == null) { cachedPeers = master.getPeers(blockManagerId,

Recursion

2014-09-05 Thread Deep Pradhan
Hi, Does Spark support recursive calls?

Re: Viewing web UI after fact

2014-09-05 Thread Grzegorz Białek
Hi Andrew, thank you very much for your answer. Unfortunately it still doesn't work. I'm using Spark 1.0.0, and I start history server running sbin/start-history-server.sh , although I also set SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory in conf/spark-env.sh. I tried also other dir than /t

Re: Mapping Hadoop Reduce to Spark

2014-09-05 Thread Lukas Nalezenec
Hi, FYI: There is bug in Java mapPartitions - SPARK-3369 . In Java results from mapPartitions and similar functions must fit in memory. Look at example below - it returns List. Lukas On 1.9.2014 00:50, Matei Zaharia wrote: mapPartitions jus

Re: Multiple spark shell sessions

2014-09-05 Thread Andrew Ash
Hi Dhimant, We also cleaned up these needless warnings on port failover in Spark 1.1 -- see https://issues.apache.org/jira/browse/SPARK-1902 Andrew On Thu, Sep 4, 2014 at 7:38 AM, Dhimant wrote: > Thanks Yana, > I am able to execute application and command via another session, i also > receiv

PySpark on Yarn a lot of python scripts project

2014-09-05 Thread Oleg Ruchovets
Hi , We avaluating PySpark and successfully executed examples of PySpark on Yarn. Next step what we want to do: We have a python project ( bunch of python script using Anaconda packages). Question: What is the way to execute PySpark on Yarn having a lot of python files ( ~ 50)?

Re: Spark Streaming into HBase

2014-09-05 Thread Tathagata Das
Hmmm, even i dont understand why TheMain needs to be serializable. It might be cleaner (as in avoid such mysterious closure issues) to actually create a separate sbt/maven project (instead of a shell) and run the streaming application from there. TD On Thu, Sep 4, 2014 at 10:25 AM, kpeng1 wrote

Re: Serialize input path

2014-09-05 Thread jerryye
Thanks for the response Sean. As a correction. The code I provided actually ended up working. I tried to reduce my code down but I was being overzealous and running count actually works. The minimal code that triggers the problem is this: val userProfiles = lines.map(line => {parse(line)}).map(js