Re: How to specify the job to run on the specific nodes(machines) in the hadoop yarn cluster?

2014-07-29 Thread Haiyang Fu
It's really a good question !I'm also working on it On Wed, Jul 30, 2014 at 11:45 AM, adu wrote: > Hi all, > RT. I want to run a job on specific two nodes in the cluster? How to > configure the yarn? Dose yarn queue help? > > Thanks > > >

Re: Is it possible to read file head in each partition?

2014-07-29 Thread Cheng Lian
What's the format of the file header? Is it possible to filter them out by prefix string matching or regex? On Wed, Jul 30, 2014 at 1:39 PM, Fengyun RAO wrote: > It will certainly cause bad performance, since it reads the whole content > of a large file into one value, instead of splitting it i

Converting matrix format

2014-07-29 Thread Chengi Liu
Hi, I have an rdd with n rows and m columns... but most of them are 0 .. its as sparse matrix.. I would like to only get the non zero entries with their index? Any equivalent python code would be for i,x in enumerate(matrix): for j,y in enumerate(x): if y: print i,j,y

Logging in Spark through YARN.

2014-07-29 Thread Archit Thakur
Hi, I want to manage logging of containers when I run Spark through YARN. I checked there is a environment variable exposed to custom log4j.properties. Setting SPARK_LOG4J_CONF to "/dir/log4j.properties" should ideally make containers use "/dir/log4j.properties" file for logging. This doesn't see

Re: zip two RDD in pyspark

2014-07-29 Thread Nick Pentreath
parallelize uses the default Serializer (PickleSerializer) while textFile uses UTF8Serializer. You can get around this with index.zip(input_data._reserialize()) (or index.zip(input_data.map(lambda x: x))) (But if you try to just do this, you run into the issue with different number of partitions

Re: Last step of processing is using too much memory.

2014-07-29 Thread Davies Liu
When you do groupBy(), it wish to load all the data into memory for best performance, then you should specify the number of partitions carefully. In Spark master or upcoming 1.1 release, PySpark can do external groupBy(), it means that it will dumps the data into disks if there is not enough memor

Re: why a machine learning application run slowly on the spark cluster

2014-07-29 Thread Xiangrui Meng
After you load the data in, call `.repartition(number of executors).cache()`. If the data is evenly distributed, it may be hard to guess the root cause. Do the two clusters have the same internode bandwidth? -Xiangrui On Tue, Jul 29, 2014 at 11:06 PM, Tan Tim wrote: > input data is evenly distrib

Re: why a machine learning application run slowly on the spark cluster

2014-07-29 Thread Tan Tim
input data is evenly distributed to the executors. The input data is on the HDFS, not on the spark clusters. How can I make the data distributed to the excutors? On Wed, Jul 30, 2014 at 1:52 PM, Xiangrui Meng wrote: > The weight vector is usually dense and if you have many partitions, > th

Re: the pregel operator of graphx throws NullPointerException

2014-07-29 Thread Ankur Dave
Denis RP writes: > I build it with sbt package, I run it with sbt run, and I do use > SparkConf.set for deployment options and external jars. It seems that > spark-submit can't load extra jars and will lead to noclassdeffounderror, > should I pack all the jars to a giant one and give it a try? Ye

Re: How to submit Pyspark job in mesos?

2014-07-29 Thread Davies Liu
Maybe mesos or spark was not configured correctly, could you check the log files in mesos slaves? It should log the reason when mesos can not lunch the executor. On Tue, Jul 29, 2014 at 10:39 PM, daijia wrote: > > Actually, it runs okay in my slaves deployed by standalone mode. > When I switch t

Re: zip two RDD in pyspark

2014-07-29 Thread Davies Liu
On Mon, Jul 28, 2014 at 12:58 PM, l wrote: > I have a file in s3 that I want to map each line with an index. Here is my > code: > input_data = sc.textFile('s3n:/myinput',minPartitions=6).cache() N input_data.count() index = sc.parallelize(range(N), 6) index.zip(input_data).

Re: why a machine learning application run slowly on the spark cluster

2014-07-29 Thread Xiangrui Meng
The weight vector is usually dense and if you have many partitions, the driver may slow down. You can also take a look at the driver memory inside the Executor tab in WebUI. Another setting to check is the HDFS block size and whether the input data is evenly distributed to the executors. Are the ha

Re: [Spark 1.0.1][SparkSQL] reduce stage of shuffle is slow。

2014-07-29 Thread Earthson
Too many GC. The task runs much more faster with more memory(heap space). The CPU load is still too high, and network load is about 20+MB/s(not high enough) So what is the correct way to solve this GC problem? Is there other ways except using more memory? -- View this message in context: http

Re: why a machine learning application run slowly on the spark cluster

2014-07-29 Thread Tan Tim
The application is Logistic Regression (OWLQN), we develop a sparse vector version. The feature dimesions is 1M+, but its very sparse. This appliction can run on another spark cluster, and every stage is about 50 seconds, and every executors have highly cpu usage. the only difference is OS(the fast

Re: the pregel operator of graphx throws NullPointerException

2014-07-29 Thread Denis RP
I build it with sbt package, I run it with sbt run, and I do use SparkConf.set for deployment options and external jars. It seems that spark-submit can't load extra jars and will lead to noclassdeffounderror, should I pack all the jars to a giant one and give it a try? I run it on a cluster of 8 m

Re: Is it possible to read file head in each partition?

2014-07-29 Thread Fengyun RAO
It will certainly cause bad performance, since it reads the whole content of a large file into one value, instead of splitting it into partitions. Typically one file is 1 GB. Suppose we have 3 large files, in this way, there would only be 3 key-value pairs, and thus 3 tasks at most. 2014-07-30 1

Re: How to submit Pyspark job in mesos?

2014-07-29 Thread daijia
Actually, it runs okay in my slaves deployed by standalone mode. When I switch to mesos, the error just occurs. Anyway, thanks for your reply and any ideas will help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-submit-Pyspark-job-in-mesos-tp10

Re: Using countApproxDistinct in pyspark

2014-07-29 Thread Davies Liu
Hey Diederik, The data in rdd._jrdd.rdd() is serialized by pickle in batch mode by default, so the number of rows in it is much less then rdd. for example: >>> size = 100 >>> d = [i%size for i in range(1, 10)] >>> rdd = sc.parallelize(d) >>> rdd.count() 9 >>> rdd._jrdd.rdd().count() 98L >

Re: why a machine learning application run slowly on the spark cluster

2014-07-29 Thread Xiangrui Meng
Could you share more details about the dataset and the algorithm? For example, if the dataset has 10M+ features, it may be slow for the driver to collect the weights from executors (just a blind guess). -Xiangrui On Tue, Jul 29, 2014 at 9:15 PM, Tan Tim wrote: > Hi, all > > [Setting] > > Input

Re: Using Spark Streaming with Kafka 0.7.2

2014-07-29 Thread Andre Schumacher
Hi, For testing you could also just use the Kafka 0.7.2 console consumer and pipe it's output to netcat (nc) and process that as in the example https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala That worked for me. Back

Re: How to submit Pyspark job in mesos?

2014-07-29 Thread Davies Liu
provided. > Attempting to register without authentication > I0729 18:40:50.443234 15036 sched.cpp:391] Framework registered with > 20140729-174911-1526966464-5050-13758-0006 > 14/07/29 18:40:50 INFO mesos.CoarseMesosSchedulerBackend: Registered as > framework ID 20140729-174911-1526966

Re: Is it possible to read file head in each partition?

2014-07-29 Thread Hossein
You can use SparkContext.wholeTextFile(). Please note that the documentation suggests: "Small files are preferred, large file is also allowable, but may cause bad performance." --Hossein On Tue, Jul 29, 2014 at 9:21 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > This is an interes

Re: Is it possible to read file head in each partition?

2014-07-29 Thread Nicholas Chammas
This is an interesting question. I’m curious to know as well how this problem can be approached. Is there a way, perhaps, to ensure that each input file matching the glob expression gets mapped to exactly one partition? Then you could probably get what you want using RDD.mapPartitions(). Nick ​

Is it possible to read file head in each partition?

2014-07-29 Thread Fengyun RAO
Hi, all We are migrating from mapreduce to spark, and encountered a problem. Our input files are IIS logs with file head. It's easy to get the file head if we process only one file, e.g. val lines = sc.textFile('hdfs://*/u_ex14073011.log') val head = lines.take(4) Then we can write our map meth

Re: Spark and Flume integration - do I understand this correctly?

2014-07-29 Thread dapooley
Great, thanks guys, that helped a lot and I've got a sample working. As a follow up, when do worker/masters become necessity? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Flume-integration-do-I-understand-this-correctly-tp10879p10908.html Sent

How to specify the job to run on the specific nodes(machines) in the hadoop yarn cluster?

2014-07-29 Thread adu
Hi all, RT. I want to run a job on specific two nodes in the cluster? How to configure the yarn? Dose yarn queue help? Thanks

How do you debug a PythonException?

2014-07-29 Thread Nick Chammas
I’m in the PySpark shell and I’m trying to do this: a = sc.textFile('s3n://path-to-handful-of-very-large-files-totalling-1tb/*.json', minPartitions=sc.defaultParallelism * 3).cache() a.map(lambda x: len(x)).max() My job dies with the following: 14/07/30 01:46:28 WARN TaskSetManager: Loss was du

How to submit Pyspark job in mesos?

2014-07-29 Thread daijia
sched.cpp:217] New master detected at master@192.168.3.91:5050 I0729 18:40:50.442570 15035 sched.cpp:225] No credentials provided. Attempting to register without authentication I0729 18:40:50.443234 15036 sched.cpp:391] Framework registered with 20140729-174911-1526966464-5050-13758-0006 14/07/29 18:40:50

RE: The function of ClosureCleaner.clean

2014-07-29 Thread Wang, Jensen
I found a detailed explanation here, https://www.quora.com/Apache-Spark/What-does-Closure-cleaner-func-mean-in-Spark . I copy it for convenience: When Scala constructs a closure, it determines which outer variables the closure will use and stores references to them in the closure object. This

Re: How true is this about spark streaming?

2014-07-29 Thread Tobias Pfeiffer
Hi, that quoted statement doesn't make too much sense for me, either. Maybe if you had a link for us that shows the context (Google doesn't reveal anything but this conversation), we could evaluate that statement better. Tobias On Tue, Jul 29, 2014 at 5:53 PM, Sean Owen wrote: > I'm not sure

RE: Avro Schema + GenericRecord to HadoopRDD

2014-07-29 Thread Severs, Chris
Hi Benjamin, I think the best bet would be to use the Avro code generation stuff to generate a SpecificRecord for your schema and then change the reader to use your specific type rather than GenericRecord. Trying to read up the generic record and then do type inference and spit out a tuple is

Re: java.io.StreamCorruptedException: invalid type code: 00

2014-07-29 Thread Alexis Roos
Just realized that I was missing the JavaSparkContext in the import and after adding it, the error is: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: java.lang.reflect.Method at org

Re: KMeans: expensiveness of large vectors

2014-07-29 Thread Xiangrui Meng
Before torrent, http is the default way for broadcasting. The driver holds the data and the executors request the data via http, making the driver the bottleneck if the data is large. -Xiangrui On Tue, Jul 29, 2014 at 10:32 AM, durin wrote: > Development is really rapid here, that's a great thing

java.io.StreamCorruptedException: invalid type code: 00

2014-07-29 Thread Alexis Roos
Hello, I am porting a data process running in Spark from Scala to Java (8) using Lambdas to see how practical Java 8 is. The first few steps are working (parsing data, creating JavaRDDs) but then it fails while doing a cogroup between two JavaPairRDD. I am getting a bunch of java.io.StreamCorr

Re: Job using Spark for Machine Learning

2014-07-29 Thread Matei Zaharia
Hi Martin, Job ads are actually not allowed on the list, but thanks for asking. Just posting this for others' future reference. Matei On July 29, 2014 at 8:34:59 AM, Martin Goodson (mar...@skimlinks.com) wrote: I'm not sure if job adverts are allowed on here - please let me know if not.  Othe

Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread soumick86
Few lines of my error logs look like 2014-07-29 16:32:16,326 ERROR [ActorSystemImpl] Uncaught fatal error from thread [spark-akka.actor.default-dispatcher-6] shutting down ActorSystem [spark] java.lang.VerifyError: (class: org/jboss/netty/channel/socket/nio/NioWorkerPool, method: createWorker si

Re: Memory & compute-intensive tasks

2014-07-29 Thread Matei Zaharia
Is data being cached? It might be that those two nodes started first and did the first pass of the data, so it's all on them. It's kind of ugly but you can add a Thread.sleep when your program starts to wait for nodes to come up. Also, have you checked the applicatio web UI at http://:4040 while

Re: Spark and Flume integration - do I understand this correctly?

2014-07-29 Thread Hari Shreedharan
Hi, Deploying spark with Flume is pretty simple. What you'd need to do is: 1. Start your spark Flume DStream Receiver on some machine using one of the FlumeUtils.createStream methods - where you need to specify the hostname and port of the worker node on which you want the spark executor to r

Re: Example standalone app error!

2014-07-29 Thread Andrew Or
Hi Alex, Very strange. This error occurs when someone tries to call an abstract method. I have run into this before and resolved it with a SBT clean followed by an assembly, so maybe you could give that a try. Let me know if that fixes it, Andrew 2014-07-29 13:01 GMT-07:00 Alex Minnaar : > I

Re: Spark and Flume integration - do I understand this correctly?

2014-07-29 Thread Tathagata Das
Hari, can you help? TD On Tue, Jul 29, 2014 at 12:13 PM, dapooley wrote: > Hi, > > I am trying to integrate Spark onto a Flume log sink and avro source. The > sink is on one machine (the application), and the source is on another. Log > events are being sent from the application server to the av

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread andy petrella
👍thx! Le 29 juil. 2014 22:09, "Ankur Dave" a écrit : > andy petrella writes: > > Oh I was almost sure that lookup was optimized using the partition info > > It does use the partitioner to run only one task, but within that task it > has to scan the entire partition: > > https://github.com/apache

Re: how to publish spark inhouse?

2014-07-29 Thread Koert Kuipers
SBT does actually have a runtime scrope, but things might indeed not translate perfectly. however exclusions being ignored by the sbt-pom-reader is a serious issue. not yet sure how to deal with that one On Tue, Jul 29, 2014 at 4:08 PM, Sean Owen wrote: > Regarding SBT: I don't think SBT has e

Re: how to publish spark inhouse?

2014-07-29 Thread Koert Kuipers
ah there is an apache parent that i need to go look at. thats why i can't find so many things. thanks On Tue, Jul 29, 2014 at 4:08 PM, Sean Owen wrote: > Regarding SBT: I don't think SBT has equivalents for some Maven > concepts like "runtime" scope. The resulting assembly-like JARs might > not

Re: how to publish spark inhouse?

2014-07-29 Thread Sean Owen
Regarding SBT: I don't think SBT has equivalents for some Maven concepts like "runtime" scope. The resulting assembly-like JARs might not be what's intended. I think that's part of why Maven is the build of reference. You might be able to get the SBT build to do the right thing with fiddling. It wa

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread Ankur Dave
andy petrella writes: > Oh I was almost sure that lookup was optimized using the partition info It does use the partitioner to run only one task, but within that task it has to scan the entire partition: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDD

Example standalone app error!

2014-07-29 Thread Alex Minnaar
I am trying to run an example Spark standalone app with the following code import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ object SparkGensimLDA extends App{ val ssc=new StreamingContext("local","testApp",Seconds(5)) val lines=ssc.textFileStream("/..

Re: how to publish spark inhouse?

2014-07-29 Thread Koert Kuipers
all i want to do is 1) change the version number and 2) publish spark to our internal maven repo. i just learned that the maven way what i want to do is a "deploy", not a "release" (neither of which makes sense to me.. how about publish? but ok). although me wanting to change a version number is a

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread andy petrella
Oh I was almost sure that lookup was optimized using the partition info Le 29 juil. 2014 21:25, "Ankur Dave" a écrit : > Yifan LI writes: > > Maybe you could get the vertex, for instance, which id is 80, by using: > > > > graph.vertices.filter{case(id, _) => id==80}.collect > > > > but I am not

Re: the pregel operator of graphx throws NullPointerException

2014-07-29 Thread Ankur Dave
Denis RP writes: > [error] (run-main-0) org.apache.spark.SparkException: Job aborted due to > stage failure: Task 6.0:4 failed 4 times, most recent failure: Exception > failure in TID 598 on host worker6.local: java.lang.NullPointerException > [error] scala.collection.Iterator$$anon$11.n

Re: how to publish spark inhouse?

2014-07-29 Thread Koert Kuipers
i just looked at my dependencies in sbt, and when using cdh4.5.0 dependencies i see that hadoop clients pulls in jboss netty (via zookeeper) and asm 3.x (via jersey-server). so somehow these exclusion rules are not working anymore? i will look into sbt-pom-reader a bit to try to understand whats ha

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread Ankur Dave
Yifan LI writes: > Maybe you could get the vertex, for instance, which id is 80, by using: > > graph.vertices.filter{case(id, _) => id==80}.collect > > but I am not sure this is the exactly efficient way.(it will scan the whole > table? if it can not get benefit from index of VertexRDD table) Un

Spark and Flume integration - do I understand this correctly?

2014-07-29 Thread dapooley
Hi, I am trying to integrate Spark onto a Flume log sink and avro source. The sink is on one machine (the application), and the source is on another. Log events are being sent from the application server to the avro source server (a log directory sink on the arvo source prints to verify) The aim

Using countApproxDistinct in pyspark

2014-07-29 Thread Diederik
Heya, I would like to use countApproxDistinct in pyspark, I know that it's an experimental method and that it is not yet available in pyspark. I started with porting the countApproxDistinct unit-test to Python, see https://gist.github.com/drdee/d68eaf0208184d72cbff. Surprisingly, the results are w

Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread Daniel Siegmann
Sonal's suggestion of looking at the JavaAPISuite is a good idea. Just a few things to note. Pay special attention to what's being done in the setUp and tearDown methods, because that's where the magic is happening. To unit test against Spark, pretty much all you need to do is create a context run

Re: iScala or Scala-notebook

2014-07-29 Thread andy petrella
Some people started some work on that topic using the notebook (the original or the n8han one, cannot remember)... Some issues have ben created already ^^ Le 29 juil. 2014 19:59, "Nick Pentreath" a écrit : > IScala itself seems to be a bit dead unfortunately. > > I did come across this today: htt

Re: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-07-29 Thread Michael Armbrust
The warehouse and the metastore directories are two different things. The metastore holds the schema information about the tables and will by default be a local directory. With javax.jdo.option.ConnectionURL you can configure it to be something like mysql. The warehouse directory is the default

python project like spark-jobserver?

2014-07-29 Thread Chris Grier
I'm looking for something like the ooyala spark-jobserver ( https://github.com/ooyala/spark-jobserver) that basically manages a SparkContext for use from a REST or web application environment, but for python jobs instead of scala. Has anyone written something like this? Looking for a project or po

Empty RDD after LzoTextInputFormat in newAPIHadoopFile

2014-07-29 Thread Ivoirians
Hello, There seems to be very little documentation on the usage of newAPIHadoopFile and even less of it in conjunction with opening LZO compressed files. I've hit a wall with some unexpected behavior that I don't know how to interpret. This is a test program I'm running in an effort to get this w

Re: iScala or Scala-notebook

2014-07-29 Thread Nick Pentreath
IScala itself seems to be a bit dead unfortunately. I did come across this today: https://github.com/tribbloid/ISpark On Fri, Jul 18, 2014 at 4:59 AM, ericjohnston1989 < ericjohnston1...@gmail.com> wrote: > Hey everyone, > > I know this was asked before but I'm wondering if there have since bee

RE: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-07-29 Thread nikroy16
Thanks for the response... hive-site.xml is in the classpath so that doesn't seem to be the issue. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HiveContext-is-creating-metastore-warehouse-locally-instead-of-in-hdfs-tp10838p10871.html Sent from the Apache

Re: KMeans: expensiveness of large vectors

2014-07-29 Thread durin
Development is really rapid here, that's a great thing. Out of curiosity, how did communication work before torrent? Did everything have to go back to the master / driver first? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-v

GraphX Connected Components

2014-07-29 Thread Jeffrey Picard
Hey all, I’m currently trying to run connected components using GraphX on a large graph (~1.8b vertices and ~3b edges, most of them are self edges where the only edge that exists for vertex v is v->v) on emr using 50 m3.xlarge nodes. As the program runs I’m seeing each iteration take longer and

Re: Memory & compute-intensive tasks

2014-07-29 Thread rpandya
OK, I did figure this out. I was running the app (avocado) using spark-submit, when it was actually designed to take command line arguments to connect to a spark cluster. Since I didn't provide any such arguments, it started a nested local Spark cluster *inside* the YARN Spark executor and so of co

Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread Sonal Goyal
You can take a look at https://github.com/apache/spark/blob/master/core/src/test/java/org/apache/spark/JavaAPISuite.java and model your junits based on it. Best Regards, Sonal Nube Technologies On Tue, Jul 29, 2014 at 10:10 PM, K

Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread Kostiantyn Kudriavtsev
Hi, try this one http://simpletoad.blogspot.com/2014/07/runing-spark-unit-test-on-windows-7.html it’s more about fixing windows-specific issue, but code snippet gives general idea just run etl and check output w/ Assert(s) On Jul 29, 2014, at 6:29 PM, soumick86 wrote: > Is there any example

the pregel operator of graphx throws NullPointerException

2014-07-29 Thread Denis RP
Hi, I'm running a spark standalone cluster to calculate single source shortest path. Here is the code, VertexRDD[(String, Long)], String for the path and Long for the distance codes before these lines related to reading graph data from file and building the graph. 71 val sssp = initialG

Re: SPARK OWLQN Exception: Iteration Stage is so slow

2014-07-29 Thread Xiangrui Meng
Do you mind sharing more details, for example, specs of nodes and data size? -Xiangrui 2014-07-29 2:51 GMT-07:00 John Wu : > Hi all, > > > > There is a problem we can’t resolve. We implement the OWLQN algorithm in > parallel with SPARK, > > We don’t know why It is very slow in every iteration stag

Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread jay vyas
I've been working some on building spark blueprints, and recently tried to generalize one for easy blueprints of spark apps. https://github.com/jayunit100/SparkBlueprint.git It runs the spark app's main method in a unit test, and builds in SBT. You can easily try it out and improve on it. Obvio

Job using Spark for Machine Learning

2014-07-29 Thread Martin Goodson
I'm not sure if job adverts are allowed on here - please let me know if not. Otherwise, if you're interested in using Spark in an R&D machine learning project then please get in touch. We are a startup based in London. Our data sets are on a massive scale- we collect data on over a billion users

Unit Testing (JUnit) with Spark

2014-07-29 Thread soumick86
Is there any example out there for unit testing a Spark application in Java? Even a trivial application like word count will be very helpful. I am very new to this and I am struggling to understand how I can use JavaSpark Context for JUnit -- View this message in context: http://apache-spark-us

Avro Schema + GenericRecord to HadoopRDD

2014-07-29 Thread Laird, Benjamin
Hi all, I can read in Avro files to Spark with HadoopRDD and submit the schema in the jobConf, but with the guidance I've seen so far, I'm left with a avro GenericRecord of Java objects without type. How do I actually use the schema to have the types inferred? Example: scala> AvroJob.setInputSc

Re: UpdatestateByKey assumptions

2014-07-29 Thread RodrigoB
Just an update on this, Looking into Spark logs seems that some partitions are not found and recomputed. Gives the impression that those are related with the delayed updatestatebykey calls. I'm seeing something like: log line 1 - Partition rdd_132_1 not found, computing it log line N - Found

UpdatestateByKey assumptions

2014-07-29 Thread RodrigoB
Hi all, I'm currently having a load issue with the updatestateBykey function. Seems to be running with considerable delay for a few the state objects when the number increases. I have a 1 sec batch size receiving events from Kafka stream which creates state objects and also update then consequen

Re: Spark streaming vs. spark usage

2014-07-29 Thread andy petrella
Yep, But RDD/DStream would hardly fit the Monad contract (discussed several time, and still under discussions here and there ;)) For instance, look at the signature of flatMap in both traits. Albeit, an RDD that can generates other RDD (flatMap) is rather somethi.g like a DStream or 'CRDD

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread andy petrella
'lookup' on RDD (pair) maybe? Le 29 juil. 2014 12:04, "Yifan LI" a écrit : > Hi Bin, > > Maybe you could get the vertex, for instance, which id is 80, by using: > > *graph.vertices.filter{case(id, _) => id==80}.collect* > > but I am not sure this is the exactly efficient way.(it will scan the > w

Re: Spark Streaming timestamps

2014-07-29 Thread Laeeq Ahmed
Hi Bill, Hope the following is what you need. val zerotime = System.currentTimeMillis() Then in foreach do the following //difference = RDDtimeparameter - zerotime //only to find the constant value to be used later starttime = (RDDtimeparameter - (zerotime + difference)) -  intervalsize endt

Re: [GraphX] How to access a vertex via vertexId?

2014-07-29 Thread Yifan LI
Hi Bin, Maybe you could get the vertex, for instance, which id is 80, by using: graph.vertices.filter{case(id, _) => id==80}.collect but I am not sure this is the exactly efficient way.(it will scan the whole table? if it can not get benefit from index of VertexRDD table) @Ankur, is there any

SPARK OWLQN Exception: Iteration Stage is so slow

2014-07-29 Thread John Wu
Hi all, There is a problem we can’t resolve. We implement the OWLQN algorithm in parallel with SPARK, We don’t know why It is very slow in every iteration stage, but the load of CPU and Memory of each executor are so low that it seems impossible to make the the every step slow. And there

[GraphX] How to access a vertex via vertexId?

2014-07-29 Thread Bin
Hi All, I wonder how to access a vertex via its vertexId? I need to get vertex's attributes after running graph algorithm. Thanks very much! Best, Bin

Re: [Spark 1.0.1][SparkSQL] reduce stage of shuffle is slow。

2014-07-29 Thread Earthson
It's really strange that cpu load so high and both disk/network IO load so low. CLUSTER BY is just something similar to groupBy, why it needs so much cpu resource? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SparkSQL-reduce-stage-of-shuffle-i

Re: sbt directory missed

2014-07-29 Thread Hao REN
What makes one confused is that, spark-0.9.2-bin-hadoop1.tgz => contains source code and sbt spark-1.0.1-bin-hadoop1.tgz => does not According

Re: How true is this about spark streaming?

2014-07-29 Thread Sean Owen
I'm not sure I understand this, maybe because the context is missing. An RDD is immutable, so there is no such thing as writing to an RDD. I'm not sure which aspect is being referred to as single-threaded. Is this the Spark Streaming driver? What is the difference between "streaming into Spark" an

Re: evaluating classification accuracy

2014-07-29 Thread Sean Owen
Yes, in addition, I think Xiangrui updated the examples anyhow to use a different form that does not rely on zip: test.map(v => (model.predict(v.features), v.label)) It avoid evaluating test twice, and avoids the zip. Although I suppose you have to bear in mind it now calls predict() on each elem