Re: quickly counting the number of rows in a partition?

2015-01-14 Thread Michael Segel
Sorry, but the accumulator is still going to require you to walk through the RDD to get an accurate count, right? Its not being persisted? On Jan 14, 2015, at 5:17 AM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Alternative to doing a naive toArray is to declare an accumulator per

Re: How to access OpenHashSet in my standalone program?

2015-01-14 Thread Reynold Xin
The reason is fairly simple actually - we don't want to commit to maintaining the specific APIs exposed. If we expose OpenHashSet, we will have to always keep that in Spark and not change the API. On Tue, Jan 13, 2015 at 12:39 PM, Tae-Hyuk Ahn ahn@gmail.com wrote: Thank, Josh and Reynold.

Re: How to use BigInteger for userId and productId in collaborative Filtering?

2015-01-14 Thread Nishanth P S
Yes, we are close to having more 2 billion users. In this case what is the best way to handle this. Thanks, Nishanth On Fri, Jan 9, 2015 at 9:50 PM, Xiangrui Meng men...@gmail.com wrote: Do you have more than 2 billion users/products? If not, you can pair each user/product id with an integer

Re: Issues with constants in Spark HiveQL queries

2015-01-14 Thread Pala M Muthaia
conversion_action_id is a int. We also tried a string column predicate with single quotes string value and hit the same error stack. On Wed, Jan 14, 2015 at 7:11 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Just a guess but what is the type of conversion_aciton_id? I do queries over an

RE: Why always spilling to disk and how to improve it?

2015-01-14 Thread Shuai Zheng
Thanks a lot! I just realize the spark is not a really in-memory version of mapreduce J From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Tuesday, January 13, 2015 3:53 PM To: Shuai Zheng Cc: user@spark.apache.org Subject: Re: Why always spilling to disk and how to improve it?

Re: Issues running spark on cluster

2015-01-14 Thread Akhil Das
​An assumption would be your master process is getting killed for some reason, it could be because of OOM kills by the kernel. (I'm assuming you are running your driver program on the master node itself.​) Thanks Best Regards On Wed, Jan 14, 2015 at 11:25 PM, TJ Klein tjkl...@gmail.com wrote:

component.AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use

2015-01-14 Thread Jianguo Li
Hi, I am using the sbt tool to build and run the scala tests related to spark. In my /src/test/scala directory, there are two test classes (TestA, TestB), both of which use the class in Spark for creating SparkContext, something like trait LocalTestSparkContext extends BeforeAndAfterAll { self:

Re: how to run python app in yarn?

2015-01-14 Thread Jian Feng
Thanks Josh and Marcelo! It now works! BTW, just wondering, is there any perf difference between running spark in standalone mode and under yarn? The only goal that I created this cluster is to run spark jobs. So I can set up spark in standalone mode if it runs slow in yarn. best. From:

Re: java.lang.ExceptionInInitializerError/Unable to load YARN support

2015-01-14 Thread maven
All, I'm still facing this issue. Any thoughts on how I can fix this? NR -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ExceptionInInitializerError-Unable-to-load-YARN-support-tp20775p21143.html Sent from the Apache Spark User List mailing list

how to run python app in yarn?

2015-01-14 Thread freedafeng
A cdh5.3.0 with spark is set up. just wondering how to run a python application on it. I used 'spark-submit --master yarn-cluster ./loadsessions.py' but got the error, Error: Cluster deploy mode is currently not supported for python applications. Run with --help for usage help or --verbose for

Re: enable debug-level log output of akka?

2015-01-14 Thread Nan Zhu
I quickly went through the code, In ExecutorBackend, we build the actor system with // Create SparkEnv using properties we fetched from the driver. val driverConf = new SparkConf().setAll(props) val env = SparkEnv.createExecutorEnv( driverConf, executorId, hostname, port, cores, isLocal =

Re: How to use BigInteger for userId and productId in collaborative Filtering?

2015-01-14 Thread Xiangrui Meng
Hi Nishanth, Just found out where you work:) We had some discussion in https://issues.apache.org/jira/browse/SPARK-2465 . Having long IDs will increase the communication cost, which may not worth the benefit. Not many companies have more than 1 billion users. If they do, maybe they can mirror the

Re: enable debug-level log output of akka?

2015-01-14 Thread Nan Zhu
sorry for the mistake, I found that those akka related messages are from Spark Akka-related component (ActorLogReceive) , instead of Akka itself, though it has been enough for the debugging purpose (in my case) the question in this thread is still in open status…. Best, -- Nan Zhu

Re: *ByKey aggregations: performance + order

2015-01-14 Thread Tobias Pfeiffer
Sean, thanks for your message. On Wed, Jan 14, 2015 at 8:36 PM, Sean Owen so...@cloudera.com wrote: On Wed, Jan 14, 2015 at 4:53 AM, Tobias Pfeiffer t...@preferred.jp wrote: OK, it seems like even on a local machine (with no network overhead), the groupByKey version is about 5 times slower

Re: enable debug-level log output of akka?

2015-01-14 Thread Ted Yu
I assume you have looked at: http://doc.akka.io/docs/akka/2.0/scala/logging.html http://doc.akka.io/docs/akka/current/additional/faq.html (Debugging, last question) Cheers On Wed, Jan 14, 2015 at 2:55 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, all though

Re: Running beyond memory limits in ConnectedComponents

2015-01-14 Thread Sean Owen
That's not quite what that error means. Spark is not out of memory. It means that Spark is using more memory than it asked YARN for. That in turn is because the default amount of cushion established between the YARN allowed container size and the JVM heap size is too small. See

Re: enable debug-level log output of akka?

2015-01-14 Thread Nan Zhu
for others who have the same question: you can simply set logging level in log4j.properties to DEBUG to achieve this Best, -- Nan Zhu http://codingcat.me On Wednesday, January 14, 2015 at 6:28 PM, Nan Zhu wrote: I quickly went through the code, In ExecutorBackend, we build the

Custom JSON schema inference

2015-01-14 Thread Corey Nolet
I'm working with RDD[Map[String,Any]] objects all over my codebase. These objects were all originally parsed from JSON. The processing I do on RDDs consists of parsing json - grouping/transforming dataset into a feasible report - outputting data to a file. I've been wanting to infer the schemas

Re: how to run python app in yarn?

2015-01-14 Thread freedafeng
Got help from Marcelo and Josh. Now it is running smoothly. In case you need this info - Just use yarn-client instead of yarn-cluster Thanks folks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-run-python-app-in-yarn-tp21141p21142.html Sent from

Re: component.AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use

2015-01-14 Thread Jianguo Li
I solved the issue. In case anyone else is looking for an answer, by default, scalatest executes all the tests in parallel. To disable this, just put the following line in your build.sbt parallelExecution in Test := false Thanks On Wed, Jan 14, 2015 at 2:30 PM, Jianguo Li

Re: how to run python app in yarn?

2015-01-14 Thread Josh Rosen
There's an open PR for supporting yarn-cluster mode in PySpark: https://github.com/apache/spark/pull/3976 (currently blocked on reviewer attention / time) On Wed, Jan 14, 2015 at 3:16 PM, Marcelo Vanzin van...@cloudera.com wrote: As the error message says... On Wed, Jan 14, 2015 at 3:14 PM,

Re: enable debug-level log output of akka?

2015-01-14 Thread Nan Zhu
Hi, Ted, Thanks I know how to set in Akka’s context, my question is just how to pass this aka.loglevel=DEBUG to Spark’s actor system Best, -- Nan Zhu http://codingcat.me On Wednesday, January 14, 2015 at 6:09 PM, Ted Yu wrote: I assume you have looked at:

RE: Issues with constants in Spark HiveQL queries

2015-01-14 Thread Cheng, Hao
The log showed it failed in parsing, so the typo stuff shouldn’t be the root cause. BUT I couldn’t reproduce that with master branch. I did the test as follow: sbt/sbt –Phadoop-2.3.0 –Phadoop-2.3 –Phive –Phive-0.13.1 hive/console scala sql(“SELECT user_id FROM actions where

enable debug-level log output of akka?

2015-01-14 Thread Nan Zhu
Hi, all though https://issues.apache.org/jira/browse/SPARK-609 was closed, I’m still unclear about how to enable debug level log output in Spark’s actor system Anyone can give the suggestion? BTW, I think we need to document it on somewhere, as the user who writes actor-based receiver of

Re: Issues running spark on cluster

2015-01-14 Thread TJ Klein
I got it working. It was a bug in Spark 1.1. After upgrading to 1.2 it worked. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issues-running-spark-on-cluster-tp21138p21140.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Using a RowMatrix inside a map

2015-01-14 Thread Alex Minnaar
I am working with a RowMatrix and I noticed in the multiply() method that the local matrix with which it is being multiplied is being distributed to all of the rows of the RowMatrix. If this is the case, then is it impossible to multiply a row matrix within a map operation? Because this would

Re: how to run python app in yarn?

2015-01-14 Thread Marcelo Vanzin
As the error message says... On Wed, Jan 14, 2015 at 3:14 PM, freedafeng freedaf...@yahoo.com wrote: Error: Cluster deploy mode is currently not supported for python applications. Use yarn-client instead of yarn-cluster for pyspark apps. -- Marcelo

Re: Using a RowMatrix inside a map

2015-01-14 Thread Xiangrui Meng
Yes, you can only use RowMatrix.multiply() within the driver. We are working on distributed block matrices and linear algebra operations on top of it, which would fit your use cases well. It may take several PRs to finish. You can find the first one here: https://github.com/apache/spark/pull/3200

Re: Issues with constants in Spark HiveQL queries

2015-01-14 Thread Yana Kadiyska
yeah, that makes sense. Pala, are you on a prebuild version of Spark -- I just tried the CDH4 prebuilt...Here is what I get for the = token: [image: Inline image 1] The literal type shows as 290, not 291, and 290 is numeric. According to this

Re: How to replay consuming messages from kafka using spark streaming?

2015-01-14 Thread Cody Koeninger
Take a look at the implementation linked from here https://issues.apache.org/jira/browse/SPARK-4964 see if that would meet your needs On Wed, Jan 14, 2015 at 9:58 PM, mykidong mykid...@gmail.com wrote: Hi, My Spark Streaming Job is doing like kafka etl to HDFS. For instance, every 10 min.

Re: Is there Spark's equivalent for Storm's ShellBolt?

2015-01-14 Thread Cody Koeninger
Look at the method pipe http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD On Wed, Jan 14, 2015 at 11:16 PM, umanga bistauma...@gmail.com wrote: This is question i originally asked in Quora: http://qr.ae/6qjoI We have some code written in C++ and Python that

Spark's equivalent of ShellBolt

2015-01-14 Thread Umanga Bista
This is question i originally asked in Quora: http://qr.ae/6qjoI http://qr.ae/6qjoI We have some code written in C++ and Python that does data enrichment to our data streams. If i use Storm, i could use those code with some small modifications using ShellBolt and IRichBolt. Since the

RE: How to replay consuming messages from kafka using spark streaming?

2015-01-14 Thread Shao, Saisai
I think there're two solutions: 1. Enable write ahead log in Spark Streaming if you're using Spark 1.2. 2. Using third-party Kafka consumer (https://github.com/dibbhatt/kafka-spark-consumer). Thanks Saisai -Original Message- From: mykidong [mailto:mykid...@gmail.com] Sent: Thursday,

Fast HashSets HashMaps - Spark Collection Utils

2015-01-14 Thread Night Wolf
Hi all, I'd like to leverage some of the fast Spark collection implementations in my own code. Particularity for doing things like distinct counts in a mapPartitions loop. Are there any plans to make the org.apache.spark.util.collection implementations public? Is there any other library out

Is there Spark's equivalent for Storm's ShellBolt?

2015-01-14 Thread umanga
This is question i originally asked in Quora: http://qr.ae/6qjoI We have some code written in C++ and Python that does data enrichment to our data streams. If i use Storm, i could use those code with some small modifications using ShellBolt and IRichBolt. Since the functionalities is all about

Re:

2015-01-14 Thread Tobias Pfeiffer
Hi, On Thu, Jan 15, 2015 at 12:23 AM, Ted Yu yuzhih...@gmail.com wrote: On Wed, Jan 14, 2015 at 6:58 AM, Jianguo Li flyingfromch...@gmail.com wrote: I am using Spark-1.1.1. When I used sbt test, I ran into the following exceptions. Any idea how to solve it? Thanks! I think somebody posted

How to replay consuming messages from kafka using spark streaming?

2015-01-14 Thread mykidong
Hi, My Spark Streaming Job is doing like kafka etl to HDFS. For instance, every 10 min. my streaming job is retrieving messages from kafka, and save them as avro files onto hdfs. My question is, if worker fails to write avro to hdfs, sometimes, I want to replay consuming messages from the last

Accumulator value in Spark UI

2015-01-14 Thread Justin Yip
Hello, From accumulator documentation, it says that if the accumulator is named, it will be displayed in the WebUI. However, I cannot find it anywhere. Do I need to specify anything in the spark ui config? Thanks. Justin

Re: Running beyond memory limits in ConnectedComponents

2015-01-14 Thread Nitin kak
Thanks Sean. I guess Cloudera Manager has parameters executor_total_max_heapsize and worker_max_heapsize which point to the parameters you mentioned above. How much should that cushon between the jvm heap size and yarn memory limit be? I tried setting jvm memory to 20g and yarn to 24g, but it

Re: Issues with constants in Spark HiveQL queries

2015-01-14 Thread Yana Kadiyska
Just a guess but what is the type of conversion_aciton_id? I do queries over an epoch all the time with no issues(where epoch's type is bigint). You can see the source here https://github.com/apache/spark/blob/v1.2.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala -- not sure what

[no subject]

2015-01-14 Thread Jianguo Li
I am using Spark-1.1.1. When I used sbt test, I ran into the following exceptions. Any idea how to solve it? Thanks! I think somebody posted this question before, but no one seemed to have answered it. Could it be the version of io.netty I put in my build.sbt? I included an dependency

java.lang.NoClassDefFoundError: io/netty/util/TimerTask Error when running sbt test

2015-01-14 Thread Jianguo Li
I am using Spark-1.1.1. When I used sbt test, I ran into the following exceptions. Any idea how to solve it? Thanks! I think somebody posted this question before, but no one seemed to have answered it. Could it be the version of io.netty I put in my build.sbt? I included an dependency

Re: Using Spark SQL with multiple (avro) files

2015-01-14 Thread Yana Kadiyska
If the wildcard path you have doesn't work you should probably open a bug -- I had a similar problem with Parquet and it was a bug which recently got closed. Not sure if sqlContext.avroFile shares a codepath with .parquetFile...you can try running with bits that have the fix for .parquetFile or

Re:

2015-01-14 Thread Ted Yu
From pom.xml (master branch): dependency groupIdio.netty/groupId artifactIdnetty-all/artifactId version4.0.23.Final/version /dependency Please check the version of netty Spark 1.1.1 depends on. Cheers On Wed, Jan 14, 2015 at 6:58 AM, Jianguo Li

Re: Accumulator value in Spark UI

2015-01-14 Thread Justin Yip
Found it. Thanks Patrick. Justin On Wed, Jan 14, 2015 at 10:38 PM, Patrick Wendell pwend...@gmail.com wrote: It should appear in the page for any stage in which accumulators are updated. On Wed, Jan 14, 2015 at 6:46 PM, Justin Yip yipjus...@prediction.io wrote: Hello, From

Serializability: for vs. while loops

2015-01-14 Thread Tobias Pfeiffer
Hi, sorry, I don't like questions about serializability myself, but still... Can anyone give me a hint why for (i - 0 to (maxId - 1)) { ... } throws a NotSerializableException in the loop body while var i = 0 while (i maxId) { // same code as in the for loop i += 1 } works

Re: How to use BigInteger for userId and productId in collaborative Filtering?

2015-01-14 Thread Nishanth P S
Hi Xiangrui, Thanks! for the reply, I will explore the suggested solutions. -Nishanth Hi Nishanth, Just found out where you work:) We had some discussion in https://issues.apache.org/jira/browse/SPARK-2465 . Having long IDs will increase the communication cost, which may not worth the benefit.

Re: Accumulator value in Spark UI

2015-01-14 Thread Patrick Wendell
It should appear in the page for any stage in which accumulators are updated. On Wed, Jan 14, 2015 at 6:46 PM, Justin Yip yipjus...@prediction.io wrote: Hello, From accumulator documentation, it says that if the accumulator is named, it will be displayed in the WebUI. However, I cannot find

Re: Accumulators

2015-01-14 Thread Corey Nolet
Just noticed an error in my wording. Should be I'm assuming it's not immediately aggregating on the driver each time I call the += on the Accumulator. On Wed, Jan 14, 2015 at 9:19 PM, Corey Nolet cjno...@gmail.com wrote: What are the limitations of using Accumulators to get a union of a bunch

Accumulators

2015-01-14 Thread Corey Nolet
What are the limitations of using Accumulators to get a union of a bunch of small sets? Let's say I have an RDD[Map{String,Any} and i want to do: rdd.map(accumulator += Set(_.get(entityType).get)) What implication does this have on performance? I'm assuming it's not immediately aggregating

Re: View executor logs on YARN mode

2015-01-14 Thread Brett Meyer
You can view the logs for the particular containers on the YARN UI if you go to the page for a specific node, and then from the Tools menu on the left, select Local Logs. There should be a userlogs directory which will contain the specific application ids for each job that you run. Inside the

Re: Using Spark SQL with multiple (avro) files

2015-01-14 Thread David Jones
Should I be able to pass multiple paths separated by commas? I haven't tried but didn't think it'd work. I'd expected a function that accepted a list of strings. On Wed, Jan 14, 2015 at 3:20 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: If the wildcard path you have doesn't work you should

Re: Spark executors resources. Blocking?

2015-01-14 Thread Brett Meyer
I can¹t speak to Mesos solutions, but for YARN you can define queues in which to run your jobs, and you can customize the amount of resources the queue consumes. When deploying your Spark job, you can specify the ‹queue queue_name option to schedule the job to a particular queue. Here are some

Re: Reading resource files in a Spark application

2015-01-14 Thread Arun Lists
The problem is that it gives an error message saying something to the effect that: URI is not hierarchical This is consistent with your explanation. Thanks, arun On Wed, Jan 14, 2015 at 1:14 AM, Sean Owen so...@cloudera.com wrote: My hunch is that it is because the URI of a resource in a

reconciling best effort real time with delayed aggregation

2015-01-14 Thread Adrian Mocanu
Hi I have a question regarding design trade offs and best practices. I'm working on a real time analytics system. For simplicity, I have data with timestamps (the key) and counts (the value). I use DStreams for the real time aspect. Tuples w the same timestamp can be across various RDDs and I

FileInputDStream missing files

2015-01-14 Thread Jem Tucker
Hi all, A small number of the files being moved into my landing directory are not being seen by my fileStream reciever. After looking at the code it seems that, in the case of long batches ( 1minute), if files are created before a batch finishes, but only become visible after that batch finished

Issues running spark on cluster

2015-01-14 Thread TJ Klein
Hi, I am running PySpark on a cluster. Generally it runs. However, frequently I get the warning message (and consequently, the task not being executed): WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have

Can I save RDD to local file system and then read it back on spark cluster with multiple nodes?

2015-01-14 Thread Wang, Ningjun (LNG-NPV)
Can I save RDD to the local file system and then read it back on a spark cluster with multiple nodes? rdd.saveAsObjectFile(file:///home/data/rdd1file:///\\home\data\rdd1) val rdd2 = sc.objectFile(file:///home/data/rdd1file:///\\home\data\rdd1) This will works if the cluster has only one node.

Re: dockerized spark executor on mesos?

2015-01-14 Thread Josh J
We have dockerized Spark Master and worker(s) separately and are using it in our dev environment. Is this setup available on github or dockerhub? On Tue, Dec 9, 2014 at 3:50 PM, Venkat Subramanian vsubr...@gmail.com wrote: We have dockerized Spark Master and worker(s) separately and are

spark crashes on second or third call first() on file

2015-01-14 Thread Linda Terlouw
I'm new to Spark. When I use the Movie Lens dataset 100k (http://grouplens.org/datasets/movielens/), Spark crashes when I run the following code. The first call to movieData.first() gives the correct result. Depending on which machine I use, it crashed the second or third time. Does anybody

Spark with MongoDB problem

2015-01-14 Thread Oliver Fernandez
Hello, I'm learning to use Spark with MongoDB, but I've encountered a problem that I think is related to the way I use Spark., because it doesn't make any sense to me. My concept test is that I want to filter a collection containing about 800K documents by a certain field. My code is very

about window length and slide interval

2015-01-14 Thread Hoai-Thu Vuong
Could we run spark streaming and reducebywindow with same window length and slide interval?

Re: use netty shuffle for network cause high gc time

2015-01-14 Thread lihu
I used the spark1.1 On Wed, Jan 14, 2015 at 2:24 PM, Aaron Davidson ilike...@gmail.com wrote: What version are you running? I think spark.shuffle.use.netty was a valid option only in Spark 1.1, where the Netty stuff was strictly experimental. Spark 1.2 contains an officially supported and

Re: Reading resource files in a Spark application

2015-01-14 Thread Sean Owen
My hunch is that it is because the URI of a resource in a JAR file will necessarily be specific to where the JAR is on the local filesystem and that is not portable or the right way to read a resource. But you didn't specify the problem here. On Jan 14, 2015 5:15 AM, Arun Lists

Re: about window length and slide interval

2015-01-14 Thread francois . garillot
Why would you want to ? That would be equivalent to setting your batch interval to the same duration you’re suggesting to specify for both window length and slide interval, IIUC. Here’s a nice explanation of windowing by TD : https://groups.google.com/forum/#!topic/spark-users/GQoxJHAAtX4

RE: Spark 1.2.0 not getting connected

2015-01-14 Thread Abhideep Chakravarty
Hi Akhil, [cid:image001.png@01D03008.40E75B40] Config: scala-spark-jar-location=”” spark-master-location=spark://172.22.193.138:7007 Error log msg: 14:46:52.443 [sparkDriver-akka.actor.default-dispatcher-2] DEBUG o.a.s.s.c.SparkDeploySchedulerBackend - [actor] received message ReviveOffers

Re: about window length and slide interval

2015-01-14 Thread Hoai-Thu Vuong
I would like to counting value in non overlap window, so that I think, I can do it with same value of window length and slide interval, (note that, this value is multiple of batch interval). However, nothing is calculated. Thanks! On Wed Jan 14 2015 at 4:16:27 PM francois.garil...@typesafe.com

Re: saveAsObjectFile is actually saveAsSequenceFile

2015-01-14 Thread Sean Owen
Yeah it is actually serializing elements after chunking them into arrays of 10 elements each. It's not actually a key-value pair in the SequenceFile for each element. That is how objectFile() reads it and flatMaps it, and the docs say that the intent is that this is an opaque, not-guaranteed

Using Spark SQL with multiple (avro) files

2015-01-14 Thread David Jones
Hi, I have a program that loads a single avro file using spark SQL, queries it, transforms it and then outputs the data. The file is loaded with: val records = sqlContext.avroFile(filePath) val data = records.registerTempTable(data) ... Now I want to run it over tens of thousands of Avro files

Re: about window length and slide interval

2015-01-14 Thread Sean Owen
You're just describing the normal operation of Spark Streaming them. Windowing is for when you want overlapping intervals. Here you simply do not window, and ask for intervals of whatever time you want. You get non-overlapping RDDs. On Wed, Jan 14, 2015 at 9:26 AM, Hoai-Thu Vuong

RE: How to integrate Spark with OpenCV?

2015-01-14 Thread jishnu.prathap
Hi Akhil Thanks for the response Our use case is Object detection in multiple videos. It’s kind of searching an image if present in the video by matching the image with all the frames of the video. I am able to do it in normal java code using OpenCV lib now but I don’t think it is scalable to

View executor logs on YARN mode

2015-01-14 Thread lonely Feb
Hi all, i sadly found on YARN mode i cannot view executor logs on YARN web UI nor on SPARK history web UI. On YARN web UI i can only view AppMaster logs and on SPARK history web UI i can only view Application metric information. If i want to see whether a executor is being full GC i can only use

[spark-jobserver] Submit Job in yarn-cluster mode (?)

2015-01-14 Thread Pietro Gentile
Hi all, I'm able to submit spark jobs through spark-jobserver. But this allows to use spark only in yarn-client mode. I want to use spark also in yarn-cluster mode but jobserver does not allow it, like says in the README file https://github.com/spark-jobserver/spark-jobserver. Could you tell

Re: How to integrate Spark with OpenCV?

2015-01-14 Thread Akhil Das
Here's an example for detecting the face https://github.com/openreserach/bin2seq/blob/master/src/test/java/com/openresearchinc/hadoop/test/SparkTest.java#L190 you might find it useful. Thanks Best Regards On Wed, Jan 14, 2015 at 3:06 PM, jishnu.prat...@wipro.com wrote: Hi Akhil Thanks for

Re: How to integrate Spark with OpenCV?

2015-01-14 Thread Alonso Isidoro Roman
thanks Jorn, sorry for the special character your name needs, i dont know how to use it. I was thinking the same. Do you know somebody that tries to use this approach? Alonso Isidoro Roman. Mis citas preferidas (de hoy) : Si depurar es el proceso de quitar los errores de software, entonces

How does GraphX internally traverse the Graph?

2015-01-14 Thread mas
I want to know the internal traversal of Graph by GraphX. Is it vertex and edges based traversal or sequential traversal of RDDS? For example given a vertex of graph, i want to fetch only of its neighbors Not the neighbors of all the vertices ? How GraphX will traverse the graph in this case.

Re: about window length and slide interval

2015-01-14 Thread Hoai-Thu Vuong
Oh I see, thank you all guys. On Wed Jan 14 2015 at 4:35:16 PM Sean Owen so...@cloudera.com wrote: You're just describing the normal operation of Spark Streaming them. Windowing is for when you want overlapping intervals. Here you simply do not window, and ask for intervals of whatever time

java.io.IOException: sendMessageReliably failed without being ACK'd

2015-01-14 Thread Antony Mayi
Hi, running spark 1.1.0 in yarn-client mode (cdh 5.2.1) on XEN based cloud and randomly getting my executors failing on errors like bellow. I suspect it is some cloud networking issue (XEN driver bug?) but wondering if there is any spark/yarn workaround that I could use to mitigate?

Re: How to integrate Spark with OpenCV?

2015-01-14 Thread Jörn Franke
Basically, you have to think about how to split the data (for pictures this can be for instance 8x8 matrices) and use spark to distribute it to different workers which themselves call opencv with the data. Afterwards you need to combine all results again. It really depends on your image / video

Re: *ByKey aggregations: performance + order

2015-01-14 Thread Sean Owen
On Wed, Jan 14, 2015 at 4:53 AM, Tobias Pfeiffer t...@preferred.jp wrote: Now I don't know (yet) if all of the functions I want to compute can be expressed in this way and I was wondering about *how much* more expensive we are talking about. OK, it seems like even on a local machine (with no

Re: How to integrate Spark with OpenCV?

2015-01-14 Thread Alonso Isidoro Roman
I was thinking about if it could be possible to use apache spark, and opencv in order to recognize stellar objects from the data set provided by Nasa,Esa, and the others space agencies. I am asking about advice or feelings about it, nor example codes or something else, but if there were any,

Re: Trouble with large Yarn job

2015-01-14 Thread Anders Arpteg
Interesting, sounds plausible. Another way to avoid the problem has been to cache intermediate output for large jobs (i.e. split large jobs into smaller and then union together) Unfortunately that this type of tweaking should be necessary though, hopefully better in 1.2.1. On Tue, Jan 13, 2015 at