Global sequential access of elements in RDD

2015-02-26 Thread Wush Wu
Dear all, I want to implement some sequential algorithm on RDD. For example: val conf = new SparkConf() conf.setMaster("local[2]"). setAppName("SequentialSuite") val sc = new SparkContext(conf) val rdd = sc. parallelize(Array(1, 3, 2, 7, 1, 4, 2, 5, 1, 8, 9), 2). sortBy(x => x, true) r

Re: Get importerror when i run pyspark with ipython=1

2015-02-26 Thread Jey Kottalam
Hi Sourabh, could you try it with the stable 2.4 version of IPython? On Thu, Feb 26, 2015 at 8:54 PM, sourabhguha wrote: > > > I get the above error when I try to run pyspark with the ipython option. I > do not ge

One of the executor not getting StopExecutor message

2015-02-26 Thread twinkle sachdeva
Hi, I am running a spark application on Yarn in cluster mode. One of my executor appears to be in hang state, for a long time, and gets finally killed by the driver. As compared to other executors, It have not received StopExecutor message from the driver. Here are the logs at the end of this c

Re: Reg. KNN on MLlib

2015-02-26 Thread Xiangrui Meng
It is not in MLlib. There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-2336 and Ashutosh has an implementation for integer values. -Xiangrui On Thu, Feb 26, 2015 at 8:18 PM, Deep Pradhan wrote: > Has KNN classification algorithm been implemented on MLlib? > > Thank You > Regards,

Get importerror when i run pyspark with ipython=1

2015-02-26 Thread sourabhguha
I get the above error when I try to run pyspark with the ipython option. I do not get this error when I run it without the ipython option. I have Java 8, Scala 2.10.4 and Enthought Canopy Python on my box. OS Wi

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Of course, breakpointing on every status update and revive offers invocation kept the problem from happening. Where could the race be? On Thu, Feb 26, 2015 at 7:55 PM, Victor Tso-Guillen wrote: > Love to hear some input on this. I did get a standalone cluster up on my > local machine and the pro

Speedup in Cluster

2015-02-26 Thread Deep Pradhan
What should be the expected performance of Spark Applications with the increase in the number of nodes in a cluster, other parameters being constant? Thank You Regards, Deep

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Kannan Rajah
Thanks Marcelo. Do you think it would be useful to make spark.executor.extraClassPath be made to pick up some environment variable that can be set from spark-env.sh? Here is a example. spark-env.sh -- executor_extra_cp = get_hbase_jars_for_cp export executor_extra_cp spark-default

Reg. KNN on MLlib

2015-02-26 Thread Deep Pradhan
Has KNN classification algorithm been implemented on MLlib? Thank You Regards, Deep

Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-02-26 Thread Arun Luthra
Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949 I tried this as a workaround: import org.apache.spark.scheduler._ import org.roaringbitmap._ ... kryo.register(classOf[org.roaringbitmap.RoaringBitmap]) kryo.register(classOf[org.roaringbitmap.RoaringArray]) kryo.r

Re: Apache Ignite vs Apache Spark

2015-02-26 Thread Ognen Duzlevski
Thanks you all! On Thu, Feb 26, 2015 at 5:50 PM, wrote: > Ignite guys spoke at the bigtop workshop last week at Scale, posted slides > here: > > https://cwiki.apache.org/confluence/display/BIGTOP/SCALE13x > > Couple main pts around comments made during the preso.., although > incubating > apache

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Love to hear some input on this. I did get a standalone cluster up on my local machine and the problem didn't present itself. I'm pretty confident that means the problem is in the LocalBackend or something near it. On Thu, Feb 26, 2015 at 1:37 PM, Victor Tso-Guillen wrote: > Okay I confirmed my

Re: Clean up app folders in worker nodes

2015-02-26 Thread markjgreene
I think the setting you are missing is 'spark.worker.cleanup.appDataTtl'. This setting controls how long the age of a file has to be before it is deleted. More info here: https://spark.apache.org/docs/1.0.1/spark-standalone.html. Also, 'spark.worker.cleanup.interval' you have configured is pretty

RE: Monitoring Spark with Graphite and Grafana

2015-02-26 Thread Shao, Saisai
Cool, great job☺. Thanks Jerry From: Ryan Williams [mailto:ryan.blake.willi...@gmail.com] Sent: Thursday, February 26, 2015 6:11 PM To: user; d...@spark.apache.org Subject: Monitoring Spark with Graphite and Grafana If anyone is curious to try exporting Spark metrics to Graphite, I just publish

Spark-submit not working when application jar is in hdfs

2015-02-26 Thread dilm
I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied my application jar to a directory in hdfs, i get the following exception: Warning: Skip remote jar hdfs://localhost:9000/user/hdfs/jars/sim

Monitoring Spark with Graphite and Grafana

2015-02-26 Thread Ryan Williams
If anyone is curious to try exporting Spark metrics to Graphite, I just published a post about my experience doing that, building dashboards in Grafana , and using them to monitor Spark jobs: http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/ Code

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Marcelo Vanzin
On Thu, Feb 26, 2015 at 5:12 PM, Kannan Rajah wrote: > Also, I would like to know if there is a localization overhead when we use > spark.executor.extraClassPath. Again, in the case of hbase, these jars would > be typically available on all nodes. So there is no need to localize them > from the no

Re: Dealing with 'smaller' data

2015-02-26 Thread Tobias Pfeiffer
On Fri, Feb 27, 2015 at 10:57 AM, Gary Malouf wrote: > So when deciding whether to take on installing/configuring Spark, the size > of the data does not automatically make that decision in your mind. > You got me there ;-) Tobias

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska
Yong, for the 200 tasks in stage 2 and 3 -- this actually comes from the shuffle setting: spark.sql.shuffle.partitions On Thu, Feb 26, 2015 at 5:51 PM, java8964 wrote: > Imran, thanks for your explaining about the parallelism. That is very > helpful. > > In my test case, I am only use one box cl

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Todd Nist
Hi Kannan, Issues with using --jars make sense. I believe you can set the classpath via the use the --conf spark.executor.extraClassPath= or in your driver with .set("spark.executor.extraClassPath", ".") I believe you are correct with the localize as well as long as your guaranteed that

Re: Dealing with 'smaller' data

2015-02-26 Thread Gary Malouf
So when deciding whether to take on installing/configuring Spark, the size of the data does not automatically make that decision in your mind. Thanks, Gary On Thu, Feb 26, 2015 at 8:55 PM, Tobias Pfeiffer wrote: > Hi > > On Fri, Feb 27, 2015 at 10:50 AM, Gary Malouf > wrote: > >> The honest a

Re: Dealing with 'smaller' data

2015-02-26 Thread Tobias Pfeiffer
Hi On Fri, Feb 27, 2015 at 10:50 AM, Gary Malouf wrote: > The honest answer is that it is unclear to me at this point. I guess what > I am really wondering is if there are cases where one would find it > beneficial to use Spark against one or more RDBs? > Well, RDBs are all about *storage*, wh

Re: group by order by fails

2015-02-26 Thread Michael Armbrust
Assign an alias to the count in the select clause and use that alias in the order by clause. On Wed, Feb 25, 2015 at 11:17 PM, Tridib Samanta wrote: > Actually I just realized , I am using 1.2.0. > > Thanks > Tridib > > -- > Date: Thu, 26 Feb 2015 12:37:06 +0530 > Sub

Re: Dealing with 'smaller' data

2015-02-26 Thread Gary Malouf
The honest answer is that it is unclear to me at this point. I guess what I am really wondering is if there are cases where one would find it beneficial to use Spark against one or more RDBs? On Thu, Feb 26, 2015 at 8:06 PM, Tobias Pfeiffer wrote: > Gary, > > On Fri, Feb 27, 2015 at 8:40 AM, Ga

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Kannan Rajah
There is a usability concern I have with the current way of specifying --jars. Imagine a use case like hbase where a lot of jobs need it in its classpath. This needs to be set every time. If we use spark.executor.extraClassPath, then we just need to set it once But there is no programmatic way to s

Re: Dealing with 'smaller' data

2015-02-26 Thread Tobias Pfeiffer
Gary, On Fri, Feb 27, 2015 at 8:40 AM, Gary Malouf wrote: > I'm considering whether or not it is worth introducing Spark at my new > company. The data is no-where near Hadoop size at this point (it sits in > an RDS Postgres cluster). > Will it ever become "Hadoop size"? Looking at the overhead

partions, SQL tables, and Parquet I/O

2015-02-26 Thread Daniel, Ronald (ELS-SDG)
Short story: I want to write some parquet files so they are pre-partitioned by the same key. Then, when I read them back in, joining the two tables on that key should be about as fast as things can be done. Can I do that, and if so, how? I don't see how to control the partitioning of a SQL table

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
I think I'm getting more confused the longer this thread goes. So rdd1.dependencies provides immediate parents to rdd1. For now i'm going to walk my internal DAG from the root down and see where running the caching of siblings concurrently gets me. I still like your point, Sean, about trying to do

Re: How to tell if one RDD depends on another

2015-02-26 Thread Sean Owen
I think we already covered that in this thread. You get dependencies from RDD.dependencies() On Fri, Feb 27, 2015 at 12:31 AM, Zhan Zhang wrote: > Currently in spark, it looks like there is no easy way to know the > dependencies. It is solved at run time. > > Thanks. > > Zhan Zhang > > On Feb 26,

Re: NegativeArraySizeException when doing joins on skewed data

2015-02-26 Thread Tristan Blakers
Hi Imran, I can confirm this still happens when calling Kryo serialisation directly, not I’m using Java. The output file is at about 440mb at the time of the crash. Kryo is version 2.21. When I get a chance I’ll see if I can make a shareable test case and try on Kryo 3.0, I doubt they’d be intere

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
Currently in spark, it looks like there is no easy way to know the dependencies. It is solved at run time. Thanks. Zhan Zhang On Feb 26, 2015, at 4:20 PM, Corey Nolet mailto:cjno...@gmail.com>> wrote: Ted. That one I know. It was the dependency part I was curious about On Feb 26, 2015 7:12 P

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
I miss that part. Thanks for the explanation. It is a challenging problem implementation wise. To do it programmatically, 1. pre-analyze all DAGs to form a complete DAG with root as the source, and leaf as all actions. 2. Any RDD(node) that has more than one downstream nodes needs to be marked

Re: spark streaming: stderr does not roll

2015-02-26 Thread Tathagata Das
If the mentioned conf is enabled, the rolling of the stderr should work. If it is not, then there is probably some bug. Take a look at the Worker's logs and see if there is any error about rolling of the Executor's stderr. If there is a bug, then it needs to be fixed (maybe you can take a crack at

Re: Converting SchemaRDD/Dataframe to RDD[vector]

2015-02-26 Thread Xiangrui Meng
Try the following: df.map { case Row(id: Int, num: Int, value: Double, x: Float) => // replace those with your types (id, Vectors.dense(num, value, x)) }.toDF("id", "features") -Xiangrui On Thu, Feb 26, 2015 at 3:08 PM, mobsniuk wrote: > I've been searching around and see others have asked si

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
Ted. That one I know. It was the dependency part I was curious about On Feb 26, 2015 7:12 PM, "Ted Yu" wrote: > bq. whether or not rdd1 is a cached rdd > > RDD has getStorageLevel method which would return the RDD's current > storage level. > > SparkContext has this method: >* Return informat

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Marcelo Vanzin
SPARK_CLASSPATH is definitely deprecated, but my understanding is that spark.executor.extraClassPath is not, so maybe the documentation needs fixing. I'll let someone who might know otherwise comment, though. On Thu, Feb 26, 2015 at 2:43 PM, Kannan Rajah wrote: > SparkConf.scala logs a warning s

Re: How to augment data to existing MatrixFactorizationModel?

2015-02-26 Thread Xiangrui Meng
It may take some work to do online updates with an MatrixFactorizationModel because you need to update some rows of the user/item factors. You may be interested in spark-indexedrdd (http://spark-packages.org/package/amplab/spark-indexedrdd). We support save/load in Scala/Java. We are going to add

Re: How to tell if one RDD depends on another

2015-02-26 Thread Ted Yu
bq. whether or not rdd1 is a cached rdd RDD has getStorageLevel method which would return the RDD's current storage level. SparkContext has this method: * Return information about what RDDs are cached, if they are in mem or on disk, how much space * they take, etc. */ @DeveloperApi d

Non-deterministic Accumulator Values

2015-02-26 Thread Peter Thai
Hi all, I'm incrementing several accumulators inside a foreach. Most of the time, the accumulators will return the same value for the same dataset. However, they sometimes differ. I'm not sure how accumulators are implemented. Could this behavior be caused by data not arriving before I print out

Re: How to tell if one RDD depends on another

2015-02-26 Thread Sean Owen
Yeah, I believe Corey knows that much and is using foreachPartition(i => None) to materialize. The question is, how would you do this with an arbitrary DAG? in this simple example we know what the answer is but he's trying to do it programmatically. On Thu, Feb 26, 2015 at 11:54 PM, Zhan Zhang wr

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
Zhan, This is exactly what I'm trying to do except, as I metnioned in my first message, I am being given rdd1 and rdd2 only and I don't necessarily know at that point whether or not rdd1 is a cached rdd. Further, I don't know at that point whether or not rdd2 depends on rdd1. On Thu, Feb 26, 2015

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Todd Nist
Hi Kannan, I believe you should be able to use the --jars for this when invoke the spark-shell or perform a spark-submit. Per docs: --jars JARSComma-separated list of local jars to include on the driver and executor classpaths. HTH. -Todd On Thu, Feb

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
In this case, it is slow to wait for rdd1.saveAsHasoopFile(...) to finish probably due to writing to hdfs. a walk around for this particular case may be as follows. val rdd1 = ..cache() val rdd2 = rdd1.map().() rdd1.count future { rdd1.saveAsHasoopFile(...) } future { rdd2.saveAsHadoo

Spark+Cassandra and how to visualize results

2015-02-26 Thread Jan Algermissen
Hi, I am planning to process an event stream in the following way: - write the raw stream through spark streaming to cassandra for later analytics use cases - ‘fork of’ the stream and do some stream analysis and make that information available to build dashboards. Since I am having ElasticSear

RE: Apache Ignite vs Apache Spark

2015-02-26 Thread nate
Ignite guys spoke at the bigtop workshop last week at Scale, posted slides here: https://cwiki.apache.org/confluence/display/BIGTOP/SCALE13x Couple main pts around comments made during the preso.., although incubating apache (first code drop was last week I believe).., tech is battle tested with

Dealing with 'smaller' data

2015-02-26 Thread Gary Malouf
I'm considering whether or not it is worth introducing Spark at my new company. The data is no-where near Hadoop size at this point (it sits in an RDS Postgres cluster). I'm wondering at which point it is worth the overhead of adding the Spark infrastructure (deployment scripts, monitoring, etc).

Re: Apache Ignite vs Apache Spark

2015-02-26 Thread Jay Vyas
-https://wiki.apache.org/incubator/IgniteProposal has I think been updated recently and has a good comparison. - Although grid gain has been around since the spark days, Apache Ignite is quite new and just getting started I think so - you will probably want to reach out to the developers for

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
> What confused me is the statement of *"The final result is that rdd1 is calculated twice.” *Is it the expected behavior? To be perfectly honest, performing an action on a cached RDD in two different threads and having them (at the partition level) block until the parent are cached would be the

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
I should probably mention that my example case is much over simplified- Let's say I've got a tree, a fairly complex one where I begin a series of jobs at the root which calculates a bunch of really really complex joins and as I move down the tree, I'm creating reports from the data that's already b

Re: How to tell if one RDD depends on another

2015-02-26 Thread Sean Owen
The issue is that both RDDs are being evaluated at once. rdd1 is cached, which means that as its partitions are evaluated, they are persisted. Later requests for the partition hit the cached partition. But we have two threads causing two jobs to evaluate partitions of rdd1 at the same time. If they

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
What confused me is the statement of "The final result is that rdd1 is calculated twice.” Is it the expected behavior? Thanks. Zhan Zhang On Feb 26, 2015, at 3:03 PM, Sean Owen mailto:so...@cloudera.com>> wrote: To distill this a bit further, I don't think you actually want rdd2 to wait on r

Converting SchemaRDD/Dataframe to RDD[vector]

2015-02-26 Thread mobsniuk
I've been searching around and see others have asked similar questions. Given a schemaRDD I extract a restless that contains numbers, both Int and Doubles. How do I construct a RDD[Vector]? In 1.2 I wrote the results to a textile and then read them back in splitting them with some code I found in

Re: How to tell if one RDD depends on another

2015-02-26 Thread Sean Owen
To distill this a bit further, I don't think you actually want rdd2 to wait on rdd1 in this case. What you want is for a request for partition X to wait if partition X is already being calculated in a persisted RDD. Otherwise the first partition of rdd2 waits on the final partition of rdd1 even whe

Re: GraphX:java.lang.NoSuchMethodError:org.apache.spark.graphx.Graph$.apply

2015-02-26 Thread hnahak
I can able to run it without any issue from standalone as well as in cluster. spark-submit --class org.graphx.test.GraphFromVerteXEdgeArray --executor-memory 1g --driver-memory 6g --master spark://VM-Master:7077 spark-graphx.jar code is exact same as above -- View this message in context:

Re: Error: no snappyjava in java.library.path

2015-02-26 Thread Marcelo Vanzin
Hi Dan, This is a CDH issue, so I'd recommend using cdh-u...@cloudera.org for those questions. This is an issue with fixed in recent CM 5.3 updates; if you're not using CM, or want a workaround, you can manually configure "spark.driver.extraLibraryPath" and "spark.executor.extraLibraryPath" to in

RE: Help me understand the partition, parallelism in Spark

2015-02-26 Thread java8964
Imran, thanks for your explaining about the parallelism. That is very helpful. In my test case, I am only use one box cluster, with one executor. So if I put 10 cores, then 10 concurrent task will be run within this one executor, which will handle more data than 4 core case, then leaded to OOM. I

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
Zhan, I think it might be helpful to point out that I'm trying to run the RDDs in different threads to maximize the amount of work that can be done concurrently. Unfortunately, right now if I had something like this: val rdd1 = ..cache() val rdd2 = rdd1.map().() future { rdd1.saveAsHaso

Re: Running spark function on parquet without sql

2015-02-26 Thread Zhan Zhang
When you use sql (or API from SchemaRDD/DataFrame) to read data form parquet, the optimizer will do column pruning, predictor pushdown, etc. Thus you can the benefit of parquet column benefits. After that, you can operate the SchemaRDD (DF) like regular RDD. Thanks. Zhan Zhang On Feb 26, 20

Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Kannan Rajah
SparkConf.scala logs a warning saying SPARK_CLASSPATH is deprecated and we should use spark.executor.extraClassPath instead. But the online documentation states that spark.executor.extraClassPath is only meant for backward compatibility. https://spark.apache.org/docs/1.2.0/configuration.html#execu

Error: no snappyjava in java.library.path

2015-02-26 Thread Dan Dong
Hi, All, When I run a small program in spark-shell, I got the following error: ... Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886) at java.lang.Runtime.loadLibrary0(Runtime.java:849) at java.lang

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
You don’t need to know rdd dependencies to maximize dependencies. Internally the scheduler will construct the DAG and trigger the execution if there is no shuffle dependencies in between RDDs. Thanks. Zhan Zhang On Feb 26, 2015, at 1:28 PM, Corey Nolet wrote: > Let's say I'm given 2 RDDs and

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Zhan Zhang
Here is my understanding. When running on top of yarn, the cores means the number of tasks can run in one executor. But all these cores are located in the same JVM. Parallelism typically control the balance of tasks. For example, if you have 200 cores, but only 50 partitions. There will be 150

Re: Cartesian issue with user defined objects

2015-02-26 Thread Marco Gaido
Thanks, my issue was exactly that the function to extract the class from the file used the same object, by only changing it. Creating a new object for each item solved the issue. Thank you very much for your reply. Best regards. > Il giorno 26/feb/2015, alle ore 22:25, Imran Rashid ha > scritt

Re: [SparkSQL, Spark 1.2] UDFs in group by broken?

2015-02-26 Thread Yin Huai
Seems you hit https://issues.apache.org/jira/browse/SPARK-4296. It has been fixed in 1.2.1 and 1.3. On Thu, Feb 26, 2015 at 1:22 PM, Yana Kadiyska wrote: > Can someone confirm if they can run UDFs in group by in spark1.2? > > I have two builds running -- one from a custom build from early Decemb

Re: Integrating Spark Streaming with Reactive Mongo

2015-02-26 Thread Tathagata Das
Hey Mike, I quickly looked through the example and I found major performance issue. You are collecting the RDDs to the driver and then sending them to Mongo in a foreach. Why not doing a distributed push to Mongo? WHAT YOU HAVE val mongoConnection = ... WHAT YOU SHUOLD DO rdd.foreachPartition {

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska
Imran, I have also observed the phenomenon of reducing the cores helping with OOM. I wanted to ask this (hopefully without straying off topic): we can specify the number of cores and the executor memory. But we don't get to specify _how_ the cores are spread among executors. Is it possible that wi

Re: How to tell if one RDD depends on another

2015-02-26 Thread Imran Rashid
no, it does not give you transitive dependencies. You'd have to walk the tree of dependencies yourself, but that should just be a few lines. On Thu, Feb 26, 2015 at 3:32 PM, Corey Nolet wrote: > I see the "rdd.dependencies()" function, does that include ALL the > dependencies of an RDD? Is it s

Running spark function on parquet without sql

2015-02-26 Thread tridib
Hello Experts, In one of my projects we are having parquet files and we are using spark SQL to get our analytics. I am encountering situation where simple SQL is not getting me what I need or the complex SQL is not supported by Spark Sql. In scenarios like this I am able to get things done using lo

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Imran Rashid
Hi Yong, mostly correct except for: > >- Since we are doing reduceByKey, shuffling will happen. Data will be >shuffled into 1000 partitions, as we have 1000 unique keys. > > no, you will not get 1000 partitions. Spark has to decide how many partitions to use before it even knows how many

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Okay I confirmed my suspicions of a hang. I made a request that stopped progressing, though the already-scheduled tasks had finished. I made a separate request that was small enough not to hang, and it kicked the hung job enough to finish. I think what's happening is that the scheduler or the local

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
I see the "rdd.dependencies()" function, does that include ALL the dependencies of an RDD? Is it safe to assume I can say "rdd2.dependencies.contains(rdd1)"? On Thu, Feb 26, 2015 at 4:28 PM, Corey Nolet wrote: > Let's say I'm given 2 RDDs and told to store them in a sequence file and > they have

Re: Iterating on RDDs

2015-02-26 Thread Imran Rashid
val grouped = R.groupBy[VertexId](G).persist(StorageLeve.MEMORY_ONLY_SER) // or whatever persistence makes more sense for you ... while(true) { val res = grouped.flatMap(F) res.collect.foreach(func) if(criteria) break } On Thu, Feb 26, 2015 at 10:56 AM, Vijayasarathy Kannan wrote: > H

How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
Let's say I'm given 2 RDDs and told to store them in a sequence file and they have the following dependency: val rdd1 = sparkContext.sequenceFile().cache() val rdd2 = rdd1.map() How would I tell programmatically without being the one who built rdd1 and rdd2 whether or not rdd2 depend

Integrating Spark Streaming with Reactive Mongo

2015-02-26 Thread Mike Trienis
Hi All, I have Spark Streaming setup to write data to a replicated MongoDB database and would like to understand if there would be any issues using the Reactive Mongo library to write directly to the mongoDB? My stack is Apache Spark sitting on top of Cassandra for the datastore, so my thinking is

Re: Cartesian issue with user defined objects

2015-02-26 Thread Imran Rashid
any chance your input RDD is being read from hdfs, and you are running into this issue (in the docs on SparkContext#hadoopFile): * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each * record, directly caching the returned RDD or directly passing it to an aggr

[SparkSQL, Spark 1.2] UDFs in group by broken?

2015-02-26 Thread Yana Kadiyska
Can someone confirm if they can run UDFs in group by in spark1.2? I have two builds running -- one from a custom build from early December (commit 4259ca8dd12) which works fine, and Spark1.2-RC2. On the latter I get: jdbc:hive2://XXX.208:10001> select from_unixtime(epoch,'-MM-dd-HH'),count(

Re: GroupByKey causing problem

2015-02-26 Thread Imran Rashid
Hi Tushar, The most scalable option is probably for you to consider doing some approximation. Eg., sample the first to come up with the bucket boundaries. Then you can assign data points to buckets without needing to do a full groupByKey. You could even have more passes which corrects any error

Re: NegativeArraySizeException when doing joins on skewed data

2015-02-26 Thread Imran Rashid
Hi Tristan, at first I thought you were just hitting another instance of https://issues.apache.org/jira/browse/SPARK-1391, but I actually think its entirely related to kryo. Would it be possible for you to try serializing your object using kryo, without involving spark at all? If you are unfamil

Re: Getting to proto buff classes in Spark Context

2015-02-26 Thread Akshat Aranya
My guess would be that you are packaging too many things in your job, which is causing problems with the classpath. When your jar goes in first, you get the correct version of protobuf, but some other version of something else. When your jar goes in later, other things work, but protobuf breaks.

Re: value foreach is not a member of java.util.List[edu.stanford.nlp.util.CoreMap]

2015-02-26 Thread Deepak Vohra
Sean, Would you kindly suggest on which forum, mailing list or issues to ask question about the AAS book? Or no such provision is made? regards, Deepak On Thu, 2/26/15, Sean Owen wrote: Subject: Re: value foreach is not a member of java.util.List

Missing tasks

2015-02-26 Thread Akshat Aranya
I am seeing a problem with a Spark job in standalone mode. Spark master's web interface shows a task RUNNING on a particular executor, but the logs of the executor do not show the task being ever assigned to it, that is, such a line is missing from the log: 15/02/25 16:53:36 INFO executor.CoarseG

NullPointerException in TaskSetManager

2015-02-26 Thread gtinside
Hi , I am trying to run a simple hadoop job (that uses CassandraHadoopInputOutputWriter) on spark (v1.2 , Hadoop v 1.x) but getting NullPointerException in TaskSetManager WARN 2015-02-26 14:21:43,217 [task-result-getter-0] TaskSetManager - Lost task 14.2 in stage 0.0 (TID 29, devntom003.dev.black

Re: throughput in the web console?

2015-02-26 Thread Tathagata Das
If you have one receiver, and you are doing only map-like operaitons then the process will primarily happen on one machine. To use all the machines, either receiver in parallel with multiple receivers, or spread out the computation by explicitly repartitioning the received streams (DStream.repartit

Re: how to map and filter in one step?

2015-02-26 Thread Mark Hamstra
rdd.map(foo).filter(bar) and rdd.filter(bar).map(foo) will each already be pipelined into a single stage, so there generally isn't any need to complect the map and filter into a single function. Additionally, there is RDD#collect[U](f: PartialFunction[T, U])(implicit arg0: ClassTag[U]): RDD[U], wh

Re: how to map and filter in one step?

2015-02-26 Thread Crystal Xing
I see. The reason we can use flatmap to map to null but not using map to map to null is because flatmap supports map to zero and more but map only support 1-1 mapping? It seems Flatmap is more equivalent to haddop's map. Thanks, Zheng zhen On Thu, Feb 26, 2015 at 10:44 AM, Sean Owen wrote:

value foreach is not a member of java.util.List[edu.stanford.nlp.util.CoreMap]

2015-02-26 Thread Deepak Vohra
  Ch 6 listing from Advanced Analytics with Spark generates error. The listing is def plainTextToLemmas(text: String, stopWords: Set[String],pipeline: StanfordCoreNLP)    : Seq[String] = {    val doc = newAnnotation(text)    pipeline.annotate(doc)    val lemmas = newArrayBuffer[String]()    val

Re: value foreach is not a member of java.util.List[edu.stanford.nlp.util.CoreMap]

2015-02-26 Thread Sean Owen
(Books on Spark are not produced by the Spark project, and this is not the right place to ask about them. This question was already answered offline, too.) On Thu, Feb 26, 2015 at 6:38 PM, Deepak Vohra wrote: > Ch 6 listing from Advanced Analytics with Spark generates error. The > listing is >

Re: how to map and filter in one step?

2015-02-26 Thread Sean Owen
You can flatMap: rdd.flatMap { in => if (condition(in)) { Some(transformation(in)) } else { None } } On Thu, Feb 26, 2015 at 6:39 PM, Crystal Xing wrote: > Hi, > I have a text file input and I want to parse line by line and map each line > to another format. But at the same time, I

How to augment data to existing MatrixFactorizationModel?

2015-02-26 Thread anishm
I am a beginner to the world of Machine Learning and the usage of Apache Spark. I have followed the tutorial at https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html#augmenting-matrix-factors

how to map and filter in one step?

2015-02-26 Thread Crystal Xing
Hi, I have a text file input and I want to parse line by line and map each line to another format. But at the same time, I want to filter out some lines I do not need. I wonder if there is a way to filter out those lines in the map function. Do I have to do two steps filter and map? In that way,

Augment more data to existing MatrixFactorization Model?

2015-02-26 Thread anishm
I am a beginner to the world of Machine Learning and the usage of Apache Spark. I have followed the tutorial at https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html#augmenting-matrix-factors

Re: Apache Ignite vs Apache Spark

2015-02-26 Thread Sean Owen
Ignite is the renaming of GridGain, if that helps. It's like Oracle Coherence, if that helps. These do share some similarities -- fault tolerant, in-memory, distributed processing. The pieces they're built on differ, the architecture differs, the APIs differ. So fairly different in particulars. I n

Getting to proto buff classes in Spark Context

2015-02-26 Thread necro351 .
Hello everyone, We are trying to decode a message inside a Spark job that we receive from Kafka. The message is encoded using Proto Buff. The problem is when decoding we get class-not-found exceptions. We have tried remedies we found online in Stack Exchange and mail list archives but nothing seem

Re: SparkStreaming failing with exception Could not compute split, block input

2015-02-26 Thread Mukesh Jha
On Wed, Feb 25, 2015 at 8:09 PM, Mukesh Jha wrote: > My application runs fine for ~3/4 hours and then hits this issue. > > On Wed, Feb 25, 2015 at 11:34 AM, Mukesh Jha > wrote: > >> Hi Experts, >> >> My Spark Job is failing with below error. >> >> From the logs I can see that input-3-14248423516

Re: Spark excludes "fastutil" dependencies we need

2015-02-26 Thread Marcelo Vanzin
On Wed, Feb 25, 2015 at 8:42 PM, Jim Kleckner wrote: > So, should the userClassPathFirst flag work and there is a bug? Sorry for jumping in the middle of conversation (and probably missing some of it), but note that this option applies only to executors. If you're trying to use the class in your

Re: Which OutputCommitter to use for S3?

2015-02-26 Thread Thomas Demoor
FYI. We're currently addressing this at the Hadoop level in https://issues.apache.org/jira/browse/HADOOP-9565 Thomas Demoor On Mon, Feb 23, 2015 at 10:16 PM, Darin McBeath wrote: > Just to close the loop in case anyone runs into the same problem I had. > > By setting --hadoop-major-version=2 w

Re: throughput in the web console?

2015-02-26 Thread Saiph Kappa
One more question: while processing the exact same batch I noticed that giving more CPUs to the worker does not decrease the duration of the batch. I tried this with 4 and 8 CPUs. Though, I noticed that giving only 1 CPU the duration increased, but apart from that the values were pretty similar, wh

spark streaming, batchinterval,windowinterval and window sliding interval difference

2015-02-26 Thread Hafiz Mujadid
Can somebody explain the difference between batchinterval,windowinterval and window sliding interval with example. If there is any real time use case of using these parameters? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-batchin

Apache Ignite vs Apache Spark

2015-02-26 Thread Ognen Duzlevski
Can someone with experience briefly share or summarize the differences between Ignite and Spark? Are they complementary? Totally unrelated? Overlapping? Seems like ignite has reached version 1.0, I have never heard of it until a few days ago and given what is advertised, it sounds pretty interestin

Re: throughput in the web console?

2015-02-26 Thread Saiph Kappa
By setting spark.eventLog.enabled to true it is possible to see the application UI after the application has finished its execution, however the Streaming tab is no longer visible. For measuring the duration of batches in the code I am doing something like this: «wordCharValues.foreachRDD(rdd => {

  1   2   >