Re: Re: Many Receiver vs. Many threads per Receiver

2015-02-26 Thread Jeffrey Jedele
As I understand the matter: Option 1) has benefits when you think that your network bandwidth may be a bottle neck, because Spark opens several network connections on possibly several different physical machines. Option 2) - as you already pointed out - has the benefit that you occupy less

Creating hive table on spark ((ERROR))

2015-02-26 Thread sandeep vura
Hi Sparkers, I am trying to creating hive table in SparkSql.But couldn't able to create it.Below are the following errors which are generating so far. java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at

Re: How to tell if one RDD depends on another

2015-02-26 Thread Sean Owen
To distill this a bit further, I don't think you actually want rdd2 to wait on rdd1 in this case. What you want is for a request for partition X to wait if partition X is already being calculated in a persisted RDD. Otherwise the first partition of rdd2 waits on the final partition of rdd1 even

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
In this case, it is slow to wait for rdd1.saveAsHasoopFile(...) to finish probably due to writing to hdfs. a walk around for this particular case may be as follows. val rdd1 = ..cache() val rdd2 = rdd1.map().() rdd1.count future { rdd1.saveAsHasoopFile(...) } future {

Re: How to augment data to existing MatrixFactorizationModel?

2015-02-26 Thread Xiangrui Meng
It may take some work to do online updates with an MatrixFactorizationModel because you need to update some rows of the user/item factors. You may be interested in spark-indexedrdd (http://spark-packages.org/package/amplab/spark-indexedrdd). We support save/load in Scala/Java. We are going to add

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Marcelo Vanzin
SPARK_CLASSPATH is definitely deprecated, but my understanding is that spark.executor.extraClassPath is not, so maybe the documentation needs fixing. I'll let someone who might know otherwise comment, though. On Thu, Feb 26, 2015 at 2:43 PM, Kannan Rajah kra...@maprtech.com wrote:

Re: NegativeArraySizeException when doing joins on skewed data

2015-02-26 Thread Tristan Blakers
Hi Imran, I can confirm this still happens when calling Kryo serialisation directly, not I’m using Java. The output file is at about 440mb at the time of the crash. Kryo is version 2.21. When I get a chance I’ll see if I can make a shareable test case and try on Kryo 3.0, I doubt they’d be

Re: [SparkSQL, Spark 1.2] UDFs in group by broken?

2015-02-26 Thread Yin Huai
Seems you hit https://issues.apache.org/jira/browse/SPARK-4296. It has been fixed in 1.2.1 and 1.3. On Thu, Feb 26, 2015 at 1:22 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Can someone confirm if they can run UDFs in group by in spark1.2? I have two builds running -- one from a custom

Converting SchemaRDD/Dataframe to RDD[vector]

2015-02-26 Thread mobsniuk
I've been searching around and see others have asked similar questions. Given a schemaRDD I extract a restless that contains numbers, both Int and Doubles. How do I construct a RDD[Vector]? In 1.2 I wrote the results to a textile and then read them back in splitting them with some code I found in

partions, SQL tables, and Parquet I/O

2015-02-26 Thread Daniel, Ronald (ELS-SDG)
Short story: I want to write some parquet files so they are pre-partitioned by the same key. Then, when I read them back in, joining the two tables on that key should be about as fast as things can be done. Can I do that, and if so, how? I don't see how to control the partitioning of a SQL

RE: Help me understand the partition, parallelism in Spark

2015-02-26 Thread java8964
Imran, thanks for your explaining about the parallelism. That is very helpful. In my test case, I am only use one box cluster, with one executor. So if I put 10 cores, then 10 concurrent task will be run within this one executor, which will handle more data than 4 core case, then leaded to OOM.

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
Zhan, I think it might be helpful to point out that I'm trying to run the RDDs in different threads to maximize the amount of work that can be done concurrently. Unfortunately, right now if I had something like this: val rdd1 = ..cache() val rdd2 = rdd1.map().() future {

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Todd Nist
Hi Kannan, I believe you should be able to use the --jars for this when invoke the spark-shell or perform a spark-submit. Per docs: --jars JARSComma-separated list of local jars to include on the driver and executor classpaths. HTH. -Todd On Thu, Feb

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska
Imran, I have also observed the phenomenon of reducing the cores helping with OOM. I wanted to ask this (hopefully without straying off topic): we can specify the number of cores and the executor memory. But we don't get to specify _how_ the cores are spread among executors. Is it possible that

Re: Integrating Spark Streaming with Reactive Mongo

2015-02-26 Thread Tathagata Das
Hey Mike, I quickly looked through the example and I found major performance issue. You are collecting the RDDs to the driver and then sending them to Mongo in a foreach. Why not doing a distributed push to Mongo? WHAT YOU HAVE val mongoConnection = ... WHAT YOU SHUOLD DO rdd.foreachPartition

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Zhan Zhang
Here is my understanding. When running on top of yarn, the cores means the number of tasks can run in one executor. But all these cores are located in the same JVM. Parallelism typically control the balance of tasks. For example, if you have 200 cores, but only 50 partitions. There will be 150

Re: Running spark function on parquet without sql

2015-02-26 Thread Zhan Zhang
When you use sql (or API from SchemaRDD/DataFrame) to read data form parquet, the optimizer will do column pruning, predictor pushdown, etc. Thus you can the benefit of parquet column benefits. After that, you can operate the SchemaRDD (DF) like regular RDD. Thanks. Zhan Zhang On Feb 26,

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
What confused me is the statement of *The final result is that rdd1 is calculated twice.” *Is it the expected behavior? To be perfectly honest, performing an action on a cached RDD in two different threads and having them (at the partition level) block until the parent are cached would be the

Re: How to tell if one RDD depends on another

2015-02-26 Thread Sean Owen
Yeah, I believe Corey knows that much and is using foreachPartition(i = None) to materialize. The question is, how would you do this with an arbitrary DAG? in this simple example we know what the answer is but he's trying to do it programmatically. On Thu, Feb 26, 2015 at 11:54 PM, Zhan Zhang

Re: How to tell if one RDD depends on another

2015-02-26 Thread Sean Owen
I think we already covered that in this thread. You get dependencies from RDD.dependencies() On Fri, Feb 27, 2015 at 12:31 AM, Zhan Zhang zzh...@hortonworks.com wrote: Currently in spark, it looks like there is no easy way to know the dependencies. It is solved at run time. Thanks. Zhan

Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Kannan Rajah
SparkConf.scala logs a warning saying SPARK_CLASSPATH is deprecated and we should use spark.executor.extraClassPath instead. But the online documentation states that spark.executor.extraClassPath is only meant for backward compatibility.

Error: no snappyjava in java.library.path

2015-02-26 Thread Dan Dong
Hi, All, When I run a small program in spark-shell, I got the following error: ... Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886) at java.lang.Runtime.loadLibrary0(Runtime.java:849) at

Re: GraphX:java.lang.NoSuchMethodError:org.apache.spark.graphx.Graph$.apply

2015-02-26 Thread hnahak
I can able to run it without any issue from standalone as well as in cluster. spark-submit --class org.graphx.test.GraphFromVerteXEdgeArray --executor-memory 1g --driver-memory 6g --master spark://VM-Master:7077 spark-graphx.jar code is exact same as above -- View this message in context:

Re: How to tell if one RDD depends on another

2015-02-26 Thread Sean Owen
The issue is that both RDDs are being evaluated at once. rdd1 is cached, which means that as its partitions are evaluated, they are persisted. Later requests for the partition hit the cached partition. But we have two threads causing two jobs to evaluate partitions of rdd1 at the same time. If

Re: Apache Ignite vs Apache Spark

2015-02-26 Thread Jay Vyas
-https://wiki.apache.org/incubator/IgniteProposal has I think been updated recently and has a good comparison. - Although grid gain has been around since the spark days, Apache Ignite is quite new and just getting started I think so - you will probably want to reach out to the developers

Re: How to tell if one RDD depends on another

2015-02-26 Thread Ted Yu
bq. whether or not rdd1 is a cached rdd RDD has getStorageLevel method which would return the RDD's current storage level. SparkContext has this method: * Return information about what RDDs are cached, if they are in mem or on disk, how much space * they take, etc. */ @DeveloperApi

Re: Error: no snappyjava in java.library.path

2015-02-26 Thread Marcelo Vanzin
Hi Dan, This is a CDH issue, so I'd recommend using cdh-u...@cloudera.org for those questions. This is an issue with fixed in recent CM 5.3 updates; if you're not using CM, or want a workaround, you can manually configure spark.driver.extraLibraryPath and spark.executor.extraLibraryPath to

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
I should probably mention that my example case is much over simplified- Let's say I've got a tree, a fairly complex one where I begin a series of jobs at the root which calculates a bunch of really really complex joins and as I move down the tree, I'm creating reports from the data that's already

Dealing with 'smaller' data

2015-02-26 Thread Gary Malouf
I'm considering whether or not it is worth introducing Spark at my new company. The data is no-where near Hadoop size at this point (it sits in an RDS Postgres cluster). I'm wondering at which point it is worth the overhead of adding the Spark infrastructure (deployment scripts, monitoring,

RE: Apache Ignite vs Apache Spark

2015-02-26 Thread nate
Ignite guys spoke at the bigtop workshop last week at Scale, posted slides here: https://cwiki.apache.org/confluence/display/BIGTOP/SCALE13x Couple main pts around comments made during the preso.., although incubating apache (first code drop was last week I believe).., tech is battle tested with

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
Zhan, This is exactly what I'm trying to do except, as I metnioned in my first message, I am being given rdd1 and rdd2 only and I don't necessarily know at that point whether or not rdd1 is a cached rdd. Further, I don't know at that point whether or not rdd2 depends on rdd1. On Thu, Feb 26,

Non-deterministic Accumulator Values

2015-02-26 Thread Peter Thai
Hi all, I'm incrementing several accumulators inside a foreach. Most of the time, the accumulators will return the same value for the same dataset. However, they sometimes differ. I'm not sure how accumulators are implemented. Could this behavior be caused by data not arriving before I print

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
Ted. That one I know. It was the dependency part I was curious about On Feb 26, 2015 7:12 PM, Ted Yu yuzhih...@gmail.com wrote: bq. whether or not rdd1 is a cached rdd RDD has getStorageLevel method which would return the RDD's current storage level. SparkContext has this method: *

Re: Converting SchemaRDD/Dataframe to RDD[vector]

2015-02-26 Thread Xiangrui Meng
Try the following: df.map { case Row(id: Int, num: Int, value: Double, x: Float) = // replace those with your types (id, Vectors.dense(num, value, x)) }.toDF(id, features) -Xiangrui On Thu, Feb 26, 2015 at 3:08 PM, mobsniuk mobsn...@gmail.com wrote: I've been searching around and see others

Re: How to tell if one RDD depends on another

2015-02-26 Thread Zhan Zhang
Currently in spark, it looks like there is no easy way to know the dependencies. It is solved at run time. Thanks. Zhan Zhang On Feb 26, 2015, at 4:20 PM, Corey Nolet cjno...@gmail.commailto:cjno...@gmail.com wrote: Ted. That one I know. It was the dependency part I was curious about On Feb

Re: group by order by fails

2015-02-26 Thread Michael Armbrust
Assign an alias to the count in the select clause and use that alias in the order by clause. On Wed, Feb 25, 2015 at 11:17 PM, Tridib Samanta tridib.sama...@live.com wrote: Actually I just realized , I am using 1.2.0. Thanks Tridib -- Date: Thu, 26 Feb 2015

Re: Dealing with 'smaller' data

2015-02-26 Thread Gary Malouf
The honest answer is that it is unclear to me at this point. I guess what I am really wondering is if there are cases where one would find it beneficial to use Spark against one or more RDBs? On Thu, Feb 26, 2015 at 8:06 PM, Tobias Pfeiffer t...@preferred.jp wrote: Gary, On Fri, Feb 27, 2015

Re: Reg. KNN on MLlib

2015-02-26 Thread Xiangrui Meng
It is not in MLlib. There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-2336 and Ashutosh has an implementation for integer values. -Xiangrui On Thu, Feb 26, 2015 at 8:18 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Has KNN classification algorithm been implemented on MLlib?

Re: Dealing with 'smaller' data

2015-02-26 Thread Gary Malouf
So when deciding whether to take on installing/configuring Spark, the size of the data does not automatically make that decision in your mind. Thanks, Gary On Thu, Feb 26, 2015 at 8:55 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi On Fri, Feb 27, 2015 at 10:50 AM, Gary Malouf

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Todd Nist
Hi Kannan, Issues with using --jars make sense. I believe you can set the classpath via the use the --conf spark.executor.extraClassPath= or in your driver with .set(spark.executor.extraClassPath, .) I believe you are correct with the localize as well as long as your guaranteed that all

Re: Apache Ignite vs Apache Spark

2015-02-26 Thread Ognen Duzlevski
Thanks you all! On Thu, Feb 26, 2015 at 5:50 PM, n...@reactor8.com wrote: Ignite guys spoke at the bigtop workshop last week at Scale, posted slides here: https://cwiki.apache.org/confluence/display/BIGTOP/SCALE13x Couple main pts around comments made during the preso.., although

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Kannan Rajah
Thanks Marcelo. Do you think it would be useful to make spark.executor.extraClassPath be made to pick up some environment variable that can be set from spark-env.sh? Here is a example. spark-env.sh -- executor_extra_cp = get_hbase_jars_for_cp export executor_extra_cp

Re: Dealing with 'smaller' data

2015-02-26 Thread Tobias Pfeiffer
Hi On Fri, Feb 27, 2015 at 10:50 AM, Gary Malouf malouf.g...@gmail.com wrote: The honest answer is that it is unclear to me at this point. I guess what I am really wondering is if there are cases where one would find it beneficial to use Spark against one or more RDBs? Well, RDBs are all

Monitoring Spark with Graphite and Grafana

2015-02-26 Thread Ryan Williams
If anyone is curious to try exporting Spark metrics to Graphite, I just published a post about my experience doing that, building dashboards in Grafana http://grafana.org/, and using them to monitor Spark jobs: http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/ Code

Spark-submit not working when application jar is in hdfs

2015-02-26 Thread dilm
I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied my application jar to a directory in hdfs, i get the following exception: Warning: Skip remote jar

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Marcelo Vanzin
On Thu, Feb 26, 2015 at 5:12 PM, Kannan Rajah kra...@maprtech.com wrote: Also, I would like to know if there is a localization overhead when we use spark.executor.extraClassPath. Again, in the case of hbase, these jars would be typically available on all nodes. So there is no need to localize

RE: Monitoring Spark with Graphite and Grafana

2015-02-26 Thread Shao, Saisai
Cool, great job☺. Thanks Jerry From: Ryan Williams [mailto:ryan.blake.willi...@gmail.com] Sent: Thursday, February 26, 2015 6:11 PM To: user; d...@spark.apache.org Subject: Monitoring Spark with Graphite and Grafana If anyone is curious to try exporting Spark metrics to Graphite, I just

Re: Clean up app folders in worker nodes

2015-02-26 Thread markjgreene
I think the setting you are missing is 'spark.worker.cleanup.appDataTtl'. This setting controls how long the age of a file has to be before it is deleted. More info here: https://spark.apache.org/docs/1.0.1/spark-standalone.html. Also, 'spark.worker.cleanup.interval' you have configured is pretty

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Love to hear some input on this. I did get a standalone cluster up on my local machine and the problem didn't present itself. I'm pretty confident that means the problem is in the LocalBackend or something near it. On Thu, Feb 26, 2015 at 1:37 PM, Victor Tso-Guillen v...@paxata.com wrote: Okay

Re: Scheduler hang?

2015-02-26 Thread Victor Tso-Guillen
Of course, breakpointing on every status update and revive offers invocation kept the problem from happening. Where could the race be? On Thu, Feb 26, 2015 at 7:55 PM, Victor Tso-Guillen v...@paxata.com wrote: Love to hear some input on this. I did get a standalone cluster up on my local

Re: Dealing with 'smaller' data

2015-02-26 Thread Tobias Pfeiffer
Gary, On Fri, Feb 27, 2015 at 8:40 AM, Gary Malouf malouf.g...@gmail.com wrote: I'm considering whether or not it is worth introducing Spark at my new company. The data is no-where near Hadoop size at this point (it sits in an RDS Postgres cluster). Will it ever become Hadoop size? Looking

Re: Is SPARK_CLASSPATH really deprecated?

2015-02-26 Thread Kannan Rajah
There is a usability concern I have with the current way of specifying --jars. Imagine a use case like hbase where a lot of jobs need it in its classpath. This needs to be set every time. If we use spark.executor.extraClassPath, then we just need to set it once But there is no programmatic way to

Re: Dealing with 'smaller' data

2015-02-26 Thread Tobias Pfeiffer
On Fri, Feb 27, 2015 at 10:57 AM, Gary Malouf malouf.g...@gmail.com wrote: So when deciding whether to take on installing/configuring Spark, the size of the data does not automatically make that decision in your mind. You got me there ;-) Tobias

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska
Yong, for the 200 tasks in stage 2 and 3 -- this actually comes from the shuffle setting: spark.sql.shuffle.partitions On Thu, Feb 26, 2015 at 5:51 PM, java8964 java8...@hotmail.com wrote: Imran, thanks for your explaining about the parallelism. That is very helpful. In my test case, I am

Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-02-26 Thread Arun Luthra
Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949 I tried this as a workaround: import org.apache.spark.scheduler._ import org.roaringbitmap._ ... kryo.register(classOf[org.roaringbitmap.RoaringBitmap]) kryo.register(classOf[org.roaringbitmap.RoaringArray])

Reg. KNN on MLlib

2015-02-26 Thread Deep Pradhan
Has KNN classification algorithm been implemented on MLlib? Thank You Regards, Deep

Get importerror when i run pyspark with ipython=1

2015-02-26 Thread sourabhguha
http://apache-spark-user-list.1001560.n3.nabble.com/file/n21843/pyspark_error.jpg I get the above error when I try to run pyspark with the ipython option. I do not get this error when I run it without the ipython option. I have Java 8, Scala 2.10.4 and Enthought Canopy Python on my box. OS Win

<    1   2