Re: How to close resources shared in executor?

2014-10-17 Thread Fengyun RAO
Of course, I could create a connection in val result = rdd.map(line = { val conf = HBaseConfiguration.create val connection = HConnectionManager.createConnection(conf) val table = connection.getTable(user) ... table.close() connection.close() } but that would be too slow, which is

Re: Join with large data set

2014-10-17 Thread Sonal Goyal
Hi Ankur, If your rdds have common keys, you can look at partitioning both your datasets using a custom partitioner based on keys so that you can avoid shuffling and optimize join performance. HTH Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal

RE: error when maven build spark 1.1.0 with message You have 1 Scalastyle violation

2014-10-17 Thread Henry Hung
Found the root cause: [SPARK-3372] [MLlib] MLlib doesn't pass maven build / checkstyle due ... ...to multi-byte character contained in Gradient.scala Author: Kousuke Saruta saru...@oss.nttdata.co.jp Closes #2248https://github.com/apache/spark/pull/2248 from sarutak/SPARK-3372 and

Re: Can's create Kafka stream in spark shell

2014-10-17 Thread Akhil Das
This is how you deal with deduplicate errors: libraryDependencies ++= Seq( (org.apache.spark % spark-streaming_2.10 % 1.1.0 % provided). *exclude(org.eclipse.jetty.orbit, javax.transaction).* *exclude(org.eclipse.jetty.orbit, javax.mail).* *exclude(org.eclipse.jetty.orbit,

rdd caching and use thereof

2014-10-17 Thread Nathan Kronenfeld
I'm trying to understand two things about how spark is working. (1) When I try to cache an rdd that fits well within memory (about 60g with about 600g of memory), I get seemingly random levels of caching, from around 60% to 100%, given the same tuning parameters. What governs how much of an RDD

Re: rdd caching and use thereof

2014-10-17 Thread Nathan Kronenfeld
Oh, I forgot - I've set the following parameters at the moment (besides the standard location, memory, and core setup): spark.logConf true spark.shuffle.consolidateFiles true spark.ui.port 4042 spark.io.compression.codec

how to submit multiple jar files when using spark-submit script in shell?

2014-10-17 Thread eric wong
Hi, i using the comma separated style for submit multiple jar files in the follow shell but it does not work: bin/spark-submit --class org.apache.spark.examples.mllib.JavaKMeans --master yarn-cluster --execur-memory 2g *--jars

key class requirement for PairedRDD ?

2014-10-17 Thread Jaonary Rabarisoa
Dear all, Is it possible to use any kind of object as key in a PairedRDD. When I use a case class key, the groupByKey operation don't behave as I expected. I want to use a case class to avoid using a large tuple as it is easier to manipulate. Cheers, Jaonary

Re: key class requirement for PairedRDD ?

2014-10-17 Thread Sonal Goyal
We use our custom classes which are Serializable and have well defined hashcode and equals methods through the Java API. Whats the issue you are getting? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Oct 17, 2014 at 12:28 PM, Jaonary

Re: key class requirement for PairedRDD ?

2014-10-17 Thread Jaonary Rabarisoa
Here what I'm trying to do. My case class is the following : case class PersonID(id: String, group: String, name: String) I want to use PersonID as a key in a PairedRDD. But I think the default equal function don't fit to my need because two PersonID(a,a,a) are not the same. When I use a tuple

MLlib and pyspark features

2014-10-17 Thread poiuytrez
Hello, I would like to use areaUnderROC from MLlib in Apache Spark. I am currently running Spark 1.1.0 and this function is not available in pyspark but is available in scala. Is there a feature tracker that tracks the advancement of porting Scala apis to Python apis? I have tried to search in

Re: MLlib linking error Mac OS X

2014-10-17 Thread poiuytrez
Hello MLnick, Have you found a solution on how to install MLlib for Mac OS ? I have also some trouble to install the dependencies. Best, poiuytrez -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-linking-error-Mac-OS-X-tp588p16668.html Sent from the

why do RDD's partitions migrate between worker nodes in different iterations

2014-10-17 Thread randylu
Dear all, In my test programer, there are 3 partitions for each RDD, the iteration procedure is as follows: var rdd_0 = ... // init for (...) { *rdd_1* = *rdd_0*.reduceByKey(...).partitionBy(p) // calculate rdd_1 from rdd_0 *rdd_0* = *rdd_0*.partitionBy(p).join(*rdd_1*)... //

Re: How does reading the data from Amazon S3 works?

2014-10-17 Thread jan.zikes
Hi,  I have seen in the video from Spark summit that usually (when I use HDFS) are data distributed across the whole cluster and usually computations goes to the data. My question is how does it work when I read the data from Amazon S3? Is the whole input dataset readed by the master node

How to assure that there will be run only one map per cluster node?

2014-10-17 Thread jan.zikes
hi, I have cluster that has several nodes and every node has several cores. I'd like to run multi-core algorithm within every map. So I'd like to assure that there will be performed only one map per cluster node. Is there some way, how to assure this? It seems to me that it should be possible

Gracefully stopping a Spark Streaming application

2014-10-17 Thread Massimiliano Tomassi
Hi all, I have a Spark Streaming application running on a cluster, deployed with the spark-submit script. I was reading here that it's possible to gracefully shutdown the application in order to allow the deployment of a new one:

Re: Help required on exercise Data Exploratin using Spark SQL

2014-10-17 Thread neeraj
Hi, When I run given Spark SQL commands in the exercise, it returns with unexpected results. I'm explaining the results below for quick reference: 1. The output of query : wikiData.count() shows some count in the file. 2. after running following query: sqlContext.sql(SELECT username, COUNT(*)

What's wrong with my spark filter? I get org.apache.spark.SparkException: Task not serializable

2014-10-17 Thread shahab
Hi, Probably I am missing very simple principle , but something is wrong with my filter, i get org.apache.spark.SparkException: Task not serializable expetion. here is my filter function: object OBJ { def f1(): Boolean = { var i = 1; for (j-1 to 10) i = i +1; true;

Re: What's wrong with my spark filter? I get org.apache.spark.SparkException: Task not serializable

2014-10-17 Thread Sourav Chandra
It might be due to the object is nested within some class which may not be serializable. Also you can run the appluication using this jvm parameter to see detailed info about serialization -Dsun.io.serialization.extendedDebugInfo=true On Fri, Oct 17, 2014 at 4:07 PM, shahab

Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Jaonary Rabarisoa
Hi all, I need to compute a similiarity between elements of two large sets of high dimensional feature vector. Naively, I create all possible pair of vectors with * features1.cartesian(features2)* and then map the produced paired rdd with my similarity function. The problem is that the cartesian

Designed behavior when master is unreachable.

2014-10-17 Thread preeze
Hi all, I am running a standalone spark cluster with a single master. No HA or failover is configured explicitly (no ZooKeeper etc). What is the default designed behavior for submission of new jobs when a single master went down or became unreachable? I couldn't find it documented anywhere.

Regarding using spark sql with yarn

2014-10-17 Thread twinkle sachdeva
Hi, I have been using spark sql with yarn. It works fine with yarn-client mode, but with yarn-cluster mode, we are facing 2 issues. Is yarn-cluster mode not recommended for spark-sql using hiveContext ?? *Problem #1* We are not able to use any query with very simple filtering operation like,

Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Sonal Goyal
Cartesian joins of large datasets are usually going to be slow. If there is a way you can reduce the problem space to make sure you only join subsets with each other, that may be helpful. Maybe if you explain your problem in more detail, people on the list can come up with more suggestions. Best

Re: ALS implicit error pyspark

2014-10-17 Thread Gen
Hi, Thanks a lot for your reply. It is true that python API has default parameters except ranks(the default iterations is 5). At the very beginning, in order to estimate the speed of ALS.trainImplicit, I used ALS.trainImplicit(ratings, rank, 1) and it worked. So I tried ALS with more iterations,

Re: ALS implicit error pyspark

2014-10-17 Thread Gen
Hi, Today, I tried again with the following code, but it didn't work... Could you please tell me your running environment? /from pyspark.mllib.recommendation import ALS from pyspark import SparkContext sc = SparkContext() r1 = (1, 1, 1.0) r2 = (1, 2, 2.0) r3 = (2, 1, 2.0) ratings =

Re: Submission to cluster fails (Spark SQL; NoSuchMethodError on SchemaRDD)

2014-10-17 Thread Michael Campbell
For posterity's sake, I solved this. The problem was the Cloudera cluster I was submitting to is running 1.0, and I was compiling against the latest 1.1 release. Downgrading to 1.0 on my compile got me past this. On Tue, Oct 14, 2014 at 6:08 PM, Michael Campbell michael.campb...@gmail.com

What is akka-actor_2.10-2.2.3-shaded-protobuf.jar?

2014-10-17 Thread Ruebenacker, Oliver A
Hello, My SBT pulls in, among others, the following dependency for Spark 1.1.0: akka-actor_2.10-2.2.3-shaded-protobuf.jar What is this? How is this different from the regular Akka Actor JAR? How do I reconcile with other libs that use Akka, such as Play? Thanks! Best,

Strange duplicates in data when scaling up

2014-10-17 Thread Jacob Maloney
Issue was solved by clearing hashmap and hashset at the beginning of the call method. From: Jacob Maloney [mailto:jmalo...@conversantmedia.com] Sent: Thursday, October 16, 2014 5:09 PM To: user@spark.apache.org Subject: Strange duplicates in data when scaling up I have a flatmap function that

Re: Join with large data set

2014-10-17 Thread Ankur Srivastava
Hi Sonal Thank you for the response but since we are joining to reference data different partitions of application data would need to join with same reference data and thus I am not sure if spark join would be a good fit for this. Eg out application data has person with zip code and then the

Re: Help required on exercise Data Exploratin using Spark SQL

2014-10-17 Thread Michael Armbrust
Looks like this data was encoded with an old version of Spark SQL. You'll need to set the flag to interpret binary data as a string. More info on configuration can be found here: http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration sqlContext.sql(set

Re: ALS implicit error pyspark

2014-10-17 Thread Gen
Hi, I created an issue in JIRA. https://issues.apache.org/jira/browse/SPARK-3990 https://issues.apache.org/jira/browse/SPARK-3990 I uploaded the error information in JIRA. Thanks in advance for your help. Best Gen Davies Liu-2 wrote It seems a bug, Could you create a JIRA for it? thanks!

bug with MapPartitions?

2014-10-17 Thread davidkl
Hello, Maybe there is something I do not get to understand, but I believe this code should not throw any serialization error when I run this in the spark shell. Using similar code with map instead of mapPartitions works just fine. import java.io.BufferedInputStream import java.io.FileInputStream

Re: Folding an RDD in order

2014-10-17 Thread Michael Misiewicz
Thank you for sharing this Cheng! This is fantastic. I was able to implement it and it seems like it's working quite well. I'm definitely on the right track now! I'm still having a small problem with the rows inside each partition being out of order - but I suspect this is because in the code

Re: Folding an RDD in order

2014-10-17 Thread Michael Misiewicz
My goal is for rows to be partitioned according to timestamp bins (e.g. with each partition representing an even interval of time), and then ordered by timestamp *within* each partition. Ordering by user ID is not important. In my aggregate function, in the seqOp function, I am checking to verify

Re: object in an rdd: serializable?

2014-10-17 Thread Duy Huynh
interesting. why does case class work for this? thanks boromir! On Thu, Oct 16, 2014 at 10:41 PM, Boromir Widas vcsub...@gmail.com wrote: make it a case class should work. On Thu, Oct 16, 2014 at 8:30 PM, ll duy.huynh@gmail.com wrote: i got an exception complaining about serializable.

Re: Spark in cluster and errors

2014-10-17 Thread Cheuk Lam
I wasn't the original person who posted the question, but this helped me! :) Thank you. I had a similar issue today when I tried to connect using the IP address (spark://master_ip:7077). I got it resolved by replacing it with the URL displayed in the Spark web console - in my case it is

Re: Gracefully stopping a Spark Streaming application

2014-10-17 Thread Sean Owen
You will have to write something in your app like an endpoint or button that triggers this code in your app. Hi all, I have a Spark Streaming application running on a cluster, deployed with the spark-submit script. I was reading here that it's possible to gracefully shutdown the application in

Re: What is akka-actor_2.10-2.2.3-shaded-protobuf.jar?

2014-10-17 Thread Chester @work
They should be the same except the package names are changed to avoid protopuf conflict. You can use it just like other Akka jars Chester Sent from my iPhone On Oct 17, 2014, at 5:56 AM, Ruebenacker, Oliver A oliver.ruebenac...@altisource.com wrote: Hello, My SBT pulls in,

Attaching schema to RDD created from Parquet file

2014-10-17 Thread Akshat Aranya
Hi, How can I convert an RDD loaded from a Parquet file into its original type: case class Person(name: String, age: Int) val rdd: RDD[Person] = ... rdd.saveAsParquetFile(people) val rdd2: sqlContext.parquetFile(people) How can I map rdd2 back into an RDD[Person]? All of the examples just

Re: bug with MapPartitions?

2014-10-17 Thread Akshat Aranya
There seems to be some problem with what gets captured in the closure that's passed into the mapPartitions (myfunc in your case). I've had a similar problem before: http://apache-spark-user-list.1001560.n3.nabble.com/TaskNotSerializableException-when-running-through-Spark-shell-td16574.html Try

Spark/HIVE Insert Into values Error

2014-10-17 Thread arthur.hk.c...@gmail.com
Hi, When trying to insert records into HIVE, I got error, My Spark is 1.1.0 and Hive 0.12.0 Any idea what would be wrong? Regards Arthur hive CREATE TABLE students (name VARCHAR(64), age INT, gpa int); OK hive INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1);

Re: how to submit multiple jar files when using spark-submit script in shell?

2014-10-17 Thread Andrew Or
Hm, it works for me. Are you sure you have provided the right jars? What happens if you pass in the `--verbose` flag? 2014-10-16 23:51 GMT-07:00 eric wong win19...@gmail.com: Hi, i using the comma separated style for submit multiple jar files in the follow shell but it does not work:

Re: how can I make the sliding window in Spark Streaming driven by data timestamp instead of absolute time

2014-10-17 Thread st553
I believe I have a similar question to this. I would like to process an offline event stream for testing/debugging. The stream is stored in a CSV file where each row in the file has a timestamp. I would like to feed this file into Spark Streaming and have the concept of time be driven by the

complexity of each action / transformation

2014-10-17 Thread ll
hello... is there a list that shows the complexity of each action/transformation? for example, what is the complexity of RDD.map()/filter() or RowMatrix.multiply() etc? that would be really helpful. thanks! -- View this message in context:

Re: why do RDD's partitions migrate between worker nodes in different iterations

2014-10-17 Thread Sean Owen
The RDDs aren't changing; you are assigning new RDDs to rdd_0 and rdd_1. Operations like join and reduceByKey are making distinct, new partitions that don't correspond 1-1 with old partitions anyway. On Fri, Oct 17, 2014 at 5:32 AM, randylu randyl...@gmail.com wrote: Dear all, In my test

Re: how to build spark 1.1.0 to include org.apache.commons.math3 ?

2014-10-17 Thread Sean Owen
It doesn't contain commons math3 since Spark does not depend on it. Its tests do, but tests are not built into the Spark assembly. On Thu, Oct 16, 2014 at 9:57 PM, Henry Hung ythu...@winbond.com wrote: HI All, I try to build spark 1.1.0 using sbt with command: sbt/sbt

Re: how to submit multiple jar files when using spark-submit script in shell?

2014-10-17 Thread Marcelo Vanzin
On top of what Andrew said, you shouldn't need to manually add the mllib jar to your jobs; it's already included in the Spark assembly jar. On Thu, Oct 16, 2014 at 11:51 PM, eric wong win19...@gmail.com wrote: Hi, i using the comma separated style for submit multiple jar files in the follow

Re: Designed behavior when master is unreachable.

2014-10-17 Thread Andrew Ash
I'm not sure what the design is, but I think the current behavior if the driver can't reach the master is to attempt to connect once and fail if that attempt fails. Is that what you're observing? (What version of Spark also?) On Fri, Oct 17, 2014 at 3:51 AM, preeze etan...@gmail.com wrote: Hi

Re: complexity of each action / transformation

2014-10-17 Thread Alec Ten Harmsel
On 10/17/2014 02:08 PM, ll wrote: hello... is there a list that shows the complexity of each action/transformation? for example, what is the complexity of RDD.map()/filter() or RowMatrix.multiply() etc? that would be really helpful. thanks! I'm pretty new to Spark, so I only know about

Unable to connect to Spark thrift JDBC server with pluggable authentication

2014-10-17 Thread Jenny Zhao
Hi, if Spark thrift JDBC server is started with non-secure mode, it is working fine. with a secured mode in case of pluggable authentication, I placed the authentication class configuration in conf/hive-site.xml property namehive.server2.authentication/name valueCUSTOM/value /property

Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-17 Thread Gerard Maas
Hi, We have been implementing several Spark Streaming jobs that are basically processing data and inserting it into Cassandra, sorting it among different keyspaces. We've been following the pattern: dstream.foreachRDD(rdd = val records = rdd.map(elem = record(elem))

Re: Optimizing pairwise similarity computation or how to avoid RDD.cartesian operation ?

2014-10-17 Thread Jaonary Rabarisoa
Hi Reza, Thank you for the suggestion. The number of point are not that large about 1000 for each set. So I have 1000x1000 pairs. But, my similarity is obtained using a metric learning to rank and from spark it is viewed as a black box. So my idea was just to distribute the computation of the

Re: Spark on YARN driver memory allocation bug?

2014-10-17 Thread Boduo Li
It may also cause a problem when running in the yarn-client mode. If --driver-memory is large, Yarn has to allocate a lot of memory to the AM container, but AM doesn't really need the memory. Boduo -- View this message in context:

mllib.linalg.Vectors vs Breeze?

2014-10-17 Thread ll
hello... i'm looking at the source code for mllib.linalg.Vectors and it looks like it's a wrapper around Breeze with very small changes (mostly changing the names). i don't have any problem with using spark wrapper around Breeze or Breeze directly. i'm just curious to understand why this wrapper

Re: mllib.linalg.Vectors vs Breeze?

2014-10-17 Thread Nicholas Chammas
I don't know the answer for sure, but just from an API perspective I'd guess that the Spark authors don't want to tie their API to Breeze. If at a future point they swap out a different implementation for Breeze, they don't have to change their public interface. MLlib's interface remains

PySpark joins fail - please help

2014-10-17 Thread Russell Jurney
https://gist.github.com/rjurney/fd5c0110fe7eb686afc9 Any way I try to join my data fails. I can't figure out what I'm doing wrong. -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com ᐧ

Re: mllib.linalg.Vectors vs Breeze?

2014-10-17 Thread Sean Owen
Yes, I think that's the logic, but then what do toBreezeVector return if it is not based on Breeze? and this is called a lot by client code since you often have to do something nontrivial to the vector. I suppose you can still have that thing return a Breeze vector and use it for no other purpose.

How to disable input split

2014-10-17 Thread Larry Liu
Is it possible to disable input split if input is already small?

Re: How to write a RDD into One Local Existing File?

2014-10-17 Thread Sean Owen
You can save to a local file. What are you trying and what doesn't work? You can output one file by repartitioning to 1 partition but this is probably not a good idea as you are bottlenecking the output and some upstream computation by disabling parallelism. How about just combining the files on

Re: PySpark joins fail - please help

2014-10-17 Thread Davies Liu
Hey Russell, join() can only work with RDD of pairs (key, value), such as rdd1: (k, v1) rdd2: (k, v2) rdd1.join(rdd2) will be (k1, v1, v2) Spark SQL will be more useful for you, see http://spark.apache.org/docs/1.1.0/sql-programming-guide.html Davies On Fri, Oct 17, 2014 at 5:01 PM,

Re: PySpark joins fail - please help

2014-10-17 Thread Russell Jurney
Is that not exactly what I've done in j3/j4? The keys are identical strings.The k is the same, the value in both instances is an associative array. devices = devices.map(lambda x: (dh.index('id'), {'deviceid': x[dh.index('id')], 'foo': x[dh.index('foo')], 'bar': x[dh.index('bar')]})) bytes_in_out

Re: input split size

2014-10-17 Thread Larry Liu
Thanks, Andrew. What about reading out of local? On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash and...@andrewash.com wrote: When reading out of HDFS it's the HDFS block size. On Fri, Oct 17, 2014 at 5:27 PM, Larry Liu larryli...@gmail.com wrote: What is the default input split size? How to

Re: PySpark joins fail - please help

2014-10-17 Thread Russell Jurney
There was a bug in the devices line: dh.index('id') should have been x[dh.index('id')]. ᐧ On Fri, Oct 17, 2014 at 5:52 PM, Russell Jurney russell.jur...@gmail.com wrote: Is that not exactly what I've done in j3/j4? The keys are identical strings.The k is the same, the value in both instances