Re: quickly counting the number of rows in a partition?

2015-01-13 Thread Tobias Pfeiffer
Hi again, On Wed, Jan 14, 2015 at 10:06 AM, Tobias Pfeiffer t...@preferred.jp wrote: If you think of items.map(x = /* throw exception */).count() then even though the count you want to get does not necessarily require the evaluation of the function in map() (i.e., the number is the

Re: *ByKey aggregations: performance + order

2015-01-13 Thread Tobias Pfeiffer
Hi, On Wed, Jan 14, 2015 at 12:11 PM, Tobias Pfeiffer t...@preferred.jp wrote: Now I don't know (yet) if all of the functions I want to compute can be expressed in this way and I was wondering about *how much* more expensive we are talking about. OK, it seems like even on a local machine

Re: Problems with Spark Core 1.2.0 SBT project in IntelliJ

2015-01-13 Thread Night Wolf
Thanks for the tips! Yeah its as a working SBT project. I.e. if I do an SBT run it picks up Test1 as a main class and runs it for me without error. Its only in IntelliJ. I opened the project from the folder afresh by choosing the build.sbt file. I re-tested by deleting .idea and just choosing the

Re: Reading resource files in a Spark application

2015-01-13 Thread Arun Lists
I experimented with using getResourceAsStream(cls, fileName) instead cls.getResource(fileName).toURI. That works! I have no idea why the latter method does not work in Spark. Any explanations would be welcome. Thanks, arun On Tue, Jan 13, 2015 at 6:35 PM, Arun Lists lists.a...@gmail.com wrote:

RE: quickly counting the number of rows in a partition?

2015-01-13 Thread Ganelin, Ilya
Alternative to doing a naive toArray is to declare an accumulator per partition and use that. It's specifically what they were designed to do. See the programming guide. Sent with Good (www.good.com) -Original Message- From: Tobias Pfeiffer

Re: Save RDD with partition information

2015-01-13 Thread Raghavendra Pandey
I believe the default hash partitioner logic in spark will send all the same keys to same machine. On Wed, Jan 14, 2015, 03:03 Puneet Kapoor puneet.cse.i...@gmail.com wrote: Hi, I have a usecase where in I have hourly spark job which creates hourly RDDs, which are partitioned by keys. At

*ByKey aggregations: performance + order

2015-01-13 Thread Tobias Pfeiffer
Hi, I have an RDD[(Long, MyData)] where I want to compute various functions on lists of MyData items with the same key (this will in general be a rather short lists, around 10 items per key). Naturally I was thinking of groupByKey() but was a bit intimidated by the warning: This operation may be

Re: Can't run Spark java code from command line

2015-01-13 Thread Ye Xianjin
There is no binding issue here. Spark picks the right ip 10.211.55.3 for you. The printed message is just an indication. However I have no idea why spark-shell hangs or stops. 发自我的 iPhone 在 2015年1月14日,上午5:10,Akhil Das ak...@sigmoidanalytics.com 写道: It just a binding issue with the

use netty shuffle for network cause high gc time

2015-01-13 Thread lihu
Hi, I just test groupByKey method on a 100GB data, the cluster is 20 machine, each with 125GB RAM. At first I set conf.set(spark.shuffle.use.netty, false) and run the experiment, and then I set conf.set(spark.shuffle.use.netty, true) again to re-run the experiment, but at the latter

Re: Save RDD with partition information

2015-01-13 Thread lihu
there is no way to avoid shuffle if you use combine by key, no matter if your data is cached in memory, because the shuffle write must write the data into disk. And It seem that spark can not guarantee the similar key(K1) goes to the Container_X. you can use the tmpfs for your shuffle dir, this

Re: Save RDD with partition information

2015-01-13 Thread lihu
By the way, I am not sure enough wether the shuffle key can go into the similar container.

Spark 1.2.0 not getting connected

2015-01-13 Thread Abhideep Chakravarty
Hi, We are using Scala app to connect with the Spark and when we start the application we get following error: 19:52:47.185 [sparkDriver-akka.actor.default-dispatcher-2] INFO o.a.s.d.client.AppClient$ClientActor - Connecting to master spark://172.22.193.138:7077... 19:52:47.198

RE: Spark 1.2.0 not getting connected

2015-01-13 Thread Abhideep Chakravarty
Let me rephrase that it is a Play-Scala app which is trying to connect with the Spark. From: Abhideep Chakravarty Sent: Wednesday, January 14, 2015 11:34 AM To: 'user@spark.apache.org' Subject: Spark 1.2.0 not getting connected Hi, We are using Scala app to connect with the Spark and when we

Re: OpenCV + Spark : Where to put System.loadLibrary ?

2015-01-13 Thread Jishnu Prathap
Hi Jaonary Rabarisoa, Where you able to fix this issue? Actually i am trying to integrate OpenCV with Spark.It would be very helpful if you could share your experience in integrating opencv with spark.It would really help me if you could share some code how to use Mat ,IplImage and spark rdd

Re: use netty shuffle for network cause high gc time

2015-01-13 Thread Andrew Ash
To confirm, lihu, are you using Spark version 1.2.0 ? On Tue, Jan 13, 2015 at 9:26 PM, lihu lihu...@gmail.com wrote: Hi, I just test groupByKey method on a 100GB data, the cluster is 20 machine, each with 125GB RAM. At first I set conf.set(spark.shuffle.use.netty, false) and run

Re: use netty shuffle for network cause high gc time

2015-01-13 Thread Aaron Davidson
What version are you running? I think spark.shuffle.use.netty was a valid option only in Spark 1.1, where the Netty stuff was strictly experimental. Spark 1.2 contains an officially supported and much more thoroughly tested version under the property spark.shuffle.blockTransferService, which is

How to integrate Spark with OpenCV?

2015-01-13 Thread Jishnu Prathap
Hi, Can somone suggest any Video+image processing library which works well with spark. Currently i am trying to integrate OpenCV with Spark. I am relatively new to both spark and OpenCV It would really help me if someone could share some sample code how to use Mat ,IplImage and spark

Re: IndexedRDD

2015-01-13 Thread Jerry Lam
Hi guys, I'm interested in the IndexedRDD too. How many rows in the big table that matches the small table in every run? If the number of rows stay constant, then I think Jem wants the runtime to stay about constant (i.e. ~ 0.6 second for all cases). However, I agree with Andrew. The performance

Re: save spark streaming output to single file on hdfs

2015-01-13 Thread Tamas Jambor
Thanks. The problem is that we'd like it to be picked up by hive. On Tue Jan 13 2015 at 18:15:15 Davies Liu dav...@databricks.com wrote: On Tue, Jan 13, 2015 at 10:04 AM, jamborta jambo...@gmail.com wrote: Hi all, Is there a way to save dstream RDDs to a single file so that another

Re: train many decision tress with a single spark job

2015-01-13 Thread sourabh chaki
Hi Josh, I was trying out decision tree ensemble using bagging. Here I am spiting the input using random split and training tree for each of the split. Here is sample code: val bags : Int = 10 val models : Array[DecisionTreeModel] = training.randomSplit(Array.fill(bags)(1.0 / bags)).map {

Re: Reading one partition at a time

2015-01-13 Thread Imran Rashid
this looks reasonable to me. As you've done, the important thing is just to make isSplittable return false. this shares a bit in common with the sc.wholeTextFile method. It sounds like you really want something much simpler than what that is doing, but you might be interested in looking at that

RE: Pattern Matching / Equals on Case Classes in Spark Not Working

2015-01-13 Thread Rosner, Frank (Allianz SE)
Yep it is in the REPL. I will try your solution and also to submit the whole thing as a job jar. If this is true, this should be fixed, right? I will check whether there is a ticket already. Somebody pointed me to https://issues.apache.org/jira/browse/SPARK-2620 but I need to investigate.

Unread block data exception when reading from HBase

2015-01-13 Thread jeremy p
Hello all, When I try to read data from an HBase table, I get an unread block data exception. I am running HBase and Spark on a single node (my workstation). My code is in Java, and I'm running it from the Eclipse IDE. Here are the versions I'm using : Cloudera : 2.5.0-cdh5.2.1 Hadoop :

save spark streaming output to single file on hdfs

2015-01-13 Thread jamborta
Hi all, Is there a way to save dstream RDDs to a single file so that another process can pick it up as a single RDD? It seems that each slice is saved to a separate folder, using saveAsTextFiles method. I'm using spark 1.2 with pyspark thanks, -- View this message in context:

Re: Running Spark application from command line

2015-01-13 Thread Arun Lists
Yes, I am running with Scala 2.11. Here is what I see when I do scala -version scala -version Scala code runner version 2.11.4 -- Copyright 2002-2013, LAMP/EPFL On Tue, Jan 13, 2015 at 2:30 AM, Sean Owen so...@cloudera.com wrote: It sounds like possibly a Scala version mismatch? are you sure

ExternalSorter - spilling in-memory map

2015-01-13 Thread akhandeshi
I am using spark 1.2, and I see a lot of messages like: ExternalSorter: Thread 66 spilling in-memory map of 5.0 MB to disk (13160 times so far) I seem to have a lot of memory: URL: spark://hadoop-m:7077 Workers: 4 Cores: 64 Total, 64 Used Memory: 328.0 GB Total, 327.0 GB Used

Re: Issue writing to Cassandra from Spark

2015-01-13 Thread Ankur Srivastava
I realized that I was running the cluster with spark.cassandra.output.concurrent.writes=2, changing it to 1 did the trick. We realized that the issue was because spark was producing data at much higher frequency than our small Cassandra cluster could write and so changing the property value to 1

Can't run Spark java code from command line

2015-01-13 Thread jeremy p
Hello all, I wrote some Java code that uses Spark, but for some reason I can't run it from the command line. I am running Spark on a single node (my workstation). The program stops running after this line is executed : SparkContext sparkContext = new SparkContext(spark://myworkstation:7077,

Querying over mutliple (avro) files using Spark SQL

2015-01-13 Thread thomas j
Hi, I have a program that loads a single avro file using spark SQL, queries it, transforms it and then outputs the data. The file is loaded with: val records = sqlContext.avroFile(filePath) val data = records.registerTempTable(data) ... Now I want to run it over tens of thousands of Avro files

Re: IndexedRDD

2015-01-13 Thread Jem Tucker
Hi, Thanks for the replies, I guess I was hoping for a bit better than linear scaling, this was performing IndexedRDD.join(RDD)((id, a, b) = (a, b)). In each join every row in the smaller table is joined to one in the lookup. I ran the same test with standard RDD joins and there was barely any

Re: save spark streaming output to single file on hdfs

2015-01-13 Thread Davies Liu
On Tue, Jan 13, 2015 at 10:04 AM, jamborta jambo...@gmail.com wrote: Hi all, Is there a way to save dstream RDDs to a single file so that another process can pick it up as a single RDD? It does not need to a single file, Spark can pick any directory as a single RDD. Also, it's easy to union

Re: spark standalone master with workers on two nodes

2015-01-13 Thread Akhil Das
You could be hitting this issue https://issues.apache.org/jira/browse/SPARK-5098 Thanks Best Regards On Wed, Jan 14, 2015 at 7:47 AM, Josh J joshjd...@gmail.com wrote: Hi, I'm trying to run Spark Streaming standalone on two nodes. I'm able to run on a single node fine. I start both workers

Re: How to integrate Spark with OpenCV?

2015-01-13 Thread Akhil Das
I ddn't played with OpenCV yet, but i was just wondering about your use-case. What exactly are you trying to do? Thanks Best Regards On Wed, Jan 14, 2015 at 12:02 PM, Jishnu Prathap jishnu.prat...@wipro.com wrote: Hi, Can somone suggest any Video+image processing library which works well with

Re: Spark 1.2.0 not getting connected

2015-01-13 Thread Akhil Das
Its clearly saying *Connection refused: 172.22.193.138:7077 http://172.22.193.138:7077 *just make sure your master URL is listed as spark://172.22.193.138:7077 in the webUI(that running on port 8080), also be sure that no firewall/network is blocking the connection (simple *telnet 172.22.193.138

Re: Problems with Spark Core 1.2.0 SBT project in IntelliJ

2015-01-13 Thread Enno Shioji
Have you tried adding this line? javax.servlet % javax.servlet-api % 3.0.1 % provided This made the problem go away for me. It also works without the provided scope. ᐧ On Wed, Jan 14, 2015 at 5:09 AM, Night Wolf nightwolf...@gmail.com wrote: Thanks for the tips! Yeah its as a working SBT

Re: save spark streaming output to single file on hdfs

2015-01-13 Thread Davies Liu
Right now, you couldn't. You could load each file as a partition into Hive, or you need to pack the files together by other tools or spark job. On Tue, Jan 13, 2015 at 10:35 AM, Tamas Jambor jambo...@gmail.com wrote: Thanks. The problem is that we'd like it to be picked up by hive. On Tue Jan

Re: Failing jobs runs twice

2015-01-13 Thread Anders Arpteg
Yes Andrew, I am. Tried setting spark.yarn.applicationMaster.waitTries to 1 (thanks Sean), but with no luck. Any ideas? On Tue, Jan 13, 2015 at 7:58 PM, Andrew Or and...@databricks.com wrote: Hi Anders, are you using YARN by any chance? 2015-01-13 0:32 GMT-08:00 Anders Arpteg

Re: java.io.IOException: Mkdirs failed to create file:/some/path/myapp.csv while using rdd.saveAsTextFile(fileAddress) Spark

2015-01-13 Thread Prannoy
What path you are giving in the saveAsTextFile ?? Can you show the whole line . On Tue, Jan 13, 2015 at 11:42 AM, shekhar [via Apache Spark User List] ml-node+s1001560n21112...@n3.nabble.com wrote: I still i having this issue with rdd.saveAsTextFile() method. thanks, Shekhar reddy

Re: Current Build Gives HTTP ERROR

2015-01-13 Thread Ganon Pierce
Yeah upon running the test locally I receive: Pi is roughly 3.139948” So spark is working, it’s just the application ui that is not… On Jan 13, 2015, at 1:13 PM, Ganon Pierce ganon.pie...@me.com wrote: My application logs remain stored as .inprogress files, e.g.

Re: Failing jobs runs twice

2015-01-13 Thread Andrew Or
Hi Anders, are you using YARN by any chance? 2015-01-13 0:32 GMT-08:00 Anders Arpteg arp...@spotify.com: Since starting using Spark 1.2, I've experienced an annoying issue with failing apps that gets executed twice. I'm not talking about tasks inside a job, that should be executed multiple

Re: Current Build Gives HTTP ERROR

2015-01-13 Thread Ganon Pierce
Perhaps I need to change my spark.eventLog.dir to an hdfs directory? Could this have something to do with the “history server” not having access to my application logs? On Jan 13, 2015, at 1:13 PM, Ganon Pierce ganon.pie...@me.com wrote: My application logs remain stored as .inprogress

Re: Failed to save RDD as text file to local file system

2015-01-13 Thread Prannoy
Hi, Could you just trying one thing. Make a directory any where out side cloudera and than try the same write. Suppose the directory made is testWrite. do r.saveAsTextFile(/home/testWrite/) I think cloudera/tmp folder do not have a write permission for users hosted other than the cloudera

Re: Current Build Gives HTTP ERROR

2015-01-13 Thread Ganon Pierce
My application logs remain stored as .inprogress files, e.g. app-20150113190025-0004.inprogress” even after completion, could this have something to do with what is going on. @ Ted Yu Where do I find the master log? It’s not very obviously labeled in my /tmp/ directory. Sorry if I should

Re: Current Build Gives HTTP ERROR

2015-01-13 Thread Ganon Pierce
Also, thanks for everyone’s help so far! On Jan 13, 2015, at 2:04 PM, Ganon Pierce ganon.pie...@me.com wrote: Yeah upon running the test locally I receive: Pi is roughly 3.139948” So spark is working, it’s just the application ui that is not… On Jan 13, 2015, at 1:13 PM, Ganon

Reading resource files in a Spark application

2015-01-13 Thread Arun Lists
In some classes, I initialize some values from resource files using the following snippet: new File(cls.getResource(fileName).toURI) This works fine in SBT. When I run it using spark-submit, I get a bunch of errors because the classes cannot be initialized. What can I do to make such

Re: quickly counting the number of rows in a partition?

2015-01-13 Thread Tobias Pfeiffer
Hi, On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya ilya.gane...@capitalone.com wrote: Use the mapPartitions function. It returns an iterator to each partition. Then just get that length by converting to an array. On Tue, Jan 13, 2015 at 2:50 PM, Kevin Burton bur...@spinn3r.com wrote:

spark standalone master with workers on two nodes

2015-01-13 Thread Josh J
Hi, I'm trying to run Spark Streaming standalone on two nodes. I'm able to run on a single node fine. I start both workers and it registers in the Spark UI. However, the application says SparkDeploySchedulerBackend: Asked to remove non-existent executor 2 Any ideas? Thanks, Josh

Failing jobs runs twice

2015-01-13 Thread Anders Arpteg
Since starting using Spark 1.2, I've experienced an annoying issue with failing apps that gets executed twice. I'm not talking about tasks inside a job, that should be executed multiple times before failing the whole app. I'm talking about the whole app, that seems to close the previous Spark

saveAsObjectFile is actually saveAsSequenceFile

2015-01-13 Thread Kevin Burton
This is interesting. I’m using ObjectInputStream to try to read a file written as saveAsObjectFile… but it’s not working. The documentation says: Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using SparkContext.objectFile().” … but

Re: Creating RDD from only few columns of a Parquet file

2015-01-13 Thread Ajay Srivastava
Setting spark.sql.hive.convertMetastoreParquet to true has fixed this. Regards,Ajay On Tuesday, January 13, 2015 11:50 AM, Ajay Srivastava a_k_srivast...@yahoo.com.INVALID wrote: Hi,I am trying to read a parquet file using -val parquetFile = sqlContext.parquetFile(people.parquet)

saveAsTextFile just uses toString and Row@37f108

2015-01-13 Thread Kevin Burton
This is almost funny. I want to dump a computation to the filesystem. It’s just the result of a Spark SQL call reading the data from Cassandra. The problem is that it looks like it’s just calling toString() which is useless. The example is below. I assume this is just a (bad) bug.

Re: saveAsTextFile just uses toString and Row@37f108

2015-01-13 Thread Reynold Xin
It is just calling RDD's saveAsTextFile. I guess we should really override the saveAsTextFile in SchemaRDD (or make Row.toString comma separated). Do you mind filing a JIRA ticket and copy me? On Tue, Jan 13, 2015 at 12:03 AM, Kevin Burton bur...@spinn3r.com wrote: This is almost funny. I

Re: Creating RDD from only few columns of a Parquet file

2015-01-13 Thread Reynold Xin
What query did you run? Parquet should have predicate and column pushdown, i.e. if your query only needs to read 3 columns, then only 3 will be read. On Mon, Jan 12, 2015 at 10:20 PM, Ajay Srivastava a_k_srivast...@yahoo.com.invalid wrote: Hi, I am trying to read a parquet file using - val

Re: Current Build Gives HTTP ERROR

2015-01-13 Thread Ganon Pierce
After clean build still receiving the same error. On Jan 6, 2015, at 3:59 PM, Sean Owen so...@cloudera.com wrote: FWIW I do not see any such error, after a mvn -DskipTests clean package and ./bin/spark-shell from master. Maybe double-check you have done a full clean build. On Tue,

Re: Spark executors resources. Blocking?

2015-01-13 Thread Luis Guerra
Thanks for your answer David, It is as I thought then. When you write that there could be some approaches to solve this using Yarn or Mesos, can you give any idea about this? Or better yet, is there any site with documentation about this issue? Currently, we are launching our jobs using Yarn, but

Re: Current Build Gives HTTP ERROR

2015-01-13 Thread Ted Yu
Gabon: Can you check the master log to see if there is some clue ? Cheers On Jan 13, 2015, at 2:03 AM, Robin East robin.e...@xense.co.uk wrote: I’ve just pulled down the latest commits from github, and done the following: 1) mvn clean package -DskipTests builds fine 2)

Spark Sql reading whole table from cache instead of required coulmns

2015-01-13 Thread Surbhit
Hi, I am using spark 1.1.0. I am using the spark-sql shell to run all the below queries. I have created an external parquet table using the below SQL: create external table daily (15 column names) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT

Re: train many decision tress with a single spark job

2015-01-13 Thread Sean Owen
OK, I still wonder whether it's not better to make one big model. The usual assumption is that the user's identity isn't predictive per se. If every customer in your shop is truly unlike the others, most predictive analytics goes out the window. It's factors like our location, income, etc that are

Re: Current Build Gives HTTP ERROR

2015-01-13 Thread Robin East
I’ve just pulled down the latest commits from github, and done the following: 1) mvn clean package -DskipTests builds fine 2) ./bin/spark-shell works 3) run SparkPi example with no problems: ./bin/run-example SparkPi 10 4) Started a master ./sbin/start-master.sh grabbed the MasterWebUI

Re: saveAsObjectFile is actually saveAsSequenceFile

2015-01-13 Thread Sean Owen
Yes, that's even what the objectFile javadoc says. It is expecting a SequenceFile with NullWritable keys and BytesWritable values containing the serialized values. This looks correct to me. On Tue, Jan 13, 2015 at 8:39 AM, Kevin Burton bur...@spinn3r.com wrote: This is interesting. I’m using

Re: Problem with building spark-1.2.0

2015-01-13 Thread Sean Owen
I don't believe you need any such manual steps. I'd undo what you did there. I have never had to add anything manually to get SBT or Maven builds working. Just follow the docs on the site. On Tue, Jan 13, 2015 at 5:29 AM, Rapelly Kartheek kartheek.m...@gmail.com wrote: Yes, this proxy problem is

Re: Several applications share the same Spark executors (or their cache)

2015-01-13 Thread Oleksii Kostyliev
Thanks a lot for the suggestion! This approach makes perfect sense. I think this what is being addressed by spark-jobserver project: https://github.com/ooyala/spark-jobserver. Do you know any other production-ready similar implementations? On Thu, Jan 8, 2015 at 1:47 PM, Silvio Fiorito

How to define SparkContext with Cassandra connection for spark-jobserver?

2015-01-13 Thread Sasi
Dear All, For our requirement, we need to define a SparkContext with SparkConf which has Cassandra connection details. And this SparkContext need to be shared for subsequent runJobs and through out the application. So, How to define SparkContext with Cassandra connection for spark-jobserver?

Re: Current Build Gives HTTP ERROR

2015-01-13 Thread Sean Owen
(This also works for me on a fresh VM, fresh pull, etc.) On Tue, Jan 13, 2015 at 10:03 AM, Robin East robin.e...@xense.co.uk wrote: I’ve just pulled down the latest commits from github, and done the following: 1) mvn clean package -DskipTests builds fine 2) ./bin/spark-shell works 3)

Problems with Spark Core 1.2.0 SBT project in IntelliJ

2015-01-13 Thread Night Wolf
Hi, I'm trying to load up an SBT project in IntelliJ 14 (windows) running 1.7 JDK, SBT 0.13.5 -I seem to be getting errors with the project. The build.sbt file is super simple; name := scala-spark-test1 version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %%

IndexedRDD

2015-01-13 Thread Jem Tucker
Hi, I have been playing around with the indexedRDD ( https://issues.apache.org/jira/browse/SPARK-2365, https://github.com/amplab/spark-indexedrdd) and have been very impressed with its performance. Some performance testing has revealed worse than expected scaling of the join performance*, and I

RE: Failed to save RDD as text file to local file system

2015-01-13 Thread Wang, Ningjun (LNG-NPV)
I want to save to local directory. I have tried the following and get error r.saveAsTextFile(file:/home/cloudera/tmp/out1) r.saveAsTextFile(file:///home/cloudera/tmp/out1) r.saveAsTextFile(file:home/cloudera/tmp/out1) They all generate the following error 15/01/12 08:31:10 WARN

Re: IndexedRDD

2015-01-13 Thread Andrew Ash
Hi Jem, Linear time in scaling on the big table doesn't seem that surprising to me. What were you expecting? I assume you're doing normalRDD.join(indexedRDD). If you were to replace the indexedRDD with a normal RDD, what times do you get? On Tue, Jan 13, 2015 at 5:35 AM, Jem Tucker

RE: Cleaning up spark.local.dir automatically

2015-01-13 Thread michael.england
That’s really useful, thanks. From: Andrew Ash [mailto:and...@andrewash.com] Sent: 09 January 2015 22:42 To: England, Michael (IT/UK) Cc: raghavendra.pan...@gmail.com; user Subject: Re: Cleaning up spark.local.dir automatically That's a worker setting which cleans up the files left behind by

RE: GraphX vs GraphLab

2015-01-13 Thread Buttler, David
Seems a bit early for anyone to have published anything regarding spark 1.2. Direct comparisons between 1.1 and 1.2 seem more likely in the near future; you should be able to extrapolate to comparisons with other systems that have been done in the past. One thing that would be really helpful

Re: Hive UDAF percentile_approx says This UDAF does not support the deprecated getEvaluator() method.

2015-01-13 Thread Ted Yu
Looking at the source code for AbstractGenericUDAFResolver, the following (non-deprecated) method should be called: public GenericUDAFEvaluator getEvaluator(GenericUDAFParameterInfo info) It is called by hiveUdfs.scala (master branch): val parameterInfo = new

Re: saveAsObjectFile is actually saveAsSequenceFile

2015-01-13 Thread Kevin Burton
Yes.. but this isn’t what the main documentation says. The file format isn’t very discoverable.. Also, the documentation doesn’t say anything about the group by 10.. what’s that about? Kevin On Tue, Jan 13, 2015 at 2:28 AM, Sean Owen so...@cloudera.com wrote: Yes, that's even what the

OOM for HiveFromSpark example

2015-01-13 Thread Zhan Zhang
Hi Folks, I am trying to run hive context in yarn-cluster mode, but met some error. Does anybody know what cause the issue. I use following cmd to build the distribution: ./make-distribution.sh -Phive -Phive-thriftserver -Pyarn -Phadoop-2.4 15/01/13 17:59:42 INFO

Re: Can't run Spark java code from command line

2015-01-13 Thread Akhil Das
It just a binding issue with the hostnames in your /etc/hosts file. You can set SPARK_LOCAL_IP and SPARK_MASTER_IP in your conf/spark-env.sh file and restart your cluster. (in that case the spark://myworkstation:7077 will change to the ip address that you provided eg: spark://10.211.55.3). Thanks

Why always spilling to disk and how to improve it?

2015-01-13 Thread Shuai Zheng
Hi All, I am trying with some small data set. It is only 200m, and what I am doing is just do a distinct count on it. But there are a lot of spilling happen in the log (I attached in the end of the email). Basically I use 10G memory, run on a one-node EMR cluster with r3*8xlarge instance

Re: Current Build Gives HTTP ERROR

2015-01-13 Thread Ganon Pierce
Yep did this and can view the masterwebui no problem: 4) Started a master ./sbin/start-master.sh grabbed the MasterWebUI from the master log - Started MasterWebUI at http://x.x.x.x:8080 http://x.x.x.x:8080/ Can view the MasterWebUI from local browser However, cannot see view the app UI in

Re: Current Build Gives HTTP ERROR

2015-01-13 Thread Ganon Pierce
Here is the master log: Spark Command: /usr/lib/jvm/java-1.7.0/bin/java -cp ::/root/ephemeral-hdfs/conf:/root/spark/sbin/../conf:/root/spark/lib/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m

Re: How to access OpenHashSet in my standalone program?

2015-01-13 Thread Reynold Xin
It is not meant to be a public API. If you want to use it, maybe copy the code out of the package and put it in your own project. On Fri, Jan 9, 2015 at 7:19 AM, Tae-Hyuk Ahn ahn@gmail.com wrote: Hi, I would like to use OpenHashSet (org.apache.spark.util.collection.OpenHashSet) in my

Re: Why always spilling to disk and how to improve it?

2015-01-13 Thread Akhil Das
You could try setting the following to tweak the application a little bit: .set(spark.rdd.compress,true) .set(spark.storage.memoryFraction, 1) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) For shuffle behavior, you can look at this document

[SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-13 Thread Yana Kadiyska
Attempting to bump this up in case someone can help out after all. I spent a few good hours stepping through the code today, so I'll summarize my observations both in hope I get some help and to help others that might be looking into this: 1. I am setting *spark.sql.parquet.**filterPushdown=true*

Re: How to access OpenHashSet in my standalone program?

2015-01-13 Thread Tae-Hyuk Ahn
Thank, Josh and Reynold. Yes, I can incorporate it to my package and use it. But I am still wondering why you designed such useful functions as private. On Tue, Jan 13, 2015 at 3:33 PM, Reynold Xin r...@databricks.com wrote: It is not meant to be a public API. If you want to use it, maybe copy

RE: Failed to save RDD as text file to local file system

2015-01-13 Thread Wang, Ningjun (LNG-NPV)
All right, I remove cloudera totally and install spark manually on bare Linux system and now r.saveAsTextFile(…) works. Thanks. Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: Prannoy [mailto:pran...@sigmoidanalytics.com]

Re: Problems with Spark Core 1.2.0 SBT project in IntelliJ

2015-01-13 Thread Akhil Das
I had a similar issue, i downgraded my intellij version to 13.1.4 and then its gone. Although there was some discussion already happened here and for some people following was the solution: Go to Preferences Build, Execution, Deployment Scala Compiler and clear the Additional compiler options

Re: Issue writing to Cassandra from Spark

2015-01-13 Thread Akhil Das
Awesome. Thanks Best Regards On Tue, Jan 13, 2015 at 10:35 PM, Ankur Srivastava ankur.srivast...@gmail.com wrote: I realized that I was running the cluster with spark.cassandra.output.concurrent.writes=2, changing it to 1 did the trick. We realized that the issue was because spark was

Hive UDAF percentile_approx says This UDAF does not support the deprecated getEvaluator() method.

2015-01-13 Thread Jianshi Huang
Hi, The following SQL query select percentile_approx(variables.var1, 0.95) p95 from model will throw ERROR SparkSqlInterpreter: Error org.apache.hadoop.hive.ql.parse.SemanticException: This UDAF does not support the deprecated getEvaluator() method. at

spark.cleaner questions

2015-01-13 Thread ankits
I am using spark 1.1 with the ooyala job server (which basically creates long running spark jobs as contexts to execute jobs in). These contexts have cached RDDs in memory (via RDD.persist()). I want to enable the spark.cleaner to cleanup the /spark/work directories that are created for each app,

Re: Problems with Spark Core 1.2.0 SBT project in IntelliJ

2015-01-13 Thread Enno Shioji
Had the same issue. I can't remember what the issue was but this works: libraryDependencies ++= { val sparkVersion = 1.2.0 Seq( org.apache.spark %% spark-core % sparkVersion % provided, org.apache.spark %% spark-streaming % sparkVersion % provided, org.apache.spark %%

Re: Problems with Spark Core 1.2.0 SBT project in IntelliJ

2015-01-13 Thread Jay Vyas
I find importing a working SBT project into IntelliJ is the way to go. How did you load the project into intellij? On Jan 13, 2015, at 4:45 PM, Enno Shioji eshi...@gmail.com wrote: Had the same issue. I can't remember what the issue was but this works: libraryDependencies ++= {

Re: Hive UDAF percentile_approx says This UDAF does not support the deprecated getEvaluator() method.

2015-01-13 Thread Yin Huai
Yeah, it's a bug. It has been fixed by https://issues.apache.org/jira/browse/SPARK-3891 in master. On Tue, Jan 13, 2015 at 2:41 PM, Ted Yu yuzhih...@gmail.com wrote: Looking at the source code for AbstractGenericUDAFResolver, the following (non-deprecated) method should be called: public

Re: Hive UDAF percentile_approx says This UDAF does not support the deprecated getEvaluator() method.

2015-01-13 Thread Jianshi Huang
Ah, thx Ted and Yin! I'll build a new version. :) Jianshi On Wed, Jan 14, 2015 at 7:24 AM, Yin Huai yh...@databricks.com wrote: Yeah, it's a bug. It has been fixed by https://issues.apache.org/jira/browse/SPARK-3891 in master. On Tue, Jan 13, 2015 at 2:41 PM, Ted Yu yuzhih...@gmail.com