Re: PathFilter for newAPIHadoopFile?

2014-09-15 Thread Nat Padmanabhan
Hi Eric, Something along the lines of the following should work val fs = getFileSystem(...) // standard hadoop API call val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath, pathFilter).map(_.getPath.toString).mkString(,) // pathFilter is an instance of org.apache.hadoop.fs.PathFilter

Re: Accuracy hit in classification with Spark

2014-09-15 Thread jatinpreet
Hi, I have been able to get the same accuracy with MLlib as Mahout's. The pre-processing phase of Mahout was the reason behind the accuracy mismatch. After studying and applying the same logic in my code, it worked like a charm. Thanks, Jatin - Novice Big Data Programmer -- View this

SparkSQL 1.1 hang when DROP or LOAD

2014-09-15 Thread linkpatrickliu
I started sparkSQL thrift server: sbin/start-thriftserver.sh Then I use beeline to connect to it: bin/beeline !connect jdbc:hive2://localhost:1 op1 op1 I have created a database for user op1. create database dw_op1; And grant all privileges to user op1; grant all on database dw_op1 to user

Re: combineByKey throws ClassCastException

2014-09-15 Thread x
How about this. scala val rdd2 = rdd.combineByKey( | (v: Int) = v.toLong, | (c: Long, v: Int) = c + v, | (c1: Long, c2: Long) = c1 + c2) rdd2: org.apache.spark.rdd.RDD[(String, Long)] = MapPartitionsRDD[9] at combineB yKey at console:14 xj @ Tokyo On Mon, Sep 15, 2014 at 3:06 PM,

Re: Developing a spark streaming application

2014-09-15 Thread Santiago Mola
Just for the record, this is being discussed at StackOverflow: http://stackoverflow.com/questions/25663026/developing-a-spark-streaming-application/25766618 2014-08-27 10:28 GMT+02:00 Filip Andrei andreis.fi...@gmail.com: Hey guys, so the problem i'm trying to tackle is the following: - I

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-15 Thread Andrew Ash
Hi Brad and Nick, Thanks for the comments! I opened a ticket to get a more thorough explanation of data locality into the docs here: https://issues.apache.org/jira/browse/SPARK-3526 If you could put any other unanswered questions you have about data locality on that ticket I'll try to

Re: Viewing web UI after fact

2014-09-15 Thread Grzegorz Białek
Hi Andrew, sorry for late response. Thank you very much for solving my problem. There was no APPLICATION_COMPLETE file in log directory due to not calling sc.stop() at the end of program. With stopping spark context everything works correctly, so thank you again. Best regards, Grzegorz On Fri,

Re: Dependency Problem with Spark / ScalaTest / SBT

2014-09-15 Thread Thorsten Bergler
Hello, When I remove the line and try to execute sbt run, I end up with the following lines: 14/09/15 10:11:35 INFO ui.SparkUI: Stopped Spark web UI at http://base:4040 [...] 14/09/15 10:11:05 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster

Re: Broadcast error

2014-09-15 Thread Chengi Liu
Hi Akhil, So with your config (specifically with set(spark.akka.frameSize , 1000)) , I see the error: org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 0:0 was 401970046 bytes which exceeds spark.akka.frameSize (10485760 bytes). Consider using broadcast

Re: Broadcast error

2014-09-15 Thread Akhil Das
Try: rdd = sc.broadcast(matrix) Or rdd = sc.parallelize(matrix,100) // Just increase the number of slices, give it a try. Thanks Best Regards On Mon, Sep 15, 2014 at 2:18 PM, Chengi Liu chengi.liu...@gmail.com wrote: Hi Akhil, So with your config (specifically with

Re: About SparkSQL 1.1.0 join between more than two table

2014-09-15 Thread Yanbo Liang
Spark SQL can support SQL and HiveSQL which used SQLContext and HiveContext separate. As far as I know, SQLContext of Spark SQL 1.1.0 can not support three table join directly. However you can modify your query with subquery such as SELECT * FROM (SELECT * FROM youhao_data left join youhao_age on

Re: Broadcast error

2014-09-15 Thread Chengi Liu
So.. same result with parallelize (matrix,1000) with broadcast.. seems like I got jvm core dump :-/ 4/09/15 02:31:22 INFO BlockManagerInfo: Registering block manager host:47978 with 19.2 GB RAM 14/09/15 02:31:22 INFO BlockManagerInfo: Registering block manager host:43360 with 19.2 GB RAM Unhandled

Re: Low Level Kafka Consumer for Spark

2014-09-15 Thread Alon Pe'er
Hi Dibyendu, Thanks for your great work! I'm new to Spark Streaming, so I just want to make sure I understand Driver failure issue correctly. In my use case, I want to make sure that messages coming in from Kafka are always broken into the same set of RDDs, meaning that if a set of messages are

Re: Low Level Kafka Consumer for Spark

2014-09-15 Thread Dibyendu Bhattacharya
Hi Alon, No this will not be guarantee that same set of messages will come in same RDD. This fix just re-play the messages from last processed offset in same order. Again this is just a interim fix we needed to solve our use case . If you do not need this message re-play feature, just do not

Re: Serving data

2014-09-15 Thread Marius Soutier
Thank you guys, I’ll try Parquet and if that’s not quick enough I’ll go the usual route with either read-only or normal database. On 13.09.2014, at 12:45, andy petrella andy.petre...@gmail.com wrote: however, the cache is not guaranteed to remain, if other jobs are launched in the cluster

Re: Serving data

2014-09-15 Thread andy petrella
I'm using Parquet in ADAM, and I can say that it works pretty fine! Enjoy ;-) aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] http://about.me/noootsab On Mon, Sep 15, 2014 at 1:41 PM, Marius Soutier mps@gmail.com wrote: Thank you guys, I’ll try Parquet and if that’s not

Re: Serving data

2014-09-15 Thread Marius Soutier
So you are living the dream of using HDFS as a database? ;) On 15.09.2014, at 13:50, andy petrella andy.petre...@gmail.com wrote: I'm using Parquet in ADAM, and I can say that it works pretty fine! Enjoy ;-) aℕdy ℙetrella about.me/noootsab On Mon, Sep 15, 2014 at 1:41 PM, Marius

Upgrading a standalone cluster on ec2 from 1.0.2 to 1.1.0

2014-09-15 Thread Tomer Benyamini
Hi, I would like to upgrade a standalone cluster to 1.1.0. What's the best way to do it? Should I just replace the existing /root/spark folder with the uncompressed folder from http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz ? What about hdfs and other installations? I have spark

Re: Write 1 RDD to multiple output paths in one go

2014-09-15 Thread Nicholas Chammas
Any tips from anybody on how to do this in PySpark? (Or regular Spark, for that matter.) On Sat, Sep 13, 2014 at 1:25 PM, Nick Chammas nicholas.cham...@gmail.com wrote: Howdy doody Spark Users, I’d like to somehow write out a single RDD to multiple paths in one go. Here’s an example. I

vertex active/inactive feature in Pregel API ?

2014-09-15 Thread Yifan LI
Hi, I am wondering if the vertex active/inactive(corresponding the change of its value between two supersteps) feature is introduced in Pregel API of GraphX? if it is not a default setting, how to call it below? def sendMessage(edge: EdgeTriplet[(Int,HashMap[VertexId, Double]), Int]) =

Found both spark.driver.extraClassPath and SPARK_CLASSPATH

2014-09-15 Thread Koert Kuipers
in spark 1.1.0 i get this error: 2014-09-14 23:17:01 ERROR actor.OneForOneStrategy: Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former. i checked my application. i do not set spark.driver.extraClassPath or SPARK_CLASSPATH. SPARK_CLASSPATH is set in spark-env.sh

Re: About SparkSQL 1.1.0 join between more than two table

2014-09-15 Thread Yin Huai
1.0.1 does not have the support on outer joins (added in 1.1). Your query should be fine in 1.1. On Mon, Sep 15, 2014 at 5:35 AM, Yanbo Liang yanboha...@gmail.com wrote: Spark SQL can support SQL and HiveSQL which used SQLContext and HiveContext separate. As far as I know, SQLContext of Spark

Compiler issues for multiple map on RDD

2014-09-15 Thread Boromir Widas
Hello Folks, I am trying to chain a couple of map operations and it seems the second map fails with a mismatch in arguments(event though the compiler prints them to be the same.) I checked the function and variable types using :t and they look ok to me. Have you seen this earlier? I am posting

File I/O in spark

2014-09-15 Thread rapelly kartheek
Hi I am trying to perform some read/write file operations in spark. Somehow I am neither able to write to a file nor read. import java.io._ val writer = new PrintWriter(new File(test.txt )) writer.write(Hello Scala) Can someone please tell me how to perform file I/O in spark.

Re: Compiler issues for multiple map on RDD

2014-09-15 Thread Sean Owen
Looks like another instance of https://issues.apache.org/jira/browse/SPARK-1199 which was intended to be fixed in 1.1.0. I'm not clear whether https://issues.apache.org/jira/browse/SPARK-2620 is the same issue and therefore whether it too is resolved in 1.1? On Mon, Sep 15, 2014 at 4:37 PM,

scala 2.11?

2014-09-15 Thread Mohit Jaggi
Folks, I understand Spark SQL uses quasiquotes. Does that mean Spark has now moved to Scala 2.11? Mohit.

Re: How to initiate a shutdown of Spark Streaming context?

2014-09-15 Thread stanley
Thank you. Would the following approaches to address this problem an overkills? a. create a ServerSocket in a different thread from the main thread that created the Spark StreamingContext, and listens to shutdown command there b. create a web service that wraps around the main thread that

Re: File I/O in spark

2014-09-15 Thread Mohit Jaggi
Is this code running in an executor? You need to make sure the file is accessible on ALL executors. One way to do that is to use a distributed filesystem like HDFS or GlusterFS. On Mon, Sep 15, 2014 at 8:51 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi I am trying to perform some

Re: Compiler issues for multiple map on RDD

2014-09-15 Thread Sean Owen
(Adding back the user list) Boromir says: Thanks much Sean, verified 1.1.0 does not have this issue. On Mon, Sep 15, 2014 at 4:47 PM, Sean Owen so...@cloudera.com wrote: Looks like another instance of https://issues.apache.org/jira/browse/SPARK-1199 which was intended to be fixed in 1.1.0.

Re: File I/O in spark

2014-09-15 Thread rapelly kartheek
Yes. I have HDFS. My cluster has 5 nodes. When I run the above commands, I see that the file gets created in the master node. But, there wont be any data written to it. On Mon, Sep 15, 2014 at 10:06 PM, Mohit Jaggi mohitja...@gmail.com wrote: Is this code running in an executor? You need to

Re: File I/O in spark

2014-09-15 Thread rapelly kartheek
The file gets created on the fly. So I dont know how to make sure that its accessible to all nodes. On Mon, Sep 15, 2014 at 10:10 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Yes. I have HDFS. My cluster has 5 nodes. When I run the above commands, I see that the file gets created in the

Re: File I/O in spark

2014-09-15 Thread Mohit Jaggi
But the above APIs are not for HDFS. On Mon, Sep 15, 2014 at 9:40 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Yes. I have HDFS. My cluster has 5 nodes. When I run the above commands, I see that the file gets created in the master node. But, there wont be any data written to it. On

Re: scala 2.11?

2014-09-15 Thread Mark Hamstra
No, not yet. Spark SQL is using org.scalamacros:quasiquotes_2.10. On Mon, Sep 15, 2014 at 9:28 AM, Mohit Jaggi mohitja...@gmail.com wrote: Folks, I understand Spark SQL uses quasiquotes. Does that mean Spark has now moved to Scala 2.11? Mohit.

Re: scala 2.11?

2014-09-15 Thread Mohit Jaggi
ah...thanks! On Mon, Sep 15, 2014 at 9:47 AM, Mark Hamstra m...@clearstorydata.com wrote: No, not yet. Spark SQL is using org.scalamacros:quasiquotes_2.10. On Mon, Sep 15, 2014 at 9:28 AM, Mohit Jaggi mohitja...@gmail.com wrote: Folks, I understand Spark SQL uses quasiquotes. Does that

Re: File I/O in spark

2014-09-15 Thread rapelly kartheek
I came across these APIs in one the scala tutorials over the net. On Mon, Sep 15, 2014 at 10:14 PM, Mohit Jaggi mohitja...@gmail.com wrote: But the above APIs are not for HDFS. On Mon, Sep 15, 2014 at 9:40 AM, rapelly kartheek kartheek.m...@gmail.com wrote: Yes. I have HDFS. My cluster

Re: File I/O in spark

2014-09-15 Thread rapelly kartheek
Can you please direct me to the right way of doing this. On Mon, Sep 15, 2014 at 10:18 PM, rapelly kartheek kartheek.m...@gmail.com wrote: I came across these APIs in one the scala tutorials over the net. On Mon, Sep 15, 2014 at 10:14 PM, Mohit Jaggi mohitja...@gmail.com wrote: But the

Re: File I/O in spark

2014-09-15 Thread Mohit Jaggi
If you underlying filesystem is HDFS, you need to use HDFS APIs. A google search brought up this link which appears reasonable. http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample If you want to use java.io APIs, you have to make sure your filesystem is accessible from all nodes in your

Re: File I/O in spark

2014-09-15 Thread Frank Austin Nothaft
Kartheek, What exactly are you trying to do? Those APIs are only for local file access. If you want to access data in HDFS, you’ll want to use one of the reader methods in org.apache.spark.SparkContext which will give you an RDD (e.g., newAPIHadoopFile, sequenceFile, or textFile). If you want

Re: Broadcast error

2014-09-15 Thread Davies Liu
I think the 1.1 will be really helpful for you, it's all compatitble with 1.0, so it's not hard to upgrade to 1.1. On Mon, Sep 15, 2014 at 2:35 AM, Chengi Liu chengi.liu...@gmail.com wrote: So.. same result with parallelize (matrix,1000) with broadcast.. seems like I got jvm core dump :-/

Re: Low Level Kafka Consumer for Spark

2014-09-15 Thread Tim Smith
Hi Dibyendu, I am a little confused about the need for rate limiting input from kafka. If the stream coming in from kafka has higher message/second rate than what a Spark job can process then it should simply build a backlog in Spark if the RDDs are cached on disk using persist(). Right? Thanks,

Re: PathFilter for newAPIHadoopFile?

2014-09-15 Thread Davies Liu
In PySpark, I think you could enumerate all the valid files, and create RDD by newAPIHadoopFile(), then union them together. On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman eric.d.fried...@gmail.com wrote: I neglected to specify that I'm using pyspark. Doesn't look like these APIs have been

Re: Write 1 RDD to multiple output paths in one go

2014-09-15 Thread Davies Liu
Maybe we should provide an API like saveTextFilesByKey(path), could you create an JIRA for it ? There is one in DPark [1] actually. [1] https://github.com/douban/dpark/blob/master/dpark/rdd.py#L309 On Mon, Sep 15, 2014 at 7:08 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Any tips

Re: Low Level Kafka Consumer for Spark

2014-09-15 Thread Dibyendu Bhattacharya
Hi Tim, I have not tried persist the RDD. Here are some discussion on Rate Limiting Spark Streaming is there in this thread. http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-rate-limiting-from-kafka-td8590.html There is a Pull Request

Need help with ThriftServer/Spark1.1.0

2014-09-15 Thread Yana Kadiyska
Hi ladies and gents, trying to get Thrift server up and running in an effort to replace Shark. My first attempt to run sbin/start-thriftserver resulted in: 14/09/15 17:09:05 ERROR TThreadPoolServer: Error occurred during processing of message. java.lang.RuntimeException:

MLLib sparse vector

2014-09-15 Thread Sameer Tilak
Hi All,I have transformed the data into following format: First column is user id, and then all the other columns are class ids. For a user only class ids that appear in this row have value 1 and others are 0. I need to crease a sparse vector from this. Does the API for creating a sparse

Example of Geoprocessing with Spark

2014-09-15 Thread Abel Coronado Iruegas
Here an example of a working code that takes a csv with lat lon points and intersects with polygons of municipalities of Mexico, generating a new version of the file with new attributes. Do you think that could be improved? Thanks. The Code: import org.apache.spark.SparkContext import

Dealing with Time Series Data

2014-09-15 Thread Gary Malouf
I have a use case for our data in HDFS that involves sorting chunks of data into time series format by a specific characteristic and doing computations from that. At large scale, what is the most efficient way to do this? Obviously, having the data sharded by that characteristic would make the

Re: How to initiate a shutdown of Spark Streaming context?

2014-09-15 Thread Jeoffrey Lim
What we did for gracefully shutting down the spark streaming context is extend a Spark Web UI Tab and perform a SparkContext.SparkUI.attachTab(custom web ui). However, the custom scala Web UI extensions needs to be under the package org.apache.spark.ui to get around with the package access

Re: scala 2.11?

2014-09-15 Thread Matei Zaharia
Scala 2.11 work is under way in open pull requests though, so hopefully it will be in soon. Matei On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com) wrote: ah...thanks! On Mon, Sep 15, 2014 at 9:47 AM, Mark Hamstra m...@clearstorydata.com wrote: No, not yet.  Spark SQL is

Re: scala 2.11?

2014-09-15 Thread Mark Hamstra
Are we going to put 2.11 support into 1.1 or 1.0? Else will be in soon applies to the master development branch, but actually in the Spark 1.2.0 release won't occur until the second half of November at the earliest. On Mon, Sep 15, 2014 at 12:11 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

Re: Write 1 RDD to multiple output paths in one go

2014-09-15 Thread Nicholas Chammas
Davies, That’s pretty neat. I heard there was a pure Python clone of Spark out there—so you were one of the people behind it! I’ve created a JIRA issue about this. SPARK-3533: Add saveAsTextFileByKey() method to RDDs https://issues.apache.org/jira/browse/SPARK-3533 Sean, I think you might be

Re: MLLib sparse vector

2014-09-15 Thread Chris Gore
Hi Sameer, MLLib uses Breeze’s vector format under the hood. You can use that. http://www.scalanlp.org/api/breeze/index.html#breeze.linalg.SparseVector For example: import breeze.linalg.{DenseVector = BDV, SparseVector = BSV, Vector = BV} val numClasses = classes.distinct.count.toInt val

Efficient way to sum multiple columns

2014-09-15 Thread jamborta
Hi all, I have an RDD that contains around 50 columns. I need to sum each column, which I am doing by running it through a for loop, creating an array and running the sum function as follows: for (i - 0 until 10) yield { data.map(x = x(i)).sum } is their a better way to do this? thanks,

Re: PathFilter for newAPIHadoopFile?

2014-09-15 Thread Eric Friedman
That's a good idea and one I had considered too. Unfortunately I'm not aware of an API in PySpark for enumerating paths on HDFS. Have I overlooked one? On Mon, Sep 15, 2014 at 10:01 AM, Davies Liu dav...@databricks.com wrote: In PySpark, I think you could enumerate all the valid files, and

Re: Efficient way to sum multiple columns

2014-09-15 Thread Xiangrui Meng
Please check the colStats method defined under mllib.stat.Statistics. -Xiangrui On Mon, Sep 15, 2014 at 1:00 PM, jamborta jambo...@gmail.com wrote: Hi all, I have an RDD that contains around 50 columns. I need to sum each column, which I am doing by running it through a for loop, creating an

Re: Spark Streaming union expected behaviour?

2014-09-15 Thread Varad Joshi
I am seeing the same exact behavior. Shrikar, did you get any response to your post? Varad -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-union-expected-behaviour-tp7206p14284.html Sent from the Apache Spark User List mailing list archive

minPartitions for non-text files?

2014-09-15 Thread Eric Friedman
sc.textFile takes a minimum # of partitions to use. is there a way to get sc.newAPIHadoopFile to do the same? I know I can repartition() and get a shuffle. I'm wondering if there's a way to tell the underlying InputFormat (AvroParquet, in my case) how many partitions to use at the outset. What

Re: minPartitions for non-text files?

2014-09-15 Thread Sean Owen
I think the reason is simply that there is no longer an explicit min-partitions argument for Hadoop InputSplits in the new Hadoop APIs. At least, I didn't see it when I glanced just now. However, you should be able to get the same effect by setting a Configuration property, and you can do so

Weird aggregation results when reusing objects inside reduceByKey

2014-09-15 Thread kriskalish
I have a pretty simple scala spark aggregation job that is summing up number of occurrences of two types of events. I have run into situations where it seems to generate bad values that are clearly incorrect after reviewing the raw data. First I have a Record object which I use to do my

Re: vertex active/inactive feature in Pregel API ?

2014-09-15 Thread Ankur Dave
At 2014-09-15 16:25:04 +0200, Yifan LI iamyifa...@gmail.com wrote: I am wondering if the vertex active/inactive(corresponding the change of its value between two supersteps) feature is introduced in Pregel API of GraphX? Vertex activeness in Pregel is controlled by messages: if a vertex did

Re: Weird aggregation results when reusing objects inside reduceByKey

2014-09-15 Thread Sean Owen
It isn't a question of an item being reduced twice, but of when objects may be reused to represent other items. I don't think you have a guarantee that you can safely reuse the objects in this argument, but I'd also be interested if there was a case where this is guaranteed. For example I'm

Re: PathFilter for newAPIHadoopFile?

2014-09-15 Thread Davies Liu
There is one way by do it in bash: hadoop fs -ls , maybe you could end up with a bash scripts to do the things. On Mon, Sep 15, 2014 at 1:01 PM, Eric Friedman eric.d.fried...@gmail.com wrote: That's a good idea and one I had considered too. Unfortunately I'm not aware of an API in PySpark

Re: PathFilter for newAPIHadoopFile?

2014-09-15 Thread Davies Liu
Or maybe you could give this one a try: https://labs.spotify.com/2013/05/07/snakebite/ On Mon, Sep 15, 2014 at 2:51 PM, Davies Liu dav...@databricks.com wrote: There is one way by do it in bash: hadoop fs -ls , maybe you could end up with a bash scripts to do the things. On Mon, Sep 15,

Re: Define the name of the outputs with Java-Spark.

2014-09-15 Thread Xiangrui Meng
Spark doesn't support MultipleOutput at this time. You can cache the parent RDD. Then create RDDs from it and save them separately. -Xiangrui On Fri, Sep 12, 2014 at 7:45 AM, Guillermo Ortiz konstt2...@gmail.com wrote: I would like to define the names of my output in Spark, I have a process

Re: Accuracy hit in classification with Spark

2014-09-15 Thread Xiangrui Meng
Thanks for the update! -Xiangrui On Sun, Sep 14, 2014 at 11:33 PM, jatinpreet jatinpr...@gmail.com wrote: Hi, I have been able to get the same accuracy with MLlib as Mahout's. The pre-processing phase of Mahout was the reason behind the accuracy mismatch. After studying and applying the

Re: minPartitions for non-text files?

2014-09-15 Thread Eric Friedman
That would be awesome, but doesn't seem to have any effect. In PySpark, I created a dict with that key and a numeric value, then passed it into newAPIHadoopFile as a value for the conf keyword. The returned RDD still has a single partition. On Mon, Sep 15, 2014 at 1:56 PM, Sean Owen

Re: MLLib sparse vector

2014-09-15 Thread Xiangrui Meng
Or you can use the factory method `Vectors.sparse`: val sv = Vectors.sparse(numProducts, productIds.map(x = (x, 1.0))) where numProducts should be the largest product id plus one. Best, Xiangrui On Mon, Sep 15, 2014 at 12:46 PM, Chris Gore cdg...@cdgore.com wrote: Hi Sameer, MLLib uses

Re: minPartitions for non-text files?

2014-09-15 Thread Sean Owen
Heh, it's still just a suggestion to Hadoop I guess, not guaranteed. Is it a splittable format? for example, some compressed formats are not splittable and Hadoop has to process whole files at a time. I'm also not sure if this is something to do with pyspark, since the underlying Scala API takes

Does Spark always wait for stragglers to finish running?

2014-09-15 Thread Pramod Biligiri
Hi, I'm running Spark tasks with speculation enabled. I'm noticing that Spark seems to wait in a given stage for all stragglers to finish, even though the speculated alternative might have finished sooner. Is that correct? Is there a way to indicate to Spark not to wait for stragglers to finish?

Invalid signature file digest for Manifest main attributes with spark job built using maven

2014-09-15 Thread kpeng1
Hi All, I am trying to submit a spark job that I have built in maven using the following command: /usr/bin/spark-submit --deploy-mode client --class com.spark.TheMain --master local[1] /home/cloudera/myjar.jar 100 But I seem to be getting the following error: Exception in thread main

Re: Invalid signature file digest for Manifest main attributes with spark job built using maven

2014-09-15 Thread Sean Owen
This is more of a Java / Maven issue than Spark per se. I would use the shade plugin to remove signature files in your final META-INF/ dir. As Spark does, in its configuration: filters filter artifact*:*/artifact excludes excludeorg/datanucleus/**/exclude

Re: MLLib sparse vector

2014-09-15 Thread Chris Gore
Probably worth noting that the factory methods in mllib create an object of type org.apache.spark.mllib.linalg.Vector which stores data in a similar format as Breeze vectors Chris On Sep 15, 2014, at 3:24 PM, Xiangrui Meng men...@gmail.com wrote: Or you can use the factory method

Convert GraphX Graph to Sparse Matrix

2014-09-15 Thread crockpotveggies
Hi everyone, I'm looking to implement Markov algorithms in GraphX and I'm wondering if it's already possible to auto-convert the Graph into a Sparse Double Matrix? I've seen this implemented in other graphs before, namely JUNG, but still familiarizing myself with GraphX. Example:

Re: Does Spark always wait for stragglers to finish running?

2014-09-15 Thread Du Li
There is a parameter spark.speculation that is turned off by default. Look at the configuration doc: http://spark.apache.org/docs/latest/configuration.html From: Pramod Biligiri pramodbilig...@gmail.commailto:pramodbilig...@gmail.com Date: Monday, September 15, 2014 at 3:30 PM To:

Re: minPartitions for non-text files?

2014-09-15 Thread Eric Friedman
Yes, it's AvroParquetInputFormat, which is splittable. If I force a repartitioning, it works. If I don't, spark chokes on my not-terribly-large 250Mb files. PySpark's documentation says that the dictionary is turned into a Configuration object. @param conf: Hadoop configuration, passed in as a

apply at Option.scala:120 callback in Spark 1.1, but no user code involved?

2014-09-15 Thread John Salvatier
In Spark 1.1, I'm seeing tasks with callbacks that don't involve my code at all! I'd seen something like this before in 1.0.0, but the behavior seems to be back apply at Option.scala:120 http://localhost:4040/stages/stage?id=52attempt=0

Re: Does Spark always wait for stragglers to finish running?

2014-09-15 Thread Pramod Biligiri
I'm already running with speculation set to true and the speculated tasks are launching, but the issue I'm observing is that Spark does not kill the long running task even if the shorter alternative has finished successfully. Therefore the overall turnaround time is still the same as without

Re: scala 2.11?

2014-09-15 Thread Matei Zaharia
I think the current plan is to put it in 1.2.0, so that's what I meant by soon. It might be possible to backport it too, but I'd be hesitant to do that as a maintenance release on 1.1.x and 1.0.x since it would require nontrivial changes to the build that could break things on Scala 2.10.

SPARK_MASTER_IP

2014-09-15 Thread Mark Grover
Hi Koert, I work on Bigtop and CDH packaging and you are right, based on my quick glance, it doesn't seem to be used. Mark From: Koert Kuipers ko...@tresata.com Date: Sat, Sep 13, 2014 at 7:03 AM Subject: SPARK_MASTER_IP To: user@spark.apache.org a grep for SPARK_MASTER_IP shows that

Re: Does Spark always wait for stragglers to finish running?

2014-09-15 Thread Matei Zaharia
It's true that it does not send a kill command right now -- we should probably add that. This code was written before tasks were killable AFAIK. However, the *job* should still finish while a speculative task is running as far as I know, and it will just leave that task behind. Matei On

Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-15 Thread Paul Wais
Dear List, I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for reading SequenceFiles. In particular, I'm seeing: Exception in thread main org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-15 Thread Christian Chua
Hi Paul. I would recommend building your own 1.1.0 distribution. ./make-distribution.sh --name hadoop-personal-build-2.4 --tgz -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests I downloaded the Pre-build for Hadoop 2.4 binary, and it had this strange behavior where

Re: scala 2.11?

2014-09-15 Thread Mark Hamstra
Okay, that's consistent with what I was expecting. Thanks, Matei. On Mon, Sep 15, 2014 at 5:20 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I think the current plan is to put it in 1.2.0, so that's what I meant by soon. It might be possible to backport it too, but I'd be hesitant to do

RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-15 Thread Cheng, Hao
What's your Spark / Hadoop version? And also the hive-site.xml? Most of case like that caused by incompatible Hadoop client jar and the Hadoop cluster. -Original Message- From: linkpatrickliu [mailto:linkpatrick...@live.com] Sent: Monday, September 15, 2014 2:35 PM To:

About SpakSQL OR MLlib

2014-09-15 Thread boyingk...@163.com
Hi: I have a dataset ,the struct [id,driverAge,TotalKiloMeter ,Emissions ,date,KiloMeter ,fuel], and the data like this: [1-980,34,221926,9,2005-2-8,123,14] [1-981,49,271321,15,2005-2-8,181,82] [1-982,36,189149,18,2005-2-8,162,51] [1-983,51,232753,5,2005-2-8,106,92]

Re: SPARK_MASTER_IP

2014-09-15 Thread Koert Kuipers
hey mark, you think that this is on purpose, or is it an omission? thanks, koert On Mon, Sep 15, 2014 at 8:32 PM, Mark Grover m...@apache.org wrote: Hi Koert, I work on Bigtop and CDH packaging and you are right, based on my quick glance, it doesn't seem to be used. Mark From: Koert

Re: About SpakSQL OR MLlib

2014-09-15 Thread Soumya Simanta
case class Car(id:String,age:Int,tkm:Int,emissions:Int,date:Date, km:Int, fuel:Int) 1. Create an PairedRDD of (age,Car) tuples (pairedRDD) 2. Create a new function fc //returns the interval lower and upper bound def fc(x:Int, interval:Int) : (Int,Int) = { val floor = x - (x%interval)

Re: NullWritable not serializable

2014-09-15 Thread Du Li
Hi Matei, Thanks for your reply. The Writable classes have never been serializable and this is why it is weird. I did try as you suggested to map the Writables to integers and strings. It didn’t pass, either. Similar exceptions were thrown except that the messages became IntWritable, Text are

RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-15 Thread linkpatrickliu
Hi, Hao Cheng, Here is the Spark\Hadoop version: Spark version = 1.1.0 Hadoop version = 2.0.0-cdh4.6.0 And hive-site.xml: configuration property namefs.default.name/name valuehdfs://ns/value /property property namedfs.nameservices/name valuens/value /property

RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-15 Thread linkpatrickliu
Hi, Hao Cheng, Here is the Spark\Hadoop version: Spark version = 1.1.0 Hadoop version = 2.0.0-cdh4.6.0 And hive-site.xml: configuration property namefs.default.name/name valuehdfs://ns/value /property property namedfs.nameservices/name valuens/value /property

RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-15 Thread Cheng, Hao
The Hadoop client jar should be assembled into the uber-jar, but (I suspect) it's probably not compatible with your Hadoop Cluster. Can you also paste the Spark uber-jar name? Usually will be under the path lib/spark-assembly-1.1.0-xxx-hadoopxxx.jar. -Original Message- From:

Re: Re: About SpakSQL OR MLlib

2014-09-15 Thread boyingk...@163.com
case class Car(id:String,age:Int,tkm:Int,emissions:Int,date:Date, km:Int, fuel:Int) 1. Create an PairedRDD of (age,Car) tuples (pairedRDD) 2. Create a new function fc //returns the interval lower and upper bound def fc(x:Int, interval:Int) : (Int,Int) = { val floor = x - (x%interval)

RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-15 Thread linkpatrickliu
Hi, Hao Cheng, This is my spark assembly jar name: spark-assembly-1.1.0-hadoop2.0.0-cdh4.6.0.jar I compiled spark 1.1.0 with following cmd: export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m mvn -Dhadoop.version=2.0.0-cdh4.6.0 -Phive -Pspark-ganglia-lgpl -DskipTests

How to set executor num on spark on yarn

2014-09-15 Thread hequn cheng
hi~I want to set the executor number to 16, but it is very strange that executor cores may affect executor num on spark on yarn, i don't know why and how to set executor number. = ./bin/spark-submit --class com.hequn.spark.SparkJoins \ --master

Complexity/Efficiency of SortByKey

2014-09-15 Thread cjwang
I wonder what algorithm is used to implement sortByKey? I assume it is some O(n*log(n)) parallelized on x number of nodes, right? Then, what size of data would make it worthwhile to use sortByKey on multiple processors rather than use standard Scala sort functions on a single processor

Re: NullWritable not serializable

2014-09-15 Thread Matei Zaharia
Can you post the exact code for the test that worked in 1.0? I can't think of much that could've changed. The one possibility is if  we had some operations that were computed locally on the driver (this happens with things like first() and take(), which will try to do the first partition

RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-15 Thread Cheng, Hao
Sorry, I am not able to reproduce that. Can you try add the following entry into the hive-site.xml? I know they have the default value, but let's make it explicitly. hive.server2.thrift.port hive.server2.thrift.bind.host hive.server2.authentication (NONE、KERBEROS、LDAP、PAM or CUSTOM)

Re: Complexity/Efficiency of SortByKey

2014-09-15 Thread Matei Zaharia
sortByKey is indeed O(n log n), it's a first pass to figure out even-sized partitions (by sampling the RDD), then a second pass to do a distributed merge-sort (first partition the data on each machine, then run a reduce phase that merges the data for each partition). The point where it becomes