Can't access remote Hive table from spark

2015-01-25 Thread guxiaobo1982
Hi, I built and started a single node standalone Spark 1.2.0 cluster along with a single node Hive 0.14.0 instance installed by Ambari 1.17.0. On the Spark and Hive node I can create and query tables inside Hive, and on remote machines I can submit the SparkPi example to the Spark master. But I

Re: Analyzing data from non-standard data sources (e.g. AWS Redshift)

2015-01-25 Thread Denis Mikhalkin
Hi Nicholas, thanks for your reply. I checked spark-redshift - it's just for the unload data files stored on hadoop, not for online result sets from DB. Do you know of any example of a custom RDD which fetches the data on the fly (not reading from HDFS)? Thanks. Denis From: Nicholas

where storagelevel DISK_ONLY persists RDD to

2015-01-25 Thread Larry Liu
I would like to persist RDD TO HDFS or NFS mount. How to change the location?

foreachActive functionality

2015-01-25 Thread kundan kumar
Can someone help me to understand the usage of foreachActive function introduced for the Vectors. I am trying to understand its usage in MultivariateOnlineSummarizer class for summary statistics. sample.foreachActive { (index, value) = if (value != 0.0) { if (currMax(index)

graph.inDegrees including zero values

2015-01-25 Thread scharissis
Hi, If a vertex has no in-degree then Spark's GraphOp 'inDegree' does not return it at all. Instead, it would be very useful to me to be able to have that vertex returned with an in-degree of zero. What's the best way to achieve this using the GraphX API? For example, given a graph with nodes

Re: what is the roadmap for Spark SQL dialect in the coming releases?

2015-01-25 Thread Niranda Perera
Thanks Michael. A clarification. So the HQL dialect provided by HiveContext, does it use catalyst optimizer? I though HiveContext is only related to Hive integration in Spark! Would be grateful if you could clarify this cheers On Sun, Jan 25, 2015 at 1:23 AM, Michael Armbrust

RE: Can't access remote Hive table from spark

2015-01-25 Thread Skanda Prasad
This happened to me as well, putting hive-site.xml inside conf doesn't seem to work. Instead I added /etc/hive/conf to SPARK_CLASSPATH and it worked. You can try this approach. -Skanda -Original Message- From: guxiaobo1982 guxiaobo1...@qq.com Sent: ‎25-‎01-‎2015 13:50 To:

Re: spark streaming with checkpoint

2015-01-25 Thread Balakrishnan Narendran
Yeah use streaming to gather the incoming logs and write to log file then run a spark job evry 5 minutes to process the counts. Got it. Thanks a lot. On 07:07, Mon, 26 Jan 2015 Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Tue, Jan 20, 2015 at 8:16 PM, balu.naren balu.na...@gmail.com

Re: Analyzing data from non-standard data sources (e.g. AWS Redshift)

2015-01-25 Thread Charles Feduke
I've got my solution working: https://gist.github.com/cfeduke/3bca88ed793ddf20ea6d I couldn't actually perform the steps I outlined in the previous message in this thread because I would ultimately be trying to serialize a SparkContext to the workers to use during the generation of 1..*n*

No AMI for Spark 1.2 using ec2 scripts

2015-01-25 Thread hajons
Hi, When I try to launch a standalone cluster on EC2 using the scripts in the ec2 directory for Spark 1.2, I get the following error: Could not resolve AMI at: https://raw.github.com/mesos/spark-ec2/v4/ami-list/us-east-1/pvm It seems there is not yet any AMI available on EC2. Any ideas when

RE: Shuffle to HDFS

2015-01-25 Thread Shao, Saisai
Hi Larry, I don’t think current Spark’s shuffle can support HDFS as a shuffle output. Anyway, is there any specific reason to spill shuffle data to HDFS or NFS, this will severely increase the shuffle time. Thanks Jerry From: Larry Liu [mailto:larryli...@gmail.com] Sent: Sunday, January 25,

Re: Lost task - connection closed

2015-01-25 Thread Aaron Davidson
Please take a look at the executor logs (on both sides of the IOException) to see if there are other exceptions (e.g., OOM) which precede this one. Generally, the connections should not fail spontaneously. On Sun, Jan 25, 2015 at 10:35 PM, octavian.ganea octavian.ga...@inf.ethz.ch wrote: Hi,

Re: where storagelevel DISK_ONLY persists RDD to

2015-01-25 Thread Larry Liu
Hi, Charles Thanks for your reply. Is it possible to persist RDD to HDFS? What is the default location to persist RDD with storagelevel DISK_ONLY? On Sun, Jan 25, 2015 at 6:26 AM, Charles Feduke charles.fed...@gmail.com wrote: I think you want to instead use `.saveAsSequenceFile` to save an

Announcement: Generalized K-Means Clustering on Spark

2015-01-25 Thread derrickburns
This project generalizes the Spark MLLIB K-Means clusterer to support clustering of dense or sparse, low or high dimensional data using distance functions defined by Bregman divergences. https://github.com/derrickburns/generalized-kmeans-clustering -- View this message in context:

SVD in pyspark ?

2015-01-25 Thread Andreas Rhode
Is the distributed SVD functionality exposed to Python yet? Seems it's only available to scala or java, unless I am missing something, looking for a pyspark equivalent to org.apache.spark.mllib.linalg.SingularValueDecomposition In case it's not there yet, is there a way to make a wrapper to call

Lost task - connection closed

2015-01-25 Thread octavian.ganea
Hi, I am running a program that executes map-reduce jobs in a loop. The first time the loop runs, everything is ok. After that, it starts giving the following error, first it gives it for one task, then for more tasks and eventually the entire program fails: 15/01/26 01:41:25 WARN

RE: where storagelevel DISK_ONLY persists RDD to

2015-01-25 Thread Shao, Saisai
No, current RDD persistence mechanism do not support putting data on HDFS. The directory is spark.local.dirs. Instead you can use checkpoint() to save the RDD on HDFS. Thanks Jerry From: Larry Liu [mailto:larryli...@gmail.com] Sent: Monday, January 26, 2015 3:08 PM To: Charles Feduke Cc:

Re: spark streaming with checkpoint

2015-01-25 Thread Balakrishnan Narendran
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 1.2 – How to change Default (Random) port ….

2015-01-25 Thread Aaron Davidson
This was a regression caused by Netty Block Transfer Service. The fix for this just barely missed the 1.2 release, and you can see the associated JIRA here: https://issues.apache.org/jira/browse/SPARK-4837 Current master has the fix, and the Spark 1.2.1 release will have it included. If you don't

Re: Eclipse on spark

2015-01-25 Thread Jörn Franke
I recommend using a build tool within eclipse, such as Gradle or Maven Le 24 janv. 2015 19:34, riginos samarasrigi...@gmail.com a écrit : How to compile a Spark project in Scala IDE for Eclipse? I got many scala scripts and i no longer want to load them from scala-shell what can i do? --

Re: Shuffle to HDFS

2015-01-25 Thread Larry Liu
Hi,Jerry Thanks for your reply. The reason I have this question is that in Hadoop, mapper intermediate output (shuffle) will be stored in HDFS. I think the default location for spark is /tmp I think. Larry On Sun, Jan 25, 2015 at 9:44 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi Larry,

RE: Shuffle to HDFS

2015-01-25 Thread Shao, Saisai
Hey Larry, I don’t think Hadoop will put shuffle output in HDFS, instead it’s behavior is the same as what Spark did, store mapper output (shuffle) data on local disks. You might misunderstood something ☺. Thanks Jerry From: Larry Liu [mailto:larryli...@gmail.com] Sent: Monday, January 26,

Re: where storagelevel DISK_ONLY persists RDD to

2015-01-25 Thread Charles Feduke
I think you want to instead use `.saveAsSequenceFile` to save an RDD to someplace like HDFS or NFS it you are attempting to interoperate with another system, such as Hadoop. `.persist` is for keeping the contents of an RDD around so future uses of that particular RDD don't need to recalculate its

Re: Analyzing data from non-standard data sources (e.g. AWS Redshift)

2015-01-25 Thread Charles Feduke
I'm facing a similar problem except my data is already pre-sharded in PostgreSQL. I'm going to attempt to solve it like this: - Submit the shard names (database names) across the Spark cluster as a text file and partition it so workers get 0 or more - hopefully 1 - shard name. In this case you

key already cancelled error

2015-01-25 Thread ilaxes
Hi everyone, I'm writing a program that update a cassandra table. I've writen a first shot where I update the table row by row from a rdd trhough a map. Now I want to build a batch of updates using the same kind of syntax as in this thread :

Re: Spark webUI - application details page

2015-01-25 Thread ilaxes
Hi, I've a similar problem. I want to see the detailed logs of Completed Applications so I've set in my program : set(spark.eventLog.enabled,true). set(spark.eventLog.dir,file:/tmp/spark-events) but when I click on the application in the webui, I got a page with the message : Application history

Re: Eclipse on spark

2015-01-25 Thread Harihar Nahak
Download pre build binary for window and attached all required jars in your project eclipsclass-path and go head with your eclipse. make sure you have same java version On 25 January 2015 at 07:33, riginos [via Apache Spark User List] ml-node+s1001560n21350...@n3.nabble.com wrote: How to

Re: Pairwise Processing of a List

2015-01-25 Thread Joseph Lust
So you’ve got a point A and you want the sum of distances between it and all other points? Or am I misunderstanding you? // target point, can be Broadcast global sent to all workers val tarPt = (10,20) val pts = Seq((2,2),(3,3),(2,3),(10,2)) val rdd= sc.parallelize(pts) rdd.map( pt = Math.sqrt(

Re: Pairwise Processing of a List

2015-01-25 Thread Tobias Pfeiffer
Hi, On Mon, Jan 26, 2015 at 9:32 AM, Steve Nunez snu...@hortonworks.com wrote: I’ve got a list of points: List[(Float, Float)]) that represent (x,y) coordinate pairs and need to sum the distance. It’s easy enough to compute the distance: Are you saying you want all combinations (N^2) of

Re: Pairwise Processing of a List

2015-01-25 Thread Sean Owen
If this is really about just Scala Lists, then a simple answer (using tuples of doubles) is: val points: List[(Double,Double)] = ... val distances = for (p1 - points; p2 - points) yield { val dx = p1._1 - p2._1 val dy = p1._2 - p2._2 math.sqrt(dx*dx + dy*dy) } distances.sum / 2 It's / 2

Re: Pairwise Processing of a List

2015-01-25 Thread Steve Nunez
Not combinations, linear distances, e.g., given: List[ (x1,y1), (x2,y2), (x3,y3) ], compute the sum of: distance (x1,y2) and (x2,y2) and distance (x2,y2) and (x3,y3) Imagine that the list of coordinate point comes from a GPS and describes a trip. - Steve From: Joseph Lust

Re: spark streaming with checkpoint

2015-01-25 Thread Tobias Pfeiffer
Hi, On Tue, Jan 20, 2015 at 8:16 PM, balu.naren balu.na...@gmail.com wrote: I am a beginner to spark streaming. So have a basic doubt regarding checkpoints. My use case is to calculate the no of unique users by day. I am using reduce by key and window for this. Where my window duration is 24

Re: [SQL] Conflicts in inferred Json Schemas

2015-01-25 Thread Tobias Pfeiffer
Hi, On Thu, Jan 22, 2015 at 2:26 AM, Corey Nolet cjno...@gmail.com wrote: Let's say I have 2 formats for json objects in the same file schema1 = { location: 12345 My Lane } schema2 = { location:{houseAddres:1234 My Lane} } From my tests, it looks like the current inferSchema() function will

Pairwise Processing of a List

2015-01-25 Thread Steve Nunez
Spark Experts, I've got a list of points: List[(Float, Float)]) that represent (x,y) coordinate pairs and need to sum the distance. It's easy enough to compute the distance: case class Point(x: Float, y: Float) { def distance(other: Point): Float = sqrt(pow(x - other.x, 2) + pow(y -

Re: Pairwise Processing of a List

2015-01-25 Thread Sean Owen
(PS the Scala code I posted is a poor way to do it -- it would materialize the entire cartesian product in memory. You can use .iterator or .view to fix that.) Ah, so you want sum of distances between successive points. val points: List[(Double,Double)] = ... points.sliding(2).map { case

Re: Serializability: for vs. while loops

2015-01-25 Thread Tobias Pfeiffer
Aaron, On Thu, Jan 15, 2015 at 5:05 PM, Aaron Davidson ilike...@gmail.com wrote: Scala for-loops are implemented as closures using anonymous inner classes which are instantiated once and invoked many times. This means, though, that the code inside the loop is actually sitting inside a class,

Re: Pairwise Processing of a List

2015-01-25 Thread Tobias Pfeiffer
Sean, On Mon, Jan 26, 2015 at 10:28 AM, Sean Owen so...@cloudera.com wrote: Note that RDDs don't really guarantee anything about ordering though, so this only makes sense if you've already sorted some upstream RDD by a timestamp or sequence number. Speaking of order, is there some reading

Re: Spark webUI - application details page

2015-01-25 Thread Joseph Lust
Perhaps you need to set this in your spark-defaults.conf so that¹s it¹s already set when your slave/worker processes start. -Joe On 1/25/15, 6:50 PM, ilaxes ila...@hotmail.com wrote: Hi, I've a similar problem. I want to see the detailed logs of Completed Applications so I've set in my program

Re: foreachActive functionality

2015-01-25 Thread Reza Zadeh
The idea is to unify the code path for dense and sparse vector operations, which makes the codebase easier to maintain. By handling (index, value) tuples, you can let the foreachActive method take care of checking if the vector is sparse or dense, and running a foreach over the values. On Sun,

Re: graph.inDegrees including zero values

2015-01-25 Thread Ankur Dave
You can do this using leftJoin, as collectNeighbors [1] does: graph.vertices.leftJoin(graph.inDegrees) { (vid, attr, inDegOpt) = inDegOpt.getOrElse(0) } [1] https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala#L145 Ankur On Sun, Jan 25,

Re: Results never return to driver | Spark Custom Reader

2015-01-25 Thread Harihar Nahak
Hi Yana, As per my custom split code, only three splits submit to the system. So three executors are sufficient for that. but it had run 8 executors. First three executors logs show the exact output what I want(i did put some syso in console to debug the code), but next five are have some other

Re: what is the roadmap for Spark SQL dialect in the coming releases?

2015-01-25 Thread Michael Armbrust
Yeah, the HiveContext is just a SQLContext that is extended with HQL, access to a metastore, hive UDFs and hive serdes. The query execution however is identical to a SQLContext. On Sun, Jan 25, 2015 at 7:24 AM, Niranda Perera niranda.per...@gmail.com wrote: Thanks Michael. A clarification.

Re: SVD in pyspark ?

2015-01-25 Thread Chip Senkbeil
Hi Andreas, With regard to the notebook interface, you can use the Spark Kernel ( https://github.com/ibm-et/spark-kernel) as the backend for an IPython 3.0 notebook. The kernel is designed to be the foundation for interactive applications connecting to Apache Spark and uses the IPython 5.0

Re: Spark 1.2 – How to change Default (Random) port ….

2015-01-25 Thread Shailesh Birari
Can anyone please let me know ? I don't want to open all ports on n/w. So, am interested in the property by which this new port I can configure. Shailesh -- View this message in context:

Re: foreachActive functionality

2015-01-25 Thread DB Tsai
PS, we were using Breeze's activeIterator originally as you can see in the old code, but we found there are overhead there, so we implement our own implementation which results 4x faster. See https://github.com/apache/spark/pull/3288 for detail. Sincerely, DB Tsai