Hi Eric,
Something along the lines of the following should work
val fs = getFileSystem(...) // standard hadoop API call
val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath,
pathFilter).map(_.getPath.toString).mkString(,) // pathFilter is an
instance of org.apache.hadoop.fs.PathFilter
Hi,
I have been able to get the same accuracy with MLlib as Mahout's. The
pre-processing phase of Mahout was the reason behind the accuracy mismatch.
After studying and applying the same logic in my code, it worked like a
charm.
Thanks,
Jatin
-
Novice Big Data Programmer
--
View this
I started sparkSQL thrift server:
sbin/start-thriftserver.sh
Then I use beeline to connect to it:
bin/beeline
!connect jdbc:hive2://localhost:1 op1 op1
I have created a database for user op1.
create database dw_op1;
And grant all privileges to user op1;
grant all on database dw_op1 to user
How about this.
scala val rdd2 = rdd.combineByKey(
| (v: Int) = v.toLong,
| (c: Long, v: Int) = c + v,
| (c1: Long, c2: Long) = c1 + c2)
rdd2: org.apache.spark.rdd.RDD[(String, Long)] = MapPartitionsRDD[9] at
combineB
yKey at console:14
xj @ Tokyo
On Mon, Sep 15, 2014 at 3:06 PM,
Just for the record, this is being discussed at StackOverflow:
http://stackoverflow.com/questions/25663026/developing-a-spark-streaming-application/25766618
2014-08-27 10:28 GMT+02:00 Filip Andrei andreis.fi...@gmail.com:
Hey guys, so the problem i'm trying to tackle is the following:
- I
Hi Brad and Nick,
Thanks for the comments! I opened a ticket to get a more thorough
explanation of data locality into the docs here:
https://issues.apache.org/jira/browse/SPARK-3526
If you could put any other unanswered questions you have about data
locality on that ticket I'll try to
Hi Andrew,
sorry for late response. Thank you very much for solving my problem. There
was no APPLICATION_COMPLETE file in log directory due to not calling
sc.stop() at the end of program. With stopping spark context everything
works correctly, so thank you again.
Best regards,
Grzegorz
On Fri,
Hello,
When I remove the line and try to execute sbt run, I end up with the
following lines:
14/09/15 10:11:35 INFO ui.SparkUI: Stopped Spark web UI at http://base:4040
[...]
14/09/15 10:11:05 WARN scheduler.TaskSchedulerImpl: Initial job has not
accepted any resources; check your cluster
Hi Akhil,
So with your config (specifically with set(spark.akka.frameSize ,
1000)) , I see the error:
org.apache.spark.SparkException: Job aborted due to stage failure:
Serialized task 0:0 was 401970046 bytes which exceeds spark.akka.frameSize
(10485760 bytes). Consider using broadcast
Try:
rdd = sc.broadcast(matrix)
Or
rdd = sc.parallelize(matrix,100) // Just increase the number of slices,
give it a try.
Thanks
Best Regards
On Mon, Sep 15, 2014 at 2:18 PM, Chengi Liu chengi.liu...@gmail.com wrote:
Hi Akhil,
So with your config (specifically with
Spark SQL can support SQL and HiveSQL which used SQLContext and HiveContext
separate.
As far as I know, SQLContext of Spark SQL 1.1.0 can not support three table
join directly.
However you can modify your query with subquery such as
SELECT * FROM (SELECT * FROM youhao_data left join youhao_age on
So.. same result with parallelize (matrix,1000)
with broadcast.. seems like I got jvm core dump :-/
4/09/15 02:31:22 INFO BlockManagerInfo: Registering block manager
host:47978 with 19.2 GB RAM
14/09/15 02:31:22 INFO BlockManagerInfo: Registering block manager
host:43360 with 19.2 GB RAM
Unhandled
Hi Dibyendu,
Thanks for your great work!
I'm new to Spark Streaming, so I just want to make sure I understand Driver
failure issue correctly.
In my use case, I want to make sure that messages coming in from Kafka are
always broken into the same set of RDDs, meaning that if a set of messages
are
Hi Alon,
No this will not be guarantee that same set of messages will come in same
RDD. This fix just re-play the messages from last processed offset in same
order. Again this is just a interim fix we needed to solve our use case .
If you do not need this message re-play feature, just do not
Thank you guys, I’ll try Parquet and if that’s not quick enough I’ll go the
usual route with either read-only or normal database.
On 13.09.2014, at 12:45, andy petrella andy.petre...@gmail.com wrote:
however, the cache is not guaranteed to remain, if other jobs are launched in
the cluster
I'm using Parquet in ADAM, and I can say that it works pretty fine!
Enjoy ;-)
aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]
http://about.me/noootsab
On Mon, Sep 15, 2014 at 1:41 PM, Marius Soutier mps@gmail.com wrote:
Thank you guys, I’ll try Parquet and if that’s not
So you are living the dream of using HDFS as a database? ;)
On 15.09.2014, at 13:50, andy petrella andy.petre...@gmail.com wrote:
I'm using Parquet in ADAM, and I can say that it works pretty fine!
Enjoy ;-)
aℕdy ℙetrella
about.me/noootsab
On Mon, Sep 15, 2014 at 1:41 PM, Marius
Hi,
I would like to upgrade a standalone cluster to 1.1.0. What's the best
way to do it? Should I just replace the existing /root/spark folder
with the uncompressed folder from
http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz ? What
about hdfs and other installations?
I have spark
Any tips from anybody on how to do this in PySpark? (Or regular Spark, for
that matter.)
On Sat, Sep 13, 2014 at 1:25 PM, Nick Chammas nicholas.cham...@gmail.com
wrote:
Howdy doody Spark Users,
I’d like to somehow write out a single RDD to multiple paths in one go.
Here’s an example.
I
Hi,
I am wondering if the vertex active/inactive(corresponding the change of its
value between two supersteps) feature is introduced in Pregel API of GraphX?
if it is not a default setting, how to call it below?
def sendMessage(edge: EdgeTriplet[(Int,HashMap[VertexId, Double]), Int]) =
in spark 1.1.0 i get this error:
2014-09-14 23:17:01 ERROR actor.OneForOneStrategy: Found both
spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former.
i checked my application. i do not set spark.driver.extraClassPath or
SPARK_CLASSPATH.
SPARK_CLASSPATH is set in spark-env.sh
1.0.1 does not have the support on outer joins (added in 1.1). Your query
should be fine in 1.1.
On Mon, Sep 15, 2014 at 5:35 AM, Yanbo Liang yanboha...@gmail.com wrote:
Spark SQL can support SQL and HiveSQL which used SQLContext and
HiveContext separate.
As far as I know, SQLContext of Spark
Hello Folks,
I am trying to chain a couple of map operations and it seems the second map
fails with a mismatch in arguments(event though the compiler prints them to
be the same.) I checked the function and variable types using :t and they
look ok to me.
Have you seen this earlier? I am posting
Hi
I am trying to perform some read/write file operations in spark. Somehow I
am neither able to write to a file nor read.
import java.io._
val writer = new PrintWriter(new File(test.txt ))
writer.write(Hello Scala)
Can someone please tell me how to perform file I/O in spark.
Looks like another instance of
https://issues.apache.org/jira/browse/SPARK-1199 which was intended to
be fixed in 1.1.0.
I'm not clear whether https://issues.apache.org/jira/browse/SPARK-2620
is the same issue and therefore whether it too is resolved in 1.1?
On Mon, Sep 15, 2014 at 4:37 PM,
Folks,
I understand Spark SQL uses quasiquotes. Does that mean Spark has now moved
to Scala 2.11?
Mohit.
Thank you.
Would the following approaches to address this problem an overkills?
a. create a ServerSocket in a different thread from the main thread that
created the Spark StreamingContext, and listens to shutdown command there
b. create a web service that wraps around the main thread that
Is this code running in an executor? You need to make sure the file is
accessible on ALL executors. One way to do that is to use a distributed
filesystem like HDFS or GlusterFS.
On Mon, Sep 15, 2014 at 8:51 AM, rapelly kartheek kartheek.m...@gmail.com
wrote:
Hi
I am trying to perform some
(Adding back the user list)
Boromir says:
Thanks much Sean, verified 1.1.0 does not have this issue.
On Mon, Sep 15, 2014 at 4:47 PM, Sean Owen so...@cloudera.com wrote:
Looks like another instance of
https://issues.apache.org/jira/browse/SPARK-1199 which was intended to
be fixed in 1.1.0.
Yes. I have HDFS. My cluster has 5 nodes. When I run the above commands, I
see that the file gets created in the master node. But, there wont be any
data written to it.
On Mon, Sep 15, 2014 at 10:06 PM, Mohit Jaggi mohitja...@gmail.com wrote:
Is this code running in an executor? You need to
The file gets created on the fly. So I dont know how to make sure that its
accessible to all nodes.
On Mon, Sep 15, 2014 at 10:10 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:
Yes. I have HDFS. My cluster has 5 nodes. When I run the above commands, I
see that the file gets created in the
But the above APIs are not for HDFS.
On Mon, Sep 15, 2014 at 9:40 AM, rapelly kartheek kartheek.m...@gmail.com
wrote:
Yes. I have HDFS. My cluster has 5 nodes. When I run the above commands, I
see that the file gets created in the master node. But, there wont be any
data written to it.
On
No, not yet. Spark SQL is using org.scalamacros:quasiquotes_2.10.
On Mon, Sep 15, 2014 at 9:28 AM, Mohit Jaggi mohitja...@gmail.com wrote:
Folks,
I understand Spark SQL uses quasiquotes. Does that mean Spark has now
moved to Scala 2.11?
Mohit.
ah...thanks!
On Mon, Sep 15, 2014 at 9:47 AM, Mark Hamstra m...@clearstorydata.com
wrote:
No, not yet. Spark SQL is using org.scalamacros:quasiquotes_2.10.
On Mon, Sep 15, 2014 at 9:28 AM, Mohit Jaggi mohitja...@gmail.com wrote:
Folks,
I understand Spark SQL uses quasiquotes. Does that
I came across these APIs in one the scala tutorials over the net.
On Mon, Sep 15, 2014 at 10:14 PM, Mohit Jaggi mohitja...@gmail.com wrote:
But the above APIs are not for HDFS.
On Mon, Sep 15, 2014 at 9:40 AM, rapelly kartheek kartheek.m...@gmail.com
wrote:
Yes. I have HDFS. My cluster
Can you please direct me to the right way of doing this.
On Mon, Sep 15, 2014 at 10:18 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:
I came across these APIs in one the scala tutorials over the net.
On Mon, Sep 15, 2014 at 10:14 PM, Mohit Jaggi mohitja...@gmail.com
wrote:
But the
If you underlying filesystem is HDFS, you need to use HDFS APIs. A google
search brought up this link which appears reasonable.
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample
If you want to use java.io APIs, you have to make sure your filesystem is
accessible from all nodes in your
Kartheek,
What exactly are you trying to do? Those APIs are only for local file access.
If you want to access data in HDFS, you’ll want to use one of the reader
methods in org.apache.spark.SparkContext which will give you an RDD (e.g.,
newAPIHadoopFile, sequenceFile, or textFile). If you want
I think the 1.1 will be really helpful for you, it's all compatitble
with 1.0, so it's
not hard to upgrade to 1.1.
On Mon, Sep 15, 2014 at 2:35 AM, Chengi Liu chengi.liu...@gmail.com wrote:
So.. same result with parallelize (matrix,1000)
with broadcast.. seems like I got jvm core dump :-/
Hi Dibyendu,
I am a little confused about the need for rate limiting input from
kafka. If the stream coming in from kafka has higher message/second
rate than what a Spark job can process then it should simply build a
backlog in Spark if the RDDs are cached on disk using persist().
Right?
Thanks,
In PySpark, I think you could enumerate all the valid files, and create RDD by
newAPIHadoopFile(), then union them together.
On Mon, Sep 15, 2014 at 5:49 AM, Eric Friedman
eric.d.fried...@gmail.com wrote:
I neglected to specify that I'm using pyspark. Doesn't look like these APIs
have been
Maybe we should provide an API like saveTextFilesByKey(path),
could you create an JIRA for it ?
There is one in DPark [1] actually.
[1] https://github.com/douban/dpark/blob/master/dpark/rdd.py#L309
On Mon, Sep 15, 2014 at 7:08 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Any tips
Hi Tim,
I have not tried persist the RDD.
Here are some discussion on Rate Limiting Spark Streaming is there in this
thread.
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-rate-limiting-from-kafka-td8590.html
There is a Pull Request
Hi ladies and gents,
trying to get Thrift server up and running in an effort to replace Shark.
My first attempt to run sbin/start-thriftserver resulted in:
14/09/15 17:09:05 ERROR TThreadPoolServer: Error occurred during processing
of message.
java.lang.RuntimeException:
Hi All,I have transformed the data into following format: First column is user
id, and then all the other columns are class ids. For a user only class ids
that appear in this row have value 1 and others are 0. I need to crease a
sparse vector from this. Does the API for creating a sparse
Here an example of a working code that takes a csv with lat lon points and
intersects with polygons of municipalities of Mexico, generating a new
version of the file with new attributes.
Do you think that could be improved?
Thanks.
The Code:
import org.apache.spark.SparkContext
import
I have a use case for our data in HDFS that involves sorting chunks of data
into time series format by a specific characteristic and doing computations
from that. At large scale, what is the most efficient way to do this?
Obviously, having the data sharded by that characteristic would make the
What we did for gracefully shutting down the spark streaming context is
extend a Spark Web UI Tab and perform a
SparkContext.SparkUI.attachTab(custom web ui). However, the custom scala
Web UI extensions needs to be under the package org.apache.spark.ui to get
around with the package access
Scala 2.11 work is under way in open pull requests though, so hopefully it will
be in soon.
Matei
On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com) wrote:
ah...thanks!
On Mon, Sep 15, 2014 at 9:47 AM, Mark Hamstra m...@clearstorydata.com wrote:
No, not yet. Spark SQL is
Are we going to put 2.11 support into 1.1 or 1.0? Else will be in soon
applies to the master development branch, but actually in the Spark 1.2.0
release won't occur until the second half of November at the earliest.
On Mon, Sep 15, 2014 at 12:11 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Davies,
That’s pretty neat. I heard there was a pure Python clone of Spark out
there—so you were one of the people behind it!
I’ve created a JIRA issue about this. SPARK-3533: Add saveAsTextFileByKey()
method to RDDs https://issues.apache.org/jira/browse/SPARK-3533
Sean,
I think you might be
Hi Sameer,
MLLib uses Breeze’s vector format under the hood. You can use that.
http://www.scalanlp.org/api/breeze/index.html#breeze.linalg.SparseVector
For example:
import breeze.linalg.{DenseVector = BDV, SparseVector = BSV, Vector = BV}
val numClasses = classes.distinct.count.toInt
val
Hi all,
I have an RDD that contains around 50 columns. I need to sum each column,
which I am doing by running it through a for loop, creating an array and
running the sum function as follows:
for (i - 0 until 10) yield {
data.map(x = x(i)).sum
}
is their a better way to do this?
thanks,
That's a good idea and one I had considered too. Unfortunately I'm not
aware of an API in PySpark for enumerating paths on HDFS. Have I
overlooked one?
On Mon, Sep 15, 2014 at 10:01 AM, Davies Liu dav...@databricks.com wrote:
In PySpark, I think you could enumerate all the valid files, and
Please check the colStats method defined under mllib.stat.Statistics. -Xiangrui
On Mon, Sep 15, 2014 at 1:00 PM, jamborta jambo...@gmail.com wrote:
Hi all,
I have an RDD that contains around 50 columns. I need to sum each column,
which I am doing by running it through a for loop, creating an
I am seeing the same exact behavior. Shrikar, did you get any response to
your post?
Varad
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-union-expected-behaviour-tp7206p14284.html
Sent from the Apache Spark User List mailing list archive
sc.textFile takes a minimum # of partitions to use.
is there a way to get sc.newAPIHadoopFile to do the same?
I know I can repartition() and get a shuffle. I'm wondering if there's a
way to tell the underlying InputFormat (AvroParquet, in my case) how many
partitions to use at the outset.
What
I think the reason is simply that there is no longer an explicit
min-partitions argument for Hadoop InputSplits in the new Hadoop APIs.
At least, I didn't see it when I glanced just now.
However, you should be able to get the same effect by setting a
Configuration property, and you can do so
I have a pretty simple scala spark aggregation job that is summing up number
of occurrences of two types of events. I have run into situations where it
seems to generate bad values that are clearly incorrect after reviewing the
raw data.
First I have a Record object which I use to do my
At 2014-09-15 16:25:04 +0200, Yifan LI iamyifa...@gmail.com wrote:
I am wondering if the vertex active/inactive(corresponding the change of its
value between two supersteps) feature is introduced in Pregel API of GraphX?
Vertex activeness in Pregel is controlled by messages: if a vertex did
It isn't a question of an item being reduced twice, but of when
objects may be reused to represent other items.
I don't think you have a guarantee that you can safely reuse the
objects in this argument, but I'd also be interested if there was a
case where this is guaranteed.
For example I'm
There is one way by do it in bash: hadoop fs -ls , maybe you could
end up with a bash scripts to do the things.
On Mon, Sep 15, 2014 at 1:01 PM, Eric Friedman
eric.d.fried...@gmail.com wrote:
That's a good idea and one I had considered too. Unfortunately I'm not
aware of an API in PySpark
Or maybe you could give this one a try:
https://labs.spotify.com/2013/05/07/snakebite/
On Mon, Sep 15, 2014 at 2:51 PM, Davies Liu dav...@databricks.com wrote:
There is one way by do it in bash: hadoop fs -ls , maybe you could
end up with a bash scripts to do the things.
On Mon, Sep 15,
Spark doesn't support MultipleOutput at this time. You can cache the
parent RDD. Then create RDDs from it and save them separately.
-Xiangrui
On Fri, Sep 12, 2014 at 7:45 AM, Guillermo Ortiz konstt2...@gmail.com wrote:
I would like to define the names of my output in Spark, I have a process
Thanks for the update! -Xiangrui
On Sun, Sep 14, 2014 at 11:33 PM, jatinpreet jatinpr...@gmail.com wrote:
Hi,
I have been able to get the same accuracy with MLlib as Mahout's. The
pre-processing phase of Mahout was the reason behind the accuracy mismatch.
After studying and applying the
That would be awesome, but doesn't seem to have any effect.
In PySpark, I created a dict with that key and a numeric value, then passed
it into newAPIHadoopFile as a value for the conf keyword. The returned
RDD still has a single partition.
On Mon, Sep 15, 2014 at 1:56 PM, Sean Owen
Or you can use the factory method `Vectors.sparse`:
val sv = Vectors.sparse(numProducts, productIds.map(x = (x, 1.0)))
where numProducts should be the largest product id plus one.
Best,
Xiangrui
On Mon, Sep 15, 2014 at 12:46 PM, Chris Gore cdg...@cdgore.com wrote:
Hi Sameer,
MLLib uses
Heh, it's still just a suggestion to Hadoop I guess, not guaranteed.
Is it a splittable format? for example, some compressed formats are
not splittable and Hadoop has to process whole files at a time.
I'm also not sure if this is something to do with pyspark, since the
underlying Scala API takes
Hi,
I'm running Spark tasks with speculation enabled. I'm noticing that Spark
seems to wait in a given stage for all stragglers to finish, even though
the speculated alternative might have finished sooner. Is that correct?
Is there a way to indicate to Spark not to wait for stragglers to finish?
Hi All,
I am trying to submit a spark job that I have built in maven using the
following command:
/usr/bin/spark-submit --deploy-mode client --class com.spark.TheMain
--master local[1] /home/cloudera/myjar.jar 100
But I seem to be getting the following error:
Exception in thread main
This is more of a Java / Maven issue than Spark per se. I would use
the shade plugin to remove signature files in your final META-INF/
dir. As Spark does, in its configuration:
filters
filter
artifact*:*/artifact
excludes
excludeorg/datanucleus/**/exclude
Probably worth noting that the factory methods in mllib create an object of
type org.apache.spark.mllib.linalg.Vector which stores data in a similar format
as Breeze vectors
Chris
On Sep 15, 2014, at 3:24 PM, Xiangrui Meng men...@gmail.com wrote:
Or you can use the factory method
Hi everyone,
I'm looking to implement Markov algorithms in GraphX and I'm wondering if
it's already possible to auto-convert the Graph into a Sparse Double Matrix?
I've seen this implemented in other graphs before, namely JUNG, but still
familiarizing myself with GraphX. Example:
There is a parameter spark.speculation that is turned off by default. Look at
the configuration doc: http://spark.apache.org/docs/latest/configuration.html
From: Pramod Biligiri
pramodbilig...@gmail.commailto:pramodbilig...@gmail.com
Date: Monday, September 15, 2014 at 3:30 PM
To:
Yes, it's AvroParquetInputFormat, which is splittable. If I force a
repartitioning, it works. If I don't, spark chokes on my not-terribly-large
250Mb files.
PySpark's documentation says that the dictionary is turned into a
Configuration object.
@param conf: Hadoop configuration, passed in as a
In Spark 1.1, I'm seeing tasks with callbacks that don't involve my code at
all!
I'd seen something like this before in 1.0.0, but the behavior seems to be
back
apply at Option.scala:120
http://localhost:4040/stages/stage?id=52attempt=0
I'm already running with speculation set to true and the speculated tasks
are launching, but the issue I'm observing is that Spark does not kill the
long running task even if the shorter alternative has finished
successfully. Therefore the overall turnaround time is still the same as
without
I think the current plan is to put it in 1.2.0, so that's what I meant by
soon. It might be possible to backport it too, but I'd be hesitant to do that
as a maintenance release on 1.1.x and 1.0.x since it would require nontrivial
changes to the build that could break things on Scala 2.10.
Hi Koert,
I work on Bigtop and CDH packaging and you are right, based on my quick
glance, it doesn't seem to be used.
Mark
From: Koert Kuipers ko...@tresata.com
Date: Sat, Sep 13, 2014 at 7:03 AM
Subject: SPARK_MASTER_IP
To: user@spark.apache.org
a grep for SPARK_MASTER_IP shows that
It's true that it does not send a kill command right now -- we should probably
add that. This code was written before tasks were killable AFAIK. However, the
*job* should still finish while a speculative task is running as far as I know,
and it will just leave that task behind.
Matei
On
Dear List,
I'm having trouble getting Spark 1.1 to use the Hadoop 2 API for
reading SequenceFiles. In particular, I'm seeing:
Exception in thread main org.apache.hadoop.ipc.RemoteException:
Server IPC version 7 cannot communicate with client version 4
at
Hi Paul.
I would recommend building your own 1.1.0 distribution.
./make-distribution.sh --name hadoop-personal-build-2.4 --tgz -Pyarn
-Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests
I downloaded the Pre-build for Hadoop 2.4 binary, and it had this strange
behavior where
Okay, that's consistent with what I was expecting. Thanks, Matei.
On Mon, Sep 15, 2014 at 5:20 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
I think the current plan is to put it in 1.2.0, so that's what I meant by
soon. It might be possible to backport it too, but I'd be hesitant to do
What's your Spark / Hadoop version? And also the hive-site.xml? Most of case
like that caused by incompatible Hadoop client jar and the Hadoop cluster.
-Original Message-
From: linkpatrickliu [mailto:linkpatrick...@live.com]
Sent: Monday, September 15, 2014 2:35 PM
To:
Hi:
I have a dataset ,the struct [id,driverAge,TotalKiloMeter ,Emissions
,date,KiloMeter ,fuel], and the data like this:
[1-980,34,221926,9,2005-2-8,123,14]
[1-981,49,271321,15,2005-2-8,181,82]
[1-982,36,189149,18,2005-2-8,162,51]
[1-983,51,232753,5,2005-2-8,106,92]
hey mark,
you think that this is on purpose, or is it an omission? thanks, koert
On Mon, Sep 15, 2014 at 8:32 PM, Mark Grover m...@apache.org wrote:
Hi Koert,
I work on Bigtop and CDH packaging and you are right, based on my quick
glance, it doesn't seem to be used.
Mark
From: Koert
case class Car(id:String,age:Int,tkm:Int,emissions:Int,date:Date, km:Int,
fuel:Int)
1. Create an PairedRDD of (age,Car) tuples (pairedRDD)
2. Create a new function fc
//returns the interval lower and upper bound
def fc(x:Int, interval:Int) : (Int,Int) = {
val floor = x - (x%interval)
Hi Matei,
Thanks for your reply.
The Writable classes have never been serializable and this is why it is weird.
I did try as you suggested to map the Writables to integers and strings. It
didn’t pass, either. Similar exceptions were thrown except that the messages
became IntWritable, Text are
Hi, Hao Cheng,
Here is the Spark\Hadoop version:
Spark version = 1.1.0
Hadoop version = 2.0.0-cdh4.6.0
And hive-site.xml:
configuration
property
namefs.default.name/name
valuehdfs://ns/value
/property
property
namedfs.nameservices/name
valuens/value
/property
Hi, Hao Cheng,
Here is the Spark\Hadoop version:
Spark version = 1.1.0
Hadoop version = 2.0.0-cdh4.6.0
And hive-site.xml:
configuration
property
namefs.default.name/name
valuehdfs://ns/value
/property
property
namedfs.nameservices/name
valuens/value
/property
The Hadoop client jar should be assembled into the uber-jar, but (I suspect)
it's probably not compatible with your Hadoop Cluster.
Can you also paste the Spark uber-jar name? Usually will be under the path
lib/spark-assembly-1.1.0-xxx-hadoopxxx.jar.
-Original Message-
From:
case class Car(id:String,age:Int,tkm:Int,emissions:Int,date:Date, km:Int,
fuel:Int)
1. Create an PairedRDD of (age,Car) tuples (pairedRDD)
2. Create a new function fc
//returns the interval lower and upper bound
def fc(x:Int, interval:Int) : (Int,Int) = {
val floor = x - (x%interval)
Hi, Hao Cheng,
This is my spark assembly jar name:
spark-assembly-1.1.0-hadoop2.0.0-cdh4.6.0.jar
I compiled spark 1.1.0 with following cmd:
export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
-XX:ReservedCodeCacheSize=512m
mvn -Dhadoop.version=2.0.0-cdh4.6.0 -Phive -Pspark-ganglia-lgpl -DskipTests
hi~I want to set the executor number to 16, but it is very strange that
executor cores may affect executor num on spark on yarn, i don't know why
and how to set executor number.
=
./bin/spark-submit --class com.hequn.spark.SparkJoins \
--master
I wonder what algorithm is used to implement sortByKey? I assume it is some
O(n*log(n)) parallelized on x number of nodes, right?
Then, what size of data would make it worthwhile to use sortByKey on
multiple processors rather than use standard Scala sort functions on a
single processor
Can you post the exact code for the test that worked in 1.0? I can't think of
much that could've changed. The one possibility is if we had some operations
that were computed locally on the driver (this happens with things like first()
and take(), which will try to do the first partition
Sorry, I am not able to reproduce that.
Can you try add the following entry into the hive-site.xml? I know they have
the default value, but let's make it explicitly.
hive.server2.thrift.port
hive.server2.thrift.bind.host
hive.server2.authentication (NONE、KERBEROS、LDAP、PAM or CUSTOM)
sortByKey is indeed O(n log n), it's a first pass to figure out even-sized
partitions (by sampling the RDD), then a second pass to do a distributed
merge-sort (first partition the data on each machine, then run a reduce phase
that merges the data for each partition). The point where it becomes
98 matches
Mail list logo