Adding a hadoop-2.6 profile is not necessary. Use hadoop-2.4, which
already exists and is intended for 2.4+. In fact this declaration is
missing things that Hadoop 2 needs.
On Thu, Dec 18, 2014 at 3:46 AM, Kyle Lin kylelin2...@gmail.com wrote:
Hi there
The following is my steps. And got the
I did not compile spark 1.1.0 source code on CDH4.3.0 with yarn successfully.
Does it support CDH4.3.0 with yarn ?
And will spark 1.2.0 support CDH5.1.2?
--
View this message in context:
Well, it's always a good idea to used matched binary versions. Here it
is more acutely necessary. You can use a pre built binary -- if you
use it to compile and also run. Why does it not make sense to publish
artifacts?
Not sure what you mean about core vs assembly, as the assembly
contains all
@Rui do you mean the spark-core jar in the maven central repo
are incompatible with the same version of the the official pre-built Spark
binary? That's really weird. I thought they should have used the same codes.
Best Regards,
Shixiong Zhu
2014-12-18 17:22 GMT+08:00 Sean Owen
The question is really: will Spark 1.1 work with a particular version
of YARN? many, but not all versions of YARN are supported. The
stable versions are (2.2.x+). Before that, support is patchier, and
in fact has been removed in Spark 1.3.
The yarn profile supports YARN stable which is about
Hi,
I have the following code in my application:
tmpRdd.foreach(item = {
println(abc: + item)
})
tmpRdd.foreachPartition(iter = {
iter.map(item = {
println(xyz: + item)
})
})
In the output, I see only the abc prints
Hi, Sean
Thank you for your reply. I will try to use Spark 1.1 and 1.2 on CHD5.X.
:)
2014-12-18 17:38 GMT+08:00 Sean Owen so...@cloudera.com:
The question is really: will Spark 1.1 work with a particular version
of YARN? many, but not all versions of YARN are supported. The
stable
Have a look at https://issues.apache.org/jira/browse/SPARK-2075
It's not quite that the API is different, but indeed building
different 'flavors' of the same version (hadoop1 vs 2) can strangely
lead to this problem, even though the public API is identical and in
theory the API is completely
Hi again,
On Thu, Dec 18, 2014 at 6:43 PM, Tobias Pfeiffer t...@preferred.jp wrote:
tmpRdd.foreachPartition(iter = {
iter.map(item = {
println(xyz: + item)
})
})
Uh, with iter.foreach(...) it works... the reason being apparently that
I did not encounter this with my Avro records using Spark 1.10 (see
https://github.com/medale/spark-mail/blob/master/analytics/src/main/scala/com/uebercomputing/analytics/basic/UniqueSenderCounter.scala).
I do use the default Java serialization but all the fields in my Avro
object are
Quick follow-up: this works sweetly with spark-1.1.1-bin-hadoop2.4.
On Dec 3, 2014, at 3:31 PM, Ian Wilkinson ia...@me.com wrote:
Hi,
I'm trying the Elasticsearch support for Spark (2.1.0.Beta3).
In the following I provide the query (as query dsl):
import org.elasticsearch.spark._
Yes, although once you have multiple ClassLoaders, you are operating
as if in multiple JVMs for most intents and purposes. I think the
request for this kind of functionality comes from use cases where
multiple ClassLoaders wouldn't work, like, wanting to have one app (in
one ClassLoader) managing
Hi,
I’m getting some seemingly invalid results when I collect an RDD. This is
happening in both Spark 1.1.0 and 1.2.0, using Java8 on Mac.
See the following code snippet:
JavaRDDThing rdd= pairRDD.values();
rdd.foreach( e - System.out.println ( RDD Foreach: + e ) );
rdd.collect().forEach( e -
Hi all,On yarn-cluster mode, can we let the driver running on a specific
machine that we choose in cluster ? Or, even the machine not in the cluster?
NP man,
The thing is that since you're in a dist env, it'd be cumbersome to do
that. Remember that Spark works basically on block/partition, they are the
unit of distribution and parallelization.
That means that actions have to be run against it **after having been
scheduled on the cluster**.
The
It sounds a lot like your values are mutable classes and you are
mutating or reusing them somewhere? It might work until you actually
try to materialize them all and find many point to the same object.
On Thu, Dec 18, 2014 at 10:06 AM, Tristan Blakers tris...@blackfrog.org wrote:
Hi,
I’m
Hi Andy,
Thanks again for your thoughts on this, I haven't found much information
about the internals of Spark, so I find very useful and interesting these
kind of explanations about its low level mechanisms. It's also nice to know
that the two pass approach is a viable solution.
Regards,
Juan
Another point to start playing with updateStateByKey is the example
StatefulNetworkWordCount. See the streaming examples directory in the
Spark repository.
TD
On Thu, Dec 18, 2014 at 6:07 AM, Pierce Lamb
richard.pierce.l...@gmail.com wrote:
I am trying to run stateful Spark Streaming
A more updated version of the streaming programming guide is here
http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html
Please refer to this until we make the official release of Spark 1.2
TD
On Tue, Dec 16, 2014 at 3:50 PM, smallmonkey...@hotmail.com
Suspected the same thing, but because the underlying data classes are
deserialised by Avro I think they have to be mutable as you need to provide
the no-args constructor with settable fields.
Nothing is being cached in my code anywhere, and this can be reproduced
using data directly out of the
Being mutable is fine; reusing and mutating the objects is the issue.
And yes the objects you get back from Hadoop are reused by Hadoop
InputFormats. You should just map the objects to a clone before using
them where you need them to exist all independently at once, like
before a collect().
(That
Hi,
I have a simple app, where I am trying to create a table. I am able to create
the table on running app in yarn-client mode, but not with yarn-cluster mode.
Is this some known issue? Has this already been fixed?
Please note that I am using spark-1.1 over hadoop-2.4.0
App:
-
import
Owen,
Since we have individual module jars published into the central maven repo for
an official release, then we need to make sure the official Spark assembly jar
should be assembled exactly from these jars, so there will be no binary
compatibility issue. We can also publish the official
Hi, I had the same problem.
One option (starting with Spark 1.2, which is currently in preview) is to
use the Avro library for Spark SQL.
Other is using Kryo Serialization.
by default spark uses Java Serialization, you can specify kryo
serialization while creating spark context.
val conf = new
Recording the outcome here for the record. Based on Sean’s advice I’ve
confirmed that making defensive copies of records that will be collected
avoids this problem - it does seem like Avro is being a bit too aggressive
when deciding it’s safe to reuse an object for a new record.
On 18 December
Hi,
This was all my fault. It turned out I had a line of code buried in a
library that did a repartition. I used this library to wrap an RDD to
present it to legacy code as a different interface. That's what was causing
the data to spill to disk.
The really stupid thing is it took me the better
Hi,
I'm trying to use pyspark to save a simple rdd to a text file (code below),
but it keeps throwing an error.
- Python Code -
items=[Hello, world]
items2 = sc.parallelize(items)
items2.coalesce(1).saveAsTextFile('c:/tmp/python_out.csv')
- Error
It seems You are missing HADOOP_HOME in the environment. As it says:
java.io.IOException: Could not locate executable *null*\bin\winutils.exe in
the Hadoop binaries.
That null is supposed to be your HADOOP_HOME.
Thanks
Best Regards
On Thu, Dec 18, 2014 at 7:10 PM, mj jone...@gmail.com wrote:
I'm running a very simple Spark application that downloads files from S3,
does a bit of mapping, then uploads new files. Each file is roughly 2MB
and is gzip'd. I was running the same code on Amazon's EMR w/Spark and not
having any download speed issues (Amazon's EMR provides a custom
Hi Pierce,
You shouldn’t have to use groupByKey because updateStateByKey will get a
Seq of all the values for that key already.
I used that for realtime sessionization as well. What I did was key my
incoming events, then send them to udpateStateByKey. The updateStateByKey
function then
Is there a planned release date for Spark 1.2? I saw on the Spark Wiki
https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage that we
are already in the latter part of the release window.
--
View this message in context:
It’s on Maven Central already http://search.maven.org/#browse%7C717101892
On 12/18/14, 2:09 PM, Al M alasdair.mcbr...@gmail.com wrote:
Is there a planned release date for Spark 1.2? I saw on the Spark Wiki
https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage that
we
are
Hi guys.
I run the folling command to lauch a new cluster :
./spark-ec2 -k test -i test.pem -s 1 --vpc-id vpc-X --subnet-id
subnet-X launch vpc_spark
The instances started ok but the command never end. With the following
output:
Setting up security groups...
Searching for existing
Hi guys.
I run the folling command to lauch a new cluster :
./spark-ec2 -k test -i test.pem -s 1 --vpc-id vpc-X --subnet-id
subnet-X launch vpc_spark
The instances started ok but the command never end. With the following
output:
Setting up security groups...
Searching for existing
Soon enough :)
http://apache-spark-developers-list.1001551.n3.nabble.com/RESULT-VOTE-Release-Apache-Spark-1-2-0-RC2-td9815.html
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-Release-Date-tp20765p20766.html
Sent from the Apache Spark User List
Hi all!,
I have a problem with LogisticRegressionWithSGD, when I train a data set
with one variable (wich is a amount of an item) and intercept, I get weights
of
(-0.4021,-207.1749) for both features, respectively. This don´t make sense
to me because I run a logistic regression for the same
Hi,
I am building a Spark-based service which requires initialization of a
SparkContext in a main():
def main(args: Array[String]) {
val conf = new SparkConf(false)
.setMaster(spark://foo.example.com:7077)
.setAppName(foobar)
val sc = new SparkContext(conf)
val rdd =
Awesome. Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-Release-Date-tp20765p20767.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To
Are you sure this is an apples-to-apples comparison? for example does your
SAS process normalize or otherwise transform the data first?
Is the optimization configured similarly in both cases -- same
regularization, etc.?
Are you sure you are pulling out the intercept correctly? It is a separate
You can build a jar of your project and add it to the sparkContext
(sc.addJar(/path/to/your/project.jar)) then it will get shipped to the
worker and hence no classNotfoundException!
Thanks
Best Regards
On Thu, Dec 18, 2014 at 10:06 PM, Akshat Aranya aara...@gmail.com wrote:
Hi,
I am building
Thanks I will try.
De: DB Tsai [mailto:dbt...@dbtsai.com]
Enviado el: jueves, 18 de diciembre de 2014 16:24
Para: Franco Barrientos
CC: Sean Owen; user@spark.apache.org
Asunto: Re: Effects problems in logistic regression
Can you try LogisticRegressionWithLBFGS? I verified that this will
It is implemented in the same way as Hive and interoperates with the hive
metastore. In 1.2 we are considering adding partitioning to the SparkSQL
data source API as well.. However, for now, you should create a hive
context and a partitioned table. Spark SQL will automatically select
partitions
Hi All,
Wondering if when caching a table backed by lzo compressed parquet data, if
spark also compresses it (using lzo/gzip/snappy) along with column level
encoding or just does the column level encoding when
*spark.sql.inMemoryColumnarStorage.compressed
*is set to true. This is because when I
This produces the expected output, thank you!
On Thu, Dec 18, 2014 at 12:11 PM, Silvio Fiorito
silvio.fior...@granturing.com wrote:
Ok, I have a better idea of what you’re trying to do now.
I think the prob might be the map. The first time the function runs,
currentValue will be None. Using
Great, glad it worked out! Just keep an eye on memory usage as you roll it
out. Like I said before, if you’ll be running this 24/7 consider cleaning
up sessions by returning None after some sort of timeout.
On 12/18/14, 8:25 PM, Pierce Lamb richard.pierce.l...@gmail.com wrote:
This produces
Hi Spark users,
I wonder if val resultRDD = RDDA.union(RDDB) will always have records in
RDDA before records in RDDB.
Also, will resultRDD.coalesce(1) change this ordering?
Best Regards,
Jerry
Hey Akshat,
What is the class that is not found, is it a Spark class or classes that
you define in your own application? If the latter, then Akhil's solution
should work (alternatively you can also pass the jar through the --jars
command line option in spark-submit).
If it's a Spark class,
Hi All,
I am wondering what is the best way to remove transitive edges with maximum
spanning tree. For example,
Edges:
1 - 2 (30)
2 - 3 (30)
1 - 3 (25)
where parenthesis is a weight for each edge.
Then, I'd like to get the reduced edges graph after Transitive Reduction
with considering the
We have a very large RDD and I need to create a new RDD whose values are
derived from each record of the original RDD, and we only retain the few new
records that meet a criteria. I want to avoid creating a second large RDD
and then filtering it since I believe this could tax system resources
Hmmm, how to do that? You mean for each file create a RDD? Then I will have
tons of RDD.
And my calculation need to rely on other input, not just the file itself
Can you show some pseudo code for that logic?
Regards,
Shuai
From: Diego García Valverde [mailto:dgarci...@agbar.es]
You mean to Spark User List, Its pretty easy. check the first email it has
all instructions
On 18 December 2014 at 21:56, csjtx1021 [via Apache Spark User List]
ml-node+s1001560n20759...@n3.nabble.com wrote:
i want to join you
--
If you reply to this email,
Hi Ted,
I've no idea what is Transitive Reduction but the expected result you can
achieve by graph.subgraph(graph.edges.filter()) syntax and which filter
edges by its weight and give you new graph as per your condition.
On 19 December 2014 at 11:11, Tae-Hyuk Ahn [via Apache Spark User List]
Hi,
I am using Spark 1.1.1 on Yarn. When I try to run K-Means, I see from the Yarn
dashboard that only 3 containers are being used. How do I increase the number
of containers used?
P.S: When I run K-Means on Mahout with the same settings, I see that there are
25-30 containers being
I don't think you can avoid examining each element of the RDD, if
that's what you mean. Your approach is basically the best you can do
in general. You're not making a second RDD here, and even if you did
this in two steps, the second RDD is really more of a bookkeeping that
a second huge data
Hi Suman,
I'll assume that you are using spark submit to run your application. You
can pass the --num-executors flag to ask for more containers. If you want
to allocate more memory for each executor, you may also pass in the
--executor-memory flag (this accepts a string in the format 1g, 512m
Hi Jay,
Please try increasing executor memory (if the available memory is more
than 2GB) and reduce numBlocks in ALS. The current implementation
stores all subproblems in memory and hence the memory requirement is
significant when k is large. You can also try reducing k and see
whether the
Thanks dbtsai for the info.
Are you using the case class for:
Case(response, vec) = ?
Also, what library do I need to import to use .toBreeze ?
Thanks,
tri
-Original Message-
From: dbt...@dbtsai.com [mailto:dbt...@dbtsai.com]
Sent: Friday, December 12, 2014 3:27 PM
To: Bui,
Hi,
Akka router creates a sqlContext and creates a bunch of routees actors
with sqlContext as parameter. The actors then execute query on that
sqlContext.
Would this pattern be a issue ? Any other way sparkContext etc. should be
shared cleanly in Akka routers/routees ?
Thanks,
There is only column level encoding (run length encoding, delta encoding,
dictionary encoding) and no generic compression.
On Thu, Dec 18, 2014 at 12:07 PM, Sadhan Sood sadhan.s...@gmail.com wrote:
Hi All,
Wondering if when caching a table backed by lzo compressed parquet data,
if spark also
why do you need a router? I mean cannot you do with just one actor which
has the SQLContext inside it?
On Thu, Dec 18, 2014 at 9:45 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Hi,
Akka router creates a sqlContext and creates a bunch of routees actors
with sqlContext as parameter. The
All,
I just built Spark-1.2 on my enterprise server (which has Hadoop 2.3 with
YARN). Here're the steps I followed for the build:
$ mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package
$ export SPARK_HOME=/path/to/spark/folder
$ export HADOOP_CONF_DIR=/etc/hadoop/conf
Hi, Dose any know when will spark 1.2 released? 1.2 has many great feature that
we can't wait now ,-) Sincely Lin wukang 发自网易邮箱大师
Hi,
Can you clean up the code lil bit better, it's hard to read what's going
on. You can use pastebin or gist to put the code.
On Wed, Dec 17, 2014 at 3:58 PM, Hao Ren inv...@gmail.com wrote:
Hi,
I am using SparkSQL on 1.2.1 branch. The problem comes froms the following
4-line code:
*val
It’s on Maven Central already http://search.maven.org/#browse%7C717101892
On Fri, Dec 19, 2014 at 11:17 AM, vboylin1...@gmail.com
vboylin1...@gmail.com wrote:
Hi,
Dose any know when will spark 1.2 released? 1.2 has many great feature
that we can't wait now ,-)
Sincely
Lin wukang
Interesting, the maven artifacts were dated Dec 10th.
However vote for RC2 closed recently:
http://search-hadoop.com/m/JW1q5K8onk2/Patrick+spark+1.2.0subj=Re+VOTE+Release+Apache+Spark+1+2+0+RC2+
Cheers
On Dec 18, 2014, at 10:02 PM, madhu phatak phatak@gmail.com wrote:
It’s on Maven
Patrick is working on the release as we speak -- I expect it'll be out
later tonight (US west coast) or tomorrow at the latest.
On Fri, Dec 19, 2014 at 1:09 AM, Ted Yu yuzhih...@gmail.com wrote:
Interesting, the maven artifacts were dated Dec 10th.
However vote for RC2 closed recently:
Yup, as he posted before, An Apache infrastructure issue prevented me from
pushing this last night. The issue was resolved today and I should be able to
push the final release artifacts tonight.
On Dec 18, 2014, at 10:14 PM, Andrew Ash and...@andrewash.com wrote:
Patrick is working on the
67 matches
Mail list logo