Yep, already fixed in master:
https://github.com/apache/spark/pull/4977/files
You need a '.toDF()' at the end.
On Sat, Mar 14, 2015 at 6:55 PM, Dean Arnold renodino...@gmail.com wrote:
Running 1.3.0 from binary install. When executing the example under the
subject section from within
Hi, I was taking a look through the mllib examples in the official spark
documentation and came across the following:
http://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html#tab_python_2
specifically the lines:
label = data.map(lambda x: x.label)
features = data.map(lambda x:
No I don't think that much is a bug, since newFilesOnly=false removes
a constraint that otherwise exists, and that's what you see.
However read the closely related:
https://issues.apache.org/jira/browse/SPARK-6061
@tdas open question for you there.
On Sat, Mar 14, 2015 at 8:18 PM, Justin Pihony
Hi Sean,
Thank very much for your reply.
I tried to config it from below code:
sf = SparkConf().setAppName(test).set(spark.executor.memory,
45g).set(spark.cores.max, 62),set(spark.local.dir, C:\\tmp)
But still get the error.
Do you know how I can config this?
Thanks,
Best,
Peng
On Sat, Mar
All,
Looking into this StackOverflow question
https://stackoverflow.com/questions/29022379/spark-streaming-hdfs/29036469
it appears that there is a bug when utilizing the newFilesOnly parameter in
FileInputDStream. Before creating a ticket, I wanted to verify it here. The
gist is that this
In spark-avro 0.1, the method AvroContext.avroFile returns a SchemaRDD, which
is deprecated in Spark 1.3.0
package com.databricks.spark
import org.apache.spark.sql.{SQLContext, SchemaRDD}
package object avro {
/**
* Adds a method, `avroFile`, to SQLContext that allows reading data
Running 1.3.0 from binary install. When executing the example under the
subject section from within spark-shell, I get the following error:
scala people.registerTempTable(people)
console:35: error: value registerTempTable is not a member of
org.apache.spark.rdd.RDD[Person]
And I have 2 TB free space on C driver.
On Sat, Mar 14, 2015 at 8:29 PM, Peng Xia sparkpeng...@gmail.com wrote:
Hi Sean,
Thank very much for your reply.
I tried to config it from below code:
sf = SparkConf().setAppName(test).set(spark.executor.memory,
45g).set(spark.cores.max,
Large edge partitions could cause java.lang.OutOfMemoryError, and then
spark tasks fails.
FWIW, each edge partition can have at most 2^32 edges because 64-bit vertex
IDs are
mapped into 32-bit ones in each partitions.
If #edges is over the limit, graphx could throw
ArrayIndexOutOfBoundsException,
programmatically specifying Schema needs
import org.apache.spark.sql.type._
for StructType and StructField to resolve.
On Sat, Mar 14, 2015 at 10:07 AM, Sean Owen so...@cloudera.com wrote:
Yes I think this was already just fixed by:
https://github.com/apache/spark/pull/4977
a .toDF() is
Hi,
If you have heap problems in spark/graphx, it'd be better to split
partitions
into smaller ones so as to fit the partition on memory.
On Sat, Mar 14, 2015 at 12:09 AM, Hlib Mykhailenko
hlib.mykhaile...@inria.fr wrote:
Hello,
I cannot process graph with 230M edges.
I cloned
Hi,
I am trying to deploy spark on standalone cluster of two machines on for
master node and one for worker node. i have defined the two machines in
conf/slaves file and also i /etc/hosts, when i tried to run the cluster the
worker node is running but the master node failed to run and throw this
Hello,
I am got a cluster with spark on yarn. Currently some nodes of it are
running a spark streamming program, thus their local space is not enough to
support other application. Thus I wonder is that possible to use a
blacklist to avoid using these nodes when running a new spark program?
I can't reproduce that. 'mvn package' builds everything. You're not
showing additional output from Maven that would explain what it
skipped and why.
On Sat, Mar 14, 2015 at 12:57 AM, sequoiadb
mailing-list-r...@sequoiadb.com wrote:
guys, is there any easier way to build all modules by mvn ?
Hi
I try to understand example provided in
https://spark.apache.org/docs/1.2.1/mllib-linear-methods.html -
Streaming linear regression
Code:
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.mllib.linalg.Vectors
import
Hi,
I read this document,
http://spark.apache.org/docs/1.2.1/mllib-feature-extraction.html, and tried
to build a TF-IDF model of my documents.
I have a list of documents, each word is represented as a Int, and each
document is listed in one line.
doc_name, int1, int2...
doc_name, int3, int4...
It means pretty much what it says. You ran out of space on an executor
(not driver), because the dir used for serialization temp files is
full (not all volumes). Set spark.local.dirs to something more
appropriate and larger.
On Sat, Mar 14, 2015 at 2:10 AM, Peng Xia sparkpeng...@gmail.com wrote:
Which release of hadoop are you using ?
Can you utilize node labels feature ?
See YARN-2492 and YARN-796
Cheers
On Sat, Mar 14, 2015 at 1:49 AM, James alcaid1...@gmail.com wrote:
Hello,
I am got a cluster with spark on yarn. Currently some nodes of it are
running a spark streamming
My hadoop version is 2.2.0, and my spark version is 1.2.0
2015-03-14 17:22 GMT+08:00 Ted Yu yuzhih...@gmail.com:
Which release of hadoop are you using ?
Can you utilize node labels feature ?
See YARN-2492 and YARN-796
Cheers
On Sat, Mar 14, 2015 at 1:49 AM, James alcaid1...@gmail.com
Hey, I work it out myself :)
The Vector is actually a SparesVector, so when it is written into a
string, the format is
(size, [coordinate], [value...])
Simple!
On Sat, Mar 14, 2015 at 6:05 PM Xi Shen davidshe...@gmail.com wrote:
Hi,
I read this document,
Hi,
You may want to check your spark environment config in spark-env.sh,
specifically for the SPARK_LOCAL_IP and check that whether you did modify
that value, which may default be localhost.
Thanks,
Sun.
fightf...@163.com
From: sara mustafa
Date: 2015-03-14 15:13
To: user
Subject: deploying
Hi,
You may want to check your spark environment config in spark-env.sh,
specifically for the SPARK_LOCAL_IP and check that whether you did modify
that value, which may default be localhost.
Thanks,
Sun.
fightf...@163.com
From: sara mustafa
Date: 2015-03-14 15:13
To: user
Subject:
You won’t be able to use YARN labels on 2.2.0. However, you only need the
labels if you want to map containers on specific hardware. In your scenario,
the capacity scheduler in YARN might be the best bet. You can setup separate
queues for the streaming and other jobs to protect a percentage of
I haven't register my class in kryo but I dont think it would have such an
impact on the stack size.
I'm thinking of using graphx and I'm wondering how it serializes the graph
object as it can use kryo as serializer.
2015-03-14 6:22 GMT+01:00 Ted Yu yuzhih...@gmail.com:
Have you registered
Thanks TD, this is what I was looking for. rdd.context.makeRDD worked.
Laeeq
On Friday, March 13, 2015 11:08 PM, Tathagata Das t...@databricks.com
wrote:
Is the number of top K elements you want to keep small? That is, is K small?
In which case, you can1. either do it in the
Out of curiosity, I searched for 'capacity scheduler deadlock' yielded the
following:
[YARN-3265] CapacityScheduler deadlock when computing absolute max avail
capacity (fix for trunk/branch-2)
[YARN-3251] Fix CapacityScheduler deadlock when computing absolute max
avail capacity (short term fix
Yes I think this was already just fixed by:
https://github.com/apache/spark/pull/4977
a .toDF() is missing
On Sat, Mar 14, 2015 at 4:16 PM, Nick Pentreath
nick.pentre...@gmail.com wrote:
I've found people.toDF gives you a data frame (roughly equivalent to the
previous Row RDD),
And you can
It’s a long story but there are many dirs with smallish part- files in them
so we create a list of the individual files as input to
sparkContext.textFile(fileList). I suppose we could move them and rename them
to be contiguous part- files in one dir. Would that be better than passing
I've found people.toDF gives you a data frame (roughly equivalent to the
previous Row RDD),
And you can then call registerTempTable on that DataFrame.
So people.toDF.registerTempTable(people) should work
—
Sent from Mailbox
On Sat, Mar 14, 2015 at 5:33 PM, David Mitchell
Any advice on dealing with a large number of separate input files?
On Mar 13, 2015, at 4:06 PM, Pat Ferrel p...@occamsmachete.com wrote:
We have many text files that we need to read in parallel. We can create a comma
delimited list of files to pass in to sparkContext.textFile(fileList). The
you seem like not to note the configuration varible spreadOutApps
And it's comment:
// As a temporary workaround before better ways of configuring memory, we
allow users to set
// a flag that will perform round-robin scheduling across the nodes
(spreading out each app
// among all the
I am pleased with the release of the DataFrame API. However, I started
playing with it, and neither of the two main examples in the documentation
work: http://spark.apache.org/docs/1.3.0/sql-programming-guide.html
Specfically:
- Inferring the Schema Using Reflection
- Programmatically
Hi,
I created a question on StackOverflow:
http://stackoverflow.com/questions/29051579/pausing-throttling-spark-spark-streaming-application
I would appreciate your help.
Best,
Tomek
--
View this message in context:
Hi all,
I have the following cluster configurations:
- 5 nodes on a cloud environment.
- Hadoop 2.5.0.
- HBase 0.98.6.
- Spark 1.2.0.
- 8 cores and 16 GB of ram on each host.
- 1 NFS disk with 300 IOPS mounted on host 1 and 2.
- 1 NFS disk with 300 IOPS mounted on host
The 4.1 GB table has 3 regions. This means that there would be at least 2
nodes which don't carry its region.
Can you split this table into 12 (or more) regions ?
BTW what's the value for spark.yarn.executor.memoryOverhead ?
Cheers
On Sat, Mar 14, 2015 at 10:52 AM, francexo83
Here is how I have dealt with many small text files (on s3 though this
should generalize) in the past:
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201411.mbox/%3ccaaswr-58p66-es2haxh4i+bu__0rvxd2okewkly0mee8rue...@mail.gmail.com%3E
FromMichael Armbrust
Do you have an example that reproduces the issue?
On Fri, Mar 13, 2015 at 4:12 PM, gtinside gtins...@gmail.com wrote:
Hi ,
I am playing around with Spark SQL 1.3 and noticed that max function does
not give the correct result i.e doesn't give the maximum value. The same
query works fine in
37 matches
Mail list logo