You can try LinearRegression with sparse input. It converges the least
squares solution if the linear system is over-determined, while the
convergence rate depends on the condition number. Applying standard
scaling is popular heuristic to reduce the condition number.
If you are interested in
There is an undocumented configuration to put users jars in front of
spark jar. But I'm not very certain that it works as expected (and
this is why it is undocumented). Please try turning on
spark.yarn.user.classpath.first . -Xiangrui
On Sat, Sep 6, 2014 at 5:13 PM, Victor Tso-Guillen
Thank you Aaron for pointing out problem. This only happens when I run this
code in spark-shell but not when i submit the job.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/error-type-mismatch-while-Union-tp13547p13677.html
Sent from the Apache Spark User
Thanks Tobias,
I also found this: https://issues.apache.org/jira/browse/SPARK-3299
Looks like it's been working on.
Jianshi
On Mon, Sep 8, 2014 at 9:28 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
On Sat, Sep 6, 2014 at 1:40 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Err...
~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2;
I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and
~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error
when trying to run distcp:
ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
Tomer,
To use distcp, you need to have a Hadoop compute cluster up. start-dfs just
restarts HDFS. I don’t have a Spark 1.0.2 cluster up right now, but there
should be a start-mapred*.sh or start-all.sh script that will launch the Hadoop
MapReduce cluster that you will need for distcp.
That should be OK, since the iterator is definitely consumed, and
therefore the connection actually done with, at the end of a 'foreach'
method. You might put the close in a finally block.
On Mon, Sep 8, 2014 at 12:29 AM, Soumitra Kumar
kumar.soumi...@gmail.com wrote:
I have the following code:
Hi,
On Mon, Sep 8, 2014 at 4:39 PM, Sean Owen so...@cloudera.com wrote:
if (rdd.take (1).size == 1) {
rdd foreachPartition { iterator =
I was wondering: Since take() is an output operation, isn't it computed
twice (once for the take(1), once during the
I see that the tachyon url constructed for an rdd partition has executor id
in it. So if the same partition is being processed by a different executor
on a reexecution of the same computation, it cannot really use the earlier
result. Is this a correct assessment? Will removing the executor id from
Hi,
Can someone tell me how to profile a spark application.
-Karthik
thank you for the replies.
I am running an insert on a join (INSERT OVERWRITE TABLE new_table select *
from table1 as a join table2 as b on (a.key = b.key),
The process does not have the right permission to write to that folder, so I
get the following error printed:
chgrp: `/user/x/y': No such
See
https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit
On Sep 8, 2014, at 2:48 AM, rapelly kartheek kartheek.m...@gmail.com wrote:
Hi,
Can someone tell me how to profile a spark application.
-Karthik
Solved.
The problem is the following: the underlying Akka driver uses the
INTERNAL interface address on the Amazon instance (the ones that start
with 10.x.y.z) to present itself to the master, it does not use the
external (public) IP!
Ognen
On 9/7/2014 3:21 PM, Sean Owen wrote:
Also keep
After wasting a lot of time, I've found the problem. Despite I haven't used
hadoop/hdfs in my application, hadoop client matters. The problem was in
hadoop-client version, it was different than the version of hadoop, spark
was built for. Spark's hadoop version 1.2.1, but in my application that was
Hello, I tried to execute a simple spark application using sparkSQL. At first
try, it worked as I exepcted but after then, it doesn't run and shows an stderr
like below: Spark Executor Command: java -cp
Hello,
I tried to execute a simple spark application using sparkSQL.
At
first try, it worked as I exepcted but after then, it doesn't run and shows an
stderr like below:
Spark
Executor Command: java -cp
Hi
I'm building a application the read from kafka stream event. In production
we've 5 consumers that share 10 partitions.
But on spark streaming kafka the master act as a consumer then distribute
the tasks to workers so I can have only 1 masters acting as consumer but I
need more because only 1
Durin,
I have integrated ecos with spark which uses suitesparse under the hood for
linear equation solvesI have exposed only the qp solver api in spark
since I was comparing ip with proximal algorithms but we can expose
suitesparse api as well...jni is used to load up ldl amd and ecos
Xiangrui,
Should I open up a JIRA for this ?
Distributed lp/socp solver through ecos/ldl/amd ?
I can open source it with gpl license in spark code as that's what our
legal cleared (apache + gpl becomes gpl) and figure out the right way to
call it...ecos is gpl but we can definitely use the jni
Dear all:
I am a brand new Spark user trying out the SimpleApp from the Quick Start
page.
Here is the code:
object SimpleApp {
def main(args: Array[String]) {
val logFile = /dev/spark-1.0.2-bin-hadoop2/README.md // Should be some
file on your system
val conf = new SparkConf()
Hi ,
I am reading data from Cassandra through datastax spark-cassandra connector
converting it into JSON and then running spark-sql on it. Refer to the code
snippet below :
step 1 val o_rdd = sc.cassandraTable[CassandraRDDWrapper](
'keyspace', 'column_family')
step 2 val tempObjectRDD =
I am running a very simple example using the SVMWithSGD on Amazon EMR. I
haven't got any result after one hour long.
My instance-type is: m3.large
instance-count is: 3
Dataset is the data provided by the MLLIB in apache: sample_svm_data
The number of iteration is: 2
and all other options
Hi,
I have a key-value RDD called rdd below. After a groupBy, I tried to count
rows.
But the result is not unique, somehow non deterministic.
Here is the test code:
val step1 = ligneReceipt_cleTable.persist
val step2 = step1.groupByKey
val s1size = step1.count
val s2size =
Thank you Ted.
regards
Karthik
On Mon, Sep 8, 2014 at 3:33 PM, Ted Yu yuzhih...@gmail.com wrote:
See
https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit
On Sep 8, 2014, at 2:48 AM, rapelly kartheek kartheek.m...@gmail.com
wrote:
Hi,
Can someone
Tomer,
Did you try start-all.sh? It worked for me the last time I tried using
distcp, and it worked for this guy too
http://stackoverflow.com/a/18083790/877069.
Nick
On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini tomer@gmail.com wrote:
~/ephemeral-hdfs/sbin/start-mapred.sh does not
Still no luck, even when running stop-all.sh followed by start-all.sh.
On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Tomer,
Did you try start-all.sh? It worked for me the last time I tried using
distcp, and it worked for this guy too.
Nick
On Mon, Sep
Update:
Just test with HashPartitioner(8) and count on each partition:
List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657591*), (*6,658327*), (*7,658434*)),
List((0,657824), (1,658549), (2,659199), (3,658684), (4,659394),
*(5,657594)*, (6,658326), (*7,658434*)),
What if, when I traverse RDD, I need to calculate values in dataset by
calling external (blocking) service? How do you think that could be
achieved?
val values: Future[RDD[Double]] = Future sequence tasks
I've tried to create a list of Futures, but as RDD id not Traversable,
Future.sequence is
Hi,
I What does the external service provide? Data? Calculations? Can the
service push data to you via Kafka and Spark streaming ? Can you fetch the
necessary data beforehand from the service? The solution to your question
depends on your answers.
I would not recommend to connect to a blocking
what did you see in the log? was there anything related to mapreduce?
can you log into your hdfs (data) node, use jps to list all java process and
confirm whether there is a tasktracker process (or nodemanager) running with
datanode process
--
Ye Xianjin
Sent with Sparrow
Hi, Jörn, first of all, thanks for you intent to help.
This one external service is a native component, that is stateless and that
performs the calculation based on the data I provide. The data is in RDD.
That one component I have on each worker node and I would like to get as
much parallelism
What is the driver-side Future for? Are you trying to make the remote
Spark workers execute more requests to your service concurrently? it's
not clear from your messages whether it's something like a web
service, or just local native code.
So the time spent in your processing -- whatever returns
No tasktracker or nodemanager. This is what I see:
On the master:
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
org.apache.hadoop.hdfs.server.namenode.NameNode
On the data node (slave):
Hello friends:
It was mentioned in another (Y.A.R.N.-centric) email thread that
'SPARK_JAR' was deprecated,
and to use the 'spark.yarn.jar' property instead for YARN submission.
For example:
user$ pyspark [some-options] --driver-java-options
On Mon, Sep 8, 2014 at 9:35 AM, Dimension Data, LLC.
subscripti...@didata.us wrote:
user$ pyspark [some-options] --driver-java-options
spark.yarn.jar=hdfs://namenode:8020/path/to/spark-assembly-*.jar
This command line does not look correct. spark.yarn.jar is not a JVM
command line option.
Thanks, Sean, I'll try to explain, what I'm trying to do.
The native component, that I'm talking about is the native code, that I call
using JNI.
I've wrote small test
Here, I traverse through the collection to call the native component N
(1000) times.
Then I have a result
it means, that
I came across this: https://github.com/xerial/sbt-pack
Until i found this, I was simply using the sbt-assembly plugin (sbt clean
assembly)
mn
On Sep 4, 2014, at 2:46 PM, Aris arisofala...@gmail.com wrote:
Thanks for answering Daniil -
I have SBT version 0.13.5, is that an old version?
Hi,
I'm having problems with a ClassNotFoundException using this simple example:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import java.net.URLClassLoader
import scala.util.Marshal
class ClassToRoundTrip(val id: Int) extends
On Mon, Sep 8, 2014 at 10:00 AM, Dimension Data, LLC.
subscripti...@didata.us wrote:
user$ export MASTER=local[nn] # Run spark shell on LOCAL CPU threads.
user$ pyspark [someOptions] --driver-java-options -Dspark.*XYZ*.jar='
/usr/lib/spark/assembly/lib/spark-assembly-*.jar'
My question is,
I don't understand what you mean. Can you be more specific?
From: Victor Tso-Guillen v...@paxata.com
Sent: Saturday, September 06, 2014 5:13 PM
To: Penny Espinoza
Cc: Spark
Subject: Re: prepending jars to the driver class path for spark-submit on YARN
I ran
Hi,
So the external service itself creates threads and blocks until they
finished execution? In this case you should not do threading but include it
via jni directly in spark - it will take care about threading for you.
Vest regards
Hi, Jörn, first of all, thanks for you intent to help.
This
well, this means you didn't start a compute cluster. Most likely because the
wrong value of mapreduce.jobtracker.address cause the slave node cannot start
the node manager. ( I am not familiar with the ec2 script, so I don't know
whether the slave node has node manager installed or not.)
Can
When you submit the job to yarn with spark-submit, set --conf
spark.yarn.user.classpath.first=true .
On Mon, Sep 8, 2014 at 10:46 AM, Penny Espinoza
pesp...@societyconsulting.com wrote:
I don't understand what you mean. Can you be more specific?
From: Victor
Could you attach the driver log? -Xiangrui
On Mon, Sep 8, 2014 at 7:23 AM, Hui Li hli161...@gmail.com wrote:
I am running a very simple example using the SVMWithSGD on Amazon EMR. I
haven't got any result after one hour long.
My instance-type is: m3.large
instance-count is: 3
Dataset
I have tried using the spark.files.userClassPathFirst option (which,
incidentally, is documented now, but marked as experimental), but it just
causes different errors. I am using spark-streaming-kafka. If I mark
spark-core and spark-streaming as provided and also exclude them from the
Depending on what you want to do with the result of the scraping, Spark may
not be the best framework for your use case. Take a look at a general Akka
application.
On Sun, Sep 7, 2014 at 12:15 AM, Sandeep Singh sand...@techaddict.me
wrote:
Hi all,
I am Implementing a Crawler, Scraper. The It
Is there more information on what the Input column on the Spark UI means?
How is this computed? I am processing a fairly small (but zipped) file
and see the value as
[image: Inline image 1]
This does not seem correct?
Thanks,
Arun
I asked Tim whether he would change the license of SuiteSparse to an
Apache-friendly license couple months ago, but the answer was no. So I
don't think we can use SuiteSparse in MLlib through JNI. Please feel
free to create JIRAs for distributed linear programming and SOCP
solvers and run the
?VIctor - Not sure what you mean. Can you provide more detail about what you
did?
From: Victor Tso-Guillen v...@paxata.com
Sent: Saturday, September 06, 2014 5:13 PM
To: Penny Espinoza
Cc: Spark
Subject: Re: prepending jars to the driver class path for
What is the right way of saving any PairRDD into avro output format.
GraphArray extends SpecificRecord etc.
I have the following java rdd:
JavaPairRDDGraphArray, NullWritable pairRDD = ...
and want to save it to avro format:
org.apache.hadoop.mapred.JobConf jc = new
On Mon, Sep 8, 2014 at 11:52 AM, Dimension Data, LLC.
subscripti...@didata.us wrote:
So just to clarify for me: When specifying 'spark.yarn.jar' as I did
above, even if I don't use HDFS to create a
RDD (e.g. do something simple like: 'sc.parallelize(range(100))'), it is
still necessary to
Yup...this can be a spark community project...I saw a PR for
that...interested users fine with lgpl/gpl code can make use of it...
On Mon, Sep 8, 2014 at 12:37 PM, Xiangrui Meng men...@gmail.com wrote:
I asked Tim whether he would change the license of SuiteSparse to an
Apache-friendly license
Hi,
I was reading the paper of Spark Streaming:
Discretized Streams: Fault-Tolerant Streaming Computation at Scale
So,
I read that performance evaluation used 100-byte input records in test Grep
and WordCount.
I don't have much experience and I'd like to know how can I control this
value in my
Hi,
Let me start with, I am new to spark.(be gentle)
I have a large data set in Parquet (~1.5B rows, 900 columns)
Currently Impala takes ~1-2 seconds for the queries while SparkSQL is
taking ~30 seconds..
Here is what I am currently doing..
I launch with SPARK_MEM=6g spark-shell
val
You are probably not getting an error because the exception is happening
inside of Hive. I'd still consider this a bug if you'd like to open a JIRA.
On Mon, Sep 8, 2014 at 3:02 AM, jamborta jambo...@gmail.com wrote:
thank you for the replies.
I am running an insert on a join (INSERT
Hello all,
I've been wrestling with this problem all day and any suggestions would be
greatly appreciated.
I'm trying to test reading a parquet file that's stored in s3 using a spark
cluster deployed on ec2. The following works in the spark shell when run
completely locally on my own machine
I believe DataStax is working on better integration here, but until that is
ready you can use the applySchema API. Basically you will convert the
CassandraTable into and RDD of Row objects using a .map() and then you can
call applySchema (provided by SQLContext) to get a SchemaRDD.
More details
How big is the data set? Does it work when you copy it to hdfs?
-Manu
On Mon, Sep 8, 2014 at 2:58 PM, Jim Carroll jimfcarr...@gmail.com wrote:
Hello all,
I've been wrestling with this problem all day and any suggestions would be
greatly appreciated.
I'm trying to test reading a parquet
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-NumberofTasks
it will be great, if something like hive.exec.reducers.bytes.per.reducer
could be implemented.
one idea is, get total size of all target blocks, then set number of
partitions
--
View this message in
Thanks TD. Someone already pointed out to me that /repartition(...)/ isn't
the right way. You have to /val partedStream = repartition(...)/. Would be
nice to have it fixed in the docs.
On Fri, Sep 5, 2014 at 10:44 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Some thoughts on this
Mmm how many days worth of data/how deep is your data nesting?
I suspect your running into a current issue with parquet (a fix is in
master but I don't believe released yet..). It reads all the metadata to
the submitter node as part of scheduling the job. This can cause long start
times(timeouts
Hi guys,
My Spark Streaming application have this java.lang.OutOfMemoryError: GC
overhead limit exceeded error in SparkStreaming driver program. I have
done the following to debug with it:
1. improved the driver memory from 1GB to 2GB, this error came after 22
hrs. When the memory was 1GB, it
On Mon, Sep 8, 2014 at 3:54 PM, Dimension Data, LLC.
subscripti...@didata.us wrote:
You're probably right about the above because, as seen *below* for
pyspark (but probably for other Spark
applications too), once '-Dspark.master=[yarn-client|yarn-cluster]' is
specified, the app invocation
Hi,
I am running Spark 1.0.2 on a cluster in Mesos mode. I am not able to access
the Spark master Web UI at port 8080 but am able to access it at port 5050.
Is 5050 the standard port?
Also, in the the standalone mode, there is a link to the Application detail
UI directly from the master UI. I
Hi,
Spark master web UI is only for standalone clusters, where cluster
resources are managed by Spark, not other resource managers.
Mesos master's default port is 5050. Within Mesos, a Spark application is
considered as one of many frameworks, so there's no Spark-specific support
like accessing
Hi,
One of the executors in my spark cluster shows a CANNOT FIND ADDRESS
address, for one of the stages which failed. After that stages, I got
cascading failures for all my stages :/ (stages that seem complete but still
appears as active stage in the dashboard; incomplete or failed stages that
are
In a Hadoop jar there is a directory called lib and all non-provided third
party jars go there and are included in the class path of the code. Do jars
for Spark have the same structure - another way to ask the question is if I
have code to execute Spark and a jar build for Hadoop can I simply use
Hi,
On Thu, Sep 4, 2014 at 10:33 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
In the current state of Spark Streaming, creating separate Java processes
each having a streaming context is probably the best approach to
dynamically adding and removing of input sources. All of these
Ron,
On Tue, Sep 9, 2014 at 11:27 AM, Ron's Yahoo! zlgonza...@yahoo.com.invalid
wrote:
I’m trying to figure out how I can run Spark Streaming like an API.
The goal is to have a synchronous REST API that runs the spark data flow
on YARN.
I guess I *may* develop something similar in the
Hi Hemanth,
I think there is a bug in this API in Spark 0.8.1, so you will meet this
exception when using Java code with this API, this bug is fixed in latest
version, as you can see the patch (https://github.com/apache/spark/pull/1508).
But it’s only for Kafka 0.8+, as you still use kafka
Tobias,
Let me explain a little more.
I want to create a synchronous REST API that will process some data that is
passed in as some request.
I would imagine that the Spark Streaming Job on YARN is a long running job
that waits on requests from something. What that something is is still not
Tobias,
Let me explain a little more.
I want to create a synchronous REST API that will process some data that is
passed in as some request.
I would imagine that the Spark Streaming Job on YARN is a long running job
that waits on requests from something. What that something is is still not
Hi,
On Tue, Sep 9, 2014 at 12:59 PM, Ron's Yahoo! zlgonza...@yahoo.com wrote:
I want to create a synchronous REST API that will process some data that
is passed in as some request.
I would imagine that the Spark Streaming Job on YARN is a long
running job that waits on requests from
hi Ted,
Where do I find the licence keys that I need to copy to the licences
directory.
Thank you!!
On Mon, Sep 8, 2014 at 8:25 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:
Thank you Ted.
regards
Karthik
On Mon, Sep 8, 2014 at 3:33 PM, Ted Yu yuzhih...@gmail.com wrote:
See
Hi Tobias,
So I guess where I was coming from was the assumption that starting up a new
job to be listening on a particular queue topic could be done asynchronously.
For example, let’s say there’s a particular topic T1 in a Kafka queue. If I
have a new set of requests coming from a
Thanks, Shao, for providing the necessary information.
Hemanth
On Tue, Sep 9, 2014 at 8:21 AM, Shao, Saisai saisai.s...@intel.com wrote:
Hi Hemanth,
I think there is a bug in this API in Spark 0.8.1, so you will meet this
exception when using Java code with this API, this bug is fixed in
Hi,
On Tue, Sep 9, 2014 at 2:02 PM, Ron's Yahoo! zlgonza...@yahoo.com wrote:
So I guess where I was coming from was the assumption that starting up a
new job to be listening on a particular queue topic could be done
asynchronously.
No, with the current state of Spark Streaming, all data
Hi Daniil,
I have to do some processing of the results, as well as pushing the data to
the front end. Currently I'm using akka for this application, but I was
thinking maybe spark streaming would be a better thing to do. as well as i
can use mllib for processing the results. Any specific reason's
As you mentioned you hope to transplant latest version of Spark into Kafka 0.7
in another mail, there are some notes you should take care:
1. Kafka 0.7+ can only be compiled with Scala 2.8, while now Spark is
compiled with Scala 2.10, there is no binary compatible between these two Scala
79 matches
Mail list logo