Not directly. If you could access brzPi and brzTheta in the
NaiveBayesModel, you could repeat its same computation in predict() and
exponentiate it to get back class probabilities, since input and internal
values are in log space.
Hm I wonder how people feel about exposing those fields or a
Hi,
Can somebody help me to understand why this error occurred?
2014-11-10 00:17:44,512 INFO [Executor task launch worker-0]
receiver.BlockGenerator (Logging.scala:logInfo(59)) - Started BlockGenerator
2014-11-10 00:17:44,513 INFO [Executor task launch worker-0]
Hi,
I've got a huge list of key-value pairs, where the key is an integer and
the value is a long string(around 1Kb). I want to concatenate the strings
with the same keys.
Initially I did something like: pairs.reduceByKey((a, b) = a+ +b)
Then tried to save the result to HDFS. But it was
You are suggesting that the String concatenation is slow? It probably is
because of all the allocation.
Consider foldByKey instead which starts with an empty StringBuilder as its
zero value. This will build up the result far more efficiently.
On Nov 10, 2014 8:37 AM, YANG Fan idd...@gmail.com
I want to run k-means of MLib on a big dataset, it seems for big datsets, we
need to perform pre-clustering methods such as canopy clustering. By starting
with an initial clustering the number of more expensive distance measurements
can be significantly reduced by ignoring points outside of
I'm experiencing some strange behavior with closure serialization that is
totally mind-boggling to me. It appears that two arrays of equal size take
up vastly different amount of space inside closures if they're generated in
different ways.
The basic flow of my app is to run a bunch of tiny
Hi
Recently I want to save a big RDD[(k,v)] in form of index and data ,I
deceide to use hadoop mapFile. I tried some examples like this
:https://gist.github.com/airawat/6538748
I runs the code well and generate a index and data file. I can use
command
hadoop fs -text
Hi
Recently i want to save a big RDD[(k,v)] in form of index and data ,I
deceide to use hadoop mapFile. I tried some examples like this
:https://gist.github.com/airawat/6538748
I runs the code well and generate a index and data file. I can use
command
hadoop fs -text
Thanks for the answer. The variables brzPi and brzTheta are declared private.
I am writing my code with Java otherwise I could have replicated the scala
class and performed desired computation, which is as I observed a
multiplication of brzTheta with test vector and adding this value to brzPi.
Hi,
This question was asked earlier and I did it in the way specified..I am
getting java.lang.ClassNotFoundException..
Can somebody explain all the steps required to build a spark app using IntelliJ
(latest version)starting from creating the project to running it..I searched a
lot but couldnt
Hello,
I'm hoping to understand exactly what happens when a spark compiled app is
submitted to a spark stand-alone cluster master. Say, our master is A, and
workers are W1 and W2. Client machine C is submitting an app to the master
using spark-submit. Here's what I think happens?
* C submits
So far I have tried this and I am able to compile it successfully . There
isn't enough documentation on spark for its usage with databases. I am using
AbstractFunction0 and AbsctractFunction1 here. I am unable to access the
database. The jar just runs without doing anything when submitted. I want
Hello,
I have a big cluster running CDH 5.1.3 which I can't upgrade to 5.2.0 at the
current time.
I would like to run Spark-On-Yarn in that cluster.
I tried to compile spark with CDH-5.1.3 and I got HDFS to work but I am having
problems with the connection to hive:
java.sql.SQLException: Could
How can I remove all the INFO logs that appear on the console when I submit
an application using spark-submit?
It works.
Thanks
On Mon, Nov 10, 2014 at 6:32 PM, YANG Fan idd...@gmail.com wrote:
Hi,
In conf/log4j.properties, change the following
log4j.rootCategory=INFO, console
to
log4j.rootCategory=WARN, console
This works for me.
Best,
Fan
On Mon, Nov 10, 2014 at 8:21 PM, Ritesh
It's hacky, but you could access these fields via reflection. It'd be
better to propose opening them up in a PR.
On Mon, Nov 10, 2014 at 9:25 AM, jatinpreet jatinpr...@gmail.com wrote:
Thanks for the answer. The variables brzPi and brzTheta are declared private.
I am writing my code with Java
Hello all, I'm hoping someone can help me with this hardware question. We have
an upcoming need to run our machine learning application on physical hardware.
Up until now, we've just rented a cloud-based high performance cluster, so my
understanding of the real relative performance tradeoffs
Hi,
How can we increase the executor memory of a running spark cluster on YARN?
We want to increase the executor memory on the addition of new nodes in the
cluster. We are running spark version 1.0.2.
Thanks
Mudassar
--
View this message in context:
Hi,
I need a matrix with each row having a index, e.g., index = 0 for first
row, index = 1 for second row. Could someone tell me how to generate such
IndexedRowMatrix from an RowMatrix?
Besides, is there anyone having the experience to do multiplication of two
distributed matrix, e.g., two
I see, thanks.
I'm not running on ec2, and I wouldn't like to start copying jars on all the
servers in the cluster.
Any ideas of how I can add this jar in a simple way?
Here are my failed attempts so far:
- adding the math3 jar the lib folder in my project root. The math3 classes
did appear in
You may use |RDD.zipWithIndex|.
On 11/10/14 10:03 PM, Lijun Wang wrote:
Hi,
I need a matrix with each row having a index, e.g., index = 0 for first
row, index = 1 for second row. Could someone tell me how to generate such
IndexedRowMatrix from an RowMatrix?
Besides, is there anyone
If you are using spark-submit with --master yarn you can also pass as a
flag --executor-memory
On Mon, Nov 10, 2014 at 8:58 AM, Mudassar Sarwar
mudassar.sar...@northbaysolutions.net wrote:
Hi,
How can we increase the executor memory of a running spark cluster on YARN?
We want to increase
Hey, guys
Feel free to ask for more details if my questions are not clear.
Any insight here ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-spark-operation-pipeline-and-block-storage-tp18201p18496.html
Sent from the Apache Spark User
Thanks, I will try it out and raise a request for making the variables
accessible.
An unrelated question, do you think the probability value thus calculated
will be a good measure of confidence in prediction? I have been reading
mixed opinions about the same.
Jatin
-
Novice Big Data
On 11/6/14 1:39 AM, Hao Ren wrote:
Hi,
I would like to understand the pipeline of spark's operation(transformation
and action) and some details on block storage.
Let's consider the following code:
val rdd1 = SparkContext.textFile(hdfs://...)
rdd1.map(func1).map(func2).count
For example, we
On Mon, Nov 10, 2014 at 10:52 PM, Ritesh Kumar Singh
riteshoneinamill...@gmail.com wrote:
Tasks are now getting submitted, but many tasks don't happen.
Like, after opening the spark-shell, I load a text file from disk and try
printing its contentsas:
-- Forwarded message --
From: Ritesh Kumar Singh riteshoneinamill...@gmail.com
Date: Mon, Nov 10, 2014 at 10:52 PM
Subject: Re: Executor Lost Failure
To: Akhil Das ak...@sigmoidanalytics.com
Tasks are now getting submitted, but many tasks don't happen.
Like, after opening the
Hi,
In my application I am doing something like this new
StreamingContext(sparkConf, Seconds(10)).textFileStream(logs/), and I
get some unknown exceptions when I copy a file with about 800 MB to that
folder (logs/). I have a single worker running with 512 MB of memory.
Anyone can tell me if
Hi,
Is there any plan to bump the Kafka version dependency in Spark 1.2 from
0.8.0 to 0.8.1.1?
Current dependency is still on Kafka 0.8.0
https://github.com/apache/spark/blob/branch-1.2/external/kafka/pom.xml
Thanks
Bhaskie
I don't think there are any regular SNAPSHOT builds published to Maven
Central. You can always mvn install the build in your local repo or any
shares repo you want.
If you just want a recentish build of 1.2.0 without rolling your own you
could point to
thanks, that looks good.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-SNAPSHOT-repo-tp18502p18505.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
Entire file in a window.
On Mon, Nov 10, 2014 at 9:20 AM, Saiph Kappa saiph.ka...@gmail.com wrote:
Hi,
In my application I am doing something like this new
StreamingContext(sparkConf, Seconds(10)).textFileStream(logs/), and I
get some unknown exceptions when I copy a file with about 800 MB
I have some previous experience with Apache Oozie while I was developing in
Apache Pig. Now, I am working explicitly with Apache Spark and I am looking
for a tool with similar functionality. Is Oozie recommended? What about
Luigi? What do you use \ recommend?
I have used Oozie for all our workflows with Spark apps but you will have
to use a java event as the workflow element. I am interested in anyones
experience with Luigi and/or any other tools.
On Mon, Nov 10, 2014 at 10:34 AM, Adamantios Corais
adamantios.cor...@gmail.com wrote:
I have some
Just curious, what are the pros and cons of this? Can the 0.8.1.1 client still
talk to 0.8.0 versions of Kafka, or do you need it to match your Kafka version
exactly?
Matei
On Nov 10, 2014, at 9:48 AM, Bhaskar Dutta bhas...@gmail.com wrote:
Hi,
Is there any plan to bump the Kafka
Can the 0.8.1.1 client still talk to 0.8.0 versions of Kafka
Yes it can.
0.8.1 is fully compatible with 0.8. It is buried on this page:
http://kafka.apache.org/documentation.html
In addition to the pom version bump SPARK-2492 would bring the kafka streaming
receiver (which was originally
Version 0.8.2-beta is published. I'd consider waiting on this, it has quite a
few nice changes coming.
https://archive.apache.org/dist/kafka/0.8.2-beta/RELEASE_NOTES.html
I started the 0.8.1.1 upgrade in a branch a few weeks ago but abandoned it
because I wasn't sure if there was interest beyond
Hi,
Does there exist a way to serialize Row objects to JSON. In the absence of
such a way, is the right way to go:
* get hold of schema using SchemaRDD.schema
* Iterate through each individual Row as a Seq and use the schema to
convert values in the row to JSON types.
Thanks,
Akshat
Tried --driver-java-options and SPARK_JAVA_OPTS, none of them worked
Had to change the default one and rebuilt.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/disable-log4j-for-spark-shell-tp11278p18513.html
Sent from the Apache Spark User List mailing
Hello Spark and MLLib folks,
So a common problem in the real world of using machine learning is that
some data analysis use tools like R, but the more data engineers out
there will use more advanced systems like Spark MLLib or even Python Scikit
Learn.
In the real world, I want to have a system
Hello all,
I have some text data that I am running different algorithms on.
I had no problems with LibSVM and Naive Bayes on the same data,
but when I run Decision Tree, the execution hangs in the middle
of DecisionTree.trainClassifier(). The only difference from the example
given on the site
When I have a multi-step process flow like this:
A - B - C - D - E - F
I need to store B and D's results into parquet files
B.saveAsParquetFile
D.saveAsParquetFile
If I don't cache/persist any step, spark might recompute from A,B,C,D and E
if something is wrong in F.
Of course, I'd better
Hello,
CDH 5.1.3 ships with a version of Hive that's not entirely the same as
the Hive Spark 1.1 supports. So when building your custom Spark, you
should make sure you change all the dependency versions to point to
the CDH versions.
IIRC Spark depends on org.spark-project.hive:0.12.0, you'd have
Even after changing
core/src/main/resources/org/apache/spark/log4j-defaults.properties to WARN
followed by a rebuild, the log level is still INFO.
Any other suggestions?
--
View this message in context:
I am trying to use spark with spray and I have the dependency problem with
quasiquotes. The issue comes up only when I include spark dependencies. I
am not sure how this one can be excluded.
Jianshi: can you let me know what version of spray + akka + spark are you
using ?
[error]
Some console messages:
14/11/10 20:04:33 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:46713
14/11/10 20:04:33 INFO util.Utils: Successfully started service 'HTTP file
server' on port 46713.
14/11/10 20:04:34 INFO server.Server: jetty-8.y.z-SNAPSHOT
14/11/10 20:04:34 INFO
I am trying to use spark with spray and I have the dependency problem with
quasiquotes. The issue comes up only when I include spark dependencies. I
am not sure how this one can be excluded.
does anyone tried this before and it works ?
[error] Modules were resolved with conflicting
Hi,
The model weight is not updating for streaming linear regression. The code and
data below is what I am running.
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
At 2014-11-10 22:53:49 +0530, Ritesh Kumar Singh
riteshoneinamill...@gmail.com wrote:
Tasks are now getting submitted, but many tasks don't happen.
Like, after opening the spark-shell, I load a text file from disk and try
printing its contentsas:
sc.textFile(/path/to/file).foreach(println)
Hi,
I'm running Spark in standalone mode: 1 master, 15 slaves. I started the node
with the ec2 script, and I'm currently breaking the job into many small parts
(~2,000) to better examine progress and failure.
Pretty basic - submitting a PySpark job (via spark-submit) to the cluster. The
job
Well you can always create C by loading B from disk, and likewise for
E / D. No need for any custom procedure.
On Mon, Nov 10, 2014 at 7:33 PM, Benyi Wang bewang.t...@gmail.com wrote:
When I have a multi-step process flow like this:
A - B - C - D - E - F
I need to store B and D's results
Nevermind - I don't know what I was thinking with the below. It's just
maxTaskFailures causing the job to failure.
From: Griffiths, Michael (NYC-RPM) [mailto:michael.griffi...@reprisemedia.com]
Sent: Monday, November 10, 2014 4:48 PM
To: user@spark.apache.org
Subject: Spark Master crashes job on
Hi,
I have some data generated by some utilities that returns the results as
a ListString. I would like to join this with a Dstream of strings. How
can I do this? I tried the following though get scala compiler errors
val list_scalaconverted = ssc.sparkContext.parallelize(listvalues.toArray())
I am embarrassed to admit but I can't get a basic 'word count' to work
under Kafka/Spark streaming. My code looks like this. I don't see any
word counts in console output. Also, don't see any output in UI. Needless
to say, I am newbie in both 'Spark' as well as 'Kafka'.
Please help. Thanks.
Hello,
I am trying to build Spark from source using the following:
export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
-XX:ReservedCodeCacheSize=512m
mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests clean
package
this works OK with branch-1.1, when I switch to branch-1.2, I get the
I ran into the same issue, reverting this commit seems to work
https://github.com/apache/spark/commit/bd86cb1738800a0aa4c88b9afdba2f97ac6cbf25
--
View this message in context:
What is the Spark master that you are using. Use local[4], not local
if you are running locally.
On Mon, Nov 10, 2014 at 3:01 PM, Something Something
mailinglist...@gmail.com wrote:
I am embarrassed to admit but I can't get a basic 'word count' to work under
Kafka/Spark streaming. My code
ah, thanks, reverted a few days, works now.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Building-spark-from-source-assertion-failed-org-eclipse-jetty-server-DispatcherType-tp18529p18532.html
Sent from the Apache Spark User List mailing list archive at
I am not running locally. The Spark master is:
spark://machine name:7077
On Mon, Nov 10, 2014 at 3:47 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
What is the Spark master that you are using. Use local[4], not local
if you are running locally.
On Mon, Nov 10, 2014 at 3:01 PM,
I was testing out the spark thrift jdbc server by running a simple query in
the beeline client. The spark itself is running on a yarn cluster.
However, when I run a query in beeline - I see no running jobs in the
spark UI(completely empty) and the yarn UI seem to indicate that the
submitted query
public static void main(String[] args) throws Exception {
System.out.println(Set Log to Warn);
Logger rootLogger = Logger.getRootLogger();
rootLogger.setLevel(Level.WARN);
...
works for me
--
View this message in context:
Josh,
On Tue, Nov 11, 2014 at 7:43 AM, Josh J joshjd...@gmail.com wrote:
I have some data generated by some utilities that returns the results as
a ListString. I would like to join this with a Dstream of strings. How
can I do this? I tried the following though get scala compiler errors
val
Akshat
On Tue, Nov 11, 2014 at 4:12 AM, Akshat Aranya aara...@gmail.com wrote:
Does there exist a way to serialize Row objects to JSON.
I can't think of any other way than the one you proposed. A Row is more or
less an Array[Object], so you need to read JSON key and data type from the
I have the following code where I'm using RDD 'union' and 'subtractByKey' to
create a new baseline RDD. All of my RDDs are a key pair with the 'key' a
String and the 'value' a String (xml document).
// **// Merge the daily
Hi,
I am building a graph from a large CSV file. Each record contains a couple of
nodes and about 10 edges. When I try to load a large portion of the graph,
using multiple partitions, I get inconsistent results in the number of edges
between different runs. However, if I use a single
You should supply more information about your input data.
For example ,I generate a IndexRowMatrix from ALS algorithm input data
format,my code like this:
val inputData = sc.textFile(fname).map{
line=
val parts = line.trim.split(' ')
There is a JIRA for adding this:
https://issues.apache.org/jira/browse/SPARK-4228
Your described approach sounds reasonable.
On Mon, Nov 10, 2014 at 5:10 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Akshat
On Tue, Nov 11, 2014 at 4:12 AM, Akshat Aranya aara...@gmail.com wrote:
Does there
Hi, all. I'm not sure whether someone has reported this bug:
There should be a checkpoint() method in EdgeRDD and VertexRDD as follows:
override def checkpoint(): Unit = { partitionsRDD.checkpoint() }
Current EdgeRDD and VertexRDD use *RDD.checkpoint()*, which only checkpoint
the
Hey Sadhan,
I really don't think this is Spark log... Unlike Shark, Spark SQL
doesn't even provide a Hive mode to let you execute queries against
Hive. Would you please check whether there is an existing HiveServer2
running there? Spark SQL HiveThriftServer2 is just a Spark port of
Hi, all. I want to seek suggestions on how to do checkpoint more
efficiently, especially for iterative applications written by GraphX.
For iterative applications, the lineage of a job can be very long, which is
easy to cause statckoverflow error. A solution is to do checkpoint.
However,
Nice, we currently encounter a stackoverflow error caused by this bug.
We also found that val partitionsRDD: RDD[(PartitionID, EdgePartition[ED,
VD])],
val targetStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY) will not
be serialized even without adding @transient.
However, transient can
Hi Srinivas,
Here's the versions I'm using.
spark.version1.2.0-SNAPSHOT/spark.version
spray.version1.3.2/spray.version
spray.json.version1.3.0/spray.json.version
akka.grouporg.spark-project.akka/akka.group
akka.version2.3.4-spark/akka.version
I'm using
I am trying spark-shell on a single host and got some strange behavior of
spark-shell.
If I run bin/spark-shell without connecting a master, it can access a hdfs
file on a remote cluster with kerberos authentication.
scala val textFile =
Hey Sandy,
Try using the -Dsun.io.serialization.extendedDebugInfo=true flag on the JVM to
print the contents of the objects. In addition, something else that helps is to
do the following:
{
val _arr = arr
models.map(... _arr ...)
}
Basically, copy the global variable into a local one.
You just need to add --driver-library-path the directory in you submit
command. And in your worker node, add the lib in the right work directory
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-use-JNI-in-spark-tp530p18551.html
Sent from the Apache
Hi again,
As Jimmy said, any thoughts about Luigi and/or any other tools? So far it
seems that Oozie is the best and only choice here. Is that right?
On Mon, Nov 10, 2014 at 8:43 PM, Jimmy McErlain jimmy.mcerl...@gmail.com
wrote:
I have used Oozie for all our workflows with Spark apps but you
76 matches
Mail list logo