We implemented chi-squared tests in v1.1:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala#L166
and we will add more after v1.1. Feedback on which tests should come
first would be greatly appreciated. -Xiangrui
On Tue, Aug 19, 2014 at
Hi everyone!
I got a exception when i run my script with spark-shell:
I added
SPARK_JAVA_OPTS=-Dsun.io.serialization.extendedDebugInfo=true
in spark-env.sh to show the following stack:
org.apache.spark.SparkException: Task not serializable
at
It did the job.
Thanks. :)
Le 19 août 2014 à 10:20, Sean Owen so...@cloudera.com a écrit :
In that case, why not collectAsMap() and have the whole result as a
simple Map in memory? then lookups are trivial. RDDs aren't
distributed maps.
On Tue, Aug 19, 2014 at 9:17 AM, Emmanuel Castanier
Hi,
I wonder if there is something like an (row) index to of the elements in the
RDD. Specifically, my RDD is generated from a series of files, where the
value corresponds the file contents. Ideally, I would like to have the keys
to be an enumeration of the file number e.g. (0,file contents
I have a table with 4 columns: a, b, c, time
What I need is something like:
SELECT a, b, GroupFirst(c)
FROM t
GROUP BY a, b
GroupFirst means the first item of column c group,
and by the first I mean minimal time in that group.
In Oracle/Sql Server, we could write:
WITH summary AS (
Hi,
I am a newbie to Spark Streaming, and I am quite confused about JavaDStream
in SparkStreaming. In my situation, after catching a message Hello world
from Kafka in JavaDStream, I want to access to JavaDStream and change this
message to Hello John, but I could not figure how to do it.
Any idea
Looking at the source codes of DStream.scala
/**
* Return a new DStream in which each RDD has a single element generated
by counting each RDD
* of this DStream.
*/
def count(): DStream[Long] = {
this.map(_ = (null, 1L))
Hi,
I have a question about broadcast. I'm working on a clustering algorithm
close to KMeans. It seems that KMeans broadcast clusters centers at each
step. For the moment I just use my centers as Array that I call directly in
my map at each step. Could it be more efficient to use broadcast
Thanks for help.
I run this script again with bin/spark-shell --conf
spark.serializer=org.apache.spark.serializer.KryoSerializer”
in the console, I can see:
scala sc.getConf.getAll.foreach(println)
(spark.tachyonStore.folderName,spark-eaabe986-03cb-41bd-bde5-993c7db3f048)
Hi,
Whats the difference between amplab docker
https://github.com/amplab/docker-scripts and spark docker
https://github.com/apache/spark/tree/master/docker?
Thanks,
Josh
Hi ,
How to increase the heap size?
What is the difference between spark executor memory and heap size?
Thanks Regards,
Meethu M
On Monday, 18 August 2014 12:35 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
I believe spark.shuffle.memoryFraction is the one you are looking for.
How about this:
val prev: RDD[V] = rdd.mapPartitions(partition = { /*setup()*/; partition
})
new RDD[V](prev) {
protected def getPartitions = prev.partitions
def compute(split: Partition, context: TaskContext) = {
context.addOnCompleteCallback(() = /*cleanup()*/)
I got some time to look in to it. It appears as that Spark (latest git)
is doing this operation much more often compare to Aug 1 version. Here
is the log from operation I am referring to
14/08/19 12:37:26 INFO spark.CacheManager: Partition rdd_8_414 not
found, computing it
14/08/19 12:37:26 INFO
And duh, of course, you can do the setup in that new RDD as well :)
On Wed, Aug 20, 2014 at 1:59 AM, Victor Tso-Guillen v...@paxata.com wrote:
How about this:
val prev: RDD[V] = rdd.mapPartitions(partition = { /*setup()*/; partition
})
new RDD[V](prev) {
protected def getPartitions =
Hi,
I don't think there's a NPE issue when using DStream/count() even there is no
data feed into Spark Streaming. I tested using Kafka in my local settings, both
are OK with and without data consumed.
Actually you can see the details in ReceiverInputDStream, even there is no data
in this
Hi Meethu,
The spark.executor.memory is the Java heap size of forked executor process.
Increasing the spark.executor.memory can actually increase the runtime heap
size of executor process.
For the details of Spark configurations, you can check:
Hi
I have this doubt:
I understand that each java process runs on different JVM instances. Now,
if I have a single executor on my machine and run several java processes,
then there will be several JVM instances running.
Now, process_local means, the data is located on the same JVM as the task
Hi,
Actually several java task threads running in a single executor, not processes,
so each executor will only have one JVM runtime which shares with different
task threads.
Thanks
Jerry
From: rapelly kartheek [mailto:kartheek.m...@gmail.com]
Sent: Wednesday, August 20, 2014 5:29 PM
To:
zipWithIndex() will give you something like an index for each element
in the RDD. If you files are small, you can use
SparkContext.wholeTextFiles() to load an RDD where each element is
(filename, content). Maybe that's what you are looking for if you are
really looking to extract an ID from the
Thank you for the reply. I implemented my InputDStream to return None when
there's no data. After changing it to return empty RDD, the exception is
gone.
I am curious as to why all other processings worked correctly with my old
incorrect implementation, with or without data? My actual codes,
Hi,
I am wondering why in web UI some stages (like join, filter) are not
visible. For example this code:
val simple = sc.parallelize(Array.range(0,100))
val simple2 = sc.parallelize(Array.range(0,100))
val toJoin = simple.map(x = (x, x.toString + x.toString))
val rdd = simple2
.map(x =
Could I write groupCount() in Scala, and then use it from Pyspark? Care
to
supply an example, I'm finding them hard to find :)
It's doable, but not so convenient. If you really care about the
performance
difference, you should write your program in Scala.
Is it possible to write my
I am working with Spark SQL and the Thrift server. I ran into an
interesting bug, and I am curious on what information/testing I can provide
to help narrow things down.
My setup is as follows:
Hive 0.12 with a table that has lots of columns (50+) stored as rcfile.
Spark-1.1.0-SNAPSHOT with Hive
Hi,
I tried to write small program which shows that using cache() can speed up
execution but results with and without cache were similar. Could help me
with this issue? I tried to compute rdd and use it later in two places and
I thought in second usage this rdd is recomputed but it doesn't:
Yes, I’m pretty sure my YARN and HDFS HA configuration is correct. I can use
the UIs and HDFS command line tools with HA support as expected (failing over
namenodes and resourcemanagers, etc) so I believe this to be a Spark issue.
Like I mentioned earlier, if i manipulate the
Can't seem to figure this out. I've tried several different approaches without
success. For example, I've tried setting spark.executor.extraJavaOptions in the
spark-default.conf (prior to starting the spark-shell) but this seems to have
no effect.
Outside of spark-shell (within a java
Hi
I am using using below program in spark-shell to load and filter data from
the data sets. I am getting exceptions if I run the programs for multiple
times, If I restart the shell it is working fine.
1) please let me know what I am doing wrong.
2) Also is there a way to make the program better
This is a long running Spark Streaming job running in YARN, Spark v1.0.2 on
CDH5. The jobs will run for about 34-37 hours then die due to this
FileNotFoundException. There’s very little CPU or RAM usage, I’m running 2 x
cores, 2 x executors, 4g memory, YARN cluster mode.
Here’s the stack
Is Spark SQL Thrift Server part of the 1.0.2 release? If not, which release is
the target?
Thanks,
Ken
What is the best way to run Hive queries in 1.0.2? In my case. Hive queries
will be invoked from a middle tier webapp. I am thinking to use the Hive JDBC
driver.
Thanks,
Ken
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Wednesday, August 20, 2014 9:38 AM
To: Tam, Ken K
Cc:
I have made a little progress - by downloading a prebuilt version of Spark
I can call spark-shell.cmd and bring up a spark shell.
In the shell things run.
Next I go to my development environment and try to run JavaWordCount
i try -Dspark.master=spark://local[*]:55519
On Wed, Aug 20, 2014 at 8:54 AM, Matt Narrell matt.narr...@gmail.com wrote:
An “unaccepted” reply to this thread from Dean Chen suggested to build Spark
with a newer version of Hadoop (2.4.1) and this has worked to some extent.
I’m now able to submit jobs (omitting an explicit
Ah, sorry, forgot to talk about the second issue.
On Wed, Aug 20, 2014 at 8:54 AM, Matt Narrell matt.narr...@gmail.com wrote:
However, now the Spark jobs running in the ApplicationMaster on a given node
fails to find the active resourcemanager. Below is a log excerpt from one
of the assigned
Hi All:
I have a question about how to do the following operation in GraphX.
Suppose I have a graph with the following vertices and scores on the edges:
(V1 {type:B})-(V2 {type:A})--(V3 {type:A})-(V4
{type:B})
100 10100
I would
Hi All,My dataset is fairly small -- a CSV file with around half million rows
and 600 features. Everything works when I set maximum depth of the decision
tree to 5 or 6. However, I get this error for larger values of that parameter
-- For example when I set it to 10. Have others encountered a
Hi,
I've used hdfs 2.3.0-cdh5.0.1, mesos 0.19.1 and spark 1.0.2 that is
re-compiled.
For a security reason, we run hdfs and mesos as hdfs, that is an account
name and not in a root group, and non-root user submit a spark job on
mesos. With no-switch_user, simple job, which only read data from
I don't know if Pregel would be necessary since it's not iterative
You could filter the graph by looking at edge triplets, and testing if
source =B, dest =A, and edge value 5
--
View this message in context:
Hi,
I was wondering if Personalized Page Rank algorithm is implemented in
graphx. If the talks and presentation were to be believed (
https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx@strata2014_final.pdf)
it is.. but cant find the algo code (
Marcelo,
Specifying the driver-class-path yields behavior like
https://issues.apache.org/jira/browse/SPARK-2420 and
https://issues.apache.org/jira/browse/SPARK-2848 It feels like opening a can
of worms here if I also need to replace the guava dependencies.
Wouldn’t calling
This is likely due to a bug in shuffle file consolidation (which you have
enabled) which was hopefully fixed in 1.1 with this patch:
https://github.com/apache/spark/commit/78f2af582286b81e6dc9fa9d455ed2b369d933bd
Until 1.0.3 or 1.1 are released, the simplest solution is to disable
I want to use opencsv's CSVParser to parse csv lines using a script like
below in spark-shell:
import au.com.bytecode.opencsv.CSVParser;
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator
import org.apache.hadoop.fs.{Path, FileSystem}
class MyKryoRegistrator
Hi,
On Wed, Aug 20, 2014 at 11:59 AM, Matt Narrell matt.narr...@gmail.com wrote:
Specifying the driver-class-path yields behavior like
https://issues.apache.org/jira/browse/SPARK-2420 and
https://issues.apache.org/jira/browse/SPARK-2848 It feels like opening a
can of worms here if I also
Ok Marcelo,
Thanks for the quick and thorough replies. I’ll keep an eye on these tickets
and the mailing list to see how things move along.
mn
On Aug 20, 2014, at 1:33 PM, Marcelo Vanzin van...@cloudera.com wrote:
Hi,
On Wed, Aug 20, 2014 at 11:59 AM, Matt Narrell matt.narr...@gmail.com
Hi Wang,Have you tried doing this in your application?
conf.set(spark.serializer,
org.apache.spark.serializer.KryoSerializer)
conf.set(spark.kryo.registrator, yourpackage.MyKryoRegistrator)
You then don't need to specify it via commandline.
Date: Wed, 20 Aug 2014 12:25:14 -0700
From: ssti...@live.com
To: men...@gmail.com
Subject: RE: Decision tree: categorical variables
Date: Wed, 20 Aug 2014 12:09:52 -0700
Hi Xiangrui,
My data is in the following format:
Was able to resolve the parsing issue. Thanks!
From: ssti...@live.com
To: user@spark.apache.org
Subject: FW: Decision tree: categorical variables
Date: Wed, 20 Aug 2014 12:48:10 -0700
From: ssti...@live.com
To: men...@gmail.com
Subject: RE: Decision tree: categorical variables
Date: Wed, 20
I can do that in my application, but I really want to know how I can do it
in spark-shell because I usually prototype in spark-shell before I put the
code into an application.
On Wed, Aug 20, 2014 at 12:47 PM, Sameer Tilak ssti...@live.com wrote:
Hi Wang,
Have you tried doing this in your
Hey, thanks for your response.
And I had seen the triplets, but I'm not quite sure how the triplets would
get me that V1 is connected to V4. Maybe I need to spend more time
understanding it, I guess.
-Cesar
On Wed, Aug 20, 2014 at 10:56 AM, glxc r.ryan.mcc...@gmail.com wrote:
I don't know
I'm still bumping up against this issue: spark (and shark) are breaking
my inputs into 64MB-sized splits. Anyone know where/how to configure
spark so that it either doesn't split the inputs, or at least uses a
much large split size? (E.g., 512MB.)
Thanks,
DR
On 07/15/2014 05:58 PM, David
At 2014-08-20 10:57:57 -0700, Mohit Singh mohit1...@gmail.com wrote:
I was wondering if Personalized Page Rank algorithm is implemented in graphx.
If the talks and presentation were to be believed
(https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx@strata2014_final.pdf)
it
Thanks, I’ll go ahead and disable that setting for now.
From: Aaron Davidson ilike...@gmail.commailto:ilike...@gmail.com
Date: Wednesday, August 20, 2014 at 3:20 PM
To: Silvio Fiorito
silvio.fior...@granturing.commailto:silvio.fior...@granturing.com
Cc:
At 2014-08-20 10:34:50 -0700, Cesar Arevalo ce...@zephyrhealthinc.com wrote:
I would like to get the type B vertices that are connected through type A
vertices where the edges have a score greater than 5. So, from the example
above I would like to get V1 and V4.
It sounds like you're trying to
Your rdd2 and rdd3 differ in two ways so it's hard to track the exact
effect of caching. In rdd3, in addition to the fact that rdd will be
cached, you are also doing a bunch of extra random number generation. So it
will be hard to isolate the effect of caching.
On Wed, Aug 20, 2014 at 7:48 AM,
For large objects, it will be more efficient to broadcast it. If your array
is small it won't really matter. How many centers do you have? Unless you
are finding that you have very large tasks (and Spark will print a warning
about this), it could be okay to just reference it directly.
On Wed,
The reason is that some operators get pipelined into a single stage.
rdd.map(XX).filter(YY) - this executes in a single stage since there is no
data movement needed in between these operations.
If you call toDeubgString on the final RDD it will give you some
information about the exact lineage.
Hi Ankur, thank you for your response. I already looked at the sample code
you sent. And I think the modification you are referring to is on the
tryMatch function of the PartialMatch class. I noticed you have a case in
there that checks for a pattern match, and I think that's the code I need
to
New to Apache Spark, trying to build a scalatest. Below is the error I'm
consistently seeing. Somehow Spark is trying to load a scalatest
AssertionHelper class which is not serializable. The scalatest I have
specified doesn't even have any assertions in it. I added the JVM flag
You could use the programatic API
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
to make the hive queries directly.
On Wed, Aug 20, 2014 at 9:47 AM, Tam, Ken K ken@verizon.com wrote:
What is the best way to run Hive queries in 1.0.2? In my case. Hive
queries
My guess is that your test is trying to serialize a closure
referencing connectionInfo; that closure will have a reference to
the test instance, since the instance is needed to execute that
method.
Try to make the connectionInfo method local to the method where it's
needed, or declare it in an
Hi,
I doubt the the broadcast variable is your problem, since you are seeing:
org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: org.apache.spark.sql
.hive.HiveContext$$anon$3
We have a knowledgebase article that explains why this happens - it's
Hi Chris,
We have a knowledge base article to explain what's happening here:
https://github.com/databricks/spark-knowledgebase/blob/master/troubleshooting/javaionotserializableexception.md
Let me know if the article is not clear enough - I would be happy to edit
and improve it.
-Vida
On Wed,
Spark memory settings let me very misunderstanding.
My code is as follows.
spark-1.0.2-bin-2.4.1/bin/spark-submit --class SimpleApp \
--master yarn \
--deploy-mode cluster \
--queue sls_queue_1 \
--num-executors 3 \
--driver-memory 6g \
--executor-memory 10g \
--executor-cores 5 \
If you want to filter the table name, you can use
hc.sql(show tables).filter(row = !test.equals(row.getString(0
Seems making functionRegistry transient can fix the error.
On Wed, Aug 20, 2014 at 8:53 PM, Vida Ha v...@databricks.com wrote:
Hi,
I doubt the the broadcast variable is your
Hi,
I tried to write to text file from DStream in Spark Streaming, using
DStream.saveAsTextFile(test,output), but it did not work.
Any suggestions?
Thanks in advance.
Cuong
--
View this message in context:
PR is https://github.com/apache/spark/pull/2074.
--
From: Yin Huai huaiyin@gmail.com
Sent: 8/20/2014 10:56 PM
To: Vida Ha v...@databricks.com
Cc: tianyi tia...@asiainfo.com; Fengyun RAO raofeng...@gmail.com;
user@spark.apache.org
Subject: Re: Got
Try to answer your another question.
One sortByKey is triggered by rangePartition which does sample to calculate the
range boundaries, which again triggers the first reduceByKey.
The second sortByKey is doing the real work to sort based on the partition
calculated, which again trigger the
That command line you mention in your e-mail doesn't look like
something started by Spark. Spark would start one of
ApplicationMaster, ExecutableRunner or CoarseGrainedSchedulerBackend,
not org.apache.hadoop.mapred.YarnChild.
On Wed, Aug 20, 2014 at 6:56 PM, centerqi hu cente...@gmail.com wrote:
I am trying to run SQL queries over streaming data in spark. This looks
pretty straight forward but when I try it, I get the error table not found :
tablename. It unable to find the table I've registered.
Using Spark SQL with batch data works fine so I'm thinking it has to do with
how I'm calling
Hi,
On Thu, Aug 21, 2014 at 2:19 PM, praveshjain1991 praveshjain1...@gmail.com
wrote:
Using Spark SQL with batch data works fine so I'm thinking it has to do
with
how I'm calling streamingcontext.start(). Any ideas what is the issue? Here
is the code:
Please have a look at
70 matches
Mail list logo