Thanks for pointing to the issue.
Yes I think its the same issue, below is Exception
ERROR OneForOneStrategy: TestCheckpointStreamingJson$1
java.io.NotSerializableException: TestCheckpointStreamingJson
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at
No, not the same thing then. This just means you accidentally have a
reference to the unserializable enclosing test class in your code.
Just make sure the reference is severed.
On Thu, Nov 6, 2014 at 8:00 AM, Vasu C vasuc.bigd...@gmail.com wrote:
Thanks for pointing to the issue.
Yes I think
When we enable checkpoint and use JsonRDD we get the following error: Is this
bug ?
Exception in thread main java.lang.NullPointerException
at org.apache.spark.rdd.RDD.init(RDD.scala:125)
at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103)
I meant above, that in the case of categorical variables it might be more
efficient to create a node on each categorical value. Is there a reason why
spark went down the binary route?
thanks,
--
View this message in context:
I haven't seen that done before, which may be most of the reason - I am not
sure that is common practice.
I can see upsides - you need not pick candidate splits to test since there
is only one N-way rule possible. The binary split equivalent is N levels
instead of 1.
The big problem is that you
Default value is infinite, so you need to enable it. Personally I’ve setup a
couple of cron jobs to clean up /tmp and /var/run/spark.
On 06.11.2014, at 08:15, Romi Kuntsman r...@totango.com wrote:
Hello,
Spark has an internal cleanup mechanism
(defined by spark.cleaner.ttl, see
Hello,
There is a big compelling reason for binary splits in general for
trees: a split is made if the difference between the two resulting
branches is significant.You also want to compare the significance of
this candidate split vs all the other candidate splits. There are many
statistical tests
Thanks for the reply, Sean.
I can see that splitting on all the categories would probably overfit
the tree, on the other hand, it might give more insight on the
subcategories (probably only would work if the data is uniformly
distributed between the categories).
I haven't really found any
the only difference between the two setups (if you vary change the executor
cores) is how many tasks are running in parallel (the number of tasks would
depend on other factors), so try to inspect the stages while running
(probably easier to do that with longer running tasks) by clicking on one of
HI Sean,
Below is my java code and using spark 1.1.0. Still getting the same error.
Here Bean class is serialized. Not sure where exactly is the problem.
What am I doing wrong here ?
public class StreamingJson {
public static void main(String[] args) throws Exception {
final String HDFS_FILE_LOC
You can imagine this same logic applying to the continuous case. E.g. what
if all the quartiles or deciles of a particular value have different
behavior - this could capture that too. Of what if some combination of
features was highly discriminitive but only into n buckets, rather than
two.. you
Erm, you are trying to do all the work in the create() method. This is
definitely not what you want to do. It is just supposed to make the
JavaSparkStreamingContext. A further problem is that you're using
anonymous inner classes, which are non-static and contain a reference
to the outer class. The
Excellent. Is there an example of this somewhere?
Sent from my iPhone
On Nov 6, 2014, at 1:43 AM, Sean Owen so...@cloudera.com wrote:
Broadcast vars should work fine in Spark streaming. Broadcast vars are
immutable however. If you have some info to cache which might change
from batch to
Hi,
quick question: I found this:
http://docs.sigmoidanalytics.com/index.php/Problems_and_their_Solutions#Multiple_SparkContext:Failed_to_bind_to:.2F127.0.1.1:45916
My main question: is this constrain still valid? AM I not allowed to have
two SparkContexts pointing to the same Spark Master in
Matei,
Thanks for reply.
I don't worry that much about more code because I migrate from mapreduce, so
I have existing code to handle it. But if I want to use a new tech, I will
always prefer right way not a temporary easy way!. I will go with RDD
first to test the performance.
Thanks!
Shuai
Even though the stage UI has min, 25th%, median, 75th%, and max durations,
I am often still left clueless about the distribution. For example, 100 out
of 200 tasks (started at the same time) have completed in 1 hour. How much
longer do I have to wait? I cannot guess well based on the five numbers.
Help please!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18280.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I am getting exception at sparksheel at the following line:
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
error: bad symbolic reference. A signature in HiveContext.class refers to term
hive
in package org.apache.hadoop which is not available.
It may be completely missing from
Hi Guys,
I am doing some tests with Spark Streaming and Kafka, but I have seen something
strange, I have modified the JavaKafkaWordCount to use ReducebyKeyandWindow and
to print in the screen the accumulated numbers of the words, in the beginning
spark works very well in each interaction the
can you be more specific what version of spark, hive, hadoop, etc...
what are you trying to do? what are the issues you are seeing?
J
ᐧ
*JIMMY MCERLAIN*
DATA SCIENTIST (NERD)
*. . . . . . . . . . . . . . . . . .*
*IF WE CAN’T DOUBLE YOUR SALES,*
*ONE OF US IS IN THE WRONG
What version of Spark are you using? Did you compile your Spark version
and if so, what compile options did you use?
On 11/6/14, 9:22 AM, tridib tridib.sama...@live.com wrote:
Help please!
--
View this message in context:
I am using spark 1.1.0.
I built it using:
./make-distribution.sh -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive
-DskipTests
My ultimate goal is to execute a query on parquet file with nested structure
and cast a date string to Date. This is required to calculate the age of Person
This is my window:
reduceByKeyAndWindow(
new Function2Integer, Integer, Integer() {
@Override
public Integer call(Integer i1, Integer i2) { return i1 + i2; }
},
new Function2Integer, Integer, Integer() {
public Integer call(Integer i1, Integer i2) {
Hi all,
Is there a way in spark to send data (RDD[Array] from the scala component to
the python component?
I saw a method that serialises double arrays (serializeDoubleMatrix), but it
requires the data to be collected before. I assume that step would pull all
the data to the driver.
Thanks,
I don't have any insight into this bug, but on Spark version 1.0.0 I ran into
the same bug running the 'sort.py' example. On a smaller data set, it worked
fine. On a larger data set I got this error:
Traceback (most recent call last):
File /home/skane/spark/examples/src/main/python/sort.py,
Hi,
I am using rdd.sortBy(_._2) to get an RDD sorted by value. The default order
is ascending order. How can I get it sorted in descending order? I could
not find an option to specify the order.
I need to get the top K elements of the list sorted in descending order. If
there is no option to
/usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java
org.apache.spark.deploy.SparkSubmitDriverBootstrapper
When I execute /usr/local/spark-1.1.0/bin/spark-submit local[32] for my
app, I see two processes get spun off. One is the
org.apache.spark.deploy.SparkSubmitDriverBootstrapper and
Yes you can sort it in desc, you simply specify a boolean value in the
second argument to the sortBy function. Default is ascending.
So it will look like:
rdd.sortBy(_._2, false)
Read more over here
http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.RDD
Thanks
Best Regards
On
Thanks. I was looking at an older RDD documentation that did not specify the
ordering option.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/specifying-sort-order-for-sort-by-value-tp18289p18292.html
Sent from the Apache Spark User List mailing list
Hello,
I am new to spark and trying to run the spark program (bundled as jar) in a
EMR cluster.
In one terminal session, i am loading data into kinesis stream.
In another window, i am trying to run the spark streaming program, and
trying to print out the output.
Whenever i run the spark
It can, but currently that method uses the default hive serde which is not
very robust (does not deal well with \n in strings) and probably is not
super fast. You'll also need to be using a HiveContext for it to work.
On Tue, Nov 4, 2014 at 8:20 PM, vdiwakar.malladi vdiwakar.mall...@gmail.com
Those are the same options I used, except I had —tgz to package it and I built
off of the master branch. Unfortunately, my only guess is that these errors
stem from your build environment. In your spark assembly, do you have any
classes which belong to the org.apache.hadoop.hive package?
It should be fixed in 1.1+.
Could you have a script to reproduce it?
On Thu, Nov 6, 2014 at 10:39 AM, skane sk...@websense.com wrote:
I don't have any insight into this bug, but on Spark version 1.0.0 I ran into
the same bug running the 'sort.py' example. On a smaller data set, it worked
Benjamin,
Thanks for the snippet. I have tried using it, but unfortunately I get the
following exception. I am clueless at what might be wrong. Any ideas?
java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at
Hello,I'm trying to find the best way of redeploying a spark streaming
application. Ideally, I was thinking of a scenario where a build server
packages up a jar and a deployment step submits it to a Spark Master. On the
next successful build, the next version would get deployed taking down the
You’ve basically got it.
Deployment step can be simply scp-ing the file to a known location on the
server and then executing a run script on the server that actually runs the
spark-submit.
From: Ashic Mahtab as...@live.commailto:as...@live.com
Date: Thursday, November 6, 2014 at 5:01 PM
To:
See this thread for examples of sparse matrix x sparse matrix:
https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA
We thought about providing matrix multiplies on CoordinateMatrix, however,
the matrices have to be very dense for the overhead of having many little
(i, j, value) objects
Hello all,
I am running the following operations:
val part1= maOutput.toArray.flatten
val part2 = sc.parallelize(part1)
val reduceOutput = part2.combineByKey(
(v) = (v, 1),
(acc: (Double, Int), v) = ( acc._1 + v, acc._2 + 1),
(acc1: (Double, Int), acc2:
We are looking at consuming the kafka stream using Spark Streaming and
transform into various subsets like applying some transformation or
de-normalizing some fields, etc. and feed it back into Kafka as a different
topic for downstream consumers.
Wanted to know if there are any existing patterns
I reproduced the problem in mllib tests ALSSuite.scala using the following
functions:
val arrayPredict = userProductsRDD.map{case(user,product) =
val recommendedProducts = model.recommendProducts(user, products)
val productScore = recommendedProducts.find{x=x.product
Hello,
Is there a built-in way or connector to store DStream results into an
existing Hive ORC table using the Hive/HCatalog Streaming API?
Otherwise, do you have any suggestions regarding the implementation of such
component?
Thank you,
-Geovani
model.recommendProducts can only be called from the master then ? I have a
set of 20% users on whom I am performing the test...the 20% users are in a
RDD...if I have to collect them all to master node and then call
model.recommendProducts, that's a issue...
Any idea how to optimize this so that
Thanks for the advice!
What seems to work for is is that I define the array type as: type: {
type: array, items: string, java-class: java.util.ArrayList }It
seems to be creating an avro.Generic.List, which spark doesn't know how to
serialize, instead of a guava.util.List, which spark likes.
There is a JIRA for it: https://issues.apache.org/jira/browse/SPARK-3066
The easiest case is when one side is small. If both sides are large,
this is a super-expensive operation. We can do block-wise cross
product and then find top-k for each user.
Best,
Xiangrui
On Thu, Nov 6, 2014 at 4:51 PM,
Michael,
Thanks for the explanation. I was able to get this running.
On Wed, Oct 29, 2014 at 3:07 PM, Michael Armbrust mich...@databricks.com
wrote:
We are working on more helpful error messages, but in the meantime let me
explain how to read this output.
Ted, any pointers?
On Thu, Nov 6, 2014 at 4:46 PM, Luiz Geovani Vier lgv...@gmail.com wrote:
Hello,
Is there a built-in way or connector to store DStream results into an
existing Hive ORC table using the Hive/HCatalog Streaming API?
Otherwise, do you have any suggestions regarding the
Geovani,
You can use HiveContext to do inserts into a Hive table in a Streaming app just
as you would a batch app. A DStream is really a collection of RDDs so you can
run the insert from within the foreachRDD. You just have to be careful that
you’re not creating large amounts of small files.
Hi spark-users,
When I use spark-sql or beeline to query a large dataset, sometimes the
query result may cause driver OOM.
So I wonder is there a config property in spark sql to limit the max return
result size (without LIMIT clause in sql query)?
For example, before the select query, I run
Hi,
I figure out that in standalone mode these configuration should add to
worker process's configs, like adding the following line in
spark-env.sh:
SPARK_WORKER_OPTS=-Dspark.executor.logs.rolling.strategy=time
-Dspark.executor.logs.rolling.time.interval=daily
Thanks for the response!! Will try to see the behaviour with Cache()
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Task-size-variation-while-using-Range-Vs-List-tp18243p18318.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Will this work even with Kryo Serialization ?
Now spark.closure.serializer must be
org.apache.spark.serializer.JavaSerializer. Therefore the serialization
closure functions won’t be involved with Kryo. Kryo is only used to
serialize the data.
Best Regards,
Shixiong Zhu
2014-11-07 12:27
Yes. I have org.apache.hadoop.hive package in spark assembly.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18322.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I built spark-1.1.0 in a new fresh machine. This issue is gone! Thank you all
for your help.
Thanks Regards
Tridib
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-use-HiveContext-in-spark-shell-tp18261p18324.html
Sent from the Apache Spark User
Hi Naveen,
Nesting RDDs inside of transformations or actions is not supported. Instead
if you need access to the other RDDs contents you can try doing a join or
(if the data is small enough) collecting and broadcasting the second RDD.
Cheers,
Holden :)
On Thu, Nov 6, 2014 at 10:28 PM, Naveen
Hi,
JavaRDDInteger distData = sc.parallelize(data);
On what basis parallelize splits the data into multiple datasets. How to handle
if we want these many datasets to be executed per executor?
For example, my data is of 1000 integers list and I am having 2 node yarn
cluster. It is diving into
what is the best way to save an mllib model that you just trained and reload
it in the future? specifically, i'm using the mllib word2vec model...
thanks.
--
View this message in context:
Plain old java serialization is one straightforward approach if you're in
java/scala.
On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote:
what is the best way to save an mllib model that you just trained and
reload
it in the future? specifically, i'm using the mllib word2vec
that works. is there a better way in spark? this seems like the most
common feature for any machine learning work - to be able to save your
model after training it and load it later.
On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com
wrote:
Plain old java serialization is
Hi,
In the documentation is I found something like this.
spark.default.parallelism
· Local mode: number of cores on the local machine
· Mesos fine grained mode: 8
· Others: total number of cores on all executor nodes or 2, whichever
is larger
I am using 2 node cluster
There's some work going on to support PMML -
https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
merged into master.
What are you used to doing in other environments? In R I'm used to running
save(), same with matlab. In python either pickling things or dumping to
json seems
Hi Pawel,
That doc was created during the initial days (Spark 0.8.0), you can of
course create multiple sparkContexts in the same driver program now.
Thanks
Best Regards
On Thu, Nov 6, 2014 at 9:30 PM, Paweł Szulc paul.sz...@gmail.com wrote:
Hi,
quick question: I found this:
61 matches
Mail list logo