Hi All,
I'v read official docs of tachyon,It seems not fit my usage,For my
understanding,It just cache files in memory,but I have a file contains over
million lines amount about 70mb,retrieveing data and mapping to a Map varible
will costs over serveral minuts,which I dont want to process it
How did you build your spark 1.1.1 ?
On Wed, Dec 10, 2014 at 10:41 AM, amin mohebbi aminn_...@yahoo.com.invalid
wrote:
I'm trying to build a very simple scala standalone app using the Mllib,
but I get the following error when trying to bulid the program:
Object mllib is not a member of
Hello Guys,
Any insights on this??
If I'm not clear enough my question is how can I use kafka consumer and not
loose any data in cases of failures with spark-streaming.
On Tue, Dec 9, 2014 at 2:53 PM, Mukesh Jha me.mukesh@gmail.com wrote:
Hello Experts,
I'm working on a spark app which
Hi,
I had the same problem, and tried to compile with mvn -Pnetlib-lgpl
$ mvn -Pnetlib-lgpl -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean
package
Unfortunately, the resulting assembly jar still lacked the netlib-system class.
This command :
$ jar tvf
Hi all,
Having a strange issue that I can't find any previous issues for on the
mailing list or stack overflow.
Frequently we are getting ACTOR SYSTEM CORRUPTED!! A Dispatcher can't have
less than 0 inhabitants! with a stack trace, from akka, in the executor
logs, and the executor is marked as
Hi,
I would like to use Spark to train a model, but use the model in some other
place,, e.g. a servelt to do some classification in real time.
What is the best way to do this? Can I just copy I model file or something
and load it in the servelt? Can anybody point me to a good tutorial?
Hi Klaus,
PredictionIO is an open source product based on Spark MLlib for exactly
this purpose.
This is the tutorial for classification in particular:
http://docs.prediction.io/classification/quickstart/
You can add custom serving logics and retrieve prediction result through
REST API/SDKs at
Hi Mukesh,
There’s been some great work on Spark Streaming reliability lately
I’m not aware of any doc yet (did I miss something ?) but you can look at the
ReliableKafkaReceiver’s test suite:
—
FG
On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha me.mukesh@gmail.com
wrote:
Hello
[sorry for the botched half-message]
Hi Mukesh,
There’s been some great work on Spark Streaming reliability lately.
https://www.youtube.com/watch?v=jcJq3ZalXD8
Look at the links from:
https://issues.apache.org/jira/browse/SPARK-3129
I’m not aware of any doc yet (did I miss
Hi
Issue created https://issues.apache.org/jira/browse/SPARK-4816
Probably a maven-related question for profiles in child modules
I couldn't find a clean solution, just a workaround : modify pom.xml in
mllib module to force activation of netlib-lgpl module.
Hope a maven expert will help.
Hi Klaus,
There is no ideal method but some workaround.
Train model in Spark cluster or YARN cluster, then use RDD.saveAsTextFile
to store this model which include weights and intercept to HDFS.
Load weights file and intercept file from HDFS, construct a GLM model, and
then run model.predict()
Hi!
I have been using spark a lot recently and it's been running really well and
fast, but now when I increase the data size, it's starting to run into problems:
I have an RDD in the form of (String, Iterable[String]) - the Iterable[String]
was produced by a groupByKey() - and I perform a
You are rightly thinking that Spark should be able to just stream
this massive collection of pairs you are creating, and never need to
put it all in memory. That's true, but, your function actually creates
a huge collection of pairs in memory before Spark ever touches it.
This is going to
for(v1 - values; v2 - values) yield ((v1, v2), 1) will generate all data
at once and return all of them to flatMap.
To solve your problem, you should use for (v1 - values.iterator; v2 -
values.iterator) yield ((v1, v2), 1) which will generate the data when it’s
necessary.
Best Regards,
You can also serialize the model and use it in other places.
Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Wed, Dec 10, 2014 at 5:32 PM, Yanbo Liang yanboha...@gmail.com wrote:
Hi Klaus,
There is no ideal method but some
Hi!
Using an iterator solved the problem! I've been chewing on this for days, so
thanks a lot to both of you!! :)
Since in an earlier version of my code, I used a self-join to perform the same
thing, and ran into the same problems, I just looked at the implementation of
PairRDDFunction.join
Dear all,
I'm trying to understand what is the correct use case of ColumnSimilarity
implemented in RowMatrix.
As far as I know, this function computes the similarity of a column of a
given matrix. The DIMSUM paper says that it's efficient for large m (rows)
and small n (columns). In this case
Well, you're computing similarity of your features then. Whether it is
meaningful depends a bit on the nature of your features and more on
the similarity algorithm.
On Wed, Dec 10, 2014 at 2:53 PM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Dear all,
I'm trying to understand what is the
I have narrowed down my problem to some code plus an input file with a single
very small input (one line). I'm getting a
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 14634430, but as the input is so small I think there's something
else up. I'm not sure what.
Hi guys:
I want to call the RDD api in a servlet which run in a tomcat. So I add the
spark-core.jar to the web-inf/lib of the web project .And deploy it to tomcat.
but In spark-core.jar there exist the httpserlet which belongs to jetty. then
there is some conflict. Can Anybody tell me how to
Hi, i think this post
http://stackoverflow.com/questions/2681759/is-there-anyway-to-exclude-artifacts-inherited-from-a-parent-pom
shall help you.
Alonso Isidoro Roman.
Mis citas preferidas (de hoy) :
Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el
If you have tall x skinny matrix of m users and n products, column
similarity will give you a n x n matrix (product x product matrix)...this
is also called product correlation matrix...it can be cosine, pearson or
other kind of correlations...Note that if the entry is unobserved (user
Joanary did
I am running spark 1.1.0 on AWS EMR and I am running a batch job that
should seems to be highly parallelizable in yarn-client mode. But spark
stop spawning any more executors after spawning 6 executors even though
YARN cluster has 15 healthy m1.large nodes. I even tried providing
'--num-executors
I'm hesitant to merge that PR in as it is using a brand new configuration
path that is different from the way that the rest of Spark SQL / Spark are
configured.
I'm suspicious that that hitting max iterations is emblematic of some other
issue, as typically resolution happens bottom up, in a
Everything worked fine on Spark 1.1.0 until we upgrade to 1.1.1. For some of
our unit tests we saw the following exceptions. Any idea how to solve it?
Thanks!
java.lang.NoClassDefFoundError: io/netty/util/TimerTask at
org.apache.spark.storage.BlockManager.init(BlockManager.scala:72)
As Sean mentioned, you would be computing similar features then.
If you want to find similar users, I suggest running k-means with some
fixed number of clusters. It's not reasonable to try and compute all pairs
of similarities between 1bn items, so k-means with fixed k is more suitable
here.
I am using SQLContext.jsonFile. If a valid JSON contains newlines,
spark1.1.1 dumps trace below. If the JSON is read as one line, it works
fine. Is this known?
14/12/10 11:44:02 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID
28)
com.fasterxml.jackson.core.JsonParseException:
I am using CDH5.1 and Spark 1.0.0.
Trying to configure resources to be allocated to each application. How do I
do this? For example, I would each app to use 2 cores and 8G of RAM. I have
tried using the pyspark commandline paramaters for --driver-memory,
--driver-cores and see no effect of those
does spark-submit with SparkPi and spark-examples.jar work?
e.g.
./spark/bin/spark-submit --class org.apache.spark.examples.SparkPi
--master spark://xx.xx.xx.xx:7077 /path/to/examples.jar
On Tue, Dec 9, 2014 at 6:58 PM, Eric Tanner eric.tan...@justenough.com
wrote:
I have set up a cluster
Yep, because sc.textFile will only guarantee that lines will be preserved
across splits, this is the semantic. It would be possible to write a
custom input format, but that hasn't been done yet. From the documentation:
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
Hi All,When I am running LinearRegressionWithSGD, I get the following error.
Any help on how to debug this further will be highly appreciated.
14/12/10 20:26:02 WARN TaskSetManager: Loss was due to
java.lang.ArrayIndexOutOfBoundsExceptionjava.lang.ArrayIndexOutOfBoundsException:
150323 at
Dear Spark'ers,
I'm trying to run a simple job using Spark Streaming (version 1.1.1) and
YARN (yarn-cluster), but unfortunately I'm facing many issues. In short, my
job does the following:
- Consumes a specific Kafka topic
- Writes its content to S3 or HDFS
Records in Kafka are in the form:
Hi folks, wondering if anyone has thoughts. Trying to create something akin
to a materialized view (sqlContext is a HiveContext connected to external
metastore):
val last2HourRdd = sqlContext.sql(sselect * from mytable)
//last2HourRdd.first prints out a org.apache.spark.sql.Row = [...] with
Have you checked to make sure the schema in the metastore matches the
schema in the parquet file? One way to test would be to just use
sqlContext.parquetFile(...) which infers the schema from the file instead
of using the metastore.
On Wed, Dec 10, 2014 at 12:46 PM, Yana Kadiyska
Hi all – I’ve been storing the model userFeatures and productFeatures vectors
that are generated internally serialized on disk and importing them as a
separate job.
From: Sonal Goyal sonalgoy...@gmail.commailto:sonalgoy...@gmail.com
Date: Wednesday, December 10, 2014 at 5:31 AM
To: Yanbo Liang
How many working nodes do these 100 executors locate at?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-ALS-java-lang-OutOfMemoryError-Java-heap-space-tp20584p20610.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi spark users,
I'm trying to filter a json file that has the following schema using Spark
SQL:
root
|-- user_id: string (nullable = true)
|-- item: array (nullable = true)
||-- element: struct (containsNull = false)
|||-- item_id: string (nullable = true)
|||-- name:
I recently installed spark standalone through cloudera manager on my CDH
5.2 cluster. CDH 5.2 is runing on CentOS release 6.6. The version of
spark again through Cloudera is 1.1. It is standalone.
I have a file in hdfs in /tmp/testfile.txt.
So what I do is i run spark-shell:
scala val source
Anyone have luck with this? An issue encountered is handling multiple
languages - python, java,scala within one module : it is unclear how to
select two module SDK's.
Both Python and Scala facets were added to the spark-parent module. But
when the Project level SDK is not set to Python then the
I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too fast.
How can i inspect the size of RDD in memory and get more information about
why it was cleaned up. There should be more than enough memory available on
the cluster to store them, and by default, the spark.cleaner.ttl is
Hi,
I'm wondering if I change RangePartitioner in sortBy to another partitioner
like HashPartitioner.
The first thing that comes into my head is that it can not be replaceable
due to RangePartitioner is a part of the sort algorithm.
If we call mapPartitions on key based partition after sorting, we
Hi
How I can do classification of big data using spark?
Which machine algorithm is preferable for that?
--
CHINTAN BHATT http://in.linkedin.com/pub/chintan-bhatt/22/b31/336/
Assistant Professor,
U P U Patel Department of Computer Engineering,
Chandubhai S. Patel Institute of Technology,
Hello,
What do you mean by app that uses 2 cores and 8G of RAM?
Spark apps generally involve multiple processes. The command line
options you used affect only one of them (the driver). You may want to
take a look at similar configuration for executors. Also, check the
documentation:
Good catch. `Join` should use `Iterator`, too. I open an JIRA here:
https://issues.apache.org/jira/browse/SPARK-4824
Best Regards,
Shixiong Zhu
2014-12-10 21:35 GMT+08:00 Johannes Simon johannes.si...@mail.de:
Hi!
Using an iterator solved the problem! I've been chewing on this for days,
so
saveAsSequenceFile method works on rdd. your object csv is a String.
If you are using spark-shell you can type your object to know it's datatype.
Some prefer eclipse(and it's intelli) to make their live easier.
..Manas
-
Manas Kar
--
View this message in context:
If you want to take out apple and orange you might want to try
dataRDD.filter(_._2 !=apple).filter(_._2 !=orange) and so on.
...Manas
-
Manas Kar
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/equivalent-to-sql-in-tp20599p20616.html
Sent from the
Can you do a countApprox as a condition to check non-empty RDD?
..Manas
-
Manas Kar
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Saving-Data-only-if-Dstream-is-not-empty-tp20587p20617.html
Sent from the Apache Spark User List mailing list archive
I am testing decision tree using iris.scale data set
(http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#iris)
In the data set there are three class labels 1, 2, and 3. However in the
following code, I have to make numClasses = 4. I will get an
ArrayIndexOutOfBound Exception
Can anyone provide an example code of using Categorical Features in Decision
Tree?
Thanks!
-Yao
Couple of questions :
1. sqlContext.jsonFile reads a json file, infers the schema for the data
stored, and then returns a SchemaRDD. Now, i could also create a SchemaRDD
by reading a file as text(which returns RDD[String]) and then use the
jsonRDD method. My question, is the jsonFile way of
Hi,
I have created a parquet-file from case-class using saveAsParquetFile
Then try to reload using parquetFile but it fails.
Sample code is attached.
Any help would be appreciated.
Regards,
Rahul
rahul@...
sample_parquet.sample_parquet
The ContextCleaner uncaches RDDs that have gone out of scope on the driver.
So it's possible that the given RDD is no longer reachable in your
program's control flow, or else it'd be a bug in the ContextCleaner.
On Wed, Dec 10, 2014 at 5:34 PM, ankits ankitso...@gmail.com wrote:
I'm using spark
You could try adding netty.io jars
http://mvnrepository.com/artifact/io.netty/netty-all/4.0.23.Final in the
classpath. Looks like that jar is missing.
Thanks
Best Regards
On Thu, Dec 11, 2014 at 12:15 AM, S. Zhou myx...@yahoo.com.invalid wrote:
Everything worked fine on Spark 1.1.0 until we
Thanks Davies. it turns out it was indeed and they fixed it in last night's
nightly build!
https://github.com/elasticsearch/elasticsearch-hadoop/issues/338
On Wed, Dec 10, 2014 at 2:52 AM, Davies Liu dav...@databricks.com wrote:
On Tue, Dec 9, 2014 at 11:32 AM, Mohamed Lrhazi
Looks like you are wondering why you cannot see the RDD table you have created
via thrift?
Based on my own experience with spark 1.1, RDD created directly via Spark SQL
(i.e. Spark Shell or Spark-SQL.sh) is not visible on thrift, since thrift has
its own session containing its own RDD.
Spark
Hi,
When my spark program calls JavaSparkContext.stop(), the following errors
occur.
14/12/11 16:24:19 INFO Main: sc.stop {
14/12/11 16:24:20 ERROR ConnectionManager: Corresponding
SendingConnection to ConnectionManagerId(cluster02,38918) not found
56 matches
Mail list logo