Hi there,
I am trying to run the example LDA code
(http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda)
on Spark 2.0.0/EMR 5.0.0
If run it with checkpoints enabled (sc.setCheckpointDir("s3n://s3-path/")
ldaModel = LDA.train(corpus, k=3, maxIterations=200,
Hi,
I have an 8 hour job (spark 2.0.0) that writes the results out to parquet
using the standard approach:
processed_images_df.write.format("parquet").save(s3_output_path)
It executes 1 tasks and writes the results to a _temporary folder, and
in the last step (after all the tasks completed)
Hi,
I am trying to take the OnlineLDAOptimizer and apply it iteratively to new
data. My use case would be:
- Train the model using the DistributedLDAModel
- Convert to LocalLDAModel
- Apply to new data as data comes in using the OnlineLDAOptimizer
I cannot see that this can be done without
You could also try to put transform in a companion object.
On Fri, 8 Apr 2016 16:48 mpawashe [via Apache Spark User List], <
ml-node+s1001560n26718...@n3.nabble.com> wrote:
> The class declaration is already marked Serializable ("with Serializable")
>
> --
> If you
Hi Alfredo,
I have been building something similar and found that EMR is not suitable
for this, as the gpu instances don't come with nvidia drivers (and the
bootstrap process does not allow to restart instances).
The way I'm setting up is based on the spark-ec2 script where you can use
custom
depends, if you'd like to multiply matrices for each row in the data, then
you could use a breeze matrix, and do that locally on the nodes in a map or
similar.
if you'd like to multiply them across the rows, eg. a row in your data is a
row in the matrix, then you could use a distributed matrix
you probably better off defining your own data structure. labelled point can
store a label, vector. but in your case is more like a label, vector,
vector. i'd probably use tuples with breeze sparse arrays:
RDD[(label:Int, vector1:SparseArray[Double], vector2:SparseArray[Double])]
--
View this
you can declare you class serializable, as spark would want to serialise the
whole class.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-NotSerializableException-Methods-Closures-tp26672p26689.html
Sent from the Apache Spark User List
if you are asking about scala vectors it is simple as this:
val vec = Vector(Vector(1,2), Vector(1,2), Vector(1,2))
val vecrdd = sc.parallelize(vec)
where vecrdd:
org.apache.spark.rdd.RDD[scala.collection.immutable.Vector[Int]
--
View this message in context:
Hi all,
I came across a really weird error on spark 1.6 (calling LDA from pyspark)
//data is [index, DenseVector]
data1 = corpusZippedDataFiltered.repartition(100).sample(False, 0.1, 100)
data2 = sc.parallelize(data1.collect().repartition(100)
ldaModel1 = LDA.train(data1, k=10,
Hi all,
I am setting up a spark standalone server with an external hive metastore
(using mysql), there is an issue that after 5 minutes inactivity, if I try
to reconnect to the metastore (i.e. by executing a new query), it hangs for
about 10 mins then times out. My guess is that datanucleus does
Hi all,
I am trying to setup an external metastore using Microsoft SQL on Azure, it
works ok initially but after about 5 mins inactivity it hangs, then times
out after 15 mins with this error:
15/05/20 00:02:49 ERROR ConnectionHandle: Database access problem. Killing
off this connection and all
Hi all,
is it possible to set hive.metastore.warehouse.dir, that is internally
create by spark, to be stored externally (e.g. s3 on aws or wasb on azure)?
thanks,
--
View this message in context:
ah, just noticed that you are using an external package, you can add that
like this
conf = (SparkConf().set(spark.jars, jar_path))
or if it is a python package:
sc.addPyFile()
--
View this message in context:
Hi Mark,
That does not look like an python path issue, spark-assembly jar should have
those packaged, and should make it available for the workers. Have you built
the jar yourself?
--
View this message in context:
hi all,
I have just come across a problem where I have a table that has a few bigint
columns, it seems if I read that table into a dataframe then collect it in
pyspark, the bigints are stored and integers in python.
(The problem is if I write it back to another table, I detect the hive type
Hi all,
I have been testing Spark ML algorithms with bigger dataset, and ran into
some problems with linear regression:
It seems the executors stop without any apparent reason:
15/04/22 20:15:05 INFO BlockManagerInfo: Added rdd_12492_80 in memory on
backend-node:48037 (size: 28.5 MB, free: 2.8
Hi all,
I have a three node cluster with identical hardware. I am trying a workflow
where it reads data from hdfs, repartitions it and runs a few map operations
then writes the results back to hdfs.
It looks like that all the computation, including the repartitioning and the
maps complete within
Hi all,
We have a workflow that pulls in data from csv files, then originally setup
up of the workflow was to parse the data as it comes in (turn into array),
then store it. This resulted in out of memory errors with larger files (as a
result of increased GC?).
It turns out if the data gets
Hi all,
I have been trying out the new dataframe api in 1.3, which looks great by
the way.
I have found an example to define udfs and add them to select operations,
like this:
slen = F.udf(lambda s: len(s), IntegerType())
df.select(df.age, slen(df.name).alias('slen')).collect()
is it possible
Hi all,
We are designing a workflow where we try to stream local files to a Socket
streamer, that would clean and process the files and write them to hdfs. We
have an issue with bigger files when the streamer cannot keep up with the
data, and runs out of memory.
What would be the best way to
Hi all,
I think I have run into an issue on the lazy evaluation of variables in
pyspark, I have to following
functions = [func1, func2, func3]
for counter in range(len(functions)):
data = data.map(lambda value: [functions[counter](value)])
it looks like that the counter is evaluated when
Hi all,
I am trying to come up with a workflow where I can query streams
asynchronously. The problem I have is a ssc.awaitTermination() line blocks
the whole thread, so it is not straightforward to me whether it is possible
to get hold of objects from streams once they are started. any suggestion
hi all,
just wondering if there is a reason why it is not possible to add intercepts
for streaming regression models? I understand that run method in the
underlying GeneralizedLinearModel does not take intercept as a parameter
either. Any reason for that?
thanks,
--
View this message in
Hi,
My guess is that Spark is not picking up the jar where the function is
stored. You might have to add it to sparkcontext or the classpath manually.
You can also register the function
hc.registerFunction(myfunct, myfunct)
then use it in the query.
--
View this message in context:
Hi all,
we are building a custom JDBC receiver that would create a stream from sql
tables. Not sure what is the best way to shut down the stream once all the
data goes through, as the receiver knows it is completed but it cannot
initiate the stream to shut down.
Any suggestion how to structure
hi all,
I am trying to create a spark context programmatically, using
org.apache.spark.deploy.SparkSubmit. It all looks OK, except that the hadoop
config that is created during the process is not picking up core-site.xml,
so it defaults back to the local file-system. I have set HADOOP_CONF_DIR in
hi all,
I have been experimenting with creating a sparkcontext - streamingcontext
- a few streams - starting - stopping - creating new streams - starting
a new (or the existing) streamingcontext with the new streams
(I need to keep the existing sparkcontext alive as it would run other spark
Just found this in the documentation:
A SparkContext can be re-used to create multiple StreamingContexts, as long
as the previous StreamingContext is stopped (without stopping the
SparkContext) before the next StreamingContext is created.
in this case, I assume the error I reported above is a
Hi all,
we have been trying to setup a stream using a custom receiver that would
pick up data from sql databases. we'd like to keep that stream context
running and dynamically change the streams on demand, adding and removing
streams based on demand. alternativel, if a stream is fixed, is it
thanks for the replies. very useful.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-tp21124p21176.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi all,
just discovered that the streaming folder in pyspark is not included in the
assembly jar (spark-assembly-1.2.0-hadoop2.3.0.jar), but included in the
python folder. Any reason why?
Thanks,
--
View this message in context:
Hi all,
Is there a way to save dstream RDDs to a single file so that another process
can pick it up as a single RDD?
It seems that each slice is saved to a separate folder, using
saveAsTextFiles method.
I'm using spark 1.2 with pyspark
thanks,
--
View this message in context:
Hi all,
We have been building a system where we heavily reply on hive queries
executed through spark to load and manipulate data, running on CDH and yarn.
I have been trying to explore lighter setups where we would not have to
maintain a hadoop cluster, just run the system on spark only.
Is it
many thanks for adding this so quickly.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/why-MatrixFactorizationModel-private-tp19763p19855.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi all,
seems that all the mllib models are declared accessible in the package,
except MatrixFactorizationModel, which is declared private to mllib. Any
reason why?
thanks,
--
View this message in context:
hi Xiangrui,
thanks. that is a very useful feature.
any suggestion on saving/loading the model in the meantime?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/why-MatrixFactorizationModel-private-tp19763p19783.html
Sent from the Apache Spark User List
it was built with 1.6 (tried 1.7, too)
On Thu, Nov 13, 2014 at 2:52 AM, Andrew Or-2 [via Apache Spark User
List] ml-node+s1001560n18797...@n3.nabble.com wrote:
Hey Jamborta,
What java version did you build the jar with?
2014-11-12 16:48 GMT-08:00 jamborta [hidden email]:
I have figured out
Hi all,
I am trying to run spark with the latest build (from branch-1.2), as far as
I can see, all the paths are set and SparkContext starts up OK, however, I
cannot run anything that goes to the nodes. I get the following error:
Error from python worker:
/usr/bin/python2.7: No module named
forgot to mention, that this setup works in spark standalone mode, only
problem when I run on yarn.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18777.html
Sent from the Apache Spark User List mailing list
I have figured out that building the fat jar with sbt does not seem to
included the pyspark scripts using the following command:
sbt/sbt -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive clean
publish-local assembly
however the maven command works OK:
mvn -Pdeb -Pyarn -Phadoop-2.3
thanks, that looks good.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-SNAPSHOT-repo-tp18502p18505.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
Hello,
I am trying to build Spark from source using the following:
export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
-XX:ReservedCodeCacheSize=512m
mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests clean
package
this works OK with branch-1.1, when I switch to branch-1.2, I get the
ah, thanks, reverted a few days, works now.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Building-spark-from-source-assertion-failed-org-eclipse-jetty-server-DispatcherType-tp18529p18532.html
Sent from the Apache Spark User List mailing list archive at
I meant above, that in the case of categorical variables it might be more
efficient to create a node on each categorical value. Is there a reason why
spark went down the binary route?
thanks,
--
View this message in context:
the only difference between the two setups (if you vary change the executor
cores) is how many tasks are running in parallel (the number of tasks would
depend on other factors), so try to inspect the stages while running
(probably easier to do that with longer running tasks) by clicking on one of
Hi all,
Is there a way in spark to send data (RDD[Array] from the scala component to
the python component?
I saw a method that serialises double arrays (serializeDoubleMatrix), but it
requires the data to be collected before. I assume that step would pull all
the data to the driver.
Thanks,
Hi,
Just wondering what is the reason that the decision tree implementation in
spark always does binary splits?
thanks,
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188.html
Sent from the Apache Spark User List
If you go to your spark job UI (probably on http://master-node:4040), and
click on the environment tab, you can check if the setting are correctly
picked up by spark.
Also when you run the job, you can see the subtasks (stages tab), inside the
task you can check what resources are assigned to the
Hi Rok,
you could try to debug it by first collecting your training_set, see if it
gets you something back, before passing it to the train method. Then go
through each line in the train method, also the serializer and check where
it fails exactly.
thanks,
--
View this message in context:
hi Andrew,
Thanks for the reply, I tried to tune the memory, changed it as low as
possible, no luck.
My guess is that this issue is related to what is discussed here
http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-td11668.html
that is the
I have two nodes with 96G ram 16 cores, my setup is as follows:
conf = (SparkConf()
.setMaster(yarn-cluster)
.set(spark.executor.memory, 30G)
.set(spark.cores.max, 32)
.set(spark.executor.instances, 2)
.set(spark.executor.cores, 8)
digging a bit deeper on, the executors get lost when the memory gets close to
the physical memory size:
http://apache-spark-user-list.1001560.n3.nabble.com/file/n15680/memory_usage.png
I'm not clear if I am allocating too much, or too less memory in this case.
thanks,
--
View this message
hi all,
I have a job that runs about for 15 mins, at some point I get an error on
both nodes (all executors) saying:
14/10/02 23:14:38 WARN TaskSetManager: Lost task 80.0 in stage 3.0 (TID 253,
backend-tes): ExecutorLostFailure (executor lost)
In the end, it seems that the job recovers and
Hi Arun,
Have you found a solution? Seems that I have the same problem.
thanks,
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Any-issues-with-repartition-tp13462p15654.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
have you found a solution this problem? (or at least a cause)
thanks,
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/new-error-for-me-tp10378p15655.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi all,
I cannot figure out why this command is not setting the driver memory (it is
setting the executor memory):
conf = (SparkConf()
.setMaster(yarn-client)
.setAppName(test)
.set(spark.driver.memory, 1G)
:26 AM, jamborta [hidden email] wrote:
Hi all,
I cannot figure out why this command is not setting the driver memory
(it is
setting the executor memory):
conf = (SparkConf()
.setMaster(yarn-client)
.setAppName(test)
.set
cluster mode apps that way, so the only
way to set this is through the command line / config files.)
On Wed, Oct 1, 2014 at 9:26 AM, jamborta [hidden email]
http://user/SendEmail.jtp?type=nodenode=15510i=4 wrote:
Hi all,
I cannot figure out why this command is not setting the driver
hi all,
I have a problem with my application when I increase the data size over 5GB
(the cluster has about 100GB memory to handle that). First I get this
warning:
WARN TaskSetManager: Lost task 10.1 in stage 4.1 (TID 408, backend-node1):
FetchFailed(BlockManagerId(3, backend-node0, 41484, 0),
hi all,
I have a job that works ok in yarn-client mode,but when I try in
yarn-cluster mode it returns the following:
WARN YarnClusterScheduler: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have sufficient
memory
the cluster has
Hi,
you can create a spark context in your python or scala environment and use
that to run your hive queries, pretty much the same way as you'd do it in
the shell.
thanks,
--
View this message in context:
hi,
Yes, I have been using spark sql extensively that way.
I have just tried and saveAsTable() works OK on 1.1.0.
Alternatively, you can write the data from schemaRDD to HDFS using
saveAsTextFile, and create an external table on top of it.
thanks,
--
View this message in context:
Just wanted to answer my question in case someone else runs into the same
problem.
It is related to the problem discussed here:
http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html
and here:
https://issues.apache.org/jira/browse/SPARK-2121
Hi all,
I am using mappartitions to do some heavy computing on subsets of the data.
I have a dataset with about 1m rows, running on a 32 core cluster.
Unfortunately, is seems that mappartitions splits the data into two sets so
it is only running on two cores.
Is there a way to force it to split
hi all,
I am getting this strange error about half way through the job (running
spark 1.1 on yarn client mode):
14/09/26 00:54:06 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@4d0155fb
java.nio.channels.CancelledKeyException
at
, Tamas Jambor [hidden email] wrote:
Thank you.
Where is the number of containers set?
On Thu, Sep 25, 2014 at 7:17 PM, Marcelo Vanzin [hidden email] wrote:
On Thu, Sep 25, 2014 at 8:55 AM, jamborta [hidden email] wrote:
I am running spark with the default settings in yarn client mode
Hi all,
I have just updated to spark 1.1.0. The new row representation of the data
in spark SQL is very handy.
I have noticed that it does not set None for NULL values coming from Hive if
the column was string type - seems it works with other types.
Is that something that will be implemented?
Hi all,
I have a java object that contains a ML model which I would like to use for
prediction (in python). I just want to iterate the data through a mapper and
predict for each value. Unfortunately, this fails when it tries to serialise
the object to sent it to the nodes.
Is there a trick
23, 2014 at 5:40 PM, Davies Liu [hidden email]
http://user/SendEmail.jtp?type=nodenode=14908i=1 wrote:
You should create a pure Python object (copy the attributes from Java
object),
then it could be used in map.
Davies
On Tue, Sep 23, 2014 at 8:48 AM, jamborta [hidden email]
http
Hi all,
I have an RDD that contains around 50 columns. I need to sum each column,
which I am doing by running it through a for loop, creating an array and
running the sum function as follows:
for (i - 0 until 10) yield {
data.map(x = x(i)).sum
}
is their a better way to do this?
thanks,
thanks. I will try to do that way.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006p14090.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I am trying to create a new table from a select query as follows:
CREATE TABLE IF NOT EXISTS new_table ROW FORMAT DELIMITED FIELDS TERMINATED
BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION
'/user/test/new_table' AS select * from table
this works in Hive, but in Spark SQL
thanks. this was actually using hivecontext.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006p14009.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
thank you for the replies.
I am running an insert on a join (INSERT OVERWRITE TABLE new_table select *
from table1 as a join table2 as b on (a.key = b.key),
The process does not have the right permission to write to that folder, so I
get the following error printed:
chgrp: `/user/x/y': No such
Hi,
I am using Spark SQL to run some administrative queries and joins (e.g.
create table, insert overwrite, etc), where the query does not return any
data. I noticed if the query fails it prints some error message on the
console, but does not actually throw an exception (this is spark 1.0.2).
hi,
I ran into a problem with spark sql, when run a query like this select
count(*), city, industry from table group by hour and I would like to take
the results from the shemaRDD
1, I have to parse each line to get the values out of the dic (eg in order
to convert it to a csv)
2, The order is
Hi all,
Just wondering if there is a way to use logging to print to spark logs some
additional info (similar to debug in scalding).
Thanks,
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Print-to-spark-log-tp13035.html
Sent from the Apache Spark User
thanks for the reply.
I was looking for something for the case when it's running outside of the
spark framework. if I declare a sparkcontext or and rdd that could print
some messages in the log?
The problem I have that if I print something from the scala object that runs
the spark app, it does
Hi all,
not sure if this is a config issue or it's by design, but when I run the
spark shell, and try to submit another application from elsewhere, the
second application waits for the first to finish and outputs the following:
Initial job has not accepted any resources; check your cluster UI to
Hi all,
I am setting up a system where spark contexts would be created by a web
server that would handle the computation and return the results. I have the
following code (in python)
os.environ['SPARK_HOME'] = /home/spark/spark-1.0.0-bin-hadoop2/
sc =
Thanks, will try normalising it.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672p7720.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi all,
I am testing the regression methods (SGD) using pyspark. Tried to tune the
parameters, but they are far off from the results obtained using R. Is there
some way to set these parameters more efficiently?
thanks,
--
View this message in context:
forgot to mention that I'm running spark 1.0
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672p7673.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
thanks for the reply. I am definitely running 1.0.0, I set it up manually.
To answer my question, I found out from the examples that it would need a
new data type called LabeledPoint instead of numpy array.
--
View this message in context:
85 matches
Mail list logo