example LDA code ClassCastException

2016-11-03 Thread jamborta
Hi there, I am trying to run the example LDA code (http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda) on Spark 2.0.0/EMR 5.0.0 If run it with checkpoints enabled (sc.setCheckpointDir("s3n://s3-path/") ldaModel = LDA.train(corpus, k=3, maxIterations=200,

writing to s3 failing to move parquet files from temporary folder

2016-09-29 Thread jamborta
Hi, I have an 8 hour job (spark 2.0.0) that writes the results out to parquet using the standard approach: processed_images_df.write.format("parquet").save(s3_output_path) It executes 1 tasks and writes the results to a _temporary folder, and in the last step (after all the tasks completed)

Iterative update for LocalLDAModel

2016-08-31 Thread jamborta
Hi, I am trying to take the OnlineLDAOptimizer and apply it iteratively to new data. My use case would be: - Train the model using the DistributedLDAModel - Convert to LocalLDAModel - Apply to new data as data comes in using the OnlineLDAOptimizer I cannot see that this can be done without

Re: Spark Streaming - NotSerializableException: Methods & Closures:

2016-04-08 Thread jamborta
You could also try to put transform in a companion object. On Fri, 8 Apr 2016 16:48 mpawashe [via Apache Spark User List], < ml-node+s1001560n26718...@n3.nabble.com> wrote: > The class declaration is already marked Serializable ("with Serializable") > > -- > If you

Re: Anyone have a tutorial or guide to implement Spark + AWS + Caffe/CUDA?

2016-04-07 Thread jamborta
Hi Alfredo, I have been building something similar and found that EMR is not suitable for this, as the gpu instances don't come with nvidia drivers (and the bootstrap process does not allow to restart instances). The way I'm setting up is based on the spark-ec2 script where you can use custom

Re: LabeledPoint with features in matrix form (word2vec matrix)

2016-04-07 Thread jamborta
depends, if you'd like to multiply matrices for each row in the data, then you could use a breeze matrix, and do that locally on the nodes in a map or similar. if you'd like to multiply them across the rows, eg. a row in your data is a row in the matrix, then you could use a distributed matrix

Re: LabeledPoint with features in matrix form (word2vec matrix)

2016-04-06 Thread jamborta
you probably better off defining your own data structure. labelled point can store a label, vector. but in your case is more like a label, vector, vector. i'd probably use tuples with breeze sparse arrays: RDD[(label:Int, vector1:SparseArray[Double], vector2:SparseArray[Double])] -- View this

Re: Spark Streaming - NotSerializableException: Methods & Closures:

2016-04-06 Thread jamborta
you can declare you class serializable, as spark would want to serialise the whole class. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-NotSerializableException-Methods-Closures-tp26672p26689.html Sent from the Apache Spark User List

Re: How to convert a Vector[Vector] to a RDD[Vector]?

2016-04-06 Thread jamborta
if you are asking about scala vectors it is simple as this: val vec = Vector(Vector(1,2), Vector(1,2), Vector(1,2)) val vecrdd = sc.parallelize(vec) where vecrdd: org.apache.spark.rdd.RDD[scala.collection.immutable.Vector[Int] -- View this message in context:

MLLIB LDA throws NullPointerException

2016-04-06 Thread jamborta
Hi all, I came across a really weird error on spark 1.6 (calling LDA from pyspark) //data is [index, DenseVector] data1 = corpusZippedDataFiltered.repartition(100).sample(False, 0.1, 100) data2 = sc.parallelize(data1.collect().repartition(100) ldaModel1 = LDA.train(data1, k=10,

hive external metastore connection timeout

2015-05-27 Thread jamborta
Hi all, I am setting up a spark standalone server with an external hive metastore (using mysql), there is an issue that after 5 minutes inactivity, if I try to reconnect to the metastore (i.e. by executing a new query), it hangs for about 10 mins then times out. My guess is that datanucleus does

sparkSQL - Hive metastore connection hangs with MS SQL server

2015-05-19 Thread jamborta
Hi all, I am trying to setup an external metastore using Microsoft SQL on Azure, it works ok initially but after about 5 mins inactivity it hangs, then times out after 15 mins with this error: 15/05/20 00:02:49 ERROR ConnectionHandle: Database access problem. Killing off this connection and all

store hive metastore on persistent store

2015-05-14 Thread jamborta
Hi all, is it possible to set hive.metastore.warehouse.dir, that is internally create by spark, to be stored externally (e.g. s3 on aws or wasb on azure)? thanks, -- View this message in context:

Re: How to add jars to standalone pyspark program

2015-04-28 Thread jamborta
ah, just noticed that you are using an external package, you can add that like this conf = (SparkConf().set(spark.jars, jar_path)) or if it is a python package: sc.addPyFile() -- View this message in context:

Re: How to add jars to standalone pyspark program

2015-04-28 Thread jamborta
Hi Mark, That does not look like an python path issue, spark-assembly jar should have those packaged, and should make it available for the workers. Have you built the jar yourself? -- View this message in context:

Bigints in pyspark

2015-04-27 Thread jamborta
hi all, I have just come across a problem where I have a table that has a few bigint columns, it seems if I read that table into a dataframe then collect it in pyspark, the bigints are stored and integers in python. (The problem is if I write it back to another table, I detect the hive type

ML regression - spark context dies without error

2015-04-23 Thread jamborta
Hi all, I have been testing Spark ML algorithms with bigger dataset, and ran into some problems with linear regression: It seems the executors stop without any apparent reason: 15/04/22 20:15:05 INFO BlockManagerInfo: Added rdd_12492_80 in memory on backend-node:48037 (size: 28.5 MB, free: 2.8

writing to hdfs on master node much faster

2015-04-20 Thread jamborta
Hi all, I have a three node cluster with identical hardware. I am trying a workflow where it reads data from hdfs, repartitions it and runs a few map operations then writes the results back to hdfs. It looks like that all the computation, including the repartitioning and the maps complete within

Spark streaming

2015-03-27 Thread jamborta
Hi all, We have a workflow that pulls in data from csv files, then originally setup up of the workflow was to parse the data as it comes in (turn into array), then store it. This resulted in out of memory errors with larger files (as a result of increased GC?). It turns out if the data gets

Dataframe groupby custom functions (python)

2015-03-24 Thread jamborta
Hi all, I have been trying out the new dataframe api in 1.3, which looks great by the way. I have found an example to define udfs and add them to select operations, like this: slen = F.udf(lambda s: len(s), IntegerType()) df.select(df.age, slen(df.name).alias('slen')).collect() is it possible

Buffering for Socket streams

2015-03-20 Thread jamborta
Hi all, We are designing a workflow where we try to stream local files to a Socket streamer, that would clean and process the files and write them to hdfs. We have an issue with bigger files when the streamer cannot keep up with the data, and runs out of memory. What would be the best way to

stack map functions in a loop (pyspark)

2015-02-19 Thread jamborta
Hi all, I think I have run into an issue on the lazy evaluation of variables in pyspark, I have to following functions = [func1, func2, func3] for counter in range(len(functions)): data = data.map(lambda value: [functions[counter](value)]) it looks like that the counter is evaluated when

Interact with streams in a non-blocking way

2015-02-13 Thread jamborta
Hi all, I am trying to come up with a workflow where I can query streams asynchronously. The problem I have is a ssc.awaitTermination() line blocks the whole thread, so it is not straightforward to me whether it is possible to get hold of objects from streams once they are started. any suggestion

no option to add intercepts for StreamingLinearAlgorithm

2015-02-05 Thread jamborta
hi all, just wondering if there is a reason why it is not possible to add intercepts for streaming regression models? I understand that run method in the underlying GeneralizedLinearModel does not take intercept as a parameter either. Any reason for that? thanks, -- View this message in

Re: Is there a way to access Hive UDFs in a HiveContext?

2015-02-05 Thread jamborta
Hi, My guess is that Spark is not picking up the jar where the function is stored. You might have to add it to sparkcontext or the classpath manually. You can also register the function hc.registerFunction(myfunct, myfunct) then use it in the query. -- View this message in context:

Best way to shut down a stream initiated from the recevier

2015-01-27 Thread jamborta
Hi all, we are building a custom JDBC receiver that would create a stream from sql tables. Not sure what is the best way to shut down the stream once all the data goes through, as the receiver knows it is completed but it cannot initiate the stream to shut down. Any suggestion how to structure

spark context not picking up default hadoop filesystem

2015-01-26 Thread jamborta
hi all, I am trying to create a spark context programmatically, using org.apache.spark.deploy.SparkSubmit. It all looks OK, except that the hadoop config that is created during the process is not picking up core-site.xml, so it defaults back to the local file-system. I have set HADOOP_CONF_DIR in

Possible to restart (or stop and create) a StreamingContext

2015-01-21 Thread jamborta
hi all, I have been experimenting with creating a sparkcontext - streamingcontext - a few streams - starting - stopping - creating new streams - starting a new (or the existing) streamingcontext with the new streams (I need to keep the existing sparkcontext alive as it would run other spark

Re: Possible to restart (or stop and create) a StreamingContext

2015-01-21 Thread jamborta
Just found this in the documentation: A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created. in this case, I assume the error I reported above is a

dynamically change receiver for a spark stream

2015-01-20 Thread jamborta
Hi all, we have been trying to setup a stream using a custom receiver that would pick up data from sql databases. we'd like to keep that stream context running and dynamically change the streams on demand, adding and removing streams based on demand. alternativel, if a stream is fixed, is it

Re: save spark streaming output to single file on hdfs

2015-01-15 Thread jamborta
thanks for the replies. very useful. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-tp21124p21176.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

spark streaming python files not packaged in assembly jar

2015-01-15 Thread jamborta
Hi all, just discovered that the streaming folder in pyspark is not included in the assembly jar (spark-assembly-1.2.0-hadoop2.3.0.jar), but included in the python folder. Any reason why? Thanks, -- View this message in context:

save spark streaming output to single file on hdfs

2015-01-13 Thread jamborta
Hi all, Is there a way to save dstream RDDs to a single file so that another process can pick it up as a single RDD? It seems that each slice is saved to a separate folder, using saveAsTextFiles method. I'm using spark 1.2 with pyspark thanks, -- View this message in context:

Spark with Hive cluster dependencies

2015-01-07 Thread jamborta
Hi all, We have been building a system where we heavily reply on hive queries executed through spark to load and manipulate data, running on CDH and yarn. I have been trying to explore lighter setups where we would not have to maintain a hadoop cluster, just run the system on spark only. Is it

Re: why MatrixFactorizationModel private?

2014-11-26 Thread jamborta
many thanks for adding this so quickly. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-MatrixFactorizationModel-private-tp19763p19855.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

why MatrixFactorizationModel private?

2014-11-25 Thread jamborta
Hi all, seems that all the mllib models are declared accessible in the package, except MatrixFactorizationModel, which is declared private to mllib. Any reason why? thanks, -- View this message in context:

Re: why MatrixFactorizationModel private?

2014-11-25 Thread jamborta
hi Xiangrui, thanks. that is a very useful feature. any suggestion on saving/loading the model in the meantime? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-MatrixFactorizationModel-private-tp19763p19783.html Sent from the Apache Spark User List

Re: No module named pyspark - latest built

2014-11-13 Thread jamborta
it was built with 1.6 (tried 1.7, too) On Thu, Nov 13, 2014 at 2:52 AM, Andrew Or-2 [via Apache Spark User List] ml-node+s1001560n18797...@n3.nabble.com wrote: Hey Jamborta, What java version did you build the jar with? 2014-11-12 16:48 GMT-08:00 jamborta [hidden email]: I have figured out

No module named pyspark - latest built

2014-11-12 Thread jamborta
Hi all, I am trying to run spark with the latest build (from branch-1.2), as far as I can see, all the paths are set and SparkContext starts up OK, however, I cannot run anything that goes to the nodes. I get the following error: Error from python worker: /usr/bin/python2.7: No module named

Re: No module named pyspark - latest built

2014-11-12 Thread jamborta
forgot to mention, that this setup works in spark standalone mode, only problem when I run on yarn. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18777.html Sent from the Apache Spark User List mailing list

Re: No module named pyspark - latest built

2014-11-12 Thread jamborta
I have figured out that building the fat jar with sbt does not seem to included the pyspark scripts using the following command: sbt/sbt -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive clean publish-local assembly however the maven command works OK: mvn -Pdeb -Pyarn -Phadoop-2.3

Re: spark SNAPSHOT repo

2014-11-10 Thread jamborta
thanks, that looks good. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-SNAPSHOT-repo-tp18502p18505.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Building spark from source - assertion failed: org.eclipse.jetty.server.DispatcherType

2014-11-10 Thread jamborta
Hello, I am trying to build Spark from source using the following: export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests clean package this works OK with branch-1.1, when I switch to branch-1.2, I get the

Re: Building spark from source - assertion failed: org.eclipse.jetty.server.DispatcherType

2014-11-10 Thread jamborta
ah, thanks, reverted a few days, works now. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Building-spark-from-source-assertion-failed-org-eclipse-jetty-server-DispatcherType-tp18529p18532.html Sent from the Apache Spark User List mailing list archive at

Re: why decision trees do binary split?

2014-11-06 Thread jamborta
I meant above, that in the case of categorical variables it might be more efficient to create a node on each categorical value. Is there a reason why spark went down the binary route? thanks, -- View this message in context:

Re: I want to make clear the difference about executor-cores number.

2014-11-06 Thread jamborta
the only difference between the two setups (if you vary change the executor cores) is how many tasks are running in parallel (the number of tasks would depend on other factors), so try to inspect the stages while running (probably easier to do that with longer running tasks) by clicking on one of

most efficient way to send data from Scala to python

2014-11-06 Thread jamborta
Hi all, Is there a way in spark to send data (RDD[Array] from the scala component to the python component? I saw a method that serialises double arrays (serializeDoubleMatrix), but it requires the data to be collected before. I assume that step would pull all the data to the driver. Thanks,

why decision trees do binary split?

2014-11-05 Thread jamborta
Hi, Just wondering what is the reason that the decision tree implementation in spark always does binary splits? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188.html Sent from the Apache Spark User List

Re: I want to make clear the difference about executor-cores number.

2014-11-05 Thread jamborta
If you go to your spark job UI (probably on http://master-node:4040), and click on the environment tab, you can check if the setting are correctly picked up by spark. Also when you run the job, you can see the subtasks (stages tab), inside the task you can check what resources are assigned to the

Re: using LogisticRegressionWithSGD.train in Python crashes with Broken pipe

2014-11-05 Thread jamborta
Hi Rok, you could try to debug it by first collecting your training_set, see if it gets you something back, before passing it to the train method. Then go through each line in the train method, also the serializer and check where it fails exactly. thanks, -- View this message in context:

Re: executors not created yarn-cluster mode

2014-10-08 Thread jamborta
hi Andrew, Thanks for the reply, I tried to tune the memory, changed it as low as possible, no luck. My guess is that this issue is related to what is discussed here http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-td11668.html that is the

Re: Any issues with repartition?

2014-10-03 Thread jamborta
I have two nodes with 96G ram 16 cores, my setup is as follows: conf = (SparkConf() .setMaster(yarn-cluster) .set(spark.executor.memory, 30G) .set(spark.cores.max, 32) .set(spark.executor.instances, 2) .set(spark.executor.cores, 8)

Re: how to debug ExecutorLostFailure

2014-10-03 Thread jamborta
digging a bit deeper on, the executors get lost when the memory gets close to the physical memory size: http://apache-spark-user-list.1001560.n3.nabble.com/file/n15680/memory_usage.png I'm not clear if I am allocating too much, or too less memory in this case. thanks, -- View this message

how to debug ExecutorLostFailure

2014-10-02 Thread jamborta
hi all, I have a job that runs about for 15 mins, at some point I get an error on both nodes (all executors) saying: 14/10/02 23:14:38 WARN TaskSetManager: Lost task 80.0 in stage 3.0 (TID 253, backend-tes): ExecutorLostFailure (executor lost) In the end, it seems that the job recovers and

Re: Any issues with repartition?

2014-10-02 Thread jamborta
Hi Arun, Have you found a solution? Seems that I have the same problem. thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Any-issues-with-repartition-tp13462p15654.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: new error for me

2014-10-02 Thread jamborta
have you found a solution this problem? (or at least a cause) thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/new-error-for-me-tp10378p15655.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread jamborta
Hi all, I cannot figure out why this command is not setting the driver memory (it is setting the executor memory): conf = (SparkConf() .setMaster(yarn-client) .setAppName(test) .set(spark.driver.memory, 1G)

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread jamborta
:26 AM, jamborta [hidden email] wrote: Hi all, I cannot figure out why this command is not setting the driver memory (it is setting the executor memory): conf = (SparkConf() .setMaster(yarn-client) .setAppName(test) .set

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread jamborta
cluster mode apps that way, so the only way to set this is through the command line / config files.) On Wed, Oct 1, 2014 at 9:26 AM, jamborta [hidden email] http://user/SendEmail.jtp?type=nodenode=15510i=4 wrote: Hi all, I cannot figure out why this command is not setting the driver

ExecutorLostFailure kills sparkcontext

2014-09-29 Thread jamborta
hi all, I have a problem with my application when I increase the data size over 5GB (the cluster has about 100GB memory to handle that). First I get this warning: WARN TaskSetManager: Lost task 10.1 in stage 4.1 (TID 408, backend-node1): FetchFailed(BlockManagerId(3, backend-node0, 41484, 0),

yarn does not accept job in cluster mode

2014-09-27 Thread jamborta
hi all, I have a job that works ok in yarn-client mode,but when I try in yarn-cluster mode it returns the following: WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory the cluster has

Re: How to run hive scripts pro-grammatically in Spark 1.1.0 ?

2014-09-27 Thread jamborta
Hi, you can create a spark context in your python or scala environment and use that to run your hive queries, pretty much the same way as you'd do it in the shell. thanks, -- View this message in context:

Re: New user question on Spark SQL: can I really use Spark SQL like a normal DB?

2014-09-27 Thread jamborta
hi, Yes, I have been using spark sql extensively that way. I have just tried and saveAsTable() works OK on 1.1.0. Alternatively, you can write the data from schemaRDD to HDFS using saveAsTextFile, and create an external table on top of it. thanks, -- View this message in context:

Re: Job cancelled because SparkContext was shut down

2014-09-26 Thread jamborta
Just wanted to answer my question in case someone else runs into the same problem. It is related to the problem discussed here: http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html and here: https://issues.apache.org/jira/browse/SPARK-2121

mappartitions data size

2014-09-26 Thread jamborta
Hi all, I am using mappartitions to do some heavy computing on subsets of the data. I have a dataset with about 1m rows, running on a 32 core cluster. Unfortunately, is seems that mappartitions splits the data into two sets so it is only running on two cores. Is there a way to force it to split

Job cancelled because SparkContext was shut down

2014-09-25 Thread jamborta
hi all, I am getting this strange error about half way through the job (running spark 1.1 on yarn client mode): 14/09/26 00:54:06 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@4d0155fb java.nio.channels.CancelledKeyException at

Re: Yarn number of containers

2014-09-25 Thread jamborta
, Tamas Jambor [hidden email] wrote: Thank you. Where is the number of containers set? On Thu, Sep 25, 2014 at 7:17 PM, Marcelo Vanzin [hidden email] wrote: On Thu, Sep 25, 2014 at 8:55 AM, jamborta [hidden email] wrote: I am running spark with the default settings in yarn client mode

Null values in pyspark Row

2014-09-24 Thread jamborta
Hi all, I have just updated to spark 1.1.0. The new row representation of the data in spark SQL is very handy. I have noticed that it does not set None for NULL values coming from Hive if the column was string type - seems it works with other types. Is that something that will be implemented?

access javaobject in rdd map

2014-09-23 Thread jamborta
Hi all, I have a java object that contains a ML model which I would like to use for prediction (in python). I just want to iterate the data through a mapper and predict for each value. Unfortunately, this fails when it tries to serialise the object to sent it to the nodes. Is there a trick

Re: access javaobject in rdd map

2014-09-23 Thread jamborta
23, 2014 at 5:40 PM, Davies Liu [hidden email] http://user/SendEmail.jtp?type=nodenode=14908i=1 wrote: You should create a pure Python object (copy the attributes from Java object), then it could be used in map. Davies On Tue, Sep 23, 2014 at 8:48 AM, jamborta [hidden email] http

Efficient way to sum multiple columns

2014-09-15 Thread jamborta
Hi all, I have an RDD that contains around 50 columns. I need to sum each column, which I am doing by running it through a for loop, creating an array and running the sum function as follows: for (i - 0 until 10) yield { data.map(x = x(i)).sum } is their a better way to do this? thanks,

Re: spark sql - create new_table as select * from table

2014-09-12 Thread jamborta
thanks. I will try to do that way. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006p14090.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

spark sql - create new_table as select * from table

2014-09-11 Thread jamborta
Hi, I am trying to create a new table from a select query as follows: CREATE TABLE IF NOT EXISTS new_table ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/test/new_table' AS select * from table this works in Hive, but in Spark SQL

Re: spark sql - create new_table as select * from table

2014-09-11 Thread jamborta
thanks. this was actually using hivecontext. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006p14009.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark SQL check if query is completed (pyspark)

2014-09-08 Thread jamborta
thank you for the replies. I am running an insert on a join (INSERT OVERWRITE TABLE new_table select * from table1 as a join table2 as b on (a.key = b.key), The process does not have the right permission to write to that folder, so I get the following error printed: chgrp: `/user/x/y': No such

Spark SQL check if query is completed (pyspark)

2014-09-06 Thread jamborta
Hi, I am using Spark SQL to run some administrative queries and joins (e.g. create table, insert overwrite, etc), where the query does not return any data. I noticed if the query fails it prints some error message on the console, but does not actually throw an exception (this is spark 1.0.2).

spark sql results maintain order (in python)

2014-09-04 Thread jamborta
hi, I ran into a problem with spark sql, when run a query like this select count(*), city, industry from table group by hour and I would like to take the results from the shemaRDD 1, I have to parse each line to get the values out of the dic (eg in order to convert it to a csv) 2, The order is

Print to spark log

2014-08-28 Thread jamborta
Hi all, Just wondering if there is a way to use logging to print to spark logs some additional info (similar to debug in scalding). Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Print-to-spark-log-tp13035.html Sent from the Apache Spark User

Re: Print to spark log

2014-08-28 Thread jamborta
thanks for the reply. I was looking for something for the case when it's running outside of the spark framework. if I declare a sparkcontext or and rdd that could print some messages in the log? The problem I have that if I print something from the scala object that runs the spark app, it does

running multiple applications at the same time

2014-06-26 Thread jamborta
Hi all, not sure if this is a config issue or it's by design, but when I run the spark shell, and try to submit another application from elsewhere, the second application waits for the first to finish and outputs the following: Initial job has not accepted any resources; check your cluster UI to

create SparkContext dynamically

2014-06-18 Thread jamborta
Hi all, I am setting up a system where spark contexts would be created by a web server that would handle the computation and return the results. I have the following code (in python) os.environ['SPARK_HOME'] = /home/spark/spark-1.0.0-bin-hadoop2/ sc =

Re: pyspark regression results way off

2014-06-17 Thread jamborta
Thanks, will try normalising it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672p7720.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

pyspark regression results way off

2014-06-16 Thread jamborta
Hi all, I am testing the regression methods (SGD) using pyspark. Tried to tune the parameters, but they are far off from the results obtained using R. Is there some way to set these parameters more efficiently? thanks, -- View this message in context:

Re: pyspark regression results way off

2014-06-16 Thread jamborta
forgot to mention that I'm running spark 1.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672p7673.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: pyspark MLlib examples don't work with Spark 1.0.0

2014-05-30 Thread jamborta
thanks for the reply. I am definitely running 1.0.0, I set it up manually. To answer my question, I found out from the examples that it would need a new data type called LabeledPoint instead of numpy array. -- View this message in context: