example LDA code ClassCastException

2016-11-03 Thread jamborta
Hi there,

I am trying to run the example LDA code
(http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda)
on Spark 2.0.0/EMR 5.0.0

If run it with checkpoints enabled (sc.setCheckpointDir("s3n://s3-path/")

ldaModel = LDA.train(corpus, k=3, maxIterations=200, checkpointInterval=10)

I get the following error (sorry, quite long): 

Py4JJavaErrorTraceback (most recent call last)
 in ()
> 1 ldaModel = LDA.train(corpus, k=3, maxIterations=200,
checkpointInterval=10)

/usr/lib/spark/python/pyspark/mllib/clustering.py in train(cls, rdd, k,
maxIterations, docConcentration, topicConcentration, seed,
checkpointInterval, optimizer)
   1037 model = callMLlibFunc("trainLDAModel", rdd, k,
maxIterations,
   1038   docConcentration, topicConcentration,
seed,
-> 1039   checkpointInterval, optimizer)
   1040 return LDAModel(model)
   1041 

/usr/lib/spark/python/pyspark/mllib/common.py in callMLlibFunc(name, *args)
128 sc = SparkContext.getOrCreate()
129 api = getattr(sc._jvm.PythonMLLibAPI(), name)
--> 130 return callJavaFunc(sc, api, *args)
131 
132 

/usr/lib/spark/python/pyspark/mllib/common.py in callJavaFunc(sc, func,
*args)
121 """ Call Java Function """
122 args = [_py2java(sc, a) for a in args]
--> 123 return _java2py(sc, func(*args))
124 
125 

/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in
__call__(self, *args)
931 answer = self.gateway_client.send_command(command)
932 return_value = get_return_value(
--> 933 answer, self.gateway_client, self.target_id, self.name)
934 
935 for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in
get_return_value(answer, gateway_client, target_id, name)
310 raise Py4JJavaError(
311 "An error occurred while calling {0}{1}{2}.\n".
--> 312 format(target_id, ".", name), value)
313 else:
314 raise Py4JError(

Py4JJavaError: An error occurred while calling o115.trainLDAModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 458.0 failed 4 times, most recent failure: Lost task 1.3 in stage
458.0 (TID 14827, ip-10-197-192-2.eu-west-1.compute.internal):
java.lang.ClassCastException: scala.Tuple2 cannot be cast to
org.apache.spark.graphx.Edge
at
org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:801)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
at
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 

writing to s3 failing to move parquet files from temporary folder

2016-09-29 Thread jamborta
Hi,

I have an 8 hour job (spark 2.0.0) that writes the results out to parquet
using the standard approach:

processed_images_df.write.format("parquet").save(s3_output_path)

It executes 1 tasks and writes the results to a _temporary folder, and
in the last step (after all the tasks completed) it copies the parquet files
from the _temporary folder, but after copying about 2-3000 files it fails
with the following (first I thought this was a temporary s3 failure, but I
rerun 3 times and getting the same error):

org.apache.spark.SparkException: Job aborted.
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(I
nsertIntoHadoopFsRelationCommand.scala:149)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIn
toHadoopFsRelationCommand.scala:115)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIn
toHadoopFsRelationCommand.scala:115)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelatio
nCommand.scala:115)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.http.NoHttpResponseException:
s3-bucket.s3.amazonaws.com:443 failed to respon
d
at
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
at
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
at
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:
283)
at
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:259)
at
org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:2
32)
at
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
at
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
at
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:686)
at
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:488)
at
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:884)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at
org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:326)
at

Iterative update for LocalLDAModel

2016-08-31 Thread jamborta
Hi,

I am trying to take the OnlineLDAOptimizer and apply it iteratively to new
data. My use case would be:

- Train the model using the DistributedLDAModel
- Convert to LocalLDAModel 
- Apply to new data as data comes in using the OnlineLDAOptimizer

I cannot see that this can be done without code change as the LDA class does
not have the option to initialise with an existing model. Am I missing
something?

Thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Iterative-update-for-LocalLDAModel-tp27632.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming - NotSerializableException: Methods & Closures:

2016-04-08 Thread jamborta
You could also try to put transform in a companion object.

On Fri, 8 Apr 2016 16:48 mpawashe [via Apache Spark User List], <
ml-node+s1001560n26718...@n3.nabble.com> wrote:

> The class declaration is already marked Serializable ("with Serializable")
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-NotSerializableException-Methods-Closures-tp26672p26718.html
> To unsubscribe from Spark Streaming - NotSerializableException: Methods &
> Closures:, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-NotSerializableException-Methods-Closures-tp26672p26723.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Anyone have a tutorial or guide to implement Spark + AWS + Caffe/CUDA?

2016-04-07 Thread jamborta
Hi Alfredo,

I have been building something similar and found that EMR is not suitable
for this, as the gpu instances don't come with nvidia drivers (and the
bootstrap process does not allow to restart instances). 

The way I'm setting up is based on the spark-ec2 script where you can use
custom AMIs which can contain your dockerized application plus all the
drivers and dependencies installed. I haven't completed my project yet, but
haven't run into any major blocker so far.

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Anyone-have-a-tutorial-or-guide-to-implement-Spark-AWS-Caffe-CUDA-tp26705p26707.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: LabeledPoint with features in matrix form (word2vec matrix)

2016-04-07 Thread jamborta
depends, if you'd like to multiply matrices for each row in the data, then
you could use a breeze matrix, and do that locally on the nodes in a map or
similar.

if you'd like to multiply them across the rows, eg. a row in your data is a
row in the matrix, then you could use a distributed matrix like
indexedRowMatrix
(http://spark.apache.org/docs/latest/mllib-data-types.html#indexedrowmatrix).



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/LabeledPoint-with-features-in-matrix-form-word2vec-matrix-tp26629p26696.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: LabeledPoint with features in matrix form (word2vec matrix)

2016-04-06 Thread jamborta
you probably better off defining your own data structure. labelled point can
store a label, vector. but in your case is more like a label, vector,
vector. i'd probably use tuples with breeze sparse arrays:

RDD[(label:Int, vector1:SparseArray[Double], vector2:SparseArray[Double])]



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/LabeledPoint-with-features-in-matrix-form-word2vec-matrix-tp26629p26690.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming - NotSerializableException: Methods & Closures:

2016-04-06 Thread jamborta
you can declare you class serializable, as spark would want to serialise the
whole class. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-NotSerializableException-Methods-Closures-tp26672p26689.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to convert a Vector[Vector] to a RDD[Vector]?

2016-04-06 Thread jamborta
if you are asking about scala vectors it is simple as this:

val vec = Vector(Vector(1,2), Vector(1,2), Vector(1,2))
val vecrdd = sc.parallelize(vec)

where vecrdd:
org.apache.spark.rdd.RDD[scala.collection.immutable.Vector[Int]



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-a-Vector-Vector-to-a-RDD-Vector-tp26680p26688.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



MLLIB LDA throws NullPointerException

2016-04-06 Thread jamborta
Hi all,

I came across a really weird error on spark 1.6 (calling LDA from pyspark)

//data is [index, DenseVector] 
data1 = corpusZippedDataFiltered.repartition(100).sample(False, 0.1, 100)
data2 = sc.parallelize(data1.collect().repartition(100)

ldaModel1 = LDA.train(data1, k=10, maxIterations=10)
ldaModel2 = LDA.train(data2, k=10, maxIterations=10)

ldaModel2 completes OK (with or without repartitioning), but ldaModel1 fails
with:

An error occurred while calling o1812.trainLDAModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 36
in stage 1681.0 failed 4 times, most recent failure: Lost task 36.3 in stage
1681.0 (TID 60425, ip-10-33-65-169.eu-west-1.compute.internal):
java.lang.NullPointerException

Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1081)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.fold(RDD.scala:1075)
at
org.apache.spark.mllib.clustering.EMLDAOptimizer.computeGlobalTopicTotals(LDAOptimizer.scala:205)
at
org.apache.spark.mllib.clustering.EMLDAOptimizer.next(LDAOptimizer.scala:192)
at
org.apache.spark.mllib.clustering.EMLDAOptimizer.next(LDAOptimizer.scala:80)
at org.apache.spark.mllib.clustering.LDA.run(LDA.scala:329)
at
org.apache.spark.mllib.api.python.PythonMLLibAPI.trainLDAModel(PythonMLLibAPI.scala:538)
at sun.reflect.GeneratedMethodAccessor123.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-LDA-throws-NullPointerException-tp26686.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



hive external metastore connection timeout

2015-05-27 Thread jamborta
Hi all,

I am setting up a spark standalone server with an external hive metastore
(using mysql), there is an issue that after 5 minutes inactivity, if I try
to reconnect to the metastore (i.e. by executing a new query), it hangs for
about 10 mins then times out. My guess is that datanucleus does not close
the existing connections from the pool, but still tries to create new ones
for some reason. 

Tried different type of connection pools, didn't help either.

thanks,





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/hive-external-metastore-connection-timeout-tp23052.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



sparkSQL - Hive metastore connection hangs with MS SQL server

2015-05-19 Thread jamborta
Hi all,

I am trying to setup an external metastore using Microsoft SQL on Azure, it
works ok initially but after about 5 mins inactivity it hangs, then times
out after 15 mins with this error:

15/05/20 00:02:49 ERROR ConnectionHandle: Database access problem. Killing
off this connection and all remaining connections in the connection pool.
SQL State = 08S01
15/05/20 00:02:49 ERROR RetryingHMSHandler: Retrying HMSHandler after 1000
ms (attempt 1 of 1) with error: javax.jdo.JDODataStoreException: SQL Server
did not return a 
response. The connection has been closed.
at
org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:451)
at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:275)
at
org.apache.hadoop.hive.metastore.ObjectStore.getMTable(ObjectStore.java:901)
at
org.apache.hadoop.hive.metastore.ObjectStore.getTable(ObjectStore.java:833)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...
NestedThrowablesStackTrace:
com.microsoft.sqlserver.jdbc.SQLServerException: SQL Server did not return a
response. The connection has been closed.
at
com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:1668)
at
com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:1655)
at
com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:4844)

I have also tried replacing BoneCP with DBCP in
datanucleus.connectionPoolingType, that didn't help either. 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sparkSQL-Hive-metastore-connection-hangs-with-MS-SQL-server-tp22950.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



store hive metastore on persistent store

2015-05-14 Thread jamborta
Hi all,

is it possible to set hive.metastore.warehouse.dir, that is internally
create by spark, to be stored externally (e.g. s3 on aws or wasb on azure)?

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/store-hive-metastore-on-persistent-store-tp22891.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to add jars to standalone pyspark program

2015-04-28 Thread jamborta
ah, just noticed that you are using an external package, you can add that
like this

conf = (SparkConf().set(spark.jars, jar_path))

or if it is a python package:

sc.addPyFile()



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685p22688.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to add jars to standalone pyspark program

2015-04-28 Thread jamborta
Hi Mark,

That does not look like an python path issue, spark-assembly jar should have
those packaged, and should make it available for the workers. Have you built
the jar yourself?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-jars-to-standalone-pyspark-program-tp22685p22687.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Bigints in pyspark

2015-04-27 Thread jamborta
hi all,

I have just come across a problem where I have a table that has a few bigint
columns, it seems if I read that table into a dataframe then collect it in
pyspark, the bigints are stored and integers in python. 

(The problem is if I write it back to another table, I detect the hive type
programmatically from the python type, so it turns those columns to
integers)

Is that intended this way or a bug?

thanks,




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Bigints-in-pyspark-tp22668.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



ML regression - spark context dies without error

2015-04-23 Thread jamborta
Hi all,

I have been testing Spark ML algorithms with bigger dataset, and ran into
some problems with linear regression:

It seems the executors stop without any apparent reason:

15/04/22 20:15:05 INFO BlockManagerInfo: Added rdd_12492_80 in memory on
backend-node:48037 (size: 28.5 MB, free: 2.8 GB)
15/04/22 20:15:05 INFO BlockManagerInfo: Added rdd_12493_80 in memory on
backend-node:48037 (size: 37.6 MB, free: 2.7 GB)
15/04/22 20:15:08 INFO BlockManagerInfo: Added rdd_12489_81 in memory on
backend-node:48037 (size: 8.4 MB, free: 2.7 GB)
[E 150422 20:15:12 java_gateway:483] Error while sending or receiving.
Traceback (most recent call last):
  File
/home/azureuser/spark-1.3.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 479, in send_command
raise Py4JError(Answer from Java side is empty)
Py4JError: Answer from Java side is empty

Then sparkcontext stops, too :

[E 150422 20:15:12 java_gateway:431] An error occurred while trying to
connect to the Java server

the problem is that it does not happen all the time, it only fails maybe
once in every five attempts.

any suggestions where can I get more detailed logs from?

Thanks,





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ML-regression-spark-context-dies-without-error-tp22633.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



writing to hdfs on master node much faster

2015-04-20 Thread jamborta
Hi all,

I have a three node cluster with identical hardware. I am trying a workflow
where it reads data from hdfs, repartitions it and runs a few map operations
then writes the results back to hdfs.

It looks like that all the computation, including the repartitioning and the
maps complete within similar time intervals on all the nodes, except when it
writes it back to HDFS when the master node does the job way much faster
then the slaves (15s for each block as opposed to 1.2 min for the slaves). 

Any suggestion what the reason might be?

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/writing-to-hdfs-on-master-node-much-faster-tp22570.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark streaming

2015-03-27 Thread jamborta
Hi all,

We have a workflow that pulls in data from csv files, then originally setup
up of the workflow was to parse the data as it comes in (turn into array),
then store it. This resulted in out of memory errors with larger files (as a
result of increased GC?). 

It turns out if the data gets stored as a string first, then parsed, it
issues does not occur.

Why is that?

Thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-tp22255.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Dataframe groupby custom functions (python)

2015-03-24 Thread jamborta
Hi all,

I have been trying out the new dataframe api in 1.3, which looks great by
the way.

I have found an example to define udfs and add them to select operations,
like this:

slen = F.udf(lambda s: len(s), IntegerType())
df.select(df.age, slen(df.name).alias('slen')).collect()

is it possible to to something similar with aggregates? Something like this:

gdf = df.groupBy(df.name)
gdf.agg(slen(df.age)).collect()

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-groupby-custom-functions-python-tp22205.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Buffering for Socket streams

2015-03-20 Thread jamborta
Hi all,

We are designing a workflow where we try to stream local files to a Socket
streamer, that would clean and process the files and write them to hdfs. We
have an issue with bigger files when the streamer cannot keep up with the
data, and runs out of memory. 

What would be the best way to implement an approach where the Socket stream
receiver would notify the stream not to send more data (stop reading from
disk too?), just before it might run out of memory?

thanks,




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Buffering-for-Socket-streams-tp22164.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



stack map functions in a loop (pyspark)

2015-02-19 Thread jamborta
Hi all,

I think I have run into an issue on the lazy evaluation of variables in
pyspark, I have to following

functions = [func1, func2, func3]

for counter in range(len(functions)):
data = data.map(lambda value: [functions[counter](value)])

it looks like that the counter is evaluated when the RDD is computed, so it
fills in all the three mappers with the last value of it. Is there any way
to get it forced to be evaluated at the time? (I am aware that I could run
persist it after each step, which sounds a bit of a waste)

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/stack-map-functions-in-a-loop-pyspark-tp21722.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Interact with streams in a non-blocking way

2015-02-13 Thread jamborta
Hi all,

I am trying to come up with a workflow where I can query streams
asynchronously. The problem I have is a ssc.awaitTermination() line blocks
the whole thread, so it is not straightforward to me whether it is possible
to get hold of objects from streams once they are started. any suggestion on
what is the best way to implement this?

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Interact-with-streams-in-a-non-blocking-way-tp21640.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



no option to add intercepts for StreamingLinearAlgorithm

2015-02-05 Thread jamborta
hi all,

just wondering if there is a reason why it is not possible to add intercepts
for streaming regression models? I understand that run method in the
underlying GeneralizedLinearModel does not take intercept as a parameter
either. Any reason for that?

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/no-option-to-add-intercepts-for-StreamingLinearAlgorithm-tp21526.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is there a way to access Hive UDFs in a HiveContext?

2015-02-05 Thread jamborta
Hi,

My guess is that Spark is not picking up the jar where the function is
stored. You might have to add it to sparkcontext or the classpath manually.

You can also register the function

hc.registerFunction(myfunct, myfunct)

then use it in the query.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-way-to-access-Hive-UDFs-in-a-HiveContext-tp21510p21527.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Best way to shut down a stream initiated from the recevier

2015-01-27 Thread jamborta
Hi all,

we are building a custom JDBC receiver that would create a stream from sql
tables. Not sure what is the best way to shut down the stream once all the
data goes through, as the receiver knows it is completed but it cannot
initiate the stream to shut down.

Any suggestion how to structure this?

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-shut-down-a-stream-initiated-from-the-recevier-tp21390.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark context not picking up default hadoop filesystem

2015-01-26 Thread jamborta
hi all,

I am trying to create a spark context programmatically, using
org.apache.spark.deploy.SparkSubmit. It all looks OK, except that the hadoop
config that is created during the process is not picking up core-site.xml,
so it defaults back to the local file-system. I have set HADOOP_CONF_DIR in
spark-env.sh, also core-site.xml in in the conf folder. The whole thing
works if it is executed through spark shell.

Just wondering where spark is picking up the hadoop config path from?

many thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-context-not-picking-up-default-hadoop-filesystem-tp21368.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Possible to restart (or stop and create) a StreamingContext

2015-01-21 Thread jamborta
hi all,

I have been experimenting with creating a sparkcontext - streamingcontext
- a few streams - starting - stopping - creating new streams - starting
a new (or the existing) streamingcontext with the new streams

(I need to keep the existing sparkcontext alive as it would run other spark
jobs)

I ran into a few problems:

- I cannot seem to create a new streaming context after another one was shut
down. I get this error:
15/01/21 12:43:16 INFO MetricsSystem: Metrics already registered
java.lang.IllegalArgumentException: A metric named
app-20150121123832-0008.driver.Spark
shell.StreamingMetrics.streaming.lastCompletedBatch_processStartTime already
exists

- Or if I try to start the one that was stopped I get this:
org.apache.spark.SparkException: StreamingContext has already been started

- It seems even after the streaming context is stopped, it still shows up in
the job info (spark web UI).

is there a better way to do this?

thanks,





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-restart-or-stop-and-create-a-StreamingContext-tp21291.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Possible to restart (or stop and create) a StreamingContext

2015-01-21 Thread jamborta
Just found this in the documentation:

A SparkContext can be re-used to create multiple StreamingContexts, as long
as the previous StreamingContext is stopped (without stopping the
SparkContext) before the next StreamingContext is created.

in this case, I assume the error I reported above is a bug.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-restart-or-stop-and-create-a-StreamingContext-tp21291p21294.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



dynamically change receiver for a spark stream

2015-01-20 Thread jamborta
Hi all,

we have been trying to setup a stream using a custom receiver that would
pick up data from sql databases. we'd like to keep that stream context
running and dynamically change the streams on demand, adding and removing
streams based on demand. alternativel, if a stream is fixed, is it possible
to stop a stream, change to config and start again? 

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/dynamically-change-receiver-for-a-spark-stream-tp21268.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: save spark streaming output to single file on hdfs

2015-01-15 Thread jamborta
thanks for the replies. very useful.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-tp21124p21176.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark streaming python files not packaged in assembly jar

2015-01-15 Thread jamborta
Hi all, 

just discovered that the streaming folder in pyspark is not included in the
assembly jar (spark-assembly-1.2.0-hadoop2.3.0.jar), but included in the
python folder. Any reason why?

Thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-python-files-not-packaged-in-assembly-jar-tp21177.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



save spark streaming output to single file on hdfs

2015-01-13 Thread jamborta
Hi all,

Is there a way to save dstream RDDs to a single file so that another process
can pick it up as a single RDD?
It seems that each slice is saved to a separate folder, using
saveAsTextFiles method.

I'm using spark 1.2 with pyspark

thanks,







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-tp21124.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark with Hive cluster dependencies

2015-01-07 Thread jamborta
Hi all, 

We have been building a system where we heavily reply on hive queries
executed through spark to load and manipulate data, running on CDH and yarn.
I have been trying to explore lighter setups where we would not have to
maintain a hadoop cluster, just run the system on spark only. 

Is it possible to run spark standalone, and setup hive alongside, without
the hadoop cluster? if not, any suggestion how we can replicate the
convenience of hive tables (and hive sql) without hive?

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-Hive-cluster-dependencies-tp21017.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: why MatrixFactorizationModel private?

2014-11-26 Thread jamborta
many thanks for adding this so quickly.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/why-MatrixFactorizationModel-private-tp19763p19855.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



why MatrixFactorizationModel private?

2014-11-25 Thread jamborta
Hi all,

seems that all the mllib models are declared accessible in the package,
except MatrixFactorizationModel, which is declared private to mllib. Any
reason why?

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/why-MatrixFactorizationModel-private-tp19763.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: why MatrixFactorizationModel private?

2014-11-25 Thread jamborta
hi Xiangrui,

thanks. that is a very useful feature.

any suggestion on saving/loading the model in the meantime?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/why-MatrixFactorizationModel-private-tp19763p19783.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: No module named pyspark - latest built

2014-11-13 Thread jamborta
it was built with 1.6 (tried 1.7, too)

On Thu, Nov 13, 2014 at 2:52 AM, Andrew Or-2 [via Apache Spark User
List] ml-node+s1001560n18797...@n3.nabble.com wrote:
 Hey Jamborta,

 What java version did you build the jar with?

 2014-11-12 16:48 GMT-08:00 jamborta [hidden email]:

 I have figured out that building the fat jar with sbt does not seem to
 included the pyspark scripts using the following command:

 sbt/sbt -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive clean
 publish-local assembly

 however the maven command works OK:

 mvn -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests
 clean package

 am I running the correct sbt command?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18787.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: [hidden email]
 For additional commands, e-mail: [hidden email]




 
 If you reply to this email, your message will be added to the discussion
 below:
 http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18797.html
 To unsubscribe from No module named pyspark - latest built, click here.
 NAML




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18833.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

No module named pyspark - latest built

2014-11-12 Thread jamborta
Hi all,

I am trying to run spark with the latest build (from branch-1.2), as far as
I can see, all the paths are set and SparkContext starts up OK, however, I
cannot run anything that goes to the nodes. I get the following error:

Error from python worker:
  /usr/bin/python2.7: No module named pyspark
PYTHONPATH was:
 
/mnt/yarn/nm/usercache/massive/filecache/15/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0.jar
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)

any idea where it is picking up this path from?

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: No module named pyspark - latest built

2014-11-12 Thread jamborta
forgot to mention, that this setup works in spark standalone mode, only
problem when I run on yarn.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18777.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: No module named pyspark - latest built

2014-11-12 Thread jamborta
I have figured out that building the fat jar with sbt does not seem to
included the pyspark scripts using the following command:

sbt/sbt -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive clean
publish-local assembly

however the maven command works OK:

mvn -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests
clean package

am I running the correct sbt command?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/No-module-named-pyspark-latest-built-tp18740p18787.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark SNAPSHOT repo

2014-11-10 Thread jamborta
thanks, that looks good.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-SNAPSHOT-repo-tp18502p18505.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Building spark from source - assertion failed: org.eclipse.jetty.server.DispatcherType

2014-11-10 Thread jamborta
Hello,

I am trying to build Spark from source using the following:

export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
-XX:ReservedCodeCacheSize=512m

mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests clean
package

this works OK with branch-1.1, when I switch to branch-1.2, I get the
following error while building spark-core:

[WARNING] Class org.eclipse.jetty.server.DispatcherType not found -
continuing with a stub.
[WARNING] Class org.eclipse.jetty.server.DispatcherType not found -
continuing with a stub.
[WARNING] Class org.eclipse.jetty.server.DispatcherType not found -
continuing with a stub.
[WARNING] Class org.eclipse.jetty.server.DispatcherType not found -
continuing with a stub.
[WARNING] Class org.eclipse.jetty.server.DispatcherType not found -
continuing with a stub.
[WARNING] Class org.eclipse.jetty.server.DispatcherType not found -
continuing with a stub.
[WARNING] Class org.eclipse.jetty.server.DispatcherType not found -
continuing with a stub.
[WARNING]
/home/spark/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala:50:
inferred existential type (org.apache.spark.scheduler.DirectTaskResult[_$1],
Int) forSome { type _$1 }, which cannot be expressed by wildcards,  should
be enabled
by making the implicit value scala.language.existentials visible.
This can be achieved by adding the import clause 'import
scala.language.existentials'
or by setting the compiler option -language:existentials.
See the Scala docs for value scala.language.existentials for a discussion
why the feature should be explicitly enabled.
[WARNING]   val (result, size) =
serializer.get().deserialize[TaskResult[_]](serializedData) match {
[WARNING]   ^
[WARNING] Class org.eclipse.jetty.server.DispatcherType not found -
continuing with a stub.
[WARNING] Class org.eclipse.jetty.server.DispatcherType not found -
continuing with a stub.
[ERROR] 
 while compiling:
/home/spark/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala
during phase: typer
 library version: version 2.10.4
compiler version: version 2.10.4

...

Exception in thread main java.lang.AssertionError: assertion failed:
org.eclipse.jetty.server.DispatcherType
at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1212)
at
scala.reflect.internal.Types$ClassTypeRef$class.baseType(Types.scala:2186)
at 
scala.reflect.internal.Types$TypeRef$$anon$6.baseType(Types.scala:2544)
at scala.reflect.internal.Types$class.firstTry$1(Types.scala:6043)
at scala.reflect.internal.Types$class.isSubType2(Types.scala:6207)
at scala.reflect.internal.Types$class.isSubType(Types.scala:5816)
at scala.reflect.internal.SymbolTable.isSubType(SymbolTable.scala:13)
at scala.reflect.internal.Types$class.isSubArg$1(Types.scala:6005)
at
scala.reflect.internal.Types$$anonfun$isSubArgs$2.apply(Types.scala:6007)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Building-spark-from-source-assertion-failed-org-eclipse-jetty-server-DispatcherType-tp18529.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Building spark from source - assertion failed: org.eclipse.jetty.server.DispatcherType

2014-11-10 Thread jamborta
ah, thanks, reverted a few days, works now.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Building-spark-from-source-assertion-failed-org-eclipse-jetty-server-DispatcherType-tp18529p18532.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: why decision trees do binary split?

2014-11-06 Thread jamborta
I meant above, that in the case of categorical variables it might be more
efficient to create a node on each categorical value. Is there a reason why
spark went down the binary route?

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188p18265.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: I want to make clear the difference about executor-cores number.

2014-11-06 Thread jamborta
the only difference between the two setups (if you vary change the executor
cores) is how many tasks are running in parallel (the number of tasks would
depend on other factors), so try to inspect the stages while running
(probably easier to do that with longer running tasks) by clicking on one of
the stages.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/I-want-to-make-clear-the-difference-about-executor-cores-number-tp18183p18272.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



most efficient way to send data from Scala to python

2014-11-06 Thread jamborta
Hi all,

Is there a way in spark to send data (RDD[Array] from the scala component to
the python component?

I saw a method that serialises double arrays (serializeDoubleMatrix), but it
requires the data to be collected before. I assume that step would pull all
the data to the driver.

Thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/most-efficient-way-to-send-data-from-Scala-to-python-tp18287.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



why decision trees do binary split?

2014-11-05 Thread jamborta
Hi,

Just wondering what is the reason that the decision tree implementation in
spark always does binary splits?

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: I want to make clear the difference about executor-cores number.

2014-11-05 Thread jamborta
If you go to your spark job UI (probably on http://master-node:4040), and
click on the environment tab, you can check if the setting are correctly
picked up by spark.

Also when you run the job, you can see the subtasks (stages tab), inside the
task you can check what resources are assigned to the task. If you are not
getting 4 cores assigned (where appropriate), it means something is wrong
with your config.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/I-want-to-make-clear-the-difference-about-executor-cores-number-tp18183p18189.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: using LogisticRegressionWithSGD.train in Python crashes with Broken pipe

2014-11-05 Thread jamborta
Hi Rok, 

you could try to debug it by first collecting your training_set, see if it
gets you something back, before passing it to the train method. Then go
through each line in the train method, also the serializer and check where
it fails exactly.

thanks,





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/using-LogisticRegressionWithSGD-train-in-Python-crashes-with-Broken-pipe-tp18182p18190.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: executors not created yarn-cluster mode

2014-10-08 Thread jamborta
hi Andrew,

Thanks for the reply, I tried to tune the memory, changed it as low as
possible, no luck. 

My guess is that this issue is related to what is discussed here 

http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-td11668.html

that is the executors cannot connect back to the driver (in my case I am not
sure if they are even started). I could not find a way to debug, as the log
files don't have any error in them.

thanks





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/executors-not-created-yarn-cluster-mode-tp15957p15976.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Any issues with repartition?

2014-10-03 Thread jamborta
I have two nodes with 96G ram 16 cores, my setup is as follows:

conf = (SparkConf()
.setMaster(yarn-cluster)
.set(spark.executor.memory, 30G)
.set(spark.cores.max, 32)
.set(spark.executor.instances, 2)
.set(spark.executor.cores, 8)
.set(spark.akka.timeout, 1)
.set(spark.akka.askTimeout, 100)
.set(spark.akka.frameSize, 500)
.set(spark.cleaner.ttl, 86400)
.set(spark.tast.maxFailures, 16)
.set(spark.worker.timeout, 150)

thanks a lot,




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Any-issues-with-repartition-tp13462p15674.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: how to debug ExecutorLostFailure

2014-10-03 Thread jamborta
digging a bit deeper on, the executors get lost when the memory gets close to
the physical memory size:

http://apache-spark-user-list.1001560.n3.nabble.com/file/n15680/memory_usage.png
 

I'm not clear if I am allocating too much, or too less memory in this case.

thanks, 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-debug-ExecutorLostFailure-tp15646p15680.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



how to debug ExecutorLostFailure

2014-10-02 Thread jamborta
hi all,

I have a job that runs about for 15 mins, at some point I get an error on
both nodes (all executors) saying:

14/10/02 23:14:38 WARN TaskSetManager: Lost task 80.0 in stage 3.0 (TID 253,
backend-tes): ExecutorLostFailure (executor lost)

In the end, it seems that the job recovers and completes the task. Just
wondering what is the best way to understand why these tasks failed
(couldn't seem to find anything in the logs), and how to avoid in the
future?

thanks,




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-debug-ExecutorLostFailure-tp15646.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Any issues with repartition?

2014-10-02 Thread jamborta
Hi Arun,

Have you found a solution? Seems that I have the same problem.

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Any-issues-with-repartition-tp13462p15654.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: new error for me

2014-10-02 Thread jamborta
have you found a solution this problem? (or at least a cause)

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/new-error-for-me-tp10378p15655.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread jamborta
Hi all,

I cannot figure out why this command is not setting the driver memory (it is
setting the executor memory):

conf = (SparkConf()
.setMaster(yarn-client)
.setAppName(test)
.set(spark.driver.memory, 1G)
.set(spark.executor.memory, 1G)
.set(spark.executor.instances, 2)
.set(spark.executor.cores, 4))
sc = SparkContext(conf=conf)

whereas if I run the spark console:
./bin/pyspark --driver-memory 1G

it sets it correctly. Seemingly they both generate the same commands in the
logs.

thanks a lot,





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-memory-is-not-set-pyspark-1-1-0-tp15498.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread jamborta
when you say respective backend code to launch it, I thought this is
the way to do that.

thanks,
Tamas

On Wed, Oct 1, 2014 at 6:13 PM, Marcelo Vanzin [via Apache Spark User
List] ml-node+s1001560n15506...@n3.nabble.com wrote:
 Because that's not how you launch apps in cluster mode; you have to do
 it through the command line, or by calling directly the respective
 backend code to launch it.

 (That being said, it would be nice to have a programmatic way of
 launching apps that handled all this - this has been brought up in a
 few different contexts, but I don't think there's an official
 solution yet.)

 On Wed, Oct 1, 2014 at 9:59 AM, Tamas Jambor [hidden email] wrote:

 thanks Marcelo.

 What's the reason it is not possible in cluster mode, either?

 On Wed, Oct 1, 2014 at 5:42 PM, Marcelo Vanzin [hidden email] wrote:
 You can't set up the driver memory programatically in client mode. In
 that mode, the same JVM is running the driver, so you can't modify
 command line options anymore when initializing the SparkContext.

 (And you can't really start cluster mode apps that way, so the only
 way to set this is through the command line / config files.)

 On Wed, Oct 1, 2014 at 9:26 AM, jamborta [hidden email] wrote:
 Hi all,

 I cannot figure out why this command is not setting the driver memory
 (it is
 setting the executor memory):

 conf = (SparkConf()
 .setMaster(yarn-client)
 .setAppName(test)
 .set(spark.driver.memory, 1G)
 .set(spark.executor.memory, 1G)
 .set(spark.executor.instances, 2)
 .set(spark.executor.cores, 4))
 sc = SparkContext(conf=conf)

 whereas if I run the spark console:
 ./bin/pyspark --driver-memory 1G

 it sets it correctly. Seemingly they both generate the same commands in
 the
 logs.

 thanks a lot,





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-memory-is-not-set-pyspark-1-1-0-tp15498.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: [hidden email]
 For additional commands, e-mail: [hidden email]




 --
 Marcelo



 --
 Marcelo

 -
 To unsubscribe, e-mail: [hidden email]
 For additional commands, e-mail: [hidden email]



 
 If you reply to this email, your message will be added to the discussion
 below:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-memory-is-not-set-pyspark-1-1-0-tp15498p15506.html
 To unsubscribe from spark.driver.memory is not set (pyspark, 1.1.0), click
 here.
 NAML




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-memory-is-not-set-pyspark-1-1-0-tp15498p15507.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread jamborta
Thank you for the replies. It makes sense for scala/java, but in python the
JVM is launched when the spark context is initialised, so it should be able
to set it, I assume.
On 1 Oct 2014 18:25, Andrew Or-2 [via Apache Spark User List] 
ml-node+s1001560n15510...@n3.nabble.com wrote:

 Hi Tamas,

 Yes, Marcelo is right. The reason why it doesn't make sense to set
 spark.driver.memory in your SparkConf is because your application code,
 by definition, *is* the driver. This means by the time you get to the
 code that initializes your SparkConf, your driver JVM has already started
 with some heap size, and you can't easily change the size of the JVM once
 it has started. Note that this is true regardless of the deploy mode
 (client or cluster).

 Alternatives to set this include the following: (1) You can set
 spark.driver.memory in your `spark-defaults.conf` on the node that
 submits the application, (2) You can use the --driver-memory command line
 option if you are using Spark submit (bin/pyspark goes through this path,
 as you have discovered on your own).

 Does that make sense?


 2014-10-01 10:17 GMT-07:00 Tamas Jambor [hidden email]
 http://user/SendEmail.jtp?type=nodenode=15510i=0:

 when you say respective backend code to launch it, I thought this is
 the way to do that.

 thanks,
 Tamas

 On Wed, Oct 1, 2014 at 6:13 PM, Marcelo Vanzin [hidden email]
 http://user/SendEmail.jtp?type=nodenode=15510i=1 wrote:
  Because that's not how you launch apps in cluster mode; you have to do
  it through the command line, or by calling directly the respective
  backend code to launch it.
 
  (That being said, it would be nice to have a programmatic way of
  launching apps that handled all this - this has been brought up in a
  few different contexts, but I don't think there's an official
  solution yet.)
 
  On Wed, Oct 1, 2014 at 9:59 AM, Tamas Jambor [hidden email]
 http://user/SendEmail.jtp?type=nodenode=15510i=2 wrote:
  thanks Marcelo.
 
  What's the reason it is not possible in cluster mode, either?
 
  On Wed, Oct 1, 2014 at 5:42 PM, Marcelo Vanzin [hidden email]
 http://user/SendEmail.jtp?type=nodenode=15510i=3 wrote:
  You can't set up the driver memory programatically in client mode. In
  that mode, the same JVM is running the driver, so you can't modify
  command line options anymore when initializing the SparkContext.
 
  (And you can't really start cluster mode apps that way, so the only
  way to set this is through the command line / config files.)
 
  On Wed, Oct 1, 2014 at 9:26 AM, jamborta [hidden email]
 http://user/SendEmail.jtp?type=nodenode=15510i=4 wrote:
  Hi all,
 
  I cannot figure out why this command is not setting the driver
 memory (it is
  setting the executor memory):
 
  conf = (SparkConf()
  .setMaster(yarn-client)
  .setAppName(test)
  .set(spark.driver.memory, 1G)
  .set(spark.executor.memory, 1G)
  .set(spark.executor.instances, 2)
  .set(spark.executor.cores, 4))
  sc = SparkContext(conf=conf)
 
  whereas if I run the spark console:
  ./bin/pyspark --driver-memory 1G
 
  it sets it correctly. Seemingly they both generate the same commands
 in the
  logs.
 
  thanks a lot,
 
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-memory-is-not-set-pyspark-1-1-0-tp15498.html
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 
  -
  To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=15510i=5
  For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=15510i=6
 
 
 
 
  --
  Marcelo
 
 
 
  --
  Marcelo

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=15510i=7
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=15510i=8




 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-memory-is-not-set-pyspark-1-1-0-tp15498p15510.html
  To unsubscribe from spark.driver.memory is not set (pyspark, 1.1.0), click
 here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=15498code=amFtYm9ydGFAZ21haWwuY29tfDE1NDk4fC00Mjk2ODU1NTM=
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml

ExecutorLostFailure kills sparkcontext

2014-09-29 Thread jamborta
hi all,

I have a problem with my application when I increase the data size over 5GB
(the cluster has about 100GB memory to handle that). First I get this
warning:

 WARN TaskSetManager: Lost task 10.1 in stage 4.1 (TID 408, backend-node1):
FetchFailed(BlockManagerId(3, backend-node0, 41484, 0), shuffleId=1,
mapId=0, r
educeId=18)

then this one:

14/09/29 23:26:44 WARN TaskSetManager: Lost task 2.0 in stage 5.2 (TID 418,
backend-node1): ExecutorLostFailure (executor lost)

a few second later the all executors shut down:

14/09/29 23:26:53 ERROR YarnClientSchedulerBackend: Yarn application already
ended: FINISHED
14/09/29 23:26:53 INFO SparkUI: Stopped Spark web UI at
http://backend-node0:4040
14/09/29 23:26:53 INFO YarnClientSchedulerBackend: Shutting down all
executors

even SparkContext stops.

Not sure how to debug this, there is nothing in the logs apart from this. I
have given enough memory to all executors.

thanks for the help,




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ExecutorLostFailure-kills-sparkcontext-tp15370.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



yarn does not accept job in cluster mode

2014-09-27 Thread jamborta
hi all,

I have a job that works ok in yarn-client mode,but when I try in
yarn-cluster mode it returns the following:

WARN YarnClusterScheduler: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have sufficient
memory

the cluster has plenty of memory and resources. I am running this from
python using this context:

conf = (SparkConf()
.setMaster(yarn-cluster)
.setAppName(spark_tornado_server)
.set(spark.executor.memory, 1024m)
.set(spark.cores.max, 16)
.set(spark.driver.memory, 1024m)
.set(spark.executor.instances, 2)
.set(spark.executor.cores, 8)
.set(spark.eventLog.enabled, False)

HADOOP_HOME and HADOOP_CONF_DIR are also set in spark-env.

thanks,

not sure if I am missing some config




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/yarn-does-not-accept-job-in-cluster-mode-tp15281.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to run hive scripts pro-grammatically in Spark 1.1.0 ?

2014-09-27 Thread jamborta
Hi,

you can create a spark context in your python or scala environment and use
that to run your hive queries, pretty much the same way as you'd do it in
the shell.

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-hive-scripts-pro-grammatically-in-Spark-1-1-0-tp15225p15282.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: New user question on Spark SQL: can I really use Spark SQL like a normal DB?

2014-09-27 Thread jamborta
hi,

Yes, I have been using spark sql extensively that way. 

I have just tried and saveAsTable() works OK on 1.1.0. 

Alternatively, you can write the data from schemaRDD to HDFS using
saveAsTextFile, and create an external table on top of it.

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/New-user-question-on-Spark-SQL-can-I-really-use-Spark-SQL-like-a-normal-DB-tp15248p15283.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Job cancelled because SparkContext was shut down

2014-09-26 Thread jamborta
Just wanted to answer my question in case someone else runs into the same
problem.

It is related to the problem discussed here:

http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html

and here:

https://issues.apache.org/jira/browse/SPARK-2121

seems yarn kills some of the executors as they request more memory than
expected.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Job-cancelled-because-SparkContext-was-shut-down-tp15189p15216.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



mappartitions data size

2014-09-26 Thread jamborta
Hi all,

I am using mappartitions to do some heavy computing on subsets of the data.
I have a dataset with about 1m rows, running on a 32 core cluster.
Unfortunately, is seems that mappartitions splits the data into two sets so
it is only running on two cores. 

Is there a way to force it to split into smaller chunks? 

thanks,




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/mappartitions-data-size-tp15231.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Job cancelled because SparkContext was shut down

2014-09-25 Thread jamborta
hi all,

I am getting this strange error about half way through the job (running
spark 1.1 on yarn client mode):

14/09/26 00:54:06 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@4d0155fb
java.nio.channels.CancelledKeyException
at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
14/09/26 00:54:06 INFO YarnClientSchedulerBackend: Executor 1 disconnected,
so removing it

then a few minutes later the whole process dies:

14/09/26 01:00:12 ERROR YarnClientSchedulerBackend: Yarn application already
ended: FINISHED
14/09/26 01:00:13 INFO SparkUI: Stopped Spark web UI at
http://backend-dev:4040
14/09/26 01:00:13 INFO YarnClientSchedulerBackend: Shutting down all
executors
14/09/26 01:00:13 INFO YarnClientSchedulerBackend: Asking each executor to
shut down
[E 140926 01:00:13 base:56] Request failed
14/09/26 01:00:13 INFO YarnClientSchedulerBackend: Stopped
[E 140926 01:00:13 base:57] {'error_msg': type 'exceptions.Exception',
org.apache.spark.SparkException: Job cancelled because SparkContext was shut
down, traceback object at 0x4483cb0}

any idea what's going on here?

thanks,





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Job-cancelled-because-SparkContext-was-shut-down-tp15189.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Yarn number of containers

2014-09-25 Thread jamborta
thanks.


On Thu, Sep 25, 2014 at 10:25 PM, Marcelo Vanzin [via Apache Spark
User List] ml-node+s1001560n15177...@n3.nabble.com wrote:
 From spark-submit --help:

  YARN-only:
   --executor-cores NUMNumber of cores per executor (Default: 1).
   --queue QUEUE_NAME  The YARN queue to submit to (Default:
 default).
   --num-executors NUM Number of executors to launch (Default: 2).
   --archives ARCHIVES Comma separated list of archives to be
 extracted into the
   working directory of each executor.

 On Thu, Sep 25, 2014 at 2:20 PM, Tamas Jambor [hidden email] wrote:

 Thank you.

 Where is the number of containers set?

 On Thu, Sep 25, 2014 at 7:17 PM, Marcelo Vanzin [hidden email] wrote:
 On Thu, Sep 25, 2014 at 8:55 AM, jamborta [hidden email] wrote:
 I am running spark with the default settings in yarn client mode. For
 some
 reason yarn always allocates three containers to the application
 (wondering
 where it is set?), and only uses two of them.

 The default number of executors in Yarn mode is 2; so you have 2
 executors + the application master, so 3 containers.

 Also the cpus on the cluster never go over 50%, I turned off the fair
 scheduler and set high spark.cores.max. Is there some additional
 settings I
 am missing?

 You probably need to request more cores (--executor-cores). Don't
 remember if that is respected in Yarn, but should be.

 --
 Marcelo



 --
 Marcelo

 -
 To unsubscribe, e-mail: [hidden email]
 For additional commands, e-mail: [hidden email]



 
 If you reply to this email, your message will be added to the discussion
 below:
 http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-number-of-containers-tp15148p15177.html
 To unsubscribe from Yarn number of containers, click here.
 NAML




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-number-of-containers-tp15148p15191.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Null values in pyspark Row

2014-09-24 Thread jamborta
Hi all,

I have just updated to spark 1.1.0. The new row representation of the data
in spark SQL is very handy.

I have noticed that it does not set None for NULL values coming from Hive if
the column was string type - seems it works with other types.

Is that something that will be implemented?

Thanks,




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Null-values-in-pyspark-Row-tp15065.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



access javaobject in rdd map

2014-09-23 Thread jamborta
Hi all,

I have a java object that contains a ML model which I would like to use for
prediction (in python). I just want to iterate the data through a mapper and
predict for each value. Unfortunately, this fails when it tries to serialise
the object to sent it to the nodes. 

Is there a trick around this? Surely, this object could be picked up by
reference at the nodes.

many thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/access-javaobject-in-rdd-map-tp14898.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: access javaobject in rdd map

2014-09-23 Thread jamborta
Great. Thanks a lot.
On 23 Sep 2014 18:44, Davies Liu-2 [via Apache Spark User List] 
ml-node+s1001560n14908...@n3.nabble.com wrote:

 Right now, there is no way to access JVM in Python worker, in order
 to make this happen, we need to do:

 1. setup py4j in Python worker
 2. serialize the JVM objects and transfer to executors
 3. link the JVM objects and py4j together to get an interface

 Before these happens, maybe you could try to setup a service
 for the model (such as RESTful service), access it map via RPC.

 On Tue, Sep 23, 2014 at 9:48 AM, Tamas Jambor [hidden email]
 http://user/SendEmail.jtp?type=nodenode=14908i=0 wrote:

  Hi Davies,
 
  Thanks for the reply. I saw that you guys do that way in the code. Is
  there no other way?
 
  I have implemented all the predict functions in scala, so I prefer not
  to reimplement the whole thing in python.
 
  thanks,
 
 
  On Tue, Sep 23, 2014 at 5:40 PM, Davies Liu [hidden email]
 http://user/SendEmail.jtp?type=nodenode=14908i=1 wrote:
  You should create a pure Python object (copy the attributes from Java
 object),
   then it could be used in map.
 
  Davies
 
  On Tue, Sep 23, 2014 at 8:48 AM, jamborta [hidden email]
 http://user/SendEmail.jtp?type=nodenode=14908i=2 wrote:
  Hi all,
 
  I have a java object that contains a ML model which I would like to
 use for
  prediction (in python). I just want to iterate the data through a
 mapper and
  predict for each value. Unfortunately, this fails when it tries to
 serialise
  the object to sent it to the nodes.
 
  Is there a trick around this? Surely, this object could be picked up
 by
  reference at the nodes.
 
  many thanks,
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/access-javaobject-in-rdd-map-tp14898.html
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 
  -
  To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=14908i=3
  For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=14908i=4
 

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=14908i=5
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=14908i=6



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/access-javaobject-in-rdd-map-mllib-tp14898p14908.html
  To unsubscribe from access javaobject in rdd map (mllib), click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=14898code=amFtYm9ydGFAZ21haWwuY29tfDE0ODk4fC00Mjk2ODU1NTM=
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/access-javaobject-in-rdd-map-mllib-tp14898p14920.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Efficient way to sum multiple columns

2014-09-15 Thread jamborta
Hi all,

I have an RDD that contains around 50 columns. I need to sum each column,
which I am doing by running it through a for loop, creating an array and
running the sum function as follows:

for (i - 0 until 10) yield {
   data.map(x = x(i)).sum
}

is their a better way to do this?

thanks,




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-sum-multiple-columns-tp14281.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark sql - create new_table as select * from table

2014-09-12 Thread jamborta
thanks. I will try to do that way.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006p14090.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark sql - create new_table as select * from table

2014-09-11 Thread jamborta
Hi, 

I am trying to create a new table from a select query as follows:

CREATE TABLE IF NOT EXISTS new_table ROW FORMAT DELIMITED FIELDS TERMINATED
BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION
'/user/test/new_table' AS select * from table

this works in Hive, but in Spark SQL (1.0.2) I am getting Unsupported
language features in query error.

Could you suggest why I am getting this?

thanks,




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark sql - create new_table as select * from table

2014-09-11 Thread jamborta
thanks. this was actually using hivecontext.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006p14009.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL check if query is completed (pyspark)

2014-09-08 Thread jamborta
thank you for the replies.

I am running an insert on a join (INSERT OVERWRITE TABLE new_table select *
from table1 as a join table2 as b on (a.key = b.key), 

The process does not have the right permission to write to that folder, so I
get the following error printed:
chgrp: `/user/x/y': No such file or directory
chmod: `/user/x/y': No such file or directory
and it returns an empty RDD without getting an exception.

thanks,




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-check-if-query-is-completed-pyspark-tp13630p13685.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL check if query is completed (pyspark)

2014-09-06 Thread jamborta
Hi,

I am using Spark SQL to run some administrative queries and joins (e.g.
create table, insert overwrite, etc), where the query does not return any
data. I noticed if the query fails it prints some error message on the
console, but does not actually throw an exception (this is spark 1.0.2). 

Is there any way to get these errors from the returned object?

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-check-if-query-is-completed-pyspark-tp13630.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark sql results maintain order (in python)

2014-09-04 Thread jamborta
hi,

I ran into a problem with spark sql, when run a query like this select
count(*), city, industry from table group by hour and I would like to take
the results from the shemaRDD

1, I have to parse each line to get the values out of the dic (eg in order
to convert it to a csv)
2, The order is not kept in a python dict - I couldn't find a way to
maintain the original order (especially a problem in this case, when the
column names are derived).

thanks, 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-results-maintain-order-in-python-tp13445.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Print to spark log

2014-08-28 Thread jamborta
Hi all,

Just wondering if there is a way to use logging to print to spark logs some
additional info (similar to debug in scalding).

Thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Print-to-spark-log-tp13035.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Print to spark log

2014-08-28 Thread jamborta
thanks for the reply. 

I was looking for something for the case when it's running outside of the
spark framework. if I declare a sparkcontext or and rdd that could print
some messages in the log? 

The problem I have that if I print something from the scala object that runs
the spark app, it does not print to the console for some reason.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Print-to-spark-log-tp13035p13057.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



running multiple applications at the same time

2014-06-26 Thread jamborta
Hi all,

not sure if this is a config issue or it's by design, but when I run the
spark shell, and try to submit another application from elsewhere, the
second application waits for the first to finish and outputs the following:

Initial job has not accepted any resources; check your cluster UI to ensure
that workers are registered and have sufficient memory. 

I have four workers, each have some additional resources to take up the new
application.

thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/running-multiple-applications-at-the-same-time-tp8333.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


create SparkContext dynamically

2014-06-18 Thread jamborta
Hi all,

I am setting up a system where spark contexts would be created by a web
server that would handle the computation and return the results. I have the
following code (in python)

os.environ['SPARK_HOME'] = /home/spark/spark-1.0.0-bin-hadoop2/
sc = SparkContext(master=spark://ip-xx-xx-xx-xx:7077, appName=Simple
App)
l =sc.parallelize([1,2,3,4])
c = l.count() 

but it throws an unrelated error 'TypeError: an integer is required' in the
last line.

I assume I did not setup the environment properly. I have added spark_home
and py4j source to the classpath. not sure what is missing.

thanks,





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/create-SparkContext-dynamically-tp7872.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: pyspark regression results way off

2014-06-17 Thread jamborta
Thanks, will try normalising it.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672p7720.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


pyspark regression results way off

2014-06-16 Thread jamborta
Hi all,

I am testing the regression methods (SGD) using pyspark. Tried to tune the
parameters, but they are far off from the results obtained using R. Is there
some way to set these parameters more efficiently?

thanks,  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: pyspark regression results way off

2014-06-16 Thread jamborta
forgot to mention that I'm running spark 1.0



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672p7673.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: pyspark MLlib examples don't work with Spark 1.0.0

2014-05-30 Thread jamborta
thanks for the reply. I am definitely running 1.0.0, I set it up manually.

To answer my question, I found out from the examples that it would need a
new data type called LabeledPoint instead of numpy array.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-MLlib-examples-don-t-work-with-Spark-1-0-0-tp6546p6579.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.