Spark-submit and multiple files

2015-03-17 Thread poiuytrez
Hello guys, 

I am having a hard time to understand how spark-submit behave with multiple
files. I have created two code snippets. Each code snippet is composed of a
main.py and work.py. The code works if I paste work.py then main.py in a
pyspark shell. However both snippets do not work when using spark submit and
generate different errors.

Function add_1 definition outside
http://www.codeshare.io/4ao8B
https://justpaste.it/jzvj

Embedded add_1 function definition
http://www.codeshare.io/OQJxq
https://justpaste.it/jzvn

I am trying a way to make it work.

Thank you for your support.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-and-multiple-files-tp22097.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Error when using multiple python files spark-submit

2015-03-16 Thread poiuytrez
I have a spark app which is composed of multiple files. 

When I launch Spark using: 

../hadoop/spark-install/bin/spark-submit main.py --py-files
/home/poiuytrez/naive.py,/home/poiuytrez/processing.py,/home/poiuytrez/settings.py
 
--master spark://spark-m:7077

I am getting an error:

15/03/13 15:54:24 INFO TaskSetManager: Lost task 6.3 in stage 413.0 (TID
5817) on executor spark-w-3.c.databerries.internal:
org.apache.spark.api.python.PythonException (Traceback (most recent call
last):   File /home/hadoop/spark-install/python/pyspark/worker.py, line
90, in main
command = pickleSer._read_with_length(infile)   File
/home/hadoop/spark-install/python/pyspark/serializers.py, line 151, in
_read_with_length
return self.loads(obj)   File
/home/hadoop/spark-install/python/pyspark/serializers.py, line 396, in
loads
return cPickle.loads(obj) ImportError: No module named naive

It is weird because I do not serialize anything. naive.py is also available
on every machine at the same path.

Any insight on what could be going on? The issue does not happen on my
laptop.

PS : I am using Spark 1.2.0.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-using-multiple-python-files-spark-submit-tp22080.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 1.3 dataframe documentation

2015-02-24 Thread poiuytrez
Hello, 

I have built Spark 1.3. I can successfully use the dataframe api. However, I
am not able to find its api documentation in Python. Do you know when the
documentation will be available? 

Best Regards,
poiuytrez



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-dataframe-documentation-tp21789.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Movie Recommendation tutorial

2015-02-23 Thread poiuytrez
Hello,

I am following the Movies recommendation with MLlib tutorial
(https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html).
However, I get RMSE that are much larger than what's written at step 7:
The best model was trained with rank = 8 and lambda = 1.0, and numIter = 10,
and its RMSE on the test set is 1.357078.
instead of 
The best model was trained using rank 8 and lambda 10.0, and its RMSE on
test is 0.8808492431998702.

Is it a mistake on the tutorial or am I doing something wrong? 

Thank you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Movie-Recommendation-tutorial-tp21769.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Movie Recommendation tutorial

2015-02-23 Thread poiuytrez
What do you mean? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Movie-Recommendation-tutorial-tp21769p21771.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Implicit ALS with multiple features

2015-02-19 Thread poiuytrez
Hello, 

I would like to use the spark MLlib recommendation filtering library. My
goal will be to predict what a user would like to buy based on what he
bought before. 

I read on the spark documentation that Spark supports implicit feedback.
However there is not example for this application. Would implicit feedback
works on my business case and how? 

Can ALS accept multiple parameters. Currently I have :
(userId,productId,nbPurchased)
I would like to another parameter:
(userId,productId,nbPurchased,frenquency)

Is it possible with ALS? 

Thank you for your reply



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Implicit-ALS-with-multiple-features-tp21723.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: OutOfMemoryError with ramdom forest and small training dataset

2015-02-12 Thread poiuytrez
Very interesting. It works. 

When I set SPARK_DRIVER_MEMORY=83971m in spark-env.sh or spark-default.conf
it works. 
However, when I set the --driver-memory option with spark submit, the memory
is not allocated to the spark master. (the web ui shows the correct value of
spark.driver.memory (83971m) but the memory is not correctly allocated as we
can see on the webui executor page). 

I am going to file an issue in the bug tracker. 

Thank you for your help.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-with-ramdom-forest-and-small-training-dataset-tp21598p21620.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: OutOfMemoryError with ramdom forest and small training dataset

2015-02-11 Thread poiuytrez
cat ../hadoop/spark-install/conf/spark-env.sh 
export SCALA_HOME=/home/hadoop/scala-install 
export SPARK_WORKER_MEMORY=83971m 
export SPARK_MASTER_IP=spark-m 
export SPARK_DAEMON_MEMORY=15744m 
export SPARK_WORKER_DIR=/hadoop/spark/work 
export SPARK_LOCAL_DIRS=/hadoop/spark/tmp 
export SPARK_LOG_DIR=/hadoop/spark/logs 
export
SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/hadoop/hadoop-install/lib/gcs-connector-1.3.2-hadoop1.jar
 
export MASTER=spark://spark-m:7077 

poiuytrez@spark-m:~$ cat ../hadoop/spark-install/conf/spark-defaults.conf 
spark.master spark://spark-m:7077 
spark.eventLog.enabled true 
spark.eventLog.dir gs://-spark/spark-eventlog-base/spark-m 
spark.executor.memory 83971m 
spark.yarn.executor.memoryOverhead 83971m 


I am using spark-submit.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-with-ramdom-forest-and-small-training-dataset-tp21598p21605.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: OutOfMemoryError with ramdom forest and small training dataset

2015-02-11 Thread poiuytrez
cat ../hadoop/spark-install/conf/spark-env.sh
export SCALA_HOME=/home/hadoop/scala-install
export SPARK_WORKER_MEMORY=83971m
export SPARK_MASTER_IP=spark-m
export SPARK_DAEMON_MEMORY=15744m
export SPARK_WORKER_DIR=/hadoop/spark/work
export SPARK_LOCAL_DIRS=/hadoop/spark/tmp
export SPARK_LOG_DIR=/hadoop/spark/logs
export
SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/hadoop/hadoop-install/lib/gcs-connector-1.3.2-hadoop1.jar
export MASTER=spark://spark-m:7077

poiuytrez@spark-m:~$ cat ../hadoop/spark-install/conf/spark-defaults.conf
spark.master spark://spark-m:7077
spark.eventLog.enabled true
spark.eventLog.dir gs://databerries-spark/spark-eventlog-base/spark-m
spark.executor.memory 83971m
spark.yarn.executor.memoryOverhead 83971m


I am using spark-submit.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-with-ramdom-forest-and-small-training-dataset-tp21598p21604.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



OutOfMemoryError with ramdom forest and small training dataset

2015-02-11 Thread poiuytrez
Hello guys, 

I am trying to run a Ramdom Forest on 30MB of data. I have a cluster of 4
machines. Each machine has 106 MB of RAM and 16 cores. 

I am getting: 
15/02/11 11:01:23 ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkDriver-akka.actor.default-dispatcher-3] shutting down ActorSystem
[sparkDriver]
java.lang.OutOfMemoryError: Java heap space

That's very weird. Any idea of what's wrong with my configuration? 

PS : I am running Spark 1.2



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-with-ramdom-forest-and-small-training-dataset-tp21598.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLlib linking error Mac OS X

2014-10-20 Thread poiuytrez
This is my error:
14/10/17 10:24:56 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeSystemBLAS
14/10/17 10:24:56 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeRefBLAS

However, it seems to work. What does it means?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-linking-error-Mac-OS-X-tp588p16791.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



MLlib and pyspark features

2014-10-17 Thread poiuytrez
Hello,

I would like to use areaUnderROC from MLlib in Apache Spark. I am currently
running Spark 1.1.0 and this function is not available in pyspark but is
available in scala.

Is there a feature tracker that tracks the advancement of porting Scala apis
to Python apis?

I have tried to search in the official jira but I could not find any ticket
number corresponding to this.

Best,
poiuytrez





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-and-pyspark-features-tp16667.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLlib linking error Mac OS X

2014-10-17 Thread poiuytrez
Hello MLnick,

Have you found a solution on how to install MLlib for Mac OS ? I have also
some trouble to install the dependencies. 

Best,
poiuytrez



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-linking-error-Mac-OS-X-tp588p16668.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL - Exception only when using cacheTable

2014-10-13 Thread poiuytrez
This is how the table was created:

transactions = parts.map(lambda p: Row(customer_id=long(p[0]),
chain=int(p[1]), dept=int(p[2]), category=int(p[3]), company=int(p[4]),
brand=int(p[5]), date=str(p[6]), productsize=float(p[7]),
productmeasure=str(p[8]), purchasequantity=int(p[9]),
purchaseamount=float(p[10])))

# Infer the schema, and register the Schema RDD as a table
schemaTransactions = sqlContext.inferSchema(transactions)
schemaTransactions.registerTempTable(transactions)
sqlContext.cacheTable(transactions)

t = sqlContext.sql(SELECT * FROM transactions WHERE purchaseamount = 50)
t.count()


Thank you,
poiuytrez



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Exception-only-when-using-cacheTable-tp16031p16262.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL - Exception only when using cacheTable

2014-10-10 Thread poiuytrez
I am using the python api. Unfortunately, I cannot find the isCached method
equivalent in the documentation:
https://spark.apache.org/docs/1.1.0/api/python/index.html in the SQLContext
section.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Exception-only-when-using-cacheTable-tp16031p16137.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL - Exception only when using cacheTable

2014-10-10 Thread poiuytrez
Hi Cheng, 

I am using Spark 1.1.0. 
This is the stack trace:
14/10/10 12:17:40 WARN TaskSetManager: Lost task 120.0 in stage 7.0 (TID
2235, spark-w-0.c.db.internal): java.lang.ClassCastException: java.lang.Long
cannot be cast to java.lang.Integer
scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
   
org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:146)
org.apache.spark.sql.columnar.INT$.getField(ColumnType.scala:105)
org.apache.spark.sql.columnar.INT$.getField(ColumnType.scala:92)
   
org.apache.spark.sql.columnar.BasicColumnBuilder.appendFrom(ColumnBuilder.scala:72)
   
org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$NullableColumnBuilder$$super$appendFrom(ColumnBuilder.scala:88)
   
org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:57)
   
org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:88)
   
org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:76)
   
org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:88)
   
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryColumnarTableScan.scala:65)
   
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryColumnarTableScan.scala:50)
   
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236)
   
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
   
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)


This was also printed on the driver:

14/10/10 12:17:43 ERROR TaskSetManager: Task 120 in stage 7.0 failed 4
times; aborting job
14/10/10 12:17:43 INFO TaskSchedulerImpl: Cancelling stage 7
14/10/10 12:17:43 INFO TaskSchedulerImpl: Stage 7 was cancelled
14/10/10 12:17:43 INFO DAGScheduler: Failed to run collect at
SparkPlan.scala:85
Traceback (most recent call last):
  File stdin, line 1, in module
  File /home/hadoop/spark-install/python/pyspark/sql.py, line 1606, in
count
return self._jschema_rdd.count()
  File
/home/hadoop/spark-install/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
line 538, in __call__
  File
/home/hadoop/spark-install/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o100.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
120 in stage 7.0 failed 4 times, most recent failure: Lost task 120.3 in
stage 7.0 (TID 2248, spark-w-0.c.db.internal): java.lang.ClassCastException: 

Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at

Bug a spark task

2014-10-09 Thread poiuytrez
Hi, 

I am parsing a csv file using Spark using the map function. One of the line
of the csv file make a task fail (then the whole job fail). Is there a way
to do some debugging to find the line which does fail ? 

Best regards,
poiuytrez



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Bug-a-spark-task-tp16029.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL - Exception only when using cacheTable

2014-10-09 Thread poiuytrez
Hello, 

I have a weird issue, this request works fine:
sqlContext.sql(SELECT customer_id FROM transactions WHERE purchaseamount =
200).count()

However, when I cache the table before making the request:
sqlContext.cacheTable(transactions)
sqlContext.sql(SELECT customer_id FROM transactions WHERE purchaseamount =
200).count()

I am getting an exception on of the task:
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
120 in stage 104.0 failed 4 times, most recent failure: Lost task 120.3 in
stage 104.0 (TID 20537, spark-w-0.c.internal): java.lang.ClassCastException: 

(I have no details after the ':')

Any ideas of what could be wrong? 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Exception-only-when-using-cacheTable-tp16031.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Bug a spark task

2014-10-09 Thread poiuytrez
Thanks for the tip !



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Debug-a-spark-task-tp16029p16035.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org