Spark-submit and multiple files
Hello guys, I am having a hard time to understand how spark-submit behave with multiple files. I have created two code snippets. Each code snippet is composed of a main.py and work.py. The code works if I paste work.py then main.py in a pyspark shell. However both snippets do not work when using spark submit and generate different errors. Function add_1 definition outside http://www.codeshare.io/4ao8B https://justpaste.it/jzvj Embedded add_1 function definition http://www.codeshare.io/OQJxq https://justpaste.it/jzvn I am trying a way to make it work. Thank you for your support. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-and-multiple-files-tp22097.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Error when using multiple python files spark-submit
I have a spark app which is composed of multiple files. When I launch Spark using: ../hadoop/spark-install/bin/spark-submit main.py --py-files /home/poiuytrez/naive.py,/home/poiuytrez/processing.py,/home/poiuytrez/settings.py --master spark://spark-m:7077 I am getting an error: 15/03/13 15:54:24 INFO TaskSetManager: Lost task 6.3 in stage 413.0 (TID 5817) on executor spark-w-3.c.databerries.internal: org.apache.spark.api.python.PythonException (Traceback (most recent call last): File /home/hadoop/spark-install/python/pyspark/worker.py, line 90, in main command = pickleSer._read_with_length(infile) File /home/hadoop/spark-install/python/pyspark/serializers.py, line 151, in _read_with_length return self.loads(obj) File /home/hadoop/spark-install/python/pyspark/serializers.py, line 396, in loads return cPickle.loads(obj) ImportError: No module named naive It is weird because I do not serialize anything. naive.py is also available on every machine at the same path. Any insight on what could be going on? The issue does not happen on my laptop. PS : I am using Spark 1.2.0. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-using-multiple-python-files-spark-submit-tp22080.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark 1.3 dataframe documentation
Hello, I have built Spark 1.3. I can successfully use the dataframe api. However, I am not able to find its api documentation in Python. Do you know when the documentation will be available? Best Regards, poiuytrez -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-dataframe-documentation-tp21789.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Movie Recommendation tutorial
Hello, I am following the Movies recommendation with MLlib tutorial (https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html). However, I get RMSE that are much larger than what's written at step 7: The best model was trained with rank = 8 and lambda = 1.0, and numIter = 10, and its RMSE on the test set is 1.357078. instead of The best model was trained using rank 8 and lambda 10.0, and its RMSE on test is 0.8808492431998702. Is it a mistake on the tutorial or am I doing something wrong? Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Movie-Recommendation-tutorial-tp21769.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Movie Recommendation tutorial
What do you mean? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Movie-Recommendation-tutorial-tp21769p21771.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Implicit ALS with multiple features
Hello, I would like to use the spark MLlib recommendation filtering library. My goal will be to predict what a user would like to buy based on what he bought before. I read on the spark documentation that Spark supports implicit feedback. However there is not example for this application. Would implicit feedback works on my business case and how? Can ALS accept multiple parameters. Currently I have : (userId,productId,nbPurchased) I would like to another parameter: (userId,productId,nbPurchased,frenquency) Is it possible with ALS? Thank you for your reply -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Implicit-ALS-with-multiple-features-tp21723.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: OutOfMemoryError with ramdom forest and small training dataset
Very interesting. It works. When I set SPARK_DRIVER_MEMORY=83971m in spark-env.sh or spark-default.conf it works. However, when I set the --driver-memory option with spark submit, the memory is not allocated to the spark master. (the web ui shows the correct value of spark.driver.memory (83971m) but the memory is not correctly allocated as we can see on the webui executor page). I am going to file an issue in the bug tracker. Thank you for your help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-with-ramdom-forest-and-small-training-dataset-tp21598p21620.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: OutOfMemoryError with ramdom forest and small training dataset
cat ../hadoop/spark-install/conf/spark-env.sh export SCALA_HOME=/home/hadoop/scala-install export SPARK_WORKER_MEMORY=83971m export SPARK_MASTER_IP=spark-m export SPARK_DAEMON_MEMORY=15744m export SPARK_WORKER_DIR=/hadoop/spark/work export SPARK_LOCAL_DIRS=/hadoop/spark/tmp export SPARK_LOG_DIR=/hadoop/spark/logs export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/hadoop/hadoop-install/lib/gcs-connector-1.3.2-hadoop1.jar export MASTER=spark://spark-m:7077 poiuytrez@spark-m:~$ cat ../hadoop/spark-install/conf/spark-defaults.conf spark.master spark://spark-m:7077 spark.eventLog.enabled true spark.eventLog.dir gs://-spark/spark-eventlog-base/spark-m spark.executor.memory 83971m spark.yarn.executor.memoryOverhead 83971m I am using spark-submit. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-with-ramdom-forest-and-small-training-dataset-tp21598p21605.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: OutOfMemoryError with ramdom forest and small training dataset
cat ../hadoop/spark-install/conf/spark-env.sh export SCALA_HOME=/home/hadoop/scala-install export SPARK_WORKER_MEMORY=83971m export SPARK_MASTER_IP=spark-m export SPARK_DAEMON_MEMORY=15744m export SPARK_WORKER_DIR=/hadoop/spark/work export SPARK_LOCAL_DIRS=/hadoop/spark/tmp export SPARK_LOG_DIR=/hadoop/spark/logs export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/hadoop/hadoop-install/lib/gcs-connector-1.3.2-hadoop1.jar export MASTER=spark://spark-m:7077 poiuytrez@spark-m:~$ cat ../hadoop/spark-install/conf/spark-defaults.conf spark.master spark://spark-m:7077 spark.eventLog.enabled true spark.eventLog.dir gs://databerries-spark/spark-eventlog-base/spark-m spark.executor.memory 83971m spark.yarn.executor.memoryOverhead 83971m I am using spark-submit. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-with-ramdom-forest-and-small-training-dataset-tp21598p21604.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
OutOfMemoryError with ramdom forest and small training dataset
Hello guys, I am trying to run a Ramdom Forest on 30MB of data. I have a cluster of 4 machines. Each machine has 106 MB of RAM and 16 cores. I am getting: 15/02/11 11:01:23 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-3] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: Java heap space That's very weird. Any idea of what's wrong with my configuration? PS : I am running Spark 1.2 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-with-ramdom-forest-and-small-training-dataset-tp21598.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLlib linking error Mac OS X
This is my error: 14/10/17 10:24:56 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/10/17 10:24:56 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS However, it seems to work. What does it means? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-linking-error-Mac-OS-X-tp588p16791.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
MLlib and pyspark features
Hello, I would like to use areaUnderROC from MLlib in Apache Spark. I am currently running Spark 1.1.0 and this function is not available in pyspark but is available in scala. Is there a feature tracker that tracks the advancement of porting Scala apis to Python apis? I have tried to search in the official jira but I could not find any ticket number corresponding to this. Best, poiuytrez -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-and-pyspark-features-tp16667.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLlib linking error Mac OS X
Hello MLnick, Have you found a solution on how to install MLlib for Mac OS ? I have also some trouble to install the dependencies. Best, poiuytrez -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-linking-error-Mac-OS-X-tp588p16668.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL - Exception only when using cacheTable
This is how the table was created: transactions = parts.map(lambda p: Row(customer_id=long(p[0]), chain=int(p[1]), dept=int(p[2]), category=int(p[3]), company=int(p[4]), brand=int(p[5]), date=str(p[6]), productsize=float(p[7]), productmeasure=str(p[8]), purchasequantity=int(p[9]), purchaseamount=float(p[10]))) # Infer the schema, and register the Schema RDD as a table schemaTransactions = sqlContext.inferSchema(transactions) schemaTransactions.registerTempTable(transactions) sqlContext.cacheTable(transactions) t = sqlContext.sql(SELECT * FROM transactions WHERE purchaseamount = 50) t.count() Thank you, poiuytrez -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Exception-only-when-using-cacheTable-tp16031p16262.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL - Exception only when using cacheTable
I am using the python api. Unfortunately, I cannot find the isCached method equivalent in the documentation: https://spark.apache.org/docs/1.1.0/api/python/index.html in the SQLContext section. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Exception-only-when-using-cacheTable-tp16031p16137.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL - Exception only when using cacheTable
Hi Cheng, I am using Spark 1.1.0. This is the stack trace: 14/10/10 12:17:40 WARN TaskSetManager: Lost task 120.0 in stage 7.0 (TID 2235, spark-w-0.c.db.internal): java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:146) org.apache.spark.sql.columnar.INT$.getField(ColumnType.scala:105) org.apache.spark.sql.columnar.INT$.getField(ColumnType.scala:92) org.apache.spark.sql.columnar.BasicColumnBuilder.appendFrom(ColumnBuilder.scala:72) org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$NullableColumnBuilder$$super$appendFrom(ColumnBuilder.scala:88) org.apache.spark.sql.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:57) org.apache.spark.sql.columnar.NativeColumnBuilder.org$apache$spark$sql$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:88) org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:76) org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:88) org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryColumnarTableScan.scala:65) org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryColumnarTableScan.scala:50) org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236) org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) This was also printed on the driver: 14/10/10 12:17:43 ERROR TaskSetManager: Task 120 in stage 7.0 failed 4 times; aborting job 14/10/10 12:17:43 INFO TaskSchedulerImpl: Cancelling stage 7 14/10/10 12:17:43 INFO TaskSchedulerImpl: Stage 7 was cancelled 14/10/10 12:17:43 INFO DAGScheduler: Failed to run collect at SparkPlan.scala:85 Traceback (most recent call last): File stdin, line 1, in module File /home/hadoop/spark-install/python/pyspark/sql.py, line 1606, in count return self._jschema_rdd.count() File /home/hadoop/spark-install/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /home/hadoop/spark-install/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o100.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 120 in stage 7.0 failed 4 times, most recent failure: Lost task 120.3 in stage 7.0 (TID 2248, spark-w-0.c.db.internal): java.lang.ClassCastException: Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at
Bug a spark task
Hi, I am parsing a csv file using Spark using the map function. One of the line of the csv file make a task fail (then the whole job fail). Is there a way to do some debugging to find the line which does fail ? Best regards, poiuytrez -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bug-a-spark-task-tp16029.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark SQL - Exception only when using cacheTable
Hello, I have a weird issue, this request works fine: sqlContext.sql(SELECT customer_id FROM transactions WHERE purchaseamount = 200).count() However, when I cache the table before making the request: sqlContext.cacheTable(transactions) sqlContext.sql(SELECT customer_id FROM transactions WHERE purchaseamount = 200).count() I am getting an exception on of the task: : org.apache.spark.SparkException: Job aborted due to stage failure: Task 120 in stage 104.0 failed 4 times, most recent failure: Lost task 120.3 in stage 104.0 (TID 20537, spark-w-0.c.internal): java.lang.ClassCastException: (I have no details after the ':') Any ideas of what could be wrong? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Exception-only-when-using-cacheTable-tp16031.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Bug a spark task
Thanks for the tip ! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Debug-a-spark-task-tp16029p16035.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org