Re: example LDA code ClassCastException
thanks for the reply. Asher, have you experienced problem when checkpoints are not enabled as well? If we have large number of iterations (over 150) and checkpoints are not enabled, the process just hangs (without no error) at around iteration 120-140 (on spark 2.0.0). I could not reproduce this outside of our data, unfortunately. On Fri, Nov 4, 2016 at 2:53 AM, Asher Krimwrote: > There is an open Jira for this issue (https://issues.apache.org/ > jira/browse/SPARK-14804). There have been a few proposed fixes so far. > > On Thu, Nov 3, 2016 at 2:20 PM, jamborta wrote: > >> Hi there, >> >> I am trying to run the example LDA code >> (http://spark.apache.org/docs/latest/mllib-clustering.html#l >> atent-dirichlet-allocation-lda) >> on Spark 2.0.0/EMR 5.0.0 >> >> If run it with checkpoints enabled (sc.setCheckpointDir("s3n://s3-path/") >> >> ldaModel = LDA.train(corpus, k=3, maxIterations=200, >> checkpointInterval=10) >> >> I get the following error (sorry, quite long): >> >> Py4JJavaErrorTraceback (most recent call last) >> in () >> > 1 ldaModel = LDA.train(corpus, k=3, maxIterations=200, >> checkpointInterval=10) >> >> /usr/lib/spark/python/pyspark/mllib/clustering.py in train(cls, rdd, k, >> maxIterations, docConcentration, topicConcentration, seed, >> checkpointInterval, optimizer) >>1037 model = callMLlibFunc("trainLDAModel", rdd, k, >> maxIterations, >>1038 docConcentration, >> topicConcentration, >> seed, >> -> 1039 checkpointInterval, optimizer) >>1040 return LDAModel(model) >>1041 >> >> /usr/lib/spark/python/pyspark/mllib/common.py in callMLlibFunc(name, >> *args) >> 128 sc = SparkContext.getOrCreate() >> 129 api = getattr(sc._jvm.PythonMLLibAPI(), name) >> --> 130 return callJavaFunc(sc, api, *args) >> 131 >> 132 >> >> /usr/lib/spark/python/pyspark/mllib/common.py in callJavaFunc(sc, func, >> *args) >> 121 """ Call Java Function """ >> 122 args = [_py2java(sc, a) for a in args] >> --> 123 return _java2py(sc, func(*args)) >> 124 >> 125 >> >> /usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in >> __call__(self, *args) >> 931 answer = self.gateway_client.send_command(command) >> 932 return_value = get_return_value( >> --> 933 answer, self.gateway_client, self.target_id, >> self.name) >> 934 >> 935 for temp_arg in temp_args: >> >> /usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw) >> 61 def deco(*a, **kw): >> 62 try: >> ---> 63 return f(*a, **kw) >> 64 except py4j.protocol.Py4JJavaError as e: >> 65 s = e.java_exception.toString() >> >> /usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in >> get_return_value(answer, gateway_client, target_id, name) >> 310 raise Py4JJavaError( >> 311 "An error occurred while calling >> {0}{1}{2}.\n". >> --> 312 format(target_id, ".", name), value) >> 313 else: >> 314 raise Py4JError( >> >> Py4JJavaError: An error occurred while calling o115.trainLDAModel. >> : org.apache.spark.SparkException: Job aborted due to stage failure: >> Task 1 >> in stage 458.0 failed 4 times, most recent failure: Lost task 1.3 in stage >> 458.0 (TID 14827, ip-10-197-192-2.eu-west-1.compute.internal): >> java.lang.ClassCastException: scala.Tuple2 cannot be cast to >> org.apache.spark.graphx.Edge >> at >> org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1. >> apply(EdgeRDD.scala:107) >> at scala.collection.Iterator$class.foreach(Iterator.scala:893) >> at >> org.apache.spark.InterruptibleIterator.foreach(Interruptible >> Iterator.scala:28) >> at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD. >> scala:107) >> at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD. >> scala:105) >> at >> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$ >> anonfun$apply$25.apply(RDD.scala:801) >> at >> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$ >> anonfun$apply$25.apply(RDD.scala:801) >> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR >> DD.scala:38) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala: >> 319) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) >> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR >> DD.scala:38) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala: >> 319) >> at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332) >> at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330) >> at >> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator >> $1.apply(BlockManager.scala:919) >> at >>
Re: store hive metastore on persistent store
ah, that explains it, many thanks! On Sat, May 16, 2015 at 7:41 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: oh...metastore_db location is not controlled by hive.metastore.warehouse.dir -- one is the location of your metastore DB, the other is the physical location of your stored data. Checkout this SO thread: http://stackoverflow.com/questions/13624893/metastore-db-created-wherever-i-run-hive On Sat, May 16, 2015 at 9:07 AM, Tamas Jambor jambo...@gmail.com wrote: Gave it another try - it seems that it picks up the variable and prints out the correct value, but still puts the metatore_db folder in the current directory, regardless. On Sat, May 16, 2015 at 1:13 PM, Tamas Jambor jambo...@gmail.com wrote: Thank you for the reply. I have tried your experiment, it seems that it does not print the settings out in spark-shell (I'm using 1.3 by the way). Strangely I have been experimenting with an SQL connection instead, which works after all (still if I go to spark-shell and try to print out the SQL settings that I put in hive-site.xml, it does not print them). On Fri, May 15, 2015 at 7:22 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: My point was more to how to verify that properties are picked up from the hive-site.xml file. You don't really need hive.metastore.uris if you're not running against an external metastore. I just did an experiment with warehouse.dir. My hive-site.xml looks like this: configuration property namehive.metastore.warehouse.dir/name value/home/ykadiysk/Github/warehouse_dir/value descriptionlocation of default database for the warehouse/description /property /configuration and spark-shell code: scala val hc= new org.apache.spark.sql.hive.HiveContext(sc) hc: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@3036c16f scala hc.sql(show tables).collect 15/05/15 14:12:57 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/05/15 14:12:57 INFO ObjectStore: ObjectStore, initialize called 15/05/15 14:12:57 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/05/15 14:12:58 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/05/15 14:12:58 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/05/15 14:13:03 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order 15/05/15 14:13:03 INFO ObjectStore: Initialized ObjectStore 15/05/15 14:13:04 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 0.12.0-protobuf-2.5 15/05/15 14:13:05 INFO HiveMetaStore: 0: get_tables: db=default pat=.* 15/05/15 14:13:05 INFO audit: ugi=ykadiysk ip=unknown-ip-addr cmd=get_tables: db=default pat=.* 15/05/15 14:13:05 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 15/05/15 14:13:05 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. res0: Array[org.apache.spark.sql.Row] = Array() scala hc.getConf(hive.metastore.warehouse.dir) res1: String = /home/ykadiysk/Github/warehouse_dir I have not tried an HDFS path but you should be at least able to verify that the variable is being read. It might be that your value is read but is otherwise not liked... On Fri, May 15, 2015 at 2:03 PM, Tamas Jambor jambo...@gmail.com wrote: thanks for the reply. I am trying to use it without hive setup (spark-standalone), so it prints something like this: hive_ctx.sql(show tables).collect() 15/05/15 17:59:03 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/05/15 17:59:03 INFO ObjectStore: ObjectStore, initialize called 15/05/15 17:59:04 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/05/15 17:59:04 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/05/15 17:59:04 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/05/15 17:59:05 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/05/15 17:59:08 INFO BlockManagerMasterActor: Registering block manager :42819 with 3.0 GB RAM, BlockManagerId(2, xxx, 42819) [0/1844] 15/05/15 17:59:18 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order 15/05/15 17:59:18 INFO MetaStoreDirectSql: MySQL check failed, assuming we
Re: store hive metastore on persistent store
Gave it another try - it seems that it picks up the variable and prints out the correct value, but still puts the metatore_db folder in the current directory, regardless. On Sat, May 16, 2015 at 1:13 PM, Tamas Jambor jambo...@gmail.com wrote: Thank you for the reply. I have tried your experiment, it seems that it does not print the settings out in spark-shell (I'm using 1.3 by the way). Strangely I have been experimenting with an SQL connection instead, which works after all (still if I go to spark-shell and try to print out the SQL settings that I put in hive-site.xml, it does not print them). On Fri, May 15, 2015 at 7:22 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: My point was more to how to verify that properties are picked up from the hive-site.xml file. You don't really need hive.metastore.uris if you're not running against an external metastore. I just did an experiment with warehouse.dir. My hive-site.xml looks like this: configuration property namehive.metastore.warehouse.dir/name value/home/ykadiysk/Github/warehouse_dir/value descriptionlocation of default database for the warehouse/description /property /configuration and spark-shell code: scala val hc= new org.apache.spark.sql.hive.HiveContext(sc) hc: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@3036c16f scala hc.sql(show tables).collect 15/05/15 14:12:57 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/05/15 14:12:57 INFO ObjectStore: ObjectStore, initialize called 15/05/15 14:12:57 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/05/15 14:12:58 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/05/15 14:12:58 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/05/15 14:13:03 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order 15/05/15 14:13:03 INFO ObjectStore: Initialized ObjectStore 15/05/15 14:13:04 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 0.12.0-protobuf-2.5 15/05/15 14:13:05 INFO HiveMetaStore: 0: get_tables: db=default pat=.* 15/05/15 14:13:05 INFO audit: ugi=ykadiysk ip=unknown-ip-addr cmd=get_tables: db=default pat=.* 15/05/15 14:13:05 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 15/05/15 14:13:05 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. res0: Array[org.apache.spark.sql.Row] = Array() scala hc.getConf(hive.metastore.warehouse.dir) res1: String = /home/ykadiysk/Github/warehouse_dir I have not tried an HDFS path but you should be at least able to verify that the variable is being read. It might be that your value is read but is otherwise not liked... On Fri, May 15, 2015 at 2:03 PM, Tamas Jambor jambo...@gmail.com wrote: thanks for the reply. I am trying to use it without hive setup (spark-standalone), so it prints something like this: hive_ctx.sql(show tables).collect() 15/05/15 17:59:03 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/05/15 17:59:03 INFO ObjectStore: ObjectStore, initialize called 15/05/15 17:59:04 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/05/15 17:59:04 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/05/15 17:59:04 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/05/15 17:59:05 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/05/15 17:59:08 INFO BlockManagerMasterActor: Registering block manager :42819 with 3.0 GB RAM, BlockManagerId(2, xxx, 42819) [0/1844] 15/05/15 17:59:18 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order 15/05/15 17:59:18 INFO MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: @ (64), after : . 15/05/15 17:59:20 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 15/05/15 17:59:20 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. 15/05/15 17:59:28 INFO Datastore: The class
Re: store hive metastore on persistent store
Thank you for the reply. I have tried your experiment, it seems that it does not print the settings out in spark-shell (I'm using 1.3 by the way). Strangely I have been experimenting with an SQL connection instead, which works after all (still if I go to spark-shell and try to print out the SQL settings that I put in hive-site.xml, it does not print them). On Fri, May 15, 2015 at 7:22 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: My point was more to how to verify that properties are picked up from the hive-site.xml file. You don't really need hive.metastore.uris if you're not running against an external metastore. I just did an experiment with warehouse.dir. My hive-site.xml looks like this: configuration property namehive.metastore.warehouse.dir/name value/home/ykadiysk/Github/warehouse_dir/value descriptionlocation of default database for the warehouse/description /property /configuration and spark-shell code: scala val hc= new org.apache.spark.sql.hive.HiveContext(sc) hc: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@3036c16f scala hc.sql(show tables).collect 15/05/15 14:12:57 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/05/15 14:12:57 INFO ObjectStore: ObjectStore, initialize called 15/05/15 14:12:57 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/05/15 14:12:58 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/05/15 14:12:58 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/05/15 14:13:03 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order 15/05/15 14:13:03 INFO ObjectStore: Initialized ObjectStore 15/05/15 14:13:04 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 0.12.0-protobuf-2.5 15/05/15 14:13:05 INFO HiveMetaStore: 0: get_tables: db=default pat=.* 15/05/15 14:13:05 INFO audit: ugi=ykadiysk ip=unknown-ip-addr cmd=get_tables: db=default pat=.* 15/05/15 14:13:05 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 15/05/15 14:13:05 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. res0: Array[org.apache.spark.sql.Row] = Array() scala hc.getConf(hive.metastore.warehouse.dir) res1: String = /home/ykadiysk/Github/warehouse_dir I have not tried an HDFS path but you should be at least able to verify that the variable is being read. It might be that your value is read but is otherwise not liked... On Fri, May 15, 2015 at 2:03 PM, Tamas Jambor jambo...@gmail.com wrote: thanks for the reply. I am trying to use it without hive setup (spark-standalone), so it prints something like this: hive_ctx.sql(show tables).collect() 15/05/15 17:59:03 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/05/15 17:59:03 INFO ObjectStore: ObjectStore, initialize called 15/05/15 17:59:04 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/05/15 17:59:04 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/05/15 17:59:04 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/05/15 17:59:05 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/05/15 17:59:08 INFO BlockManagerMasterActor: Registering block manager :42819 with 3.0 GB RAM, BlockManagerId(2, xxx, 42819) [0/1844] 15/05/15 17:59:18 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order 15/05/15 17:59:18 INFO MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: @ (64), after : . 15/05/15 17:59:20 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 15/05/15 17:59:20 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. 15/05/15 17:59:28 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 15/05/15 17:59:29 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. 15
Re: store hive metastore on persistent store
thanks for the reply. I am trying to use it without hive setup (spark-standalone), so it prints something like this: hive_ctx.sql(show tables).collect() 15/05/15 17:59:03 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/05/15 17:59:03 INFO ObjectStore: ObjectStore, initialize called 15/05/15 17:59:04 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/05/15 17:59:04 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/05/15 17:59:04 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/05/15 17:59:05 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/05/15 17:59:08 INFO BlockManagerMasterActor: Registering block manager :42819 with 3.0 GB RAM, BlockManagerId(2, xxx, 42819) [0/1844] 15/05/15 17:59:18 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order 15/05/15 17:59:18 INFO MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: @ (64), after : . 15/05/15 17:59:20 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 15/05/15 17:59:20 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. 15/05/15 17:59:28 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 15/05/15 17:59:29 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. 15/05/15 17:59:31 INFO ObjectStore: Initialized ObjectStore 15/05/15 17:59:32 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 0.13.1aa 15/05/15 17:59:33 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-azure-file-system.properties,hadoop-metrics2.properties 15/05/15 17:59:33 INFO MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 15/05/15 17:59:33 INFO MetricsSystemImpl: azure-file-system metrics system started 15/05/15 17:59:33 INFO HiveMetaStore: Added admin role in metastore 15/05/15 17:59:34 INFO HiveMetaStore: Added public role in metastore 15/05/15 17:59:34 INFO HiveMetaStore: No user is added in admin role, since config is empty 15/05/15 17:59:35 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/05/15 17:59:37 INFO HiveMetaStore: 0: get_tables: db=default pat=.* 15/05/15 17:59:37 INFO audit: ugi=testuser ip=unknown-ip-addr cmd=get_tables: db=default pat=.* not sure what to put in hive.metastore.uris in this case? On Fri, May 15, 2015 at 2:52 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: This should work. Which version of Spark are you using? Here is what I do -- make sure hive-site.xml is in the conf directory of the machine you're using the driver from. Now let's run spark-shell from that machine: scala val hc= new org.apache.spark.sql.hive.HiveContext(sc) hc: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@6e9f8f26 scala hc.sql(show tables).collect 15/05/15 09:34:17 INFO metastore: Trying to connect to metastore with URI thrift://hostname.com:9083 -- here should be a value from your hive-site.xml 15/05/15 09:34:17 INFO metastore: Waiting 1 seconds before next connection attempt. 15/05/15 09:34:18 INFO metastore: Connected to metastore. res0: Array[org.apache.spark.sql.Row] = Array([table1,false], scala hc.getConf(hive.metastore.uris) res13: String = thrift://hostname.com:9083 scala hc.getConf(hive.metastore.warehouse.dir) res14: String = /user/hive/warehouse The first line tells you which metastore it's trying to connect to -- this should be the string specified under hive.metastore.uris property in your hive-site.xml file. I have not mucked with warehouse.dir too much but I know that the value of the metastore URI is in fact picked up from there as I regularly point to different systems... On Thu, May 14, 2015 at 6:26 PM, Tamas Jambor jambo...@gmail.com wrote: I have tried to put the hive-site.xml file in the conf/ directory with, seems it is not picking up from there. On Thu, May 14, 2015 at 6:50 PM, Michael Armbrust mich...@databricks.com wrote: You can configure Spark SQLs hive interaction by placing a hive-site.xml file in the conf/ directory. On Thu, May 14, 2015 at 10:24 AM, jamborta jambo...@gmail.com wrote: Hi all, is it possible to set hive.metastore.warehouse.dir, that is internally create by spark, to be stored externally (e.g. s3 on aws or wasb
Re: store hive metastore on persistent store
I have tried to put the hive-site.xml file in the conf/ directory with, seems it is not picking up from there. On Thu, May 14, 2015 at 6:50 PM, Michael Armbrust mich...@databricks.com wrote: You can configure Spark SQLs hive interaction by placing a hive-site.xml file in the conf/ directory. On Thu, May 14, 2015 at 10:24 AM, jamborta jambo...@gmail.com wrote: Hi all, is it possible to set hive.metastore.warehouse.dir, that is internally create by spark, to be stored externally (e.g. s3 on aws or wasb on azure)? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/store-hive-metastore-on-persistent-store-tp22891.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: writing to hdfs on master node much faster
Not sure what would slow it down as the repartition completes equally fast on all nodes, implying that the data is available on all, then there are a few computation steps none of them local on the master. On Mon, Apr 20, 2015 at 12:57 PM, Sean Owen so...@cloudera.com wrote: What machines are HDFS data nodes -- just your master? that would explain it. Otherwise, is it actually the write that's slow or is something else you're doing much faster on the master for other reasons maybe? like you're actually shipping data via the master first in some local computation? so the master's executor has the result much faster? On Mon, Apr 20, 2015 at 12:21 PM, jamborta jambo...@gmail.com wrote: Hi all, I have a three node cluster with identical hardware. I am trying a workflow where it reads data from hdfs, repartitions it and runs a few map operations then writes the results back to hdfs. It looks like that all the computation, including the repartitioning and the maps complete within similar time intervals on all the nodes, except when it writes it back to HDFS when the master node does the job way much faster then the slaves (15s for each block as opposed to 1.2 min for the slaves). Any suggestion what the reason might be? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/writing-to-hdfs-on-master-node-much-faster-tp22570.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark streaming
It is just a comma separated file, about 10 columns wide which we append with a unique id and a few additional values. On Fri, Mar 27, 2015 at 2:43 PM, Ted Yu yuzhih...@gmail.com wrote: jamborta : Please also describe the format of your csv files. Cheers On Fri, Mar 27, 2015 at 6:42 AM, DW @ Gmail deanwamp...@gmail.com wrote: Show us the code. This shouldn't happen for the simple process you described Sent from my rotary phone. On Mar 27, 2015, at 5:47 AM, jamborta jambo...@gmail.com wrote: Hi all, We have a workflow that pulls in data from csv files, then originally setup up of the workflow was to parse the data as it comes in (turn into array), then store it. This resulted in out of memory errors with larger files (as a result of increased GC?). It turns out if the data gets stored as a string first, then parsed, it issues does not occur. Why is that? Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-tp22255.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: multiple sparkcontexts and streamingcontexts
thanks for the reply. Actually, our main problem is not really about sparkcontext, the problem is that spark does not allow to create streaming context dynamically, and once a stream is shut down, a new one cannot be created in the same sparkcontext. So we cannot create a service that would create and manage multiple streams - the same way that is possible with batch jobs. On Mon, Mar 2, 2015 at 2:52 PM, Sean Owen so...@cloudera.com wrote: I think everything there is to know about it is on JIRA; I don't think that's being worked on. On Mon, Mar 2, 2015 at 2:50 PM, Tamas Jambor jambo...@gmail.com wrote: I have seen there is a card (SPARK-2243) to enable that. Is that still going ahead? On Mon, Mar 2, 2015 at 2:46 PM, Sean Owen so...@cloudera.com wrote: It is still not something you're supposed to do; in fact there is a setting (disabled by default) that throws an exception if you try to make multiple contexts. On Mon, Mar 2, 2015 at 2:43 PM, jamborta jambo...@gmail.com wrote: hi all, what is the current status and direction on enabling multiple sparkcontexts and streamingcontext? I have seen a few issues open on JIRA, which seem to be there for quite a while. thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/multiple-sparkcontexts-and-streamingcontexts-tp21876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: multiple sparkcontexts and streamingcontexts
I have seen there is a card (SPARK-2243) to enable that. Is that still going ahead? On Mon, Mar 2, 2015 at 2:46 PM, Sean Owen so...@cloudera.com wrote: It is still not something you're supposed to do; in fact there is a setting (disabled by default) that throws an exception if you try to make multiple contexts. On Mon, Mar 2, 2015 at 2:43 PM, jamborta jambo...@gmail.com wrote: hi all, what is the current status and direction on enabling multiple sparkcontexts and streamingcontext? I have seen a few issues open on JIRA, which seem to be there for quite a while. thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/multiple-sparkcontexts-and-streamingcontexts-tp21876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: multiple sparkcontexts and streamingcontexts
Sorry, I meant once the stream is started, it's not possible to create new streams in the existing streaming context, and it's not possible to create new streaming context if another one is already running. So the only feasible option seemed to create new sparkcontexts for each stream (tried using spark-jobserver for that). On Mon, Mar 2, 2015 at 3:07 PM, Sean Owen so...@cloudera.com wrote: You can make a new StreamingContext on an existing SparkContext, I believe? On Mon, Mar 2, 2015 at 3:01 PM, Tamas Jambor jambo...@gmail.com wrote: thanks for the reply. Actually, our main problem is not really about sparkcontext, the problem is that spark does not allow to create streaming context dynamically, and once a stream is shut down, a new one cannot be created in the same sparkcontext. So we cannot create a service that would create and manage multiple streams - the same way that is possible with batch jobs. On Mon, Mar 2, 2015 at 2:52 PM, Sean Owen so...@cloudera.com wrote: I think everything there is to know about it is on JIRA; I don't think that's being worked on. On Mon, Mar 2, 2015 at 2:50 PM, Tamas Jambor jambo...@gmail.com wrote: I have seen there is a card (SPARK-2243) to enable that. Is that still going ahead? On Mon, Mar 2, 2015 at 2:46 PM, Sean Owen so...@cloudera.com wrote: It is still not something you're supposed to do; in fact there is a setting (disabled by default) that throws an exception if you try to make multiple contexts. On Mon, Mar 2, 2015 at 2:43 PM, jamborta jambo...@gmail.com wrote: hi all, what is the current status and direction on enabling multiple sparkcontexts and streamingcontext? I have seen a few issues open on JIRA, which seem to be there for quite a while. thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/multiple-sparkcontexts-and-streamingcontexts-tp21876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Interact with streams in a non-blocking way
Thanks for the reply, I am trying to setup a streaming as a service approach, using the framework that is used for spark-jobserver. for that I would need to handle asynchronous operations that are initiated from outside of the stream. Do you think it is not possible? On Fri Feb 13 2015 at 10:14:18 Sean Owen so...@cloudera.com wrote: You call awaitTermination() in the main thread, and indeed it blocks there forever. From there Spark Streaming takes over, and is invoking the operations you set up. Your operations have access to the data of course. That's the model; you don't make external threads that reach in to Spark Streaming's objects, but can easily create operations that take whatever actions you want and invoke them in Streaming. On Fri, Feb 13, 2015 at 10:04 AM, jamborta jambo...@gmail.com wrote: Hi all, I am trying to come up with a workflow where I can query streams asynchronously. The problem I have is a ssc.awaitTermination() line blocks the whole thread, so it is not straightforward to me whether it is possible to get hold of objects from streams once they are started. any suggestion on what is the best way to implement this? thanks, -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/Interact-with-streams-in-a- non-blocking-way-tp21640.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: one is the default value for intercepts in GeneralizedLinearAlgorithm
Thanks for the reply. Seems it is all set to zero in the latest code - I was checking 1.2 last night. On Fri Feb 06 2015 at 07:21:35 Sean Owen so...@cloudera.com wrote: It looks like the initial intercept term is 1 only in the addIntercept numOfLinearPredictor == 1 case. It does seem inconsistent; since it's just an initial weight it may not matter to the final converged value. You can see a few notes in the class about how numOfLinearPredictor == 1 is handled a bit inconsistently and how a smarter choice of initial intercept could help convergence. So I don't know if this rises to the level of bug but I don't know that the difference is on purpose. On Thu, Feb 5, 2015 at 5:40 PM, jamborta jambo...@gmail.com wrote: hi all, I have been going through the GeneralizedLinearAlgorithm to understand how intercepts are handled in regression. Just noticed that the initial setting for the intercept is set to one (whereas the initial setting for the rest of the coefficients is set to zero) using the same piece of code that adds the 1 in front of each line in the data. Is this a bug? thanks, -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/one-is-the-default-value-for-intercepts-in- GeneralizedLinearAlgorithm-tp21525.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark context not picking up default hadoop filesystem
thanks for the reply. I have tried to add SPARK_CLASSPATH, I got a warning that it was deprecated (didn't solve the problem), also tried to run with --driver-class-path, which did not work either. I am trying this locally. On Mon Jan 26 2015 at 15:04:03 Akhil Das ak...@sigmoidanalytics.com wrote: You can also trying adding the core-site.xml in the SPARK_CLASSPATH, btw are you running the application locally? or in standalone mode? Thanks Best Regards On Mon, Jan 26, 2015 at 7:37 PM, jamborta jambo...@gmail.com wrote: hi all, I am trying to create a spark context programmatically, using org.apache.spark.deploy.SparkSubmit. It all looks OK, except that the hadoop config that is created during the process is not picking up core-site.xml, so it defaults back to the local file-system. I have set HADOOP_CONF_DIR in spark-env.sh, also core-site.xml in in the conf folder. The whole thing works if it is executed through spark shell. Just wondering where spark is picking up the hadoop config path from? many thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-context-not-picking-up-default-hadoop-filesystem-tp21368.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: dynamically change receiver for a spark stream
thanks for the replies. is this something we can get around? Tried to hack into the code without much success. On Wed, Jan 21, 2015 at 3:15 AM, Shao, Saisai saisai.s...@intel.com wrote: Hi, I don't think current Spark Streaming support this feature, all the DStream lineage is fixed after the context is started. Also stopping a stream is not supported, instead currently we need to stop the whole streaming context to meet what you want. Thanks Saisai -Original Message- From: jamborta [mailto:jambo...@gmail.com] Sent: Wednesday, January 21, 2015 3:09 AM To: user@spark.apache.org Subject: dynamically change receiver for a spark stream Hi all, we have been trying to setup a stream using a custom receiver that would pick up data from sql databases. we'd like to keep that stream context running and dynamically change the streams on demand, adding and removing streams based on demand. alternativel, if a stream is fixed, is it possible to stop a stream, change to config and start again? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/dynamically-change-receiver-for-a-spark-stream-tp21268.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: dynamically change receiver for a spark stream
we were thinking along the same line, that is to fix the number of streams and change the input and output channels dynamically. But could not make it work (seems that the receiver is not allowing any change in the config after it started). thanks, On Wed, Jan 21, 2015 at 10:49 AM, Gerard Maas gerard.m...@gmail.com wrote: One possible workaround could be to orchestrate launch/stopping of Streaming jobs on demand as long as the number of jobs/streams stay within the boundaries of the resources (cores) you've available. e.g. if you're using Mesos, Marathon offers a REST interface to manage job lifecycle. You will still need to solve the dynamic configuration through some alternative channel. On Wed, Jan 21, 2015 at 11:30 AM, Tamas Jambor jambo...@gmail.com wrote: thanks for the replies. is this something we can get around? Tried to hack into the code without much success. On Wed, Jan 21, 2015 at 3:15 AM, Shao, Saisai saisai.s...@intel.com wrote: Hi, I don't think current Spark Streaming support this feature, all the DStream lineage is fixed after the context is started. Also stopping a stream is not supported, instead currently we need to stop the whole streaming context to meet what you want. Thanks Saisai -Original Message- From: jamborta [mailto:jambo...@gmail.com] Sent: Wednesday, January 21, 2015 3:09 AM To: user@spark.apache.org Subject: dynamically change receiver for a spark stream Hi all, we have been trying to setup a stream using a custom receiver that would pick up data from sql databases. we'd like to keep that stream context running and dynamically change the streams on demand, adding and removing streams based on demand. alternativel, if a stream is fixed, is it possible to stop a stream, change to config and start again? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/dynamically-change-receiver-for-a-spark-stream-tp21268.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: dynamically change receiver for a spark stream
Hi Gerard, thanks, that makes sense. I'll try that out. Tamas On Wed, Jan 21, 2015 at 11:14 AM, Gerard Maas gerard.m...@gmail.com wrote: Hi Tamas, I meant not changing the receivers, but starting/stopping the Streaming jobs. So you would have a 'small' Streaming job for a subset of streams that you'd configure-start-stop on demand. I haven't tried myself yet, but I think it should also be possible to create a Streaming Job from the Spark Job Server ( https://github.com/spark-jobserver/spark-jobserver). Then you would have a REST interface that even gives you the possibility of passing a configuration. -kr, Gerard. On Wed, Jan 21, 2015 at 11:54 AM, Tamas Jambor jambo...@gmail.com wrote: we were thinking along the same line, that is to fix the number of streams and change the input and output channels dynamically. But could not make it work (seems that the receiver is not allowing any change in the config after it started). thanks, On Wed, Jan 21, 2015 at 10:49 AM, Gerard Maas gerard.m...@gmail.com wrote: One possible workaround could be to orchestrate launch/stopping of Streaming jobs on demand as long as the number of jobs/streams stay within the boundaries of the resources (cores) you've available. e.g. if you're using Mesos, Marathon offers a REST interface to manage job lifecycle. You will still need to solve the dynamic configuration through some alternative channel. On Wed, Jan 21, 2015 at 11:30 AM, Tamas Jambor jambo...@gmail.com wrote: thanks for the replies. is this something we can get around? Tried to hack into the code without much success. On Wed, Jan 21, 2015 at 3:15 AM, Shao, Saisai saisai.s...@intel.com wrote: Hi, I don't think current Spark Streaming support this feature, all the DStream lineage is fixed after the context is started. Also stopping a stream is not supported, instead currently we need to stop the whole streaming context to meet what you want. Thanks Saisai -Original Message- From: jamborta [mailto:jambo...@gmail.com] Sent: Wednesday, January 21, 2015 3:09 AM To: user@spark.apache.org Subject: dynamically change receiver for a spark stream Hi all, we have been trying to setup a stream using a custom receiver that would pick up data from sql databases. we'd like to keep that stream context running and dynamically change the streams on demand, adding and removing streams based on demand. alternativel, if a stream is fixed, is it possible to stop a stream, change to config and start again? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/dynamically-change-receiver-for-a-spark-stream-tp21268.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: save spark streaming output to single file on hdfs
Thanks. The problem is that we'd like it to be picked up by hive. On Tue Jan 13 2015 at 18:15:15 Davies Liu dav...@databricks.com wrote: On Tue, Jan 13, 2015 at 10:04 AM, jamborta jambo...@gmail.com wrote: Hi all, Is there a way to save dstream RDDs to a single file so that another process can pick it up as a single RDD? It does not need to a single file, Spark can pick any directory as a single RDD. Also, it's easy to union multiple RDDs into single one. It seems that each slice is saved to a separate folder, using saveAsTextFiles method. I'm using spark 1.2 with pyspark thanks, -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/save-spark-streaming-output-to-single- file-on-hdfs-tp21124.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: No module named pyspark - latest built
Thanks. Will it work with sbt at some point? On Thu, 13 Nov 2014 01:03 Xiangrui Meng men...@gmail.com wrote: You need to use maven to include python files. See https://github.com/apache/spark/pull/1223 . -Xiangrui On Wed, Nov 12, 2014 at 4:48 PM, jamborta jambo...@gmail.com wrote: I have figured out that building the fat jar with sbt does not seem to included the pyspark scripts using the following command: sbt/sbt -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive clean publish-local assembly however the maven command works OK: mvn -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests clean package am I running the correct sbt command? -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/No-module-named-pyspark-latest- built-tp18740p18787.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: why decision trees do binary split?
Thanks for the reply, Sean. I can see that splitting on all the categories would probably overfit the tree, on the other hand, it might give more insight on the subcategories (probably only would work if the data is uniformly distributed between the categories). I haven't really found any comparison between the two methods in terms of performance and interpretability. thanks, On Thu, Nov 6, 2014 at 9:46 AM, Sean Owen so...@cloudera.com wrote: I haven't seen that done before, which may be most of the reason - I am not sure that is common practice. I can see upsides - you need not pick candidate splits to test since there is only one N-way rule possible. The binary split equivalent is N levels instead of 1. The big problem is that you are always segregating the data set entirely, and making the equivalent of those N binary rules, even when you would not otherwise bother because they don't add information about the target. The subsets matching each child are therefore unnecessarily small and this makes learning on each independent subset weaker. On Nov 6, 2014 9:36 AM, jamborta jambo...@gmail.com wrote: I meant above, that in the case of categorical variables it might be more efficient to create a node on each categorical value. Is there a reason why spark went down the binary route? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188p18265.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: pass unique ID to mllib algorithms pyspark
Hi Xiangrui, Thanks for the reply. is this still due to be released in 1.2 (SPARK-3530 is still open)? Thanks, On Wed, Nov 5, 2014 at 3:21 AM, Xiangrui Meng men...@gmail.com wrote: The proposed new set of APIs (SPARK-3573, SPARK-3530) will address this issue. We carry over extra columns with training and prediction and then leverage on Spark SQL's execution plan optimization to decide which columns are really needed. For the current set of APIs, we can add `predictOnValues` to models, which carries over the input keys. StreamingKMeans and StreamingLinearRegression implement this method. -Xiangrui On Tue, Nov 4, 2014 at 2:30 AM, jamborta jambo...@gmail.com wrote: Hi all, There are a few algorithms in pyspark where the prediction part is implemented in scala (e.g. ALS, decision trees) where it is not very easy to manipulate the prediction methods. I think it is a very common scenario that the user would like to generate prediction for a datasets, so that each predicted value is identifiable (e.g. have a unique id attached to it). this is not possible in the current implementation as predict functions take a feature vector and return the predicted values where, I believe, the order is not guaranteed, so there is no way to join it back with the original data the predictions are generated from. Is there a way around this at the moment? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pass-unique-ID-to-mllib-algorithms-pyspark-tp18051.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: partition size for initial read
That would work - I normally use hive queries through spark sql, I have not seen something like that there. On Thu, Oct 2, 2014 at 3:13 PM, Ashish Jain ashish@gmail.com wrote: If you are using textFiles() to read data in, it also takes in a parameter the number of minimum partitions to create. Would that not work for you? On Oct 2, 2014 7:00 AM, jamborta jambo...@gmail.com wrote: Hi all, I have been testing repartitioning to ensure that my algorithms get similar amount of data. Noticed that repartitioning is very expensive. Is there a way to force Spark to create a certain number of partitions when the data is read in? How does it decided on the partition size initially? Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/partition-size-for-initial-read-tp15603.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark.driver.memory is not set (pyspark, 1.1.0)
thanks Marcelo. What's the reason it is not possible in cluster mode, either? On Wed, Oct 1, 2014 at 5:42 PM, Marcelo Vanzin van...@cloudera.com wrote: You can't set up the driver memory programatically in client mode. In that mode, the same JVM is running the driver, so you can't modify command line options anymore when initializing the SparkContext. (And you can't really start cluster mode apps that way, so the only way to set this is through the command line / config files.) On Wed, Oct 1, 2014 at 9:26 AM, jamborta jambo...@gmail.com wrote: Hi all, I cannot figure out why this command is not setting the driver memory (it is setting the executor memory): conf = (SparkConf() .setMaster(yarn-client) .setAppName(test) .set(spark.driver.memory, 1G) .set(spark.executor.memory, 1G) .set(spark.executor.instances, 2) .set(spark.executor.cores, 4)) sc = SparkContext(conf=conf) whereas if I run the spark console: ./bin/pyspark --driver-memory 1G it sets it correctly. Seemingly they both generate the same commands in the logs. thanks a lot, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-memory-is-not-set-pyspark-1-1-0-tp15498.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark.driver.memory is not set (pyspark, 1.1.0)
when you say respective backend code to launch it, I thought this is the way to do that. thanks, Tamas On Wed, Oct 1, 2014 at 6:13 PM, Marcelo Vanzin van...@cloudera.com wrote: Because that's not how you launch apps in cluster mode; you have to do it through the command line, or by calling directly the respective backend code to launch it. (That being said, it would be nice to have a programmatic way of launching apps that handled all this - this has been brought up in a few different contexts, but I don't think there's an official solution yet.) On Wed, Oct 1, 2014 at 9:59 AM, Tamas Jambor jambo...@gmail.com wrote: thanks Marcelo. What's the reason it is not possible in cluster mode, either? On Wed, Oct 1, 2014 at 5:42 PM, Marcelo Vanzin van...@cloudera.com wrote: You can't set up the driver memory programatically in client mode. In that mode, the same JVM is running the driver, so you can't modify command line options anymore when initializing the SparkContext. (And you can't really start cluster mode apps that way, so the only way to set this is through the command line / config files.) On Wed, Oct 1, 2014 at 9:26 AM, jamborta jambo...@gmail.com wrote: Hi all, I cannot figure out why this command is not setting the driver memory (it is setting the executor memory): conf = (SparkConf() .setMaster(yarn-client) .setAppName(test) .set(spark.driver.memory, 1G) .set(spark.executor.memory, 1G) .set(spark.executor.instances, 2) .set(spark.executor.cores, 4)) sc = SparkContext(conf=conf) whereas if I run the spark console: ./bin/pyspark --driver-memory 1G it sets it correctly. Seemingly they both generate the same commands in the logs. thanks a lot, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-memory-is-not-set-pyspark-1-1-0-tp15498.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Marcelo -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark.driver.memory is not set (pyspark, 1.1.0)
Thank you for the replies. It makes sense for scala/java, but in python the JVM is launched when the spark context is initialised, so it should be able to set it, I assume. On Wed, Oct 1, 2014 at 6:24 PM, Andrew Or and...@databricks.com wrote: Hi Tamas, Yes, Marcelo is right. The reason why it doesn't make sense to set spark.driver.memory in your SparkConf is because your application code, by definition, is the driver. This means by the time you get to the code that initializes your SparkConf, your driver JVM has already started with some heap size, and you can't easily change the size of the JVM once it has started. Note that this is true regardless of the deploy mode (client or cluster). Alternatives to set this include the following: (1) You can set spark.driver.memory in your `spark-defaults.conf` on the node that submits the application, (2) You can use the --driver-memory command line option if you are using Spark submit (bin/pyspark goes through this path, as you have discovered on your own). Does that make sense? 2014-10-01 10:17 GMT-07:00 Tamas Jambor jambo...@gmail.com: when you say respective backend code to launch it, I thought this is the way to do that. thanks, Tamas On Wed, Oct 1, 2014 at 6:13 PM, Marcelo Vanzin van...@cloudera.com wrote: Because that's not how you launch apps in cluster mode; you have to do it through the command line, or by calling directly the respective backend code to launch it. (That being said, it would be nice to have a programmatic way of launching apps that handled all this - this has been brought up in a few different contexts, but I don't think there's an official solution yet.) On Wed, Oct 1, 2014 at 9:59 AM, Tamas Jambor jambo...@gmail.com wrote: thanks Marcelo. What's the reason it is not possible in cluster mode, either? On Wed, Oct 1, 2014 at 5:42 PM, Marcelo Vanzin van...@cloudera.com wrote: You can't set up the driver memory programatically in client mode. In that mode, the same JVM is running the driver, so you can't modify command line options anymore when initializing the SparkContext. (And you can't really start cluster mode apps that way, so the only way to set this is through the command line / config files.) On Wed, Oct 1, 2014 at 9:26 AM, jamborta jambo...@gmail.com wrote: Hi all, I cannot figure out why this command is not setting the driver memory (it is setting the executor memory): conf = (SparkConf() .setMaster(yarn-client) .setAppName(test) .set(spark.driver.memory, 1G) .set(spark.executor.memory, 1G) .set(spark.executor.instances, 2) .set(spark.executor.cores, 4)) sc = SparkContext(conf=conf) whereas if I run the spark console: ./bin/pyspark --driver-memory 1G it sets it correctly. Seemingly they both generate the same commands in the logs. thanks a lot, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-memory-is-not-set-pyspark-1-1-0-tp15498.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Marcelo -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: yarn does not accept job in cluster mode
thanks for the reply. As I mentioned above, all works in yarn-client mode, the problem starts when I try to run it in yarn-cluster mode. (seems that spark-shell does not work in yarn-cluster mode, so cannot debug that way). On Mon, Sep 29, 2014 at 7:30 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you try running the spark-shell in yarn-cluster mode? ./bin/spark-shell --master yarn-client Read more over here http://spark.apache.org/docs/1.0.0/running-on-yarn.html Thanks Best Regards On Sun, Sep 28, 2014 at 7:08 AM, jamborta jambo...@gmail.com wrote: hi all, I have a job that works ok in yarn-client mode,but when I try in yarn-cluster mode it returns the following: WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory the cluster has plenty of memory and resources. I am running this from python using this context: conf = (SparkConf() .setMaster(yarn-cluster) .setAppName(spark_tornado_server) .set(spark.executor.memory, 1024m) .set(spark.cores.max, 16) .set(spark.driver.memory, 1024m) .set(spark.executor.instances, 2) .set(spark.executor.cores, 8) .set(spark.eventLog.enabled, False) HADOOP_HOME and HADOOP_CONF_DIR are also set in spark-env. thanks, not sure if I am missing some config -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/yarn-does-not-accept-job-in-cluster-mode-tp15281.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Yarn number of containers
Thank you. Where is the number of containers set? On Thu, Sep 25, 2014 at 7:17 PM, Marcelo Vanzin van...@cloudera.com wrote: On Thu, Sep 25, 2014 at 8:55 AM, jamborta jambo...@gmail.com wrote: I am running spark with the default settings in yarn client mode. For some reason yarn always allocates three containers to the application (wondering where it is set?), and only uses two of them. The default number of executors in Yarn mode is 2; so you have 2 executors + the application master, so 3 containers. Also the cpus on the cluster never go over 50%, I turned off the fair scheduler and set high spark.cores.max. Is there some additional settings I am missing? You probably need to request more cores (--executor-cores). Don't remember if that is respected in Yarn, but should be. -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: access javaobject in rdd map
Hi Davies, Thanks for the reply. I saw that you guys do that way in the code. Is there no other way? I have implemented all the predict functions in scala, so I prefer not to reimplement the whole thing in python. thanks, On Tue, Sep 23, 2014 at 5:40 PM, Davies Liu dav...@databricks.com wrote: You should create a pure Python object (copy the attributes from Java object), then it could be used in map. Davies On Tue, Sep 23, 2014 at 8:48 AM, jamborta jambo...@gmail.com wrote: Hi all, I have a java object that contains a ML model which I would like to use for prediction (in python). I just want to iterate the data through a mapper and predict for each value. Unfortunately, this fails when it tries to serialise the object to sent it to the nodes. Is there a trick around this? Surely, this object could be picked up by reference at the nodes. many thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/access-javaobject-in-rdd-map-tp14898.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org