Re: example LDA code ClassCastException

2016-11-04 Thread Tamas Jambor
thanks for the reply.

Asher, have you experienced problem when checkpoints are not enabled as
well? If we have large number of iterations (over 150) and checkpoints are
not enabled, the process just hangs (without no error) at around iteration
120-140 (on spark 2.0.0). I could not reproduce this outside of our data,
unfortunately.

On Fri, Nov 4, 2016 at 2:53 AM, Asher Krim  wrote:

> There is an open Jira for this issue (https://issues.apache.org/
> jira/browse/SPARK-14804). There have been a few proposed fixes so far.
>
> On Thu, Nov 3, 2016 at 2:20 PM, jamborta  wrote:
>
>> Hi there,
>>
>> I am trying to run the example LDA code
>> (http://spark.apache.org/docs/latest/mllib-clustering.html#l
>> atent-dirichlet-allocation-lda)
>> on Spark 2.0.0/EMR 5.0.0
>>
>> If run it with checkpoints enabled (sc.setCheckpointDir("s3n://s3-path/")
>>
>> ldaModel = LDA.train(corpus, k=3, maxIterations=200,
>> checkpointInterval=10)
>>
>> I get the following error (sorry, quite long):
>>
>> Py4JJavaErrorTraceback (most recent call last)
>>  in ()
>> > 1 ldaModel = LDA.train(corpus, k=3, maxIterations=200,
>> checkpointInterval=10)
>>
>> /usr/lib/spark/python/pyspark/mllib/clustering.py in train(cls, rdd, k,
>> maxIterations, docConcentration, topicConcentration, seed,
>> checkpointInterval, optimizer)
>>1037 model = callMLlibFunc("trainLDAModel", rdd, k,
>> maxIterations,
>>1038   docConcentration,
>> topicConcentration,
>> seed,
>> -> 1039   checkpointInterval, optimizer)
>>1040 return LDAModel(model)
>>1041
>>
>> /usr/lib/spark/python/pyspark/mllib/common.py in callMLlibFunc(name,
>> *args)
>> 128 sc = SparkContext.getOrCreate()
>> 129 api = getattr(sc._jvm.PythonMLLibAPI(), name)
>> --> 130 return callJavaFunc(sc, api, *args)
>> 131
>> 132
>>
>> /usr/lib/spark/python/pyspark/mllib/common.py in callJavaFunc(sc, func,
>> *args)
>> 121 """ Call Java Function """
>> 122 args = [_py2java(sc, a) for a in args]
>> --> 123 return _java2py(sc, func(*args))
>> 124
>> 125
>>
>> /usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in
>> __call__(self, *args)
>> 931 answer = self.gateway_client.send_command(command)
>> 932 return_value = get_return_value(
>> --> 933 answer, self.gateway_client, self.target_id,
>> self.name)
>> 934
>> 935 for temp_arg in temp_args:
>>
>> /usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>>  61 def deco(*a, **kw):
>>  62 try:
>> ---> 63 return f(*a, **kw)
>>  64 except py4j.protocol.Py4JJavaError as e:
>>  65 s = e.java_exception.toString()
>>
>> /usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in
>> get_return_value(answer, gateway_client, target_id, name)
>> 310 raise Py4JJavaError(
>> 311 "An error occurred while calling
>> {0}{1}{2}.\n".
>> --> 312 format(target_id, ".", name), value)
>> 313 else:
>> 314 raise Py4JError(
>>
>> Py4JJavaError: An error occurred while calling o115.trainLDAModel.
>> : org.apache.spark.SparkException: Job aborted due to stage failure:
>> Task 1
>> in stage 458.0 failed 4 times, most recent failure: Lost task 1.3 in stage
>> 458.0 (TID 14827, ip-10-197-192-2.eu-west-1.compute.internal):
>> java.lang.ClassCastException: scala.Tuple2 cannot be cast to
>> org.apache.spark.graphx.Edge
>> at
>> org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.
>> apply(EdgeRDD.scala:107)
>> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>> at
>> org.apache.spark.InterruptibleIterator.foreach(Interruptible
>> Iterator.scala:28)
>> at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.
>> scala:107)
>> at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.
>> scala:105)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$
>> anonfun$apply$25.apply(RDD.scala:801)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$
>> anonfun$apply$25.apply(RDD.scala:801)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 319)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 319)
>> at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
>> at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
>> at
>> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator
>> $1.apply(BlockManager.scala:919)
>> at
>> 

Re: store hive metastore on persistent store

2015-05-16 Thread Tamas Jambor
ah, that explains it, many thanks!

On Sat, May 16, 2015 at 7:41 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

 oh...metastore_db location is not controlled by
 hive.metastore.warehouse.dir -- one is the location of your metastore DB,
 the other is the physical location of your stored data. Checkout this SO
 thread:
 http://stackoverflow.com/questions/13624893/metastore-db-created-wherever-i-run-hive


 On Sat, May 16, 2015 at 9:07 AM, Tamas Jambor jambo...@gmail.com wrote:

 Gave it another try - it seems that it picks up the variable and prints
 out the correct value, but still puts the metatore_db folder in the current
 directory, regardless.

 On Sat, May 16, 2015 at 1:13 PM, Tamas Jambor jambo...@gmail.com wrote:

 Thank you for the reply.

 I have tried your experiment, it seems that it does not print the
 settings out in spark-shell (I'm using 1.3 by the way).

 Strangely I have been experimenting with an SQL connection instead,
 which works after all (still if I go to spark-shell and try to print out
 the SQL settings that I put in hive-site.xml, it does not print them).


 On Fri, May 15, 2015 at 7:22 PM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 My point was more to how to verify that properties are picked up from
 the hive-site.xml file. You don't really need hive.metastore.uris if
 you're not running against an external metastore.  I just did an
 experiment with warehouse.dir.

 My hive-site.xml looks like this:

 configuration
 property
 namehive.metastore.warehouse.dir/name
 value/home/ykadiysk/Github/warehouse_dir/value
 descriptionlocation of default database for the 
 warehouse/description
 /property
 /configuration

 ​

 and spark-shell code:

 scala val hc= new org.apache.spark.sql.hive.HiveContext(sc)
 hc: org.apache.spark.sql.hive.HiveContext = 
 org.apache.spark.sql.hive.HiveContext@3036c16f

 scala hc.sql(show tables).collect
 15/05/15 14:12:57 INFO HiveMetaStore: 0: Opening raw store with 
 implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
 15/05/15 14:12:57 INFO ObjectStore: ObjectStore, initialize called
 15/05/15 14:12:57 INFO Persistence: Property datanucleus.cache.level2 
 unknown - will be ignored
 15/05/15 14:12:58 WARN Connection: BoneCP specified but not present in 
 CLASSPATH (or one of dependencies)
 15/05/15 14:12:58 WARN Connection: BoneCP specified but not present in 
 CLASSPATH (or one of dependencies)
 15/05/15 14:13:03 INFO ObjectStore: Setting MetaStore object pin classes 
 with 
 hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
 15/05/15 14:13:03 INFO ObjectStore: Initialized ObjectStore
 15/05/15 14:13:04 WARN ObjectStore: Version information not found in 
 metastore. hive.metastore.schema.verification is not enabled so recording 
 the schema version 0.12.0-protobuf-2.5
 15/05/15 14:13:05 INFO HiveMetaStore: 0: get_tables: db=default pat=.*
 15/05/15 14:13:05 INFO audit: ugi=ykadiysk  ip=unknown-ip-addr  
 cmd=get_tables: db=default pat=.*
 15/05/15 14:13:05 INFO Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
 embedded-only so does not have its own datastore table.
 15/05/15 14:13:05 INFO Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MOrder is tagged as 
 embedded-only so does not have its own datastore table.
 res0: Array[org.apache.spark.sql.Row] = Array()

 scala hc.getConf(hive.metastore.warehouse.dir)
 res1: String = /home/ykadiysk/Github/warehouse_dir

 ​

 I have not tried an HDFS path but you should be at least able to verify
 that the variable is being read. It might be that your value is read but is
 otherwise not liked...

 On Fri, May 15, 2015 at 2:03 PM, Tamas Jambor jambo...@gmail.com
 wrote:

 thanks for the reply. I am trying to use it without hive setup
 (spark-standalone), so it prints something like this:

 hive_ctx.sql(show tables).collect()
 15/05/15 17:59:03 INFO HiveMetaStore: 0: Opening raw store with
 implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
 15/05/15 17:59:03 INFO ObjectStore: ObjectStore, initialize called
 15/05/15 17:59:04 INFO Persistence: Property datanucleus.cache.level2
 unknown - will be ignored
 15/05/15 17:59:04 INFO Persistence: Property
 hive.metastore.integral.jdo.pushdown unknown - will be ignored
 15/05/15 17:59:04 WARN Connection: BoneCP specified but not present in
 CLASSPATH (or one of dependencies)
 15/05/15 17:59:05 WARN Connection: BoneCP specified but not present in
 CLASSPATH (or one of dependencies)
 15/05/15 17:59:08 INFO BlockManagerMasterActor: Registering block
 manager :42819 with 3.0 GB RAM, BlockManagerId(2, xxx, 42819)

   [0/1844]
 15/05/15 17:59:18 INFO ObjectStore: Setting MetaStore object pin
 classes with
 hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
 15/05/15 17:59:18 INFO MetaStoreDirectSql: MySQL check failed,
 assuming we

Re: store hive metastore on persistent store

2015-05-16 Thread Tamas Jambor
Gave it another try - it seems that it picks up the variable and prints out
the correct value, but still puts the metatore_db folder in the current
directory, regardless.

On Sat, May 16, 2015 at 1:13 PM, Tamas Jambor jambo...@gmail.com wrote:

 Thank you for the reply.

 I have tried your experiment, it seems that it does not print the settings
 out in spark-shell (I'm using 1.3 by the way).

 Strangely I have been experimenting with an SQL connection instead, which
 works after all (still if I go to spark-shell and try to print out the SQL
 settings that I put in hive-site.xml, it does not print them).


 On Fri, May 15, 2015 at 7:22 PM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 My point was more to how to verify that properties are picked up from
 the hive-site.xml file. You don't really need hive.metastore.uris if
 you're not running against an external metastore.  I just did an
 experiment with warehouse.dir.

 My hive-site.xml looks like this:

 configuration
 property
 namehive.metastore.warehouse.dir/name
 value/home/ykadiysk/Github/warehouse_dir/value
 descriptionlocation of default database for the 
 warehouse/description
 /property
 /configuration

 ​

 and spark-shell code:

 scala val hc= new org.apache.spark.sql.hive.HiveContext(sc)
 hc: org.apache.spark.sql.hive.HiveContext = 
 org.apache.spark.sql.hive.HiveContext@3036c16f

 scala hc.sql(show tables).collect
 15/05/15 14:12:57 INFO HiveMetaStore: 0: Opening raw store with 
 implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
 15/05/15 14:12:57 INFO ObjectStore: ObjectStore, initialize called
 15/05/15 14:12:57 INFO Persistence: Property datanucleus.cache.level2 
 unknown - will be ignored
 15/05/15 14:12:58 WARN Connection: BoneCP specified but not present in 
 CLASSPATH (or one of dependencies)
 15/05/15 14:12:58 WARN Connection: BoneCP specified but not present in 
 CLASSPATH (or one of dependencies)
 15/05/15 14:13:03 INFO ObjectStore: Setting MetaStore object pin classes 
 with 
 hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
 15/05/15 14:13:03 INFO ObjectStore: Initialized ObjectStore
 15/05/15 14:13:04 WARN ObjectStore: Version information not found in 
 metastore. hive.metastore.schema.verification is not enabled so recording 
 the schema version 0.12.0-protobuf-2.5
 15/05/15 14:13:05 INFO HiveMetaStore: 0: get_tables: db=default pat=.*
 15/05/15 14:13:05 INFO audit: ugi=ykadiysk  ip=unknown-ip-addr  
 cmd=get_tables: db=default pat=.*
 15/05/15 14:13:05 INFO Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
 embedded-only so does not have its own datastore table.
 15/05/15 14:13:05 INFO Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only 
 so does not have its own datastore table.
 res0: Array[org.apache.spark.sql.Row] = Array()

 scala hc.getConf(hive.metastore.warehouse.dir)
 res1: String = /home/ykadiysk/Github/warehouse_dir

 ​

 I have not tried an HDFS path but you should be at least able to verify
 that the variable is being read. It might be that your value is read but is
 otherwise not liked...

 On Fri, May 15, 2015 at 2:03 PM, Tamas Jambor jambo...@gmail.com wrote:

 thanks for the reply. I am trying to use it without hive setup
 (spark-standalone), so it prints something like this:

 hive_ctx.sql(show tables).collect()
 15/05/15 17:59:03 INFO HiveMetaStore: 0: Opening raw store with
 implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
 15/05/15 17:59:03 INFO ObjectStore: ObjectStore, initialize called
 15/05/15 17:59:04 INFO Persistence: Property datanucleus.cache.level2
 unknown - will be ignored
 15/05/15 17:59:04 INFO Persistence: Property
 hive.metastore.integral.jdo.pushdown unknown - will be ignored
 15/05/15 17:59:04 WARN Connection: BoneCP specified but not present in
 CLASSPATH (or one of dependencies)
 15/05/15 17:59:05 WARN Connection: BoneCP specified but not present in
 CLASSPATH (or one of dependencies)
 15/05/15 17:59:08 INFO BlockManagerMasterActor: Registering block
 manager :42819 with 3.0 GB RAM, BlockManagerId(2, xxx, 42819)

   [0/1844]
 15/05/15 17:59:18 INFO ObjectStore: Setting MetaStore object pin classes
 with
 hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
 15/05/15 17:59:18 INFO MetaStoreDirectSql: MySQL check failed, assuming
 we are not on mysql: Lexical error at line 1, column 5.  Encountered: @
 (64), after : .
 15/05/15 17:59:20 INFO Datastore: The class
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as
 embedded-only so does not have its own datastore table.
 15/05/15 17:59:20 INFO Datastore: The class
 org.apache.hadoop.hive.metastore.model.MOrder is tagged as
 embedded-only so does not have its own datastore table.
 15/05/15 17:59:28 INFO Datastore: The class

Re: store hive metastore on persistent store

2015-05-16 Thread Tamas Jambor
Thank you for the reply.

I have tried your experiment, it seems that it does not print the settings
out in spark-shell (I'm using 1.3 by the way).

Strangely I have been experimenting with an SQL connection instead, which
works after all (still if I go to spark-shell and try to print out the SQL
settings that I put in hive-site.xml, it does not print them).


On Fri, May 15, 2015 at 7:22 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

 My point was more to how to verify that properties are picked up from
 the hive-site.xml file. You don't really need hive.metastore.uris if
 you're not running against an external metastore.  I just did an
 experiment with warehouse.dir.

 My hive-site.xml looks like this:

 configuration
 property
 namehive.metastore.warehouse.dir/name
 value/home/ykadiysk/Github/warehouse_dir/value
 descriptionlocation of default database for the 
 warehouse/description
 /property
 /configuration

 ​

 and spark-shell code:

 scala val hc= new org.apache.spark.sql.hive.HiveContext(sc)
 hc: org.apache.spark.sql.hive.HiveContext = 
 org.apache.spark.sql.hive.HiveContext@3036c16f

 scala hc.sql(show tables).collect
 15/05/15 14:12:57 INFO HiveMetaStore: 0: Opening raw store with implemenation 
 class:org.apache.hadoop.hive.metastore.ObjectStore
 15/05/15 14:12:57 INFO ObjectStore: ObjectStore, initialize called
 15/05/15 14:12:57 INFO Persistence: Property datanucleus.cache.level2 unknown 
 - will be ignored
 15/05/15 14:12:58 WARN Connection: BoneCP specified but not present in 
 CLASSPATH (or one of dependencies)
 15/05/15 14:12:58 WARN Connection: BoneCP specified but not present in 
 CLASSPATH (or one of dependencies)
 15/05/15 14:13:03 INFO ObjectStore: Setting MetaStore object pin classes with 
 hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
 15/05/15 14:13:03 INFO ObjectStore: Initialized ObjectStore
 15/05/15 14:13:04 WARN ObjectStore: Version information not found in 
 metastore. hive.metastore.schema.verification is not enabled so recording the 
 schema version 0.12.0-protobuf-2.5
 15/05/15 14:13:05 INFO HiveMetaStore: 0: get_tables: db=default pat=.*
 15/05/15 14:13:05 INFO audit: ugi=ykadiysk  ip=unknown-ip-addr  
 cmd=get_tables: db=default pat=.*
 15/05/15 14:13:05 INFO Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
 embedded-only so does not have its own datastore table.
 15/05/15 14:13:05 INFO Datastore: The class 
 org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only 
 so does not have its own datastore table.
 res0: Array[org.apache.spark.sql.Row] = Array()

 scala hc.getConf(hive.metastore.warehouse.dir)
 res1: String = /home/ykadiysk/Github/warehouse_dir

 ​

 I have not tried an HDFS path but you should be at least able to verify
 that the variable is being read. It might be that your value is read but is
 otherwise not liked...

 On Fri, May 15, 2015 at 2:03 PM, Tamas Jambor jambo...@gmail.com wrote:

 thanks for the reply. I am trying to use it without hive setup
 (spark-standalone), so it prints something like this:

 hive_ctx.sql(show tables).collect()
 15/05/15 17:59:03 INFO HiveMetaStore: 0: Opening raw store with
 implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
 15/05/15 17:59:03 INFO ObjectStore: ObjectStore, initialize called
 15/05/15 17:59:04 INFO Persistence: Property datanucleus.cache.level2
 unknown - will be ignored
 15/05/15 17:59:04 INFO Persistence: Property
 hive.metastore.integral.jdo.pushdown unknown - will be ignored
 15/05/15 17:59:04 WARN Connection: BoneCP specified but not present in
 CLASSPATH (or one of dependencies)
 15/05/15 17:59:05 WARN Connection: BoneCP specified but not present in
 CLASSPATH (or one of dependencies)
 15/05/15 17:59:08 INFO BlockManagerMasterActor: Registering block manager
 :42819 with 3.0 GB RAM, BlockManagerId(2, xxx, 42819)

 [0/1844]
 15/05/15 17:59:18 INFO ObjectStore: Setting MetaStore object pin classes
 with
 hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
 15/05/15 17:59:18 INFO MetaStoreDirectSql: MySQL check failed, assuming
 we are not on mysql: Lexical error at line 1, column 5.  Encountered: @
 (64), after : .
 15/05/15 17:59:20 INFO Datastore: The class
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as
 embedded-only so does not have its own datastore table.
 15/05/15 17:59:20 INFO Datastore: The class
 org.apache.hadoop.hive.metastore.model.MOrder is tagged as
 embedded-only so does not have its own datastore table.
 15/05/15 17:59:28 INFO Datastore: The class
 org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as
 embedded-only so does not have its own datastore table.
 15/05/15 17:59:29 INFO Datastore: The class
 org.apache.hadoop.hive.metastore.model.MOrder is tagged as
 embedded-only so does not have its own datastore table.
 15

Re: store hive metastore on persistent store

2015-05-15 Thread Tamas Jambor
thanks for the reply. I am trying to use it without hive setup
(spark-standalone), so it prints something like this:

hive_ctx.sql(show tables).collect()
15/05/15 17:59:03 INFO HiveMetaStore: 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/05/15 17:59:03 INFO ObjectStore: ObjectStore, initialize called
15/05/15 17:59:04 INFO Persistence: Property datanucleus.cache.level2
unknown - will be ignored
15/05/15 17:59:04 INFO Persistence: Property
hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/05/15 17:59:04 WARN Connection: BoneCP specified but not present in
CLASSPATH (or one of dependencies)
15/05/15 17:59:05 WARN Connection: BoneCP specified but not present in
CLASSPATH (or one of dependencies)
15/05/15 17:59:08 INFO BlockManagerMasterActor: Registering block manager
:42819 with 3.0 GB RAM, BlockManagerId(2, xxx, 42819)

[0/1844]
15/05/15 17:59:18 INFO ObjectStore: Setting MetaStore object pin classes
with
hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
15/05/15 17:59:18 INFO MetaStoreDirectSql: MySQL check failed, assuming we
are not on mysql: Lexical error at line 1, column 5.  Encountered: @
(64), after : .
15/05/15 17:59:20 INFO Datastore: The class
org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as
embedded-only so does not have its own datastore table.
15/05/15 17:59:20 INFO Datastore: The class
org.apache.hadoop.hive.metastore.model.MOrder is tagged as
embedded-only so does not have its own datastore table.
15/05/15 17:59:28 INFO Datastore: The class
org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as
embedded-only so does not have its own datastore table.
15/05/15 17:59:29 INFO Datastore: The class
org.apache.hadoop.hive.metastore.model.MOrder is tagged as
embedded-only so does not have its own datastore table.
15/05/15 17:59:31 INFO ObjectStore: Initialized ObjectStore
15/05/15 17:59:32 WARN ObjectStore: Version information not found in
metastore. hive.metastore.schema.verification is not enabled so recording
the schema version 0.13.1aa
15/05/15 17:59:33 WARN MetricsConfig: Cannot locate configuration: tried
hadoop-metrics2-azure-file-system.properties,hadoop-metrics2.properties
15/05/15 17:59:33 INFO MetricsSystemImpl: Scheduled snapshot period at 10
second(s).
15/05/15 17:59:33 INFO MetricsSystemImpl: azure-file-system metrics system
started
15/05/15 17:59:33 INFO HiveMetaStore: Added admin role in metastore
15/05/15 17:59:34 INFO HiveMetaStore: Added public role in metastore
15/05/15 17:59:34 INFO HiveMetaStore: No user is added in admin role, since
config is empty
15/05/15 17:59:35 INFO SessionState: No Tez session required at this point.
hive.execution.engine=mr.
15/05/15 17:59:37 INFO HiveMetaStore: 0: get_tables: db=default pat=.*
15/05/15 17:59:37 INFO audit: ugi=testuser ip=unknown-ip-addr
 cmd=get_tables: db=default pat=.*

not sure what to put in hive.metastore.uris in this case?


On Fri, May 15, 2015 at 2:52 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

 This should work. Which version of Spark are you using? Here is what I do
 -- make sure hive-site.xml is in the conf directory of the machine you're
 using the driver from. Now let's run spark-shell from that machine:

 scala val hc= new org.apache.spark.sql.hive.HiveContext(sc)
 hc: org.apache.spark.sql.hive.HiveContext = 
 org.apache.spark.sql.hive.HiveContext@6e9f8f26

 scala hc.sql(show tables).collect
 15/05/15 09:34:17 INFO metastore: Trying to connect to metastore with URI 
 thrift://hostname.com:9083  -- here should be a value from your 
 hive-site.xml
 15/05/15 09:34:17 INFO metastore: Waiting 1 seconds before next connection 
 attempt.
 15/05/15 09:34:18 INFO metastore: Connected to metastore.
 res0: Array[org.apache.spark.sql.Row] = Array([table1,false],

 scala hc.getConf(hive.metastore.uris)
 res13: String = thrift://hostname.com:9083

 scala hc.getConf(hive.metastore.warehouse.dir)
 res14: String = /user/hive/warehouse

 ​

 The first line tells you which metastore it's trying to connect to -- this
 should be the string specified under hive.metastore.uris property in your
 hive-site.xml file. I have not mucked with warehouse.dir too much but I
 know that the value of the metastore URI is in fact picked up from there as
 I regularly point to different systems...


 On Thu, May 14, 2015 at 6:26 PM, Tamas Jambor jambo...@gmail.com wrote:

 I have tried to put the hive-site.xml file in the conf/ directory with,
 seems it is not picking up from there.


 On Thu, May 14, 2015 at 6:50 PM, Michael Armbrust mich...@databricks.com
  wrote:

 You can configure Spark SQLs hive interaction by placing a hive-site.xml
 file in the conf/ directory.

 On Thu, May 14, 2015 at 10:24 AM, jamborta jambo...@gmail.com wrote:

 Hi all,

 is it possible to set hive.metastore.warehouse.dir, that is internally
 create by spark, to be stored externally (e.g. s3 on aws or wasb

Re: store hive metastore on persistent store

2015-05-14 Thread Tamas Jambor
I have tried to put the hive-site.xml file in the conf/ directory with,
seems it is not picking up from there.


On Thu, May 14, 2015 at 6:50 PM, Michael Armbrust mich...@databricks.com
wrote:

 You can configure Spark SQLs hive interaction by placing a hive-site.xml
 file in the conf/ directory.

 On Thu, May 14, 2015 at 10:24 AM, jamborta jambo...@gmail.com wrote:

 Hi all,

 is it possible to set hive.metastore.warehouse.dir, that is internally
 create by spark, to be stored externally (e.g. s3 on aws or wasb on
 azure)?

 thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/store-hive-metastore-on-persistent-store-tp22891.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: writing to hdfs on master node much faster

2015-04-20 Thread Tamas Jambor
Not sure what would slow it down as the repartition completes equally fast
on all nodes, implying that the data is available on all, then there are a
few computation steps none of them local on the master.

On Mon, Apr 20, 2015 at 12:57 PM, Sean Owen so...@cloudera.com wrote:

 What machines are HDFS data nodes -- just your master? that would
 explain it. Otherwise, is it actually the write that's slow or is
 something else you're doing much faster on the master for other
 reasons maybe? like you're actually shipping data via the master first
 in some local computation? so the master's executor has the result
 much faster?

 On Mon, Apr 20, 2015 at 12:21 PM, jamborta jambo...@gmail.com wrote:
  Hi all,
 
  I have a three node cluster with identical hardware. I am trying a
 workflow
  where it reads data from hdfs, repartitions it and runs a few map
 operations
  then writes the results back to hdfs.
 
  It looks like that all the computation, including the repartitioning and
 the
  maps complete within similar time intervals on all the nodes, except
 when it
  writes it back to HDFS when the master node does the job way much faster
  then the slaves (15s for each block as opposed to 1.2 min for the
 slaves).
 
  Any suggestion what the reason might be?
 
  thanks,
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/writing-to-hdfs-on-master-node-much-faster-tp22570.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



Re: Spark streaming

2015-03-27 Thread Tamas Jambor
It is just a comma separated file, about 10 columns wide which we append
with a unique id and a few additional values.

On Fri, Mar 27, 2015 at 2:43 PM, Ted Yu yuzhih...@gmail.com wrote:

 jamborta :
 Please also describe the format of your csv files.

 Cheers

 On Fri, Mar 27, 2015 at 6:42 AM, DW @ Gmail deanwamp...@gmail.com wrote:

 Show us the code. This shouldn't happen for the simple process you
 described

 Sent from my rotary phone.


  On Mar 27, 2015, at 5:47 AM, jamborta jambo...@gmail.com wrote:
 
  Hi all,
 
  We have a workflow that pulls in data from csv files, then originally
 setup
  up of the workflow was to parse the data as it comes in (turn into
 array),
  then store it. This resulted in out of memory errors with larger files
 (as a
  result of increased GC?).
 
  It turns out if the data gets stored as a string first, then parsed, it
  issues does not occur.
 
  Why is that?
 
  Thanks,
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-tp22255.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: multiple sparkcontexts and streamingcontexts

2015-03-02 Thread Tamas Jambor
thanks for the reply.

Actually, our main problem is not really about sparkcontext, the problem is
that spark does not allow to create streaming context dynamically, and once
a stream is shut down, a new one cannot be created in the same
sparkcontext. So we cannot create a service that would create and manage
multiple streams - the same way that is possible with batch jobs.

On Mon, Mar 2, 2015 at 2:52 PM, Sean Owen so...@cloudera.com wrote:

 I think everything there is to know about it is on JIRA; I don't think
 that's being worked on.

 On Mon, Mar 2, 2015 at 2:50 PM, Tamas Jambor jambo...@gmail.com wrote:
  I have seen there is a card (SPARK-2243) to enable that. Is that still
 going
  ahead?
 
  On Mon, Mar 2, 2015 at 2:46 PM, Sean Owen so...@cloudera.com wrote:
 
  It is still not something you're supposed to do; in fact there is a
  setting (disabled by default) that throws an exception if you try to
  make multiple contexts.
 
  On Mon, Mar 2, 2015 at 2:43 PM, jamborta jambo...@gmail.com wrote:
   hi all,
  
   what is the current status and direction on enabling multiple
   sparkcontexts
   and streamingcontext? I have seen a few issues open on JIRA, which
 seem
   to
   be there for quite a while.
  
   thanks,
  
  
  
   --
   View this message in context:
  
 http://apache-spark-user-list.1001560.n3.nabble.com/multiple-sparkcontexts-and-streamingcontexts-tp21876.html
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  
 
 



Re: multiple sparkcontexts and streamingcontexts

2015-03-02 Thread Tamas Jambor
I have seen there is a card (SPARK-2243) to enable that. Is that still
going ahead?

On Mon, Mar 2, 2015 at 2:46 PM, Sean Owen so...@cloudera.com wrote:

 It is still not something you're supposed to do; in fact there is a
 setting (disabled by default) that throws an exception if you try to
 make multiple contexts.

 On Mon, Mar 2, 2015 at 2:43 PM, jamborta jambo...@gmail.com wrote:
  hi all,
 
  what is the current status and direction on enabling multiple
 sparkcontexts
  and streamingcontext? I have seen a few issues open on JIRA, which seem
 to
  be there for quite a while.
 
  thanks,
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/multiple-sparkcontexts-and-streamingcontexts-tp21876.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



Re: multiple sparkcontexts and streamingcontexts

2015-03-02 Thread Tamas Jambor
Sorry, I meant once the stream is started, it's not possible to create new
streams in the existing streaming context, and it's not possible to create
new streaming context if another one is already running.
So the only feasible option seemed to create new sparkcontexts for each
stream (tried using spark-jobserver for that).

On Mon, Mar 2, 2015 at 3:07 PM, Sean Owen so...@cloudera.com wrote:

 You can make a new StreamingContext on an existing SparkContext, I believe?

 On Mon, Mar 2, 2015 at 3:01 PM, Tamas Jambor jambo...@gmail.com wrote:
  thanks for the reply.
 
  Actually, our main problem is not really about sparkcontext, the problem
 is
  that spark does not allow to create streaming context dynamically, and
 once
  a stream is shut down, a new one cannot be created in the same
 sparkcontext.
  So we cannot create a service that would create and manage multiple
 streams
  - the same way that is possible with batch jobs.
 
  On Mon, Mar 2, 2015 at 2:52 PM, Sean Owen so...@cloudera.com wrote:
 
  I think everything there is to know about it is on JIRA; I don't think
  that's being worked on.
 
  On Mon, Mar 2, 2015 at 2:50 PM, Tamas Jambor jambo...@gmail.com
 wrote:
   I have seen there is a card (SPARK-2243) to enable that. Is that still
   going
   ahead?
  
   On Mon, Mar 2, 2015 at 2:46 PM, Sean Owen so...@cloudera.com wrote:
  
   It is still not something you're supposed to do; in fact there is a
   setting (disabled by default) that throws an exception if you try to
   make multiple contexts.
  
   On Mon, Mar 2, 2015 at 2:43 PM, jamborta jambo...@gmail.com wrote:
hi all,
   
what is the current status and direction on enabling multiple
sparkcontexts
and streamingcontext? I have seen a few issues open on JIRA, which
seem
to
be there for quite a while.
   
thanks,
   
   
   
--
View this message in context:
   
   
 http://apache-spark-user-list.1001560.n3.nabble.com/multiple-sparkcontexts-and-streamingcontexts-tp21876.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.
   
   
 -
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
   
  
  
 
 



Re: Interact with streams in a non-blocking way

2015-02-13 Thread Tamas Jambor
Thanks for the reply, I am trying to setup a streaming as a service
approach, using the framework that is used for spark-jobserver. for that I
would need to handle asynchronous  operations that are initiated from
outside of the stream. Do you think it is not possible?

On Fri Feb 13 2015 at 10:14:18 Sean Owen so...@cloudera.com wrote:

 You call awaitTermination() in the main thread, and indeed it blocks
 there forever. From there Spark Streaming takes over, and is invoking
 the operations you set up. Your operations have access to the data of
 course. That's the model; you don't make external threads that reach
 in to Spark Streaming's objects, but can easily create operations that
 take whatever actions you want and invoke them in Streaming.

 On Fri, Feb 13, 2015 at 10:04 AM, jamborta jambo...@gmail.com wrote:
  Hi all,
 
  I am trying to come up with a workflow where I can query streams
  asynchronously. The problem I have is a ssc.awaitTermination() line
 blocks
  the whole thread, so it is not straightforward to me whether it is
 possible
  to get hold of objects from streams once they are started. any
 suggestion on
  what is the best way to implement this?
 
  thanks,
 
 
 
  --
  View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Interact-with-streams-in-a-
 non-blocking-way-tp21640.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



Re: one is the default value for intercepts in GeneralizedLinearAlgorithm

2015-02-06 Thread Tamas Jambor
Thanks for the reply. Seems it is all set to zero in the latest code - I
was checking 1.2 last night.

On Fri Feb 06 2015 at 07:21:35 Sean Owen so...@cloudera.com wrote:

 It looks like the initial intercept term is 1 only in the addIntercept
  numOfLinearPredictor == 1 case. It does seem inconsistent; since
 it's just an initial weight it may not matter to the final converged
 value. You can see a few notes in the class about how
 numOfLinearPredictor == 1 is handled a bit inconsistently and how a
 smarter choice of initial intercept could help convergence. So I don't
 know if this rises to the level of bug but I don't know that the
 difference is on purpose.

 On Thu, Feb 5, 2015 at 5:40 PM, jamborta jambo...@gmail.com wrote:
  hi all,
 
  I have been going through the GeneralizedLinearAlgorithm to understand
 how
  intercepts are handled in regression. Just noticed that the initial
 setting
  for the intercept is set to one (whereas the initial setting for the
 rest of
  the coefficients is set to zero) using the same piece of code that adds
 the
  1 in front of each line in the data. Is this a bug?
 
  thanks,
 
 
 
  --
  View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/one-is-the-default-value-for-intercepts-in-
 GeneralizedLinearAlgorithm-tp21525.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



Re: spark context not picking up default hadoop filesystem

2015-01-26 Thread Tamas Jambor
thanks for the reply. I have tried to add SPARK_CLASSPATH, I got a warning
that it was deprecated (didn't solve the problem), also tried to run with
--driver-class-path, which did not work either. I am trying this locally.



On Mon Jan 26 2015 at 15:04:03 Akhil Das ak...@sigmoidanalytics.com wrote:

 You can also trying adding the core-site.xml in the SPARK_CLASSPATH, btw
 are you running the application locally? or in standalone mode?

 Thanks
 Best Regards

 On Mon, Jan 26, 2015 at 7:37 PM, jamborta jambo...@gmail.com wrote:

 hi all,

 I am trying to create a spark context programmatically, using
 org.apache.spark.deploy.SparkSubmit. It all looks OK, except that the
 hadoop
 config that is created during the process is not picking up core-site.xml,
 so it defaults back to the local file-system. I have set HADOOP_CONF_DIR
 in
 spark-env.sh, also core-site.xml in in the conf folder. The whole thing
 works if it is executed through spark shell.

 Just wondering where spark is picking up the hadoop config path from?

 many thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-context-not-picking-up-default-hadoop-filesystem-tp21368.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: dynamically change receiver for a spark stream

2015-01-21 Thread Tamas Jambor
thanks for the replies.

is this something we can get around? Tried to hack into the code without
much success.

On Wed, Jan 21, 2015 at 3:15 AM, Shao, Saisai saisai.s...@intel.com wrote:

 Hi,

 I don't think current Spark Streaming support this feature, all the
 DStream lineage is fixed after the context is started.

 Also stopping a stream is not supported, instead currently we need to stop
 the whole streaming context to meet what you want.

 Thanks
 Saisai

 -Original Message-
 From: jamborta [mailto:jambo...@gmail.com]
 Sent: Wednesday, January 21, 2015 3:09 AM
 To: user@spark.apache.org
 Subject: dynamically change receiver for a spark stream

 Hi all,

 we have been trying to setup a stream using a custom receiver that would
 pick up data from sql databases. we'd like to keep that stream context
 running and dynamically change the streams on demand, adding and removing
 streams based on demand. alternativel, if a stream is fixed, is it possible
 to stop a stream, change to config and start again?

 thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/dynamically-change-receiver-for-a-spark-stream-tp21268.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
 commands, e-mail: user-h...@spark.apache.org




Re: dynamically change receiver for a spark stream

2015-01-21 Thread Tamas Jambor
we were thinking along the same line, that is to fix the number of streams
and change the input and output channels dynamically.

But could not make it work (seems that the receiver is not allowing any
change in the config after it started).

thanks,

On Wed, Jan 21, 2015 at 10:49 AM, Gerard Maas gerard.m...@gmail.com wrote:

 One possible workaround could be to orchestrate launch/stopping of
 Streaming jobs on demand as long as the number of jobs/streams stay
 within the boundaries of the resources (cores) you've available.
 e.g. if you're using Mesos, Marathon offers a REST interface to manage job
 lifecycle. You will still need to solve the dynamic configuration through
 some alternative channel.

 On Wed, Jan 21, 2015 at 11:30 AM, Tamas Jambor jambo...@gmail.com wrote:

 thanks for the replies.

 is this something we can get around? Tried to hack into the code without
 much success.

 On Wed, Jan 21, 2015 at 3:15 AM, Shao, Saisai saisai.s...@intel.com
 wrote:

 Hi,

 I don't think current Spark Streaming support this feature, all the
 DStream lineage is fixed after the context is started.

 Also stopping a stream is not supported, instead currently we need to
 stop the whole streaming context to meet what you want.

 Thanks
 Saisai

 -Original Message-
 From: jamborta [mailto:jambo...@gmail.com]
 Sent: Wednesday, January 21, 2015 3:09 AM
 To: user@spark.apache.org
 Subject: dynamically change receiver for a spark stream

 Hi all,

 we have been trying to setup a stream using a custom receiver that would
 pick up data from sql databases. we'd like to keep that stream context
 running and dynamically change the streams on demand, adding and removing
 streams based on demand. alternativel, if a stream is fixed, is it possible
 to stop a stream, change to config and start again?

 thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/dynamically-change-receiver-for-a-spark-stream-tp21268.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
 additional commands, e-mail: user-h...@spark.apache.org






Re: dynamically change receiver for a spark stream

2015-01-21 Thread Tamas Jambor
Hi Gerard,

thanks, that makes sense. I'll try that out.

Tamas

On Wed, Jan 21, 2015 at 11:14 AM, Gerard Maas gerard.m...@gmail.com wrote:

 Hi Tamas,

 I meant not changing the receivers, but starting/stopping the Streaming
 jobs. So you would have a 'small' Streaming job for a subset of streams
 that you'd configure-start-stop  on demand.
 I haven't tried myself yet, but I think it should also be possible to
 create a Streaming Job from the Spark Job Server (
 https://github.com/spark-jobserver/spark-jobserver). Then you would have
 a REST interface that even gives you the possibility of passing a
 configuration.

 -kr, Gerard.

 On Wed, Jan 21, 2015 at 11:54 AM, Tamas Jambor jambo...@gmail.com wrote:

 we were thinking along the same line, that is to fix the number of
 streams and change the input and output channels dynamically.

 But could not make it work (seems that the receiver is not allowing any
 change in the config after it started).

 thanks,

 On Wed, Jan 21, 2015 at 10:49 AM, Gerard Maas gerard.m...@gmail.com
 wrote:

 One possible workaround could be to orchestrate launch/stopping of
 Streaming jobs on demand as long as the number of jobs/streams stay
 within the boundaries of the resources (cores) you've available.
 e.g. if you're using Mesos, Marathon offers a REST interface to manage
 job lifecycle. You will still need to solve the dynamic configuration
 through some alternative channel.

 On Wed, Jan 21, 2015 at 11:30 AM, Tamas Jambor jambo...@gmail.com
 wrote:

 thanks for the replies.

 is this something we can get around? Tried to hack into the code
 without much success.

 On Wed, Jan 21, 2015 at 3:15 AM, Shao, Saisai saisai.s...@intel.com
 wrote:

 Hi,

 I don't think current Spark Streaming support this feature, all the
 DStream lineage is fixed after the context is started.

 Also stopping a stream is not supported, instead currently we need to
 stop the whole streaming context to meet what you want.

 Thanks
 Saisai

 -Original Message-
 From: jamborta [mailto:jambo...@gmail.com]
 Sent: Wednesday, January 21, 2015 3:09 AM
 To: user@spark.apache.org
 Subject: dynamically change receiver for a spark stream

 Hi all,

 we have been trying to setup a stream using a custom receiver that
 would pick up data from sql databases. we'd like to keep that stream
 context running and dynamically change the streams on demand, adding and
 removing streams based on demand. alternativel, if a stream is fixed, is 
 it
 possible to stop a stream, change to config and start again?

 thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/dynamically-change-receiver-for-a-spark-stream-tp21268.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
 additional commands, e-mail: user-h...@spark.apache.org








Re: save spark streaming output to single file on hdfs

2015-01-13 Thread Tamas Jambor
Thanks. The problem is that we'd like it to be picked up by hive.

On Tue Jan 13 2015 at 18:15:15 Davies Liu dav...@databricks.com wrote:

 On Tue, Jan 13, 2015 at 10:04 AM, jamborta jambo...@gmail.com wrote:
  Hi all,
 
  Is there a way to save dstream RDDs to a single file so that another
 process
  can pick it up as a single RDD?

 It does not need to a single file, Spark can pick any directory as a
 single RDD.

 Also, it's easy to union multiple RDDs into single one.

  It seems that each slice is saved to a separate folder, using
  saveAsTextFiles method.
 
  I'm using spark 1.2 with pyspark
 
  thanks,
 
 
 
 
 
 
 
  --
  View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/save-spark-streaming-output-to-single-
 file-on-hdfs-tp21124.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



Re: No module named pyspark - latest built

2014-11-12 Thread Tamas Jambor
Thanks. Will it work with sbt at some point?

On Thu, 13 Nov 2014 01:03 Xiangrui Meng men...@gmail.com wrote:

 You need to use maven to include python files. See
 https://github.com/apache/spark/pull/1223 . -Xiangrui

 On Wed, Nov 12, 2014 at 4:48 PM, jamborta jambo...@gmail.com wrote:
  I have figured out that building the fat jar with sbt does not seem to
  included the pyspark scripts using the following command:
 
  sbt/sbt -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive clean
  publish-local assembly
 
  however the maven command works OK:
 
  mvn -Pdeb -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests
  clean package
 
  am I running the correct sbt command?
 
 
 
  --
  View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/No-module-named-pyspark-latest-
 built-tp18740p18787.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 



Re: why decision trees do binary split?

2014-11-06 Thread Tamas Jambor
Thanks for the reply, Sean.

I can see that splitting on all the categories would probably overfit
the tree, on the other hand, it might give more insight on the
subcategories (probably only would work if the data is uniformly
distributed between the categories).

I haven't really found any comparison between the two methods in terms
of performance and interpretability.

thanks,

On Thu, Nov 6, 2014 at 9:46 AM, Sean Owen so...@cloudera.com wrote:
 I haven't seen that done before, which may be most of the reason - I am not
 sure that is common practice.

 I can see upsides - you need not pick candidate splits to test since there
 is only one N-way rule possible. The binary split equivalent is N levels
 instead of 1.

 The big problem is that you are always segregating the data set entirely,
 and making the equivalent of those N binary rules, even when you would not
 otherwise bother because they don't add information about the target. The
 subsets matching each child are therefore unnecessarily small and this makes
 learning on each independent subset weaker.

 On Nov 6, 2014 9:36 AM, jamborta jambo...@gmail.com wrote:

 I meant above, that in the case of categorical variables it might be more
 efficient to create a node on each categorical value. Is there a reason
 why
 spark went down the binary route?

 thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188p18265.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pass unique ID to mllib algorithms pyspark

2014-11-05 Thread Tamas Jambor
Hi Xiangrui,

Thanks for the reply. is this still due to be released in 1.2
(SPARK-3530 is still open)?

Thanks,

On Wed, Nov 5, 2014 at 3:21 AM, Xiangrui Meng men...@gmail.com wrote:
 The proposed new set of APIs (SPARK-3573, SPARK-3530) will address
 this issue. We carry over extra columns with training and prediction
 and then leverage on Spark SQL's execution plan optimization to decide
 which columns are really needed. For the current set of APIs, we can
 add `predictOnValues` to models, which carries over the input keys.
 StreamingKMeans and StreamingLinearRegression implement this method.
 -Xiangrui

 On Tue, Nov 4, 2014 at 2:30 AM, jamborta jambo...@gmail.com wrote:
 Hi all,

 There are a few algorithms in pyspark where the prediction part is
 implemented in scala (e.g. ALS, decision trees) where it is not very easy to
 manipulate the prediction methods.

 I think it is a very common scenario that the user would like to generate
 prediction for a datasets, so that each predicted value is identifiable
 (e.g. have a unique id attached to it). this is not possible in the current
 implementation as predict functions take a feature vector and return the
 predicted values where, I believe, the order is not guaranteed, so there is
 no way to join it back with the original data the predictions are generated
 from.

 Is there a way around this at the moment?

 thanks,



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/pass-unique-ID-to-mllib-algorithms-pyspark-tp18051.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: partition size for initial read

2014-10-02 Thread Tamas Jambor
That would work - I normally use hive queries through spark sql, I
have not seen something like that there.

On Thu, Oct 2, 2014 at 3:13 PM, Ashish Jain ashish@gmail.com wrote:
 If you are using textFiles() to read data in, it also takes in a parameter
 the number of minimum partitions to create. Would that not work for you?

 On Oct 2, 2014 7:00 AM, jamborta jambo...@gmail.com wrote:

 Hi all,

 I have been testing repartitioning to ensure that my algorithms get
 similar
 amount of data.

 Noticed that repartitioning is very expensive. Is there a way to force
 Spark
 to create a certain number of partitions when the data is read in? How
 does
 it decided on the partition size initially?

 Thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/partition-size-for-initial-read-tp15603.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Tamas Jambor
thanks Marcelo.

What's the reason it is not possible in cluster mode, either?

On Wed, Oct 1, 2014 at 5:42 PM, Marcelo Vanzin van...@cloudera.com wrote:
 You can't set up the driver memory programatically in client mode. In
 that mode, the same JVM is running the driver, so you can't modify
 command line options anymore when initializing the SparkContext.

 (And you can't really start cluster mode apps that way, so the only
 way to set this is through the command line / config files.)

 On Wed, Oct 1, 2014 at 9:26 AM, jamborta jambo...@gmail.com wrote:
 Hi all,

 I cannot figure out why this command is not setting the driver memory (it is
 setting the executor memory):

 conf = (SparkConf()
 .setMaster(yarn-client)
 .setAppName(test)
 .set(spark.driver.memory, 1G)
 .set(spark.executor.memory, 1G)
 .set(spark.executor.instances, 2)
 .set(spark.executor.cores, 4))
 sc = SparkContext(conf=conf)

 whereas if I run the spark console:
 ./bin/pyspark --driver-memory 1G

 it sets it correctly. Seemingly they both generate the same commands in the
 logs.

 thanks a lot,





 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-memory-is-not-set-pyspark-1-1-0-tp15498.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Tamas Jambor
when you say respective backend code to launch it, I thought this is
the way to do that.

thanks,
Tamas

On Wed, Oct 1, 2014 at 6:13 PM, Marcelo Vanzin van...@cloudera.com wrote:
 Because that's not how you launch apps in cluster mode; you have to do
 it through the command line, or by calling directly the respective
 backend code to launch it.

 (That being said, it would be nice to have a programmatic way of
 launching apps that handled all this - this has been brought up in a
 few different contexts, but I don't think there's an official
 solution yet.)

 On Wed, Oct 1, 2014 at 9:59 AM, Tamas Jambor jambo...@gmail.com wrote:
 thanks Marcelo.

 What's the reason it is not possible in cluster mode, either?

 On Wed, Oct 1, 2014 at 5:42 PM, Marcelo Vanzin van...@cloudera.com wrote:
 You can't set up the driver memory programatically in client mode. In
 that mode, the same JVM is running the driver, so you can't modify
 command line options anymore when initializing the SparkContext.

 (And you can't really start cluster mode apps that way, so the only
 way to set this is through the command line / config files.)

 On Wed, Oct 1, 2014 at 9:26 AM, jamborta jambo...@gmail.com wrote:
 Hi all,

 I cannot figure out why this command is not setting the driver memory (it 
 is
 setting the executor memory):

 conf = (SparkConf()
 .setMaster(yarn-client)
 .setAppName(test)
 .set(spark.driver.memory, 1G)
 .set(spark.executor.memory, 1G)
 .set(spark.executor.instances, 2)
 .set(spark.executor.cores, 4))
 sc = SparkContext(conf=conf)

 whereas if I run the spark console:
 ./bin/pyspark --driver-memory 1G

 it sets it correctly. Seemingly they both generate the same commands in the
 logs.

 thanks a lot,





 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-memory-is-not-set-pyspark-1-1-0-tp15498.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Marcelo



 --
 Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Tamas Jambor
Thank you for the replies. It makes sense for scala/java, but in
python the JVM is launched when the spark context is initialised, so
it should be able to set it, I assume.



On Wed, Oct 1, 2014 at 6:24 PM, Andrew Or and...@databricks.com wrote:
 Hi Tamas,

 Yes, Marcelo is right. The reason why it doesn't make sense to set
 spark.driver.memory in your SparkConf is because your application code, by
 definition, is the driver. This means by the time you get to the code that
 initializes your SparkConf, your driver JVM has already started with some
 heap size, and you can't easily change the size of the JVM once it has
 started. Note that this is true regardless of the deploy mode (client or
 cluster).

 Alternatives to set this include the following: (1) You can set
 spark.driver.memory in your `spark-defaults.conf` on the node that submits
 the application, (2) You can use the --driver-memory command line option if
 you are using Spark submit (bin/pyspark goes through this path, as you have
 discovered on your own).

 Does that make sense?


 2014-10-01 10:17 GMT-07:00 Tamas Jambor jambo...@gmail.com:

 when you say respective backend code to launch it, I thought this is
 the way to do that.

 thanks,
 Tamas

 On Wed, Oct 1, 2014 at 6:13 PM, Marcelo Vanzin van...@cloudera.com
 wrote:
  Because that's not how you launch apps in cluster mode; you have to do
  it through the command line, or by calling directly the respective
  backend code to launch it.
 
  (That being said, it would be nice to have a programmatic way of
  launching apps that handled all this - this has been brought up in a
  few different contexts, but I don't think there's an official
  solution yet.)
 
  On Wed, Oct 1, 2014 at 9:59 AM, Tamas Jambor jambo...@gmail.com wrote:
  thanks Marcelo.
 
  What's the reason it is not possible in cluster mode, either?
 
  On Wed, Oct 1, 2014 at 5:42 PM, Marcelo Vanzin van...@cloudera.com
  wrote:
  You can't set up the driver memory programatically in client mode. In
  that mode, the same JVM is running the driver, so you can't modify
  command line options anymore when initializing the SparkContext.
 
  (And you can't really start cluster mode apps that way, so the only
  way to set this is through the command line / config files.)
 
  On Wed, Oct 1, 2014 at 9:26 AM, jamborta jambo...@gmail.com wrote:
  Hi all,
 
  I cannot figure out why this command is not setting the driver memory
  (it is
  setting the executor memory):
 
  conf = (SparkConf()
  .setMaster(yarn-client)
  .setAppName(test)
  .set(spark.driver.memory, 1G)
  .set(spark.executor.memory, 1G)
  .set(spark.executor.instances, 2)
  .set(spark.executor.cores, 4))
  sc = SparkContext(conf=conf)
 
  whereas if I run the spark console:
  ./bin/pyspark --driver-memory 1G
 
  it sets it correctly. Seemingly they both generate the same commands
  in the
  logs.
 
  thanks a lot,
 
 
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-memory-is-not-set-pyspark-1-1-0-tp15498.html
  Sent from the Apache Spark User List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 
 
  --
  Marcelo
 
 
 
  --
  Marcelo

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: yarn does not accept job in cluster mode

2014-09-29 Thread Tamas Jambor
thanks for the reply.

As I mentioned above, all works in yarn-client mode, the problem
starts when I try to run it in yarn-cluster mode.

(seems that spark-shell does not work in yarn-cluster mode, so cannot
debug that way).


On Mon, Sep 29, 2014 at 7:30 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
 Can you try running the spark-shell in yarn-cluster mode?

 ./bin/spark-shell --master yarn-client

 Read more over here http://spark.apache.org/docs/1.0.0/running-on-yarn.html

 Thanks
 Best Regards

 On Sun, Sep 28, 2014 at 7:08 AM, jamborta jambo...@gmail.com wrote:

 hi all,

 I have a job that works ok in yarn-client mode,but when I try in
 yarn-cluster mode it returns the following:

 WARN YarnClusterScheduler: Initial job has not accepted any resources;
 check
 your cluster UI to ensure that workers are registered and have sufficient
 memory

 the cluster has plenty of memory and resources. I am running this from
 python using this context:

 conf = (SparkConf()
 .setMaster(yarn-cluster)
 .setAppName(spark_tornado_server)
 .set(spark.executor.memory, 1024m)
 .set(spark.cores.max, 16)
 .set(spark.driver.memory, 1024m)
 .set(spark.executor.instances, 2)
 .set(spark.executor.cores, 8)
 .set(spark.eventLog.enabled, False)

 HADOOP_HOME and HADOOP_CONF_DIR are also set in spark-env.

 thanks,

 not sure if I am missing some config




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/yarn-does-not-accept-job-in-cluster-mode-tp15281.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Yarn number of containers

2014-09-25 Thread Tamas Jambor
Thank you.

Where is the number of containers set?

On Thu, Sep 25, 2014 at 7:17 PM, Marcelo Vanzin van...@cloudera.com wrote:
 On Thu, Sep 25, 2014 at 8:55 AM, jamborta jambo...@gmail.com wrote:
 I am running spark with the default settings in yarn client mode. For some
 reason yarn always allocates three containers to the application (wondering
 where it is set?), and only uses two of them.

 The default number of executors in Yarn mode is 2; so you have 2
 executors + the application master, so 3 containers.

 Also the cpus on the cluster never go over 50%, I turned off the fair
 scheduler and set high spark.cores.max. Is there some additional settings I
 am missing?

 You probably need to request more cores (--executor-cores). Don't
 remember if that is respected in Yarn, but should be.

 --
 Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: access javaobject in rdd map

2014-09-23 Thread Tamas Jambor
Hi Davies,

Thanks for the reply. I saw that you guys do that way in the code. Is
there no other way?

I have implemented all the predict functions in scala, so I prefer not
to reimplement the whole thing in python.

thanks,


On Tue, Sep 23, 2014 at 5:40 PM, Davies Liu dav...@databricks.com wrote:
 You should create a pure Python object (copy the attributes from Java object),
  then it could be used in map.

 Davies

 On Tue, Sep 23, 2014 at 8:48 AM, jamborta jambo...@gmail.com wrote:
 Hi all,

 I have a java object that contains a ML model which I would like to use for
 prediction (in python). I just want to iterate the data through a mapper and
 predict for each value. Unfortunately, this fails when it tries to serialise
 the object to sent it to the nodes.

 Is there a trick around this? Surely, this object could be picked up by
 reference at the nodes.

 many thanks,



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/access-javaobject-in-rdd-map-tp14898.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org