Re: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-28 Thread Felix Cheung
Ah, it's there in spark-submit and pyspark.Seems like it should be added for spark_ec2 _ From: Ted Yu Sent: Friday, November 27, 2015 11:50 AM Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash() To: Felix Cheung

streaming from Flume, save to Cassandra and Solr with Banana as search engine

2015-11-28 Thread Oleksandr Yermolenko
Hi, The aim: - collects syslogs. - filter (discard really unneeded events) - save to cassandra table the rest - later I will have to integrate search engine (Solr based) Environment: spark 1.5.2/scala 2.10.4 spark-cassandra-connector_2.11-1.5.0-M2.jar flume 1.6.0 I have reviewed a lot examples

How to get a single available message from kafka (case where OffsetRange.fromOffset == OffsetRange.untilOffset)

2015-11-28 Thread Nikos Viorres
Hi, I am using KafkaUtils.createRDD to retrieve data from Kafka for batch processing and when Invoking KafkaUtils.createRDD with an OffsetRange where OffsetRange.fromOffset == OffsetRange.untilOffset for a particular partition, i get an empy RDD. Documentation is clear that until is exclusive and

df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-28 Thread Don Drake
I have a 2TB dataset that I have in a DataFrame that I am attempting to partition by 2 fields and my YARN job seems to write the partitioned dataset successfully. I can see the output in HDFS once all Spark tasks are done. After the spark tasks are done, the job appears to be running for over an

StructType for oracle.sql.STRUCT

2015-11-28 Thread Pieter Minnaar
Hi, I need to read Oracle Spatial SDO_GEOMETRY tables into Spark. I need to know how to create a StructField to use in the schema definition for the Oracle geometry columns. In the standard JDBC the values are read as oracle.sql.STRUCT types. How can I get the same values in Spark SQL?

General question on using StringIndexer in SparkML

2015-11-28 Thread Vishnu Viswanath
Hi All, I have a general question on using StringIndexer. StringIndexer gives an index to each label in the feature starting from 0 ( 0 for least frequent word). Suppose I am building a model, and I use StringIndexer for transforming on of my column. e.g., suppose A was most frequent word

storing result of aggregation of spark streaming

2015-11-28 Thread Amir Rahnama
Hi, I am gonna store the results of my stream job into a db, which one of databases has the native support (if any)? -- Thanks and Regards, Amir Hossein Rahnama *Tel: +46 (0) 761 681 102* Website: www.ambodi.com Twitter: @_ambodi

Re: StructType for oracle.sql.STRUCT

2015-11-28 Thread andy petrella
Warf... such an heavy tasks man! I'd love to follow your work on that (I've a long XP in geospatial too), is there a repo available already for that? The hard part will be to support all descendant types I guess (line, mutlilines, and so on), then creating the spatial operators. The only

Spark and simulated annealing

2015-11-28 Thread marfago
Hi All, I would like to implement a simulated annealing algorithm with Spark. What is the best way to do that with python or scala? Is there any library already implementing it? Thank you in advance. Marco -- View this message in context:

Retrieve best parameters from CrossValidator

2015-11-28 Thread BenFradet
Hi everyone, Is there a better way to retrieve the best model parameters obtained from cross validation than inspecting the logs issued while calling the fit method (with the call here:

Confirm this won't parallelize/partition?

2015-11-28 Thread Jim Lohse
Hi, I got a good answer on the main question elsewhere, would anyone please confirm the first code is the right approach? For a MVCE I am trying to adapt this example and it's seems like I am having Java issues with types: (but this is basically the right approach?) int count =

Re: Experiences about NoSQL databases with Spark

2015-11-28 Thread Yu Zhang
BTW, if you decide to try the mongodb, please use the 3.0+ version with "wiredtiger" engine. On Sat, Nov 28, 2015 at 11:30 PM, Yu Zhang wrote: > If you need to construct multiple indexes, hbase will perform better, the > writing speed is slow in mongodb with many indexes

Re: Spark Streaming on mesos

2015-11-28 Thread Renjie Liu
Hi, Nagaraj: Thanks for the response, but this does not solve my problem. I think executor memory should be proportional to number of cores, or number of core in each executor should be the same. On Sat, Nov 28, 2015 at 1:48 AM Nagaraj Chandrashekar < nchandrashe...@innominds.com> wrote: > Hi

Re: Automatic driver restart does not seem to be working in Spark Standalone

2015-11-28 Thread swetha kasireddy
Yes. I mean killing the Spark Job from UI. Also I use context.awaitTermination(). On Wed, Nov 25, 2015 at 6:23 PM, Tathagata Das wrote: > What do you mean by killing the streaming job using UI? Do you mean that > you are clicking the "kill" link in the Jobs page in the

Parquet files not getting coalesced to smaller number of files

2015-11-28 Thread SRK
Hi, I have the following code that saves the parquet files in my hourly batch to hdfs. My idea is to coalesce the files to 1500 smaller files. The first run it gives me 1500 files in hdfs. For the next runs the files seem to be increasing even though I coalesce. Its not getting coalesced to

Re: storing result of aggregation of spark streaming

2015-11-28 Thread Michael Spector
Hi Amir, You can store results of stream transformation in Cassandra using: https://github.com/datastax/spark-cassandra-connector Regards, Michael On Sun, Nov 29, 2015 at 1:41 AM, Amir Rahnama wrote: > Hi, > > I am gonna store the results of my stream job into a db,

Re: Experiences about NoSQL databases with Spark

2015-11-28 Thread Jörn Franke
I would not use MongoDB because it does not fit well into the Spark or Hadoop architecture. You can use it if your data amount is very small and already preaggregated, but this is a very limited use case. You can use Hbase or with future versions of Hive (if they use TEZ > 0.8) For interactive