Improve performance using spark streaming + sparksql

2015-01-24 Thread Subacini B
Hi All, I have a cluster of 3 nodes [each 8 core/32 GB memory]. My program uses Spark Streaming with Spark SQL[Spark 1.1] and writes incoming JSON to elasticsearch, Hbase. Below is my code and i receive json files [input data varies from 30MB to 300 MB] every 10 seconds. Irrespective of 3 nodes

Re: Integrerate Spark Streaming and Kafka, but get bad symbolic reference error

2015-01-24 Thread mykidong
Maybe, you can use alternative kafka receiver which I wrote: https://github.com/mykidong/spark-kafka-simple-consumer-receiver - Kidong. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Integrerate-Spark-Streaming-and-Kafka-but-get-bad-symbolic-reference-erro

Re: Need some help to create user defined type for ML pipeline

2015-01-24 Thread Joseph Bradley
Hi Jao, You're right that defining serialize and deserialize is the main task in implementing a UDT. They are basically translating between your native representation (ByteImage) and SQL DataTypes. The sqlType you defined looks correct, and you're correct to use a row of length 4. Other than th

Re: [mllib] Decision Tree - prediction probabilites of label classes

2015-01-24 Thread Joseph Bradley
There is a JIRA...but not a PR yet. Here's the JIRA: https://issues.apache.org/jira/browse/SPARK-3727 I'm not aware of current work on it, but I agree it would be nice to have! Joseph On Thu, Jan 22, 2015 at 2:50 AM, Sean Owen wrote: > You are right that this isn't implemented. I presume you c

RE: spark 1.1.0 save data to hdfs failed

2015-01-24 Thread ey-chih chow
I modified my pom.xml according to the Spark pom.xml. It is working right now. Hadoop2 classes are no longer packaged into my jar. Thanks. From: eyc...@hotmail.com To: so...@cloudera.com CC: user@spark.apache.org Subject: RE: spark 1.1.0 save data to hdfs failed Date: Sat, 24 Jan 2015 07:30:45

JDBC sharded solution

2015-01-24 Thread Charles Feduke
I'm trying to figure out the best approach to getting sharded data from PostgreSQL into Spark. Our production PGSQL cluster has 12 shards with TiB of data on each shard. (I won't be accessing all of the data on a shard at once, but I don't think its feasible to use Sqoop to copy tables who's data

RE: Full per node replication level (architecture question)

2015-01-24 Thread Ashic Mahtab
You could look at using Cassandra for storage. Spark integrates nicely with Cassandra, and a combination of Spark + Cassandra would give you fast access to structured data in Cassandra, while enabling analytic scenarios via Spark. Cassandra would take care of the replication, as it's one of the

Full per node replication level (architecture question)

2015-01-24 Thread Matan Safriel
Hi, I wonder whether any of the file systems supported by Spark, may well support a replication level whereby each node has a full copy of the data. I realize this was not the main intended scenario of spark/hadoop, but may be a good fit for a compute cluster that needs to be very fast over its in

cannot run spark-shell interactively against cluster from remote host - confusing memory warnings

2015-01-24 Thread Joseph Lust
I’ve setup a Spark cluster in the last few weeks and everything is working, but I cannot run spark-shell interactively against the cluster from a remote host * Deploy .jar to cluster from remote (laptop) spark-submit and have it run – Check * Run .jar on spark-shell locally – Check *

Re: Support for SQL on unions of tables (merge tables?)

2015-01-24 Thread Michael Armbrust
> > I have never used Hive, so I'll have to investigate further. To clarify, I wasn't recommending you use Apache Hive, but instead the HiveContext provided by Spark SQL. This will allow you to create views in a hiv

Re: what is the roadmap for Spark SQL dialect in the coming releases?

2015-01-24 Thread Michael Armbrust
I generally recommend people use the HQL dialect provided by the HiveContext when possible: http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started I'll also note that this is distinct from the Hive on Spark project, which is based on the Hive query optimizer / execution eng

Project ideas for spark

2015-01-24 Thread shrenik
Hello guys, I'm a graduate student and have to make a semester long project based on Data Intensive Computing. I'm interested in using Spark in my project so I would really appreciate if any body could suggest any project ideas related to using spark. Thank you in advance. -- View this message i

Re: spark 1.2 - Writing parque fails for timestamp with "Unsupported datatype TimestampType"

2015-01-24 Thread Michael Armbrust
Those annotations actually don't work because the timestamp is SQL has optional nano-second precision. However, there is a PR to add support using parquets INT96 type: https://github.com/apache/spark/pull/3820 On Fri, Jan 23, 2015 at 12:08 PM, Manoj Samel wrote: > Looking further at the trace a

Re: Can't access nested types with sql

2015-01-24 Thread Michael Armbrust
You need to use lateral view explode: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView On Fri, Jan 23, 2015 at 7:02 AM, matthes wrote: > I try to work with nested parquet data. To read and write the parquet file > is > actually working now but when I try to query a nes

Re: Checksum validation failed pulling spark 1.2.0 from Maven Central

2015-01-24 Thread Rory Douglas
Yes SPARK-5308 should take care of it (I'm using leiningen here). I'll disable checksum verification until this is resolved. Thanks! On Sat Jan 24 2015 at 1:39:25 PM Sean Owen wrote: > Weird, because the .md5 and .sha1 files are present in the repo: > > https://repo1.maven.org/maven2/org/apache

Re: Checksum validation failed pulling spark 1.2.0 from Maven Central

2015-01-24 Thread Sean Owen
Weird, because the .md5 and .sha1 files are present in the repo: https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.10/1.2.0/ I'm actually not sure what that indicates, that some metadata isn't quite right at Central? PS I was just looking at a closely related, but seemingly different,

Eclipse on spark

2015-01-24 Thread riginos
How to compile a Spark project in Scala IDE for Eclipse? I got many scala scripts and i no longer want to load them from scala-shell what can i do? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Eclipse-on-spark-tp21350.html Sent from the Apache Spark User

Checksum validation failed pulling spark 1.2.0 from Maven Central

2015-01-24 Thread Rory Douglas
I'm unable to pull spark_2.10 v1.2.0 from Maven Central. It fails with a checksum validation exception. I noticed someone else experienced the same issue ( http://apache-spark-user-list.1001560.n3.nabble.com/Spark-core-maven-error-tc20871.html ). I'm able to pull 1.1.1 fine. I've also tried blo

Re: java.io.IOException: connection closed.

2015-01-24 Thread Kartheek.R
When I increase the executor.memory size, I run it smoothly without any errors. On Sat, Jan 24, 2015 at 9:29 PM, Rapelly Kartheek wrote: > Hi, > While running spark application, I get the following Exception leading to > several failed stages. > > Exception in thread "Thread-46" org.apache.sp

Re: Analyzing data from non-standard data sources (e.g. AWS Redshift)

2015-01-24 Thread Nicholas Chammas
I believe databricks provides an rdd interface to redshift. Did you check spark-packages.org? On 2015년 1월 24일 (토) at 오전 6:45 Denis Mikhalkin wrote: > Hello, > > we've got some analytics data in AWS Redshift. The data is being > constantly updated. > > I'd like to be able to write a query against

java.io.IOException: connection closed.

2015-01-24 Thread Kartheek.R
Hi, While running spark application, I get the following Exception leading to several failed stages. Exception in thread "Thread-46" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 4 times, most recent failure: Lost task 0.3 in stage 11.0 (TID 262, s

SparkException: Task not serializable - Jackson Json

2015-01-24 Thread mickdelaney
Hi, I'm getting an exception using the Jackson with Spark & Scala. I'm using Scala & Case Classes to represent the Json that i'm trying to parse. Its being caused by: *Caused by: java.io.NotSerializableException: com.fasterxml.jackson.module.scala.modifiers.SetTypeModifier$ * I using a modif

RE: spark 1.1.0 save data to hdfs failed

2015-01-24 Thread ey-chih chow
Thanks for the information. I changed the dependencies for Spark jars as follows: org.apache.spark spark-core_2.10 1.1.0 provided

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-24 Thread Fengyun RAO
Hi, Davies The log shows that LogParser initializes and loads data once per executor, thus I think singleton still works. I change the code to sc.textFile(inputPath) .flatMap(line => LogParser.parseLine(line)) .foreach(_ => {}) to avoid shuffle IO, but it’s slower. I thought it may be caused by

Analyzing data from non-standard data sources (e.g. AWS Redshift)

2015-01-24 Thread Denis Mikhalkin
Hello, we've got some analytics data in AWS Redshift. The data is being constantly updated. I'd like to be able to write a query against Redshift which would return a subset of data, and then run a Spark job (Pyspark) to do some analysis. I could not find an RDD which would let me do it OOB (Pyt

Re: spark-shell has syntax error on windows.

2015-01-24 Thread Vladimir Protsenko
I have created an issue https://issues.apache.org/jira/browse/SPARK-5396. Yana, it is something different so I created a new one. 2015-01-24 2:23 GMT+04:00 Yana Kadiyska : > https://issues.apache.org/jira/browse/SPARK-5389 > > I marked as minor since I also just discovered that I can run it unde

summary for all columns (numeric, strings) in a dataset

2015-01-24 Thread kundan kumar
Hi , Is there something like summary function in spark like that in "R". The summary calculation which comes with spark(MultivariateStatisticalSummary) operates only on numeric types. I am interested in getting the results for string types also like the first four max occuring strings(groupby ki

RE: Starting a spark streaming app in init.d

2015-01-24 Thread Ashic Mahtab
Cool. I was thinking of waiting a second and doing ps aux | grep java | grep jarname.jar, and I guess checking 4040 would work as as well. Thanks for the input. Regards,Ashic. Date: Sat, 24 Jan 2015 13:00:13 +0530 Subject: Re: Starting a spark streaming app in init.d From: ak...@sigmoidanalytics

Re: While Loop

2015-01-24 Thread Ted Yu
Please check the ulimit setting. Cheers > On Jan 23, 2015, at 11:19 PM, Deep Pradhan wrote: > > Ted, when I added --driver-memory 2g to my Spark submit command, I got error > which says "Too many files open" > >> On Sat, Jan 24, 2015 at 10:59 AM, Deep Pradhan >> wrote: >> Version of Spar

Re: spark 1.1.0 save data to hdfs failed

2015-01-24 Thread Sean Owen
Hadoop 2's artifact is hadoop-common rather than hadoop-core but I assume you looked for that too. To answer your earlier question, no, Spark works with both Hadoop 1 and Hadoop 2 and is source-compatible with both. It can't be binary-compatible with both at once though. The code you cite is correc

Re: Large number of pyspark.daemon processes

2015-01-24 Thread Sven Krasser
Hey Davies, Sure thing, it's filed here now: https://issues.apache.org/jira/browse/SPARK-5395 As far as a repro goes, what is a normal number of workers I should expect? Even shortly after kicking the job off, I see workers in the double-digits per container. Here's an example using pstree on a w