Re: Spark/HIVE Insert Into values Error

2014-10-26 Thread arthur.hk.c...@gmail.com
Hi, I have already found the way about how to “insert into HIVE_TABLE values (…..) Regards Arthur On 18 Oct, 2014, at 10:09 pm, Cheng Lian lian.cs@gmail.com wrote: Currently Spark SQL uses Hive 0.12.0, which doesn't support the INSERT INTO ... VALUES ... syntax. On 10/18/14 1:33 AM,

Create table error from Hive in spark-assembly-1.0.2.jar

2014-10-26 Thread Jacob Chacko - Catalyst Consulting
Hi All We are trying to create a table in Hive from spark-assembly-1.0.2.jar file. CREATE TABLE IF NOT EXISTS src (key INT, value STRING) JavaSparkContext sc = CC2SparkManager.sharedInstance().getSparkContext(); JavaHiveContext sqlContext = new JavaHiveContext(sc); sqlContext.sql(CREATE

Re: scala.collection.mutable.ArrayOps$ofRef$.length$extension since Spark 1.1.0

2014-10-26 Thread Marius Soutier
I tried that already, same exception. I also tried using an accumulator to collect all filenames. The filename is not the problem. Even this crashes with the same exception: sc.parallelize(files.value).map { fileName = println(sScanning $fileName) try { println(sScanning

Implement Count by Minute in Spark Streaming

2014-10-26 Thread Ji ZHANG
Hi, Suppose I have a stream of logs and I want to count them by minute. The result is like: 2014-10-26 18:38:00 100 2014-10-26 18:39:00 150 2014-10-26 18:40:00 200 One way to do this is to set the batch interval to 1 min, but each batch would be quite large. Or I can use updateStateByKey where

Re: Bug in Accumulators...

2014-10-26 Thread octavian.ganea
Sorry, I forgot to say that this gives the above error just when run on a cluster, not in local mode. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-tp17263p17277.html Sent from the Apache Spark User List mailing list archive at

Re: Spark as Relational Database

2014-10-26 Thread Peter Wolf
My understanding is the SparkSQL allows one to access Spark data as if it were stored in a relational database. It compiles SQL queries into a series of calls to the Spark API. I need the performance of a SQL database, but I don't care about doing queries with SQL. I create the input to MLib by

Re: Implement Count by Minute in Spark Streaming

2014-10-26 Thread Asit Parija
Hi , You can use Redis to store the keys and value as count by doing an update function whenever you receive that minute key , being an in memory database it would faster than SQL .You can do an update at the end of each batch to update the count of the key if it exists or create in case

Re: Spark as Relational Database

2014-10-26 Thread Rick Richardson
Spark's API definitely covers all of the things that a relational database can do. It will probably outperform a relational star schema if all of your *working* data set can fit into RAM on your cluster. It will still perform quite well if most of the data fits and some has to spill over to disk.

Re: Accumulators : Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-10-26 Thread Akhil Das
Just tried the below code and works for me, not sure why is sparkContext being sent inside the mapPartitions function in your case. Can you try with simple map() instead of mapPartition? val ac = sc.accumulator(0) val or = sc.parallelize(1 to 1) val ps = or.map(x = (x,x+2)).map(x = ac +=1)

Re: Spark as Relational Database

2014-10-26 Thread Soumya Simanta
@Peter - as Rick said - Spark's main usage is data analysis and not storage. Spark allows you to plugin different storage layers based on your use cases and quality attribute requirements. So in essence if your relational database is meeting your storage requirements you should think about how to

Re: Spark as Relational Database

2014-10-26 Thread Helena Edelson
Hi, It is very easy to integrate using Cassandra in a use case such as this. For instance, do your joins in Spark and do your data storage in Cassandra which allows a very flexible schema, unlike a relational DB, and is much faster, fault tolerant, and with spark and colocation WRT data

what classes are needed to register in KryoRegistrator, e.g. Row?

2014-10-26 Thread Fengyun RAO
In Tuning Spark https://spark.apache.org/docs/latest/tuning.html, it says, Spark automatically includes Kryo serializers for the *many commonly-used core Scala classes* covered in the AllScalaRegistrar from the Twitter chill https://github.com/twitter/chill library. I looked into the

Re: Accumulators : Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-10-26 Thread octavian.ganea
Hi Akhil, Please see this related message. http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Accumulators-td17263.html I am curious if this works for you also. -- View this message in context:

How do you use the thrift-server to get data from a Spark program?

2014-10-26 Thread Edward Sargisson
Hi all, This feels like a dumb question but bespeaks my lack of understanding: what is the Spark thrift-server for? Especially if there's an existing Hive installation. Background: We want to use Spark to do some processing starting from files (in probably MapRFS). We want to be able to read the

Spark optimization

2014-10-26 Thread Morbious
I wonder if there is any tool to tweak spark (worker and master). I have 6 workers (192 GB RAM, 32 cores CPU each) with 2 masters and see only small different between MapReduce from hadoop and Spark. I've tested word count on 50 GB file. During tests spark hung on 2 nodes for few minuts with

Re: How do you use the thrift-server to get data from a Spark program?

2014-10-26 Thread Michael Armbrust
This is very experimental and mostly unsupported, but you can start the JDBC server from within your own programs https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L45 by passing it the HiveContext. On

Spark SQL configuration

2014-10-26 Thread Pagliari, Roberto
I'm a newbie with Spark. After installing it on all the machines I want to use, do I need to tell it about Hadoop configuration, or will it be able to find it himself? Thank you,

Re: Spark as Relational Database

2014-10-26 Thread Rick Richardson
I agree with Soumya. A relational database is usually the worst kind of database to receive a constant event stream. That said, the best solution is one that already works :) If your system is meeting your needs, then great. When you get so many events that your db can't keep up, I'd look into

Re: Spark LIBLINEAR

2014-10-26 Thread Chih-Jen Lin
Debasish Das writes: If the SVM is not already migrated to BFGS, that's the first thing you should try...Basically following LBFGS Logistic Regression come up with LBFGS based linear SVM... About integrating TRON in mllib, David already has a version of TRON in breeze but someone

Spark 1.1.0 ClassNotFoundException issue when submit with multi jars using CLUSTER MODE

2014-10-26 Thread xing_bing
HI I am using Spark 1.1.0 config with STANDALONE clusterManager and CLUSTER deployMode. The logic is I want to submit multi jars with spark-submit , using the �C-jars optional, I got an ClassNotFoundException ,  by the way in my code I also use thread context class loader to load

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-26 Thread guoxu1231
Any update? I encountered same issue in my environment. Here are my steps as usual: git clone https://github.com/apache/spark mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package build successfully by maven. import into IDEA as a maven project, click Build-Make

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-26 Thread Stephen Boesch
Yes it is necessary to do a mvn clean when encountering this issue. Typically you would have changed one or more of the profiles/options - which leads to this occurring. 2014-10-22 22:00 GMT-07:00 Ryan Williams ryan.blake.willi...@gmail.com: I started building Spark / running Spark tests this

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-26 Thread Ryan Williams
I heard from one person offline who regularly builds Spark on OSX and Linux and they felt like they only ever saw this error on OSX; if anyone can confirm whether they've seen it on Linux, that would be good to know. Stephen: good to know re: profiles/options. I don't think changing them is a

Re: Setting only master heap

2014-10-26 Thread Keith Simmons
Hi Guys, Here's some lines from the log file before the OOM. They don't look that helpful, so let me know if there's anything else I should be sending. I am running in standalone mode. spark-pulse-org.apache.spark.deploy.master.Master-1-hadoop10.pulse.io.out.5:java.lang.OutOfMemoryError: Java

Spark SQL Exists Clause

2014-10-26 Thread agg212
Hey, I'm trying to run TPC-H Query 4 (shown below), and get the following error: Exception in thread main java.lang.RuntimeException: [11.25] failure: ``UNION'' expected but `select' found It seems like Spark SQL doesn't support the exists clause. Is this true? select o_orderpriority,

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-26 Thread Stephen Boesch
I see the errors regularly on linux under the conditions of having changed profiles. 2014-10-26 20:49 GMT-07:00 Ryan Williams ryan.blake.willi...@gmail.com: I heard from one person offline who regularly builds Spark on OSX and Linux and they felt like they only ever saw this error on OSX; if

Re: Spark as Relational Database

2014-10-26 Thread Michael Hausenblas
Given that you are storing event data (which is basically things that have happened in the past AND cannot be modified) you should definitely look at Event sourcing. http://martinfowler.com/eaaDev/EventSourcing.html Agreed. In this context: a lesser known fact is that the Lambda

RE: Create table error from Hive in spark-assembly-1.0.2.jar

2014-10-26 Thread Cheng, Hao
Can you paste the hive-site.xml? Most of times I meet this exception, because the JDBC driver for hive metastore are not correct set or wrong driver classes are included in the assembly jar. As default, the assembly jar contains the derby.jar, which is the embedded derby JDBC driver. From:

Re: RDD to DStream

2014-10-26 Thread Jianshi Huang
I have a similar requirement. But instead of grouping it by chunkSize, I would have the timeStamp be part of the data. So the function I want has the following signature: // RDD of (timestamp, value) def rddToDStream[T](data: RDD[(Long, T)], timeWindow: Long)(implicit ssc: StreamingContext):