Use of nscala-time within spark-shell

2015-02-16 Thread Hammam CHAMSI
Hi All, Thanks in advance for your help. I have timestamp which I need to convert to datetime using scala. A folder contains the three needed jar files: joda-convert-1.5.jar joda-time-2.4.jar nscala-time_2.11-1.8.0.jar Using scala REPL and adding the jars: scala -classpath *.jar I can use

How to retreive the value from sql.row by column name

2015-02-16 Thread Eric Bell
Is it possible to reference a column from a SchemaRDD using the column's name instead of its number? For example, let's say I've created a SchemaRDD from an avro file: val sqlContext = new SQLContext(sc) import sqlContext._ val

Re: Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-16 Thread Corey Nolet
We've been using commons configuration to pull our properties out of properties files and system properties (prioritizing system properties over others) and we add those properties to our spark conf explicitly and we use ArgoPartser to get the command line argument for which property file to load.

Re: How to retreive the value from sql.row by column name

2015-02-16 Thread Eric Bell
I am just learning scala so I don't actually understand what your code snippet is doing but thank you, I will learn more so I can figure it out. I am new to all of this and still trying to make the mental shift from normal programming to distributed programming, but it seems to me that the

Re: Array in broadcast can't be serialized

2015-02-16 Thread Ted Yu
Is it possible to port WrappedArraySerializer.scala to your app ? Pardon me for not knowing how to integrate Chill with Spark. Cheers On Mon, Feb 16, 2015 at 12:31 AM, Tao Xiao xiaotao.cs@gmail.com wrote: Thanks Ted After searching for a whole day, I still don't know how to let spark

Spark newbie desires feedback on first program

2015-02-16 Thread Eric Bell
I'm a spark newbie working on his first attempt to do write an ETL program. I could use some feedback to make sure I'm on the right path. I've written a basic proof of concept that runs without errors and seems to work, although I might be missing some issues when this is actually run on more

Problem getting pyspark-cassandra and pyspark working

2015-02-16 Thread Mohamed Lrhazi
Hello all, Trying the example code from this package ( https://github.com/Parsely/pyspark-cassandra) , I always get this error... Can you see what I am doing wrong? from googling arounf it seems to be that the jar is not found somehow... The spark log shows the JAR was processed at least.

Re: Spark Streaming and SQL checkpoint error: (java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf)

2015-02-16 Thread Michael Armbrust
You probably want to mark the HiveContext as @transient as its not valid to use it on the slaves anyway. On Mon, Feb 16, 2015 at 1:58 AM, Haopu Wang hw...@qilinsoft.com wrote: I have a streaming application which registered temp table on a HiveContext for each batch duration. The

Re: spark-local dir running out of space during long ALS run

2015-02-16 Thread Tathagata Das
Correct, brute force clean up is not useful. Since Spark 1.0, Spark can do automatic cleanup of files based on which RDDs are used/garbage collected by JVM. That would be the best way, but depends on the JVM GC characteristics. If you force a GC periodically in the driver that might help you get

Re: How to retreive the value from sql.row by column name

2015-02-16 Thread Michael Armbrust
For efficiency the row objects don't contain the schema so you can't get the column by name directly. I usually do a select followed by pattern matching. Something like the following: caper.select('ran_id).map { case Row(ranId: String) = } On Mon, Feb 16, 2015 at 8:54 AM, Eric Bell

Re: How to retreive the value from sql.row by column name

2015-02-16 Thread Michael Armbrust
I can unpack the code snippet a bit: caper.select('ran_id) is the same as saying SELECT ran_id FROM table in SQL. Its always a good idea to explicitly request the columns you need right before using them. That way you are tolerant of any changes to the schema that might happen upstream. The

Re: Problem getting pyspark-cassandra and pyspark working

2015-02-16 Thread Mohamed Lrhazi
Yes, am sure the system cant find the jar.. but how do I fix that... my submit command includes the jar: /spark/bin/spark-submit --py-files /spark/pyspark_cassandra.py --jars /spark/pyspark-cassandra-0.1-SNAPSHOT.jar --driver-class-path /spark/pyspark-cassandra-0.1-SNAPSHOT.jar /spark/test2.py

Re: Spark newbie desires feedback on first program

2015-02-16 Thread Charles Feduke
I cannot comment about the correctness of Python code. I will assume your caper_kv is keyed on something that uniquely identifies all the rows that make up the person's record so your group by key makes sense, as does the map. (I will also assume all of the rows that comprise a single person's

Re: Spark newbie desires feedback on first program

2015-02-16 Thread Charles Feduke
My first problem was somewhat similar to yours. You won't find a whole lot of JDBC to Spark examples since I think a lot of the adoption for Spark is from teams already experienced with Hadoop and already have an established big data solution (so their data is already extracted from whatever

Re: Problem getting pyspark-cassandra and pyspark working

2015-02-16 Thread Davies Liu
It also need the Cassandra jar: com.datastax.spark.connector.CassandraJavaUtil Is it included in /spark/pyspark-cassandra-0.1-SNAPSHOT.jar ? On Mon, Feb 16, 2015 at 1:20 PM, Mohamed Lrhazi mohamed.lrh...@georgetown.edu wrote: Yes, am sure the system cant find the jar.. but how do I fix

Re: Spark newbie desires feedback on first program

2015-02-16 Thread Eric Bell
Thanks Charles. I just realized a few minutes ago that I neglected to show the step where I generated the key on the person ID. Thanks for the pointer on the HDFS URL. Next step is to process data from multiple RDDS. My data originates from 7 tables in a MySQL database. I used sqoop to create

Re: Problem getting pyspark-cassandra and pyspark working

2015-02-16 Thread Mohamed Lrhazi
Oh, I don't know. thanks a lot Davies, gonna figure that out now On Mon, Feb 16, 2015 at 5:31 PM, Davies Liu dav...@databricks.com wrote: It also need the Cassandra jar: com.datastax.spark.connector.CassandraJavaUtil Is it included in /spark/pyspark-cassandra-0.1-SNAPSHOT.jar ? On

OOM error

2015-02-16 Thread Harshvardhan Chauhan
Hi All, I need some help with Out Of Memory errors in my application. I am using Spark 1.1.0 and my application is using Java API. I am running my app on EC2 25 m3.xlarge (4 Cores 15GB Memory) instances. The app only fails sometimes. Lots of mapToPair tasks a failing. My app is configured to

Re: Problem getting pyspark-cassandra and pyspark working

2015-02-16 Thread Davies Liu
It seems that the jar for cassandra is not loaded, you should have them in the classpath. On Mon, Feb 16, 2015 at 12:08 PM, Mohamed Lrhazi mohamed.lrh...@georgetown.edu wrote: Hello all, Trying the example code from this package (https://github.com/Parsely/pyspark-cassandra) , I always get

Unable to broadcast dimension tables with Spark SQL

2015-02-16 Thread Sunita Arvind
Hi Experts, I have a large table with 54 million records (fact table), being joined with 6 small tables (dimension tables). The size on disk of small tables is within 5k and the record count is in the range of 4 - 200 All the worker nodes have RAM of 32GB allocated for spark. I have tried the

Re: Re: Question about spark streaming+Flume

2015-02-16 Thread bit1...@163.com
Hi Arush, With your code, I still didn't see the output Received X flumes events.. bit1...@163.com From: bit1...@163.com Date: 2015-02-17 14:08 To: Arush Kharbanda CC: user Subject: Re: Re: Question about spark streaming+Flume Ok, you are missing a letter in foreachRDD.. let me proceed..

Identify the performance bottleneck from hardware prospective

2015-02-16 Thread jalafate
Hi there, I am trying to scale up the data size that my application is handling. This application is running on a cluster with 16 slave nodes. Each slave node has 60GB memory. It is running in standalone mode. The data is coming from HDFS that also in same local network. In order to have an

Re: How to retreive the value from sql.row by column name

2015-02-16 Thread Reynold Xin
BTW we merged this today: https://github.com/apache/spark/pull/4640 This should allow us in the future to address column by name in a Row. On Mon, Feb 16, 2015 at 11:39 AM, Michael Armbrust mich...@databricks.com wrote: I can unpack the code snippet a bit: caper.select('ran_id) is the same

PySpark and Cassandra

2015-02-16 Thread Rumph, Frens Jan
Hi, I'm trying to connect to Cassandra through PySpark using the spark-cassandra-connector from datastax based on the work of Mike Sukmanowsky. I can use Spark and Cassandra through the datastax connector in Scala just fine. Where things fail in PySpark is that an exception is raised in

Re: OOM error

2015-02-16 Thread Akhil Das
Increase your executor memory, Also you can play around with increasing the number of partitions/parallelism etc. Thanks Best Regards On Tue, Feb 17, 2015 at 3:39 AM, Harshvardhan Chauhan ha...@gumgum.com wrote: Hi All, I need some help with Out Of Memory errors in my application. I am

Re: spark-local dir running out of space during long ALS run

2015-02-16 Thread Davies Liu
For the last question, you can trigger GC in JVM from Python by : sc._jvm.System.gc() On Mon, Feb 16, 2015 at 4:08 PM, Antony Mayi antonym...@yahoo.com.invalid wrote: thanks, that looks promissing but can't find any reference giving me more details - can you please point me to something? Also

Re: Problem getting pyspark-cassandra and pyspark working

2015-02-16 Thread Mohamed Lrhazi
Will do. Thanks a lot. On Mon, Feb 16, 2015 at 7:20 PM, Davies Liu dav...@databricks.com wrote: Can you try the example in pyspark-cassandra? If not, you could create a issue there. On Mon, Feb 16, 2015 at 4:07 PM, Mohamed Lrhazi mohamed.lrh...@georgetown.edu wrote: So I tired building

Re: Use of nscala-time within spark-shell

2015-02-16 Thread Kevin (Sangwoo) Kim
What is your scala version used to build Spark? It seems your nscala-time library scala version is 2.11, and default Spark scala version is 2.10. On Tue Feb 17 2015 at 1:51:47 AM Hammam CHAMSI hscha...@hotmail.com wrote: Hi All, Thanks in advance for your help. I have timestamp which I need

Re: Problem getting pyspark-cassandra and pyspark working

2015-02-16 Thread Mohamed Lrhazi
So I tired building the connector from: https://github.com/datastax/spark-cassandra-connector which seems to include the java class referenced in the error message: [root@devzero spark]# unzip -l

Re: spark-local dir running out of space during long ALS run

2015-02-16 Thread Antony Mayi
thanks, that looks promissing but can't find any reference giving me more details - can you please point me to something? Also is it possible to force GC from pyspark (as I am using pyspark)? thanks,Antony. On Monday, 16 February 2015, 21:05, Tathagata Das tathagata.das1...@gmail.com

Re: Shuffle on joining two RDDs

2015-02-16 Thread Davies Liu
This will be fixed by https://github.com/apache/spark/pull/4629 On Fri, Feb 13, 2015 at 10:43 AM, Imran Rashid iras...@cloudera.com wrote: yeah I thought the same thing at first too, I suggested something equivalent w/ preservesPartitioning = true, but that isn't enough. the join is done by

Re: Problem getting pyspark-cassandra and pyspark working

2015-02-16 Thread Davies Liu
Can you try the example in pyspark-cassandra? If not, you could create a issue there. On Mon, Feb 16, 2015 at 4:07 PM, Mohamed Lrhazi mohamed.lrh...@georgetown.edu wrote: So I tired building the connector from: https://github.com/datastax/spark-cassandra-connector which seems to include the

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-16 Thread Emre Sevinc
I've managed to solve this, but I still don't know exactly why my solution works: In my code I was trying to force the Spark to output via: jsonIn.print(); jsonIn being a JavaDStreamString. When removed the code above, and added the code below to force the output operation, hence the

Re: hive-thriftserver maven artifact

2015-02-16 Thread Ted Yu
I searched for 'spark-hive-thriftserver_2.10' on this page: http://mvnrepository.com/artifact/org.apache.spark Looks like it is not published. On Mon, Feb 16, 2015 at 5:44 AM, Marco marco@gmail.com wrote: Hi, I am referring to https://issues.apache.org/jira/browse/SPARK-4925 (Hive

Re: Extract hour from Timestamp in Spark SQL

2015-02-16 Thread Wush Wu
Dear Cheng Hao, You are right! After using the HiveContext, the issue is solved. Thanks, Wush 2015-02-15 10:42 GMT+08:00 Cheng, Hao hao.ch...@intel.com: Are you using the SQLContext? I think the HiveContext is recommended. Cheng Hao *From:* Wush Wu [mailto:w...@bridgewell.com]

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-16 Thread Emre Sevinc
Sean, In this case, I've been testing the code on my local machine and using Spark locally, so I all the log output was available on my terminal. And I've used the .print() method to have an output operation, just to force Spark execute. And I was not using foreachRDD, I was only using print()

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-16 Thread Sean Owen
How are you deciding whether files are processed or not? It doesn't seem possible from this code. Maybe it just seems so. On Feb 16, 2015 12:51 PM, Emre Sevinc emre.sev...@gmail.com wrote: I've managed to solve this, but I still don't know exactly why my solution works: In my code I was

hive-thriftserver maven artifact

2015-02-16 Thread Marco
Hi, I am referring to https://issues.apache.org/jira/browse/SPARK-4925 (Hive Thriftserver Maven Artifact). Can somebody point me (URL) to the artifact in a public repository ? I have not found it @Maven Central. Thanks, Marco

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-16 Thread Emre Sevinc
Hello Sean, I did not understand your question very well, but what I do is checking the output directory (and I have various logger outputs at various stages showing the contents of an input file being processed, the response from the web service, etc.). By the way, I've already solved my

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-16 Thread Sean Owen
Materialization shouldn't be relevant. The collect by itself doesn't let you detect whether it happened. Print should print some results to the console but on different machines, so may not be a reliable way to see what happened. Yes I understand your real process uses foreachRDD and that's what

Re: Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-16 Thread Akhil Das
Instead of print you should do jsonIn.count().print(). Straight forward approach is to use foreachRDD :) Thanks Best Regards On Mon, Feb 16, 2015 at 6:48 PM, Emre Sevinc emre.sev...@gmail.com wrote: Hello Sean, I did not understand your question very well, but what I do is checking the

Question about spark streaming+Flume

2015-02-16 Thread bit1...@163.com
Hi, I am trying Spark Streaming + Flume example: 1. Code object SparkFlumeNGExample { def main(args : Array[String]) { val conf = new SparkConf().setAppName(SparkFlumeNGExample) val ssc = new StreamingContext(conf, Seconds(10)) val lines =

Identify the performance bottleneck from hardware prospective

2015-02-16 Thread Julaiti Alafate
Hi there, I am trying to scale up the data size that my application is handling. This application is running on a cluster with 16 slave nodes. Each slave node has 60GB memory. It is running in standalone mode. The data is coming from HDFS that also in same local network. In order to have an

Re: Question about spark streaming+Flume

2015-02-16 Thread Arush Kharbanda
Hi Can you try this val lines = FlumeUtils.createStream(ssc,localhost,) // Print out the count of events received from this server in each batch lines.count().map(cnt = Received + cnt + flume events. at + System.currentTimeMillis() ) lines.forechRDD(_.foreach(println)) Thanks

Re: Re: Question about spark streaming+Flume

2015-02-16 Thread bit1...@163.com
Ok, you are missing a letter in foreachRDD.. let me proceed.. bit1...@163.com From: Arush Kharbanda Date: 2015-02-17 14:31 To: bit1...@163.com CC: user Subject: Re: Question about spark streaming+Flume Hi Can you try this val lines = FlumeUtils.createStream(ssc,localhost,) // Print

Re: Re: Question about spark streaming+Flume

2015-02-16 Thread bit1...@163.com
Thanks Arush.. With your code, compiling error occurs: Error:(19, 11) value forechRDD is not a member of org.apache.spark.streaming.dstream.ReceiverInputDStream[org.apache.spark.streaming.flume.SparkFlumeEvent] lines.forechRDD(_.foreach(println)) ^ From: Arush Kharbanda Date: 2015-02-17

java.lang.NoClassDefFoundError: org/apache/spark/SparkConf

2015-02-16 Thread siqi chen
Hello, I have a simple Kafka Spark Streaming example which I am still developing in the standalone mode. Here is what is puzzling me, If I build the assembly jar, use bin/spark-submit to run it, it works fine. But if I want to run the code from within Intellij IDE, then it will cry for this

Re: hive-thriftserver maven artifact

2015-02-16 Thread Arush Kharbanda
You can build your own spark with option -Phive-thriftserver. You can publish the jars locally. I hope that would solve your problem. On Mon, Feb 16, 2015 at 8:54 PM, Marco marco@gmail.com wrote: Ok, so will it be only available for the next version (1.30)? 2015-02-16 15:24 GMT+01:00 Ted

MLib usage on Spark Streaming

2015-02-16 Thread Spico Florin
Hello! I'm newbie to Spark and I have the following case study: 1. Client sending at 100ms the following data: {uniqueId, timestamp, measure1, measure2 } 2. Each 30 seconds I would like to correlate the data collected in the window, with some predefined double vector pattern for each given

Re: hive-thriftserver maven artifact

2015-02-16 Thread Marco
Ok, so will it be only available for the next version (1.30)? 2015-02-16 15:24 GMT+01:00 Ted Yu yuzhih...@gmail.com: I searched for 'spark-hive-thriftserver_2.10' on this page: http://mvnrepository.com/artifact/org.apache.spark Looks like it is not published. On Mon, Feb 16, 2015 at 5:44

Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-16 Thread Emre Sevinc
Hello, I'm using Spark 1.2.1 and have a module.properties file, and in it I have non-Spark properties, as well as Spark properties, e.g.: job.output.dir=file:///home/emre/data/mymodule/out I'm trying to pass it to spark-submit via: spark-submit --class com.myModule --master local[4]

Re: Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-16 Thread Sean Owen
Since SparkConf is only for Spark properties, I think it will in general only pay attention to and preserve spark.* properties. You could experiment with that. In general I wouldn't rely on Spark mechanisms for your configuration, and you can use any config mechanism you like to retain your own

Re: Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-16 Thread Emre Sevinc
Sean, I'm trying this as an alternative to what I currently do. Currently I have my module.properties file for my module in the resources directory, and that file is put inside the über JAR file when I build my application with Maven, and then when I submit it using spark-submit, I can read that

Re: Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-16 Thread Charles Feduke
I haven't actually tried mixing non-Spark settings into the Spark properties. Instead I package my properties into the jar and use the Typesafe Config[1] - v1.2.1 - library (along with Ficus[2] - Scala specific) to get at my properties: Properties file: src/main/resources/integration.conf (below

Re: Loading tables using parquetFile vs. loading tables from Hive metastore with Parquet serde

2015-02-16 Thread Cheng Lian
Hi Jianshi, When accessing a Hive table with Parquet SerDe, Spark SQL tries to convert it into Spark SQL's native Parquet support for better performance. And yes, predicate push-down, column pruning are applied here. In 1.3.0, we'll also cover the write path except for writing partitioned table.

Which OutputCommitter to use for S3?

2015-02-16 Thread Mingyu Kim
HI all, The default OutputCommitter used by RDD, which is FileOutputCommitter, seems to require moving files at the commit step, which is not a constant operation in S3, as discussed in http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3c543e33fa.2000...@entropy.be%3E. People

Re: spark-local dir running out of space during long ALS run

2015-02-16 Thread Antony Mayi
spark.cleaner.ttl is not the right way - seems to be really designed for streaming. although it keeps the disk usage under control it also causes loss of rdds and broadcasts that are required later leading to crash. is there any other way?thanks,Antony. On Sunday, 15 February 2015, 21:42,

Re: Array in broadcast can't be serialized

2015-02-16 Thread Tao Xiao
Thanks Ted After searching for a whole day, I still don't know how to let spark use twitter chill serialization - there are very few documents about how to integrate twitter chill into Spark for serialization. I tried the following, but an exception of java.lang.ClassCastException:

Spark Streaming and SQL checkpoint error: (java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf)

2015-02-16 Thread Haopu Wang
I have a streaming application which registered temp table on a HiveContext for each batch duration. The application runs well in Spark 1.1.0. But I get below error from 1.1.1. Do you have any suggestions to resolve it? Thank you! java.io.NotSerializableException:

Re: Writing to HDFS from spark Streaming

2015-02-16 Thread Sean Owen
PS this is the real fix to this issue: https://issues.apache.org/jira/browse/SPARK-5795 I'd like to merge it as I don't think it breaks the API; it actually fixes it to work as intended. On Mon, Feb 16, 2015 at 3:25 AM, Bahubali Jain bahub...@gmail.com wrote: I used the latest assembly jar and

Magic number 16: Why doesn't Spark Streaming process more than 16 files?

2015-02-16 Thread Emre Sevinc
Hello, I have an application in Java that uses Spark Streaming 1.2.1 in the following manner: - Listen to the input directory. - If a new file is copied to that input directory process it. - Process: contact a RESTful web service (running also locally and responsive), send the contents of the