Spark on Hadoop with Java 8

2014-08-27 Thread jatinpreet
Hi, I am contemplating the use of Hadoop with Java 8 in a production system. I will be using Apache Spark for doing most of the computations on data stored in HBase. Although Hadoop seems to support JDK 8 with some tweaks, the official HBase site states the following for version 0.98, Running

Re: Upgrading 1.0.0 to 1.0.2

2014-08-27 Thread Victor Tso-Guillen
Ah, thanks. On Tue, Aug 26, 2014 at 7:32 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, Victor, the issue for you to have different version in driver and cluster is that you the master will shutdown your application due to the inconsistent SerialVersionID in ExecutorState Best, -- Nan

RE: What is a Block Manager?

2014-08-27 Thread Liu, Raymond
The framework have those info to manage cluster status, and these info (e.g. worker number) is also available through spark metrics system. While from the user application's point of view, can you give an example why you need these info, what would you plan to do with them? Best Regards,

Re: CUDA in spark, especially in MLlib?

2014-08-27 Thread Antonio Jesus Navarro
Maybe this would interest you: CPU and GPU-accelerated Machine Learning Library: https://github.com/BIDData/BIDMach 2014-08-27 4:08 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com: You should try to find a Java-based library, then you can call it from Scala. Matei On August 26, 2014 at

Is there a way to insert data into existing parquet file using spark ?

2014-08-27 Thread rafeeq s
Hi, *Is there a way to insert data into existing parquet file using spark ?* I am using spark stream and spark sql to store store real time data into parquet files and then query it using impala. spark creating multiple sub directories of parquet files and it make me challenge while loading it

Re: Spark Streaming Output to DB

2014-08-27 Thread Akhil Das
Like Mayur said, its better to use mapPartition instead of map. Here's a piece of code which typically reads a text file and inserts each raw into the database. I haven't tested it, It might throw up some Serialization errors, In that case, you gotta serialize them! JavaRDDString

Developing a spark streaming application

2014-08-27 Thread Filip Andrei
Hey guys, so the problem i'm trying to tackle is the following: - I need a data source that emits messages at a certain frequency - There are N neural nets that need to process each message individually - The outputs from all neural nets are aggregated and only when all N outputs for each message

hive on spark yarn

2014-08-27 Thread centerqi hu
Hi all When I run a simple SQL, encountered the following error. hive:0.12(metastore in mysql) hadoop 2.4.1 spark 1.0.2 build with hive my hql code import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._ import org.apache.spark.sql.hive.LocalHiveContext object

Re: Spark - GraphX pregel like with global variables (accumulator / broadcast)

2014-08-27 Thread BertrandR
Thank you for your answers, and sorry for my lack of understanding. So I tried what you suggested, with/without unpersisting and with .cache() (also persist(StorageLevel.MEMORY_AND_DISK) but this is not allowed for msg because you can't change the Storage level apparently) for msg, g and newVerts,

Example File not running

2014-08-27 Thread Hingorani, Vineet
Hello all, I am able to use Spark in the shell but I am not able to run a spark file. I am using sbt and the jar is created but even the SimpleApp class example given on the site http://spark.apache.org/docs/latest/quick-start.html is not running. I installed a prebuilt version of spark and

Replicate RDDs

2014-08-27 Thread rapelly kartheek
Hi I have a three node spark cluster. I restricted the resources per application by setting appropriate parameters and I could run two applications simultaneously. Now, I want to replicate an RDD and run two applications simultaneously. Can someone help how to go about doing this!!! I replicated

Re: Example File not running

2014-08-27 Thread Akhil Das
The statement java.io.IOException: Could not locate executable null\bin\winutils.exe explains that the null is received when expanding or replacing an Environment Variable. I'm guessing that you are missing *HADOOP_HOME* in the environment variables. Thanks Best Regards On Wed, Aug 27, 2014

RE: Installation On Windows machine

2014-08-27 Thread Mishra, Abhishek
Thank you for the reply Matei, Is there something which we missed. ? I am able to run the spark instance on my local system i.e. Windows 7 but the same set of steps do not allow me to run it on Windows server 2012 machine. The black screen just appears for a fraction of second and disappear,

RE: Example File not running

2014-08-27 Thread Hingorani, Vineet
What should I put the value of that environment variable? I want to run the scripts locally on my machine and do not have any Hadoop installed. Thank you From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Mittwoch, 27. August 2014 12:54 To: Hingorani, Vineet Cc: user@spark.apache.org

External dependencies management with spark

2014-08-27 Thread Jaonary Rabarisoa
Dear all, I'm looking for an efficient way to manage external dependencies. I know that one can add .jar or .py dependencies easily but how can I handle other type of dependencies. Specifically, I have some data processing algorithm implemented with other languages (ruby, octave, matlab, c++) and

Re: Example File not running

2014-08-27 Thread Akhil Das
It should point to your hadoop installation directory. (like C:\hadoop\) Since you don't have hadoop installed, What is the code that you are running? Thanks Best Regards On Wed, Aug 27, 2014 at 4:50 PM, Hingorani, Vineet vineet.hingor...@sap.com wrote: What should I put the value of that

Re: spark and matlab

2014-08-27 Thread Jaonary Rabarisoa
Thank you Matei. I found a solution using pipe and matlab engine (an executable that can call matlab behind the scene and uses stdin and stdout to communicate). I just need to fix two other issues : - how can I handle my dependencies ? My matlab script need other matlab files that need to be

RE: Example File not running

2014-08-27 Thread Hingorani, Vineet
The code is the example given on Spark site: /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile =

NotSerializableException while doing rdd.saveToCassandra

2014-08-27 Thread lmk
Hi All, I am using spark-1.0.0 to parse a json file and save to values to cassandra using case class. My code looks as follows: case class LogLine(x1:Option[String],x2: Option[String],x3:Option[List[String]],x4:

Example file not running

2014-08-27 Thread Hingorani, Vineet
Hello all, I am able to use Spark in the shell but I am not able to run a spark file. I am using sbt and the jar is created but even the SimpleApp class example given on the site http://spark.apache.org/docs/latest/quick-start.html is not running. I installed a prebuilt version of spark and

How to get prerelease thriftserver working?

2014-08-27 Thread Matt Chu
(apologies for sending this twice, first via nabble; didn't realize it wouldn't get forwarded) Hey, I know it's not officially released yet, but I'm trying to understand (and run) the Thrift-based JDBC server, in order to enable remote JDBC access to our dev cluster. Before asking about details,

RE: Installation On Windows machine

2014-08-27 Thread Mishra, Abhishek
I got it upright Matei, Thank you. I was giving wrong directory path. Thank you...!! Thanks, Abhishek Mishra -Original Message- From: Mishra, Abhishek [mailto:abhishek.mis...@xerox.com] Sent: Wednesday, August 27, 2014 4:38 PM To: Matei Zaharia Cc: user@spark.apache.org Subject: RE:

Re: Example File not running

2014-08-27 Thread Akhil Das
You can install hadoop 2 by reading this doc https://wiki.apache.org/hadoop/Hadoop2OnWindows Once you are done with it, you can set the environment variable HADOOP_HOME then it should work. Also Not sure if it will work, but can you provide file:// at the front and give it a go? I don't see any

RE: Example File not running

2014-08-27 Thread Hingorani, Vineet
It didn’t work after adding file:// in the front. I compiled it again and ran it. The same error are coming. Do you think there can be some problem with the java dependency? Also, I don’t want to install Hadoop I just want to run it on local machine. The reason is, whenever I install these

Re: Does HiveContext support Parquet?

2014-08-27 Thread Silvio Fiorito
What Spark and Hadoop versions are you on? I have it working in my Spark app with the parquet-hive-bundle-1.5.0.jar bundled into my app fat-jar. I¹m running Spark 1.0.2 and CDH5. bin/spark-shell --master local[*] --driver-class-path ~/parquet-hive-bundle-1.5.0.jar To see if that works? On

Saddle structure in Spark

2014-08-27 Thread LPG
Hello everyone, Is it possible to use an external data structure, such as Saddle, in Spark? As far as I know, a RDD is a kind of wrapper or container that has certain data structure inside. So I was wondering whether this data structure has to be either a basic (or native) structure or any

Re: Spark Streaming Output to DB

2014-08-27 Thread Ravi Sharma
Thank you Akhil and Mayur. It will be really helpful. Thanks, On 27 Aug 2014 13:19, Akhil Das ak...@sigmoidanalytics.com wrote: Like Mayur said, its better to use mapPartition instead of map. Here's a piece of code which typically reads a text file and inserts each raw into the database. I

Reference Accounts Large Node Deployments

2014-08-27 Thread Steve Nunez
All, Does anyone have specific references to customers, use cases and large-scale deployments of Spark Streaming? By OElarge scale¹ I mean both through-put and number of nodes. I¹m attempting an objective comparison of Streaming and Storm and while this data is known for Storm, there appears to

Re: Execute HiveFormSpark ERROR.

2014-08-27 Thread Du Li
As suggested in the error messages, double-check your class path. From: CharlieLin chury...@gmail.commailto:chury...@gmail.com Date: Tuesday, August 26, 2014 at 8:29 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Execute

Re: What is a Block Manager?

2014-08-27 Thread Victor Tso-Guillen
I have long-lived state I'd like to maintain on the executors that I'd like to initialize during some bootstrap phase and to update the master when such executor leaves the cluster. On Tue, Aug 26, 2014 at 11:18 PM, Liu, Raymond raymond@intel.com wrote: The framework have those info to

RE: Save an RDD to a SQL Database

2014-08-27 Thread bdev
I have similar requirement to export the data to mysql. Just wanted to know what the best approach is so far after the research you guys have done. Currently thinking of saving to hdfs and use sqoop to handle export. Is that the best approach or is there any other way to write to mysql? Thanks!

Re: CUDA in spark, especially in MLlib?

2014-08-27 Thread Wei Tan
Thank you all. Actually I was looking at JCUDA. Function wise this may be a perfect solution to offload computation to GPU. Will see how performance it will be, especially with the Java binding. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J.

Re: Out of memory on large RDDs

2014-08-27 Thread Jianshi Huang
I have the same issue (I'm using the latest 1.1.0-SNAPSHOT). I've increased my driver memory to 30G, executor memory to 10G, and spark.akka.askTimeout to 180. Still no good. My other configurations are: spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryoserializer.buffer.mb

Re: Specifying classpath

2014-08-27 Thread Ashish Jain
I solved this issue by putting hbase-protobuf in Hadoop classpath, and not in the spark classpath. export HADOOP_CLASSPATH=/path/to/jar/hbase-protocol-0.98.1-cdh5.1.0.jar On Tue, Aug 26, 2014 at 5:42 PM, Ashish Jain ashish@gmail.com wrote: Hello, I'm using the following version of

Re: Issue Connecting to HBase in spark shell

2014-08-27 Thread kpeng1
It looks like the issue I had is that I didn't pull in htrace-core jar into the spark class path. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-Connecting-to-HBase-in-spark-shell-tp12855p12924.html Sent from the Apache Spark User List mailing list

Re: CUDA in spark, especially in MLlib?

2014-08-27 Thread Xiangrui Meng
Hi Wei, Please keep us posted about the performance result you get. This would be very helpful. Best, Xiangrui On Wed, Aug 27, 2014 at 10:33 AM, Wei Tan w...@us.ibm.com wrote: Thank you all. Actually I was looking at JCUDA. Function wise this may be a perfect solution to offload computation

Re: Spark 1.1. doesn't work with hive context

2014-08-27 Thread S Malligarjunan
It is my mistake, some how I have added the io.compression.codec property value as the above mentioned class. Resolved the problem now   Thanks and Regards, Sankar S.   On Wednesday, 27 August 2014, 1:23, S Malligarjunan smalligarju...@yahoo.com wrote: Hello all, I have just checked out

Re: Low Level Kafka Consumer for Spark

2014-08-27 Thread Bharat Venkat
Hi Dibyendu, That would be great. One of the biggest drawback of Kafka utils as well as your implementation is I am unable to scale out processing. I am relatively new to Spark and Spark Streaming - from what I read and what I observe with my deployment is that having the RDD created on one

Amplab: big-data-benchmark

2014-08-27 Thread Sameer Tilak
Hi All, I am planning to run amplab benchmark suite to evaluate the performance of our cluster. I looked at: https://amplab.cs.berkeley.edu/benchmark/ and it mentions about data avallability at: s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]where /tiny/,

Re: Amplab: big-data-benchmark

2014-08-27 Thread Burak Yavuz
Hi Sameer, I've faced this issue before. They don't show up on http://s3.amazonaws.com/big-data-benchmark/. But you can directly use: `sc.textFile(s3n://big-data-benchmark/pavlo/text/tiny/crawl)` The gotcha is that you also need to supply which dataset you want: crawl, uservisits, or rankings

Re: Does HiveContext support Parquet?

2014-08-27 Thread lyc
Thanks a lot. Finally, I can create parquet table using your command -driver-class-path. I am using hadoop 2.3. Now, I will try to load data into the tables. Thanks, lyc -- View this message in context:

RE: Amplab: big-data-benchmark

2014-08-27 Thread Sameer Tilak
Hi Burak,Thanks, I will then start benchmarking the cluster. Date: Wed, 27 Aug 2014 11:52:05 -0700 From: bya...@stanford.edu To: ssti...@live.com CC: user@spark.apache.org Subject: Re: Amplab: big-data-benchmark Hi Sameer, I've faced this issue before. They don't show up on

Re: Does HiveContext support Parquet?

2014-08-27 Thread Michael Armbrust
I'll note the parquet jars are included by default in 1.1 On Wed, Aug 27, 2014 at 11:53 AM, lyc yanchen@huawei.com wrote: Thanks a lot. Finally, I can create parquet table using your command -driver-class-path. I am using hadoop 2.3. Now, I will try to load data into the tables.

Re: How to get prerelease thriftserver working?

2014-08-27 Thread Michael Armbrust
I would expect that to work. What exactly is the error? On Wed, Aug 27, 2014 at 6:02 AM, Matt Chu m...@kabam.com wrote: (apologies for sending this twice, first via nabble; didn't realize it wouldn't get forwarded) Hey, I know it's not officially released yet, but I'm trying to understand

RE: Execution time increasing with increase of cluster size

2014-08-27 Thread Sameer Tilak
Can you tell which nodes were doing the computation in each case? Date: Wed, 27 Aug 2014 20:29:38 +0530 Subject: Execution time increasing with increase of cluster size From: sarathchandra.jos...@algofusiontech.com To: user@spark.apache.org Hi, I've written a simple scala program which reads a

Re: hive on spark yarn

2014-08-27 Thread Michael Armbrust
You need to have the datanuclus jars on your classpath. It is not okay to merge them into an uber jar. On Wed, Aug 27, 2014 at 1:44 AM, centerqi hu cente...@gmail.com wrote: Hi all When I run a simple SQL, encountered the following error. hive:0.12(metastore in mysql) hadoop 2.4.1

Spark N.C.

2014-08-27 Thread am
Looking for fellow Spark enthusiasts based in and around Research Triangle Park, Raleigh, Durham, and Chapel Hill, North Carolina Please get in touch off list for an employment opportunity. Must be local. Thanks! -Andrew -

MLBase status

2014-08-27 Thread Sameer Tilak
Hi All,I was wondering can someone please tell me the status of MLbase and its roadmap in terms of software release. We are very interested in exploring it for our applications.

[Streaming] Cannot get executors to stay alive

2014-08-27 Thread Yana Kadiyska
Hi, I tried a similar question before and didn't get any answers,so I'll try again: I am using updateStateByKey, pretty much exactly as shown in the examples shipping with Spark: def createContext(master:String,dropDir:String, checkpointDirectory:String) = { val updateFunc = (values:

Re: disable log4j for spark-shell

2014-08-27 Thread Yana
You just have to tell Spark which log4j properties file to use. I think --driver-java-options=-Dlog4j.configuration=log4j.properties should work but it didn't for me. set SPARK_JAVA_OPTS=-Dlog4j.configuration=log4j.properties did work though (this was on Windows, in local mode, assuming you put a

Historic data and clocks

2014-08-27 Thread Frank van Lankvelt
Hi, In an attempt to keep processing logic as simple as possible, I'm trying to use spark streaming for processing historic as well as real-time data. This works quite well, using big intervals that match the window size for historic data, and small intervals for real-time. I found this

Re: CUDA in spark, especially in MLlib?

2014-08-27 Thread Frank van Lankvelt
you could try looking at ScalaCL[1], it's targeting OpenCL rather than CUDA, but that might be close enough? cheers, Frank 1. https://github.com/ochafik/ScalaCL On Wed, Aug 27, 2014 at 7:33 PM, Wei Tan w...@us.ibm.com wrote: Thank you all. Actually I was looking at JCUDA. Function wise this

SchemaRDD

2014-08-27 Thread Koert Kuipers
i feel like SchemaRDD has usage beyond just sql. perhaps it belongs in core?

Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks, -Soumitra.

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Xiangrui Meng
No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks,

minPartitions ignored for bz2?

2014-08-27 Thread jerryye
Hi, I'm running on the master branch and I noticed that textFile ignores minPartition for bz2 files. Is anyone else seeing the same thing? I tried varying minPartitions for a bz2 file and rdd.partitions.size was always 1 whereas doing it for a non-bz2 file worked. Not sure if this matters or not

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
So, I guess zipWithUniqueId will be similar. Is there a way to get unique index? On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote: No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote:

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Xiangrui Meng
You can use RDD id as the seed, which is unique in the same spark context. Suppose none of the RDDs would contain more than 1 billion records. Then you can use rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid) Just a hack .. On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar

Re: minPartitions ignored for bz2?

2014-08-27 Thread Xiangrui Meng
Are you using hadoop-1.0? Hadoop doesn't support splittable bz2 files before 1.2 (or a later version). But due to a bug (https://issues.apache.org/jira/browse/HADOOP-10614), you should try hadoop-2.5.0. -Xiangrui On Wed, Aug 27, 2014 at 2:49 PM, jerryye jerr...@gmail.com wrote: Hi, I'm running

Re: SchemaRDD

2014-08-27 Thread Matei Zaharia
I think this will increasingly be its role, though it doesn't make sense to use it to core because it is clearly just a client of the core APIs. What usage do you have in mind in particular? It would be nice to know how the non-SQL APIs for this could be better. Matei On August 27, 2014 at

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
Thanks. Just to double check, rdd.id would be unique for a batch in a DStream? On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote: You can use RDD id as the seed, which is unique in the same spark context. Suppose none of the RDDs would contain more than 1 billion

Re: How to get prerelease thriftserver working?

2014-08-27 Thread Cheng Lian
Hey Matt, if you want to access existing Hive data, you still need a to run a Hive metastore service, and provide a proper hive-site.xml (just drop it in $SPARK_HOME/conf). Could you provide the error log you saw? ​ On Wed, Aug 27, 2014 at 12:09 PM, Michael Armbrust mich...@databricks.com

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Patrick Wendell
Yeah - each batch will produce a new RDD. On Wed, Aug 27, 2014 at 3:33 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Thanks. Just to double check, rdd.id would be unique for a batch in a DStream? On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote: You can use RDD

FileNotFoundException (No space left on device) writing to S3

2014-08-27 Thread Daniil Osipov
Hello, I've been seeing the following errors when trying to save to S3: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage fail ure: Task 4058 in stage 2.1 failed 4 times, most recent failure: Lost task 4058.3 in stag e 2.1 (TID 12572,

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
I see a issue here. If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG. I wish there was DStream mapPartitionsWithIndex. On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote: You can use RDD id as the seed, which is unique in the same spark context. Suppose none of the

SparkSQL returns ArrayBuffer for fields of type Array

2014-08-27 Thread Du Li
Hi, Michael. I used HiveContext to create a table with a field of type Array. However, in the hql results, this field was returned as type ArrayBuffer which is mutable. Would it make more sense to be an Array? The Spark version of my test is 1.0.2. I haven’t tested it on SQLContext nor newer

Re: Apache Spark- Cassandra - NotSerializable Exception while saving to cassandra

2014-08-27 Thread Yana
I'm not so sure that your error is coming from the cassandra write. you have val data = test.map(..).map(..) so data will actually not get created until you try to save it. Can you try to do something like data.count() or data.take(k) after this line and see if you even get to the cassandra

Re: SparkSQL returns ArrayBuffer for fields of type Array

2014-08-27 Thread Michael Armbrust
Arrays in the JVM are also mutable. However, you should not be relying on the exact type here. The only promise is that you will get back something of type Seq[_]. On Wed, Aug 27, 2014 at 4:27 PM, Du Li l...@yahoo-inc.com wrote: Hi, Michael. I used HiveContext to create a table with a

Re: SparkSQL returns ArrayBuffer for fields of type Array

2014-08-27 Thread Du Li
I found this discrepancy when writing unit tests for my project. Basically the expectation was that the returned type should match that of the input data. Although it’s easy to work around, I was just feeling a bit weird. Is there a better reason to return ArrayBuffer? From: Michael Armbrust

Kafka stream receiver stops input

2014-08-27 Thread Tim Smith
Hi, I have Spark (1.0.0 on CDH5) running with Kafka 0.8.1.1. I have a streaming jobs that reads from a kafka topic and writes output to another kafka topic. The job starts fine but after a while the input stream stops getting any data. I think these messages show no incoming data on the stream:

Re: SparkSQL returns ArrayBuffer for fields of type Array

2014-08-27 Thread Michael Armbrust
In general the various language interfaces try to return the natural type for the language. In python we return lists in scala we return Seqs. Arrays on the JVM have all sorts of messy semantics (e.g. they are invariant and don't have erasure). On Wed, Aug 27, 2014 at 5:34 PM, Du Li

Re: MLBase status

2014-08-27 Thread Ameet Talwalkar
Hi Sameer, MLbase started out as a set of three ML components on top of Spark. The lowest level, MLlib, is now a rapidly growing component within Spark and is maintained by the Spark community. The two higher-level components (MLI and MLOpt) are experimental components that serve as testbeds for

Re: Low Level Kafka Consumer for Spark

2014-08-27 Thread Dibyendu Bhattacharya
I agree. This issue should be fixed in Spark rather rely on replay of Kafka messages. Dib On Aug 28, 2014 6:45 AM, RodrigoB rodrigo.boav...@aspect.com wrote: Dibyendu, Tnks for getting back. I believe you are absolutely right. We were under the assumption that the raw data was being

Compilaon Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi, I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 0.98, My steps: wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 edit project/SparkBuild.scala, set HBASE_VERSION // HBase version; set as appropriate. val

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread Ted Yu
See SPARK-1297 The pull request is here: https://github.com/apache/spark/pull/1893 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: (correction: Compilation Error: Spark 1.0.2 with HBase 0.98” , please ignore if duplicated) Hi, I need to use

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi Ted, Thank you so much!! As I am new to Spark, can you please advise the steps about how to apply this patch to my spark-1.0.2 source folder? Regards Arthur On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote: See SPARK-1297 The pull request is here:

RE: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2014-08-27 Thread Henry Hung
Update: I use shell script to execute the spark-shell, inside the my-script.sh: $SPARK_HOME/bin/spark-shell $HOME/test.scala $HOME/test.log 21 Although it correctly finish the println(hallo world), but the strange thing is that my-script.sh finished before spark-shell even finish executing

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread Ted Yu
You can get the patch from this URL: https://github.com/apache/spark/pull/1893.patch BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml Cheers On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, Thank you so much!!

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi Ted, I tried the following steps to apply the patch 1893 but got Hunk FAILED, can you please advise how to get thru this error? or is my spark-1.0.2 source not the correct one? Regards Arthur wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread Ted Yu
Can you use this command ? patch -p1 -i 1893.patch Cheers On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, I tried the following steps to apply the patch 1893 but got Hunk FAILED, can you please advise how to get thru this error? or is my

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi Ted, Thanks. Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.) Is this normal? Regards Arthur patch -p1 -i 1893.patch patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 succeeded at 94 (offset -16 lines). 1 out of 2 hunks FAILED -- saving rejects to file

RE: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2014-08-27 Thread Matei Zaharia
You can use spark-shell -i file.scala to run that. However, that keeps the interpreter open at the end, so you need to make your file end with System.exit(0) (or even more robustly, do stuff in a try {} and add that in finally {}). In general it would be better to compile apps and run them

Compilation FAILURE : Spark 1.0.2 / Project Hive (0.13.1)

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi, I use Hadoop 2.4.1, HBase 0.98.5, Zookeeper 3.4.6 and Hive 0.13.1. I just tried to compile Spark 1.0.2, but got error on Spark Project Hive, can you please advise which repository has org.spark-project.hive:hive-metastore:jar:0.13.1? FYI, below is my repository setting in maven which

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread Ted Yu
Looks like the patch given by that URL only had the last commit. I have attached pom.xml for spark-1.0.2 to SPARK-1297 You can download it and replace examples/pom.xml with the downloaded pom I am running this command locally: mvn -Phbase-hadoop2,hadoop-2.4,yarn -DskipTests clean package

Re: Compilation FAILURE : Spark 1.0.2 / Project Hive (0.13.1)

2014-08-27 Thread Ted Yu
See this thread: http://search-hadoop.com/m/JW1q5wwgyL1/Working+Formula+for+Hive+0.13subj=Re+Working+Formula+for+Hive+0+13+ On Wed, Aug 27, 2014 at 8:54 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I use Hadoop 2.4.1, HBase 0.98.5, Zookeeper 3.4.6 and Hive 0.13.1. I

Re: Apache Spark- Cassandra - NotSerializable Exception while saving to cassandra

2014-08-27 Thread lmk
Hi Yana I have done take and confirmed existence of data..Also checked that it is getting connected to Cassandra.. That is why I suspect that this particular rdd is not serializable.. Thanks, Lmk On Aug 28, 2014 5:13 AM, Yana [via Apache Spark User List] ml-node+s1001560n12960...@n3.nabble.com

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread Ted Yu
I forgot to include '-Dhadoop.version=2.4.1' in the command below. The modified command passed. You can verify the dependence on hbase 0.98 through this command: mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree dep.txt Cheers On Wed, Aug 27, 2014 at

Update on Pig on Spark initiative

2014-08-27 Thread Mayur Rustagi
Hi, We have migrated Pig functionality on top of Spark passing 100% e2e for success cases in pig test suite. That means UDF, Joins other functionality is working quite nicely. We are in the process of merging with Apache Pig trunk(something that should happen over the next 2 weeks). Meanwhile if

Re: Update on Pig on Spark initiative

2014-08-27 Thread Matei Zaharia
Awesome to hear this, Mayur! Thanks for putting this together. Matei On August 27, 2014 at 10:04:12 PM, Mayur Rustagi (mayur.rust...@gmail.com) wrote: Hi, We have migrated Pig functionality on top of Spark passing 100% e2e for success cases in pig test suite. That means UDF, Joins other

Submitting multiple files pyspark

2014-08-27 Thread Chengi Liu
Hi, I have two files.. main_app.py and helper.py main_app.py calls some functions in helper.py. I want to use spark-submit to submit a job but how do i specify helper.py? Basically, how do i specify multiple files in spark? Thanks