Re: Need help for Spark-JobServer setup on Maven (for Java programming)

2014-12-30 Thread Sasi
Does my question make sense or required some elaboration? Sasi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20896.html Sent from the Apache Spark User List mailing list archive at

Re: Need help for Spark-JobServer setup on Maven (for Java programming)

2014-12-30 Thread abhishek
Hey, why specific in maven?? we setup a spark job server thru sbt which is easy way to up and running job server. On 30 Dec 2014 13:32, Sasi [via Apache Spark User List] ml-node+s1001560n20896...@n3.nabble.com wrote: Does my question make sense or required some elaboration? Sasi

Re: Need help for Spark-JobServer setup on Maven (for Java programming)

2014-12-30 Thread Sasi
The reason being, we had Vaadin (Java Framework) application which displays data from Spark RDD, which in turn gets data from Cassandra. As we know, we need to use Maven for building Spark API in Java. We tested the spark-jobserver using SBT and able to run it. However, for our requirement, we

Re: Clustering text data with MLlib

2014-12-30 Thread xhudik
Kmeans really needs to have identified number of clusters in advance. There are multiple algorithms (XMeans, ART,...) which do not need this information. Unfortunately, none of them is implemented in MLLib for the moment (you can give a hand and help community). Anyway, it seems to me you will

Re: Need help for Spark-JobServer setup on Maven (for Java programming)

2014-12-30 Thread abhishek
Ohh... Just curious, we did similar use case like yours getting data out of Cassandra since job server is a rest architecture all we need is an URL to access it. Why integrating with your framework matters here when all we need is a URL. On 30 Dec 2014 14:05, Sasi [via Apache Spark User List]

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2014-12-30 Thread Mukesh Jha
Thanks Sandy, It was the issue with the no of cores. Another issue I was facing is that tasks are not getting distributed evenly among all executors and are running on the NODE_LOCAL locality level i.e. all the tasks are running on the same executor where my kafkareceiver(s) are running even

Spark SQL implementation error

2014-12-30 Thread sachin Singh
I have a table(csv file) loaded data on that by creating POJO as per table structure,and created SchemaRDD as under JavaRDDTest1 testSchema = sc.textFile(D:/testTable.csv).map(GetTableData);/* GetTableData will transform the all table data in testTable object*/ JavaSchemaRDD schemaTest =

Re: Need help for Spark-JobServer setup on Maven (for Java programming)

2014-12-30 Thread Sasi
Thanks Abhishek. We understand your point and will try using REST URL. However one concern, we had around 1 lakh rows in our Cassandra table presently. Will REST URL result can withstand the response size? -- View this message in context:

Re: Can we say 1 RDD is generated every batch interval?

2014-12-30 Thread Jahagirdar, Madhu
Foreach iterates through the partitions in the RDD and executes the operations for each partitions i guess. On 29-Dec-2014, at 10:19 pm, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi All, Please clarify. Can we say 1 RDD is generated every batch interval? If the above is true. Then, is

Spark SQL insert overwrite table failed.

2014-12-30 Thread Mars Max
While I was doing JOIN operation of three tables using Spark 1.1.1, and always got the following error. However, I've never met the exception in Spark 1.1.0 with the same operation and same data. Does anyone meet the problem? 14/12/30 17:49:33 ERROR CliDriver:

Re: Can we say 1 RDD is generated every batch interval?

2014-12-30 Thread Sean Owen
The DStream model is one RDD of data per interval, yes. foreachRDD performs an operation on each RDD in the stream, which means it is executed once* for the one RDD in each interval. * ignoring the possibility here of failure and retry of course On Mon, Dec 29, 2014 at 4:49 PM, SamyaMaiti

Writing and reading sequence file results in trailing extra data

2014-12-30 Thread Enno Shioji
Hi, I'm facing a weird issue. Any help appreciated. When I execute the below code and compare input and output, each record in the output has some extra trailing data appended to it, and hence corrupted. I'm just reading and writing, so the input and output should be exactly the same. I'm using

Re: Can we say 1 RDD is generated every batch interval?

2014-12-30 Thread Maiti, Samya
Thank Sean. That was helpful. Regards, Sam On Dec 30, 2014, at 4:12 PM, Sean Owen so...@cloudera.com wrote: The DStream model is one RDD of data per interval, yes. foreachRDD performs an operation on each RDD in the stream, which means it is executed once* for the one RDD in each interval.

[SOLVED] Re: Writing and reading sequence file results in trailing extra data

2014-12-30 Thread Enno Shioji
This poor soul had the exact same problem and solution: http://stackoverflow.com/questions/24083332/write-and-read-raw-byte-arrays-in-spark-using-sequence-file-sequencefile ᐧ On Tue, Dec 30, 2014 at 10:58 AM, Enno Shioji eshi...@gmail.com wrote: Hi, I'm facing a weird issue. Any help

Re: Need help for Spark-JobServer setup on Maven (for Java programming)

2014-12-30 Thread abhishek
Frankly saying I never tried for this volume in practical. But I believe it should work. On 30 Dec 2014 15:26, Sasi [via Apache Spark User List] ml-node+s1001560n20902...@n3.nabble.com wrote: Thanks Abhishek. We understand your point and will try using REST URL. However one concern, we had

Re: building spark1.2 meet error

2014-12-30 Thread xhudik
Hi, well, spark 1.2 was prepared for scala 2.10. If you want stable and fully functional tool I'd compile it this default compiler. *I was able to compile Spar 1.2 by Java 7 and scala 2.10 seamlessly.* I also tried Java8 and scala 2.11 (no -Dscala.usejavacp=true), but I failed for some other

How to collect() each partition in scala ?

2014-12-30 Thread DEVAN M.S.
Hi all, i have one large data-set. when i am getting the number of partitions its showing 43. We can't collect() the large data-set in to memory so i am thinking like this, collect() each partitions so that it will be small in size. Any thoughts ?

Re: Need help for Spark-JobServer setup on Maven (for Java programming)

2014-12-30 Thread Sasi
Thanks Abhishek. We are good know with an answer to try. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-for-Spark-JobServer-setup-on-Maven-for-Java-programming-tp20849p20906.html Sent from the Apache Spark User List mailing list archive at

Re: How to collect() each partition in scala ?

2014-12-30 Thread Sean Owen
collect()-ing a partition still implies copying it to the driver, but you're suggesting you can't collect() the whole data set to the driver. What do you mean: collect() 1 partition? or collect() some smaller result from each partition? On Tue, Dec 30, 2014 at 11:54 AM, DEVAN M.S.

SparkContext with error from PySpark

2014-12-30 Thread Jaggu
Hi Team, I was trying to execute a Pyspark code in cluster. It gives me the following error. (Wne I run the same job in local it is working fine too :-() Eoor Error from python worker: /usr/lib/spark-1.2.0-bin-hadoop2.3/python/pyspark/context.py:209: Warning: 'with' will become a reserved

Re: SparkContext with error from PySpark

2014-12-30 Thread Eric Friedman
The Python installed in your cluster is 2.5. You need at least 2.6. Eric Friedman On Dec 30, 2014, at 7:45 AM, Jaggu jagana...@gmail.com wrote: Hi Team, I was trying to execute a Pyspark code in cluster. It gives me the following error. (Wne I run the same job in local it is

Is it possible to store graph directly into HDFS?

2014-12-30 Thread Jason Hong
Dear all:) We're trying to make a graph using large input data and get a subgraph applied some filter. Now, we wanna save this graph to HDFS so that we can load later. Is it possible to store graph or subgraph directly into HDFS and load it as a graph for future use? We will be glad for your

Re: Is it possible to store graph directly into HDFS?

2014-12-30 Thread Xuefeng Wu
how about save as object? Yours, Xuefeng Wu 吴雪峰 敬上 On 2014年12月30日, at 下午9:27, Jason Hong begger3...@gmail.com wrote: Dear all:) We're trying to make a graph using large input data and get a subgraph applied some filter. Now, we wanna save this graph to HDFS so that we can load later.

Spark Standalone Cluster not correctly configured

2014-12-30 Thread frodo777
Hi. I'm trying to configure a spark standalone cluster, with three master nodes (bigdata1, bigdata2 and bigdata3) managed by Zookeeper. It seems there's a configuration problem, since everyone is saying it is the cluster leader: . 14/12/30 13:54:59 INFO Master: I have been

Re: How to collect() each partition in scala ?

2014-12-30 Thread Cody Koeninger
I'm not sure exactly what you're trying to do, but take a look at rdd.toLocalIterator if you haven't already. On Tue, Dec 30, 2014 at 6:16 AM, Sean Owen so...@cloudera.com wrote: collect()-ing a partition still implies copying it to the driver, but you're suggesting you can't collect() the

Re: Anaconda iPython notebook working with CDH Spark

2014-12-30 Thread Sebastián Ramírez
Some time ago I did the (2) approach, I installed Anaconda on every node. But to avoid screwing RedHat (it was CentOS in my case, which is the same) I installed Anaconda on every node using the user yarn and made it the default python only for that user. After you install it, Anaconda asks if it

Re: SchemaRDD to RDD[String]

2014-12-30 Thread Yana
Do your debug println show values? i.e. what would you see if in rowToString you output println( row to string +row+ +sub)? Another thing to check would be to do schemaRDD.take(3) or something to make sure you actually have data you can also try this: rowToString(schemaRDD.first,list) and

Host Error on EC2 while accessing hdfs from stadalone

2014-12-30 Thread Laeeq Ahmed
Hi, I am using spark standalone on EC2. I can access ephemeral hdfs from spark-shell interface but I can't access hdfs in standalone application. I am using spark 1.2.0 with hadoop 2.4.0 and launched cluster from ec2 folder from my local machine. In my pom file I have given hadoop client as

Spark 1.2 and Mesos 0.21.0 spark.executor.uri issue?

2014-12-30 Thread Denny Lee
I've been working with Spark 1.2 and Mesos 0.21.0 and while I have set the spark.executor.uri within spark-env.sh (and directly within bash as well), the Mesos slaves do not seem to be able to access the spark tgz file via HTTP or HDFS as per the message below. 14/12/30 15:57:35 INFO SparkILoop:

Re: Host Error on EC2 while accessing hdfs from stadalone

2014-12-30 Thread Aniket Bhatnagar
Did you check firewall rules in security groups? On Tue, Dec 30, 2014, 9:34 PM Laeeq Ahmed laeeqsp...@yahoo.com.invalid wrote: Hi, I am using spark standalone on EC2. I can access ephemeral hdfs from spark-shell interface but I can't access hdfs in standalone application. I am using spark

Re: Mapping directory structure to columns in SparkSQL

2014-12-30 Thread Michael Davies
Hi Michael, I’ve looked through the example and the test cases and I think I understand what we need to do - so I’ll give it a go. I think what I’d like to try to do is allow files to be added at anytime, so perhaps I can cache partition info, and also what may be useful for us would be to

Shuffle Problems in 1.2.0

2014-12-30 Thread Sven Krasser
Hey all, Since upgrading to 1.2.0 a pyspark job that worked fine in 1.1.1 fails during shuffle. I've tried reverting from the sort-based shuffle back to the hash one, and that fails as well. Does anyone see similar problems or has an idea on where to look next? For the sort-based shuffle I get a

Re: Mllib native netlib-java/OpenBLAS

2014-12-30 Thread xhudik
I'm half-way there follow 1. compiled and installed open blas library 2. ln -s libopenblas_sandybridgep-r0.2.13.so /usr/lib/libblas.so.3 3. compiled and built spark: mvn -Pnetlib-lgpl -DskipTests clean compile package So far so fine. Then I run into problems by testing the solution:

Re: S3 files , Spark job hungsup

2014-12-30 Thread Sven Krasser
This here may also be of help: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html. Make sure to spread your objects across multiple partitions to not be rate limited by S3. -Sven On Mon, Dec 22, 2014 at 10:20 AM, durga katakam durgak...@gmail.com wrote: Yes . I

Trying to make spark-jobserver work with yarn

2014-12-30 Thread Fernando O.
Hi all, I'm investigating spark for a new project and I'm trying to use spark-jobserver because... I need to reuse and share RDDs and from what I read in the forum that's the standard :D Turns out that spark-jobserver doesn't seem to work on yarn, or at least it does not on 1.1.1 My config

Re: Cached RDD

2014-12-30 Thread Rishi Yadav
Without caching, each action is recomputed. So assuming rdd2 and rdd3 result in separate actions answer is yes. On Mon, Dec 29, 2014 at 7:53 PM, Corey Nolet cjno...@gmail.com wrote: If I have 2 RDDs which depend on the same RDD like the following: val rdd1 = ... val rdd2 = rdd1.groupBy()...

Re: Spark SQL implementation error

2014-12-30 Thread Michael Armbrust
Anytime you see java.lang.NoSuchMethodError it means that you have multiple conflicting versions of a library on the classpath, or you are trying to run code that was compiled against the wrong version of a library. On Tue, Dec 30, 2014 at 1:43 AM, sachin Singh sachin.sha...@gmail.com wrote: I

Re: SparkContext with error from PySpark

2014-12-30 Thread JAGANADH G
Hi I am using Aanonda Python. Is there any way to specify the Python which we have o use for running pyspark in a cluster. Best regards Jagan On Tue, Dec 30, 2014 at 6:27 PM, Eric Friedman eric.d.fried...@gmail.com wrote: The Python installed in your cluster is 2.5. You need at least 2.6.

Spark Accumulators exposed as Metrics to Graphite

2014-12-30 Thread Łukasz Stefaniak
Hi Does spark have built in possiblity of exposing current value of Accumulator [1] using Monitoring and Instrumentation [2]. Unfortunately I couldn't find anything in Sources which could be used. Does it mean only way to expose current accumulator value is to implement new Source which would

Re: Spark Streaming: HiveContext within Custom Actor

2014-12-30 Thread Tathagata Das
I am not sure that can be done. Receivers are designed to be run only on the executors/workers, whereas a SQLContext (for using Spark SQL) can only be defined on the driver. On Mon, Dec 29, 2014 at 6:45 PM, sranga sra...@gmail.com wrote: Hi Could Spark-SQL be used from within a custom actor

Re: word count aggregation

2014-12-30 Thread Tathagata Das
For windows that large (1 hour), you will probably also have to increase the batch interval for efficiency. TD On Mon, Dec 29, 2014 at 12:16 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can use reduceByKeyAndWindow for that. Here's a pretty clean example

Re: Shuffle Problems in 1.2.0

2014-12-30 Thread Josh Rosen
Hi Sven, Do you have a small example program that you can share which will allow me to reproduce this issue? If you have a workload that runs into this, you should be able to keep iteratively simplifying the job and reducing the data set size until you hit a fairly minimal reproduction (assuming

Re: SparkContext with error from PySpark

2014-12-30 Thread Josh Rosen
To configure the Python executable used by PySpark, see the Using the Shell Python section in the Spark Programming Guide: https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell You can set the PYSPARK_PYTHON environment variable to choose the Python executable that will be

Re: Spark Streaming: HiveContext within Custom Actor

2014-12-30 Thread Ranga
Thanks. Will look at other options. On Tue, Dec 30, 2014 at 11:43 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I am not sure that can be done. Receivers are designed to be run only on the executors/workers, whereas a SQLContext (for using Spark SQL) can only be defined on the driver.

Location of logs in local mode

2014-12-30 Thread Brett Meyer
I¹m submitting a script using spark-submit in local mode for testing, and I¹m having trouble figuring out where the logs are stored. The documentation indicates that they should be in the work folder in the directory in which Spark lives on my system, but I see no such folder there. I¹ve set the

Gradual slow down of the Streaming job (getCallSite at DStream.scala:294)

2014-12-30 Thread RK
Here is the code for my streaming job. ~~val sparkConf = new SparkConf().setAppName(SparkStreamingJob) sparkConf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)sparkConf.set(spark.default.parallelism,

Kafka + Spark streaming

2014-12-30 Thread SamyaMaiti
Hi Experts, Few general Queries : 1. Can a single block/partition in a RDD have more than 1 kafka message? or there will be one only one kafka message per block? In a more broader way, is the message count related to block in any way or its just that any message received with in a particular

Trouble using MultipleTextOutputFormat with Spark

2014-12-30 Thread Arpan Ghosh
Hi, I am trying to use the MultipleTextOutputFormat to rename the output files of my Spark job something different from the default part-N. I have implemented a custom MultipleTextOutputFormat class as follows: *class DriveOutputRenameMultipleTextOutputFormat extends

JetS3T settings spark

2014-12-30 Thread durga
I am not sure , the way I can pass jets3t.properties file for spark-submit. --file option seems not working. can some one please help me. My production spark jobs get hung up when reading s3 file sporadically. Thanks, -D -- View this message in context:

Re: Long-running job cleanup

2014-12-30 Thread Ganelin, Ilya
Hi Patrick, to follow up on the below discussion, I am including a short code snippet that produces the problem on 1.1. This is kind of stupid code since it’s a greatly simplified version of what I’m actually doing but it has a number of the key components in place. I’m also including some

Re: building spark1.2 meet error

2014-12-30 Thread j_soft
no,it still fail use mvn -Pyarn -Phadoop-2.5 -Dhadoop.version=2.5.0 -Dscala-2.10 -X -DskipTests clean package ... [DEBUG] /opt/xdsp/spark-1.2.0/core/src/main/scala [DEBUG] includes = [**/*.scala,**/*.java,] [DEBUG] excludes = [] [WARNING] Zinc server is not available

Re: JetS3T settings spark

2014-12-30 Thread Matei Zaharia
This file needs to be on your CLASSPATH actually, not just in a directory. The best way to pass it in is probably to package it into your application JAR. You can put it in src/main/resources in a Maven or SBT project, and check that it makes it into the JAR using jar tf yourfile.jar. Matei

Re: Shuffle Problems in 1.2.0

2014-12-30 Thread Sven Krasser
Hey Josh, I am still trying to prune this to a minimal example, but it has been tricky since scale seems to be a factor. The job runs over ~720GB of data (the cluster's total RAM is around ~900GB, split across 32 executors). I've managed to run it over a vastly smaller data set without issues.

Re: python: module pyspark.daemon not found

2014-12-30 Thread Davies Liu
Could you share a link about this? It's common to use Java 7, that will be nice if we can fix this. On Mon, Dec 29, 2014 at 1:27 PM, Eric Friedman eric.d.fried...@gmail.com wrote: Was your spark assembly jarred with Java 7? There's a known issue with jar files made with that version. It

Re: Python:Streaming Question

2014-12-30 Thread Davies Liu
There is a known bug with local scheduler, will be fixed by https://github.com/apache/spark/pull/3779 On Sun, Dec 21, 2014 at 10:57 PM, Samarth Mailinglist mailinglistsama...@gmail.com wrote: I’m trying to run the stateful network word count at

Re: Kafka + Spark streaming

2014-12-30 Thread Tathagata Das
1. Of course, a single block / partition has many Kafka messages, and from different Kafka topics interleaved together. The message count is not related to the block count. Any message received within a particular block interval will go in the same block. 2. Yes, the receiver will be started on

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2014-12-30 Thread Tathagata Das
Thats is kind of expected due to data locality. Though you should see some tasks running on the executors as the data gets replicated to other nodes and can therefore run tasks based on locality. You have two solutions 1. kafkaStream.repartition() to explicitly repartition the received data

Re: Gradual slow down of the Streaming job (getCallSite at DStream.scala:294)

2014-12-30 Thread Tathagata Das
Which version of Spark Streaming are you using. When the batch processing time increases to 15-20 seconds, could you compare the task times compared to the tasks time when the application is just launched? Basically is the increase from 6 seconds to 15-20 seconds is caused by increase in

Re: Gradual slow down of the Streaming job (getCallSite at DStream.scala:294)

2014-12-30 Thread RK
I am running the job on 1.1.1. I will let the job run overnight and send you more info on computation vs GC time tomorrow. BTW, do you know what the stage description named getCallSite at DStream.scala:294 might mean? Thanks,RK On Tuesday, December 30, 2014 6:02 PM, Tathagata Das

Re: python: module pyspark.daemon not found

2014-12-30 Thread Eric Friedman
https://issues.apache.org/jira/browse/SPARK-1911 Is one of several tickets on the problem. On Dec 30, 2014, at 8:36 PM, Davies Liu dav...@databricks.com wrote: Could you share a link about this? It's common to use Java 7, that will be nice if we can fix this. On Mon, Dec 29, 2014 at

Spark app performance

2014-12-30 Thread Raghavendra Pandey
I have a spark app that involves series of mapPartition operations and then a keyBy operation. I have measured the time inside mapPartition function block. These blocks take trivial time. Still the application takes way too much time and even sparkUI shows that much time. So i was wondering where

Re: serialize protobuf messages

2014-12-30 Thread Chen Song
Anyone has suggestions? On Tue, Dec 23, 2014 at 3:08 PM, Chen Song chen.song...@gmail.com wrote: Silly question, what is the best way to shuffle protobuf messages in Spark (Streaming) job? Can I use Kryo on top of protobuf Message type? -- Chen Song -- Chen Song

Re: JetS3T settings spark

2014-12-30 Thread durga katakam
Thanks Matei. -D On Tue, Dec 30, 2014 at 4:49 PM, Matei Zaharia matei.zaha...@gmail.com wrote: This file needs to be on your CLASSPATH actually, not just in a directory. The best way to pass it in is probably to package it into your application JAR. You can put it in src/main/resources in a

Re: building spark1.2 meet error

2014-12-30 Thread Sean Owen
This is still using a non-existent hadoop-2.5 profile, and -Dscala-2.10 won't do anything. These don't matter though; this error is just some scalac problem. I don't see this error when compiling. On Wed, Dec 31, 2014 at 12:48 AM, j_soft zsof...@gmail.com wrote: no,it still fail use mvn -Pyarn

Re: Is it possible to store graph directly into HDFS?

2014-12-30 Thread Jason Hong
Thanks for your answer, Xuefeng Wu. But, I don't understand how to save a graph as object. :( Do you have any sample codes? 2014-12-31 13:27 GMT+09:00 Jason Hong begger3...@gmail.com: Thanks for your answer, Xuefeng Wu. But, I don't understand how to save a graph as object. :( Do you have

How to set local property in beeline connect to the spark thrift server

2014-12-30 Thread Xiaoyu Wang
Hi all! I use Spark SQL1.2 start the thrift server on yarn. I want to use fair scheduler in the thrift server. I set the properties in spark-defaults.conf like this: spark.scheduler.mode FAIR spark.scheduler.allocation.file /opt/spark-1.2.0-bin-2.4.1/conf/fairscheduler.xml In the thrift server