RDD for Storm Streaming in Spark

2014-12-23 Thread Ajay
Hi, Can we use Storm Streaming as RDD in Spark? Or any way to get Spark work with Storm? Thanks Ajay

Re: Spark Job hangs up on multi-node cluster but passes on a single node

2014-12-23 Thread Akhil
That's because in your code at some place you have specified localhost instead of the ip address of the machine running the service. When run it in local mode it will work fine because everything happens on that machine and hence it will be able to connect to localhost which runs the service, now

How to export data from hive into hdfs in spark program?

2014-12-23 Thread LinQili
Hi all:I wonder if is there a way to export data from table of hive into hdfs using spark?like this: INSERT OVERWRITE DIRECTORY '/user/linqili/tmp/src' select * from $DB.$tableName

Re: RDD for Storm Streaming in Spark

2014-12-23 Thread Gerard Maas
Hi, I'm not sure what you are asking: Whether we can use spouts and bolts in Spark (= no) or whether we can do streaming in Spark: http://spark.apache.org/docs/latest/streaming-programming-guide.html -kr, Gerard. On Tue, Dec 23, 2014 at 9:03 AM, Ajay ajay.ga...@gmail.com wrote: Hi, Can

Re: RDD for Storm Streaming in Spark

2014-12-23 Thread Ajay
Hi, The question is to do streaming in Spark with Storm (not using Spark Streaming). The idea is to use Spark as a in-memory computation engine and static data coming from Cassandra/Hbase and streaming data from Storm. Thanks Ajay On Tue, Dec 23, 2014 at 2:03 PM, Gerard Maas

Re: Interpreting MLLib's linear regression o/p

2014-12-23 Thread Sean Owen
(In your libsvm example, your indices are not ascending.) The first weight corresponds to the first feature, of course. An indexing scheme doesn't change that or somehow make the first feature map to the second (where would the last one go then?). You'll find the first weight at offset 0 in an

JavaRDD (Data Aggregation) based on key

2014-12-23 Thread sachin Singh
Hi, I have a csv file having fields as a,b,c . I want to do aggregation(sum,average..) based on any field(a,b or c) as per user input, using Apache Spark Java API,Please Help Urgent! Thanks in advance, Regards Sachin -- View this message in context:

Re: Consistent hashing of RDD row

2014-12-23 Thread lev
After checking the spark code, I now realize that an rdd that was cached to disk can't be evicted, so I will just persist the rdd to disk after the random numbers are created. -- View this message in context:

Re: RDD for Storm Streaming in Spark

2014-12-23 Thread Gerard Maas
I'm not aware of a project trying to achieve this integration. At some point Summingbird had the intention of adding an Spark port, and that could potentially bridge Storm and Spark. Not sure if that evolved into something concrete. In any case, an attempt to bring Storm and Spark together will

Re: RDD for Storm Streaming in Spark

2014-12-23 Thread Ajay
Right. I contacted the SummingBird users as well. It doesn't support Spark streaming currently. We are heading towards Storm as it is mostly widely used. Is Spark streaming production ready? Thanks Ajay On Tue, Dec 23, 2014 at 3:47 PM, Gerard Maas gerard.m...@gmail.com wrote: I'm not aware

ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Enno Shioji
Is anybody experiencing this? It looks like a bug in JetS3t to me, but thought I'd sanity check before filing an issue. I'm writing to S3 using ReceiverInputDStream#saveAsTextFiles with a S3 URL (s3://fake-test/1234). The code does write to S3, but with double forward slashes

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Enno Shioji
ᐧ I filed a new issue HADOOP-11444. According to HADOOP-10372, s3 is likely to be deprecated anyway in favor of s3n. Also the comment section notes that Amazon has implemented an EmrFileSystem for S3 which is built using AWS SDK rather than JetS3t. On Tue, Dec 23, 2014 at 2:06 PM, Enno Shioji

Spark Installation Maven PermGen OutOfMemoryException

2014-12-23 Thread Vladimir Protsenko
I am installing Spark 1.2.0 on CentOS 6.6. Just downloaded code from github, installed maven and trying to compile system: git clone https://github.com/apache/spark.git git checkout v1.2.0 mvn -DskipTests clean package leads to OutOfMemoryException. What amount of memory does it requires?

Re: Spark Installation Maven PermGen OutOfMemoryException

2014-12-23 Thread Sean Owen
You might try a little more. The official guidance suggests 2GB: https://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage On Tue, Dec 23, 2014 at 2:57 PM, Vladimir Protsenko protsenk...@gmail.com wrote: I am installing Spark 1.2.0 on CentOS 6.6. Just downloaded

Re: How to export data from hive into hdfs in spark program?

2014-12-23 Thread Cheng Lian
This depends on which output format you want. For Parquet, you can simply do this: |hiveContext.table(some_db.some_table).saveAsParquetFile(hdfs://path/to/file) | On 12/23/14 5:22 PM, LinQili wrote: Hi Leo: Thanks for your reply. I am talking about using hive from spark to export data from

RE: Spark Installation Maven PermGen OutOfMemoryException

2014-12-23 Thread Guru Medasani
Hi Vladimir, From the link Sean posted, if you use Java 8 there is this following note. Note: For Java 8 and above this step is not required. So if you have no problems using Java 8, give it a shot. Best Regards,Guru Medasani From: so...@cloudera.com Date: Tue, 23 Dec 2014 15:04:42 +

Re: Spark Installation Maven PermGen OutOfMemoryException

2014-12-23 Thread Sean Owen
The text there is actually unclear. In Java 8, you still need to set the max heap size (-Xmx2g). The optional bit is the -XX:MaxPermSize=512M actually. Java 8 no longer has a separate permanent generation. On Tue, Dec 23, 2014 at 3:32 PM, Guru Medasani gdm...@outlook.com wrote: Hi Vladimir,

RE: Spark Installation Maven PermGen OutOfMemoryException

2014-12-23 Thread Guru Medasani
Thanks for the clarification Sean. Best Regards,Guru Medasani From: so...@cloudera.com Date: Tue, 23 Dec 2014 15:39:59 + Subject: Re: Spark Installation Maven PermGen OutOfMemoryException To: gdm...@outlook.com CC: protsenk...@gmail.com; user@spark.apache.org The text there is

Re: SparkSQL Array type support - Unregonized Thrift TTypeId value: ARRAY_TYPE

2014-12-23 Thread David Allan
Doh...figured it out. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Array-type-support-Unregonized-Thrift-TTypeId-value-ARRAY-TYPE-tp20817p20832.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark UI port issue when deploying Spark driver on YARN in yarn-cluster mode on EMR

2014-12-23 Thread Roberto Coluccio
Hello folks, I'm trying to deploy a Spark driver on Amazon EMR in yarn-cluster mode expecting to be able to access the Spark UI from the spark-master-ip:4040 address (default port). The problem here is that the Spark UI port is always defined randomly at runtime, although I also tried to specify

Re: SchemaRDD.sample problem

2014-12-23 Thread Hao Ren
update: t1 is good. After collecting on t1, I find that all row is ok (is_new = 0) Just after sampling, there are some rows where is_new = 1 which should have been filtered by Where clause. -- View this message in context:

removing first record from RDD[String]

2014-12-23 Thread Hafiz Mujadid
hi dears! Is there some efficient way to drop first line of an RDD[String]? any suggestion? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/removing-first-record-from-RDD-String-tp20834.html Sent from the Apache Spark User List mailing list

Re: removing first record from RDD[String]

2014-12-23 Thread Jörg Schad
Hi, maybe the drop function is helpful for you (even though this is probably more than you need, still interesting read) http://erikerlandson.github.io/blog/2014/07/27/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/ Joerg On Tue, Dec 23, 2014 at 5:45 PM, Hao Ren

retry in combineByKey at BinaryClassificationMetrics.scala

2014-12-23 Thread Thomas Kwan
Hi there, We are using mllib 1.1.1, and doing Logistics Regression with a dataset of about 150M rows. The training part usually goes pretty smoothly without any retries. But during the prediction stage and BinaryClassificationMetrics stage, I am seeing retries with error of fetch failure. The

Re: removing first record from RDD[String]

2014-12-23 Thread Erik Erlandson
There is also a lazy implementation: http://erikerlandson.github.io/blog/2014/07/29/deferring-spark-actions-to-lazy-transforms-with-the-promise-rdd/ I generated a PR for it -- there was also an alternate proposal for having it be a library in the new Spark Packages site:

Single worker locked at 100% CPU

2014-12-23 Thread Phil Wills
I've been attempting to run a job based on MLlib's ALS implementation for a while now and have hit an issue I'm having a lot of difficulty getting to the bottom of. On a moderate size set of input data it works fine, but against larger (still well short of what I'd think of as big) sets of data,

Re: Tuning Spark Streaming jobs

2014-12-23 Thread Timothy Chen
Hi Gerard, SPARK-4286 is the ticket I am working on, which besides supporting shuffle service it also supports the executor scaling callbacks (kill/request total) for coarse grain mode. I created SPARK-4940 to discuss more about the distribution problem, and let's bring our discussions there.

Re: removing first record from RDD[String]

2014-12-23 Thread Michael Quinlan
Hafiz, You can probably use the RDD.mapPartitionsWithIndex method. Mike On Tue, Dec 23, 2014 at 8:35 AM, Hafiz Mujadid [via Apache Spark User List] ml-node+s1001560n20834...@n3.nabble.com wrote: hi dears! Is there some efficient way to drop first line of an RDD[String]? any suggestion?

Re: MLlib + Streaming

2014-12-23 Thread Xiangrui Meng
We have streaming linear regression (since v1.1) and k-means (v1.2) in MLlib. You can check the user guide: http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-clustering -Xiangrui On Tue,

Re: retry in combineByKey at BinaryClassificationMetrics.scala

2014-12-23 Thread Xiangrui Meng
Sean's PR may be relevant to this issue (https://github.com/apache/spark/pull/3702). As a workaround, you can try to truncate the raw scores to 4 digits (e.g., 0.5643215 - 0.5643) before sending it to BinaryClassificationMetrics. This may not work well if he score distribution is very skewed. See

Re: removing first record from RDD[String]

2014-12-23 Thread Hafiz Mujadid
yep Michael Quinlan,it's working as suggested by Hoe Ren thansk to you and Hoe Ren -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/removing-first-record-from-RDD-String-tp20834p20840.html Sent from the Apache Spark User List mailing list archive at

Spark Port Configuration

2014-12-23 Thread Dan H.
Hi all, I'm trying to lock down ALL Spark ports and have tried using spark-defaults.conf and via the sparkContext. (The example below was run in local[*] mode, but all attempts to run in local or spark-submit.sh on cluster via jar all result in the same results). My goal is to define all

Re: MLlib + Streaming

2014-12-23 Thread Fernando O.
Hey Xiangrui, Is there any plan to have a streaming compatible ALS version? Or if it's currently doable, is there any example? On Tue, Dec 23, 2014 at 4:31 PM, Xiangrui Meng men...@gmail.com wrote: We have streaming linear regression (since v1.1) and k-means (v1.2) in MLlib. You can

Re: MLLib beginner question

2014-12-23 Thread boci
Xiangrui: Hi, I want to using this with streaming and with job too. I using kafka (streaming) and elasticsearch (job) as source and want to calculate sentiment value from the input text. Simon: great, you have any doc how can I embed into my application without using the http interface? (how can I

Re: retry in combineByKey at BinaryClassificationMetrics.scala

2014-12-23 Thread Sean Owen
Yes, my change is slightly downstream of this point in the processing though. The code is still creating a counter for each distinct score value, and then binning. I don't think that would cause a failure - just might be slow. At the extremes, you might see 'fetch failure' as a symptom of things

serialize protobuf messages

2014-12-23 Thread Chen Song
Silly question, what is the best way to shuffle protobuf messages in Spark (Streaming) job? Can I use Kryo on top of protobuf Message type? -- Chen Song

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Jon Chase
I've had a lot of difficulties with using the s3:// prefix. s3n:// seems to work much better. Can't find the link ATM, but seems I recall that s3:// (Hadoop's original block format for s3) is no longer recommended for use. Amazon's EMR goes so far as to remap the s3:// to s3n:// behind the

Re: JavaRDD (Data Aggregation) based on key

2014-12-23 Thread Jon Chase
Have a look at RDD.groupBy(...) and reduceByKey(...) On Tue, Dec 23, 2014 at 4:47 AM, sachin Singh sachin.sha...@gmail.com wrote: Hi, I have a csv file having fields as a,b,c . I want to do aggregation(sum,average..) based on any field(a,b or c) as per user input, using Apache Spark Java

SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-23 Thread Jerry Lam
Hi spark users, I'm trying to create external table using HiveContext after creating a schemaRDD and saving the RDD into a parquet file on hdfs. I would like to use the schema in the schemaRDD (rdd_table) when I create the external table. For example:

Re: S3 files , Spark job hungsup

2014-12-23 Thread Jon Chase
http://www.jets3t.org/toolkit/configuration.html Put the following properties in a file named jets3t.properties and make sure it is available during the running of your Spark job (just place it in ~/ and pass a reference to it when calling spark-submit with --file ~/jets3t.properties)

Re: Downloads from S3 exceedingly slow when running on spark-ec2

2014-12-23 Thread Jon Chase
Turns out I was using the s3:// prefix (in a standalone Spark cluster). It was writing a LOT of block_* files to my S3 bucket, which was the cause for the slowness. I was coming from Amazon EMR, where Amazon's underlying FS implementation has re-mapped s3:// to s3n://, which doesn't use the

weights not changed with different reg param

2014-12-23 Thread Thomas Kwan
Hi there We are on mllib 1.1.1, and trying different regularization parameters. We noticed that the regParam dont affect the weights at all. Is setting the reg param via the optimizer the right thing to do? Do we need to set our own updater? Anyone else seeing the same behaviour? thanks again

Exception after changing RDDs

2014-12-23 Thread kai.lu
Hi All, We are getting exception after we added one RDD to another RDD. We first declared an empty RDD A, then received new Dstream B from Kafka; for each RDD in the Dstream B, we kept adding them to the existing RDD A. Error happened when we were trying to use the updated RDD A. Could anybody

Re: S3 files , Spark job hungsup

2014-12-23 Thread durga
Hi All , It seems problem is little more complicated. If the job is hungup on reading s3 file.even if I kill the unix process that started the job, it is not killing spark-job. It is still hung up there. Now the questions are : How do I find spark-job based on the name? How do I kill the

Re: SchemaRDD.sample problem

2014-12-23 Thread Cheng Lian
Here is a more cleaned up version, can be used in |./sbt/sbt hive/console| to easily reproduce this issue: |sql(SELECT * FROM src WHERE key % 2 = 0). sample(withReplacement =false, fraction =0.05). registerTempTable(sampled) println(table(sampled).queryExecution) val query = sql(SELECT

Re: Single worker locked at 100% CPU

2014-12-23 Thread Andrew Ash
Hi Phil, This sounds a lot like a deadlock in Hadoop's Configuration object that I ran into a while back. If you jstack the JVM and see a thread that looks like the below, it could be https://issues.apache.org/jira/browse/SPARK-2546 Executor task launch worker-6 daemon prio=10

Re: S3 files , Spark job hungsup

2014-12-23 Thread Denny Lee
You should be able to kill the job using the webUI or via spark-class. More info can be found in the thread: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-kill-a-Spark-job-running-in-cluster-mode-td18583.html. HTH! On Tue, Dec 23, 2014 at 4:47 PM, durga durgak...@gmail.com wrote:

Re: Spark SQL job block when use hive udf from_unixtime

2014-12-23 Thread Cheng Lian
Could you please provide a complete stacktrace? Also it would be good if you can share your hive-site.xml as well. On 12/23/14 4:42 PM, Dai, Kevin wrote: Hi, there When I use hive udf from_unixtime with the HiveContext, the job block and the log is as follow: sun.misc.Unsafe.park(Native

Debugging a Spark application using Eclipse throws SecurityException

2014-12-23 Thread ey-chih chow
I am using Eclipse to develop a Spark application (using Spark 1.1.0). I use the ScalaTest framework to test the application. But I was blocked by the following exception: java.lang.SecurityException: class javax.servlet.FilterRegistration's signer information does not match signer information

RE: SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-23 Thread Cheng, Hao
Hi, Lam, I can confirm this is a bug with the latest master, and I filed a jira issue for this: https://issues.apache.org/jira/browse/SPARK-4944 Hope come with a solution soon. Cheng Hao From: Jerry Lam [mailto:chiling...@gmail.com] Sent: Wednesday, December 24, 2014 4:26 AM To:

Re: Debugging a Spark application using Eclipse throws SecurityException

2014-12-23 Thread ey-chih chow
It's working now. Probably I didn't specify the excluded list correctly. I kept revising it and now it's working. Thanks. Ey-Chih Chow -- View this message in context:

Re:Re: Serialization issue when using HBase with Spark

2014-12-23 Thread yangliuyu
Thanks All. Finally the works code is below: object PlayRecord { def getUserActions(accounts: RDD[String], idType: Int, timeStart: Long, timeStop: Long, cacheSize: Int, filterSongDays: Int, filterPlaylistDays: Int): RDD[(String, (Int, Set[Long], Set[Long]))] = {

Re: How to run an action and get output?

2014-12-23 Thread Tobias Pfeiffer
Hi, On Fri, Dec 19, 2014 at 6:53 PM, Ashic Mahtab as...@live.com wrote: val doSomething(entry:SomeEntry, session:Session) : SomeOutput = { val result = session.SomeOp(entry) SomeOutput(entry.Key, result.SomeProp) } I could use a transformation for rdd.map, but in case of failures,

How to build Spark against the latest

2014-12-23 Thread guxiaobo1982
Hi, The official pom.xml file only have profile for hadoop version 2.4 as the latest version, but I installed hadoop version 2.6.0 with ambari, how can I build spark against it, just using mvn -Dhadoop.version=2.6.0, or how to make a coresponding profile for it? Regards, Xiaobo

Re: How to build Spark against the latest

2014-12-23 Thread Ted Yu
See http://search-hadoop.com/m/JW1q5Cew0j On Tue, Dec 23, 2014 at 8:00 PM, guxiaobo1982 guxiaobo1...@qq.com wrote: Hi, The official pom.xml file only have profile for hadoop version 2.4 as the latest version, but I installed hadoop version 2.6.0 with ambari, how can I build spark against it,

Fwd: Mesos resource allocation

2014-12-23 Thread Tim Chen
Hi Josh, If you want to cap the amount of memory per executor in Coarse grain mode, then yes you only get 240GB of memory as you mentioned. What's the reason you don't want to raise the capacity of memory you use per executor? In coarse grain mode the Spark executor is long living and it

Not Serializable exception when integrating SQL and Spark Streaming

2014-12-23 Thread bigdata4u
I am trying to use sql over Spark streaming using Java. But i am getting Serialization Exception. public static void main(String args[]) { SparkConf sparkConf = new SparkConf().setAppName(NumberCount); JavaSparkContext jc = new JavaSparkContext(sparkConf); JavaStreamingContext jssc =

SchemaRDD to RDD[String]

2014-12-23 Thread Hafiz Mujadid
Hi dears! I want to convert a schemaRDD into RDD of String. How can we do that? Currently I am doing like this which is not converting correctly no exception but resultant strings are empty here is my code def SchemaRDDToRDD( schemaRDD : SchemaRDD ) : RDD[ String ] = { var

Re: Fetch Failure

2014-12-23 Thread Stefano Ghezzi
i've eliminated fetch failed with this parameters (don't know which was the right one for the problem) to the spark-submit running with 1.2.0 --conf spark.shuffle.compress=false \ --conf spark.file.transferTo=false \ --conf spark.shuffle.manager=hash \ --conf

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-23 Thread Jianshi Huang
FYI, Latest hive 0.14/parquet will have column renaming support. Jianshi On Wed, Dec 10, 2014 at 3:37 AM, Michael Armbrust mich...@databricks.com wrote: You might also try out the recently added support for views. On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang jianshi.hu...@gmail.com wrote:

Best to execute SQL in Streaming data

2014-12-23 Thread bigdata4u
Hi, I have a existing batch processing system which use SQL queries to extract information from data. I want to replace this with Real time system. I am coding in Java and to use SQL in Streaming data i found few examples but none of them is complete.

Re: Best to execute SQL in Streaming data

2014-12-23 Thread Akhil Das
Here's the java api docs https://spark.apache.org/docs/latest/api/java/index.html You can start with this example and convert it into Java (its pretty straight forward) http://stackoverflow.com/questions/25484879/sql-over-spark-streaming Eg: In Scala : val sparkConf = new