Printing the RDDs in SparkPageRank

2014-08-24 Thread Deep Pradhan
Hi, I was going through the SparkPageRank code and want to see the intermediate steps, like the RDDs formed in the intermediate steps. Here is a part of the code along with the lines that I added in order to print the RDDs. I want to print the *parts* in the code (denoted by the comment in Bold

Re: Spark Streaming with Flume event

2014-08-24 Thread Spidy
Anybody? Example of how to desearalize FlumeEvent data using Scala -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-Flume-event-tp12569p12709.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to make Spark Streaming write its output so that Impala can read it?

2014-08-24 Thread rafeeq s
I have the following problem with Spark Streaming API. I am currently streaming input data via KAFKA to Spark Streaming, with which I plan to do some preprocessing for the data. Then, I'd like to save the data to Parquet file system and query it with Impala. However, Spark is writing the data

Re: Spark SQL Parser error

2014-08-24 Thread S Malligarjunan
Hello Yin, Additional note: In ./bin/spark-shell --jars s3n:/mybucket/myudf.jar  I got the following message in console. Waring skipped external jar..   Thanks and Regards, Sankar S.   On , S Malligarjunan smalligarju...@yahoo.com wrote: Hello Yin, I have tried use sc.addJar and

Re: Printing the RDDs in SparkPageRank

2014-08-24 Thread Jörn Franke
Hi, What kind of error do you receive? Best regards, Jörn Le 24 août 2014 08:29, Deep Pradhan pradhandeep1...@gmail.com a écrit : Hi, I was going through the SparkPageRank code and want to see the intermediate steps, like the RDDs formed in the intermediate steps. Here is a part of the

Spark Streaming API and Performance Clarifications

2014-08-24 Thread didi
bsd I am new to the Spark Streaming and have some issues which i can't find any documentation stuff to answer them I believe a lot of Spark users in general and Spark Streaming in particular use it for analysis of events by calculation of distributed large aggregations. In case i have to digest

amp lab spark streaming twitter example

2014-08-24 Thread Forest D
Hi folks, I have been trying to run the AMPLab’s twitter streaming example (http://ampcamp.berkeley.edu/big-data-mini-course/realtime-processing-with-spark-streaming.html) in the last 2 days.I have encountered the same error messages as shown below: 14/08/24 17:14:22 ERROR

Re: amp lab spark streaming twitter example

2014-08-24 Thread Jonathan Haddad
Could you be hitting this? https://issues.apache.org/jira/browse/SPARK-3178 On Sun, Aug 24, 2014 at 10:21 AM, Forest D dev24a...@gmail.com wrote: Hi folks, I have been trying to run the AMPLab’s twitter streaming example

Return multiple [K,V] pairs from a Java Function

2014-08-24 Thread Tom
Hi, I would like to create multiple key-value pairs, where all keys still can be reduced. For instance, I have the following 2 lines: A,B,C B,D I would like to return the following pairs for the first line: A,B A,C B,A B,C C,A C,B And for the second B,D D,B After a reduce by key, I want to end

Re: Return multiple [K,V] pairs from a Java Function

2014-08-24 Thread Sean Owen
You are looking for the method flatMapToPair. It takes a PairFlatMapFunction, which is something that returns an Iterable of Tuple2 of K,V. You end up with a JavaPairRDD of K and V as desired. On Sun, Aug 24, 2014 at 9:15 PM, Tom thubregt...@gmail.com wrote: Hi, I would like to create multiple

Re: What about implementing various hypothesis test for LogisticRegression in MLlib

2014-08-24 Thread Xiangrui Meng
Thanks for the reference! Many tests are not designed for big data: http://magazine.amstat.org/blog/2010/09/01/statrevolution/ . So we need to understand which tests are proper. Feel free to create a JIRA and let's move our discussion there. -Xiangrui On Fri, Aug 22, 2014 at 8:44 PM, guxiaobo1982

pipe raw binary data

2014-08-24 Thread Emeric, Viel
Hello, I am trying to use the RDD pipe method to integrate Spark with external commands to be executed on each partition. My program roughly looks like: rdd.pipe(cmd1).pipe(cmd2) The output of cmd1 and input of cmd2 is raw binary data. However, the pipe method in RDD requires converting data

Spark Stream + HDFS Append

2014-08-24 Thread Dean Chen
We are using HDFS for log storage where logs are flushed to HDFS every minute, with a new file created for each hour. We would like to consume these logs using spark streaming.  The docs state that new HDFS will be picked up, but does Spark Streaming support HDFS appends? — Dean Chen

Re: Spark Stream + HDFS Append

2014-08-24 Thread Tobias Pfeiffer
Hi, On Mon, Aug 25, 2014 at 9:56 AM, Dean Chen deanch...@gmail.com wrote: We are using HDFS for log storage where logs are flushed to HDFS every minute, with a new file created for each hour. We would like to consume these logs using spark streaming. The docs state that new HDFS will be

Re: multiple windows from the same DStream ?

2014-08-24 Thread Tobias Pfeiffer
Hi, computations are triggered by an output operation. No output operation, no computation. Therefore in your code example, On Thu, Aug 21, 2014 at 11:58 PM, Josh J joshjd...@gmail.com wrote: JavaPairReceiverInputDStreamString, String messages =