Create DataFrame from textFile with unknown columns

2015-04-05 Thread olegshirokikh
Assuming there is a text file with unknown number of columns, how one would create a data frame? I have followed the example in Spark Docs where one first creates a RDD of Rows, but it seems that you have to know exact number of columns in file and can't to just this: val rowRDD =

Re: NoSuchMethodException KafkaUtils.

2015-04-05 Thread Yamini
Customized spark-streaming-kafka_2.10-1.1.0.jar. Included a new method in kafkaUtils class to handle byte array format. That helped. - Thanks, Yamini -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodException-KafkaUtils-tp17142p22384.html

Re: newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-05 Thread Ted Yu
bq. HBase scan operation like scan StartROW and EndROW in RDD? I don't think RDD supports concept of start row and end row. In HBase, please take a look at the following methods of Scan: public Scan setStartRow(byte [] startRow) { public Scan setStopRow(byte [] stopRow) { Cheers On Sun,

Re: Need help with ALS Recommendation code

2015-04-05 Thread Xiangrui Meng
Could you try `sbt package` or `sbt compile` and see whether there are errors? It seems that you haven't reached the ALS code yet. -Xiangrui On Sat, Apr 4, 2015 at 5:06 AM, Phani Yadavilli -X (pyadavil) pyada...@cisco.com wrote: Hi , I am trying to run the following command in the Movie

Re: newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-05 Thread Jeetendra Gangele
I am already using STRATROW and ENDROW in Hbase from newAPIHadoopRDD. Can I do similar with RDD?.lets say use Filter in RDD to get only those records which matches the same Criteria mentioned in STARTROW and Stop ROW.will it much faster than Hbase querying? On 6 April 2015 at 03:15, Ted Yu

Re: Re: About Waiting batches on the spark streaming UI

2015-04-05 Thread bit1...@163.com
Thanks Tathagata for the explanation! bit1...@163.com From: Tathagata Das Date: 2015-04-04 01:28 To: Ted Yu CC: bit1129; user Subject: Re: About Waiting batches on the spark streaming UI Maybe that should be marked as waiting as well. Will keep that in mind. We plan to update the ui soon, so

Re: newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-05 Thread Ted Yu
You do need to apply the patch since 0.96 doesn't have this feature. For JavaSparkContext.newAPIHadoopRDD, can you check region server metrics to see where the overhead might be (compared to creating scan and firing query using native client) ? Thanks On Sun, Apr 5, 2015 at 2:00 PM, Jeetendra

Re: newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-05 Thread Jeetendra Gangele
Sure I will check. On 6 April 2015 at 02:45, Ted Yu yuzhih...@gmail.com wrote: You do need to apply the patch since 0.96 doesn't have this feature. For JavaSparkContext.newAPIHadoopRDD, can you check region server metrics to see where the overhead might be (compared to creating scan and

RE: Need help with ALS Recommendation code

2015-04-05 Thread Phani Yadavilli -X (pyadavil)
Hi Xiangrui, Thank you for the response. I tried sbt package and sbt compile both the commands give me success result sbt compile [info] Set current project to machine-learning (in build file:/opt/mapr/spark/spark-1.2.1/SparkTraining/machine-learning/) [info] Updating

Re: conversion from java collection type to scala JavaRDDObject

2015-04-05 Thread Dean Wampler
The runtime attempts to serialize everything required by records, and also any lambdas/closures you use. Small, simple types are less likely to run into this problem. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe

Add row IDs column to data frame

2015-04-05 Thread olegshirokikh
What would be the most efficient neat method to add a column with row ids to dataframe? I can think of something as below, but it completes with errors (at line 3), and anyways doesn't look like the best route possible: var dataDF = sc.textFile(path/file).toDF() val rowDF = sc.parallelize(1 to

Re: Add row IDs column to data frame

2015-04-05 Thread Xiangrui Meng
Sorry, it should be toDF(text, id). On Sun, Apr 5, 2015 at 9:21 PM, Xiangrui Meng men...@gmail.com wrote: Try: sc.textFile(path/file).zipWithIndex().toDF(id, text) -Xiangrui On Sun, Apr 5, 2015 at 7:50 PM, olegshirokikh o...@solver.com wrote: What would be the most efficient neat method to

Re: Add row IDs column to data frame

2015-04-05 Thread Xiangrui Meng
Try: sc.textFile(path/file).zipWithIndex().toDF(id, text) -Xiangrui On Sun, Apr 5, 2015 at 7:50 PM, olegshirokikh o...@solver.com wrote: What would be the most efficient neat method to add a column with row ids to dataframe? I can think of something as below, but it completes with errors (at

Re: Spark Streaming program questions

2015-04-05 Thread Sean Owen
The DAG can't change. You can create many DStreams, but they have to belong to one StreamingContext. You can try these things to see. On Sun, Apr 5, 2015 at 2:13 AM, nickos168 nickos...@yahoo.com.invalid wrote: I have two questions: 1) In a Spark Streaming program, after the various DStream

Spark streaming with Kafka- couldnt find KafkaUtils

2015-04-05 Thread Priya Ch
Hi All, I configured Kafka cluster on a single node and I have streaming application which reads data from kafka topic using KafkaUtils. When I execute the code in local mode from the IDE, the application runs fine. But when I submit the same to spark cluster in standalone mode, I end up

Re: Spark streaming with Kafka- couldnt find KafkaUtils

2015-04-05 Thread Akhil Das
How are you submitting the application? Use a standard build tool like maven or sbt to build your project, it will download all the dependency jars, when you submit your application (if you are using spark-submit, then use --jars option to add those jars which are causing classNotFoundException).

Re: Pseudo Spark Streaming ?

2015-04-05 Thread Jörn Franke
Hallo, Only because you receive the log files hourly it means that you have to use Spark Streaming. Spark streaming is often used if you receive new events each minute /second potentially at an irregular frequency. Of course your analysis window can be larger. I think your use case justifies

Re: input size too large | Performance issues with Spark

2015-04-05 Thread Ted Yu
Reading Sandy's blog, there seems to be one typo. bq. Similarly, the heap size can be controlled with the --executor-cores flag or thespark.executor.memory property. '--executor-memory' should be the right flag. BTW bq. It defaults to max(384, .07 * spark.executor.memory) Default memory

Pseudo Spark Streaming ?

2015-04-05 Thread Bahubali Jain
Hi, I have a requirement in which I plan to use the SPARK Streaming. I am supposed to calculate the access count to certain webpages.I receive the webpage access information thru log files. By Access count I mean how many times was the page accessed *till now* I have the log files for past 2

Re: 4 seconds to count 13M lines. Does it make sense?

2015-04-05 Thread Horia
Are you pre-caching them in memory? On Apr 4, 2015 3:29 AM, SamyaMaiti samya.maiti2...@gmail.com wrote: Reduce *spark.sql.shuffle.partitions* from default of 200 to total number of cores. -- View this message in context:

Re: Spark + Kinesis

2015-04-05 Thread Vadim Bichutskiy
ᐧ Hi all, Below is the output that I am getting. My Kinesis stream has 1 shard, and my Spark cluster on EC2 has 2 slaves (I think that's fine?). I should mention that my Kinesis producer is written in Python where I followed the example

Diff between foreach and foreachsync

2015-04-05 Thread Jeetendra Gangele
Hi can somebody explain me what is the difference between foreach and foreachsync over RDD action. which one will give good result maximum throughput. does foreach run in parallel way?

Sending RDD object over the network

2015-04-05 Thread raggy
For a class project, I am trying to utilize 2 spark Applications communicate with each other by passing an RDD object that was created from one application to another Spark application. The first application is developed in Scala and creates an RDD and sends it to the 2nd application over the

Re: newAPIHadoopRDD Mutiple scan result return from Hbase

2015-04-05 Thread Ted Yu
Looks like MultiRowRangeFilter would serve your need. See HBASE-11144. HBase 1.1 would be released in May. You can also backport it to the HBase release you're using. On Sat, Apr 4, 2015 at 8:45 AM, Jeetendra Gangele gangele...@gmail.com wrote: Here is my conf object passing first parameter