Assuming there is a text file with unknown number of columns, how one would
create a data frame? I have followed the example in Spark Docs where one
first creates a RDD of Rows, but it seems that you have to know exact number
of columns in file and can't to just this:
val rowRDD =
Customized spark-streaming-kafka_2.10-1.1.0.jar. Included a new method in
kafkaUtils class to handle byte array format. That helped.
-
Thanks,
Yamini
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodException-KafkaUtils-tp17142p22384.html
bq. HBase scan operation like scan StartROW and EndROW in RDD?
I don't think RDD supports concept of start row and end row.
In HBase, please take a look at the following methods of Scan:
public Scan setStartRow(byte [] startRow) {
public Scan setStopRow(byte [] stopRow) {
Cheers
On Sun,
Could you try `sbt package` or `sbt compile` and see whether there are
errors? It seems that you haven't reached the ALS code yet. -Xiangrui
On Sat, Apr 4, 2015 at 5:06 AM, Phani Yadavilli -X (pyadavil)
pyada...@cisco.com wrote:
Hi ,
I am trying to run the following command in the Movie
I am already using STRATROW and ENDROW in Hbase from newAPIHadoopRDD.
Can I do similar with RDD?.lets say use Filter in RDD to get only those
records which matches the same Criteria mentioned in STARTROW and Stop
ROW.will it much faster than Hbase querying?
On 6 April 2015 at 03:15, Ted Yu
Thanks Tathagata for the explanation!
bit1...@163.com
From: Tathagata Das
Date: 2015-04-04 01:28
To: Ted Yu
CC: bit1129; user
Subject: Re: About Waiting batches on the spark streaming UI
Maybe that should be marked as waiting as well. Will keep that in mind. We plan
to update the ui soon, so
You do need to apply the patch since 0.96 doesn't have this feature.
For JavaSparkContext.newAPIHadoopRDD, can you check region server metrics
to see where the overhead might be (compared to creating scan and firing
query using native client) ?
Thanks
On Sun, Apr 5, 2015 at 2:00 PM, Jeetendra
Sure I will check.
On 6 April 2015 at 02:45, Ted Yu yuzhih...@gmail.com wrote:
You do need to apply the patch since 0.96 doesn't have this feature.
For JavaSparkContext.newAPIHadoopRDD, can you check region server metrics
to see where the overhead might be (compared to creating scan and
Hi Xiangrui,
Thank you for the response. I tried sbt package and sbt compile both the
commands give me success result
sbt compile
[info] Set current project to machine-learning (in build
file:/opt/mapr/spark/spark-1.2.1/SparkTraining/machine-learning/)
[info] Updating
The runtime attempts to serialize everything required by records, and also
any lambdas/closures you use. Small, simple types are less likely to run
into this problem.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe
What would be the most efficient neat method to add a column with row ids to
dataframe?
I can think of something as below, but it completes with errors (at line 3),
and anyways doesn't look like the best route possible:
var dataDF = sc.textFile(path/file).toDF()
val rowDF = sc.parallelize(1 to
Sorry, it should be toDF(text, id).
On Sun, Apr 5, 2015 at 9:21 PM, Xiangrui Meng men...@gmail.com wrote:
Try: sc.textFile(path/file).zipWithIndex().toDF(id, text) -Xiangrui
On Sun, Apr 5, 2015 at 7:50 PM, olegshirokikh o...@solver.com wrote:
What would be the most efficient neat method to
Try: sc.textFile(path/file).zipWithIndex().toDF(id, text) -Xiangrui
On Sun, Apr 5, 2015 at 7:50 PM, olegshirokikh o...@solver.com wrote:
What would be the most efficient neat method to add a column with row ids to
dataframe?
I can think of something as below, but it completes with errors (at
The DAG can't change. You can create many DStreams, but they have to
belong to one StreamingContext. You can try these things to see.
On Sun, Apr 5, 2015 at 2:13 AM, nickos168 nickos...@yahoo.com.invalid wrote:
I have two questions:
1) In a Spark Streaming program, after the various DStream
Hi All,
I configured Kafka cluster on a single node and I have streaming
application which reads data from kafka topic using KafkaUtils. When I
execute the code in local mode from the IDE, the application runs fine.
But when I submit the same to spark cluster in standalone mode, I end up
How are you submitting the application? Use a standard build tool like
maven or sbt to build your project, it will download all the dependency
jars, when you submit your application (if you are using spark-submit, then
use --jars option to add those jars which are causing
classNotFoundException).
Hallo,
Only because you receive the log files hourly it means that you have to use
Spark Streaming. Spark streaming is often used if you receive new events
each minute /second potentially at an irregular frequency. Of course your
analysis window can be larger.
I think your use case justifies
Reading Sandy's blog, there seems to be one typo.
bq. Similarly, the heap size can be controlled with the --executor-cores flag
or thespark.executor.memory property.
'--executor-memory' should be the right flag.
BTW
bq. It defaults to max(384, .07 * spark.executor.memory)
Default memory
Hi,
I have a requirement in which I plan to use the SPARK Streaming.
I am supposed to calculate the access count to certain webpages.I receive
the webpage access information thru log files.
By Access count I mean how many times was the page accessed *till now*
I have the log files for past 2
Are you pre-caching them in memory?
On Apr 4, 2015 3:29 AM, SamyaMaiti samya.maiti2...@gmail.com wrote:
Reduce *spark.sql.shuffle.partitions* from default of 200 to total number
of
cores.
--
View this message in context:
ᐧ
Hi all,
Below is the output that I am getting. My Kinesis stream has 1 shard, and
my Spark cluster on EC2 has 2 slaves (I think that's fine?).
I should mention that my Kinesis producer is written in Python where I
followed the example
Hi
can somebody explain me what is the difference between foreach and
foreachsync over RDD action. which one will give good result maximum
throughput.
does foreach run in parallel way?
For a class project, I am trying to utilize 2 spark Applications communicate
with each other by passing an RDD object that was created from one
application to another Spark application. The first application is developed
in Scala and creates an RDD and sends it to the 2nd application over the
Looks like MultiRowRangeFilter would serve your need.
See HBASE-11144.
HBase 1.1 would be released in May.
You can also backport it to the HBase release you're using.
On Sat, Apr 4, 2015 at 8:45 AM, Jeetendra Gangele gangele...@gmail.com
wrote:
Here is my conf object passing first parameter
24 matches
Mail list logo