Hi,
You can use pipe operator, if you are running shell script/perl script on
some data. More information on my blog
http://blog.madhukaraphatak.com/pipe-in-spark/.
Regards,
Madhukara Phatak
http://datamantra.io/
On Mon, May 25, 2015 at 8:02 AM, luohui20...@sina.com wrote:
Thanks Akhil,
Hi all
I want to learn spark source code recently. Git clone spark code from
git , and exec sbt gen-idea command . import the project into intellij ,
have such error below:
Anyone could help me? Spark version is 1.4, operation system is windows
7
Hi,
I am sure you can use spark for this but it seems like a problem that should be
delegated to a text based indexing technology like elastic search or something
based on lucene to serve the requests. Spark can be used to prepare the data
that can be fed to the indexing service.
Using spark
Hi all,
I only have one stage which is mapToPair and inside the function, I have a
for loop which will do about 133433 times.
But then it becomes slow, when I replace 133433 with just 133, it works very
fast.
But I think this is just a simple operation even in normal Java.
You can look at the
I am not sure what happen. According to your snapshot, it just shows warning
message instead of error. But I suggest you can try to use maven with: mvn
idea:idea.
On Monday, May 25, 2015 2:48 PM, huangzheng 1106944...@qq.com wrote:
!--#yiv0816328792 _filtered #yiv0816328792
Yes, spark will be useful for following areas of your application:
1. Running same function on every CV in parallel and score
2. Improve scoring function by better access to classification and
clustering algorithms, within and beyond mllib.
These are first benefits you can start with and then
Hi, ankur!
Thanks for your reply!
CVs are a just bunch of IDs, each ID represents some object of some class
(eg. class=JOB, object=SW Developer). We have already processed texts and
extracted all facts. So we don't need to do any text processing in Spark,
just to run scoring function on many many
I was digging into the possibilities for Websphere MQ as a data source for
spark-streaming becuase it is needed in one of our use case. I got to know
that MQTT http://mqtt.org/ is the protocol that supports the
communication from MQ data structures but since I am a newbie to spark
streaming I
thanks, madhu and Akhil
I modified my code like below,however I think it is not so distributed. Have
you guys better idea to run this app more efficiantly and distributed?
So I add some comments with my understanding:
import org.apache.spark._
import www.celloud.com.model._
object GeneCompare3 {
Hi All,
*I am new to spark and trying to execute spark sql through java code as
below*
package com.ce.sql;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import
Hi Umesh,
You can connect to Spark Streaming with MQTT refer to the example.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala
Thanks
Arush
On Mon, May 25, 2015 at 3:43 PM, umesh9794 umesh.chaudh...@searshc.com
Hello,
I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and
running on a cluster with 3 nodes on 64bit ubuntu.
My application is compiled with spark 1.3.1 (apparently with mesos
0.21.0 dependency), hadoop 2.5.1-mapr-1503 and akka 2.3.10. Only with
this combination I have
On Mon, May 25, 2015 at 2:43 PM, Reinis Vicups sp...@orbit-x.de wrote:
Hello,
I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and
running on a cluster with 3 nodes on 64bit ubuntu.
My application is compiled with spark 1.3.1 (apparently with mesos 0.21.0
dependency),
Can you can tell us what exactly you are trying to achieve?
Thanks
Best Regards
On Mon, May 25, 2015 at 5:00 PM, luohui20...@sina.com wrote:
thanks, madhu and Akhil
I modified my code like below,however I think it is not so distributed.
Have you guys better idea to run this app more
Try this way:
object Holder extends Serializable { @transient lazy val log =
Logger.getLogger(getClass.getName)}
val someRdd = spark.parallelize(List(1, 2, 3))
someRdd.map {
element =
Holder.*log.info http://log.info/(s$element will be processed)*
element + 1
If you want to notify after every batch is completed, then you can simply
implement the StreamingListener
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener
interface, which has methods like onBatchCompleted, onBatchStarted etc in
which
Hi.
In a dataframe, How can I execution a conditional sentence in a
aggregation. For example, Can I translate this SQL statement to DataFrame?:
SELECT name, SUM(IF table.col2 100 THEN 1 ELSE table.col1)
FROM table
GROUP BY name
Thanks
--
Regards.
Miguel
Hello,
I assume I am running spark in a fine-grained mode since I haven't
changed the default here.
One question regarding 1.4.0-RC1 - is there a mvn snapshot repository I
could use for my project config? (I know that I have to download source
and make-distribution for executor as well)
Hello!
I would like to use the logging mechanism provided by the log4j, but I'm
getting the
Exception in thread main org.apache.spark.SparkException: Task not
serializable - Caused by: java.io.NotSerializableException:
org.apache.log4j.Logger
The code (and the problem) that I'm using resembles
Further experimentation indicates these problems only occur when master is
local[*].
There are no issues if a standalone cluster is used.
--
View this message in context:
Hi Kevin,
Did you try adding a host name for the ipv6? I have a few ipv6 boxes, spark
failed for me when i use just the ipv6 addresses, but it works fine when i
use the host names.
Here's an entry in my /etc/hosts:
2607:5300:0100:0200::::0a4d hacked.work
My spark-env.sh file:
Great hints, you guys!
Yes spark-shell worked fine with mesos as master. I haven't tried to
execute multiple rdd actions in a row though (I did couple of
successful counts on hbase tables i am working with in several
experiments but nothing that would compare to the stuff my spark jobs
are
Here is a link for builds of 1.4 RC2:
http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-bin/
For a mvn repo, I believe the RC2 artifacts are here:
https://repository.apache.org/content/repositories/orgapachespark-1104/
A few experiments you might try:
1. Does spark-shell work?
Hello,
I am using SparkSQL 1.3.0 and Hive 0.13.1 on AWS YARN.
My Hive table as an external table is partitioned with date and hour.
I expected that a query with certain partitions will read only the data
files of the partitions.
I turned on TRACE level logging for ThriftServer since the query
Case when col2100 then 1 else col2 end
On 26 May 2015 00:25, Masf masfwo...@gmail.com wrote:
Hi.
In a dataframe, How can I execution a conditional sentence in a
aggregation. For example, Can I translate this SQL statement to DataFrame?:
SELECT name, SUM(IF table.col2 100 THEN 1 ELSE
Hello,
I have a custom data source and I want to load the data into Spark to
perform some computations. For this I see that I might need to implement a
new RDD for my data source.
I am a complete Scala noob and I am hoping that I can implement the RDD in
Java only. I looked around the internet
I am right trying to run some shell script in my spark app, hoping it runs more
concurrently in my spark cluster.However I am not sure whether my codes will
run concurrently in my executors.Dive into my code, you can see that I am
trying to
1.splite both db and sample into 21 small files. That
Hello,
I have a custom data source and I want to load the data into Spark to
perform some computations. For this I see that I might need to implement a
new RDD for my data source.
I am a complete Scala noob and I am hoping that I can implement the RDD in
Java only. I looked around the internet
The reason it didn't work for you is that the function you registered with
someRdd.map will be running on the worker/executor side, not in your
driver's program. Then you need to be careful to not accidentally close
over some objects instantiated from your driver's program, like the log
object in
Сергей,
A simple implementation would be to create a DataFrame of CVs by issuing a
Spark SQL query against your Postgres database, persist it in memory, and
then to map F over it at query time and return the top
Hi Yuming - I was running into the same issue with larger worker nodes a few
weeks ago.
The way I managed to get around the high GC time, as per the suggestion of
some others, was to break each worker node up into individual workers of
around 10G in size. Divide your cores accordingly.
The other
My data is in S3 and is indexed in Dynamo. For example, If I want to load
data given a time range, I will first need to query Dynamo for the S3 file
keys for the corresponding time range and then load them in Spark. The
files may not always be in the same S3 path prefix, hence
Please take a look at:
core/src/main/scala/org/apache/spark/api/java/JavaRDD.scala
core/src/test/java/org/apache/spark/JavaJdbcRDDSuite.java
Cheers
On Mon, May 25, 2015 at 8:39 PM, swaranga sarma.swara...@gmail.com wrote:
Has this changed now? Can a new RDD be implemented in Java?
--
Thanks I'll give it a try!
С Уважением, Сергей Мелехин.
2015-05-26 12:56 GMT+10:00 Alex Chavez alexkcha...@gmail.com:
Сергей,
A simple implementation would be to create a DataFrame of CVs by issuing a
Spark SQL query against your Postgres database, persist it in memory, and
then to map F
34 matches
Mail list logo