Re: Re: how to distributed run a bash shell in spark

2015-05-25 Thread madhu phatak
Hi, You can use pipe operator, if you are running shell script/perl script on some data. More information on my blog http://blog.madhukaraphatak.com/pipe-in-spark/. Regards, Madhukara Phatak http://datamantra.io/ On Mon, May 25, 2015 at 8:02 AM, luohui20...@sina.com wrote: Thanks Akhil,

Intellij IDEA import spark souce code error

2015-05-25 Thread huangzheng
Hi all I want to learn spark source code recently. Git clone spark code from git , and exec sbt gen-idea command . import the project into intellij , have such error below: Anyone could help me? Spark version is 1.4, operation system is windows 7

RE: Using Spark like a search engine

2015-05-25 Thread ankur chauhan
Hi, I am sure you can use spark for this but it seems like a problem that should be delegated to a text based indexing technology like elastic search or something based on lucene to serve the requests. Spark can be used to prepare the data that can be fed to the indexing service. Using spark

The stage slow when I have for loop inside (Java)

2015-05-25 Thread allanjie
Hi all, I only have one stage which is mapToPair and inside the function, I have a for loop which will do about 133433 times. But then it becomes slow, when I replace 133433 with just 133, it works very fast. But I think this is just a simple operation even in normal Java. You can look at the

Re: Intellij IDEA import spark souce code error

2015-05-25 Thread Yi Zhang
I am not sure what happen. According to your snapshot, it just shows warning message instead of error. But I suggest you can try to use maven with: mvn idea:idea. On Monday, May 25, 2015 2:48 PM, huangzheng 1106944...@qq.com wrote: !--#yiv0816328792 _filtered #yiv0816328792

Re: Using Spark like a search engine

2015-05-25 Thread ayan guha
Yes, spark will be useful for following areas of your application: 1. Running same function on every CV in parallel and score 2. Improve scoring function by better access to classification and clustering algorithms, within and beyond mllib. These are first benefits you can start with and then

Re: Using Spark like a search engine

2015-05-25 Thread Сергей Мелехин
Hi, ankur! Thanks for your reply! CVs are a just bunch of IDs, each ID represents some object of some class (eg. class=JOB, object=SW Developer). We have already processed texts and extracted all facts. So we don't need to do any text processing in Spark, just to run scoring function on many many

Websphere MQ as a data source for Apache Spark Streaming

2015-05-25 Thread umesh9794
I was digging into the possibilities for Websphere MQ as a data source for spark-streaming becuase it is needed in one of our use case. I got to know that MQTT http://mqtt.org/ is the protocol that supports the communication from MQ data structures but since I am a newbie to spark streaming I

回复:Re: Re: how to distributed run a bash shell in spark

2015-05-25 Thread luohui20001
thanks, madhu and Akhil I modified my code like below,however I think it is not so distributed. Have you guys better idea to run this app more efficiantly and distributed? So I add some comments with my understanding: import org.apache.spark._ import www.celloud.com.model._ object GeneCompare3 {

spark sql through java code facing issue

2015-05-25 Thread vinayak
Hi All, *I am new to spark and trying to execute spark sql through java code as below* package com.ce.sql; import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import

Re: Websphere MQ as a data source for Apache Spark Streaming

2015-05-25 Thread Arush Kharbanda
Hi Umesh, You can connect to Spark Streaming with MQTT refer to the example. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala Thanks Arush On Mon, May 25, 2015 at 3:43 PM, umesh9794 umesh.chaudh...@searshc.com

Tasks randomly stall when running on mesos

2015-05-25 Thread Reinis Vicups
Hello, I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and running on a cluster with 3 nodes on 64bit ubuntu. My application is compiled with spark 1.3.1 (apparently with mesos 0.21.0 dependency), hadoop 2.5.1-mapr-1503 and akka 2.3.10. Only with this combination I have

Re: Tasks randomly stall when running on mesos

2015-05-25 Thread Iulian Dragoș
On Mon, May 25, 2015 at 2:43 PM, Reinis Vicups sp...@orbit-x.de wrote: Hello, I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with zookeeper and running on a cluster with 3 nodes on 64bit ubuntu. My application is compiled with spark 1.3.1 (apparently with mesos 0.21.0 dependency),

Re: Re: Re: how to distributed run a bash shell in spark

2015-05-25 Thread Akhil Das
Can you can tell us what exactly you are trying to achieve? Thanks Best Regards On Mon, May 25, 2015 at 5:00 PM, luohui20...@sina.com wrote: thanks, madhu and Akhil I modified my code like below,however I think it is not so distributed. Have you guys better idea to run this app more

Re: Using Log4j for logging messages inside lambda functions

2015-05-25 Thread Akhil Das
Try this way: object Holder extends Serializable { @transient lazy val log = Logger.getLogger(getClass.getName)} val someRdd = spark.parallelize(List(1, 2, 3)) someRdd.map { element = Holder.*log.info http://log.info/(s$element will be processed)* element + 1

Re: How to use zookeeper in Spark Streaming

2015-05-25 Thread Akhil Das
If you want to notify after every batch is completed, then you can simply implement the StreamingListener https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener interface, which has methods like onBatchCompleted, onBatchStarted etc in which

DataFrame. Conditional aggregation

2015-05-25 Thread Masf
Hi. In a dataframe, How can I execution a conditional sentence in a aggregation. For example, Can I translate this SQL statement to DataFrame?: SELECT name, SUM(IF table.col2 100 THEN 1 ELSE table.col1) FROM table GROUP BY name Thanks -- Regards. Miguel

Re: Tasks randomly stall when running on mesos

2015-05-25 Thread Reinis Vicups
Hello, I assume I am running spark in a fine-grained mode since I haven't changed the default here. One question regarding 1.4.0-RC1 - is there a mvn snapshot repository I could use for my project config? (I know that I have to download source and make-distribution for executor as well)

Using Log4j for logging messages inside lambda functions

2015-05-25 Thread Spico Florin
Hello! I would like to use the logging mechanism provided by the log4j, but I'm getting the Exception in thread main org.apache.spark.SparkException: Task not serializable - Caused by: java.io.NotSerializableException: org.apache.log4j.Logger The code (and the problem) that I'm using resembles

Re: Spark updateStateByKey fails with class leak when using case classes - resend

2015-05-25 Thread rsearle
Further experimentation indicates these problems only occur when master is local[*]. There are no issues if a standalone cluster is used. -- View this message in context:

Re: IPv6 support

2015-05-25 Thread Akhil Das
Hi Kevin, Did you try adding a host name for the ipv6? I have a few ipv6 boxes, spark failed for me when i use just the ipv6 addresses, but it works fine when i use the host names. Here's an entry in my /etc/hosts: 2607:5300:0100:0200::::0a4d hacked.work My spark-env.sh file:

Re: Tasks randomly stall when running on mesos

2015-05-25 Thread Reinis Vicups
Great hints, you guys! Yes spark-shell worked fine with mesos as master. I haven't tried to execute multiple rdd actions in a row though (I did couple of successful counts on hbase tables i am working with in several experiments but nothing that would compare to the stuff my spark jobs are

Re: Tasks randomly stall when running on mesos

2015-05-25 Thread Dean Wampler
Here is a link for builds of 1.4 RC2: http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-bin/ For a mvn repo, I believe the RC2 artifacts are here: https://repository.apache.org/content/repositories/orgapachespark-1104/ A few experiments you might try: 1. Does spark-shell work?

SparkSQL's performance : contacting namenode and datanode to uncessarily check all partitions for a query of specific partitions

2015-05-25 Thread ogoh
Hello, I am using SparkSQL 1.3.0 and Hive 0.13.1 on AWS YARN. My Hive table as an external table is partitioned with date and hour. I expected that a query with certain partitions will read only the data files of the partitions. I turned on TRACE level logging for ThriftServer since the query

Re: DataFrame. Conditional aggregation

2015-05-25 Thread ayan guha
Case when col2100 then 1 else col2 end On 26 May 2015 00:25, Masf masfwo...@gmail.com wrote: Hi. In a dataframe, How can I execution a conditional sentence in a aggregation. For example, Can I translate this SQL statement to DataFrame?: SELECT name, SUM(IF table.col2 100 THEN 1 ELSE

Implementing custom RDD in Java

2015-05-25 Thread Swaranga Sarma
Hello, I have a custom data source and I want to load the data into Spark to perform some computations. For this I see that I might need to implement a new RDD for my data source. I am a complete Scala noob and I am hoping that I can implement the RDD in Java only. I looked around the internet

回复:Re: Re: Re: how to distributed run a bash shell in spark

2015-05-25 Thread luohui20001
I am right trying to run some shell script in my spark app, hoping it runs more concurrently in my spark cluster.However I am not sure whether my codes will run concurrently in my executors.Dive into my code, you can see that I am trying to 1.splite both db and sample into 21 small files. That

Implementing custom RDD in Java

2015-05-25 Thread swaranga
Hello, I have a custom data source and I want to load the data into Spark to perform some computations. For this I see that I might need to implement a new RDD for my data source. I am a complete Scala noob and I am hoping that I can implement the RDD in Java only. I looked around the internet

Re: Using Log4j for logging messages inside lambda functions

2015-05-25 Thread Wesley Miao
The reason it didn't work for you is that the function you registered with someRdd.map will be running on the worker/executor side, not in your driver's program. Then you need to be careful to not accidentally close over some objects instantiated from your driver's program, like the log object in

Re: Using Spark like a search engine

2015-05-25 Thread Alex Chavez
Сергей, A simple implementation would be to create a DataFrame of CVs by issuing a Spark SQL query against your Postgres database, persist it in memory, and then to map F over it at query time and return the top

Re: Spark SQL High GC time

2015-05-25 Thread Nick Travers
Hi Yuming - I was running into the same issue with larger worker nodes a few weeks ago. The way I managed to get around the high GC time, as per the suggestion of some others, was to break each worker node up into individual workers of around 10G in size. Divide your cores accordingly. The other

Re: Implementing custom RDD in Java

2015-05-25 Thread Swaranga Sarma
My data is in S3 and is indexed in Dynamo. For example, If I want to load data given a time range, I will first need to query Dynamo for the S3 file keys for the corresponding time range and then load them in Spark. The files may not always be in the same S3 path prefix, hence

Re: Re: is there any easier way to define a custom RDD in Java

2015-05-25 Thread Ted Yu
Please take a look at: core/src/main/scala/org/apache/spark/api/java/JavaRDD.scala core/src/test/java/org/apache/spark/JavaJdbcRDDSuite.java Cheers On Mon, May 25, 2015 at 8:39 PM, swaranga sarma.swara...@gmail.com wrote: Has this changed now? Can a new RDD be implemented in Java? --

Re: Using Spark like a search engine

2015-05-25 Thread Сергей Мелехин
Thanks I'll give it a try! С Уважением, Сергей Мелехин. 2015-05-26 12:56 GMT+10:00 Alex Chavez alexkcha...@gmail.com: Сергей, A simple implementation would be to create a DataFrame of CVs by issuing a Spark SQL query against your Postgres database, persist it in memory, and then to map F