What is a taskBinary for a ShuffleMapTask? What is its purpose?

2015-09-21 Thread Muler
Hi, What is the purpose of the taskBinary for a ShuffleMapTask? What does it contain and how is it useful? Is it the representation of all the RDD operations that will be applied for the partition that task will be processing? (in the case below the task will process stage 0, partition 0) If it

Re: question building spark in a virtual machine

2015-09-21 Thread Eyal Altshuler
Anyone? On Sun, Sep 20, 2015 at 7:49 AM, Eyal Altshuler wrote: > I allocated almost 6GB of RAM to the ubuntu virtual machine and got the > same problem. > I will go over this post and try to zoom in into the java vm settings. > > meanwhile - can someone with a working

Re: Class cast exception : Spark 1.5

2015-09-21 Thread sim
You likely need to add the Cassandra connector JAR to spark.jars so it is available to the executors. Hope this helps, Sim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Class-cast-exception-Spark-1-5-tp24732p24753.html Sent from the Apache Spark User

Re: word count (group by users) in spark

2015-09-21 Thread Aniket Bhatnagar
Unless I am mistaken, in a group by operation, it spills to disk in case values for a key don't fit in memory. Thanks, Aniket On Mon, Sep 21, 2015 at 10:43 AM Huy Banh wrote: > Hi, > > If your input format is user -> comment, then you could: > > val comments =

Re: Problem at sbt/sbt assembly

2015-09-21 Thread Sean Owen
Sbt asked for a bigger initial heap than the host had space for. It is a JVM error you can and should search for first. You will need more memory. On Mon, Sep 21, 2015, 2:11 AM Aaroncq4 <475715...@qq.com> wrote: > When I used “sbt/sbt assembly" to compile spark code of spark-1.5.0,I got a >

Fwd: Issue with high no of skipped task

2015-09-21 Thread Saurav Sinha
Hi Users, I am new Spark I have written flow.When we deployed our code it is completing jobs in 4-5 min. But now it is taking 20+ min in completing with almost same set of data. Can you please help me to figure out reason for it. -- Thanks and Regards, Saurav Sinha Contact: 9742879062 --

Deploying spark-streaming application on production

2015-09-21 Thread Jeetendra Gangele
Hi All, I have an spark streaming application with batch (10 ms) which is reading the MQTT channel and dumping the data from MQTT to HDFS. So suppose if I have to deploy new application jar(with changes in spark streaming application) what is the best way to deploy, currently I am doing as below

Spark Lost executor && shuffle.FetchFailedException

2015-09-21 Thread biyan900116
Hi All: When I write the data to the hive dynamic partition table, many errors and warnings as following happen... Is the reason that shuffle output is so large ? = 15/09/21 14:53:09 ERROR cluster.YarnClusterScheduler: Lost executor 402 on dn03.datanode.com: remote Rpc client

Issue with high no of skipped task

2015-09-21 Thread Saurav Sinha
Hi Users, I am new Spark I have written flow.When we deployed our code it is completing jobs in 4-5 min. But now it is taking 20+ min in completing with almost same set of data. Can you please help me to figure out reason for it. -- Thanks and Regards, Saurav Sinha Contact: 9742879062 --

Hbase Spark streaming issue.

2015-09-21 Thread Siva
Hi, I m seeing some strange error while inserting data from spark streaming to hbase. I can able to write the data from spark (without streaming) to hbase successfully, but when i use the same code to write dstream I m seeing the below error. I tried setting the below parameters, still didnt

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Philip Weaver
Hmm, I don't think that's what I want. There's no "zero value" in my use case. On Mon, Sep 21, 2015 at 8:20 PM, Sean Owen wrote: > I think foldByKey is much more what you want, as it has more a notion > of building up some result per key by encountering values serially. >

Spark Ingestion into Relational DB

2015-09-21 Thread Sri
Hi, We have a usecase where we get the dated from different systems and finally data will be consolidated into Oracle Database. Does spark is a valid useless for this scenario. Currently we also don't have any big data component. In case if we go with Spark to ingest data, does it require

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Sean Owen
The zero value here is None. Combining None with any row should yield Some(row). After that, combining is a no-op for other rows. On Tue, Sep 22, 2015 at 4:27 AM, Philip Weaver wrote: > Hmm, I don't think that's what I want. There's no "zero value" in my use > case. > >

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Sean Owen
Yes, that's right, though "in order" depends on the RDD having an ordering, but so does the zip-based solution. Actually, I'm going to walk that back a bit, since I don't see a guarantee that foldByKey behaves like foldLeft. The implementation underneath, in combineByKey, appears that it will act

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Philip Weaver
Hmm, ok, but I'm not seeing why foldByKey is more appropriate than reduceByKey? Specifically, is foldByKey guaranteed to walk the RDD in order, but reduceByKey is not? On Mon, Sep 21, 2015 at 8:41 PM, Sean Owen wrote: > The zero value here is None. Combining None with any

Re: How does one use s3 for checkpointing?

2015-09-21 Thread Utkarsh Sengar
We are using "spark-1.4.1-bin-hadoop2.4" on mesos (not EMR) with s3 to read and write data and haven't noticed any inconsistencies with it, so 1 (mostly) and 2 definitely should not be a problem. Regarding 3, are you setting the file system impl in spark config?

Invalid checkpoint url

2015-09-21 Thread srungarapu vamsi
I am using reduceByKeyAndWindow (with inverse reduce function) in my code. In order to use this, it seems the checkpointDirectory which i have to use should be hadoop compatible file system. Does that mean that, i should setup hadoop on my system. I googled about this and i found in a S.O answer

Re: Spark Ingestion into Relational DB

2015-09-21 Thread ayan guha
No, it does not require hadoop. 1. However, I doubt if this is a good usecase for spark. You probably would be better off and gain better performance with sqlloader. On Tue, Sep 22, 2015 at 3:13 PM, Sri wrote: > Hi, > > We have a usecase where we get the

Re: Spark Ingestion into Relational DB

2015-09-21 Thread Jörn Franke
You do not need Hadoop. However, you should think about using it. If you use Spark to load data directly from Oracle then your database might have unexpected loads of data once a Spark node may fail. Additionally, the Oracle Database, if it is not based on local disk, may have a storage

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-21 Thread tridib
Did you get any solution to this? I am getting same issue. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24759.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Philip Weaver
I am processing a single file and want to remove duplicate rows by some key by always choosing the first row in the file for that key. The best solution I could come up with is to zip each row with the partition index and local index, like this: rdd.mapPartitionsWithIndex { case (partitionIndex,

Re: passing SparkContext as parameter

2015-09-21 Thread Priya Ch
can i use this sparkContext on executors ?? In my application, i have scenario of reading from db for certain records in rdd. Hence I need sparkContext to read from DB (cassandra in our case), If sparkContext couldn't be sent to executors , what is the workaround for this ?? On Mon, Sep 21,

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Romi Kuntsman
RDD is a set of data rows (in your case numbers), there is no meaning for the order of the items. What exactly are you trying to accomplish? *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Mon, Sep 21, 2015 at 2:29 PM, Zhiliang Zhu wrote: > Dear , >

Re: passing SparkContext as parameter

2015-09-21 Thread Priya Ch
Yes, but i need to read from cassandra db within a spark transformation..something like.. dstream.forachRDD{ rdd=> rdd.foreach { message => sc.cassandraTable() . . . } } Since rdd.foreach gets executed on workers, how can i make sparkContext available on workers ???

Re: passing SparkContext as parameter

2015-09-21 Thread Romi Kuntsman
sparkConext is available on the driver, not on executors. To read from Cassandra, you can use something like this: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Mon, Sep 21, 2015 at 2:27 PM,

Re: passing SparkContext as parameter

2015-09-21 Thread Ted Yu
You can use broadcast variable for passing connection information. Cheers > On Sep 21, 2015, at 4:27 AM, Priya Ch wrote: > > can i use this sparkContext on executors ?? > In my application, i have scenario of reading from db for certain records in > rdd. Hence I

Re: passing SparkContext as parameter

2015-09-21 Thread Romi Kuntsman
foreach is something that runs on the driver, not the workers. if you want to perform some function on each record from cassandra, you need to do cassandraRdd.map(func), which will run distributed on the spark workers *Romi Kuntsman*, *Big Data Engineer* http://www.totango.com On Mon, Sep 21,

Re: Spark does not yet support its JDBC component for Scala 2.11.

2015-09-21 Thread Ted Yu
I think the document should be updated to reflect the integration of SPARK-8013 Cheers On Mon, Sep 21, 2015 at 3:48 AM, Petr Novak wrote: > Nice, thanks. > > So the note in build instruction for 2.11 is obsolete? Or there

Re: Spark does not yet support its JDBC component for Scala 2.11.

2015-09-21 Thread Petr Novak
Nice, thanks. So the note in build instruction for 2.11 is obsolete? Or there are still some limitations? http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211 On Fri, Sep 11, 2015 at 2:19 PM, Petr Novak wrote: > Nice, thanks. > > So the note in

How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Zhiliang Zhu
Dear , I have took lots of days to think into this issue, however, without any success...I shall appreciate your all kind help. There is an RDD rdd1, I would like get a new RDD rdd2, each row in rdd2[ i ] = rdd1[ i ] - rdd[i - 1] .What kinds of API or function would I use... Thanks very

Re: DataGenerator for streaming application

2015-09-21 Thread Hemant Bhanawat
Why are you using rawSocketStream to read the data? I believe rawSocketStream waits for a big chunk of data before it can start processing it. I think what you are writing is a String and you should use socketTextStream which reads the data on a per line basis. On Sun, Sep 20, 2015 at 9:56 AM,

Re: HiveQL Compatibility (0.12.0, 0.13.0???)

2015-09-21 Thread Michael Armbrust
In general we welcome pull requests for these kind of updates. In this case its already been fixed in master and branch-1.5 and will be updated when we release 1.5.1 (hopefully soon). On Mon, Sep 21, 2015 at 1:21 PM, Dominic Ricard < dominic.ric...@tritondigital.com> wrote: > Hi, >here's a

Re: Spark on Yarn vs Standalone

2015-09-21 Thread Saisai Shao
I think you need to increase the memory size of executor through command arguments "--executor-memory", or configuration "spark.executor.memory". Also yarn.scheduler.maximum-allocation-mb in Yarn side if necessary. Thanks Saisai On Mon, Sep 21, 2015 at 5:13 PM, Alexander Pivovarov

Re: word count (group by users) in spark

2015-09-21 Thread Zhang, Jingyu
Spark spills data to disk when there is more data shuffled onto a single executor machine than can fit in memory. However, it flushes out the data to disk one key at a time - so if a single key has more key-value pairs than can fit in memory, an out of memory exception occurs. Cheers, Jingyu On

Re: Spark on Yarn vs Standalone

2015-09-21 Thread Sandy Ryza
The warning your seeing in Spark is no issue. The scratch space lives inside the heap, so it'll never result in YARN killing the container by itself. The issue is that Spark is using some off-heap space on top of that. You'll need to bump the spark.yarn.executor.memoryOverhead property to give

Re: Spark on Yarn vs Standalone

2015-09-21 Thread Alexander Pivovarov
I repartitioned input RDD from 4,800 to 24,000 partitions After that the stage (24000 tasks) was done in 22 min on 100 boxes Shuffle read/write: 905 GB / 710 GB Task Metrics (Dur/GC/Read/Write) Min: 7s/1s/38MB/30MB Med: 22s/9s/38MB/30MB Max:1.8min/1.6min/38MB/30MB On Mon, Sep 21, 2015 at 5:55

Troubleshooting "Task not serializable" in Spark/Scala environments

2015-09-21 Thread Balaji Vijayan
Howdy, I'm a relative novice at Spark/Scala and I'm puzzled by some behavior that I'm seeing in 2 of my local Spark/Scala environments (Scala for Jupyter and Scala IDE) but not the 3rd (Spark Shell). The following code throws the following stack trace error in the former 2 environments but

Re: Troubleshooting "Task not serializable" in Spark/Scala environments

2015-09-21 Thread Alexis Gillain
As Igor said header must be available on each partition so the solution is broadcasting it. About the difference between repl and scala IDE, it may come from the sparkContext setup as REPL define one by default. 2015-09-22 8:41 GMT+08:00 Igor Berman : > Try to broadcasr

Re: How does one use s3 for checkpointing?

2015-09-21 Thread Jerry Lam
Hi Amit, Have you looked at Amazon EMR? Most people using EMR use s3 for persistency (both as input and output of spark jobs). Best Regards, Jerry Sent from my iPhone > On 21 Sep, 2015, at 9:24 pm, Amit Ramesh wrote: > > > A lot of places in the documentation mention using

Spark Streaming distributed job

2015-09-21 Thread nibiau
Hello, Please could you explain me what is exactly distributed when I launch a spark streaming job over YARN cluster ? My code is something like : JavaDStream customReceiverStream = ssc.receiverStream(streamConfig.getJmsReceiver()); JavaDStream incoming_msg = customReceiverStream.map(

Re: Troubleshooting "Task not serializable" in Spark/Scala environments

2015-09-21 Thread Ted Yu
Which release are you using ? >From the line number in ClosureCleaner, it seems you're using 1.4.x Cheers On Mon, Sep 21, 2015 at 4:07 PM, Balaji Vijayan wrote: > Howdy, > > I'm a relative novice at Spark/Scala and I'm puzzled by some behavior that > I'm seeing in

Re: how to get RDD from two different RDDs with cross column

2015-09-21 Thread Zhiliang Zhu
Hi Romi, Yes, you understand it correctly.And rdd1 keys are cross with rdd2 keys, that is, there are lots of same keys between rdd1 and rdd2, and there are some keys inrdd1 but not in rdd2, there are also some keys in rdd2 but not in rdd1.Then rdd3 keys would be same with rdd1 keys, rdd3 will

spark.mesos.coarse impacts memory performance on mesos

2015-09-21 Thread Utkarsh Sengar
I am running Spark 1.4.1 on mesos. The spark job does a "cartesian" of 4 RDDs (aRdd, bRdd, cRdd, dRdd) of size 100, 100, 7 and 1 respectively. Lets call it prouctRDD. Creation of "aRdd" needs data pull from multiple data sources, merging it and creating a tuple of JavaRdd, finally aRDD looks

How does one use s3 for checkpointing?

2015-09-21 Thread Amit Ramesh
A lot of places in the documentation mention using s3 for checkpointing, however I haven't found any examples or concrete evidence of anyone having done this. 1. Is this a safe/reliable option given the read-after-write consistency for PUTS in s3? 2. Is s3 access broken for hadoop 2.6

Re: Troubleshooting "Task not serializable" in Spark/Scala environments

2015-09-21 Thread Igor Berman
Try to broadcasr header On Sep 22, 2015 08:07, "Balaji Vijayan" wrote: > Howdy, > > I'm a relative novice at Spark/Scala and I'm puzzled by some behavior that > I'm seeing in 2 of my local Spark/Scala environments (Scala for Jupyter and > Scala IDE) but not the 3rd

Iterator-based streaming, how is it efficient ?

2015-09-21 Thread Samuel Hailu
Hi, In Spark's in-memory logic, without cache, elements are accessed in an iterator-based streaming style [ http://www.slideshare.net/liancheng/dtcc-14-spark-runtime-internals?next_slideshow=1 ] I have two questions: 1. if elements are read one line at at time from HDFS (disk) and then

Re: Spark on Yarn vs Standalone

2015-09-21 Thread Alexander Pivovarov
I noticed that some executors have issue with scratch space. I see the following in yarn app container stderr around the time when yarn killed the executor because it uses too much memory. -- App container stderr -- 15/09/21 21:43:22 WARN storage.MemoryStore: Not enough space to cache rdd_6_346

Re: Remove duplicate keys by always choosing first in file.

2015-09-21 Thread Sean Owen
I think foldByKey is much more what you want, as it has more a notion of building up some result per key by encountering values serially. You would take the first and ignore the rest. Note that "first" depends on your RDD having an ordering to begin with, or else you rely on however it happens to

Re: Why are executors on slave never used?

2015-09-21 Thread Hemant Bhanawat
When you specify master as local[2], it starts the spark components in a single jvm. You need to specify the master correctly. I have a default AWS EMR cluster (1 master, 1 slave) with Spark. When I run a Spark process, it works fine -- but only on the master, as if it were standalone. The web-UI

Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
Hi Spark Developers, I just ran some very simple operations on a dataset. I was surprise by the execution plan of take(1), head() or first(). For your reference, this is what I did in pyspark 1.5: df=sqlContext.read.parquet("someparquetfiles") df.head() The above lines take over 15 minutes. I

Python Packages in Spark w/Mesos

2015-09-21 Thread John Omernik
Hey all - Curious at the best way to include python packages in my Spark installation. (Such as NLTK). Basically I am running on Mesos, and would like to find a way to include the package in the binary distribution in that I don't want to install packages on all nodes. We should be able to

AWS_CREDENTIAL_FILE

2015-09-21 Thread Michel Lemay
Hi, It looks like spark does read AWS credentials from environment variable AWS_CREDENTIAL_FILE like awscli does. Mike

Re: Count for select not matching count for group by

2015-09-21 Thread Richard Hillegas
For what it's worth, I get the expected result that "filter" behaves like "group by" when I run the same experiment against a DataFrame which was loaded from a relational store: import org.apache.spark.sql._ import org.apache.spark.sql.types._ val df = sqlContext.read.format("jdbc").options(

Why are executors on slave never used?

2015-09-21 Thread Joshua Fox
I have a default AWS EMR cluster (1 master, 1 slave) with Spark. When I run a Spark process, it works fine -- but only on the master, as if it were standalone. The web-UI and logging code shows only 1 executor, the localhost. How can I diagnose this? (I create *SparkConf, *in Python, with

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Zhiliang Zhu
Hi Romi, Thanks very much for your kind help comment~~ In fact there is some valid backgroud of the application, it is about R data analysis #fund_nav_daily is a M X N (or M X 1) matrix or data.frame, col is each daily fund return, row is the daily date#fund_return_daily needs to count the

Re: SparkR - calling as.vector() with rdd dataframe causes error

2015-09-21 Thread Ellen Kraffmiller
Thank you for the link! I was using http://apache-spark-user-list.1001560.n3.nabble.com/, and I didn't see replies there. Regarding your code example, I'm doing the same thing and successfully creating the rdd, but the problem is that when I call a clustering algorithm like amap::hcluster(), I

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Sujit Pal
Hi Zhiliang, Would something like this work? val rdd2 = rdd1.sliding(2).map(v => v(1) - v(0)) -sujit On Mon, Sep 21, 2015 at 7:58 AM, Zhiliang Zhu wrote: > Hi Romi, > > Thanks very much for your kind help comment~~ > > In fact there is some valid backgroud of

Exception initializing JavaSparkContext

2015-09-21 Thread ekraffmiller
Hi, I’m trying to run a simple test program to access Spark though Java. I’m using JDK 1.8, and Spark 1.5. I’m getting an Exception from the JavaSparkContext constructor. My initialization code matches all the sample code I’ve found online, so not sure what I’m doing wrong. Here is my code:

Re: Python Packages in Spark w/Mesos

2015-09-21 Thread Tim Chen
Hi John, Sorry haven't get time to respond to your questions over the weekend. If you're running client mode, to use the Docker/Mesos integration minimally you just need to set the image configuration 'spark.mesos.executor.docker.image' as stated in the documentation, which Spark will use this

how to get RDD from two different RDDs with cross column

2015-09-21 Thread Zhiliang Zhu
Dear Romi, Priya, Sujt and Shivaram and all, I have took lots of days to think into this issue, however, without  any enough good solution...I shall appreciate your all kind help. There is an RDD rdd1, and another RDD rdd2, (rdd2 can be PairRDD, or DataFrame with two columns

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Seems 1.4 has the same issue. On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote: > btw, does 1.4 has the same problem? > > On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote: > >> Hi Jerry, >> >> Looks like it is a Python-specific issue. Can you create

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
Hi Yin, You are right! I just tried the scala version with the above lines, it works as expected. I'm not sure if it happens also in 1.4 for pyspark but I thought the pyspark code just calls the scala code via py4j. I didn't expect that this bug is pyspark specific. That surprises me actually a

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Jerry Lam
I just noticed you found 1.4 has the same issue. I added that as well in the ticket. On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam wrote: > Hi Yin, > > You are right! I just tried the scala version with the above lines, it > works as expected. > I'm not sure if it happens

JDBCRdd issue

2015-09-21 Thread Saurabh Malviya (samalviy)
Hi, While using reference with in JDBCRdd , It is throwing serializable exception. Does JDBCRdd does not except reference from other part of code.? confMap= ConfFactory.getConf(ParquetStreaming) val jdbcRDD = new JdbcRDD(sc, () => {

Fwd: Issue with high no of skipped task

2015-09-21 Thread Saurav Sinha
-- Forwarded message -- From: "Saurav Sinha" Date: 21-Sep-2015 11:48 am Subject: Issue with high no of skipped task To: Cc: Hi Users, I am new Spark I have written flow.When we deployed our code it is completing jobs in 4-5 min.

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Zhiliang Zhu
Hi Romi, I must show my sincere appreciation towards your kind & helpful help. One more question, currently I am using spark to deal with financial data analysis, so lots of operations on R data.frame/matrix and stat/regressionare always called.However, SparkR currently is not that strong, most

Re: Exception initializing JavaSparkContext

2015-09-21 Thread Marcelo Vanzin
What Spark package are you using? In particular, which hadoop version? On Mon, Sep 21, 2015 at 9:14 AM, ekraffmiller wrote: > Hi, > I’m trying to run a simple test program to access Spark though Java. I’m > using JDK 1.8, and Spark 1.5. I’m getting an Exception

Re: DataGenerator for streaming application

2015-09-21 Thread Saiph Kappa
Thanks a lot. Now it's working fine. I wasn't aware of "socketTextStream", not sure if it was documented in the spark programming guide. On Mon, Sep 21, 2015 at 12:46 PM, Hemant Bhanawat wrote: > Why are you using rawSocketStream to read the data? I believe >

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Looks like the problem is df.rdd does not work very well with limit. In scala, df.limit(1).rdd will also trigger the issue you observed. I will add this in the jira. On Mon, Sep 21, 2015 at 10:44 AM, Jerry Lam wrote: > I just noticed you found 1.4 has the same issue. I

Spark Streaming and Kafka MultiNode Setup - Data Locality

2015-09-21 Thread Ashish Soni
Hi All , Just wanted to find out if there is an benefits to installing kafka brokers and spark nodes on the same machine ? is it possible that spark can pull data from kafka if it is local to the node i.e. the broker or partition is on the same machine. Thanks, Ashish

Re: Docker/Mesos with Spark

2015-09-21 Thread Tim Chen
Hi John, There is no other blog post yet, I'm thinking to do a series of posts but so far haven't get time to do that yet. Running Spark in docker containers makes distributing spark versions easy, it's simple to upgrade and automatically caches on the slaves so the same image just runs right

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
Hi Jerry, Looks like it is a Python-specific issue. Can you create a JIRA? Thanks, Yin On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote: > Hi Spark Developers, > > I just ran some very simple operations on a dataset. I was surprise by the > execution plan of take(1),

Re: Spark SQL DataFrame 1.5.0 is extremely slow for take(1) or head() or first()

2015-09-21 Thread Yin Huai
btw, does 1.4 has the same problem? On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai wrote: > Hi Jerry, > > Looks like it is a Python-specific issue. Can you create a JIRA? > > Thanks, > > Yin > > On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam wrote: > >> Hi

Re: spark + parquet + schema name and metadata

2015-09-21 Thread Cheng Lian
Currently Spark SQL doesn't support customizing schema name and metadata. May I know why these two matters in your use case? Some Parquet data models, like parquet-avro, do support it, while some others don't (e.g. parquet-hive). Cheng On 9/21/15 7:13 AM, Borisa Zivkovic wrote: Hi, I am

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Zhiliang Zhu
Hi Sujit, I must appreciate your kind help very much~ It seems to be OK, however, do you know the corresponding spark Java API achievement...Is there any java API as scala sliding, and it seemed that I do not find spark scala's doc about sliding ... Thank you very much~Zhiliang On

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Zhiliang Zhu
Hi Sujit, Thanks very much for your kind help.I have found the sliding doc in both scala and java spark, it is from mlib RDDFunctions, though in the doc there is always not enough example. Best Regards,Zhiliang On Monday, September 21, 2015 11:48 PM, Sujit Pal

Re: How to get a new RDD by ordinarily subtract its adjacent rows

2015-09-21 Thread Sujit Pal
Hi Zhiliang, Haven't used the Java API but found this Javadoc page, may be helpful to you. https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/mllib/rdd/RDDFunctions.html I think the equivalent Java code snippet might go something like this: RDDFunctions.fromRDD(rdd1,

sqlContext.read.avro broadcasting files from the driver

2015-09-21 Thread Daniel Haviv
Hi, I'm loading a 1000 files using the spark-avro package: val df = sqlContext.read.avro(*"/incoming/"*) When I'm performing an action on this df it seems like for each file a broadcast is being created and is sent to the workers (instead of the workers reading their data-local files): scala>

spark + parquet + schema name and metadata

2015-09-21 Thread Borisa Zivkovic
Hi, I am trying to figure out how to write parquet metadata when persisting DataFrames to parquet using Spark (1.4.1) I could not find a way to change schema name (which seems to be hardcoded to root) and also how to add data to key/value metadata in parquet footer.

Re: passing SparkContext as parameter

2015-09-21 Thread Cody Koeninger
That isn't accurate, I think you're confused about foreach. Look at http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Mon, Sep 21, 2015 at 7:36 AM, Romi Kuntsman wrote: > foreach is something that runs on the

Count for select not matching count for group by

2015-09-21 Thread Michael Kelly
Hi, I'm seeing some strange behaviour with spark 1.5, I have a dataframe that I have built from loading and joining some hive tables stored in s3. The dataframe is cached in memory, using df.cache. What I'm seeing is that the counts I get when I do a group by on a column are different from what

Re: Deploying spark-streaming application on production

2015-09-21 Thread Petr Novak
In short there is no direct support for it in Spark AFAIK. You will either manage it in MQTT or have to add another layer of indirection - either in-memory based (observable streams, in-mem db) or disk based (Kafka, hdfs files, db) which will keep you unprocessed events. Now realizing, there is

Re: passing SparkContext as parameter

2015-09-21 Thread Petr Novak
add @transient? On Mon, Sep 21, 2015 at 11:36 AM, Petr Novak wrote: > add @transient? > > On Mon, Sep 21, 2015 at 11:27 AM, Priya Ch > wrote: > >> Hello All, >> >> How can i pass sparkContext as a parameter to a method in an object. >>

mongo-hadoop with Spark is slow for me, and adding nodes doesn't seem to make any noticeable difference

2015-09-21 Thread cscarioni
Hi,I appreciate any help or pointers in the right direction My current test scenario is the following. I want to process a MongoDB collection, anonymising some fields on it and store it in another Collection. The size of the collection is around 900 GB with 2.5 million documents Following is

Re: Spark + Druid

2015-09-21 Thread Petr Novak
Great work. On Fri, Sep 18, 2015 at 6:51 PM, Harish Butani wrote: > Hi, > > I have just posted a Blog on this: > https://www.linkedin.com/pulse/combining-druid-spark-interactive-flexible-analytics-scale-butani > > regards, > Harish Butani. > > On Tue, Sep 1, 2015 at

spark with internal ip

2015-09-21 Thread ZhuGe
Hi there:We recently add one NIC to each node of the cluster(stand alone) for larger bandwidth, and we modify the /etc/hosts file, so the hostname points to the new NIC's ip address(internal).What we want to achieve is that, communication between nodes would go through the new NIC. It seems

Re: how to get RDD from two different RDDs with cross column

2015-09-21 Thread Romi Kuntsman
Hi, If I understand correctly: rdd1 contains keys (of type StringDate) rdd2 contains keys and values and rdd3 contains all the keys, and the values from rdd2? I think you should make rdd1 and rdd2 PairRDD, and then use outer join. Does that make sense? On Mon, Sep 21, 2015 at 8:37 PM Zhiliang

Slow Performance with Apache Spark Gradient Boosted Tree training runs

2015-09-21 Thread vkutsenko
I'm experimenting with Gradient Boosted Trees learning algorithm from ML library of Spark 1.4. I'm solving a binary classification problem where my input is ~50,000 samples and ~500,000 features. My goal is to output the definition of

Re: Null Value in DecimalType column of DataFrame

2015-09-21 Thread Reynold Xin
+dev list Hi Dirceu, The answer to whether throwing an exception is better or null is better depends on your use case. If you are debugging and want to find bugs with your program, you might prefer throwing an exception. However, if you are running on a large real-world dataset (i.e. data is

HiveQL Compatibility (0.12.0, 0.13.0???)

2015-09-21 Thread Dominic Ricard
Hi, here's a statement from the Spark 1.5.0 Spark SQL and DataFrame Guide : *Compatibility with Apache Hive* Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs.

Re: Why are executors on slave never used?

2015-09-21 Thread Andrew Or
Hi Joshua, What cluster manager are you using, standalone or YARN? (Note that standalone here does not mean local mode). If standalone, you need to do `setMaster("spark://[CLUSTER_URL]:7077")`, where CLUSTER_URL is the machine that started the standalone Master. If YARN, you need to do

Re: Spark data type guesser UDAF

2015-09-21 Thread Ruslan Dautkhanov
Does it deserve to be a JIRA in Spark / Spark MLLib? How do you guys normally determine data types? Frameworks like h2o automatically determine data type scanning a sample of data, or whole dataset. So then one can decide e.g. if a variable should be a categorical variable or numerical. Another

Re: Exception initializing JavaSparkContext

2015-09-21 Thread Ellen Kraffmiller
I found the problem - the pom.xml I was using also contained and old dependency to a mahout library, which was including the old hadoop-core. Removing that fixed the problem. Thank you! On Mon, Sep 21, 2015 at 2:54 PM, Ted Yu wrote: > bq. hadoop-core-0.20.204.0 > > How come

Re: passing SparkContext as parameter

2015-09-21 Thread Romi Kuntsman
Cody, that's a great reference! As shown there - the best way to connect to an external database from the workers is to create a connection pool on (each) worker. The driver mass pass, via broadcast, the connection string, but not the connect object itself and not the spark context. On Mon, Sep

Re: Using Spark for portfolio manager app

2015-09-21 Thread Adrian Tanase
1. reading from kafka has exactly once guarantees - we are using it in production today (with the direct receiver) * ​you will probably have 2 topics, loading both into spark and joining / unioning as needed is not an issue * tons of optimizations you can do there, assuming

Re: Spark Streaming and Kafka MultiNode Setup - Data Locality

2015-09-21 Thread Adrian Tanase
We do - using Spark streaming, Kafka, HDFS all collocated on the same nodes. Works great so far. Spark picks up the location information and reads data from the partitions hosted by the local broker, showing up as NODE_LOCAL in the UI. You also need to look at the locality options in the

Re: What's the best practice to parse JSON using spark

2015-09-21 Thread Adrian Tanase
I've been using spray-json for general JSON ser/deser in scala (spark app), mostly for config files and data exchange. Haven't used it in conjunction with jobs that process large JSON data sources, so can't speak for those use cases. -adrian

Re: Spark on Mesos with Jobs in Cluster Mode Documentation

2015-09-21 Thread Alan Braithwaite
That could be the behavior but spark.mesos.executor.home being unset still raises an exception inside the dispatcher preventing a docker from even being started. I can see if other properties are inherited from the default environment when that's set, if you'd like. I think the main problem is

Serialization Error with PartialFunction / immutable sets

2015-09-21 Thread Chaney Courtney
Hi, I’m receiving a task not serializable exception using Spark GraphX (Scala 2.11.6 / JDK 1.8 / Spark 1.5) My vertex data is of type (VertexId, immutable Set), My edge data is of type PartialFunction[ISet[E], ISet[E]] where each ED has a precomputed function. My vertex program: val

Re: Deploying spark-streaming application on production

2015-09-21 Thread Adrian Tanase
I'm wondering, isn't this the canonical use case for WAL + reliable receiver? As far as I know you can tune Mqtt server to wait for ack on messages (qos level 2?). With some support from the client libray you could achieve exactly once semantics on the read side, if you ack message only after

  1   2   >