Re: spark and binary files

2015-05-12 Thread tog
Thanks for the answer but this seems to apply for files that are havin a key-value structure which I currently don't have. My file is a generic binary file encoding data from sensors over time. I am just looking at recreating some objects by assigning splits (ie continuous chunks of bytes) to each

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-12 Thread Sean Owen
The question is really whether all the third-party integrations should be built into Spark's main assembly. I think reasonable people could disagree, but I think the current state (not built in) is reasonable. It means you have to bring the integration with you. That is, no, third-party queue

Re: How to speed up data ingestion with Spark

2015-05-12 Thread Akhil Das
This article http://www.virdata.com/tuning-spark/ gives you a pretty good start on the Spark streaming side. And this article https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines is for the kafka, it has nice explanation how message size and

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-12 Thread Lee McFadden
Thanks for explaining Sean and Cody, this makes sense now. I'd like to help improve this documentation so other python users don't run into the same thing, so I'll look into that today. On Tue, May 12, 2015 at 9:44 AM Cody Koeninger c...@koeninger.org wrote: One of the packages just contains

Re: Duplicate entries in output of mllib column similarities

2015-05-12 Thread Reza Zadeh
Great! Reza On Tue, May 12, 2015 at 7:42 AM, Richard Bolkey rbol...@gmail.com wrote: Hi Reza, That was the fix we needed. After sorting, the transposed entries are gone! Thanks a bunch, rick On Sat, May 9, 2015 at 5:17 PM, Reza Zadeh r...@databricks.com wrote: Hi Richard, One reason

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-12 Thread Sean Owen
I think Java-land users will understand to look for an assembly jar in general, but it's not as obvious outside the Java ecosystem. Assembly = this thing, plus all its transitive dependencies. No, there is nothing wrong with Kafka at all. You need to bring everything it needs for it to work at

Re: DStream Union vs. StreamingContext Union

2015-05-12 Thread Tathagata Das
@Vadim What happened when you tried unioning using DStream.union in python? TD On Tue, May 12, 2015 at 9:53 AM, Evo Eftimov evo.efti...@isecc.com wrote: I can confirm it does work in Java *From:* Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] *Sent:* Tuesday, May 12, 2015 5:53 PM

Re: value toDF is not a member of RDD object

2015-05-12 Thread Dean Wampler
It's the import statement Olivier showed that makes the method available. Note that you can also use `sc.createDataFrame(myRDD)`, without the need for the import statement. I personally prefer this approach. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re: DStream Union vs. StreamingContext Union

2015-05-12 Thread Vadim Bichutskiy
@TD I kept getting an empty RDD (i.e. rdd.take(1) was False). ᐧ On Tue, May 12, 2015 at 12:57 PM, Tathagata Das tathagata.das1...@gmail.com wrote: @Vadim What happened when you tried unioning using DStream.union in python? TD On Tue, May 12, 2015 at 9:53 AM, Evo Eftimov

Spark Example Project 0.3.0 released

2015-05-12 Thread Alex Dean
Hi all, Just to let you know that we've released a new version of our (Scala) Spark Example Project. It now targets Spark 1.3.0 and has much cleaner Elastic MapReduce support using boto/invoke: http://snowplowanalytics.com/blog/2015/05/10/spark-example-project-0.3.0-released/ Hope it's of

RE: DStream Union vs. StreamingContext Union

2015-05-12 Thread Evo Eftimov
I can confirm it does work in Java From: Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] Sent: Tuesday, May 12, 2015 5:53 PM To: Evo Eftimov Cc: Saisai Shao; user@spark.apache.org Subject: Re: DStream Union vs. StreamingContext Union Thanks Evo. I tried chaining Dstream unions like

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-12 Thread Lee McFadden
Thanks again for all the help folks. I can confirm that simply switching to `--packages org.apache.spark:spark-streaming-kafka-assembly_2.10:1.3.1` makes everything work as intended. I'm not sure what the difference is between the two packages honestly, or why one should be used over the other,

Re: Why so slow

2015-05-12 Thread Jianshi Huang
Hi Olivier, Here it is. == Physical Plan == Aggregate false, [PartialGroup#155], [PartialGroup#155 AS is_bad#108,Coalesce(SUM(PartialCount#152L),0) AS count#109L,(CAST(SUM(PartialSum#153), DoubleType) / CAST(SUM(PartialCount#154L), DoubleType)) AS avg#110] Exchange (HashPartitioning

Re: Spark SQL ArrayIndexOutOfBoundsException

2015-05-12 Thread Michael Armbrust
val trainRDD = rawTrainData.map( rawRow = Row( rawRow.split(,) .map(_.toInt) ) ) The above is creating a Row with a single column that contains a sequence. You need to extract the sequence using varargs: val trainRDD = rawTrainData.map( rawRow = Row( rawRow.split(,) .map(_.toInt): _* )) You

[SparkSQL] Partition Autodiscovery (Spark 1.3)

2015-05-12 Thread Yana Kadiyska
Hi folks, I'm trying to use Automatic partition discovery as descibed here: https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html /data/year=2014/file.parquet/data/year=2015/file.parquet … SELECT * FROM table WHERE year = 2015 I have an official 1.3.1 CDH4

Reasons for Pregel being slow

2015-05-12 Thread dawiss
Hello, I tryed running GraphX Pregel for single source shortest path(very similar to example in https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api) using around 17K vertices and 36K edges. On a simple 8 vertex, 10 edge graph the Pregel algorithm works very well. When I

Re: DStream Union vs. StreamingContext Union

2015-05-12 Thread Vadim Bichutskiy
Thanks Saisai. That makes sense. Just seems redundant to have both. ᐧ On Mon, May 11, 2015 at 10:36 PM, Saisai Shao sai.sai.s...@gmail.com wrote: DStream.union can only union two DStream, one is itself. While StreamingContext.union can union an array of DStreams, internally DStream.union is a

Re: DStream Union vs. StreamingContext Union

2015-05-12 Thread Vadim Bichutskiy
Thanks Evo. I tried chaining Dstream unions like what you have and it didn't work for me. But passing multiple arguments to StreamingContext.union worked fine. Any idea why? I am using Python, BTW. ᐧ On Tue, May 12, 2015 at 12:45 PM, Evo Eftimov evo.efti...@isecc.com wrote: You can also union

Re: DStream Union vs. StreamingContext Union

2015-05-12 Thread Tathagata Das
I wonder that may be a bug in the Python API. Please file it as a JIRA along with sample code to reproduce it and sample output you get. On Tue, May 12, 2015 at 10:00 AM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: @TD I kept getting an empty RDD (i.e. rdd.take(1) was False). ᐧ On

how to set random seed

2015-05-12 Thread Charles Hayden
In pySpark, I am writing a map with a lambda that calls random.shuffle. For testing, I want to be able to give it a seed, so that successive runs will produce the same shuffle. I am looking for a way to set this same random seed once on each worker. Is there any simple way to do it??

Worker Core in Spark

2015-05-12 Thread guoqing0...@yahoo.com.hk
Assume that i had several mathines with 8cores , 1 core per work with 8 workers , 8 cores per work with 1 work , which one is better ?

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-05-12 Thread fightf...@163.com
Hi, there Which version are you using ? Actually the problem seems gone after we change our spark version from 1.2.0 to 1.3.0 Not sure what the internal changes did. Best, Sun. fightf...@163.com From: Night Wolf Date: 2015-05-12 22:05 To: fightf...@163.com CC: Patrick Wendell; user; dev

question about customize kmeans distance measure

2015-05-12 Thread June
Dear list, I am new to spark, and I want to use the kmeans algorithm in mllib package. I am wondering whether it is possible to customize the distance measure used by kmeans, and how? Many thanks! June

Re: how to load some of the files in a dir and monitor new file in that dir in spark streaming without missing?

2015-05-12 Thread Akhil Das
I believe fileStream would pickup the new files (may be you should increase the batch duration). You can see the implementation details for finding new files from here

Re: EVent generation

2015-05-12 Thread anshu shukla
I dont know how to simulate such type of input for even spark . On Tue, May 12, 2015 at 3:02 PM, Steve Loughran ste...@hortonworks.com wrote: I think you may want to try emailing things to the storm users list, not the spark one On 11 May 2015, at 15:42, Tyler Mitchell

Re: Master HA

2015-05-12 Thread James King
Thanks Akhil, I'm using Spark in standalone mode so i guess Mesos is not an option here. On Tue, May 12, 2015 at 1:27 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Mesos has a HA option (of course it includes zookeeper) Thanks Best Regards On Tue, May 12, 2015 at 4:53 PM, James King

Re: Spark and RabbitMQ

2015-05-12 Thread Akhil Das
I found two examples Java version https://github.com/deepakkashyap/Spark-Streaming-with-RabbitMQ-/blob/master/example/Spark_project/CustomReceiver.java, and Scala version. https://github.com/d1eg0/spark-streaming-toy Thanks Best Regards On Tue, May 12, 2015 at 2:31 AM, dgoldenberg

Why so slow

2015-05-12 Thread Jianshi Huang
Hi, I have a SQL query on tables containing big Map columns (thousands of keys). I found it to be very slow. select meta['is_bad'] as is_bad, count(*) as count, avg(nvar['var1']) as avg from test where date between '2014-04-01' and '2014-04-30' group by meta['is_bad'] =

value toDF is not a member of RDD object

2015-05-12 Thread SLiZn Liu
Hi User Group, I’m trying to reproduce the example on Spark SQL Programming Guide https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection, and got a compile error when packaging with sbt: [error] myfile.scala:30: value toDF is not a member of

Re: How to get Master UI with ZooKeeper HA setup?

2015-05-12 Thread michal.klo...@gmail.com
I've been querying Zookeeper directly via the Zookeeper client tools, it has the ip of the current master leader in the master_status data. We are also running Exhibitor for zookeeper which has a nice UI for exploring if you want to look up manually Thanks, Michal On May 12, 2015, at 1:28

Re: Master HA

2015-05-12 Thread Akhil Das
Mesos has a HA option (of course it includes zookeeper) Thanks Best Regards On Tue, May 12, 2015 at 4:53 PM, James King jakwebin...@gmail.com wrote: I know that it is possible to use Zookeeper and File System (not for production use) to achieve HA. Are there any other options now or in the

Re: TwitterPopularTags Long Processing Delay

2015-05-12 Thread Akhil Das
Are you using checkpointing/WAL etc? If yes, then it could be blocking on disk IO. Thanks Best Regards On Mon, May 11, 2015 at 10:33 PM, Seyed Majid Zahedi zah...@cs.duke.edu wrote: Hi, I'm running TwitterPopularTags.scala on a single node. Everything works fine for a while (about 30min),

Spark on Yarn : Map outputs lifetime ?

2015-05-12 Thread Ashwin Shankar
Hi, In spark on yarn and when running spark_shuffle as auxiliary service on node manager, does map spills of a stage gets cleaned up once the next stage completes OR is it preserved till the app completes(ie waits for all the stages to complete) ? -- Thanks, Ashwin

Re: how to use rdd.countApprox

2015-05-12 Thread Du Li
HI, I tested the following in my streaming app and hoped to get an approximate count within 5 seconds. However, rdd.countApprox(5000).getFinalValue() seemed to always return after it finishes completely, just like rdd.count(), which often exceeded 5 seconds. The values for low, mean, and high

Fast big data analytics with Spark on Tachyon in Baidu

2015-05-12 Thread Haoyuan Li
Dear all, We’re organizing a meetup http://www.meetup.com/Tachyon/events/222485713/ on May 28th at IBM in Forster City that might be of interest to the Spark community. The focus is a production use case of Spark and Tachyon at Baidu. You can sign up here:

Content based filtering

2015-05-12 Thread Yasemin Kaya
Hi, is Content based filtering available for Spark in Mllib? If it isn't , what can I use as an alternative? Thank you. Have a nice day yasemin -- hiç ender hiç

Re: Cassandra number of Tasks

2015-05-12 Thread Vijay Pawnarkar
Thanks!. We can somewhat approximate number of rows returned by where(), as a result we can approximate number of partitions, so repartition approach will work. Lets say if the .where() had resulted in widel varying number of rows, we would not have been to approximate # of partition, that would

Reading Real Time Data only from Kafka

2015-05-12 Thread James King
What I want is if the driver dies for some reason and it is restarted I want to read only messages that arrived into Kafka following the restart of the driver program and re-connection to Kafka. Has anyone done this? any links or resources that can help explain this? Regards jk

Re: Content based filtering

2015-05-12 Thread Nick Pentreath
Content based filtering is a pretty broad term - do you have any particular approach in mind? MLLib does not have any purely content-based methods. Your main alternative is ALS collaborative filtering. However, using a system like Oryx / PredictionIO / elasticsearch etc you can combine

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Akhil Das
Yep, you can try this lowlevel Kafka receiver https://github.com/dibbhatt/kafka-spark-consumer. Its much more flexible/reliable than the one comes with Spark. Thanks Best Regards On Tue, May 12, 2015 at 5:15 PM, James King jakwebin...@gmail.com wrote: What I want is if the driver dies for some

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread James King
Very nice! will try and let you know, thanks. On Tue, May 12, 2015 at 2:25 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Yep, you can try this lowlevel Kafka receiver https://github.com/dibbhatt/kafka-spark-consumer. Its much more flexible/reliable than the one comes with Spark. Thanks

Re: Spark and RabbitMQ

2015-05-12 Thread Dmitry Goldenberg
Thanks, Akhil. It looks like in the second example, for Rabbit they're doing this: https://www.rabbitmq.com/mqtt.html. On Tue, May 12, 2015 at 7:37 AM, Akhil Das ak...@sigmoidanalytics.com wrote: I found two examples Java version

Re: how to use rdd.countApprox

2015-05-12 Thread Tathagata Das
From the code it seems that as soon as the rdd.countApprox(5000) returns, you can call pResult.initialValue() to get the approximate count at that point of time (that is after timeout). Calling pResult.getFinalValue() will further block until the job is over, and give the final correct values

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-12 Thread Night Wolf
I'm seeing a similar thing with a slightly different stack trace. Ideas? org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150) org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Cody Koeninger
Akhil, I hope I'm misreading the tone of this. If you have personal issues at stake, please take them up outside of the public list. If you have actual factual concerns about the kafka integration, please share them in a jira. Regarding reliability, here's a screenshot of a current production

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread James King
Many thanks both, appreciate the help. On Tue, May 12, 2015 at 4:18 PM, Cody Koeninger c...@koeninger.org wrote: Yes, that's what happens by default. If you want to be super accurate about it, you can also specify the exact starting offsets for every topic/partition. On Tue, May 12, 2015

Re: Why so slow

2015-05-12 Thread Olivier Girardot
can you post the explain too ? Le mar. 12 mai 2015 à 12:11, Jianshi Huang jianshi.hu...@gmail.com a écrit : Hi, I have a SQL query on tables containing big Map columns (thousands of keys). I found it to be very slow. select meta['is_bad'] as is_bad, count(*) as count, avg(nvar['var1']) as

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Akhil Das
Hi Cody, I was just saying that i found more success and high throughput with the low level kafka api prior to KafkfaRDDs which is the future it seems. My apologies if you felt it that way. :) On 12 May 2015 19:47, Cody Koeninger c...@koeninger.org wrote: Akhil, I hope I'm misreading the tone of

Re: EVent generation

2015-05-12 Thread Steve Loughran
I think you may want to try emailing things to the storm users list, not the spark one On 11 May 2015, at 15:42, Tyler Mitchell tyler.mitch...@actian.commailto:tyler.mitch...@actian.com wrote: I've had good success with splunk generator. https://github.com/coccyx/eventgen/blob/master/README.md

Master HA

2015-05-12 Thread James King
I know that it is possible to use Zookeeper and File System (not for production use) to achieve HA. Are there any other options now or in the near future?

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-05-12 Thread Night Wolf
Seeing similar issues, did you find a solution? One would be to increase the number of partitions if you're doing lots of object creation. On Thu, Feb 12, 2015 at 7:26 PM, fightf...@163.com fightf...@163.com wrote: Hi, patrick Really glad to get your reply. Yes, we are doing group by

Re: Using sc.HadoopConfiguration in Python

2015-05-12 Thread Ram Sriharsha
yes, the SparkContext in the Python API has a reference to the JavaSparkContext (jsc) https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext through which you can access the hadoop configuration On Tue, May 12, 2015 at 6:39 AM, ayan guha guha.a...@gmail.com wrote: Hi

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Cody Koeninger
Yes, that's what happens by default. If you want to be super accurate about it, you can also specify the exact starting offsets for every topic/partition. On Tue, May 12, 2015 at 9:01 AM, James King jakwebin...@gmail.com wrote: Thanks Cody. Here are the events: - Spark app connects to

Re: value toDF is not a member of RDD object

2015-05-12 Thread Olivier Girardot
you need to instantiate a SQLContext : val sc : SparkContext = ... val sqlContext = new SQLContext(sc) import sqlContext.implicits._ Le mar. 12 mai 2015 à 12:29, SLiZn Liu sliznmail...@gmail.com a écrit : I added `libraryDependencies += org.apache.spark % spark-sql_2.11 % 1.3.1` to `build.sbt`

Re: Duplicate entries in output of mllib column similarities

2015-05-12 Thread Richard Bolkey
Hi Reza, That was the fix we needed. After sorting, the transposed entries are gone! Thanks a bunch, rick On Sat, May 9, 2015 at 5:17 PM, Reza Zadeh r...@databricks.com wrote: Hi Richard, One reason that could be happening is that the rows of your matrix are using SparseVectors, but the

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-12 Thread Sean Owen
Yeah, fair point about Python. spark-streaming-kafka should not contain third-party dependencies. However there's nothing stopping the build from producing an assembly jar from these modules. I think there is an assembly target already though? On Tue, May 12, 2015 at 3:37 PM, Lee McFadden

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Dibyendu Bhattacharya
The low level consumer which Akhil mentioned , has been running in Pearson for last 4-5 months without any downtime. I think this one is the reliable Receiver Based Kafka consumer as of today for Spark .. if you say it that way .. Prior to Spark 1.3 other Receiver based consumers have used Kafka

Re: value toDF is not a member of RDD object

2015-05-12 Thread SLiZn Liu
Thanks folks, really appreciate all your replies! I tried each of your suggestions and in particular, *Animesh*‘s second suggestion of *making case class definition global* helped me getting off the trap. Plus, I should have paste my entire code with this mail to help the diagnose. REGARDS, Todd

Re: The explanation of input text format using LDA in Spark

2015-05-12 Thread keegan
This matrix is the format of a Document Term Matrix. Each row represents all the words in a single document, each column represents just one of the possible words, and the elements of the matrix are the corresponding word counts. Simple example here

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Cody Koeninger
I don't think it's accurate for Akhil to claim that the linked library is much more flexible/reliable than what's available in Spark at this point. James, what you're describing is the default behavior for the createDirectStream api available as part of spark since 1.3. The kafka parameter

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Akhil Das
Hi Cody, If you are so sure, can you share a bench-marking (which you ran for days maybe?) that you have done with Kafka APIs provided by Spark? Thanks Best Regards On Tue, May 12, 2015 at 7:22 PM, Cody Koeninger c...@koeninger.org wrote: I don't think it's accurate for Akhil to claim that

Using sc.HadoopConfiguration in Python

2015-05-12 Thread ayan guha
Hi I found this method in scala API but not in python API (1.3.1). Basically, I want to change blocksize in order to read a binary file using sc.binaryRecords but with multiple partitions (for testing I want to generate partitions smaller than default blocksize)/ Is it possible in python? if

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread James King
Thanks Cody. Here are the events: - Spark app connects to Kafka first time and starts consuming - Messages 1 - 10 arrive at Kafka then Spark app gets them - Now driver dies - Messages 11 - 15 arrive at Kafka - Spark driver program reconnects - Then Messages 16 - 20 arrive Kafka What I want is

Re: Specify Python interpreter

2015-05-12 Thread Bin Wang
Hi Felix and Tomoas, Thanks a lot for your information. I figured out the environment variable PYSPARK_PYTHON is the secret key. My current approach is to start iPython notebook on the namenode, export PYSPARK_PYTHON=/opt/local/anaconda/bin/ipython /opt/local/anaconda/bin/ipython notebook

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-12 Thread Ted Yu
Currently external/kafka/pom.xml doesn't cite yammer metrics as dependency. $ ls -l ~/.m2/repository/com/yammer/metrics/metrics-core/2.2.0/metrics-core-2.2.0.jar -rw-r--r-- 1 tyu staff 82123 Dec 17 2013 /Users/tyu/.m2/repository/com/yammer/metrics/metrics-core/2.2.0/metrics-core-2.2.0.jar

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-12 Thread Matei Zaharia
It could also be that your hash function is expensive. What is the key class you have for the reduceByKey / groupByKey? Matei On May 12, 2015, at 10:08 AM, Night Wolf nightwolf...@gmail.com wrote: I'm seeing a similar thing with a slightly different stack trace. Ideas?

How to speed up data ingestion with Spark

2015-05-12 Thread dgoldenberg
Hi, I'm looking at a data ingestion implementation which streams data out of Kafka with Spark Streaming, then uses a multi-threaded pipeline engine to process the data in each partition. Have folks looked at ways of speeding up this type of ingestion? Let's say the main part of the ingest

Re: Does long-lived SparkContext hold on to executor resources?

2015-05-12 Thread Josh Rosen
I would be cautious regarding use of spark.cleaner.ttl, as it can lead to confusing error messages if time-based cleaning deletes resources that are still needed. See my comment at

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-12 Thread Ted Yu
bq. it is already in the assembly Yes. Verified: $ jar tvf ~/Downloads/spark-streaming-kafka-assembly_2.10-1.3.1.jar | grep yammer | grep Gauge 1329 Sat Apr 11 04:25:50 PDT 2015 com/yammer/metrics/core/Gauge.class On Tue, May 12, 2015 at 8:05 AM, Sean Owen so...@cloudera.com wrote: It