Re: JavaPairDStream saveAsTextFile

2014-10-09 Thread Sean Owen
Yeah it's not there. I imagine it was simply never added, and that there's not a good reaosn it couldn't be. On Thu, Oct 9, 2014 at 4:53 AM, SA sadhu.a...@gmail.com wrote: HI, I am looking at the documentation for Java API for Streams. The scala library has option to save file locally, but

Re: JavaPairDStream saveAsTextFile

2014-10-09 Thread Mayur Rustagi
Thats a cryptic way to say thr should be a Jira for it :) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Oct 9, 2014 at 11:46 AM, Sean Owen so...@cloudera.com wrote: Yeah it's not there. I imagine it was simply

Re: Dedup

2014-10-09 Thread Akhil Das
If you are looking to eliminate duplicate rows (or similar) then you can define a key from the data and on that key you can do reduceByKey. Thanks Best Regards On Thu, Oct 9, 2014 at 10:30 AM, Sonal Goyal sonalgoy...@gmail.com wrote: What is your data like? Are you looking at exact matching or

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-09 Thread Sean Owen
Yes, I think this another operation that is not deterministic even for the same RDD. If a partition is lost and recalculated the ordering can be different in the partition. Sorting the RDD makes the ordering deterministic. On Thu, Oct 9, 2014 at 7:51 AM, Sung Hwan Chung coded...@cs.stanford.edu

Re: Dedup

2014-10-09 Thread Sean Owen
I think the question is about copying the argument. If it's an immutable value like String, yes just return the first argument and ignore the second. If you're dealing with a notoriously mutable value like a Hadoop Writable, you need to copy the value you return. This works fine although you will

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-09 Thread Sung Hwan Chung
Are there a large number of non-deterministic lineage operators? This seems like a pretty big caveat, particularly for casual programmers who expect consistent semantics between Spark and Scala. E.g., making sure that there's no randomness what-so-ever in RDD transformations seems critical.

Re: How could I start new spark cluster with hadoop2.0.2

2014-10-09 Thread Akhil Das
There is no --hadoop-minor-version, but you can try --hadoop-major-version=2.0.2 Does it break anything if you use 2.0.0 version of hadoop? Thanks Best Regards On Wed, Oct 8, 2014 at 8:44 PM, st553 sthompson...@gmail.com wrote: Hi, Were you able to figure out how to choose a specific

Re: Any issues with repartition?

2014-10-09 Thread Akhil Das
After a bit of research, i figured out that the one of the worker was hung on cleaning up GC and the connection usually times out since the default is 60Seconds, so i set it to a higher number and it eliminated this issue. You may want to try this:

Re: SparkStreaming program does not start

2014-10-09 Thread Akhil Das
You cannot give a file for spark-shell. You can open the spark-shell (./spark-shell --master=local[2]) and paste the code there and it will run or else you have to create a jar and submit it through spark-submit or run it independently. Thanks Best Regards On Wed, Oct 8, 2014 at 11:07 AM, Sean

Re: Same code --works in spark 1.0.2-- but not in spark 1.1.0

2014-10-09 Thread Akhil Das
Can you try decreasing the level of parallelism that you are giving for those functions? I had this issue when i gave a value 500 and it was gone when i dropped it to 200. Thanks Best Regards On Wed, Oct 8, 2014 at 9:28 AM, Andrew Ash and...@andrewash.com wrote: Hi Meethu, I believe you may

Re: Spark Standalone on EC2

2014-10-09 Thread Akhil Das
You must be having those hostnames in your /etc/hosts file, if you are not able to access it using the hostnames then you won't be able access it with the IP address either i believe. What are you trying to do here? running your eclipse locally and connecting to your ec2 cluster? Thanks Best

Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:

2014-10-09 Thread jan.zikes
I've tried to add / at the end of the path, but the result was exactly the same. I also guess that there will be some problem on the level of Hadoop - S3 comunication. Doy you know if there is some possibility of how tu run scripts from Spark on for example different hadoom version from the

Re: akka.remote.transport.netty.NettyTransport

2014-10-09 Thread Akhil Das
This issue is related to your cluster. Can you paste your spark-env.sh? Also you can try starting a spark-shell like $SPARK_HOME/bin/spark-shell --master=*spark://ip-172-31-24-183.ec2.internal:7077* and try the same code in it. Is this the same URI spark://ip-172-31-24-183.ec2.internal:7077

Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-10-09 Thread DB Tsai
Nice to hear that your experiment is consistent to my assumption. The current L1/L2 will penalize the intercept as well which is not idea. I'm working on GLMNET in Spark using OWLQN, and I can exactly get the same solution as R but with scalability in # of rows and columns. Stay tuned! Sincerely,

DIMSUM item similarity tests

2014-10-09 Thread Clive Cox
Hi, I'm trying out the DIMSUM item similarity from github master commit 69c3f441a9b6e942d6c08afecd59a0349d61cc7b . My matrix is: Num items : 8860 Number of users : 5138702 Implicit 1.0 values Running item similarity with threshold :0.5 I have a 2 slave spark cluster on EC2 with m3.xlarge (13G

Re: Error reading from Kafka

2014-10-09 Thread Antonio Jesus Navarro
Hi Saisai, thanks you for your help, all is working ok now. Cheers! 2014-10-09 2:49 GMT+02:00 Shao, Saisai saisai.s...@intel.com: Hi, I think you have to change the code like this to specify the type info, like this: * val kafkaStream: ReceiverInputDStream[(String, String)] =

Re: SparkContext.wholeTextFiles() java.io.FileNotFoundException: File does not exist:

2014-10-09 Thread Rahul Kumar Singh
I faced similar issue with wholeTextFiles function due to version compatibility. Spark 1.0 with Hadoop 2.4.1 worked. Did you try other function such as textFile to check if the issue is specific to wholeTextFiles? Spark needs to be re-compiled for different hadoop versions. However, you can keep

RE: Dedup

2014-10-09 Thread Ge, Yao (Y.)
Yes. I was using String array as arguments in the reduceByKey. I think String array is actually immutable and simply returning the first argument without cloning one should work. I will look into mapPartitions as we can have up to 40% duplicates. Will follow up on this if necessary. Thanks very

Bug a spark task

2014-10-09 Thread poiuytrez
Hi, I am parsing a csv file using Spark using the map function. One of the line of the csv file make a task fail (then the whole job fail). Is there a way to do some debugging to find the line which does fail ? Best regards, poiuytrez -- View this message in context:

Re: coalesce with shuffle or repartition is not necessarily fault-tolerant

2014-10-09 Thread Eric Friedman
+1 Eric Friedman On Oct 9, 2014, at 12:11 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote: Are there a large number of non-deterministic lineage operators? This seems like a pretty big caveat, particularly for casual programmers who expect consistent semantics between Spark and

Spark SQL - Exception only when using cacheTable

2014-10-09 Thread poiuytrez
Hello, I have a weird issue, this request works fine: sqlContext.sql(SELECT customer_id FROM transactions WHERE purchaseamount = 200).count() However, when I cache the table before making the request: sqlContext.cacheTable(transactions) sqlContext.sql(SELECT customer_id FROM transactions WHERE

Re: Bug a spark task

2014-10-09 Thread Akhil Das
Have a try catch inside the map. See the following example. val csvRDD = myRDD.map(x = { var index=null try { index = x.toString.split(,)(0) }catch{ case e: Exception = println(Exception!! = + e) } (index, x) }) Thanks Best Regards On Thu, Oct 9, 2014

Re: Spark on YARN driver memory allocation bug?

2014-10-09 Thread Greg Hill
$MASTER is 'yarn-cluster' in spark-env.sh spark-submit --driver-memory 12424m --class org.apache.spark.examples.SparkPi /usr/lib/spark-yarn/lib/spark-examples*.jar 1000 OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0006fd28, 4342677504, 0) failed; error='Cannot allocate

RE: Dedup

2014-10-09 Thread Sean Owen
Arrays are not immutable and do not have the equals semantics you want to use them as a key. Use a Scala immutable List. On Oct 9, 2014 12:32 PM, Ge, Yao (Y.) y...@ford.com wrote: Yes. I was using String array as arguments in the reduceByKey. I think String array is actually immutable and

IOException and appcache FileNotFoundException in Spark 1.02

2014-10-09 Thread Ilya Ganelin
On Oct 9, 2014 10:18 AM, Ilya Ganelin ilgan...@gmail.com wrote: Hi all – I could use some help figuring out a couple of exceptions I’ve been getting regularly. I have been running on a fairly large dataset (150 gigs). With smaller datasets I don't have any issues. My sequence of operations is

Re: Bug a spark task

2014-10-09 Thread poiuytrez
Thanks for the tip ! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Debug-a-spark-task-tp16029p16035.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To

Re: Spark Standalone on EC2

2014-10-09 Thread Ankur Srivastava
Thank you Akhil will try this out. We are able to access the machines using the public IP and even the private as they are on our subnet. Thanks Ankur On Oct 9, 2014 12:41 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You must be having those hostnames in your /etc/hosts file, if you are

Re: Interactive interface tool for spark

2014-10-09 Thread andy petrella
Hey, Regarding python libs, I'd say it's not supported out of the box, however it must be quite easy to generate plots using jFreeChart and automatically add 'em to the DOM. Nevertheless, I added an extensible support for javascript manipulation of results, using that one it's rather easy to plot

Re: Is there a way to look at RDD's lineage? Or debug a fault-tolerance error?

2014-10-09 Thread Pierre B
To add a bit on this one, if you look at RDD.scala in Spark code, you'll see that both parent and firstParent methods are protected[spark]. I guess, for good reasons, that I must admit I don't understand completely, you are not supposed to explore an RDD lineage programmatically... I had a

java.lang.OutOfMemoryError: Java heap space when running job via spark-submit

2014-10-09 Thread Jaonary Rabarisoa
Dear all, I have a spark job with the following configuration *val conf = new SparkConf()* * .setAppName(My Job)* * .set(spark.serializer, org.apache.spark.serializer.KryoSerializer)* * .set(spark.kryo.registrator, value.serializer.Registrator)* * .setMaster(local[4])* *

Re: DIMSUM item similarity tests

2014-10-09 Thread Xiangrui Meng
please re-try with --driver-memory 10g . The default is 256m. -Xiangrui On Thu, Oct 9, 2014 at 2:33 AM, Clive Cox clive@rummble.com wrote: Hi, I'm trying out the DIMSUM item similarity from github master commit 69c3f441a9b6e942d6c08afecd59a0349d61cc7b . My matrix is: Num items : 8860

spark-sql failing for some tables in hive

2014-10-09 Thread sadhan
We have a hive deployement on which we tried running spark-sql. When we try to do describe table_name for some of the tables, spark-sql fails with this: while it works for some of the other tables. Confused and not sure what's happening here. The same describe command works in hive. Whats

Re: Help with using combineByKey

2014-10-09 Thread Sean Owen
You have a typo in your code at var acc:, and the map from opPart1 to opPart2 looks like a no-op, but those aren't the problem I think. It sounds like you intend the first element of each pair to be a count of nonzero values, but you initialize the first element of the pair to v, not 1, in v =

Re: Help with using combineByKey

2014-10-09 Thread Yana Kadiyska
If you just want the ratio of positive to all values per key (if I'm reading right) this works val reduced= input.groupByKey().map(grp= grp._2.filter(v=v0).size.toFloat/grp._2.size) reduced.foreach(println) I don't think you need reduceByKey or combineByKey as you're not doing anything where the

Performance with activeSetOpt in GraphImpl.mapReduceTriplets()

2014-10-09 Thread Cheuk Lam
When using the activeSetOpt in GraphImpl.mapReduceTriplets(), can we expect a performance that is only proportional to the size of the active set and independent of the size of the original data set? Or there is still a fixed overhead that depends on the size of the original data set? Thank you!

Re: Spark Standalone on EC2

2014-10-09 Thread Akhil Das
Another work around would be to add the hostnames with ip addresses in all machines /etc/hosts file Thanks Best Regards On Thu, Oct 9, 2014 at 8:49 PM, Ankur Srivastava ankur.srivast...@gmail.com wrote: Thank you Akhil will try this out. We are able to access the machines using the public

Re: Help with using combineByKey

2014-10-09 Thread HARIPRIYA AYYALASOMAYAJULA
Hello Sean, Thank you, but changing from v to 1 doesn't help me either. I am trying to count the number of non-zero values using the first accumulator. val newlist = List ((LAX,6), (LAX,0), (LAX,7), (SFO,0), (SFO,0), (SFO,9)) val plist = sc.parallelize(newlist) val part1 = plist.combineByKey(

Re: Spark-SQL: SchemaRDD - ClassCastException

2014-10-09 Thread Ranga
Resolution: After realizing that the SerDe (OpenCSV) was causing all the fields to be defined as String type, I modified the Hive load statement to use the default serializer. I was able to modify the CSV input file to use a different delimiter. Although, this is a workaround, I am able to proceed

MLUtil.kfold generates overlapped training and validation set?

2014-10-09 Thread Nan Zhu
Hi, all When we use MLUtils.kfold to generate training and validation set for cross validation we found that there is overlapped part in two sets…. from the code, it does sampling for twice for the same dataset @Experimental def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed:

Re: New API for TFIDF generation in Spark 1.1.0

2014-10-09 Thread nilesh
hi Xiangrui, I am trying to implement the tfidf as per the instruction you sent in your response to Jatin. I am getting an error in idf step. Here are my steps that run till the last line where the compile fails. val labeledDocs = sc.textFile(title_subcategory) val stopwords =

Re: java.lang.OutOfMemoryError: Java heap space when running job via spark-submit

2014-10-09 Thread Jaonary Rabarisoa
in fact with --driver-memory 2G I can get it working On Thu, Oct 9, 2014 at 6:20 PM, Xiangrui Meng men...@gmail.com wrote: Please use --driver-memory 2g instead of --conf spark.driver.memory=2g. I'm not sure whether this is a bug. -Xiangrui On Thu, Oct 9, 2014 at 9:00 AM, Jaonary Rabarisoa

How to save ReceiverInputDStream to Hadoop using saveAsNewAPIHadoopFile

2014-10-09 Thread bdev
I'm using KafkaUtils.createStream for the input stream to pull messages from kafka which seems to return a ReceiverInputDStream. I do not see saveAsNewAPIHadoopFile available on ReceiverInputDStream and obviously run into this error: saveAsNewAPIHadoopFile is not a member of

Spark in cluster [ remote.EndpointWriter: AssociationError]

2014-10-09 Thread Morbious
Hi, Recently I've configured spark in cluster with zookeper. I have 2 masters ( active/standby) and 6 workers. I've begun my installation with samples from example directory. Everything worked fine when I only used memory . When I used word count example I got messages like the ones below:

Re: How to save ReceiverInputDStream to Hadoop using saveAsNewAPIHadoopFile

2014-10-09 Thread Sean Owen
I think you have not imported org.apache.spark.streaming.StreamingContext._ ? This gets you the implicits that provide these methods. On Thu, Oct 9, 2014 at 8:40 PM, bdev buntu...@gmail.com wrote: I'm using KafkaUtils.createStream for the input stream to pull messages from kafka which seems to

Convert a org.apache.spark.sql.SchemaRDD[Row] to a RDD of Strings

2014-10-09 Thread Soumya Simanta
I've a SchemaRDD that I want to convert to a RDD that contains String. How do I convert the Row inside the SchemaRDD to String?

Re: Spark Streaming Fault Tolerance (?)

2014-10-09 Thread Massimiliano Tomassi
Hello all, I wrote a blog post around the issue I reported before: http://metabroadcast.com/blog/design-your-spark-streaming-cluster-carefully Can I ask some feedback from who's already using Spark Streaming in production? How do you deal with fault tolerance and scalability? Thanks a lot for

Spark job doesn't clean after itself

2014-10-09 Thread Rohit Pujari
Hello Folks: I'm running spark job on YARN. After the execution, I would expect the spark job to clean staging the area, but it seems every run creates a new staging directory. Is there a way to force spark job to clean after itself? Thanks, Rohit -- CONFIDENTIALITY NOTICE NOTICE: This message

Debug Spark in Cluster Mode

2014-10-09 Thread Rohit Pujari
Hello Folks: What're some best practices to debug Spark in cluster mode? Thanks, Rohit -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from

Could Spark make use of Intel Xeon Phi?

2014-10-09 Thread Lang Yu
Hi, I have set up Spark 1.0.2 on the cluster using standalone mode and the input is managed by HDFS. One node of the cluster has Intel Xeon Phi 5110P coprocessor. Is there any possibility that spark could be aware of the existence of Phi and run job on Xeon Phi or recognize Phi as an

Re: Could Spark make use of Intel Xeon Phi?

2014-10-09 Thread Andrew Ash
Hi Lang, What special features of the Xeon Phil do you want Spark to take advantage of? On Thu, Oct 9, 2014 at 4:50 PM, Lang Yu lysubscr...@gmail.com wrote: Hi, I have set up Spark 1.0.2 on the cluster using standalone mode and the input is managed by HDFS. One node of the cluster has Intel

Re: Spark on YARN driver memory allocation bug?

2014-10-09 Thread Sandy Ryza
I filed https://issues.apache.org/jira/browse/SPARK-3884 to address this. -Sandy On Thu, Oct 9, 2014 at 7:05 AM, Greg Hill greg.h...@rackspace.com wrote: $MASTER is 'yarn-cluster' in spark-env.sh spark-submit --driver-memory 12424m --class org.apache.spark.examples.SparkPi

Memory Leaks? 1GB input file turns into 8GB memory use in JVM... from parsing CSV

2014-10-09 Thread Aris
Hello Spark folks, I am doing a simple parsing of a CSV input file, and the input file is very large (~1GB). It seems I have a memory leak here and I am destroying my server. After using jmap to generate a Java heap dump and using the Eclipse Memory Analyzer, I basically learned that when I read

Re: GroupBy Key and then sort values with the group

2014-10-09 Thread Davies Liu
There is a new API called repartitionAndSortWithinPartitions() in master, it may help in this case, then you should do the `groupBy()` by yourself. On Wed, Oct 8, 2014 at 4:03 PM, chinchu chinchu@gmail.com wrote: Sean, I am having a similar issue, but I have a lot of data for a group I

One pass compute() to produce multiple RDDs

2014-10-09 Thread Akshat Aranya
Hi, Is there a good way to materialize derivate RDDs from say, a HadoopRDD while reading in the data only once. One way to do so would be to cache the HadoopRDD and then create derivative RDDs, but that would require enough RAM to cache the HadoopRDD which is not an option in my case. Thanks,

Re: Convert a org.apache.spark.sql.SchemaRDD[Row] to a RDD of Strings

2014-10-09 Thread Matei Zaharia
A SchemaRDD is still an RDD, so you can just do rdd.map(row = row.toString). Or if you want to get a particular field of the row, you can do rdd.map(row = row(3).toString). Matei On Oct 9, 2014, at 1:22 PM, Soumya Simanta soumya.sima...@gmail.com wrote: I've a SchemaRDD that I want to

Re: where are my python lambda functions run in yarn-client mode?

2014-10-09 Thread Davies Liu
When you call rdd.take() or rdd.first(), it may[1] executor the job locally (in driver), otherwise, all the jobs are executed in cluster. There is config called `spark.localExecution.enabled` (since 1.1+) to change this, it's not enabled by default, so all the functions will be executed in

Re: New API for TFIDF generation in Spark 1.1.0

2014-10-09 Thread nilesh
Did some digging in the documentation. Looks like the IDFModel.transform only accepts RDD as an input, and not individual elements. Is this a bug? I am saying this because HashingTF.transform accepts both RDD as well as vector elements as its input. From your post replying to Jatin, looks like

Re: java.io.IOException Error in task deserialization

2014-10-09 Thread Davies Liu
This exception should be caused by another one, could you paste all of them here? Also, that will be great if you can provide a script to reproduce this problem. Thanks! On Fri, Sep 26, 2014 at 6:11 AM, Arun Ahuja aahuj...@gmail.com wrote: Has anyone else seen this erorr in task

Re: java.io.IOException Error in task deserialization

2014-10-09 Thread Davies Liu
Could you provide a script to reproduce this problem? Thanks! On Wed, Oct 8, 2014 at 9:13 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: This is also happening to me on a regular basis, when the job is large with relatively large serialized objects used in each RDD lineage. A bad thing

Re: GroupBy Key and then sort values with the group

2014-10-09 Thread Chinchu Sup
Thanks Davies.. I'll try it when it gets released (I am on 1.1.0 currently). For now I am using a custom partitioner with the ShuffleRDD() to keep the same groups together, so I don't have to shuffle all data to a single partition. On Thu, Oct 9, 2014 at 2:34 PM, Davies Liu dav...@databricks.com

Re: java.io.IOException Error in task deserialization

2014-10-09 Thread Sung Hwan Chung
I don't think that I saw any other error message. This is all I saw. I'm currently experimenting to see if this can be alleviated by using HttpBroadcastFactory instead of TorrentBroadcast. So far, with HttpBroadcast, I haven't seen this recurring as of yet. I'll keep you posted. On Thu, Oct 9,

Hung spark executors don't count toward worker memory limit

2014-10-09 Thread Keith Simmons
Hi Folks, We have a spark job that is occasionally running out of memory and hanging (I believe in GC). This is it's own issue we're debugging, but in the meantime, there's another unfortunate side effect. When the job is killed (most often because of GC errors), each worker attempts to kill

Re: Spark Streaming : Could not compute split, block not found

2014-10-09 Thread Tian Zhang
I have figured out why I am getting this error: We have a lot of data in kafka and the DStream from Kafka used MEMROY_ONLY_SER, so once the memory is low, spark started to discard data that is needed later ... So once I change to MEMORY_AND_DISK_SER, the error is gone. Tian -- View this

getting tweets for a specified handle

2014-10-09 Thread SK
Hi, I am using Spark 1.1.0. Is there a way to get the complete tweets corresponding to a handle (for e.g. @Delta)? I tried using the following example that extracts just the hashtags and replaced the # with @ as follows. I need the complete tweet and not just the tags. // val hashTags =

IOException and appcache FileNotFoundException in Spark 1.02

2014-10-09 Thread Ilya Ganelin
Hi all – I could use some help figuring out a couple of exceptions I’ve been getting regularly. I have been running on a fairly large dataset (150 gigs). With smaller datasets I don't have any issues. My sequence of operations is as follows – unless otherwise specified, I am not caching: Map a

Combined HDFS/Kafka Processing

2014-10-09 Thread Tobias Pfeiffer
Hi, I have a setting where data arrives in Kafka and is stored to HDFS from there (maybe using Camus or Flume). I want to write a Spark Streaming app where - first all files in a that HDFS directory are processed, - and then the stream from Kafka is processed, starting with the first item

Unable to share Sql between HiveContext and JDBC Thrift Server

2014-10-09 Thread Steve Arnold
I am writing a Spark job to persist data using HiveContext so that it can be accessed via the JDBC Thrift server. Although my code doesn't throw an error, I am unable to see my persisted data when I query from the Thrift server. I tried three different ways to get this to work: 1) val

Executor and BlockManager memory size

2014-10-09 Thread Larry Xiao
Hi all, I'm confused about Executor and BlockManager, why they have different memory. 14/10/10 08:50:02 INFO AppClient$ClientActor: Executor added: app-20141010085001-/2 on worker-20141010004933-brick6-35657 (brick6:35657) with 6 cores 14/10/10 08:50:02 INFO SparkDeploySchedulerBackend:

Re: Hung spark executors don't count toward worker memory limit

2014-10-09 Thread Keith Simmons
Actually, it looks like even when the job shuts down cleanly, there can be a few minutes of overlap between the time the next job launches and the first job actually terminates it's process. Here's some relevant lines from my log: 14/10/09 20:49:20 INFO Worker: Asked to kill executor

Re: Spark SQL Percentile UDAF

2014-10-09 Thread Michael Armbrust
Please file a JIRA:https://issues.apache.org/jira/browse/SPARK/ https://www.google.com/url?q=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK%2Fsa=Dsntz=1usg=AFQjCNFS_GnMso2OCOITA0TSJ5U10b3JSQ On Thu, Oct 9, 2014 at 6:48 PM, Anand Mohan chinn...@gmail.com wrote: Hi, I just noticed the

Re: [SQL] Set Parquet block size?

2014-10-09 Thread Michael Allman
Hi Pierre, I'm setting parquet (and hdfs) block size like follows: val ONE_GB = 1024 * 1024 * 1024 sc.hadoopConfiguration.setInt(dfs.blocksize, ONE_GB) sc.hadoopConfiguration.setInt(parquet.block.size, ONE_GB) Here, sc is a reference to the spark context. I've tested this and it

Re: Spark SQL Percentile UDAF

2014-10-09 Thread Anand Mohan Tumuluri
Filed https://issues.apache.org/jira/browse/SPARK-3891 Thanks, Anand Mohan On Thu, Oct 9, 2014 at 7:13 PM, Michael Armbrust mich...@databricks.com wrote: Please file a JIRA:https://issues.apache.org/jira/browse/SPARK/

How to benchmark SPARK apps?

2014-10-09 Thread Theodore Si
Hi all, What tools should I use to benchmark SPARK applications? BR, Theo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: How to benchmark SPARK applications?

2014-10-09 Thread Theodore Si
How can I get figures like those in the Evaluation part of the following paper? http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf 在 10/10/2014 10:35 AM, Theodore Si 写道: Hi all, What tools should I use to benchmark SPARK applications? BR, Theo

Re: How to benchmark SPARK apps?

2014-10-09 Thread 牛兆捷
*You can try https://github.com/databricks/spark-perf https://github.com/databricks/spark-perf*

Re: How to save ReceiverInputDStream to Hadoop using saveAsNewAPIHadoopFile

2014-10-09 Thread Buntu Dev
Thanks Sean, but I'm importing org.apache.spark.streaming. StreamingContext._ Here are the spark imports: import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.kafka._ import org.apache.spark.SparkConf val stream

Re: How to benchmark SPARK apps?

2014-10-09 Thread Theodore Si
What can I get from it? Can you show me some results please? 在 10/10/2014 10:46 AM, 牛兆捷 写道: *You can try https://github.com/databricks/spark-perf* - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

Re: Help with using combineByKey

2014-10-09 Thread HARIPRIYA AYYALASOMAYAJULA
Sean, Thank you. It works. But I am still confused about the function. Can you kindly throw some light on it? I was going through the example mentioned in https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html Is there any better source through which I can learn

Re: One pass compute() to produce multiple RDDs

2014-10-09 Thread Sean Owen
Although caching is synonymous with persisting in memory, you can also just persist the result (partially) on disk. At least you would use as much RAM as you can. Obviously that require re-reading the RDD (partially) from HDFS, and the point is avoiding reading data from HDFS several times. But

Re: How to save ReceiverInputDStream to Hadoop using saveAsNewAPIHadoopFile

2014-10-09 Thread Sean Owen
Your RDD does not contain pairs, since you .map(_._2) (BTW that can just be .values). Hadoop files means SequenceFiles and those store key-value pairs. That's why the method only appears for RDD[(K,V)]. On Fri, Oct 10, 2014 at 3:50 AM, Buntu Dev buntu...@gmail.com wrote: Thanks Sean, but I'm

Intermittent checkpointing failure.

2014-10-09 Thread Sung Hwan Chung
I'm getting DFS closed channel exception every now and then when I run checkpoint. I do checkpointing every 15 minutes or so. This happens usually after running the job for 1~2 hours. Anyone seen this before? Job aborted due to stage failure: Task 6 in stage 70.0 failed 4 times, most recent

Re: Help with using combineByKey

2014-10-09 Thread Sean Owen
It's the exact same reason you wrote: (acc: (Int, Int), v) = ( if(v 0) acc._1 + 1 else acc._1, acc._2 + 1), right? the first function establishes an initial value for a count. The value is either (0,1) or (1,1) depending on whether the value is 0 or not. You're otherwise using the method just

Is possible to invoke updateStateByKey twice on the same RDD

2014-10-09 Thread qihong
I need to implement following logic in a spark streaming app: for the incoming dStream, do some transformation, and invoke updateStateByKey to update state object for each key (mark data entries that are updated as dirty for next step), then let state objects produce event(s) based (based on

Re: Processing order in Spark

2014-10-09 Thread x
I doubt Spark has such a ability which can arrange the order of task execution. You could try from these aspects. 1. Write your partitioner to group your data. 2. Sort elements in partitions e.g. with row index. 3. Control the order of incoming outcome obtained from Spark at your application. xj

Re: Processing order in Spark

2014-10-09 Thread Sean Owen
Since an RDD doesn't have any ordering guarantee to begin with, I don't think there is any guarantee about the order in which data is encountered. It can change when the same RDD is reevaluated even. As you say, your scenario 1 is about the best you can do. You can achieve this if you can define