Re: MLUtil.kfold generates overlapped training and validation set?

2014-10-10 Thread Xiangrui Meng
1. No. 2. The seed per partition is fixed. So it should generate non-overlapping subsets. 3. There was a bug in 1.0, which was fixed in 1.0.1 and 1.1. Best, Xiangrui On Thu, Oct 9, 2014 at 11:05 AM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, all When we use MLUtils.kfold to generate training

Re: Help with using combineByKey

2014-10-10 Thread Davies Liu
Maybe this version is easier to use: plist.mapValues((v) = (if (v 0) 1 else 0, 1)).reduceByKey((x, y) = (x._1 + y._1, x._2 + y._2)) It has similar behavior with combineByKey(), will by faster than groupByKey() version. On Thu, Oct 9, 2014 at 9:28 PM, HARIPRIYA AYYALASOMAYAJULA

Re: How to save ReceiverInputDStream to Hadoop using saveAsNewAPIHadoopFile

2014-10-10 Thread Akhil Das
You can convert this ReceiverInputDStream http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.ReceiverInputDStream into PairRDDFuctions http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions and call the

How to patch sparkSQL on EC2?

2014-10-10 Thread Christos Kozanitis Christos Kozanitis
Hi I have written a few extensions for sparkSQL (for version 1.1.0) and I am trying to deploy my new jar files (one for catalyst and one for sql/core) on ec2. My approach was to create a new spark/lib/spark-assembly-1.1.0-hadoop1.0.4.jar that merged the contents of the old one with the

Re: IOException and appcache FileNotFoundException in Spark 1.02

2014-10-10 Thread Akhil Das
You could be hitting this issue https://issues.apache.org/jira/browse/SPARK-3633 (or similar). You can try the following workarounds: sc.set(spark.core.connection.ack.wait.timeout,600) sc.set(spark.akka.frameSize,50) Also reduce the number of partitions, you could be hitting the kernel's ulimit.

Re: Spark in cluster [ remote.EndpointWriter: AssociationError]

2014-10-10 Thread Akhil Das
Can you paste your spark-env.sh file? Looks like you have it misconfigured. Thanks Best Regards On Fri, Oct 10, 2014 at 1:43 AM, Morbious knowledgefromgro...@gmail.com wrote: Hi, Recently I've configured spark in cluster with zookeper. I have 2 masters ( active/standby) and 6 workers. I've

Delayed hotspot optimizations in Spark

2014-10-10 Thread Alexey Romanchuk
Hello spark users and developers! I am using hdfs + spark sql + hive schema + parquet as storage format. I have lot of parquet files - one files fits one hdfs block for one day. The strange thing is very slow first query for spark sql. To reproduce situation I use only one core and I have 97sec

Re: Delayed hotspot optimizations in Spark

2014-10-10 Thread Sean Owen
You could try setting -Xcomp for executors to force JIT compilation upfront. I don't know if it's a good idea overall but might show whether the upfront compilation really helps. I doubt it. However is this almost surely due to caching somewhere, in Spark SQL or HDFS? I really doubt hotspot makes

Re: Delayed hotspot optimizations in Spark

2014-10-10 Thread Alexey Romanchuk
Hey Sean and spark users! Thanks for reply. I try -Xcomp right now and start time was about few minutes (as expected), but I got first query slow as before: Oct 10, 2014 3:03:41 PM INFO: parquet.hadoop.InternalParquetRecordReader: Assembled and processed 1568899 records from 30 columns in 12897

Re: Spark SQL - Exception only when using cacheTable

2014-10-10 Thread visakh
Can you try checking whether the table is being cached? You can use isCached method. More details are here - http://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/sql/SQLContext.html -- View this message in context:

Can I run examples on cluster?

2014-10-10 Thread Theodore Si
Hi all, I want to use two nodes for test, one as master, the other worker. Can I submit the example application included in Spark source code tarball on master to let it run on the worker? What should I do? BR, Theo - To

Re: Delayed hotspot optimizations in Spark

2014-10-10 Thread Guillaume Pitel
Hi Could it be due to GC ? I read it may happen if your program starts with a small heap. What are your -Xms and -Xmx values ? Print GC stats with -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps Guillaume Hello spark users and developers! I am using hdfs + spark sql + hive schema +

Re: Windowed Operations

2014-10-10 Thread julyfire
hi Diego, I have the same problem. // reduce by key in the first window val *w1* = *one*.reduceByKeyAndWindow(_ + _, Seconds(30), Seconds(10)) w1.count().print() //reduce by key in the second window based on the results of the first window val *w2* = *w1*.reduceByKeyAndWindow(_ + _,

Spark on Mesos Issue - Do I need to install Spark on Mesos slaves

2014-10-10 Thread Bijoy Deb
Hi, I am trying to submit a Spark job on Mesos using spark-submit from my Mesos-Master machine. My SPARK_HOME = /vol1/spark/spark-1.0.2-bin-hadoop2 I have uploaded the spark-1.0.2-bin-hadoop2.tgz to hdfs so that the mesos slaves can download it to invoke the Mesos Spark backend executor. But

Re: Can I run examples on cluster?

2014-10-10 Thread Akhil Das
This is how the spark-cluster looks like So your driver program (example application) can be ran on the master (or anywhere which has access to the master - clustermanager) and the workers will execute it. Thanks Best Regards On Fri, Oct 10, 2014 at 2:47 PM, Theodore Si sjyz...@gmail.com

Re: Spark SQL parser bug?

2014-10-10 Thread Cheng Lian
Hi Mohammed, Would you mind to share the DDL of the table |x| and the complete stacktrace of the exception you got? A full Spark shell session history would be more than helpful. PR #2084 had been merged in master in Aug, and timestamp type is supported in 1.1. I tried the following

Re: Spark SQL - Exception only when using cacheTable

2014-10-10 Thread Cheng Lian
Hi Poiuytrez, what version of Spark are you using? Exception details like stacktrace are really needed to investigate this issue. You can find them in the executor logs, or just browse the application stderr/stdout link from Spark Web UI. On 10/9/14 9:37 PM, poiuytrez wrote: Hello, I have a

Re: Unable to share Sql between HiveContext and JDBC Thrift Server

2014-10-10 Thread Cheng Lian
Which version are you using? Also |.saveAsTable()| saves the table to Hive metastore, so you need to make sure your Spark application points to the same Hive metastore instance as the JDBC Thrift server. For example, put |hive-site.xml| under |$SPARK_HOME/conf|, and run |spark-shell| and

Re: Spark in cluster [ remote.EndpointWriter: AssociationError]

2014-10-10 Thread Morbious
Sorry, but your solution doesn't work. I can see on my master port 7077 open and listening and connected workers but I don't understand why it's trying to connect itself ... = Master is running on the specific host netstat -at | grep 7077 You will get something similar to: tcp0 0

Re: IOException and appcache FileNotFoundException in Spark 1.02

2014-10-10 Thread Ilya Ganelin
Thank you - I will try this. If I drop the partition count am I not more likely to hit memory issues? Especially if the dataset is rather large? On Oct 10, 2014 3:19 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You could be hitting this issue https://issues.apache.org/jira/browse/SPARK-3633

Re: Spark SQL - Exception only when using cacheTable

2014-10-10 Thread poiuytrez
I am using the python api. Unfortunately, I cannot find the isCached method equivalent in the documentation: https://spark.apache.org/docs/1.1.0/api/python/index.html in the SQLContext section. -- View this message in context:

Re: Spark SQL - Exception only when using cacheTable

2014-10-10 Thread poiuytrez
Hi Cheng, I am using Spark 1.1.0. This is the stack trace: 14/10/10 12:17:40 WARN TaskSetManager: Lost task 120.0 in stage 7.0 (TID 2235, spark-w-0.c.db.internal): java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer

Disabling log4j in Spark-Shell on ec2 stopped working on Wednesday (Oct 8)

2014-10-10 Thread Darin McBeath
For weeks, I've been using the following trick to successfully disable log4j in the spark-shell when running a cluster on ec2 that was started by the Spark provided ec2 scripts. cp ./conf/log4j.properties.template ./conf/log4j.properties I then change log4j.rootCategory=INFO to

Re: Can I run examples on cluster?

2014-10-10 Thread Theodore Si
But I cannot do this via using ./bin/run-example SparkPi 10 right? On Fri, Oct 10, 2014 at 6:04 PM, Akhil Das ak...@sigmoidanalytics.com wrote: This is how the spark-cluster looks like So your driver program (example application) can be ran on the master (or anywhere which has access to

Re: Can I run examples on cluster?

2014-10-10 Thread Theodore Si
Should I pack the example into a jar file and submit it on master? On Fri, Oct 10, 2014 at 9:32 PM, Theodore Si sjyz...@gmail.com wrote: But I cannot do this via using ./bin/run-example SparkPi 10 right? On Fri, Oct 10, 2014 at 6:04 PM, Akhil Das ak...@sigmoidanalytics.com wrote: This

Re: Can I run examples on cluster?

2014-10-10 Thread Akhil Das
Yes, you can run it with --master=spark://your-spark-uri:7077 i believe. Thanks Best Regards On Fri, Oct 10, 2014 at 7:03 PM, Theodore Si sjyz...@gmail.com wrote: Should I pack the example into a jar file and submit it on master? On Fri, Oct 10, 2014 at 9:32 PM, Theodore Si sjyz...@gmail.com

Re: Can I run examples on cluster?

2014-10-10 Thread HARIPRIYA AYYALASOMAYAJULA
spark-submit --class “Classname --master yarn-cluster jarfile(withcomplete path) This should work. On Fri, Oct 10, 2014 at 8:36 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Yes, you can run it with --master=spark://your-spark-uri:7077 i believe. Thanks Best Regards On Fri, Oct 10,

How does the Spark Accumulator work under the covers?

2014-10-10 Thread Areg Baghdasaryan (BLOOMBERG/ 731 LEX -)
Hello, I was wondering on what does the Spark accumulator do under the covers. I’ve implemented my own associative addInPlace function for the accumulator, where is this function being run? Let’s say you call something like myRdd.map(x = sum += x) is “sum” being accumulated locally in any way,

Re: MLUtil.kfold generates overlapped training and validation set?

2014-10-10 Thread Nan Zhu
Thanks, Xiangrui, I found the reason of overlapped training set and test set …. Another counter-intuitive issue related to https://github.com/apache/spark/pull/2508 Best, -- Nan Zhu On Friday, October 10, 2014 at 2:19 AM, Xiangrui Meng wrote: 1. No. 2. The seed per partition

Re: How does the Spark Accumulator work under the covers?

2014-10-10 Thread HARIPRIYA AYYALASOMAYAJULA
If you use parallelize, the data is distributed across multiple nodes available and sum is computed individually within each partition and later merged. The driver manages the entire process. Is my understanding correct? Can someone please correct me if I am wrong? On Fri, Oct 10, 2014 at 9:37

RE: Spark on Mesos Issue - Do I need to install Spark on Mesos slaves

2014-10-10 Thread Yangcheng Huang
spark-defaults.conf spark.executor.uri hdfs://:9000/user//spark-1.1.0-bin-hadoop2.4.tgz From: Bijoy Deb [mailto:bijoy.comput...@gmail.com] Sent: den 10 oktober 2014 11:59 To: user@spark.apache.org Subject: Spark on Mesos Issue - Do I need to install Spark on Mesos slaves Hi, I

Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Matei Zaharia
Hi folks, I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for the project, which is that we've been able to use Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x fewer nodes. There's a detailed writeup at

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Debasish Das
Awesome news Matei ! Congratulations to the databricks team and all the community members... On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi folks, I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for the project, which

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Mridul Muralidharan
Brilliant stuff ! Congrats all :-) This is indeed really heartening news ! Regards, Mridul On Fri, Oct 10, 2014 at 8:24 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi folks, I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for the project, which

SparkSQL LEFT JOIN problem

2014-10-10 Thread invkrh
Hi, I am exploring SparkSQL 1.1.0, I have a problem on LEFT JOIN. Here is the request: select * from customer left join profile on customer.account_id = profile.account_id The two tables' schema are shown as following: // Table: customer root |-- account_id: string (nullable = false) |--

Re: Does Ipython notebook work with spark? trivial example does not work. Re: bug with IPython notebook?

2014-10-10 Thread jay vyas
PySpark definetly works for me in ipython notebook. A good way to debug is do setMaster(local) in your python sc object, see if that works. Then from there, modify it to point to the real spark server. Also, I added a hack where i did sys.path.insert the path to pyspark in my python note book

Re: Executor and BlockManager memory size

2014-10-10 Thread Boromir Widas
Hey Larry, I have been trying to figure this out for standalone clusters as well. http://apache-spark-user-list.1001560.n3.nabble.com/What-is-a-Block-Manager-td12833.html has an answer as to what block manager is for. From the documentation, what I understood was if you assign X GB to each

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Nan Zhu
Great! Congratulations! -- Nan Zhu On Friday, October 10, 2014 at 11:19 AM, Mridul Muralidharan wrote: Brilliant stuff ! Congrats all :-) This is indeed really heartening news ! Regards, Mridul On Fri, Oct 10, 2014 at 8:24 PM, Matei Zaharia matei.zaha...@gmail.com

Re: java.io.IOException Error in task deserialization

2014-10-10 Thread Sung Hwan Chung
I haven't seen this at all since switching to HttpBroadcast. It seems TorrentBroadcast might have some issues? On Thu, Oct 9, 2014 at 4:28 PM, Sung Hwan Chung coded...@cs.stanford.edu wrote: I don't think that I saw any other error message. This is all I saw. I'm currently experimenting to

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread arthur.hk.c...@gmail.com
Wonderful !! On 11 Oct, 2014, at 12:00 am, Nan Zhu zhunanmcg...@gmail.com wrote: Great! Congratulations! -- Nan Zhu On Friday, October 10, 2014 at 11:19 AM, Mridul Muralidharan wrote: Brilliant stuff ! Congrats all :-) This is indeed really heartening news ! Regards, Mridul

Rdd repartitioning

2014-10-10 Thread rapelly kartheek
Hi, I was facing GC overhead errors while executing an application with 570MB data(with rdd replication). In order to fix the heap errors, I repartitioned the rdd to 10: val logData = sc.textFile(hdfs:/text_data/text data.txt).persist(StorageLevel.MEMORY_ONLY_2) val

Spark job (not Spark streaming) doesn't delete un-needed checkpoints.

2014-10-10 Thread Sung Hwan Chung
Un-needed checkpoints are not getting automatically deleted in my application. I.e. the lineage looks something like this and checkpoints simply accumulate in a temporary directory (every lineage point, however, does zip with a globally permanent): PermanentRDD:Global zips with all the

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Steve Nunez
Great stuff. Wonderful to see such progress in so short a time. How about some links to code and instructions so that these benchmarks can be reproduced? Regards, - Steve From: Debasish Das debasish.da...@gmail.com Date: Friday, October 10, 2014 at 8:17 To: Matei Zaharia

Re: How does the Spark Accumulator work under the covers?

2014-10-10 Thread Jayant Shekhar
Hi Areg, Check out http://spark.apache.org/docs/latest/programming-guide.html#accumulators val sum = sc.accumulator(0) // accumulator created from an initial value in the driver The accumulator variable is created in the driver. Tasks running on the cluster can then add to it. However, they

Application failure in yarn-cluster mode

2014-10-10 Thread Christophe Préaud
Hi, After updating from spark-1.0.0 to spark-1.1.0, my spark applications failed most of the time (but not always) in yarn-cluster mode (but not in yarn-client mode). Here is my configuration: * spark-1.1.0 * hadoop-2.2.0 And the hadoop.tmp.dir definition in the hadoop core-site.xml

Re: SparkSQL LEFT JOIN problem

2014-10-10 Thread Liquan Pei
Hi Can you try select birthday from customer left join profile on customer.account_id = profile.account_id to see if the problems remains on your entire data? Thanks, Liquan On Fri, Oct 10, 2014 at 8:20 AM, invkrh inv...@gmail.com wrote: Hi, I am exploring SparkSQL 1.1.0, I have a problem

RE: Spark on Mesos Issue - Do I need to install Spark on Mesos slaves

2014-10-10 Thread Malte
I have actually had the same problem. spark.executor.uri on HDFS did not work so I had to put it in a local folder -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Mesos-Issue-Do-I-need-to-install-Spark-on-Mesos-slaves-tp16129p16165.html Sent from

Re: Help with using combineByKey

2014-10-10 Thread HARIPRIYA AYYALASOMAYAJULA
Thank you guys! It was very helpful and now I understand it better. On Fri, Oct 10, 2014 at 1:38 AM, Davies Liu dav...@databricks.com wrote: Maybe this version is easier to use: plist.mapValues((v) = (if (v 0) 1 else 0, 1)).reduceByKey((x, y) = (x._1 + y._1, x._2 + y._2)) It has

Re: java.io.IOException Error in task deserialization

2014-10-10 Thread Davies Liu
Maybe, TorrentBroadcast is more complicated than HttpBroadcast, could you tell us how to reproduce this issue? That will help us a lot to improve TorrentBroadcast. Thanks! On Fri, Oct 10, 2014 at 8:46 AM, Sung Hwan Chung coded...@cs.stanford.edu wrote: I haven't seen this at all since switching

RE: Spark SQL parser bug?

2014-10-10 Thread Mohammed Guller
Hi Chen, Thanks for looking into this. It looks like the bug may be in the Spark Cassandra connector code. Table x is a table in Cassandra. However, while trying to troubleshoot this issue, I noticed another issue. This time I did not use Cassandra; instead created a table on the fly. I am not

Re: Akka disassociation on Java SE Embedded

2014-10-10 Thread bhusted
How do you increase the spark block manager timeout? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Akka-disassociation-on-Java-SE-Embedded-tp6266p16176.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Akka disassociation on Java SE Embedded

2014-10-10 Thread Nan Zhu
https://github.com/CodingCat/spark/commit/c5cee24689ac4ad1187244e6a16537452e99e771 -- Nan Zhu On Friday, October 10, 2014 at 4:31 PM, bhusted wrote: How do you increase the spark block manager timeout? -- View this message in context:

Issue with java spark broadcast

2014-10-10 Thread Jacob Maloney
I'm trying to broadcast an accumulator I generated earlier in my app. However I get a nullpointer exception whenever I reference the value. // The start of my accumulator generation LookupKeyToIntMap keyToIntMapper = new LookupKeyToIntMap();

Re: getting tweets for a specified handle

2014-10-10 Thread SK
Thanks. I made the change and ran the code. But I dont get any tweets for my handle, although I do see the tweets when I search for it on twitter. Does Spark allow us to get the tweets from the past (say the last 100 tweets? tweets that appeared in the last 10 minutes)? thanks -- View this

Spark Streaming KafkaUtils Issue

2014-10-10 Thread Abraham Jacob
Hi Folks, I am seeing some strange behavior when using the Spark Kafka connector in Spark streaming. I have a Kafka topic which has 8 partitions. I have a kafka producer that pumps some messages into this topic. On the consumer side I have a spark streaming application that that has 8 executors

Re: Spark Streaming KafkaUtils Issue

2014-10-10 Thread Sean McNamara
Would you mind sharing the code leading to your createStream? Are you also setting group.id? Thanks, Sean On Oct 10, 2014, at 4:31 PM, Abraham Jacob abe.jac...@gmail.com wrote: Hi Folks, I am seeing some strange behavior when using the Spark Kafka connector in Spark streaming. I

Re: Spark Streaming KafkaUtils Issue

2014-10-10 Thread Abraham Jacob
Sure... I do set the group.id for all the consumers to be the same. Here is the code --- SparkConf sparkConf = new SparkConf().setMaster(yarn-cluster).setAppName(Streaming WordCount); sparkConf.set(spark.shuffle.manager, SORT); sparkConf.set(spark.streaming.unpersist, true);

Re: Spark Streaming KafkaUtils Issue

2014-10-10 Thread Sean McNamara
How long do you let the consumers run for? Is it less than 60 seconds by chance? auto.commit.interval.ms defaults to 6 (60 seconds). If so that may explain why you are seeing that behavior. Cheers, Sean On Oct 10, 2014, at 4:47 PM, Abraham Jacob

Running Example in local mode with input files

2014-10-10 Thread maxpar
Hi folks, I have just upgraded to Spark 1.1.0, and try some examples like: ./run-example SparkPageRank pagerank_data.txt 5 It turns out that Spark keeps trying to connect to my name node and read the file from HDFS other than local FS: Client: Retrying connect to server:

how to find the sources for spark-project

2014-10-10 Thread sadhan
We have our own customization on top of parquet serde that we've been using for hive. In order to make it work with spark-sql, we need to be able to re-build spark with this. It'll be much easier to rebuild spark with this patch once I can find the sources for org.spark-project.hive. Not sure

Re: where are my python lambda functions run in yarn-client mode?

2014-10-10 Thread Evan
Thank you! I was looking for a config variable to that end, but I was looking in Spark 1.0.2 documentation, since that was the version I had the problem with. Is this behavior documented in 1.0.2's documentation? Evan On 10/09/2014 04:12 PM, Davies Liu wrote: When you call rdd.take() or

small bug in pyspark

2014-10-10 Thread Andy Davidson
Hi I am running spark on an ec2 cluster. I need to update python to 2.7. I have been following the directions on http://nbviewer.ipython.org/gist/JoshRosen/6856670 https://issues.apache.org/jira/browse/SPARK-922 I noticed that when I start a shell using pyspark, I correctly got python2.7, how

RE: Spark Streaming KafkaUtils Issue

2014-10-10 Thread Shao, Saisai
Hi Abraham, You are correct, the “auto.offset.reset“ behavior in KafkaReceiver is different from original Kafka’s semantics, if you set this configure, KafkaReceiver will clean the related immediately, but for Kafka this configuration is just a hint which will be effective only when offset is

Re: Spark Streaming KafkaUtils Issue

2014-10-10 Thread Abraham Jacob
Thanks Jerry, So, from what I can understand from the code, if I leave out auto.offset.reset, it should theoretically read from the last commit point... Correct? -abe On Fri, Oct 10, 2014 at 5:40 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi Abraham, You are correct, the

Maryland Meetup

2014-10-10 Thread Brian Husted
Please add the Maryland Spark meetup to the Spark website http://www.meetup.com/Apache-Spark-Maryland/ Thanks Brian Husted

Re: Spark Streaming KafkaUtils Issue

2014-10-10 Thread Sean McNamara
This jira and comment sums up the issue: https://issues.apache.org/jira/browse/SPARK-2492?focusedCommentId=14069708page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14069708 Basically the offset param was renamed and had slightly different semantics between kafka 0.7

RE: Spark Streaming KafkaUtils Issue

2014-10-10 Thread Shao, Saisai
Hi abe, You can see the details in KafkaInputDStream.scala, here is the snippet // When auto.offset.reset is defined, it is our responsibility to try and whack the // consumer group zk node. if (kafkaParams.contains(auto.offset.reset)) {

Re: Spark Streaming KafkaUtils Issue

2014-10-10 Thread Abraham Jacob
Ah I see... much clearer now... Because auto.offset.reset will trigger KafkaReciver to delete the ZK metadata; when the control passes over to Kafka consumer API it will see that there is no offset available for the partition. This then will trigger the smallest or largest logic to execute in

What if I port Spark from TCP/IP to RDMA?

2014-10-10 Thread Theodore Si
Hi, Let's say that I managed to port Spark from TCP/IP to RDMA. What tool or benchmark can I use to test the performance improvement? BR, Theo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

Re: Window comparison matching using the sliding window functionality: feasibility

2014-10-10 Thread nitinkak001
Thanks @category_theory, the post was of great help!! I had to learn a few thing before I could understand it completely. However, I am facing the issue of partitioning the data (using partitionBy) without providing a hardcoded value for number of partitions. The partitions need to be driven by

Re: add Boulder-Denver Spark meetup to list on website

2014-10-10 Thread Matei Zaharia
Added you, thanks! (You may have to shift-refresh the page to see it updated). Matei On Oct 10, 2014, at 1:52 PM, Michael Oczkowski michael.oczkow...@seeq.com wrote: Please add the Boulder-Denver Spark meetup group to the list on the website.

Re: IOException and appcache FileNotFoundException in Spark 1.02

2014-10-10 Thread Ilya Ganelin
Hi Akhil - I tried your suggestions and tried varying my partition sizes. Reducing the number of partitions led to memory errors (presumably - I saw IOExceptions much sooner). With the settings you provided the program ran for longer but ultimately crashes in the same way. I would like to

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Ilya Ganelin
Hi Matei - I read your post with great interest. Could you possibly comment in more depth on some of the issues you guys saw when scaling up spark and how you resolved them? I am interested specifically in spark-related problems. I'm working on scaling up spark to very large datasets and have been

Re: Debug Spark in Cluster Mode

2014-10-10 Thread Ilya Ganelin
I would also be interested in knowing more about this. I have used the cloudera manager and the spark resource interface (clientnode:4040) but would love to know if there are other tools out there - either for post processing or better observation during execution. On Oct 9, 2014 4:50 PM, Rohit

Re: Spark SQL parser bug?

2014-10-10 Thread Cheng Lian
Hmm, there is a “T” in the timestamp string, which makes the string not a valid timestamp string representation. Internally Spark SQL uses |java.sql.Timestamp.valueOf| to cast a string to a timestamp. On 10/11/14 2:08 AM, Mohammed Guller wrote: scala rdd.registerTempTable(x) scala val sRdd

RDD size in memory - Array[String] vs. case classes

2014-10-10 Thread Liam Clarke-Hutchinson
Hi all, I'm playing with Spark currently as a possible solution at work, and I've been recently working out a rough correlation between our input data size and RAM needed to cache an RDD that will be used multiple times in a job. As part of this I've been trialling different methods of

Re: where are my python lambda functions run in yarn-client mode?

2014-10-10 Thread Davies Liu
This is some kind of implementation details, so not documented :-( If you think this is a blocker for you, you could create a JIRA, maybe it's could be fixed in 1.0.3+. Davies On Fri, Oct 10, 2014 at 5:11 PM, Evan evan.sama...@gmail.com wrote: Thank you! I was looking for a config variable to