Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

2014-07-07 Thread Eustache DIEMERT
I tried to adjust stepSize between 1e-4 and 1, it doesn't seem to be the problem. Actually the problem is that the model doesn't use the intercept. So what happens is that it tries to compensate with super heavy weights ( 1e40) and ends up overflowing the model coefficients. MSE is exploding too,

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

2014-07-07 Thread Eustache DIEMERT
Well, why not, but IMHO MLLib Logistic Regression is unusable right now. The inability to use intercept is just a no-go. I could hack a column of ones to inject the intercept into the data but frankly it's a pithy to have to do so. 2014-07-05 23:04 GMT+02:00 DB Tsai dbt...@dbtsai.com: You may

Re: Spark SQL user defined functions

2014-07-07 Thread Martin Gammelsæter
Hi again, and thanks for your reply! On Fri, Jul 4, 2014 at 8:45 PM, Michael Armbrust mich...@databricks.com wrote: Sweet. Any idea about when this will be merged into master? It is probably going to be a couple of weeks. There is a fair amount of cleanup that needs to be done. It works

Broadcast variable in Spark Java application

2014-07-07 Thread Praveen R
I need a variable to be broadcasted from driver to executor processes in my spark java application. I tried using spark broadcast mechanism to achieve, but no luck there. Could someone help me doing this, share some code probably ? Thanks, Praveen R

Re: Broadcast variable in Spark Java application

2014-07-07 Thread Cesar Arevalo
Hi Praveen: It may be easier for other people to help you if you provide more details about what you are doing. It may be worthwhile to also mention which spark version you are using. And if you can share the code which doesn't work for you, that may also give others more clues as to what you

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

2014-07-07 Thread Eustache DIEMERT
Ok, I've tried to add the intercept term myself (code here [1]), but with no luck. It seems that adding a column of ones doesn't help with convergence either. I may have missed something in the coding as I'm quite a noob in Scala, but printing the data seem to indicate I succeeded in adding the

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Konstantin Kudryavtsev
guys, I'm not talking about running spark on VM, I don have problem with it. I confused in the next: 1) Hortonworks describe installation process as RPMs on each node 2) spark home page said that everything I need is YARN And I'm in stucj with understanding what I need to do to run spark on yarn

Re: Spark memory optimization

2014-07-07 Thread Igor Pernek
Thanks guys! Actually, I'm not doing any caching (at least I'm not calling cache/persist), do I still need to use the DISK_ONLY storage level? However, I do use reduceByKey and sortByKey. Mayur, you mentioned that sortByKey requires data to fit the memory. Is there any way to work around this

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-07 Thread sparkuser2345
Thank you for all the replies! Realizing that I can't distribute the modelling with different cross-validation folds to the cluster nodes this way (but to the threads only), I decided not to create nfolds data sets but to parallelize the calculation (threadwise) over folds and to zip the

which Spark package(wrt. graphX) I should install to do graph computation on cluster?

2014-07-07 Thread Yifan LI
Hi, I am planning to do graph(social network) computation on a cluster(hadoop has been installed), but it seems there are a Pre-built package for hadoop which I am NOT sure if the graphX has been included in. or, should I install other released version(obviously the graphX has been included)?

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Krishna Sankar
Konstantin, 1. You need to install the hadoop rpms on all nodes. If it is Hadoop 2, the nodes would have hdfs YARN. 2. Then you need to install Spark on all nodes. I haven't had experience with HDP, but the tech preview might have installed Spark as well. 3. In the end, one should

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Konstantin Kudryavtsev
thank you Krishna! Could you please explain why do I need install spark on each node if Spark official site said: If you have a Hadoop 2 cluster, you can run Spark without any installation needed I have HDP 2 (YARN) and that's why I hope I don't need to install spark on each node Thank you,

Re: Spark memory optimization

2014-07-07 Thread Surendranauth Hiraman
Using persist() is a sort of a hack or a hint (depending on your perspective :-)) to make the RDD use disk, not memory. As I mentioned though, the disk io has consequences, mainly (I think) making sure you have enough disks to not let io be a bottleneck. Increasing partitions I think is the other

spark-submit conflicts with dependencies

2014-07-07 Thread Robert James
When I use spark-submit (along with spark-ec2), I get dependency conflicts. spark-assembly includes older versions of apache commons codec and httpclient, and these conflict with many of the libs our software uses. Is there any way to resolve these? Or, if we use the precompiled spark, can we

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Chester @work
In Yarn cluster mode, you can either have spark on all the cluster nodes or supply the spark jar yourself. In the 2nd case, you don't need install spark on cluster at all. As you supply the spark assembly as we as your app jar together. I hope this make it clear Chester Sent from my iPhone

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-07 Thread Koert Kuipers
you could only do the deep check if the hashcodes are the same and design hashcodes that do not take all elements into account. the alternative seems to be putting cache statements all over graphx, as is currently the case, which is trouble for any long lived application where caching is

Possible bug in Spark Streaming :: TextFileStream

2014-07-07 Thread Luis Ángel Vicente Sánchez
I have a basic spark streaming job that is watching a folder, processing any new file and updating a column family in cassandra using the new cassandra-spark-driver. I think there is a problem with SparkStreamingContext.textFileStream... if I start my job in local mode with no files in the folder

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Konstantin Kudryavtsev
Hi Chester, Thank you very much, it is clear now - just two different way to support spark on acluster Thank you, Konstantin Kudryavtsev On Mon, Jul 7, 2014 at 3:22 PM, Chester @work ches...@alpinenow.com wrote: In Yarn cluster mode, you can either have spark on all the cluster nodes or

Pig 0.13, Spark, Spork

2014-07-07 Thread Bertrand Dechoux
Hi, I was wondering what was the state of the Pig+Spark initiative now that the execution engine of Pig is pluggable? Granted, it was done in order to use Tez but could it be used by Spark? I know about a 'theoretical' project called Spork but I don't know any stable and maintained version of it.

Re: Pig 0.13, Spark, Spork

2014-07-07 Thread Mayur Rustagi
Hi, We have fixed many major issues around Spork deploying it with some customers. Would be happy to provide a working version to you to try out. We are looking for more folks to try it out submit bugs. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Comparative study

2014-07-07 Thread santosh.viswanathan
Hello Experts, I am doing some comparative study on the below: Spark vs Impala Spark vs MapREduce . Is it worth migrating from existing MR implementation to Spark? Please share your thoughts and expertise. Thanks, Santosh This message is for the designated

Re: Pig 0.13, Spark, Spork

2014-07-07 Thread Mayur Rustagi
That version is old :). We are not forking pig but cleanly separating out pig execution engine. Let me know if you are willing to give it a go. Also would love to know what features of pig you are using ? Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Control number of tasks per stage

2014-07-07 Thread Konstantin Kudryavtsev
Hi all, is it any way to control the number tasks per stage? currently I see situation when only 2 tasks are created per stage and each of them is very slow, at the same time cluster has a huge number of unused nodes Thank you, Konstantin Kudryavtsev

Re: Java sample for using cassandra-driver-spark

2014-07-07 Thread Piotr Kołaczkowski
Hi, we're planning to add a basic Java-API very soon, possibly this week. There's a ticket for it here: https://github.com/datastax/cassandra-driver-spark/issues/11 We're open to any ideas. Just let us know what you need the API to have in the comments. Regards, Piotr Kołaczkowski 2014-07-05

Re: Control number of tasks per stage

2014-07-07 Thread Daniel Siegmann
The default number of tasks when reading files is based on how the files are split among the nodes. Beyond that, the default number of tasks after a shuffle is based on the property spark.default.parallelism. (see http://spark.apache.org/docs/latest/configuration.html). You can use

Re: Execution stalls in LogisticRegressionWithSGD

2014-07-07 Thread Xiangrui Meng
It seems to me a setup issue. I just tested news20.binary (1355191 features) on a 2-node EC2 cluster and it worked well. I added one line to conf/spark-env.sh: export SPARK_JAVA_OPTS= -Dspark.akka.frameSize=20 and launched spark-shell with --driver-memory 20g. Could you re-try with an EC2

Re: Dense to sparse vector converter

2014-07-07 Thread Xiangrui Meng
No, but it should be easy to add one. -Xiangrui On Mon, Jul 7, 2014 at 12:37 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi, Is there a method in Spark/MLlib to convert DenseVector to SparseVector? Best regards, Alexander

tiers of caching

2014-07-07 Thread Koert Kuipers
i noticed that some algorithms such as graphx liberally cache RDDs for efficiency, which makes sense. however it can also leave a long trail of unused yet cached RDDs, that might push other RDDs out of memory. in a long-lived spark context i would like to decide which RDDs stick around. would it

Re: tiers of caching

2014-07-07 Thread Ankur Dave
I think tiers/priorities for caching are a very good idea and I'd be interested to see what others think. In addition to letting libraries cache RDDs liberally, it could also unify memory management across other parts of Spark. For example, small shuffles benefit from explicitly keeping the

spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Robert James
spark-submit includes a spark-assembly uber jar, which has older versions of many common libraries. These conflict with some of the dependencies we need. I have been racking my brain trying to find a solution (including experimenting with ProGuard), but haven't been able to: when we use

spark-assembly libraries conflict with application libraries

2014-07-07 Thread Robert James
spark-submit includes a spark-assembly uber jar, which has older versions of many common libraries. These conflict with some of the dependencies we need. I have been racking my brain trying to find a solution (including experimenting with ProGuard), but haven't been able to: when we use

Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Koert Kuipers
spark has a setting to put user jars in front of classpath, which should do the trick. however i had no luck with this. see here: https://issues.apache.org/jira/browse/SPARK-1863 On Mon, Jul 7, 2014 at 1:31 PM, Robert James srobertja...@gmail.com wrote: spark-submit includes a spark-assembly

Re: tiers of caching

2014-07-07 Thread Andrew Or
Others have also asked for this on the mailing list, and hence there's a related JIRA: https://issues.apache.org/jira/browse/SPARK-1762. Ankur brings up a good point in that any current implementation of in-memory shuffles will compete with application RDD blocks. I think we should definitely add

Error while launching spark cluster manaually

2014-07-07 Thread Sameer Tilak
Hi All,I am having the following issue -- may be fqdn/ip resolution issue, but not sure, any help with this will be great! On the master node I get the following error:I start master using ./start-master.shstarting org.apache.spark.deploy.master.Master, logging to

Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Yan Fang
Hi guys, Not sure if you have similar issues. Did not find relevant tickets in JIRA. When I deploy the Spark Streaming to YARN, I have following two issues: 1. The UI port is random. It is not default 4040. I have to look at the container's log to check the UI port. Is this suppose to be this

Re: Kafka - streaming from multiple topics

2014-07-07 Thread Sergey Malov
I opened JIRA issue with Spark, as an improvement though, not as a bug. Hopefully, someone there would notice it. From: Tobias Pfeiffer t...@preferred.jpmailto:t...@preferred.jp Reply-To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Date:

Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-07 Thread Juan Rodríguez Hortalá
Hi list, I'm writing a Spark Streaming program that reads from a kafka topic, performs some transformations on the data, and then inserts each record in a database with foreachRDD. I was wondering which is the best way to handle the connection to the database so each worker, or even each task,

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Andrew Or
I will assume that you are running in yarn-cluster mode. Because the driver is launched in one of the containers, it doesn't make sense to expose port 4040 for the node that contains the container. (Imagine if multiple driver containers are launched on the same node. This will cause a port

[no subject]

2014-07-07 Thread Juan Rodríguez Hortalá
Hi all, I'm writing a Spark Streaming program that uses reduceByKeyAndWindow(), and when I change the windowsLenght or slidingInterval I get the following exceptions, running in local mode 14/07/06 13:03:46 ERROR actor.OneForOneStrategy: key not found: 1404677026000 ms

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Yan Fang
Hi Andrew, Thanks for the quick reply. It works with the yarn-client mode. One question about the yarn-cluster mode: actually I was checking the AM for the log, since the spark driver is running in the AM, the UI should also work, right? But that is not true in my case. Best, Fang, Yan

Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Chester Chen
I don't have experience deploying to EC2. can you use add.jar conf to add the missing jar at runtime ? I haven't tried this myself. Just a guess. On Mon, Jul 7, 2014 at 12:16 PM, Chester Chen ches...@alpinenow.com wrote: with provided scope, you need to provide the provided jars at the

Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Robert James
Thanks - that did solve my error, but instead got a different one: java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/lib/input/FileInputFormat It seems like with that setting, spark can't find Hadoop. On 7/7/14, Koert Kuipers ko...@tresata.com wrote: spark has a setting to put user

SparkSQL with sequence file RDDs

2014-07-07 Thread Gary Malouf
Has anyone reported issues using SparkSQL with sequence files (all of our data is in this format within HDFS)? We are considering whether to burn the time upgrading to Spark 1.0 from 0.9 now and this is a main decision point for us.

Re: Comparative study

2014-07-07 Thread Daniel Siegmann
From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can

RE: Comparative study

2014-07-07 Thread santosh.viswanathan
Thanks Daniel for sharing this info. Regards, Santosh Karthikeyan From: Daniel Siegmann [mailto:daniel.siegm...@velos.io] Sent: Tuesday, July 08, 2014 1:10 AM To: user@spark.apache.org Subject: Re: Comparative study From a development perspective, I vastly prefer Spark to MapReduce. The

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Andrew Or
@Yan, the UI should still work. As long as you look into the container that launches the driver, you will find the SparkUI address and port. Note that in yarn-cluster mode the Spark driver doesn't actually run in the Application Manager; just like the executors, it runs in a container that is

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Yan Fang
Thank you, Andrew. That makes sense for me now. I was confused by In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster in http://spark.apache.org/docs/latest/running-on-yarn.html . After you explanation, it's clear now. Thank you.

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Chester Chen
@Andrew Yes, the link point to the same redirected http://localhost/proxy/application_1404443455764_0010/ I suspect something todo with the cluster setup. I will let you know once I found something. Chester On Mon, Jul 7, 2014 at 1:07 PM, Andrew Or and...@databricks.com wrote:

Re: NoSuchMethodError in KafkaReciever

2014-07-07 Thread mcampbell
xtrahotsauce wrote I had this same problem as well. I ended up just adding the necessary code in KafkaUtil and compiling my own spark jar. Something like this for the raw stream: def createRawStream( jssc: JavaStreamingContext, kafkaParams: JMap[String, String],

acl for spark ui

2014-07-07 Thread Koert Kuipers
i was testing using the acl for spark ui in secure mode on yarn in client mode. it works great. my spark 1.0.0 configuration has: spark.authenticate = true spark.ui.acls.enable = true spark.ui.view.acls = koert spark.ui.filters =

RE: Spark logging strategy on YARN

2014-07-07 Thread Andrew Lee
Hi Kudryavtsev, Here's what I am doing as a common practice and reference, I don't want to say it is best practice since it requires a lot of customer experience and feedback, but from a development and operating stand point, it will be great to separate the YARN container logs with the Spark

Cannot create dir in Tachyon when running Spark with OFF_HEAP caching (FileDoesNotExistException)

2014-07-07 Thread Teng Long
Hi guys, I'm running Spark 1.0.0 with Tachyon 0.4.1, both in single node mode. Tachyon's own tests (./bin/tachyon runTests) works good, and manual file system operation like mkdir works well. But when I tried to run a very simple Spark task with RDD persist as OFF_HEAP, I got the following

Re: reading compress lzo files

2014-07-07 Thread Nicholas Chammas
I found it quite painful to figure out all the steps required and have filed SPARK-2394 https://issues.apache.org/jira/browse/SPARK-2394 to track improving this. Perhaps I have been going about it the wrong way, but it seems way more painful than it should be to set up a Spark cluster built using

memory leak query

2014-07-07 Thread Michael Lewis
Hi, I hope someone can help as I’m not sure if I’m using Spark correctly. Basically, in the simple example below I create an RDD which is just a sequence of random numbers. I then have a loop where I just invoke rdd.count() what I can see is that the memory use always nudges upwards. If I

Re: master attempted to re-register the worker and then took all workers as unregistered

2014-07-07 Thread Nan Zhu
Hey, Cheney, The problem is still existing? Sorry for the delay, I’m starting to look at this issue, Best, -- Nan Zhu On Tuesday, May 6, 2014 at 10:06 PM, Cheney Sun wrote: Hi Nan, In worker's log, I see the following exception thrown when try to launch on executor. (The

Re: Comparative study

2014-07-07 Thread Nabeel Memon
For Scala API on map/reduce (hadoop engine) there's a library called Scalding. It's built on top of Cascading. If you have a huge dataset or if you consider using map/reduce engine for your job, for any reason, you can try Scalding. However, Spark vs Impala doesn't make sense to me. It should've

Re: Comparative study

2014-07-07 Thread Sean Owen
On Tue, Jul 8, 2014 at 1:05 AM, Nabeel Memon nm3...@gmail.com wrote: For Scala API on map/reduce (hadoop engine) there's a library called Scalding. It's built on top of Cascading. If you have a huge dataset or if you consider using map/reduce engine for your job, for any reason, you can try

Re: SparkSQL with sequence file RDDs

2014-07-07 Thread Michael Armbrust
I haven't heard any reports of this yet, but I don't see any reason why it wouldn't work. You'll need to manually convert the objects that come out of the sequence file into something where SparkSQL can detect the schema (i.e. scala case classes or java beans) before you can register the RDD as a

Re: Comparative study

2014-07-07 Thread Soumya Simanta
Daniel, Do you mind sharing the size of your cluster and the production data volumes ? Thanks Soumya On Jul 7, 2014, at 3:39 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained;

Re: Spark SQL : Join throws exception

2014-07-07 Thread Yin Huai
Hi Subacini, Just want to follow up on this issue. SPARK-2339 has been merged into the master and 1.0 branch. Thanks, Yin On Tue, Jul 1, 2014 at 2:00 PM, Yin Huai huaiyin@gmail.com wrote: Seems it is a bug. I have opened https://issues.apache.org/jira/browse/SPARK-2339 to track it.

Re: Data loading to Parquet using spark

2014-07-07 Thread Michael Armbrust
SchemaRDDs, provided by Spark SQL, have a saveAsParquetFile command. You can turn a normal RDD into a SchemaRDD using the techniques described here: http://spark.apache.org/docs/latest/sql-programming-guide.html This should work with Impala, but if you run into any issues please let me know.

The number of cores vs. the number of executors

2014-07-07 Thread innowireless TaeYun Kim
Hi, I'm trying to understand the relationship of the number of cores and the number of executors when running a Spark job on YARN. The test environment is as follows: - # of data nodes: 3 - Data node machine spec: - CPU: Core i7-4790 (# of cores: 4, # of threads: 8) - RAM: 32GB (8GB

usage question for saprk run on YARN

2014-07-07 Thread Cheng Ju Chuang
Hi, I am running some simple samples for my project. Right now the spark sample is running on Hadoop 2.2 with YARN. My question is what is the main different when we run as spark-client and spark-cluster except different way to submit our job. And what is the specific way to configure the job

RE: SparkSQL with sequence file RDDs

2014-07-07 Thread Haoming Zhang
Hi Gray, Like Michael mentioned, you need to take care of the scala case classes or java beans, because SparkSQL need the schema. Currently we are trying insert our data to HBase with Scala 2.10.4 and Spark 1.0. All the data are tables. We created one case class for each rows, which means

Re: SparkSQL with sequence file RDDs

2014-07-07 Thread Michael Armbrust
We know Scala 2.11 has remove the limitation of parameter number, but Spark 1.0 is not compatible with it. So now we are considering use java beans instead of Scala case classes. You can also manually create a class that implements scala's Product interface. Finally, SPARK-2179

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread DB Tsai
Actually, the one needed to install the jar to each individual node is standalone mode which works for both MR1 and MR2. Cloudera and Hortonworks currently support spark in this way as far as I know. For both yarn-cluster or yarn-client, Spark will distribute the jars through distributed cache

Re: Data loading to Parquet using spark

2014-07-07 Thread Soren Macbeth
I typed spark parquet into google and the top results was this blog post about reading and writing parquet files from spark http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ On Mon, Jul 7, 2014 at 5:23 PM, Michael Armbrust mich...@databricks.com wrote: SchemaRDDs, provided by Spark

Re: how to set spark.executor.memory and heap size

2014-07-07 Thread Alex Gaudio
Hi All, This is a bit late, but I found it helpful. Piggy-backing on Wang Hao's comment, spark will ignore the spark.executor.memory setting if you add it to SparkConf via: conf.set(spark.executor.memory, 1g) What you actually should do depends on how you run spark. I found some official

Powered By Spark: Can you please add our org?

2014-07-07 Thread Alex Gaudio
Hi, Sailthru is also using Spark. Could you please add us to the Powered By Spark https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark page when you have a chance? Organization Name: Sailthru URL: www.sailthru.com Short Description: Our data science platform uses Spark to build

Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-07 Thread Tobias Pfeiffer
Juan, I am doing something similar, just not insert into SQL database, but issue some RPC call. I think mapPartitions() may be helpful to you. You could do something like dstream.mapPartitions(iter = { val db = new DbConnection() // maybe only do the above if !iter.isEmpty iter.map(item =

Re: usage question for saprk run on YARN

2014-07-07 Thread DB Tsai
spark-clinet mode runs driver in your application's JVM while spark-cluster mode runs driver in yarn cluster. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Jul 7, 2014 at 5:44 PM,

Re: Spark SQL user defined functions

2014-07-07 Thread Michael Armbrust
The names of the directories that are created for the metastore are different (metastore vs metastore_db), but that should be it. Really we should get rid of LocalHiveContext as it is mostly redundant and the current state is kind of confusing. I've created a JIRA to figure this out before the

the Pre-built packages for CDH4 can not support yarn ?

2014-07-07 Thread ch huang
hi,maillist : i download the pre-built spark packages for CDH4 ,but it say can not support yarn ,why? i need build it by myself with yarn support enable?

RE: SparkSQL with sequence file RDDs

2014-07-07 Thread Haoming Zhang
Hi Michael, Thanks for the reply. Actually last week I tried to play with Product interface, but I'm not really sure I did correct or not. Here is what I did: 1. Created an abstract class A with Product interface, which has 20 parameters, 2. Created case class B extends A, and B has 20

Is the order of messages guaranteed in a DStream?

2014-07-07 Thread Yan Fang
I know the order of processing DStream is guaranteed. Wondering if the order of messages in one DStream is guaranteed. My gut feeling is yes for the question because RDD is immutable. Some simple tests prove this. Want to hear from authority to persuade myself. Thank you. Best, Fang, Yan

RE: The number of cores vs. the number of executors

2014-07-07 Thread innowireless TaeYun Kim
For your information, I've attached the Ganglia monitoring screen capture on the Stack Overflow question. Please see: http://stackoverflow.com/questions/24622108/apache-spark-the-number-of-cores -vs-the-number-of-executors From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr]

Re: Pig 0.13, Spark, Spork

2014-07-07 Thread 张包峰
Hi guys, previously I checked out the old spork and updated it to Hadoop 2.0, Scala 2.10.3 and Spark 0.9.1, see github project of mine https://github.com/pelick/flare-spork‍ It it also highly experimental, and just directly mapping pig physical operations to spark RDD transformations/actions.

Re: SparkSQL with sequence file RDDs

2014-07-07 Thread Michael Armbrust
Here is a simple example of registering an RDD of Products as a table. It is important that all of the fields are val defined in the constructor and that you implement canEqual, productArity and productElement. class Record(val x1: String) extends Product with Serializable { def canEqual(that:

RE: Spark RDD Disk Persistance

2014-07-07 Thread Shao, Saisai
Hi Madhu, I don't think you can reuse the persistent RDD the next time you run the program, because the folder for RDD materialization will be changed, also Spark will lose the information of how to retrieve the previous persisted RDD. AFAIK Spark has fault tolerance mechanism, node failure

Re: SparkSQL - Partitioned Parquet

2014-07-07 Thread Michael Armbrust
The only partitioning that is currently supported is through Hive partitioned tables. Supporting this for parquet as well is on our radar, but probably won't happen for 1.1. On Sun, Jul 6, 2014 at 10:00 PM, Raffael Marty ra...@pixlcloud.com wrote: Does SparkSQL support partitioned parquet

Help for the large number of the input data files

2014-07-07 Thread innowireless TaeYun Kim
Hi, A help for the implementation best practice is needed. The operating environment is as follows: - Log data file arrives irregularly. - The size of a log data file is from 3.9KB to 8.5MB. The average is about 1MB. - The number of records of a data file is from 13 lines to 22000

Spark Installation

2014-07-07 Thread Srikrishna S
Hi All, Does anyone know what the command line arguments to mvn are to generate the pre-built binary for spark on Hadoop 2-CHD5. I would like to pull in a recent bug fix in spark-master and rebuild the binaries in the exact same way that was used for that provided on the website. I have tried

Re: Is the order of messages guaranteed in a DStream?

2014-07-07 Thread Mayur Rustagi
If you receive data through multiple receivers across the cluster. I don't think any order can be guaranteed. Order in distributed systems is tough. On Tuesday, July 8, 2014, Yan Fang yanfang...@gmail.com wrote: I know the order of processing DStream is guaranteed. Wondering if the order of

Re: the Pre-built packages for CDH4 can not support yarn ?

2014-07-07 Thread Matei Zaharia
They are for CDH4 without YARN, since YARN is experimental in that. You can download one of the Hadoop 2 packages if you want to run on YARN. Or you might have to build specifically against CDH4's version of YARN if that doesn't work. Matei On Jul 7, 2014, at 9:37 PM, ch huang