Re: Spark v1.2.1 failing under BigTop build in External Flume Sink (due to missing Netty library)

2015-03-05 Thread Patrick Wendell
You may need to add the -Phadoop-2.4 profile. When building or release packages for Hadoop 2.4 we use the following flags: -Phadoop-2.4 -Phive -Phive-thriftserver -Pyarn - Patrick On Thu, Mar 5, 2015 at 12:47 PM, Kelly, Jonathan jonat...@amazon.com wrote: I confirmed that this has nothing to

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-05 Thread Marcelo Vanzin
It seems from the excerpt below that your cluster is set up to use the Yarn ATS, and the code is failing in that path. I think you'll need to apply the following patch to your Spark sources if you want this to work: https://github.com/apache/spark/pull/3938 On Thu, Mar 5, 2015 at 10:04 AM, Todd

Re: Spark v1.2.1 failing under BigTop build in External Flume Sink (due to missing Netty library)

2015-03-05 Thread Kelly, Jonathan
That's probably a good thing to have, so I'll add it, but unfortunately it did not help this issue. It looks like the hadoop-2.4 profile only sets these properties, which don't seem like they would affect anything related to Netty: properties hadoop.version2.4.0/hadoop.version

Re: Spark v1.2.1 failing under BigTop build in External Flume Sink (due to missing Netty library)

2015-03-05 Thread Kelly, Jonathan
I confirmed that this has nothing to do with BigTop by running the same mvn command directly in a fresh clone of the Spark package at the v1.2.1 tag. I got the same exact error. Jonathan Kelly Elastic MapReduce - SDE Port 99 (SEA35) 08.220.C2 From: Kelly, Jonathan Kelly

Training Random Forest

2015-03-05 Thread drarse
I am testing the Random Forest in Spark, but I have a question... If I train for the second time, will update the decision trees created or these are created anew ?. That is, does the system will continually learning for each dataset or only the first? Thanks for everything -- View this

Re: Identify the performance bottleneck from hardware prospective

2015-03-05 Thread Julaiti Alafate
Hi Mitch, I think it is normal. The network utilization will be high when there is some shuffling process happening. After that, the network utilization should come down, while each slave nodes will do the computation on the partitions assigned to them. At least it is my understanding. Best,

Re: How to parse Json formatted Kafka message in spark streaming

2015-03-05 Thread Cui Lin
Hi, Ted, Thanks for your reply. I noticed from the below link partitions.size will not work for checking empty RDD in streams. It seems that the problem can be solved in spark 1.3 which is no way to download at this time? https://issues.apache.org/jira/browse/SPARK-5270 Best regards, Cui Lin

Question about the spark assembly deployed to the cluster with the ec2 scripts

2015-03-05 Thread Darin McBeath
I've downloaded spark 1.2.0 to my laptop. In the lib directory, it includes spark-assembly-1.2.0-hadoop2.4.0.jar When I spin up a cluster using the ec2 scripts with 1.2.0 (and set --hadoop-major-version=2) I notice that in the lib directory for the master/slaves the assembly is for

Re: Identify the performance bottleneck from hardware prospective

2015-03-05 Thread jalafate
Hi David, It is a great point. It is actually one of the reasons that my program is slow. I found that the major cause of my program running slow is the huge garbage collection time. I created too many small objects in the map procedure which triggers GC mechanism frequently. After I improved my

Re: Spark Streaming - Duration 1s not matching reality

2015-03-05 Thread Tathagata Das
Hint: Print() just gives a sample of what is in the data, and does not enforce the processing on all the data (only the first partition of the rdd is computed to get 10 items). Count() actually processes all the data. This is all due to lazy eval, if you don't need to use all the data, don't

Re: Driver disassociated

2015-03-05 Thread Thomas Gerber
Thanks. I was already setting those (and I checked they were in use through the environment tab in the UI). They were set at 10 times their default value: 6 and 1 respectively. I'll start poking at spark.shuffle.io.retryWait. Thanks! On Wed, Mar 4, 2015 at 7:02 PM, Ted Yu

Re: Writing to S3 and retrieving folder names

2015-03-05 Thread Mike Trienis
Please ignore my question, you can simply specify the root directory and it looks like redshift takes care of the rest. copy mobile from 's3://BUCKET_NAME/' credentials json 's3://BUCKET_NAME/jsonpaths.json' On Thu, Mar 5, 2015 at 3:33 PM, Mike Trienis mike.trie...@orcsol.com wrote: Hi

Spark Streaming - Duration 1s not matching reality

2015-03-05 Thread eleroy
Hello, Getting started with Spark. Got JavaNetworkWordcount working on a 3 node cluster. netcat on with a infinite loop printing random numbers 0-100 With a duration of 1sec, I do see a list of (word, count) values every second. The list is limited to 10 values (as per the docs) The count

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-05 Thread Zhan Zhang
In addition, you may need following patch if it is not in 1.2.1 to solve some system property issue if you use HDP 2.2. https://github.com/apache/spark/pull/3409 You can follow the following link to set hdp.version for java options.

Re: LBGFS optimizer performace

2015-03-05 Thread DB Tsai
PS, I will recommend you compress the data when you cache the RDD. There will be some overhead in compression/decompression, and serialization/deserialization, but it will help a lot for iterative algorithms with ability to caching more data. Sincerely, DB Tsai

RE: spark standalone with multiple executors in one work node

2015-03-05 Thread Judy Nash
I meant from one app, yes. Was asking this because our previous tuning experiment shows spark-on-yarn runs faster when overloading workers with executors (i.e. if a worker has 4 cores, creating 2 executors each use 4 cores will see a speed boost from 1 executor with 4 cores). I have found an

Building Spark 1.3 for Scala 2.11 using Maven

2015-03-05 Thread Night Wolf
Hey guys, Trying to build Spark 1.3 for Scala 2.11. I'm running with the folllowng Maven command; -DskipTests -Dscala-2.11 clean install package *Exception*: [ERROR] Failed to execute goal on project spark-core_2.10: Could not resolve dependencies for project

Re: Spark with data on NFS v HDFS

2015-03-05 Thread Tobias Pfeiffer
Hi, On Thu, Mar 5, 2015 at 10:58 PM, Ashish Mukherjee ashish.mukher...@gmail.com wrote: I understand Spark can be used with Hadoop or standalone. I have certain questions related to use of the correct FS for Spark data. What is the efficiency trade-off in feeding data to Spark from NFS v

Re: How to parse Json formatted Kafka message in spark streaming

2015-03-05 Thread Ted Yu
See following thread for 1.3.0 release: http://search-hadoop.com/m/JW1q5hV8c4 Looks like the release is around the corner. On Thu, Mar 5, 2015 at 3:26 PM, Cui Lin cui@hds.com wrote: Hi, Ted, Thanks for your reply. I noticed from the below link partitions.size will not work for

Re: External Data Source in Spark

2015-03-05 Thread Michael Armbrust
Currently we have implemented External Data Source API and are able to push filters and projections. Could you provide some info on how perhaps the joins could be pushed to the original Data Source if both the data sources are from same database *.* First a disclaimer: This is an

Re: Spark code development practice

2015-03-05 Thread fightf...@163.com
Hi, You can first establish a scala ide to develop and debug your spark program, lets say, intellij idea or eclipse. Thanks, Sun. fightf...@163.com From: Xi Shen Date: 2015-03-06 09:19 To: user@spark.apache.org Subject: Spark code development practice Hi, I am new to Spark. I see every

spark-ec2 script problems

2015-03-05 Thread roni
Hi , I used spark-ec2 script to create ec2 cluster. Now I am trying copy data from s3 into hdfs. I am doing this *root@ip-172-31-21-160 ephemeral-hdfs]$ bin/hadoop distcp s3://xxx/home/mydata/small.sam hdfs://ec2-52-11-148-31.us-west-2.compute.amazonaws.com:9010/data1

Re: Spark code development practice

2015-03-05 Thread Xi Shen
Thanks guys, this is very useful :) @Stephen, I know spark-shell will create a SC for me. But I don't understand why we still need to do new SparkContext(...) in our code. Shouldn't we get it from some where? e.g. SparkContext.get. Another question, if I want my spark code to run in YARN later,

Re: spark-ec2 script problems

2015-03-05 Thread Akhil Das
It works pretty fine for me with the script comes with 1.2.0 release. Here's a few things which you can try: - Add your s3 credentials to the core-site.xml property namefs.s3.awsAccessKeyId/name valueID/value/propertyproperty namefs.s3.awsSecretAccessKey/name valueSECRET/value/property - Do a

Re: Managing permissions when saving as text file

2015-03-05 Thread Akhil Das
Why not setup HDFS? Thanks Best Regards On Thu, Mar 5, 2015 at 4:03 PM, didmar marin.did...@gmail.com wrote: Hi, I'm having a problem involving file permissions on the local filesystem. On a first machine, I have two different users : - launcher, which launches my job from an uber jar

Writing to S3 and retrieving folder names

2015-03-05 Thread Mike Trienis
Hi All, I am receiving data from AWS Kinesis using Spark Streaming and am writing the data collected in the dstream to s3 using output function: dstreamData.saveAsTextFiles(s3n://XXX:XXX@/) After the run the application for several seconds, I end up with a sequence of directories in S3 that

Re: Building Spark 1.3 for Scala 2.11 using Maven

2015-03-05 Thread Marcelo Vanzin
I've never tried it, but I'm pretty sure in the very least you want -Pscala-2.11 (not -D). On Thu, Mar 5, 2015 at 4:46 PM, Night Wolf nightwolf...@gmail.com wrote: Hey guys, Trying to build Spark 1.3 for Scala 2.11. I'm running with the folllowng Maven command; -DskipTests -Dscala-2.11

Re: Building Spark 1.3 for Scala 2.11 using Maven

2015-03-05 Thread Marcelo Vanzin
Ah, and you may have to use dev/change-version-to-2.11.sh. (Again, never tried compiling with scala 2.11.) On Thu, Mar 5, 2015 at 4:52 PM, Marcelo Vanzin van...@cloudera.com wrote: I've never tried it, but I'm pretty sure in the very least you want -Pscala-2.11 (not -D). On Thu, Mar 5, 2015

Re: SparkSQL JSON array support

2015-03-05 Thread Michael Armbrust
You can do want with lateral view explode, but what seems to be missing is that jsonRDD converts json objects into structs (fixed keys with a fixed order) and fields in a struct are accessed using a `.` val myJson = sqlContext.jsonRDD(sc.parallelize({foo:[{bar:1},{baz:2}]} :: Nil))

Re: External Data Source in Spark

2015-03-05 Thread Michael Armbrust
One other caveat: While writing up this example I realized that we make SparkPlan private and we are already packaging 1.3-RC3... So you'll need a custom build of Spark for this to run. We'll fix this in the next release. On Thu, Mar 5, 2015 at 5:26 PM, Michael Armbrust mich...@databricks.com

Re: Spark code development practice

2015-03-05 Thread Koen Vantomme
use the spark-shell command and the shell will open type :paste abd then paste your code, after control-d open spark-shell: sparks/bin ./spark-shell Verstuurd vanaf mijn iPhone Op 6-mrt.-2015 om 02:28 heeft fightf...@163.com fightf...@163.com het volgende geschreven: Hi, You can first

SparkSQL JSON array support

2015-03-05 Thread Justin Pihony
Is there any plans of supporting JSON arrays more fully? Take for example: val myJson = sqlContext.jsonRDD(List({foo:[{bar:1},{baz:2}]})) myJson.registerTempTable(JsonTest) I would like a way to pull out parts of the array data based on a key sql(SELECT foo[bar] FROM JsonTest)

Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-05 Thread Todd Nist
I am running Spark on a HortonWorks HDP Cluster. I have deployed there prebuilt version but it is only for Spark 1.2.0 not 1.2.1 and there are a few fixes and features in there that I would like to leverage. I just downloaded the spark-1.2.1 source and built it to support Hadoop 2.6 by doing the

Re: Which OutputCommitter to use for S3?

2015-03-05 Thread Pei-Lun Lee
Thanks for the DirectOutputCommitter example. However I found it only works for saveAsHadoopFile. What about saveAsParquetFile? It looks like SparkSQL is using ParquetOutputCommitter, which is subclass of FileOutputCommitter. On Fri, Feb 27, 2015 at 1:52 AM, Thomas Demoor

Re: using log4j2 with spark

2015-03-05 Thread Akhil Das
You may exclude the log4j dependency while building. You can have a look at this build file to see how to exclude libraries http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/missing_dependencies_in_jar_files.html Thanks Best Regards On Thu, Mar 5, 2015 at 1:20

Re: Which OutputCommitter to use for S3?

2015-03-05 Thread Aaron Davidson
Yes, unfortunately that direct dependency makes this injection much more difficult for saveAsParquetFile. On Thu, Mar 5, 2015 at 12:28 AM, Pei-Lun Lee pl...@appier.com wrote: Thanks for the DirectOutputCommitter example. However I found it only works for saveAsHadoopFile. What about

Re: Having lots of FetchFailedException in join

2015-03-05 Thread Aaron Davidson
However, Executors were dying when using Netty as well, so it is possible that the OOM was occurring then too. It is also possible only one of your Executors OOMs (due to a particularly large task) and the others display various exceptions while trying to fetch the shuffle blocks from the failed

Re: How to parse Json formatted Kafka message in spark streaming

2015-03-05 Thread Akhil Das
When you use KafkaUtils.createStream with StringDecoders, it will return String objects inside your messages stream. To access the elements from the json, you could do something like the following: val mapStream = messages.map(x= { val mapper = new ObjectMapper() with ScalaObjectMapper

Connection PHP application to Spark Sql thrift server

2015-03-05 Thread fanooos
We have two applications need to connect to Spark Sql thrift server. The first application is developed in java. Having spark sql thrift server running, we following the steps in this link https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-JDBC and the

Re: Having lots of FetchFailedException in join

2015-03-05 Thread Jianshi Huang
Thanks. I was about to submit a ticket for this :) Also there's a ticket for sort-merge based groupbykey https://issues.apache.org/jira/browse/SPARK-3461 BTW, any idea why run with netty didn't output OOM error messages? It's very confusing in troubleshooting. Jianshi On Thu, Mar 5, 2015 at

Spark with data on NFS v HDFS

2015-03-05 Thread Ashish Mukherjee
Hello, I understand Spark can be used with Hadoop or standalone. I have certain questions related to use of the correct FS for Spark data. What is the efficiency trade-off in feeding data to Spark from NFS v HDFS? If one is not using Hadoop, is it still usual to house data in HDFS for Spark to

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-05 Thread Sean Owen
Jackson 1.9.13? and codehaus.jackson.version? that's already set by the profile hadoop-2.4. On Thu, Mar 5, 2015 at 6:13 PM, Ted Yu yuzhih...@gmail.com wrote: Please add the following to build command: -Djackson.version=1.9.3 Cheers On Thu, Mar 5, 2015 at 10:04 AM, Todd Nist

IncompatibleClassChangeError

2015-03-05 Thread ey-chih chow
Hi, I am using CDH5.3.2 now for a Spark project. I got the following exception: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected I used all the CDH5.3.2 jar files in my pom file to generate the application jar file.

Spark v1.2.1 failing under BigTop build in External Flume Sink (due to missing Netty library)

2015-03-05 Thread Kelly, Jonathan
I'm running into an issue building Spark v1.2.1 (as well as the latest in branch-1.2 and v1.3.0-rc2 and the latest in branch-1.3) with BigTop (v0.9, which is not quite released yet). The build fails in the External Flume Sink subproject with the following error: [INFO] Compiling 5 Scala

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-05 Thread Victor Tso-Guillen
That particular class you did find is under parquet/... which means it was shaded. Did you build your application against a hadoop2.6 dependency? The maven central repo only has 2.2 but HDP has its own repos. On Thu, Mar 5, 2015 at 10:04 AM, Todd Nist tsind...@gmail.com wrote: I am running

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-05 Thread Ted Yu
Please add the following to build command: -Djackson.version=1.9.3 Cheers On Thu, Mar 5, 2015 at 10:04 AM, Todd Nist tsind...@gmail.com wrote: I am running Spark on a HortonWorks HDP Cluster. I have deployed there prebuilt version but it is only for Spark 1.2.0 not 1.2.1 and there are a few

Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-05 Thread Todd Nist
@Victor, I'm pretty sure I built it correctly, I specified -Dhadoop.version=2.6.0, am I missing something here? Followed the docs on this but I'm open to suggestions. make-distribution.sh --name hadoop2.6 --tgz -Pyarn -Phadoop-2.4 *-Dhadoop.version=2.6.0* -Phive -Phive-thriftserver -DskipTests

Re: IncompatibleClassChangeError

2015-03-05 Thread M. Dale
In Hadoop 1.x TaskAttemptContext is a class (for example, https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/mapred/TaskAttemptContext.html) In Hadoop 2.x TaskAttemptContext is an interface (https://hadoop.apache.org/docs/r2.4.0/api/org/apache/hadoop/mapreduce/TaskAttemptContext.html)

Re: Partitioning Dataset and Using Reduce in Apache Spark

2015-03-05 Thread Daniel Siegmann
An RDD is a Resilient *Distributed* Data set. The partitioning and distribution of the data happens in the background. You'll occasionally need to concern yourself with it (especially to get good performance), but from an API perspective it's mostly invisible (some methods do allow you to specify

Re: How to parse Json formatted Kafka message in spark streaming

2015-03-05 Thread Helena Edelson
Hi Cui, What version of Spark are you using? There was a bug ticket that may be related to this, fixed in core/src/main/scala/org/apache/spark/rdd/RDD.scala that is merged into versions 1.3.0 and 1.2.1 . If you are using 1.1.1 that may be the reason but it’s a stretch

spark-shell --master yarn-client fail on Windows

2015-03-05 Thread Xi Shen
Hi, My HDFS and YARN services are started, and my spark-shell can wok in local mode. But when I try spark-shell --master yarn-client, a job can be created at the YARN service, but will fail very soon. The Diagnostics are: Application application_1425559747310_0002 failed 2 times due to AM

Re: Extra output from Spark run

2015-03-05 Thread Sean Owen
In the console, you'd find this draws a progress bar illustrating the current stage progress. In logs, it shows up as this sort of 'pyramid' since CR makes a newline. You can turn it off with spark.ui.showConsoleProgress = false On Thu, Mar 5, 2015 at 2:11 AM, cjwang c...@cjwang.us wrote: When

Managing permissions when saving as text file

2015-03-05 Thread didmar
Hi, I'm having a problem involving file permissions on the local filesystem. On a first machine, I have two different users : - launcher, which launches my job from an uber jar file - spark, which runs the master On a second machine, I have a user spark (same uid/gid as the other) which runs the

Re: Identify the performance bottleneck from hardware prospective

2015-03-05 Thread davidkl
Hello Julaiti, Maybe I am just asking the obvious :-) but did you check disk IO? Depending on what you are doing that could be the bottleneck. In my case none of the HW resources was a bottleneck, but using some distributed features that were blocking execution (e.g. Hazelcast). Could that be

Nullpointer Exception on broadcast variables (YARN Cluster mode)

2015-03-05 Thread samriddhac
Hi All I have a simple spark application, where I am trying to broadcast a String type variable on YARN Cluster. But every time I am trying to access the broadcast-ed variable value , I am getting null within the Task. It will be really helpful, if you guys can suggest, what I am doing wrong

Re: How to parse Json formatted Kafka message in spark streaming

2015-03-05 Thread Ted Yu
Cui: You can check messages.partitions.size to determine whether messages is an empty RDD. Cheers On Thu, Mar 5, 2015 at 12:52 AM, Akhil Das ak...@sigmoidanalytics.com wrote: When you use KafkaUtils.createStream with StringDecoders, it will return String objects inside your messages stream.

Partitioning Dataset and Using Reduce in Apache Spark

2015-03-05 Thread raggy
I am trying to use Apache spark to load up a file, and distribute the file to several nodes in my cluster and then aggregate the results and obtain them. I don't quite understand how to do this. From my understanding the reduce action enables Spark to combine the results from different nodes and

Map task in Trident.

2015-03-05 Thread Vladimir Protsenko
There is a map function in clojure so you could map one collection to other. The most resembling operation is *each*, however when f applied to input tuple we get tuple with two fields* f([field-a]) = [ field-a field-b]*. How could I realize the same operation on trident stream?

Re: Extra output from Spark run

2015-03-05 Thread davidkl
If you do not want those progress indication to appear, just set spark.ui.showConsoleProgress to false, e.g: System.setProperty(spark.ui.showConsoleProgress, false); Regards -- View this message in context:

RE: Connection PHP application to Spark Sql thrift server

2015-03-05 Thread Cheng, Hao
Can you query upon Hive? Let's confirm if it's a bug of SparkSQL in your PHP code first. -Original Message- From: fanooos [mailto:dev.fano...@gmail.com] Sent: Thursday, March 5, 2015 4:57 PM To: user@spark.apache.org Subject: Connection PHP application to Spark Sql thrift server We

Re: How to parse Json formatted Kafka message in spark streaming

2015-03-05 Thread Helena Edelson
Great point :) Cui, Here’s a cleaner way than I had before, w/out the use of spark sql for the mapping: KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, kafka.kafkaParams, Map(github - 5), StorageLevel.MEMORY_ONLY) .map{ case (k,v) =

Re: Construct model matrix from SchemaRDD automatically

2015-03-05 Thread Evan R. Sparks
Hi Wush, I'm CC'ing user@spark.apache.org (which is the new list) and BCC'ing u...@spark.incubator.apache.org. In Spark 1.3, schemaRDD is in fact being renamed to DataFrame (see: https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html ) As for a

Spark code development practice

2015-03-05 Thread Xi Shen
Hi, I am new to Spark. I see every spark program has a main() function. I wonder if I can run the spark program directly, without using spark-submit. I think it will be easier for early development and debug. Thanks, David

Re: Spark code development practice

2015-03-05 Thread Stephen Boesch
Hi Xi, Yes, You can do the following: val sc = new SparkContext(local[2], mptest) // or .. val sc = new SparkContext(spark://master:7070, mptest) val fileDataRdd = sc.textFile(/path/to/dir) val fileLines = fileDataRdd.take(100) The key here - i.e. the answer to your specific question -

why my YoungGen GC takes so long time?

2015-03-05 Thread lisendong
I found my task takes so long time for YoungGen GC, I set the young gen size to about 1.5G, I wonder why it takes so long time? not all the tasks take such long time, only about 1% tasks so long... 180.426: [GC [PSYoungGen: 9916105K-1676785K(14256640K)] 26201020K-18690057K(53403648K), 17.3581500

Re: Managing permissions when saving as text file

2015-03-05 Thread didmar
Ok, I solved this problem by : - changing the primary group of launcher to spark - adding umask 002 in launcher's .bashrc and spark's init.d script -- View this message in context:

spark-stream programme failed on yarn-client

2015-03-05 Thread fenghaixiong
Hi all, I'm try to write a spark stream programme so i read the

Compile Spark with Maven Zinc Scala Plugin

2015-03-05 Thread Night Wolf
Hey, Trying to build latest spark 1.3 with Maven using -DskipTests clean install package But I'm getting errors with zinc, in the logs I see; [INFO] *--- scala-maven-plugin:3.2.0:compile (scala-compile-first) @ spark-network-common_2.11 --- * ... [error] Required file not found:

??????Compile Spark with Maven Zinc Scala Plugin

2015-03-05 Thread ??
try it with mvn -DskipTests -Pscala-2.11 clean install package

Re: Training Random Forest

2015-03-05 Thread Xiangrui Meng
We don't support warm starts or online updates for decision trees. So if you call train twice, only the second dataset is used for training. -Xiangrui On Thu, Mar 5, 2015 at 12:31 PM, drarse drarse.a...@gmail.com wrote: I am testing the Random Forest in Spark, but I have a question... If I train

Construct model matrix from SchemaRDD automatically

2015-03-05 Thread Wush Wu
Dear all, I am a new spark user from R. After exploring the schemaRDD, I notice that it is similar to data.frame. Is there a feature like `model.matrix` in R to convert schemaRDD to model matrix automatically according to the type without explicitly converting them one by one? Thanks, Wush