Re: Removing the Mesos fine-grained mode

2015-11-30 Thread Adam McElwee
To eliminate any skepticism around whether cpu is a good performance metric
for this workload, I did a couple comparison runs of an example job to
demonstrate a more universal change in performance metrics (stage/job time)
between coarse and fine-grained mode on mesos.

The workload is identical here - pulling tgz archives from s3, parsing json
lines from the files and ultimately creating documents to index into solr.
The tasks are not inserting into solr (just to let you know that there's no
network side-effect of the map task). The runs are on the same exact
hardware in ec2 (m2.4xlarge, with 68GB of ram and 45G executor memory),
exact same jvm and it's not dependent on order of running the jobs, meaning
I get the same results whether I run the coarse first or whether I run the
fine-grained first. No other frameworks/tasks are running on the mesos
cluster during the test. I see the same results whether it's a 3-node
cluster, or whether it's a 200-node cluster.

With the CMS collector in fine-grained mode, the map stage takes roughly
2.9h, and coarse-grained mode takes 3.4h. Because both modes initially
start out performing similarly, the total execution time gap widens as the
job size grows. To put that another way, the difference is much smaller for
jobs/stages < 1 hour. When I submit this job for a much larger dataset that
takes 5+ hours, the difference in total stage time moves closer and closer
to roughly 20-30% longer execution time.

With the G1 collector in fine-grained mode, the map stage takes roughly
2.2h, and coarse-grained mode takes 2.7h. Again, the fine and coarse-grained
execution tests are on the exact same machines, exact same dataset, and
only changing spark.mesos.coarse to true/false.

Let me know if there's anything else I can provide here.

Thanks,
-Adam


On Mon, Nov 23, 2015 at 11:27 AM, Adam McElwee  wrote:

>
>
> On Mon, Nov 23, 2015 at 7:36 AM, Iulian Dragoș  > wrote:
>
>>
>>
>> On Sat, Nov 21, 2015 at 3:37 AM, Adam McElwee  wrote:
>>
>>> I've used fine-grained mode on our mesos spark clusters until this week,
>>> mostly because it was the default. I started trying coarse-grained because
>>> of the recent chatter on the mailing list about wanting to move the mesos
>>> execution path to coarse-grained only. The odd things is, coarse-grained vs
>>> fine-grained seems to yield drastic cluster utilization metrics for any of
>>> our jobs that I've tried out this week.
>>>
>>> If this is best as a new thread, please let me know, and I'll try not to
>>> derail this conversation. Otherwise, details below:
>>>
>>
>> I think it's ok to discuss it here.
>>
>>
>>> We monitor our spark clusters with ganglia, and historically, we
>>> maintain at least 90% cpu utilization across the cluster. Making a single
>>> configuration change to use coarse-grained execution instead of
>>> fine-grained consistently yields a cpu utilization pattern that starts
>>> around 90% at the beginning of the job, and then it slowly decreases over
>>> the next 1-1.5 hours to level out around 65% cpu utilization on the
>>> cluster. Does anyone have a clue why I'd be seeing such a negative effect
>>> of switching to coarse-grained mode? GC activity is comparable in both
>>> cases. I've tried 1.5.2, as well as the 1.6.0 preview tag that's on github.
>>>
>>
>> I'm not very familiar with Ganglia, and how it computes utilization. But
>> one thing comes to mind: did you enable dynamic allocation
>> 
>> on coarse-grained mode?
>>
>
> Dynamic allocation is definitely not enabled. The only delta between runs
> is adding --conf "spark.mesos.coarse=true" the job submission. Ganglia is
> just pulling stats from the procfs, and I've never seen it report bad
> results. If I sample any of the 100-200 nodes in the cluster, dstat
> reflects the same average cpu that I'm seeing reflected in ganglia.
>
>>
>> iulian
>>
>
>


Re: Problem in running MLlib SVM

2015-11-30 Thread Fazlan Nazeem
You should never use the training data to measure your prediction accuracy.
Always use a fresh dataset (test data) for this purpose.

On Sun, Nov 29, 2015 at 8:36 AM, Jeff Zhang  wrote:

> I think this should represent the label of LabledPoint (0 means negative 1
> means positive)
> http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point
>
> The document you mention is for the mathematical formula, not the
> implementation.
>
> On Sun, Nov 29, 2015 at 9:13 AM, Tarek Elgamal 
> wrote:
>
>> According to the documentation
>> , by
>> default, if wTx≥0 then the outcome is positive, and negative otherwise. I
>> suppose that wTx is the "score" in my case. If score is more than 0 and the
>> label is positive, then I return 1 which is correct classification and I
>> return zero otherwise. Do you have any idea how to classify a point as
>> positive or negative using this score or another function ?
>>
>> On Sat, Nov 28, 2015 at 5:14 AM, Jeff Zhang  wrote:
>>
>>> if((score >=0 && label == 1) || (score <0 && label == 0))
>>>  {
>>>   return 1; //correct classiciation
>>>  }
>>>  else
>>>   return 0;
>>>
>>>
>>>
>>> I suspect score is always between 0 and 1
>>>
>>>
>>>
>>> On Sat, Nov 28, 2015 at 10:39 AM, Tarek Elgamal >> > wrote:
>>>
 Hi,

 I am trying to run the straightforward example of SVm but I am getting
 low accuracy (around 50%) when I predict using the same data I used for
 training. I am probably doing the prediction in a wrong way. My code is
 below. I would appreciate any help.


 import java.util.List;

 import org.apache.spark.SparkConf;
 import org.apache.spark.SparkContext;
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.function.Function;
 import org.apache.spark.api.java.function.Function2;
 import org.apache.spark.mllib.classification.SVMModel;
 import org.apache.spark.mllib.classification.SVMWithSGD;
 import org.apache.spark.mllib.regression.LabeledPoint;
 import org.apache.spark.mllib.util.MLUtils;

 import scala.Tuple2;
 import edu.illinois.biglbjava.readers.LabeledPointReader;

 public class SimpleDistSVM {
   public static void main(String[] args) {
 SparkConf conf = new SparkConf().setAppName("SVM Classifier
 Example");
 SparkContext sc = new SparkContext(conf);
 String inputPath=args[0];

 // Read training data
 JavaRDD data = MLUtils.loadLibSVMFile(sc,
 inputPath).toJavaRDD();

 // Run training algorithm to build the model.
 int numIterations = 3;
 final SVMModel model = SVMWithSGD.train(data.rdd(), numIterations);

 // Clear the default threshold.
 model.clearThreshold();


 // Predict points in test set and map to an RDD of 0/1 values where
 0 is misclassication and 1 is correct classification
 JavaRDD classification = data.map(new
 Function() {
  public Integer call(LabeledPoint p) {
int label = (int) p.label();
Double score = model.predict(p.features());
if((score >=0 && label == 1) || (score <0 && label == 0))
{
return 1; //correct classiciation
}
else
 return 0;

  }
}
  );
 // sum up all values in the rdd to get the number of correctly
 classified examples
  int sum=classification.reduce(new Function2>>> Integer>()
 {
 public Integer call(Integer arg0, Integer arg1)
 throws Exception {
 return arg0+arg1;
 }});

  //compute accuracy as the percentage of the correctly classified
 examples
  double accuracy=((double)sum)/((double)classification.count());
  System.out.println("Accuracy = " + accuracy);

 }
   }
 );
   }
 }

>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Thanks & Regards,

Fazlan Nazeem

*Software Engineer*

*WSO2 Inc*
Mobile : +94772338839
<%2B94%20%280%29%20773%20451194>
fazl...@wso2.com


Re: Need suggestions on monitor Spark progress

2015-11-30 Thread Jacek Laskowski
Hi,

My limited understanding of Spark tells me that a task is the least
possible working unit and Spark itself won't give you much. It
wouldn't expect so since "acount" is a business entity not Spark's
one.

What about using mapPartitions* to know the details of partitions and
do whatever you want (log to stdout or whatever)? Just a thought.

Pozdrawiam,
Jacek

--
Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
http://blog.jaceklaskowski.pl
Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
Follow me at https://twitter.com/jaceklaskowski
Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski


On Sun, Nov 29, 2015 at 3:12 PM, Yuhao Yang  wrote:
> Hi all,
>
> I got a simple processing job for 2 accounts on 8 partitions. It's
> roughly 2500 accounts on each partition. Each account will take about 1s to
> complete the computation. That means each partition will take about 2500
> seconds to finish the batch.
>
> My question is how can I get the detailed progress of how many accounts has
> been processed for each partition during the computation. An ideal solution
> would allow me to know how many accounts has been processed periodically
> (like every minute) so I can monitor and take action to save some time.
> Right now on UI I can only get that task is running.
>
> I know one solution is to split the data horizontally on driver and submit
> to spark in mini batches, yet I think that would waste some cluster resource
> and create extra complexity for result handling.
>
> Any experience or best practice is welcome. Thanks a lot.
>
> Regards,
> Yuhao

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Need suggestions on monitor Spark progress

2015-11-30 Thread Alex Rovner
In these scenarios it's fairly standard to report the metrics either
directly or through accumulators (
http://spark.apache.org/docs/latest/programming-guide.html#accumulators-a-nameaccumlinka)
to a time series database such as Graphite (http://graphite.wikidot.com/)
or OpenTSDB (http://opentsdb.net/) and monitor the progress through the UI
provided by the DB.

*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052

* *

On Mon, Nov 30, 2015 at 1:43 PM, Jacek Laskowski  wrote:

> Hi,
>
> My limited understanding of Spark tells me that a task is the least
> possible working unit and Spark itself won't give you much. It
> wouldn't expect so since "acount" is a business entity not Spark's
> one.
>
> What about using mapPartitions* to know the details of partitions and
> do whatever you want (log to stdout or whatever)? Just a thought.
>
> Pozdrawiam,
> Jacek
>
> --
> Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
> http://blog.jaceklaskowski.pl
> Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>
>
> On Sun, Nov 29, 2015 at 3:12 PM, Yuhao Yang  wrote:
> > Hi all,
> >
> > I got a simple processing job for 2 accounts on 8 partitions. It's
> > roughly 2500 accounts on each partition. Each account will take about 1s
> to
> > complete the computation. That means each partition will take about 2500
> > seconds to finish the batch.
> >
> > My question is how can I get the detailed progress of how many accounts
> has
> > been processed for each partition during the computation. An ideal
> solution
> > would allow me to know how many accounts has been processed periodically
> > (like every minute) so I can monitor and take action to save some time.
> > Right now on UI I can only get that task is running.
> >
> > I know one solution is to split the data horizontally on driver and
> submit
> > to spark in mini batches, yet I think that would waste some cluster
> resource
> > and create extra complexity for result handling.
> >
> > Any experience or best practice is welcome. Thanks a lot.
> >
> > Regards,
> > Yuhao
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: question about combining small parquet files

2015-11-30 Thread Nezih Yigitbasi
This looks interesting, thanks Ruslan. But, compaction with Hive is as
simple as an insert overwrite statement as Hive
supports CombineFileInputFormat, is it possible to do the same with Spark?

On Thu, Nov 26, 2015 at 9:47 AM, Ruslan Dautkhanov 
wrote:

> An interesting compaction approach of small files is discussed recently
>
> http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-fast-data-with-impala-without-kudu/
>
>
> AFAIK Spark supports views too.
>
>
> --
> Ruslan Dautkhanov
>
> On Thu, Nov 26, 2015 at 10:43 AM, Nezih Yigitbasi <
> nyigitb...@netflix.com.invalid> wrote:
>
>> Hi Spark people,
>> I have a Hive table that has a lot of small parquet files and I am
>> creating a data frame out of it to do some processing, but since I have a
>> large number of splits/files my job creates a lot of tasks, which I don't
>> want. Basically what I want is the same functionality that Hive provides,
>> that is, to combine these small input splits into larger ones by specifying
>> a max split size setting. Is this currently possible with Spark?
>>
>> I look at coalesce() but with coalesce I can only control the number
>> of output files not their sizes. And since the total input dataset size
>> can vary significantly in my case, I cannot just use a fixed partition
>> count as the size of each output file can get very large. I then looked for
>> getting the total input size from an rdd to come up with some heuristic to
>> set the partition count, but I couldn't find any ways to do it (without
>> modifying the spark source).
>>
>> Any help is appreciated.
>>
>> Thanks,
>> Nezih
>>
>> PS: this email is the same as my previous email as I learned that my
>> previous email ended up as spam for many people since I sent it through
>> nabble, sorry for the double post.
>>
>
>


Re: Export BLAS module on Spark MLlib

2015-11-30 Thread DB Tsai
The workaround is have your code in the same package, or write some
utility wrapper in the same package so you can use them in your code.
Mostly we implement those BLAS for our own need, and we don't have
general use-case in mind. As a result, if we open them up prematurely,
it will add our api maintenance cost. Once it's getting mature, and
people are asking for them, we will gradually make them public.

Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Sat, Nov 28, 2015 at 5:20 AM, Sasaki Kai  wrote:
> Hello
>
> I'm developing a Spark package that manipulates Vector and Matrix for
> machine learning.
> This package uses mllib.linalg.Vector and mllib.linalg.Matrix in order to
> achieve compatible interface to mllib itself. But mllib.linalg.BLAS module
> is private inside spark package. We cannot use BLAS from spark package.
> Due to this, there is no way to manipulate mllib.linalg.{Vector, Matrix}
> from spark package side.
>
> Is there any reason why BLAS module is not set public?
> If we cannot use BLAS, what is the reasonable option to manipulate Vector
> and Matrix from spark package?
>
> Regards
> Kai Sasaki(@Lewuathe)
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Removing the Mesos fine-grained mode

2015-11-30 Thread Timothy Chen
Hi Adam,

Thanks for the graphs and the tests, definitely interested to dig a
bit deeper to find out what's could be the cause of this.

Do you have the spark driver logs for both runs?

Tim

On Mon, Nov 30, 2015 at 9:06 AM, Adam McElwee  wrote:
> To eliminate any skepticism around whether cpu is a good performance metric
> for this workload, I did a couple comparison runs of an example job to
> demonstrate a more universal change in performance metrics (stage/job time)
> between coarse and fine-grained mode on mesos.
>
> The workload is identical here - pulling tgz archives from s3, parsing json
> lines from the files and ultimately creating documents to index into solr.
> The tasks are not inserting into solr (just to let you know that there's no
> network side-effect of the map task). The runs are on the same exact
> hardware in ec2 (m2.4xlarge, with 68GB of ram and 45G executor memory),
> exact same jvm and it's not dependent on order of running the jobs, meaning
> I get the same results whether I run the coarse first or whether I run the
> fine-grained first. No other frameworks/tasks are running on the mesos
> cluster during the test. I see the same results whether it's a 3-node
> cluster, or whether it's a 200-node cluster.
>
> With the CMS collector in fine-grained mode, the map stage takes roughly
> 2.9h, and coarse-grained mode takes 3.4h. Because both modes initially start
> out performing similarly, the total execution time gap widens as the job
> size grows. To put that another way, the difference is much smaller for
> jobs/stages < 1 hour. When I submit this job for a much larger dataset that
> takes 5+ hours, the difference in total stage time moves closer and closer
> to roughly 20-30% longer execution time.
>
> With the G1 collector in fine-grained mode, the map stage takes roughly
> 2.2h, and coarse-grained mode takes 2.7h. Again, the fine and coarse-grained
> execution tests are on the exact same machines, exact same dataset, and only
> changing spark.mesos.coarse to true/false.
>
> Let me know if there's anything else I can provide here.
>
> Thanks,
> -Adam
>
>
> On Mon, Nov 23, 2015 at 11:27 AM, Adam McElwee  wrote:
>>
>>
>>
>> On Mon, Nov 23, 2015 at 7:36 AM, Iulian Dragoș
>>  wrote:
>>>
>>>
>>>
>>> On Sat, Nov 21, 2015 at 3:37 AM, Adam McElwee  wrote:

 I've used fine-grained mode on our mesos spark clusters until this week,
 mostly because it was the default. I started trying coarse-grained because
 of the recent chatter on the mailing list about wanting to move the mesos
 execution path to coarse-grained only. The odd things is, coarse-grained vs
 fine-grained seems to yield drastic cluster utilization metrics for any of
 our jobs that I've tried out this week.

 If this is best as a new thread, please let me know, and I'll try not to
 derail this conversation. Otherwise, details below:
>>>
>>>
>>> I think it's ok to discuss it here.
>>>

 We monitor our spark clusters with ganglia, and historically, we
 maintain at least 90% cpu utilization across the cluster. Making a single
 configuration change to use coarse-grained execution instead of 
 fine-grained
 consistently yields a cpu utilization pattern that starts around 90% at the
 beginning of the job, and then it slowly decreases over the next 1-1.5 
 hours
 to level out around 65% cpu utilization on the cluster. Does anyone have a
 clue why I'd be seeing such a negative effect of switching to 
 coarse-grained
 mode? GC activity is comparable in both cases. I've tried 1.5.2, as well as
 the 1.6.0 preview tag that's on github.
>>>
>>>
>>> I'm not very familiar with Ganglia, and how it computes utilization. But
>>> one thing comes to mind: did you enable dynamic allocation on coarse-grained
>>> mode?
>>
>>
>> Dynamic allocation is definitely not enabled. The only delta between runs
>> is adding --conf "spark.mesos.coarse=true" the job submission. Ganglia is
>> just pulling stats from the procfs, and I've never seen it report bad
>> results. If I sample any of the 100-200 nodes in the cluster, dstat reflects
>> the same average cpu that I'm seeing reflected in ganglia.
>>>
>>>
>>> iulian
>>
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Bringing up JDBC Tests to trunk

2015-11-30 Thread Josh Rosen
The JDBC drivers are currently being pulled in as test-scope dependencies
of the `sql/core` module:
https://github.com/apache/spark/blob/f2fbfa444f6e8d27953ec2d1c0b3abd603c963f9/sql/core/pom.xml#L91

In SBT, these wind up on the Docker JDBC tests' classpath as a transitive
dependency of the `spark-sql` test JAR. However, what we *should* be doing
is adding them as explicit test dependencies of the
`docker-integration-tests` subproject, since Maven handles transitive test
JAR dependencies differently than SBT (see
https://github.com/apache/spark/pull/9876#issuecomment-158593498 for some
discussion). If you choose to make that fix as part of your PR, be sure to
move the version handling to the root POM's  section
so that the versions in both modules stay in sync. We might also be able to
just simply move the JDBC driver dependencies to docker-integration-tests'
POM if it turns out that they're not used anywhere else (that's my hunch).

On Sun, Nov 22, 2015 at 6:49 PM, Luciano Resende 
wrote:

> Hey Josh,
>
> Thanks for helping bringing this up, I have just pushed a WIP PR for
> bringing the DB2 tests to be running on Docker, and I have a question about
> how the jdbc drivers are actually being setup for the other datasources
> (MySQL and PostgreSQL), are these setup directly on the Jenkins slaves ? I
> didn't see the jars or anything specific on the pom or other files...
>
>
> Thanks
>
> On Wed, Oct 21, 2015 at 1:26 PM, Josh Rosen  wrote:
>
>> Hey Luciano,
>>
>> This sounds like a reasonable plan to me. One of my colleagues has
>> written some Dockerized MySQL testing utilities, so I'll take a peek at
>> those to see if there are any specifics of their solution that we should
>> adapt for Spark.
>>
>> On Wed, Oct 21, 2015 at 1:16 PM, Luciano Resende 
>> wrote:
>>
>>> I have started looking into PR-8101 [1] and what is required to merge it
>>> into trunk which will also unblock me around SPARK-10521 [2].
>>>
>>> So here is the minimal plan I was thinking about :
>>>
>>> - make the docker image version fixed so we make sure we are using the
>>> same image all the time
>>> - pull the required images on the Jenkins executors so tests are not
>>> delayed/timedout because it is waiting for docker images to download
>>> - create a profile to run the JDBC tests
>>> - create daily jobs for running the JDBC tests
>>>
>>>
>>> In parallel, I learned that Alan Chin from my team is working with the
>>> AmpLab team to expand the build capacity for Spark, so I will use some of
>>> the nodes he is preparing to test/run these builds for now.
>>>
>>> Please let me know if there is anything else needed around this.
>>>
>>>
>>> [1] https://github.com/apache/spark/pull/8101
>>> [2] https://issues.apache.org/jira/browse/SPARK-10521
>>>
>>> --
>>> Luciano Resende
>>> http://people.apache.org/~lresende
>>> http://twitter.com/lresende1975
>>> http://lresende.blogspot.com/
>>>
>>
>>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>


Re: Export BLAS module on Spark MLlib

2015-11-30 Thread Burak Yavuz
Or you could also use reflection like in this Spark Package:
https://github.com/brkyvz/lazy-linalg/blob/master/src/main/scala/com/brkyvz/spark/linalg/BLASUtils.scala

Best,
Burak

On Mon, Nov 30, 2015 at 12:48 PM, DB Tsai  wrote:

> The workaround is have your code in the same package, or write some
> utility wrapper in the same package so you can use them in your code.
> Mostly we implement those BLAS for our own need, and we don't have
> general use-case in mind. As a result, if we open them up prematurely,
> it will add our api maintenance cost. Once it's getting mature, and
> people are asking for them, we will gradually make them public.
>
> Thanks.
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Sat, Nov 28, 2015 at 5:20 AM, Sasaki Kai  wrote:
> > Hello
> >
> > I'm developing a Spark package that manipulates Vector and Matrix for
> > machine learning.
> > This package uses mllib.linalg.Vector and mllib.linalg.Matrix in order to
> > achieve compatible interface to mllib itself. But mllib.linalg.BLAS
> module
> > is private inside spark package. We cannot use BLAS from spark package.
> > Due to this, there is no way to manipulate mllib.linalg.{Vector, Matrix}
> > from spark package side.
> >
> > Is there any reason why BLAS module is not set public?
> > If we cannot use BLAS, what is the reasonable option to manipulate Vector
> > and Matrix from spark package?
> >
> > Regards
> > Kai Sasaki(@Lewuathe)
> >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Export BLAS module on Spark MLlib

2015-11-30 Thread DB Tsai
I used reflection initially, but I found it's very slow especially in
a tight loop. Maybe caching the reflection can help which I never try.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Mon, Nov 30, 2015 at 2:15 PM, Burak Yavuz  wrote:
> Or you could also use reflection like in this Spark Package:
> https://github.com/brkyvz/lazy-linalg/blob/master/src/main/scala/com/brkyvz/spark/linalg/BLASUtils.scala
>
> Best,
> Burak
>
> On Mon, Nov 30, 2015 at 12:48 PM, DB Tsai  wrote:
>>
>> The workaround is have your code in the same package, or write some
>> utility wrapper in the same package so you can use them in your code.
>> Mostly we implement those BLAS for our own need, and we don't have
>> general use-case in mind. As a result, if we open them up prematurely,
>> it will add our api maintenance cost. Once it's getting mature, and
>> people are asking for them, we will gradually make them public.
>>
>> Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>>
>> On Sat, Nov 28, 2015 at 5:20 AM, Sasaki Kai  wrote:
>> > Hello
>> >
>> > I'm developing a Spark package that manipulates Vector and Matrix for
>> > machine learning.
>> > This package uses mllib.linalg.Vector and mllib.linalg.Matrix in order
>> > to
>> > achieve compatible interface to mllib itself. But mllib.linalg.BLAS
>> > module
>> > is private inside spark package. We cannot use BLAS from spark package.
>> > Due to this, there is no way to manipulate mllib.linalg.{Vector, Matrix}
>> > from spark package side.
>> >
>> > Is there any reason why BLAS module is not set public?
>> > If we cannot use BLAS, what is the reasonable option to manipulate
>> > Vector
>> > and Matrix from spark package?
>> >
>> > Regards
>> > Kai Sasaki(@Lewuathe)
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Problem in running MLlib SVM

2015-11-30 Thread Joseph Bradley
model.predict should return a 0/1 predicted label.  The example code is
misleading when it calls the prediction a "score."

On Mon, Nov 30, 2015 at 9:13 AM, Fazlan Nazeem  wrote:

> You should never use the training data to measure your prediction
> accuracy. Always use a fresh dataset (test data) for this purpose.
>
> On Sun, Nov 29, 2015 at 8:36 AM, Jeff Zhang  wrote:
>
>> I think this should represent the label of LabledPoint (0 means negative
>> 1 means positive)
>> http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point
>>
>> The document you mention is for the mathematical formula, not the
>> implementation.
>>
>> On Sun, Nov 29, 2015 at 9:13 AM, Tarek Elgamal 
>> wrote:
>>
>>> According to the documentation
>>> , by
>>> default, if wTx≥0 then the outcome is positive, and negative otherwise. I
>>> suppose that wTx is the "score" in my case. If score is more than 0 and the
>>> label is positive, then I return 1 which is correct classification and I
>>> return zero otherwise. Do you have any idea how to classify a point as
>>> positive or negative using this score or another function ?
>>>
>>> On Sat, Nov 28, 2015 at 5:14 AM, Jeff Zhang  wrote:
>>>
 if((score >=0 && label == 1) || (score <0 && label == 0))
  {
   return 1; //correct classiciation
  }
  else
   return 0;



 I suspect score is always between 0 and 1



 On Sat, Nov 28, 2015 at 10:39 AM, Tarek Elgamal <
 tarek.elga...@gmail.com> wrote:

> Hi,
>
> I am trying to run the straightforward example of SVm but I am getting
> low accuracy (around 50%) when I predict using the same data I used for
> training. I am probably doing the prediction in a wrong way. My code is
> below. I would appreciate any help.
>
>
> import java.util.List;
>
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.api.java.function.Function2;
> import org.apache.spark.mllib.classification.SVMModel;
> import org.apache.spark.mllib.classification.SVMWithSGD;
> import org.apache.spark.mllib.regression.LabeledPoint;
> import org.apache.spark.mllib.util.MLUtils;
>
> import scala.Tuple2;
> import edu.illinois.biglbjava.readers.LabeledPointReader;
>
> public class SimpleDistSVM {
>   public static void main(String[] args) {
> SparkConf conf = new SparkConf().setAppName("SVM Classifier
> Example");
> SparkContext sc = new SparkContext(conf);
> String inputPath=args[0];
>
> // Read training data
> JavaRDD data = MLUtils.loadLibSVMFile(sc,
> inputPath).toJavaRDD();
>
> // Run training algorithm to build the model.
> int numIterations = 3;
> final SVMModel model = SVMWithSGD.train(data.rdd(), numIterations);
>
> // Clear the default threshold.
> model.clearThreshold();
>
>
> // Predict points in test set and map to an RDD of 0/1 values
> where 0 is misclassication and 1 is correct classification
> JavaRDD classification = data.map(new
> Function() {
>  public Integer call(LabeledPoint p) {
>int label = (int) p.label();
>Double score = model.predict(p.features());
>if((score >=0 && label == 1) || (score <0 && label == 0))
>{
>return 1; //correct classiciation
>}
>else
> return 0;
>
>  }
>}
>  );
> // sum up all values in the rdd to get the number of correctly
> classified examples
>  int sum=classification.reduce(new Function2 Integer>()
> {
> public Integer call(Integer arg0, Integer arg1)
> throws Exception {
> return arg0+arg1;
> }});
>
>  //compute accuracy as the percentage of the correctly classified
> examples
>  double accuracy=((double)sum)/((double)classification.count());
>  System.out.println("Accuracy = " + accuracy);
>
> }
>   }
> );
>   }
> }
>



 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Thanks & Regards,
>
> Fazlan Nazeem
>
> *Software Engineer*
>
> *WSO2 Inc*
> Mobile : +94772338839
> <%2B94%20%280%29%20773%20451194>
> fazl...@wso2.com
>


Re: Grid search with Random Forest

2015-11-30 Thread Joseph Bradley
It should work with 1.5+.

On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar  wrote:

>
> Hi folks,
>
> Does anyone know whether the Grid Search capability is enabled since the
> issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol
> column doesn't exist" when trying to perform a grid search with Spark 1.4.0.
>
> Cheers,
> Ardo
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Grid search with Random Forest

2015-11-30 Thread Ndjido Ardo BAR
Hi Joseph,

Yes Random Forest support Grid Search on Spark 1.5.+ . But I'm getting a
"rawPredictionCol field does not exist exception" on Spark 1.5.2 for
Gradient Boosting Trees classifier.


Ardo
On Tue, 1 Dec 2015 at 01:34, Joseph Bradley  wrote:

> It should work with 1.5+.
>
> On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar 
> wrote:
>
>>
>> Hi folks,
>>
>> Does anyone know whether the Grid Search capability is enabled since the
>> issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol
>> column doesn't exist" when trying to perform a grid search with Spark 1.4.0.
>>
>> Cheers,
>> Ardo
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


FOSDEM 2016 - take action by 4th of December 2015

2015-11-30 Thread Roman Shaposhnik
As most of you probably know FOSDEM 2016 (the biggest,
100% free open source developer conference) is right 
around the corner:
   https://fosdem.org/2016/

We hope to have an ASF booth and we would love to see as
many ASF projects as possible present at various tracks
(AKA Developer rooms):
   https://fosdem.org/2016/schedule/#devrooms

This year, for the first time, we are running a dedicated
Big Data and HPC Developer Room and given how much of that
open source development is done at ASF it would be great
to have folks submit talks to:
   https://hpc-bigdata-fosdem16.github.io

While the CFPs for different Developer Rooms follow slightly 
different schedules, but if you submit by the end of this week 
you should be fine.

Finally if you don't want to fish for CFP submission URL,
here it is:
   https://fosdem.org/submit

If you have any questions -- please email me *directly* and
hope to see as many of you as possible in two months! 

Thanks,
Roman.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Grid search with Random Forest

2015-11-30 Thread Benjamin Fradet
Hi Ndjido,

This is because GBTClassifier doesn't yet have a rawPredictionCol like the.
RandomForestClassifier has.
Cf:
http://spark.apache.org/docs/latest/ml-ensembles.html#output-columns-predictions-1
On 1 Dec 2015 3:57 a.m., "Ndjido Ardo BAR"  wrote:

> Hi Joseph,
>
> Yes Random Forest support Grid Search on Spark 1.5.+ . But I'm getting a
> "rawPredictionCol field does not exist exception" on Spark 1.5.2 for
> Gradient Boosting Trees classifier.
>
>
> Ardo
> On Tue, 1 Dec 2015 at 01:34, Joseph Bradley  wrote:
>
>> It should work with 1.5+.
>>
>> On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar 
>> wrote:
>>
>>>
>>> Hi folks,
>>>
>>> Does anyone know whether the Grid Search capability is enabled since the
>>> issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol
>>> column doesn't exist" when trying to perform a grid search with Spark 1.4.0.
>>>
>>> Cheers,
>>> Ardo
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>


Re: How to add 1.5.2 support to ec2/spark_ec2.py ?

2015-11-30 Thread Alexander Pivovarov
just want to follow up
On Nov 25, 2015 9:19 PM, "Alexander Pivovarov"  wrote:

> Hi Everyone
>
> I noticed that spark ec2 script is outdated.
> How to add 1.5.2 support to ec2/spark_ec2.py?
> What else (except of updating spark version in the script) should be done
> to add 1.5.2 support?
>
> We also need to update scala to 2.10.4 (currently it's 2.10.3)
>
> Alex
>


Re: Grid search with Random Forest

2015-11-30 Thread Ndjido Ardo BAR
Hi Benjamin,

Thanks, the documentation you sent is clear.
Is there any other way to perform a Grid Search with GBT?


Ndjido
On Tue, 1 Dec 2015 at 08:32, Benjamin Fradet 
wrote:

> Hi Ndjido,
>
> This is because GBTClassifier doesn't yet have a rawPredictionCol like
> the. RandomForestClassifier has.
> Cf:
> http://spark.apache.org/docs/latest/ml-ensembles.html#output-columns-predictions-1
> On 1 Dec 2015 3:57 a.m., "Ndjido Ardo BAR"  wrote:
>
>> Hi Joseph,
>>
>> Yes Random Forest support Grid Search on Spark 1.5.+ . But I'm getting a
>> "rawPredictionCol field does not exist exception" on Spark 1.5.2 for
>> Gradient Boosting Trees classifier.
>>
>>
>> Ardo
>> On Tue, 1 Dec 2015 at 01:34, Joseph Bradley 
>> wrote:
>>
>>> It should work with 1.5+.
>>>
>>> On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar 
>>> wrote:
>>>

 Hi folks,

 Does anyone know whether the Grid Search capability is enabled since
 the issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol
 column doesn't exist" when trying to perform a grid search with Spark 
 1.4.0.

 Cheers,
 Ardo




 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>